A key selling point of the CrowdSec CTI isn’t just that it is the most exhaustive CTI database on the planet but also that we can provide in-depth information about malicious IPs that help threat hunters and SOC analysts understand the whole lifecycle of the threats that land on their desks.
A big part of this is the diverse scores that we assign to each IP, broken down over different time spans.
In this article, we take a deep dive into the recent changes that help make CTI scores easier to explain and more dependable. Due to the way we layer our scores, this change will have downstream effects touching many other aspects of our CTI, but more on that later.
CrowdSec’s three-layer approach to CTI
Our CTI scores are divided into three layers.
Layer #1: Base scores
These scores have a one-to-one correspondence to data points that we aggregate for each IP. So, for instance, the total number of reports we have received for a given IP or the total number of CrowdSec Security Engines that have reported a given IP.
The base scores are used internally for the generation of the community blocklist or for our machine learning models. The one-to-one design allows us to build products from our data without having to make sure every product has a mechanism to adapt to the way our data shifts as our network grows.
Layer #2: Analyst scores
The analyst scores are a breakdown of the information we have on a given IP along the main axes that analysts care about. There are four categories: aggressiveness, threat, trust, and anomaly with scores ranging from 0 to 5.
As an example, the aggressiveness rating represents how many attacks our users see for a given IP. IPs with a high aggressiveness rating might try to attack many different users or try many different attacks on single users.
These four scores can be accessed using our CTI API and can for instance be used to build proprietary rating systems in a SOC or as enrichment for threat hunters. The aggressiveness score is also displayed as the four gauges in the activity section of our CTI searchbar.
Layer #3: Product scores
These are ready-to-use scores made by our Data Science teams for common CTI use cases. Examples here include the background noise score which can be used to filter common attackers from alerts, allowing security teams to focus on the important threats.
In addition to the CTI API, these scores are also presented on the CrowdSec console and on our CTI searchbar.
Score drift
One of the issues with the data we have at CrowdSec is that the absolute value of most counts increases as our network grows in size. For instance, while an IP that got reported by 25 different CrowdSec instances might have been considered very active back when we had around 1000 watchers, it should no longer be considered very active if it collects 25 reports from our network which is 100x bigger, as very active IPs will now collect over 1000 reports in the same time period.
This process is referred to as “data drift” in data science lingo. The base scores, which make up the first layer in our scoring system, are designed to address this issue of data drift.
The idea of the base scores is fairly simple. Instead of directly using the number of reports in a calculation, we simply provide a score from 0 to 5 that represents how many reports we have received for a given IP relative to how many reports other IPs receive. This abstracts away the problem of data drift for all scores that are computed downstream of the base scores. As long as the base scores compensate for the data drift, the rest of the system will stay consistent.
Designing base scores
As explained above, the core requirement for these base scores is that they stay consistent under continuous data drift.
One way to fulfill this requirement is by using quantiles of the underlying data. Imagine if you had 100 different values and you had to define a score from 1 to 4 for these values. With this method you would sort your values and then go along from bottom to top, mapping the first 25 values you see to 1, the next 25 to 2 and so on. To decide where a new value would be mapped to you simply have to compare it to the values at 25, 50 and 75.
These three values are called quantiles. To address data drift, we now only have to recompute the values of the quantiles over the dataset from time to time. With this update we are now adding a mechanism that recomputes these quantiles. While we are at it we are also changing the way we choose the quantiles.
Previously we used a normal distribution we had fitted onto our data at one point that we would shift depending on the timespan as a basis for the quantiles. For each datapoint we would then compute where it would fall on this distribution and then use six regular quantiles to compute the score from 0 to 5. While the normal distribution was a good fit when we added it, over time our data has become more bimodal (fancy way of saying the curve looks like a camelback).
This means that the normal distribution was not just a bad fit because the data had shifted in values, it was also no longer optimal because the shape of the distribution had changed as our network grew in size.
With the new changes we no longer make any assumptions about the shape of our data and simply look at the quantiles for 60%, 80%, 80% and 95%. This makes our scores significantly more explainable, as a score of 5 now means that this IP falls within the top 5% of values for this datapoint. It also makes sure that the changing shape of the underlying distribution doesn’t suddenly force us to change the scores again.
Another guarantee we are adding is that a score of 0 must and will only be given if the value is 0. This allows downstream systems to filter IPs more efficiently without having to look up our raw data.
Downstream effects
As the base scores are the foundation upon which all other scores are built, changing the way they are computed will have downstream effects on many of the scores we present in our CTI.
The CrowdSec team will be monitoring the impact of these changes and we might slightly tweak product scores to make sure they can still serve their intended purpose.
If you haven’t tried the CrowdSec CTI yet, go check it out and explore our vast database of malicious IPs and their activity!