One of the most common use cases of Cyber Threat Intelligence (CTI) data is the identification of VPN and proxy users. This is due to the fact that a lot of abuse prevention mechanisms focus on enforcing decisions based on the IP address of the abuser — starting from firewalls at the system level to preventing multiple signups for user accounts.
As a provider of CTI data, CrowdSec also provides such information on our 55 million IP database. Until recently, this information was provided by various third-party sources. However, as providers of the world’s biggest CTI database, we ought to create our own solution for collecting and analyzing that set of information.
Our new AI-powered VPN and proxy detection mechanism for the CrowdSec CTI will be available soon, but we thought it would be interesting to give you a sneak peek into the research behind building the feature. This article details some of the methods we use to discover and tag users of anonymization tools on the web. If you’re interested in cybersecurity data collection and learning more about the backend of our VPN/proxy detection system, we’re sure you’ll find this article fascinating!
Looking at the available data
Any data science project goes through three basic steps:
- Collecting and preprocessing the relevant data
- Selection and training of an AI model
- Deployment of said model
Of these three steps, the first one is generally the most important step. While steps two and three are also interesting, in this article, we focus only on this first part.
In compliance with GDPR, we only collect a limited amount of data from CrowdSec users, namely, the malicious IP, the attack vector (aka scenario) triggered, the timestamp of the attack, and some basic information about the Security Engine that detected the attack.
By itself, data from a single Security Engine does not provide much value. The real strength of CrowdSec is that we get these bits of data from a growing community of over 65,000 protected workloads. So, while an individual engine might only see an attacker once, we get to receive multiple signals about said attacker, allowing us to aggregate this information. These aggregates also power our community blocklist.
In addition to the data from our network, we also have access to enrichments from other services, ranging from whois information to scanning information from places like shodan.io. This metadata is collected per IP and periodically refreshed to stay up-to-date.
In this article, we focus on the parts of the detection process that rely on the data we get from our own network.
Attack signatures
For the task of detecting anonymization, we came into the data collection process with a couple of hypotheses. One of them was that different attackers will use an IP hosting an anonymization service and might, therefore, frequently change their attacks when another attacker uses the service.
So, if on day one, a certain IP is doing a lazy SSH brute force, on day two, it attempts to do a log4shell attack, and on day three is trying to brute force a GitLab instance, then it is highly probable that this IP is actually being used by multiple attackers.
More concretely, we define the attack signature of an IP for a specific day to be the normalized vector of all the scenarios which the IP triggered that day. See an example of 10 such signatures for a specific IP below.
This signature, however, comes with an important warning. Not every security engine in our network protects the same workload and has installed the same detection scenarios. So, even if an attacker performs the GitLab brute force attack mentioned above on any IP it sees, a server not running a GitLab instance will not see the attack and, therefore, will not report it. While this shouldn’t affect attacks that try to exploit only specific CVEs, it will affect attackers who throw a whole blanket of attacks at a server to see what sticks.
The same applies to IPs that are actually compromised machines that have been captured by different malicious actors — the toaster in my grandma’s kitchen might run SSH brute force attacks for evil-guy-1 while also doing WordPress brute force for script-kiddie-4. As a consequence, the WordPress instance without SSH exposed will be sure the toaster only attacks via HTTP, while the database server might see only SSH-based attacks.
We address this problem in two ways.
We make the assumption that our watchers with common tech stacks cover similar attack vectors (aka scenario). This is a reasonable assumption since production tech stacks usually use popular frameworks and services. This is also facilitated by the CrowdSec Hub, which ships scenarios packed into collection, ensuring that most Security Engines have the same detection capabilities. In addition, our partner, Shield (WordPress protection solution), uses the same collection on every machine, so for signals issued from HTTP attacks, the signature is very reliable. This is particularly important as most anonymized traffic uses HTTP.
The second way we addressed this issue is by detecting the signatures in a way that is not too sensitive to outliers.
Detecting attack signatures
A very simple way to check whether our original hypothesis holds true is to look at the standard deviation of each IP — i.e., from the table above, we first examine the vertical and then the horizontal standard deviation. This turned out to have quite a bit of predictive power, so we decided to spend more time on the attack signature hypothesis.
A very fast technique that builds on the double standard deviation idea (which is still implemented in the production model) are the top_k_std features. This first takes the vertical standard deviations and then returns the k largest of these standard deviations.
The drawback that comes with this method is that setups that have a high variance in configurations get proportionally higher values compared to setups that come with very standardized configurations. For instance, SSH attacks are mostly covered by the two brute force scenarios found in CrowdSec Hub (ssh-bf and ssh-slow-bf), while HTTP-based attacks can take a lot of different shapes and forms. So, the variance for a single attacker who is not behind a VPN is still higher for HTTP than for SSH.
To amend this issue we added a second feature based on unsupervised clustering. This is a machine learning method to group points together. In our case, that means that attacks that have a similar attack signature get grouped together. Each group represents a generic attack signature that appears on a reasonably frequent basis. This method is more robust to single-day outliers in attack signature. See an example of this in the diagram below.
Here, we have an outlier where an IP that usually attacks within two main groups displays an unseen mix. This might happen if the attacker changes machines in the middle of the day or if they just happen to attack a Security Engine with a rather unique combination. In this case, the standard deviation for the GitLab brute force scenario would jump massively. The clustering method, however, is still robust, and the day would likely still be classified in the mostly-ssh-attacks cluster.
The combination of the top_k_std and n_distinct_clusters features allows us to measure changes in the attack signature of the IP over time in an outlier-resistant way. In combination with other data and enrichments, we can reliably detect and flag the use of anonymization methods in our CTI database.
Wrapping up
Hope you found this short article on our internal research interesting! It’s an important piece of CrowdSec VPN/Proxy detection algorithm, which will soon be rolled out to general availability, through both our CTI and premium blocklist
Stay tuned!
The Home of Proactive Cybersecurity
The CrowdSec Console provides real-time security events monitoring, metrics, dashboards, blocklist monitoring, and so much more!