Want to improve the security of your ecommerce website?

Learn how

AI-Powered Proxy and VPN Detection — The CrowdSec Way

One of the most common use cases of Cyber Threat Intelligence (CTI) data is the identification of VPN and proxy users. This is due to the fact that a lot of abuse prevention mechanisms focus on enforcing decisions based on the IP address of the abuser — starting from firewalls at the system level to preventing multiple signups for user accounts. 

As a provider of CTI data, CrowdSec also provides such information on our 55 million IP database. Until recently, this information was provided by various third-party sources. However, as providers of the world’s biggest CTI database, we ought to create our own solution for collecting and analyzing that set of information.

Our new AI-powered VPN and proxy detection mechanism for the CrowdSec CTI will be available soon, but we thought it would be interesting to give you a sneak peek into the research behind building the feature. This article details some of the methods we use to discover and tag users of anonymization tools on the web. If you’re interested in cybersecurity data collection and learning more about the backend of our VPN/proxy detection system, we’re sure you’ll find this article fascinating!

CrowdSec Majority Report

 

Discover Key Insights on Emerging Cyberthreats

 Download Report

Looking at the available data

Any data science project goes through three basic steps: 

  1. Collecting and preprocessing the relevant data
  2. Selection and training of an AI model 
  3. Deployment of said model 

Of these three steps, the first one is generally the most important step. While steps two and three are also interesting, in this article, we focus only on this first part.  

In compliance with GDPR, we only collect a limited amount of data from CrowdSec users, namely, the malicious IP, the attack vector (aka scenario) triggered, the timestamp of the attack, and some basic information about the Security Engine that detected the attack. 

By itself, data from a single Security Engine does not provide much value. The real strength of CrowdSec is that we get these bits of data from a growing community of over 65,000 protected workloads. So, while an individual engine might only see an attacker once, we get to receive multiple signals about said attacker, allowing us to aggregate this information. These aggregates also power our community blocklist.

In addition to the data from our network, we also have access to enrichments from other services, ranging from whois information to scanning information from places like shodan.io. This metadata is collected per IP and periodically refreshed to stay up-to-date.

In this article, we focus on the parts of the detection process that rely on the data we get from our own network. 

Attack signatures

For the task of detecting anonymization, we came into the data collection process with a couple of hypotheses. One of them was that different attackers will use an IP hosting an anonymization service and might, therefore, frequently change their attacks when another attacker uses the service. 

So, if on day one, a certain IP is doing a lazy SSH brute force, on day two, it attempts to do a log4shell attack, and on day three is trying to brute force a GitLab instance, then it is highly probable that this IP is actually being used by multiple attackers. 

More concretely, we define the attack signature of an IP for a specific day to be the normalized vector of all the scenarios which the IP triggered that day. See an example of 10 such signatures for a specific IP below.

This signature, however, comes with an important warning. Not every security engine in our network protects the same workload and has installed the same detection scenarios. So, even if an attacker performs the GitLab brute force attack mentioned above on any IP it sees, a server not running a GitLab instance will not see the attack and, therefore, will not report it. While this shouldn’t affect attacks that try to exploit only specific CVEs, it will affect attackers who throw a whole blanket of attacks at a server to see what sticks. 

The same applies to IPs that are actually compromised machines that have been captured by different malicious actors — the toaster in my grandma’s kitchen might run SSH brute force attacks for evil-guy-1 while also doing WordPress brute force for script-kiddie-4. As a consequence, the WordPress instance without SSH exposed will be sure the toaster only attacks via HTTP, while the database server might see only SSH-based attacks. 

We address this problem in two ways.

We make the assumption that our watchers with common tech stacks cover similar attack vectors (aka scenario). This is a reasonable assumption since production tech stacks usually use popular frameworks and services. This is also facilitated by the  CrowdSec Hub, which ships scenarios packed into collection, ensuring that most Security Engines have the same detection capabilities. In addition, our partner, Shield (WordPress protection solution), uses the same collection on every machine, so for signals issued from HTTP attacks, the signature is very reliable. This is particularly important as most anonymized traffic uses HTTP.

The second way we addressed this issue is by detecting the signatures in a way that is not too sensitive to outliers.

Detecting attack signatures

A very simple way to check whether our original hypothesis holds true is to look at the standard deviation of each IP — i.e., from the table above, we first examine the vertical and then the horizontal standard deviation. This turned out to have quite a bit of predictive power, so we decided to spend more time on the attack signature hypothesis. 

A very fast technique that builds on the double standard deviation idea (which is still implemented in the production model) are the top_k_std features. This first takes the vertical standard deviations and then returns the k largest of these standard deviations.

The drawback that comes with this method is that setups that have a high variance in configurations get proportionally higher values compared to setups that come with very standardized configurations. For instance, SSH attacks are mostly covered by the two brute force scenarios found in CrowdSec Hub (ssh-bf and ssh-slow-bf), while HTTP-based attacks can take a lot of different shapes and forms. So, the variance for a single attacker who is not behind a VPN is still higher for HTTP than for SSH. 

To amend this issue we added a second feature based on unsupervised clustering. This is a machine learning method to group points together. In our case, that means that attacks that have a similar attack signature get grouped together. Each group represents a generic attack signature that appears on a reasonably frequent basis. This method is more robust to single-day outliers in attack signature. See an example of this in the diagram below.

Here, we have an outlier where an IP that usually attacks within two main groups displays an unseen mix. This might happen if the attacker changes machines in the middle of the day or if they just happen to attack a Security Engine with a rather unique combination. In this case, the standard deviation for the GitLab brute force scenario would jump massively. The clustering method, however, is still robust, and the day would likely still be classified in the mostly-ssh-attacks cluster. 

The combination of the top_k_std and n_distinct_clusters features allows us to measure changes in the attack signature of the IP over time in an outlier-resistant way. In combination with other data and enrichments, we can reliably detect and flag the use of anonymization methods in our CTI database.

Wrapping up

Hope you found this short article on our internal research interesting! It’s an important piece of  CrowdSec VPN/Proxy detection algorithm, which will soon be rolled out to general availability, through both our CTI and premium blocklist

Stay tuned!

The Home of Proactive Cybersecurity

 

The CrowdSec Console provides real-time security events monitoring, metrics, dashboards, blocklist monitoring, and so much more!

 Sign Up

You may also like

explore how we compute the enhanced crowdsec cti scoring systems
Data Curation

Explore the Enhanced CrowdSec CTI Scoring System and How We Compute It

Explore CrowdSec’s enhanced CTI scoring system, now more explainable and reliable, with improved quantiles for accurate threat analysis and IP data tracking.

Introducing the IP Range Reputation System: Identify Organized Cyber Crime vs. Petty Criminals
Data Curation

Introducing the IP Range Reputation System: Identify Organized Cyber Crime vs. Petty Criminals

IP Range Reputation provides incident response teams & threat researchers with improved visibility in low-information environments to act fast against threats.

Detecting VPN and Proxy Usage via IP Traffic Analysis: A Glimpse into CrowdSec’s Kaggle Challenge
Data Curation

Detecting VPN and Proxy Usage via IP Traffic Analysis: A Glimpse into CrowdSec’s Kaggle Challenge

We hosted a Kaggle challenge to tackle the pressing cybersecurity issue of detecting VPN and proxy traffic. Here are the key findings and takeaways.