We love AI, don’t we, folks?
Whether it’s getting Claude to write us some unbelievably sloppy code to get through a sprint deadline or asking ChatGPT to write us an announcement blog post that the marketing team has asked us to write, most of us have come to terms with the quirky, sometimes funny, but mostly useless AI ✨ buttons and features in our everyday apps.
While the aforementioned features and buttons are mostly harmless, these Large Language Models (LLMs) are constantly hungry for new data, which AI companies strive to collect from every corner of the internet. This observation prompted us to release the AI Crawlers Blocklist, a new blocklist that will protect our users by preemptively blocking IPs known to be used by AI bots crawling websites for data.
In this article, I’ll detail some of the abusive behaviors we have seen by AI companies recently, leading us to build this blocklist. I’ll also share the details of how we created this blocklist and how you can deploy it in any firewall that might be running on your stack.
Crawling the internet 101
The training and development of large language models requires a gigantic amount of language data. A very common way of finding this data is to use a process referred to as web crawling.
A web crawler is a bot that iteratively visits sites one by one and processes the content of the website for further use. For some, this might be downloading the text from the website to train language models, but other uses include scanning the website for critical vulnerabilities or simply indexing the website to power a search engine such as Google Search.
In addition to processing the content of the site, a web crawler will usually also extract the links contained in the site. Once the site content has finished processing, the web crawler can then proceed to the next site to crawl using one of the links it has found so far. This is why such crawlers are also referred to as spiders, as they traverse the web along the threads or links that connect it.
Note: Bots usually do slow crawls on multiple websites in order not to trigger alerts for reaching request quotas.
While the design of such web crawlers is relatively simple, operating them at scale is quite hard as multiple crawlers need to coordinate to avoid scanning the same site multiple times. The ability to solve such problems at a web scale is the reason why engineers at Google are most likely paid high six figures, while you and I probably are not!
In addition to coordinating duplicate links, web crawlers are also supposed to follow rules that each site owner can define about crawling their site. These rules are specified in a file called robots.txt
and they allow site owners to define what crawlers are allowed to access which parts of the site. For instance, If you wanted to block Google Search from indexing your site, you could add the following to your robots.txt
file.
User-agent: Googlebot
Disallow: /
Furthermore, site owners can also add parameters such as visiting times and global limits on crawl speed.
The crawling menace
As the above section has made clear, web crawling is a very common process that has been part of the internet for a long time. This is why there are rules in place that are respected by most big crawlers such as Googlebot or Bingbot.
However, there is no real enforcement of these rules. There is no one to complain to or sue if Google crawls your side even if you excluded them. In addition to this, the cost of excessive crawling is mainly incurred by the owner of the crawled website instead of the crawling service. And let’s not forget the aforementioned difficulties of managing multiple crawlers at web-scale.
These three problems are now combined with AI startups desperate to cut corners to catch up to the competition in terms of data, resulting in a toxic pile of traffic problems that site owners are left to deal with. As the number of incidents has grown far beyond the scale of a single blog post, we will simply present a highlight reel of incidents that we found particularly egregious.
Case #1
Early this year, the project website for the popular Linux distro Mint was taken offline, presumably due to DDoS. The project’s webmaster later revealed that the attack was simply Anthropics ClaudeBot trying to crawl the site for AI training.
Case #2
In a similar case, Kyle Wiens, the CEO of iFixit, a popular site for all kinds of device repairs, tweeted about Anthropics web crawler being aggressive enough to alert their security team. The resulting X thread also saw multiple other site owners, such as Eric Holscher from Read the Docs share similar concerns.
Case #3
While Claude seems to be a pretty bad offender, abusive crawling is not limited to Anthropic’s flagship product. Bytedance, the Chinese company that owns TikTok, jumped on the crawling train as well, and apparently, it’s not pretty. In April 2024, Nerdcrawler, a small startup that enables the sale of comics, reportedly cut their bandwidth by 60% simply by blocking the Bytespider crawler.
Case #4
Wired, a popular US media company, ran a story revealing that Perplexity, another AI startup, still crawled their site even after being explicitly disabled via the robots.txt file. The same company has been accused of similar failings by Forbes.
Case #5
Vercel, which allows you to build and host sites on their Next.js framework, reported in December that their network now sees over 1.3 billion website views from popular AI crawlers. Leading to some of their customers having to foot massive bills to pay for worthless traffic. To prevent further incidents, the Vercel WAF now ships with a one-click rule to block all such crawlers from accessing the site. This follows a similar move by Cloudflare, which also shipped a one-click rule to stop crawlers with their WAF.
With all these WAF providers adding AI blocking to their tools, our blocklist release should not come as a surprise. But, before I go into more details about the new CrowdSec Blocklist, allow me to add one final article to the highlight reel.
Bonus case #6
In January this year, Benjamin Flesch, a security researcher, published an article that explained how one could use OpenAIs ChatGPT to run a DDoS attack against any site. The fairly trivial exploit allows a user to pass the same link to OpenAI’s crawler multiple times, triggering multiple parallel requests that hit the target site simultaneously. In a writeup produced by The Register, Flesch simply concludes with: “If crawlers don’t limit their amount of requests to the same website, they will get blocked immediately”.
I will certainly back up Flesch’s warning with action, but before I do, let’s take a look at how we tag the most common AI companies in our CTI feed.
A global robots.txt file
About a year ago, we added the Alert Context feature to the CrowdSec Security Engine. This opt-in feature allows our users to pass along additional context to each alert which is enriched and displayed for them in the CrowdSec Console.
By default, Alert Context includes fields such as the targeted URI and the user agent of the attacker; however, the feature extends far beyond this. MSSPs such as ScaleCommerce use Alert Context to figure out which customer is being attacked, giving their security team a global view of their threat landscape. On our side, the additional threat intelligence we gain from contextualized alerts allows us to build products such as the AI Crawlers Blocklist we present in this article.
I previously mentioned the robots.txt
file that is used by crawlers to check where and if they are allowed to crawl a site. Compliant crawlers check the robots.txt file to see if it contains their user agent.
The user agent is an attribute that is sent with each HTTP request, which allows a client to identify their system to the web server. It is used, for instance, by browsers to tell a website whether they are Safari or Chrome and what operating system it is running on, which allows the website to disable or enable features that are not supported by a given browser or operating system.
This user agent is also used by a lot of crawlers to identify themselves to the website as a crawler. For example, the Bytespider crawler of Bytedance will identify itself to the website using the following user agent:
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
This self-identification allowed us to easily build the first iteration of the AI Crawlers Blocklist.
AI Crawlers Blocklist to the rescue
After matching the user agents used by AI companies to the Alert Context data and applying our usual trust metrics, we created a blocklist of around 25000 AI crawler IP addresses.
Once we knew what crawlers looked like and, in particular, what kind of behaviors they would present to the Security Engine, we also found additional IP addresses that behaved like crawlers without presenting themselves via their user agent.
These non-compliant trackers will be subject to further research on our side to ensure that they pass the stringent quality requirements we have for our IP blocklists. For the moment, the AI Crawlers Blocklist includes all the crawlers whose identities we were able to confirm.
Compared to the previously mentioned vendors, the Crowdsec AI Crawlers Blocklist offers distinct advantages.
No dedicated integration required, just feed the blocklist to your existing firewall, proxy, or CDN.
CrowdSec Blocklists don’t require the integration of a specific WAF or service into your stack, contrary to other AI crawler solutions I mentioned earlier. As a blocklist provider, we built simple and straightforward integrations of our blocklists into almost any firewall you can have on your stack. After all, good protection should be available everywhere and not depend on the Javascript framework used by your engineering team!
Unique, unmatched threat intel, powered by the crowd.
Our unique crowdsourcing process guarantees that our lists stay up to date and are protected from most methods AI companies are using to evade detection. This ensures that you remain protected even as your adversaries adapt to match your defense.
The CrowdSec AI Crawlers Blocklist is now available as part of the Platinum Blocklists plan. You can also browse our extended Catalog of Platinum and Premium Blocklists to find the perfect blocklist for your needs.
Take care, and stay safe out there!
Prepare for the Next Evolution of AI-Driven Threats
Understanding the rise of Multimodal Offensive AI is critical for anyone working in cybersecurity, IT, or risk management.
Download ebook