As our install base continues to grow, more and more users are using our Security Engine to process very high volumes of logs to protect their web infrastructures that receive a lot of traffic. You might remember a previous post where we outlined how to use horizontal scaling to process over 2.6 billion daily events. While we know this method works for our users, we knew we could do more to reduce the required processing power needed to process high volumes of logs, while significantly reducing the processing time.
We’re very excited to say that as part of our work on CrowdSec 1.5 (our latest release that will be available in summer of this year) we have been able to achieve much faster response times when processing high volumes of logs, while significantly reducing the amount of processing power needed. Read on to see our benchmarking tests and get a closer look at the results!
Benchmarking parameters
CrowdSec uses go-routines for parsing and enriching logs, pouring events to buckets and then managing outputs – the number of routines represents the level of concurrency in CrowdSec. For this benchmarking test, we ran up to 4 routines at once and focused on 2 specific parameters:
- Bucket Routines: The number of go-routines used to manage the leaky buckets created by the scenarios. The higher it is, the more buckets CrowdSec will be able to handle.
- Parser Routines: The number of go-routines used to parse the log lines. The higher it is, the more lines CrowdSec will be able to parse.
We conducted this performance benchmarking using a file of 50,000 lines of nginx logs. These nginx logs generate a significant amount of alerts within a very short time frame, which is exactly why we chose them. We wanted to put CrowdSec under as much stress as possible!
We used the HTTP collection to analyze these logs, running all 11 HTTP detection scenarios, alongside our geolocation enricher. Here are the results:
Test 1: Testing our current state, 1.4.6
We first ran the nginx logs through the current CrowdSec version – v1.4.6.
Here are the raw results that we saw:
The first obvious result is that the total amount of time CrowdSec spends when analyzing logs significantly decreases as we increase the number of go-routines we allow CrowdSec to run. This is promising, but we wanted to look at exactly where this time is being spent.
Since Golang integrates with pprof, we could analyze the performance of CrowdSec in real time. Here’s what we saw:
What did we notice?
exprhelpers.RegexpInFile represents 24% of time spent
RegexpInFile is used by various scenarios to match incoming lines versus a file containing regular expressions. An expensive example of this is the http-bad-user-agent scenario, as it runs every user agent string against 614 regular expressions.
regexp.FindStringSubmatch represents 19% of the time spent
As CrowdSec uses grok patterns to parse nearly everything, we weren’t surprised to see this one.
TL;DR: Regexps are expensive.
Improvement 1: Caching some things
We saw that scenarios that match data against a lot of regular expressions can become very expensive, very quickly. In an attempt to bring the cost of running these scenarios down, we introduced an optional caching mechanism.
filter: 'evt.Meta.log_type in ["http_access-log", "http_error-log"] && RegexpInFile(evt.Parsed.http_user_agent, "bad_user_agents.regex.txt")'data:
- source_url: https://raw.githubusercontent.com/crowdsecurity/sec-lists/master/web/bad_user_agents.regex.txt
dest_file: bad_user_agents.regex.txt
type: regexp
strategy: LRU
size: 40
ttl: 5s
By implementing this caching mechanism, we allowed the scenarios to cache the results of regular expression matches (using a strategy of “least recently used”, “least frequently used” etc.) for a given time, with a given max cache size.
Test 2: Testing our current state with a caching mechanism
After performing the same tests as above, this is what we saw:
As you can see, there was almost a 100% improvement on the time it took to parse the logs. Here is how our flamegraph looked:
As you can see, we managed to significantly improve the time taken to parse 50,000 lines of nginx logs, but could we find a more glamorous solution rather than just introducing a caching mechanism? This got us thinking, and we took a closer look at how we could speed up our regular expressions.
Improvement 2: Speeding up regexps
To improve our speed even further, we started looking at alternatives to regexps and GoLang’s notoriously slow native implementation. Our search led us to go-re2, a drop-in replacement for the standard library regexp package, which uses the C++ re2 library for improved performance with large inputs or complex expressions
As our initial tests were going very well, we looked to integrate it into CrowdSec.
One thing that is important to note is that go-re2 support two modes:
- WASM integration
- CGO integration
While CGO Integration offers around ~30% more performance, it is still not yet fully supported, so it won’t be integrated natively into the 1.5 preview and is out of the scope of this benchmark.
Having said that, in v1.5, we are introducing a set of Feature Flags that will allow our users to opt-in to use features that are still under development, and that’s how we integrated used the RE2 WASM integration.
Test 3: Testing with go-re2
After performing the same tests, but this time using the go-rep2 library, this is what we saw:
Again we made some significant improvements in terms of processing time. However, we noticed that the RE2 WASM support doesn’t allow linear decrease of processing time over the number of go-routines – this is due to the implementation having some mutex exclusions that prevents this.
Using the CGO implementation would avoid this issue, and so we are excited to work on fully supporting that implementation, which we estimate will increase performances of the regular expressions by 30%! We will keep you posted on how that progresses, but for now we are happy to have significantly reduced log processing time using the WASM integration.
Here is the flamegraph for the 3rd test:
Conclusion and next steps
If we look at the raw results of each of these evolutions, we can see that we reduced CrowdSecs processing speed by more than 300%.
This graph gives you a great visualization (x-axis is the number of go-routines, and y-axis is the time spent parsing 50k log lines).
And interestingly, while we saw that the RE2 WASM integration doesn’t adapt well to scale, we can see it uses a lot less CPU!
What are the next steps ?
We hope that you have enjoyed reading through our benchmark tests! While optimization is a never ending process, we are delighted to see that CrowdSec v1.5 significantly reduces the time it takes for our security engine to parse through logs.
We’re excited to work on the CGO integration and reduce this time even further!
We are currently in a private preview for CrowdSec 1.5
If you are interested in being one of the first to test it out and give us your feedback, contact us! We will allow all testers to have 2 months free of all the new features as well as premium ones, as thanks for helping us with this new version.