The negative impact of regular expressions on systems

5/1/18

Sooner or later you have no choice to use them in your code. It doesn't matter if it is PHP, Javascript, Go, C#, Java or any other language, there will be one day where you have to check or manipulate a string that has such a complexity, that you would need a lot of if-else blocks to cover a use case. This will be the point where you have to use regular expressions (or short: regex). At first it looks like gibberish but is even with its basics a very powerful tool. A tool that can slow down your application or even be used to attack your application.

As you may already know, a regular expression isn't just gibberish, but a pattern of a textual syntax that match or doesn't match a string. While its syntax is kind of simple, a bigger and more complex pattern can be hard to read, so we keep the examples for this blog as small as [A-Z]+. This pattern can be used to get all matches out of a string, where upper-case letters are present and depending on the function or class from the code language of your choice, you can return those found matches for further aggregation or to replace them with other values. It becomes even more powerful with the help of groups, which enables you to capture a part of a pattern for further processing on the target string. With a pattern like ([A-Z]+) ([A-Z]+) it is possible, for example, to switch values of a string from AAA BBB to BBB AAA. Each group of a pattern, enclosed by round brackets, gets a numeric representation and can be used in most replacing functions of a language's regex implementation. In some languages, those group numbers need to be prefixed with a Dollar character ($2 $1) while others use a backslash (\2 \1), but the result is the same despite the group syntax. I didn't even scratched the surface of this topic and as I mentioned before, regular expressions can be found in nearly any language, which makes them an important basic knowledge for developers to learn.

The performance problem with regular expressions

A lot of performance issues are connected to functions or methods used in a language and can be notably slow when called multiple times. This can be inside of a loop or on systems with high queries/requests per seconds that calls those slow parts of code multiple times, creating potential bottlenecks. In other cases, like with regex patterns, performance problems depend on the arguments used in the code and the engine that is used to interpret and execute the patterns on the given string. There are differences in the engines used by languages or tools, which can also influence runtime, but I don't know of any case where the coding language for a project was chosen by the efficiency of an engine for regular expressions. With new versions came improvements or deprecations of old functionalities, ready to be removed and replaced by something better, so being up to date with the version of your language of choice should be enough most of the time.

Even though the engine usually doesn't really bother us very much, patterns can have an enormous influence on the performance of an application. But instead of words, let's take a look at a simple series of tests. The tests are written in JavaScript, so they can simply be executed in a browser. Here's the base code used for the tests:

var regex = /^([a-z]+)+$/;
var value = "teststring1";

var startTime = new Date() / 1000;
value.match(regex);
var runTime = new Date() / 1000 - startTime;
console.log(runTime);

We measure the time for a simple matching of a pattern var regex against a string var value. The pattern ^([a-z]+)+$ stays the same and only the string value will change.
As you can see, the pattern is about a string that has to start and end with a lower-case letter from a to z and has to be of the size of 1 or more characters. The complete word will also be defined as a group with repetition using the quantifier +. This pattern isn't a complicated one and I'll come back to it later, but let's see how long it takes to process different values.

value = "aaaaaaaaaaaaaaaaaaaaaaa1"    // runTime: 0.068 seconds
value = "aaaaaaaaaaaaaaaaaaaaaaaa1"   // runTime: 0.136 seconds
value = "aaaaaaaaaaaaaaaaaaaaaaaaa1"  // runTime: 0.272 seconds
value = "aaaaaaaaaaaaaaaaaaaaaaaaaa1" // runTime: 0.544 seconds

To keep the example more readable, I only used the letter a. As you may have noticed, none of the examples above is matching to the pattern because of a number at the end of each string. The runtime results are in seconds and may also differ between the browsers/engines you use to reconstruct the test cases.

As we can see, every additional letter seems to double the time to process the string and we stopped at half a second, because this is already a poorly performing result. So is this about the size of a string? Let's repeat the last test case, but I put the not-matching number closer to the beginning of the string.

value = "aaaaaaaa1aaaaaaaaaaaaaaaaaa" // runTime: 0.002 seconds

Before it was half a second, now it's only 2 milliseconds. Yes, the length of a string is a factor, but it doesn't need to be processed until its end when there is an early factor, that makes the string a non-fitting match. In this test, the performance differs by the content of the string.

Pattern matters

This was just a one-sided view of the problem and those of you, who are already familiar with regular expressions realized, that the easy to read pattern of the test can still be reduced to a simpler one. You get the same behaviour with a pattern like ^[a-z]+$, but without creating a bottleneck. So we also have the pattern as a factor for performance of a regex and you might already realize what a bad idea it can be to make a user be able to send and inject a regex in your application.

A pattern with bad influence on performance is also called an evil pattern. So ^([a-z]+)+$ is a repetitive group containing a repetition in form of [a-z]+ that slows down our test. The engine is going through every possible path for a string to determine if a match is given or not, which isn't a straight forward process. If one path fails to match, the regex engine is trying a different one from the previous state to see if a match can be obtained. The more paths are available the more time it will take to come to the conclusion if a string fits to the expression or not. In our high repetitive case from the test are a lot of possible path to "walk through" and therefore it has a higher runtime. If you're interested in the possible paths of regex engines, I can recommend you regex101.com and its debugger feature.

High repetition is one cause for multiple paths of a string, that has to match with a pattern and a regex group with a quantifier is a most common cause. It isn't always inevitable just to omit the group part of a regular expression as you may need it for different use cases. I just covered matching patterns in the previous example and I mentioned before, how powerful regular expressions for replacing or extracting parts of a string can be. It doesn't have to be the complete pattern, that can cause performance problems. Even just one repetitive part in a bigger expression can be used to slow down your application.

The security issue

The previous sentence should ring some alarm bells: "can be used to". Calling the bottleneck regular expression an evil pattern was not randomly chosen, because when there's a possible slow down for your application, there may be a way for a potential Denial of Service attack too. In this case, it's called ReDoS (Regular expression Denial of Service) and it can be a big issue compared to other attacks. Most security issues can be prevented by two important steps:

Validate user input
Escape user input

"User" means any third party source of input values that a developer can't influence, but check, prevent or "defuse". Regular expressions are used for validation in some projects, which can, depending on the pattern, make the part of your code, that was created to prevent security issues, be a security issue itself. If a delay in loading time was found by just using a simple string, a first test to create a workload can be started quickly and the first step for getting the system down is done.

It can be easy for an attacker to find out if you're using an evil pattern. In case of a web application a pattern can be used to validate input values in the browser and later again on server side to prevent bypassing the frontend. An attacker can review a frontend-pattern, assume that it's also used in the backend and try to use that knowledge for an attack on the server side. For example, some E-Mail addresses are checked with a regular expression and are possibly one of the first places for an attack.

So with a great regular expression comes great responsibility and you have to be careful about what your pattern does allow.

Fixing evil patterns

To prevent long runtimes with regex, I'll show a few solutions for simple evil pattern examples, to prevent a bad performance influence.
What if we need a group like in ([a-f0-9]+)* to be repetitive, because it is an important part of a bigger expression? The asterisk quantifier * in there means, that this group can be zero or more times in a string. Do you really need it multiple times? If you need it just zero or one times to be in a string, a simple ? would be a better choice. If you know the exact amount, you can use {2} where the number represents the amount, but don't set it to high. That would remove a zero times check. Otherwise, {0,2} may be a more fitting quantifier to make it possible for 0, 1 or 2 occurrences.
This will lead to the following suggestions:

([a-f0-9]+)?
([a-f0-9]+){2}
([a-f0-9]+){0,2}

Even if there is no quantifier inside your group, the meta-character for an OR | inside a group can slow down your code. A pattern like (a|aa|aaa|aaaa|aaaaa|aaaaaa|aaaaaaa)+ (and more) is a good source of a bottleneck or for a ReDoS attack. Even here can a more accurate quantifier a good help: (a{1,6})+. But this is still an evil pattern and the question needs to be asked again: Do you need to use + for your group or is there a limit for a match? Or do you need the quantifier anyway?

In case of matching a hash, you will usually have a maximal length where a string is valid or not, but even when a string can be really long and still be valid, is it a real world use case to have a string with a high length in your project? A check on the input value before it hits the validating regular expression can be a wise decision too.

Review the regular expressions

Like every part of code, SQL or API calls, regular expressions need reviews too. They may just occupy a part of a code line, but can hide unwanted surprises.
Changing a pattern until it works as expected is one source where evil patterns can be created in a project. If it works, create tests for the functionality, review and adapt/simplify your expressions (in case you found a bottleneck) and prevent possible attack angles. Some testing frameworks offer test capabilities for execution timeouts, if a test runtime is too long, and it may be possible to use those for a string, that would cause performance issues with an evil pattern. A project and its code is always changing and a previously inaccessible part can become reachable over time, so there is no argument like "It can't be used for an attack" in development. It doesn't hurt to be on a safe side when it comes to a project's security and performance. If you want to read more about the ReDoS topic and performance, you can visit regular-expressions.info and its ReDoS section Preventing Regular Expression Denial of Service. I hope you got a good short insight for this matter and you might revisit the regular expressions of your code.

About Claudio Zizza
I'm working as a software developer for more than 15 years and used many coding languages and frameworks during that time, like PHP (of course), ASP, .NET, Javascript, Symfony, Laminas and more. I also contribute to the open source communities and am a part of the PHP Usergroup in Karlsruhe, Germany.
Follow me on Twitter (@SenseException)