Getty Images/iStockphoto
The benefits of using compiled regex in Python and Java
Whether you're programming in Java or Python, compiled regex can significantly speed up your text manipulation routines.
I was involved in a recent discussion on the "best" way to remove a given parameter from a URL string.
The conversation began with using string primitives to split and join the parameter, a method that is clear and easy to read. One can achieve the same result using one or two lines of regular expression replacement, but the retort was that regular expressions, or regex, are notoriously slow. That's generally true.
Nevertheless, I wondered how much slower are regex versus string primitives, and if using precompiled regular expressions improves performance.
Remove parameters with String Manipulation
The simplest and arguably most readable method to remove a parameter is to split the string using the split primitive, remove the given parameter and then rejoin the string.
This felt like a lot of work to me, and likely it would be memory-heavy given that strings are immutable in most languages.
In Java, this amounts to four lines of code:
String[] parts = input.split("&");
List<String> partsList = new ArrayList<>(Arrays.asList(parts));
partsList.removeIf(part -> part.contains("option"));
outputsSplit = String.join("&",partsList);
In Python, the code is shorter and more succinct:
parts = input.split("&")
parts = [part for part in parts if "option" not in part]
output = '&'.join(parts)
Both do the same thing: manipulate and rejoin a split list.
Remove parameters using regular expressions
The other option to remove parameters in Java or Python is to use a regular expression to delete the matching portion of the input string. This method is much less readable but uses fewer lines of code. It is a use case for which regular expression is well-suited.
In Java this required a single line of code:
output = input.replaceFirst("option=one$","").replaceFirst("option=one&","");
In Python it amounted to two lines of code:
output = re.sub(r"option=one$", "", input)
output = re.sub(r"option=one&", "", output)
The Pythonic way uses fewer lines of code than the string primitives. It is a well-suited use case for regular expressions, although the catch is that regular expressions are notoriously slow.
Knowing that ahead of time, what can you do? There's another way.
Compiled regular expressions to remove parameters
When the regular expression that you use to match the section to delete does not change, you can instead use what are called compiled regular expressions.
Consider standard regular expressions to be a recipe of what to look for in a string. The computer must read the recipe every time it is used and parse it. This is generally the most performance-consuming part of using regular expressions.
Compiled regular expressions only do this once, which makes them much faster than standard regular expressions, although they cannot be used if the pattern changes.
The negative here is that the number of lines of code increases from the original string primitives version, which was the initial motivation to use regular expressions.
In Java we end up with six lines:
private static final Pattern PATTERN_END = Pattern.compile("&option=one$");
private static final Pattern PATTERN_START_MIDDLE = Pattern.compile("option=one&");
Matcher matcherEnd = PATTERN_END.matcher(input);
String intermediateResult = matcherEnd.replaceFirst("");
Matcher matcherStartMiddle = PATTERN_START_MIDDLE.matcher(intermediateResult);
output = matcherStartMiddle.replaceFirst("");
In Python it takes us four lines:
PATTERN_END = re.compile(r"&option=one$")
PATTERN_START_MIDDLE = re.compile(r"option=one&")
intermediate_result = PATTERN_END.sub('', input)
output = PATTERN_START_MIDDLE.sub('', intermediate_result)
The real question, then, is this: Is the increase in lines of code worth the performance gain?
To test the methods head-to-head, I wrote some code to generate 100,000 input lines, and within it I randomly placed the parameter we want to remove. I then ran the code through the functions and measured the time to completion.
Java results with compiled regular expressions
In Java, I was surprised by how close the performance was for standard regular expressions to using string primitives. String primitives took 188 ms to process the 100,000 lines. Standard regular expressions took 224 ms to process the same 100,000 lines.
Compiled regular expressions, on the other hand, were breathtaking. They managed to process the 100,000 lines in just 52 ms. That's over three times faster than string primitives.
Python results with compiled regular expressions
Using the same methodology of 100,000 randomly generated input lines in Python showed some differences to Java. The first was that the code in general was almost five times slower. This is to be expected with an interpreted language versus a compiled one.
In the second, there was a much larger differential between using string primitives and regular expressions. It was almost two times as slow to use a regular expression as opposed to string primitives. String primitives processed the 100,000 lines in 477 ms, while the standard regular expression came in at 904 ms.
Compiled regular expressions still performed comparatively well in Python, though. This method processed the 100,000 lines in 165 ms, about two times faster than string primitives in Python.
Conclusion
The result of all this is that yes, standard regular expressions are significantly slower than simple string primitives. However, the extent of the difference heavily depends upon the coding language you use.
Moreover, if you write high-performance string replacement code, then compiled regular expressions are the way to go.
David "Walker" Aldridge is a programmer with 40 years of experience in multiple languages and remote programming. He is also an experienced systems admin and infosec blue team member with interest in retrocomputing.