How to remove all control characters from a string - the story of one stormy optimization

It turned out that I was able to optimize the code of the cluster problem, which was part of the Big Cluster Algorithm and was doing a very simple thing: depending on the contents of the fields, the input stream from n fields needed to be redistributed into the output stream from m fields and almost calm down. Almost - because there were lines of arbitrary type inside the fields that needed to be “cleared” - to carry out the simplest, seemingly, operation of deleting all control characters from the line.

It turned out that this operation is not as “simplest” as it seems, especially if we consider it in modern languages with a virtual machine. Below, I will show how you can replace a single line solution with a couple of dozen lines of solution, increasing the algorithm performance by ~ 10 times. At once I will make a reservation that the examples will be related to Java, but similar reasoning will be true for most other languages and virtual machines - first of all, for .NET-based.

In fact, the whole activity of the algorithm was reduced to:

Get from the outside (over the network or from disk) a set of N fields: (in ₁ , in ₂ , ..., in _n )
Do a dozen simplest operations like copying in to out with checks for simple conditions
Clear all lines in (out ₁ , out ₂ , ..., out _m ) from control characters
Send them further (write to disk or send over the network)

From my point of view, it would be logical for such a task to be rested against a disk or a network — the load on the processor here should be minimal. However, the simplest profiling showed something completely different. The original version contained a line on which 80-90% of the working time of the algorithm was spent - it was exactly one line that did operation 3, here it is:

Option 1

s = s.replaceAll("\\p{Cntrl}", "");

I even know where this line came from - a quick search on Google for the phrase “java strip non-printable characters” gives exactly this option. So, the task is clear, the goals are clear, the programmer's pride is hurt (“how can I really not be able to disperse this unfortunate line”), let's go!
')
We quickly do a simple binding that measures the time of the algorithm, isolate it from the rest of the code, generate a test input line, which we will run millions of times through the algorithm and measure the performance. It turns out that the first version processes 517009 lines per second. We take several measurements, we understand that the accuracy of our measurements and experiments is about 2-3% - i.e. roughly speaking, it is possible not to look at the last 4 digits at all, but to look at the fifth digit from the end, but not to look. Those. results somewhere between 500..540 thousand.

We recall what exactly we have for the lines and what we need to filter and what is \p{Cntrl} . We understand that this element is essentially equivalent to the choice of characters with codes from 0 to 31 inclusive, plus 127. Just in case, we quickly check whether something gives if we change this \p{Cntrl} to a more trivial enumeration of all characters. in a regular expression through an operator of the type [az] , in our case:

Option 2

 s = s.replaceAll("[\\x00-\\x1F]", "");

No, it does not. All the same 520000 ± 3% .

We recall that in the wonderful language Java, the compiler, the virtual machine and stdlibs often live a separate life and most likely do not optimize such a simple operation - despite the fact that the same regular expression is repeated millions of times, it is re-compiled each time. Peeping into the documentation on String # replaceAll confirms this guess. We are trying to take this very compilation out of the brackets:

Option 3

 public static final Pattern KILL_CC_PATTERN = Pattern.compile("[\\x00-\\x1F]"); ... Matcher m = KILL_CC_PATTERN.matcher(s); s = m.replaceAll("");

Suddenly, instead of one line - three, and the performance increased to 640000 ± 3% - not many times, but suddenly + 23% we played.

We do not think long to do better. We try what happens if we go through the string manually in a loop, analyzing character by character (pulling them out through String # codePointAt ), and saving it in another string. The subconscious mind automatically suggests that not only in a string, but in a StringBuilder or StringBuffer . In our case, it makes no difference what to use — in our case, the cluster makes parallelization, starting N independent processes in parallel. A quick glance at the documentation shows that it is recommended to pre-initialize a StringBuilder with some capacity — the number of expected characters in the result. There is no reason not to believe the documentation, so let's do it - we know for sure that as a result, we will obviously have no more than was originally in the line.

Option 4

 StringBuilder sb = new StringBuilder(s.length()); for (int j = 0; j < s.length(); j++) { int ch = s.codePointAt(j); if (ch >= ' ') sb.appendCodePoint(ch); } s = sb.toString();

7 lines instead of 1, but already 710000 ± 3% lines per second. This is + 37% - more than a third.

We continue to think further. A simple thought skips through - what will happen if I remove the work with Unicode codepoints and go on to using ordinary Java chars? We will lose the opportunity to look at all sorts of surrogates, composites, etc. as a whole - but from the point of view of stripping - nothing will be wrong. We try:

Option 5

 StringBuilder sb = new StringBuilder(s.length()); for (int j = 0; j < s.length(); j++) { char ch = s.charAt(j); if (ch >= ' ') sb.append(ch); } s = sb.toString();

The same 7 lines, the change is insignificant at first glance, but the performance suddenly jumps to 1052000 ± 3% ! This is already cool - a little more than 2 times relative to the original (+ 103%).

Can it be done even faster? Can! Let's look inside StringBuilder, suddenly see that this is not some kind of wild magic that goes deep into the C-code of the JVM, but quite a solution in pure Java. Inside StringBuilder, trite data structures are stored that link character strings through char arrays. In our case, all these structures are unnecessarily - we are not going to insert something in the middle of the line, push it, etc. You can really do everything stupidly on arrays, almost a C-way:

Option 6

 char[] res = new char[s.length()]; int newLen = 0; for (int j = 0; j < s.length(); j++) { char ch = s.charAt(j); if (ch >= ' ') { res[newLen] = ch; newLen++; } } s = new String(res, 0, newLen);

The lines became 10, but the productivity doubled: as many as 2022000 ± 3% ! This is 4 times faster than the original regexp.

What else can you optimize? The call to append is essentially optimized, but is it possible to somehow optimize the passage through the string using String#charAt() ? It turns out too. Having drunk another cup of coffee, let's try to peep inside the sources of the String class and the same StringBuilder. There we will see that inside the String all the work goes with the same char[] , i.e. You can reduce the number of method calls (operations like invokevirtual in terms of bytecode), reducing them to operations like aload , iload , castore , etc. - which are very “cheap” and well optimized by the JVM.

Thus, everything is trivial: you can leave many charAt () calls by replacing them with one String # getChars . Checking:

Option 7

 char[] oldChars = new char[s.length()]; s.getChars(0, s.length(), oldChars, 0); char[] newChars = new char[s.length()]; int newLen = 0; for (int j = 0; j < s.length(); j++) { char ch = oldChars[j]; if (ch >= ' ') { newChars[newLen] = ch; newLen++; } } s = new String(newChars, 0, newLen);

Another small miracle occurs: 12 lines, but 2500000 ± 3% , i.e. 5 times faster than the original.

Surely, we did the initial task a long time ago - the cluster algorithm started abutting on something else and not on this operation, and I spent on these micro-optimizations already quite indecently long time, so here my own brain preferred to surrender and accept the fact that it is hardly possible to do even faster. After a while, for the sake of sports interest, I returned to the task and tried several more hypotheses:

What happens if instead of String # getChars () use String # getBytes () ? Nothing good will happen, inside Java it stores lines in the form of arrays of 16-bit numbers of type char; operations like getBytes () are quite expensive, because Conversion to any encoding (be it utf-8, any two-byte encoding or one-byte encoding) is a non-trivial operation. The slowest, oddly enough, was the option to convert to windows-1251.
Adding final wherever possible, contrary to traditionally established opinion, gives no increase
Using the increment inside an array operation - newChars[newLen++] - doesn’t give anything except the reduction of the code by 1 line
Parallelization in this particular place does not give anything - the cost of generating 2-3-4 threads and working on the line in parts is incommensurable with the gain from such parallelization; plus, in the original task, parallelization and so provided a cluster framework

In continuation of sports interest, I threw this puzzle on stackoverflow and there I was prompted by several rather ancient, but, to my surprise, effective methods, in fact in pure C-style:

Option 8

 int length = s.length(); char[] oldChars = new char[length]; s.getChars(0, length, oldChars, 0); int newLen = 0; for (int j = 0; j < length; j++) { char ch = oldChars[j]; if (ch >= ' ') { oldChars[newLen] = ch; newLen++; } } s = new String(oldChars, 0, newLen);

Oddly enough, it gives as much as 3,100,000 ± 3% lines per second, i.e. almost 6 times faster than the original, and another 24% faster than the previous best version.

The main gain is achieved by two banal, well-known with C, but still quite working constructs: predicting the length of the string in the variable length (saving on calls to String # length () - for some reason I hoped naively that the JVM would do it for me) and using the same oldChars array at the same time to store both the old and the new line (using the fact that from the old line we always need the j-th character, and at the time of reading the j-th character is always j <= newLen).

It would seem much further - but there is also the ninth option. In fact, you can avoid any allocation at all by sacrificing the thread safety of this function and allocating a static buffer in advance. We will play on the fact that most of the lines in the source stream are generally limited to some specific value from above - for example, 1024 characters. Only for very isolated cases, it will be necessary to allocate a larger buffer - respectively, we will do this:

Option 9

 public static char[] buf = new char[1024]; ... int length = s.length(); char[] oldChars = (length < 1024) ? buf : new char[length]; s.getChars(0, length, oldChars, 0); int newLen = 0; for (int j = 0; j < length; j++) { char ch = oldChars[j]; if (ch >= ' ') { oldChars[newLen] = ch; newLen++; } } s = new String(oldChars, 0, newLen);

A trifle seems to be, but it gives as much as 3490000 ± 3% lines per second, i.e. 6.7 × increase (or + 12.5% compared to the previous version).

The final version on which I have so far stopped is the tenth. In fact, the last thing to save on is to create a new String object - it is especially a pity to do this, if in practice 99.9% of the lines passing through the algorithm do not need stripping and the output is equal to the input.

Option 10

 public static char[] buf = new char[1024]; ... int length = s.length(); char[] oldChars = (length < 1024) ? buf : new char[length]; s.getChars(0, length, oldChars, 0); int newLen = 0; for (int j = 0; j < length; j++) { char ch = oldChars[j]; if (ch >= ' ') { oldChars[newLen] = ch; newLen++; } } if (newLen != length) s = new String(oldChars, 0, newLen);

This is a bit of cheating, because it uses a priori knowledge of the input stream, but it's worth it. The final best result is 5350000 lines per second when processing a stream, in which only 0.1% of lines should be processed. The increase is already 10 × from the original or + 53% from the previous version.

Conclusions and morals

5-6 hours of work on 1 line can turn it into 15 lines and make it 10 times faster; high price, but it is adequate, if we are talking about a really narrow place
Reading Java stdlibs and JVM code can be quite useful.
It is not always necessary to rush imprudently into religion "how everything is slow, let's rewrite everything on JNI" - often the problem can be solved without JNI
Working with codepoints in Java will almost always be slower than working with char; it's worth it to make it a rule to motivate why you need codepoints if you really want to use them
The built-in Java regexps, firstly, have a damned API (inciting thereby to compile the same expressions too often), and secondly, they are rather slow on such simple operations; However, the fact that embedded Java regexps is slow is something everyone knows
Any complex data structures (StringBuffer, StringBuilder, String) in Java often work inside base arrays of primitive types, and often do so less optimally than we would like
A lot of old tricks (final, work with bytes instead of wide characters) do not work
A lot of old tricks (rendering the result of a frequently used method to a variable, saving on memory allocation, reuse) work very well

Hm Maybe someone knows ways how to make it even faster?

PS The author of a wonderful photo at the beginning of the article is JcOlivera.com , the photo is distributed under CC-BY-NC-ND.

Source: https://habr.com/ru/post/126954/

All Articles

How to remove all control characters from a string - the story of one stormy optimization

Conclusions and morals

More articles: