📜 ⬆️ ⬇️

The subtleties of regular expressions. Part 2: Returns and their quantity

Part 1: metacharacters inside and outside character classes .

In this part, I would like to tell you about how the regular expression engines work, why some people think that regular expressions are very slow, and why the authors of many engines do not follow the POSIX standard.

As in many problems, the problem of “finding a match in a string for a regular expression” exists in many approaches and principles for solving it. The basic principles in this case are 2: matching the search, controlled by a regular expression, and matching by string. Let's take a closer look at what each of these principles means.

String-driven matching


This type of regular expression engines is implemented using the so-called DKA ( Deterministic State Machine ). The main advantage is the tremendous speed of finding matches (compared to another principle). Naturally, this principle has a lot of flaws (otherwise it would not be interesting, right?).
')
First, the compilation time (meaning the compilation of a regular expression into the internal structures of the engine, and not the compilation of the source code of the program) can be extremely long (compared to another principle). In fact, during compilation, a whole program of comparison of a specific pattern is created. Since DFA is “controlled” by a string, and we don’t know the string at compile time, the algorithm is created taking into account all possible strings. For complex expressions, this is obviously not done so quickly.

Secondly, DKA does not support very many convenient and necessary features to which we are used to regular expressions. For example, saving brackets are not supported. Of course, with the set that DKA has, it is impossible to do in more or less real situations. Therefore, in its pure form, DKA is now almost not used anywhere. But it is successfully used in the so-called "hybrid" engines, which in the simplest case, choose the engine that is most suitable for a particular regular expression.

Thus, if you need to search in a huge amount of text and at the same time your regular expression is not too complicated, it is difficult to find anything better than DKA.

So how does DKA work? The engine is called “string driven” because it works just like that. The state machine reads the string in which to find matches, and makes comparisons of the string with the pattern. At each moment in time, the algorithm “knows” which parts of the pattern and lines coincide. When the algorithm completely scans the entire line, it remains to select only the longest match that is closer to the beginning of the line and display it as an answer.

For definiteness, we assume that we have a regular expression ^abc\w+e$ and two strings abcde and abce . It is quite obvious that the expression matches the first line, but not the second.

Let us consider how the search for a match occurs in the DFA mechanism. For the first line has the following sequence:



What will happen in the second case:


What can we see in these two examples? The algorithm doesn't care if we have a match or not. It simply scans the string character by character and tries to match matches based on the pattern. If at the end of the work we have several complete matches, then the result will be what is located closer to the left edge of the line and the longest.

Actually, it becomes clear why the algorithm works so quickly. Except in some cases, each character of the string is viewed only once. It also becomes clear why a pattern is compiled for a long time - for complex patterns, many conditions and dependencies must be taken into account in order to correctly conduct partial matches.

At the moment, the DKA engine is used in the grep utility as part of the engine selection mechanism and in the Tcl language as part of the hybrid engine, it’s probably used somewhere else, but I don’t know about it.

Regular expression driven matching


This type of regular expression engines is implemented using the so-called NCA ( Nondeterministic Finite State Machine ). The main advantage is the support of a large number of functions (compared to another principle). Naturally, this principle also has a lot of flaws.

Firstly, the work time can be extremely long (compared to another principle). Up to exponential asymptotics from the length of the input string. Many popular engines even have protection against this by interrupting the search.

Secondly, the NCA depends very much on how the regular expression is written. If the DFA is absolutely all the same on the record form: the time and the search results will be exactly the same for any equivalent expressions, then for the NCA this is a whole problem. One expression can give exponential asymptotics, and the other is almost linear. Accordingly, the question of optimizing the pattern rises to its full height. NCA is currently used in most regular expression engines.

A key feature of the classic NCA is that it interrupts the search as soon as it finds the first complete match. Those. search results in some cases will be very different from DKA. This is a feature of the algorithm.

So why authors of engines do not hasten to follow POSIX? The fact is that the semantics of DFA search is fixed in the standard. Those. Of course, it is not said literally that DFA should be used, but it is said about “the leftmost coincidence of the greatest length”. But the NKA returns just the first, and not the greatest length. And what to do? In order to comply with the POSIX, they invented a modification of the algorithm, which is usually called the POSIX NCA. And it works much longer than just an NCA. Because in order to satisfy the standard, it is necessary to perform a brute-force search in general of all variants that a regular expression allows. Next, I will explain why it is much longer.

Compiled (the value is the same as for DCA) NCA is a sequential algorithm, instructions based on a pattern. A key feature of NCAs is “returns.” The search speed directly depends on their number.

What returns are best explained by example. Take the same two strings that were used in the DTA example, and the same regular expression.

For the first line:


What will happen in the second case:


For more understanding, I’ll tell you that the + quantifier always captures as many characters of a string as it can. Therefore, it is called the maximum. This is a frequent error of beginners. The pattern \(.+\) Will match in the line “here is the text (bracket) and then there is the text (another bracket) and here the text” with the part “(bracket) and then there is the text (still bracket)”, and not with “( bracket) ". Just because at the processing stage .+ whole line will be captured to the end, and as soon as we execute the returns, we will reach the “)” symbol, so a complete match will be found right away. Let's slightly change the regular expression: \([^)]+\) and now there is no problem in it that interfered with us. But another one appeared (and how could it be without it). Now, in the line “here is the text (bracket (and inside there is another bracket) well, and again the text) and then there is text” will be captured “(bracket (and inside is another bracket)”, which is of course not correct. This will happen because at the capture stage, we will not reach the second closing bracket and immediately return the match found on the first one. I will not consider this example further, because in the general case the problem of balanced brackets cannot be solved only with the help of regular expressions. In .NET there is a special construct that allows you to solve the problem balance checks (namely checks).

Actually, I was a little bit mean when I described how exactly the algorithm works. The fact is that it will work this way if the engine applies optimization (obviously, if the pattern starts with a ^ , then either it will match at the beginning of the line, or it will not match at all). In fact, if we omit the optimization, the engine will perform exactly the same search starting with each next character of the line, if it does not find any complete matches in the previous step. Those. for the second line, 5 consecutive searches will be performed (starting before the “a” symbol, then “b”, “c”, “e” and the last one after the “e” symbol), and only after that it will be possible to say that there are no matches. DKA would deal with all cases in 1 pass. But in the absence of a match, the NKA will need 5.

The difference between the POSIX NCA engine will be that in the first case (when a match is found) the search would also not be stopped, but all 5 searches would be performed, as well as if there was no match, which of course is much longer.

Now that we have considered the technique of performing returns, you can guess where the exponential search time is taken. In some situations, the algorithm falls into the trap, creating a huge number of returns. Consider, for example, the expression (\w+)*a . If you apply it to the string "bbb", then what happens? First, the first + quantifier captures the entire string and transfers control to the second quantifier. Since there are no more characters in the string, go to the “a” symbol of the pattern. Which is not in the line, we perform a return by one character, re-take the newly released character into a new group, check “a”, again there is no coincidence, return. Further the meaning should be clear. The algorithm will iterate over all possible line breaks to capture two quantifiers. For the string “bbb” these will be the options: “[bbb]”, “[bb] [b]”, “[b] [bb]”, “[b] [b] [b]”. Where square brackets conditionally denote the captured group. The number of partitions grows exponentially with the length of the string in this case.

Another situation. Imagine a 1 megabyte text and a view pattern .*a What happens when you search? At each stage, the entire megabyte line will be captured and successive returns will be performed until the “a” character is found. And if there is no such character in the string? Then 1000001 searches will be done. Which each time will capture the entire line from the current character to the end and return through it until it finds the character “a” (which is not present). In order to avoid this, it would be enough in this case to write [^a]*a . But this is for this case. In the general case (as in the example with brackets) this is not enough.

About techniques for optimizing returns, you can write a lot more. The next article I plan to devote to this.

Based on the book by Jeffrey Friedl, Mastering Regular Expressions .

Source: https://habr.com/ru/post/112327/


All Articles