The subtleties of regular expressions. Part 2: Returns and their quantity

Part 1: metacharacters inside and outside character classes .

In this part, I would like to tell you about how the regular expression engines work, why some people think that regular expressions are very slow, and why the authors of many engines do not follow the POSIX standard.

As in many problems, the problem of “finding a match in a string for a regular expression” exists in many approaches and principles for solving it. The basic principles in this case are 2: matching the search, controlled by a regular expression, and matching by string. Let's take a closer look at what each of these principles means.

String-driven matching

This type of regular expression engines is implemented using the so-called DKA ( Deterministic State Machine ). The main advantage is the tremendous speed of finding matches (compared to another principle). Naturally, this principle has a lot of flaws (otherwise it would not be interesting, right?).
')
First, the compilation time (meaning the compilation of a regular expression into the internal structures of the engine, and not the compilation of the source code of the program) can be extremely long (compared to another principle). In fact, during compilation, a whole program of comparison of a specific pattern is created. Since DFA is “controlled” by a string, and we don’t know the string at compile time, the algorithm is created taking into account all possible strings. For complex expressions, this is obviously not done so quickly.

Secondly, DKA does not support very many convenient and necessary features to which we are used to regular expressions. For example, saving brackets are not supported. Of course, with the set that DKA has, it is impossible to do in more or less real situations. Therefore, in its pure form, DKA is now almost not used anywhere. But it is successfully used in the so-called "hybrid" engines, which in the simplest case, choose the engine that is most suitable for a particular regular expression.

Thus, if you need to search in a huge amount of text and at the same time your regular expression is not too complicated, it is difficult to find anything better than DKA.

So how does DKA work? The engine is called “string driven” because it works just like that. The state machine reads the string in which to find matches, and makes comparisons of the string with the pattern. At each moment in time, the algorithm “knows” which parts of the pattern and lines coincide. When the algorithm completely scans the entire line, it remains to select only the longest match that is closer to the beginning of the line and display it as an answer.

For definiteness, we assume that we have a regular expression ^abc\w+e$ and two strings abcde and abce . It is quite obvious that the expression matches the first line, but not the second.

Let us consider how the search for a match occurs in the DFA mechanism. For the first line has the following sequence:

The first character of the string “a” matches the position with the metacharacter ^ . Found 1 partial match.
The first character of the string “a” is the same as the second character of the pattern. The algorithm continues to lead 1 partial match.
The second character of the string “b” is the same as the third character of the pattern. The algorithm continues to lead 1 partial match.
The third character of the string "c" is the same as the fourth character of the pattern. The algorithm continues to lead 1 partial match.
The fourth character of the string “d” is the same as the fifth metacharacter \w pattern. The quantifier requires that the metacharacter matches at least once. The condition is fulfilled. The algorithm continues to lead 1 partial match.
The fifth character of the string “e” is the same as the eighth character of the pattern. The quantifier condition for the previous metacharacter is fulfilled, we can count the character “e” as a match with the eighth character of the pattern, or as a match with the \w metacharacter. The algorithm does both, and it continues to lead already 2 partial coincidences. Remember this moment, in another principle there will be a very significant difference.
The sixth character of the string is missing - matches the position with the $ metacharacter. Found 1 complete match, and the second partial does not meet expectations and is discarded.
The line ended, return the result.

What will happen in the second case:

The beginning of the algorithm is exactly the same, immediately go to the fourth character of the string "e"
The fourth character of the string “e” is the same as the fifth metacharacter of the \w pattern. The quantifier condition requires that at least one character match exists. Therefore, "e" is captured by the \w metacharacter. 1 partial match
The fifth character of the string is absent - there is no match with the symbol “e”. Obviously, the string does not match.
The line has ended, there are no matches, we return the result.

What can we see in these two examples? The algorithm doesn't care if we have a match or not. It simply scans the string character by character and tries to match matches based on the pattern. If at the end of the work we have several complete matches, then the result will be what is located closer to the left edge of the line and the longest.

Actually, it becomes clear why the algorithm works so quickly. Except in some cases, each character of the string is viewed only once. It also becomes clear why a pattern is compiled for a long time - for complex patterns, many conditions and dependencies must be taken into account in order to correctly conduct partial matches.

At the moment, the DKA engine is used in the grep utility as part of the engine selection mechanism and in the Tcl language as part of the hybrid engine, it’s probably used somewhere else, but I don’t know about it.

Regular expression driven matching

This type of regular expression engines is implemented using the so-called NCA ( Nondeterministic Finite State Machine ). The main advantage is the support of a large number of functions (compared to another principle). Naturally, this principle also has a lot of flaws.

Firstly, the work time can be extremely long (compared to another principle). Up to exponential asymptotics from the length of the input string. Many popular engines even have protection against this by interrupting the search.

Secondly, the NCA depends very much on how the regular expression is written. If the DFA is absolutely all the same on the record form: the time and the search results will be exactly the same for any equivalent expressions, then for the NCA this is a whole problem. One expression can give exponential asymptotics, and the other is almost linear. Accordingly, the question of optimizing the pattern rises to its full height. NCA is currently used in most regular expression engines.

A key feature of the classic NCA is that it interrupts the search as soon as it finds the first complete match. Those. search results in some cases will be very different from DKA. This is a feature of the algorithm.

So why authors of engines do not hasten to follow POSIX? The fact is that the semantics of DFA search is fixed in the standard. Those. Of course, it is not said literally that DFA should be used, but it is said about “the leftmost coincidence of the greatest length”. But the NKA returns just the first, and not the greatest length. And what to do? In order to comply with the POSIX, they invented a modification of the algorithm, which is usually called the POSIX NCA. And it works much longer than just an NCA. Because in order to satisfy the standard, it is necessary to perform a brute-force search in general of all variants that a regular expression allows. Next, I will explain why it is much longer.

Compiled (the value is the same as for DCA) NCA is a sequential algorithm, instructions based on a pattern. A key feature of NCAs is “returns.” The search speed directly depends on their number.

What returns are best explained by example. Take the same two strings that were used in the DTA example, and the same regular expression.

For the first line:

The first metacharacter of the pattern ^ - matches the position with the beginning of the line. The return stack is empty.
The second character of the pattern “a” is the same as the first character of the string. The return stack is empty.
The third character of the pattern “b” is the same as the second character of the string. The return stack is empty.
The fourth character of the “c” pattern is the same as the third character of the string. The return stack is empty.
The fifth metacharacter of the string \w is the same as the fourth character of the string. The quantifier requires at least one character, so there can be no return. The return stack is empty.
The fifth metacharacter of the string \w is the same as the fifth character of the string. The quantifier requires at least one character that has already been captured. At this point there may be a refund. The return stack contains 1 item.
At this intermediate stage, the NCA found as a coincidence the entire string “abcde”, but had not yet completed the pattern. The current position in the line is behind the “e” symbol, although the “e” symbol from the pattern itself has not yet coincided. This is where the fun begins. Watch your hands.
The eighth character of the pattern “e” - does not coincide in meaning with the sixth character of the string (it simply does not exist). No match found. Check the return stack - there is one record. We carry out the return. After the return, we are in the position “before the character e” in the line and in the position “before the character e” in the pattern. The return stack is empty.
When the NCA does not find a match, it checks the stack of returns. When the return is executed, the position in the row is restored and the return is removed from the stack. If the return stack is empty, but there is no match, then it is considered that there are no matches in this starting position in the string.
The eighth character of the pattern “e” is the same as the fifth character of the string. The return stack is empty.
The ninth metacharacter of the $ pattern - coincides in position with the end of the line. The return stack is empty.
The pattern is over, there is a coincidence at the moment. We return it.

What will happen in the second case:

The beginning of the algorithm is exactly the same; we immediately go to the eighth character of the “e” pattern.
The eighth character of the pattern “e” does not coincide in meaning with the fifth character of the string (again, it simply does not exist). No match found. We check the return stack — it is empty (empty because at the previous stage we didn’t add anything to the return stack, because the quantifier + requires at least one match). Return cannot be performed. The line did not match, we return the result.

For more understanding, I’ll tell you that the + quantifier always captures as many characters of a string as it can. Therefore, it is called the maximum. This is a frequent error of beginners. The pattern $.+$ Will match in the line “here is the text (bracket) and then there is the text (another bracket) and here the text” with the part “(bracket) and then there is the text (still bracket)”, and not with “( bracket) ". Just because at the processing stage .+ whole line will be captured to the end, and as soon as we execute the returns, we will reach the “)” symbol, so a complete match will be found right away. Let's slightly change the regular expression: $[^)]+$ and now there is no problem in it that interfered with us. But another one appeared (and how could it be without it). Now, in the line “here is the text (bracket (and inside there is another bracket) well, and again the text) and then there is text” will be captured “(bracket (and inside is another bracket)”, which is of course not correct. This will happen because at the capture stage, we will not reach the second closing bracket and immediately return the match found on the first one. I will not consider this example further, because in the general case the problem of balanced brackets cannot be solved only with the help of regular expressions. In .NET there is a special construct that allows you to solve the problem balance checks (namely checks).

Actually, I was a little bit mean when I described how exactly the algorithm works. The fact is that it will work this way if the engine applies optimization (obviously, if the pattern starts with a ^ , then either it will match at the beginning of the line, or it will not match at all). In fact, if we omit the optimization, the engine will perform exactly the same search starting with each next character of the line, if it does not find any complete matches in the previous step. Those. for the second line, 5 consecutive searches will be performed (starting before the “a” symbol, then “b”, “c”, “e” and the last one after the “e” symbol), and only after that it will be possible to say that there are no matches. DKA would deal with all cases in 1 pass. But in the absence of a match, the NKA will need 5.

The difference between the POSIX NCA engine will be that in the first case (when a match is found) the search would also not be stopped, but all 5 searches would be performed, as well as if there was no match, which of course is much longer.

Now that we have considered the technique of performing returns, you can guess where the exponential search time is taken. In some situations, the algorithm falls into the trap, creating a huge number of returns. Consider, for example, the expression (\w+)*a . If you apply it to the string "bbb", then what happens? First, the first + quantifier captures the entire string and transfers control to the second quantifier. Since there are no more characters in the string, go to the “a” symbol of the pattern. Which is not in the line, we perform a return by one character, re-take the newly released character into a new group, check “a”, again there is no coincidence, return. Further the meaning should be clear. The algorithm will iterate over all possible line breaks to capture two quantifiers. For the string “bbb” these will be the options: “[bbb]”, “[bb] [b]”, “[b] [bb]”, “[b] [b] [b]”. Where square brackets conditionally denote the captured group. The number of partitions grows exponentially with the length of the string in this case.

Another situation. Imagine a 1 megabyte text and a view pattern .*a What happens when you search? At each stage, the entire megabyte line will be captured and successive returns will be performed until the “a” character is found. And if there is no such character in the string? Then 1000001 searches will be done. Which each time will capture the entire line from the current character to the end and return through it until it finds the character “a” (which is not present). In order to avoid this, it would be enough in this case to write [^a]*a . But this is for this case. In the general case (as in the example with brackets) this is not enough.

About techniques for optimizing returns, you can write a lot more. The next article I plan to devote to this.

Based on the book by Jeffrey Friedl, Mastering Regular Expressions .

Source: https://habr.com/ru/post/112327/

All Articles

The subtleties of regular expressions. Part 2: Returns and their quantity

String-driven matching

Regular expression driven matching

More articles: