Atomic grouping, or Not a step back!

0. Tip

In a certain kingdom, in a certain state, there lived a programmer. His name, as it should be, Ivan. He was a real specialist, possessed all the Three Great Virtues of the Programmer , that is, he was lazy, arrogant and impatient. Great sadness happened in that kingdom: a crisis. And they kicked Vanya out of work without severance pay. Vanya grieved for a long time, and then he collected his courage and sent out a resume all over the world. How long, shortly, did they call Vanya for an interview. There were many requirements for the job seeker, but the main thing was that he needed to have regular expressions well. Before the interview - almost a month, I do not want to get ready. Being a serious man, Ivan decided to prepare thoroughly. 3 weeks and 3 days he lay on the stove, read Habr and thought how incredibly thoroughly he would prepare. Before the interview was 1 day. Vanyusha mentally cursed employers who schedule an interview so soon that he didn’t have time to get ready, got off the stove, handed over beer bottles and bought a regex book with the money. He read it to complete exhaustion, until he turned off. In the morning we will find the sleepy physiognomy of Vanya lying, as if on a pillow, on this very book under Habrakat.

1. Recursion

At the interview, Ivan was given such a task. Write a regex that matches the expression in parentheses (...), and inside the parentheses there can be many expressions in other parentheses, probably nested ones. For example: in the chain
sin(2*pi/tan(.7+b*tanh(b/2)))+8*cos(b/4)
Regex should match (find) the contents of the 1st pair of brackets:
(2*pi/tan(.7+b*tanh(b/2)))
However, regex should not match such a part of the chain in which the brackets are not balanced (open brackets do not close or, conversely, brackets that were not opened at all) are closed. For example, processing a chain
(sin(b/2)
Regex should find (b/2) , ignoring the 1st bracket, since it does not close. And in the next thread
2*pi)*(r*r
Regex should not find anything at all, since there is not a single pair of “correct” brackets. And one more limitation is that “empty” brackets () forbidden to match, that is, there must be at least some content inside a pair of brackets.
In short, you need to match the "correct" expression in brackets, which may contain subexpressions with nested brackets, and empty brackets () not considered correct.
Ivan writes out approximate expressions and shakes on a mustache:
1) the sought expression is something in brackets.
2) Inside brackets there can be either something without brackets, and then everything is simple:
(3.14 * R * R)
... or something in brackets:
(2 * sin(pi/2))
Stop! In the latter case, inside the brackets is no longer just “something in brackets”, but first “something without brackets” 2 * sin , and only then “something in brackets” (pi/2) !
And it becomes clear that within the brackets "something without brackets" and "something in brackets" can occur as many times as necessary:
(2 * (a+8.5) * sin(pi/2) / (b - 1e-8)) .
To set “something without brackets” in the language of regex is quite simple: [^)(]+ . How to set an alternative (something without brackets OR something in brackets) and “as many times as you like” is also simple: metacharacters | and +. But what is “Something in brackets” and how to write it in the language of regex?
"Something in brackets" ... "Something in brackets" ... Somewhere it has already occurred ... Ah, here: point
1) the sought expression is something in brackets . What is this “something in brackets”? If the sought expression is something in parentheses, then something in parentheses is ... the sought expression! Eureka!
Ivan comes to mind this grammar:
search-expression :: = ( {expression-without-brackets | search-expression} + )
Where + means “1 or more times”, {a | b} means "a or b", bold brackets denote the brackets themselves, and
expression-without-brackets :: = any-any-except-brackets +
That is, the definition of the sought expression is recursive. But how to write it in the language of regex? Can regex include itself? Ivan would have thought it was impossible, but you forgot - he had been preparing hard for a month! He recalled that in modern regex engines this is possible: anywhere within regex
(?R)
means a link to the entire regex. Ivan writes the following regex (with the / x key, which means that all whitespace characters, including line breaks, are not taken into account, and comments are also possible after the # character):

/
\( #
(
[^)(]+ # --
| #
(?R) #
)+ # 1
\) #
/x

Ivan checks the regex on several examples of chains (good, interviews are allowed to run test programs, you can not just go online and read the documentation):
there are no parentheses here
(a + b)
sin((pi/180)*deg + theta)
1+(1+1/(1+1/(2+1/(1+1/(1+1/(4+1/(1+1/(1+1/(6+1/(1+1/(1+1/(8+1))))))))))))
sin)a - b(
sin(a - b(
sin(a * (b+1)
and regex with them with all cope as expected. Satisfied, Ivan shows the solution to the interviewer. Happy end, curtain? But why the article is named as it is called ??
The interviewer asks what he should (if he should) find the regex in the following chain:
(you're gonna fail sonny unless you correct it (your regex)
Having read and turned pale, Vanyusha, nevertheless, without batting an eye, answers that, since the 1st bracket does not close anywhere, the regex should find the 2nd pair of brackets: (your regex) . He is asked to check. Ivan drives this chain into his verification script (Perl):

#!/usr/bin/perl -wl
use strict;

my $string = "(you're gonna fail sonny unless you correct it (your regex)";
print $string;

if ( $string =~ / \( ( [^)(]+ | (?R) )+ \) /x ) {
print "Match: $&";
} else {
print "Not matched!"
}

Having checked the syntax, Ivan runs the script and ... and nothing happens, the script seems to be “hanging”. Says neither yes nor no. Ivan and the interviewer silently look at it for 10 seconds. During this time, Ivan turns pale pink bright pink: he clearly already feels a trick. But what did he do wrong ?? At this point, the laptop on which the script was launched (and for some reason has not yet been completed) the script begins to buzz distinctly, although it worked silently until then. The interviewer, in order to interrupt an awkwardly prolonged pause, offers Ivan a cup of tea or coffee. “In the meantime, you drink, just think what could be wrong,” he adds. It takes another half hour, Ivan drinks 2 cups of coffee and one tea with lemon, but he has no ideas why the script behaves so mysteriously. The interviewer with Ivan parted, the HR department of the tsar-father makes a note: “the candidate was rejected as a result of the interview”. And we turn to the point

2. Catastrophic pullback

What happened with regex Ivan? Let's try to understand how the work of regex occurred.
/ \( ( [^)(]+ | (?R) )+ \) /x
on the "ill-starred" chain
(you're gonna fail sonny unless you correct it (your regex)

Regex bracket matches bracket in chain
[^)(]+ you're gonna fail sonny unless you correct it and “rests” on the opening bracket that you can’t eat
The alternative (... | ...) stands with the greedy quantifier +, so the regex engine tries to find (... | ...) again. The first branch [^)(]+ alternatives immediately upon seeing the opening bracket (says "I can not match"
The engine moves to the 2nd alternative: (?R) . This is nothing more than a whole regex at first. And in the chain left (your regex) . Everything is simple: a bracket (in regex matches a bracket (in a chain, then [^)(]+ your regex matches, then (there are no other occurrences of an alternative (... | ...) engine in the remaining part of the chain: no non-brackets ", nor the opening bracket) the engine moves to the closing bracket) in regex. And it matches the bracket) in the chain.
Great, the engine managed to match (?R) parts of the source chain (your regex)
In the chain, we come to an end, there are no more characters left. The quantifier in (... | ...) + is greedy, and the regex engine tries to find another alternative (... | ...), but it does not work: in the empty remaining part of the chain, it finds none brackets, no opening brace with which a branch could start (?R)
So, one more alternative cannot be found. Well, all greed should have boundaries. The engine is “content” with the 2 occurrences of the alternative in the chain already found and proceeds to the next part of the regex. This is a closing bracket). But there is no brace in the chain: in the chain in general, by this time it is empty, everything has already been removed from the previous parts of the regex. There is nothing to open the regex opening bracket.
What makes the regex engine go from here? Surrenders and says that there is no match? In no case. He remembers that the last expression with the quantifier [^)(]+ was greedy and, perhaps, "ate" too much, leaving nothing to the next parts of the regex. The regex slider rolls back.
The engine remembered that the previous time [^)(]+ “ate” your regex chain. It forces [^)(]+ “give” 1 character, x, for the following parts of the regex. That is, it remains in the chain: x) . And in regex we are inside an alternative (... | ...) +
Because it is greedy, the engine is trying to find another 1 alternative (... | ...), and it works, because the 1st branch of the alternative, [^)(]+ , the x character matches remarkably remaining in the chain.

So you can continue for a very long time, but here, in fact, the problem has already surfaced, the root of evil, the flaw of our regex. In the chain of your regex , which from the point of view of a person is one indivisible token-expressless brackets, our regex engine found 2 tokens: your rege and x , which were distributed across “different instances” of the regex [^)(]+ . If you remove source regex:
/ \( ( [^)(]+ | (?R) )+ \) /x
all “superfluous”, all that requires the presence of symbols “)” or “(” ”in the chain (and obviously there are no such characters in the chain of your regex), then only:
/ ( [^)(]+ )+ /x
And here the problem becomes even more obvious. After all, your regex can be matched as 1 token of your regex, and as 2 of your rege and x , and as you and r regex , and as 3, 4, ... or even 10 separate tokens, each of which consists of one character! The number of options for splitting your regex chain into tokens is huge. And regex / ( [^)(]+ )+ /x implies, if necessary, enumeration of all these options. Why did Ivan immediately, when checking test chains, not notice this problem? Because in that case the usual luck took place: although the number of options for splitting the chain was huge, but the 1st option of splitting (when [^) (] +, being greedy, captures all the text without whole brackets) turned out to be successful, there was a match and the regex engine just didn't have to roll back. In the case of the chain given by the interviewer, everything was worse, since neither the 1st, nor the 2nd, nor the 100,000,000th option of splitting a long line into tokens did not give a match, because he tried to find a closing bracket which simply was not. Therefore, in the latter case, the regex engine will roll back, trying more and more new splitting options, but it will not find the match in a reasonable time. This is called a catastrophic pullback .
')

3. Atomic grouping

Is it possible to prevent this problem? Yes, and very easy. The problem is that the “stupid” regex engine rolls back to the Greek calends, while a man of one glance is enough to understand: there is no use rolling back. No partitioning of your regex (and the previous part of the chain) will help to find the missing closing bracket. There is a way to tell the regex engine: “in this place, do not try to roll back, even if it leads to the absence of a full match”.
(?> .....)
this is called the "atomic grouping", and means roughly "inside the part of the regex that is between (?> and ) , rollback is prohibited." Or rather, in a different way. Let (?>X) , where X is a certain regex, is part of a larger regex A(?>X)B (A and B are also some rehexes). Suppose that the chain ab fed to the input of this larger regex, and here a and b are not single characters, but some chains of characters. Suppose that the initial part A of the larger rehex has already found a correspondence (“otmatchila”) chain a. The regex engine moves to the regex (?> X), and in the chain being processed, the character pointer stands immediately behind the chain a (and immediately before b). In this case, the atomic grouping (?> X) works as if an absolutely separate, independent, regex X was fed to the input chain b. And X “does not know” and it doesn’t matter to him whether there is any other regex B after him and whether B will succeed in something to match. X works as if there are no more regexes besides him. And if, in particular, X contains something with a greedy quantifier:
(?> [^)(]+ )
then the atomic group will not allow this “greedy” part to roll back. If we immediately captured the entire chain of your regex , then so be it, not one step back!
Changing our regex to find the text in brackets, we would solve the problem:
/ \( (?> [^)(]+ | (?R) )+ \) /x

In the article Superjugate quantifiers we considered quantifiers ++, *, +, etc. It turns out that superjudic quantifiers are a special case of atomic grouping. And we could achieve the same effect (get rid of a catastrophic pullback) by making the quantifiers in the source regex super-greedy:
/ \( ( [^)(]++ | (?R) )+ \) /x
or better yet
/ \( ( [^)(]++ | (?R) )++ \) /x

So the tale is over, and who listened ... let him grieve with the author, who, before writing the article, could not have imagined that the explanations would be so long and incomplete. But if someone understood something and does not repeat Vanya’s experience, it’s already worth something.

Source: https://habr.com/ru/post/56765/

All Articles

Atomic grouping, or Not a step back!

0. Tip

1. Recursion

2. Catastrophic pullback

3. Atomic grouping

More articles: