📜 ⬆️ ⬇️

Supergreant Quantifiers

The Regexp article is a “programming language”. The basics were given the task: to write a regular expression that finds text in double quotes in a chain of characters, and inside the quotation marks "..." there may be characters themselves, if they are escaped with a backslash, for example:
one two "foo:=\"quux\"; print" three "four"
Here our regex should find a match for the chain.
"foo:=\"quux\"; print"
The author (of that article) proposed such a solution:
/ " ( \\" | [^"] )* " /x
(hereafter, the Perl syntax; the / x key means that regex spaces are not taken into account, we added them just for clarity so that parts of the regex do not merge into a single “modem noise”).
This regex works when there is a match (text in quotes). The problem is that he finds the text in quotes, even when the text is in quotes (according to our screening rules with a backslash) simply does not exist. For example, in the "\" chain, the regex finds a match (equal to the entire string "\"), although it should not be: the quotation mark is open, the screened quotation mark ... but there is no closing quotation mark.
The situation is easy to fix, the initial problem is easy to solve, making a few simple changes to regex ... but it's not about that, but about the fact that you have a modern tool in your hands, i.e. a regex engine (the latest version of Perl, Java or PHP with PCRE), then you can “fix” the described regex by adding only 1 character to it. Which one Where? Why? If you know the answers, then you should not read further ;-)

First, let's figure out why the original version
/ " ( \\" | [^"] )* " /x
matches the string "\"
Events develop (very roughly) like this:
  1. "(in regex) matches" (in the checked chain)
  2. \\ "(1st option in alternative (... | ...) in regex) matches \" (in chain)
  3. We have come to the end of the chain. There is still a closing quotation mark in the regex "(or, once again, the contents of the alternative (... | ...)), but there is nothing in the chain, there is nothing to match. The regex engine sees that there is no match (matching).
    There seems to be an end to a fairy tale ... but not here it was! Here only the interesting begins. The regex engine rolls back to the beginning of step 2. And this time the next alternative is trying to match the checked chain, the 2nd alternative: [^ "]. And - that's wonderful! - it works out for him.
  4. [^ "] in regex matches \ in a chain. Backslash is not a quote? Not a quote. So, a match.
  5. The regex engine will try to find (... | ...) again, because * is a greedy quantifier (A * means “find as many A-matching items as possible”). He does not succeed: only a quotation mark remains in the chain, which is neither a combination \ "nor a non-quotation mark [^"]
  6. Then the regex engine will match the rest of the chain, the lone quotation mark ", the rest of the regex, the lone quotation mark". Match! A match was found.

In general, the whole thing is in the "magic bubbles", more precisely, in rolling back the regular expression engine. It rolls back after the alternative with the quantifier (... | ...) * captures the rest of the chain to the end, and the rest of the regex is “not getting anything.” Is it possible to ask the regex engine not to roll back? It turns out that in modern regex engines (in particular, in Perl 5.10, in relatively fresh versions of Java and PHP with fresh PCRE) this is possible. Come to the rescue ... Chip and Dale Rescuers Malibu

... Supergreant Quantifiers



What kind of fruit is it? Everyone knows the usual quantifiers:
* + ? {m, n}
They are greedy , i.e. they “capture” as much as possible from the tested chain. There are also non-greedy (non-greedy, lazy, reluctant, lazy) quantifiers
*? +? ?? {m, n}?
which capture as little as possible from the chain.
In modern regex engines, in addition to these two “classical” types of quantifiers, possessive quantifiers are also implemented:
*+ ++ ?+ {m, n}+
We do not know the well-established Russian translation for this term, so they were called, as it seemed reasonable: super-nimble quantifiers . Why "super-bad"? Because, first of all, they behave like greedy, that is, they capture as much of the chain as possible. Secondly, they are, in a sense, even more “greedy” greedy and go further than them: once they are “grabbed”, they never roll back, they do not “give back” pieces of the rehex grasped by them to the next parts.
Example. Regex
/ " .* " /x
when processing a string
one "two" three "four"
Will match:
"two" three "four"
because * greedy and "eats" all quotes that he can eat (including he eats and quotes after four, but then he "gives" it back, because the regex engine does not see what to play the last "regex" .
And here is the “superjust” option
/ " .*+ " /x
will find nothing in the same chain! (i.e. there will be no match). Why? Because. * + Eats the rest of the chain after the opening quotation mark, and he doesn’t care what the closing reel of “regex” will match. He will not give up part of the piece of the chain “eaten” by him.
Why do we need such strange quantifiers? It turns out that they are very useful when the “rollback” of the regex engine is undesirable for us. And as practice shows, a rollback is not always desirable ... unless, of course, we are not talking about wasting government money ;-)
')
So after all, we had an initial task - searching for text in quotes - there was just such a problem, the rollback worked, which led to the match in “unnecessary” chains like “\”. Add to the original regex
/ " ( \\" | [^"] )* " /x
Ma-scarlet plus sign:
/ " ( \\" | [^"] )*+ " /x
and the problem is solved! This kind of regex will not match strings like
"\"
or
who "\"the\"heck\" is quux anyway

From the gun on the sparrows



In fact, for such a simple task as searching for a “quoted” text, no advanced features like the possessive quantifiers are needed. The following regex will do a great job with this task:
/" ( \\. | [^"\\] )* "/x
That is, literally “an opening quote, (backslash followed by any character OR any character except for a backslash and a quote) zero or more times, closing the quote”.
If the character new-line can occur after \ and we consider it valid, then you need to add the key / s, since without this key the point. does not match the new-string character.

Source: https://habr.com/ru/post/55863/


All Articles