
I came across an extremely simple but interesting task that required a little bit of going beyond the framework of the worker-peasant regular expression course - and I hope a short story about it will be useful to those who have not yet become a regular Jedi.
Of course, reading the documentation of regular expressions diagonally, you, like me, probably faced more than once advanced and retrospective checks, but without being aware of what task they might be needed - they won't even pop up when you need it.
The task is trivial - to replace line breaks with <br/>, except for the case if there was an html tag before this (for simplicity, only the> character). Departing from the topic - such a replacement algorithm is needed in order to have automatic addition of line breaks inside blocks of text in the habr style, and not to break the usual HTML layout.
')
The solution to the forehead is simple as an ax - the previous character is part of the pattern to be replaced, which we re-insert into the result:
preg_replace("/([^>])\n/","\\1<br />",$text);
And it basically worked for a whole year until suddenly line breaks were “canonized,” i.e. In order for the code to work equally regardless of the operating system, any line translation options (\ n, \ r, \ r \ n) have been replaced with \ n. Suddenly, 2 line breaks in a row are no longer replaced by 2 <br/>
This behavior is quite reasonable (especially after debugging) - preg_replace does not try to check once again what it just replaced in order to avoid looping - and we need to check the previous character. When line breaks were not canonized - we actually had \ r \ n \ r \ n (0xd 0xa 0xd 0xa, by the way, you can remember the sequence of special characters as
R etur
N ) - and we replaced \ n, a \ r - remained, and it was he who was checked by a regular expression for matching '>'. After canonization, we lost this “reserve” of 1 character, and preg_replace started to check the string for regular expression match directly with the \ n character - and of course there was no replacement.
It is for solving such problems that there are Look-ahead and Look-behind expressions (which I personally have never encountered before).
Look-ahead & Look-behind Zero-Width Assertions (advanced and retrospective checks) is the ability to create your own $ and ^ analogues: they define a condition that must be fulfilled or not fulfilled at the beginning or end of the line, and are not part of the “matted” expressions, i.e. will not be replaced in preg_replace. This is exactly what we need for this task.
Look-behind - “looks” back, respectively, placed at the beginning of the regular expression.
Look ahead - at the end, and “looks” ahead.
Their syntax is:(? <= pattern) positive look-behind condition
(? <! pattern) negative look-behind condition
(? = pattern) positive look-ahead condition
(?! pattern) negative look-ahead condition
Various restrictions are imposed on Look-behind assertions by the regular expression engines - in most cases it should check the expression of a fixed length known in advance (restrictions are weaker in Java and .NET parsers, not supported in JavaScript, check the documentation).
Thanks to
senia, we can
familiarize ourselves
with the exhaustive compatibility
table of various regular expression parsers, this is our theme:
Feature | .NET | Java | Perl | PCRE | ECMA | Python | Ruby | Tcl are | POSIX BRE | POSIX ERE | GNU BRE | GNU ERE | XML | Xpath |
---|
(? = regex) (positive lookahead) | YES | YES | YES | YES | YES | YES | YES | YES | no | no | no | no | no | no |
(?! regex) (negative lookahead) | YES | YES | YES | YES | YES | YES | YES | YES | no | no | no | no | no | no |
(? <= text) (positive lookbehind) | full regex | finite length | fixed length | fixed + altern ation | no | fixed length | no | no | no | no | no | no | no | no |
(? <! text) (negative lookbehind) | full regex | full regex | finite length | fixed length | fixed + altern ation | no | fixed length | no | no | no | no | no | no | no |
Accordingly, a regular expression using negative retrospective checking results in the following:
preg_replace("/(?<!>)\n/","<br />",$text);
And if you rewrite for a demonstration with a positive retrospective test:
("Before" must be any character except ">")
preg_replace("/(?<=[^>])\n/","<br />",$text);
Now our code works with canonized line breaks, without requiring crutches, such as inserting parts of a regular expression into the result without changes.
Ps. On Habré, the topic has already been touched upon in the article
Imitate Intersection, Exclusion, and Subtraction, using advanced checks, in regular expressions in ECMAScript, but its name is terrible and it should be read diligently :-)