Regular Expression Advance and Retrospective Checks

I came across an extremely simple but interesting task that required a little bit of going beyond the framework of the worker-peasant regular expression course - and I hope a short story about it will be useful to those who have not yet become a regular Jedi.

Of course, reading the documentation of regular expressions diagonally, you, like me, probably faced more than once advanced and retrospective checks, but without being aware of what task they might be needed - they won't even pop up when you need it.

The task is trivial - to replace line breaks with <br/>, except for the case if there was an html tag before this (for simplicity, only the> character). Departing from the topic - such a replacement algorithm is needed in order to have automatic addition of line breaks inside blocks of text in the habr style, and not to break the usual HTML layout.
')
The solution to the forehead is simple as an ax - the previous character is part of the pattern to be replaced, which we re-insert into the result:

preg_replace("/([^>])\n/","\\1<br />",$text);

And it basically worked for a whole year until suddenly line breaks were “canonized,” i.e. In order for the code to work equally regardless of the operating system, any line translation options (\ n, \ r, \ r \ n) have been replaced with \ n. Suddenly, 2 line breaks in a row are no longer replaced by 2 <br/>

This behavior is quite reasonable (especially after debugging) - preg_replace does not try to check once again what it just replaced in order to avoid looping - and we need to check the previous character. When line breaks were not canonized - we actually had \ r \ n \ r \ n (0xd 0xa 0xd 0xa, by the way, you can remember the sequence of special characters as R etur N ) - and we replaced \ n, a \ r - remained, and it was he who was checked by a regular expression for matching '>'. After canonization, we lost this “reserve” of 1 character, and preg_replace started to check the string for regular expression match directly with the \ n character - and of course there was no replacement.

It is for solving such problems that there are Look-ahead and Look-behind expressions (which I personally have never encountered before).

Look-ahead & Look-behind Zero-Width Assertions (advanced and retrospective checks) is the ability to create your own $ and ^ analogues: they define a condition that must be fulfilled or not fulfilled at the beginning or end of the line, and are not part of the “matted” expressions, i.e. will not be replaced in preg_replace. This is exactly what we need for this task.

Look-behind - “looks” back, respectively, placed at the beginning of the regular expression.
Look ahead - at the end, and “looks” ahead.

Their syntax is:
(? <= pattern) positive look-behind condition
(? <! pattern) negative look-behind condition
(? = pattern) positive look-ahead condition
(?! pattern) negative look-ahead condition

Various restrictions are imposed on Look-behind assertions by the regular expression engines - in most cases it should check the expression of a fixed length known in advance (restrictions are weaker in Java and .NET parsers, not supported in JavaScript, check the documentation).

Thanks to senia, we can familiarize ourselves with the exhaustive compatibility table of various regular expression parsers, this is our theme:

Feature	.NET	Java	Perl	PCRE	ECMA	Python	Ruby	Tcl are	POSIX BRE	POSIX ERE	GNU BRE	GNU ERE	XML	Xpath
(? = regex) (positive lookahead)	YES	YES	YES	YES	YES	YES	YES	YES	no	no	no	no	no	no
(?! regex) (negative lookahead)	YES	YES	YES	YES	YES	YES	YES	YES	no	no	no	no	no	no
(? <= text) (positive lookbehind)	full regex	finite length	fixed length	fixed + altern ation	no	fixed length	no	no	no	no	no	no	no	no
(? <! text) (negative lookbehind)	full regex	full regex	finite length	fixed length	fixed + altern ation	no	fixed length	no	no	no	no	no	no	no

Accordingly, a regular expression using negative retrospective checking results in the following:

 preg_replace("/(?<!>)\n/","<br />",$text);

And if you rewrite for a demonstration with a positive retrospective test:
("Before" must be any character except ">")

 preg_replace("/(?<=[^>])\n/","<br />",$text);

Now our code works with canonized line breaks, without requiring crutches, such as inserting parts of a regular expression into the result without changes.

Ps. On Habré, the topic has already been touched upon in the article Imitate Intersection, Exclusion, and Subtraction, using advanced checks, in regular expressions in ECMAScript, but its name is terrible and it should be read diligently :-)

Source: https://habr.com/ru/post/159483/

All Articles

Regular Expression Advance and Retrospective Checks

More articles: