📜 ⬆️ ⬇️

Named Capturing Group and Backreferences

This note is not intended for beginners to learn regular expressions, for beginners I would advise Ben Forta’s book “Teach Yourself Regular Expressions in 10 Minutes” (ISBN: 0-672-32566-7).

The RegexBuddy program (http://www.regexbuddy.com) is ideal for testing and debugging regular expressions. To debug the following examples, you need to copy the HTML of a page into the Test tab or type several tags yourself.

The task is to find all the IMG tags in HTML and extract the values ​​of the SRC and ALT attributes from the tags.
')


The first part of the task to find all HTML tags is solved quite simply with a regular expression:
<img .*?>

Do not forget to tick "Dot matches newline" and "Case insensitive"


Expression .*? not exactly what we need. There can be many attributes inside an IMG tag, in any sequence, attribute values ​​can be enclosed in single, double quotes or not in quotes at all.

Let's try to catch the SRC attribute first.
\s+src\s*=\s*
By expression we catch the preceding whitespace characters, as well as optional whitespace characters before and after the equal sign.
The expression does not take into account the attribute value, which can be enclosed in single or double quotes.
And here Backreferences and Named Capturing Group come to the rescue.
\s+src\s*=\s*(?P<qt1>[\"\'])(?P<src>.*?)\k<qt1>


So, the expression (?P<qt1>[\"\']) creates a named group" qt1 "which includes the character" or'.
Next comes the named src group where all characters are lazily captured up to the closing quotation mark.
Backreference \k<qt1> ensures that the closing quote matches the one that was used at the beginning and captured under the name qt1.
Notice in the picture how the RegexBuddy debugger highlighted group symbols with the name src in a darker color.

Similarly, it will be possible to construct a regular schedule for alt.

Combine the attributes alt, src and all the others (.*?) .
The resulting regulars looks a bit complicated, so first the explanations:
the expression (?:) similar () , with the difference that the value inside the first brackets (?:) not captured into the result.

Our regular season is as follows:
<img(?: (?: src)|(?: alt)|(?: ) )*/?>
those. img field can meet "src attribute" or "alt attribute" or the rest, all of which are combined into a group that can be repeated several times.
The IMG tag ends with the optional / character and followed by>
Here is what we get:
<img(?:(?:\s+src\s*=\s*(?P<qt1>[\"\'])(?P<src>.*?)\k<qt1>)|(?:\s+alt\s*=\s*(?P<qt2>[\"\'])(?P<alt>.*?)\k<qt2>)|(?:.*?))*/?>


Remained a little. What to do with the case if the quotes are not specified?
in this case, the expression
\s+src\s*=\s*(?P<qt>[\"\'])(?P<src>.*?)\k<qt>
breaks down into 2 options
\s+src\s*=\s*( | )

So an extended version, where src attribute value src is understood as with quotes, and without:
<img(?:(?:\s+src\s*=\s*(?:(?:(?P<qt1>[\"\'])(?P<src>.*?)\k<qt1>)|(?:(?P<src>\S+))))|(?:\s+alt\s*=\s*(?P<qt2>[\"\'])(?P<alt>.*?)\k<qt2>)|(?:.*?))*/?>

Particularly corrosive let us expand the regular list so that alt is also understood without quotes. (in this case, of course, the value should not contain spaces)


UPD: This regular season does not claim to be universal. There is a possibility of false positives inside commented blocks, javascript chunks, PRE blogs, where images are not actually displayed, etc.
If you parse the entire page, it is advisable to remove scripts and comments from the page, PRE blocks (as a separate regular schedule), although this does not solve the problems with constructions like
  onmouseover = "document.write ('<img src = ...')" 


UPD2: Karma after the article was less than before the publication ... However, the motivation!


UPD3: Transferred to the most suitable blog for the topic, I hope no one objects :)

Source: https://habr.com/ru/post/54681/


All Articles