📜 ⬆️ ⬇️

Regular expressions, a guide for beginners. Part 2

In the first half of this tutorial, we revealed only a small part of the possibilities of regular expressions. In the second, larger half, we will look at some new metacharacters, how to use groups to get parts of the matched text, break lines, find and replace parts of the text. In the end, let's talk a little about common mistakes.

More metacharacters


There are some metacharacters that we have not yet learned. Most of them will be covered in this section.

Some of the remaining metacharacters are assertions of zero size . They do not cause the engine to pass through the line, they do not cover any characters at all, just success or failure is possible. For example, \b is the statement that the current position is on the boundary ( boundary ) of the word, while the \b character itself does not change the position. This means that statements of zero size should never be repeated, because if they coincided once in a given place, they will obviously correspond to this place an infinite number of times.

|
Corresponds to the operator OR. If A and B are regular expressions, then A|B will match any string that matches A or B. Metacharacter | has a very low priority in order to make it work intelligently when you alternate several characters of a string. Crow | Servo will look for a match for either Crow or Servo , not Cro('w' 'S')ervo .
')
^
Searches for a match only at the beginning of a line. If the MULTILINE flag is MULTILINE , as mentioned in the last part, then a comparison is made for each part after the newline character.

For example, if you want to find only those lines that have From at the beginning, then ^From written in a regular expression:

>>> print re . search ( '^ From' , 'From Here to Eternity' )
< _sre. SRE_Match object at 0x ... >
>>> print re . search ( '^ From' , 'Reciting From Memory' )
None


$
Same as ^ , but at the end of a line, which is determined either by the end of the line itself or by the newline character.

>>> print re . search ( '} $' , '{block}' )
< _sre. SRE_Match object at 0x ... >
>>> print re . search ( '} $' , '{block}' )
None
>>> print re . search ( '} $' , '{block} \ n ' )
< _sre. SRE_Match object at 0x ... >


\A
The match is only at the beginning of the line, that is, the same as ^ , but does not depend on the MULTILINE flag

\Z
The match is only at the end of the line, that is, the same as $ , but does not depend on the MULTILINE flag

\b
Word boundary A word is defined as a sequence of characters of numbers and / or letters, so that word boundaries represent spaces or any characters not related to the above.

The following example searches for the word class only when it is a separate word. If it is contained within another word, there is no match:

>>> p = re . compile ( r ' \ b class \ b ' )
>>> print p. search ( 'no class at all' )
< _sre. SRE_Match object at 0x ... >
>>> print p. search ( 'the declassified algorithm' )
None
>>> print p. search ( 'one subclass is' )
None


There are two subtleties that you must remember when using this special sequence. First, this is one of the worst collisions between Python string literals and regular expression sequences: in Python string literals, \b is the backspace character, ASCII value 8. If you do not use raw strings, Python will convert \b to backspace, and Your regular expression will not be as intended:

>>> p = re . compile ( ' \ b class \ b ' )
>>> print p. search ( 'no class at all' )
None
>>> print p. search ( ' \ b ' + 'class' + ' \ b ' )
< _sre. SRE_Match object at 0x ... >


Secondly, it is impossible to use this combination inside a character class, because the \b combination for compatibility with string literals Python represents the backspace character.

\ B
The opposite of the previous combination corresponding to the current position is not on the word boundary.

Grouping


It is often necessary to get more information than just to find out if PB is matching or not. Regular expressions are often used to cut strings by writing regular expressions, divided into several subgroups that correspond to the various components of the query. For example, in the RFC-822 standard, there are various fields in the header, separated by a colon:

From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com


This can be processed by writing a regular expression that matches the entire header line, and there is one group that corresponds to the name of the header, and another group that corresponds to the value of the title.

Groups are indicated by metacharacters in the form of parentheses '(', ')' . '(' and ')' have the same meaning as in mathematical expressions; they group together the expressions contained in them, and you can repeat the contents of the group with repeating qualifiers such as *, +, ? and {m, n} . For example, (ab)* will match zero or more ab repetitions.

>>> p = re . compile ( '(ab) *' )
>>> print p. match ( 'ababababab' ) . span ( )
( 0 , 10 )


The groups defined by the brackets also capture the starting and ending indices of the matching text; this can be obtained by passing the argument group(), start(), end() and span() . Groups are numbered, starting with 0. Group 0 is always present, it is the entire regular expression itself, so MatchObject methods always contain 0 as the default argument:

>>> p = re . compile ( '(a) b' )
>>> m = p. match ( 'ab' )
>>> m. group ( )
'ab'
>>> m. group ( 0 )
'ab'


Subgroups are numbered from left to right, from 1 onwards. Groups can be nested; in order to determine the number of attachments, simply count left-to-right symbols of the opening bracket:

>>> p = re . compile ( '(a (b) c) d' )
>>> m = p. match ( 'abcd' )
>>> m. group ( 0 )
'abcd'
>>> m. group ( 1 )
'abc'
>>> m. group ( 2 )
'b'


group() can simultaneously accept several group numbers in a single request, and a tuple containing the values ​​for the respective groups will be returned:

>>> m. group ( 2 , 1 , 2 )
( 'b' , 'abc' , 'b' )


The groups() method returns a tuple of strings for all subgroups, starting from the 1st:

>>> m. groups ( )
( 'abc' , 'b' )


Backlinks in a template allow you to specify that the content of a previously captured group should also be found at the current line position. For example, \1 corresponds to the fact that the content of group 1 is exactly repeated in the current position.

For example, the following PB detects repeated words in a row twice in a row:

>>> p = re . compile ( r '( \ b \ w +) \ s + \ 1 ' )
>>> p. search ( 'Paris in the spring' ) . group ( )
'the the'


Backlinks, such as this one, are not often useful for searching a string, but you'll soon find out that they are very useful when performing a string replacement.

Capture Groups and Named Groups

Regular expressions can use many groups, both for capturing the required substring, and for grouping and structuring the RVs themselves. In complex regular expressions, it becomes difficult to keep track of group numbers. There are two features that help to deal with this problem. Both of them use a common syntax for extending regular expressions, which we therefore consider first.

Several additional functions for standard regular expressions have been added to Perl 5, and the re module supports most of them. It would be difficult to choose new single-character metacharacters or new sequences with backslashes in order to introduce new features so that Perl regular expressions are different from standard regular expressions without confusion. If you choose as a new metacharacter, for example, & , then the old regular expressions would accept it as a regular character and you could not escape it \& or [&] .

The solution chosen by Perl developers was to use (?...) as an extension of the syntax. The question mark after the bracket in the case of a normal RV is a syntax error, since ? there is nothing to repeat, so this does not lead to any compatibility problems. Characters right after ? show what extension is used, so (?=foo) is one thing (a positive statement about the preview), and (?:foo) is something else (a group without content capture, including the subexpression foo ).

A native extension is added to the Perl extended syntax in Python. If the first character is after the question mark P , then this means that a Python-specific extension is used. Currently there are two such extensions: ( ?P<some_name>... ) defines a named group, and ( ?P=some_name ) serves as a backward link for it. If similar features using a different syntax are added in future versions of Perl 5, the re module will be modified to support the new syntax, while maintaining the Python syntax for compatibility.

Sometimes you need to use a group to collect parts of a regular expression, but you are not interested in retrieving the contents of the group. You can do this using a group without capturing content: (?:...) , where you can replace ... any other regular expression:

>>> m = re . match ( "([abc]) +" , "abc" )
>>> m. groups ( )
( 'c' , )
>>> m = re . match ( "(?: [abc]) +" , "abc" )
>>> m. groups ( )
( )


Except that you don’t get the content of what the group matched with, these groups behave just like normal ones; you can put anything into them, repeat using the appropriate symbol, such as * , and insert them into other groups (data collection or not).

A more important feature is the named groups: instead of referring to them by numbers, these groups can be referenced by name.

Named group syntax is one of the specific Python extensions: (?P<some_name>...) . Named groups behave exactly like normal, but in addition to this they are associated with some name. The MatchObject methods that were used for ordinary groups accept both numbers that refer to the group number, as well as strings containing the name of the required group. That is, named groups still accept numbers as well, so you can get information about a group in two ways:

>>> p = re . compile ( r '(? P <word> \ b \ w + \ b )' )
>>> m = p. search ( '((((Lots of punctuation))))' )
>>> m. group ( 'word' )
'Lots'
>>> m. group ( 1 )
'Lots'


Named groups are convenient in that they allow the use of easy-to-remember names instead of numbers. Here is an example of a regular expression from the imaplib module:

InternalDate = re . compile ( r 'INTERNALDATE "'
r '(? P <day> [123] [0-9]) - (? P <mon> [AZ] [az] [az]) -'
r '(? P <year> [0-9] [0-9] [0-9] [0-9])'
r '(? P <hour> [0-9] [0-9]) :(? P <min> [0-9] [0-9]) :(? P <sec> [0-9] [ 0-9])
r '(? P <zonen> [- +]) (? P <zoneh> [0-9] [0-9]) (? P <zonem> [0-9] [0-9])'
r '"' )


The syntax of backlinks in a regular expression of the type (...)\1 refers to the group number. It would be more natural to use group names instead of numbers. Another Python extension: (?P=name) indicates that the contents of the named group must again be matched at the current position. Our previous regular expression for searching for duplicate words, (\b\w+)\s+\1 can also be written as (?P<doble_word>\b\w+)\s+(?P=doble_word) :

>>> p = re . compile ( r '(? P <word> \ b \ w +) \ s + (? P = word)' )
>>> p. search ( 'Paris in the spring' ) . group ( )
'the the'


Advance checks

Checks are available in a positive and negative (retrospective) form, and look like this:

(?=...)
Positive check. Corresponds to the case when the contained expression, represented here as ... , corresponds to the current position. But, after the contained expression has been tested, the comparing engine does not advance further; the remainder of the template is compared further to the right of the place where the statement begins.

(?!...)
A negative check corresponds to the case when the expression contained does not match the current position of the line.

For specifics, consider the case in which a preview is useful. Consider a simple template for comparing a file name and splitting it into parts: the name itself and the extension, separated by a dot.

The pattern for this comparison is quite simple:

.*[.].*$

Note that the symbol . requires special brackets, since the dot itself is a metacharacter, as seen in the same expression. Also note the final $; it is added to ensure that the entire remainder of the string is included in the extension.

Now, consider the problem a little wider; what if you want to compare the names of all files whose extension is not a bat ? Several incorrect attempts:

.*[.][^b].*$
The first attempt is to exclude the bat with the requirement that the first character extension be not b . This is incorrect because the template will also exclude foo.bar .

.*[.]([^b]..|.[^a].|..[^t])$
The expression will turn out even more sloppy when you decide to correct the first decision by a separate task of the necessary characters: the first character of the extension must not be b; the second is not a; the third is not t. This will enable foo.bar and reject autoexec.bat , but requires a three-letter extension and will not work with two-character file name extensions, like sendmail.cf . Then we will have to complicate the pattern again to solve this problem:

.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$
In the third attempt, the second and third letters in order to allow the comparison of extensions that are shorter than three characters are made optional.

The template is now really ready, it is difficult to read and understand. Even worse, if the problem changes and you need to eliminate both the bat and the exe extensions, the template will become even more complicated and confusing.

A negative forward check solves all these difficulties:

.*[.](?!bat$).*$
A negative preview means: if the expression bat does not match this position, compare the rest of the template; if a match is found for bat$ , then the whole template does not suit us. The $ sign that encloses the expression is needed so that an expression such as sample.batch .

Eliminating another extension is now also easy; just add it as an alternative in the same statement. The following template excludes file names that end with the bat or exe extension:

.*[.](?!bat$|exe$).*$

Change strings


Up to this point, we simply searched for a static string. Regular expressions are also often used to change strings in various ways, using the following pattern methods:

Method / Attributepurpose
split ()Break a line into a list where there is a PB match
sub ()Find all substrings of matches with RV and replace them with another string
subn ()Does the same thing as sub (), but returns a new string and the number of substitutions.


String splitting

The split() template method splits a string into parts where there is a PB match, returning a list of parts. This is similar to the split() string method, but provides for universality in the delimiters by which the split occurs; ordinary split() provides splitting only by whitespace characters or a fixed string. As expected, there is a modular function re.split() .

.split (string [, maxsplit = 0])
Splits a string by regular expression match. If there are exciting brackets in the RV, then their contents will also be returned as part of the resulting list. If maxsplit not zero, no more than maxsplit partitions are maxsplit , the remainder of the string will be returned as the last item in the list.

In the following example, the delimiter is any sequence of non-alphanumeric characters.

>>> p = re . compile ( r ' \ W +' )
>>> p. split ( 'This is a test, short and sweet, of split ().' )
[ 'This' , 'is' , 'a' , 'test' , 'short' , 'and' , 'sweet' , 'of' , 'split' , '' ]]
>>> p. split ( 'This is a test, short and sweet, of split ().' , 3 )
[ 'This' , 'is' , 'a' , 'test, short and sweet, of split ().' ]


Sometimes you are not only interested in what text was between delimiters, but also need to know which delimiter was used. If the RV has exciting parentheses, then these values ​​are also returned as part of the list. Compare:

>>> p = re . compile ( r ' \ W +' )
>>> p2 = re . compile ( r '( \ W +)' )
>>> p. split ( 'This ... is a test.' )
[ 'This' , 'is' , 'a' , 'test' , '' ]
>>> p2. split ( 'This ... is a test.' )
[ 'This' , '...' , 'is' , '' , 'a' , '' , 'test' , '.' , '' ]


The function of the re.split() module takes the RV as the first argument, and otherwise behaves also:

>>> re . split ( '[ \ W ] +' , 'Words, words, words.' )
[ 'Words' , 'words' , 'words' , '' ]
>>> re . split ( '([ \ W ] +)' , 'Words, words, words.' )
[ 'Words' , ',' , 'words' , ',' , 'words' , '.' , '' ]
>>> re . split ( '[ \ W ] +' , 'Words, words, words.' , 1 )
[ 'Words' , 'words, words.' ]


Search and replace

Another common task is to find all matches with the pattern and replace them with another string. The sub() method takes as its argument the value of the replacement part (which can be both a string and a function) and the string that is to be processed.

.sub (replacement, string [, count = 0])
Returns the string resulting from the replacement. If the pattern is not found, the string is returned unchanged.

The optional count argument is the maximum number of matches to replace.

A simple example of using the sub() method. Color names are replaced by the word colour :

>>> p = re . compile ( '(blue | white | red)' )
>>> p. sub ( 'color' , 'blue socks and red shoes' )
'color socks and color shoes'
>>> p. sub ( 'color' , 'blue socks and red shoes' , count = 1 )
'color socks and red shoes'


The subn () method does the same, but returns a tuple containing the new string and the number of substitutions made:

>>> p = re . compile ( '(blue | white | red)' )
>>> p. subn ( 'color' , 'blue socks and red shoes' )
( 'color socks and color shoes' , 2 )
>>> p. subn ( 'color' , 'no colors at all' )
( 'no colors at all' , 0 )


Empty matches are replaced only when they are not adjacent to the previous match:

>>> p = re . compile ( 'x *' )
>>> p. sub ( '-' , 'abxd' )
'-abd-'


If the string is a surrogate, then escaping characters are supported. So, \n is a single newline character, \r is a carriage return, and so on. Backlinks, such as \6 are replaced by a substring that matches the corresponding group in the RV. This allows you to include parts of the original text in the result of the replacement line.

The example corresponds to the word section in the part of the line preceding the part in curly brackets {, } , and replaces the section with a subsection :

>>> p = re . compile ( 'section {([^}] *)}' , re . VERBOSE )
>>> p. sub ( r 'subsection { \ 1 }' , 'section {First} section {second}' )
'subsection {First} subsection {second}'


It is also possible to refer to named groups. To do this, use the sequence \g<...> , where as ... can be a number or the name of a group. \g<2> is equivalent to \2 , but it is not ambiguous in terms such as \g<2>0 . ( \20 will be interpreted as a reference to group 20, and not as a second group followed by the literal '0'.) The following operations are equivalent, but use three different ways:

>>> p = re . compile ( 'section {(? P <name> [^}] *)}' , re . VERBOSE )
>>> p. sub ( r 'subsection { \ 1 }' , 'section {First}' )
'subsection {First}'
>>> p. sub ( r 'subsection { \ g <1>}' , 'section {First}' )
'subsection {First}'
>>> p. sub ( r 'subsection { \ g <name>}' , 'section {First}' )
'subsection {First}'


A Deputy may also be a function that gives you more control. If so, the function is called for each non-overlapping pattern. Each time the function is called, it is passed as a MatchObject argument.

In the following example, the replacement function converts decimal numbers to hexadecimal numbers:

>>> def hexrepl ( match ) :
... "Return the hex string for a decimal number"
... value = int ( match. group ( ) )
... return hex ( value )
...
>>> p = re . compile ( r ' \ d +' )
>>> p. sub ( hexrepl, 'Call 65490 for printing, 49152 for user code.' )
'Call 0xffd2 for printing, 0xc000 for user code.'


Common problems


Regular expressions are a powerful tool for some applications, but in some respects their behavior is not intuitive, and sometimes they do not behave as you would expect from them. This section will point out some of the most common mistakes.

Using string methods

Sometimes using a module reis a mistake. If you are looking for a fixed string or a single character, and you do not need to use any special features re, then all the power of regular expressions is not required for this. Strings have several methods for operations with fixed strings and they are usually much faster because they are optimized for this purpose.

Imagine you need to replace one fixed line with another, for example, replace a word wordwith a word deed. Here, of course, the function is suitable re.sub(), but consider the string method replace(). Note that it replace()will also replace wordinside words by changing swordfishtosdeedfish, but a simple regular expression will do the same. (To avoid performing substitution on parts of words, the template should contain \bword\b).

Another common task is to remove a single character from a string or replace it with another character. You can do this with something like this re.sub('\n', ' ', S), but the translate () method will handle both tasks and do it faster than any regular expression.

In short, before using the module re, see if the problem can be solved by faster and simpler string methods.

match () versus search ()

The function match()searches for a PB at the beginning of a line, while it search()searches for a match for the entire line. It is important to keep in mind this distinction:

>>> print re . match ( 'super' , 'superstition' ) . span ( )
( 0 , 5 )
>>> print re . match ( 'super' , 'insuperable' )
None
>>> print re . search ( 'super' , 'superstition' ) . span ( )
( 0 , 5 )
>>> print re . search ( 'super' , 'insuperable' ) . span ( )
( 2 , 7 )


You may be tempted to always use re.match()just adding in front of your regular expression .*. Resist this temptation, and use it instead re.search(). The regular expression compiler makes a small analysis of RVs in order to speed up the matching process. One type of analysis is to determine what should be the first match character, for example, a match with a pattern starting with Crowmust start with 'C'. This analysis leads to the fact that the engine quickly runs through the string in the search for the initial character, and begins a full comparison only if the character 'C' is found.

Adding.*negates this optimization, requiring scanning to the end of the line and then returning to compare the remainder of the regular expression. Use instead re.search().

Greedy against non-greedy

When repeated in a RT, such as a*, the resultant action eats as much of the pattern as possible. This often burns those who want to find a pair of symmetrical determinants, such as the angle brackets <> surrounding the HTML tags. A naive approach to the HTML tag matching pattern will not work because of its “greedy” nature .*:

>>> s = '<html> <head> <title> Title </ title>'
>>> len ( s )
32
>>> print re . match ( '<.*>' , s ) . span ( )
( 0 , 32 )
>>> print re . match ( '<.*>' , s ) . group ( )
< html >< head >< title > Title < /title >


PB matches '<'in the first tag - html, and .*takes the rest of the line. As a result, the mapping extends from the opening '<'tag htmlto the closing bracket of the >'closing tag /title, which, of course, is not what we wanted.

In such a case, the solution is to use non-greedy determinants *?, +?, ??or {m,n}?that match as little text as possible. In the example above, the first character '>' after '<' will be selected, and only if it fails, the engine will continue to try to find the character '>' in the next position, depending on how long the tag name is. This gives the desired result:

>>> print re . match ( '<. *?>' , s ) . group ( )
< html >


(Note that parsing HTML or XML using regular expressions is painful. A hastily made template can do some things, but collapses when the page code changes. A well-designed template can be too complicated to try to modify. For such tasks, it is better to use HTML or XML parser modules.)

Using re.VERBOSE

By now you’ve probably noticed that PBs are very compact, but sometimes they’re not very readable. RVs of moderate complexity can be long sequences of slashes, parentheses, metacharacters, which makes them difficult to read and understand.

For such RTs, it may be useful to specify the VERBOSE flag when compiling a regular expression because it allows the regular expression to be formatted in a clearer way.

The VERBOSE flag has several features. Spaces in RVs that are not inside a character class are ignored. This means that expressions such as dog | catequivalent are less readable, without a string dog|cat, but [ab] will still match the characters'a', 'b'or space. In addition, you can also put comments inside a PB that last from a character #to the next line. Formatting will be more accurate with triple quotes:

pat = re . compile ( r "" "
\ s * #Skip leading whitespace
(? P <header> [^:] +) # Header name
\ s *: # Whitespace, and a colon
(? P <value>. *?) # The header's value - *? used to
lose the following trailing whitespace
\ s * $ trailing whitespace to end-of-line
"" "
, re . VERBOSE )


It is much easier to read than:

pat = re . compile ( r " \ s * (? P <header> [^:] +) \ s * :(? P <value>. *?) \ s * $" )


Finally


Documentation of the
Habrablog “Regular Expressions”
module Regexp editor

Source: https://habr.com/ru/post/115436/


All Articles