Regular expressions, a guide for beginners. Part 1

Regular expressions (PBs) are essentially a tiny programming language, embedded in Python and accessible using the re module. Using it, you specify the rules for the set of possible strings that you want to check; This set can contain English phrases, or email addresses, or TeX commands, or whatever. With the help of PB, you can ask questions such as “Does this string match the pattern?”, Or “Does the pattern match somewhere with this string?”. You can also use regular expressions to change a string or split it into various ways.

Patterns of regular expressions are compiled in a series of byte-codes, which are then executed by the corresponding engine written in C. For advanced use, it may be important to pay attention to how the engine will execute this regular expression and write it so that it turns out byte-code that works faster. Optimization is not covered in this document, as it requires you to have a good understanding of the internal parts of the engine.

The language of regular expressions is relatively small and limited; therefore, not all possible tasks for processing strings can be done using regular expressions. There are also tasks that can be done with regular expressions, but expressions are too complex. In these cases, it may be better to write plain Python code, let it run slower than the developed regular expression, but will be more understandable.

Simple templates

We start by learning the simplest regular expressions. Since regular expressions are used to work with strings, we start with the most common task - matching characters.
')
For a detailed explanation of the technical side of regular expressions (deterministic and non-deterministic finite automata) you can refer to almost any compiler writing tutorial.

Character Matching

Most letters and symbols correspond to themselves. For example, the regular expression test will exactly match the string test (you can turn on case-insensitive mode, which will allow this regular expression to also match Test or TEST , but more on that later).

There are exceptions to this rule; some characters are special metacharacters , and do not correspond to themselves. Instead, they indicate that some unusual thing must be found, or they affect other parts of the regular expression, repeating or changing their meaning. Most of this tutorial is devoted to discussing various metacharacters and what they do.

Here is a complete list of metacharacters; their meanings will be discussed in the rest of this HOWTO.

. ^ $ * + ? { [ ] \ | ( )

The first metacharacters that we will look at are [ and ] . They are used to define a character class, which is a set of characters with which you are looking for a match. Characters can be listed individually, or as a range of characters indicated by the first and last characters, separated by a '-' . For example, [abc] will match any of the characters a, b or c ; this is the same as the expression [ac] , which uses a range to specify the same set of characters. If you want to match only lowercase letters, PB will look like [az] .

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm' or '$' . The '$' character is usually a metacharacter (as seen from the list of characters above), but within the class of characters it loses its particular nature.

In order to match characters outside this class, the symbol '^' added at the beginning of the class. For example, the expression [^5] matches any character except '5'.

Perhaps the most important is the backslash metacharacter \ . As in Python string literals, backslashes can be followed by various characters denoting different special sequences. It is also used for escaping metacharacters so that they can be used in templates; for example, if you need to find a match [ or \ , in order to deprive them of their special role of metacharacters, you must put a backslash in front of it: \[ or \\ .

Some of the special sequences beginning with '\' represent predefined character sets that are often useful, such as a set of numbers, a set of letters, or sets of everything that is not spaces, tabs, etc. (whitespace). The following predefined sequences are a subset of them. For a complete list of sequences and extended class definitions for Unicode strings, see the last part of Regular Expression Syntax .

\d
Matches any digit; equivalent class [0-9] .
\D
Matches any non-numeric character; equivalent class [^0-9] .
\s
Matches any whitespace character; equivalent to [ \t\n\r\f\v] .
\S
Matches any non-whitespace character; the equivalent of [^ \t\n\r\f\v] .
\w
Matches any letter or number; the equivalent of [a-zA-Z0-9_] .
\W
On the contrary; equivalent to [^a-zA-Z0-9_] .

These sequences can be included in a character class. For example, [\ s,.] Is a character class that will match any whitespace character or comma or period.

The last metacharacter in this section is the '.' . It matches all characters except the newline character, but there is an alternative mode ( re.DOTALL ) where this set will include it. '.' often used where you want to match "any character".

Duplicate things

The ability to match different character sets is the first thing that regular expressions can do and which is not always possible to do with string methods. However, if this were the only additional opportunity, they would not be so interesting. Another possibility is that you can specify how many times the part of the regular expression should be repeated.

The first metacharacter to repeat is * . It indicates that the previous character can be matched zero or more times, instead of one comparison.

For example, ca*t will correspond to ct (0 characters a ), cat (1 character a ), caaat (3 characters a ), and so on. The regular expression engine has various internal constraints resulting from the size of the int type for C, which does not allow it to match more than 2 billion 'a' characters. (I hope you do not need this).

Repetitions such as * are called greedy ; the engine will try to repeat it as many times as possible. If the following parts of the template do not match, the engine will go back and try again with a few repetitions of the symbol.

A step-by-step examination of an example will make the explanation clearer. Let's look at the expression a[bcd]*b . It corresponds to the letter 'a' , zero or more characters from the class [bcd] , and finally, the final letter 'b' . Now imagine a mapping of this regular expression to the string abcbd . Here's how the comparison happens in stages:

1. a - 'a' matches a regular expression
2. abcbd - the engine matches [bcd]* on as many characters as possible, that is, to the end of the line (since all characters correspond to the class in brackets [])
3. Failure - the engine tries to match the last character in a regular expression - the letter b , but the current position is at the end of the line, where there are no characters, so that it fails.
4. abcb - went back, reduced by one character mapping from [bcd]*
5. Failing - trying to find b again, but only d at the end
6. abc - go back again, now [bcd]* this is only bc
7. abcb - look for the last character of the regular expression again - b. Now he really is on the right position and we succeed

So, the end of the RV was reached and a comparison with it was given by abcb . This example showed how the engine first gets as far as it can and, if it does not find a match, comes back, working again and again with the rest of the regular expression. He will do so until he gets a zero match for [bcd]* , and, if and then there is no match, he concludes that the string doesn’t fit the PB pattern at all.

Another repetition metacharacter is + , repeating the comparison sequence one or more times. Pay special attention to the difference between * and + . * requires matching the required part zero or more times, that is, the repeatable may not be present at all, and + requires at least one occurrence. For a similar example, ca+t will be matched by cat or, for example, caaat , but not ct .

There are two more repeating specifiers. Question mark ? that checks for a match zero or one time. For example, home-?brew matches both homebrew and home-brew .

The most complete repeating specifier is {m,n} , where m and n are integers. This determinant means that there must be at least m and no more than n repetitions. For example, a/{1,3}b corresponds to a/b , a//b and a///b . This cannot be ab , a line in which there are no slashes, or a////b , in which there are four of them.

You can not specify m or n , then the most reasonable value is assumed for the absent. Lowering m means that the lower limit is 0, lowering n implies infinity to the upper limit, but, as mentioned above, the latter is limited by memory.

Readers may already have noticed that all three other specifiers can be expressed through the latter. {0,} is the same as * , {1,} is equivalent to + , and {0,1} can replace the sign ? .

Use regular expressions

Now that we’ve covered a few simple regular expressions, how can we use them in Python? The re module provides an interface for regular expressions, which allows you to compile regular expressions into objects and then perform mappings with them.

Compiling regular expressions

Regular expressions are compiled into template objects that have methods for various operations, such as finding an occurrence of a pattern or performing a string replacement.

>>> import re
>>> p = re . compile ( 'ab *' )
>>> print p
< _sre. SRE_Pattern object at 0x ... >

re.compile() also accepts optional arguments used to include various syntax features and variations:

>>> p = re.compile('ab*', re.IGNORECASE)

The regular expression is passed to re.compile() as a string. Regular expressions are treated as strings because they are not part of the Python language, and there is no special syntax for expressing them. (There are applications that do not need regular expressions at all, so there is no need to override the language specification, including them.) Instead, there is a re module, which is a C wrapper on the module, like socket or zlib modules.

Passing regular expressions as strings allows Python to be simpler, but it has one flaw, which is the topic of the next section.

Backslash disaster
(Or backward plague :))

As noted earlier, in regular expressions in order to designate a special form or to allow characters to lose their special role, the backslash character ( '\' ) is used. This leads to a conflict with using the same character in Python string literals for the same purpose.

Let's say you want to write a regular expression corresponding to \section , which you need to find in the LaTeX file. To find out what to write in the program code, let's start with the line that needs to be compared. Further, you should avoid any backslashes and other metacharacters by escaping them with a backslash, with the result that the \\ part appears in the string. Then, the resulting string to be passed to re.compile () should be \\section . However, in order to express it as a Python string literal, both backslashes must be escaped again , that is, the "\\\\section" .

In short, to match backslash, you need to write '\\\\' as the regular expression string, because the regular expression must be \\ , and each backslash must be translated to the usual string as \\ .

The solution is to use raw strings for regular expressions; in string literals with the prefix 'r' slashes are not processed in any way, so r"\n" is a string of two characters ('\' and 'n'), and "\ n" is one character of the new string. Therefore, regular expressions will often be written using raw strings.

Regular string	Raw string
'ab *'	r'ab * '
'\\\\ section'	r '\\ section *'
'\\ w + \\ s + \\ 1'	r '\ w + \ s + \ 1'

Performing mappings

After you have an object representing the compiled regular expression, what will you do with it? Template objects have several methods and attributes. Only the most important ones will be covered here; For a complete list, see the re documentation .

Method / Attribute	purpose
match ()	Determine whether a regular expression match starts at the beginning of a string.
search ()	Scan the entire string for all regular expression matches.
findall ()	Find all substrings of regular expression matches and return them as a list
finditer ()	Find all substrings of regular expression matches and return them as an iterator

If no matches were found, then match() and search() return None . If the search is successful, a MatchObject instance is MatchObject containing information about the match: where it begins and ends, a matching substring, and so on.

You can find out about it by experimenting with the re module. You can also take a look at Tools/scripts/redemo.py , the demo program included with the Python distribution. It allows you to enter regular expressions and strings, and displays if there is a match with the regular expression or not. redemo.py can be quite useful for debugging complex regular expressions. Phil Schwartz's Kodos is another interactive tool for developing and testing PB models.

In this tutorial, we use the standard Python interpreter for the examples:

>>> import re
>>> p = re . compile ( '[az] +' )
>>> p
< _sre. SRE_Pattern object at 0x ... >

Now you can try to compare the strings for the regular expression [az]+ . An empty string will not correspond to it, because + means repeating "one or more" times. match() in this case should return None , which we see:

>>> p. match ( "" )
>>> print p. match ( "" )
None

Now try a string that should match the pattern: 'tempo' . In this case, match() will return a MatchObject , which you can place in some variable to use in the future:

>>> m = p. match ( 'tempo' )
>>> print m
< _sre. SRE_Match object at 0x ... >

You can now call MatchObject to get information about the corresponding rows. There MatchObject also several methods and attributes for MatchObject , the most important of which are:

Method / Attribute	purpose
group ()	Return a string that matches a regular expression.
start ()	Return the starting position of the match
end ()	Return end position to match
span ()	Return a tuple (start, end) of matching positions

>>> m. group ( )
'tempo'
>>> m. start ( ) , m. end ( )
( 0 , 5 )
>>> m. span ( )
( 0 , 5 )

Since the match() method only matches matches from the beginning of the string, start() will always return 0. However, the search() method scans the entire string, so the start for it is not necessarily zero:

>>> print p. match ( '::: message' )
None
>>> m = p. search ( '::: message' ) ; print m
< _sre. SRE_Match object at 0x ... >
>>> m. group ( )
'message'
>>> m. span ( )
( 4 , 11 )

In real programs, the most common style is storing a MatchObject in a variable, and then checking for None . Usually it looks like this:

p = re . compile ( ... )
m = p. match ( 'string goes here' )
if m:
print 'Match found:' , m. group ( )
else :
print 'No match'

Two methods return all matches for the template. findall() returns a list of matched substrings:

>>> p = re . compile ( ' \ d +' )
>>> p. findall ( '12 drummers drumming, 11 pipers piping, 10 lords a-leaping ' )
[ '12' , '11' , '10' ]

The findall() method must create a complete list before it can be returned as a result. The finditer() method returns a sequence of MatchObject instances as an iterator.

>>> iterator = p. finditer ( '12 drummers drumming, 11 ... 10 ... ' )
>>> iterator
< callable-iterator object at 0x401833ac >
>>> for match in iterator:
... print match. span ( )
...
( 0 , 2 )
( 22 , 24 )
( 29 , 31 )

Functions at the module level

You do not need to create template objects and call their methods; the re module also provides the top-level functions match(), search(), findall(), sub() and so on. These functions take the same arguments as for templates, with the string RV as the first argument and also return None or MatchObject .

>>> print re . match ( r 'From \ s +' , 'Fromage amk' )
None
>>> re . match ( r 'From \ s +' , 'From amk Thu May 14 19:12:10 1998' )
< _sre. SRE_Match object at 0x ... >

These functions simply create a template object for you and call the appropriate method. They also store the object in the cache, so future calls using the same regular expression will be faster.

Should you use these functions or templates with methods? It depends on how often the regular expression will be used and on your personal coding style. If the regular expression is used in only one place in the code, then such functions are probably more convenient. If the program contains many regular expressions, or reuses the same in several places, then it would be advisable to collect all the definitions in one place, in a section of code that precompiles all regular expressions. As an example from the standard library, here is a piece from xmllib.py :

ref = re.compile( ... )
entityref = re.compile( ... )
charref = re.compile( ... )
starttagopen = re.compile( ... )

I myself prefer to work with compiled objects, even for one-time use, but very few people will be the same purist in it, like me.

Compilation flags

Compilation flags allow you to change some aspects of how regular expressions work. The flags are available in the module under two names: long, such as IGNORECASE and short, in single letter form, such as I Multiple flags can be given in the form of binary OR; for example re.I | re.M re.I | re.M sets the flags I and M.

DOTALL, S
Matching is the same as '.' , that is, with any character, but when this flag is included, a newline character is added to the consideration.

IGNORECASE, I
Matchless case mapping; For example, [AZ] will also match lower case letters, so Spam will match Spam, spam, spAM and so on.

LOCALE, L
Makes \w, \W, \b, \B dependent on localization. For example, if you work with text in French, and want to write \w+ to find words, but \w searches only characters from the set [A-Za-z] and will not search for 'é' or 'ç'. If the system is configured correctly and French is selected, 'é' will also be treated as a letter.

MULTILINE, M
(The metacharacters ^ and $ have not yet been described; they will be introduced a little later, at the beginning of the second part of this manual.

Usually ^ looks for a match only at the beginning of a line, and $ only at the end immediately before the newline character (if any). If this flag is specified, ^ comparison occurs in all lines, that is, at the beginning, and immediately after each new line character. Similarly for $ .

UNICODE, U
Makes \w, \W, \b, \B, \d, \D, \s, \S match the Unicode table.

VERBOSE, X
It includes verbose (detailed) regular expressions that can be organized more clearly and clearly. If this flag is specified, spaces in the regular expression string are ignored, unless they are in the character class or preceded by an unscreened backslash; This allows you to organize regular expressions in a clearer way. This flag also allows placing regular comments with '#' , which will be ignored by the engine.

An example of how RV becomes significantly easier to read:

charref = re . compile ( r "" "
& [#] # Start of a numeric entity reference
(
0 [0-7] + # Octal form
| [0-9] + # Decimal form
| x [0-9a-fA-F] + # Hexadecimal form
)
; # Trailing semicolon
"" " , re . VERBOSE )

Without verbose, it would look like this:

charref = re . compile ( "& # (0 [0-7] +"
"| [0-9] +"
"| x [0-9a-fA-F] +);" )

In the above example, the automatic concatenation of Python string literals was used to break the RV into smaller parts, but, all the same, without explanation, this example is more difficult to understand than the version with re.VERBOSE.

At this point we will complete our review. I advise you to take some rest before the second half , which contains a story about other metacharacters, methods for splitting, searching and replacing strings and a large number of examples of using regular expressions.

Continuation

Source: https://habr.com/ru/post/115825/

All Articles