\b
is the statement that the current position is on the boundary ( boundary ) of the word, while the \b
character itself does not change the position. This means that statements of zero size should never be repeated, because if they coincided once in a given place, they will obviously correspond to this place an infinite number of times.|
A|B
will match any string that matches A or B. Metacharacter |
has a very low priority in order to make it work intelligently when you alternate several characters of a string. Crow | Servo will look for a match for either Crow
or Servo
, not Cro('w' 'S')ervo
.^
MULTILINE
flag is MULTILINE
, as mentioned in the last part, then a comparison is made for each part after the newline character.From
at the beginning, then ^From
written in a regular expression:>>> print re . search ( '^ From' , 'From Here to Eternity' )
< _sre. SRE_Match object at 0x ... >
>>> print re . search ( '^ From' , 'Reciting From Memory' )
None
$
^
, but at the end of a line, which is determined either by the end of the line itself or by the newline character.>>> print re . search ( '} $' , '{block}' )
< _sre. SRE_Match object at 0x ... >
>>> print re . search ( '} $' , '{block}' )
None
>>> print re . search ( '} $' , '{block} \ n ' )
< _sre. SRE_Match object at 0x ... >
\A
^
, but does not depend on the MULTILINE
flag\Z
$
, but does not depend on the MULTILINE
flag\b
class
only when it is a separate word. If it is contained within another word, there is no match:>>> p = re . compile ( r ' \ b class \ b ' )
>>> print p. search ( 'no class at all' )
< _sre. SRE_Match object at 0x ... >
>>> print p. search ( 'the declassified algorithm' )
None
>>> print p. search ( 'one subclass is' )
None
\b
is the backspace character, ASCII value 8. If you do not use raw strings, Python will convert \b
to backspace, and Your regular expression will not be as intended:>>> p = re . compile ( ' \ b class \ b ' )
>>> print p. search ( 'no class at all' )
None
>>> print p. search ( ' \ b ' + 'class' + ' \ b ' )
< _sre. SRE_Match object at 0x ... >
\b
combination for compatibility with string literals Python represents the backspace character.From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com
'(', ')'
. '('
and ')'
have the same meaning as in mathematical expressions; they group together the expressions contained in them, and you can repeat the contents of the group with repeating qualifiers such as *, +, ?
and {m, n}
. For example, (ab)*
will match zero or more ab
repetitions.>>> p = re . compile ( '(ab) *' )
>>> print p. match ( 'ababababab' ) . span ( )
( 0 , 10 )
group(), start(), end()
and span()
. Groups are numbered, starting with 0. Group 0 is always present, it is the entire regular expression itself, so MatchObject
methods always contain 0 as the default argument:>>> p = re . compile ( '(a) b' )
>>> m = p. match ( 'ab' )
>>> m. group ( )
'ab'
>>> m. group ( 0 )
'ab'
>>> p = re . compile ( '(a (b) c) d' )
>>> m = p. match ( 'abcd' )
>>> m. group ( 0 )
'abcd'
>>> m. group ( 1 )
'abc'
>>> m. group ( 2 )
'b'
group()
can simultaneously accept several group numbers in a single request, and a tuple containing the values ​​for the respective groups will be returned:>>> m. group ( 2 , 1 , 2 )
( 'b' , 'abc' , 'b' )
groups()
method returns a tuple of strings for all subgroups, starting from the 1st:>>> m. groups ( )
( 'abc' , 'b' )
\1
corresponds to the fact that the content of group 1 is exactly repeated in the current position.>>> p = re . compile ( r '( \ b \ w +) \ s + \ 1 ' )
>>> p. search ( 'Paris in the spring' ) . group ( )
'the the'
re
module supports most of them. It would be difficult to choose new single-character metacharacters or new sequences with backslashes in order to introduce new features so that Perl regular expressions are different from standard regular expressions without confusion. If you choose as a new metacharacter, for example, &
, then the old regular expressions would accept it as a regular character and you could not escape it \&
or [&]
.(?...)
as an extension of the syntax. The question mark after the bracket in the case of a normal RV is a syntax error, since ?
there is nothing to repeat, so this does not lead to any compatibility problems. Characters right after ?
show what extension is used, so (?=foo)
is one thing (a positive statement about the preview), and (?:foo)
is something else (a group without content capture, including the subexpression foo
).P
, then this means that a Python-specific extension is used. Currently there are two such extensions: ( ?P<some_name>...
) defines a named group, and ( ?P=some_name
) serves as a backward link for it. If similar features using a different syntax are added in future versions of Perl 5, the re
module will be modified to support the new syntax, while maintaining the Python syntax for compatibility.(?:...)
, where you can replace ...
any other regular expression:>>> m = re . match ( "([abc]) +" , "abc" )
>>> m. groups ( )
( 'c' , )
>>> m = re . match ( "(?: [abc]) +" , "abc" )
>>> m. groups ( )
( )
*
, and insert them into other groups (data collection or not).(?P<some_name>...)
. Named groups behave exactly like normal, but in addition to this they are associated with some name. The MatchObject
methods that were used for ordinary groups accept both numbers that refer to the group number, as well as strings containing the name of the required group. That is, named groups still accept numbers as well, so you can get information about a group in two ways:>>> p = re . compile ( r '(? P <word> \ b \ w + \ b )' )
>>> m = p. search ( '((((Lots of punctuation))))' )
>>> m. group ( 'word' )
'Lots'
>>> m. group ( 1 )
'Lots'
imaplib
module:InternalDate = re . compile ( r 'INTERNALDATE "'
r '(? P <day> [123] [0-9]) - (? P <mon> [AZ] [az] [az]) -'
r '(? P <year> [0-9] [0-9] [0-9] [0-9])'
r '(? P <hour> [0-9] [0-9]) :(? P <min> [0-9] [0-9]) :(? P <sec> [0-9] [ 0-9])
r '(? P <zonen> [- +]) (? P <zoneh> [0-9] [0-9]) (? P <zonem> [0-9] [0-9])'
r '"' )
(...)\1
refers to the group number. It would be more natural to use group names instead of numbers. Another Python extension: (?P=name)
indicates that the contents of the named group must again be matched at the current position. Our previous regular expression for searching for duplicate words, (\b\w+)\s+\1
can also be written as (?P<doble_word>\b\w+)\s+(?P=doble_word)
:>>> p = re . compile ( r '(? P <word> \ b \ w +) \ s + (? P = word)' )
>>> p. search ( 'Paris in the spring' ) . group ( )
'the the'
(?=...)
...
, corresponds to the current position. But, after the contained expression has been tested, the comparing engine does not advance further; the remainder of the template is compared further to the right of the place where the statement begins.(?!...)
.*[.].*$
.
requires special brackets, since the dot itself is a metacharacter, as seen in the same expression. Also note the final $; it is added to ensure that the entire remainder of the string is included in the extension.bat
? Several incorrect attempts:.*[.][^b].*$
bat
with the requirement that the first character extension be not b
. This is incorrect because the template will also exclude foo.bar
..*[.]([^b]..|.[^a].|..[^t])$
foo.bar
and reject autoexec.bat
, but requires a three-letter extension and will not work with two-character file name extensions, like sendmail.cf
. Then we will have to complicate the pattern again to solve this problem:.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$
.*[.](?!bat$).*$
bat
does not match this position, compare the rest of the template; if a match is found for bat$
, then the whole template does not suit us. The $
sign that encloses the expression is needed so that an expression such as sample.batch
.bat
or exe
extension:.*[.](?!bat$|exe$).*$
Method / Attribute | purpose |
split () | Break a line into a list where there is a PB match |
sub () | Find all substrings of matches with RV and replace them with another string |
subn () | Does the same thing as sub (), but returns a new string and the number of substitutions. |
split()
template method splits a string into parts where there is a PB match, returning a list of parts. This is similar to the split()
string method, but provides for universality in the delimiters by which the split occurs; ordinary split()
provides splitting only by whitespace characters or a fixed string. As expected, there is a modular function re.split()
.maxsplit
not zero, no more than maxsplit
partitions are maxsplit
, the remainder of the string will be returned as the last item in the list.>>> p = re . compile ( r ' \ W +' )
>>> p. split ( 'This is a test, short and sweet, of split ().' )
[ 'This' , 'is' , 'a' , 'test' , 'short' , 'and' , 'sweet' , 'of' , 'split' , '' ]]
>>> p. split ( 'This is a test, short and sweet, of split ().' , 3 )
[ 'This' , 'is' , 'a' , 'test, short and sweet, of split ().' ]
>>> p = re . compile ( r ' \ W +' )
>>> p2 = re . compile ( r '( \ W +)' )
>>> p. split ( 'This ... is a test.' )
[ 'This' , 'is' , 'a' , 'test' , '' ]
>>> p2. split ( 'This ... is a test.' )
[ 'This' , '...' , 'is' , '' , 'a' , '' , 'test' , '.' , '' ]
re.split()
module takes the RV as the first argument, and otherwise behaves also:>>> re . split ( '[ \ W ] +' , 'Words, words, words.' )
[ 'Words' , 'words' , 'words' , '' ]
>>> re . split ( '([ \ W ] +)' , 'Words, words, words.' )
[ 'Words' , ',' , 'words' , ',' , 'words' , '.' , '' ]
>>> re . split ( '[ \ W ] +' , 'Words, words, words.' , 1 )
[ 'Words' , 'words, words.' ]
sub()
method takes as its argument the value of the replacement part (which can be both a string and a function) and the string that is to be processed.count
argument is the maximum number of matches to replace.sub()
method. Color names are replaced by the word colour
:>>> p = re . compile ( '(blue | white | red)' )
>>> p. sub ( 'color' , 'blue socks and red shoes' )
'color socks and color shoes'
>>> p. sub ( 'color' , 'blue socks and red shoes' , count = 1 )
'color socks and red shoes'
>>> p = re . compile ( '(blue | white | red)' )
>>> p. subn ( 'color' , 'blue socks and red shoes' )
( 'color socks and color shoes' , 2 )
>>> p. subn ( 'color' , 'no colors at all' )
( 'no colors at all' , 0 )
>>> p = re . compile ( 'x *' )
>>> p. sub ( '-' , 'abxd' )
'-abd-'
\n
is a single newline character, \r
is a carriage return, and so on. Backlinks, such as \6
are replaced by a substring that matches the corresponding group in the RV. This allows you to include parts of the original text in the result of the replacement line.section
in the part of the line preceding the part in curly brackets {, }
, and replaces the section
with a subsection
:>>> p = re . compile ( 'section {([^}] *)}' , re . VERBOSE )
>>> p. sub ( r 'subsection { \ 1 }' , 'section {First} section {second}' )
'subsection {First} subsection {second}'
\g<...>
, where as ...
can be a number or the name of a group. \g<2>
is equivalent to \2
, but it is not ambiguous in terms such as \g<2>0
. ( \20
will be interpreted as a reference to group 20, and not as a second group followed by the literal '0'.) The following operations are equivalent, but use three different ways:>>> p = re . compile ( 'section {(? P <name> [^}] *)}' , re . VERBOSE )
>>> p. sub ( r 'subsection { \ 1 }' , 'section {First}' )
'subsection {First}'
>>> p. sub ( r 'subsection { \ g <1>}' , 'section {First}' )
'subsection {First}'
>>> p. sub ( r 'subsection { \ g <name>}' , 'section {First}' )
'subsection {First}'
MatchObject
argument.>>> def hexrepl ( match ) :
... "Return the hex string for a decimal number"
... value = int ( match. group ( ) )
... return hex ( value )
...
>>> p = re . compile ( r ' \ d +' )
>>> p. sub ( hexrepl, 'Call 65490 for printing, 49152 for user code.' )
'Call 0xffd2 for printing, 0xc000 for user code.'
re
is a mistake. If you are looking for a fixed string or a single character, and you do not need to use any special features re
, then all the power of regular expressions is not required for this. Strings have several methods for operations with fixed strings and they are usually much faster because they are optimized for this purpose.word
with a word deed
. Here, of course, the function is suitable re.sub()
, but consider the string method replace()
. Note that it replace()
will also replace word
inside words by changing swordfish
tosdeedfish
, but a simple regular expression will do the same. (To avoid performing substitution on parts of words, the template should contain \bword\b
).re.sub('\n', ' ', S)
, but the translate () method will handle both tasks and do it faster than any regular expression.re
, see if the problem can be solved by faster and simpler string methods.match()
searches for a PB at the beginning of a line, while it search()
searches for a match for the entire line. It is important to keep in mind this distinction:>>> print re . match ( 'super' , 'superstition' ) . span ( )
( 0 , 5 )
>>> print re . match ( 'super' , 'insuperable' )
None
>>> print re . search ( 'super' , 'superstition' ) . span ( )
( 0 , 5 )
>>> print re . search ( 'super' , 'insuperable' ) . span ( )
( 2 , 7 )
re.match()
just adding in front of your regular expression .*
. Resist this temptation, and use it instead re.search()
. The regular expression compiler makes a small analysis of RVs in order to speed up the matching process. One type of analysis is to determine what should be the first match character, for example, a match with a pattern starting with Crow
must start with 'C'
. This analysis leads to the fact that the engine quickly runs through the string in the search for the initial character, and begins a full comparison only if the character 'C' is found..*
negates this optimization, requiring scanning to the end of the line and then returning to compare the remainder of the regular expression. Use instead re.search()
.a*
, the resultant action eats as much of the pattern as possible. This often burns those who want to find a pair of symmetrical determinants, such as the angle brackets <> surrounding the HTML tags. A naive approach to the HTML tag matching pattern will not work because of its “greedy” nature .*
:>>> s = '<html> <head> <title> Title </ title>'
>>> len ( s )
32
>>> print re . match ( '<.*>' , s ) . span ( )
( 0 , 32 )
>>> print re . match ( '<.*>' , s ) . group ( )
< html >< head >< title > Title < /title >
'<'
in the first tag - html, and .*
takes the rest of the line. As a result, the mapping extends from the opening '<'
tag html
to the closing bracket of the >'
closing tag /title
, which, of course, is not what we wanted.*?, +?, ??
or {m,n}?
that match as little text as possible. In the example above, the first character '>' after '<' will be selected, and only if it fails, the engine will continue to try to find the character '>' in the next position, depending on how long the tag name is. This gives the desired result:>>> print re . match ( '<. *?>' , s ) . group ( )
< html >
dog | cat
equivalent are less readable, without a string dog|cat
, but [ab] will still match the characters'a', 'b'
or space. In addition, you can also put comments inside a PB that last from a character #
to the next line. Formatting will be more accurate with triple quotes:pat = re . compile ( r "" "
\ s * #Skip leading whitespace
(? P <header> [^:] +) # Header name
\ s *: # Whitespace, and a colon
(? P <value>. *?) # The header's value - *? used to
lose the following trailing whitespace
\ s * $ trailing whitespace to end-of-line
"" " , re . VERBOSE )
pat = re . compile ( r " \ s * (? P <header> [^:] +) \ s * :(? P <value>. *?) \ s * $" )
Source: https://habr.com/ru/post/115436/
All Articles