Regexp and Python: extracting tokens from text

Parsing logs and configuration files is a task that often arises and is repeatedly described. In this article, I will tell you how to implement its classical solution in python: using regular expressions and named groups. If possible, I will try to tell the reasons for which this or that solution is applied, as well as to outline the pitfalls and methods of their circumvention.

Why parse the text and who the tokens are

In text files of interest to our programs with you, usually there is more than one piece of information. So that the program can separate one part of the information from another, we set the file formats - that is, an agreement on how the text is written inside the file. The simplest format is that each piece of information is on a separate line. Such a file almost does not require additional processing - it is enough to consider it as the means of the programming language used and break it up into lines. Most languages allow you to split a file into lines with one or two commands. Unfortunately, most of the files that need to be processed have a slightly more complex format. For example, the classic settings file contains lines of the form name = value. In the general case, such a format is also fairly easy to parse, reading the file line by line and finding in each line '='. The one to the left of it will be the name of the field, and the one to the right will be the value. This logic will work until we need to parse the file with multi-line field values and values that contain the symbol "=". Attempts to process such a file quickly lead to the appearance in the code of numerous checks, loops and other difficulties. Therefore, for text files that are more complicated in structure than a list of strings, the method of tokening using regular expressions has long been successfully applied. The word “token” is usually understood as a small part of the text, located in a certain place of this text and having a certain meaning. For example, in the following fragment of the configuration file:

Three tokens can be distinguished: “name” as the field name, “=” as the separator, and “Vasya” as the field value. Strictly speaking, what I call tokens in this article more closely matches the definition of lexeme. The difference between them is that a token is a fragment of text of a certain format without taking into account its position relative to other fragments of text. Complex parsers, for example, those used in compilers, at the beginning break the text into lexemes and then process the list of lexemes with the help of a large and branched finite automaton, which already selects tokens from the lexemes.
Fortunately, Python has a very good library for working with regular expressions, which allows you to solve most text processing tasks in one pass, without an intermediate search for tokens and their subsequent conversion into tokens.

Who are regular expressions?

Regular expressions are such, such ... If at all briefly, then this is a programming language that is designed to search for text. A very, very simple programming language. The conditions are practically not applied to it, there are no cycles and functions, there is only one expression that describes what text we want to find. But this is the only expression that can be very long :). To successfully use regular expressions in general and on python in particular, you need to know a few things. First, every self-respecting library for working with regular expressions uses its own syntax for these most regular expressions. In general, syntax is similar, but the details and additional features can vary greatly - therefore, before using regular expressions in python, you should be familiar with the syntax in the official documentation .
Secondly, regular expressions do not share the syntax of the language itself and user data. That is, if we want to find the word "Vasya", then the regular expression for its search will look like that - "Vasya". In this line, the programming language itself is not present, there is only the string we specify, which is to be searched. But if we want to find the word “Vasya”, followed by a comma or a semicolon, then the regular expression will acquire the necessary and important details: “Vasya, | Vasya;”. As we can see, the construction of the “logical or” language, which is represented by a vertical bar, was added here. At the same time, the lines we have specified are not separated from the syntax of the language. This leads to an important and unpleasant consequence - if we want to specify a character in the search string, which is present in the syntax of the language, then we need to write "\" in front of it. So a regular expression that searches for the word “Vasya” after which there is a dot or a question mark will look like this: “Vasya \. | Vasya \?”. Both the dot and the question mark are used in the regular expression language syntax: (.
Third, regular expressions are greedy by default. That is, if you do not specifically specify this, then a string of maximum length will be found that satisfies the regular expression. For example, if we want to find a line like "name = value" in the text and write a regular expression: ". + =. +", Then for the text "a = b" it will work correctly, returning "a = b". But for the text “a = b, c = d” it returns “a = b, c = d” - the whole text. This property of regular expressions must always be remembered and written in such a way that the library will not be tempted to return half of the “war and peace” as a result of the search. For example, the previous regular expression is enough to modify a little: "[^ =] + = [^ =] +" - this version will take into account that the text "=" itself should not be in the text before and after the "=" symbol.
')

We are looking for a token in the text

The python regular expression library is called “re”. The main function is essentially one — search (). By passing the regular expression as the first argument, the second in which to search for the text, we will get the result of the search at the output. Please note that for a string with a regular expression, it is better to use the “r” prefix so that the "\" characters are not converted to string escape sequences. Search example:

import re
match = re . search ( ur "Vasya \. | Vasya \? " , u "Vasya?" )
print match. group ( ) . encode ( "cp1251" )

As you can see from the example, the search () method returns an object of the type 'search result', which has several methods and fields with which you can get the found text, its position in the original expression and other necessary and useful properties. Consider a more vital example - a classic configuration file consisting of section names in curly brackets, field names and their values. The regular expression for searching for section names will look like this:

import re
txt = " "
{number section}
num = 1
{text section}
txt = "2"
' ' '
match = re . search ( ur "{[^}] +}" , txt )
print match. group ( )

The result of executing this code will be the string "{number section}" - the section name was successfully found.

We are looking for all instances of the token in the text.

As can be seen from the previous example, simply calling re.search () will only find the first token in the text. The re library offers several methods for finding all instances of a token. The most correct, in my opinion, is a call to the finditer () method, which returns a list of objects of the type 'search result'. Obtaining these mysterious objects instead of ordinary lines (which the findall method can return, for example), we get the opportunity not only to familiarize ourselves with the fact that the text is found, but also to find out exactly where it was found - for this object the type of “search result” has a trained method span () that returns the exact position of the found fragment in the source text. The modified code for finding all instances of the token using the finditer () method will look like this:

result = re . finditer ( ur "{[^} \ n ] +}" , txt )
for match in result:
print match. group ( )

We are looking for different tokens in the text

Unfortunately, the search for a single token is certainly an interesting thing - but almost useless. Usually in the text, even as simple as a configuration file, there are many tokens of interest to us. In the configuration file example, this will be at least section names, field names, and field values. Groups are used to search for several different tokens in the regular expression language. Groups are fragments of a regular expression, enclosed in parentheses — the parts of the text corresponding to these fragments will be returned as separate results. Thus, a regular expression that can search for sections, fields, and values will look like this:

result = re . finditer ( ur "({[^} \ n ] +}) | (?: ([^ = \ n ] +) = ([^ \ n ] +))" , txt )
for match in result:
print match. groups ( )

Please note that this code is significantly different from the previous one. First, in the regular expression there are three groups: "({[^} \ n] +})" corresponds to the title in curly brackets,
"([^ = \ n] +)" before the '=' sign matches the field name and "([^ \ n] +)" after the '=' sign matches the field value. It also uses the strange group "(? :)", which combines groups of names and field values. This is a special group for use with the logical operator '|' - it allows you to combine several groups with one operator '|' no side effects. Secondly, the groups () method was used instead of the group () method to print the results. This is not surprising - the python regular expression library has its own idea of what a “search result” is. This separateness is expressed in the fact that the regular expression from two groups "([^ = \ n] +) = ([^ = \ n] +)", applied to the test "a = b" returns ONE object of the type "result", which consists of several GROUPS.

Determine exactly what we found

If we run the previous example, the following result will be displayed on the screen:

('{number section}', None, None)
(None, 'num', '1')
('{text section}', None, None)
(None, 'txt', '"2"')

As you can see, for each result, the groups () method returns a magic list of three elements, each of which can be either None (empty) or found text. If you thoughtfully smoke the documentation, then you can figure out that the library found three groups in our expression and now for each result it lists which groups are present in it. We see that the first group corresponds to the section name, the second field name and the third field value. So the first result, "{number section}", is the name of the section. The second result, “num = 1”, is the name of the field and the value of the field, and so on. As we see, it is rather confusing and inconvenient - in the general case it is difficult to determine WHAT IT IS we have found.
To answer this important question the group can be called. For this purpose, a special syntax is provided in the regular expression language: "(? P <group_name> expression)". If we slightly change our code and give three groups names, then everything will be much more convenient:

import re
txt = " "
{number section}
num = 1
{text section}
txt = "2"
' ' '
object = re . compile ( ur "(? P <section> {[^} \ n ] +}) | (? :(? P <name> [^ = \ n ] +) = (? P <value> [^ \ n ] +)) " , re . M | re . S | re . U )
result = object . finditer ( txt )
group_name_by_index = dict ( [ ( v, k ) for k, v in object . groupindex . items ( ) ] )
print group_name_by_index
for match in result:
for group_index, group in enumerate ( match. groups ( ) ) :
if group:
print "text:% s" % group
print "group:% s" % group_name_by_index [ group_index + 1 ]
print "position:% d,% d" % match. span ( group_index + 1 )

Pay attention to a number of cosmetic changes. Before searching, the re.compile () method is used, which returns the so-called “compiled regular expression”. In addition to speed and convenience, it has one remarkable property - if you call its groupindex () method, we will get a dictionary containing the names of all found groups and from indexes. Unfortunately, the dictionary is somehow inverted - the name of the group is not the index in it. A terrible dict () expression corrects this annoying misunderstanding and the group_name_by_index dictionary can be used to get the name of a group by its number. Also, when compiling, the flags re.M (correct search for the beginning of the string "^" and the end of the string "$" in the multiline text) are used, re.S ("." Finds absolutely everything, including \ n) and .U (correct search in unicode text). As a result, the analysis of the found takes two cycles - first we iterate over the search results, and then for each result by the groups it contains. The result is an accurate and complete list of tokens, indicating their type and position in the text. This list can be used for text processing, syntax highlighting, finding errors - in general, a necessary and useful thing.

Conclusion

The demonstrated way of finding lexemes in the text is not the best or the most correct one - even within the framework of the python language and its standard library of regular expressions, there are at least a dozen alternative ways that are not inferior to this. But I hope that the examples and explanations given will help someone, if necessary, quickly get up to speed, saving time for googling and finding out why it doesn’t work quite as we would like. Good luck to everyone, waiting for comments.

Source: https://habr.com/ru/post/60369/

All Articles