📜 ⬆️ ⬇️

Parsim in Python: Pyparsing for beginners

Parsing (parsing) is the process of matching a sequence of words or characters - the so-called formal grammar. For example, for a line of code:

import matplotlib.pyplot as plt 

The following grammar takes place: first comes the import keyword, then the module name or a chain of module names, separated by a dot, then the as keyword, followed by our name for the module being imported.

As a result of parsing, for example, it may be necessary to come up with the following expression:
')
 { 'import': [ 'matplotlib', 'pyplot' ], 'as': 'plt' } 

This expression is a Python dictionary that has two keys: 'import' and 'as'. The value for the 'import' key is a list in which the names of the imported modules are listed in order.

As a rule, regular expressions are used for parsing. For this, there is a Python module called re (regular expression is a regular expression). If you have not worked with regular expressions, their appearance may scare you. For example, for the line of the code 'import matplotlib.pyplot as plt' it will look like:

 r'^[ \t]*import +\D+\.\D+ +as \D+' 

Fortunately, there is a convenient and flexible parsing tool called Pyparsing. Its main advantage is that it makes the code more readable, and also allows additional processing of the analyzed text.

In this article we will install Pyparsing and create our first parser on it.


First install Pyparsing. If you are working in Linux, on the command line type:

 sudo pip install pyparsing 

On Windows, you need to enter the directory where the pip.exe file is located (for example, C: \ Python27 \ Scripts \) ​​in the command line running as administrator, then run:

 pip install pyparsing 

Another way is to go to the Pyparsing project page on SourceForge , download the installer for Windows there and install Pyparsing as a regular program. Complete information about all kinds of ways to install Pyparsing can be found on the project page .

Let's move on to parsing. Let s be the following line:

 s = 'import matplotlib.pyplot as plt' 

As a result of parsing, we want to get a dictionary:

 { 'import': [ 'matplotlib', 'pyplot' ], 'as': 'plt' } 

First you need to import Pyparsing. Run for example Python IDLE and enter:

 from pyparsing import * 

An asterisk * above means import all the names from pyparsing. As a result, this may disrupt the namespace, leading to errors in the program. In our case, * is used temporarily, because we do not yet know which classes from Pyparsing we will use. After we write the parser, we replace * with the names of the classes we used.

When using pyparsing, the parser is first written for individual keywords, symbols, short phrases, and then a parser is obtained for the entire text from individual parts.

To begin with, we have the name of the module in the line. Formal grammar: in general, a module name is a word consisting of letters and an underscore character. On pyparsing:

 module_name = Word(alphas + '_') 

Word is a word, alphas is a letter. Word(alphas + '_') - a word consisting of letters and underscores. module_name translates as module name. Now we read everything together: the name of the module is a word consisting of letters and the underscore symbol. Thus, writing to Pyparsing is very close to natural language.

The full name of the module is the name of the module, then a dot, then the name of another module, then again a dot, then the name of the third module, and so on, until we reach the desired module along the chain. The full name of the module may consist of the name of one module and not have points. On pyparsing:

 full_module_name = module_name + ZeroOrMore('.' + module_name) 

ZeroOrMore literally translates as "zero or more", which means that the content in brackets can be repeated several times or absent. As a result, we read the second line of the parser completely: the full name of the module is the name of the module, after which the dot and the name of the module go zero or more times.

After the full name of the module comes the optional part 'as plt'. It represents the 'as' keyword, followed by the name that we ourselves gave the imported module. On pyparsing:

 import_as = Optional('as' + module_name) 

Optional literally translates as "optional", which means that the content in brackets may or may not be present. In sum, we get: “an optional expression consisting of the word 'as' and the name of the module.

The full import instruction consists of the import keyword, followed by the full name of the module, then the optional 'as plt' construct. On pyparsing:

 parse_module = 'import' + full_module_name + import_as 

As a result, we have our first parser:

 module_name = Word(alphas + '_') full_module_name = module_name + ZeroOrMore('.' + module_name) import_as = Optional('as' + module_name) parse_module = 'import' + full_module_name + import_as 

Now we need to parse the string s:

 parse_module.parseString(s) 

We'll get:

 (['import', 'matplotlib', '.', 'pyplot', 'as', 'plt'], {}) 

The output can be improved by converting the result into a list:

 parse_module.parseString(s).asList() 

We get:

 ['import', 'matplotlib', '.', 'pyplot', 'as', 'plt'] 

Now we will improve the parser. First of all, we would not want to see the word import and the point between the names of the modules in the parser output. Suppress () is used to suppress output. Given this, our parser looks like this:

 module_name = Word(alphas + '_') full_module_name = module_name + ZeroOrMore(Suppress('.') + module_name) import_as = Optional(Suppress('as') + module_name) parse_module = Suppress('import') + full_module_name + import_as 

Having parse_module.parseString(s).asList() , we get:

 ['matplotlib', 'pyplot', 'plt'] 

Let's now make the parser immediately return to us a dictionary of the type {'import':[1, 2, ...], 'as':} . Before doing this, you first need to separately access the list of imported modules (full_module_name) and our own module name (import_as). For this, pyparsing allows you to assign names to the results of parsing. Let's give the list of imported modules the name 'modules', and the way we called the module ourselves is the name 'import as':

 full_module_name = (module_name + ZeroOrMore(Suppress('.') + module_name))('modules') import_as = (Optional(Suppress('as') + module_name))('import_as') 

As can be seen from the two lines above, to give the result of parsing a name, you need to put the expression of the parser in brackets, and after this expression in brackets to give the name of the result. Let's see what has changed. To do this, execute the code:

 res = parse_module.parseString(s) print(res.modules.asList()) print(res.import_as.asList()) 

We get:

 ['matplotlib', 'pyplot'] ['plt'] 

Now we can separately extract a chain of modules to import the desired and our name for it. It remains to make the parser return the dictionary. To do this, use the so-called ParseAction - action in the process of parsing:

 parse_module = (Suppress('import') + full_module_name).setParseAction(lambda t: {'import': t.modules.asList(), 'as': t.import_as.asList()[0]}) 

lambda is an anonymous function in Python, t is an argument to this function. Then comes the colon and the Python dictionary expression, into which we substitute the data we need. When we call asList (), we get a list. The module name after as is always one, and the list t.import_as.asList() will always contain only one value. Therefore, we take a single list element (it has an index of zero) and write asList () [0].

Check the parser. Run parse_module.parseString(s).asList() and get:

 [{ 'import': [ 'matplotlib', 'pyplot' ], 'as': 'plt' }] 

We almost reached the goal. Since the resulting list has a single argument, add [0] at the end of the line to parse the text: parse_module.parseString(s).asList()[0]
parse_module.parseString(s).asList()[0]


Eventually:

 { 'import': [ 'matplotlib', 'pyplot' ], 'as': 'plt' } 

We got what we wanted.

Reaching the goal, you need to return to 'from pyparsing import *' and change the asterisk to those classes that are useful to us:

 from pyparsing import Word, alphas, ZeroOrMore, Suppress, Optional 

As a result, our code has the following form:

 from pyparsing import Word, alphas, ZeroOrMore, Suppress, Optional module_name = Word(alphas + "_") full_module_name = (module_name + ZeroOrMore(Suppress('.') + module_name))('modules') import_as = (Optional(Suppress('as') + module_name))('import_as') parse_module = (Suppress('import') + full_module_name + import_as).setParseAction(lambda t: {'import': t.modules.asList(), 'as': t.import_as.asList()[0]}) 

We looked at a very simple example and only a small part of the Pyparsing capabilities. Overboard - creating recursive expressions, processing tables, searching text with optimization, dramatically accelerating the search itself, and much more.

In conclusion, a few words about yourself. I am a graduate student and assistant MSTU. Bauman (Department of MT-1 "Metal-cutting machines"). My hobbies are Python, Linux, HTML, CSS and JS. My hobby is automation of engineering activities and engineering calculations. I consider that I can be useful to Habra, sharing my knowledge of working in Pyparsing, Sage and some features of automation of engineering calculations. I also know the SageMathCloud environment, which is a powerful alternative to Wolfram Alpha. SageMathCloud sharpened to conduct calculations in Python in the cloud. In this case, you can access the console (Ubuntu under the hood), Sage, IPython and LaTeX. There is the possibility of working together. In addition to Python code, SageMathCloud supports html, css, js, coffescript, go, fortran, scilab, and more. Currently, the environment is free (fairly stable beta version), then it will work on the Freemium system. At the current time, this environment is not covered in Habré, and I would like to fill this gap.

I thank Darya Frolov and Nikita Konovalov for help in editing the article.

Source: https://habr.com/ru/post/239081/


All Articles