📜 ⬆️ ⬇️

Deep into Pyparsing: Parsing units of measurement in Python

In the last article, we got acquainted with the convenient parser library Pyparsing and wrote a parser for the expression 'import matplotlib.pyplot as plt' .

In this article we will begin the immersion in Pyparsing on the example of the task of parsing units of measurement. Step by step, we will create a recursive parser that can search for characters in Russian, check the validity of the name of the unit of measurement, and also group those that the user has enclosed in brackets.

Note: The code for this article has been tested and posted on Sagemathclod . If suddenly something does not work for you (most likely due to the encoding of the text), be sure to let me know about it in a personal, in the comments or send me an email or VK .

Beginning of work. Baseline data and task.


As an example, we will parse the expression:
')
 s = "*^2/(*^2)" 

This unit of measurement was taken from the head in order to obtain a string, the analysis of which would use all the capabilities of our parser. We need to get:

 res = [('',1.0), ('',2.0), ('',-1.0), ('',-2.0)] 

Replacing in the string s division by multiplication, opening the brackets and clearly putting down the degrees of the units, we get: N * m ^ 2 / (kg * s ^ 2) = H ^ 1 * m ^ 2 * kg ^ -1 * s ^ -2 .

Thus, each tuple in the variable res contains the name of the unit of measurement and the degree to which it must be built. Between tuples you can mentally put multiplication signs.

Before you use pyparsing, you must import it:

 from pyparsing import * 

When we write a parser, we replace * with the classes we use.

Method of writing a parser on Pyparsing


When using pyparsing, you should adhere to the following method of writing a parser:
  1. First, keywords or individual important characters that are “building blocks” for constructing the final line are selected from the text line.
  2. We write separate parsers for "bricks".
  3. “Build” a parser for the final string.

In our case, the main "building blocks" are the names of individual units of measure and their degree.

Writing a parser for a unit of measurement. Parsing Russian letters.


A unit of measurement is a word that begins with a letter and consists of letters and dots (for example, mm Hg). In pyparsing, we can write:

 ph_unit = Word(alphas, alphas+'.') 

Note that the Word class now has 2 arguments. The first argument is responsible for what should be the first character of the word, the second argument is responsible for what other characters of the word can be. The unit of measurement necessarily begins with a letter, so we put the first argument alphas . In addition to letters, the unit of measurement can contain a period (for example, mm.rt.st), so the second argument for Word is alphas + '.' .

Unfortunately, if we try to parse any unit of measurement, we will find that the parser only works for units of measurement in English. This is because alphas means not just letters, but letters of the English alphabet.

This problem is very easy. First, create a string listing all the letters in Russian:

 rus_alphas = '' 

And the parser code for a particular unit of measurement should be changed to:

 ph_unit = Word(alphas+rus_alphas, alphas+rus_alphas+'.') 

Now our parser understands units of measurement in Russian and English. For other languages, the parser code is written similarly.

Correction of the coding result parser.


When testing a parser for a unit of measurement, you can get a result in which Russian characters are replaced by their code designation. For example, on Sage:

 ph_unit.parseString("").asList() # : ['\xd0\xbc\xd0\xbc'] 

If you get the same result, then everything works correctly, but you need to correct the encoding. In my case (sage), the use of the self-made bprint function (better print) works:

 def bprint(obj): print(obj.__repr__().decode('string_escape')) 

Using this function, we get the output in Sage in the correct encoding:

 bprint(ph_unit.parseString("").asList()) # : [''] 

Writing a parser for a degree. Parsing an arbitrary number.


Learn to parse the degree. Usually the degree is an integer. However, in rare cases, the degree may contain a fractional part or be written in exponential notation. Therefore, we will write a parser for a regular number, for example, like this:

 test_num = "-123.456e-3" 

The “brick” of an arbitrary number is a natural number, which consists of numbers:

 int_num = Word(nums) 

There may be a plus or minus sign in front of the number. In this case, the plus sign should not be output to the result (we use Suppress() ).

 pm_sign = Optional(Suppress("+") | Literal("-")) 

A vertical bar means “or” (plus or minus). Literal() means an exact match to the text string. Thus, the expression for pm_sign means that it is necessary to find an optional + symbol in the text, which should not be output to the result of parsing, or an optional minus symbol.

Now we can write a parser for the whole number. The number starts with an optional plus or minus sign, then the numbers follow, then the optional point is the fractional separator, then the numbers, then the e character can follow, followed by the number again: optional plus or minus and numbers. The number after e has no fractional part. On pyparsing:

 float_num = pm_sign + int_num + Optional('.' + int_num) + Optional('e' + pm_sign + int_num) 

We now have a parser for the number. Let's see how the parser works:

 float_num.parseString('-123.456e-3').asList() #  ['-', '123', '.', '456', 'e', '-', '3'] 

As we can see, the number is divided into separate components. We do not need this, and we would like to “collect” the number back. This is done using Combine() :

 float_num = Combine(pm_sign + int_num + Optional('.' + int_num) + Optional('e' + pm_sign + int_num)) 

Check:

 float_num.parseString('-123.456e-3').asList() #  ['-123.456e-3'] 

Fine! But ... The output is still a string, but we need a number. Add a string to number conversion using ParseAction() :

 float_num = Combine(pm_sign + int_num + Optional('.' + int_num) + Optional('e' + pm_sign + int_num)).setParseAction(lambda t: float(t.asList()[0])) 

We use an anonymous function lambda , whose argument is t . First we get the result as a list (t.asList()) . Since the resulting list has only one element; you can immediately extract it: t.asList()[0] . The float() function converts text to a floating point number. If you work in Sage, you can replace float with RR , the constructor of the Sage real number class.

Parsing units with degree.


A separate unit of measurement is the name of the unit of measurement, after which the sign of the degree ^ and the number - the degree to which it is necessary to build can go. On pyparsing:

 single_unit = ph_unit + Optional('^' + float_num) 

Test:

 bprint(single_unit.parseString("^2").asList()) # : ['', '^', 2.0] 

Immediately improve the output. We do not need to see ^ as a result of parsing, and we want to see the result in the form of a tuple (see the variable res at the beginning of this article). To suppress output, we use Suppress() , to convert the list into a tuple - ParseAction() :

 single_unit = (ph_unit + Optional(Suppress('^') + float_num)).setParseAction(lambda t: tuple(t.asList())) 

Check:
 bprint(single_unit.parseString("^2").asList()) # : [('', 2.0)] 


Parsing units of measure, framed by brackets. Recursion implementation.


We come to an interesting place - a description of the implementation of recursion. When writing a unit of measurement, the user can frame one or more units of measurement, between which there are multiplication and division signs. The expression in brackets may contain another, nested expression, framed by brackets (for example, "(^2/ (^2 * ))" ). The possibility of nesting some expressions with brackets to others is a source of recursion. Let's go to Pyparsing.

First we write the expression, not paying attention that we have a recursion:

 unit_expr = Suppress('(') + single_unit + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr))) + Suppress(")") 

Optional contains that part of the string that may or may not be present. OneOrMore (translated as “one or more”) contains the part of the string that should appear in the text at least once. OneOrMore contains two “addends”: first we look for the multiplication and division sign, then the unit of measurement or the nested expression.

In the form, as it is now, unit_expr cannot leave unit_expr : to the left and right of the equal sign there is unit_expr , which clearly indicates recursion. This problem is solved very simply: you need to change the assignment sign to <<, and in the line before unit_expr add the assignment of a special class Forward() :

 unit_expr = Forward() unit_expr << Suppress('(') + single_unit + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr))) + Suppress(")") 

Thus, when writing a parser, there is no need to foresee recursion in advance. First, write the expression as if there is no recursion in it, and when you see that it appears, just replace the = sign with << and add the assignment of the Forward() class in the line above.

Check:

 bprint(unit_expr.parseString("(*/^2)").asList()) # : [('',), '*', ('',), '/', ('', 2.0)] 


Parsing a common expression for a unit of measure.


We have the last step: a general expression for the unit of measurement. On pyparsing:

 parse_unit = (unit_expr | single_unit) + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr))) 

Note that the expression has the form (a | b) + (c | d) . The brackets are required and have the same role as in mathematics. Using parentheses, we want to indicate that we first need to check that the first term is unit_expr or single_unit , and the second term is an optional expression. If you remove the brackets, it turns out that parse_unit is unit_expr or single_unit + an optional expression, which is not exactly what we intended. The same reasoning applies to the expression inside Optional() .

Draft parser. Correction of the result encoding.


So, we wrote a draft parser:

 from pyparsing import * rus_alphas = '' ph_unit = Word(rus_alphas+alphas, rus_alphas+alphas+'.') int_num = Word(nums) pm_sign = Optional(Suppress("+") | Literal("-")) float_num = Combine(pm_sign + int_num + Optional('.' + int_num) + Optional('e' + pm_sign + int_num)).setParseAction(lambda t: float(t.asList()[0])) single_unit = (ph_unit + Optional(Suppress('^') + float_num)).setParseAction(lambda t: tuple(t.asList())) unit_expr = Forward() unit_expr << Suppress('(') + single_unit + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr))) + Suppress(")") parse_unit = (unit_expr | single_unit) + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr))) 

Check:

 print(s) # s = "*^2/(*^2)" — .  . bprint(parse_unit.parseString(s).asList()) # : [('',), '*', ('', 2.0), '/', ('',), '*', ('', 2.0)] 

Grouping units of measure, framed by brackets.


We are already close to the result we want to get. The first thing we need to implement is the grouping of those units that the user has enclosed in brackets. To do this, Pyparsing uses Group() , which we apply to unit_expr :

 unit_expr = Forward() unit_expr << Group(Suppress('(') + single_unit + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr))) + Suppress(")")) 

Let's see what has changed:

 bprint(parse_unit.parseString(s).asList()) # : [('',), '*', ('', 2.0), '/', [('',), '*', ('', 2.0)]] 


We put degree 1 in those tuples where there is no degree.


In some tuples after the comma does not cost anything. Recall that a tuple corresponds to a unit of measurement and has the form (unit of measure, degree). Recall that we can give names to certain pieces of the result of the parser (described in the last article ). In particular, let's call the found unit of measure as 'unit_name' , and its degree as 'unit_degree' . In setParseAction() we write an anonymous function lambda() , which will put 1 where the user does not specify the degree of the unit of measurement). On pyparsing:

 single_unit = (ph_unit('unit_name') + Optional(Suppress('^') + float_num('unit_degree'))).setParseAction(lambda t: (t.unit_name, float(1) if t.unit_degree == "" else t.unit_degree)) 

Now our entire parser produces the following result:

 bprint(parse_unit.parseString(s).asList()) # : [('', 1.0), '*', ('', 2.0), '/', [('', 1.0), '*', ('', 2.0)]] 

In the code above, instead of float(1) , it would be possible to write just 1.0 , but in Sage, in this case, you will get not the type float , but your own type Sage for real numbers.

Remove the signs * and / from the result of the parser, open the brackets.


All that is left for us to do is to remove the * and / signs as well as the nested square brackets as a result of the parser. If before the nested list (that is, before [) there is a division, the sign of the degree of the units of measurement in the nested list should be changed to the opposite. To do this, we write a separate function transform_unit() , which we will use in setParseAction() for parse_unit :

 def transform_unit(unit_list, k=1): res = [] for v in unit_list: if isinstance(v, tuple): res.append(tuple((v[0], v[1]*k))) elif v == "/": k = -k elif isinstance(v, list): res += transform_unit(v, k=k) return(res) parse_unit = ((unit_expr | single_unit) + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr)))).setParseAction(lambda t: transform_unit(t.asList())) 

After this, our parser returns the unit of measure in the required format:

 bprint(transform_unit(parse_unit.parseString(s).asList())) # : [('', 1.0), ('', 2.0), ('', -1.0), ('', -2.0)] 

Note that the transform_unit() function removes nesting. In the conversion process, all brackets are expanded. If there is a dividing sign in front of the parenthesis, the sign of the degree of units in brackets is reversed.

The implementation of checking units of measurement directly in the process of parsing.


The last thing that was promised to be done is to introduce an early verification of units of measurement. In other words, as soon as the parser finds the unit of measurement, it will immediately check it against our database.

We will use the Python dictionary as a database:

 unit_db = {'':{'':1, '':1/10, '':1/100, '':1/1000, '':1000, '':1/1000000}, '':{'':1}, '':{'':1, '':1000}, '':{'':1}, '':{'':1, '':0.001}} 

To quickly check a unit of measure, it would be nice to create a lot of Python by putting units of measure into it:

 unit_set = set([t for vals in unit_db.values() for t in vals]) 

Let's write the check_unit function, which will check the unit of measurement, and insert it into the setParseAction for ph_unit :

 def check_unit(unit_name): if not unit_name in unit_set: raise ValueError("        : " + unit_name) return(unit_name) ph_unit = Word(rus_alphas+alphas, rus_alphas+alphas+'.').setParseAction(lambda t: check_unit(t.asList()[0])) 

The output of the parser will not change, but if you get a unit that is not in the database or in science, the user will receive an error message. Example:

 ph_unit.parseString("") #    : Error in lines 1-1 Traceback (most recent call last): … File "", line 1, in <lambda> File "", line 3, in check_unit ValueError:         :  

The last line is our error message to the user.

Full parser code. Conclusion


In conclusion, I will give the full code of the parser. Do not forget in the import line "from pyparsing import *" to replace * with the used classes.

 from pyparsing import nums, alphas, Word, Literal, Optional, Combine, Forward, Group, Suppress, OneOrMore def bprint(obj): print(obj.__repr__().decode('string_escape')) #     unit_db = {'':{'':1, '':1/10, '':1/100, '':1/1000, '':1000, '':1/1000000}, '':{'':1}, '':{'':1, '':1000}, '':{'':1}, '':{'':1, '':0.001}} unit_set = set([t for vals in unit_db.values() for t in vals]) #           rus_alphas = '' def check_unit(unit_name): """      . """ if not unit_name in unit_set: raise ValueError("        : " + unit_name) return(unit_name) ph_unit = Word(rus_alphas+alphas, rus_alphas+alphas+'.').setParseAction(lambda t: check_unit(t.asList()[0])) #    int_num = Word(nums) pm_sign = Optional(Suppress("+") | Literal("-")) float_num = Combine(pm_sign + int_num + Optional('.' + int_num) + Optional('e' + pm_sign + int_num)).setParseAction(lambda t: float(t.asList()[0])) #       single_unit = (ph_unit('unit_name') + Optional(Suppress('^') + float_num('unit_degree'))).setParseAction(lambda t: (t.unit_name, float(1) if t.unit_degree == "" else t.unit_degree)) #      unit_expr = Forward() unit_expr << Group(Suppress('(') + single_unit + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr))) + Suppress(")")) #       def transform_unit(unit_list, k=1): """     ,  ,       *  / """ res = [] for v in unit_list: if isinstance(v, tuple): res.append(tuple((v[0], v[1]*k))) elif v == "/": k = -k elif isinstance(v, list): res += transform_unit(v, k=k) return(res) parse_unit = ((unit_expr | single_unit) + Optional(OneOrMore((Literal("*") | Literal("/")) + (single_unit | unit_expr)))).setParseAction(lambda t: transform_unit(t.asList())) # s = "*^2/(*^2)" bprint(parse_unit.parseString(s).asList()) 

Thank you for the patience with which you read my article. Let me remind you that the code presented in this article is posted on Sagemathcloud . If you are not registered on Habré, you can send me a question by email or write to VK . In the next article I want to introduce you to Sagemathcloud , to show how much he can simplify your work in Python. After that, I will return to the topic of parsing Pyparsing at a qualitatively new level.

I thank Darya Frolov and Nikita Konovalov for help in checking the article before publishing it.

Source: https://habr.com/ru/post/241670/


All Articles