Parsing and building syntax trees with PLY. The basics

What is PLY?

PLY is an abbreviation of the first letters of an expression: P ython L ex- Y acc.
In fact, this is the port of the lex and yacc utilities on python in a nice wrapper.
Working with ply is very simple and the threshold for starting to use is almost zero.
It is written on pure python and is a LALR (1) parser , but who cares?
I am by nature a practitioner (like most of you) therefore went into battle!

What do we do?

The site has an example of writing the next calculator, so we will not repeat. And let's do something like a parser of a very very narrow PHP subset :)
Our task at the end of the article is to build a syntax tree for such an example:

<?php $val = 5; $result = substr( "foobar", 2*(7-$val) ); echo "  : $result";

')
The example is very small and taken from the ceiling. But to build a tree of code you need a lot and the campaign we will use such a mechanism PLY as state.

Lex

Lex is the thing that breaks the text down into basic language elements. Well, or groups the text into basic elements. Something like this.

What do we see here, except for the useless code? We see tokens (basic elements):
PHP_START - '<? Php'
PHP_VAR - '$ result'
PHP_EQUAL - '='
PHP_NAME - 'substr'
PHP_OPEN - '('
PHP_STRING - "foobar", 'this is our result: $ result'
PHP_NUM - '2', '7'
PHP_CLOSE - ')'
PHP_ECHO - "echo"
PHP_COLON - ';'
PHP_COMA - ','
PHP_PLUSMINUS - '-'
PHP_DIVMUL - '*'

To parse the text on tokens in PLY there is ply.lex . And it works with regular expressions.
And he is very picky about what we write in the code. It requires the presence of an array called tokens.
For each element of this array, we are obliged to have a regular code or function in the code named t_ILEMENT.
Why is he so picky? Because it does not work directly during the execution of the program, but is executed only once at the first launch of the program and creates the lextab.py file, which describes transition tables and regular tables. The next time you start the program, it checks the presence of this file and, if it is present, it no longer tries to build the tables again, but uses the generated ones.
Back to the code.
If the PHP syntax were limited to the thirteen tokens listed above, we would write the following:

 # coding=utf8 import ply.lex as lex #      ,                tokens = ( 'PHPSTART', 'PHPVAR', 'PHPEQUAL', 'PHPFUNC', 'PHPSTRING', 'PHPECHO', 'PHPCOLON', 'PHPCOMA', 'PHPOPEN', 'PHPCLOSE', 'PHPNUM', 'PLUSMINUS', 'DIVMUL' ) #      ident = r'[az]\w*' #            t_ =  t_PHPSTART = r'\<\?php' t_PHPVAR = r'\$'+ident #  , ? t_PHPEQUAL = r'\=' t_PHPFUNC = ident t_PHPSTRING = r'"(\\.|[^"])*"' t_PHPECHO = r'echo' t_PHPCOLON = r';' t_PHPCOMA = r',' t_PHPOPEN = r'\(' t_PHPCLOSE = r'\)' t_PHPNUM = r'\d+' t_PLUSMINUS = r'\+|\-' t_DIVMUL = r'/|\*' #     .    ,  $var=$value  $var   =  $value t_ignore = ' \r\n\t\f' #     .      def t_error(t): print "Illegal character '%s'" % t.value[0] t.lexer.skip(1) lexer = lex.lex(reflags=re.UNICODE | re.DOTALL) data = ''' <?php $result = substr("foobar", "bar"); echo $result; ''' lexer.input(data) while True: tok = lexer.token() #    if not tok: break #   print tok

In the comments to the code, I roughly described what we have going on there. I will only note that instead of regulars defining a token, you can write functions in which there will be the same regulars, but in addition you can write something else of your own there. For example:

 def t_PHPVAR(t): r'\$[a-zA-Z]\w*' print ',    ' + t.value # value -      return t

By the way, the output of the example that is written above will be:

  LexToken (PHPSTART, '<? Php', 1,1)
 LexToken (PHPVAR, '$ val', 1.7)
 LexToken (PHPEQUAL, '=', 1.12)
 LexToken (PHPNUM, '5', 1.14)
 LexToken (PHPCOLON, ';', 1.15)
 LexToken (PHPVAR, '$ result', 1.17)
 LexToken (PHPEQUAL, '=', 1.25)
 LexToken (PHPFUNC, 'substr', 1.27)
 LexToken (PHPOPEN, '(', 1.33)
 LexToken (PHPSTRING, '"foobar"', 1.35)
 LexToken (PHPCOMA, ',', 1.43)
 LexToken (PHPNUM, '2', 1.45)
 LexToken (DIVMUL, '*', 1.46)
 LexToken (PHPOPEN, '(', 1.47)
 LexToken (PHPNUM, '7', 1.48)
 LexToken (PLUSMINUS, '-', 1.49)
 LexToken (PHPVAR, '$ val', 1.50)
 LexToken (PHPCLOSE, ')', 1.54)
 LexToken (PHPCLOSE, ')', 1.56)
 LexToken (PHPCOLON, ';', 1.57)
 LexToken (PHPFUNC, 'echo', 1.59)
 LexToken (PHPSTRING, '"\ xd1 \ x8d \ xd1 \ x82 \ xd0 \ xbe \ xd0 \ xbd \ xd0 \ xb0 \ xd1 \ x88 \ xd1 \ x80 \ xd0 \ xb5 \ xd0 \ xb7 \ xd1 \ x83 \ xd0 \ xbb \ xd1 \ x8c \ xd1 \ x82 \ xd0 \ xb0 \ xd1 \ x82: $ result "', 1.64)
 LexToken (PHPCOLON, ';', 1,107)

The numbers at the end are the line number and the character number. As you can see, the line number does not change. You need to change it yourself. How? When you pass the token with the transition to a new line. To do this, remove the newline character from the ignored tokens and add a new t_newline function:

 t_ignore = ' \r\t\f' def t_newline(t): r'\n+' t.lexer.lineno += len(t.value)

Now everything works with us:

 LexToken (PHPSTART, '<? Php', 2.1)
 LexToken (PHPVAR, '$ val', 3.7)
 LexToken (PHPEQUAL, '=', 3.12)
 LexToken (PHPNUM, '5', 3.14)
 LexToken (PHPCOLON, ';', 3.15)
 LexToken (PHPVAR, '$ result', 4.17)
 LexToken (PHPEQUAL, '=', 4.25)
 LexToken (PHPFUNC, 'substr', 4.27)
 LexToken (PHPOPEN, '(', 4.33)
 LexToken (PHPSTRING, '"foobar"', 4.35)
 LexToken (PHPCOMA, ',', 4.43)
 LexToken (PHPNUM, '2', 4.45)
 LexToken (DIVMUL, '*', 4.46)
 LexToken (PHPOPEN, '(', 4.47)
 LexToken (PHPNUM, '7', 4.48)
 LexToken (PLUSMINUS, '-', 4.49)
 LexToken (PHPVAR, '$ val', 4.50)
 LexToken (PHPCLOSE, ')', 4.54)
 LexToken (PHPCLOSE, ')', 4.56)
 LexToken (PHPCOLON, ';', 4.57)
 LexToken (PHPFUNC, 'echo', 5.59)
 LexToken (PHPSTRING, '"\ xd1 \ x8d \ xd1 \ x82 \ xd0 \ xbe \ xd0 \ xbd \ xd0 \ xb0 \ xd1 \ x88 \ xd1 \ x80 \ xd0 \ xb5 \ xd0 \ xb7 \ xd1 \ x83 \ xd0 \ xbb \ xd1 \ x8c \ xd1 \ x82 \ xd0 \ xb0 \ xd1 \ x82: $ result "', 5.64)
 LexToken (PHPCOLON, ';', 5,107)

Works, but not completely. Our string “is our result: $ result” is returned by the lexer as is And how will we parse it? Create a separate lexer? Nope In PLY (and in lex) there is support for such a thing as state. State is a switch to a different type of token. This is what we will use.

State

So, in order to create a new state, we first need to declare it:

 states = ( ('string','exclusive'), )

'string' is the name of the new state, and exclusive is the type. In general, there can be two types of states: exclusive and inclusive. Exclusive - general tokens are not visible (common, these are the ones we wrote before, they are in the INITIAL state by default). Inclusive - you can see common tokens and in the current state we just add additional ones to them.
In the case of our string, we do not need other tokens, since inside we have it all different. But we will borrow a couple. We need the PHPVAR token, because our variable looks the same both inside and outside the line, and we need another token, which you will see later on.
To make the token shared, we must assign it the ANY_ prefix, but to make the token visible only for a particular state, <state> _. Below are the modified tokens.

 #            t_ =  t_PHPSTART = r'\<\?php' t_ANY_PHPVAR = r'\$'+ident #  , ? t_PHPEQUAL = r'\=' t_PHPFUNC = ident t_PHPECHO = r'echo' t_PHPCOLON = r';' t_PHPCOMA = r',' t_PHPOPEN = r'\(' t_PHPCLOSE = r'\)' t_PHPNUM = r'\d+' t_PLUSMINUS = r'\+|\-' t_DIVMUL = r'/|\*' #   PHPSTRING     ,      def t_ANY_PHPSTRING(t): #    ,         . r'"' if t.lexer.current_state() == 'string': t.lexer.begin('INITIAL') #     else: t.lexer.begin('string') #   return t t_string_STR = r'(\\.|[^$"])+' #         ,    #       t_string_ignore = '' #    ,      state #         def t_string_error(t): print "Illegal character '%s'" % t.value[0] t.lexer.skip(1)

By the way, have you noticed this: LexToken (PHPFUNC, 'echo', 5.59) ?
PLY sorts regular expressions in ascending order before building the table and wins the longest during the parsing process. That's why echo doesn't parse like PHPECHO. How to get around this? Easy. Just change the return type of the token in the function.
Like this:

 @TOKEN(ident) def t_PHPFUNC(t): if t.value.lower() == 'echo': t.type = 'PHPECHO' return t

Now echo returns as needed. By the way, about the TOKEN decorator: instead of writing a regular at the beginning of the function body, you can simply put it into a variable and apply it to the function as a decorator. What we did.

Here it is. Now everything seems to be. Yes, not all.
It would be nice to ignore comments.
Well, let's add another small function to the lexer:

 def t_comment(t): r'(/\*(.|\n)*?\*/)|(//.*)' pass

Full source code lexer (lexer.py)

 # coding=utf8 import ply.lex as lex from ply.lex import TOKEN import re states = ( ('string','exclusive'), ) #      ,                tokens = ( 'PHPSTART', 'PHPVAR', 'PHPEQUAL', 'PHPFUNC', 'PHPSTRING', 'PHPECHO', 'PHPCOLON', 'PHPCOMA', 'PHPOPEN', 'PHPCLOSE', 'PHPNUM', 'PLUSMINUS', 'DIVMUL', 'STR' ) #      ident = r'[az]\w*' #            t_ =  t_PHPSTART = r'\<\?php' t_ANY_PHPVAR = r'\$'+ident #  , ? t_PHPEQUAL = r'\=' t_PHPCOLON = r';' t_PHPCOMA = r',' t_PHPOPEN = r'\(' t_PHPCLOSE = r'\)' t_PHPNUM = r'\d+' t_PLUSMINUS = r'\+|\-' t_DIVMUL = r'/|\*' @TOKEN(ident) def t_PHPFUNC(t): if t.value.lower() == 'echo': t.type = 'PHPECHO' return t #   def t_comment(t): r'(/\*(.|\n)*?\*/)|(//.*)' pass #   PHPSTRING     ,      def t_ANY_PHPSTRING(t): #    ,         . r'"' if t.lexer.current_state() == 'string': t.lexer.begin('INITIAL') #     else: t.lexer.begin('string') #   return t t_string_STR = r'(\\.|[^$"])+' #         ,    #       t_string_ignore = '' #    ,      state #         def t_string_error(t): print "Illegal character '%s'" % t.value[0] t.lexer.skip(1) #     .    ,  $var=$value  $var   =  $value t_ignore = ' \r\t\f' def t_newline(t): r'\n+' t.lexer.lineno += len(t.value) #     .      def t_error(t): print "Illegal character '%s'" % t.value[0] t.lexer.skip(1) lexer = lex.lex(reflags=re.UNICODE | re.DOTALL | re.IGNORECASE) if __name__=="__main__": data = ''' <?php $val = 5; $result = substr( "foobar", 2*(7-$val) ); echo "  : $result"; ''' lexer.input(data) while True: tok = lexer.token() #    if not tok: break #   print tok

Now go to the parser (parser.py).

Yacc

Yacc is a piece (parser) into which we transfer tokens and describe the rules for their connection (grammar). Hike the work of this program, we can create an abstract tree, or immediately run the program (as is done in the example with a calculator on the site).
In PLY, there is a class ply.yacc to describe grammar.
Describing grammar in ply is very simple and it gives even some pleasure. To describe, we again have special functions p_name function, where in the doc-lines we describe this very grammar.
Let's try to write a very simple grammar for our abstract php example:

 php -> [PHPSTART phpbody]?
 phpbody -> [phpline phpcolons] *
 phpcolons -> [PHPCOLON] +
 phpline -> assign |  func |  [PHPECHO args]
 assign -> PHPVAR PHPEQUAL expr
 expr -> [fact |  expr PLUSMINUS fact]
 fact -> [term |  fact DIVMUL term]
 term -> [arg |  PHPOPEN expr PHPCLOSE]
 func -> PHPFUNC PHPOPEN args PHPCLOSE
 args -> [expr [PHPCOMA expr] *]?
 arg -> string |  phpvar |  PHPNUM |  func
 string -> PHPSTRING str PHPSTRING
 str -> [STR |  str phpvar]?
 phpvar -> PHPVAR

There is already a lot of text written, and you can still talk for a very long time. But to understand the basics of ply.yacc, it suffices to look at one example. And then I'll post the source of the parser.

So, the torn piece from the parser:

 def p_str(p): '''str : | STR | str phpvar''' if len(p) == 1: p[0] = Node('str', ['']) elif len(p) == 2: p[0] = Node('str', [p[1]]) else: p[0] = p[1].add_parts([p[2]])

In order. Parsing starts from the first top function with the pattern p_ <function>. Those. For example, my first source is p_php, so the parser will start working from there (from top to bottom). The parser works with the rules in the doc-lines, and after their matching returns control to the function with the rules and passes the variable p. The variable p stores the result of the parsing. And after processing, we have to shove our results in p [0]. Elements starting from p [1] are the right part of our rules - token tokens and rules.

  '''str : | STR | str phpvar'''

In fact, the rule above is a combined (merged) rule. '|' as easy to guess - this is OR, well, or various token options.
A space is required between the colon and the left / right side. So likes ply. If the right side can be empty, then we do not write anything after the colon. Tokens should be written in capital letters, and the rules should be small, without the p_ prefix. Those. in the example above, the p_str and p_phpvar rules were used.

For example, if we have "" (empty string), then in p we will have only one value - p [0], and it is not defined and we must fill it.
If we have the string "hello $ vasya pupkin", then this function will be executed three times. The first time is for the STR value, and we will return this value back to p [0], the second time for $ vasya, and add it to the previous value (p [1] will contain the node that we returned to p [0] at the last iteration ), and the third time we add to the exit "pupkin". I guess I described it messily, but you just need to touch it. Open the execution environment and play :)

By the way, sometimes the option with divided versions is more convenient (taftology, sorry):

 def p_str_empty(p): '''str :''' p[0] = Node('str', ['']) def p_str_raw(p): '''str : STR''' p[0] = Node('str', [p[1]]) def p_str_var(p): '''str : str phpvar''' p[0] = p[1].add_parts([p[2]])

It is absolutely identical to the previous version of the code. Just other markers. Functions have different names (because they must be separated in python), but the prefixes are identical (str), which causes ply to group them together as variants of the same rule.

For the convenience of building a tree, I used such a simple and effective class:

 class Node: def parts_str(self): st = [] for part in self.parts: st.append( str( part ) ) return "\n".join(st) def __repr__(self): return self.type + ":\n\t" + self.parts_str().replace("\n", "\n\t") def add_parts(self, parts): self.parts += parts return self def __init__(self, type, parts): self.type = type self.parts = parts

It simultaneously stores all that is necessary, and structures the conclusion that it is convenient when debugging.

Full parser code

 # coding=utf8 from lexer import tokens import ply.yacc as yacc class Node: def parts_str(self): st = [] for part in self.parts: st.append( str( part ) ) return "\n".join(st) def __repr__(self): return self.type + ":\n\t" + self.parts_str().replace("\n", "\n\t") def add_parts(self, parts): self.parts += parts return self def __init__(self, type, parts): self.type = type self.parts = parts def p_php(p): '''php : | PHPSTART phpbody''' if len(p) == 1: p[0] = None else: p[0] = p[2] def p_phpbody(p): '''phpbody : | phpbody phpline phpcolons''' if len(p) > 1: if p[1] is None: p[1] = Node('body', []) p[0] = p[1].add_parts([p[2]]) else: p[0] = Node('body', []) def p_phpcolons(p): '''phpcolons : PHPCOLON | phpcolons PHPCOLON''' def p_phpline(p): '''phpline : assign | func | PHPECHO args''' if len(p) == 2: p[0] = p[1] else: p[0] = Node('echo', [p[2]]) def p_assign(p): '''assign : PHPVAR PHPEQUAL expr''' p[0] = Node('assign', [p[1], p[3]]) def p_expr(p): '''expr : fact | expr PLUSMINUS fact''' if len(p) == 2: p[0] = p[1] else: p[0] = Node(p[2], [p[1], p[3]]) def p_fact(p): '''fact : term | fact DIVMUL term''' if len(p) == 2: p[0] = p[1] else: p[0] = Node(p[2], [p[1], p[3]]) def p_term(p): '''term : arg | PHPOPEN expr PHPCLOSE''' if len(p) == 2: p[0] = p[1] else: p[0] = p[2] def p_func(p): '''func : PHPFUNC PHPOPEN args PHPCLOSE''' p[0] = Node('func', [p[1], p[3]]) def p_args(p): '''args : | expr | args PHPCOMA expr''' if len(p) == 1: p[0] = Node('args', []) elif len(p) == 2: p[0] = Node('args', [p[1]]) else: p[0] = p[1].add_parts([p[3]]) def p_arg(p): '''arg : string | phpvar | PHPNUM | func''' p[0] = Node('arg', [p[1]]) def p_phpvar(p): '''phpvar : PHPVAR''' p[0] = Node('var', [p[1]]) def p_string(p): '''string : PHPSTRING str PHPSTRING''' p[0] = p[2] def p_str(p): '''str : | STR | str phpvar''' if len(p) == 1: p[0] = Node('str', ['']) elif len(p) == 2: p[0] = Node('str', [p[1]]) else: p[0] = p[1].add_parts([p[2]]) def p_error(p): print 'Unexpected token:', p parser = yacc.yacc() def build_tree(code): return parser.parse(code)

The main file, from where we call all

 # coding=utf8 from parser import build_tree data = ''' <?php $val = 5; $result = substr( "foobar", 2*(7-$val) ); /* comment */ echo "  : ", $result; ''' result = build_tree(data) print result

Syntax tree output

 line:
	 assign:
		 $ val
		 arg:
			 five
	 assign:
		 $ result
		 arg:
			 func:
				 substr
				 args:
					 arg:
						 str:
							 foobar
					 *:
						 arg:
							 2
						 -:
							 arg:
								 7
							 arg:
								 var:
									 $ val
	 echo:
		 args:
			 arg:
				 str:
					 this is our result: 
			 arg:
				 var:
					 $ result

Conclusion

If there are any comments or suggestions, I will gladly add / change the post.

Article code here

Well, if I'm wondering why I suddenly decided to write about ply - pyfox . I do little by little css processing on python. Rather, the css parser is written, but the walk through the dom is not fully implemented yet (not all pseudo-selectors work). Now specifically the problem with nth-child. Not at the level of css but at the level of effective matching - in dom there is no such property as a number among neighbors, but I don’t want to consider previous neighbors. Apparently you have to customize HTMLParser. Can anyone decide to join me? All welcome :)

Source: https://habr.com/ru/post/191252/

All Articles