📜 ⬆️ ⬇️

A simple interpreter from scratch in Python (part 3)

image


Explanation to the publication
User @duse previously posted translations of two previous articles, clearly intending to translate the entire series. Since the topic interests me very much, but there are no new translations, I turned to the original source. Since English is not very strong, so as not to lose the point, I began to translate into Russian. So this translation was born. I apologize to @duse if I had to suffer a little more. But for the community in any case there should be a benefit.


Thus, we wrote a lexer parser combinator library for our interpreter. In this part, we will create structural data of an abstract syntax tree (AST), and write a parser using our library of combinators that translate the list of tokens returned by the lexer into an abstract syntax tree (AST). After we parse the AST, it will be very easy to start the program.
')

Determine AST


Before we start writing our parser, we need to define the data structures that our parser will return. We define them using classes. Each IMP syntax element will have a corresponding class. Objects of this class will display the nodes in the AST.

There are only three structures in our IMP: arithmetic expressions (used to calculate numbers), logical expressions (used to calculate conditions for if and while statements), and state. Let's start with arithmetic expressions, as the remaining two depend on it.
An arithmetic expression can take one of three forms:


We can also group expressions together with brackets (like (x+2)*3 ). This is not just a different type of expression, but a different way of parsing an expression.
We define three classes for these forms, plus a base class for basic arithmetic expressions. While classes do nothing but store information. The __repr__ method allows to print AST during debugging. All AST classes will inherit Equality , so we will be able to verify the similarity of two AST objects, which is also useful in testing.

 from equality import * class Aexp(Equality): pass class IntAexp(Aexp): def __init__(self, i): self.i = i def __repr__(self): return 'IntAexp(%d)' % self.i class VarAexp(Aexp): def __init__(self, name): self.name = name def __repr__(self): return 'VarAexp(%s)' % self.name class BinopAexp(Aexp): def __init__(self, op, left, right): self.op = op self.left = left self.right = right def __repr__(self): return 'BinopAexp(%s, %s, %s)' % (self.op, self.left, self.right) 

Logical expressions are a bit more complicated. There are 4 types of logical expressions:


The left and right sides of relation expressions are arithmetic expressions. The left and right sides and , or , or not expressions are logical expressions. This type distinction helps us avoid meaningless expressions like x < 10 and 30 .

 class Bexp(Equality): pass class RelopBexp(Bexp): def __init__(self, op, left, right): ... class AndBexp(Bexp): def __init__(self, left, right): ... class OrBexp(Bexp): def __init__(self, left, right): ... class NotBexp(Bexp): def __init__(self, exp): ... 

Statements (statements) can contain arithmetic and logical expressions simultaneously. There are 4 types of expressions: assignment, join, conditions, and cycles.

 class Statement(Equality): pass class AssignStatement(Statement): def __init__(self, name, aexp): ... class CompoundStatement(Statement): def __init__(self, first, second): ... class IfStatement(Statement): def __init__(self, condition, true_stmt, false_stmt): ... class WhileStatement(Statement): def __init__(self, condition, body): ... 

Primitives


Now we have our AST classes and a suitable set of combinators - it's time to write our parser. When writing a parser, it is easier to start with the basic structures of the language, gradually moving to more complex things.

The first parser we’ll look at is the keyword parser. It is just a special version of the Reserved Combinator using the RESERVED tag, which is labeled with all keywords. Remember, Reserved corresponds to a single token, for which the value and tag are the same as for the transmitted ones.

 def keyword(kw): return Reserved(kw, RESERVED) 

keyword is an actual combinator, because it is a function that returns a parser. We will use it directly in other parsers.

The id parser is used for matching variable names. It uses the Tag combinator, which matches the token with the specified tag.

 id = Tag(ID) 

The num parser is used to match integers. It works almost like id , except that we use the Process combinator (more precisely, the operator that calls Process ) to convert the token to a specific integer.

 num = Tag(INT) ^ (lambda i: int(i)) 

Parsing arithmetic expressions


Parsing arithmetic expressions is not easy, but we need to parse arithmetic expressions in order to parse logical expressions or statements, so we must start with them.

First we define the aexp_value parser, which will convert the values ​​returned by num and id into actual expressions.

 def aexp_value(): return (num ^ (lambda i: IntAexp(i))) | \ (id ^ (lambda v: VarAexp(v))) 

We used the operator | which is short for the Alternate combinator. This allows the first expressions to be parsed. If they fail, an attempt will be made to parse the expression with the variable.

Note that we have defined aexp_value as a function with no arguments instead of a global value, just as we did with id and num . We will do the same with all remaining parsers. And we did it because we do not want the code of each parser to be calculated immediately. If we defined each parser as global, each parser would not be able to refer to the other parsers that follow in the same source file, because they would not have been declared yet. This would make it impossible to define recursive parsers, and the source code would be less readable.

Further, we would like to implement grouping with brackets in our arithmetic expressions. Although grouping expressions do not require their own AST class, they need another parser to process them.

 def process_group(parsed): ((_, p), _) = parsed return p def aexp_group(): return keyword('(') + Lazy(aexp) + keyword(')') ^ process_group 

The process_group function is used with the Process combinator ( ^ operator). It simply discards the tokens of the parentheses and returns the expression inside. In fact, aexp_group also a parser. Remember, operator + is short for Concat combinator. So she parses '(', following the arithmetic expression (exploded by aexp , which we will soon define), followed by ')'. You must avoid calling aexp , because, as aexp calls aexp_group , which will lead to infinite recursion. We use the Lazy combinator, which postpones the aexp call until the parser is applied to any input data.

Next, we combine aexp_value and aexp_group with aexp_term . The expression aexp_term is any simple independent expression in which we do not have to care about the precedence of operators with respect to other expressions.

 def aexp_term(): return aexp_value() | aexp_group() 

Now we are approaching the tricky part: operators and seniority. It will be easier to define another parser for aexp and throw it together with aexp_term . This will result in the expression:
 1 + 2 * 3 

To such a wrong analysis:
 (1 + 2) * 3 

The parser must be aware of the precedence of the operators, and it must group the operators with higher precedence with each other.

We will define several auxiliary functions in order to perform this work.

 def process_binop(op): return lambda l, r: BinopAexp(op, l, r) 

The process_binop function is what creates the BinopAexp object. This function accepts any arithmetic operator and returns a function that combines pairs of expressions using this operator ...

The process_binop function should be used with the Exp ( * operator) combinator. Combinator Exp parses a list of expressions with a separator between each pair of expressions. The left operator Exp is a parser that matches the individual elements of the list (in our case, arithmetic expressions). The right operator is a parser that maps separators (operators). No matter which separator is matched, the right parser returns a function that, given the correspondence of the operators, returns a union function. The union function accepts parsed expressions to the left and right of the separator, and returns a single, unified expression. Still not confused? We will quickly go through the use of Exp . The process_binop function is what the right parser will return.
Next, we will define our levels of seniority and a combinator to help us cope with them.
 def any_operator_in_list(ops): op_parsers = [keyword(op) for op in ops] parser = reduce(lambda l, r: l | r, op_parsers) return parser aexp_precedence_levels = [ ['*', '/'], ['+', '-'], ] 

The any_operator_in_list function takes a list of keyword strings and returns the corresponding parser. We define aexp_precedence_levels containing a list of operators for each level of precedence (starting with a higher one).

 def precedence(value_parser, precedence_levels, combine): def op_parser(precedence_level): return any_operator_in_list(precedence_level) ^ combine parser = value_parser * op_parser(precedence_levels[0]) for precedence_level in precedence_levels[1:]: parser = parser * op_parser(precedence_level) return parser 

precedence is the actual content of the operation. Its first argument, value_parser , is a parser that can read the simple parts of an expression: numbers, variables, and groups. This will be aexp_term . The precedence_levels list contains a list of statements, one list for each level. For this we use aexp_precedence_levels . combine will accept a function that, passed by the operator, will return a function for constructing one large expression from two small ones. This will be the process_binop

Inside precedence , we first define op_parser , which, for a given level of precedence, reads only operators with the same level and returns a function that combines two expressions. op_parser can be used as the right argument to Exp . We start by calling Exp with op_parser for the highest level of precedence, for these operations must be grouped first. Next we use the resulting parser as an element of the parser (the left argument Exp ) at the next level. After the end of the cycle, the resulting parser is able to parse the arithmetic expression correctly.

How does it work in practice? Let's sort it out.
 E<sub>0</sub> = value_parser E<sub>1</sub> = E<sub>0</sub> * op_parser(precedence_levels[0]) E<sub>2</sub> = E<sub>1</sub> * op_parser(precedence_levels[1]) 

E 0 is the same as value_parser . It can parse numbers, variables and groups, but not operators. E 1 parsits expressions containing everything that can coincide with E 0 , separated by operators from the first level of precedence. So E 1 can match a * b / c , but should cause an error as soon as it encounters the + operator. E 2 can match expressions separated by operators of the next level of precedence. Since we only have 2 levels of precedence, E 2 can match any arithmetic expressions that we support.

Let's follow the example. Take a complex arithmetic expression, and gradually replace each part with its comparison.

 4 * a + b / 2 - (6 + c) E<sub>0(4)</sub> * E<sub>0</sub>(a) + E<sub>0</sub>(b) / E<sub>0</sub>(2) - E<sub>0</sub>(6+c) E<sub>1</sub>(4*a) + E<sub>1</sub>(b/2) - E<sub>1</sub>(6+c) E<sub>2</sub>((4*a)+(b/2)-(6+c)) 

We use precedence directly to determine aexp :

 def aexp(): return precedence(aexp_term(), aexp_precedence_levels, process_binop) 

We can probably define seniority less abstractly, but the advantage of our approach is that we apply it in any situation that has the problem of operator precedence. We will reapply it for parsing logical expressions.

Parsing logical expressions


We can already move from arithmetic expressions to logical ones. Logical expressions are usually simpler than arithmetic, so we don’t need new tools to parse them. Let's start with the simplest logical expressions:
 def process_relop(parsed): ((left, op), right) = parsed return RelopBexp(op, left, right) def bexp_relop(): relops = ['<', '<=', '>', '>=', '=', '!='] return aexp() + any_operator_in_list(relops) + aexp() ^ process_relop 

We use the process_relop function with the Process combinator. It takes three connected values ​​and creates RelopBexp from them. In bexp_relop we parse two arithmetic expressions ( aexp ) separated by a relational operator. And we use our old man - the function any_operator_in_list , so we don’t need to write a case for each operator. There is also no need to use combinators like Exp or precedence , since relationship expressions cannot be connected to each other in IMP as they are done in other languages.

Next, we define the expression not . The expression is not a unary operator with high seniority. This makes it easier to parse than and and or expressions.
 def bexp_not(): return keyword('not') + Lazy(bexp_term) ^ (lambda parsed: NotBexp(parsed[1])) 

Here we have connected the not keyword with the name of the logical expression (which we will define later). Since bexp_not will be used to define bexp_term , we need to use the Lazy combinator to avoid endless recursion.

 def bexp_group(): return keyword('(') + Lazy(bexp) + keyword(')') ^ process_group def bexp_term(): return bexp_not() | \ bexp_relop() | \ bexp_group() 

We define bexp_group and bexp_term in the same way as for arithmetic equivalents. This is nothing new.
Next, we need to define expressions that include the and and or operators. These operators actually work like arithmetic operators; both are performed from left to right, and and has the highest level of seniority.
 bexp_precedence_levels = [ ['and'], ['or'], ] def process_logic(op): if op == 'and': return lambda l, r: AndBexp(l, r) elif op == 'or': return lambda l, r: OrBexp(l, r) else: raise RuntimeError('unknown logic operator: ' + op) def bexp(): return precedence(bexp_term(), bexp_precedence_levels, process_logic) 

As well as process_binop , process_logic intended for use with the combinator Exp . It takes an operator and returns a function that combines two subexpressions into one expression using this operator. We substitute this in accordance with the precedence as in aexp . Writing a common code here pays off, since we don’t have to repeat complex expression expression processing code.

Parsing allegations


With the end of aexp and bexp , we can begin to parse the IMP statements. Let's start with a modest assignment statement:
 def assign_stmt(): def process(parsed): ((name, _), exp) = parsed return AssignStatement(name, exp) return id + keyword(':=') + aexp() ^ process 


Nothing particularly interesting. Next, we look at the stmt_list . It will parse a series of statements separated by a semicolon.
 def stmt_list(): separator = keyword(';') ^ (lambda x: lambda l, r: CompoundStatement(l, r)) return Exp(stmt(), separator) 


Remember, we need to use the EXP combinator here instead of something simpler like stmt() + keyword(';') + stmt() to avoid left-side recursion.
Next we have an if :
 def if_stmt(): def process(parsed): (((((_, condition), _), true_stmt), false_parsed), _) = parsed if false_parsed: (_, false_stmt) = false_parsed else: false_stmt = None return IfStatement(condition, true_stmt, false_stmt) return keyword('if') + bexp() + \ keyword('then') + Lazy(stmt_list) + \ Opt(keyword('else') + Lazy(stmt_list)) + \ keyword('end') ^ process 

Here the only difficulty is that the else clause is optional. This makes executing a function a little more difficult.
Finally, we have a while :
 def while_stmt(): def process(parsed): ((((_, condition), _), body), _) = parsed return WhileStatement(condition, body) return keyword('while') + bexp() + \ keyword('do') + Lazy(stmt_list) + \ keyword('end') ^ process 

We wrap this with stmt :
 def stmt(): return assign_stmt() | \ if_stmt() | \ while_stmt() 

You may notice that in both if and while if , using stmt_list in the body is better than stmt . stmt_list is actually our top level definition. We cannot have stmt directly dependent on stmt_list , since this will lead to left-side recursion of the parser, and since we want to have optional statements in the if and while bodies, we directly use stmt_list .

Putting it all together


Now we have a parser for each part of the language. We just need to make some high-level definitions:

 def parser(): return Phrase(stmt_list()) 

parser will parse the entire program. A program is just a list of statements, but the Phrase combinator ensures that we use every token in the file before a premature termination due to garbage (invalid) tokens at the end.

 def imp_parse(tokens): ast = parser()(tokens, 0) return ast 

The imp_parse function will be called by the client to parse the program. It accepts the entire list of tokens, calls parser, starts with the first token, and returns the resulting Abstract Syntax Tree (AST).

Here is a simple control program to test our parsers (in addition to unit tests):
 import sys from imp_parser import * if __name__ == '__main__': if len(sys.argv) != 3: sys.stderr.write('usage: %s filename parsername' % sys.argv[0]) sys.exit(1) filename = sys.argv[1] file = open(filename) characters = file.read() file.close() tokens = imp_lex(characters) parser = globals()[sys.argv[2]]() result = parser(tokens, 0) print result 

This program reads a file (first argument) and parses it with some kind of parser from imp_parse.py (second argument). Example:

 $ echo '1 + 2 * 3' >foo.imp $ python imp_parser_driver.py foo.imp aexp Result(BinopAexp(+, IntAexp(1), BinopAexp(*, IntAexp(2), IntAexp(3))), 5) 

This should provide a good sandbox for experiments.

Conclusion


In this article, we wrote a combinator library from scratch and used it to write a parser for IMP . In the next and last article in this series, we will write a performer ( Comment : could not find the best translation of the word evaluator ) for our parsed Abstract Syntax Tree ( AST ).

The author of the original article - Jay Conrod

PS I see problems with the Sub tag. There is an assumption that it is too early for a newbie to use such a tag. What to replace it - I do not know. I leave to clarify the decision.

Source: https://habr.com/ru/post/208872/


All Articles