📜 ⬆️ ⬇️

Simple interpreter from scratch in Python # 2



In the previous article, we looked at the IMP language itself and the basic structure of the interpreter. Also, we carefully considered lexer. In this article we will write a small parser for our language. It will extract an AST (abstract syntax tree) from the list of tokens generated by the lexer. The combinator library will be independent, that is, using it you can write a parser for any language.


What are parser combinators?

There are so many ways to write a parser. The simplest and fastest way to do this are the combinators .
')
You can think of a parser as a function that accepts a stream of tokens. If successful, the parser will “eat up” some tokens from the stream. The function will return part of the final AST along with the other tokens. A combinator is a function that produces a parser as its result, usually after taking one or several analyzers (parsers) as input, hence the name “combinator”. You can use combinators to create a complete parser for a language, like an IMP, by creating a set of small parsers for each part of the language.

Our little combinator

Combinator analyzers are quite common, ordinary , can be used for any language. We start by writing the agnostic library of combinators, as we did with the lexer, and then use this to write the parser.

First, let's create a Result class. Each parser will return an instance of the Result class (if successful), or None if unsuccessful. Result includes value (value, part of AST) and position (index of the next token in the stream).

class Result: def __init__(self, value, pos): self.value = value self.pos = pos def __repr__(self): return 'Result(%s, %d)' % (self.value, self.pos) 


Next we create the main class Parser . Before, I said that parsers are functions that take a stream of tokens. In fact, we will define parsers as objects with the __call__ method . This means that the parser object will behave as if it were a function, but we can also provide additional functionality by creating statements.

 class Parser: def __call__(self, tokens, pos): return None # subclasses will override this def __add__(self, other): return Concat(self, other) def __mul__(self, other): return Exp(self, other) def __or__(self, other): return Alternate(self, other) def __xor__(self, function): return Process(self, function) 


The method that actually performs the parsing is __call__. The input data is the complete list of tokens (created by the lexer) and the index of the list, indicating the next token. The default implementation always returns None . Subclasses will have their own method __call__.

Methods such as __add__, __mul__, __or__, and __xor__ define +, *, | and ^ operators respectively. Each operator provides a shortcut to call a specific combinator. We will look at each one in brief.

Now we look at a simple combinator called Reserved. Reserved will be used to parse reserved words and operators; it will accept tokens with a specific value and tag. Remember that tokens are a tag-value pair. token [0] - value, token [1] - tag.

 class Reserved(Parser): def __init__(self, value, tag): self.value = value self.tag = tag def __call__(self, tokens, pos): if pos < len(tokens) and \ tokens[pos][0] == self.value and \ tokens[pos][1] is self.tag: return Result(tokens[pos][0], pos + 1) else: return None 


Now you can say: “I thought that the combinators should be functions that return parsers.” The subclass does not look like a function at all. A subclass is like a function if you think of a constructor as a function that returns an object (which is callable in this case). Subclassing is a simple way to define new combinators, since we only need to provide a constructor and a __call__ method, and we still retain the rest of the functionality (for example, operator overloading).

Moving on. Combinator Tag is very similar to Reserved. It finds tokens corresponding to a specific tag. The value can be anything.

 class Tag(Parser): def __init__(self, tag): self.tag = tag def __call__(self, tokens, pos): if pos < len(tokens) and tokens[pos][1] is self.tag: return Result(tokens[pos][0], pos + 1) else: return None 


Tag and Reserved combinators are our primitives. All other combinators will be built from them at the most basic level.

Concat combinator takes two parsers as input (left and right). When using the Concat parser, it will use the left parser, and then the right one [parser]. If both are successful, the resulting value will contain a pair of left and right results. If at least one does not work, then None will return.

 class Concat(Parser): def __init__(self, left, right): self.left = left self.right = right def __call__(self, tokens, pos): left_result = self.left(tokens, pos) if left_result: right_result = self.right(tokens, left_result.pos) if right_result: combined_value = (left_result.value, right_result.value) return Result(combined_value, right_result.pos) return None 


Concat is useful for parsing a specific sequence of tokens. For example, for parsing 1 + 2 , you can write

 parser = Concat(Concat(Tag(INT), Reserved('+', RESERVED)), Tag(INT)) 


or more briefly using the shorthand operator

 parser = Tag(INT) + Reserved('+', RESERVED) + Tag(INT) 


Alternative Combinator is also very similar to previous ones. It also accepts left- and right-parsers. If successful, the result is returned, otherwise - it takes the right-parser and returns its result.

 class Alternate(Parser): def __init__(self, left, right): self.left = left self.right = right def __call__(self, tokens, pos): left_result = self.left(tokens, pos) if left_result: return left_result else: right_result = self.right(tokens, pos) return right_result 


Alternative is useful when selecting possible parsers. For example, if we need to parse a binary operator:

 parser = Reserved('+', RESERVED) | Reserved('-', RESERVED) | Reserved('*', RESERVED) | Reserved('/', RESERVED) 


The Opt class is useful for additional text, such as else. It takes one parser. If this parser is successful, the result is returned normally. If not, the successful result is still returned, but its value is None. Tokens are not consumed in case of failure; position of the result is the same as the position of the input.

 class Opt(Parser): def __init__(self, parser): self.parser = parser def __call__(self, tokens, pos): result = self.parser(tokens, pos) if result: return result else: return Result(None, pos) 


Rep accepts the parser until it fails. This is useful for building lists. Remember that Rep will successfully match the empty list and will not absorb tokens if the parser fails the first time.

 class Rep(Parser): def __init__(self, parser): self.parser = parser def __call__(self, tokens, pos): results = [] result = self.parser(tokens, pos) while result: results.append(result.value) pos = result.pos result = self.parser(tokens, pos) return Result(results, pos) 


The Process Combinator is very useful for manipulating the values ​​of results. Its input is a parser and a function. When the parser is successful, the resulting value is sent to the function instead of the original value returned from the function. We will use Process to build AST nodes from pairs and lists (returned by Concat and Rep).

 class Process(Parser): def __init__(self, parser, function): self.parser = parser self.function = function def __call__(self, tokens, pos): result = self.parser(tokens, pos) if result: result.value = self.function(result.value) return result 


As an example, consider the parser that we built with Concat. When he parses 1 + 1 , the result will be (('1', '+'), '2') , which is not very convenient. With Process we can change the result. The following code will return the sum of these two numbers, for example.

 def process_func(parsed): ((l, _), r) = parsed return int(l) + int(r) better_parser = parser ^ process_func 


Lazy is a less obvious combinator. Instead of taking a parser as input, it takes a function with zero argument, which returns a parser. Lazy Combinator will not call a function to get a parser until it is used. This is needed to build recursive parsers. Since the analyzer (parser) refers to itself, we cannot just take it and define it by reference; while the parser expression is executed, it is not dated. We don’t need it in languages ​​like Haskell or Scala, since they use lazy expressions, but Python isn’t.

 class Lazy(Parser): def __init__(self, parser_func): self.parser = None self.parser_func = parser_func def __call__(self, tokens, pos): if not self.parser: self.parser = self.parser_func() return self.parser(tokens, pos) 


The next Phrase combinator takes a single parser to the input, applies it, and returns its result. The only catch is that it will fail if it does not absorb all the tokens. Phrase will be the analyzer of the top level. It does not allow us to match programs that have garbage at the end.

 class Phrase(Parser): def __init__(self, parser): self.parser = parser def __call__(self, tokens, pos): result = self.parser(tokens, pos) if result and result.pos == len(tokens): return result else: return None 


Last combinator, unfortunately, the most difficult. Exp meaning is very simple; It is used to match (match) an expression, which consists of a list of elements separated by something. Here is an example of compound statements:

 a := 10; b := 20; c := 30 


In this case, we have a list of operators, which are separated by a semicolon. You may think that we do not need Exp, since we can do the same with the help of other combinators. You can write a parser for such expressions:

 def compound_stmt(): return stmt() + Reserved(';', RESERVED) + stmt() 


Create stmt:

 def stmt(): return Lazy(compound_stmt) | assign_stmt() 


So stmt calls compound_stmt, which calls stmt. They will call each other until we get a buffer overflow. This problem is not unique to compound operators, and is called left recursion.

Fortunately, Exp provides an opportunity to bypass the left recursion, simply by sorting out the list (relatively as well as Rep). Exp takes two parsers to enter. The first parser corresponds to the real elements of the list, and the second separators. If successful, the second parser should return a function that combines the disassembled elements on the left and right in one value. This value accumulates from the entire list, from left to right, and ultimately returns.

Let's look at the code:

 class Exp(Parser): def __init__(self, parser, separator): self.parser = parser self.separator = separator def __call__(self, tokens, pos): result = self.parser(tokens, pos) def process_next(parsed): (sepfunc, right) = parsed return sepfunc(result.value, right) next_parser = self.separator + self.parser ^ process_next next_result = result while next_result: next_result = next_parser(tokens, result.pos) if next_result: result = next_result return result 


result will contain everything that has been written for all the time. The process_next function can be used with the Process combinator. next_parser takes separator and then parser to get the next list item. process_next will create a new result by taking the separator function for the current result and for the newly-parsed element. The next_parser is received in a loop while it is able to accept items.

Let's see how Exp can be used to solve the compound_stmt problem:

 def assign_stmt(): ... def compound_sep(): def process_sep(parsed): return lambda l, r: CompoundStmt(l, r) return Reserved(';', RESERVED) ^ process_sep def compound_stmt(): return Exp(assign_stmt(), compound_sep()) 


We can even write this:

 def compound_stmt(): return assign_stmt() * compound_sep() 


We delve into this when we consider the analysis of arithmetic expressions in the next article.

Download full source code: imp-interpreter.tar.gz

Source: https://habr.com/ru/post/206454/


All Articles