📜 ⬆️ ⬇️

Classic parser combinators in Python

A parser is a part of a program that, from a linear sequence of simple data, builds more complex data structures with some grammar included.

Functional programming languages ​​allow you to describe higher-order functions that take as arguments and return other functions as results.

Parser combinators is a well-known technique for creating parsers that uses the capabilities of functional programming languages ​​to dynamically build more complex parsers from simple grammar rules.

In Python, classical parser combinators can be described as follows. The definition of parsers will be done through the description of classes. Each class will override the __call__ method, which will do all the work.
')
At the input, the parsers will take a linear sequence of some data (this may be a set of characters, tokens, etc.) and the initial position from which to start parsing.

As a result of the parsing, an object of type Res will be returned, which, if successful, will contain part of the parsed AST (abstract syntax tree) and the next position in the input sequence; otherwise, the position of the element that caused the error.

Res class definition:

class Res: def __init__(self, subtree, pos): self.subtree = subtre self.pos = pos def __str__(self): return '(' + ', '.join((str(self.subtree), str(self.pos))) + ')' def __bool__(self): return self.subtree != None 

Tree class definition:

 class Tree: def __init__(self, root = None, children = None): self.root = root if children == None: self.children = [] else: self.children = children def __str__(self): r = str(self.root) c = ', '.join(str(c) for c in self.children) if c: r = '[' + r + ', ' + c + ']' return r 

We describe the base class from which all parsers will be inherited:

 import abc class Parser(metaclass = abc.ABCMeta): @abc.abstractmethod def __call__(self): pass def __lshift__(self, other): #   << return Concat(self, other, 1) def __rshift__(self, other): #   >> return Concat(self, other, 0) def __or__(self, other): #   | return Alt(self, other) 

The Atom parser takes one element and maps it to an element in the input sequence. If successful, returns the sheet associated with this element.

 class Atom(Parser): def __init__(self, token): self.token = token def __call__(self, tokens, pos = 0): if pos != len(tokens) and self.token == tokens[pos]: return Res(Tree(tokens[pos]), pos + 1) return Res(None, pos) 

The Concat parser takes two parsers as input. First, the left parser is used, then the right one. If it works successfully, the result will contain a left or right associative tree. If one of them fails to parse its part of the sequence, then the whole combination returns a failure.

 class Concat(Parser): def __init__(self, left, right, F = 0): self.left = left self.right = right self.F = F #  F == 0  - , #  – - def __call__(self, tokens, pos = 0): left_res = self.left(tokens, pos) if left_res: right_res = self.right(tokens, left_res.pos) if right_res: if self.F == 0: right_res.subtree.children.insert(0, left_res.subtree) return right_res left_res.subtree.children.append(right_res.subtree) return Res(left_res.subtree, right_res.pos) return right_res return left_res 

We describe the alternative parser. Alt parser takes two parsers as input. It works successfully if the left or right parser has successfully worked.

 class Alt(Parser): def __init__(self, left, right): self.left = left self.right = right def __call__(self, tokens, pos = 0): left_res = self.left(tokens, pos) if left_res: return left_res right_res = self.right(tokens, pos) if right_res: return right_res if left_res.pos > right_res.pos: return left_res return right_res 

We describe the optional parser Opt. If its argument worked successfully, it returns the result, otherwise it returns success anyway, but with the specified default value.

 class Opt(Parser): def __init__(self, parser, default = None): self.parser = parser def __call__(self, tokens, pos = 0): res = self.parser(tokens, pos) if res: return res return Res(Tree(default), pos) 

Parser Repeat Repeat works until it “breaks”. If the argument did not work once, it is also considered a success.

 class Repeat(Parser): def __init__(self, root, parser, F = 0): self.root = root self.parser = parser self.F = F def __call__(self, tokens, pos = 0): tree = Tree(self.root) while True: res = self.parser(tokens, pos) if not res: break if self.F == 0: tree.children.append(res.subtree) else: tree.children.insert(0, res.subtree) pos = res.pos return Res(tree, pos) 

Parser Prog sequentially applies parsers devoted to it and returns the result of the specified (by default last).

 class Prog(Parser): def __init__(self, parser, *others, N = -1): self.parser = parser self.others = others self.N = N def __call__(self, tokens, pos = 0): i = 1 + len(self.others) + self.N if self.N < 0 else self.N if i < 0 or i > len(self.others): raise IndexError res = self.parser(tokens, pos) if not res: return res t = res error = 0 j = 1 for parser in self.others: res = parser(tokens, res.pos) if not res: error = 1 break if j == i: t = res j += 1 if error: return res return Res(t.subtree, res.pos) 

The Lazy parser is used to describe recursive parsers. It takes as its input a function with no arguments, which returns a parser. This is due to the fact that at the time of the description the parser is not yet defined and cannot refer to itself directly.

 class Lazy(Parser): def __init__(self, parser_func): self.parser_func = parser_func self.parser = None def __call__(self, tokens, pos = 0): if not self.parser: self.parser = self.parser_func() return self.parser(tokens, pos) 

The parser LExp in some cases allows you to bypass the left recursion (for example, when parsing left-associative operators). It accepts three parsers as input: for parsing the left element, the separator and the right element. In the absence of the next separator returns the result.

 class LExp(Parser): def __init__(self, first, sep, parser): self.first = first self.sep = sep self.parser = parser def __call__(self, tokens, pos = 0): left_res = self.first(tokens, pos) if not left_res: return left_res error = 0 while True: sep_res = self.sep(tokens, left_res.pos) if not sep_res: break right_res = self.parser(tokens, sep_res.pos) if not right_res: error = 1 break sep_res.subtree.children.append(right_res.subtree) sep_res.subtree.children.insert(0, left_res.subtree) left_res = Res(sep_res.subtree, right_res.pos) if error: return right_res return left_res 

The parser combinators described above provide convenient and flexible tools for creating parsers. Of course, they cannot compare with such well-known libraries as Boost.Spirit, but for a beginner, writing your own parser library will allow you to better understand the parsing process, which often causes confusion.

Source: https://habr.com/ru/post/317304/


All Articles