Parsers, word processing. Just about the complicated. CFG, BNF, LL (k), LR (k), PEG and other scary words

Probably every programmer had to deal with tasks like “read something in format A and perform some manipulations with it”. Whether it be json, logs of nginx, cfg, sql, yaml, csv or something else. Well, when you can use the library, however, for various reasons, it is not always possible. Then the question arises of creating your own parser for a given format. And this, as the English say, often turns out to be PITA (pain in ...). In this article I will try to alleviate this pain. Who cares, welcome.

Introduction

So the question is, what exactly is this article about? Here I will try to help the reader to get out of the situation with the least losses, when you need to parse some text format, and the library will not work. That is, to solve an absolutely specific problem by common means.

At once I will make a reservation, the topic in itself is not enough that it is VERY difficult, it is also impossible to be extensive and to cover everything in one article just will not work. Therefore, I will start from the general and will proceed to the particular, specifically in this article, giving a review of the tools (which, however, can be used to solve very specific parsing problems) rather than diving deep into the concepts. Perhaps, if readers are interested, the article will turn into a cycle in which I can reveal specific questions in more detail.

I decided to write it because the information available on the topic is scattered and often incomplete, there are very few sources in Russian, and those that exist imply a fairly decent acquaintance of the reader with very specific mathematics. Therefore, so that the reader far from the topic did not experience pain from the consciousness of his (imaginary) inferiority, I decided to try to describe this topic as simply as possible.
')
Be brave, the article is large, some places will not be very interesting, but without them it can be incomprehensible.

Basic concepts

Before talking on the topic is to determine the basic concepts, so that there is no ambiguity. This is a glossary of this article. It may coincide with generally accepted terminology, but generally speaking, it is not obliged, since it shows a picture that is being formed in the author’s head.
So:

input character stream (hereinafter input stream or stream ) - a stream of characters to parse, fed to the input of the parser
parser / parser ( parser, analyzer ) - a program that accepts input stream and converts it into AST and / or allows you to bind executable functions to grammar elements
AST (Abstract Syntax Tree) / ASD (Abstract Syntax Tree) ( output data structure ) - An object structure representing the hierarchy of non-terminal grammar entities of the parsed stream and their constituent terminals . For example, the algebraic flow (1 + 2) + 3 can be represented as EXPRESSION (EXPRESSION (NUMBER (1) OPERATOR (+) NUMBER (2)) OPERATOR (+) NUMBER (3)). As a rule, then this tree is somehow processed by the client of the parser to get results (for example, counting the response of this expression)
CFG (Context-free grammar) / CSG (Context-free grammar) is the most common grammar type used to describe the incoming character stream for a parser (not only for this, of course). Characterized by the fact that the use of its rules does not depend on the context (which does not exclude the fact that in some way it sets the context for itself, for example, the rule for calling a function will not matter if it is inside a stream fragment described by the comment rule). Consists of product rules defined for terminal and non-terminal characters.
Terminal symbols ( terminals ) - for a given parse language - a set of all (atomic) symbols that can occur in the incoming stream
Not terminal characters ( not terminals ) - for a given parsing language - a set of all characters that are not found in the input stream, but are participating in grammar rules.
parse language (in most cases, there will be a CLS ( context-free language )) - a set of all terminal and non-terminal symbols, as well as a YOG for a given input stream. For example, in natural languages, the terminal symbols are all letters, numbers and punctuation marks used by the language, not the terminals will be words and sentences (and other constructions like subject, predicate, verbs, adverbs, etc.), but grammar itself language.
BNF (Backus-Naur Form, Backus normal form) / BNF (Backus-Naur form) is a form in which some syntactic categories are sequentially defined through others. The representation form of the CGC, often used directly to specify the input to the parser. Characterized by the fact that the ONE nonterminal symbol is always definable. The classic form of the form is:
< > ::= <.1> | <.2> | . . . | <.n>
There are also a number of varieties, such as ABNF (AugmentedBNF), EBNF (ExtendedBNF), etc. In general, these forms somewhat extend the syntax of the usual BNF notation.
LL (k), LR (k), SLR, ... are types of parser algorithms. In this article we will not dwell on them if someone is interested, below I will give a few links to material from which you can learn about them. However, let us dwell on another aspect, on the grammars of parsers. The grammar of the LL / LR parser groups is BNF, this is true. But it is also true that not every BNF grammar is also LL (k) or LR (k). Anyway, what does the letter k in the LL / LR (k) record mean? It means that to parse the grammar you need to look forward a maximum of k terminal characters along the stream. That is, to parse (0) grammar you only need to know the current character. For (1), you need to know the current and 1 next character. For (2) - current and 2 following, etc. A little more about choosing / compiling BNF for a particular parser will be discussed below.
PEG ( Parsing expression grammar ) / PB grammar is a grammar designed to recognize strings in a language. An example of such a grammar for algebraic actions +, -, *, / for non-negative numbers is:
Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Sum ← Product (('+' / '-') Product)* Expr ← Sum

Finally, we are done with basic concepts. Let us turn to the choice of the method of analysis.

Parsing options

When we are faced with the task of creating a parser, the solution boils down, as a rule, to 4 main options:

Solve the problem in the forehead, that is, analyze the input stream character-by-character and using the rules of grammar, build an ASD or immediately perform the operations we need on the components we need. Of the benefits - this option is the most simple, if we talk about the algorithms and the presence of the mathematical base. Minuses - the probability of an accidental error is close to the maximum, since you have no formal criteria for whether you have taken into account all the rules of grammar when building a parser. Very time consuming. In general, it is not very easily modified and not very flexible, especially if you have not implemented the construction of the ASD. Even with long work parser you can not be sure that it works absolutely correctly. From plus or minus. In this embodiment, it all depends on the directness of your hands. We will not talk about this option in detail.
We use regular expressions! I will not joke right now on the topic of the number of problems and regular expressions, but in general, the method, although affordable, is not too good. In the case of a complex grammar, working with regulars will turn into hell of a hell, especially if you try to optimize the rules to increase work speed. In general, if you choose this method, I can only wish you good luck. Regular expressions are not for parsing! And let me not assure the opposite. They are designed to search and replace. Attempting to use them for other things inevitably leads to losses. With them, we either significantly slow down the analysis, passing along the line many times, or lose brain cells, trying to figure out a way to remove the tonsils through the anus. Perhaps the situation will be slightly improved by an attempt to cross this method with the previous one. Probably no. In general, the benefits are almost the same as in the past. Only still need knowledge of regular expressions, and it is desirable not only to know how to use them, but also to have an idea how fast the option you are using works. Of the minuses, too, about the same as in the previous version, except that it is less time consuming.
Use a bunch of parsing tools for bnf! This option is more interesting. Firstly, we are offered a version of the type lex-yacc or flex-bison, secondly, in many languages we can find native libraries for parsing BNF. The keywords for the search can be taken LL, LR, BNF. The point is that all of them in some form accept the BNF variation as input, and LL, LR, SLR, etc. are specific algorithms that the parser uses. Most often, the end user is not particularly interested in which algorithm is used, although they have certain limitations on grammar parsing (we’ll dwell on this below) and may have different work times (although most say O (L), where L is the length of the stream of characters). One of the advantages is a stable toolkit, an intelligible form of recording (BNF), adequate estimates of working time and availability of a BNF record for most modern languages (if desired, can be found for sql, python, json, cfg, yaml, html, csv, and many others). Of the minuses - not always the obvious and convenient interface of tools, you may have to write something on an unfamiliar PL, especially the understanding of grammar with different tools.
Use the PEG parsing tools! This is also an interesting option, plus, here is a bit richer with libraries, although they, as a rule, are of several different eras (PEG was proposed by Brian Ford in 2004, while the roots of BNF stretch in the 1980s), that is, noticeably younger and worse ironed and live mostly on github. Of the benefits - quickly, simply, often - natively. Of the minuses - heavily dependent on the implementation. The pessimistic estimate for the PEG according to the specification seems to be O (exp (L)) (another thing, to create such a grammar will have to try hard). Strongly dependent on the presence / absence of the library. For some reason, many creators of PEG libraries consider tokenization and search / replace operations to be sufficient, and no AST to you, or even binding functions to grammar elements. But in general, the topic is promising.

We solve the problem of parsing

Let's go in order, skip the brute-force and regexp options.

Bnf

So the time has come to dwell a bit more on the parser's algorithms, or rather, on the grammars used by them. So, GLR (up to O (L ^ 3) or up O (L ^ 4), as some sources (antlr) say), LR, LALR and LL are most often found, all within O (L). GLR has the greatest “gluttony” - it is able to better handle grammar ambiguities, but due to this it is slower. That is, if we consider the “size” of the class of grammars processed by the parser (let's call it the power of the parser), then all other things being equal, the power will be distributed as follows: GLR> LR> LALR> SLR> LL. Resource consumption, respectively, close to the opposite. But back to the grammar.

Compiling or searching the grammar for the LR parser as a whole is quite simple and there is a high chance that the BNF compiled by you “on the knee” will be easily accepted by the parser and processed. The main thing is that the grammar should be complete, that is, to describe all possible situations of the input stream, besides try to understand by yourself whether knowing the following k characters (depending on the selected LR parser) can unambiguously determine which rule should apply.

For the LR-parser there can be conflicts of the following type:

Shift-reduce: NT ::= x NT | x NT ::= x NT | x . Where is the length x> k. It is solved this way: NT ::= xK; K ::= K | e NT ::= xK; K ::= K | e
Convolution (reduce-reduce):
NT :: = e A ::= NT B ::= NT D ::= B uv | A uw NT :: = e A ::= NT B ::= NT D ::= B uv | A uw , where the length u> k
It is solved like this:
R ::= Au J ::= Bu D ::= Rw | Jv

Conflicts of a type are characteristic for an LL parser (it is necessary, but not enough, to reformulate them, upon request I can dwell on LL (k) grammars in more detail in the following article):

Left recursion: NT ::= NT x | y NT ::= NT x | y , where x, y are arbitrary strings of terminals / not terminals, but y does not start with NT

Example: E ::= E + T | T E ::= E + T | T It can be reformulated as:

E ::= TZ
Z ::= '+' TZ | x

Left factorization: NT ::= ax | ay NT ::= ax | ay , where a is a string of length> k of our parser. Solving is even simpler: NT ::= aC; C = x|y NT ::= aC; C = x|y

So what are we going to decide?

Well, let's say it will be a simple calculator:

 OPERATOR ::= "+ "| "-" | "*" | "/" STMT ::= "(" STMT ")" | STMT OPERATOR STMT | FLOAT | INT FLOAT ::= INT "." INT INT ::= (POSITIVE_DIGIT INT | DIGIT ) | DIGIT POSITIVE_DIGIT ::= "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" DIGIT ::= POSITIVE_DIGIT | "0"

If the reader tries to find the grammar of the calculator on the Internet, he will see that often the operations of addition / subtraction and multiplication / division are processed by different grammatical structures. This was done specifically to take into account in the grammar such a moment as the priority of operations (and also to reveal the ambiguities of the grammar). We will do this further in the course of the article.

We try to find a native Python-solution, we find a lot of them .

Use parglare . This is the Python library / cli-tool, which implements an LR / GLR parser with a fairly wide range of capabilities (inline functions, prioritization, trees, a distinct grammar analysis, and a QA visualizer resulting from grammar processing).
```
 pip install parglare 
```
We reformulate our calculator as parglare asks.
```
 OPERATOR : "+ "| "-" | "*" | "/" | = STMT : "(" STMT ")" | STMT OPERATOR STMT | FLOAT | INT FLOAT : INT "." INT INT : (POSITIVE_DIGIT INT | DIGIT ) | DIGIT POSITIVE_DIGIT : "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" DIGIT : POSITIVE_DIGIT | "0" 
```
Is that enough? Save to calc.pg and use the cli-tool
```
 pglr --debug check calc.pg Error in file "/home/survivor/tests/calc.pg" at position 1,42 => "" | "/" | *=\nSTMT ". Expected: Name or StrTerm or RegExTerm 
```
Oops! it seems something extra. Bingo! I put in for some reason | = after “/” (no, I know where he is from there (but this does not apply to the topic of the article)). The main thing is that the program clearly indicated this to us. Further, after the correction, the program will complain about the absence; at the end of the nonterminal designation and on the bracket in the INT rule. The reformulated version will look like this:
```
 STMT : "(" STMT ")" | STMT OPERATOR STMT | FLOAT | INT; OPERATOR : "+ "| "-" | "*" | "/"; FLOAT : INT "." INT; INT : POSITIVE_DIGIT INT | POSITIVE_DIGIT DIGIT | DIGIT; POSITIVE_DIGIT : "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"; DIGIT : POSITIVE_DIGIT | "0"; 
```
As a result, pglr likes everything and he will tell us:
```
 Grammar ok! 
```
BUT:
```
 There are 4 Shift/Reduce conflicts. Either use 'prefer_shifts' parser mode, try to resolve manually or use GLR parsing. There are 7 Reduce/Reduce conflicts. Try to resolve manually or use GLR parsing. 
```
Well, as described above for debug, you can read their beautiful (and understandable) description. Well, let's think about it. First, let's not be the smartest and throw out positive_digit. If someone writes 0034 - firstly, he is angry with Pinocchio on his own, and secondly, most PLs, including Python, will not tell us anything when converting this to a number, and will simply give out 34. Well, that will help a lot. Secondly, see here the division into int / float, for simplicity we assume that all numbers are floating point. Also, pglr understands regular expressions in BNF, use this. We get:
```
  STMT : "(" STMT ")" | STMT OPERATOR STMT | FLOAT; OPERATOR : "+ "| "-" | "*" | "/"; FLOAT : /\d+(\.\d+)?/; 
```
and still
```
 There are 4 Shift/Reduce conflicts. Either use 'prefer_shifts' parser mode, try to resolve manually or use GLR parsing. 
```
Well, be that as it may, grammar
```
 *** GRAMMAR *** Terminals: + EOF ) ( FLOAT STOP * / - \d+(\.\d+)? EMPTY NonTerminals: OPERATOR S' STMT Productions: 0: S' = STMT STOP 1: STMT = STMT OPERATOR STMT 2: STMT = ( STMT ) 3: STMT = FLOAT 4: OPERATOR = + 5: OPERATOR = - 6: OPERATOR = * 7: OPERATOR = / 
```
Let's try to parse something.
```
 from parglare import Grammar from parglare import Parser grammar = Grammar.from_file('calc.pg') parser = Parser(grammar, build_tree=True, prefer_shifts=True) result = parser.parse('1 + 2 / (3 - 1 + 5)') 
```
We get:
```
 result <NonTerm(start=0, end=19, sym=STMT)> 
```
we can get result.children and further along the tree. We can calculate the final expression at the same time, but we will not do it here. It is important that we have access to the object tree, for the terminal symbols they are also their “value” and we can do whatever we want with this.

If we want to fix conflict situations, how else can we resolve grammar conflicts?

There are several options:
- Solve the problem by reformulating the grammar.
  For example:
```
 STMT : TERM | STMT ADDOP TERM ; TERM : FACTOR | FACTOR MULOP FACTOR ; FACTOR : "(" STMT ")" | NUMBER; ADDOP : "+" | "-"; MULOP : "*"|"/"; NUMBER: /\d+(\.\d+)?/; 
```
  As you can see, the task has become more complicated, but not too much. Moreover, if we do the analysis of just such a tree, it will be easier for us.
- Use the means of parglare itself
  
  In this case, the solution is more specific, but in some cases more convenient. Parglare provides a good toolkit for prioritizing rules, for example, for arithmetic expressions you can set the operation priority and associativity (from which side this operation is performed - from left to right or from right to left) the higher the priority, the operation will be performed earlier (relative to the others), for example, our grammar in this notation might look like this:
```
 STMT : STMT ADDOP STMT {1, left} | STMT MULOP STMT {2, left} | "(" STMT ")" | NUMBER; ADDOP : "+" | "-"; MULOP : "*"|"/"; NUMBER: /\d+(\.\d+)?/; 
```
  Parsing, but not working correctly. For example, for
  
  1 + 2 / (3 - 1 + 5)
  
  we get (with non-tree parsing)
```
 ['1', u'+', ['2', u'/', [u'(', ['3', u'-', ['1', u'+', '5']], u')']]] 
```
  which obviously does not match the expected result:
```
 ['1', u'+', ['2', u'/', [u'(', [['3', u'-', '1'], u'+', '5'], u')']]] 
```
Morals - it is better to use the standard moments described in BNF. Brilliant and cool stuff shine and cool, but not always work. Or I do not know how to cook them. And most parsing libraries have alpha | beta.

According to the author about this bug, it is due to the fact that, in essence, the associativity (left) of the parser at the algorithm level means to prefer to reduce rather than shift. That is, in fact, if there is an opportunity to “chop off” the rule, or continue within its framework - it is better to chop off. Since the parsing goes from left to right, this means left associativity. The reason for the error is that the priority for the rules inside ADDOP / MULOP is not defined, which breaks the entire rule.

However, even if we prioritize (for example:

ADDOP: “+” {1} | “-” {1} ADDOP: “+” {1} | “-” {1} ), will not work anyway, already due to another error.

Finally, an example with inline functions that are bound directly to the rules from the parglare documentation.
```
 from parglare import Parser, Grammar grammar = r""" E: E '+' E {left, 1} | E '-' E {left, 1} | E '*' E {left, 2} | E '/' E {left, 2} | E '^' E {right, 3} | '(' E ')' | number; number: /\d+(\.\d+)?/; """ # Define inline functions bound to grammar rule actions = { "E": [lambda _, nodes: nodes[0] + nodes[2], # for rule[0] “+” lambda _, nodes: nodes[0] - nodes[2], # for rule[1] “-” lambda _, nodes: nodes[0] * nodes[2], # for rule[2] “*” lambda _, nodes: nodes[0] / nodes[2], # for rule[3] “/” lambda _, nodes: nodes[0] ** nodes[2], # for rule[4] “^” lambda _, nodes: nodes[1], # for rule[5] “()” - just get the middle lambda _, nodes: nodes[0]], # for rule[6] just get node "number": lambda _, value: float(value), # for rule - convert to float } g = Grammar.from_string(grammar) parser = Parser(g, debug=True, actions=actions) result = parser.parse("34 + 4.6 / 2 * 4^2^2 + 78") print("Result = ", result) # Output # -- Debuging/tracing output with detailed info about grammar, productions, # -- terminals and nonterminals, DFA states, parsing progress, # -- and at the end of the output: # Result = 700.8 
```
Next, try the standalone ANTLR tool.

Installation is pretty simple, for linux it is
```
 $ cd /usr/local/lib $ curl -O http://www.antlr.org/download/antlr-4.7.1-complete.jar $ export CLASSPATH=".:/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" $ alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool' $ alias grun='java org.antlr.v4.gui.TestRig' 
```
In order to work on python2, you still need to install
```
 pip install antlr4-python2-runtime 
```
Then try to make a grammar for it. It has a slightly different format, so we replace the double quotes with single quotes and slightly rewrite the regular expression, as well as add the notation for WS, which was set by default in the previous tool. In our case, the left recursion can be eliminated simply by rewriting it on the right.

An important point! ANTLR has parser rules and grammar rules. The rules of parsing lead to the appearance of a node in the resulting AST. In addition, there is some difference that can / cannot be in the grammatical / parsing rules. In particular, there are no regular expressions in the rules of parsing. In addition, the parser should receive the input rule, the starting net terminal (in general, everything is somewhat more complicated, but in our case this is also enough). In general, it is worth noting that ANTLR has a rather powerful syntax, also supports inline functions (albeit in a slightly different format) and non-tree non-terminals and something else. As a result, our grammar looks like this:
```
 grammar calc; stmt : term | term addop stmt ; term : factor | factor mulop factor ; factor : '(' stmt ')' | NUMBER; addop : '+' | '-'; mulop : '*'|'/'; NUMBER: [0-9]+|[0-9]+.[0-9]+; WS : [ \t\r\n]+ -> skip ; 
```
The file is called calc.g4 (This is important! The name of the file must match the name of the grammar inside).

Let's create a java-parser.
```
 antlr4 calc.g4 javac calc*.java grun calc stmt -gui 
```
Then we give him some input (for example, 1 + 2 / (3 - 1 + 5) ) and press the end of the line ( ctrl-d on * nix, ctrl-z on windows) and look at the result. We were shown a beautiful parse tree, also interactive. You can see, twist the nodes, think, export as a picture.

Create a python2 parser:
```
 antlr4 -Dlanguage=Python2 calc.g4 
```
Further, we can hang listeners on the grammar and enjoy.
```
 import sys from antlr4 import * from calc_bakLexer import calc_bakLexer from calc_bakListener import calc_bakListener from calc_bakParser import calc_bakParser class StmtPrinter(calc_bakListener): def __init__(self): self.map_results = {} self.result = 0 def exitStmt(self, ctx): print("Oh, a stmt! {}".format(ctx.getText())) def main(argv): input = FileStream(argv[1]) print(input) lexer = calc_bakLexer(input) stream = CommonTokenStream(lexer) parser = calc_bakParser(stream) tree = parser.stmt() printer = StmtPrinter() walker = ParseTreeWalker() walker.walk(printer, tree) if __name__ == '__main__': main(sys.argv) 
```
... Enjoy the wrong program. Or rather, right, but unexpectedly.
```
 python ./calc_antlr_min.py ./example.antlr 1 + 2 / (3 - 1 + 5) Oh, a stmt! 5 Oh, a stmt! 1+5 Oh, a stmt! 3-1+5 Oh, a stmt! 2/(3-1+5) Oh, a stmt! 1+2/(3-1+5) 
```
As you can see, the associativity here is “right”. And operations of addition, subtraction, multiplication-division are left. I honestly tried a few days (sic!) To find a solution (of course, I did not do this all the time, but in total I killed 12-15 hours for this). Associative markers, heaps of manuals, grammar reformulation ... Did not work out. I am sure there is a solution. Moreover, an example of a grammar calculator is here . But I wanted to find my solution, if possible, in terms of a general grammar. In the end, the goal was to LEARN, not to solve the problem.

And I admit my failure. Yes, the problem is solved. But using only documentation and searching for general topics, I could not solve it. The reasons ... First, excluding the book on ANTLR, the sources available on the web are not very detailed and expressive. Secondly, there are a lot of materials in the network for different (non-compatible) versions of ANTLR. No, I understand everything, development and so on. But for some reason, in the process of getting acquainted with the instrument, I got the feeling that the author did not even hear about the concept of backward compatibility. In general, sad.
Update

As rightly noted, one head is good and two is better.
The reformulation of the grammar with the right-associative on the left is done literally “with a click of the fingers” (Thanks to Valentin Nechayev netch80 for the addition) - you just need to replace
```
 stmt : term | term addop stmt ; 
```
on
```
 stmt : term | stmt addop term ; 
```
In this case, ANTLR handles left-side recursion without questions (apparently, due to some optimizations). Output simple listeners in this case.
```
 python ./calc_antlr_min.py ./example.antlr 1 + 2 / (3 - 1 + 5) Oh, a stmt! 1 Oh, a stmt! 3 Oh, a stmt! 3-1 Oh, a stmt! 3-1+5 Oh, a stmt! 1+2/(3-1+5) 
```

Peg

In essence, they are a simplified form of BNF, but it’s not generally necessary to know about it to the programmer. Unlike BNF, they are originally more like regular expressions. In fact, one could say that it is BNF with the ability to use regular expressions “according to the standard” (and, which is nice, not only within the non-terminal line, but also to some extent in the expression itself (PEG supports constructions of the form A = B ( XC)* , Z = R (K)? , “A B X C, ” “Z R K 0 1 ”). , , . , PEG , . do parsers encounter, including BNF? The ambiguity of choice! To solve this problem, PEG has a nice sequential or “/” operator . Unlike the operator “|”, which describes equivalent alternatives, “/” clearly indicates the order of resolution of the rule. For example, A / B / C / Dindicates: compare with A, if not suitable, compare with B, if not suitable, compare with C, etc. For this reason, operating PEG grammars is EASIER. And still, the recommendation - if you are indifferent to the order of the comparison, it is better to write “/” This will reduce the number of ambiguous situations for the parser. However, as with any tool, this should be handled with care.
Also, be careful, the creators of PEG parsers are even more prone to wanting to understand the standard as they like, so the vocabulary of different implementations can vary significantly.

So, let's move on to practice.

Use Arpeggio by parglare:

 pip install arpeggio

Next, we look at the documentation and find out that the Visitor pattern recommended for working with AST for this library is very similar to the listener recommended in ANTLR. Fortunately, here for the whole experiment I had an hour, everything turned out to be not difficult:

 from arpeggio.cleanpeg import ParserPEG from arpeggio import PTNodeVisitor, EOF, visit_parse_tree # Setting up our simple grammar calc_grammar = """ number = r'\d+(\.\d+)?' factor = number / "(" stmt ")" term = factor (mulop factor)* stmt = term (addop term)* addop = "+" / "-" mulop = "*" / "/" calc = stmt EOF """ #Creating a visitor class to calculate the result class CalcVisitor(PTNodeVisitor): def visit_number(self, node, children): return float(node.value) def visit_factor(self, node, children): # Children are list-like structure of VISITED node results # visits are made from depth to top return children[0] def visit_term(self, node, children): # For such rules you may use, for example children["factor"] # Though, you need to know, that children["factor"] is a list of ALL # factor's of this term if len(children) == 1: return children[0] else: res = children[0] for i in xrange(len(children) / 2): if children[2 * i + 1] == '*': res *= children[2 * (i + 1)] else: res /= children[2 * (i + 1)] return res def visit_stmt(self, node, children): if len(children) == 1: return children[0] else: res = children[0] for i in xrange(len(children) / 2): if children[2 * i + 1] == '+': res += children[2 * (i + 1)] else: res -= children[2 * (i + 1)] return res # Don't forget about root rule for your parser, as it will be produced as # a parsing result parser = ParserPEG(calc_grammar, "calc") input_expr = "(4-1)*5+(2+4.67)+5.89/(1.2+7)" print(input_expr) parse_tree = parser.parse(input_expr) result = visit_parse_tree(parse_tree, CalcVisitor( #debug=True )) print(result) input_expr = "1 + 2 / (3 - 1 + 5)" print(input_expr) parse_tree = parser.parse(input_expr) result = visit_parse_tree(parse_tree, CalcVisitor( #debug=True )) print(result)

At startup, it will display the following:

 python ./calc_arpeggio.py (4-1)*5+(2+4.67)+5.89/(1.2+7) 22.3882926829 1 + 2 / (3 - 1 + 5) 1.28571428571

If you want to see how a visitor goes around a tree, you can uncomment the debug.

As we can see, the grammar has undergone minor but important changes. In particular, if we pay attention to the stmt and term rules, it will be clear that they have an arbitrary number of elements. Accordingly, thanks to this, the processing of the associativity of mathematical operations is entirely on our conscience. Despite these changes, the program remains simple and straightforward.

General conclusions

In fact, there are several conclusions:

, . , , , . , . — . , .
. , , .
. , , . , - , - (- js, python, lua ruby — ). , “ stand-alone ”, .
() . “:” “=” BNF, . . , 20. - , , . . , …
, “” . , .
, , C (, Bison), “ ”. , , ( , — ). , — . , .

Conclusion

In conclusion, I would like to say that this was an interesting study. I tried to describe my conclusions as simply and clearly as possible, I hope I managed to write this article so that the topic could be understood even by programmers far from mathematics, at least in general terms, sufficient to use the existing tools.

You can ask any questions, if any details on the topic will excite many, they can serve for writing other articles.

As promised, several useful links on the subject of the article:

In Wikipedia, in English (there are significantly fewer articles in Russian):
FG
BNF
LL
LR
PEG
- , “ ”, . xo, . “ , ”. , , . , .

Source: https://habr.com/ru/post/348314/

All Articles