📜 ⬆️ ⬇️

Parsers, word processing. Just about the complicated. CFG, BNF, LL (k), LR (k), PEG and other scary words

Probably every programmer had to deal with tasks like “read something in format A and perform some manipulations with it”. Whether it be json, logs of nginx, cfg, sql, yaml, csv or something else. Well, when you can use the library, however, for various reasons, it is not always possible. Then the question arises of creating your own parser for a given format. And this, as the English say, often turns out to be PITA (pain in ...). In this article I will try to alleviate this pain. Who cares, welcome.

Introduction


So the question is, what exactly is this article about? Here I will try to help the reader to get out of the situation with the least losses, when you need to parse some text format, and the library will not work. That is, to solve an absolutely specific problem by common means.

At once I will make a reservation, the topic in itself is not enough that it is VERY difficult, it is also impossible to be extensive and to cover everything in one article just will not work. Therefore, I will start from the general and will proceed to the particular, specifically in this article, giving a review of the tools (which, however, can be used to solve very specific parsing problems) rather than diving deep into the concepts. Perhaps, if readers are interested, the article will turn into a cycle in which I can reveal specific questions in more detail.

I decided to write it because the information available on the topic is scattered and often incomplete, there are very few sources in Russian, and those that exist imply a fairly decent acquaintance of the reader with very specific mathematics. Therefore, so that the reader far from the topic did not experience pain from the consciousness of his (imaginary) inferiority, I decided to try to describe this topic as simply as possible.
')
Be brave, the article is large, some places will not be very interesting, but without them it can be incomprehensible.

Basic concepts


Before talking on the topic is to determine the basic concepts, so that there is no ambiguity. This is a glossary of this article. It may coincide with generally accepted terminology, but generally speaking, it is not obliged, since it shows a picture that is being formed in the author’s head.
So:


Finally, we are done with basic concepts. Let us turn to the choice of the method of analysis.

Parsing options


When we are faced with the task of creating a parser, the solution boils down, as a rule, to 4 main options:


We solve the problem of parsing


Let's go in order, skip the brute-force and regexp options.

Bnf


So the time has come to dwell a bit more on the parser's algorithms, or rather, on the grammars used by them. So, GLR (up to O (L ^ 3) or up O (L ^ 4), as some sources (antlr) say), LR, LALR and LL are most often found, all within O (L). GLR has the greatest “gluttony” - it is able to better handle grammar ambiguities, but due to this it is slower. That is, if we consider the “size” of the class of grammars processed by the parser (let's call it the power of the parser), then all other things being equal, the power will be distributed as follows: GLR> LR> LALR> SLR> LL. Resource consumption, respectively, close to the opposite. But back to the grammar.

Compiling or searching the grammar for the LR parser as a whole is quite simple and there is a high chance that the BNF compiled by you “on the knee” will be easily accepted by the parser and processed. The main thing is that the grammar should be complete, that is, to describe all possible situations of the input stream, besides try to understand by yourself whether knowing the following k characters (depending on the selected LR parser) can unambiguously determine which rule should apply.

For the LR-parser there can be conflicts of the following type:

  1. Shift-reduce: NT ::= x NT | x NT ::= x NT | x . Where is the length x> k. It is solved this way: NT ::= xK; K ::= K | e NT ::= xK; K ::= K | e
  2. Convolution (reduce-reduce):
    NT :: = e
    A ::= NT
    B ::= NT
    D ::= B uv | A uw
    NT :: = e
    A ::= NT
    B ::= NT
    D ::= B uv | A uw
    , where the length u> k
    It is solved like this:
    R ::= Au
    J ::= Bu
    D ::= Rw | Jv


Conflicts of a type are characteristic for an LL parser (it is necessary, but not enough, to reformulate them, upon request I can dwell on LL (k) grammars in more detail in the following article):

Left recursion: NT ::= NT x | y NT ::= NT x | y , where x, y are arbitrary strings of terminals / not terminals, but y does not start with NT

Example: E ::= E + T | T E ::= E + T | T It can be reformulated as:

E ::= TZ
Z ::= '+' TZ | x


Left factorization: NT ::= ax | ay NT ::= ax | ay , where a is a string of length> k of our parser. Solving is even simpler: NT ::= aC; C = x|y NT ::= aC; C = x|y

So what are we going to decide?

Well, let's say it will be a simple calculator:

 OPERATOR ::= "+ "| "-" | "*" | "/" STMT ::= "(" STMT ")" | STMT OPERATOR STMT | FLOAT | INT FLOAT ::= INT "." INT INT ::= (POSITIVE_DIGIT INT | DIGIT ) | DIGIT POSITIVE_DIGIT ::= "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" DIGIT ::= POSITIVE_DIGIT | "0" 

If the reader tries to find the grammar of the calculator on the Internet, he will see that often the operations of addition / subtraction and multiplication / division are processed by different grammatical structures. This was done specifically to take into account in the grammar such a moment as the priority of operations (and also to reveal the ambiguities of the grammar). We will do this further in the course of the article.

We try to find a native Python-solution, we find a lot of them .

  1. Use parglare . This is the Python library / cli-tool, which implements an LR / GLR parser with a fairly wide range of capabilities (inline functions, prioritization, trees, a distinct grammar analysis, and a QA visualizer resulting from grammar processing).

     pip install parglare 

    We reformulate our calculator as parglare asks.

     OPERATOR : "+ "| "-" | "*" | "/" | = STMT : "(" STMT ")" | STMT OPERATOR STMT | FLOAT | INT FLOAT : INT "." INT INT : (POSITIVE_DIGIT INT | DIGIT ) | DIGIT POSITIVE_DIGIT : "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" DIGIT : POSITIVE_DIGIT | "0" 

    Is that enough? Save to calc.pg and use the cli-tool

     pglr --debug check calc.pg Error in file "/home/survivor/tests/calc.pg" at position 1,42 => "" | "/" | *=\nSTMT ". Expected: Name or StrTerm or RegExTerm 

    Oops! it seems something extra. Bingo! I put in for some reason | = after “/” (no, I know where he is from there (but this does not apply to the topic of the article)). The main thing is that the program clearly indicated this to us. Further, after the correction, the program will complain about the absence; at the end of the nonterminal designation and on the bracket in the INT rule. The reformulated version will look like this:

     STMT : "(" STMT ")" | STMT OPERATOR STMT | FLOAT | INT; OPERATOR : "+ "| "-" | "*" | "/"; FLOAT : INT "." INT; INT : POSITIVE_DIGIT INT | POSITIVE_DIGIT DIGIT | DIGIT; POSITIVE_DIGIT : "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"; DIGIT : POSITIVE_DIGIT | "0"; 

    As a result, pglr likes everything and he will tell us:

     Grammar ok! 

    BUT:

     There are 4 Shift/Reduce conflicts. Either use 'prefer_shifts' parser mode, try to resolve manually or use GLR parsing. There are 7 Reduce/Reduce conflicts. Try to resolve manually or use GLR parsing. 

    Well, as described above for debug, you can read their beautiful (and understandable) description. Well, let's think about it. First, let's not be the smartest and throw out positive_digit. If someone writes 0034 - firstly, he is angry with Pinocchio on his own, and secondly, most PLs, including Python, will not tell us anything when converting this to a number, and will simply give out 34. Well, that will help a lot. Secondly, see here the division into int / float, for simplicity we assume that all numbers are floating point. Also, pglr understands regular expressions in BNF, use this. We get:

      STMT : "(" STMT ")" | STMT OPERATOR STMT | FLOAT; OPERATOR : "+ "| "-" | "*" | "/"; FLOAT : /\d+(\.\d+)?/; 

    and still

     There are 4 Shift/Reduce conflicts. Either use 'prefer_shifts' parser mode, try to resolve manually or use GLR parsing. 

    Well, be that as it may, grammar

     *** GRAMMAR *** Terminals: + EOF ) ( FLOAT STOP * / - \d+(\.\d+)? EMPTY NonTerminals: OPERATOR S' STMT Productions: 0: S' = STMT STOP 1: STMT = STMT OPERATOR STMT 2: STMT = ( STMT ) 3: STMT = FLOAT 4: OPERATOR = + 5: OPERATOR = - 6: OPERATOR = * 7: OPERATOR = / 

    Let's try to parse something.

     from parglare import Grammar from parglare import Parser grammar = Grammar.from_file('calc.pg') parser = Parser(grammar, build_tree=True, prefer_shifts=True) result = parser.parse('1 + 2 / (3 - 1 + 5)') 

    We get:

     result <NonTerm(start=0, end=19, sym=STMT)> 

    we can get result.children and further along the tree. We can calculate the final expression at the same time, but we will not do it here. It is important that we have access to the object tree, for the terminal symbols they are also their “value” and we can do whatever we want with this.

    If we want to fix conflict situations, how else can we resolve grammar conflicts?

    There are several options:

    • Solve the problem by reformulating the grammar.
      For example:

       STMT : TERM | STMT ADDOP TERM ; TERM : FACTOR | FACTOR MULOP FACTOR ; FACTOR : "(" STMT ")" | NUMBER; ADDOP : "+" | "-"; MULOP : "*"|"/"; NUMBER: /\d+(\.\d+)?/; 

      As you can see, the task has become more complicated, but not too much. Moreover, if we do the analysis of just such a tree, it will be easier for us.
    • Use the means of parglare itself

      In this case, the solution is more specific, but in some cases more convenient. Parglare provides a good toolkit for prioritizing rules, for example, for arithmetic expressions you can set the operation priority and associativity (from which side this operation is performed - from left to right or from right to left) the higher the priority, the operation will be performed earlier (relative to the others), for example, our grammar in this notation might look like this:

       STMT : STMT ADDOP STMT {1, left} | STMT MULOP STMT {2, left} | "(" STMT ")" | NUMBER; ADDOP : "+" | "-"; MULOP : "*"|"/"; NUMBER: /\d+(\.\d+)?/; 

      Parsing, but not working correctly. For example, for

      1 + 2 / (3 - 1 + 5)

      we get (with non-tree parsing)

       ['1', u'+', ['2', u'/', [u'(', ['3', u'-', ['1', u'+', '5']], u')']]] 

      which obviously does not match the expected result:

       ['1', u'+', ['2', u'/', [u'(', [['3', u'-', '1'], u'+', '5'], u')']]] 


    Morals - it is better to use the standard moments described in BNF. Brilliant and cool stuff shine and cool, but not always work. Or I do not know how to cook them. And most parsing libraries have alpha | beta.

    According to the author about this bug, it is due to the fact that, in essence, the associativity (left) of the parser at the algorithm level means to prefer to reduce rather than shift. That is, in fact, if there is an opportunity to “chop off” the rule, or continue within its framework - it is better to chop off. Since the parsing goes from left to right, this means left associativity. The reason for the error is that the priority for the rules inside ADDOP / MULOP is not defined, which breaks the entire rule.

    However, even if we prioritize (for example:

    ADDOP: “+” {1}
    | “-” {1}
    ADDOP: “+” {1}
    | “-” {1}
    ), will not work anyway, already due to another error.

    Finally, an example with inline functions that are bound directly to the rules from the parglare documentation.

     from parglare import Parser, Grammar grammar = r""" E: E '+' E {left, 1} | E '-' E {left, 1} | E '*' E {left, 2} | E '/' E {left, 2} | E '^' E {right, 3} | '(' E ')' | number; number: /\d+(\.\d+)?/; """ # Define inline functions bound to grammar rule actions = { "E": [lambda _, nodes: nodes[0] + nodes[2], # for rule[0] “+” lambda _, nodes: nodes[0] - nodes[2], # for rule[1] “-” lambda _, nodes: nodes[0] * nodes[2], # for rule[2] “*” lambda _, nodes: nodes[0] / nodes[2], # for rule[3] “/” lambda _, nodes: nodes[0] ** nodes[2], # for rule[4] “^” lambda _, nodes: nodes[1], # for rule[5] “()” - just get the middle lambda _, nodes: nodes[0]], # for rule[6] just get node "number": lambda _, value: float(value), # for rule - convert to float } g = Grammar.from_string(grammar) parser = Parser(g, debug=True, actions=actions) result = parser.parse("34 + 4.6 / 2 * 4^2^2 + 78") print("Result = ", result) # Output # -- Debuging/tracing output with detailed info about grammar, productions, # -- terminals and nonterminals, DFA states, parsing progress, # -- and at the end of the output: # Result = 700.8 

  2. Next, try the standalone ANTLR tool.

    Installation is pretty simple, for linux it is

     $ cd /usr/local/lib $ curl -O http://www.antlr.org/download/antlr-4.7.1-complete.jar $ export CLASSPATH=".:/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" $ alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool' $ alias grun='java org.antlr.v4.gui.TestRig' 

    In order to work on python2, you still need to install

     pip install antlr4-python2-runtime 

    Then try to make a grammar for it. It has a slightly different format, so we replace the double quotes with single quotes and slightly rewrite the regular expression, as well as add the notation for WS, which was set by default in the previous tool. In our case, the left recursion can be eliminated simply by rewriting it on the right.

    An important point! ANTLR has parser rules and grammar rules. The rules of parsing lead to the appearance of a node in the resulting AST. In addition, there is some difference that can / cannot be in the grammatical / parsing rules. In particular, there are no regular expressions in the rules of parsing. In addition, the parser should receive the input rule, the starting net terminal (in general, everything is somewhat more complicated, but in our case this is also enough). In general, it is worth noting that ANTLR has a rather powerful syntax, also supports inline functions (albeit in a slightly different format) and non-tree non-terminals and something else. As a result, our grammar looks like this:

     grammar calc; stmt : term | term addop stmt ; term : factor | factor mulop factor ; factor : '(' stmt ')' | NUMBER; addop : '+' | '-'; mulop : '*'|'/'; NUMBER: [0-9]+|[0-9]+.[0-9]+; WS : [ \t\r\n]+ -> skip ; 

    The file is called calc.g4 (This is important! The name of the file must match the name of the grammar inside).

    Let's create a java-parser.

     antlr4 calc.g4 javac calc*.java grun calc stmt -gui 

    Then we give him some input (for example, 1 + 2 / (3 - 1 + 5) ) and press the end of the line ( ctrl-d on * nix, ctrl-z on windows) and look at the result. We were shown a beautiful parse tree, also interactive. You can see, twist the nodes, think, export as a picture.

    Create a python2 parser:

     antlr4 -Dlanguage=Python2 calc.g4 


    Further, we can hang listeners on the grammar and enjoy.

     import sys from antlr4 import * from calc_bakLexer import calc_bakLexer from calc_bakListener import calc_bakListener from calc_bakParser import calc_bakParser class StmtPrinter(calc_bakListener): def __init__(self): self.map_results = {} self.result = 0 def exitStmt(self, ctx): print("Oh, a stmt! {}".format(ctx.getText())) def main(argv): input = FileStream(argv[1]) print(input) lexer = calc_bakLexer(input) stream = CommonTokenStream(lexer) parser = calc_bakParser(stream) tree = parser.stmt() printer = StmtPrinter() walker = ParseTreeWalker() walker.walk(printer, tree) if __name__ == '__main__': main(sys.argv) 

    ... Enjoy the wrong program. Or rather, right, but unexpectedly.

     python ./calc_antlr_min.py ./example.antlr 1 + 2 / (3 - 1 + 5) Oh, a stmt! 5 Oh, a stmt! 1+5 Oh, a stmt! 3-1+5 Oh, a stmt! 2/(3-1+5) Oh, a stmt! 1+2/(3-1+5) 

    As you can see, the associativity here is “right”. And operations of addition, subtraction, multiplication-division are left. I honestly tried a few days (sic!) To find a solution (of course, I did not do this all the time, but in total I killed 12-15 hours for this). Associative markers, heaps of manuals, grammar reformulation ... Did not work out. I am sure there is a solution. Moreover, an example of a grammar calculator is here . But I wanted to find my solution, if possible, in terms of a general grammar. In the end, the goal was to LEARN, not to solve the problem.

    And I admit my failure. Yes, the problem is solved. But using only documentation and searching for general topics, I could not solve it. The reasons ... First, excluding the book on ANTLR, the sources available on the web are not very detailed and expressive. Secondly, there are a lot of materials in the network for different (non-compatible) versions of ANTLR. No, I understand everything, development and so on. But for some reason, in the process of getting acquainted with the instrument, I got the feeling that the author did not even hear about the concept of backward compatibility. In general, sad.

    Update


    As rightly noted, one head is good and two is better.
    The reformulation of the grammar with the right-associative on the left is done literally “with a click of the fingers” (Thanks to Valentin Nechayev netch80 for the addition) - you just need to replace
     stmt : term | term addop stmt ; 

    on
     stmt : term | stmt addop term ; 

    In this case, ANTLR handles left-side recursion without questions (apparently, due to some optimizations). Output simple listeners in this case.
     python ./calc_antlr_min.py ./example.antlr 1 + 2 / (3 - 1 + 5) Oh, a stmt! 1 Oh, a stmt! 3 Oh, a stmt! 3-1 Oh, a stmt! 3-1+5 Oh, a stmt! 1+2/(3-1+5) 



Peg


In essence, they are a simplified form of BNF, but it’s not generally necessary to know about it to the programmer. Unlike BNF, they are originally more like regular expressions. In fact, one could say that it is BNF with the ability to use regular expressions “according to the standard” (and, which is nice, not only within the non-terminal line, but also to some extent in the expression itself (PEG supports constructions of the form A = B ( XC)* , Z = R (K)? , “A B X C, ” “Z R K 0 1 ”). , , . , PEG , . do parsers encounter, including BNF? The ambiguity of choice! To solve this problem, PEG has a nice sequential or “/” operator . Unlike the operator “|”, which describes equivalent alternatives, “/” clearly indicates the order of resolution of the rule. For example, A / B / C / Dindicates: compare with A, if not suitable, compare with B, if not suitable, compare with C, etc. For this reason, operating PEG grammars is EASIER. And still, the recommendation - if you are indifferent to the order of the comparison, it is better to write “/” This will reduce the number of ambiguous situations for the parser. However, as with any tool, this should be handled with care.
Also, be careful, the creators of PEG parsers are even more prone to wanting to understand the standard as they like, so the vocabulary of different implementations can vary significantly.

So, let's move on to practice.

Use Arpeggio by parglare:

 pip install arpeggio 

Next, we look at the documentation and find out that the Visitor pattern recommended for working with AST for this library is very similar to the listener recommended in ANTLR. Fortunately, here for the whole experiment I had an hour, everything turned out to be not difficult:

 from arpeggio.cleanpeg import ParserPEG from arpeggio import PTNodeVisitor, EOF, visit_parse_tree # Setting up our simple grammar calc_grammar = """ number = r'\d+(\.\d+)?' factor = number / "(" stmt ")" term = factor (mulop factor)* stmt = term (addop term)* addop = "+" / "-" mulop = "*" / "/" calc = stmt EOF """ #Creating a visitor class to calculate the result class CalcVisitor(PTNodeVisitor): def visit_number(self, node, children): return float(node.value) def visit_factor(self, node, children): # Children are list-like structure of VISITED node results # visits are made from depth to top return children[0] def visit_term(self, node, children): # For such rules you may use, for example children["factor"] # Though, you need to know, that children["factor"] is a list of ALL # factor's of this term if len(children) == 1: return children[0] else: res = children[0] for i in xrange(len(children) / 2): if children[2 * i + 1] == '*': res *= children[2 * (i + 1)] else: res /= children[2 * (i + 1)] return res def visit_stmt(self, node, children): if len(children) == 1: return children[0] else: res = children[0] for i in xrange(len(children) / 2): if children[2 * i + 1] == '+': res += children[2 * (i + 1)] else: res -= children[2 * (i + 1)] return res # Don't forget about root rule for your parser, as it will be produced as # a parsing result parser = ParserPEG(calc_grammar, "calc") input_expr = "(4-1)*5+(2+4.67)+5.89/(1.2+7)" print(input_expr) parse_tree = parser.parse(input_expr) result = visit_parse_tree(parse_tree, CalcVisitor( #debug=True )) print(result) input_expr = "1 + 2 / (3 - 1 + 5)" print(input_expr) parse_tree = parser.parse(input_expr) result = visit_parse_tree(parse_tree, CalcVisitor( #debug=True )) print(result) 

At startup, it will display the following:
 python ./calc_arpeggio.py (4-1)*5+(2+4.67)+5.89/(1.2+7) 22.3882926829 1 + 2 / (3 - 1 + 5) 1.28571428571 

If you want to see how a visitor goes around a tree, you can uncomment the debug.

As we can see, the grammar has undergone minor but important changes. In particular, if we pay attention to the stmt and term rules, it will be clear that they have an arbitrary number of elements. Accordingly, thanks to this, the processing of the associativity of mathematical operations is entirely on our conscience. Despite these changes, the program remains simple and straightforward.

General conclusions


In fact, there are several conclusions:


Conclusion


In conclusion, I would like to say that this was an interesting study. I tried to describe my conclusions as simply and clearly as possible, I hope I managed to write this article so that the topic could be understood even by programmers far from mathematics, at least in general terms, sufficient to use the existing tools.

You can ask any questions, if any details on the topic will excite many, they can serve for writing other articles.

As promised, several useful links on the subject of the article:

Source: https://habr.com/ru/post/348314/


All Articles