Compilation. 5: downward parsing

Up to now they have been engaged in ascending parsing. What other options are there?
Put the bison aside, and return to the theory.

Further in the post:

Idea
Embodiment
Holivar
Backtracking

Idea

Remember that the general idea of an upward parsing is as follows: reading the input line, we shift it one character to the stack; as soon as a combination is formed at the top of the stack that fits any rule of grammar, we fold it into one non-terminal, and continue further.
An important feature of this method is that during the convolution of the rule, we already have all its components in our hands, and we can build a tree of them with whatever structure we want, or even use it for calculations on the fly.

The opposite approach is that we begin to expand the initial grammar symbol, choosing a rule to be expanded according to the following input symbol.
For example, if we have rules in grammar

 OP: '{' OPS '}' // block
     |  EXPR ';'  // expression
     |  'if' '(' EXPR ')' OP
     |  'if' '(' EXPR ')' OP 'else' OP
     |  'while' '(' EXPR ')' OP
     |  'exit' ';'  ;

and we see that further in the text goes while , then we will unfold according to the rule OP: 'while' '(' EXPR ')' OP ;
')
For a store machine, such parsing can be implemented as follows:

During the sweep, we write the right side of the rule onto the stack: we will continue to parse the text according to it.
If there is a terminal at the top of the stack, and the same terminal in the input text, it means that the text corresponds to the expected one. Remove the top terminal from the stack, read the next character of the text.
If at the top of the stack is a non-terminal, then it must be deployed. The rule for scanning is selected based on the input symbol.

Immediately, we note that it is pointless to perform an action when deploying: the right side of the expanded rule has not yet been read, and it is not known whether it will be read at all. It could be possible to put a dummy-signal “rule read successfully, you can perform an action” on the stack under the symbols of the right part of the stack on the right side of the stack; but at this point, the entire right-hand side has already been erased from the stack! What is the action going to work with?

But the main difficulty of the downward analysis is not in this, but in how to choose the appropriate rule for scanning.
Recall a grammar that is too tough for a bison:

 WORD: S1 'a' 'i' 'l' |  S2 'a' 'l' 'e';
 S1: 's';
 S2: 's';

How do we deploy the WORD when the next 's' character?
It is not clear how to choose: both rules for WORD begin exactly at 's' .

Embodiment

You can allow the machine to look further than one character further: judging by the reviews, this approach was used in ANTLR, a popular generator of LL parsers.
The technical difficulty is that each new symbol, on which the execution of the scan depends, increases the dimension of the transition tables; so uncompressed tables for an automaton of N states, reading 3 characters from the text (plus the character from the stack), would contain N · 256 ⁴ elements, and these are tens of gigabytes.
Therefore, the applicability of LL parsing with a far look ahead is determined precisely by the ability of the parser generator to compress tables efficiently (recall how the bison squeezed us table 258x14 into seven short one-dimensional arrays).

As far as I understand, LR-parsers with distant peering forward do not do it - because everyone is happy with everything: conflicts caused by insufficiently distant peeping are rare in LR parsers. So that our grammar could recognize the bison, it is enough not to ask him to choose between two identical convolutions:

 WORD: 's' 'a' 'i' 'l' |  's' 'a' 'l' 'e';

On the other hand, nothing has changed for the LL parser: both rules still match the 's' symbol. Therefore, in order to “adapt” the grammar for LL parsing, the general beginning of different rules for the same non-terminal is put into “auxiliary non-terminal”:

 WORD1: 's' 'a' WORD;
 WORD: 'i' 'l' |  'l' 'e';

This trick is called "left factorization": from the equally beginning rules, it is as if the "common factor" is taken out.
Now, WORD1 unambiguous, and two possible scans for WORD begin in different letters.

Why bother with factorization and produce meaningless non-terminals in grammar if the ANTLR miracle constrictor is able to look in as much as it wants?
Back to grammar

 OP: '{' OPS '}' // block
     |  EXPR ';'  // expression
     |  'if' '(' EXPR ')' OP
     |  'if' '(' EXPR ')' OP 'else' OP
     |  'while' '(' EXPR ')' OP
     |  'exit' ';'  ;

How to deploy OP when the next if character? There are two rules that can begin like this; and their common part - 'if' '(' EXPR ')' OP - can be of arbitrary length, so no matter how far the parser looks in, it will not save it.

Another problem that LL cannot handle is left recursion. We remember the grammar of the calculator from the market:

 EXPR: NUM |  EXPR OP NUM;

Both rules for EXPR begin with NUM : the first is explicit, the second is implicit; while there is no common "factor" that could be taken out of the rules out. If we also assume that the length of the NUM not limited, it is clear that no peeking will solve the problem.

To fix the grammar, we start from its meaning : the first NUM in the string is a ready-made expression, and all other expressions contain two operands.

 EXPR: NUM EXPR1;
 EXPR1: |  OP NUM EXPR1;

By left factorization and the elimination of left recursion, it is possible to adapt any unambiguous grammar for LL parsing, at the cost of adding a lot of auxiliary non-terminals, which are completely meaningless.
For example, in the folding rule EXPR: EXPR OP NUM ; we easily built a syntax tree node: the left argument is here, in $1 ; The right argument is here, at $3 . And what can be done with EXPR1: OP NUM EXPR1 ; ? The left argument is already expanded and cleared from the stack ; but in our hands we have EXPR1 - a sub-expression after the right argument. It would be like some good in it!

It is important that left recursion is natural for all left associative operations - and most of them. Therefore, the confusion of the type above is not a rare curiosity, but a typical case.

On the one hand, the reduction of grammar to the LL-type is completely formal, and it can be entrusted to a soulless piece of iron. On the other hand, how to debug an automaton that works not for a given grammar, but for some of its own?
In particular, we will not be allowed to write actions for developments, because it will not be the rules that we asked, but some others that the piece of iron asked itself. It is also good if such a parser manages to generate a given parse tree; to do this, he will have to correlate the rules of a given grammar with the rules according to which the automaton actually works, and rebuild the tree, “returning” the nodes that have moved from one rule to another.

Holivar

Dispute about the benefits of LL vs. LR is as old as automatic parsing in general; both approaches have their supporters. For example, Niklaus Wirth, deeply respected by me personally, was an active apologist for LL parsing, and one of the design goals in developing Pascal was the possibility of LL (1) parsing — that is, automatic, which sees only the next character of the text. Most of the “living languages” in LL do not fit: for example, in C, the declaration of either a variable or a function can begin with an int , and we cannot distinguish one from the other until we finish the line to the end.

As for convenience: basically, everyone praises the tool he is used to and dislikes the unusual.
For example, I will quote one “best answer” from stackoverflow regarding the dispute “ ANTLR vs. bison ":

In terms of personal taste, I think that grammars are a lot easier to construct and debug. Whats up? What are you crying? It is a bit more interesting.

The advantages of ANTLR in this dispute include anything other than the qualities of the actual parsing: a convenient development environment, code generation in different languages, readable generated code.

The readable generated code is the strongest difference between ANTLR and tabular parsers. In fact, instead of the specialized parse stack, the call stack is used, and the recognition of each grammar rule is implemented as a function call (for a recursive rule, a recursive function). Therefore, when debugging from the call stack, you can directly see how the parser got into the current state; whereas with bison, as we remember, it is necessary to turn on the debug printing of transitions between states, and to check the printout of the machine gun in order to understand the reason for the transitions.
The disadvantages of a recursive implementation in front of a tablespace are also clear: a much larger amount of code, which means memory consumption; the impossibility of writing “patches” on the generated code, because it changes when the grammar changes, and it also propagates through dozens of functions.

Backtracking

LL parsers that predefinely select a rule for each scan are not the only option for downward parsing. Alternative idea: when there are several suitable rules, we will try everything , some yes will do. For example, you can do something like fork() : create clones of the parser in the current state, and let each clone try one of the scan options. If a clone stumbles upon a syntax error, it dies. If all the clones have died, then no sweep option is appropriate: a syntax error in the input text.

For correct texts, this approach is more or less acceptable; but there are problems with detecting errors. First, which of the detected errors to print? Most of them are the result of an incorrectly selected scan, and not errors in the program text; but from the point of view of the parser, everyone is absolutely equal.

Secondly, the analysis may never end: each time we do the unfolding according to the left-recursive rule, one of the clones will be in exactly the same state as before the development; so that at each step you will get another exactly the same clone, and so on to infinity. If one of the clones gets to the end of the analysis, then all the others can be killed; and if, on the contrary, all the other clones will run into syntax errors and die, and only a uselessly breeding clone will continue to live?

Finally, what to do with ambiguous grammars? Both LL and LR parsers will detect conflicts during compilation; there is no compilation as such: the parser is guided by the grammar almost raw, i.e. interprets it on the go.
So, it may turn out that for the same text we will receive one analysis, then another: depending on which clone has time to finish the analysis.

The last problem was solved in the most original way: they declared that the possibility of ambiguous parsing is a fundamental problem of grammars, and instead they need to use expressions that differ in essence only in the fact that strict rules are set between the rules of the scan. For example, if the grammar
OP: EXPR ';' | 'if' '(' EXPR ')' OP | 'if' '(' EXPR ')' OP 'else' OP ;
and
OP: EXPR ';' | 'if' '(' EXPR ')' OP 'else' OP | 'if' '(' EXPR ')' OP ;
Is the same ambiguous grammar, the expression
OP: EXPR ';' / 'if' '(' EXPR ')' OP 'else' OP / 'if' '(' EXPR ')' OP
unequivocally indicates to the parser “first try to recognize if..else , and only if it fails, recognize if without- else ”. And the expression
OP: EXPR ';' / 'if' '(' EXPR ')' OP / 'if' '(' EXPR ')' OP 'else' OP
means “first recognize the if without- else ”, and therefore the else will never be recognized by them at all.

Now, instead of checking all possible scans at the same time, we will check them in turn, in accordance with the priority; We proceed to the next scan only when the previous one stumbled upon an error. The link mentioned by the commentators provides an excellent metaphor for this sort of parsing method:

Have you ever rode a lazy elephant? He walks slowly and swings heavily. And it does not go on the road, but wanders in all directions that it considers interesting, but if they do not coincide with the necessary one, then the driver will tell the elephant “no, we are not here”. So, having tried all the options, the elephant is at the point about which they did not say "not here." So parser combinators will also try all combinations of options until one of them works. And after each rollback, they start to do the same work again. <...> The program in a couple of lines was quickly disassembled. In three lines for a few seconds. Five lines, I did not wait. Remember, children, such parser combinators are only suitable for very simple grammars.

It remains to detect looping parsing, to achieve an acceptable working time, and, not badly, localization of the error.

It’s simpler with looping: if we were in one state twice, without moving forward in the input text, it means we’ll end up in it for the third time, and as many as we like. It turns out that for each position in the input text it is necessary to keep a list of all the states in which this position has already been visited: if we come to the same state a second time, we say, “not enough, I won't go there anymore”.

As a free bonus, we get a “linear” time of work, if we remember not just a tick “aha, there were”, but also the result of the past analysis (the sweep approached / did not fit); then the worst case scenario is if all the possible states are in each position of the text. If the text is N characters long, and in the ~~grammar of the~~ parsing expression of M rules (including alternative scans of each non-terminal), then we get O ( M * N ) work time. If it is clever to squint and pretend that M is a constant - aha, time is linear. For comparison, traditional LL and LR parsers with a predefined action in each state are exactly the same in the worst case O ( M * N ):

 S: |  T1 S;
 T1: T2;
 T2: T3;
 ...
 TM: 't';

Here, LR will, after each shift 't' perform M convolutions 't' -> TM -> ... -> T3 -> T2 -> T1 ; and LL, before 'eating' each 't' , makes M sweeps T1 -> T2 -> T3 -> ... -> TM -> 't' .
The whole question is how much the average case differs from the worst: at least, the “linear elephant” performs more deconstructions in any analysis than LL in the same grammar.

Another trick in memory consumption. We will need to store the M * N results of past scans - and this is in addition to the fact that the input text will have to be stored in the memory as a whole, because the elephant needs to run on it back and forth without end. This is despite the fact that store-assisted parsers never return to the already read text, and therefore their memory requirements depend only on grammar, but not on text.
"The history of one byte" read everything? Moreover, one of the most natural applications for new compilers is to support all sorts of compact platforms, where saved memory matters.

And about the detection of syntax errors. Our elephant, which actually received the name of packrat (skopidom), concludes that the text of the error, when no sweep did not fit. Therefore, the position in which the error will be detected may well precede the place of the actual error: suppose the expression

 DECL: DECLVAR / DECLFN;
 DECLVAR: TYPE ID ';'  ;
 DECLFN: TYPE ID '(' ARGS ')';

- vaguely resembling the syntax of declarations in C.
If the input text met a sequence that can be parsed as TYPE ID '!' then in what position is the syntax error? Packrat will check the first sweep, it will not work, the parser will roll back to the beginning of TYPE and try the second sweep; she also does not fit. It turns out, the error will be detected before TYPE ; and what if TYPE is a half-string hardcore expression?
It is logical to show, as an error position, the rightmost one, which has been able to reach at least one scan - i.e. the last position, to which the parser still had hopes that the text parsed successfully.
I assume that there are packrat implementations in which the error localization is implemented in this way.

Everything, it was the last post of a series in which we are engaged in syntax analysis.
Next time we turn to semantics, and then our toy language will really become a programming language.

Source: https://habr.com/ru/post/99466/

All Articles