Writing compiler LALR (1) -parser. LR Generator Description

Foreword

Good day.

This is the second part of an article about writing your own LALR analyzer. In this part I will talk about the evolution from primitive ascending parsers to the most relevant, though not very new, LALR parsers. Those who have not read the first article (links - below), I advise you to read at least the first half of the last section. I will mention that little code snippet several times.

In the comments to the last article, several people were interested in my motives in writing their own compiler compilers. Unfortunately, they in this article will not find answers to this question. I will not hide, initially I planned to write an article without a special theory, but with the justification of the tasks and goals for which I began to write a generator, and I wanted to share the nuances and peculiarities of implementation. That is, the volume is pretty decent: a few screens. But then I decided to still describe the basic theory of populist language, so the article has grown to three parts. Thus, in order not to break the logic of presentation, I will first talk about LR / SLR / LALR analyzers, and tomorrow I will publish the final, and I think the most interesting part.

Why do we need evolution?

If you look at the primitive code of the example in the last part, you can see that the complexity of the algorithm is calculated as O (n), where n is the number of input characters. I have to say that this is a very good estimate for algorithms. For example, for the fastest sorting, the complexity is O (n * log (n)). However, what does this statement about the complexity of the algorithm mean? And it means that we will have to perform not n operations, but at least C * n, where C is a constant. For this example, the constant is very large - the sum of the lengths of all the rules, and can easily be equal to 100, and 1000, and 10,000. That is, for a string of 10 characters, we will spend one hundred thousand operations. But this is very bad.

')

That's why we need to improve the algorithm of the parser itself. The basic idea is to know at which step this symbol, together with a number of previous ones, can lead to convolution (replacing the last K characters, including already minimized non-terminals, into one non-terminal), and in which - to transfer, that is, the definition of where this terminal took its position. Roughly speaking, if there are rules A = 012 and B = 021, then line 01 will lead to transfer for rule A, 02 - to transfer according to rule B, and 012 - to convolution according to rule A. This idea will bring the above-mentioned constant C to 1 It is easy to see that this scheme fits very well with the concept of FSM (finite automata). To put it simply, this is a two-dimensional table that, by the number of the current state and the incoming symbol, gives a new state number. And so on until we reach the final state.

FSM states

OK well. I think it is obvious that the states should display the current position of the symbol in the rules - for the above examples A and B with previously read characters 0, 1 the state can be described as "we have already read the first 2 characters from the 3 rules of A". If we add the rule C = 013, then the state will be less specific: “we have already read the first 2 characters from 3 A or C”. And then everything will depend on the next symbol - 2 or 3. If 2 arrives, then we go to the state “we read 3 symbols from 3 A” and then we arrange convolution - turning 012 into symbol A. For 3 and C respectively - it is similar. This should be clear.

In addition, we need to determine where to start and when to stop. In this example, the beginning is “We have read 0 characters from 3x A or B or C”. The ending is also kind of clear - 3 possible states “3 out of 3x” for A / B / C. That is, when a convolution occurs according to one of the rules, we stop. However, if it works for a primitive, then it is worth a little more complicated, and everything breaks down.

A = 0 A | 1

Such a grammar specifies m zeros and then 1 at the end, that is, for example, "0000001". What is the difficulty? And the fact that we can no longer determine when to stop. That is, if we stop at convolution A, then we immediately follow the first convolution (when we read Terminal 1, convolve it in A) and stop, and at the same time we will not draw even m convolutions (according to rule A = 0 A). Therefore, when developing a grammar, a restriction is imposed on the presence of one starting symbol to which the grammar will be assembled. S = A | B | C. But besides this, the special character S 'and the rule S' = S are automatically added to the parser and then in case of convolution of this rule we will know that the mission has been completed, you can go home. Thus, we simultaneously determined both the initial state - “we considered 0 characters from 1 by the rule S '”, and the final one.

Now is the time to think about state coding. Well, we will not write our wishes and expectations in plain text. The analyzer simply does not understand us. We need to think about the machine representation of the above-used maxims. In fact, it can be easily seen if we compare them with each other. They contain only 2 defining moments - the rule and the current position in this rule. And, as can be seen from the examples, the state may contain several such sets. The sets themselves in the classics are called items. And they look like this: A = 01 · 2, that is, read 2 characters in rule A = 012.

FSM table

It remains only to set the FSM itself and the parser is actually ready. The table of transitions between states is constructed very simply - convolution is determined when we encounter a state containing a rule with all the characters read. That is, the marker (·) stands at the very end. In addition, since the columns in the table are specified by the current symbol, we must find out all the possible terminals that can follow after the collapsible symbol. If a convolution has formed, but then the symbol follows, well, who could not find it in the general logic of grammar, then this is a mistake. Example:

 #start A A = B 2 B = 01

Here, during the convolution of B, we have to check that 2 follows, and not any other character. I think the logic of calculation of this set is clear, and if not, then the pseudo-algorithm will be below. In addition to this point, when convolving it is necessary to take into account what needs to be tracked according to which parental rule is being analyzed. That is, after all, the symbol for convolution can easily participate in several rules, and we need to clearly know which rule is currently undergoing this operation in order to go to the appropriate state. This is solved by maintaining the state stack. When shifting (receiving the terminal and changing the state), we push the state number onto the stack. And when convolving, we extract from the stack L elements, where L is the length of the right side of the expression (a set of elements that collapse into one non-terminal). And we put the index of the new state, which is determined by the same FSM table, with the column number equal to the non-terminal of the left part of the collapsed rule and the row number - the state at the top of the stack. It may sound cumbersome, but in fact everything is simple. I will illustrate the work of the algorithm on the above grammar.

  0.  0  1. ,  s0 {A = · B 2},  {0}  2. ,   = [s0, 0] = s1 {B = 0 · 1} ,  {0},  1  3. ,   = [s1, 1] = s2 {B = 0 1 ·} ,  {0, 1},  2  4. ,   [s2, 2]     (2 -  ),   = [s2, 2] = s3 {A = B · 2},  {B}  5. ,   = [s3, 2] = s4 {A = B 2 ·} ,  {B, 2},  EOF (   )  6. ,   = [s4, EOF] = s5 {S = A ·},  {A}  7. ,   = [s5, EOF] = s6  {S},  ,       ,

We figured out the convolution, but it is still unclear how we determine the transitions between states and where they come from. But everything is resolved trivially - we simply iterate over all the available characters (terminals and non-terminals), applying to the current analyzed state. If it contains an item of the form {A = ... · X ...}, where X is an inline character, then we move the marker (this is exactly the use of the character X) and we get the point of the new state {A = ... X · .. .}. Of course, if we have several such items for X, then the new state will also contain several entries. In addition, during this process we build the required FSM: [sOld, X] = sNew. This is almost a finished line of the table. It is only necessary to add convolution elements to it.

Another important point. Again, if we turn to the last example, it is clear that from the starting state {A = · B 2} you can get only {A = B · 2} and [s0, B] = sX. But this is not true and will never work. Just because at the input of the starting state we will not get a non-terminal (B), but only an input terminal (0, 1, 2). In order to ensure the operation of the algorithm, it is necessary to take into account possible reversals (the operation is inverse to convolution). That is, in this case, from the grammar, we see that it is possible to expand B and conditionally speaking {A = · 0 1 2}, which fully corresponds to the logic. Formalizing what I described can be written as: “if the state contains {X = ... · Y ...}, then it should contain items of the form {Y = · ...}”. This expansion process is called closure.

Everything, now the theory is enough for drawing up both the FSM generation algorithm and the FSM-based analyzer itself. Pseudocode built:

 /*      ,      */ function firstTerminal(X) { if (X.type == Terminal) return X; result = []; for (rule in rules) { //  X = Y ... if (rule.left == X) { //  , first(Y) result[] = first(rule.right[0]) } } return result; } //    ,     function nextTerminal(X) { result = []; for (rule in rules) { //  Y = ... X if (rule.right.end(X)) { /*   X   Y,   X     ,    Y */ result[] = nextTerminal(rule.left); } else { //  Y = ... X ... if (rule.right.has(X)) { //         result[] = firstTerminal(rule.right.next(X)); } } } return result; } //  function closeItem(I) { for (item in I) { //      Y = ... · X ..., X   if (item.markered.type == NonTerminal) { for (rule in rules) { //  X = ... if (rule.left == item.markered) { //   X = · ... I[] = {rule: rule, marker: 0}; } } } } } //        FSM function addReducing(I, FSM) { for (item in I) { //  X = ... · if (item.markered == item.end) { //     S = A,  A -   if (item.rule.left == _START_) { //   ,    ,    FSM[I.index, EOF] = Success; } else { //    ,      for (term in nextTerminal(item.rule.left)) { /* rule      :           */ FSM[I.index, term] = {operation: Reduce, rule: item.rule}; } } } } } //     I    X function shiftState(I, X) { result = []; for (item in I) { //  Y = ... · X ... if (item.markered == X) { //      result[] = {rule: item.rule, marker: item.marker + 1}; } } return closeItem(result); } //    function buildFSM() { FSM = {}; //     S = · A states = [closeItem({rule: {S = A}, marker: 0})]; for (state in states) { //      addReducing(state, FSM); for (X in symbols) { new = states[] = shiftState(state, X); //   FSM[state.index, X] = {operation: Shift, index: new.index}; } } }

Algorithm code:

 input >> term; accepted = false; /*   0 ,      ,   ==   */ stack = [0]; while (!accepted) { //   - stack.top() state = FSM[stack.top(), term]; switch (state) { case Success: accepted = true; break; case Shift: //      stack.push(state.index); input >> term; break; case Reduce: //  L  stack.pop(state.rule.right.length); // state.rule.left -     // stack.pop() -  ,   X stack.push(FSM[stack.pop(), state.rule.left]); //     {..., Y, ..., Z}  {..., X}   X = Y ... Z break; default: throw SyntaxError; } }

Further improvement of the algorithm. SLR (1)

We got a fairly fast and good algorithm, but there is one flaw in it that is difficult to detect. It lies in the imperfection of the definition of the symbol that comes after the collapsing non-terminal (function nextTerminal). A small example:

 A = B 2 | C B = C 0 | 1 C = B States: S0: [{S = · A}, {A = · B 2}, {A = · C}, {B = · C 0}, {B = · 1}, {C = · B}] S1: [{S = A ·}],   S0   A S2: [{A = B · 2}, {C = B ·}],   S0   B ... SX: [{A = B 2 ·}],   S2   2

The grammar describes strings of the form 10 ... 02 and 10 ... 0. nextSymbol = [0, 2, EOF]. And then we get that for FSM [S2, 2] we have to do a convolution according to the rule C = B, but at the same moment according to the rule A = B2, we have to do the transfer to SX via the same terminal 2 for the state S2 . It is sad. This happened because in the S2 state for the C = B item we have to wait only for EOF, based on A = C EOF, and no 2 or 0. They can appear only in the future.

Yes. This is a synthetic example, but in reality there are grammars with such traps. Moreover, not so rare. The solution is obvious - the determination of the points generating the convolution, by the expected symbol. That is, {C = B ·} transforms conditionally into {C = B · [expect EOF]} or, for short, {C = B ·, EOF}. Only 2 points are affected - the generation of points (it is necessary to create points of a new format) and the generation of cells with convolution. That is, only 2 functions closeItem () and addReducing ():

 //  function closeItem(I) { for (item in I) { //      Y = ... · X ..., X   if (item.markered.type == NonTerminal) { for (rule in rules) { //  X = ... if (rule.left == item.markered) { // X -   if (item.nextMarkered == item.end) { //       first = item.expect; } else { //        first = firstTerminal(item.nextMarkered); } //     for (expect in first) { //   X = · ... I[] = {rule: rule, marker: 0, expect: expect}; } } } } } } //        FSM function addReducing(I, FSM) { for (item in I) { //  X = ... · if (item.markered == item.end) { //     S = A,  A -   if (item.rule.left == _START_) { //   ,    ,    FSM[I.index, EOF] = Success; } else { /*      ,       */ FSM[I.index, item.expect] = {operation: Reduce, rule: item.rule}; } } } }

Vertex - LALR (1)

The disadvantage of the previous algorithm is noticeable to the naked eye — where we used 1 point and 1 state, now there are several points and several states that differ only in the expected symbol. Accordingly, the FSM table swells, and so much so that its size can be a couple of orders of magnitude greater than the corresponding table LR (0). Not good.

Finding a way out is easy - we combine states that differ only in the expected symbol. For example: [{A = B, 0}, {B = C, 0}] and [{A = B, 1}, {B = C, 1}]. Why is it safe?

When building a new state during the transition, we do not use the expect character at all, so if s0 and s1 are identical in the core (part of the item except for the expected character), then s2 generated by X from s0 and s3 generated by X from s1 are also identical by core . Thus, we can easily combine s0 and s1 into sn0, s2 and s3 into sn1, and create a connection between them. This is completely correct.
The second part of the FSM table is convolutions. But here it is obvious that they, by definition, cannot intersect, since in states different expect-characters, then convolutions are written only in 1 cell for the corresponding expect-character. And they can be painlessly combined.

So in the end we get the same LR (0), but without the restriction of grammar.

Parts of the article

Part 1. Basic theory
Part 2. Description of LR-generators
Part 3. Features of writing and possible features of LR-generators

Source: https://habr.com/ru/post/140339/

All Articles