Compilation. 2: grammar

In the previous post there was a lot of code and, according to some opinions, not enough explanations. We will alternate: this time there will be a lot of theory, and the practice will almost never come.

Further in the post:

Vending machine
Formal grammar
LR parsing

Vending machine

Last time, we identified a problem with regular expressions and the construction of DFAs for them to parse the text: it is impossible to determine whether or not the nesting level is located in the text of the bracket. With the help of regexp it is impossible to solve even a much narrower task: check in a string of type {*} * whether the number { number } is equal. The proof is by contradiction: suppose that such a regexp can be created. So, you can build on it a DFA, which will distinguish suitable strings from inappropriate ones. If in the resulting DFA there are N states, then an automaton of the type { ^{N +1} (i.e. " N +1 times { ") will have the machine in one of the states more than once ("looped"). So, by adding new characters to the beginning of the line { , you can drive the automaton through the same cycle once again - and he will not notice the substitution. As a result, the machine will not be able to distinguish the appropriate string { ^{N +1} } ^{N +1} from the unsuitable string { ^{N + M +1} } ^{N +1} . ( M is the length of the cycle detected in the automaton; it is exactly that { add to the beginning of the line.)

In other words, DKA cannot recognize nested constructions due to the fact that it does not have enough memory: all that it remembers is only one value from 1 to N (the number of the current state). We will need a more capacious automaton whose memory can grow indefinitely as necessary.
')

An automaton with a store memory , in addition to the current state, has a stack where it can write characters. At each step, the machine reads the next input character, plus the top character from the stack. In accordance with the triple (the current state, the input character, a character from the stack), the automaton chooses which state to go to and what to write to the stack instead of the read character. Writing to the stack is optional: you can simply erase the read character from there; so the stack can grow and shrink during operation.

Regarding the terminology: the vending machine is not related to vending machines in stores. Strangely enough, the slot machine is named after the machine shops (see fig.), Which are the classic implementation of the stack: access is possible only to the ~~topmost element of the~~ cartridge.

Now we can solve the original problem: check whether it is equal in the { and } line . Reading each { , we write it to the stack. Reading each } , erase one from the stack { .

 state 1: consider {
              stack symbol: {EOF (stack is empty)
 input character:
       {write down {{stay in 1 write down {, stay in 1
       } do not write anything, go to line 2 does not fit (1)
      EOF string does not fit (2) string matches (3)


 state 2: believe}
              stack symbol: {EOF (stack is empty)
 input character:
       {line does not fit (4) line does not fit (4)
       } do not write anything, stay in line 2 does not fit (5)
      EOF string does not fit (6) string matches (7)

Notes:

line starts with }
there are several { at the beginning of the line, but there is not a single one after them }
string is empty
after } the line is again {
} more than { (the stack is over, but no input)
} less than { (input ended, but no stack)
equally { and } (the stack and input ended at the same time)

In addition to the capacious automaton, we still need a new way to specify the syntax - more flexible than regexps.

Formal grammar

To define language constructs that support arbitrary nesting, they began to use "inference rules" consisting of characters ("terminals") and variables ("non-terminals"), for example, for the above problem of nested brackets:

 VALID: '{' VALID '}';
 VALID:;

On the left, before the colon - variable; on the right is a piece of text that can come out of it, and which can contain variables, including recursive references to the variable being defined.

To save space, alternative definitions of the same variable can be divided | :

 VALID: '{' VALID '}' |  ;

A more meaningful example is the grammar of arithmetic expressions:

 EXPR: NUM |  EXPR OP EXPR;
 NUM: DIGIT |  NUM DIGIT;
 DIGIT: '0' |  '1' |  '2' |  '3' |  '4' |  '5' |  '6' |  '7' |  '8' |  '9' ;
 OP: '+' |  '-' |  '*' |  '/';

Grammars of this type are called context-free , indicating that variable definitions are context- independent: for example, NUM always matches a string of decimal digits, even if the number was preceded by a '0x'. In practice, context-sensitive grammars are not used, because there are no effective ways to parse them.

So, in accordance with our grammar, the expression 22+3*4-5 would be recognized like this:

   '2' '2' '+' '3' '*' '4' '-' '5'
 DIGIT DIGIT OP DIGIT OP DIGIT OP DIGIT
  NUM DIGIT OP NUM OP NUM OP NUM
    NUM OP EXPR OP EXPR OP EXPR
     EXPR OP EXPR OP EXPR
            EXPR OP EXPR
                     EXPR

And maybe so:

   '2' '2' '+' '3' '*' '4' '-' '5'
 DIGIT DIGIT OP DIGIT OP DIGIT OP DIGIT
  NUM DIGIT OP NUM OP NUM OP NUM
    NUM OP EXPR OP EXPR OP EXPR
     EXPR OP EXPR OP EXPR
            EXPR OP EXPR
                             EXPR

In the first case, the expression is recognized as the product of the sum on the left and the difference on the right; in the second - in accordance with the rules of arithmetic. There are other options, except for the two.

When one expression has several “parse trees”, the question arises: which of them should the parser choose?

There was a similar problem with regular expressions: if you apply (. +) [] (. +) To the line “quick brown fox”, then you could get either $ 1 = “quick brown” and $ 2 = “fox”, or $ 1 = “ quick ”and $ 2 =“ brown fox ”. Regarding regular expressions, they agreed that they act “greedily”, and, reading a line from left to right, they take as many suitable characters as possible into each sub-expression. (So we’ll actually get “quick brown” and “fox”).

There is no such agreement with grammars. If the grammar permits ambiguous analysis, then this is a bad, worthless grammar, and you need to come up with another one. For example, the grammar for the calculator from the market, which performs all actions from left to right, would be:

EXPR: NUM | EXPR OP NUM ;
NUM: DIGIT | NUM DIGIT ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
OP: '+' | '-' | '*' | '/' ;

Now, in the EXPR definition, the second operand is always a bare number, and therefore the parsing can only be like this:

   '2' '2' '+' '3' '*' '4' '-' '5'
 DIGIT DIGIT OP DIGIT OP DIGIT OP DIGIT
  NUM DIGIT OP NUM OP NUM OP NUM
    NUM OP NUM OP NUM OP NUM
     EXPR OP NUM OP NUM OP NUM
            EXPR OP NUM OP NUM
                     EXPR OP NUM
                             EXPR

If, instead, we want to observe the priority of multiplication over addition, then we need to define EXPR as the sum of the products :

EXPR: TERM | EXPR '+' TERM | EXPR '-' TERM ;
TERM: NUM | TERM '*' NUM | TERM '/' NUM ;
NUM: DIGIT | NUM DIGIT ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;

'2' '2' '+' '3' '*' '4' '-' '5'
DIGIT DIGIT '+' DIGIT '*' DIGIT '-' DIGIT
NUM DIGIT '+' NUM '*' NUM '-' NUM
NUM '+' TERM '*' NUM '-' TERM
TERM '+' TERM '-' TERM
EXPR '+' TERM '-' TERM
EXPR '-' TERM
EXPR

In programming languages, similar ambiguities are often found, and in C ++, right at every step: the commentators mentioned a wonderful article “A Rare Profession” about developing a C ++ compiler from beginning to end.
Not time to delve into specifics, but the simplest example is if (1) if (2) foo(); else bar(); if (1) if (2) foo(); else bar(); - can be constructed either as if (1) { if (2) foo(); else bar(); } if (1) { if (2) foo(); else bar(); } if (1) { if (2) foo(); else bar(); } , or as if (1) { if (2) foo(); } else bar(); if (1) { if (2) foo(); } else bar(); , if the grammar maker does not take care to ban one of the interpretations.
Compiling grammars without ambiguities is an important skill.

LR parsing

How to make the vending machine recognize the grammar?
It will read the input string character by character, and write ( shift ) the read characters onto the stack. Once at the top of the stack there is a sequence (of characters and variables) that matches the grammar rule, the machine will push all of it out of the stack and replace it with a variable on the left side of the matched rule ( convolution ). The whole work of the automaton consists in a sequence of shifts and convolutions.

An interesting point: the automaton doesn’t really matter which characters are on the stack. Anyway, he cannot compare them with the rules of grammar, because he sees only the upper one; instead, he chooses which rule to apply for the convolution, according to its current state. He needs the stack then to know what state to go after convolution. To do this, during the shift, he writes to the stack, together with the symbol, his current state; and during the convolution it takes from the stack the state written under all the erased characters and, depending on it, goes to the next state.

The captious habraiser will notice that the automatic machine is not exactly turning out: at each step not one element is erased from the stack, but several at once; and look not at the top element of the stack, but at the one that is erased. In addition, during the convolution, the automaton leaves the input symbol unread. Theoretically, such an “extended” automaton can be converted to a “standard” form, creating a bunch of additional states for clearing the stack. In practice, the implementation of both options is quite similar: exactly the same cycle, stack and transition table. To save space for the table of transitions, use the "extended" machine, without inflating his state in vain.

To recognize our last grammar (arithmetic expressions with the correct priority of operations), we need an automaton from 11 states:

 state 1:
   number: shift, go to 2
 state 2:
   remove 1 character from stack, write DIGIT there
   if the state on the stack is 1 or 7 or 10, go to 3
   if the state on the stack is 4 or 9, go to 5
 state 3:
   remove 1 character from stack, write NUM there
   if the state on the stack is 1 or 10, go to 4
   if the state on the stack is 7, go to 9
 state 4:
   number: shift, go to 2
   otherwise: remove 1 character from the stack, write TERM there
          if the state on the stack is 1, go to 6
          if the state on the stack is 10, go to 11
 state 5:
   remove 2 characters from stack, write NUM there
   if the state on the stack is 1, go to 4
   if the state on the stack is 7, go to 9
 state 6:
   '*', '/': shift, go to 7
   otherwise: remove 1 character from the stack, write EXPR there
          if the state on the stack is 1, go to 8
 state 7:
   number: shift, go to 2
 state 8:
   EOF: expression recognized successfully
   '+', '-': shift, go to 10
 state 9:
   number: shift, go to 2
   otherwise: remove 3 characters from the stack, write TERM there
          if the state on the stack is 1, go to 6
          if the state on the stack is 10, go to 11
 state 10: EXPR shift: EXPR '+' TERM or shift EXPR: EXPR '+' TERM
   number: shift, go to 2
 state 11:
   '*', '/': shift, go to 7
   otherwise: remove 3 characters from the stack, write EXPR there
          if the state on the stack is 1, go to 8

We will not dwell on where this automaton came from; suppose it brought us a UFO.
It’s better to see how it works: let's skip through it all the same expression 22+3*4-5 :

 state 1, input "22 + 3 * 4-5", the stack is empty
 * number: shift, go to 2
 state 2, input "2 + 3 * 4-5", stack '2', (1)
 * remove 1 character from the stack, write DIGIT there, go to 3
 state 3, input "2 + 3 * 4-5", stack DIGIT, (1)
 * remove 1 character from the stack, write NUM there, go to 4
 state 4, input "2 + 3 * 4-5", stack NUM, (1)
 * number: shift, go to 2
 state 2, input "+ 3 * 4-5", stack '2', (4), NUM, (1)
 * remove 1 character from the stack, write DIGIT there, go to 5
 state 5, input "+ 3 * 4-5", stack DIGIT, (4), NUM, (1)
 * remove 2 characters from the stack, write NUM there, go to 4
 state 4, input "+ 3 * 4-5", stack NUM, (1)
 * remove 1 character from the stack, write TERM there, go to 6
 state 6, input "+ 3 * 4-5", stack TERM, (1)
 * remove 1 character from the stack, write EXPR there, go to 8
 state 8, input "+ 3 * 4-5", EXPR stack, (1)
 * '+': shift, go to 10
 state 10, input "3 * 4-5", stack '+', (8), EXPR, (1)
 * number: shift, go to 2
 state 2, input "* 4-5", stack '3', (10), '+', (8), EXPR, (1)
 * remove 1 character from the stack, write DIGIT there, go to 3
 state 3, input "* 4-5", stack DIGIT, (10), '+', (8), EXPR, (1)
 * remove 1 character from the stack, write NUM there, go to 4
 state 4, input "* 4-5", stack NUM, (10), '+', (8), EXPR, (1)
 * remove 1 character from the stack, write TERM there, go to 11
 state 11, input "* 4-5", stack TERM, (10), '+', (8), EXPR, (1)
 * '*': shift, go to 7
 state 7, input 4-5, stack '*', (11), TERM, (10), '+', (8), EXPR, (1)
 * number: shift, go to 2
 state 2, input "-5", stack '4', (7), '*', (11), TERM, (10), '+', (8), EXPR, (1)
 * remove 1 character from the stack, write DIGIT there, go to 3
 state 3, input "-5", stack DIGIT, (7), '*', (11), TERM, (10), '+', (8), EXPR, (1)
 * remove 1 character from the stack, write NUM there, go to 9
 state 9, input "-5", stack NUM, (7), '*', (11), TERM, (10), '+', (8), EXPR, (1)
 * remove 3 characters from the stack, write TERM there, go to 11
 state 11, input "-5", stack TERM, (10), '+', (8), EXPR, (1)
 * remove 3 characters from the stack, write EXPR there, go to 8
 state 8, input "-5", EXPR stack, (1)
 * '-': shift, go to 10
 state 10, input "5", stack '-', (8), EXPR, (1)
 * number: shift, go to 2
 state 2, input is empty, stack is '5', (10), '-', (8), EXPR, (1)
 * remove 1 character from the stack, write DIGIT there, go to 3
 state 3, input is empty, stack DIGIT, (10), '-', (8), EXPR, (1)
 * remove 1 character from the stack, write NUM there, go to 4
 state 4, input is empty, stack NUM, (10), '-', (8), EXPR, (1)
 * remove 1 character from the stack, write TERM there, go to 11
 state 11, input is empty, stack TERM, (10), '-', (8), EXPR, (1)
 * remove 3 characters from the stack, write EXPR there, go to 8
 state 8, input is empty, EXPR stack, (1)
 * EOF: expression recognized successfully

In fact, our automaton determines whether the input string is a valid arithmetic expression. If not, then in one of the states there will be no action suitable for the next input symbol; and then the machine will report a syntax error.
We can use the same automaton and calculate expressions: in each symbol in the stack we will store its mathematical value; during convolution, we calculate the value of the new symbol based on those that are removed from the stack.
Get an advanced calculator, which takes into account the priority of operations:

 state 1, input "22 + 3 * 4-5", the stack is empty
 * number: shift, go to 2
 state 2, input "2 + 3 * 4-5", stack '2', (1)
 * remove '2' from the stack, write DIGIT = 2 there, go to 3
 state 3, input "2 + 3 * 4-5", stack DIGIT = 2, (1)
 * remove DIGIT = 2 from the stack, write NUM = 2 there, go to 4
 state 4, input "2 + 3 * 4-5", stack NUM = 2, (1)
 * number: shift, go to 2
 state 2, input "+ 3 * 4-5", stack '2', (4), NUM = 2, (1)
 * remove the '2' character from the stack, write DIGIT = 2 there, go to 5
 state 5, input "+ 3 * 4-5", stack DIGIT = 2, (4), NUM = 2, (1)
 * remove DIGIT = 2, NUM = 2 from the stack, write NUM = 22 there, go to 4
 state 4, input "+ 3 * 4-5", stack NUM = 22, (1)
 * remove NUM = 22 from the stack, write TERM = 22 there, go to 6
 state 6, input "+ 3 * 4-5", stack TERM, (1)
 * remove TERM = 22 from the stack, write EXPR = 22 there, go to 8
 state 8, input "+ 3 * 4-5", stack EXPR = 22, (1)
 * '+': shift, go to 10
 state 10, input "3 * 4-5", stack '+', (8), EXPR = 22, (1)
 * number: shift, go to 2
 state 2, input "* 4-5", stack '3', (10), '+', (8), EXPR = 22, (1)
 * remove '3' from the stack, write there DIGIT = 3, go to 3
 state 3, input "* 4-5", stack DIGIT = 3, (10), '+', (8), EXPR = 22, (1)
 * remove DIGIT = 3 from the stack, write NUM = 3 there, go to 4
 state 4, input "* 4-5", stack NUM = 3, (10), '+', (8), EXPR = 22, (1)
 * remove NUM = 3 character from the stack, write TERM = 3 there, go to 11
 state 11, input "* 4-5", stack TERM = 3, (10), '+', (8), EXPR = 22, (1)
 * '*': shift, go to 7
 state 7, input “4-5”, stack '*', (11), TERM = 3, (10), '+', (8), EXPR = 22, (1)
 * number: shift, go to 2
 state 2, input "-5", stack '4', (7), '*', (11), TERM = 3, (10), '+', (8), EXPR = 22, (1)
 * remove '4' from the stack, write there DIGIT = 4, go to 3
 state 3, input "-5", stack DIGIT = 4, (7), '*', (11), TERM = 3, (10), '+', (8), EXPR = 22, (1)
 * remove DIGIT = 4 character from the stack, write NUM = 4 there, go to 9
 state 9, input "-5", stack NUM = 4, (7), '*', (11), TERM = 3, (10), '+', (8), EXPR = 22, (1)
 * remove NUM = 4, '*', TERM = 3 from the stack, write TERM = 12 there, go to 11
 state 11, input "-5", stack TERM = 12, (10), '+', (8), EXPR = 22, (1)
 * remove TERM = 12, '+', EXPR = 22 from the stack, write EXPR = 34 there, go to 8
 state 8, input "-5", stack EXPR = 34, (1)
 * '-': shift, go to 10
 state 10, input "5", stack '-', (8), EXPR = 34, (1)
 * number: shift, go to 2
 state 2, input is empty, stack is '5', (10), '-', (8), EXPR = 34, (1)
 * remove '5' from the stack, write DIGIT = 5 there, go to 3
 state 3, input is empty, stack DIGIT = 5, (10), '-', (8), EXPR = 34, (1)
 * remove DIGIT = 5 from the stack, write NUM = 5 there, go to 4
 state 4, input is empty, stack NUM = 5, (10), '-', (8), EXPR = 34, (1)
 * remove NUM = 5 from the stack, write TERM = 5 there, go to 11
 state 11, input is empty, stack TERM = 5, (10), '-', (8), EXPR = 34, (1)
 * remove TERM = 5, '-', EXPR = 34 from the stack, write EXPR = 29 there, go to 8
 state 8, input is empty, stack EXPR = 29, (1)
 * EOF: expression recognized successfully, value = 29

The parser in the real compiler is similar, only there the expression is not calculated in each bundle, and the corresponding code is generated.

The described method of parsing is called ascending : we start with the recognition of the smallest constructions, and, connecting them together, we get more and more. There is an opposite, downward approach: we start with the “all text” construction, and we split it up, recognizing ever smaller substructures. As it turned out, there are slack holivors between the proponents of the two approaches: each side believes that its approach is more natural, closer to the recognition of syntax by man, and therefore easier to debug.

Next time we will get acquainted with the bison , which according to the grammar itself builds a recognizing LR machine, and then we can compile our wonderful calculator.

Source: https://habr.com/ru/post/99298/

All Articles

Compilation. 2: grammar

Further in the post:

Vending machine

Formal grammar

LR parsing

More articles: