Compilation. 3: bison

This is the only post in the series, the center of attention of which is the old-believing sishny bison, which has become so annoying to some. Those who write not in C, the post should still be interesting, because generators of LR parsers, similar in their way of working, exist for very many languages. Those who ideologically reject LR-parsers, I have nothing to attract today.

Further in the post:

Grammar compilation
Two-level parser
What's inside?
Conflicts in grammar
How it works?

What have we been doing so far?
Last time we compiled a grammar for arithmetic expressions:

 EXPR: TERM |  EXPR '+' TERM |  EXPR '-' TERM;
 TERM: NUM |  TERM '*' NUM |  TERM '/' NUM;
 NUM: DIGIT |  NUM DIGIT;
 DIGIT: '0' |  '1' |  '2' |  '3' |  '4' |  '5' |  '6' |  '7' |  '8' |  '9' ;

We ended up looking intently at the automaton that parses it.
We do not know how to invent parsing machines themselves, and we don’t need to, because bison knows how to build them for us!
With the help of bison, we can compile our parser, and see how it all really works.

Grammar compilation

The general principle is the same as with flex in the first part : we list the grammar rules, and in front of each rule we write the sish code that will be executed during the rule folding.
')
Last time we mentioned that during the convolution we take out a set of characters (strings or variables) from the stack, look at their values, set the value for the coiled variable, and put it on the stack instead of the ones taken out.

%{
#include <stdio.h>
int yylex() { return getc( stdin ); }
void yyerror( char *s) {
fprintf ( stderr , " %s \n " , s);
}
%}

%%

EVALUATE: EXPR '\n' { printf( " %d \n " , $$ ) } ;

EXPR: TERM
| EXPR '+' TERM { $$ = $1 + $3 ; }
| EXPR '-' TERM { $$ = $1 - $3 ; }
;

TERM: NUM
| TERM '*' NUM { $$ = $1 * $3 ; }
| TERM '/' NUM { $$ = $1 / $3 ; }
;

NUM: DIGIT
| NUM DIGIT { $$ = $1 * 10 + $2 ; }
;

DIGIT: '0' { $$ = 0 ; } | '1' { $$ = 1 ; } | '2' { $$ = 2 ; } | '3' { $$ = 3 ; }
| '4' { $$ = 4 ; } | '5' { $$ = 5 ; } | '6' { $$ = 6 ; } | '7' { $$ = 7 ; }
| '8' { $$ = 8 ; } | '9' { $$ = 9 ; }
;

%%

Parsing code

Just like in lex files, the code in % {%} tags is copied into the parser unchanged. There are two functions that need to be defined: int yylex() returns the next input character, and void yyerror(char *s) prints an error message. Classic yacc included a ready-made implementation of yyerror , which could, if necessary, be redeclared; but its bison GNU-clone requires an explicit implementation from the programmer.

%% , as in lex files, divides the declaration area and the grammar rules area. The rules are listed in the same way as we are used to; at the end of the rule we add the convolution code. In the convolutional code, to refer to the character values on the parser stack, we use $ -tags. $$ refers to a collapsible variable (the left side of the rule), $1 to the leftmost character on the right side of the rule (the deepest in the parser stack), $2 to the second left, and so on. If the right side of the rule is N characters, then you can use values from $1 to $N
If the convolution code is not specified, and the right side of the rule is one character, then by default the bison “inherits” its value: $$=$1

The very first variable declared is considered the “root” of the grammar, i.e. all input text should eventually collapse into this root. EXPR would be suitable as a root, but then the printing of the calculated expression would have to be put in an EXPR convolution; which means that the value of each subexpression along the way would also be printed. For the convenience of printing we will create a new variable, EVALUATE , which is used only as a root. This at the same time makes it possible to read a line feed at the end of the expression - in addition to the convenience of printing, we get the convenience of input.

When compiling a parser, you need to specify the liby library, which contains the standard main() bison.

[tyomitch@home ~]$ bison 2.y
[tyomitch@home ~]$ cc -ly 2.tab.c
[tyomitch@home ~]$ ./a.out
22+3*4-5
=29
[tyomitch@home ~]$ ./a.out
2 + 2
syntax error

Unlike the market calculator, which was able to ignore spaces and comments, the bison calculator understands a strictly defined syntax: numbers, operation signs, newline at the end. To provide, for example, spaces between any pair of characters, we would have to explicitly add them to the grammar:

EXPR: TERM | EXPR WS '+' WS TERM | EXPR WS '-' WS TERM ;
TERM: NUM | TERM WS '*' WS NUM | TERM WS '/' WS NUM ;
NUM: DIGIT | NUM DIGIT ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
WS: | WS ' ' ;

Clearly this is inconvenient. And it is inconvenient to write for recognizing the digits of 10 different rules; and if we needed Latin letters, for example, in variable names, we would set 52 rules ?!

It would be great to combine the advantages of flex and bison ! It was easy to recognize simple expressions (numbers, spaces), and it was possible to recognize complex expressions.

Two-level parser

If we already have a flex parser that successfully recognizes numbers and deletes comments and spaces, then run the input text through it; and what we succeed in, we will drive through the advanced bison parser. In fact, there is no need to store an intermediate result: both steps can be completed in one pass. flex reads the input text character by character, occasionally passing tokens to the bison, the terminal symbols of the grammar. The values for tokens flex itself sets.

In order for such a symbiosis to be possible, flex must somehow recognize the terminal characters of the bison grammar. We will run bison with the -d key, then it will generate a .h file listing all terminals. To do this, you need to declare (indicating %token ) the terminals in the grammar file - and all that remains is to refer to the generated .h file in the .lex file.

%{
#include <stdio.h>
void yyerror( char *s) {
fprintf ( stderr , " %s \n " , s);
}
%}

%token NUM

%%

EVALUATE: EXPR { printf( "= %d \n " , $$ ); } ;

EXPR: TERM
| EXPR '+' TERM { $$ = $1 + $3 ; }
| EXPR '-' TERM { $$ = $1 - $3 ; }
;

TERM: NUM
| TERM '*' NUM { $$ = $1 * $3 ; }
| TERM '/' NUM { $$ = $1 / $3 ; }
;

%%

The function yylex is no longer needed: now the input characters will not come from stdin , but from flex .
In addition, erased '\n' after EXPR (it will be swallowed by flex ), and removed all the rules about NUM and DIGIT .

Relevant .lex file:

%{
#include "3.tab.h"
%}

%option yylineno
%option noyywrap

%%

[/][/] .*\n ; // comment
[0-9] + { yylval = atoi(yytext);
return NUM;
}
[ \t\r\n] ; // whitespace
. { return * yytext ; }

%%

The file with token definitions gets the suffix .tab.h
The only thing in it is - #define NUM 258
All tokens receive numbers higher than 256 to differ from "ordinary" characters.

To transfer the token to the bison, we write its value to the global (oh, horror!) Variable yylval , and return the token code.
We transfer ordinary characters in the usual way (we return the character code).

The noyywrap option indicates that the input text is one, and after reading EOF do not need to try to go to the next text. This option was set automatically while we were using %option main , which reads from stdin . Now main() will be bison, so we do not need to ask flex standard main() , nor write our own.

Compile a two-stage parser:

[tyomitch@home ~]$ lex 3.lex
[tyomitch@home ~]$ bison -d 3.y
[tyomitch@home ~]$ cc -ly lex.yy.c 3.tab.c
[tyomitch@home ~]$ ./a.out
22+ // hello
3*4 - 5
=29
[tyomitch@home ~]$ ./a.out
22+x
syntax error

Regarding the terminology: in such a two-step model, the lower parser that recognizes tokens is traditionally called a lexical analyzer (“lexer”), and the upper one, recognizing constructions of tokens, is called a parser . This is an indication of the role of the parser, and not its devices: other parsing systems can, for example, use shop-automatic parsers for both stages.

What's inside?

To see the generated automaton, you do not need to dive into the depths of the code: the bison has powerful tools for debugging grammars. Specify the key -v , and look at the file with the suffix .output .
After transferring the parsing of numbers to a lexer, there are 14 states left in the machine, and they are described like this:

 ...

 state 7

     4 EXPR: EXPR '-'.  TERM

     NUM shift, and go to state 1

     TERM go to state 11


 state 8

     6 TERM: TERM '*'.  NUM

     NUM shift, and go to state 12


 state 9

     7 TERM: TERM '/'.  NUM

     NUM shift, and go to state 13


 state 10

     3 EXPR: EXPR '+' TERM.
     6 TERM: TERM.  '*' NUM
     7 |  TERM.  '/' NUM

     '*' shift, and go to state 8
     '/' shift, and go to state 9

     $ default reduce using rule 3 (EXPR)

 ...

For each state, the rules that are recognized in this state (along with their grammar numbers) are indicated, and the actions performed for each input symbol are listed. The following state after convolution is not indicated; instead, the automaton returns to the state read from the stack, and in it searches for the “go to” rule corresponding to the newly-flipped non-terminal. Thus, the automaton transition table is two-dimensional: in each state, the action depends only on the input symbol, and does not depend on the contents of the stack. (But the following state is taken from the stack after convolution.)

In order not to crawl with a pencil on the printout of the machine, substituting into it symbol by symbol, you can ask the bison to print all transitions from state to state during parsing. To do this, we compile the grammar with the -t key, and the global flag yydebug will appear in the parser. It must be set to 1 - for example, in main() .
If, in addition, we want the character values to be printed, then we need to define the macro YYPRINT :

%{
#include <stdio.h>
void yyerror( char *s) {
fprintf ( stderr , " %s \n " , s);
}
#define YYPRINT(file, type, value) fprintf(file, " %d " , value);
%}

%token NUM

%%

EVALUATE: EXPR { printf( "= %d \n " , $$ ); } ;

EXPR: TERM
| EXPR '+' TERM { $$ = $1 + $3 ; }
| EXPR '-' TERM { $$ = $1 - $3 ; }
;

TERM: NUM
| TERM '*' NUM { $$ = $1 * $3 ; }
| TERM '/' NUM { $$ = $1 / $3 ; }
;

%%
int main () { yydebug= 1 ; return yyparse(); }

The code after the second %% tag is copied into the parser unchanged, just as if it were in % {%} .
Now, since we have defined main() ourselves, it is no longer necessary to connect liby when compiling:

[tyomitch@home ~]$ lex 3.lex
[tyomitch@home ~]$ bison -dt 3.y
[tyomitch@home ~]$ cc lex.yy.c 3.tab.c
[tyomitch@home ~]$ ./a.out
Starting parse
Entering state 0
Reading a token: 22+3*4-5
Next token is token NUM (22)
Shifting token NUM, Entering state 1
Reducing stack by rule 5 (line 20), NUM -> TERM
Stack now 0
Entering state 4
Reading a token: Next token is token '+' (22)
Reducing stack by rule 2 (line 15), TERM -> EXPR
Stack now 0
Entering state 3
Next token is token '+' (22)
Shifting token '+', Entering state 6
Reading a token: Next token is token NUM (3)
Shifting token NUM, Entering state 1
Reducing stack by rule 5 (line 20), NUM -> TERM
Stack now 0 3 6
Entering state 10
Reading a token: Next token is token '*' (3)
Shifting token '*', Entering state 8
Reading a token: Next token is token NUM (4)
Shifting token NUM, Entering state 12
Reducing stack by rule 6 (line 21), TERM '*' NUM -> TERM
Stack now 0 3 6
Entering state 10
Reading a token: Next token is token '-' (4)
Reducing stack by rule 3 (line 16), EXPR '+' TERM -> EXPR
Stack now 0
Entering state 3
Next token is token '-' (4)
Shifting token '-', Entering state 7
Reading a token: Next token is token NUM (5)
Shifting token NUM, Entering state 1
Reducing stack by rule 5 (line 20), NUM -> TERM
Stack now 0 3 7
Entering state 11
Reading a token: Now at end of input.
Reducing stack by rule 4 (line 17), EXPR '-' TERM -> EXPR
Stack now 0
Entering state 3
Now at end of input.
Reducing stack by rule 1 (line 13), EXPR -> EVALUATE
=29
Stack now 0
Entering state 2
Now at end of input.

Only states are printed from the stack; the types of characters in the stack, and their meanings, are left to guess from the context.
If the YYPRINT macro YYPRINT not specified, then it is necessary to guess the values of the read tokens: the bison will print only empty brackets.

Conflicts in grammar

Last time, ambiguous grammars were mentioned, allowing for several expressions to be parsed.
What will a bison say if we try to compile an ambiguous grammar?

%{
#include <stdio.h>
void yyerror( char *s) {
fprintf ( stderr , " %s \n " , s);
}
%}

%token NUM

%%

EVALUATE: EXPR { printf( "= %d \n " , $$ ) } ;

EXPR: NUM
| EXPR '+' EXPR { $$ = $1 + $3 ; }
| EXPR '-' EXPR { $$ = $1 - $3 ; }
| EXPR '*' EXPR { $$ = $1 * $3 ; }
| EXPR '/' EXPR { $$ = $1 / $3 ; }
;

%%

[tyomitch@home ~]$ bison 4.y
4.y: conflicts: 16 shift/reduce

When the grammar allows several continuations from one state, it is not clear to the buffalo what to perform. In our case, it oscillates between shift and convolution. You can correct the grammar, as we did last time; or you can “push” the buffalo in the right direction, and hint what to do in the event of a conflict. It is necessary to treat this as a quick hack: the parser starts working, but it becomes harder to debug the “grammar with hints”.

Since the typical source of conflicts is arithmetic expressions, hints are given in the form of an indication of the precedence of operators (to perform multiplication before addition) and their associativity (from operators with equal priority, which they should perform before). The lower the operator in the list, the higher the priority. Operators in one line of the list receive the same priority.

%{
#include <stdio.h>
void yyerror( char *s) {
fprintf ( stderr , " %s \n " , s);
}
%}

%token NUM

%left '+' '-'
%left '*' '/'

%%

EVALUATE: EXPR { printf( "= %d \n " , $$ ) } ;

EXPR: NUM
| EXPR '+' EXPR { $$ = $1 + $3 ; }
| EXPR '-' EXPR { $$ = $1 - $3 ; }
| EXPR '*' EXPR { $$ = $1 * $3 ; }
| EXPR '/' EXPR { $$ = $1 / $3 ; }
;

%%

For %right associative operators, there is a %right directive.
The unnatural nature of hack with priorities can be assessed by the example of ambiguity if (1) if (2) foo(); else bar(); if (1) if (2) foo(); else bar();
So that it breaks up in the usual way - if (1) { if (2) foo(); else bar(); } if (1) { if (2) foo(); else bar(); } if (1) { if (2) foo(); else bar(); } - it is necessary that the priority of else be higher than that of ')'
Both of these terminals are hard to call operators, and it is even more difficult to give them a “natural” priority.
But it works!

"Grammar with hints" is more compact and unambiguous in its original form (shorter than twice), and in the form of an automaton (saved one state).

Even in the unambiguous grammar, there may be conflicts related to the principle of operation of the slot machine: during the execution of each action, he sees only one next character of input. For example, grammar

 WORD: S1 'a' 'i' 'l' |  S2 'a' 'l' 'e';
 S1: 's';
 S2: 's';

- unambiguous, and it corresponds to only two words - sail and sale. When the parser has shifted the first letter of 's' and sees after it of 'a', it cannot make a choice, collapse S1 or S2 : the correct convolution depends on what letter will be after 'a'; but its machine does not yet see.
This is the second reason why the parser is made in two steps: due to the fact that the lexer compresses the strings into tokens and discards unnecessary characters between tokens, the LR parser succeeds in looking “further”: not one letter, but one token ahead.

How it works?

Like flex , the core of the parser is a transition table and a small loop. Two parallel stacks are used: the yyssa state yyssa and the yyvsa value yyvsa — all the same, the states and characters enter and leave the stack always in pairs.

Like last time, symbols that are identical from the point of view of the parser are combined into classes. In this case, classes 7, and they are listed in the .output file. The static const unsigned char yytranslate[259] array static const unsigned char yytranslate[259] matches all terminals with a class. (From 0 to 255 - ordinary characters; 258 - terminal NUM declared by us).

Conversion tables are cleverly combined. The main table stores descriptions of actions: for a shift (a positive number) - which state to go to; for convolution (negative) - according to what rule to collapse.

 static const unsigned char yytable [] =
 {
        6, 7, 5, 8, 9, 10, 11, 1, 12, 13
 };

It is surprising that the table is not only one-dimensional, but even the elements in it are smaller than our states (there are 14).
The trick is that the index of the first action for each state is stored in a separate table:

 #define YYPACT_NINF -5
 static const yysigned_char yypact [] =
 {
        4, -5, 2, -4, -3, -5, 4, 4, 5, 6,
       -3, -3, -5, -5
 };

YYPACT_NINF means that no yytable element yytable ; in other words, the action performed does not depend on the input character.
The default actions for each state are stored in another separate table:

 static const unsigned char yydefact [] =
 {
        0, 6, 0, 2, 3, 1, 0, 0, 0, 0,
        4, 5, 7, 8
 };

Only the execution of the convolution can be independent of the input character, so the values in yydefact are the numbers of the grammar rules by which to fold.

By index from yypact[n] action is stored for state n and for character class 0. The action for character class k is stored in yytable[yypact[n]+k] ; therefore, in yypact can be negative indices - this is just a “base” to which the class number will be added.

To check to which symbol each action belongs in the yytable , there is another table:

 static const unsigned char yycheck [] =
 {
        4, 5, 0, 6, 7, 6, 7, 3, 3, 3
 };

What do we see? yytable[0] refers to class 4 characters, yytable[1] to class 5 characters, and so on.
Let's try to find some action, for example, those listed above:

 ...

 state 7

     4 EXPR: EXPR '-'.  TERM

     NUM shift, and go to state 1

     TERM go to state 11


 state 8

     6 TERM: TERM '*'.  NUM

     NUM shift, and go to state 12

 ...

Terminal class NUM - 3.
We are looking for the first shift: yytable[yypact[7]+3]==yytable[4+3]==1 (and indeed, yycheck[yypact[7]+3]==3 )
Second shift: yytable[yypact[8]+3]==yytable[5+3]==12 (and indeed, yycheck[yypact[7]+3]==3 )

Similarly, the “go to” table is broken into three arrays, which specifies which state to go to after the convolution.

Parser code itself: (uninteresting pieces are cut, and interesting ones are commented)

yynewstate :
*++yyssp = yystate; //

yyn = yypact[yystate]; //
if (yyn == YYPACT_NINF)
goto yydefault; //

yychar = yylex(); //
yytoken = yytranslate[yychar]; //
yyn += yytoken; // yytable
if (yyn < 0 || YYLAST < yyn || yycheck[yyn] != yytoken)
goto yydefault; //
yyn = yytable[yyn];

if (yyn <= 0 ) {
yyn = -yyn;
goto yyreduce;
}

*++yyvsp = yylval; //
yystate = yyn; //
goto yynewstate;

yydefault :
yyn = yydefact[yystate];
if (yyn == 0 ) // :
goto yyerrlab; // ,
//
yyreduce :
yylen = yyr2[yyn]; //

yyval = yyvsp[ 1 -yylen]; // : $1

// :
// $- yyval yyvsp[]

switch (yyn) {
case 2 :
{ printf( "= %d \n " , yyval); }
break ;

case 4 :
{ yyval = yyvsp[- 2 ] + yyvsp[ 0 ]; }
break ;

case 5 :
{ yyval = yyvsp[- 2 ] - yyvsp[ 0 ]; }
break ;

case 7 :
{ yyval = yyvsp[- 2 ] * yyvsp[ 0 ]; }
break ;

case 8 :
{ yyval = yyvsp[- 2 ] / yyvsp[ 0 ]; }
break ;
}

yyvsp -= yylen; //
yyssp -= yylen;

*++yyvsp = yyval; //

yyn = yyr1[yyn]; //

//
yystate = yypgoto[yyn - YYNTOKENS] + *yyssp;
if ( 0 <= yystate && yystate <= YYLAST && yycheck[yystate] == *yyssp)
yystate = yytable[yystate];
else
yystate = yydefgoto[yyn - YYNTOKENS];

goto yynewstate;

Again, we see: the entire parser, along with the evaluation of the expression, fit into a couple of pages of code; and even then, its third is a tricky search on compressed tables.

Next time let's do a parsing of a toy programming language .
The dignity of a bison and others like it is that from complicating a language, only tables and a switch with convolution actions copied from a .y file will grow in the parser.
The rest of the parser code is universal: no spaghetti, no recursive functions that call each other in tricky combinations. The rules of grammar are really compiled , and not framed in the syntax of the high level language.

Source: https://habr.com/ru/post/99366/

All Articles