Compilation. 1: lexer

I have always been fascinated by the mystery of the birth of the program of the program. Unfortunately, Russian universities pay little attention to this interesting topic. I hope to write a series of posts in which we will gradually create a small efficient compiler.

The first posts of the series have already been prepared, and beta-tested in one small and tightly closed community. Nevertheless, I will continue to rule them, taking into account the wishes of the venerable community.

Further in the post:

Why should compilers write?
Overall plan
Text analysis
Practical example
How it works?

Why should compilers write?

Are not all the necessary programming languages already written? Who in our enlightened age may need to write their own compiler?

Everything that can be invented has been invented.
--Charles H. Duell, Director of US Patent Office, 1899 (attributed)

What was, will be; and what has been done will be done, and there is nothing new under the sun.
- Ecclesiastes 1: 9 (c. 3 in. BC)

First, languages are constantly being created and developed. Of the now popular ones, Ruby, PHP, Java and C # were created in our memory, and they began to be used vigorously several years ago. Right now, Microsoft is pushing through the new F # language, and he - given the power of Microsoft - will surely also be among the most commonly used ones.
There are still niches for new languages: for example, attempts to invent a convenient imperative language for parallel programming continue.
')
Secondly, the techniques used in the compilation (first of all, the parsing of grammar) have a lot of other applications. There is often a need for source-to-source conversions (refactoring, translating code into another language, etc.) when you need to parse the text in a programming language, process it, and output the processed text (in the same or a different language) .

Certainly, compilation is a rather exotic task for a programmer, somewhere alongside with the programming of ~~huge human-like combat~~ robots. However, all this has practical applications, unlike many other extravagant programmer hobbies.

Overall plan

The compilation process, in principle, consists of two main stages:

Source Text Analysis
Machine code generation

At the first stage, the presentation of the program is built, with which it will be convenient to work further; usually a tree, but not necessarily. At the second stage, the compiler passes through the tree, and for each node generates the final code.

The most challenging part of the compiler is code optimization; at the first stage, high-level optimization is performed, at the node level of the tree (for example, cycles unfold, inline functions are implanted); on the second, low-level, at the level of a stream of commands (for example, they are reordered so as to more fully load the conveyors of a particular processor). According to tradition, it never reaches optimizations in university courses; but in our example we will try to implement the simplest (copy elimination, constant propagation).

Old posts on the topic of syntactic analysis I saw on Habré; but the authors did not approach the code generation even once.

Text analysis

When a novice programmer tries to write a text parser, his natural approach is a recursive deepening: to find the beginning of a construct (for example, { ); find its end (for example, } at the same nesting level); select the contents of the structure, and parse it recursively.

Problems with this approach - first, excessive complexity (on the same piece of text we walk back and forth); secondly, the inconvenience of support (language syntax is distributed over kilobytes and kilobytes of branching code).

The language syntax can be set declaratively; for example, everyone knows regular expressions. Ideally, it would be to write a pile of regexps for all language constructs; opposite each one is the definition of a node that should be created in the program tree; A “universal parser” would simply substitute the program into one regexp after another, and create nodes according to the description, one by one.

The first of these problems is solved by the fact that the search for all regexp in the text can be performed in one pass, i.e. There is no need to store the entire program in the memory entirely - just read it one character at a time, process the character, and then forget it.
The second is that now we have a centralized, formal description of the language: we can change the regexps without touching the code; and vice versa, change the code without risking damage to the parser.

Practical example

The “universal parser”, of course, has long been implemented, and not once. The classic implementation, lex , accepts descriptions of actions for each regexp on C. The standard GNU utilities include a flex that is compatible with lex , which we will use for examples. (There are similar utilities for C #, Java, etc.)

By tradition, the first example would be writing a calculator:

 % {
     #include <stdio.h>
     int reg = 0;
     char op = '+';
     int unmin = 0;
 %}

 % option main
 % option yylineno

 %%

 [/ ([/]);  // comment
 [0-9] {int opnd = atoi (yytext);
                   if (unmin) opnd = - opnd;  unmin = 0;
                   switch (op) {
                     case '+': reg + = opnd;  break;
                     case '-': reg - = opnd;  break;
                     case '*': reg * = opnd;  break;
                     case '/': reg / = opnd;  break;
                   }
                   op = 0;
                 }
 [- + * /] {if (op) {
                     if (* yytext == '-')
                       unmin = 1;
                     else {
                       printf ("Unexpected operator in line% d \ n", yylineno);
                       exit (1);
                     }
                   } else
                     op = * yytext;
                 }
 [;] {if (op) {
                     printf ("Unexpected semicolon in line% d \ n", yylineno);
                     exit (1);
                   }
                   printf ("=% d \ n", reg);
                   reg = 0;
                   op = '+';
                 }
 [\ t \ r \ n];  // whitespace
 .  {printf ("Syntax error in line% d \ n", yylineno);  exit (1);  }
 %%

The simulator of market traders is simulated: without brackets, without priority of operations, without fractions. Expressions are separated by a semicolon, and you can insert comments from // to the end of the line.

Let's compile our calculator and try to play with it:

[tyomitch@home ~]$ lex 1.lex
[tyomitch@home ~]$ cc lex.yy.c
[tyomitch@home ~]$ ./a.out
2+2;
=4
2+2*2;
=8
2 + // hello
- 3
;
=-1
2 / /
Unexpected operator in line 6

Parsing code

At the top, in the tags % {%} , there is a code that is copied directly to the parser. We declare three variables: reg (“register”) for the intermediate result, op for the dialed operation, and unmin — the flag that the minus was typed after the operation sign, and it should be interpreted as the sign of the second operand.
%option main indicates that we are satisfied with the standard main (which reads from stdin to EOF ). %option yylineno creates the int yylineno variable in the parser where the number of the current line of the input text will be stored: it is useful for diagnostic messages.
%% separates the declaration area and the actual language rules area
Each rule on the left is written regexp, on the right - the C code. In the first regexpe, the code is empty; those. such a construct (and this comment) will simply be ignored. Similarly, the penultimate rule instructs to ignore spaces and line breaks between recognized constructs.
In the second rule, we use the yytext variable: it stores the text that matches the regexp (in our case, the number operand)
In the third rule - an example of handling errors in the program text (use yylineno in the message text)
The rules are tried in the order of their appearance, from first to last. If none match, the parser simply prints the current character to stdout . Instead, we add the last rule . - it matches any character, and prints an error message.

In the present compiler, of course, the rules will not print the results of calculations on the screen, but will save the expressions themselves for the subsequent code generation.

How it works?

Mathematicians have the concept of DFA ( deterministic finite-state machine ) - a piece that can be in one of N states; which reads from the input stream character by character; and which has a table: for each combination of the current state and the read character - in which state to go. The work of flex is that it builds a DCA based on a given set of regexps; in some states, this DKA not only goes into the next, but in addition causes our rules.

You can see the built DFA by looking inside the generated lex.yy.c parser. It took 15 states. To save space, the transition table is not stored explicitly (size 15x256), but is divided into sophisticated overlapping lists. To see it in the most visual form, -Cef compile a parser with the -Cef option (“turn off table compression”):

 static yyconst short yy_nxt [] [8] =
     {
     {0, 0, 0, 0, 0, 0, 0, 0},
     {3, 4, 5, 6, 7, 8, 9, 10},
     {3, 4, 5, 6, 7, 8, 9, 10},
     {-3, -3, -3, -3, -3, -3, -3, -3},
     {3, -4, -4, -4, -4, -4, -4, -4},
     {3, -5, -5, -5, -5, -5, -5, -5},
     {3, -6, -6, -6, -6, -6, -6, -6},
     {3, -7, -7, -7, -7, -7, -7, -7},
     {3, -8, -8, -8, -8, 11, -8, -8},
     {3, -9, -9, -9, -9, -9, 12, -9},
     {3, -10, -10, -10, -10, -10, -10, -10},
     {3, 13, 13, 14, 13, 13, 13, 13},
     {3, -12, -12, -12, -12, -12, 12, -12},
     {3, 13, 13, 14, 13, 13, 13, 13},
     {3, -14, -14, -14, -14, -14, -14, -14},
     };

 static yyconst short int yy_accept [15] =
     {0,
         0, 0, 8, 6, 5, 5, 3, 3, 2, 4,
         0, 2, 0, 1
     };

The characters here are divided into 8 classes that are identical from the point of view of the parser (for example, all the numbers are combined into one class). A separate array of static yyconst int yy_ec[256] assigns a class to each character.

The main cycle of the parser is very simple:

 yy_match:
    while ((yy_current_state = yy_nxt [yy_current_state] [yy_ec [(unsigned char) (* yy_cp)]])> 0)
        {
        if (yy_accept [yy_current_state])
            {
            yy_last_accepting_state = yy_current_state;
            yy_last_accepting_cpos = yy_cp;
            }

          yy_cp;
        }

    yy_current_state = -yy_current_state;

Positive numbers in the jump table mean “go to the state and continue reading”; negative - “go to the state and execute the action”. The number of the action to be performed upon arrival in the state is stored in yy_accept .

Consider an example: for digits, the number of “character class” is 6.
In the initial state (1), symbol 6 can be seen in the transition table 9. Go on, read on.
In state 9, if there is one more digit (6), go to state 12 and read on.
From state 12, if there is another digit, just read on. (The table is 12)
If you see a non-digit (any character except 6), you need to perform an action: we see a row of -9 in the 9th row, and a row of -12 in the 12th row.
Check yy_accept : in both cases we apply rule 2. (Recall that the rule recognizing numbers in our parser is really the second.)
It is not clear why flex decided to separate states 9 and 12 if it does the same in both. But he is a soulless piece of iron, he knows better.

It’s remarkable how simple the parser is: instead of branching recognition of different constructions, we have one large table, and a cycle of ten lines. But there is a significant problem. Let us return to the example from the very beginning of the post: “find the beginning of the construction (for example, { ); find its end (for example, } at the same level of nesting ) ... "How to denote the condition" at the same level of nesting "?

Mathematicians tried to do the same: they proved that it was impossible. So, to parse languages that support nested constructs (and these are all languages more complicated than BAT files), we need a more powerful tool than regexps. Next time - about him .

Source: https://habr.com/ru/post/99162/

All Articles