Writing compiler LALR (1) -parser. Basic theory

Introduction, or why need parsers

Good day.
Not so long ago, I had the task of parsing one grammar. Alas, the existing solutions did not suit me, so there was a problem of writing my own parser generator. Despite the fact that the topic is quite popular and there are not so few articles and books on this subject, I decided to describe the process once again, and start with the most basic concepts.

This part is devoted to the basis, the general theory of computer science. It is possible that it is even taught in schools / universities of Russia. The very pulp will go with the second part.

So, why would anyone need to write a parser and what is it all about? A parser is code that gives an incoming character set a semantic meaning. That is, an analysis of these symbols takes place, and on the basis of this analysis, the program understands how to interpret these letters and numbers. A simple example is “1 + 2”, after or during the parsing process, the “+” sign is not just a plus symbol, but the designation of a binary addition operator, and in “+3” it is a unary number sign operator. This is obvious to most people, not to a car.
')
Parsers are used everywhere - in Word'e to analyze applications, word forms, formulas, etc; on almost any site when validating the input data: email, phone number, credit card number; configuration files; serialized data (for example, in xml); in many games - script videos, AI scripts, console. In general, it is an integral part of computer science.

Two types of parsers

Ok, after realizing the importance of this technology, it is necessary to raise the question of the standards for writing this class of programs.

Relatively speaking, parsers can be divided into two types: those using formal language descriptions and those implemented directly into code without an abstract data schema.

The first approach means separating the analyzer into two parts - a description of the format / scheme of the input information and logic, which operates no longer with a pure data stream, but rather structured in accordance with the scheme. Example:

scheme: DIGIT = '0'..'9' ADD = DIGIT '+' DIGIT SIGN = ('+' | '-') DIGIT EXPR = ADD | SIGN code: long result = 0; ParseNode & expr = Parse(input); ParseNode & oper = expr.child() switch (oper.type) { // 0: DIGIT, 1: '+', 2: DIGIT case ADD: result = oper.child(0) + oper.child(2); break; // 0: '+' or '-', 1: DIGIT case SIGN: result = oper.child(0) == '-' ? - oper.child(1) : oper.child(1); break; }

The second approach is easier to show than to explain:

  char c = ''; long result = 0; input >> c; // we have three variants of first symbol switch (c) { case '+': // next symbol is digit input >> c; if (c < '0' || c > '9') return SyntaxError; result = c - '0'; break; case '-': // next symbol is digit input >> c; if (c < '0' || c > '9') return SyntaxError; result = c - '0'; break; default: if (c < '0' || c > '9') return SyntaxError; result = c - '0'; input >> c; if (c != '+') return SyntaxError; input >> c; if (c < '0' || c > '9') return SyntaxError; result += (c - '0'); break; }

In principle, from this example, it is already possible to draw conclusions about which way is better, but I will still try to collect the advantages of each type. In this case, plus for one is equivalent to minus another.

Advantages of the grammar description:

First of all, it is easy language support. That is, modifications, even the most cardinal, are superimposed very simply.
Since the syntactic structures are collected in one place and compact, it is very simple to review the structure of the language as a whole. In the alternative, you can simply get bogged down among thousands and tens of thousands of lines interspersed with logic. In addition, the transparency of the language is ensured, the evolution of structures is evidently traced. That is, in our example, you can see how DIGIT turns into ADD, and then into EXPR, actually allowing you to mark points for the introduction of semantic logic.
Easy bug tracking. And at once two categories - syntactic errors of the input data and errors of the compilation of the code itself. When manually writing, the code may contain inconsistencies, dead sections, and other logical joys and compile at the same time. With a given scheme is impossible. Contradictions are detected immediately at the generation stage.
The introduction of additional abstraction simplifies the source code and allows you to debug and test higher-level entities, which by definition are less numerous and more specific than a simple stream of characters.

Despite all this, and programming "profitable" has a number of its advantages:

Entry threshold is significantly lower. That is, almost any programmer can say - “Sit down write code”, and the parser will be programmed in the proposed style. To compose a formal grammar, it is necessary to have a small, but still important, volume of theory - the basics of mathematical logic, to understand the construction of grammar, to own a code generation tool.
Despite the fact that the grammar can be described almost everything, there are exceptions. In addition, the limitations of the toolkit are added to them. Direct writing code has the highest flexibility.
There is still quite a tiny plus - all in one place. That is, with the normal organization of the program, you can select key pieces of logic and immediately understand their essence. In the first version, we actually have two necessary screens - the screen of the scheme + screen of the code. Sometimes you need to switch between them to clarify the nuances. This is not very convenient. But again, this is an extremely small advantage, especially when you consider the second plus separate coding.

Parser structure

Of course, like most normal programmers, I almost always choose the first approach. Again, for the implementation of the second, no knowledge is needed, and I have no additional theory for it. Therefore, I’ll talk about the structure of the parser based on formal grammar.

This is a very conventional scheme, the analyzer can consist of all three parts, or have the combination of the first or last two, and even be a solid piece without separation into components. There are almost no best practices in the organization. But personally, I prefer to separate flies from cutlets. I will not breed holivar, but just describe the possible options.

First of all, it is worth figuring out what these components are. The lexical analyzer receives as input a stream of characters of a given alphabet (letters, numbers, used symbols, etc). Then it breaks up this stream into more complex primitives (such is an oxymoron, yes), so-called tokens or terminals. For example, instead of a sequence of digits of some length, the parser gets a “Number” with an attribute equal to the value of that number. Why I need it will explain below. Next, a parser based on tokens and rules for composing characters of the second hierarchical order (the first is tokens), the so-called non-terminals, collects the output tree. This ADT uniquely represents the structure of the recognized data. And finally, the last stage is a semantic analysis of the resulting tree and the execution of business logic. This stage is presented on the first listing.

Why do we need a lexical analyzer, if it can actually be integrated into the syntactic part? After all, what's the difference what alphabet to get - the original or new synthetic. The answer is quite obvious - firstly, as a rule, it is a narrowing of the alphabet, that is, a simplification of semantics; secondly, we remove one or even several lower levels of the tree with full preservation of its properties. It is clear that the lexical analyzer should not assume the role of a syntactic, and even more semantic, so it should not return 3, but the simplest actions, such as the formation of numbers, are quite suitable. Let's slightly complicate the example, and look at the output tree in two cases. At the same time, I will show what kind of tree it is for those who did not quite understand the brief explanation.

  DIGIT = '0'..'9' NUMBER = NUMBER? DIGIT ADD = NUMBER '+' NUMBER SIGN = ('+' | '-') NUMBER EXPR = ADD | SIGN

Parsing is the expression 12 + 34

Even on such a simple example, it is clear that it is more convenient to divide the analysis into at least 2 stages. In addition, the specialization of the lexical analyzer allows the use of more efficient algorithms other than the analysis of grammars. The main difference, besides the empirical lexer from the syntactic analyzer, is the absence of inference rules, or rather they can be implicitly represented, but there will only be alphabetic characters on the right, and the rules never have a connection with the rest lexer terminals. That is, they are self-sufficient and describe only the expected stream of characters.

Now consider the second potential fusion. This is the integration of semantic logic directly into the parser, bypassing the stage of building the tree. Here the tactics are simple - we mark the points that have semantic meaning and write the code. Example:

  DIGIT = '0'..'9' {value = child[0] - '0'} ADD = DIGIT '+' DIGIT {value = child[0].value + child[1].value} SIGN = ('+' | '-') DIGIT {value = child[0].value == '-' ? - child[1].value : child[1].value} EXPR = ADD | SIGN {result = child[0].value}

It is noticeable that the complex grammar can not be described. Yes, and this mixing eliminates some of the benefits of writing separate code. But for simple grammars, this is a pretty good method for writing code, and its importance should not be minimized.

LL vs LR, or elephant vs whale

Writing a good lexer is a separate large topic, so I will continue to describe only the development of the parser.

There are two groups of analyzers - ascending (LR) and descending (LL).

They take their name from the method of building a tree of output. The first builds a tree from the bottom up, the second from top to bottom. I’ll dwell on this a bit and give the most advanced algorithms.

The LR parser conventionally has a stack in which it stores the last received characters (both terminals and non-terminals), at each step, reading the next token, the parser tries to find a rule that it can apply to the character set from the top of the stack, if it finds it, it collapses sequence of characters in one nonterminal. That is, if the stack looks like {DIGIT, +, DIGIT}, then the analyzer will roll it into {ADD}, and then into {EXPR}. The pseudocode of such a parser would be something like this:

 input >> term; while (term != EOF) { stack.push(term); do { reduced = false; for (Rule rule : rules) if (rule.Reduce(stack)) { stack.pop(rule.right.length); stack.push(rule.symbol); reduced = true; break; } } while (reduced); input >> term; }

LL-parser tries to do the opposite - for each incoming character to guess to which rule it applies. The most primitive is a choice from alternatives (for example, EXPR can turn in ADD or in SIGN, that is, 2 alternatives) on the first symbol, and a recursive descent with a new set of rules that produce non terminals from the selected path. And so on until the rules are decomposed into terminals. The description is quite complicated, in the code it will be much easier to understand:

 function ExpandRule(symbol, term) { for (Rule rule : rules) { if (rule.sybmol == symbol && firstTerminal(rule) == term) break; } for (s : rule) { if (s.type == NonTerminal) term = ExpandRule(s, term); else { if (term != s) throw SyntaxError; input >> term; } } return term; } input >> term; ExpandRule(EXPR, term);

Which parser to use is a matter of taste. Almost all characteristics are the same. There is a bike that LL (x) was used in the Soviets and LR (x) in the West, but I don’t know how true it is. Personally, I ideologically liked LR, in more detail I will tell about it in the next part.

PS: The importance of correctly writing parsers can be seen right in the article by looking at the unlighted second section of code, which is wrapped in the same source block as the first example.

Parts of the article

Part 1. Basic theory
Part 2. Description of LR-generators
Part 3. Features of writing and possible features of LR-generators

Source: https://habr.com/ru/post/140058/

All Articles