We write LR (0) -analyzer. Simple words about the difficult

Introduction

Good day.
I did not find a simple and clear description of this algorithm in Russian. I decided to fill this gap. First of all what is it? LR (0) -analyzer is primarily a parser. The purpose of the parser is to process the input stream of tokens (basic elements of the language that the lexical analyzer produces based on the input stream of characters, examples of tokens - number, comma, symbol) and compare it with the description of the language specified in a specific format. The mapping is to build a specific data structure, most often a tree. Then this structure will go to the next stage - semantic analysis, where the compiler is already trying to understand the meaning contained in the tree.

There are 2 classes of parsers - ascending and descending analyzers. The first build a tree starting from the leaves, which are input tokens, the second, respectively, on the contrary, start from the root of the tree. Actually LR means that the analyzer will read the flow from left to right (L - 'Left') and build a tree from the bottom up (let the letter R, which means Right, do not confuse, explanations are given just below). Index 0 means that we do not preview the following lexemes, but work only with the current one. What advantages does the choice of this type of analyzers give us?

He is fast.
Covers many languages. That is, if you invented a language and described it, then most likely the LR analyzer will be able to process it.
Syntax errors are detected as quickly as possible. As soon as the character that does not correspond to the previous input stream is encountered, we can output an error about it.

There are disadvantages:

Relative complexity of construction.
You can drive the analyzer into a stupor with the ambiguity of the language description.

Terminology

Terminal symbol ( terminal , terminal ) is a symbol that can be given to the analyzer as a user, in our case these are lexemes.
A non-terminal symbol ( non-terminal , nonterminal ) is a symbol that denotes an object of a language. For example, let us specify that the symbol A is a term. Of course, we can choose multi-character names - term instead of A.
Context-free grammar ( CFG ) - a set of rules of the form

, where A is a non-terminal, w is an arbitrary set of terminals and non-terminals. In the article, I will use just grammar, meaning by this precisely context-free.
A small example of grammar for which we will build an analyzer:

This grammar describes an incomplete set of arithmetic operations on two numbers - 0 and 1. The grammar is a description of the language. In order to check if the input stream belongs to our language, or somewhere we made a syntax error (wrote 1+, instead of 1 + 1), we are looking for a possible way of obtaining this input stream following the rules starting from the starting terminal (we have this E). For 1 + 1, the path will be E, apply rule 2, E + F, rule 1, F + F, rule 4, T + F, rule 8, 1 + F, again 4, 1 + T, and at the end the eighth, 1 + 1. As we see, we were able to get the input string, which means that it is syntactically correct.
Now we can explain the letter R in the name of the analyzer. It means that we go from the extreme right parts of the rules to the axiom, that is, from the more simple rules (7 and 8) we collect the original (1). L-analyzers (LL) try to select the following direction of analysis on the left-hand sides of the rules.

We should also mention the state machines ( Final-state machine , FSM ). This is a model that has a set of states and an input stream. The machine starts from the initial state and changes its state based on the current and input symbol. That is, we start with state 0, if a is received at the input, then the automaton goes to state 1, and if b goes to state 2. The transition mechanics is given by the table, where the columns are the current state, and the columns are the input symbol.

Algorithm

The analyzer needs several things to work with it:

The input stream itself, which will be analyzed.
The stack (the data structure that meets the LIFO (Last In First Out) rule) is easiest to imagine as a stack of sheets - put the sheet in the stack last, and take it first when you need an element from the stack for the parser states.
Action table It tells us what to do in the current state and with the current character at the input.
Conversion table Auxiliary table that is used in one of the actions.

Here it is necessary to clarify how the analyzer will work. The current state is the state at the top of the stack. We look at the table of actions (indexes are the current input symbol and the current state). There are 4 types of entries in this table:

success (accepted) - means that the input string belongs to this grammar.
void (error) - there are no actions, we are at a dead end, the user was mistaken with the current symbol.
carry (shift) - put the state that corresponds to the input symbol on the top of the stack, read the following
convolution (reduce) - we have a stack of states that we can replace with one, using the grammar rule, here we take the value in the transition table. The first index is the current state. The second is the left side of the rule. That is, in what we turned the sequence of states.

As code, it looks like this:

stack. push ( states [ 0 ] ) ;
while ( ! accepted )
{
State * st = stack. top ( ) ;
Terminal term = s [ inp_pos ] ;
if ( ! terms. IsTerm ( term ) )
error ( ) ;
Action * action = actionTable. Get ( st, term ) ;
if ( ! action )
error ( ) ;
switch ( action - > Type ( ) )
{
case ActionAccept :
accepted = true ;
break ;
case ActionShift :
inp_pos ++ ;
stack. push ( action - > State ( ) ) ;
break ;
case ActionReduce :
Rule * rule = action - > Rule ( ) ;
stack. pop ( rule - > Size ( ) ) ;
State * transferState = transferTable. Get ( stack. Top ( ) , rule - > Left ( ) ) ;
if ( ! transferState )
error ( ) ;
stack. push ( transferState ) ;
break ;
}
}

As you can see, there is nothing difficult in the analysis itself. However, the whole trick lies in the construction of these tricky tables. For a start, let's see what the state of the parser is. This is a rather nontrivial part of the algorithm. And no, it's not just a number. We will have to introduce a number of new concepts.
First of all, these are items. This is a rule with a new property - a marker. The marker indicates which item we are currently waiting for. Accordingly, each rule generates n + 1 markers, where n is the number of characters in the right part of the rule. For example, take rule 3. Plus in the circle indicates the place of the marker.

The marker in the second paragraph, for example, indicates that we expect to see a minus sign in the current character. Combining multiple items is a item set. Actually, the state is a set of items gathered together in a certain way.

But to work with states, we first need to close the set. This means that we want to get a full-fledged analyzer branch. That is, if there is a point in the set in which the marker points to a non-terminal (and all of our non-terminals are left parts), then the corresponding non-terminal must be “expanded”. This happens by simply adding points, the left parts of which are this non-terminal, and the marker points to the first character. By itself, we expand recursively, if in the newly added paragraph the first character is a non-terminal, then we also close it. Until we get a full set. Close the set in which only one item (number 3 in the previous example):

Deploying F, we get points 2, 3, 4. In 3 and 4 again we are offered to deploy F, however we already have these rules in the set, so we skip it. But T is not deployed, having done this, we get 5 and 6. Everything, the closure is ready.

for ( closed_item in itemset )
{
if ( closed_item. isClose )
continue ;
Element marker = closed_item. Marker ( ) ;
if ( marker. Type ( ) ! = ElementNonTerm )
{
closed_item isClose = true ;
continue ;
}
NonTerminal nonTerm = marker. NonTerm ( ) ;
item = allitems - > First ( 0 , nonTerm ) ;
while ( ! item. isend ( ) )
{
if ( ! itemset. exists ( item ) )
itemset. add ( item ) ;
item. next ( 0 , nonTerm ) ;
}
closed_item isClose = true ;
}

Having understood what the states are, we can begin to build them. To begin with, we introduce a new rule, which is the basis of the conclusion and to which we must come at the end.

The first state, of course, will be the closure of an item based on this rule and with a marker pointing to E. Now we begin to build a temporary state machine table, which will serve as the basis for transition tables and actions. We divide the state into groups according to the criterion of the symbol indicated by the marker. For the closure of the example there will be 4 groups - F-group, T-group, 0-group and 1-group. Each group is a transition to a new state. The first index from the transition is the symbol by which we actually group (F, T, 0, 1). The second index is the current state. And the value in the table is the state to which we proceed. So we have 4 new states. It is quite simple to construct them - in the group at each point we shift the marker by one position to the right and close the resulting set. This will be the new state.

firstState. Add ( items. First ( ) ) ;
firstState. MakeClosure ( ) ;
states. add ( firstState ) ;
size_t state_idx = 0 ;
while ( state_idx < states. size ( ) )
{
State * st = states [ state_idx ] ;
GroupedItems group = st - > Group ( ) ;
for ( group_class in group )
{
if ( group_class - > first. Type ( ) == ElementEnd )
continue ;
State newState ( & items, states. Size ( ) ) ;
for ( group_item in group_class )
newState. Add ( group_item, group_item. MarkerInt ( ) + 1 ) ;
newState. MakeClosure ( ) ;
State oldState = states. find ( newState )
if ( ! oldState )
{
states. add ( newState ) ;
fsmTable. Add ( st, group_class - > first, newState ) ;
}
else
fsmTable. Add ( st, group_class - > first, oldState ) ;
}
state_idx ++ ;
}

The transition table is built very simply - we transfer the columns from the FSM table, whose indices are non-terminals.

The action table is a little more interesting. The part is also transferred from the FSM, in turn, the columns with terminal indices, while the shift action with the state parameter, which was recorded in the original KA table, is recorded in the table cells. Then we add a new column '$', which marks the end of the input line. In this column, we enter the accepted event, which is recorded if the index-state contains the item

. It means success, turned into the primary rule and at the same time the input stream ended. Then come the action of convolutions. For each state in which there is an item

where w is any combination of terminals and non terminals, we write the complete command (of course, only free cells not occupied by other commands) with the parameter of the corresponding rule to which this item belongs.

fsmTable. FeedTransferTable ( transferTable ) ;
fsmTable. FeedActionTable ( actionTable ) ;
Item endItem = items. GetItem ( 1 , 'S' , Elements ( "E" , nonTerms ) ) ;
for ( st in states )
if ( st. HaveItem ( endItem ) )
actionTable. Add ( st, '$' , new Action ( ) ) ;
for ( st in states )
{
ItemList list = st. GetReducable ( ) ;
for ( listItem in list )
actionTable. Add ( st, new Action ( listItem. GetRule ( ) ) ) ;
}

Literature

Compilers: Principles, Techniques, and Tools, 1986, Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman

Source: https://habr.com/ru/post/116732/

All Articles

We write LR (0) -analyzer. Simple words about the difficult

Introduction

Terminology

Algorithm

Literature

More articles: