Compilation. 4: toy yap

With grammar calculators have played enough, go to the programming languages. Beta testers of the article gave the idea to write a JavaScript-like language: let's start with the simplest staple skeleton, and we will gradually turn it into twists - syntax sugar, data types, support for functions, etc.

To the inferiority of our language was already clear from the name, let's call it JSkrip.

Further in the post

Syntax
Grammar
Parser
Syntax tree
Pretty-printing

Syntax

In the initial version of the language will be arithmetic constructions over integers, if , while , and exit ; and a pair of "predefined functions": echo() for printing, and input() for input.
')

from = 0; to = 1000;
echo( " " ,from, " " ,to, ", \n " );
while (from <= to) {
guess = (from+to)/2;
echo( " " ,guess, "? (1=, 2=, 3=) " );
i = input();
if (i==1)
to = guess-1;
else if (i==2)
from = guess+1;
else if (i==3) {
echo( "! ! \n " );
exit;
} else
echo( " ! \n " );
}
echo( ", ! \n " );

Grammar

We look at the sample program, and try to formalize the language:

PROGRAM: OPS ; //
OPS: OP | OPS OP ;
OP: '{' OPS '}' //
| EXPR ';' //
| ' if ' ' (' EXPR ')' OP
| ' if ' ' (' EXPR ')' OP ' else ' OP
| ' while ' ' (' EXPR ')' OP
| 'exit ' ' ;' ;
EXPR: NUM //
| ID //
| ID '(' ARGS ')' //
| EXPR '+' EXPR | EXPR '-' EXPR | EXPR '*' EXPR | EXPR '/' EXPR | '(' EXPR ')' | '-' EXPR //
| EXPR '==' EXPR | EXPR ' < =' EXPR | EXPR ' > =' EXPR | EXPR '!=' EXPR | EXPR '>' EXPR | EXPR '<' EXPR
| '!' EXPR // ; && *, || +
| ID '=' EXPR ; //
ARGS: //
| ARG //
| ARGS ',' ARG ;
ARG: EXPR | STRING ; //

We come across already familiar ambiguities: with else , and with a group of operators in the expression. We already know how to fix the grammar :

PROGRAM: OPS ;
OPS: OP | OPS OP ;
// , else
OP1 : '{' OPS '}' | EXPR ';'
| ' if ' ' (' EXPR ')' OP1 ' else ' OP1 // if else: if else
| ' while ' ' (' EXPR ')' OP1
| 'exit ' ' ;' ;
// , if else
OP2 : ' if ' ' (' EXPR ')' OP // if else:
| IF '(' EXPR ')' OP1 ELSE OP2
| WHILE '(' EXPR ')' OP2 ;
OP: OP1 | OP2 ;
EXPR: EXPR1 | ID '=' EXPR ; //
// ,
EXPR1 : EXPR2
| EXPR1 '==' EXPR2 | EXPR1 ' < =' EXPR2 | EXPR1 ' > =' EXPR2
| EXPR1 '!=' EXPR2 | EXPR1 '>' EXPR2 | EXPR1 '<' EXPR2 ;
EXPR2 : TERM | EXPR2 '+' TERM | EXPR2 '-' TERM ;
TERM : VAL | TERM '*' VAL | TERM '/' VAL ;
VAL : NUM | '-' VAL | '!' VAL | '(' EXPR ')' | ID | ID '(' ARGS ')' ;
ARGS: | ARG | ARGS ',' ARG ;
ARG: EXPR | STRING ;

Parser

It is enough to add the usual bison heading to turn the grammar into a dummy parser:

%{
#include <iostream>
extern int yylineno;
extern int yylex();
void yyerror( char *s) {
std::cerr << s << ", line " << yylineno << std::endl;
exit( 1 );
}
#define YYSTYPE std::string
%}

%token IF ELSE WHILE EXIT
%token EQ LE GE NE
%token STRING NUM ID

%%

PROGRAM: OPS
;

OPS: OP
| OPS OP
;

OP1: '{' OPS '}'
| EXPR ';'
| IF '(' EXPR ')' OP1 ELSE OP1
| WHILE '(' EXPR ')' OP1
| EXIT ';'
;

OP2: IF '(' EXPR ')' OP
| IF '(' EXPR ')' OP1 ELSE OP2
| WHILE '(' EXPR ')' OP2
;

OP: OP1 | OP2 ;

EXPR: EXPR1
| ID '=' EXPR

EXPR1: EXPR2
| EXPR1 EQ EXPR2
| EXPR1 LE EXPR2
| EXPR1 GE EXPR2
| EXPR1 NE EXPR2
| EXPR1 '>' EXPR2
| EXPR1 '<' EXPR2
;

EXPR2: TERM
| EXPR2 '+' TERM
| EXPR2 '-' TERM
;

TERM: VAL
| TERM '*' VAL
| TERM '/' VAL
;

VAL: NUM
| '-' VAL
| '!' VAL
| '(' EXPR ')'
| ID
| ID '(' ARGS ')'
;

ARGS:
| ARG
| ARGS ',' ARG
;

ARG: EXPR
| STRING
;

%%
int main() { return yyparse(); }

This time it is more convenient for us to store in each token not a number, but a string ( std::string ). The type for character values is specified by the YYSTYPE macro.

It remains to attach lexer to grammar; there will be a small trick - named states, for recognizing iscaps inside strings.
Files of our language will be called jsk .

%{
#include <string>
#define YYSTYPE std::string
#include "jsk.tab.h"
void yyerror( char *s);
%}

%option yylineno
%option noyywrap

%x STR

%%

[/][/] .*\n ; // comment
if return IF;
else return ELSE;
while return WHILE;
exit return EXIT;
== return EQ;
[<] = return LE;
>= return GE;
!= return NE;
[0-9] + { yylval = yytext ;
return NUM;
}
[a-zA-Z_][a-zA-Z0-9_] * { yylval = yytext ;
return ID;
}
["] { yylval = "" ; BEGIN (STR); }
<STR> [^\\\n"] + yylval += yytext ;
<STR> \\n yylval += '\n' ;
<STR> \\ ["] yylval += '"' ;
<STR> \\ yyerror( "Invalid escape sequence" );
<STR> \n yyerror( "Newline in string literal" );
<STR> ["] { BEGIN (INITIAL); return STRING; }
[ \t\r\n] ; // whitespace
[-{};()=<>+*/!,] { return * yytext ; }
. yyerror( "Invalid character" );

%%

The % x declaration defines a named state in which the lexer enters not as a result of reading input characters, but as a result of an explicit call to BEGIN(STR) . The initial state is called INITIAL ; To return to it, call BEGIN(INITIAL) . As a result, we have two different sets of regexps in the lexer: one for the main program, the other for string literals. On each quote switch between these two sets.

The resulting dummy parser looks for syntax errors in the program; and if it does not find it, then it does not display anything:

[tyomitch@home ~]$ lex jsk.lex
[tyomitch@home ~]$ bison -d jsk.y
[tyomitch@home ~]$ c++ lex.yy.c jsk.tab.c
[tyomitch@home ~]$ ./a.out < test.jsk
[tyomitch@home ~]$ ./a.out
{ foo();
bar; }
}
syntax error, line 3

Syntax tree

What can be done useful in the rule folding?
In the past times, we were able to perform all the necessary actions right in the convolutions, right in the course of the analysis, (calculate the value of the expression).
A more universal approach is to build a tree for the input text that matches the syntactic structure of the text and, after parsing, finish processing this tree. For getting information from trees, there are a lot of methods not related to compilation: from recursive traversals to XPath queries.

For each node type in the tree, we define a separate class, and we will store pointers to the node objects on the parser stack. Besides them, the stack may contain lines from the lexer, as well as “intermediate results” for which we decide not to select a separate node in the tree. In our example, the tree node does not correspond to the ARGS symbol: we will store references to all the function arguments directly in the “function call” node. In other words, the syntax tree does not have to match the parse tree exactly; we have the right to rebuild it more comfortably. Another example of inconsistency is that there will not be an “operator” node in our syntax tree; only a “list of operators” - possibly from a single element.

%{
#include <iostream>
#include <list>
extern int yylineno;
extern int yylex();
void yyerror( char *s) {
std::cerr << s << ", line " << yylineno << std::endl;
exit( 1 );
}

class oper_t { // abstract
protected : oper_t() {}
public : virtual ~oper_t() {}
};

class expr_t { // abstract
protected : expr_t() {}
public : virtual ~expr_t() {}
};

class block : public oper_t {
std::list<oper_t*> ops;
void append(oper_t* op) {
block* b = dynamic_cast <block*>(op);
if (b) {
ops.splice(ops.end(), b->ops, b->ops.begin(), b->ops.end());
delete b;
}
else ops.push_back(op);
}
public :
block() {}
block(oper_t* op) { append(op); }
block(oper_t* op1, oper_t* op2) { append(op1); append(op2); }
};

class exprop : public oper_t {
expr_t* expr;
public : exprop(expr_t* expr) : expr(expr) {}
};

class ifop : public oper_t {
expr_t* cond;
block thenops, elseops;
public : ifop(expr_t* cond, oper_t* thenops, oper_t* elseops) :
cond(cond), thenops(thenops), elseops(elseops) {}
};

class whileop : public oper_t {
expr_t* cond;
block ops;
public : whileop(expr_t* cond, oper_t* ops) : cond(cond), ops(ops) {}
};

class exitop : public oper_t {};

class binary : public expr_t {
const char * op;
expr_t *arg1, *arg2;
public : binary( const char * op, expr_t *arg1, expr_t *arg2) :
op(op), arg1(arg1), arg2(arg2) {}
};

class assign : public expr_t {
std::string name;
expr_t* value;
public : assign( const std::string& name, expr_t* value) :
name(name), value(value) {}
};

class unary : public expr_t {
const char * op;
expr_t* arg;
public : unary( const char * op, expr_t* arg) : op(op), arg(arg) {}
};

class funcall : public expr_t {
std::string name;
std::list<expr_t*> args;
public : funcall( const std::string& name,
const std::list<expr_t*>& args) :
name(name), args(args) {}
};

class value : public expr_t {
std::string text;
public : value( const std::string& text) : text(text) {}
};

// : , , ,
typedef struct {
std::string str;
oper_t* oper;
expr_t* expr;
std::list<expr_t*> args;
} YYSTYPE;
#define YYSTYPE YYSTYPE

//
std::string replaceAll( const std::string& where, const std::string& what, const std::string& withWhat) {
std::string result = where;
while ( 1 ) {
int pos = result.find(what);
if (pos==- 1 ) return result;
result.replace(pos, what.size(), withWhat);
}
}
%}

%token IF ELSE WHILE EXIT
%token EQ LE GE NE
%token STRING NUM ID

%type < str > ID NUM STRING
%type < oper > OPS OP1 OP2 OP
%type < expr > EXPR EXPR1 EXPR2 TERM VAL ARG
%type < args > ARGS

%%

PROGRAM: OPS //
;

OPS: OP // inherit
| OPS OP { $$ = new block( $1 , $2 ); }
;

OP1: '{' OPS '}' { $$ = $2 ; }
| EXPR ';' { $$ = new exprop( $1 ); }
| IF '(' EXPR ')' OP1 ELSE OP1 { $$ = new ifop( $3 , $5 , $7 ); }
| WHILE '(' EXPR ')' OP1 { $$ = new whileop( $3 , $5 ); }
| EXIT ';' { $$ = new exitop(); }
;

OP2: IF '(' EXPR ')' OP { $$ = new ifop( $3 , $5 , new block()); }
| IF '(' EXPR ')' OP1 ELSE OP2 { $$ = new ifop( $3 , $5 , $7 ); }
| WHILE '(' EXPR ')' OP2 { $$ = new whileop( $3 , $5 ); }
;

OP: OP1 | OP2 ; // inherit

EXPR: EXPR1 // inherit
| ID '=' EXPR { $$ = new assign( $1 , $3 ); }

EXPR1: EXPR2 // inherit
| EXPR1 EQ EXPR2 { $$ = new binary( "==" , $1 , $3 ); }
| EXPR1 LE EXPR2 { $$ = new binary( "<=" , $1 , $3 ); }
| EXPR1 GE EXPR2 { $$ = new binary( ">=" , $1 , $3 ); }
| EXPR1 NE EXPR2 { $$ = new binary( "!=" , $1 , $3 ); }
| EXPR1 '>' EXPR2 { $$ = new binary( ">" , $1 , $3 ); }
| EXPR1 '<' EXPR2 { $$ = new binary( "<" , $1 , $3 ); }
;

EXPR2: TERM // inherit
| EXPR2 '+' TERM { $$ = new binary( "+" , $1 , $3 ); }
| EXPR2 '-' TERM { $$ = new binary( "-" , $1 , $3 ); }
;

TERM: VAL // inherit
| TERM '*' VAL { $$ = new binary( "*" , $1 , $3 ); }
| TERM '/' VAL { $$ = new binary( "/" , $1 , $3 ); }
;

VAL: NUM { $$ = new value( $1 ); }
| '-' VAL { $$ = new unary( "-" , $2 ); }
| '!' VAL { $$ = new unary( "!" , $2 ); }
| '(' EXPR ')' { $$ = $2 ; }
| ID { $$ = new value( $1 ); }
| ID '(' ARGS ')' { $$ =new funcall( $1 , $3 ); }
;

ARGS: { $$ .clear(); }
| ARG { $$ .clear(); $$ .push_back( $1 ); }
| ARGS ',' ARG { $$ = $1 ; $$ .push_back( $3 ); }
;

ARG: EXPR // inherit
| STRING { $$ =new value( '"' +replaceAll( $1 , " \n " , " \\ n" )+ '"' ); }
;

%%
int main() { return yyparse(); }

The %type<str> declaration specifies for grammar characters (terminals and non-terminals) which YYSTYPE fields should be used as the value of a character. If the field is set, the bison will substitute it every time the $ -tag refers to such a grammar symbol. If the field is not specified (as in the previous examples), then YYSTYPE used entirely.
In classical syshnyh parsers, YYSTYPE declared as a union : each character interprets the same stored value in its own way. This does not suit us: an object with a destructor cannot be a union field; therefore, we declare YYSTYPE as a structure, and the structure fields of the structure that are not needed by the symbol will simply take up nothing. But we are not greedy.

The unpleasant moment is that due to the repair of the conflict with else in the grammar two identical if..else rules and two identical while rules appeared. It’s on our conscience to choose what is more unpleasant: to give priority to else and ) , as if they were operators; or copy-paste convolution code twice.

So, our parser builds a tree in memory, and when it is ready, it just comes out, without even freeing the memory. Not too impressive?

Pretty-printing

It takes quite a bit of effort for the parser to do something meaningful; for example, placed brackets in expressions, deleted redundant {} and, and in addition, leveled operators using BSD Kernel Normal Form.
Let's add a recursive printout to our tree classes.
Only class definitions and convolution of the PROGRAM symbol are PROGRAM .

#define TOKENPASTE(x, y) x ## y
#define TOKENPASTE2(x, y) TOKENPASTE(x, y)
#define foreach(i, list) typedef typeof(list) TOKENPASTE2(T, __LINE__ ); \
                    for (TOKENPASTE2(T, __LINE__ )::iterator i = list.begin(); i != list.end(); i++)

class oper_t { // abstract
protected : oper_t() {}
public : virtual ~oper_t() {}
virtual void print( int indent= 0 ) = 0 ;
};

class expr_t { // abstract
protected : expr_t() {}
public : virtual ~expr_t() {}
virtual void print() = 0 ;
};

class block : public oper_t {
std::list<oper_t*> ops;
void append(oper_t* op) {
block* b = dynamic_cast <block*>(op);
if (b) {
ops.splice(ops.end(), b->ops, b->ops.begin(), b->ops.end());
delete b;
}
else ops.push_back(op);
}
public :
block() {}
block(oper_t* op) { append(op); }
block(oper_t* op1, oper_t* op2) { append(op1); append(op2); }
int size() { return ops.size(); }
virtual void print( int indent= 0 ) {
foreach(i, ops) {
std::cout << std::string(indent, '\t' );
(*i)->print(indent);
}
}
virtual ~block() { foreach(i, ops) delete *i; }
};

class exprop : public oper_t {
expr_t* expr;
public : exprop(expr_t* expr) : expr(expr) {}
virtual void print( int indent= 0 ) {
expr->print();
std::cout << ";" << std::endl;
}
virtual ~exprop() { delete expr; }
};

class ifop : public oper_t {
expr_t* cond;
block thenops, elseops;
public : ifop(expr_t* cond, oper_t* thenops, oper_t* elseops) :
cond(cond), thenops(thenops), elseops(elseops) {}
virtual void print( int indent= 0 ) {
std::cout << "if " ; cond->print(); std::cout << " {" << std::endl;
thenops.print(indent+ 1 );
if (elseops.size()) {
std::cout << std::string(indent, '\t' ) << "} else {" << std::endl;
elseops.print(indent+ 1 );
}
std::cout << std::string(indent, '\t' ) << "}" << std::endl;
}
virtual ~ifop() { delete cond; }
};

class whileop : public oper_t {
expr_t* cond;
block ops;
public : whileop(expr_t* cond, oper_t* ops) : cond(cond), ops(ops) {}
virtual void print( int indent= 0 ) {
std::cout << "while " ; cond->print(); std::cout << " {" << std::endl;
ops.print(indent+ 1 );
std::cout << std::string(indent, '\t' ) << "}" << std::endl;
}
virtual ~whileop() { delete cond; }
};

class exitop : public oper_t {
virtual void print( int indent= 0 ) { std::cout << "exit;" << std::endl; }
};

class binary : public expr_t {
const char * op;
expr_t *arg1, *arg2;
public : binary( const char * op, expr_t *arg1, expr_t *arg2) :
op(op), arg1(arg1), arg2(arg2) {}
virtual void print() {
std::cout<< "(" ;
arg1->print();
std::cout<<op;
arg2->print();
std::cout<< ")" ;
}
virtual ~binary() { delete arg1; delete arg2; }
};

class assign : public expr_t {
std::string name;
expr_t* value;
public : assign( const std::string& name, expr_t* value) :
name(name), value(value) {}
virtual void print() { std::cout<<name<< " = " ; value->print(); }
virtual ~assign() { delete value; }
};

class unary : public expr_t {
const char * op;
expr_t* arg;
public : unary( const char * op, expr_t* arg) : op(op), arg(arg) {}
virtual void print() { std::cout<<op; arg->print(); }
virtual ~unary() { delete arg; }
};

class funcall : public expr_t {
std::string name;
std::list<expr_t*> args;
public : funcall( const std::string& name,
const std::list<expr_t*>& args) :
name(name), args(args) {}
virtual void print() {
std::cout<<name<< "(" ;
foreach(i,args) {
if (i!=args.begin())
std::cout<< ", " ;
(*i)->print();
}
std::cout<< ")" ;
}
virtual ~funcall() { foreach(i,args) delete *i; }
};

class value : public expr_t {
std::string text;
public : value( const std::string& text) : text(text) {}
virtual void print() { std::cout<<text; }
};

PROGRAM: OPS { $1 - > print(); delete $1 ; }
;

Checking what happened:

[tyomitch@home ~]$ lex jsk.lex
[tyomitch@home ~]$ bison -d jsk.y
[tyomitch@home ~]$ c++ jsk.tab.c lex.yy.c
[tyomitch@home ~]$ ./a.out < test.jsk
from = 0;
to = 1000;
echo(" ", from, " ", to, ", \n");
while (from <= to) {
guess = ((from + to) / 2);
echo(" ", guess, "? (1=, 2=, 3=) ");
i = input();
if (i == 1) {
to = (guess - 1);
} else {
if (i == 2) {
from = (guess + 1);
} else {
if (i == 3) {
echo("! !\n");
exit;
} else {
echo(" !\n");
}
}
}
}
echo(", !\n");

A formatted listing of the input text is a common way to initially check the parser. Sample code from the beginning of the post is not very good in this regard, because covers not all language constructs. But it is intended to illustrate the language, not for testing.

It is clear that for the sake of text formatting alone there was no point in bothering with trees: it would be enough to keep a list of text lines for each operator on the stack, and merge these lines into rolls, adding indents. Yes, there, for the sake of indentation, even a bison is not needed: and flex would have done it. (Although bracing is a bit more interesting.)

Next time we take a break from the bison .

Source: https://habr.com/ru/post/99397/

All Articles