Working with the coco / r translator generator

coco / r generator compilers and translators, which according to attribute grammar generates a scanner (lexical analyzer) and a parser (syntax analyzer). The scanner is built as a deterministic finite state machine, and the parser is constructed by a recursive descent.

1. Grammar

The generator generates a code according to the RBNF rules. Because The parser is worth a recursive descent, and this is a top-down parsing, then the grammar should belong to LL (k). Ideally, LL (1), but conflict resolution is done in coco / r.
')
LL (1) conflicts are resolved by semantic actions and viewing ahead of the tokens.

A grammar is LL (1) if the conditions for two different products S-> A | B are:
1. There is no such terminal a for which A and B generate a string starting with a.
2. An empty line can be generated by no more than one of products A or B.
3. If e belongs to the set FIRST (B), then the sets FIRST (A) FOLLOW (S) do not intersect. A similar statement for e belonging to FIRST (A) is also true.

2. Language coco / r

The generator can create code in languages: java, C ++, C #, F #, VB.Net, Oberon.
In the example C ++ was taken.

All rules must end with a period.
Products of the form S-> AB will be written as S = A B.
| - or
( ) - Group
[] - 0 or 1
{} - 0 or more

2.1 Import, i.e. connect headers.

#include <string> #include <sstream> #include <iostream> #include <fstream> #include <vector>

2.2 Compiler ID , i.e. followed by the keyword COMPILER, followed by the non-terminal where your grammar begins.

COMPILER expr

2.3 Global variables and functions.
After the keyword COMPILER, you can insert a section with variables and functions. If there are many functions, it is better to put them in a separate file, since grammar and without them will turn out very scary and unreadable.

 int toInt(const std::wstring& strbuf) { std::wstringstream converter; int value = 0; converter << strbuf; converter >> value; return value; } std::wstring toString ( int Number ) { std::wostringstream ss; ss << Number; return ss.str(); }

2.4 Rules for generating a scanner.
2.4.1 IGNORECASE keyword to ignore case of characters.

2.4.2 CHARACTERS - section in which we describe a set of valid characters. For example, what is a letter for us and what is a number.

Charters
letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".
digit = "0123456789".

Also here you can describe the characters separators.
cr = '\ r'.
lf = '\ n'.
tab = '\ t'.

An entry of the form char 1 .. char2 denotes a sequence of characters from char1 to char2.
digit = '0' .. '9'.

ANY - any sequence of characters.
+ including many
- not including

For example, you can describe a set of characters that does not include double quotes:
verbatimStringChar = ANY - '"'.

Set of characters for hexadecimal number system:
hexDigit = digit + "ABCDEFabcdef".

2.4.3 Tokens themselves, or another name - tokens. Keyword section - TOKENS. We describe what we mean by the lexeme "identifier" and the lexeme "number".

TOKENS
ident = letter {letter | digit | "_"}.
number = digit {digit}.

The token "string", like any sequence of characters enclosed in double quotes:
string = "\" "{verbatimStringChar}" \ "".

Number of hexadecimal system:
hex = "0x" {hexDigit hexDigit}.

2.4.4 Special Options Section.

2.4.5 Comments.
COMMENTS FROM "/ *" TO "* /" NESTED
COMMENTS FROM "//" TO cr lf

2.4.6 Separators. We write which delimiters we ignore.
IGNORE cr + lf + tab

2.5 Rules for parser generation.
Start with the keyword PRODUCTIONS.
Expression grammar example:
Expr = ident ': =' NumExpr ';'
NumExpr = Term {('+' | '-') NumExpr}
Term = Multiplier {('*' | '/') Term}
Multiplier = ident | number | '(' NumExpr ')'

All attributes are written in brackets <>. If you use something from stl, for example, list, then attributes should be placed in brackets with dots, i.e. <. .>.
Attributes are written at non-terminals and are translated as function parameters.
The terminal has no attributes, but if you want so much, you can wrap it in a non-terminal:
LogOp <std :: wstring & op> = ("&&" | "||") (.op = t-> val ;.).

Semantic actions are written in brackets with dots (..). This is the code that the generator inserts into the parser. The code is written in the language in which the lexer and parser are generated.

ANY is a keyword in the parser section for any token. So we can describe the “text” as {ANY}.

2.6 The keyword END , followed by the name you entered after COMPILER in 2.2.

3. Conflict resolution.

It is very important to understand that the grammar should correspond to the type of the analyzer. If this condition is violated, the generator begins to swear in terrible words.

3.1 Factoring.
Rules start with the same token. The generator writes several alternatives
Example:
S-> a '=' B | a '(' C ')'
As you can see, 2 rules start with the token "a", which violates the first rule LL (1). We rewrite it as S-> a ('=' B | '(' C ')')
Some conflicts such as if-else cannot be resolved.
Statement = if '(' Eepr ')' Statement [else Statement].
if (a> b) if (a> c) max = a; else max = b;
It is not clear what to choose here, therefore an agreement was made: to choose the first alternative. In this example, the choice is correct, but in your grammar it is better to avoid such ambiguities.

Failed to write.
S = a [id b] A.
A = id {.id}.
It is not clear which rule to choose, since [id b] and A begin with identical tokens. In this case, it is best to rewrite the grammar:
S = a id (b A | {. Id}).

3.2 Left recursion.
Left recursion for LL grammar is in no way valid. It must be removed by conversion. Fortunately, any left recursion can be converted to right.
A-> Aa1 | ... | Aan | b1 ... | bm
on
A-> b1B | .. | bmB
B-> a1B | .. | anB | e

Record in RBNF:
A = A b | c.
on
A = c {b}.

3.3 Semantic significance.
In some cases, the choice of rules is based on their semantics. For example, an expression with types:
Expr = Factor {'+' Factor}.
Factor = '(' ident ')' Factor | '(' Expr ')' | ident | number.

Those. Such a grammar allows the following chains:
a + b
(a) + b
(int) a + b
In the last thread, the choice of the rule '(' ident ')' Factor is determined by the semantics of the identifier. Those. This rule is chosen if we have the type as ident.

! An extremely unfortunate example in terms of building a language. Usually in the grammar "keywords" are described by a separate rule. Then there is no need for ID checks.

Another example.
A = ident (. X = 1;.) {',' Ident (.x ++ ;.)} ':' | ident (.Foo () ;.) {',' ident (.bar () ;.)} ';'.
In this case, the grammar cannot be edited, since the same parts of the rule have different semantic actions. To define a rule, you must look at all the tokens to a colon or semicolon. Only then will it become clear which rule to choose.

Decision:
In the grammar, you can insert a Boolean function, which will be selected alternative.
S = a [id b] A.
A = id {.id}.

S = a [IF (isAlias ()) id b] A.
IsAlias () is a function that scans 2 tokens in front of it. If it is b, then returns true.

Token t - just recognized token
Token la - the next token

t.val - the value of the token
t.kind - type of token, determined by lexer

A = IF (FollowedByColon ())
ident (. x = 1;.) {',' ident (.x ++ ;.)} ':'
| ident (.Foo () ;.) {',' ident (. Bar () ;.)} ';'.

 bool FollowedByColon(){ //   Token x = la; //      while(x.kind==_comma || x.kind== _ident) //     x=scanner.Peek(); // true,     return x.kind==_colon; }

Notes:
IF keyword generator language.
The FollowedByColon () function refers to the first rule. If she gave true, then she considers him.
The names of the types of tokens assigns a scanner. But if in the TOKENS section to make such announcements
ident = letter {letter | digit | "_"}.
comma = ','.
semicolon = ';'.
colon = ':'.
Then the scanner will generate constants with good names:
const int _EOF = 0;
const int _ident = 1;
const int _comma = 2;
const int _semicolon = 3;
const int _colon = 4;

From the point of view of building a language, a separate description of each special character as a token has no meaning. But if the need arose to write the conditions in which verification is required of the tokens in front, this description can be useful.

If in the first rule you had a function in which you checked tokens, and in the second rule you also have a function, then the scanner should return to its original position. Reset position can function scanner.ResetPeek ().

4. Sample code

We translate the expression into a postfix notation. Expressions should be derived from this grammar:
Expr = ident ': =' NumExpr ';'
NumExpr = Term {('+' | '-') NumExpr}
Term = Multiplier {('*' | '/') Term}
Multiplier = ident | number | '(' NumExpr ')'

atg file:

 #include <string> #include <sstream> #include <iostream> #include <fstream> #include <vector> COMPILER expr int toInt(const std::wstring& strbuf) { std::wstringstream converter; int value = 0; converter << strbuf; converter >> value; return value; } std::wstring toString ( int Number ) { std::wostringstream ss; ss << Number; return ss.str(); } IGNORECASE CHARACTERS letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz". digit = "0123456789". cr = '\r'. lf = '\n'. tab = '\t'. TOKENS ident = letter {letter | digit | "_"}. number = digit {digit}. COMMENTS FROM "/*" TO "*/" NESTED COMMENTS FROM "//" TO cr lf IGNORE cr + lf + tab PRODUCTIONS expr (.std::wstring str;.) = (.std::wstring s,s1,s2,s3,s4; .) ident (.s1=t->val;.)":=" NumExpr<s2> ";" (. str+=s1; str+=s2; str+=L":=\n";.) {ident(.s3=t->val; s4=L"";.)":=" NumExpr<s4> ";" (. s+=s3; s+=s4; s+=L":=\n"; .)} (. str+=s; std::wofstream outfile ("out.txt", std::ios_base::out); outfile << str << std::endl; outfile.close(); .) . NumExpr<std::wstring &str> = (.std::wstring s1,s2, op; .) Term<s1> (.str+=s1;.) { ("+" (.op=L"+";.)|"-" (.op=L"-";.)) NumExpr<s2> (. str+=s2; str+=op; .) }. Term<std::wstring &str> =(.std::wstring s1,s2, op;.) Multiplier<s1> (.str+=s1;.) { ("*" (.op=L"*";.)|"/"(.op=L"/";.)) Term<s2> (. str+=s2; str+=op; .) }. Multiplier<std::wstring &str> = ident (.str=t->val; .) | number (.str=t->val;.) | (.std::wstring s; .) "(" NumExpr<s> ")" (.str=s;.). END expr.

main:

 #include <iostream> #include <wchar.h> #include "Parser.h" #include "Scanner.h" #include <string> using namespace std; main(int argc, char *argv[]) { if (argc == 2) { wchar_t *file = coco_string_create(argv[1]); Scanner *scanner = new Scanner(file); Parser *parser = new Parser(scanner); parser->Parse(); delete parser; delete scanner; delete file; return 0; } else { cout << "Use: translator filename" << endl; return 1; } }

make:

 all: translator translator: Coco scanner.o parser.o main.o g++ -o tr.exe scanner.o parser.o main.o main.o: main.cpp g++ -c main.cpp scanner.o: Scanner.cpp Scanner.h g++ -c Scanner.cpp -o scanner.o parser.o: Parser.cpp Parser.h g++ -c Parser.cpp -o parser.o Coco: expr.atg coco expr.atg clean: del Scanner.cpp Scanner.h Parser.cpp Parser.h del Scanner.cpp.old Scanner.h.old Parser.cpp.old Parser.h.old del scanner.o parser.o main.o del translator

Parser function, built on grammar

 void Parser::NumExpr(std::wstring &str) { std::wstring s1,s2, op; Term(s1); str+=s1; while (la->kind == 5 || la->kind == 6) { if (la->kind == 5) { Get(); op=L"+"; } else { Get(); op=L"-"; } NumExpr(s2); str+=s2; str+=op; } }

entrance
a: = b;
a: = a-5;
a: = 9-5 + 2;
a: = 2 + 3 * 4;
a: = (5-4) * (3 + 2);

output
ab: =
aa5 -: =
a952 + -: =
a234 * +: =
a54-32 + *: =

Source: https://habr.com/ru/post/139972/

All Articles

Working with the coco / r translator generator

More articles: