Developing a compiler for the cool language in C # for .NET (Part 1)

Introduction

Hello, dear habrayuzer. I would like to present you the material on the practical creation of a compiler that will translate the code written in the Cool language into the CIL (Common Intermediate Language) virtual machine code for the .NET platform. I decided to split this material into ~~- for laziness to describe it all at once~~

In the first part, we will describe the process of writing grammar taking into account the priorities of operators in the ANTLR environment, as well as generating the lexer and parser for C #. It will also examine the pitfalls that I met on my way. Thus, I will try to save some time (maybe for myself in the future).

In the second part, the process of building a semantic code analyzer, code generation and self-made code optimization ~~nobody~~ needs will be described. There will also be described how to make a beautiful interface ~~with blackjack and whores~~ with syntax highlighting and folding of blocks, as in modern IDE. At the end of the second part, I, of course, will lay out all the sources of my solution and talk about the further improvement of the architecture and code, at least as it seems to me.
')
Warning : before developing this compiler, I practically did not study any thematic literature. Everything was done by reading a few articles of interest to me, watching ANTLR video tutorials on the official website and talking to “rummaging” group mates. So the developed software is far from ideal.

In order for the reader to better understand me, I will first give some definitions (from Wikipedia):

Token - sequences of characters in lexical analysis in computer science, corresponding to a token.
A lexer is a module for analytically parsing an input sequence of characters (for example, such as source code in one of the programming languages) to produce at the output a sequence of characters called "tokens" (like grouping letters into words).
Parser is a module for comparing a linear sequence of lexemes (words, tokens) of a language with its formal grammar. The result is a parse tree or syntax tree.

On Habrahabr, and not only, there are many articles explaining in detail how to write a grammar for mathematical expressions and languages, so I will not dwell on setting up the environment in detail. Also, I will not delve into the theoretical issues of compiler construction, since there are also a number of quite good articles on this topic, for example, this cycle.

So, to develop a compiler, the following programs and libraries were used:

ANTLRWorks - IDE for writing grammars and generating code for lexers and parsers for various languages (C, C #, Java, JavaScript, Objective-C, Perl, Python, Ruby, etc.). Based on LL (*) parsing.
ANTLR C # runtime distribution (Antlr3.Runtime.dll) is used in the generated lexer and parser.
Visual Studio 2010 In the next article, we will look at a component under WPF, an advanced text editor for syntax highlighting and code completion ( AvalonEdit ).
Java Runtime Environment (since ANTLRWorks is written in Java).
IL Disassembler will be used in the second part in order to understand how the compiler, for example the C # language, generates CIL code and how it should look.

Cool language description

Cool - Classroom Object-Oriented Language is, as we see from the name, a ~~useless~~ programming language that includes classes and objects. Cool is a functional strictly typed language, i.e. type checking occurs statically at compile time. The program is essentially a set of classes. Classes include a set of fields and functions (here they are called feature ).
All fields are visible inside the base and derived classes. All functions are visible from everywhere (public).
This language is built on expressions ( expression ). All functions can take expressions as arguments and the bodies of all functions are expressions that return a result. Even functions such as string output return a result, namely an instance of the class from which they are called.
Cool also contains static classes String, Int, Bool, from which you cannot inherit and which cannot take null values (here it is called void ). Objects of these classes are passed by value , not by reference.
Keywords in this language: class, else, false, fi, if, in, inherits, isvoid, let, loop, pool, then, while, case, esac, new, of, not, true
I want to note that some operators do not end with a curly brace (as in C-like ones) or with the word end; (as in ~~dead~~ pascal), and the name of the name of the same operator in the reverse order (if -> fi, loop -> pool, case -> esac).
What do you think, if each statement returns a result, then what should return if? After all, he can return the option of two alternatives. The correct answer is: the closest common ancestor (in the textbook it is written about it is somehow difficult, there is still a special operator used). Here is a picture for clarity:

Picture 1.

Here, in classes A and B, the closest common ancestor is class D.

The case statement is similar to the if statement, applied several times.

In the while construct, void is always returned (unless of course the loop is looping).

Such a construction {<expr1>; ... <exprn>; } returns the object of the expression exprn, i.e. last expression.

Functions here can be called not only for the class and its ancestor (virtual functions), but in general for any ancestors of this class. This can be done using the '@' operator. As an example, consider the class hierarchy in Figure 1. Let class E have a function func (), all other descendants redefine it in their own way. Then, if we have an instance of object A and want to call the function func from class D, we need to use the following syntax:

result <- (new A)@D.func()

The keyword SELF_TYPE is used as a synonym for a class that describes a specific function. The self identifier is a pointer to the current class (i.e. this in other languages).

Local variables are entered in the Cool language using the let operator and they can only be used within this let block.

void is an analogue of null, a sign that the object was not created.

I think all the other operators do not need explanations, and if it’s not clear, then you can study the manual for this cool language dialect using the link at the end of the article.

Writing Cool Language Grammar at ANTLRWorks

So, the original grammar is given in the following form:

 program ::= [[class; ]]+ class ::= class TYPE [inherits TYPE] { [[feature; ]]*} feature ::= ID( [ formal [[, formal]]*] ) : TYPE { expr } | ID : TYPE [ <- expr ] formal ::= ID : TYPE expr ::= ID <- expr | expr[@TYPE].ID( [ expr [[, expr]]*] ) | ID( [ expr [[, expr]]*] ) | if expr then expr else expr fi | while expr loop expr pool | { [[expr; ]] +} | let ID : TYPE [ <- expr ] [[, ID : TYPE [ <- expr ]]]* in expr | case expr of [[ID : TYPE => expr; ]]+ esac | new TYPE | isvoid expr | expr + expr | expr ? expr | expr ? expr | expr / expr | ~expr | expr < expr | expr <= expr | expr = expr | not expr | (expr) | ID | integer | string | true | false

Notation: [[]] * - iteration; [[]] + - positive iteration; So, what should be done after seeing such a grammar? ~~Forget about it forever and about compilers.~~ ~~Who needs another nepoyazyk ??~~ So, if you immediately rewrite this grammar in ANTLR and generate lexer and parser code, then nothing good will come of it because of the following reasons:

The existence of left recursion . Of course, ANTLRWorks, based on the recursive descent method, will immediately detect and produce an error of the following kind: [fatal] rule compilation_unit has non-LL (*) • Resolve by left-factoring or using syntactic predicates or using backtrack = true option ... And, of course, ANTLR has an option backtrack = true, which allows you to overcome this limitation. But having applied it, it is necessary to come to terms with another node in the generated spacecraft for expression with left recursion. So it would still be good to get rid of it. In addition, as can be seen from this article , this option slows down the work of the parser and the author recommends, if possible, to abandon it.
But the main thing is that the priority of operators is not taken into account in this form of grammar. If the left recursion affects only the efficiency of the parser, then ignoring this problem will lead to incorrect code generation later (for example, 2 + 2 * 2 will be equal to 8, not 6)

We read the manual further. And then just written about the priorities of operations. They take the following form (at the top - the highest, at the bottom - the lowest priorities).

.
@
~
isvoid
* /
+ -
<= <=
not
<-

So, we need to build our grammar with these operations. To do this, we adhere to the following construction rule: For each lower priority operator, we create a new rule that will contain the operator of this priority and the nonterminal for the higher priority operator in the right-hand part. I know that it sounds messy, so I’ll quote this construction for our original grammar:

 expr: (ID ASSIGN^)* not; not: (NOT^)* relation; relation: addition ((LE^ | LT^ | GE^ | GT^ | EQUAL^) addition)*; addition: multiplication ((PLUS^ | MINUS^) multiplication)*; multiplication: isvoid ((MULT^ | DIV^) isvoid)*; isvoid: (ISVOID^)* neg; neg: (NEG^)* dot; dot: term ((ATSIGN typeName)? DOT ID LPAREN invokeExprs? RPAREN)? -> ^(Term term ID? typeName? invokeExprs?);

The term in this case includes all the remaining expressions that were not considered, i.e. expressions with the same priority.

Comments in the language are essentially a sequence of characters that must be ignored during lexical parsing. For this, the ANTLR design is used

 {$channel = Hidden;};

written to the right of each rule describing comments. For example:

 COMMENT : '--' .* ('\n'|'\r') {$channel = Hidden;};

ANTLR Operators

Next, I will describe the ANTLR operators used:

! - it is necessary so that the extra language tokens do not interfere with parsing. For example, brackets and other operators that are needed only for the parser (for example, the same endings of the syntax constructions fi, esac). These operators are needed in order to build exactly the syntactic tree from the stream of tokens, and not the parse tree).
-> - needed to convert a sequence of tokens on the left side to a sequence of tokens on the right side, which would be convenient to handle when generating code. The operator ^ is also used along with this operator.
^ is used to create a new node from the token marked by this operator and its descendants in the generated syntax tree. There are two forms for this operator:
- Postfix. For example, for such a rule:
```
 relation: addition ((LE^ | LT^ | GE^ | GT^ | EQUAL^) addition)*; 
```
  a node is inserted into the syntax tree with one of the operators in brackets and the children addition , which is to the left of the operator and addition , which is to the right of the operator.
- Prefix. For example, for such a rule:
```
 CLASS typeName (INHERITS typeName)? LCURLY featureList RCURLY -> ^(Class typeName featureList typeName?) 
```
  the parent node of the Class with the descendants of typeName featureList typeName is inserted into the syntax tree .

All other used ANTLR operators do not need comments (for a person studying discrete mathematics), and if they do, then small ones:

* - Iteration.
+ - Positive iteration.
? - A left-standing token or a group of tokens may or may not be present.

I want to note that when using the '->' and '^' operators, it becomes necessary to introduce new tokens. This is done in the corresponding tokens {} section. By the way, I support the fact that the .g file has a minimum of code intended for a specific lexer language and parser (in our case for C #). However, this code still had to be used for the STRING token:

 STRING: '"' { StringBuilder b = new StringBuilder(); } ( '"' '"' { b.Append('"');} | c=~('"'|'\r'|'\n') { b.Append((char)c);} )* '"' { Text = b.ToString(); } ;

Here StringBuilder is used to make string processing efficient. In principle, other blotches of such code are acceptable for optimizing the lexer, but not for all the rules. In order to specify namespaces in the C # code of the lexer and parser StringBuilder), the following constructs are used (some “unnecessary” warnings are also disabled here):

 @lexer::header {#pragma warning disable 3021 using System; using System.Text; } @parser::header {#pragma warning disable 3021 using System.Text;}

After we compiled the grammar in ANTLRWorks, we need to generate the C # code of the lexer and the parser (thanks, cap).

ANTLRWorks with the specified settings generates 3 files:

CoolGrammar.tokens
CoolGrammarLexer.cs
CoolGrammarParser.cs

Using the generated lexer in C # code

Getting a list of all tokens in the generated C # code does not look very trivial (although it is not always necessary):

 var stream = new ANTLRStringStream(Source); // Source -  . lexer = new CoolGrammarLexer(stream, new RecognizerSharedState() { errorRecovery = true }); IToken token; token = lexer.NextToken(); while (token.Type != CoolGrammarLexer.EOF) { //    . //  : CoolTokens.Dictionary[token.Type] (  '('  LPAREN, //     ,      ) //  : token.Text (  '('   "(") //     : token.Line //       : token.Column token = lexer.NextToken(); //   ,   ,      . } //           0, //        (  ). lexer.Reset();

CoolTokens.Dictionary should be generated as follows:

 var Lines = File.ReadAllLines(fileName); Dictionary = new Dictionary<int, string>(Lines.Length); foreach (var line in Lines) { var parts = line.Split('='); if (!Dictionary.ContainsKey(int.Parse(parts[1]))) Dictionary.Add(int.Parse(parts[1]), parts[0]); }

fileName - path to the CoolCompiler.tokens file

Using the generated parser in C # code

 var tokenStream = new CommonTokenStream(lexer); //   lexer,   . parser = new CoolGrammarParser(tokenStream); Tree = parser.program(); //   ,     (.. program   ).

To bypass the same syntax tree, the following properties and methods of the IList interface are used:

treeNode.ChildCount - the number of children at the node treeNode
treeNode.GetChild (i) - getting a child at treeNode node i
treeNode.Type is the token of this node. It has type int and must be compared with constants declared in the lexer class. For example: treeNode.Type == CoolGrammarLexer.EQUAL (whether this token is a '=' sign). I want to note that when traversing a tree, you must use the Type property, and not Text, because sranivanie numbers faster than string matching, to the probability of error is reduced, since all constants are generated automatically.
treeNode.Line - the token line number in the code
treeNode.CharPositionInLine - the position of the token in the code from the beginning of the line

I also wanted to pay special attention to the fact that when generating code from a grammar in C #, you need to put the public modifier before the grammar axiom (in this case before the program) so that you can call it later from the code (otherwise it will be private)

Conclusion

This article explained how to describe a grammar in the ANTLRWorks environment, which is then used to generate lexer and parser code in C #, as well as how to use a lexer and a parser. The next article will cover the processing of language semantics (i.e., type checking, semantic error handling), how to bypass each node of the syntax tree and generate CIL code for the corresponding syntax, explaining the CIL instructions, explaining the operation of the virtual machine stack and using .NET library System.Reflection.Emit to facilitate code generation. Of course, I will describe the pitfalls that I met on my way (in particular, the creation of a class hierarchy using System.Reflection.Emit). Also there will be described how to make ~~beautiful wallpapers~~ beautiful and modern interface! However, about what is written in the conclusion, is already partially written in the introduction.

Finally, I would like to ask one question to the community about which I could not find the answer:

For the C # lexer, I always generate such code to exclude a NoViableAltException, and this throw is impossible to intercept:

 catch (NoViableAltException nvae) { DebugRecognitionException(nvae); throw; }

How can I write a grammar file so that in this line not only there was a throw, but for example, my exception was called or did the lexer continue to work at all? And then you have to correct this line every time you generate C # code. Note: the manipulations with @rulecatch in ANTLR did not lead to anything.

It would be nice if someone figured out this problem and wrote the answer in the comments.