⬆️ ⬇️

MiniJava programming language compiler

How to write a compiler of any programming language in a relatively short time. To do this, use the development tools that automate the process. I would like to talk about how I wrote the MiniJava programming language compiler for the .NET Framework.



The whole process of writing a compiler as a whole is displayed in the following image. The input file is the source code in the programming language MiniJava. The output is a PE file for execution by the CLR.





')

Further I would like to concentrate on the practical part of the description.



Programming language MiniJava



MiniJava is a subset of the Java programming language. Overloads are not allowed; only integers are printed to the console by calling the System.out.println(...) method; The e.length expression applies only to int[] arrays.

A description of the grammar in the form of Backus-Naur (taken from the language project site http://www.cambridge.org/us/features/052182060X/ ):



 Goal ::= MainClass ( ClassDeclaration )* <EOF> MainClass ::= "class" Identifier "{" "public" "static" "void" "main" "(" "String" "[" "]" Identifier ")" "{" Statement "}" "}" ClassDeclaration ::= "class" Identifier ( "extends" Identifier )? "{" ( VarDeclaration )* ( MethodDeclaration )* "}" VarDeclaration ::= Type Identifier ";" MethodDeclaration ::= "public" Type Identifier "(" ( Type Identifier ( "," Type Identifier )* )? ")" "{" ( VarDeclaration )* ( Statement )* "return" Expression ";" "}" Type ::= "int" "[" "]" | "boolean" | "int" | Identifier Statement ::= "{" ( Statement )* "}" | "if" "(" Expression ")" Statement "else" Statement | "while" "(" Expression ")" Statement | "System.out.println" "(" Expression ")" ";" | Identifier "=" Expression ";" | Identifier "[" Expression "]" "=" Expression ";" Expression ::= Expression ( "&&" | "<" | "+" | "-" | "*" ) Expression | Expression "[" Expression "]" | Expression "." "length" | Expression "." Identifier "(" ( Expression ( "," Expression )* )? ")" | <INTEGER_LITERAL> | "true" | "false" | Identifier | "this" | "new" "int" "[" Expression "]" | "new" Identifier "(" ")" | "!" Expression | "(" Expression ")" Identifier ::= <IDENTIFIER> 


Here the grammar is represented by the rules for outputting non-terminals into chains of a language.

The terminal is the final grammar symbol. A non-terminal is a grammar symbol for which there is an inference rule that translates it into a string of combinations of terminals and / or non-terminals, i.e. α → β, where α∈V, β∈ (N∪V) *. N is the alphabet of terminals, V is the alphabet of non terminals.

All terminals in the description of the grammar are indicated by a sequence of characters in double quotes, non-terminals - by a sequence of characters, as well as in angular quotes. Grammar Axiom - Netal Goal .

Consider the process of writing a compiler (until the generation of the IL code) using the example of the program factorial.java , which calculates factorial taken from the MiniJava project site:



 class Factorial{ public static void main(String[] a){ System.out.println(new Fac().ComputeFac(10)); } } class Fac { public int ComputeFac(int num){ int num_aux ; if (num < 1) num_aux = 1 ; else num_aux = num * (this.ComputeFac(num-1)) ; return num_aux ; } } 




Lexical and parsing



I combined these two phases of writing a compiler, because they were made using a compiler generator. Now there are a sufficient number of compiler generators, such as ANTLR , GPPG , Coco / R , GOLD Parsing System, and many others .

I used ANTLR because it is convenient for writing grammar, has a graphical shell with a text editor, and allows you to visualize an abstract syntax tree (ANTLRWorks 1.4.3, but now the site already has version 2.0 http://tunnelvisionlabs.com/products/demo / antlrworks ).

The compiler generator allows you to get a set of tokens and an abstract syntax tree (AST) corresponding to the grammar.

From the description of the grammar in the form of Backus-Naur, I have compiled a grammar file intended for processing ANTLR. Giving the grammar to the LL-type, removing the left recursion and inscribing the creation of grammar symbols in C # into the appropriate parts, I prepared the file for the further code generation process (the grammar file MiniJava.g is located in the project directory, which is referenced below).

For example, the creation of the main class is described as follows:



 mainClassDecl returns [NonTerm value] : CLASS^ id=ID LCURLY! PUBLIC! STATIC! VOID! MAIN! LPAREN! STRING! LBRACK! RBRACK! ID RPAREN! LCURLY! statement=stmtList RCURLY! RCURLY! { $value = NonTerm.CreateMainClassDecl(new Token(TokenType.ID, id.Text, id), statement.valueList); if (flagDebug) Console.WriteLine("mainClassDecl"); } ; 


Here mainClassDecl is the rule corresponding to the description of the main class, which is responsible for the following description:

 MainClass ::= "class" Identifier "{" "public" "static" "void" "main" "(" "String" "[" "]" Identifier ")" "{" Statement "}" "}" 


The "^" sign makes the CLASS token the root of the rule, the "!" after other tokens, indicates ANTLR to ignore their checking (i.e. excluded when building an AST). Those rules whose results must be saved (the name of the main class ID and the list of stmtList ) must be assigned to variables ( id and statement , respectively). It should be noted that tokens are written in uppercase, and the rules - in lowercase. $value is the value that the root rule returns. The type of the return value is determined in square brackets after the returns keyword (in this case, NonTerm ).

The output file is generated using the ANTLR compiler command for .NET (using antlr-dotnet-tool-3.3.1.7705):



Antlr3.exe -o "___" "_____\MiniJava.g"





Own changes that were made to the grammar (for convenience):



The results of lexical and syntactic analysis are displayed as an xml file.

To display the result of the lexical analysis, use the following method of the class Lexer .

 public void SaveToFile(string fileName) { List<IToken> tokens = TokenStream.GetTokens(); XElement xElement = new XElement("Lexer", from token in tokens select new XElement("Token", new XAttribute("Text", token.Text.TokenToString()), new XAttribute("TokenIndex", token.TokenIndex), new XAttribute("Type", token.Type), new XAttribute("Line", token.Line), new XAttribute("CharPositionInLine", token.CharPositionInLine), new XAttribute("StartIndex", token.StartIndex), new XAttribute("StopIndex", token.StopIndex) ) ); xElement.Save(fileName); } 




The result of lexical analysis is 152 tokens (the first 30 are shown here):

Token flow
 <?xml version="1.0" encoding="utf-8"?> <Lexer> <Token Text="class" TokenIndex="0" Type="8" Line="1" CharPositionInLine="0" StartIndex="0" StopIndex="4" /> <Token Text=" " TokenIndex="1" Type="74" Line="1" CharPositionInLine="5" StartIndex="5" StopIndex="5" /> <Token Text="Factorial" TokenIndex="2" Type="26" Line="1" CharPositionInLine="6" StartIndex="6" StopIndex="14" /> <Token Text="{" TokenIndex="3" Type="32" Line="1" CharPositionInLine="15" StartIndex="15" StopIndex="15" /> <Token Text="\n" TokenIndex="4" Type="74" Line="1" CharPositionInLine="16" StartIndex="16" StopIndex="16" /> <Token Text=" " TokenIndex="5" Type="74" Line="2" CharPositionInLine="0" StartIndex="17" StopIndex="17" /> <Token Text=" " TokenIndex="6" Type="74" Line="2" CharPositionInLine="1" StartIndex="18" StopIndex="18" /> <Token Text=" " TokenIndex="7" Type="74" Line="2" CharPositionInLine="2" StartIndex="19" StopIndex="19" /> <Token Text=" " TokenIndex="8" Type="74" Line="2" CharPositionInLine="3" StartIndex="20" StopIndex="20" /> <Token Text="public" TokenIndex="9" Type="56" Line="2" CharPositionInLine="4" StartIndex="21" StopIndex="26" /> <Token Text=" " TokenIndex="10" Type="74" Line="2" CharPositionInLine="10" StartIndex="27" StopIndex="27" /> <Token Text="static" TokenIndex="11" Type="64" Line="2" CharPositionInLine="11" StartIndex="28" StopIndex="33" /> <Token Text=" " TokenIndex="12" Type="74" Line="2" CharPositionInLine="17" StartIndex="34" StopIndex="34" /> <Token Text="void" TokenIndex="13" Type="72" Line="2" CharPositionInLine="18" StartIndex="35" StopIndex="38" /> <Token Text=" " TokenIndex="14" Type="74" Line="2" CharPositionInLine="22" StartIndex="39" StopIndex="39" /> <Token Text="main" TokenIndex="15" Type="41" Line="2" CharPositionInLine="23" StartIndex="40" StopIndex="43" /> <Token Text="(" TokenIndex="16" Type="40" Line="2" CharPositionInLine="27" StartIndex="44" StopIndex="44" /> <Token Text="String" TokenIndex="17" Type="66" Line="2" CharPositionInLine="28" StartIndex="45" StopIndex="50" /> <Token Text="[" TokenIndex="18" Type="31" Line="2" CharPositionInLine="34" StartIndex="51" StopIndex="51" /> <Token Text="]" TokenIndex="19" Type="57" Line="2" CharPositionInLine="35" StartIndex="52" StopIndex="52" /> <Token Text=" " TokenIndex="20" Type="74" Line="2" CharPositionInLine="36" StartIndex="53" StopIndex="53" /> <Token Text="a" TokenIndex="21" Type="26" Line="2" CharPositionInLine="37" StartIndex="54" StopIndex="54" /> <Token Text=")" TokenIndex="22" Type="60" Line="2" CharPositionInLine="38" StartIndex="55" StopIndex="55" /> <Token Text="{" TokenIndex="23" Type="32" Line="2" CharPositionInLine="39" StartIndex="56" StopIndex="56" /> <Token Text="\n" TokenIndex="24" Type="74" Line="2" CharPositionInLine="40" StartIndex="57" StopIndex="57" /> <Token Text="\t" TokenIndex="25" Type="74" Line="3" CharPositionInLine="0" StartIndex="58" StopIndex="58" /> <Token Text="System.out.println" TokenIndex="26" Type="54" Line="3" CharPositionInLine="1" StartIndex="59" StopIndex="76" /> <Token Text="(" TokenIndex="27" Type="40" Line="3" CharPositionInLine="19" StartIndex="77" StopIndex="77" /> <Token Text="new" TokenIndex="28" Type="50" Line="3" CharPositionInLine="20" StartIndex="78" StopIndex="80" /> <Token Text=" " TokenIndex="29" Type="74" Line="3" CharPositionInLine="23" StartIndex="81" StopIndex="81" /> ... </Lexer> 




For AST, I compiled an object model representing each element of the grammar. Classes correspond to non-terminals and terminals. The class diagram is shown in the following figure.







The base class BaseSymbol is a grammar symbol from the combined alphabet, from it the Token - token (terminal) and NonTerm - nonterminal are inherited. No terminals are: ProgramStatement - grammar axiom, MainClassDecl - main class, ClassDecl - class, MethodDecl - method, ExtendsClause - class inheritance, TypeDecl - type, VarDecl - variable, ExpressionDecl - expression, StatementDecl - base class of statements. Operator classes: IfStatement is a conditional operator, WhileStatement is a while statement, StatementList is a list of statements, AssignVarStatement is a declaration and assignment of a new variable, AssignIdStatement is assignment to a variable by the identifier previously announced.

These classes are used in the ANTLR grammar file to generate an AST.

The AST representation formed as a result of the parsing is saved in the xml file using the recursive method ToXmlTree() base class BaseSymbol . The method is redefined for tokens only. For the base class BaseSymbol method is as follows:



 public virtual XElement ToXmlTree() { XElement elements = new XElement(ToString()); Symbols.ForEach(symbol => { if (symbol != null) { XElement el = symbol.ToXmlTree(); elements.Add(el); } }); return elements; } 




The result of the formation of the tree is presented here:

AST
 <?xml version="1.0" encoding="utf-8"?> <Program> <MainClass> <ID Value="Factorial" /> <PrintStatement> <MethodCallExpression> <NewStatement> <ID Value="Fac" /> </NewStatement> <ID Value="ComputeFac" /> <ArgumentListExpression> <INTEGER Value="10" /> </ArgumentListExpression> </MethodCallExpression> </PrintStatement> </MainClass> <Class> <ID Value="Fac" /> <Method> <INT /> <ID Value="ComputeFac" /> <FormalArgumentList> <Variable> <INT /> <ID Value="num" /> </Variable> </FormalArgumentList> <StatementList> <VarStatement> <Variable> <INT /> <ID Value="num_aux" /> </Variable> </VarStatement> <IfElseStatement> <LessExpression> <ID Value="num" /> <INTEGER Value="1" /> </LessExpression> <IdStatement> <ID Value="num_aux" /> <INTEGER Value="1" /> </IdStatement> <IdStatement> <ID Value="num_aux" /> <MultiplyExpression> <ID Value="num" /> <MethodThisCallExpression> <ID Value="ComputeFac" /> <ArgumentListExpression> <MinusExpression> <ID Value="num" /> <INTEGER Value="1" /> </MinusExpression> </ArgumentListExpression> </MethodThisCallExpression> </MultiplyExpression> </IdStatement> </IfElseStatement> </StatementList> <ID Value="num_aux" /> </Method> </Class> </Program> 




To catch exceptions, I wrote exception classes that are used in the ParserException lexical parsing, as well as in the generation of CodeGenerationException code, inherited from the CompilerException compiler exception base class.

Thus, as a result of the syntactic analysis, an abstract syntactic tree is formed, according to which the IL-code is generated in the next phase.



IL code generation



To generate, I again did not reinvent the wheel, but used the existing project RunSharp, located on Codeproject http://www.codeproject.com/Articles/20921/RunSharp-Reflection-Emit-Has-Never-Been-Easier . RunSharp is a project that encapsulates calls to class methods from the System.Reflection.Emit namespace, simplifying the process of generating IL code.

The following data structures are used for code generation: a program class table (for code generation of classes and for resolving dependencies, for example, when a method of one class refers to another, and vice versa), a list of methods of each class (for code generation of methods), a list of local variables (for tracking links) for local variables inside if , while and body blocks), local variable stacks (for performing unary and binary operations), a list of formal parameters of the current method.

It also uses a code generation control table containing the name of the grammar symbol and the method that generates the code when it appears.

The IL code is Generate(BaseSymbol root) in the main recursive Generate(BaseSymbol root) method Generate(BaseSymbol root) of the SharpCodeGen class.

 private void Generate(BaseSymbol root) { if (root == null) { return; } if (root.GrammarMember == GrammarMemberType.NonTerm) { NonTerm nonTerm = root as NonTerm; _compilerLogger.PrintGenerateNonTerm(nonTerm); if (_emitTableDictionary.ContainsKey(nonTerm.TypeNonTerm)) { _emitTableDictionary[nonTerm.TypeNonTerm](nonTerm); } else { root.Symbols.ForEach(Generate); } } } 




The method determines whether root non-terminal, and if the type of a non-terminal is contained in the control table, the method of generating the control table is executed; otherwise, generation is performed recursively for each grammar character contained in root .

The control table contains the following methods that generate instructions for the following grammar symbols:



In part of these methods, the recursive Generate() method is called.

So, the method that generates the method instructions is as follows:

 private void EmitMethod(NonTerm nonTerm) { Token typeMethodDeclSimple; Token methodName; List<BaseSymbol> formalParametersList; BaseSymbol methodStatementList; BaseSymbol returnStatement; NonTermFactory.GetMethodDecl(nonTerm, out typeMethodDeclSimple, out methodName, out formalParametersList, out methodStatementList, out returnStatement); _currentFormalArgumentList.Clear(); foreach (BaseSymbol symbol in formalParametersList) { Token type; Token id; NonTermFactory.GetFormalArgumentDeclaration(symbol, out type, out id); _currentFormalArgumentList.Add(id.Value); } _compilerLogger.PrintRefreshFormalArgumentList(_currentFormalArgumentList); _currentMethod = _methodsTables[_currentClass.Name][methodName.Value]; _g = _currentMethod; GeneratePreInitLocalVariables(methodStatementList); Generate(methodStatementList); Type resultType = GetVariableType(typeMethodDeclSimple); string nameResult = AddTempLocalVariable(resultType); EmitExpression(returnStatement, resultType, nameResult); try { _g.Return(_currentOperandTempResult); } catch (InvalidCastException ex) { throw new CodeGenerationException(MessagesHelper.TypeMismatchEx, returnStatement.ToStringInfo(), ex); } ClearCurrentBlockLocalVariables(); } 




First we get all the necessary data by the method (return type typeMethodDeclSimple , method name methodName , formalParametersList formal parameter formalParametersList , methodStatementList method statement list, and methodStatementList return expression). Next, the current list of formal variables is populated, after which the _currentMethod variable _currentMethod assigned an object (of type MethodGen from the RunSharp library), which is responsible for generating the method. Then, in the GeneratePreInitLocalVariables() method, the list of local variables is populated and the contents of the method itself is recursively generated, the operator responsible for returning the method result after the return keyword is generated, and the list of local block variables is cleared at the end.

And the method responsible for generating the variable assignment operation is as follows:

 private void EmitIdStatement(NonTerm nonTerm) { Token idToken; BaseSymbol expression; NonTermFactory.GetAssignIdStatement(nonTerm, out idToken, out expression); Operand operand; if (_currentFormalArgumentList.Contains(idToken.Value)) { operand = _g.Arg(idToken.Value); } else if (_localVariablesTable.ContainsKey(idToken.Value)) { operand = _localVariablesTable[idToken.Value]; } else { operand = _g.This().Field(idToken.Value); } _currentOperandTempResult = EmitExpression(expression, operand.Type, idToken.Value); try { _g.Assign(operand, _currentOperandTempResult); } catch (InvalidCastException ex) { throw new CodeGenerationException(MessagesHelper.AssignTypeMismatchEx, expression.ToStringInfo(), ex); } } 




Here, first we get the token to which we assign the expression to the right ( idToken ) and the expression itself ( expression ). Next, we determine where the variable came from: from the list of formal parameters ( _currentFormalArgumentList ), from the table of local variables ( _localVariablesTable ) or this is a class variable. Then we calculate the expression on the right by calling the EmitExpression() method, and generate an assignment.



Similarly, taking into account the specifics of other structures, their code is generated.



This shows the log of the result of code generation for our factorial calculation program.

Code Generation Log
  Generating classes and method signatures
 Fac Class
  ComputeFac System.Int32 Method

 Generating Instructions for Program
 Generating instructions for MainClass
 There are no local variables in this block.
 Generating instructions for PrintStatement
 Generating Instructions for MethodCallExpression
 Generating Instructions for NewStatement
 Generating Instructions for Class
 Generating Instructions for Method
 Updated list of current method parameters
 num
 Added variables for block
 num_aux
 Statement generation for StatementList
 Generating Instructions for VarStatement
 Generating Instructions for IfElseStatement
 Generating Instructions for LessExpression
 There are no local variables in this block.
 Generating Instructions for IdStatement
 There are no local variables in this block.
 Generating Instructions for IdStatement
 Generating instructions for MultiplyExpression
 Generating Instructions for MethodThisCallExpression
 Generating instructions for MinusExpression
 Removed variables 
 num_aux






For the compiler, a console application is written that takes the following command line arguments:

-i < > [-o < >]



If no output directory is specified, the output files will be created in the directory where the application was launched. Examples and the application itself are located in the Samples directory.



Conclusion



The whole method of writing the MiniJava programming language compiler under the .NET Framework in C # was analyzed using existing development tools. Here I concentrated on the higher level points needed to write the compiler.

The written compiler successfully copes with examples from the MiniJava project home page, such as: factorial computation, binary search, bubble sort, tree traversal, quick sort, linear search, construction of a coherent list, construction of a binary tree.

The source code for the project is available on GitHub .



Sources



General (theory and practice)




ANTLR




MiniJava


Source: https://habr.com/ru/post/191766/



All Articles