⬆️ ⬇️

We write a primitive and useless compiler

I believe that every programmer should write his own compiler.



I myself have long believed that the creation of compilers is the lot of the elite, and the mere mortal programmer cannot comprehend this science. I will try to prove that it is not.



In the post, we will look at how you can write your own C-like language compiler in less than an hour, using only 300 lines of code. As a bonus, this includes the code of the virtual machine, in which the source code will be compiled into the bytecode.



The compiler will be written in Python. I love this language, but the code can be sometimes clumsy. If something is wrong - write in a personal.

')

Yes, the compiler is completely brazenly based on Tiny-C.



Grammar



Before we begin, let's define what our language will be able to do.

He will be able to do very little:



- the only data type is int

- all variables are global. There are 26 variables in total (az)

- of the mathematical operations, only "+" and "-" are supported.

- the only comparison operator - is a "<"

- from the language structures - conditional statements if, if / else, while, do / while



Everything. No arrays, functions, structures. Here is the grammar of our language:



 <program> ::= <statement> <statement> ::= "if" <paren-expr> <statement> | "if" <paren-expr> <statement> "else" <statement> | "while" <paren-expr> <statement> | "do" <statement> "while" <paren-expr> | "{" { <statement> } "}" | <expr> ";" | ";" <paren-expr> ::= "(" <expr> ")" <expr> ::= <test> | <id> "=" <expr> <test> ::= <sum> | <sum> "<" <sum> <sum> ::= <term> | <sum> "+" <term> | <sum> "-" <term> <term> ::= <id> | <int> | <paren-expr> <id> ::= "a" | "b" | ... | "z" <int> ::= <digit>, { <digit> } <digit> ::= "0" | "1" | ... | "9" <statement> "else" <statement> | <program> ::= <statement> <statement> ::= "if" <paren-expr> <statement> | "if" <paren-expr> <statement> "else" <statement> | "while" <paren-expr> <statement> | "do" <statement> "while" <paren-expr> | "{" { <statement> } "}" | <expr> ";" | ";" <paren-expr> ::= "(" <expr> ")" <expr> ::= <test> | <id> "=" <expr> <test> ::= <sum> | <sum> "<" <sum> <sum> ::= <term> | <sum> "+" <term> | <sum> "-" <term> <term> ::= <id> | <int> | <paren-expr> <id> ::= "a" | "b" | ... | "z" <int> ::= <digit>, { <digit> } <digit> ::= "0" | "1" | ... | "9" " <paren-expr> | <program> ::= <statement> <statement> ::= "if" <paren-expr> <statement> | "if" <paren-expr> <statement> "else" <statement> | "while" <paren-expr> <statement> | "do" <statement> "while" <paren-expr> | "{" { <statement> } "}" | <expr> ";" | ";" <paren-expr> ::= "(" <expr> ")" <expr> ::= <test> | <id> "=" <expr> <test> ::= <sum> | <sum> "<" <sum> <sum> ::= <term> | <sum> "+" <term> | <sum> "-" <term> <term> ::= <id> | <int> | <paren-expr> <id> ::= "a" | "b" | ... | "z" <int> ::= <digit>, { <digit> } <digit> ::= "0" | "1" | ... | "9" | <program> ::= <statement> <statement> ::= "if" <paren-expr> <statement> | "if" <paren-expr> <statement> "else" <statement> | "while" <paren-expr> <statement> | "do" <statement> "while" <paren-expr> | "{" { <statement> } "}" | <expr> ";" | ";" <paren-expr> ::= "(" <expr> ")" <expr> ::= <test> | <id> "=" <expr> <test> ::= <sum> | <sum> "<" <sum> <sum> ::= <term> | <sum> "+" <term> | <sum> "-" <term> <term> ::= <id> | <int> | <paren-expr> <id> ::= "a" | "b" | ... | "z" <int> ::= <digit>, { <digit> } <digit> ::= "0" | "1" | ... | "9" | <program> ::= <statement> <statement> ::= "if" <paren-expr> <statement> | "if" <paren-expr> <statement> "else" <statement> | "while" <paren-expr> <statement> | "do" <statement> "while" <paren-expr> | "{" { <statement> } "}" | <expr> ";" | ";" <paren-expr> ::= "(" <expr> ")" <expr> ::= <test> | <id> "=" <expr> <test> ::= <sum> | <sum> "<" <sum> <sum> ::= <term> | <sum> "+" <term> | <sum> "-" <term> <term> ::= <id> | <int> | <paren-expr> <id> ::= "a" | "b" | ... | "z" <int> ::= <digit>, { <digit> } <digit> ::= "0" | "1" | ... | "9" 


This is a record in the form of grammar EBNF . That's what this record roughly means.



A program is a single statement.

Operators are conditional (if..else ...), cyclic (while ...) and just operators (for example, "a = 2 + 3").

Conditional and cyclic contain expression and body condition (as operator). Normal operators end with a semicolon. Operators can be grouped in braces, then obtain a compound statement.

Expressions are either a sum or a variable assignment.

Here the sum is a sequence of additions and subtractions of terms.

A term is a number, variable or expression in brackets.

Variables are characters from a to z. Numbers are a set of numbers from 0 to 9.



All these difficulties are needed in order to tell the compiler exactly how to automatically analyze the source code. For example, we met the word “if” - it means that the expression in brackets goes on, and the operator follows.



Lexical analyzer



The compiler receives a text file (source). The lexical analyzer is needed in order to highlight the tokens file. Those. The line “a = 3 + 42;” the lexical analyzer should represent in the form “identifier: a”, “operator =”, “number 3”, “operator +”, “number 42”, “symbol;”. The lexical analyzer works only with a sequence of letters, i.e. for him the line “ab if =” also makes sense and is absolutely correct.



So, our lexical analyzer should recognize the following tokens:



- numbers

- identifiers-variables

- keywords: if, else, while, do

- symbols +, -, <, =, {,}, (,) ,;

- end of file



Here is what its source code looks like:



 class Lexer: NUM, ID, IF, ELSE, WHILE, DO, LBRA, RBRA, LPAR, RPAR, PLUS, MINUS, LESS, \ EQUAL, SEMICOLON, EOF = range(16) #    SYMBOLS = { '{': LBRA, '}': RBRA, '=': EQUAL, ';': SEMICOLON, '(': LPAR, ')': RPAR, '+': PLUS, '-': MINUS, '<': LESS } #   WORDS = { 'if': IF, 'else': ELSE, 'do': DO, 'while': WHILE } #  ,    ch = ' ' # ,   -   def error(self, msg): print 'Lexer error: ', msg sys.exit(1) def getc(self): self.ch = sys.stdin.read(1) def next_tok(self): self.value = None self.sym = None while self.sym == None: if len(self.ch) == 0: self.sym = Lexer.EOF elif self.ch.isspace(): self.getc() elif self.ch in Lexer.SYMBOLS: self.sym = Lexer.SYMBOLS[self.ch] self.getc() elif self.ch.isdigit(): intval = 0 while self.ch.isdigit(): intval = intval * 10 + int(self.ch) self.getc() self.value = intval self.sym = Lexer.NUM elif self.ch.isalpha(): ident = '' while self.ch.isalpha(): ident = ident + self.ch.lower() self.getc() if ident in Lexer.WORDS: self.sym = Lexer.WORDS[ident] elif len(ident) == 1: self.sym = Lexer.ID self.value = ord(ident) - ord('a') else: self.error('Unknown identifier: ' + ident) else: self.error('Unexpected symbol: ' + self.ch) 


The method next_tok () analyzer receives the next token. The type of the token can be obtained from the sym attribute, and its value (if it is a variable or a number) from the value attribute.



The analyzer ignores spaces, checks whether the current character is a special character of the language, if not, checks if it is a number or identifier. Those. upon encountering the number 1, the analyzer will continue to read the characters until it encounters a non-numeric character. Similarly, identifiers are checked (there instead of numbers of letters).



Parser



This is probably the most difficult component of our compiler. His task, using tokens received from a lexical analyzer, is to form a kind of tree in which hierarchy and relationships display the structure of the code. For example:



 "if (a < 0) a = 5;" IF +-LESS | +-VAR(a) | +-NUM(0) +-SET +-VAR(a) +-NUM(5) 




The tree that is built by the parser consists of nodes. A node has a type (IF, LESS, SET, VAR ...), a value (if it is a number or a variable) and several child nodes of operands (in the code - op1, op2, op3). For if, they store the condition and then / else branches, for loops, the condition and the body of the loop.



To describe the nodes, we introduce the Node class. Here is the code for the parser class and the Node class:



 class Node: def __init__(self, kind, value = None, op1 = None, op2 = None, op3 = None): self.kind = kind self.value = value self.op1 = op1 self.op2 = op2 self.op3 = op3 class Parser: VAR, CONST, ADD, SUB, LT, SET, IF1, IF2, WHILE, DO, EMPTY, SEQ, EXPR, PROG = range(14) def __init__(self, lexer): self.lexer = lexer def error(self, msg): print 'Parser error:', msg sys.exit(1) def term(self): if self.lexer.sym == Lexer.ID: n = Node(Parser.VAR, self.lexer.value) self.lexer.next_tok() return n elif self.lexer.sym == Lexer.NUM: n = Node(Parser.CONST, self.lexer.value) self.lexer.next_tok() return n else: return self.paren_expr() def summa(self): n = self.term() while self.lexer.sym == Lexer.PLUS or self.lexer.sym == Lexer.MINUS: if self.lexer.sym == Lexer.PLUS: kind = Parser.ADD else: kind = Parser.SUB self.lexer.next_tok() n = Node(kind, op1 = n, op2 = self.term()) return n def test(self): n = self.summa() if self.lexer.sym == Lexer.LESS: self.lexer.next_tok() n = Node(Parser.LT, op1 = n, op2 = self.summa()) return n def expr(self): if self.lexer.sym != Lexer.ID: return self.test() n = self.test() if n.kind == Parser.VAR and self.lexer.sym == Lexer.EQUAL: self.lexer.next_tok() n = Node(Parser.SET, op1 = n, op2 = self.expr()) return n def paren_expr(self): if self.lexer.sym != Lexer.LPAR: self.error('"(" expected') self.lexer.next_tok() n = self.expr() if self.lexer.sym != Lexer.RPAR: self.error('")" expected') self.lexer.next_tok() return n def statement(self): if self.lexer.sym == Lexer.IF: n = Node(Parser.IF1) self.lexer.next_tok() n.op1 = self.paren_expr() n.op2 = self.statement() if self.lexer.sym == Lexer.ELSE: n.kind = Parser.IF2 self.lexer.next_tok() n.op3 = self.statement() elif self.lexer.sym == Lexer.WHILE: n = Node(Parser.WHILE) self.lexer.next_tok() n.op1 = self.paren_expr() n.op2 = self.statement(); elif self.lexer.sym == Lexer.DO: n = Node(Parser.DO) self.lexer.next_tok() n.op1 = self.statement() if self.lexer.sym != Lexer.WHILE: self.error('"while" expected') self.lexer.next_tok() n.op2 = self.paren_expr() if self.lexer.sym != Lexer.SEMICOLON: self.error('";" expected') elif self.lexer.sym == Lexer.SEMICOLON: n = Node(Parser.EMPTY) self.lexer.next_tok() elif self.lexer.sym == Lexer.LBRA: n = Node(Parser.EMPTY) self.lexer.next_tok() while self.lexer.sym != Lexer.RBRA: n = Node(Parser.SEQ, op1 = n, op2 = self.statement()) self.lexer.next_tok() else: n = Node(Parser.EXPR, op1 = self.expr()) if self.lexer.sym != Lexer.SEMICOLON: self.error('";" expected') self.lexer.next_tok() return n def parse(self): self.lexer.next_tok() node = Node(Parser.PROG, op1 = self.statement()) if (self.lexer.sym != Lexer.EOF): self.error("Invalid statement syntax") return node 


This parser works recursively, starting with the method parse ().

At first, it creates a "Program" node, a child node which becomes the main operator of the program.



Operators are formed in the statement () method. In this method, the first token is checked and, depending on it, an if, if / else, while, do / while, compound statement (if it starts with a curly bracket) or just a single statement, terminated by a semicolon, is formed.



When constructing operators, the methods expr () are used - get the expression and paren_expr () - get the expression in brackets.



Expressions are built from checks that are built from sums that consist of terms. And the terms are in turn composed of numbers, variables, and expressions in parentheses. The house that Jack built.



Such is the recursion. As you can see, the methods correspond to the concepts described above grammar.



At the output of the parse () method, we obtain an object of class Node, which contains the tree of our program. This tree can now be converted to machine code.



Machine code



We will compile in a simple bytecode our special virtual machine. The virtual machine will stack because They are much easier register. Here are her possible instructions:



 FETCH x -      x STORE x -    x     PUSH n -   n    POP -      ADD -       SUB -       LT -       (a < b).  - 0  1 JZ a -     0 -    a. JNZ a -      0 -    a. JMP a -    a HALT -   


For example, operators «a = 2; b = 5; "is converted into the following sequence of instructions:



 PUSH 2 STORE 0 PUSH 5 STORE 1 


The virtual machine code is very simple. It is all described in the run method:



 IFETCH, ISTORE, IPUSH, IPOP, IADD, ISUB, ILT, JZ, JNZ, JMP, HALT = range(11) class VirtualMachine: def run(self, program): var = [0 for i in xrange(26)] stack = [] pc = 0 while True: op = program[pc] if pc < len(program) - 1: arg = program[pc+1] if op == IFETCH: stack.append(var[arg]); pc += 2 elif op == ISTORE: var[arg] = stack.pop(); pc += 2 elif op == IPUSH: stack.append(arg); pc += 2 elif op == IPOP: stack.append(arg); stack.pop(); pc += 1 elif op == IADD: stack[-2] += stack[-1]; stack.pop(); pc += 1 elif op == ISUB: stack[-2] -= stack[-1]; stack.pop(); pc += 1 elif op == ILT: if stack[-2] < stack[-1]: stack[-2] = 1 else: stack[-2] = 0 stack.pop(); pc += 1 elif op == JZ: if stack.pop() == 0: pc = arg else: pc += 2 elif op == JNZ: if stack.pop() != 0: pc = arg else: pc += 2 elif op == JMP: pc = arg; elif op == HALT: break print 'Execution finished.' for i in xrange(26): if var[i] != 0: print '%c = %d' % (chr(i+ord('a')), var[i]) ] - = stack [-1]; IFETCH, ISTORE, IPUSH, IPOP, IADD, ISUB, ILT, JZ, JNZ, JMP, HALT = range(11) class VirtualMachine: def run(self, program): var = [0 for i in xrange(26)] stack = [] pc = 0 while True: op = program[pc] if pc < len(program) - 1: arg = program[pc+1] if op == IFETCH: stack.append(var[arg]); pc += 2 elif op == ISTORE: var[arg] = stack.pop(); pc += 2 elif op == IPUSH: stack.append(arg); pc += 2 elif op == IPOP: stack.append(arg); stack.pop(); pc += 1 elif op == IADD: stack[-2] += stack[-1]; stack.pop(); pc += 1 elif op == ISUB: stack[-2] -= stack[-1]; stack.pop(); pc += 1 elif op == ILT: if stack[-2] < stack[-1]: stack[-2] = 1 else: stack[-2] = 0 stack.pop(); pc += 1 elif op == JZ: if stack.pop() == 0: pc = arg else: pc += 2 elif op == JNZ: if stack.pop() != 0: pc = arg else: pc += 2 elif op == JMP: pc = arg; elif op == HALT: break print 'Execution finished.' for i in xrange(26): if var[i] != 0: print '%c = %d' % (chr(i+ord('a')), var[i]) 


It remains to write the compiler itself - a code generator.



Compiler



The crown of our creation. Here is its source code:



 class Compiler: program = [] pc = 0 def gen(self, command): self.program.append(command) self.pc = self.pc + 1 def compile(self, node): if node.kind == Parser.VAR: self.gen(IFETCH) self.gen(node.value) elif node.kind == Parser.CONST: self.gen(IPUSH) self.gen(node.value) elif node.kind == Parser.ADD: self.compile(node.op1) self.compile(node.op2) self.gen(IADD) elif node.kind == Parser.SUB: self.compile(node.op1) self.compile(node.op2) self.gen(ISUB) elif node.kind == Parser.LT: self.compile(node.op1) self.compile(node.op2) self.gen(ILT) elif node.kind == Parser.SET: self.compile(node.op2) self.gen(ISTORE) self.gen(node.op1.value) elif node.kind == Parser.IF1: self.compile(node.op1) self.gen(JZ); addr = self.pc; self.gen(0); self.compile(node.op2) self.program[addr] = self.pc elif node.kind == Parser.IF2: self.compile(node.op1) self.gen(JZ); addr1 = self.pc; self.gen(0) self.compile(node.op2) self.gen(JMP); addr2 = self.pc; self.gen(0) self.program[addr1] = self.pc self.compile(node.op3) self.program[addr2] = self.pc elif node.kind == Parser.WHILE: addr1 = self.pc self.compile(node.op1) self.gen(JZ); addr2 = self.pc; self.gen(0) self.compile(node.op2) self.gen(JMP); self.gen(addr1); self.program[addr2] = self.pc elif node.kind == Parser.DO: addr = self.pc self.compile(node.op1) self.compile(node.op2) self.gen(JNZ); self.gen(addr); elif node.kind == Parser.SEQ: self.compile(node.op1) self.compile(node.op2) elif node.kind == Parser.EXPR: self.compile(node.op1); self.gen(IPOP) elif node.kind == Parser.PROG: self.compile(node.op1); self.gen(HALT) return self.program 


The gen () method adds a new byte (command or argument) to the program (list of bytes).

In the compile () method, the entire program is built. In fact, this method recursively traverses the node tree. and for each type of node generates the corresponding code.



Note the trick in conditional statements and loops. After JMP / JZ, we first write 0, and when the branch itself is compiled and the address we need to go to is known, in order not to execute this branch, we change the value 0 to the actual address.



We are testing



Here is the complete source code compiler. I used the script engine to run and test (I read Lexer from stdin):



 echo " i = 3;" | ./cc.py echo " { a=3; b=5; }" | ./cc.py echo " { a = 1; b = 2; c = a + b; }" | ./cc.py echo " { a = 5; b = 2; c = a - b; }" | ./cc.py echo " { a = 5; b = 2; c = b < a; }" | ./cc.py echo " { a = 5; if (a < 10) a = 33; }" | ./cc.py echo " { a = 5; if (10 < a) a = 33; else { a = 1; b = 2; } }" | ./cc.py echo " { a = 10; do { a = a - 2;} while (3 < a); }" | ./cc.py echo " { a = 1; b = 5; while (a < b) { a = a + 3; }}" | ./cc.py 


Answers machine gave out seems to be correct.



What's next?



Our compiler has no applications. But the experience gained.

I hope you want to write your own compiler even more. Then it is better to start with languages ​​with simple syntax, for example. TinyBasic or PL / 0 .

If you want to read the source code of other compilers, then I advise you to pay attention to the work of Bellard ( otcc ), tinyc , tcc , smallC , lcc .

Source: https://habr.com/ru/post/133780/



All Articles