Most compilers have the following architecture:

In this article I am going to dissect this architecture in detail, element by element.
We can say that this article is an addition to the huge amount of existing resources on the topic of compilers. It is an autonomous source that will allow you to understand the basics of the design and implementation of programming languages.
')
The target audience of the article is people whose understanding of the work of compilers is extremely limited (the maximum is that they are engaged in compiling). However, I expect the reader to understand the structures and algorithms of the data.
The article is by no means devoted to modern production compilers with millions of lines of code - no, this is a short course “compilers for dummies” that helps to figure out what a compiler is.
Introduction
I'm currently working on the
Krug system language, inspired by Rust and Go. In the article I will refer to Krug as an example to illustrate my thoughts. Krug is under development, but is already available at
https://github.com/krug-lang in the
caasper and
krug repositories. The language is not quite typical compared to the usual compiler architecture, which partly inspired me to write an article - but more on that later.
I hasten to inform you that I am by no means a compiler expert! I do not have a doctoral degree, and I did not go through any formal training — I studied everything described in the article myself in my free time. I must also say that I do not describe the actual, the only correct approach to creating a compiler, but rather, I present basic methods suitable for creating a small "toy" compiler.
Frontend
Let's return to the diagram above: the arrows on the left that are directed to the frontend field are known and favorite languages ​​like C. The frontend looks like this: lexical analysis -> parser.
Lexical analysis
When I started learning compilers and language design, I was told that lexical analysis is the same as tokenization. We will use this description. The analyzer takes input data in the form of strings or a stream of characters and recognizes patterns in it, which it cuts into tokens.
In the case of a compiler, it receives a written program as an input. It is read into a string from the file, and the analyzer tokenizes its source code.
enum TokenType { Identifier, Number, }; struct Token { std::string Lexeme; TokenType type; // ... // It's also handy to store things in here // like the position of the token (start to end row:col) };
In this fragment, written in a C-shaped language, you can see the structure containing the aforementioned lexeme, as well as TokenType, which serves to recognize this lexeme.
Note: the article is not an instruction for creating a language with examples - but for a better understanding I will insert code fragments from time to time.
Usually analyzers are the simplest components of a compiler. The whole frontend is, in fact, fairly simple compared to the rest of the puzzle pieces. Although it depends a lot on your work.
Take the following piece of C code:
int main() { printf("Hello world!\n"); return 0; }
Having read it from file to line and having performed a linear scan, you may be able to slice tokens. We identify tokens in a natural way - seeing that int is a “word”, and 0 in a return statement is a “number”. The lexical analyzer does the same procedure as we do — later we will look into this process in more detail. For example, analyze the numbers:
0xdeadbeef — HexNumber ( ) 1231234234 — WholeNumber ( ) 3.1412 — FloatingNumber ( ) 55.5555 — FloatingNumber ( ) 0b0001 — BinaryNumber ( )
Definition of words can be difficult. Most languages ​​define a word as a sequence of letters and numbers, and an identifier, as a rule, must begin with a letter or underscore. For example:
123foobar := 3 person-age := 5 fmt.Println(123foobar)
In Go, this code will not be considered correct and will be parsed into the following tokens:
Number(123), Identifier(foobar), Symbol(:=), Number(3) ...
The most common identifiers look like this:
foo_bar __uint8_t fooBar123
Analyzers will have to solve other problems associated with, for example, spaces, multiline and single-line comments, identifiers, numbers, number systems and number formatting (for example, 1_000_000) and encodings (for example, support for UTF8 instead of ASCII).
And if you think that you can resort to regular expressions - it’s better not to. It is much easier to write an analyzer from scratch, but I highly recommend reading
this article from our king and god Rob Pike. The reasons why Regex does not suit us are described in many other articles, so I’ll omit this point. Besides, writing an analyzer is much more interesting than suffering over long wordy expressions uploaded to regex101.com at 5:24 am. In my first language, I used the tokenization function
split(str)
- and far from it.
Parsing
Parsing is somewhat more complicated than lexical analysis. There are many parsers and parsers-generators - here the game starts on a large scale.
Parsers in compilers usually accept input in the form of tokens and build a specific tree — an abstract syntax tree or a parsing tree. They are similar in nature, but have some differences.
These stages can be represented as functions:
fn lex(string input) []Token {...} fn parse(tokens []Token) AST {...} let input = "int main() { return 0; }"; let tokens = lex(input); let parse_tree = parse(tokens); // ....
Typically, compilers are built from a variety of small components that take input data, change it, or convert it into various output data. This is one of the reasons why functional languages ​​are well suited for building compilers. Other reasons are an excellent benchmarking and fairly extensive standard libraries. Cool fact: the first implementation of the
Rust compiler was on Ocaml.
I advise you to keep these components as simple and autonomous as possible - modularity will greatly facilitate the process. In my opinion, the same can be said about many other aspects of software development.
Trees
Parsing tree
What the fuck is that? Also known as a parse tree, this thick tree is used to visualize a source program. They contain all the information (or most of it) about the input program, usually the same as what is described in the grammar of your language. Each tree node will be leaf or non-leaf, for example, NumberConstant or StringConstant.
Abstract syntax tree
As the name implies, ASD is an
abstract syntax tree. The parsing tree contains a lot of (often redundant) information about your program, and in the case of an ASD it is not required. ASD does not need useless information about structure and grammar, which does not affect the semantics of the program.
Suppose you have an expression of the type ((5 + 5) -3) +2 in your tree. In the parsing tree, you would store it completely, along with parentheses, operators, and values ​​5, 5, 3, and 2. But you can simply make associations with ASD — we only need to know the values, the operators, and their order.
The picture below shows a tree for the expression a + b / c.
ASD can be represented as follows:
interface Expression { ... }; struct UnaryExpression { Expression value; char op; }; struct BinaryExpression { Expression lhand, rhand; string op; // string because the op could be more than 1 char. }; interface Node { ... }; // or for something like a variable struct Variable : Node { Token identifier; Expression value; };
This presentation is quite limited, but I hope you can see how your nodes will be structured. For parsing you can resort to the following procedure:
Node parseNode() { Token current = consume(); switch (current.lexeme) { case "var": return parseVariableNode(); // ... } panic("unrecognized input!"); } Node n = parseNode(); if (n != null) { // append to some list of top level nodes? // or append to a block of nodes! }
I hope that you have grasped the essence of how the other nodes will be step-by-step parsing, starting with high-level language constructs. How exactly the parser is implemented with a recursive descent, you need to study yourself.
Grammar
The parsing in the ASD from a set of tokens can be difficult. Usually you should start with the grammar of your language. In essence, grammar defines the structure of your language. There are several languages ​​for defining languages ​​that can describe (or parse) themselves.
An example of a language for defining languages ​​is the
extended Backus-Naur form (RBNF). It is a variation of the
BNF with a smaller number of angle brackets. Here is an example of the RBNF from a Wikipedia article:
digit excluding zero = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; digit = "0" | digit excluding zero ;
The production rules are defined: they indicate which terminal template constitutes a “non-terminal”. Terminals are part of the alphabet, for example, the token if or 0 and 1 in the example above are terminals. The non-terminals are their opposite, they are in the left part of the production rules, and they can be considered variables or “named pointers” to groups of terminals and non-terminals.
Many languages ​​have specifications that contain grammar. For example, for
Go ,
Rust and
D.Recursive Descent Analyzers
Recursive descent is the simplest of many approaches to parsing.
Recursive descent analyzers are downstream based on recursive procedures. It is much easier to write a parser, because in your grammar there is no
left recursion . In most "toy" languages, this technique is sufficient for parsing. GCC uses a hand-written descending parser, although YACC was used before.
However, problems with parsing of these languages ​​can arise. In particular C, where
foo * bar
can be interpreted as
int foo = 3; int bar = 4; foo * bar; // unused expression
or how
typedef struct { int b; } foo; foo* bar; bar.b = 3;
The Clang implementation also uses a recursive descent analyzer:
Since this is the usual code for C ++, recursive descent allows beginners to easily understand it. It supports specially created rules and other strange things required by C / C ++ and helps you easily implement diagnostics and error correction.Also pay attention to other approaches:
- descending LL, recursive descent
- ascending lr, shift, ascending descent
Parser Generators
Another good way. Of course, there are also disadvantages - but this can be said about any other choice programmers make when creating software.
Parser-generators work very briskly. Using them is easier than writing your own analyzer and getting a quality result — although they are not very user friendly and do not always display error messages. In addition, you will have to learn how to use a parser generator, and during the promotion of the compiler, you will probably have to unwind a parser generator.
An example of a generator of parsers is
ANTLR , there are many others.
I think that this tool is suitable for those who do not want to waste time writing the frontend, and who would prefer to write the middle and backend of the compiler / interpreter and analyze whatever it is.
Application parsing
If you have not understood yourself. Even the front-end compiler (lex / parse) can also be used to solve other problems:
- syntax highlighting
- HTML / CSS parsing for rendering engine
- Transporters: TypeScript, CoffeeScript
- linkers
- REGEX
- interface data analysis
- parsing url
- formatting tools like gofmt
- SQL parsing and more.
The middle
Semantic analysis! Analysis of the semantics of the language is one of the most difficult tasks when creating a compiler.
You need to make sure that all input programs work correctly. The aspects related to semantic analysis have not yet been included in my Krug language, and without it, the programmer will always have to write the correct code. In reality, this is impossible - and we always write, compile, sometimes run, fix errors. This spiral is infinite.
In addition, compiling programs is impossible without analyzing the correctness of the semantics at the appropriate stage of compilation.
I once came across a chart devoted to the percentage ratio of the frontend, midland and backend. Then it looked like
F: 20% M: 20%: B: 60%
Today it represents something like
F: 5% M: 60% B: 35%
The frontend is mainly engaged in the generator, and in contextless languages ​​that do not have the duality of grammar, they can be done quite quickly - here recursive descent will help.
With LLVM technology, most optimization work can be loaded into a framework, which presents a number of ready-made optimizations.
The next step is semantic analysis, the most important part of the compilation phase.
For example, in Rust, with its memory management model, the compiler acts as a big, powerful machine that performs various types of static analysis on introductory forms. In part, this task is to convert the input data into a form more convenient for analysis.
For this reason, semantic analysis plays an important role in the compiler architecture, and grueling preparatory work such as optimizing the generated assembly or reading the input data in the ASD is done for you.
Semantic passages
In the course of semantic analysis, most compilers spend a large number of “semantic passes” through an ASD or other abstract form of code expression.
This article contains details on most of the passes produced by the .NET C # compiler.
I will not consider each passage, especially since they differ depending on the language, but below are described a few steps in Krug.
Top level ad
The compiler will go through all the "highest level" ads in the modules and be aware of their existence. He will not go deeper into blocks - he will simply declare which structures, functions, etc. are available in this or that module.
Name / Symbol Resolution
The compiler runs through all blocks of code in functions, etc. and resolves them — that is, finds the characters requiring permission. This is a common pass, and it is from here that, as a rule, the error
No such symbol XYZ comes from when compiling the Go code.
Running this pass can be very difficult, especially if there are cyclic dependencies in your dependency diagram. Some languages ​​do not allow them, for example, Go will give an error if one of your packages forms a loop, like my Krug language. Cyclic dependencies can be considered a side effect of bad architecture.
Cycles can be defined by modifying DFS in a dependency diagram, or using
the Tarjan algorithm (as done in Krug) to define (multiple) cycles.
Type Inference
The compiler goes through all the variables and displays their types. Type inference in Krug is very weak, it simply displays variables based on their values. This is in no way a fancy system, such as those found in functional languages ​​like Haskell.
You can infer types using the process of “unification” or “unification of types”. For simpler type systems, you can use a very simple implementation.
Types are implemented in Krug as follows:
interface Type {}; struct IntegerType : Type { int width; bool signed; }; struct FloatingType : Type { int width; }; struct ArrayType : Type { Type base_type; uint64 length; };
You can also have a simple type inference in which you will assign a type to expression nodes, for example,
IntegerConstantNode
can be of type IntegerType (64). And then you may have a function
unify(t1, t2)
, which will choose the widest type that can be used to deduce the type of more complex expressions, say, binary ones. So it's a matter of assigning a variable to the left of the values ​​of the above types to the right.
I once wrote a simple
type casting on Go, which became the prototype implementation for Krug.
Mutability Pass
Krug (like Rust) defaults to immutable language, that is, variables remain unchanged, unless otherwise specified:
let x = 3; x = 4; // BAD! mut y = 5; y = 6; // OK!
The compiler runs through all blocks and functions and checks that their “variables are correct”, that is, we do not change what does not follow, and that all variables passed to certain functions are constant or changeable where required.
This is done using symbolic information that has been collected in previous passes. A symbol table constructed from the results of a semantic passage contains the names of tokens and signs of variable variables. It may contain other data, for example, in C ++, the table can store information about whether the symbol is external or static.
Character tables
A symbol table, or “stab”, is a table for searching for characters that are used in your program. For each scope, a single table is created, and all of them contain information about the symbols present in a specific scope.
This information includes such properties as the name of the symbol, type, sign of variability, the presence of external communication, location in static memory, and so on.
Area of ​​visibility
This is an important concept in programming languages. Of course, your language is not obliged to give the ability to create nested scopes, everything can be placed in one common namespace!
Although the representation of the scope is an interesting task for the compiler architecture, in most C-like languages, the scope appears to be (or is) like a stack data structure.
Usually we create and destroy scopes, and they are usually used to manage names, that is, they allow us to hide (shadowing) variables:
{ // push scope let x = 3; { // push scope let x = 4; // OK! } // pop scope } // pop scope
It can be presented differently:
struct Scope { Scope* outer; SymbolTable symbols; }
Small offtopic, but I recommend reading about the
spaghetti stack . This is a data structure that is used to store scopes in ASD nodes of opposite blocks.
Type systems
Many of the following sections can be developed into separate articles, but it seems to me that this headline deserves this most. Today, a lot of information is available about type systems, as well as the varieties of the systems themselves, around which many copies break. I will not be deeply immersed in this topic, just leave a link to the
excellent article by Steve Klabnik .
The type system is what is provided and semantically defined in the compiler using compiler representations and analyzing these representations.
Possession
This concept is used in programming more and more. The principles of semantics of ownership and movement are embedded in the language of
Rust , and I hope they will appear in other languages. Rust performs many different types of static analysis, which checks whether the input data satisfies a set of rules regarding memory: who owns what memory, when memory is destroyed and how many references (or borrowings) exist to these values ​​or memory.
The beauty of Rust is that all of this is done during compilation, inside the compiler, so the programmer does not have to do garbage collection or reference counting. All these semantics are related to the type system and can be provided even before the program is presented in the form of a complete binary file.
I can not say how it all works under the hood, but all this is the result of static analysis and a wonderful study of the Mozilla team and the participants of the
Cyclone project.
Control Flow Graphs
To represent program flows, we use control flow graphs (CFG), which contain all the ways in which the program can run. This is used in semantic analysis to exclude non-working sections of code, that is, blocks, functions, and even modules that will never be achieved during program execution. Also, graphs can be used to identify cycles that cannot be interrupted. Or to search for inaccessible code, for example, when you call a panic (call a panic), or return in a loop, and the code outside the loop is not executed.
Data flow analysis plays an important role during the semantic phase of the compiler, so I recommend reading about the types of analysis you can perform, how they work, and what optimizations you can do.
Backend
The final part of our architecture diagram.We have done most of the work on generating executable binaries. This can be done in various ways, which we will discuss below.
- , . , , «».
, . , , . , , , . , .
, . , ++ — Cfront — C.
JavaScript. TypeScript , , , , .
«» , , , , « » . — , , .
LLVM
LLVM: Rust, Swift, C/C++ (clang), D, Haskell.
« », , . , LLVM . , . , , , , 1, 4, 8 16-. , , - .
-
— , — , .
Go — , LLVM ( ). Go , Windows, Linux MacOS. , Krug -.
. , LLVM, , , LLVM , .
, , , LLVM, IR, , , , ( ).
. , , , . IR ( ) «» fprintf .
8cc .
. — Java: ,
JVM , , Kotlin.
, Java . , . , .
, JVM JIT , JIT-, .
, ! , , . - , , .
Godbolt — , , . , , .
, , (strip the debug symbols), , GCC. , - .
. . , . production-.
rwmj , 8 , 80 % . 1971-!
, Rust.
IR
(intermediate representation, IR) , . , , .
IR . , , , .
IR, «», IR . , SSA — Static Single Assignment, , .
Go IR SSA. IR LLVM SSA, .
SSA , , (constant propagation), ( ) .
, . , , , , . ( 16 32), , (spill to the stack).
— ( ). , , .
:
- (graph colouring) — (NP- ). , (liveness ranges) .
- — .
. , . , , .
-, , . , .
fn main() int { let x = 0; { let x = 0; { let x = 0; } } return 0; }
, ( - :) ) , . , .
LLDB
DWARF . LLVM , DWARF GNU-. , , , .
(Foreign Function Interface, FFI )
libc , , . , ?
— . , ( .s/.asm)? ? ,
Jai . , .
(CaaS)
API-. , Krug-, . , , .
, , , . , API-.
production- CaaS. Microsofts Roslyn, , . , , , , API-, , , Rust
RLS .
Krug — — Caasper CaaS-.
Caasper (, , ), , krug, . , , (bootstrap) , .
Krug JavaScript, Go*, , , Krug. JavaScript , yarn/npm.
* Go () , JS.Caasper
.
Github Krug, D LLVM. YouTube-
.
Krug ()
.
useful links