LLVM: do-it-yourself compiler. Introduction

Imagine that one day you came up with the idea of a processor of your own, unlike anything architecture, and you really wanted to implement this idea “in hardware”. Fortunately, there is nothing impossible in this. A little verilogue, and here's your idea. You already have beautiful dreams about how Intel went bankrupt, Microsoft hastily rewrites Windows for your architecture, and the Linux community has already written a fresh version of the system for your microprocessor with very boring wallpapers.
However, for all this, one little thing is missing: a compiler!
Yes, I know that many do not consider the presence of a compiler to be something important, considering that everyone should program strictly in assembler. If you think so too, I will not argue with you, just do not read further.
If you want at least C language to be available for your original architecture, I ask for cat.
The article will discuss the use of the LLVM compiler infrastructure to build its own solutions based on it.
The scope of LLVM is not limited to the development of compilers for new processors, the LLVM compilers infrastructure can also be used to develop compilers for new programming languages, new optimization algorithms and specific static code analysis tools (search for errors, statistics collection, etc.).
For example, you can use some kind of standard processor (for example, ARM) in combination with a specialized coprocessor (for example, a matrix FPU), in which case you may need to modify the existing compiler for ARM so that it can generate code for your FPU.
Also interesting is the use of LLVM to generate high-level source texts (“translation” from one language to another). For example, you can write a code generator on Verilog using the source code in C.

KDPV
')

Why LLVM?

Today there are only two realistic ways to develop a compiler for your own architecture: using GCC or using LLVM. Other open source compiler projects either did not reach the same level of development as GCC and LLVM, or were outdated and stopped developing, they do not have advanced optimization algorithms, and may not provide full compatibility even with the C standard, not to mention the support of others. programming languages. Developing your own compiler from scratch is a very irrational way, because existing open source solutions already implement the compiler's frontend with a lot of highly non-trivial optimization algorithms, which, moreover, have been well tested and have been used for a long time.
Which of these two open-source projects to choose as the basis for your compiler? GCC (GNU Compiler Collection) is an older project, the first release of which took place in 1987; its author is Richard Stallman, a well-known figure in the open-source movement [4]. It supports many programming languages: C, C ++, Objective C, Fortran, Java, Ada, Go. There are also frontends for many other programming languages that are not included in the main assembly. The GCC compiler supports a large number of processor architectures and operating systems, and is currently the most common compiler. GCC itself is written in C (in the comments I was corrected that it was already mostly rewritten in C ++).
LLVM is much "younger", its first release took place in 2003, it (or rather, its Clang frontend) supports the programming languages C, C ++, Objective-C and Objective-C ++, and also has front-ends for Common Lisp, ActionScript, Ada , D, Fortran, OpenGL Shading Language, Go, Haskell, Java bytecode, Julia, Swift, Python, Ruby, Rust, Scala, C # and Lua. It was developed at the University of Illinois, in the USA, and is the main compiler for developing for OS X. LLVM is written in C ++ (C ++ 11 for the latest releases) [5].

The relative "youth" of LLVM is not a disadvantage, it is mature enough that it does not have critical bugs, nor does it carry a huge load of outdated architectural solutions, like GCC. The modular structure of the compiler allows the use of the LLVM-GCC front-end, which provides full support for the GCC standards, while the target platform code will be generated by LLC (LLVM backend). You can also use Clang - the original frontend LLVM.
LLVM is well documented, and for it a large number of code examples.

Compiler Modular Architecture

The infrastructure of the LLVM compilers consists of various tools; it does not make sense to consider them all within the framework of this article. We confine ourselves to the basic tools that make up the compiler as such.
The LLVM compiler, like some other compilers, consists of a frontend, an optimizer (midland), and a backend. This structure allows us to separate the development of a compiler for a new programming language, the development of optimization methods and the development of a code generator for a specific processor (such compilers are called “retargeted”, retargetable).
The connecting link between them is the intermediate language LLVM, the assembler of the “virtual machine”. The frontend (for example, Clang) converts the text of a program in a high-level language into text in an intermediate language, the optimizer (opt) performs various optimizations on it, and the backend (llc) generates the code of the target processor (in assembler or directly as a binary file). In addition, LLVM can work in JIT (just-in-time) compilation mode when compilation occurs directly during program execution.
The intermediate presentation of the program can be either in the form of a text file in the LLVM assembler language, or in the form of a binary format, a “bitcode”. By default, clang generates exactly the bitcode (.bc file), but to debug and learn LLVM, we will need to generate a text intermediate representation in the LLVM assembler (it is also called IR code, from the words Intermediate Representation, intermediate representation).

Fig. 1. Modular compiler architecture

In addition to the above programs, LLVM includes other utilities [6].
So, we write the simplest program in C.

int foo(int x, int y) { return x + y; }

And compile it:

clang-3.5 -c add.c -O0 --target=xcore -emit-llvm -S -o add_o0.ll

Some explanations:
-c add.c - input file;
-O0 - optimization level 0, no optimization;
--target = xcore - the core of the xcore processor does not have any complicated features when compiled into IR code, therefore it is an ideal object for research. This core is 32-bit wide, and clang aligns all variables along the 32-bit word boundaries;
-emit-llvm -S - an instruction to clang to generate the llvm file in text form (in LLVM assembler);
-o add_o0.ll - output file
Let's look at the result:

; ModuleID = 'add.c'
target datalayout = "em:ep:32:32-i1:8:32-i8:8:32-i16:16:32-i64:32-f64:32-a:0:32-n32"
target triple = "xcore"

; Function Attrs: nounwind
define i32 @foo(i32 %x, i32 %y) #0 {
%1 = alloca i32, align 4
%2 = alloca i32, align 4
store i32 %x, i32* %1, align 4
store i32 %y, i32* %2, align 4
%3 = load i32* %1, align 4
%4 = load i32* %2, align 4
%5 = add nsw i32 %3, %4
ret i32 %5
}

attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{!0}
!xcore.typestrings = !{!1}

!0 = metadata !{metadata !"Ubuntu clang version 3.5.0-4ubuntu2~trusty2 (tags/RELEASE_350/final) (based on LLVM 3.5.0)"}
!1 = metadata !{i32 (i32, i32)* @foo, metadata !"f{si}(si,si)"}

Looks pretty hard, right? However, let's see what is written here. So:
target datalayout = "em: ep: 32: 32-i1: 8: 32-i8: 8: 32-i16: 16: 32: i-i64: 32-f64: 32-a: 0: 32-n32"
description of the digit capacity of variables and the most basic features of the architecture. e-little-endian architecture. For big-endian, here would be the letter E. m: e - mangling, a naming convention. We don't need it yet. p: 32: 32 - pointers have 32 bits and are aligned on 32-bit word boundaries. i1: 8: 32 - Boolean variables (i1) are expressed in 8-bit values and are aligned on 32-bit word boundaries. Further, similarly for integer variables i8 - i64 (from 8 to 64 bits, respectively), and f64 - for real double-precision variables. a: 0: 32 — aggregate variables (ie, arrays and structures) have an alignment of 32 bits, n32 — the width of the numbers processed by the processor's ALU (native integer width).
Next comes the name of the target (i.e. the target processor): target triple = "xcore". Although the IR code is often referred to as “machine-independent,” in fact, we see it is not. It reflects some features of the target architecture. This is one of the reasons why we have to specify the target architecture not only for the backend, but also for the frontend.
The following is the function code foo:

define i32 @foo(i32 %x, i32 %y) #0 {
%1 = alloca i32, align 4
%2 = alloca i32, align 4
store i32 %x, i32* %1, align 4
store i32 %y, i32* %2, align 4
%3 = load i32* %1, align 4
%4 = load i32* %2, align 4
%5 = add nsw i32 %3, %4
ret i32 %5
}

The code is rather cumbersome, although the source function is very simple. That's what he does.
Immediately, we note that all variable names in LLVM have a prefix of either% (for local variables) or @ - for global ones. In this example, all variables are local.
% 1 = alloca i32, align 4 - allocates 4 bytes for the variable on the stack, pointer to this area is pointer% 1.
store i32% x, i32 *% 1, align 4 - copies one of the function arguments (% x) to the selected area.
% 3 = load i32 *% 1, align 4 - retrieves the value in the variable% 3. Now local copy of% x is stored in% 3.
Does the same for the% y argument.
% 5 = add nsw i32% 3,% 4 - adds local copies of% x and% y, puts the result in% 5. There is also the nsw attribute, but it is not important for us yet.
Returns the value of% 5.
From the above example, it is clear that with the zero level of optimization, clang generates code, literally following the source code, creates local copies of all arguments and does not make any attempts to remove redundant commands. This may seem to be a bad compiler property, but in fact this is a very useful feature when debugging programs and when debugging the code of the compiler itself.
Let's see what happens if you change the optimization level to O1:

define i32 @foo(i32 %x, i32 %y) #0 {
%1 = add nsw i32 %y, %x
ret i32 %1
}

We see that not a single extra team is left. Now the program adds directly the function arguments and returns the result.
There are higher levels of optimization, but in this particular case they will not give a better result, because maximum optimization level has already been reached.
The rest of the LLVM code (attributes, metadata) carries the service information, which is not interesting to us yet.

LLVM code structure

The structure of the LLVM code is very simple. The program code consists of modules, the compiler processes one module at a time. The module has global declarations (variables, constants, and declarations of function headers that are in other modules) and functions. Functions have arguments and return types. Functions consist of basic blocks. The base unit is a sequence of LLVM assembler commands that has one entry point and one exit point. The base unit does not contain any branches and cycles, it runs strictly sequentially from beginning to end and must end with a terminating command (return from a function or transfer to another block).
And finally, the base unit consists of assembly commands. The list of commands is given in the documentation on LLVM [7].
So again: the base unit has one entry point marked with a label, and must necessarily end with an unconditional branch command br or a return command ret. There may be a conditional branch command in front of them, in which case it must be immediately before the terminating team. The base unit has a list of predecessors - base units from which control can come to it, and successors - base units to which it can transfer control. Based on this information, CFG - Control Flow Graph, a control flow graph, the most important structure representing the program in the compiler, is built.
Consider a test case in C:
Let the source code in C has a cycle:

 //   10   int for_loop(int x[]) { int sum = 0; for(int i = 0; i < 10; ++i) { sum += x[i]; } return sum; }

Its CFG is:

Another type of graph in LLVM is the DAG - directed acyclic graph, a directed acyclic graph, which is a basic block.
It represents the assembler commands and the dependencies between them. The figure below shows the DAG of the base unit, representing the body of the loop in the example above, at the optimization level -O1:

SSA form

The key feature of the IR code that distinguishes it from high-level languages is that it is presented in the so-called SSA form (Static Single Assignment form). This feature is very important for understanding when developing a compiler on the LLVM platform, so we will give it some attention. In short, in the SSA form, each variable is assigned a value exactly once and only at one point in the program. All algorithms for optimizing and converting the IR code work only with this form.
However, how to convert an ordinary high-level language program into such a form? Indeed, in ordinary programming languages, the value of a variable can be assigned several times in different places of the program, or, for example, in a cycle.
To implement this behavior, the program uses one of two techniques. The first technique is to use load / store pairs of commands, as in the above code. The only assignment restriction applies only to LLVM named variables, and does not apply to memory cells referenced by pointers. That is, you can write to the memory an unlimited number of times with the store command, and the formal SSA rule will be observed, since the pointer to this cell does not change. This method is usually used at the optimization level -O0, and we will not dwell on it in detail. The second technique uses the φ-function, another interesting concept of the IR-code.

Code in SSA-form: load / store

Let's use the test example from the previous section.
Compile with optimization level -O0:

define i32 @for_loop(i32* %x) #0 {
%1 = alloca i32*, align 4
%sum = alloca i32, align 4
%i = alloca i32, align 4
store i32* %x, i32** %1, align 4
store i32 0, i32* %sum, align 4
store i32 0, i32* %i, align 4
br label %2

; :2 ; preds = %12, %0
%3 = load i32* %i, align 4
%4 = icmp slt i32 %3, 10
br i1 %4, label %5, label %15

; :5 ; preds = %2
%6 = load i32* %i, align 4
%7 = load i32** %1, align 4
%8 = getelementptr inbounds i32* %7, i32 %6
%9 = load i32* %8, align 4
%10 = load i32* %sum, align 4
%11 = add nsw i32 %10, %9
store i32 %11, i32* %sum, align 4
br label %12

; :12 ; preds = %5
%13 = load i32* %i, align 4
%14 = add nsw i32 %13, 1
store i32 %14, i32* %i, align 4
br label %2

; :15 ; preds = %2
%16 = load i32* %sum, align 4
ret i32 %16
}

Here we see what was written above: the loop variable is just a memory location referenced by the% i pointer.

Code in SSA form: φ-function

Now let's try the O1 optimization level:

define i32 @for_loop(i32* nocapture readonly %x) #0 {
br label %1

; :1 ; preds = %1, %0
%i.02 = phi i32 [ 0, %0 ], [ %5, %1 ]
%sum.01 = phi i32 [ 0, %0 ], [ %4, %1 ]
%2 = getelementptr inbounds i32* %x, i32 %i.02
%3 = load i32* %2, align 4, !tbaa !2
%4 = add nsw i32 %3, %sum.01
%5 = add nsw i32 %i.02, 1
%exitcond = icmp eq i32 %5, 10
br i1 %exitcond, label %6, label %1

; :6 ; preds = %1
ret i32 %4
}

Here we see that the loop variable is% i.02 (variable names in LLVM can contain dots), and this is not a pointer, but a regular 32-bit variable, and the assignment of a value occurs using the function phi i32 [0,% 0], [% 5,% 1]. This means that the function will take the value 0 if the transition occurred from the base unit% 0 (the first base unit in the function), and the value of the variable% 5 if the transition occurred from the base unit% 1 (that is, from the output point of the same basic block). Thus, the IR code generator got rid of two variable assignments, strictly following the formal rules of SSA. Further, it is seen that the assignment of the variable% sum.01 occurs in a similar way.
So, the meaning of the φ-function is that its value depends on from which base unit the transition was made to it. φ-functions can be only at the beginning of the base unit. If there are several, they should follow continuously, starting with the first instruction of the base unit.

Moar optimization!

What will happen with the optimization level -O2?

define i32 @for_loop(i32* nocapture readonly %x) #0 {
%1 = load i32* %x, align 4, !tbaa !2
%2 = getelementptr inbounds i32* %x, i32 1
%3 = load i32* %2, align 4, !tbaa !2
%4 = add nsw i32 %3, %1
%5 = getelementptr inbounds i32* %x, i32 2
%6 = load i32* %5, align 4, !tbaa !2
%7 = add nsw i32 %6, %4
%8 = getelementptr inbounds i32* %x, i32 3
%9 = load i32* %8, align 4, !tbaa !2
%10 = add nsw i32 %9, %7
%11 = getelementptr inbounds i32* %x, i32 4
%12 = load i32* %11, align 4, !tbaa !2
%13 = add nsw i32 %12, %10
%14 = getelementptr inbounds i32* %x, i32 5
%15 = load i32* %14, align 4, !tbaa !2
%16 = add nsw i32 %15, %13
%17 = getelementptr inbounds i32* %x, i32 6
%18 = load i32* %17, align 4, !tbaa !2
%19 = add nsw i32 %18, %16
%20 = getelementptr inbounds i32* %x, i32 7
%21 = load i32* %20, align 4, !tbaa !2
%22 = add nsw i32 %21, %19
%23 = getelementptr inbounds i32* %x, i32 8
%24 = load i32* %23, align 4, !tbaa !2
%25 = add nsw i32 %24, %22
%26 = getelementptr inbounds i32* %x, i32 9
%27 = load i32* %26, align 4, !tbaa !2
%28 = add nsw i32 %27, %25
ret i32 %28
}

The optimizer has expanded the loop. In general, the IR code optimizer in LLVM is very intelligent, it can not only expand cycles, but also simplify non-trivial constructions, calculate constant values, even if they are not present in the code explicitly, and do other complex code conversions.

IR layout

Real programs consist of more than one module. The traditional compiler compiles the modules separately, turning them into object files, and then transfers them to the linker (linker), which combines them into one executable file. LLVM also allows you to do this.
But LLVM also has the ability to link IR code. The easiest way to consider this is an example. Suppose there are two source modules: foo.c and bar.c:

 //bar.h #ifndef BAR_H #define BAR_H int bar(int x, int k); #endif //bar.c int bar(int x, int k) { return x * x * k; } //foo.c #include "bar.h" int foo(int x, int y) { return bar(x, 2) + bar(y, 3); }

If the program is compiled in the “traditional” way, the optimizer cannot do almost anything with it: when compiling foo.c, the compiler does not know what is inside the bar function and can do the only obvious way to insert bar () calls.
But if we compose the IR code, we will get one module, which after optimization with the -O2 level will look like this (for clarity, the module header and metadata are omitted):

define i32 @foo(i32 %x, i32 %y) #0 {
%1 = shl i32 %x, 1
%2 = mul i32 %1, %x
%3 = mul i32 %y, 3
%4 = mul i32 %3, %y
%5 = add nsw i32 %4, %2
ret i32 %5
}

; Function Attrs: nounwind readnone
define i32 @bar(i32 %x, i32 %k) #0 {
%1 = mul nsw i32 %x, %x
%2 = mul nsw i32 %1, %k
ret i32 %2
}

Here it can be seen that no calls occur in the foo function, the compiler transferred the contents of bar () to it completely, in the process substituting constant values of k. Although the bar () function remains in the module, it will be excluded when the executable file is compiled, provided that it is not called anywhere else in the program.
It should be noted that GCC also has the ability to link and optimize an intermediate code (LTO, link-time optimization) [6].
Of course, optimization in LLVM is not limited to optimization of the IR-code. Inside the backend, various optimizations also occur at different stages of converting the IR code into a machine representation. Some of these optimizations are done by LLVM independently, but the backend developer can (and should) develop his own optimization algorithms that will make full use of the features of the processor architecture.

Target platform code generation

Development of a compiler for the original processor architecture, this is mainly the development of the backend. Intervention in the frontend algorithms, as a rule, is not necessary, or, in any case, requires very good reasons. If you analyze the source code of Clang, you can see that most of the "special" algorithms are on the x86 and PowerPC processors with their non-standard real number formats. For most other processors, you only need to specify the sizes of the base types and endianness (big-endian or little-endian). Most often, you can simply find a similar (in terms of digit capacity) processor among the many supported.
Code generation for the target platform occurs in the LLVM, LLC backend. LLC supports many different processors, and you can use it to make a code generator for your own original processor. This task is simplified by the fact that all source code, including modules for each supported architecture, is completely open and available for study.
It is the code generator for the target platform (target) that is the most time-consuming task when developing a compiler based on the LLVM infrastructure. I decided not to dwell in detail here on the implementation features of the backend, since they significantly depend on the architecture of the target processor. However, if a respected Habr audience has an interest in this topic, I am ready to describe the key points in the development of the backend in the next article.

Conclusion

Within a small article, neither the LLVM architecture, the syntax of the LLVM IR language, nor the backend development process can be considered in detail. However, these issues are covered in detail in the documentation. The author rather tried to give a general idea of the LLVM compilers infrastructure, emphasizing that this platform is modern, powerful, universal, and independent of either the input language or the target processor architecture, allowing you to implement both.
In addition to the considered, LLVM contains other utilities, but their consideration is beyond the scope of the article.
LLVM allows implementing a backend for any architecture (see note), including architectures with pipelining, with extraordinary execution of commands, with various parallelization options, VLIW, for classical and stack architectures, in general, for any options.
Regardless of how non-standard solutions underlie the processor architecture, it’s just a matter of writing a larger or smaller amount of code.

note

for anyone within reason. It is hardly possible to implement a C compiler for 4-bit architecture, since The language standard clearly requires that the digit capacity be at least 8.

Literature

Compilers
[1] Dragon Book
[2] Wirth N. Construction of compilers
Gcc
[3] gcc.gnu.org - GCC project site
[4] Richard M. Stallman and the GCC Developer Community. GNU Compiler Collection Internals
Llvm
[5] http://llvm.org/ - LLVM project site
[6] http://llvm.org/docs/GettingStarted.html Getting Started with the LLVM System
[7] http://llvm.org/docs/LangRef.html LLVM Language Reference Manual
[8] http://llvm.org/docs/WritingAnLLVMBackend.html Writing An LLVM Backend
[9] http://llvm.org/docs/WritingAnLLVMPass.html Writing An LLVM Pass
[10] Chen Chung-Shu. Creating an LLVM Backend for the Cpu0 Architecture
[11] Mayur Pandey, Suyog Sarda. LLVM Cookbook
[12] Bruno Cardoso Lopes. Getting Started with LLVM Core Libraries
[13] Suyog Sarda, Mayur Pandey. LLVM Essentials
The author will be happy to answer your questions in the comments and in PM.
Request for all noticed typos reported in a personal. Thank you in advance.

Source: https://habr.com/ru/post/277717/

All Articles