📜 ⬆️ ⬇️

My first compiler on LLVM

This tutorial is dedicated to writing the simplest compiler on LLVM. No prior preparation is required.



The input language of our compiler will be BF. This is a classic "toy" language for compilers, and even there is a BF compiler in the examples for LLVM ! In this post I will give the process of writing a compiler with explanations.

First LLVM program


The LLVM infrastructure is built around the LLVM IR programming language. First we will write the simplest LLVM IR program as far as possible.
')
Since LLVM already uses LLVM IR, we can simply use clang to see what a simple program looks like. Take a simple C program

int main() { return 42; } 

Let's skip it through clang:

 # -O3 ensures we discard any unnecessary instructions. $ clang -S -emit-llvm -O3 forty_two.c 

Get the file forty_two.ll, which looks like this:

 define i32 @main() { ret i32 42 } 

Our first program on LLVM IR! We can use lli to run it:

 $ lli forty_two.ll $ echo $? 42 

We write a skeleton


We will compile the BF commands into a sequence of LLVM IR instructions. However, these instructions must be placed inside the main function so that LLVM knows where the entry point is. We will also need to allocate and initialize the memory cells and the cell index.

Again, we simply write the equivalent in C and look at the LLVM IR code that Clang generates. Here is the skeleton of the program that we will use:

 declare i8* @calloc(i32, i32) declare void @free(i8*) define i32 @main() { ; Allocate 30,000 cells on the heap. %cells = call i8* @calloc(i32 30000, i32 1) ; Allocate a stack variable to track the cell index. %cell_index_ptr = alloca i32 ; Initialise it to zero. store i32 0, i32* %cell_index_ptr ;;;;;;;;;;;;;;;;;;;; ; Our BF code will go here! ;;;;;;;;;;;;;;;;;;;; ; Free the memory for the cells. call void @free(i8* %cells) ret i32 0 } 

Manually compile>


">" Is the easiest BF command to compile. Open a text editor and write its equivalent on LLVM IR.

If your knowledge of BF is a little rusty, ">" simply increments the cell index.

 %cell_index = load i32* %cell_index_ptr %new_cell_index = add i32 1, %cell_index store i32 %new_cell_index, i32* %cell_index_ptr 

To verify that the code is correct, run it. Just insert it into our skeleton , run it in lli, and see what happens. The implementation of ">" is straightforward, and we will also write a program for "<".

Manually compile +


The BF "+" command increments the current cell. We are required to dereference the cell pointer, calculate the new value, and save it. In C, it looks like this:

 char *cell_ptr = cells + cell_index; char current_value = *cell_ptr; char new_value = current_value + 1; *cell_ptr = new_value; 

LLVM uses the getelementptr instruction to calculate the pointer. The code will look like this:

 %cell_index = load i32* %cell_index_ptr %cell_ptr = getelementptr i8* %cells, i32 %cell_index %current_value = load i8* %cell_ptr %new_value = add i8 %current_value, 1 store i8 %new_value, i8* %cell_ptr 

We test it again by placing the program in our skeleton, and do the same for "-".

Input Output


BF has two I / O commands: "," which reads from stdin to a cell, and "." which writes from cell to stdout. We need to call C functions for this: putchar and getchar .

We need to declare these functions, as we did earlier for malloc :

 declare i32 @putchar(i32) declare i32 @getchar() 

To implement the command, we call getchar , prune the result to char, and write it to the current cell.

 %cell_index = load i32* %cell_index_ptr %cell_ptr = getelementptr i8* %cells, i32 %cell_index %input_int = call i32 @getchar() %input_byte = trunc i32 %input_int to i8 store i8 %input_byte, i8* %cell_ptr 

"." works in the reverse order: we read from the cell, make a significant extension, and call putchar .

 %cell_index = load i32* %cell_index_ptr %cell_ptr = getelementptr i8* %cells, i32 %cell_index %current_cell = load i8* %cell_ptr %current_cell_word = sext i8 %current_cell to i32 call i32 @putchar(i32 %current_cell_word) 

Cycles


LLVM IR instructions are organized into basic blocks; sequences of instructions without transitions inside the block. At the end of each base unit, you must either make the transition to another base unit, or return.

To compile "[x] y", we need to check the current cell, then either go to x , our loop body, or y . At the end of x , we need to go to the beginning.

 loop_header: %cell_index = load i32* %cell_index_ptr %cell_ptr = getelementptr i8* %cells, i32 %cell_index %cell_value = load i8* %cell_ptr %is_zero = icmp eq i8 %cell_value, 0 br i1 %is_zero, label %loop_after, label %loop_body loop_body: ; x br label %loop_header loop_after: ; y 

Note that there is a recursion, x can also contain loops. Sample program here .

Putting it all together


We have done the most difficult! Now we use LLVM API for generation of instructions. Each IR instruction has a corresponding C ++ object that you can add to your base unit.

LLVM API also has a convenient IRBuilder concept. The IRBuilder class provides methods for creating all IR instructions, making generating IR simple.

Here is the C ++ code for generating LLVM IR for ">". The excellent LLVM manual includes instructions for compiling a simple C ++ program using the LLVM API.

 BasicBlock *BB; // Instantiate an IRBuilder that appends to our // current basic block. IRBuilder<> Builder(getGlobalContext()); Builder.SetInsertPoint(BB); // We want to increment by 1, but since cell_index is // 32-bit, our constant must be 32-bit too. Value *IncrementAmount = ConstantInt::get(getGlobalContext(), APInt(32, 1)); // Emit the load, add and store instructions. Value *CellIndex = Builder.CreateLoad(CellIndexPtr, "cell_index"); Value *NewCellIndex = Builder.CreateAdd(CellIndex, IncrementAmount, "new_cell_index"); Builder.CreateStore(NewCellIndex, CellIndexPtr); 

Compiling other BF commands is a simple translation of our hand-written snippets. You can see a fully working compiler here .

Machine code generation


Our compiler generates only LLVM IR. Real compilers generate machine code. Use llc to convert to an object file, and then link to get the executable file.

 $ ./compiler hello_world.bf $ llc -filetype=obj hello_world.ll $ gcc hello_world.o -o hello_world $ ./hello_world Hello World! 

That's all!

Appendix: Optimization


You can do optimizations to get faster executables from BF programs. However, LLVM is smart enough to compile a simple BF program without cycles into the optimal LLVM IR representation!

 $ cat exclamation.bf +++++ +++++ +++++ +++++ +++++ +++++ +++ ASCII 33 is '!' . Write ! to stdout $ ./compiler exclamation.bf $ opt -S -O3 exclamation.ll -o optimised_exclamation.ll 

We get:

 define i32 @main() { entry: %0 = tail call i32 @putchar(i32 33) ret i32 0 } 

Source: https://habr.com/ru/post/342456/


All Articles