This tutorial is dedicated to writing the simplest compiler on LLVM. No prior preparation is required.

The input language of our compiler will be BF. This is a classic "toy" language for compilers, and even there is a BF compiler in the
examples for LLVM ! In this post I will give the process of writing a compiler with explanations.
First LLVM program
The LLVM infrastructure is built around the
LLVM IR programming language. First we will write the simplest LLVM IR program as far as possible.
')
Since LLVM already uses LLVM IR, we can simply use
clang to see what a simple program looks like. Take a simple C program
int main() { return 42; }
Let's skip it through clang:
Get the file forty_two.ll, which looks like this:
define i32 @main() { ret i32 42 }
Our first program on LLVM IR! We can use
lli to run it:
$ lli forty_two.ll $ echo $? 42
We write a skeleton
We will compile the BF commands into a sequence of LLVM IR instructions. However, these instructions must be placed inside the
main function so that LLVM knows where the entry point is. We will also need to allocate and initialize the memory cells and the cell index.
Again, we simply write the equivalent in C and look at the LLVM IR code that Clang generates. Here is the skeleton of the program that we will use:
declare i8* @calloc(i32, i32) declare void @free(i8*) define i32 @main() { ; Allocate 30,000 cells on the heap. %cells = call i8* @calloc(i32 30000, i32 1) ; Allocate a stack variable to track the cell index. %cell_index_ptr = alloca i32 ; Initialise it to zero. store i32 0, i32* %cell_index_ptr ;;;;;;;;;;;;;;;;;;;; ; Our BF code will go here! ;;;;;;;;;;;;;;;;;;;; ; Free the memory for the cells. call void @free(i8* %cells) ret i32 0 }
Manually compile>
">" Is the easiest BF command to compile. Open a text editor and write its equivalent on LLVM IR.
If your knowledge of BF is a little rusty, ">" simply increments the cell index.
%cell_index = load i32* %cell_index_ptr %new_cell_index = add i32 1, %cell_index store i32 %new_cell_index, i32* %cell_index_ptr
To verify that the code is correct, run it. Just
insert it into our skeleton , run it in lli, and see what happens. The implementation of ">" is straightforward, and we will also write a
program for "<".
Manually compile +
The BF "+" command increments the current cell. We are required to dereference the cell pointer, calculate the new value, and save it. In C, it looks like this:
char *cell_ptr = cells + cell_index; char current_value = *cell_ptr; char new_value = current_value + 1; *cell_ptr = new_value;
LLVM uses the
getelementptr instruction to calculate the pointer. The code will look like this:
%cell_index = load i32* %cell_index_ptr %cell_ptr = getelementptr i8* %cells, i32 %cell_index %current_value = load i8* %cell_ptr %new_value = add i8 %current_value, 1 store i8 %new_value, i8* %cell_ptr
We test it again by
placing the program in our skeleton, and do
the same for "-".
Input Output
BF has two I / O commands: "," which reads from stdin to a cell, and "." which writes from cell to stdout. We need to call C functions for this:
putchar and
getchar .
We need to declare these functions, as we did earlier for
malloc :
declare i32 @putchar(i32) declare i32 @getchar()
To implement the command, we call
getchar , prune the result to char, and write it to the current cell.
%cell_index = load i32* %cell_index_ptr %cell_ptr = getelementptr i8* %cells, i32 %cell_index %input_int = call i32 @getchar() %input_byte = trunc i32 %input_int to i8 store i8 %input_byte, i8* %cell_ptr
"." works in the reverse order: we read from the cell, make a significant extension, and call
putchar .
%cell_index = load i32* %cell_index_ptr %cell_ptr = getelementptr i8* %cells, i32 %cell_index %current_cell = load i8* %cell_ptr %current_cell_word = sext i8 %current_cell to i32 call i32 @putchar(i32 %current_cell_word)
Cycles
LLVM IR instructions are organized into basic blocks; sequences of instructions without transitions inside the block. At the end of each base unit, you must either make the transition to another base unit, or return.
To compile "[x] y", we need to check the current cell, then either go to
x , our loop body, or
y . At the end of
x , we need to go to the beginning.
loop_header: %cell_index = load i32* %cell_index_ptr %cell_ptr = getelementptr i8* %cells, i32 %cell_index %cell_value = load i8* %cell_ptr %is_zero = icmp eq i8 %cell_value, 0 br i1 %is_zero, label %loop_after, label %loop_body loop_body: ; x br label %loop_header loop_after: ; y
Note that there is a recursion,
x can also contain loops. Sample program
here .
Putting it all together
We have done the most difficult! Now we use LLVM API for generation of instructions. Each IR instruction has a corresponding C ++ object that you can add to your base unit.
LLVM API also has a convenient
IRBuilder concept. The IRBuilder class provides methods for creating all IR instructions, making generating IR simple.
Here is the C ++ code for generating LLVM IR for ">". The excellent LLVM manual includes instructions for compiling a simple C ++ program using the LLVM API.
BasicBlock *BB;
Compiling other BF commands is a simple translation of our hand-written snippets. You can see a fully working compiler
here .
Machine code generation
Our compiler generates only LLVM IR. Real compilers generate machine code. Use llc to convert to an object file, and then link to get the executable file.
$ ./compiler hello_world.bf $ llc -filetype=obj hello_world.ll $ gcc hello_world.o -o hello_world $ ./hello_world Hello World!
That's all!
Appendix: Optimization
You can do optimizations to get faster executables from BF programs. However, LLVM is smart enough to compile a simple BF program without cycles into the optimal LLVM IR representation!
$ cat exclamation.bf +++++ +++++ +++++ +++++ +++++ +++++ +++ ASCII 33 is '!' . Write ! to stdout $ ./compiler exclamation.bf $ opt -S -O3 exclamation.ll -o optimised_exclamation.ll
We get:
define i32 @main() { entry: %0 = tail call i32 @putchar(i32 33) ret i32 0 }