📜 ⬆️ ⬇️

Toy Front End for LLVM written in Rust: Beginner's Guide

Translator's Note
The code given in the article was compiled with fairly old versions of peg and peg_syntax_ext crate. For current versions of the source you need to make minimal changes. I inserted the changed sites in the spoilers on the text of the article. To build the code, install the nightly Rust compiler.
The full source code with my edits can be downloaded here: https://github.com/arktur04/rust-llvm-toy-frontend


I am currently working on a compiler that is written in Rust, and spawns LLVM IR. The LLVM API looks a bit scary for newbies, and there are not so many guides on it (and they are all in C ++, so it’s not quite clear how to do the same in Rust). I would like someone to give me a helping hand when I started all this, and this article is what I would like to show myself at that time.



In Rust, the best way to interact with LLVM is through llvm-sys crate. One kind person posted documentation to him here . Of course, you should also read the LLVM guide , as it will help you understand how LLVM “thinks”. This post is basically a translation of the subsets from this guide to Rust.
')
The full source code for this tutorial is here .

Get the development environment


For starters, here is a way to start working with LLVM:

# `curl` is just so we can next install Rust sudo apt-get -y install clang curl llvm-3.8-dev curl https://sh.rustup.rs -sSf | sh # The `llvm-sys` crate expects something called `llvm-config` on your PATH. sudo ln -s /usr/bin/llvm-config-3.8 /usr/bin/llvm-config 

If you are working on a fresh Ubuntu (you may need apt-get update ), then everything is ready, you can start. If not, you can run in the Vagrant virtual machine, via Vagrantfile :

 Vagrant.configure("2") do |config| config.vm.box = "bento/ubuntu-16.04" end 

You can proceed by running cargo init llvm-example --bin and placing the following (taken from llvm-sys) in src / main.rs :

 //! Construct a function that does nothing in LLVM IR. extern crate llvm_sys as llvm; use std::ptr; fn main() { unsafe { // Set up a context, module and builder in that context. let context = llvm::core::LLVMContextCreate(); let module = llvm::core::LLVMModuleCreateWithName(b"nop\0".as_ptr() as *const _); let builder = llvm::core::LLVMCreateBuilderInContext(context); // Get the type signature for void nop(void); // Then create it in our module. let void = llvm::core::LLVMVoidTypeInContext(context); let function_type = llvm::core::LLVMFunctionType(void, ptr::null_mut(), 0, 0); let function = llvm::core::LLVMAddFunction(module, b"nop\0".as_ptr() as *const _, function_type); // Create a basic block in the function and set our builder to generate // code in it. let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function, b"entry\0".as_ptr() as *const _); llvm::core::LLVMPositionBuilderAtEnd(builder, bb); // Emit a `ret void` into the function llvm::core::LLVMBuildRetVoid(builder); // Dump the module as IR to stdout. llvm::core::LLVMDumpModule(module); // Clean up. Values created in the context mostly get cleaned up there. llvm::core::LLVMDisposeBuilder(builder); llvm::core::LLVMDisposeModule(module); llvm::core::LLVMContextDispose(context); } } 

And in your Cargo.toml :

 [package] name = "llvm-example" version = "0.1.0" authors = ["Ulysse Carion <ulysse@ulysse.io>"] [[bin]] name = "main" [dependencies] llvm-sys = "0.2" 

You should get the following:

 vagrant@vagrant:/vagrant$ cargo run Compiling llvm-example v0.1.0 (file:///vagrant) Running `target/debug/main` ; ModuleID = 'nop' define void @nop() { entry: ret void } 

Hooray! We can start writing our own program.

A bit less trivial program.


To begin with, let's compile a program that simply returns the termination code by returning the integer from the main function.

Here's how I did it: (We will need a parser soon, and I added it now; I used a peg crate):

Note trans.
Text Cargo.toml:

 [package] name = "llvm-example" version = "0.1.0" authors = ["Ulysse Carion <ulysse@ulysse.io>"] [[bin]] name = "main" [dependencies] llvm-sys = "38" peg = "0.5.4" peg-syntax-ext = "0.5.2" 


 #![feature(plugin)] #![plugin(peg_syntax_ext)] extern crate llvm_sys as llvm; use std::ffi::CString; use std::fs::File; use std::io::Read; use std::ptr; fn main() { let mut input = String::new(); let mut f = File::open("in.ex").unwrap(); f.read_to_string(&mut input).unwrap(); let parsed_input = parser::program(&input).unwrap(); unsafe { codegen(parsed_input); } } peg! parser(r#" #[pub] program -> String = i:int_literal "\n" { i } int_literal -> String = [0-9]+ { match_str.to_owned() } "#); unsafe fn codegen(input: String) { let context = llvm::core::LLVMContextCreate(); let module = llvm::core::LLVMModuleCreateWithName(b"example_module\0".as_ptr() as *const _); let builder = llvm::core::LLVMCreateBuilderInContext(context); // In LLVM, you get your types from functions. let int_type = llvm::core::LLVMInt64TypeInContext(context); let function_type = llvm::core::LLVMFunctionType(int_type, ptr::null_mut(), 0, 0); let function = llvm::core::LLVMAddFunction(module, b"main\0".as_ptr() as *const _, function_type); let entry_name = CString::new("entry").unwrap(); let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function, entry_name.as_ptr()); llvm::core::LLVMPositionBuilderAtEnd(builder, bb); // The juicy part: construct a `LLVMValue` from a Rust value: let int_value: u64 = input.parse().unwrap(); let int_value = llvm::core::LLVMConstInt(int_type, int_value, 0); llvm::core::LLVMBuildRet(builder, int_value); // Instead of dumping to stdout, let's write out the IR to `out.ll` let out_file = CString::new("out.ll").unwrap(); llvm::core::LLVMPrintModuleToFile(module, out_file.as_ptr(), ptr::null_mut()); llvm::core::LLVMDisposeBuilder(builder); llvm::core::LLVMDisposeModule(module); llvm::core::LLVMContextDispose(context); } 

Note trans.
Parser changes:

 peg! parser(r#" #[pub] program -> String = i:int_literal "\n" { i } int_literal -> String = n:$([0-9]+) { n.to_owned() } "#); 


Works! Checking:

 vagrant@vagrant:/vagrant$ cat in.ex 42 vagrant@vagrant:/vagrant$ cargo run Running `target/debug/main` vagrant@vagrant:/vagrant$ lli-3.8 out.ll ; echo $? 42 

Cool! This is how out.ll looks like :

 ; ModuleID = 'example_module' define i64 @main() { entry: ret i64 42 } 

Arithmetic


Add support for addition, subtraction, multiplication, and division of numbers. To do this, we need to expand our grammar. Let's enter the enum that represents AST:

 pub enum Expr { Add(Box<Expr>, Box<Expr>), Sub(Box<Expr>, Box<Expr>), Mul(Box<Expr>, Box<Expr>), Div(Box<Expr>, Box<Expr>), Literal(String), } 

We also need to expand the grammar:

 // `product` and `sum` are that way to get operator precedence peg! parser(r#" use super::Expr; #[pub] program -> Expr = e:expression "\n" { e } expression -> Expr = sum sum -> Expr = a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) } / a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) } / product product -> Expr = a:int_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) } / a:int_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) } / int_literal int_literal -> Expr = [0-9]+ { Expr::Literal(match_str.to_owned()) } _ = " "* "#); 

Note trans.
Parser changes:

 // `product` and `sum` are that way to get operator precedence peg! parser(r#" use super::Expr; #[pub] program -> Expr = e:expression "\n" { e } expression -> Expr = sum sum -> Expr = a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) } / a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) } / product product -> Expr = a:int_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) } / a:int_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) } / int_literal int_literal -> Expr = n:$([0-9]+) { Expr::Literal(n.to_owned()) } _ = " "* "#); 


Then we generate the code. You can define strings, such as “ addtmp ”, and they will be used as part of the corresponding register name in IR.

 // When you write out instructions in LLVM, you get back `LLVMValueRef`s. You // can then use these references in other instructions. unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, expr: Expr) -> LLVMValueRef { match expr { Expr::Literal(int_literal) => { let int_type = llvm::core::LLVMInt64TypeInContext(context); llvm::core::LLVMConstInt(int_type, int_literal.parse().unwrap(), 0) }, Expr::Add(lhs, rhs) => { let lhs = codegen_expr(context, builder, *lhs); let rhs = codegen_expr(context, builder, *rhs); let name = CString::new("addtmp").unwrap(); llvm::core::LLVMBuildAdd(builder, lhs, rhs, name.as_ptr()) }, Expr::Sub(lhs, rhs) => { let lhs = codegen_expr(context, builder, *lhs); let rhs = codegen_expr(context, builder, *rhs); let name = CString::new("subtmp").unwrap(); llvm::core::LLVMBuildSub(builder, lhs, rhs, name.as_ptr()) }, Expr::Mul(lhs, rhs) => { let lhs = codegen_expr(context, builder, *lhs); let rhs = codegen_expr(context, builder, *rhs); let name = CString::new("multmp").unwrap(); llvm::core::LLVMBuildMul(builder, lhs, rhs, name.as_ptr()) }, Expr::Div(lhs, rhs) => { let lhs = codegen_expr(context, builder, *lhs); let rhs = codegen_expr(context, builder, *rhs); let name = CString::new("divtmp").unwrap(); llvm::core::LLVMBuildUDiv(builder, lhs, rhs, name.as_ptr()) }, } } 

Now you can execute programs like 10 * 4 + 20/2 - 8 ! Very cool, I think.

Variables


We will follow a simple path and assume that our program does not do various annoying things, such as references to undefined variables. We simply save the variables to the registers, and save them to the HashMap <String, LLVMValueRef>. This works because the program has only one execution path.
Extend the language and parser:

 pub enum Expr { Literal(String), Ref(String), Assign(String, Box<Expr>), Add(Box<Expr>, Box<Expr>), Sub(Box<Expr>, Box<Expr>), Mul(Box<Expr>, Box<Expr>), Div(Box<Expr>, Box<Expr>), } peg! parser(r#" use super::Expr; #[pub] program -> Vec<Expr> = e:(expression ** "\n") "\n" { e } expression -> Expr = i:identifier _ "=" _ s:sum { Expr::Assign(i, Box::new(s)) } / sum sum -> Expr = a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) } / a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) } / product product -> Expr = a:ref_or_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) } / a:ref_or_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) } / ref_or_literal ref_or_literal -> Expr = i:identifier { Expr::Ref(i) } / int_literal identifier -> String = [a-zA-Z]+ { match_str.to_owned() } int_literal -> Expr = [0-9]+ { Expr::Literal(match_str.to_owned()) } _ = " "* "#); 

Note trans.
Parser changes:

 peg! parser(r#" use super::Expr; #[pub] program -> Vec<Expr> = e:(expression ** "\n") "\n" { e } expression -> Expr = i:identifier _ "=" _ s:sum { Expr::Assign(i, Box::new(s)) } / sum sum -> Expr = a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) } / a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) } / product product -> Expr = a:ref_or_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) } / a:ref_or_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) } / ref_or_literal ref_or_literal -> Expr = i:identifier { Expr::Ref(i) } / int_literal identifier -> String = n:$([a-zA-Z]+) { n.to_owned() } int_literal -> Expr = n:$([0-9]+) { Expr::Literal(n.to_owned()) } _ = " "* "#); 


Then we add support for two new expressions:

 unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, names: &mut HashMap<String, LLVMValueRef>, expr: Expr) -> LLVMValueRef { match expr { // ... Expr::Ref(name) => { *names.get(&name).unwrap() }, Expr::Assign(name, expr) => { let new_value = codegen_expr(context, builder, names, *expr); names.insert(name, new_value); new_value }, } } 

And change the codegen function a bit :

 let int_type = llvm::core::LLVMInt64TypeInContext(context); let zero = llvm::core::LLVMConstInt(int_type, 0, 0); let mut names = HashMap::new(); let mut return_value = zero; // return value on empty program for expr in input { return_value = codegen_expr(context, builder, &mut names, expr); } llvm::core::LLVMBuildRet(builder, return_value); 

Voila! Checking:

 vagrant@vagrant:/vagrant$ cat in.ex a = 3 b = 76 a + b vagrant@vagrant:/vagrant$ cargo run Running `target/debug/main` vagrant@vagrant:/vagrant$ cat out.ll ; ModuleID = 'example_module' define i64 @main() { entry: ret i64 79 } 

If


With if, things get a bit more complicated. The easiest way to make it work is to save local variables on the stack, and allow LLVM to optimize. In LLVM, you create a stack variable with the alloca command, and then read / write with the load / store commands.

In order to do this, we again expand the language and grammar by adding new parser rules:

 expression -> Expr = if_expression / i:identifier _ "=" _ s:expression { Expr::Assign(i, Box::new(s)) } / sum if_expression -> Expr = "if" _ e:expression _ "{\n" _ then_body:statements _ "}" _ "else" _ "{\n" _ else_body:statements _ "}" { Expr::If(Box::new(e), then_body, else_body) } 

And add a new type of AST node:

 pub enum Expr { Literal(String), Ref(String), Assign(String, Box<Expr>), Add(Box<Expr>, Box<Expr>), Sub(Box<Expr>, Box<Expr>), Mul(Box<Expr>, Box<Expr>), Div(Box<Expr>, Box<Expr>), If(Box<Expr>, Vec<Expr>, Vec<Expr>), } 

Finally, we generate code for the if expression:

 unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, func: LLVMValueRef, names: &mut HashMap<String, LLVMValueRef>, expr: Expr) -> LLVMValueRef { match expr { // ... Expr::If(condition, then_body, else_body) => { let condition_value = codegen_expr(context, builder, func, names, *condition); let int_type = llvm::core::LLVMInt64TypeInContext(context); let zero = llvm::core::LLVMConstInt(int_type, 0, 0); // `is_nonzero` is a 1-bit integer let name = CString::new("is_nonzero").unwrap(); let is_nonzero = llvm::core::LLVMBuildICmp(builder, llvm::LLVMIntPredicate::LLVMIntNE, condition_value, zero, name.as_ptr()); // It's fine to create blocks first, and then fill them in later. let entry_name = CString::new("entry").unwrap(); let then_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr()); let else_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr()); let merge_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr()); llvm::core::LLVMBuildCondBr(builder, is_nonzero, then_block, else_block); llvm::core::LLVMPositionBuilderAtEnd(builder, then_block); let mut then_return = zero; for expr in then_body { then_return = codegen_expr(context, builder, func, names, expr); } llvm::core::LLVMBuildBr(builder, merge_block); llvm::core::LLVMPositionBuilderAtEnd(builder, else_block); let mut else_return = zero; for expr in else_body { else_return = codegen_expr(context, builder, func, names, expr); } llvm::core::LLVMBuildBr(builder, merge_block); // Position the builder so that it's ready to work on the next // expression. llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block); zero } } } 

A lot of code, but it does what we expected. Now we can run this program:

 a = 1 if a { a = 42 } else { a = 13 } a 

which generates such an IR:

 ; ModuleID = 'example_module' define i64 @main() { entry: %a = alloca i64 store i64 1, i64* %a %a1 = load i64, i64* %a %is_nonzero = icmp ne i64 %a1, 0 br i1 %is_nonzero, label %entry2, label %entry3 entry2: ; preds = %entry store i64 42, i64* %a br label %entry4 entry3: ; preds = %entry store i64 13, i64* %a br label %entry4 entry4: ; preds = %entry3, %entry2 %a5 = load i64, i64* %a ret i64 %a5 } 

However, we have not finished yet. Now our "expression" if is always zero. Instead, if should be equal to then_return if the “ then ” path is executed, or else_return otherwise.

How to make LLVM track which path was taken? Using the “ Phi ” node. You give the phi instructions a list of pairs (block, value), and then the phi node will return the value corresponding to the block that was executed before it.

This concludes with if . Note that we need to update then_block and else_block . This is done in order to get the last block in the “ then ” / ” else ” structure, and earlier then_block was the first block in the “ then ” / ” else ".

 // ... // This is mostly the same code as before, just note the new calls to // `LLVMGetInsertBlock`. llvm::core::LLVMPositionBuilderAtEnd(builder, then_block); let mut then_return = zero; for expr in then_body { then_return = codegen_expr(context, builder, func, names, expr); } llvm::core::LLVMBuildBr(builder, merge_block); let then_block = llvm::core::LLVMGetInsertBlock(builder); llvm::core::LLVMPositionBuilderAtEnd(builder, else_block); let mut else_return = zero; for expr in else_body { else_return = codegen_expr(context, builder, func, names, expr); } llvm::core::LLVMBuildBr(builder, merge_block); let else_block = llvm::core::LLVMGetInsertBlock(builder); // Insert the phi node llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block); let phi_name = CString::new("iftmp").unwrap(); let phi = llvm::core::LLVMBuildPhi(builder, int_type, phi_name.as_ptr()); let mut values = vec![then_return, else_return]; let mut blocks = vec![then_block, else_block]; llvm::core::LLVMAddIncoming(phi, values.as_mut_ptr(), blocks.as_mut_ptr(), 2); phi 

And here he is, an amazing compiler:

 vagrant@vagrant:/vagrant$ cat in.ex a = 1 b = 0 c = if a { if b { 11 } else { 40 } } else { if b { 10 } else { 20 } } c + 2 vagrant@vagrant:/vagrant$ cargo run Running `target/debug/main` vagrant@vagrant:/vagrant$ lli-3.8 out.ll ; echo $? 42 

Cool! Here is the code that is generated for this example input program:

 ; ModuleID = 'example_module' define i64 @main() { entry: %a = alloca i64 %b = alloca i64 %c = alloca i64 store i64 1, i64* %a store i64 0, i64* %b %a1 = load i64, i64* %a %is_nonzero = icmp ne i64 %a1, 0 br i1 %is_nonzero, label %entry2, label %entry3 entry2: ; preds = %entry %b5 = load i64, i64* %b %is_nonzero6 = icmp ne i64 %b5, 0 br i1 %is_nonzero6, label %entry7, label %entry8 entry3: ; preds = %entry %b10 = load i64, i64* %b %is_nonzero11 = icmp ne i64 %b10, 0 br i1 %is_nonzero11, label %entry12, label %entry13 entry4: ; preds = %entry14, %entry9 %iftmp16 = phi i64 [ %iftmp, %entry9 ], [ %iftmp15, %entry14 ] store i64 %iftmp16, i64* %c %c17 = load i64, i64* %c %addtmp = add i64 %c17, 2 ret i64 %addtmp entry7: ; preds = %entry2 br label %entry9 entry8: ; preds = %entry2 br label %entry9 entry9: ; preds = %entry8, %entry7 %iftmp = phi i64 [ 11, %entry7 ], [ 40, %entry8 ] br label %entry4 entry12: ; preds = %entry3 br label %entry14 entry13: ; preds = %entry3 br label %entry14 entry14: ; preds = %entry13, %entry12 %iftmp15 = phi i64 [ 10, %entry12 ], [ 20, %entry13 ] br label %entry4 } 

Note the pattern that the blocks form: with the exception of the entry block, they form groups of three, where the “then” branch goes first, then the “else” branch, then the “merge” block (which can be recognized by the phi instruction). This is a consequence of the fact that every time we find an “if” expression, we add three new blocks to main. The triples of blocks are arranged in the order in which the AST tree is traversed.

That's all! I hope that now you have a sufficient basis to act independently.

Source: https://habr.com/ru/post/338420/


All Articles