Compilation. 5 and 1/2: llvm as back-end

In a series of articles from tyomitch “Compilation” ( here , here , here , here , here and here ), the construction of a translator of the toy language jsk, described in 4 parts, was considered .
As a back-end for this compiler, tyomitch proposed a bytecode implementation and interpreter of this bytecode.

In my opinion, a more reasonable approach would be to use existing solutions for the backend, for example llvm, and following the principle of “Criticism without concrete proposals - criticism”, I propose an option for implementing this small jsk language with llvm.

What will it give for jsk? This compilation, that is, the result will be an executable file that does not depend on any runtime, the possibility of serious optimization, code profiling and automatically get documentation on the back-end (which will facilitate maintenance).

How does llvm work from a front-end developer's point of view?

For llvm to work, you need to create a context, using it to create a module. Each module can have several functions. Since jsk does not support functions, we will create one function - main ().
In each function there can be several blocks of code.

What is a code block

Often, programs create the need to transfer control forward to a label that has not yet been defined. libjit allows you to first create and use a label, and define it later, Gnu llightning allows you to write a transfer of control to the code, and later patch this code to substitute the desired address, and llvm uses a completely different approach - blocks of code.
')
That is, llvm API does not know such a thing as a label at all. If you need to transfer control, a link to the code block is passed to the command. For example for

if then else endif

In the LLVM API, you need to create blocks of code for the then branch, a block of code for the else branches, and in the conditional branch command specify these blocks. Something like

BasicBlock *ThenBB = BasicBlock::Create(context, "");
BasicBlock *ElseBB = BasicBlock::Create(context, "");
BasicBlock *AfterIfBB = BasicBlock::Create(context, "");
Builder.CreateCondBr(CondV, ThenBB, ElseBB);

ThenBB, ElseBB, AfterIfBB.

For the actual code generation, IRBuilder is used, which contains (roughly speaking) methods for generating all LLVM commands.

After all the code was created, what next?

It is up to us. We can simply print the received bytecode in a readable form using the dump () method of the llvm module, execute the code (that is, use llvm as the JIT compiler), or, best of all, save the bytecode in the file. This will further optimize this byte code and convert it into an executable file using standard llvm tools.

So,

Objectives are defined, tasks are set, comrades for the work (s)

I can’t put all the code here, if someone wants to play with the source code, the archive is posted here or here .
First of all, we need to include certain headers in the source code.

#include "llvm / DerivedTypes.h"
#include "llvm / LLVMContext.h"
#include "llvm / Module.h"
#include "llvm / Analysis / Verifier.h"
#include "llvm / Support / IRBuilder.h"
#include <llvm / Support / raw_ostream.h>
#include <llvm / Bitcode / ReaderWriter.h>
using namespace llvm ;

That should be enough. Next, we initialize the context, module, create a function main and IRBuilder to add code to the function.

In the declaration:

static Module * TheModule ;
static IRBuilder <> Builder ( getGlobalContext ( ) ) ;

and to the top of the code the following:

LLVMContext & Context = getGlobalContext ( ) ;
TheModule = new Module ( "jsk" , Context ) ;
const Type * voidType = Type :: getVoidTy ( Context ) ;

func_main = cast < Function > ( TheModule - > getOrInsertFunction ( "main" , voidType, NULL ) ) ;
func_main - > setCallingConv ( CallingConv :: C ) ;
BasicBlock * block = BasicBlock :: Create ( getGlobalContext ( ) , "code" , func_main ) ;
Builder. SetInsertPoint ( block ) ;

After generating the code, we generate ret void to exit our dummy function and drop the bytecode into the a.out.bc file.

Builder. CreateRetVoid ( ) ;

std :: string ErrStr ;
raw_fd_ostream bitcode ( "a.out.bc" , ErrStr, 0 ) ;
WriteBitcodeToFile ( TheModule, bitcode ) ;
bitcode. close ( ) ;

Now about generating code for the specific operation of our jsk language

Variables

For each variable, we reserve space on the stack using the instructions

%varname = alloca i32

Or, in terms of API

Builder.CreateAlloca(IntegerType::get(context,32), 0, VarName.c_str());

The problem is that we cannot allocate a place on the stack exactly where we met the variable for the first time. Because if a variable occurs in the middle of a while loop, we will fill in the entire stack with copies of the variable.

Therefore, we need to have a list of already defined variables, and when we encounter a variable in the text, check it, and if it met for the first time, place the memory allocation for it at the very top of the function code. That is, from the current function, take the first block, and add the alloca command to the top of this block. Luckily, llvm allows you to write commands not only at the end, but also at the beginning of a block. How this is done, you can see in the source text - the function CreateEntryBlockAlloca ();

Variable assignment

Value *result = value->emit();
Builder.CreateStore(result,varreference);

..

store i32 %result, i32* %varreference

Accordingly, get the value of the variable


return Builder.CreateLoad(varref);

%val = load i32* %varref

Binary and unary operations.

It's simple

switch ( op ) {
case '+' : return Builder. CreateAdd ( L, R ) ;
case '-' : return Builder. CreateSub ( L, R ) ;
case '*' : return Builder. CreateMul ( L, R ) ;
case '/' : return Builder. CreateSDiv ( L, R ) ;
case '<' : tmp = Builder. CreateICmpSLT ( L, R ) ; return builder. CreateZExt ( tmp, IntegerType :: get ( getGlobalContext ( ) , 32 ) ) ;
case '>' : tmp = Builder. CreateICmpSGT ( L, R ) ; return builder. CreateZExt ( tmp, IntegerType :: get ( getGlobalContext ( ) , 32 ) ) ;
case 'L' : tmp = Builder. CreateICmpSLE ( L, R ) ; return builder. CreateZExt ( tmp, IntegerType :: get ( getGlobalContext ( ) , 32 ) ) ;
case 'G' : tmp = Builder. CreateICmpSGE ( L, R ) ; return builder. CreateZExt ( tmp, IntegerType :: get ( getGlobalContext ( ) , 32 ) ) ;
case 'N' : tmp = Builder. CreateICmpNE ( L, R ) ; return builder. CreateZExt ( tmp, IntegerType :: get ( getGlobalContext ( ) , 32 ) ) ;
case '=' : tmp = Builder. CreateICmpEQ ( L, R ) ; return builder. CreateZExt ( tmp, IntegerType :: get ( getGlobalContext ( ) , 32 ) ) ;

default : return ErrorV ( "invalid binary operator" ) ;
}

IF and WHILE

As already described, we use code blocks. For example for IF:

Value * CondV = cond - > emit ( ) ;
// Compare condition expression with 0
CondV = Builder. CreateICmpNE ( CondV, zero, "ifcond" ) ;
// Code blocks for IF branches
BasicBlock * ThenBB = BasicBlock :: Create ( getGlobalContext ( ) , "thenblock" ) ;
BasicBlock * ElseBB = BasicBlock :: Create ( getGlobalContext ( ) , "elseblock" ) ;
BasicBlock * MergeBB = BasicBlock :: Create ( getGlobalContext ( ) , "afterifblock" ) ;
// Actually branching for IF
Builder. CreateCondBr ( CondV, ThenBB, ElseBB ) ;

// Next, add the code to the THEN branch
Builder. SetInsertPoint ( ThenBB ) ;
this - > thenops. emit ( ) ;
// At the end of THEN switch to "after IF"
Builder. CreateBr ( MergeBB ) ;

TheFunction - > getBasicBlockList ( ) . push_back ( ElseBB ) ;
// Next, add the code to the ELSE century
Builder. SetInsertPoint ( ElseBB ) ;
this - > elseops. emit ( ) ;
// At the end of an else transition to "After IF"
Builder. CreateBr ( MergeBB ) ;

TheFunction - > getBasicBlockList ( ) . push_back ( MergeBB ) ;
// After IF code add only to MergeBB
Builder. SetInsertPoint ( MergeBB ) ;
return zero ;

For while:

BasicBlock * CondBB = BasicBlock :: Create ( getGlobalContext ( ) , "whilexpr" , TheFunction ) ;
BasicBlock * LoopBB = BasicBlock :: Create ( getGlobalContext ( ) , "loop" ) ;
BasicBlock * AfterBB = BasicBlock :: Create ( getGlobalContext ( ) , "after" ) ;
Builder. CreateBr ( CondBB ) ;

// Condition block. Add the code here. We generate the condition code, Compare the result with 0
// And if the condition is not met - go to AfterBB
Builder. SetInsertPoint ( CondBB ) ;
Value * CondV = cond - > emit ( ) ;
if ( CondV == 0 ) return 0 ;
// Convert condition to compose equal to 0.
CondV = Builder. CreateICmpNE ( CondV, zero, "whilecond" ) ;
Builder. CreateCondBr ( CondV, LoopBB, AfterBB ) ;

// Block while body. We write the body code here, then the transition to the condition WHILE
TheFunction - > getBasicBlockList ( ) . push_back ( LoopBB ) ;
Builder. SetInsertPoint ( LoopBB ) ;
Value * Ops = this - > ops. emit ( ) ;
Builder. CreateBr ( CondBB ) ;

// Block "After WHILE ', At the moment, just say
// That all the following code should be written in this block
TheFunction - > getBasicBlockList ( ) . push_back ( AfterBB ) ;
Builder. SetInsertPoint ( AfterBB ) ;

Finally, we can try our compiler

#make
bison -d jsk.y
flex jsk.lex
g++ -O2 jsk.tab.c lex.yy.c `llvm-config --cxxflags --libs` -lrt -ldl -o jskc
jsk.tab.c: In function 'int yyparse()':
jsk.tab.c:2026: warning: deprecated conversion from string constant to 'char*'
jsk.tab.c:2141: warning: deprecated conversion from string constant to 'char*'
jsk.lex: In function 'int yylex()':
jsk.lex:34: warning: deprecated conversion from string constant to 'char*'
jsk.lex:35: warning: deprecated conversion from string constant to 'char*'
jsk.lex:39: warning: deprecated conversion from string constant to 'char*'

So. The compiler is ready. Let's try it on some test task. For example on

a=b=88;
b=b+1;
echo("test4=",a," ",b,"\n");

We perform:

./jskc <test3.jsk

And we get a.out.bc in the current directory. We can disassemble it:

llvm-dis <a.out.bc

; ModuleID = ''

@.format1 = internal constant [3 x i8] c"%d\00" ; <[3 x i8]*> [#uses=1]
@.format2 = internal constant [3 x i8] c"%s\00" ; <[3 x i8]*> [#uses=1]
@0 = internal constant [7 x i8] c"test4=\00" ; <[7 x i8]*> [#uses=1]
@1 = internal constant [2 x i8] c" \00" ; <[2 x i8]*> [#uses=1]
@2 = internal constant [2 x i8] c"\0A\00" ; <[2 x i8]*> [#uses=1]

define void @main() {
code:
%a = alloca i32 ; <i32*> [#uses=2]
%b = alloca i32 ; <i32*> [#uses=4]
%int_for_scanf___ = alloca i32 ; <i32*> [#uses=0]
store i32 88, i32* %b
store i32 88, i32* %a
%0 = load i32* %b ; [#uses=1]
%1 = add i32 %0, 1 ; [#uses=1]
store i32 %1, i32* %b
call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format2, i32 0, i32 0), i8* getelementptr inbounds ([7 x i8]* @0, i32 0, i32 0))
%2 = load i32* %a ; [#uses=1]
call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format1, i32 0, i32 0), i32 %2)
call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format2, i32 0, i32 0), i8* getelementptr inbounds ([2 x i8]* @1, i32 0, i32 0))
%3 = load i32* %b ; [#uses=1]
call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format1, i32 0, i32 0), i32 %3)
call void (i8*, ...)* @printf(i8* getelementptr inbounds ([3 x i8]* @.format2, i32 0, i32 0), i8* getelementptr inbounds ([2 x i8]* @2, i32 0, i32 0))
ret void
}

declare void @printf(i8*, ...)

declare void @scanf(i8*, ...)

declare void @exit(i32)


lli , ( ) - :

$ llvmc a.out.bc

(!) a.out, , JSK :

$ ls -al ./a.out
-rwxr-xr-x 1 walrus walrus 4639 2010-08-25 16:49 ./a.out

$ file a.out
a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, not stripped

$ ldd a.out
linux-gate.so.1 => (0x00762000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x00456000)
/lib/ld-linux.so.2 (0x00ee4000)


,
$ ./a.out
test4=88 89


.

Source: https://habr.com/ru/post/102597/

All Articles