📜 ⬆️ ⬇️

LLVM for researchers

This article focuses on conducting research based on the LLVM compiler infrastructure. Our story should be enough for the researchers, to whom the compilers before were for the most part indifferent, came to the delight of LLVM and did something interesting with its help.

What is LLVM?


LLVM is a truly convenient compiler for disassembling and assembling for such traditional programming languages ​​as C and C ++.

LLVM is so good that it is considered “more than just a compiler” (it is a dynamic compiler, it works with non-C family languages, it is a new delivery format for the App Store , etc., etc.). All of the above is true, but for our article, only the definition above is important.
')
LLVM has several key differences from other compilers:



Why does the LLVM researcher?



LLVM is a great tool. But what do you care if your research is not about compilers?

The compiler infrastructure allows you to do a lot of interesting things with programs. For example, you can analyze a program to see how often it performs certain actions. You can convert a program to work better on a specific system. You can also change the program to imagine how it will use a hypothetical new architecture or operating system for which a new chip has not yet been manufactured or a kernel module has not been written. Compiler infrastructure can be useful to researchers much more often than many people think. I advise you to contact LLVM first, before you try to file one of these tools (unless you have a specific reason for this):


Even if the compiler does not look like an ideal solution for your task, it often makes it easier for 90% of the work, for example, when translating one source code to another.

The following are good examples of research projects that are not so similar to the compiler:



Once again, LLVM is intended not only for the development of new optimizations in the compiler.

Details



The figure below shows the main components of the LLVM architecture (and the general architecture of any modern compiler):

image



Preparation for work


So let's pick something up.

Install LLVM


First you need to install LLVM. Linux distributions often include LLVM and Clang packages, which are completely ready to use. Make sure that the resulting version includes all the necessary headers for doping programs with the help of the compiler. For example, the OS X build that comes with Xcode is not complete enough. Fortunately, it is not difficult to compile LLVM from source code using CMake. Usually, you only need to build the LLVM itself. The Clang that comes with the OS does an excellent job with this task, if the corresponding versions match (however, there are instructions for building Clang ).

In particular, Brandon Holt wrote a good instruction for OS X, there is also a recipe for the Homebrew system .

Teach materiel


You need to carefully review the documentation. In my opinion, the following materials will be especially useful:



Writing aisle


Usually the result of research using LLVM is to write a new pass. This section contains instructions for building and executing a simple pass that transforms programs on the fly.

"Skeleton"


I created a template repository where there is one useless LLVM pass. I recommend to start with the template, because when creating from scratch, there may be problems with the configuration of the assembly.
Clone the llvm-pass-skeleton repository with GitHub:

$ git clone git@github.com:sampsyo/llvm-pass-skeleton.git 


The content work is done in the skeleton / Skeleton.cpp file, so open it. This is where everything happens:

 virtual bool runOnFunction(Function &F) { errs() << "I saw a function called " << F.getName() << "!\n"; return false; } 


There are several types of LLVM passages. We use one of them - function pass (it is ideal for beginners). As expected, LLVM calls the method described above for each function it finds in the program we are compiling. While the method only displays the name of the function.

Details:


Assembly


Build the walkway using CMake :
 $ cd llvm-pass-skeleton $ mkdir build $ cd build $ cmake .. # Generate the Makefile. $ make # Actually build the pass. 


If LLVM is not installed globally, then CMake needs to specify its location. To do this, set the path to the share / llvm / cmake / directory where LLVM is located in the LLVM_DIR environment variable. The following is an example path for the Homebrew system:

 $ LLVM_DIR=/usr/local/opt/llvm/share/llvm/cmake cmake .. 


As a result of the assembly, the shared library is obtained. It is located in the build / skeleton / libSkeletonPass.so file or in a file with a similar name depending on the platform used. In the next step, we will load this library to perform a pass for real program code.

Pass


To run a pass, compile some C program with flags indicating that you need to use the library you just received:

 $ clang -Xclang -load -Xclang build/skeleton/libSkeletonPass.* something.c I saw a function called main! 


This dance with the procedure -Xclang -load -Xclang path / to / lib.so is all you need to load and activate the passage in Clang . Therefore, when working with large projects, you can add these arguments to the list of CFLAGS Makefile variables or the corresponding equivalent of your build system.

(In addition, you can perform passes by themselves without running a clang: this method using the LLVM opt command is recommended in the official documentation . However, I will not describe it in this article).

Congratulations, you just finished the compiler! In the next steps, we will see how to refine our “Hello, world!” Level passage to more interesting things.

The structure of the intermediate presentation LLVM


To work with programs in LLVM, it would be nice to understand the structure of software.

image

Modules ( Modules ) contain functions ( Functions ), which, in turn, include the basic blocks ( BasicBlock ), containing instructions ( Instructions ). All classes except Module are derived from Value .


Container


Below is an overview of the most important components of the LLVM program:



Most LLVM entities (functions, base blocks, and instructions) are C ++ classes derived from the omnipresent Value base class. A value is any data that can be used in calculations (for example, the number or address of a code), as well as global variables and constants (known as "literals" or "immediate values", for example, 5).

Instruction


Below is an example of instructions in a readable text form PP LLVM:

 %5 = add i32 %4, 2 


The instruction adds two 32-bit numbers (indicated by i32). It adds the number in register 4 (indicated by% 4) and the constant 2 (actually 2) and writes the result to register 5. That is what I mean when I say that the LLVM software looks like an ideal RISC machine code. We even use the same terminology, such as register, but the number of registers is infinite.

The same instruction is presented inside the compiler as an instance of the C ++ Instruction class. The object has an opcode indicating that this is an addition, as well as the type and list of operands that serve as pointers to other Value objects. In our case, it points to a Constant object (constant) representing the number 2, and another Instruction object (instruction) corresponds to the register% 5. (Considering that the LLVM PP has the form of a static one-time assignment , in reality the registers and instructions are the same. The register numbers are an artifact of the text representation.)

By the way, if you want to see the PP LLVM of your program, you can ask about this Slang:

 $ clang -emit-llvm -S -o - something.c 


Intermediate View Verification


Let's go back to the LLVM passage we were working on. We can check all important objects of the PP using the convenient common dump () method, which displays a readable representation of the object in the PP. Given that our pass for each function being processed is obtained by the Function object, we will one after another get access to the basic blocks of functions and instructions of each block.

Here is the code that does it. It can be taken from the containers branch of the llvm-pass-skeleton repository :

 errs() << "Function body:\n"; F.dump(); for (auto& B : F) { errs() << "Basic block:\n"; B.dump(); for (auto& I : B) { errs() << "Instruction: "; I.dump(); } } 


With fashionable auto and foreach from C ++ 11, it is convenient to bypass the hierarchy of PP LLVM.
If you rebuild the passage and run it, you will see in the output different LLVM entities in the order of their traversal.

Using the passage to solve more complex problems


Real miracles happen when searching for patterns in the program and changing the code after they are detected. Consider a simple example. Suppose you need to replace the first binary operator ("+", "-", etc.) in each function with multiplication. May be useful, is not it?

Here is the code that does it. This version, as well as a sample program where you can try it, is available in the mutate branch of the LLVM git repository :

 for (auto& B : F) { for (auto& I : B) { if (auto* op = dyn_cast<BinaryOperator>(&I)) { // Insert at the point where the instruction `op` appears. IRBuilder<> builder(op); // Make a multiply with the same operands as `op`. Value* lhs = op->getOperand(0); Value* rhs = op->getOperand(1); Value* mul = builder.CreateMul(lhs, rhs); // Everywhere the old instruction was used as an operand, use our // new multiply instruction instead. for (auto& U : op->uses()) { User* user = U.getUser(); // A User is anything with operands. user->setOperand(U.getOperandNo(), mul); } // We modified the code. return true; } } } 


Details:



Now we can compile the program ( example.c in the repository):

 #include <stdio.h> int main(int argc, const char** argv) { int num; scanf("%i", &num); printf("%i\n", num + 2); return 0; } 


The usual compiler gives the code with the expected behavior, and after the operation of our module, the code instead of adding two multiplies by two:

 $ cc example.c $ ./a.out 10 12 $ clang -Xclang -load -Xclang build/skeleton/libSkeletonPass.so example.c $ ./a.out 10 20 


Magic!

Layout with executable code library


If you need to change the code so that it does something non-trivial, then generating the necessary instructions using IRBuilder may require a lot of effort. Instead, you can implement the desired behavior in C and build with the compiled program. This section explains how to write a library that records the results of performing binary operations instead of changing them silently.

Here is the program code that does this, taken from the branch of the rtlib repository llvm-pass-skeleton:
 // Get the function to call from our runtime library. LLVMContext& Ctx = F.getContext(); Constant* logFunc = F.getParent()->getOrInsertFunction( "logop", Type::getVoidTy(Ctx), Type::getInt32Ty(Ctx), NULL ); for (auto& B : F) { for (auto& I : B) { if (auto* op = dyn_cast<BinaryOperator>(&I)) { // Insert *after* `op`. IRBuilder<> builder(op); builder.SetInsertPoint(&B, ++builder.GetInsertPoint()); // Insert a call to our function. Value* args[] = {op}; builder.CreateCall(logFunc, args); return true; } } } 

The necessary tools are Module :: getOrInsertFunction and IRBuilder :: CreateCall . The first adds the declaration of the function logop, as if in C code there was a declaration of the function void logop (int i); without body. This added declaration corresponds to the definition of the logop function in the library ( rtlib.c in the repository):

 #include <stdio.h> void logop(int i) { printf("computed: %i\n", i); } 


To run a modified program, build it with your library:

 $ cc -c rtlib.c $ clang -Xclang -load -Xclang build/skeleton/libSkeletonPass.so -c example.c $ cc example.o rtlib.o $ ./a.out 12 computed: 14 14 


If desired, you can link the program and the library before compiling into machine code. You will be helped by the utility llvm-link , which can be considered as the rough equivalent of ld at the PP level.

Notes


Most projects require interaction with the developer. For this, it is convenient to use notes that convey the necessary information from the program for your LLVM pass. There are several ways to create notes.


I hope to tell in more detail about these methods in my future publications.

And much more…


LLVM has great potential. I will list only a few topics not covered in this article:

I hope that I provided you with enough information so that you could create something worthwhile. Explore, create and email me if the article was helpful!
________________________________________

I express my gratitude to the staff of the University of Washington from the architecture and systems groups who were present at the presentation of the oral version of this article and asked many startlingly useful questions.

Addition from my dear readers:


Translated by ABBYY Language Services

Source: https://habr.com/ru/post/265871/


All Articles