LLVM for researchers

This article focuses on conducting research based on the LLVM compiler infrastructure. Our story should be enough for the researchers, to whom the compilers before were for the most part indifferent, came to the delight of LLVM and did something interesting with its help.

What is LLVM?

LLVM is a truly convenient compiler for disassembling and assembling for such traditional programming languages as C and C ++.

LLVM is so good that it is considered “more than just a compiler” (it is a dynamic compiler, it works with non-C family languages, it is a new delivery format for the App Store , etc., etc.). All of the above is true, but for our article, only the definition above is important.
')
LLVM has several key differences from other compilers:

The main innovation is the intermediate presentation (PP). LLVM works with software that can really be read (if you know how to read assembly code). It may not seem like a great revelation to someone, but this feature is very important. PP of other compilers usually have such a complex structure that it is impossible to write them manually, it is difficult to understand and use.
LLVM is pretty elegantly written: its architecture is more modular than other compilers. One of the reasons for this grace is that the developer of the original version was one of us .
LLVM is not only the preferred research tool for academic hackers like us, but also an industrial-level compiler, backed by the largest company on the planet. This means that you don’t have to compromise between a great compiler and a custom compiler (as happens in the Java land when choosing between HotSpot and Jikes ).

Why does the LLVM researcher?

LLVM is a great tool. But what do you care if your research is not about compilers?

The compiler infrastructure allows you to do a lot of interesting things with programs. For example, you can analyze a program to see how often it performs certain actions. You can convert a program to work better on a specific system. You can also change the program to imagine how it will use a hypothetical new architecture or operating system for which a new chip has not yet been manufactured or a kernel module has not been written. Compiler infrastructure can be useful to researchers much more often than many people think. I advise you to contact LLVM first, before you try to file one of these tools (unless you have a specific reason for this):

architectural simulator ;
Dynamic binary instrumentation tool, for example, Pin ;
source-level conversion (from simple tools, such as sed, to advanced toolboxes that include parsing and serializing the SDA);
filing the kernel to intercept system calls;
any tool that looks like a hypervisor.

Even if the compiler does not look like an ideal solution for your task, it often makes it easier for 90% of the work, for example, when translating one source code to another.

The following are good examples of research projects that are not so similar to the compiler:

Virtual Ghost University of Illinois in Urbana-Champaign (USA) demonstrates how you can use compiler pass to protect processes from a cracked OS kernel.
The University of Washington (USA) CoreDet makes multi-threaded programs deterministic.
In approximate calculations, we use the LLVM pass to add bugs to programs in order to simulate the operation of hardware subject to failures.

Once again, LLVM is intended not only for the development of new optimizations in the compiler.

Details

The figure below shows the main components of the LLVM architecture (and the general architecture of any modern compiler):

Frontend parses the source code and turns it into an intermediate representation (PP). This simplifies the work of the rest of the compiler, which is hard to “digest” the extremely complex C ++ source code. Such a fearless explorer, as you probably will not have to finish anything in this part, so you can use Clang unchanged.
Passages perform the conversion of one PP to another. Under normal circumstances, the passages optimize the code, that is, at the output they give a program to the control panel, which does the same as the control panel fed to the input, but only faster. This is where you want to finish something. Your revision will be able, for example, to read and change the PP passing through the compiler.
Backend directly generates machine code. Most likely, you will not have to change anything in this part of the system.
The LLVM architecture corresponds to the architecture of most modern compilers, but pay attention to one innovation: unlike other compilers, where a unique form of the program code is created on each pass, the same software is used in the LLVM throughout the process. This is ideal for us, hackers: we do not need to worry about at what stage of the process the code runs, if it happens between the frontend and the backend.

Preparation for work

So let's pick something up.

Install LLVM

First you need to install LLVM. Linux distributions often include LLVM and Clang packages, which are completely ready to use. Make sure that the resulting version includes all the necessary headers for doping programs with the help of the compiler. For example, the OS X build that comes with Xcode is not complete enough. Fortunately, it is not difficult to compile LLVM from source code using CMake. Usually, you only need to build the LLVM itself. The Clang that comes with the OS does an excellent job with this task, if the corresponding versions match (however, there are instructions for building Clang ).

In particular, Brandon Holt wrote a good instruction for OS X, there is also a recipe for the Homebrew system .

Teach materiel

You need to carefully review the documentation. In my opinion, the following materials will be especially useful:

Very important information is contained on the automatically generated Doxygen pages . To learn how to successfully pick programs with LLVM, you will have to settle among these documents for a long time on the API. However, given the complexity of navigating through the pages, I recommend “google” the information. If you add “LLVM” to the name of any function or class name, Google usually finds the desired Doxygen page (if you try, you can train Google to find information about the compiler without even entering the name “LLVM”!). I understand that it sounds ridiculous, but in order to survive, you really have to dance with a tambourine around the LLVM API documentation. Maybe there is a more convenient way to navigate through API documents, but I have not heard about it.
The LLVM help is useful if you don’t understand something in the syntax of the PP.
The programmer’s manual describes the tools for working with data structures specific to LLVM (effective strings, STL alternatives for maps, vectors, etc.), as well as tools for working with the types (isa, cast and dyn_cast) that you will use everywhere.
Refer to the LLVM Pass Writing Guide if you have questions about the capabilities of a single pass. Considering that you are a researcher, and not just picking up the compiler, I would like to note that the author of this article does not agree with some points of this tutorial. (First of all, ignore the instructions for building the system using the Makefile and go directly to the instructions for building outside the source tree.) However, in general, the manual is the canonical source of information about the passes.
In some cases, it is convenient to use a GitHub mirror , a web resource for viewing LLVM source code.

Writing aisle

Usually the result of research using LLVM is to write a new pass. This section contains instructions for building and executing a simple pass that transforms programs on the fly.

"Skeleton"

I created a template repository where there is one useless LLVM pass. I recommend to start with the template, because when creating from scratch, there may be problems with the configuration of the assembly.
Clone the llvm-pass-skeleton repository with GitHub:

$ git clone git@github.com:sampsyo/llvm-pass-skeleton.git

The content work is done in the skeleton / Skeleton.cpp file, so open it. This is where everything happens:

 virtual bool runOnFunction(Function &F) { errs() << "I saw a function called " << F.getName() << "!\n"; return false; }

There are several types of LLVM passages. We use one of them - function pass (it is ideal for beginners). As expected, LLVM calls the method described above for each function it finds in the program we are compiling. While the method only displays the name of the function.

Details:

The element called errs () is a C ++ output stream provided by LLVM. It is used for console output.
The function returns false to indicate that it has not changed F. Later, when we start changing the PP, we will need to return true.

Assembly

Build the walkway using CMake :

 $ cd llvm-pass-skeleton $ mkdir build $ cd build $ cmake .. # Generate the Makefile. $ make # Actually build the pass.

If LLVM is not installed globally, then CMake needs to specify its location. To do this, set the path to the share / llvm / cmake / directory where LLVM is located in the LLVM_DIR environment variable. The following is an example path for the Homebrew system:

 $ LLVM_DIR=/usr/local/opt/llvm/share/llvm/cmake cmake ..

As a result of the assembly, the shared library is obtained. It is located in the build / skeleton / libSkeletonPass.so file or in a file with a similar name depending on the platform used. In the next step, we will load this library to perform a pass for real program code.

Pass

To run a pass, compile some C program with flags indicating that you need to use the library you just received:

 $ clang -Xclang -load -Xclang build/skeleton/libSkeletonPass.* something.c I saw a function called main!

This dance with the procedure -Xclang -load -Xclang path / to / lib.so is all you need to load and activate the passage in Clang . Therefore, when working with large projects, you can add these arguments to the list of CFLAGS Makefile variables or the corresponding equivalent of your build system.

(In addition, you can perform passes by themselves without running a clang: this method using the LLVM opt command is recommended in the official documentation . However, I will not describe it in this article).

Congratulations, you just finished the compiler! In the next steps, we will see how to refine our “Hello, world!” Level passage to more interesting things.

The structure of the intermediate presentation LLVM

To work with programs in LLVM, it would be nice to understand the structure of software.

Modules ( Modules ) contain functions ( Functions ), which, in turn, include the basic blocks ( BasicBlock ), containing instructions ( Instructions ). All classes except Module are derived from Value .

Container

Below is an overview of the most important components of the LLVM program:

A module is simply a source file (roughly speaking) or a transformation unit (if you strictly approach). The module contains all other entities.
First of all, modules contain functions that fully correspond to their name and are named blocks of executable code (both functions and methods in C ++ correspond to LLVM functions).
In addition to declaring the name and arguments, the function serves as a container of basic blocks (BasicBlock). The base unit is a familiar concept from compiler theory. However, in our article we will consider it simply as a continuous block of instructions.
In turn, the instruction is a single operation with the code; its level of abstraction is about the same as in the RISC machine code. For example, the instruction can be the addition of integers, the division with a floating point or the storage into memory.

Most LLVM entities (functions, base blocks, and instructions) are C ++ classes derived from the omnipresent Value base class. A value is any data that can be used in calculations (for example, the number or address of a code), as well as global variables and constants (known as "literals" or "immediate values", for example, 5).

Instruction

Below is an example of instructions in a readable text form PP LLVM:

 %5 = add i32 %4, 2

The instruction adds two 32-bit numbers (indicated by i32). It adds the number in register 4 (indicated by% 4) and the constant 2 (actually 2) and writes the result to register 5. That is what I mean when I say that the LLVM software looks like an ideal RISC machine code. We even use the same terminology, such as register, but the number of registers is infinite.

The same instruction is presented inside the compiler as an instance of the C ++ Instruction class. The object has an opcode indicating that this is an addition, as well as the type and list of operands that serve as pointers to other Value objects. In our case, it points to a Constant object (constant) representing the number 2, and another Instruction object (instruction) corresponds to the register% 5. (Considering that the LLVM PP has the form of a static one-time assignment , in reality the registers and instructions are the same. The register numbers are an artifact of the text representation.)

By the way, if you want to see the PP LLVM of your program, you can ask about this Slang:

 $ clang -emit-llvm -S -o - something.c

Intermediate View Verification

Let's go back to the LLVM passage we were working on. We can check all important objects of the PP using the convenient common dump () method, which displays a readable representation of the object in the PP. Given that our pass for each function being processed is obtained by the Function object, we will one after another get access to the basic blocks of functions and instructions of each block.

Here is the code that does it. It can be taken from the containers branch of the llvm-pass-skeleton repository :

 errs() << "Function body:\n"; F.dump(); for (auto& B : F) { errs() << "Basic block:\n"; B.dump(); for (auto& I : B) { errs() << "Instruction: "; I.dump(); } }

With fashionable auto and foreach from C ++ 11, it is convenient to bypass the hierarchy of PP LLVM.
If you rebuild the passage and run it, you will see in the output different LLVM entities in the order of their traversal.

Using the passage to solve more complex problems

Real miracles happen when searching for patterns in the program and changing the code after they are detected. Consider a simple example. Suppose you need to replace the first binary operator ("+", "-", etc.) in each function with multiplication. May be useful, is not it?

Here is the code that does it. This version, as well as a sample program where you can try it, is available in the mutate branch of the LLVM git repository :

 for (auto& B : F) { for (auto& I : B) { if (auto* op = dyn_cast<BinaryOperator>(&I)) { // Insert at the point where the instruction `op` appears. IRBuilder<> builder(op); // Make a multiply with the same operands as `op`. Value* lhs = op->getOperand(0); Value* rhs = op->getOperand(1); Value* mul = builder.CreateMul(lhs, rhs); // Everywhere the old instruction was used as an operand, use our // new multiply instruction instead. for (auto& U : op->uses()) { User* user = U.getUser(); // A User is anything with operands. user->setOperand(U.getOperandNo(), mul); } // We modified the code. return true; } } }

Details:

The dyn_cast (p) construct is LLVM-specific and performs a dynamic type cast. It uses carefully thought-out LLVM mechanisms to quickly perform dynamic type checks — the compilers use them all the time. The construct returns a null pointer if I is not a binary BinaryOperator operator, so it is ideal for handling special cases, as in our code.
IRBuilder is designed to build code. It provides a million methods for creating any instruction you wish.
To embed our new instruction in the code, you need to find all the places where it is used, and insert this instruction into them as an argument. Recall that the instruction Instruction is also a value (Value). In this case, the multiplication instruction is used as an operand in another instruction, that is, the result will be an argument.
In addition, we will need to remove the old instructions. However, I omitted this step in order not to overload the description.

Now we can compile the program ( example.c in the repository):

 #include <stdio.h> int main(int argc, const char** argv) { int num; scanf("%i", &num); printf("%i\n", num + 2); return 0; }

The usual compiler gives the code with the expected behavior, and after the operation of our module, the code instead of adding two multiplies by two:

 $ cc example.c $ ./a.out 10 12 $ clang -Xclang -load -Xclang build/skeleton/libSkeletonPass.so example.c $ ./a.out 10 20

Magic!

Layout with executable code library

If you need to change the code so that it does something non-trivial, then generating the necessary instructions using IRBuilder may require a lot of effort. Instead, you can implement the desired behavior in C and build with the compiled program. This section explains how to write a library that records the results of performing binary operations instead of changing them silently.

Here is the program code that does this, taken from the branch of the rtlib repository llvm-pass-skeleton:

 // Get the function to call from our runtime library. LLVMContext& Ctx = F.getContext(); Constant* logFunc = F.getParent()->getOrInsertFunction( "logop", Type::getVoidTy(Ctx), Type::getInt32Ty(Ctx), NULL ); for (auto& B : F) { for (auto& I : B) { if (auto* op = dyn_cast<BinaryOperator>(&I)) { // Insert *after* `op`. IRBuilder<> builder(op); builder.SetInsertPoint(&B, ++builder.GetInsertPoint()); // Insert a call to our function. Value* args[] = {op}; builder.CreateCall(logFunc, args); return true; } } }

The necessary tools are Module :: getOrInsertFunction and IRBuilder :: CreateCall . The first adds the declaration of the function logop, as if in C code there was a declaration of the function void logop (int i); without body. This added declaration corresponds to the definition of the logop function in the library ( rtlib.c in the repository):

 #include <stdio.h> void logop(int i) { printf("computed: %i\n", i); }

To run a modified program, build it with your library:

 $ cc -c rtlib.c $ clang -Xclang -load -Xclang build/skeleton/libSkeletonPass.so -c example.c $ cc example.o rtlib.o $ ./a.out 12 computed: 14 14

If desired, you can link the program and the library before compiling into machine code. You will be helped by the utility llvm-link , which can be considered as the rough equivalent of ld at the PP level.

Notes

Most projects require interaction with the developer. For this, it is convenient to use notes that convey the necessary information from the program for your LLVM pass. There are several ways to create notes.

A practical and hacking method is the use of "magical" functions. Declare a function with an empty body and special, if possible, unique names in the header file. Include this file in your source code and add calls to these functions. Then, when you run the pass, find the CallInst instructions that call your functions, and use them to start the "magic" functions. For example, you can use the __enable_instrumentation () and __disable_instrumentation () calls to make the program limit your code changes to specific areas.
If you want to allow programmers to add markers to function or variable declarations, the Clang __attribute __ ((Annotate ("foo"))) construct will add metadata with the specified string that can be processed during the pass. Brandon Holt wrote a manuscript about this technique. If it is necessary to mark not declarations, but expressions, the __builtin_annotation (e, “foo”) intrinsic construction is suitable, which, unfortunately, is not documented and has limited capabilities.
You can take a chance and make changes directly to Clang to interpret your new syntax. But I do not advise doing this.
If you need to create notes for types (which, in my opinion, is required quite often, even if you don’t realize this), now I’m developing a Quala system. It adds support for custom type qualifiers and pluggable type systems to Clang, such as JSR-308 for Java. Let me know if you are interested in working together on this project!

I hope to tell in more detail about these methods in my future publications.

And much more…

LLVM has great potential. I will list only a few topics not covered in this article:

Use a wide range of classic compiler analysis tools available in the LLVM glove box.
Creation of special machine instructions required by architects, by finalizing the backend.
Use debug information to obtain data on the line number and the character in the line of the source code corresponding to a certain point of the PP.
Writing plugins for backend Clang .

I hope that I provided you with enough information so that you could create something worthwhile. Explore, create and email me if the article was helpful!
________________________________________

I express my gratitude to the staff of the University of Washington from the architecture and systems groups who were present at the presentation of the oral version of this article and asked many startlingly useful questions.

Addition from my dear readers:

Emery Berger noted that dynamic binary instrumentation tools, such as Pin , are still the right choice if you need to comply with specific architectural requirements (registers, memory hierarchy, command coding, etc.).
Brandon Holt has just published debugging tips to LLVM , including charting control flow with GraphViz.
John Reger ( John Regehr ) commented on why it’s not very good in your project to depend on a brilliant LLVM - the API is constantly changing. The internal components of LLVM change significantly from release to release, therefore, to save the project, it is necessary to keep up with these changes.
Alex Bradbury publishes the LLVM Weekly newsletter (“ LLVM Weekly Newsletter”) - an excellent resource for tracking events in the LLVM ecosystem.

Translated by ABBYY Language Services

Source: https://habr.com/ru/post/265871/

All Articles