JIT compiler as an academic project at the Academic University

About sixteen years ago, the first version of Hotspot was released - a JVM implementation, which later became the standard virtual machine supplied with the Sun JRE.

The main difference of this implementation was the JIT compiler, thanks to which, statements about slow Java in many cases became completely untenable.
Now almost all interpreted platforms, such as CLR, Python, Ruby, Perl, and even the wonderful R programming language , have got their own implementations of JIT-translators.

In this article, I do not plan to shed light on the little-known details of the implementation of industrial JIT-compilers, rather it will be quite a superficial acquaintance with the basics and a story about a training project on relevant topics.
')
Thus, you may be interested under the cut if:

You basically do not understand what a JIT compiler is, or you have a slight misunderstanding than such an approach is significantly better than interpretation.
You would like to write a simple JIT for your interpreted language.
You teach the course “Programming Languages and Compilers”, and do not mind making the practical task for students more interesting.
You wonder how this picture is drawn.

Benefits of JIT Compilation

To talk about the benefits, let's first understand what it is. Let's start from afar, namely with the definition of a compiled language.

A compiled programming language is a programming language whose source code is converted by the compiler into the machine code of a specific architecture, for example x86 / ARM. This machine code is a sequence of commands that is completely understandable to your processor, and can be executed by it without an intermediary.

Interpreted programming languages are different in that source codes are not converted into machine code for direct execution on the CPU, but are executed using a special interpreter program.
As an intermediate version, many languages (Java, C #, Python) are translated into machine-independent byte code ,
which still cannot be executed directly on the CPU, and an interpreter is still needed to execute it.

Why is the interpretation slower? The answer to this question, oddly enough, does not seem obvious to everyone, although it is really very simple: interpreting each command separately, we spend additional resources on translating the semantics of this command into a processor language. In addition, modern processors are optimized for executing sequential instructions (see Pipeline ), and the interpreter most often represents a huge switch with a bunch of options that drives the predictor of transitions into a dead end.

A completely natural solution to all these problems would be to translate the bytecode once into the processor language and transfer it to the CPU for execution. So most often they do, calling this process Just-in-time-compilation (JIT) . In addition, it is during this phase that various optimizations of the generated code are most often performed.

We now turn to practice.

Formulation of the problem

The course “Virtualization and Virtual Machines” at our university is read by Nikolai Igotti, who took part in the development of various VM classes: Hotspot, VirtualBox, NativeClient, and who knows firsthand the details of their implementation. Thanks to the wonders of modern technology, in order to learn more about it, it is not even necessary to be a student at an Academic University, since the course has been published in a lecture hall . Although it should be noted that this is of course a bit different, due to the interactivity of the course and work with the audience in lectures.

As part of the course, students were asked to write their own implementation of a simple stack-based virtual machine interpreter and translator into bytecode from a C-like language to gain admission to the exam. In fact, the task is somewhat simpler than it might seem at first glance - along with the task a lot of ready-made C ++ code is issued:

The parser in the AST of the programming language that you want to broadcast. Thank you so much for this! Usually, by the sixth year, students of AU have time to write sixteen different parsers, and writing another recursive descent is completely uninteresting.
A list of types of bytecode instructions in the form of enum with detailed comments.
Utility classes for working with byte-code: its design and work with binary representation.

The code name for this entire task is mathvm , apparently due to the fact that the main purpose of a programming language, as often happens, is to calculate something.

A typical mathvm program:

function int fib(int x) { if (x <= 1) { return x; } int r = fib(x - 1); return r + fib(x - 2); } print(fib(35), '\n');

Main features of the language:

Work with data types: int, double, string, arithmetic over them.
Standard control structures: if, while, for, print.
The possibility of declaring functions within functions, as a consequence of the presence of closures. This feature, although it caused a lot of questions in the design of the interpreter, however, given the availability of instructions for working with closures, did not become a stumbling block.
Ability to declare native-functions . That is, you can declare a function without a body with the mark “native” and the name of a symbol that can be loaded by the virtual machine via dlsym with the parameter RTLD_DEFAULT.
For example:
```
 function void printfIDS(string format, int x1, double x2, string x3) native 'printf'; printfIDS('%d %f %s\n', 1, 2, 'abc'); 
```

It seems to me that the last point makes the task much more fun:

First, it is not immediately clear how you can pass an unknown set of arguments to a function if you have a pointer to the function in runtime, the signature and the calculated arguments are stored in a certain place in memory. It is more or less clear that it will not be possible to do this completely cross-platform, and at least the Calling conventions must be fixed. We agreed on calling conventions from System V ABI , used in popular unix systems.
But even after that, it does not seem obvious how to achieve the goal most painlessly.
The only thing that came to mind was a huge, terrible assembler insert with a code that decomposes the arguments in the corresponding registers and stack cells. As a result, I chose a slightly different solution, which I will describe below.
Native-functions significantly expand the capabilities of the resulting language. When the interpreter and translator are ready after several days of stupor, and instead of dull numerical integration with trapezes, something like that starts up on your brainchild and draws a cool psychedelic animation on the screen (see the screenshot at the beginning of the article), this is a nice bonus.

The attentive reader may have noticed that the program from the last item for some reason calculates the number of frames drawn per second (FPS). This indicator allows you to evaluate the performance of the interpreter and thus allows you to bring a competitive spirit into the learning process that seemed to someone boring.
The student, the interpreter of which will be the fastest on this and another pair of similar benchmarks, automatically passes the exam, and in addition receives the love and respect of colleagues. It should be noted that we are not the first thread with which we played this game, and usually it was necessary to implement a JIT compiler to win.

Difficulties

It would seem that it is essentially difficult to simply translate the semantics of simple byte-code instructions into the corresponding elements of the x86 instruction set?

If we had a compiled language, then we could literally generate the text in assembly language, then assemble it accordingly, resulting in an executable file.
But within the framework of the JIT-translation, it is an impermissible luxury to divide the process into two phases, and you need to somehow immediately learn to generate machine code in the memory of the VM process.
And not any area of memory is suitable for recording machine code. More precisely, you can write anywhere, but you can transfer the execution only to the memory area whose pages are marked as executable. It seems there can not do a simple malloc'om.
Ideally, of course, so that the resulting machine code works as quickly as possible.

Asmjit

Asmjit is quite suitable for solving the first two problems - an open source library, developed mostly by one person over the past few years, and used in other projects of the author, as well as in JIT for scripts on the GTA San Andreas MP server .
As it is easy to understand from the title, its main mission is to create a convenient API for generating machine code Just-in-time. Its public interface can be divided into two parts:

The X86Assembler is a low-level interface that essentially allows you to do three things:
1. Initiate the allocation of a piece of memory with execution privileges.
2. Place instructions / data in it sequentially and set labels for places where transitions from other code sections exist.
  To work with the code in the library there is a set of classes of operands: GpReg, XmmReg, Imm, Mem, and for each instruction in the class there is a method of the same name, and exactly with that set of overloads that are valid for this instruction. For example, the mov signature method (X86Mem &, X86Mem &) is missing, as it should be.
3. Getting a pointer to the beginning of the filled piece of memory, which can be used as a pointer to a function-entry point.
X86Compiler - high level wrapper over X86Assembler encapsulating
- Creating functions and working with arguments in accordance with the specified calling conventions, both on the side of the function definition, and at the call site (by the way, the easiest way to implement native-functions) .
- Register allocation: you can define variables (GpVar or XmmVar), the storage for which the compiler will choose for itself. At once I will make a reservation that I did not search, and I do not know what kind of allocation algorithm is used. The only thing I can say is that in simple experiments the existing registers were used quite reasonably.

In my implementation, I used the Assembler, which bribed me with its transparency and finer controllability.

The easiest working example of using Asmjit:

 #include "asmjit/asmjit.h" #include <iostream> int main(int argc, char** argv) { using namespace std; using namespace asmjit; using namespace asmjit::x86; JitRuntime runtime; // RAII object X86Assembler assembler(&runtime); assembler.add(rdi, rsi); // add second argument to first assembler.mov(rax, rdi); // mov result to return register assembler.ret(); void * address = assembler.make(); cout << reinterpret_cast<int64_t (*) (int64_t, int64_t)>(address)(2, 2) << endl; // 4 return 0; }

I think people who are familiar with the assembler, he should not cause many questions. The only thing to be mentioned is the runtime object responsible for the lifetime of the allocated memory. It will be released after calling its destructor ( RAII ).
More examples of use can be found in the test library of the library or in my repository.

Debugging

Oddly enough, debugging the generated code does not differ in special perversions.
First, if something went wrong, you can thoughtfully look at the Asmjit logs. In the logs you can see the resulting assembly code and a few comments, where it came from.
In case even this did not help you, then in most modern debuggers (gdb / VS) there is a possibility to work in debug mode with disassembly. That is, you can put a breakpoint in place of the call of the generated function, and Step Into will take you to the screen with an assembler code, where you can debug almost as usual.
Important

In three cases out of ten in half an hour, you will find the error not in the compiler, but in the interpreted program itself.
Unfortunately, since Asmjit is not used very intensively yet, it can potentially contain various degrees of nasty bugs. Therefore, carefully watch that in the logs exactly what you wanted. Although I didn’t come across anything, my classmate was unlucky: the fact that an invalid code was generated was not visible even in the logs, but only in the debugger.

Optimization

After three days of active programming and debugging, the first completely unsophisticated version of the JIT-translator appeared, in which all values of the stack and variables were thoughtlessly stored in memory each time.
Even this solution gave a performance gain of about six times, with 6 FPS, which were obtained with an interpreter up to 36 FPS.

In general, at the beginning of the semester, when the rules of the game became clear, I had Napoleon's plans: to make everything quite adult, with the transfer of bytecode to SSA and a clever register allocation algorithm.
But due to the acute shortage of time and banal cowardice, it all ended up a bit more prosaic.

Register allocation

Just in case, let me remind you that one of the most critical points in terms of program performance is the effectiveness of using the existing CPU registers.
This is due to the fact that the main computational activity can be performed only on them, and besides reading / writing, even in memory in the L1 cache, works up to two times longer than similar operations on registers.
I used not the most difficult, but rather effective heuristic solution: we will store in registers the first elements of the virtual machine stack, 7 elements for general purpose slots (lines / integers) and 14 for floating point slots.
This solution seems to be the most justified, since the hottest variables in the framework of the function’s work are indeed the bottom of the stack, which participates in all calculations.
In addition, if you use the same registers, in which the arguments are expanded when you call functions, this in some cases saves time at the call sites.
As a result of the implementation of these ideas, I received an acceleration of 9 FPS , thus reaching 45 FPS, which could not but please me.

Peephole optimization

One of the simple classical approaches to generating is the so-called Peephole-optimization , the idea of which is to search for and replace certain sequences of instructions with other, more productive ones.

For example, due to the lack of expressiveness of the mathvm bytecode, comparison operators like (x0 <= x1) have to be described something like this:

 LOADIVAR_0 LOADIVAR_1 IF_CMPLE L1 ILOAD_0 //   0    JA L2 //   L1: ILOAD_1 L2: ...

Experts on Instructions set x86 will say that such code can be written much shorter and without a single conditional jump:

 mov rdi, %(var0) mov rsi, %(var1) cmp rdi, rsi setle al movzx rax, al

Similar constructions are found in other cases: the negation operator, the for loop. The detection and correction of these patterns led to an acceleration of 4 FPS to 49 points.

Closure optimization

To provide work with closures in the prolog and epilogue of each function, I saved the address where the variables of the last call are located in a specially allocated area of memory outside the stack.
In addition, you need to save and restore the previous base pointer. All this joy took up to seven extra instructions for each challenge, including those actively working with memory.
In fact, most functions do not contain circuits within themselves, and these losses can be avoided.
Therefore, I carried out an additional analysis of the entire byte-code received for the presence of closures and the corresponding redundant instructions were removed, as a result of which the cartoon began to be drawn another 3 FPS faster.

Inlining

The final point of the story about micro-optimization is inlining or in Russian “embedding” . The method by which industrial compilers struggle with the cost of calling small functions: the code of the called function is simply duplicated at the call site, without a prologue / epilogue and other “extra” instructions. However, it should be noted that in real life compilers approach the decision about embedding very carefully - the size of the resulting machine code is also a measure of the quality of their work.

However, in my case, since it is the time of work that is estimated, I decided to try to embed everything that is embedded (non-recursive calls).
And here, oddly enough, it turned out to be easier to work at the level of AST translation into byte-code.

The resulting increase of 2 FPS was no longer as significant as in the previous cases, but the total 53-54 FPS (almost ninefold acceleration) can not but rejoice. Acceleration of the same order was observed in other available benchmarks.

What is interesting, even in my simple JIT-translator, to obtain peak performance, several warm-up iterations were required, although I didn’t run the interpreter at all and immediately compiled all the code.
This illustrates the statements of authoritative people about the fact that warmup in benchmarks is needed not only to compile everything you need, but also to fix the profile of the program.

Results

As a result, my implementation was the fastest, but this is due not so much to the fact that I am the most cunning, but to the fact that none of the opponents made even the simplest implementation of JIT.
Thus, the absence of at least some kind of sports intrigue darkened the victory a bit.

The sources of my implementation can be viewed in the repository in the impl directory.
I apologize in advance for the quality of the code, but this is still a toy project, and I do not plan to support it further.

Thanks for attention!

Source: https://habr.com/ru/post/250841/

All Articles