📜 ⬆️ ⬇️

JIT compiler as an academic project at the Academic University

About sixteen years ago, the first version of Hotspot was released - a JVM implementation, which later became the standard virtual machine supplied with the Sun JRE.

The main difference of this implementation was the JIT compiler, thanks to which, statements about slow Java in many cases became completely untenable.
Now almost all interpreted platforms, such as CLR, Python, Ruby, Perl, and even the wonderful R programming language , have got their own implementations of JIT-translators.

In this article, I do not plan to shed light on the little-known details of the implementation of industrial JIT-compilers, rather it will be quite a superficial acquaintance with the basics and a story about a training project on relevant topics.
')
Thus, you may be interested under the cut if:



Benefits of JIT Compilation


To talk about the benefits, let's first understand what it is. Let's start from afar, namely with the definition of a compiled language.

A compiled programming language is a programming language whose source code is converted by the compiler into the machine code of a specific architecture, for example x86 / ARM. This machine code is a sequence of commands that is completely understandable to your processor, and can be executed by it without an intermediary.

Interpreted programming languages are different in that source codes are not converted into machine code for direct execution on the CPU, but are executed using a special interpreter program.
As an intermediate version, many languages ​​(Java, C #, Python) are translated into machine-independent byte code ,
which still cannot be executed directly on the CPU, and an interpreter is still needed to execute it.

Why is the interpretation slower? The answer to this question, oddly enough, does not seem obvious to everyone, although it is really very simple: interpreting each command separately, we spend additional resources on translating the semantics of this command into a processor language. In addition, modern processors are optimized for executing sequential instructions (see Pipeline ), and the interpreter most often represents a huge switch with a bunch of options that drives the predictor of transitions into a dead end.

A completely natural solution to all these problems would be to translate the bytecode once into the processor language and transfer it to the CPU for execution. So most often they do, calling this process Just-in-time-compilation (JIT) . In addition, it is during this phase that various optimizations of the generated code are most often performed.

We now turn to practice.

Formulation of the problem


The course “Virtualization and Virtual Machines” at our university is read by Nikolai Igotti, who took part in the development of various VM classes: Hotspot, VirtualBox, NativeClient, and who knows firsthand the details of their implementation. Thanks to the wonders of modern technology, in order to learn more about it, it is not even necessary to be a student at an Academic University, since the course has been published in a lecture hall . Although it should be noted that this is of course a bit different, due to the interactivity of the course and work with the audience in lectures.

As part of the course, students were asked to write their own implementation of a simple stack-based virtual machine interpreter and translator into bytecode from a C-like language to gain admission to the exam. In fact, the task is somewhat simpler than it might seem at first glance - along with the task a lot of ready-made C ++ code is issued:

The code name for this entire task is mathvm , apparently due to the fact that the main purpose of a programming language, as often happens, is to calculate something.

A typical mathvm program:
function int fib(int x) { if (x <= 1) { return x; } int r = fib(x - 1); return r + fib(x - 2); } print(fib(35), '\n'); 

Main features of the language:
It seems to me that the last point makes the task much more fun:

The attentive reader may have noticed that the program from the last item for some reason calculates the number of frames drawn per second (FPS). This indicator allows you to evaluate the performance of the interpreter and thus allows you to bring a competitive spirit into the learning process that seemed to someone boring.
The student, the interpreter of which will be the fastest on this and another pair of similar benchmarks, automatically passes the exam, and in addition receives the love and respect of colleagues. It should be noted that we are not the first thread with which we played this game, and usually it was necessary to implement a JIT compiler to win.

Difficulties


It would seem that it is essentially difficult to simply translate the semantics of simple byte-code instructions into the corresponding elements of the x86 instruction set?

  1. If we had a compiled language, then we could literally generate the text in assembly language, then assemble it accordingly, resulting in an executable file.
    But within the framework of the JIT-translation, it is an impermissible luxury to divide the process into two phases, and you need to somehow immediately learn to generate machine code in the memory of the VM process.
  2. And not any area of ​​memory is suitable for recording machine code. More precisely, you can write anywhere, but you can transfer the execution only to the memory area whose pages are marked as executable. It seems there can not do a simple malloc'om.
  3. Ideally, of course, so that the resulting machine code works as quickly as possible.

Asmjit


Asmjit is quite suitable for solving the first two problems - an open source library, developed mostly by one person over the past few years, and used in other projects of the author, as well as in JIT for scripts on the GTA San Andreas MP server .
As it is easy to understand from the title, its main mission is to create a convenient API for generating machine code Just-in-time. Its public interface can be divided into two parts:

In my implementation, I used the Assembler, which bribed me with its transparency and finer controllability.

The easiest working example of using Asmjit:
 #include "asmjit/asmjit.h" #include <iostream> int main(int argc, char** argv) { using namespace std; using namespace asmjit; using namespace asmjit::x86; JitRuntime runtime; // RAII object X86Assembler assembler(&runtime); assembler.add(rdi, rsi); // add second argument to first assembler.mov(rax, rdi); // mov result to return register assembler.ret(); void * address = assembler.make(); cout << reinterpret_cast<int64_t (*) (int64_t, int64_t)>(address)(2, 2) << endl; // 4 return 0; } 

I think people who are familiar with the assembler, he should not cause many questions. The only thing to be mentioned is the runtime object responsible for the lifetime of the allocated memory. It will be released after calling its destructor ( RAII ).
More examples of use can be found in the test library of the library or in my repository.

Debugging


Oddly enough, debugging the generated code does not differ in special perversions.
First, if something went wrong, you can thoughtfully look at the Asmjit logs. In the logs you can see the resulting assembly code and a few comments, where it came from.
In case even this did not help you, then in most modern debuggers (gdb / VS) there is a possibility to work in debug mode with disassembly. That is, you can put a breakpoint in place of the call of the generated function, and Step Into will take you to the screen with an assembler code, where you can debug almost as usual.
Important


Optimization


After three days of active programming and debugging, the first completely unsophisticated version of the JIT-translator appeared, in which all values ​​of the stack and variables were thoughtlessly stored in memory each time.
Even this solution gave a performance gain of about six times, with 6 FPS, which were obtained with an interpreter up to 36 FPS.

In general, at the beginning of the semester, when the rules of the game became clear, I had Napoleon's plans: to make everything quite adult, with the transfer of bytecode to SSA and a clever register allocation algorithm.
But due to the acute shortage of time and banal cowardice, it all ended up a bit more prosaic.

Register allocation

Just in case, let me remind you that one of the most critical points in terms of program performance is the effectiveness of using the existing CPU registers.
This is due to the fact that the main computational activity can be performed only on them, and besides reading / writing, even in memory in the L1 cache, works up to two times longer than similar operations on registers.
I used not the most difficult, but rather effective heuristic solution: we will store in registers the first elements of the virtual machine stack, 7 elements for general purpose slots (lines / integers) and 14 for floating point slots.
This solution seems to be the most justified, since the hottest variables in the framework of the function’s work are indeed the bottom of the stack, which participates in all calculations.
In addition, if you use the same registers, in which the arguments are expanded when you call functions, this in some cases saves time at the call sites.
As a result of the implementation of these ideas, I received an acceleration of 9 FPS , thus reaching 45 FPS, which could not but please me.

Peephole optimization

One of the simple classical approaches to generating is the so-called Peephole-optimization , the idea of ​​which is to search for and replace certain sequences of instructions with other, more productive ones.

For example, due to the lack of expressiveness of the mathvm bytecode, comparison operators like (x0 <= x1) have to be described something like this:
 LOADIVAR_0 LOADIVAR_1 IF_CMPLE L1 ILOAD_0 //   0    JA L2 //   L1: ILOAD_1 L2: ... 

Experts on Instructions set x86 will say that such code can be written much shorter and without a single conditional jump:
 mov rdi, %(var0) mov rsi, %(var1) cmp rdi, rsi setle al movzx rax, al 

Similar constructions are found in other cases: the negation operator, the for loop. The detection and correction of these patterns led to an acceleration of 4 FPS to 49 points.

Closure optimization


To provide work with closures in the prolog and epilogue of each function, I saved the address where the variables of the last call are located in a specially allocated area of ​​memory outside the stack.
In addition, you need to save and restore the previous base pointer. All this joy took up to seven extra instructions for each challenge, including those actively working with memory.
In fact, most functions do not contain circuits within themselves, and these losses can be avoided.
Therefore, I carried out an additional analysis of the entire byte-code received for the presence of closures and the corresponding redundant instructions were removed, as a result of which the cartoon began to be drawn another 3 FPS faster.

Inlining


The final point of the story about micro-optimization is inlining or in Russian “embedding” . The method by which industrial compilers struggle with the cost of calling small functions: the code of the called function is simply duplicated at the call site, without a prologue / epilogue and other “extra” instructions. However, it should be noted that in real life compilers approach the decision about embedding very carefully - the size of the resulting machine code is also a measure of the quality of their work.

However, in my case, since it is the time of work that is estimated, I decided to try to embed everything that is embedded (non-recursive calls).
And here, oddly enough, it turned out to be easier to work at the level of AST translation into byte-code.

The resulting increase of 2 FPS was no longer as significant as in the previous cases, but the total 53-54 FPS (almost ninefold acceleration) can not but rejoice. Acceleration of the same order was observed in other available benchmarks.

What is interesting, even in my simple JIT-translator, to obtain peak performance, several warm-up iterations were required, although I didn’t run the interpreter at all and immediately compiled all the code.
This illustrates the statements of authoritative people about the fact that warmup in benchmarks is needed not only to compile everything you need, but also to fix the profile of the program.

Results


As a result, my implementation was the fastest, but this is due not so much to the fact that I am the most cunning, but to the fact that none of the opponents made even the simplest implementation of JIT.
Thus, the absence of at least some kind of sports intrigue darkened the victory a bit.

The sources of my implementation can be viewed in the repository in the impl directory.
I apologize in advance for the quality of the code, but this is still a toy project, and I do not plan to support it further.

Thanks for attention!

Source: https://habr.com/ru/post/250841/


All Articles