Internal device llst, part 3. Magic JIT, or how to speed up a virtual machine 50 times

In the last article, we showed with humbug how the speed of calculations can vary depending on how the method is executed and its contents. Now we can look under the hood of a virtual machine and understand how and why this happens.

Earlier, we became acquainted with the Smalltalk language , or rather with its micro implementation of Little Smalltalk . Understood with the syntax of the language, the format of the objects in memory and a set of basic instructions. Now we have come close to the issues of interaction between Smalltalk and LLVM (for the sake of this, the whole series of articles was started).

Now we have all the necessary knowledge base in order to understand what is being done in our JIT compiler . In this article, we will learn how Smalltalk bytecodes are converted to LLVM IR code, how code is compiled and executed, and why it works faster than software interpretation. The most impatient can look at shellcast ( one and two ) with tsiferkami and running lines (do not forget about the possibility of scrolling).

')

Introduction

Both TRIZ and common sense considerations tell us that:

The quickest thing to do is to prove that you don’t need to;
Instead of doing something quickly, you can not get into a fever, but try to do it in advance;
Do not do the same job twice (especially if you can not do it at all).

These principles go far beyond the limits of human relations and find application in various fields. Including in computer science. Many understand the benefits of using caches . Some have heard of lazy calculations . Well, with a counterexample in the face of a senseless, time-consuming bureaucracy, everyone faced.

JIT compilers also have in their arsenal a lot in common with the above considerations. Our project is no exception. We will see how the code is executed in the traditional way, and what can be done to improve performance.

Differences of the virtual machine from the real processor

The Smalltalk virtual machine is stack-oriented:

When performing operations, operands are added to the stack;
The operation performed removes the required number of operands from the stack;
The result is placed back on top of the stack and can be used as an operand for the next operation.

The advantages of this approach are the simplicity of the virtual machine code and the ease of implementation of a series of calculations. Of course, there are drawbacks, the main one being that you have to juggle values by placing and removing them from the stack in order to allow the machine to perform some action on them. Judge for yourself: in the example described in the last article ( Collection>>sort: we repeatedly put the values on the stack just to immediately remove them, copy them to the instance of the Array class, and then put the object on the stack again ( markArguments does this ).

Of course, this is all done for a reason. Stacking organization of the machine makes it very easy to record sequences of actions without introducing complex operations. For example, when calculating arithmetic expressions with brackets, the stack is used as a temporary storage of intermediate values. According to such principles from time immemorial, engineering calculators were designed using the reverse Polish notation .

However, from the point of view of modern processor architecture, Smalltalk is very inconvenient. Most operations are done with memory. The principle of locality is not maintained (the objects are all scattered in a heap far from each other). Constantly generated new objects, which leads to garbage collection and the next shuffling of a large amount of memory. All memory operations are addressed indirectly. Finally, a large number of small methods make it impossible to adequately assign registers.

Modern processors in the total are registered. Registers are much faster than RAM, so you can expect greater performance from direct operations on them. The presence of a volume data cache can significantly reduce the cost of access to memory (while respecting the principle of locality), but, as will be shown below, we achieve success largely due to the principal reduction in the number of memory operations themselves, rather than by moving them to registers.

Thus, in order to effectively execute Smalltalk code on the processor, it is necessary to find a way to convert stack logic into register code.

JIT compiler stack

The intermediate code representation in LLVM uses SSA notation. In this notation, there are no variables in our usual form. All computation is a graph. Each calculation result (node) can be given a name, and this name can be used in further calculations (link two nodes). And, once the designated name can not change its value (re-assigning the value of the same name is not allowed). The name refers only to the point of the program (and the data) where it was announced. In those places where you still have to work with memory, the alloca instruction is used, which allocates the memory area of the requested size, as well as the load and store instructions for reading and writing, respectively.

Let's once again turn our attention to the Smalltalk bytecodes needed to initialize the left variable from the sort method described earlier:

 "left <- List new" 0023 PushLiteral 4 0024 MarkArguments 1 0025 SendMessage new 0026 AssignTemporary 0 0027 DoSpecial popTop

Another example is the beginning of a block from the same method:

 0037 PushArgument 1 "criteria" 0038 PushTemporary 3 "x" 0039 PushTemporary 2 "mediane" 0040 MarkArguments 3 0041 SendMessage value:value: "  " 0042 DoSpecial branchIfFalse 52

In either case, it is clear that a large number of simple memory operations are performed. Basic push instructions can be converted to a pair of assembly instructions. It is only necessary to copy the pointer (and in fact we work only with pointers to the structures of objects) from one memory region by the offset of the variable index to another region by the offset of the stack top index.

The implementation of the above ideas in the IR code may look something like this:

 define void @pushValueToStack(%TContext* %context, i32 %index, %TObject* %value) { %stackPointer = getelementptr inbounds %TContext* %context, i32 0, i32 4 %stack = load %TObjectArray** %stackPointer %stackObject = bitcast %TObjectArray* %stack to %TObject* call %TObject** @setObjectField(%TObject* %stackObject, i32 %index, %TObject* %value) ret void } define %TObject** @setObjectField(%TObject* %object, i32 %index, %TObject* %value) { %fieldPointer = call %TObject** @getObjectFieldPtr(%TObject* %object, i32 %index) store %TObject* %value, %TObject** %fieldPointer ret %TObject** %fieldPointer } define %TObject** @getObjectFieldPtr(%TObject* %object, i32 %index) { %fields = getelementptr inbounds %TObject* %object, i32 0, i32 2 %fieldPointer = getelementptr inbounds [0 x %TObject*]* %fields, i32 0, i32 %index ret %TObject** %fieldPointer }

Yes, it will work and, converted to native code, will be faster than its equivalent implementation for the interpreter. But the final performance of such a JIT VM will be surprisingly low. The thing is that the memory operations themselves have not gone away. We simply transferred them to low-level code, getting rid of the "wrapper" of the software VM.

To really speed up the virtual machine, you need to use the LLVM capabilities that we provide, namely SSA, along with a rich set of optimization passes.

If we recall the purpose of the push and markArgument instructions , it becomes clear that the stack is used here only to bind some value (argument, literal, constant or temporary variable) with the corresponding cell of the argument array, which is determined by the position of the corresponding push instruction.

The SSA notation allows you to do this directly, bypassing the stack. All we need to do is “remember” which objects we prepared to be placed in an array of arguments, and then copy them immediately at the place of future use. Since the stored values will also be accessed sequentially according to the LIFO principle, it is reasonable to use a stack for storing references. This stack of values exists and is used only at compile time. The already generated, “correct” code will be executed.

This principle works so well that the JIT versions of methods do not use the normal context stack at all . It is not even initialized when creating a context in the JIT code, and this is a minus one allocation for each message sent.

Thus, we obtain the following algorithm for optimized message sending:

Instead of pushing an object onto the context stack ( push ), we load the value from the memory into the variable ( load ) and put it on the compiler value stack;
Processing the markArguments instruction, we create an argument object, and then sequentially remove elements from the value stack and write ( store ) them in a row to the slots of the created argument array;
Call the sendMessage stub handler.

It turns out that instead of four memory operations ( load values, store values per stack, load values from the stack and store in the array of arguments), we left only two ( load values and store in the array).

People familiar with LLVM may argue that such empty memory operations are effectively removed by the optimizer passes and without help. Indeed, in the simplest case it is. But the real code that we have to generate looks much more complicated. This is primarily due to the need to add special markers and value references to ensure the correct operation of the garbage collector.

Methods and Functions

As it works, the virtual machine creates its JIT version for each method from the image. It is a functional equivalent of the method, but is already implemented in the native instructions of the processor. JIT functions are created when you first send a message that has not been sent yet. For example, when sending the message List>>sort: first, a message handler will be found, which is the Collection>>sort: method (since the List class does not have the sort: method, but inherits from Collection ). Then the virtual machine will try to find the JIT function with the same name and will not find it. The JIT compiler will be called, which, by the body of the method Collection>>sort: will create an equivalent function. Then this function will be called with the same parameters that were intended for the normal message. The next time the package is sent, the compiler will not be called, but an existing version of the method will be taken.

Message sending

As we remember from the previous article, sending a message to a virtual machine requires:

Remove values from the stack and create an array of arguments (made a separate step in markArguments );
Find the method in the hierarchy that will process the message;
Create a context object and fill in the fields;
Switch to the execution of the new context.

At this stage, we cannot influence the procedure for searching the destination method, since the virtual machine has no information about the types of objects. More specifically, this information is available only at the time of the call and cannot be used to predict the future type of objects. Therefore, the message search mechanism remains the same and is called using the system function sendMessage() , registered in the module. This function allows you to access the program VM from the JIT code and ask it to find the message recipient. Then the control is transferred to the JIT version of the found method.

A software VM creates all objects only on the heap, since it has no other memory. In the case of a JIT VM, we can significantly reduce memory overhead by placing some objects on the stack. Such objects are:

Array of arguments;
Context object;
Array of temporary variables.

The lifetime of these objects is not less than the execution time of the method itself (since they always have references from active contexts), so they can be placed on the stack. When exiting the method, the stack collapses to the previous frame, automatically freeing the occupied memory. Strictly speaking, the speed of memory allocation on the stack and in the heap is about the same. Both there and there it is only necessary to move the pointer to the size of the allocated memory and return the resulting value. But in the case of a bunch, this can lead to the need for garbage collection, which already takes considerable time. The more often this happens, the more efficient the stack variant will work.

Statistics

The ideal end result of the JIT compiler can be represented as follows:

All involved Smalltalk methods are converted to functions.
Each message sending is converted to a direct function call (and not a search stub)
Once there are direct calls, the code of blocks and small functions is built in at the place of its use bypassing the parcels.

The main problem on the way to achieving RBI is the uncertainty of the type of object from call to call. Consider once again the Collection >> sort method already known to us:

 sort: criteria | left right mediane | (self size < 32) ifTrue: [ ^ self insertSort: criteria ]. mediane <- self popFirst. left <- List new. right <- List new. self do: [ :x | (criteria value: x value: mediane) ifTrue: [ left add: x ] ifFalse: [ right add: x ] ]. left <- left sort: criteria. right <- right sort: criteria. right add: mediane. ^ left appendList: right

What can be said about the types of variables used? Without the execution context and the calling code, nothing definite can be said. This is both the strengths and weaknesses of Smalltalk. Strong, because this code will work equally efficiently, regardless of the class of the descendant. It sorts the list and array in the same way. Weak, since we cannot do optimization at the compilation stage, relying on knowledge of object types.

We have a cunning plan that allows us to get a result close to the optimal one. And his name is call statistics. As the program runs, the sendMessage handler is invoked, which, in addition to its main job, also updates the statistics of which classes are involved in sending messages.

By postulating that the class hierarchy does not change as the program runs, we will be able to insert direct calls instead of conditions. Unfortunately, the harsh reality and here gives us a surprise. For example, collecting statistics, we got the following results: out of 10 consecutive calls, 10 times the variable turned out to be the String class. But this does not mean that it will be so next time. Passing through the collection of objects, we can meet a variety of classes. Finally, even a method that previously returned a String instance may suddenly return nil , simply because the ~~stars~~ were ~~formed~~ this way.

Therefore, even with call statistics, the maximum that we can do at this stage is to insert direct calls , provided that the type of the variable is one of the known ones. In reality, this causes the patcher to insert switch blocks that check the class of the object and transfer control to the corresponding function.

Type inference

In the future, it is planned to make a full-fledged type deduction using the Hindley-Milner algorithm and additional heuristics. Then, in those places where the variable type was derived completely, it will be possible to make direct calls without any checks. This can have a great effect on the performance and the ability to fully inline the called method into the calling code.

For example, looking at the Collection>>sort: method described above, a careful reader may notice that the left and right variables are always initialized by instances of the List class. The virtual machine can understand this from the following considerations:

The variable left initialized by the result of the expression List new
In the List new expression, List new takes no parameters and has a fixed action object ( List )
Sending a #new message to a List object is processed by the List>>new method
The List>>new method returns the result of the expression super new
In the current context, it will be List super that is, Collection
Sending a #new message to a Collection object is handled by the Object>>new method
The Object>>new method contains the primitive 7, which always returns an object of a known class (axiom)

Then the expressions are folded in the reverse order, which leads to the statement: "The variable left initialized by an object of the List class".

Knowing the class of variables left and right , the remainder of the method can also be optimized. The operations #add: #sort: and #appendList: are compiled using direct method calls without additional conditions.

In fact, a thorough type inference is desirable but not necessary. It is enough to know that the type of the variable does not change from call to call. And what exactly it will be, we will find out at the first actual message sending. Everything together will allow to exclude type checking during program execution, reduce code size and untie the hands of the optimizer.

Compilation

Those who read up to this point, but have a weak idea of what LLVM is , I advise you to read the numerous and quite good posts on Habré, or the articles Hacker's introduction to LLVM and LLVM Language Reference Manual . In English, of course.

So, LLVM takes as its input an IR representation written in the form of chains of instructions. The instructions are organized in essence, called basic blocks . A feature of such a unit is that it has only one input and one output. This is done intentionally, for convenience. Inside the block can be placed any instructions, except those that transfer control. That is, conditional jumps, return, overshoot, and exception catching instructions cannot be in the middle of a block (function calls are valid if they do not throw exceptions). If this is still necessary, then the block in this place is divided into two, so that the problem instruction (in LLVM terminology, it is called the terminator ) is in the tail of the first half. Thus, the entire function consists of a set of basic blocks and transitions between them (the transition instructions themselves operate not with addresses, but with block identifiers, to which control should be transferred).

Our task is to read the bytecodes of the method and recreate it in the IR code, preserving the logic of the transitions, and eventually get a functional equivalent. Of course, this does not mean that we should repeat the operations of bytecodes word for word, but we need to ensure correctness.

The first problem is that the bytecodes of the method are written together, in one array. Naturally, there are no base blocks there, and all the jump addresses are recorded in the offset system relative to the beginning of the array. Therefore, the first thing we need is to build the correspondence between the offsets in the transition instructions and the base blocks. Now this is done by passing bytecodes beforehand ( scanForBranches method from MethodCompiler.cpp )

Having the “fish” in the hands of the future blocks of the method, we begin to “fill” it with instructions. The packing itself takes place sequentially. We go from the beginning of the method, translating the Smalltalk instructions into the corresponding operations in the IR code. Recall that push instructions are not directly encoded: instead, we push the TDeferredValue structure (see also TStackValue ), which describes the necessary delayed action, onto the value stack. Then, when in the code we stumble upon an operation that removes such a value from the stack, we perform a deferred action and get the actual name that can already be used. In simple cases, of which the majority, actions are postponed for just a couple of instructions, so the actual position of the operation in the code does not change. In essence, there is a logical linking of two places in the code, without the need to enter intermediate values (or use the context stack). How exactly the transfer of the value in the present code will be implemented is decided by the LLVM.

For example, if in bytecodes we had:

 "left <- List new" 0023 PushLiteral 4 0024 MarkArguments 1 0025 SendMessage new 0026 AssignTemporary 0 0027 DoSpecial popTop

... then when compiling this piece, the sequence of actions will be as follows:

Create a pending operation "put the fourth literal on the stack."
The markArguments instruction requires removing the value from the stack. This leads to the execution of a deferred operation: create the name lit0 , to which we bind the read operation of the fourth element from the array of literals. This name is put on the stack of values of the current base unit.
Create an array of one element, associate it with the name args0 . . , , ( lit0 ). , : args0[0] = lit0 .
sendMessage, args0 . ( , ).
temp0, . temp0 .
( popTop ).

At first glance, it is scary, but in fact, it all comes together in just a few assembly instructions. In particular, when forming an array of arguments, we left only the necessary memory operations: read the literal, write it into an array, send a message, and write the result in a temporary variable.

Software VM for the same set of operations is forced to repeatedly read and write memory when working with the stack. It puts the value into the context stack only to immediately get it back and then write it into an array of arguments. It again puts the pointer to the created array of arguments on the stack, then pulls it out of the stack and uses it when sending a message, the result of which is put on the stack again for a couple of moments, etc. Each extra operation is a loss of performance. Even with modern processors and fat data caches.

In our country, even an array of arguments is not created on the heap, but located in the real call flow stack. Therefore, allocation and filling of the array occurs quickly. The most important thing is that LLVM, operating with IR, is free to do all kinds of optimizations that we do not see directly, but which it considers appropriate.

For example, it may happen that the same temp value is used twice in a row. Then, instead of re-reading the value, LLVM can use the previous name (if it does not affect the result). And these little things recruited a whole bunch.

... We have considered a variant in which the stack of compiler values is used locally. But things get a lot more complicated when transition operations come into play.

All push operationsperformed on a local stack of values of the current base unit. The operations of taking values from the stack may be non-local. In this case, the value is removed from the block above the transition hierarchy. For example: there are two basic blocks X and Y. Block X consists of operations pushTemporary 0and

branch , Y. , Y markArguments 1 , . , Y . , Y X, . .

Y X ₁ ..X _n , φ- , SSA. X _i , , . , .

, MethodCompiler::TJitContext::popValue() JITCompilation.txt .


, , « ». , , . Collection>>sort: , . , , .

, . , , . , .

, . , 10% , , — . , :

`branch , Y. , Y markArguments 1 , . , Y . , Y X, . . Y X ₁ ..X _n , φ- , SSA. X _i , , . , . , MethodCompiler::TJitContext::popValue() JITCompilation.txt . , , « ». , , . Collection>>sort: , . , , . , . , , . , . , . , 10% , , — . , :`

branch , Y. , Y markArguments 1 , . , Y . , Y X, . .

Y X ₁ ..X _n , φ- , SSA. X _i , , . , .

, MethodCompiler::TJitContext::popValue() JITCompilation.txt .


, , « ». , , . Collection>>sort: , . , , .

, . , , . , .

, . , 10% , , — . , :

Source: https://habr.com/ru/post/197474/

All Articles