Internal device llst, part 2 or Little Smalltalk + LLVM =

Hello! Together with humbug , we bring to your attention the third article from the Low Level Smalltalk (LLST) series. We hope that the article will be interesting not only for fans of ~~bicycles of~~ unusual programming languages, but also for those who are interested in such a wonderful thing as LLVM .

Let me remind you that the goal of our project is to create our own virtual machine compatible with Little Smalltalk at the byte-code level. The key difference is the heterogeneous architecture, which allows you to execute bytecodes both programmatically and compile them into low-level processor instructions by translating to LLVM IR code. Of course, the second method allows us to achieve higher performance and use the computing resources at our disposal in an optimal way.

However, first things first ...
')

New in version 0.2

Compared to the previous version described in the previous article , there are a lot of changes:

Show list

We connected the library readline . Now you can conveniently edit the command line; there is a history of previously entered commands (Ctrl + R) and auto-completion on the Tab key. In general, everything works the same as in any normal shell.
Added primitives work with files and implemented to write the image to disk. Now with the new VM you can do almost the same thing as the original one.
Implemented primitive multitasking (green threads) based on the class Scheduler. The plans - full multithreading.
Tests are written for basic operations with objects that greatly simplify subsequent debugging. Tests are awesome!
The pointers to the heap objects hptr<> . Previously, they used external std::list<> lists, now the list is maintained in a stack space. This alone sped up the software VM by half.
In branch 37 , the sketches of the Native API are made, which in the future will allow conveniently writing native methods completely without wrappers and other crutches. Methods are implemented "in place", in the same classes that describe the structure of equivalent entities from Smalltalk. Examples can be found here , here and here .
An attempt has been made to implement Generational GC based on the existing Baker. In fact, the difference is in the role of the halves of the heap and the storage of the list of references between generations. Thus, during a quick build it is necessary to run through only the younger generation and take from there objects that are referenced from older generations. The code is written, but not yet debugged.
Work has begun on rewriting ImageBuilder from scratch using Flex / Bison. The inherited utility has a lot of problems: you cannot pass parameters, there is no normal error output, “mystical” crashes when changing a pair of characters in the image code; the same “mystical” revivals when deleting, for example, a comment, etc. In addition, the utility in some cases generates an incorrect byte code. Of course, it is impossible to live like this, so we decided to redo it. At the moment, the grammar of Little Smalltalk is fully described; In addition to the constructions of the language, there are also control commands that are used to build the primary bootstrap image.
We got a domain name and moved to GitHub . Now the repository is available at llst.org . Pay attention also to the tracker .

And now, the most interesting:

Implemented the Smalltalk bytecode translation into the LLVM IR code. And this is done on the fly, right at the time of sending the message. That is, during the first call some time will be spent on translation and compilation of the method code (milliseconds), but subsequent calls will be executed natively.
An additional pass has been implemented, which will remove the extra intrinsics llvm.gcroot , to reduce the number of memory accesses (and increase the speed, of course).
Statistics are collected on the messages sent and hot-points of the methods. This is the basis for subsequent optimizations.
Finally, a pass for modifying “hot” methods is implemented, which, based on the collected statistics, changes the methods by introducing direct calls, bypassing caches and searching in class tables. This allows you to use the pipeline and the predictor of the transition processor in an optimal way, which has a positive effect on performance.

Currently, hot code with run-through optimizations runs from 2 to 100 times faster compared to the previous version of the software VM (depending on the executable code). Not bad, I think. Although there is still room for development. In fact, the simplest optimizations have been made that do not require complex graph analysis and type inference. If desired, and the availability of free time, you can do much more spectacular things.

What it looks like

Let's see how knowledge of call statistics can speed up code. We will conduct tests on the sorting algorithm already known to us, described in the last article, and also on the benchmark, which allows us to estimate the speed of processing messages. Those who wish to install the version locally, I advise you to either install the compiled version , or compile from source, after having read the points Usage and LLVM in the description of the repository on the githaba.

Let's start with the benchmark:

 loopBenchmark | sum | sum <- 0. 1 to: 100000 do: [ :x | sum <- sum + 1 ]. ^sum

This "ingenious" code a hundred thousand times adds one to the variable sum . Of course, at the output we must get the same one hundred thousand (or our VM can be sent to the garbage).

And here is the result of the run:

many beeches

 $ ./llst Image read complete. Loaded 5442 objects Soft run: 60 Cold jit run: 46 Hot jit run: 28 JIT Runtime stat: Messages dispatched: 200006 Objects allocated: 200004 Blocks invoked: 200002 Block cache hits: 199999 misses 3 ratio 100.00 % Message cache hits: 400004 misses 6 ratio 100.00 % Hot methods: Hit count Method name 200000 Block>>value: (0 sites) 2 Number>>to:do: (1 sites) value: (index 20, offset 109) class hits: (Block 200000) 2 Undefined>>loopBenchmark (1 sites) to:do: (index 15, offset 73) class hits: (SmallInt 2) 2 Block>>value (0 sites) Patching active methods that have hot call sites Recompiling method for patching: Number>>to:do: Patching Number>>to:do: ...done. Verifying ...done. Recompiling method for patching: Undefined>>loopBenchmark Patching Undefined>>loopBenchmark ...done. Verifying ...done. Optimizing Number>>to:do: ...done. Verifying ...done. Optimizing Undefined>>loopBenchmark ...done. Verifying ...done. Compiling machine code for Number>>to:do: ...done. Compiling machine code for Undefined>>loopBenchmark ...done. All is done. Patched cold jit run: 12 Patched hot jit run: 9 JIT Runtime stat: Messages dispatched: 200010 Objects allocated: 400008 Blocks invoked: 400004 Block cache hits: 399998 misses 6 ratio 100.00 % Message cache hits: 400006 misses 10 ratio 100.00 % Hot methods: Hit count Method name 200000 Block>>value: (0 sites) 4 Block>>value (0 sites) 2 Undefined>>loopBenchmark (0 sites) 2 Number>>to:do: (1 sites) value: (index 20, offset 109) class hits: (Block 200000) 2 Undefined>>loopBenchmark (1 sites) to:do: (index 15, offset 73) class hits: (SmallInt 2) ===-------------------------------------------------------------------------=== ... Statistics Collected ... ===-------------------------------------------------------------------------=== 2 adce - Number of instructions removed 2 branchfolding - Number of block tails merged 2 branchfolding - Number of dead blocks removed 1 cgscc-passmgr - Maximum CGSCCPassMgr iterations on one SCC 31 codegen-dce - Number of dead instructions deleted 63 codegenprepare - Number of GEPs converted to casts 9 codegenprepare - Number of blocks eliminated 114 codegenprepare - Number of memory instructions whose address computations were sunk 47 codegenprepare - Number of uses of Cast expressions replaced with uses of sunken Casts 313 dagcombine - Number of dag nodes combined 0 dse - Number of other instrs removed 12 dse - Number of stores deleted 54 gvn - Number of blocks merged 2 gvn - Number of instructions PRE'd 125 gvn - Number of instructions deleted 2 gvn - Number of loads PRE'd 37 gvn - Number of loads deleted 265 inline - Number of functions inlined 271 inline-cost - Number of call sites analyzed 263 instcombine - Number of dead inst eliminated 1 instcombine - Number of dead stores eliminated 67 instcombine - Number of instructions sunk 492 instcombine - Number of insts combined 159 isel - Number of blocks selected using DAG 7667 isel - Number of times dag isel has to try another path 101 jit - Number of bytes of global vars initialized 5310 jit - Number of bytes of machine code compiled 12 jit - Number of global vars initialized 239 jit - Number of relocations applied 3 jit - Number of slabs of memory allocated by the JIT 1 loop-simplify - Number of pre-header or exit blocks inserted 3 machine-licm - Number of hoisted machine instructions CSEed 11 machine-licm - Number of machine instructions hoisted out of loops 73 machine-sink - Number of machine instructions sunk 6 memdep - Number of block queries that were completely cached 11 memdep - Number of fully cached non-local ptr responses 43 memdep - Number of uncached non-local ptr responses 784 pei - Number of bytes used for stack in all functions 1 phi-opt - Number of dead PHI cycles 15 phielim - Number of atomic phis lowered 31 pre-RA-sched - Number of loads clustered together 48 reassociate - Number of insts reassociated 29 regalloc - Number of cross class joins performed 251 regalloc - Number of identity moves eliminated after coalescing 92 regalloc - Number of identity moves eliminated after rewriting 7 regalloc - Number of instructions deleted by DCE 4 regalloc - Number of instructions re-materialized 1 regalloc - Number of interferences evicted 251 regalloc - Number of interval joins performed 11 regalloc - Number of new live ranges queued 683 regalloc - Number of original intervals 369 regalloc - Number of registers assigned 1 regalloc - Number of registers unassigned 3 regalloc - Number of rematerialized defs for spilling 4 regalloc - Number of rematerialized defs for splitting 3 regalloc - Number of spilled live ranges 2 regalloc - Number of splits finished 4 simplifycfg - Number of blocks simplified 2 twoaddrinstr - Number of instructions aggressively commuted 2 twoaddrinstr - Number of instructions commuted to coalesce 4 twoaddrinstr - Number of instructions promoted to 3-address 30 twoaddrinstr - Number of two-address instructions 14 x86-codegen - Number of floating point instructions 1414 x86-emitter - Number of machine instructions emitted ->

Five lines are important here:

 Soft run: 60 Cold jit run: 46 Hot jit run: 28 Patched cold jit run: 12 Patched hot jit run: 9

The first line is the execution of the method by means of a software VM. The slowest way: no tricks, the code is executed extremely correctly, instruction by instruction. This mode is good for debugging, as it does not make any changes to the code; it is also used by the built-in image compiler when parsing the command line.

The second line is the “cold” JIT run. The method is translated into a functionally equivalent IR code, compiled into processor instructions that are already executed directly. Already at this stage, some optimizations are made, which will be discussed later. It can be seen that even with the compilation of the method, the final execution time is less than pure execution. This is not always the case. Often, for complex methods, the first execution may require ≈ 100 ms more time, but then we get a gain. In the same mode, statistics on calls is accumulated (each time a call handler is invoked, which runs the compiled method).

The third line is the “hot run”. The method already familiar JIT is called, so ready, compiled code is involved. There is no overhead projector - just checking for the presence of methods in the cache and a direct function call. The result is obvious.

The fourth line is the “cold” run of the patcher and the placement of direct calls (directs) where statistics show the feasibility of this event. In this case, the whole body is thrown out of the function (which is recompiled again), and then the patcher is passed. Full recompilation is needed so that the patcher can find the instruction that needs to be replaced with a direct call. The problem is that after optimizing passes (as well as preparing code for the GC), the method body may change beyond recognition. After all these manipulations, the IR code is compiled into real processor instructions and then executed.

The fifth line is the “hot” run of the patched and optimized method. Again, just the execution, without any tricks.

Such are the cakes ... Extrapolating the values, we get that about 12 million blocks per second are processed on my machine. From this constant, you can get an estimate of the number of processed messages. In the JIT code corresponding to the Number>>to:do: method, three operations are performed, equivalent to the following message parcels:

Sending < to the counter object inside #to:do: (loop condition);
Sending #value: to a block object (block call);
Sending + sum object (loop body).

We get a constant of 36 million, which can be viewed as a rough estimate above the number of messages sent per second.

At the same time, this is far from the speed limit. For example, if we rewrite the body of the benchmark using the whileTrue: construction whileTrue:

 loopBenchmark | sum | sum <- 0. [ sum < 1000000 ] whileTrue: [ sum <- sum + 1 ]. ^sum

then the results of the run on one million (instead of one hundred thousand, pay attention to the constant) will be:

 Soft run: 197 Cold jit run: 13 Hot jit run: 4

In this case, we get the equivalent already ≈ 8.1 ⋅ 10 ⁸ messages per second, or acceleration of 197/4 ≈ 50 times. This is because now the compiled version does not contain any real message sending operation. whileTrue: expands to a linear instruction set with transitions; all operations occur on numbers for which there are separate branches of execution that involve arithmetic operations directly.

The results of the runner with the patcher do not differ from the results of the JIT hot run, because there are no message packages, which means there is no statistics that can be collected, and there are no calls to the software VM that could be replaced by direct calls to JIT functions.

Of course, you need to be very careful about all sorts of benchmarks (especially integer) when it comes to VM performance. We use these figures only for a rough assessment of the optimization performed and as an indication that, on average, the code began to run faster.

Of course, the real numbers will be less, because here we have almost ideal conditions for optimization: cycles and arithmetic are compiled into native code entirely. The presence of direct transition instructions optimally utilizes the processor transition predictor, caches miss less. Hence, the increase in comparison with the non-optimized version, where for each parcel you need to go into the stub-processor, look into the cache and understand who really needs to address the message. Even taking into account 99% of hits, precious processor clock cycles are spent on it all the same.

The sorting test shows the results more modest:

 Soft run: 48 Cold jit run: 140 Hot jit run: 25 Patched cold jit run: 7 Patched hot jit run: 6

full output

 Preparing test data ...done Soft run: 48 Cold jit run: 140 Hot jit run: 25 JIT Runtime stat: Messages dispatched: 210613 Objects allocated: 17746 Blocks invoked: 43006 Block cache hits: 43001 misses 5 ratio 99.99 % Message cache hits: 369520 misses 51704 ratio 87.73 % Hot methods: Hit count Method name 44061 Link>>next (0 sites) 35102 MetaObject>>in:at:put: (0 sites) 27775 Link>>value (0 sites) 25778 Block>>value:value: (0 sites) 17746 Class>>new (0 sites) 17356 MetaLink>>value:next: (3 sites) new (index 3, offset 7) class hits: (MetaLink 17356) in:at:put: (index 11, offset 31) class hits: (MetaLink 17356) in:at:put: (index 18, offset 72) class hits: (MetaLink 17356) 17226 Block>>value: (0 sites) 15619 List>>add: (1 sites) value:next: (index 5, offset 13) class hits: (MetaLink 15619) 1999 List>>isEmpty (1 sites) = (index 4, offset 9) class hits: (SmallInt 1999) 1999 SmallInt>>= (0 sites) 1867 List>>insert:onCondition: (10 sites) isEmpty (index 3, offset 7) class hits: (List 1867) add: (index 10, offset 27) class hits: (List 130) value (index 33, offset 166) class hits: (Link 10419) value:value: (index 40, offset 210) class hits: (Block 10419) next (index 48, offset 268) class hits: (Link 1481) value:next: (index 50, offset 286) class hits: (MetaLink 1481) value:next: (index 57, offset 21) class hits: (Link 1481) next (index 68, offset 8) class hits: (Link 8938) value:next: (index 81, offset 24) class hits: (MetaLink 256) next: (index 83, offset 9) class hits: (Link 256) 1481 Link>>value:next: (0 sites) 392 List>>size (0 sites) 390 MetaList>>new (2 sites) new (index 4, offset 9) class hits: (MetaCollection 390) in:at:put: (index 12, offset 34) class hits: (MetaList 390) 384 Link>>next: (0 sites) 262 Collection>>sort: (13 sites) size (index 3, offset 7) class hits: (List 262) insertSort: (index 12, offset 34) class hits: (List 132) popFirst (index 21, offset 88) class hits: (List 130) new (index 26, offset 126) class hits: (MetaList 130) new (index 31, offset 158) class hits: (MetaList 130) value:value: (index 42, offset 219) class hits: (Block 15359) add: (index 49, offset 279) class hits: (List 8207) add: (index 56, offset 12) class hits: (List 7152) do: (index 59, offset 31) class hits: (List 130) sort: (index 64, offset 64) class hits: (List 130) sort: (index 70, offset 19) class hits: (List 130) add: (index 76, offset 4) class hits: (List 130) appendList: (index 81, offset 24) class hits: (List 130) 260 Link>>do: (2 sites) value (index 18, offset 72) class hits: (Link 260) value: (index 20, offset 82) class hits: (Block 260) 260 List>>do: (1 sites) do: (index 9, offset 25) class hits: (Link 260) 132 Collection>>insertSort: (4 sites) isEmpty (index 3, offset 7) class hits: (List 132) new (index 16, offset 55) class hits: (MetaList 130) insert:onCondition: (index 27, offset 130) class hits: (List 1867) do: (index 30, offset 143) class hits: (List 130) 130 List>>popFirst (3 sites) value (index 14, offset 43) class hits: (Link 130) next (index 19, offset 76) class hits: (Link 130) - (index 25, offset 111) class hits: (SmallInt 130) 130 SmallInt>>- (0 sites) 130 List>>appendList: (7 sites) firstLink (index 8, offset 21) class hits: (List 2) size (index 13, offset 40) class hits: (List 2) next (index 36, offset 181) class hits: (Link 8207) next (index 43, offset 234) class hits: (Link 8079) firstLink (index 54, offset 3) class hits: (List 128) next: (index 56, offset 12) class hits: (Link 128) size (index 61, offset 49) class hits: (List 128) 130 List>>firstLink (0 sites) 2 Collection>>sort (1 sites) sort: (index 10, offset 27) class hits: (List 2) 2 Block>>value (0 sites) ===-------------------------------------------------------------------------=== ... Statistics Collected ... ===-------------------------------------------------------------------------=== 2 adce - Number of instructions removed 14 branchfolding - Number of block tails merged 6 branchfolding - Number of branches optimized 5 branchfolding - Number of dead blocks removed 1 cgscc-passmgr - Maximum CGSCCPassMgr iterations on one SCC 38 codegen-dce - Number of dead instructions deleted 220 codegenprepare - Number of GEPs converted to casts 2 codegenprepare - Number of blocks eliminated 151 codegenprepare - Number of memory instructions whose address computations were sunk 123 codegenprepare - Number of uses of Cast expressions replaced with uses of sunken Casts 854 dagcombine - Number of dag nodes combined 250 dce - Number of insts removed 194 dse - Number of other instrs removed 158 dse - Number of stores deleted 51 gvn - Number of blocks merged 353 gvn - Number of instructions deleted 6 gvn - Number of loads PRE'd 277 gvn - Number of loads deleted 862 inline - Number of functions inlined 862 inline-cost - Number of call sites analyzed 1085 instcombine - Number of dead inst eliminated 69 instcombine - Number of instructions sunk 2540 instcombine - Number of insts combined 194 isel - Number of blocks selected using DAG 18193 isel - Number of times dag isel has to try another path 461 jit - Number of bytes of global vars initialized 12042 jit - Number of bytes of machine code compiled 25 jit - Number of global vars initialized 375 jit - Number of relocations applied 2 jit - Number of slabs of memory allocated by the JIT 15 llst - Number of removed loads from gc.root protected pointers <<<<<< 222 llst - Number of removed roots <<<<<< 4 machine-cse - Number of common subexpression eliminated 1 machine-licm - Number of hoisted machine instructions CSEed 14 machine-licm - Number of machine instructions hoisted out of loops 71 machine-sink - Number of machine instructions sunk 10 memdep - Number of block queries that were completely cached 81 memdep - Number of fully cached non-local ptr responses 84 memdep - Number of uncached non-local ptr responses 2792 pei - Number of bytes used for stack in all functions 9 phielim - Number of atomic phis lowered 2 phielim - Number of critical edges split 36 pre-RA-sched - Number of loads clustered together 23 reassociate - Number of insts reassociated 21 regalloc - Number of cross class joins performed 250 regalloc - Number of identity moves eliminated after coalescing 124 regalloc - Number of identity moves eliminated after rewriting 6 regalloc - Number of instructions deleted by DCE 1 regalloc - Number of interferences evicted 248 regalloc - Number of interval joins performed 21 regalloc - Number of new live ranges queued 1240 regalloc - Number of original intervals 891 regalloc - Number of registers assigned 1 regalloc - Number of registers unassigned 6 regalloc - Number of rematerialized defs for spilling 4 regalloc - Number of rematerialized defs for splitting 6 regalloc - Number of spilled live ranges 4 regalloc - Number of splits finished 13 simplifycfg - Number of blocks simplified 3 twoaddrinstr - Number of instructions re-materialized 43 twoaddrinstr - Number of two-address instructions 40 x86-codegen - Number of floating point instructions 2697 x86-emitter - Number of machine instructions emitted Patching active methods that have hot call sites Recompiling method for patching: MetaLink>>value:next: Patching MetaLink>>value:next: ...done. Verifying ...done. Recompiling method for patching: List>>add: Patching List>>add: ...done. Verifying ...done. Recompiling method for patching: List>>isEmpty Patching List>>isEmpty ...done. Verifying ...done. Recompiling method for patching: List>>insert:onCondition: Patching List>>insert:onCondition: ...done. Verifying ...done. Recompiling method for patching: MetaList>>new Patching MetaList>>new ...done. Verifying ...done. Recompiling method for patching: Collection>>sort: Patching Collection>>sort: ...done. Verifying ...done. Recompiling method for patching: Link>>do: Patching Link>>do: ...done. Verifying ...done. Recompiling method for patching: List>>do: Patching List>>do: ...done. Verifying ...done. Recompiling method for patching: Collection>>insertSort: Patching Collection>>insertSort: ...done. Verifying ...done. Recompiling method for patching: List>>popFirst Patching List>>popFirst ...done. Verifying ...done. Recompiling method for patching: List>>appendList: Patching List>>appendList: ...done. Verifying ...done. Recompiling method for patching: Collection>>sort Patching Collection>>sort ...done. Verifying ...done. Optimizing MetaLink>>value:next: ...done. Verifying ...done. Optimizing List>>add: ...done. Verifying ...done. Optimizing List>>isEmpty ...done. Verifying ...done. Optimizing List>>insert:onCondition: ...done. Verifying ...done. Optimizing MetaList>>new ...done. Verifying ...done. Optimizing Collection>>sort: ...done. Verifying ...done. Optimizing Link>>do: ...done. Verifying ...done. Optimizing List>>do: ...done. Verifying ...done. Optimizing Collection>>insertSort: ...done. Verifying ...done. Optimizing List>>popFirst ...done. Verifying ...done. Optimizing List>>appendList: ...done. Verifying ...done. Optimizing Collection>>sort ...done. Verifying ...done. Compiling machine code for MetaLink>>value:next: ...done. Compiling machine code for List>>add: ...done. Compiling machine code for List>>isEmpty ...done. Compiling machine code for List>>insert:onCondition: ...done. Compiling machine code for MetaList>>new ...done. Compiling machine code for Collection>>sort: ...done. Compiling machine code for Link>>do: ...done. Compiling machine code for List>>do: ...done. Compiling machine code for Collection>>insertSort: ...done. Compiling machine code for List>>popFirst ...done. Compiling machine code for List>>appendList: ...done. Compiling machine code for Collection>>sort ...done. All is done. Patched cold jit run: 7 Patched hot jit run: 6

Here you need to say a few words about what is optimized and how. First, now the patcher passes only on the functions of the methods. Blocks remain non-optimized. Secondly, for each operation of sending a message, an instance of the Array class is created, in which the message arguments are placed. Time is also being spent on this. Finally, inline methods are almost never used. It concerns only official functions (as we shall see later) and some trivial constructions. All this allows us to conclude that the potential for further optimization is far from exhausted.

To understand in more detail what is happening in the compilation process and during program execution, you need to understand the internal kitchen of the virtual machine. What we now do.

Smalltalk virtual machine

The virtual machine operates with objects.The operation is reduced to the exchange of messages between objects and sweeping debris behind them. In fact, the only serious operation that a virtual machine does is sending a message. Everything else, one way or another, comes down to the same premise.

To understand how a machine does this, you must first understand what the Smalltalk object is.

Objects

Any object can simply be represented by the following structure:

 struct TObject { TSize size; TClass* klass; union { TObject* fields[0]; uint8_t bytes[0]; }; };

Each object has a title in which the size of the object and a pointer to its class are recorded. Next are the fields of the object containing pointers to other objects. Of course, both the class and the fields are also objects, and therefore are represented by the same structures.

All objects in Smalltalk are a multiple of 4 bytes. This size is stored at zero offset, and the lower two bits have a special role - they store the Binary ( B ) and Indirect ( I ) flags . The B flag means that the object is binary, that is, it stores a raw set of bytes in the space reserved for the fields of ordinary objects. Such objects are, for example, strings (instances String). Byte codes of methods are stored in instances.ByteArraywhich are also binary objects. Binary objects are always padded to multiple lengths. Flag I is used by the garbage collector as it passes through the heap to mark those objects that have already been processed.

That leaves 30 bits to the size of the object. For ordinary objects, the size is considered in the fields (in multiples of 4 bytes), for binary objects - in bytes. Since both objects have the same, previously known title, its size is not taken into account in the total amount.

Smallint

All objects are located in memory aligned to 4 bytes. Thus, the lower two bits of the address are always equal to 0. This fact is used to store numbers up to 31 bits in the fields of the object. At the same time, the recorded number is multiplied by 2 (shifted 1 bit to the left), and the low-order bit is set to 1. The virtual machine is aware of this optimization, and in all places where the fields are accessed, checks whether the pointer to the object is really stored there or this information must be interpreted as a number.

It is important to note that this point is completely transparent to the rest of the system. The application programmer is not required to know what, where and how is stored. For example, it would be quite legal to write a command in the console 1 classand get the expected response.SmallInt, although this unit in the image will be represented by just such an “object” SmallInt.

This little trick can significantly reduce the amount of memory consumed, since only one bit more is used to write the number than its binary representation. If we use boxing , then for each number in addition to 4 bytes of a pointer to an object, another 8 bytes of the header and another 4 bytes of the actual data will be stored. Not the best option, I think.

Messages

As already described in the previous article, a message is a receiving object, plus a selector, plus a set of parameters. An important difference in sending a message from a function call is that until the last moment it is not known who will actually process the message. Only the object is known - the recipient of the message.

Sending a message begins with a search in the hierarchy of such a class that will be able to process the message. The search is conducted from the immediate class of the object, up the hierarchy, up toObject. I must say that this is a rather expensive operation, since you have to do a large number of memory accesses. Therefore, search results are cached. Thus, a complete search has to be done only once. The method caches are reset only in two cases - with the next garbage collection (method objects could move) and with the addition / removal of the next method (this could affect the hierarchy). Even with regular cleaning of caches during garbage collection, the percentage of hits remains very high (about 99%), so the time spent on finding a method can be considered on average insignificant.

Let's follow how the search takes place when a message is sent to an object 'Hello world'(class instance String) #isNil.

The search will be as follows:

hash(String, #isNil);
, ;
, String :
methods , ( Dictionary ) .
: ;
;
;
, , , ; ;
, :
( nil ), 4;
, .

If the method was not found in the class hierarchy, the virtual machine sends a message to the object #doesNotUnderstand:, which is guaranteed to be processed (at least in itself Object). In some cases, classes intentionally intercept this message for specific purposes. For example, proxy objects can do this, which then delivers the message to a valid address.

For the above message, the String>>isNilsearch string would be:
String→ Array→ Collection→ Magnitude→ Object.

After the method has been found, the virtual machine creates and populates the context object, and then proceeds to execute the method.

Context

The concept of context is inextricably linked with the operation of sending a message.

In traditional processor architectures, such as x86, there is the concept of a call stack . When a function is called, the parameters passed along with the return address are pushed onto the stack, after which the transition to the function body is performed. When exiting the function, respectively, the return address is removed from the top of the stack. It turns out that for each function on the stack lies the entire “hierarchy” of transitions from the moment the thread starts to the current function call.

In Smalltalk, everything is different. There is no single call stack. Instead, each time a message is sent, a context object is formed.which stores information relevant to that particular package. The same object is used by the virtual machine when executing the method body itself. Here is what it looks like:

 struct TContext : public TObject { TMethod* method; TObjectArray* arguments; TObjectArray* temporaries; TObjectArray* stack; TInteger bytePointer; TInteger stackTop; TContext* previousContext; };

method - this places the pointer to the method object found earlier that will process this message.
arguments - a pointer to the mass is stored here (instance Array) where the passed arguments are stored.
temporaries is a repository of temporary method variables. In particular, those that are written between characters | |in the method header.
stack — . . . , . .
bytePointer — ; IP . . .
stackTop — .
previousContext - and this is where the parent context is stored, from which method the message was sent. When returning a value from the current method, it will be put on top of the parent stack.

At each time point, the context has all the information regarding the current state of the method execution. This makes it possible to easily interrupt the execution of the method (for example, when the number of ticks allocated for it expires), and then return back later. There are also exotic use cases, including, for example, the implementation of continuations .

Methods

The methods are represented by the following objects:

 struct TMethod : public TObject { TSymbol* name; TByteObject* byteCodes; TSymbolArray* literals; TInteger stackSize; TInteger temporarySize; TClass* klass; TString* text; TObject* package; };

name - the actual name of the method, or rather the pointer to the symbol of this name.
byteCodes — ByteArray , - .
literals — . : , , . , .
stackSize — (. TContext::stack), .
temporarySize — , . .
klass — , .
text — .
package - the category of this method. It is intended for filtering lists of displayed methods in the user interface. Currently not used.

Method objects are generated when the primary image is compiled from the source imageSource.st by the ImageBuilder utility, and then saved to the resulting image file. A one-time method is also created when executing a command from the command line. In essence, the command line text is interpreted as the body of the method, which is then compiled and called in the usual way.

This is done like this. In the body of the method Undefined>>mainthere is a code:

 [ command <- String readline: '->'. command notNil ] whileTrue: [ command isEmpty ifFalse: [ command doIt printNl ] ]

First, using the readline library, we get the command string. Then a message is sent to the line object #doIt, the result of which is displayed on the screen. The method itself is #doItas follows:

 doIt | method | method <- Undefined parseMethod: 'doItCommand ^ ' + self. ^ method notNil ifTrue: [ ^ Context new perform: method withArguments: (Array new: 1) ]

All the magic is created in the method Undefined>>parseMethod:that, according to the source text, forms the object of the method #doItCommandusing the compiler implemented here in the text of the image. It turns out that Smalltalk compiles its own methods by means of a compiler written in Smalltalk itself and an integral part of the image. I find this moment rather amusing.

After the method object has been created, to call it, a context object is created with which the created method is executed. Since the new method was not added to the lists of methods of any of the classes, it will exist only at the time of its execution (well, until the next garbage collection).

Virtual Machine Instructions

A virtual machine can only execute instructions within methods. Outside code methods do not exist. Instructions are stored in the byteCodes field of the class instance Method. The instance itself also contains additional information, which is also used when initializing the context object (see above).

Since the article has already expanded greatly, here I will not describe in detail the format of the representation of bytecodes. I note only that one instruction can occupy one or two bytes.

Instructions for working with a stack of values

There are even push instructions pushing one of the objects to the top of the value stack. Along with the instruction code, an integer parameter is also specified, which is interpreted as an index for selecting an object from the corresponding data structure:

pushArgument - an argument from the current context argument array.
pushInstance is the field of the current object.
pushTemporary is a temporary variable from an array of current context variables.
pushLiteral is a literal from a constant array of literals of this method.

There are two more special push instructions that work a little differently:

pushConstant — : SmallInt , 0-9 , nil , true false , Undefined , True False .
pushBlock — Block , - , . bytePointer .

There are also several inverse operations. The following operations allow you to change the values of fields and temporary variables without removing the values from the stack. Fields and variables are also indexed with an optional integer parameter.

assignInstance - assigns a value to the field of the current object that lies on top of the stack.
assignTemporary - assigns a temporary variable the value that lies on top of the stack.

Since both the arguments and literals in the framework of a method call are considered constants, assignment operations for them do not exist. To remove the value from the stack, a separate operation ( popTop ) is provided , which will be discussed below.

Transition instructions

There are, of course, and transition instructions:

branchIfTrue — bytePointer , true .
branchIfFalse — , false .
branch — .

Although the procedure for sending a message is universal and can be used everywhere, in some cases its specialized implementations are used to speed up processing. Such special cases are the operations of sending unary and binary messages. Separate sendUnary and sendBinary instructions are provided for them . A normal message is sent by the sendMessage instruction .

When a message is sent, the arguments are pushed onto the stack, after which the markArguments N instruction is called . It removes N values from the stack and forms an object from them Array. Further, a pointer to this object returns to the top of the stack, which will be used when initializing the arguments field of the context object being created.

Return instructions

Sooner or later, the methods will need to somehow return. This is done using return instructions. The main one is stackReturn , which removes the value from the stack and passes it to the calling context, and then terminates the execution of the current method context.

In addition to returning an arbitrary value in Smalltalk, it is very often necessary to return self as a result. Therefore, a separate selfReturn instruction is provided for such an operation .

The last return instruction is blockReturn, explain which on the fingers is quite difficult. The basic idea is that control is not transferred to the generating context, but much higher, to the parent context of the method containing the declaration of the currently executing block. This behavior can be compared to the exclusion mechanism in other languages. But unlike exceptions, which occur only in special situations of the program and generally do not belong to the “normal” course of program execution, blockReturn is quite a regular operation from the point of view of a virtual machine and can be used in common code.

The easiest way to show it by example. This is the text of the method.Collection>>at:ifAbsent:

 at: value ifAbsent: exceptionBlock self do: [ :element | element = value ifTrue: [ ^element ] ]. ^exceptionBlock value

An expression ^elementthat is in the nested block itself will be compiled using the blockReturn statement . This is necessary because in reality, the block will be executed not in the current method, but much deeper. The method Collection>>at:ifAbsent:calls a method Collection>>do:, passing an external block as a parameter. The method Collection>>do:in turn will call Block>>value:for each element of the collection, passing it as a parameter to the block. And only inside Block>>value:is located the primitive number 8, which will already lead to the execution of the block code. Therefore, the code block decides that it is necessary to perform a return value element, it will do this using the blockReturn instruction , which will transfer control to the very top, beyondCollection>>at:ifAbsent:, returning the required value as the result of the message.

It should be noted that not every operator ^standing in the body of the block will be converted to a blockReturn instruction . Whenever possible, the compiler tries to expand the code into simpler instructions: embed the blocks in the body of the executing method and replace the block call with simple transitions according to the instructions. In this case, blockReturn will also be replaced by stackReturn or selfReturn .

Special instructions

In addition to the above instructions, there are some more auxiliary. These include popTop and dup instructions . The first simply removes the value from the top of the stack, which was put there earlier using one of the push instructions (or the virtual machine itself, as a result of sending a message). Typically, popTop is used after assignInstance or assignTemporary , to remove a value that is no longer needed from the stack.

The instruction dup , as you might guess from the name, duplicates the value on the stack, placing it alongside exactly the same. This instruction is used by the Smalltalk compiler when parsing complex expressions with conditions.

Execution method

As noted above, execution begins with the creation of a context object. After that, the virtual machine extracts and decodes the byte code of the first instruction. Then the machine begins to follow the instructions one by one, until it stumbles upon either the message being sent or the transition instruction. The method completes as soon as one of the return instructions is encountered.

We will follow the execution of the method and the work of the JIT compiler on the basis of the collection sorting method already familiar to us from the previous article:

 ->Collection viewMethod: #sort: sort: criteria | left right mediane | (self size < 32) ifTrue: [ ^ self insertSort: criteria ]. mediane <- self popFirst. left <- List new. right <- List new. self do: [ :x | (criteria value: x value: mediane) ifTrue: [ left add: x ] ifFalse: [ right add: x ] ]. left <- left sort: criteria. right <- right sort: criteria. right add: mediane. ^ left appendList: right

The result of compiling such a method is a whole sheet of instructions. A detailed explanation of the logic of the work is still beyond the scope of this article, so try to simply view the byte codes and understand which parts correspond to what. In fact, it is not so difficult. Unlike the traditional assembler, here the instructions are used rather monotonously and consistently. First, the stack of values is filled with arguments of the future call; then, using markArguments , an array of arguments is formed of them, which is already used in a particular message sending operation. Well, the transition instructions control all this disgrace. For convenience of reading, I repulsed with blank lines blocks of instructions related to one package and supplied them with comments:

 ->Collection methods at: #sort:; disassemble "  " 0000 PushArgument 0 "    self" 0001 MarkArguments 1 0002 SendMessage size 0003 PushLiteral 1 "  1        32" 0004 SendBinary < " " 0005 DoSpecial branchIfFalse 16 "     " "  :" 0008 PushArgument 0 0009 PushArgument 1 0010 MarkArguments 2 0011 SendMessage insertSort: 0012 DoSpecial stackReturn 0013 DoSpecial branch 17 "   :" 0016 PushConstant nil " " 0017 DoSpecial popTop "mediane <- self popFirst" 0018 PushArgument 0 0019 MarkArguments 1 0020 SendMessage popFirst 0021 AssignTemporary 2 0022 DoSpecial popTop "left <- List new" 0023 PushLiteral 4 0024 MarkArguments 1 0025 SendMessage new 0026 AssignTemporary 0 0027 DoSpecial popTop "right <- List new" 0028 PushLiteral 6 0029 MarkArguments 1 0030 SendMessage new 0031 AssignTemporary 1 0032 DoSpecial popTop "self do: ..." 0033 PushArgument 0 0034 PushBlock " " 0037 PushArgument 1 "criteria" 0038 PushTemporary 3 "x" 0039 PushTemporary 2 "mediane" 0040 MarkArguments 3 0041 SendMessage value:value: "  " 0042 DoSpecial branchIfFalse 52 "left add: x" 0045 PushTemporary 0 0046 PushTemporary 3 0047 MarkArguments 2 0048 SendMessage add: 0049 DoSpecial branch 56 "right add: x" 0052 PushTemporary 1 0053 PushTemporary 3 0054 MarkArguments 2 0055 SendMessage add: 0056 DoSpecial stackReturn 0057 MarkArguments 2 0058 SendMessage do: 0059 DoSpecial popTop "  left" 0060 PushTemporary 0 0061 PushArgument 1 0062 MarkArguments 2 0063 SendMessage sort: 0064 AssignTemporary 0 0065 DoSpecial popTop "  right" 0066 PushTemporary 1 0067 PushArgument 1 0068 MarkArguments 2 0069 SendMessage sort: 0070 AssignTemporary 1 0071 DoSpecial popTop "right add: mediane" 0072 PushTemporary 1 0073 PushTemporary 2 0074 MarkArguments 2 0075 SendMessage add: 0076 DoSpecial popTop "  " 0077 PushTemporary 0 0078 PushTemporary 1 0079 MarkArguments 2 0080 SendMessage appendList: " " 0081 DoSpecial stackReturn "( )" 0082 DoSpecial popTop 0083 DoSpecial selfReturn

Conclusion

... In general, this is all I wanted to tell about the internal structure of the Smalltalk virtual machine in this article. The format of the narrative requires conciseness, but this should be done not at the expense of understanding. I hope that I managed to find a middle ground.

We learned how objects look in a Smalltalk image, how numbers appear, figured out the algorithm for sending a message and finding the appropriate class, found out what the context object is; Finally, we got acquainted with the basic instructions of the virtual machine, examined the code of the sorting method already known to us and the instructions to which it is translated by the compiler.

In the next articleaddresses issues of JIT compiling Smalltalk methods into an intermediate IR code, understandable by LLVM. In turn, it will be compiled into the actual instructions of the processor. We will analyze the bytecodes of the method and try to understand what needs to be done to translate them to the IR in an optimal way.

Finally, a small survey:

Source: https://habr.com/ru/post/191250/

All Articles