Ruby Inside. YARV Baytcode (I)

In this and subsequent articles I would like to tell you about the YARV bytecode, the virtual machine used in Ruby MRI ¹ 1.9.

First, a little history. The very first implementation of Ruby, which eventually turned into Ruby 1.8, had a very inefficient interpreter: when loading, the code turned into an abstract syntax tree, which was entirely stored in memory, and the execution of the code was a trivial workaround for that tree. Even if you close your eyes to the fact that bypassing a huge tree (think of Rails, AST which took about a dozen megabytes ² ) by the links in memory, the thing is quite slow, because the processor will not be able to adequately cache the code, in any case this implementation did not allow at least some optimizations. Considering also that due to an extremely flexible object-oriented system, in which it was possible to redefine the methods of any object, including the built-in class Fixnum, arithmetic calculations were performed by calling methods on objects (yes, 5 + 3 caused the object’s “+” method 5 with the creation of a stack frame), Ruby 1.8 has turned out to be one of the slowest among the most commonly interpreted programming languages.

YARV (Yet Another Ruby VM) - a stack virtual machine developed by Sasada Koichi and then integrated into the main tree, fixed many of these flaws, if not all. The code is now translated into a compact representation, optimized ³ and executed significantly faster than before.

Here, however, there is one important difference from other virtual machines. The bytecode that spawns YARV can be saved, but it cannot be loaded — in the distributed version, the bytecode loader is disabled (although it is in the source code and can be enabled if necessary). The official reason is the absence of a verifier, but, as it seems to me, the truth is that this bytecode is considered an internal format in which you can make changes at any time, without thinking about compatibility, and try to preserve this situation.
')
As a result, despite the fact that access to bytecode is absolutely indispensable when analyzing performance or developing alternative interpreters, any documentation on it is missing as a class. The best of what you can find is sites like YARV Instructions , which are simply parsed virtual machine definition files from Ruby source code. (I understood the meaning of having part of the fields in the bytecode dump header from the variable names in a blog post of a Japanese.)

I would like to partly correct this situation. In this and the next article I will tell you exactly what I managed to understand in the Ruby bytecode device and how I put it into practice. I must say at once that I did not fully understand some of the features; in such cases, I will mark it separately. If there is no such phrase, it means that I managed to verify the obtained information in practice and everything works as it should be.

Let's proceed to the bytecode itself. In Ruby ^4, there is the system class RubyVM :: InstructionSequence, which allows you to compile arbitrary text into bytecode (as far as I know, it is impossible to get the bytecode of a loaded program). In the simplest case, it suffices to use the InstructionSequence.compile method, which returns an object of this class, and the InstructionSequence # to_a method, which returns a bytecode dump.

Readers who know Ruby have already noticed that the dump should be an array, because the #to_a method, according to the Convention over Configuration principle, should convert an object into an array.

A small digression is needed here. In the canonical version of the implementation, bytecode, as its name suggests, is a sequence of bytes, and somewhere deep inside the interpreter this is exactly what it looks like. However, its presentation, which can be obtained by standard tools, looks like a normal Ruby object - namely, a tree consisting of nested arrays. It contains only the minimum subset of standard types: Array, String, Symbol, Fixnum, Hash (only in the header), as well as nil, true and false. This is very convenient (and in the style of Ruby): you can not deal with the analysis of binary data, and immediately work with a readable representation of them, without thinking about the magic constants, opcode numbers and incompatible changes in the next versions of the translator.

So, we get a dump of some simple program:

ruby-1.9.2-p136 :001 > seq = RubyVM::InstructionSequence.compile(%{puts "Hello, YARV!"})
=> <RubyVM::InstructionSequence:<compiled>@<compiled>>
ruby-1.9.2-p136 :002 > seq.to_a
=> ["YARVInstructionSequence/SimpleDataFormat", 1, 2, 1, {:arg_size=>0, :local_size=>1, :stack_max=>2}, "<compiled>", "<compiled>", nil, 1, :top, [], 0, [], [1, [:trace, 1], [:putnil], [:putstring, "Hello, YARV!"], [:send, :puts, 1, nil, 8, 0], [:leave]]]

The dump consists of two parts: the header and the actual code. Consider the header fields.

"YARVInstructionSequence/SimpleDataFormat", 1, 2, 1 , {:arg_size=>0, :local_size=>1, :stack_max=>2}, "<compiled>", "<compiled>", nil, 1, :top, [], 0, []

The first four fields are essentially a magic value that identifies bytecode, but the last three fields are also a version in the format major, minor, format. (These are the same fields that I discovered in a Japanese blog. And no, this is far from obvious.)

"YARVInstructionSequence/SimpleDataFormat", 1, 2, 1, {:arg_size=>0, :local_size=>1, :stack_max=>2} , "<compiled>", "<compiled>", nil, 1, :top, [], 0, []

The fifth field is a hash containing several parameters of the stack frame that will be created for this piece of code. Purpose: arg_size and: stack_max, I think, obviously.

Parameter: local_size, in theory, should contain the number of local variables, but in fact it is always more by 1. This unit is tightly hammered into the code (compile.c, 342); at first I thought that the value self was stored in it, but it (which, if you think about it, is more logical) is in the stack frame.

"YARVInstructionSequence/SimpleDataFormat", 1, 2, 1, {:arg_size=>0, :local_size=>1, :stack_max=>2}, "<compiled>", "<compiled>", nil, 1 , :top, [], 0, []

The following four fields contain the name of the method (or pseudo-name, for example, “block in main”); the name of the file in which it is defined as it was loaded (for example, require '../something' generates a block in which this field contains '../something'); the full path to the file (probably for the debugger) and the line on which the definition of the corresponding code block begins.

"YARVInstructionSequence/SimpleDataFormat", 1, 2, 1, {:arg_size=>0, :local_size=>1, :stack_max=>2}, "<compiled>", "<compiled>", nil, 1, :top , [], 0, []

The following field contains the type of code block. I encountered the values: top (toplevel; "just" code that is not nested in either the method or the class),: block,: method, and: class.

The following values are defined in the Ruby source code (vm_core.h, 552): top, method, class, block, finish, cfunc, proc, lambda, ifunc, and eval. Most of them are not found in bytecode and are probably assigned dynamically; thus, a block with an ifunc type is created during yield in cases where the passed block is a C function (vm_insnhelper.c, 721). The purpose of the others (except cfunc) is not clear to me at the moment, I can only write that blocks like lambda, judging by the code, are quite clearly generated when compiling AST, but at the same time they have never met me. Presumably, this refers to optimization (which I haven’t done at all before).

"YARVInstructionSequence/SimpleDataFormat", 1, 2, 1, {:arg_size=>0, :local_size=>1, :stack_max=>2}, "<compiled>", "<compiled>", nil, 1, :top, [], 0 , []

The following two fields contain a list of local variables (an array of characters, something like [:local1, :local2] and the number of arguments. Instead of the number in some cases (for example, if there are arguments with default values, or arguments of the form * splat or & block ) there may be an array, the format of which is not completely known to me; I will consider it when I write about the function call.

The list of local variables at runtime is probably needed in order to be able to implement the Binding class, without which, say, it is impossible to do a REPL .

"YARVInstructionSequence/SimpleDataFormat", 1, 2, 1, {:arg_size=>0, :local_size=>1, :stack_max=>2}, "<compiled>", "<compiled>", nil, 1, :top, [], 0, []

The penultimate field is the catch table, which still remains a complete mystery to me. In this mystical structure, there are both constructs associated with exceptions (catch and ensure), and records that are somehow related to the implementation of the next, redo, retry, and break keywords, the first two, despite the presence of records in the catch table , they don’t use it at all.

[
1,
[:trace, 1],
[:putnil],
[:putstring, "Hello, YARV!"],
[:send, :puts, 1, nil, 8, 0],
[:leave]
]

And finally, the last field is the actual code.

The code is an array with a sequence of instructions, interspersed with line numbers and labels; if the element is a number, then this is the line number, if a symbol of the form: label_01, then this is the label to which the transition can occur, otherwise it will be an array representing the instruction.

[:putstring, "Hello, YARV!"]
The first element of the instruction is always a symbol containing the name of the instruction, the other elements are obviously its arguments.

The general principles of the virtual machine and a detailed description of the instructions will be in the next section.

¹ Matz Reference Implementation
² About this you can read, for example, here .
³ There are about a dozen optimizations in the settings of the translator, including the peephole , tailcall , as well as various caches and specialized instructions.
⁴ Hereinafter, Ruby means Ruby MRI 1.9.x.

PS And even under pain of death, I will not write a word about Bra ... you understood what I was talking about.

Source: https://habr.com/ru/post/113592/

All Articles

Ruby Inside. YARV Baytcode (I)

More articles: