QEMU.js: now in earnest and with WASM

Once upon a time I decided, for the sake of laughter, to prove the reversibility of the process and learn how to generate JavaScript (or rather, Asm.js) from machine code. QEMU was chosen for the experiment, some time later an article was written on Habr. In the comments, I was advised to redo the project on WebAssembly, and I myself didn’t want to throw the almost finished project myself ... The work was going, but very slowly, and now, recently in that article a comment appeared on the topic "So how did it all end?". In my detailed response, I heard "It pulls on the article." Well, since pulls, there will be an article. Maybe someone will come in handy. From it, the reader will learn some facts about the QEMU code generation device backend, as well as how to write the Just-in-Time compiler for the web application.

Tasks

Since I had already learned how to port QEMU to JavaScript somehow, this time it was decided to do it according to the mind and not repeat old mistakes.

Error number one: tap off point release

My first mistake was to branch off my version from upstream version 2.4.1. Then it seemed to me a good idea: if point release exists, then it is probably more stable than the simple 2.4, and even more so the master branch. And since I planned to add a fair amount of my bugs, I didn’t need strangers at all. So it probably happened. But bad luck: QEMU does not stand still, and at some point there even announced optimization of the generated code by 10 percent. "Yeah, I'm dead now," I thought ~~and broken off~~ . Here we must digress: due to the single-threaded nature of QEMU.js and the fact that the original QEMU does not imply the absence of multithreading (that is, the possibility of simultaneous operation of several unrelated code paths, and not just “zayuzat all cores” is critical), the main stream functions had to "turn out" to be able to call outside. This created some natural problems with the merger. However, the fact that some of the changes from the master branch, with which I tried to merge my code, were also cherry picked in point release (and, therefore, in my branch) too, probably would not have added convenience.

In general, I decided that a prototype still makes sense. ~~throw out~~ disassemble for parts and build a new version from scratch on the basis of something fresher and now from the master .

Mistake number two: TLP methodology

In essence, this is not a mistake, in general - just a feature of creating a project in conditions of complete misunderstanding as “where and how to move?”, And in general “will we get there?”. Under these conditions, tyap-bloop programming was a viable option, but, naturally, I didn’t want to repeat it at all unnecessarily. This time I wanted to do it according to my mind: atomic commits, deliberate code changes (rather than “stringing random characters together until it compiles (with warnings)”, as Linus Torvalds once said about someone, if you believe Wikisitnik), etc.

Mistake number three: not knowing ford climb into the water

I didn’t get rid of this even now, but now I decided to go not along the path of the least resistance, and make it “adult”, namely, write my TCG backend from scratch, so as not to say “Yes, this, Of course, slowly, but I can’t control everything - TCI is so written ... ". Besides, initially it seemed like an obvious solution, since I am generating the binary code . As the saying goes, “Gent y assembled, yes not that”: the code, of course, is binary, but the control cannot simply be transferred to it - it must be explicitly shoved into the browser for compilation, resulting in some object from the JS world, which still need to save somewhere. However, on ~~normal~~ RISC architectures, as I understand it, a typical situation is the need to explicitly reset the instruction cache for the regenerated code — if this is not what we need, then, in any case, it’s close. In addition, from my last attempt, I learned that the control in the middle of the translation unit does not seem to be transferred, so we don’t really need the bytecode interpreted from any offset, and we can simply generate a function on TB.

Came and kicked

Although I started rewriting the code back in July, the magic Pendel sneaked up unnoticed: usually letters from GitHub come as notifications about replies to Issues and Pull requests, and here, suddenly the mention in the thread Binaryen as a qemu backend in context, “Here it is— he did something like that, maybe he would say something. ” It was about using the related Emscripten Binaryen library to create a WASM JIT. Well, I said that you have an Apache 2.0 license there, and QEMU as a whole is distributed under GPLv2, and they are not very compatible. Suddenly it turned out that the license can be somehow corrected (I don’t know: maybe, change, maybe double licensing, maybe something else ...). This, of course, made me happy, because by that time I had already looked closely at the WebAssembly binary format , and I was somehow sad and incomprehensible. There was also a library, which would devour base blocks with a transition graph, and issue a baytkod, and even launch it itself in the interpreter, if necessary.

Then there was another letter on the QEMU mailing list, but this is more likely to the question, “And who needs it at all?”. And it suddenly turned out to be necessary. At a minimum, it is possible to scrape together such possibilities of use, if it works more or less quickly:

launching something tutorial without installation at all
Virtualization on iOS, where, according to rumors, the only application eligible for code generation on the fly is the JS engine (is it true?)
demonstration of mini-OS - single-disk, built-in, all sorts of firmware, etc ...

Browser runtime features

As I said, QEMU is tied to multithreading, but not in the browser. Well, that is, how not ... At first, it was not there at all, then WebWorkers appeared - as far as I understand, this is multithreading, based on the transfer of messages without mutually variable variables . Naturally, this creates significant problems when porting existing code based on a shared memory model. Then, under public pressure, it was implemented and it was called SharedArrayBuffers . It was gradually introduced, celebrated its launch in different browsers, then they celebrated the new year, and then Meltdown ... Then they came to the conclusion that you don’t blush — don’t blush the time dimension, but with the help of a shared memory and a counter incrementing flow, you’ll still get pretty sure . So disconnected multithreading with shared memory. It seems that it was later turned back on, but as it became clear from the first experiment, there is life without it, and if so, we will try to do it without laying down on multithreading.

The second feature is the impossibility of low-level manipulations with the stack: you can not just take, save the current context and switch to a new one with a new stack. The call stack is managed by the JS virtual machine. It would seem, what's the problem, since we still decided to manage the former threads completely manually? The fact is that block I / O in QEMU is implemented through cortinas, and here low-level stack manipulations would be useful to us. Fortunately, Emscipten already contains a mechanism for asynchronous operations, even two: Asyncify and Emterpreter . The first works through a significant swelling of the generated JavaScript code and is no longer supported. The second is the current "correct way" and works through the generation of bytecode for its own interpreter. It works, of course, slowly, but it does not inflate the code. True, the support for corutin for this mechanism had to be contributed independently (there were already corortines written under Asyncify and there was an implementation of approximately the same API for Emterpreter, it was just necessary to connect them).

At the moment, I have not yet had time to divide the code into WASM compiled and interpreted using Emterpreter, so block devices are not working yet (see the next series, as they say ...). That is, in the end, you should get this funny layered something:

interpretable block I / O. Well, what do you really expect emulated NVMe with native performance? :)
statically compiled QEMU master code (translator, other emulated devices, etc.)
dynamically compiled guest code in WASM

QEMU source features

As you probably already guessed, the emulation code of guest architectures and the code for generating host machine instructions from QEMU are separate. In fact, there is even a little more cunning:

there are guest architectures
There are accelerators , namely, KVM for hardware virtualization on Linux (for compatible guest and host systems), TCG for JIT code generation anywhere. Starting with QEMU 2.9, support for the HAXM hardware virtualization standard on Windows has appeared ( details )
if TCG is used, and not hardware virtualization, then it has separate support for code generation for each host architecture, as well as a universal interpreter
... and around it all - emulated peripherals, user interface, migration, record-replay, etc.

By the way, did you know: QEMU can emulate not only a computer as a whole, but also a processor for a separate user process in a host core, which is used, for example, by an AFL fuser for the instrumentation of binaries. Perhaps someone wants to port this QEMU mode of operation to JS? ;)

Like most long-standing free software, QEMU is built via configure and make . Suppose you decide to add something: TCG-backend, implementation of threads, something else. Do not be in a hurry to rejoice / be horrified (underline the necessary) the perspective of communication with Autoconf - in fact, QEMU's configure seems to be self-written and is not generated from anything.

WebAssembly

So what is this thing - WebAssembly (aka WASM)? This is a replacement for Asm.js, now no longer pretending to be valid JavaScript code. On the contrary, it is purely binary and optimized, and even just writing an integer into it is not very simple: it is stored for compactness in LEB128 format.

You may have heard about the relooping algorithm for Asm.js - this is the restoration of “high-level” flow control instructions (i.e. if-then-else, loops, etc.), which have been sharpened by JS engines, from the low-level LLVM IR, closer to the machine code executed by the processor. Naturally, the QEMU intermediate representation is closer to the second. It would seem, here it is, the baytkod, the end of torment ... And then the blocks, if-then-else and cycles! ..

And therein lies another reason why Binaryen is useful: it can naturally accept high-level blocks that are close to what will be stored in WASM. But he can also issue a code from the graph of base blocks and transitions between them. Well, I already said that it hides behind a convenient C / C ++ API storage format for WebAssembly.

TCG (Tiny Code Generator)

TCG was originally a backend for the C compiler. Then it apparently could not stand the competition with GCC, but eventually found its place in QEMU as a code generation mechanism for the host platform. There is also a TCG backend that generates some abstract baytkod, which is immediately executed by the interpreter, but I decided to use it this time. However, the fact that QEMU already has the ability to enable the transition to the generated TB via the tcg_qemu_tb_exec function tcg_qemu_tb_exec out to be very useful to me.

To add a new TCG backend to QEMU, you need to create a subdirectory tcg/< > (in this case, tcg/binaryen ), and there are two files in it: tcg-target.h and tcg-target.inc.c and prescribe everything This is a matter of configure . You can put other files there, but, as you can guess from the names of these two, they will both be included somewhere: one as a normal header file (it is included in tcg/tcg.h , and that one in other files in the tcg directories, accel and not only), the other - only as a code snippet in tcg/tcg.c , but it has access to its static functions.

Having decided that I would spend too much time on the detailed proceedings, as it is arranged, I simply copied the “skeletons” of these two files from another implementation of the backend, honestly indicating this in the license header.

The tcg-target.h contains mainly settings in the form tcg-target.h #define s:

how many registers and how wide is the target architecture (we have - how much we want, so much is - the question is more than what will be generated in a more efficient code by the browser on the "very target" architecture ...)
alignment of host instructions: on x86, and in TCI, instructions are not aligned at all, I'm going to put code and not instructions at all in the buffer, but pointers to the structures of the Binaryen library, therefore I will say: 4 bytes
what optional instructions can generate a backend - we include everything we find in Binaryen, let the rest of the accelerator be broken down into simpler ones
what is the size of the TLB cache asks for the backend. The fact is that in QEMU everything is serious: although there are helpers that implement load / store taking into account the guest MMU (and where is it now without it?), They retain their broadcast cache as a structure, which processing can be easily embedded straight into the broadcast blocks. The question is which offset in this structure is most effectively handled by a small and fast sequence of commands.
here you can tweak the assignment of one or two reserved registers, enable the TB call through the function and optionally describe a couple of small inline functions like flush_icache_range (but this is not our case)

The tcg-target.inc.c , of course, usually much larger in size and contains several required functions:

initialization, which specifies, among other things, restrictions on what instruction with which operands can work. Impudently copied by me from another backend
function that accepts a single inner bytecode statement
here you can put auxiliary functions, and also here you can use static functions from tcg/tcg.c

For myself, I chose the following strategy: in the first words of the next translation block, I wrote down four pointers: a start mark (some value in the 0xFFFFFFFF neighborhood, which determined the current TB state), a context, a generated module, and a magic number for debugging. First, the label was set to 0xFFFFFFFF - n , where n is a small positive number, and with each execution through the interpreter it increased by 1. When it reached 0xFFFFFFFE , the compilation took place, the module was saved in the function table, imported into a small “starter”, into which the execution went away from tcg_qemu_tb_exec , and the module was removed from the QEMU memory.

Paraphrasing the classics, “A crutch, how much of this sound for Proger's heart is intertwined ...”. However, the memory flowed away somewhere. And it was a memory driven by QEMU! I had a code that, when I wrote down the next instruction (well, that is, the pointer), deleted the one to which it was at this place earlier, but it did not help. In fact, in the simplest case, QEMU allocates memory at startup and writes the generated code there. When the buffer ends, the code is discarded, and the next one begins to be written in its place.

After studying the code, I realized that the crutch with the magic number made it possible not to fall on the destruction of the heap, freeing up something that was not on the uninitialized buffer on the first pass. But who rewrites the buffer to bypass my function then? As the developers of Emscripten advise, having rested against the problem, I ported the resulting code back into the native application, set Mozilla Record-Replay on it ... In general, as a result, I understood a simple thing: for each block, there is a struct TranslationBlock with its description. Guess where ... That's right, directly in front of the block, right in the buffer. Realizing this, I decided to tie it with crutches (at least a few), and just threw out the magic number, and transferred the remaining words to the struct TranslationBlock , making a simply connected list that you can quickly go through when resetting the broadcast cache and freeing the memory.

Some crutches remain: for example, marked pointers in the code buffer - some of them are simply BinaryenExpressionRef , that is, they look at expressions that need to be linearly put into the generated base unit, part is the condition of transition between the BBs, part is where to go. Well, there are already prepared blocks for Relooper, which need to be connected according to the conditions. In order to distinguish them, the assumption is that they are all aligned at least four bytes, so you can safely use the lower two bits under the label, you just need to remember to remove it if necessary. By the way, such labels are already used in QEMU to indicate the reason for leaving the TCG cycle.

Use Binaryen

Modules in WebAssembly contain functions, each of which contains a body, which is an expression. Expressions are unary and binary operations, blocks consisting of lists of other expressions, control flow, etc. As I said, the control flow is organized here as high-level branches, cycles, function calls, etc. Arguments are passed to functions not on the stack, but explicitly, as in JS. There are global variables, but I have not used them, so I will not tell about them.

Functions also have local variables numbered from zero, of type: int32 / int64 / float / double. In this case, the first n local variables are the arguments passed to the functions. Please note that even though everything is not quite low-level in terms of control flow, but whole numbers still do not carry the sign / unsigned attribute: how the number behaves depends on the operation code.

Generally speaking, Binaryen provides a simple C-API : you create a module, you create expressions in it - unary, binary, blocks from other expressions, control flow, etc. Then you create a function whose body you want to specify an expression. If you, like me, have a low-level transition graph, the relooper component will help you. As far as I understand, you can use high-level flow control in a block as long as it does not go beyond the block limit — that is, you can do the internal branch of the fast path / slow path inside the embedded TLB cache processing code, but you cannot interfere with the outer control flow - no . When you release a relooper, its blocks are released, when you release a module — the expressions, functions, etc., that are highlighted in its arena , disappear.

However, if you want to interpret the code on the go without unnecessary creations and deletions of the interpreter instance, it may make sense to bring this logic to a file in C ++, and from there directly manage the entire C ++ library API, bypassing the ready-made wrappers.

So, to generate code, you need

 //    (  ) BinaryenSetAPITracing(0); BinaryenSetOptimizeLevel(3); BinaryenSetShrinkLevel(2); //   BinaryenModuleRef MODULE = BinaryenModuleCreate(); //    ( ,   ) helper_type BinaryenAddFunctionType(MODULE, "helper-func", BinaryenTypeInt32(), int32_helper_args, ARRAY_SIZE(int32_helper_args)); // (int23_helper_args ^W ) //  -  // ...     -  :) //    BinaryenAddFunction(MODULE, "tb_fun", tb_func_type, func_locals, FUNC_LOCALS_COUNT, expr); BinaryenAddFunctionExport(MODULE, "tb_fun", "tb_fun"); ... BinaryenSetMemory(MODULE, (1 << 15) - 1, -1, NULL, NULL, NULL, NULL, NULL, 0, 0); BinaryenAddMemoryImport(MODULE, NULL, "env", "memory", 0); BinaryenAddTableImport(MODULE, NULL, "env", "tb_funcs"); //       assert (BinaryenModuleValidate(MODULE)); BinaryenModuleOptimize(MODULE);

… — , , — .

--, :

 static char buf[1 << 20]; BinaryenModuleOptimize(MODULE); BinaryenSetMemory(MODULE, 0, -1, NULL, NULL, NULL, NULL, NULL, 0, 0); int sz = BinaryenModuleWrite(MODULE, buf, sizeof(buf)); BinaryenModuleDispose(MODULE); EM_ASM({ var module = new WebAssembly.Module(new Uint8Array(wasmMemory.buffer, $0, $1)); var fptr = $2; var instance = new WebAssembly.Instance(module, { 'env': { 'memory': wasmMemory, // ... } ); //       instance! }, buf, sz);

- QEMU JS , ( ), . , translation block, , struct TranslationBlock .

, ( ) Firefox. Chrome - , - WebAssembly, ...

. , , - . , . , WebAssembly , JS, , , .

: 32- , Binaryen, - - 2 32- . , Binaryen . ?

, « , 32- Linux?» . , : 1 2 Gb.

- ( )

. , — . « : , ...».

 // 2gbubble.c // Usage: LD_PRELOAD=2gbubble.so <program> #include <sys/mman.h> #include <assert.h> void __attribute__((constructor)) constr(void) { assert(MAP_FAILED != mmap(1u >> 31, (1u >> 31) - (1u >> 20), PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0)); }

… Valgrind-, , , , , Valgrind :)

, - , ...

Source: https://habr.com/ru/post/451306/

All Articles