JIT-enabled Qemu.js: minced meat can still be turned back

A few years ago, Fabrice Bellard wrote jslinux , a PC emulator written in JavaScript. After that there was at least Virtual x86 . But all of them, as far as I know, were interpreters, while written much earlier by the same Fabrice Bellar Qemu, and, probably, any modern emulator respecting itself, uses JIT compilation of the guest code into the host system code. It seemed to me that the time was right to implement the inverse problem in relation to the one that the browsers decide: JIT-compile the machine code in JavaScript, for which the most logical was to port Qemu. It would seem, why Qemu, there are more simple and user-friendly emulators - the same VirtualBox, for example - set up and running. But Qemu has some interesting features.

open source
ability to work without a kernel driver
ability to work in interpreter mode
support a large number of both host and guest architectures

At the expense of the third point, now I can already explain that in fact in the TCI mode not the guest machine instructions themselves are interpreted, but the bytecode derived from them, but this does not change the essence - to build and run Qemu on a new architecture, if you're lucky enough compiler C - writing code generator can be postponed.

And so, after two years of leisurely picking the Qemu source codes in my spare time, a working prototype appeared in which Kolibri OS, for example, can be launched.

What is Emscripten?

Nowadays, there are many compilers, the final result of which is JavaScript. Some, such as Type Script, were originally conceived as the best way to write for the web. At the same time, Emscripten is a way to take existing code in C or C ++, and compile it into a form that is understandable to the browser. On this page, many ports of well-known programs are collected: here , for example, you can look at PyPy - by the way, it is claimed, they already have a JIT. In fact, not any program can be simply compiled and run in the browser - there are a number of features that you have to put up with, however, as the inscription on the same page reads: "C / C ++ code to JavaScript" . That is, there are a number of operations that are undefined behavior by the standard, but usually work on x86 — for example, unaligned access to variables, which is generally prohibited on some architectures. In general, Qemu is a cross-platform program, and I wanted to believe, and so does not contain a lot of undefined behavior - take it and compile it, then tinker a little with JIT - and go! But it was not there...

First try

Generally speaking, I am not the first who came up with the idea of porting Qemu to JavaScript. The ReactOS forum asked if this was possible with Emscripten. Earlier there were rumors that Fabrice Bellard did it personally, but it was jslinux, which, as far as I know, is just an attempt to manually achieve sufficient performance on JS, and is written from scratch. Later, Virtual x86 was written - nefuscated source code was uploaded to it, and, as it was stated, a great "realism" of emulation allowed using SeaBIOS as firmware. In addition, there was at least one attempt to port Qemu using Emscripten - the socketpair tried to do this , but the development, as I understand it, was frozen.

So, it would seem, here are the sources, here is Emscripten - take it and compile it. But there are also libraries on which Qemu depends, and libraries on which those libraries depend, etc., and one of them is libffi , on which glib depends. There were rumors on the Internet that there was also a large collection of ports of libraries under Emscripten, but it was somehow hard to believe: first, it was not going to be a new compiler, secondly, it’s too low-level library to just take, and compile in js. And the matter is not even only in assembler inserts - probably, if you pervert, then for some calling conventions you can form the necessary arguments on the stack without them and call the function. Here is just Emscripten - the trick is tricky: to make the generated code look familiar to the browser's JS engine optimizer, some tricks are used. In particular, the so-called relooping, the code generator, based on the received LLVM IR with some abstract transition instructions, attempts to recreate plausible if-s, cycles, etc. Well, the arguments in the function are passed as? Naturally, as arguments of JS-functions, that is whenever possible not through a stack.

At the beginning, the idea was to simply write a replacement for libffi on JS and run off regular tests, but in the end I got confused about how to make my own header files so that they work with existing code - what can we do, as they say, "Whether the tasks are so complicated whether we are so stupid. " I had to port libffi to another architecture, so to speak - fortunately, Emscripten has both macros for inline assembly (in javascript, yeah - well, what architecture, such an assembler), and the ability to run the code generated on the fly. In general, after fiddling with libffi platform-specific snippets for a while, I got some compiling code, and drove it away on the first test I got. To my surprise, the test was successful. Ofigev from his genius - it’s a joke, it worked from the first launch - I, still not believing my eyes, useful to look again at the resulting code, evaluate where to dig further. Here I again ofigel - the only thing that my ffi_call function ffi_call was that it was reporting a successful call. The call itself was not. So I sent my first pull request, correcting an error in the test that any olympiadian could understand - real numbers should not be compared as a == b and even as a - b < EPS - the module should not be forgotten, otherwise 0 will turn out to be very equal to 1 / 3 ... In general, I got a port libffi, which passes the most simple tests, and with which glib is compiled - I decided it will be necessary, then I will add. Looking ahead I’ll say that, as it turned out, the compiler didn’t even include the compiler in the final code of the libffi function.

But, as I have already said, there are some limitations, and among the free use of various indefinite behaviors, a more unpleasant feature has been added: JavaScript by design does not support multi-threading with shared memory. In principle, this can usually be called a good idea, but not for porting code whose architecture is tied to sishnyh streams. Generally speaking, there are experiments in support of shared workers in Firefox, and the implementation of pthread for them in Emscripten is present, but did not want to depend on it. It was necessary to slowly uproot multithreading from the Qemu code — that is, to find out where the threads are started, to carry out the body of the loop running in this thread into a separate function, and call such functions one by one from the main loop.

Second try

At some point, it became clear that things are still there, and that the unsystematic pushing of crutches along the code will not lead to good. Conclusion: we must somehow systematize the process of adding crutches. Therefore, the latest version 2.4.1 was taken at that time (not 2.5.0, because, you never know, there are still not caught bugs of the new version, and I will have enough of my bugs), and the first thing was safely rewritten thread-posix.c . Well, that is, how safe: if someone tried to perform an operation leading to blocking, the abort() function was called immediately — of course, this did not solve all problems at once, but at least it was somehow nicer than quietly getting inconsistency data.

In general, the Emscripten -s ASSERTIONS=1 -s SAFE_HEAP=1 options help in porting code to JS - they catch some kinds of undefined behavior like calls to an unaligned address (which is completely inconsistent with the code for typed arrays like HEAP32[addr >> 2] = 1 ) or a function call with an incorrect number of arguments.

By the way, alignment errors are a separate topic. As I said before, Qemu has a “degenerate” interpretive TCI code generation (tiny code interpreter) interpreter, and to build and run Qemu on a new architecture, if it's lucky, the C compiler is enough. The keywords “if you are lucky” . I was not lucky, and it turned out that TCI uses non-aligned access when parsing its bytecode. That is, on everyones ARM and other architectures with necessarily aligned access, Qemu is compiled because there is a normal TCG backend generating the native code for them, and whether TCI will work for them is another question. However, as it turned out, something similar was clearly indicated in the documentation for TCI. As a result, calls to functions for non-aligned reading that were found in another part of Qemu were added to the code.

Heap destruction

As a result, the unaligned access to the TCI was corrected, the main loop was made, in turn calling the processor, RCU and something trivial. And now I run Qemu with the -d exec,in_asm,out_asm option -d exec,in_asm,out_asm , meaning that I need to say which blocks of code are executed, and write at the time of the broadcast what guest code was, what host code became (in this case, bytecode). It starts, executes several translation blocks, writes a debug message left by me that RCU will now start and ... crashes on abort() inside the free() function. By picking up the free() function, we managed to find out that the header of the heap block, which lies in eight bytes, preceding the allocated memory, instead of block size or something like that, turned out to be garbage.

Heap destruction is so nice ... In this case, there is a useful tool - from (if possible) the same source code, compile the native binary and run it under Valgrind. After a while the binary was ready. I launch it with the same options - it crashes on initialization, before reaching, in fact, execution. It's unpleasant, of course - you see, the source was not exactly the same, which is not surprising, because configure has explored several other options, but I also have Valgrind - I will fix this bug first, and then, if I'm lucky, the initial one will appear. I run all the same thing under Valgrind ... Yyyyy, yyy, uh-uh, it started, initialization passed normally and went further by the source bug without a single warning about incorrect memory access, not to mention about the falls. For such a life, as they say, I did not prepare for this - the falling program stops falling when it starts up under valgrind. What it was was a mystery. My hypothesis is that once in the vicinity of the current instruction after crashing when initializing gdb, it showed that the memset worked with a valid pointer using either mmx or xmm registers, then maybe it was some kind of alignment error, although it’s still hard to believe .

OK, Valgrind here does not seem to be an assistant. And here the most disgusting thing began - everything, it seems, even starts, but falls for absolutely unknown reasons due to an event that could have happened millions of instructions back. For a long time, even approaching it was not clear how. In the end, I still had to sit down and debug. Printing what the title was rewritten showed that this is not like a number, but rather some kind of binary data. And, about a miracle, this binary line was found in the file with bios - that is, now it was possible to say with sufficient confidence that it was a buffer overflow, and it is even clear that this buffer was recorded. Well, then somehow - in Emscripten, fortunately, there is no randomization of the address space, there are no holes in it either, so you can write somewhere in the middle of the code the output of data on the pointer from the last launch, look at the data, look at the pointer, and if that has not changed, get information for consideration. True, a couple of minutes are spent on linking after any change, but what can you do. As a result, a specific string was found, copying the BIOS from the temporary buffer to the guest memory - and, indeed, there was not enough space in the buffer. Searching for the source of that strange buffer address resulted in the qemu_anon_ram_alloc function in the oslib-posix.c - the logic was there: sometimes it can be useful to align the address on a huge 2Mb page, for this we will ask mmap first a little more, and then we will return the excess to using munmap . And if such alignment is not required, then instead of 2 MB, we specify the result of getpagesize() - mmap will still output the aligned address ... So, in Emscripten, mmap simply calls malloc , and that, of course, does not align the page. In general, the bug that had upset me for a couple of months, was corrected by a change in two lines.

Features of the function call

And now the processor thinks something, Qemu does not fall, but the screen does not turn on, and the processor quickly goes in cycles, judging by the conclusion -d exec,in_asm,out_asm . A hypothesis appeared: the timer interrupts do not come (well, or in general, all interrupts). And indeed, if from a native assembly, which for some reason worked, to unscrew the interrupts, you get a similar picture. But the solution turned out to be completely different: a comparison of the traces issued with the above option showed that the execution paths diverge very early. Here it must be said that comparing the debugging output recorded with the emrun with the output of the native assembly is not quite a mechanical process. I don’t know exactly how the program launched in the browser connects with emrun , but some lines in the output are swapped, so the difference in differential is not a reason to assume that the trajectories diverged. In general, it became clear that according to the ljmpl instruction, a transition occurs at different addresses, and the baytcode generates fundamentally different: in one there is an instruction to call the sishnaya helper function, in the other - not. After googling instructions and examining the code that transmits these instructions, it became clear that, firstly, immediately before it, a record was made in the cr0 register — also with the help of a helper --- that takes the processor to a protected mode, and secondly, js-version in a protected mode has not switched. But the fact is that another feature of Emscripten is a reluctance to tolerate code like the implementation of a call instruction in TCI, which leads to a type of long long f(int arg0, .. int arg9) type of any function pointer - functions must be called with the correct number of arguments. If this rule is violated, depending on the debugging settings, the program will either crash (which is good), or it will not call that function at all (which will be debugging sadly). There is also a third option - to enable the generation of wrappers that add / discard arguments, but in total these wrappers take up quite a lot of space, despite the fact that in fact I only need a little more than a hundred wrappers. This alone is very sad, but it turned out to be a more serious problem: in the generated code of the wrapper functions, the arguments were converted-converted, only the function with the generated arguments was sometimes not called — well, just like in my implementation of libffi. That is, some helpers are simply not executed.

Fortunately, in Qemu there are machine-readable lists of helpers in the form of a header file like

 DEF_HELPER_0(lock, void) DEF_HELPER_0(unlock, void) DEF_HELPER_3(write_eflags, void, env, tl, i32)

They are used quite funny: first, the DEF_HELPER_n macros are redefined in the most bizarre way, and then helper.h is helper.h . Up to the point that the macro is expanded into a structure initializer and a comma, and then an array is defined, and instead of elements - #include <helper.h> As a result, the reason finally came to try the pyparsing library, and a script was generated that generates exactly those wrappers exactly for those functions for which it is needed.

And so, after that, the processor seems to be working. It seems to be because the screen was never initialized, although it was possible to start memtest86 + in the native build. Here it is necessary to clarify that the Qemu block I / O code is written in coroutines. Emscripten has a very intricate implementation, but it still needed to be supported in Qemu code, and you can debug the processor right now: Qemu supports the -kernel , -initrd , -append , with which you can load Linux or, for example, memtest86 +, in general not using block devices. But here's the ill luck: in the native build, it was possible to observe the output of the Linux kernel to the console with the -nographic option, and from the browser no output to the terminal from which the emrun was launched did not come. That is, it is not clear: the processor does not work or graphics output. And then it occurred to me to wait a bit. It turned out that "the processor is not sleeping, but just blinks slowly," and after about five minutes the kernel threw out a stack of messages on the console and went to hang on. It became clear that the processor, in general, works, and you need to dig in the code to work with SDL2. Unfortunately, I don’t know how to use this library, so I had to act at random in places. At some point on the screen flashed a line parallel0 on a blue background, which led to some thoughts. As a result, it turned out that the case was that Qemu opens several virtual windows in one physical window, between which you can switch by Ctrl-Alt-n: in the native assembly it works, in Emscripten it is not. After getting rid of unnecessary windows using the options -monitor none -parallel none -serial none and instructions to force the redrawing of the entire screen on each frame, everything suddenly worked.

Korutiny

So, the emulation in the browser works, but nothing interesting single-disk in it can not be run, because there is no block I / O - you need to implement support for korutin. In Qemu, there are already several coroutine backends, but due to the peculiarities of JavaScript and the Emscripten code generator, one cannot simply start juggling with stacks. It would seem, "usyo was gone, the cast was removed," but the Emscripten developers have already taken care of everything. This is implemented quite funny: let's call a suspicious function call like emscripten_sleep and several others using Asyncify, as well as pointer calls and calls to any function where one of the previous two cases may occur below. And now, before each suspicious call, we select the async context, and immediately after the call, we check whether an asynchronous call has occurred, and if it has, save all local variables in this async context, indicate which function to transfer control to when we need to continue execution , and quit the current function. This is where the scope for exploring the effect of clearing up is - for the needs of continuing to execute code after returning from an asynchronous call, the compiler generates functions "chipping" that start after a suspicious call - like this: if there are n suspicious calls, then the function will be rattled somewhere in 2 times is still if not to consider that it is necessary to add saving of a part of local variables after each potentially asynchronous call to the initial function. Subsequently, I even had to write a simple Python script, which, given a set of especially diluted functions, which supposedly "do not let asynchrony through themselves" (that is, they do not trigger stack promotion and everything that I just described), indicates calls through pointers in which functions should be ignored by the compiler so that these functions are not considered as asynchronous. And then the JS-files for 60 MB is already obviously overkill - even if it is at least 30. Although, once I set up the build script, and accidentally threw out the linker options, among which was -O3 . I run the generated code, and Chromium fats out the memory and crashes. I then accidentally looked at what he was trying to download ... Well, what can I say, I would also be hung up if I were asked to carefully study and optimize javascript for 500+ MB.

Unfortunately, the checks in the Asyncify support library code were not exactly friendly with the longjmps, which are used in the virtual processor code, but after a small patch that disables these checks and forcibly restores contexts as if everything was fine, the code worked. And then a strange thing started: sometimes checks in the synchronization code were triggered - the ones that abnormally terminate the code, if, according to the execution logic, it should be blocked - someone tried to capture the already captured mutex. Fortunately, this was not a logical problem in the serialized code - I just used the regular main loop functionality provided by Emscripten, but sometimes the asynchronous call fully expanded the stack, and at that moment setTimeout from the main loop worked - the code would go into the iteration of the main loop without leaving the previous iteration. I rewrote emscripten_sleep on an infinite loop, and problems with mutexes stopped. The code has become even more logical - after all, in fact, I don’t have some code that prepares the next frame of the animation - the processor just counts something and the screen is periodically updated. However, the problems did not stop there: sometimes Qemu execution was just silently completed without any exceptions or errors. At that moment I scored on it, but, looking ahead, I will say that the problem was this: the code of corutin, in fact, does not use setTimeout at all (well, or at least not as often as you might think): function emscripten_yield simply sets the asynchronous call flag. The whole emscripten_coroutine_next is that emscripten_coroutine_next not an asynchronous function: it checks the box inside, resets it, and passes control to the right place. That is, on her stack promotion ends. The problem was that due to use-after-free, which manifested itself when the coruntine pool was disabled, because I did not copy the important line of code from the existing coroutine backend, the qemu_in_coroutine function returned true, when in fact it should return false. This led to the emscripten_yield call, above which there was no emscripten_coroutine_next stack, the stack was deployed to the very top, but no setTimeout , as I said, was set.

JavaScript code generation

But, in fact, the promised "turning the stuffing back." Not really. Of course, if you run Qemu in the browser, and Node.js in it, then, naturally, after code generation in Qemu, we will get completely different JavaScript. But still, some kind of inverse transformation.

First, a little about how Qemu works. Immediately I ask you to forgive me: I am not a professional developer of Qemu and my conclusions may be erroneous in places. , " , ". Qemu target-i386 . , . , , Qemu, TCG (Tiny Code Generator) . readme-, tcg, C, JIT. , , target architecture — , . - — Tiny Code Interpreter (TCI), ( ) . , , , JIT- , . , .

TCG backend, , TCI. :

,
, ,
( , -, , , patch) JS-, ,

, , , .

switch , , Emscripten, JS relooping, , , , — . — , , if- ( ). , , , - . brcond . , … , assert- . , , , switch- , , JavaScript-. . , , . , TB, , TB , . " , ?" — , . , Chromium ( Firefox ), Firefox asm.js, .

 Compiling 0x15b46d0: CompiledTB[0x015b46d0] = function(stdlib, ffi, heap) { "use asm"; var HEAP8 = new stdlib.Int8Array(heap); var HEAP16 = new stdlib.Int16Array(heap); var HEAP32 = new stdlib.Int32Array(heap); var HEAPU8 = new stdlib.Uint8Array(heap); var HEAPU16 = new stdlib.Uint16Array(heap); var HEAPU32 = new stdlib.Uint32Array(heap); var dynCall_iiiiiiiiiii = ffi.dynCall_iiiiiiiiiii; var getTempRet0 = ffi.getTempRet0; var badAlignment = ffi.badAlignment; var _i64Add = ffi._i64Add; var _i64Subtract = ffi._i64Subtract; var Math_imul = ffi.Math_imul; var _mul_unsigned_long_long = ffi._mul_unsigned_long_long; var execute_if_compiled = ffi.execute_if_compiled; var getThrew = ffi.getThrew; var abort = ffi.abort; var qemu_ld_ub = ffi.qemu_ld_ub; var qemu_ld_leuw = ffi.qemu_ld_leuw; var qemu_ld_leul = ffi.qemu_ld_leul; var qemu_ld_beuw = ffi.qemu_ld_beuw; var qemu_ld_beul = ffi.qemu_ld_beul; var qemu_ld_beq = ffi.qemu_ld_beq; var qemu_ld_leq = ffi.qemu_ld_leq; var qemu_st_b = ffi.qemu_st_b; var qemu_st_lew = ffi.qemu_st_lew; var qemu_st_lel = ffi.qemu_st_lel; var qemu_st_bew = ffi.qemu_st_bew; var qemu_st_bel = ffi.qemu_st_bel; var qemu_st_leq = ffi.qemu_st_leq; var qemu_st_beq = ffi.qemu_st_beq; function tb_fun(tb_ptr, env, sp_value, depth) { tb_ptr = tb_ptr|0; env = env|0; sp_value = sp_value|0; depth = depth|0; var u0 = 0, u1 = 0, u2 = 0, u3 = 0, result = 0; var r0 = 0, r1 = 0, r2 = 0, r3 = 0, r4 = 0, r5 = 0, r6 = 0, r7 = 0, r8 = 0, r9 = 0; var r10 = 0, r11 = 0, r12 = 0, r13 = 0, r14 = 0, r15 = 0, r16 = 0, r17 = 0, r18 = 0, r19 = 0; var r20 = 0, r21 = 0, r22 = 0, r23 = 0, r24 = 0, r25 = 0, r26 = 0, r27 = 0, r28 = 0, r29 = 0; var r30 = 0, r31 = 0, r41 = 0, r42 = 0, r43 = 0, r44 = 0; r14 = env|0; r15 = sp_value|0; START: do { r0 = HEAPU32[((r14 + (-4))|0) >> 2] | 0; r42 = 0; result = ((r0|0) != (r42|0))|0; HEAPU32[1445307] = r0; HEAPU32[1445321] = r14; if(result|0) { HEAPU32[1445322] = r15; return 0x0345bf93|0; } r0 = HEAPU32[((r14 + (16))|0) >> 2] | 0; r42 = 8; r0 = ((r0|0) - (r42|0))|0; HEAPU32[(r14 + (16)) >> 2] = r0; r1 = 8; HEAPU32[(r14 + (44)) >> 2] = r1; r1 = r0|0; HEAPU32[(r14 + (40)) >> 2] = r1; r42 = 4; r0 = ((r0|0) + (r42|0))|0; r2 = HEAPU32[((r14 + (24))|0) >> 2] | 0; HEAPU32[1445307] = r0; HEAPU32[1445308] = r1; HEAPU32[1445309] = r2; HEAPU32[1445321] = r14; HEAPU32[1445322] = r15; qemu_st_lel(env|0, r0|0, r2|0, 34, 22759218); if(getThrew() | 0) abort(); r0 = 3241038392; HEAPU32[1445307] = r0; r0 = qemu_ld_leul(env|0, r0|0, 34, 22759233)|0; if(getThrew() | 0) abort(); HEAPU32[(r14 + (24)) >> 2] = r0; r1 = HEAPU32[((r14 + (12))|0) >> 2] | 0; r2 = HEAPU32[((r14 + (40))|0) >> 2] | 0; HEAPU32[1445307] = r0; HEAPU32[1445308] = r1; HEAPU32[1445309] = r2; qemu_st_lel(env|0, r2|0, r1|0, 34, 22759265); if(getThrew() | 0) abort(); r0 = HEAPU32[((r14 + (24))|0) >> 2] | 0; HEAPU32[(r14 + (40)) >> 2] = r0; r1 = 24; HEAPU32[(r14 + (52)) >> 2] = r1; r42 = 0; result = ((r0|0) == (r42|0))|0; if(result|0) { HEAPU32[1445307] = r0; HEAPU32[1445308] = r1; } HEAPU32[1445307] = r0; HEAPU32[1445308] = r1; return execute_if_compiled(22759392|0, env|0, sp_value|0, depth|0) | 0; return execute_if_compiled(23164080|0, env|0, sp_value|0, depth|0) | 0; break; } while(1); abort(); return 0|0; } return {tb_fun: tb_fun}; }(window, CompilerFFI, Module.buffer)["tb_fun"]

Conclusion

, , . , . , , , . , - Qemu. : - "" . , — git log .

(, ).

x86
JIT- JavaScript
32- : MIPS

. JIT , , , Virtual x86 ( Qemu )
— - , , , Emscripten,
Qemu — , VM ..
UPD: Emscripten -, Qemu . , Emscripten .

Source: https://habr.com/ru/post/315770/

All Articles