πŸ“œ ⬆️ ⬇️

Writing x86-64 JIT compiler from scratch in stock Python

In this article, I will show how to write a rudimentary, native x86-64 just-in-time compiler (JIT) in CPython, using only embedded modules.

The code is intended for UNIX-based systems such as macOS and Linux, but it should be easy to broadcast to other systems, such as Windows. All code is published on github.com/cslarsen/minijit .

The goal is to generate new versions of the assembler code below in runtime and execute them.
')
48 b8 ed ef be ad de movabs $0xdeadbeefed, %rax 00 00 00 48 0f af c7 imul %rdi,%rax c3 retq 

Basically, we will deal with the left part of the code β€” the byte sequence 48 b8 ed ... and so on. These 15 bytes in machine code constitute the x86-64 function, which multiplies its argument by the constant 0xdeadbeefed . At the JIT stage, functions with different such constants will be created. This contrived form of specialization should demonstrate the basic mechanics of JIT compilation.

The main strategy is to load the standard C library using the built-in python module ctypes . From there, we will get access to system functions for interacting with the virtual memory manager. Use mmap to get a block of memory aligned to the page border. Alignment is necessary for code execution. For this reason, we do not take the usual malloc function, since it can return a memory that extends beyond the page boundary.

We use the mprotect function to mark the memory block as read-only and executable. After that, it should be possible to call our freshly compiled code block with ctypes.

Template part


Before you do anything, you need to load the standard C library.

 import ctypes import sys if sys.platform.startswith("darwin"): libc = ctypes.cdll.LoadLibrary("libc.dylib") # ... elif sys.platform.startswith("linux"): libc = ctypes.cdll.LoadLibrary("libc.so.6") # ... else: raise RuntimeError("Unsupported platform") 

There are other ways to do this, for example

 >>> import ctypes >>> import ctypes.util >>> libc = ctypes.CDLL(ctypes.util.find_library("c")) >>> libc <CDLL '/usr/lib/libc.dylib', handle 110d466f0 at 103725ad0> 

To determine the page size, call sysconf(_SC_PAGESIZE) . The _SC_PAGESIZE is 29 on macOS, but 30 on Linux. We just hard-code them in our program. You can find the page size by examining the system header files or writing a simple C program for output. A more reliable and elegant solution is to use the cffi instead of ctypes, because it can automatically parse header files. However, since we set the goal to use the standard CPython distribution, we will continue to work with ctypes.

We need some extra constants for mmap and so on. They are written below. Maybe you have to look for them for other UNIX options.

 import ctypes import sys if sys.platform.startswith("darwin"): libc = ctypes.cdll.LoadLibrary("libc.dylib") _SC_PAGESIZE = 29 MAP_ANONYMOUS = 0x1000 MAP_PRIVATE = 0x0002 PROT_EXEC = 0x04 PROT_NONE = 0x00 PROT_READ = 0x01 PROT_WRITE = 0x02 MAP_FAILED = -1 # voidptr actually elif sys.platform.startswith("linux"): libc = ctypes.cdll.LoadLibrary("libc.so.6") _SC_PAGESIZE = 30 MAP_ANONYMOUS = 0x20 MAP_PRIVATE = 0x0002 PROT_EXEC = 0x04 PROT_NONE = 0x00 PROT_READ = 0x01 PROT_WRITE = 0x02 MAP_FAILED = -1 # voidptr actually else: raise RuntimeError("Unsupported platform") 

Although this is not a strict requirement, it is very useful to pass on the ctypes the types of functions that we are going to use. Thus, exceptions will be raised in the case of mixing invalid types. For example:

 # Set up sysconf sysconf = libc.sysconf sysconf.argtypes = [ctypes.c_int] sysconf.restype = ctypes.c_long 

This tells ctypes that the sysconf function takes a four-byte integer, but produces a long integer. After that you can find out the current page size with the following command:

 pagesize = sysconf(_SC_PAGESIZE) 

The machine code that we are going to generate will be interpreted as unsigned 8-bit bytes, so we register the unsigned pointer type to these bytes:

 # 8-bit unsigned pointer type c_uint8_p = ctypes.POINTER(ctypes.c_uint8) 

Below are just laid out the other types of functions that we will use. For bug reports, it’s good that the strerror function strerror . Use munmap to destroy the machine code block when we munmap done with it. So the operating system will be able to reuse this memory.

 strerror = libc.strerror strerror.argtypes = [ctypes.c_int] strerror.restype = ctypes.c_char_p mmap = libc.mmap mmap.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int, ctypes.c_int, ctypes.c_int, # Below is actually off_t, which is 64-bit on macOS ctypes.c_int64] mmap.restype = c_uint8_p munmap = libc.munmap munmap.argtypes = [ctypes.c_void_p, ctypes.c_size_t] munmap.restype = ctypes.c_int mprotect = libc.mprotect mprotect.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int] mprotect.restype = ctypes.c_int 

At this stage it is difficult to justify using Python instead of C. In the case of C, we do not need the above generic code. But Python will give much more freedom in further experiments.

Now we are ready to write the mmap wrapper.

 def create_block(size): ptr = mmap(0, size, PROT_WRITE | PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0) if ptr == MAP_FAILED: raise RuntimeError(strerror(ctypes.get_errno())) return ptr 

This function uses mmap to allocate memory aligned to the page borders. We mark PROT as readable and writeable, and mark it as private and anonymous. The latter means that other processes will not be able to see this section of memory and that it does not have file support. The Linux mmap tutorial on Linux covers this topic in more detail (just be sure to open the manual specifically for your system). If the mmap call fails, we cause a Python error.

To mark a memory as executable:

 def make_executable(block, size): if mprotect(block, size, PROT_READ | PROT_EXEC) != 0: raise RuntimeError(strerror(ctypes.get_errno())) 

With this mprotect call mprotect we mark the mprotect as readable and executable. If we want, we can also make it writable, but some systems will refuse to execute code from memory that is open for writing. This is sometimes called the security feature W ^ X.

To destroy the memory block, do the following:

 def destroy_block(block, size): if munmap(block, size) == -1: raise RuntimeError(strerror(ctypes.get_errno())) 

Fun part


Now we are finally ready to write insanely simple JIT code!

Recall the assembler listing from the beginning of the article: this is a small function β€” without a local stack frame β€” that multiplies the number at the input to a constant. In Python, we would write it like this:

 def create_multiplication_function(constant): return lambda n: n * constant 

This is a really contrived example, but it corresponds to the definition of JIT. In the end, we really create native code in runtime and execute it. It is easy to present more advanced examples, such as Brainfuck JIT compilation into x86-64 machine code. Or use AVX instructions for lightning-fast math vectorization operations.

The disassembly of the machine code at the beginning of the article was actually accomplished by compiling and disassembling the following C code:

 #include <stdint.h> uint64_t multiply(uint64_t n) { return n*0xdeadbeefedULL; } If you want to compile it yourself, use something like $ gcc -Os -fPIC -shared -fomit-frame-pointer \ -march=native multiply.c -olibmultiply.so 

Here I optimized for size ( -Os ) to generate minimal machine code, position-independent ( -fPIC ) to prevent the use of hops in absolute addresses, without any frame pointers ( -fomit-frame-pointer ) to remove unnecessary installation code stack (but it may be necessary for more advanced functions) and using the native instruction set of the existing processor ( -march=native ).

We could pass -S and get the disassembler listing, but we are interested in the machine code , so instead we use a tool like objdump :

 $ objdump -d libmultiply.so ... 0000000000000f71 <_multiply>: f71: 48 b8 ed ef be ad de movabs $0xdeadbeefed,%rax f78: 00 00 00 f7b: 48 0f af c7 imul %rdi,%rax f7f: c3 retq 

If you are not very familiar with assembler, I will explain how this function works. At first, the movabs function simply places the immediate number (immediate number) in the RAX register. Direct - this is such a jargon in assembler to refer to something that is specified directly in the machine code. In other words, this is the built-in argument for the movabs instruction. So now the RAX register contains the constant 0xdeadbeefed .

Also - according to the AMD64 convention - the first integer argument will be in the RDI register, and the return value in RAX. So the RDI register contains a multiplier. This is the essence of the imul , which multiplies RAX and RDI, putting the result in RAX. Finally, we retrieve the 64-bit return address from the stack and proceed to it with the RETQ command. At this level, it is easy to imagine how programming can be implemented in the transmission of continuations .

Please note that the constant 0xdeadbeefed is in reverse byte format (little-endian). You need to remember to do the same in the code.

Now we are ready to put everything into the Python function.

 def make_multiplier(block, multiplier): # Encoding of: movabs <multiplier>, rax block[0] = 0x48 block[1] = 0xb8 # Little-endian encoding of multiplication constant block[2] = (multiplier & 0x00000000000000ff) >> 0 block[3] = (multiplier & 0x000000000000ff00) >> 8 block[4] = (multiplier & 0x0000000000ff0000) >> 16 block[5] = (multiplier & 0x00000000ff000000) >> 24 block[6] = (multiplier & 0x000000ff00000000) >> 32 block[7] = (multiplier & 0x0000ff0000000000) >> 40 block[8] = (multiplier & 0x00ff000000000000) >> 48 block[9] = (multiplier & 0xff00000000000000) >> 56 # Encoding of: imul rdi, rax block[10] = 0x48 block[11] = 0x0f block[12] = 0xaf block[13] = 0xc7 # Encoding of: retq block[14] = 0xc3 # Return a ctypes function with the right prototype function = ctypes.CFUNCTYPE(ctypes.c_uint64) function.restype = ctypes.c_uint64 return function 

At the bottom, we return the function type ctypes for use in this code. This is a somewhat arbitrary placement, but I thought it would be good to put it next to the machine code.

The final part


Now we have the main parts that can be combined. First, select one page of memory:

 pagesize = sysconf(_SC_PAGESIZE) block = create_block(pagesize) 

Then generate the machine code. As a multiplier, choose the number 101.

 mul101_signature = make_multiplier(block, 101) 

Now we mark the section of memory as executable and read-only:

 make_executable(block, pagesize) 

We take the address of the first byte in the memory block. We submit it to the called function ctypes with the correct type:

 address = ctypes.cast(block, ctypes.c_void_p).value mul101 = mul101_signature(address) 

To get the block's memory address, use ctypes to pass it to a null pointer and retrieve its value. Finally, we initialize the real function from this address with the help of the mul101_signature constructor.

Voila! Now we have a piece of native code that can be called from Python. If you are in the REPL environment, you can do it directly:

>>> print(mul101(8))
808


Note that this small multiplication function is slower than native Python calculations. This is mainly due to the ctypes alien library, the use of which carries a lot of overhead: every time you call a function, you need to check which dynamic types you pass to it, then unpack them and convert them, and then do the same with the return value. So it makes sense to use an assembler or if you have access to some new Intel instructions, or to compile something like Brainfuck into native code.

In the end, if you want, you can let the system reuse the memory in which the function is located. Keep in mind that after this, the process is likely to fail if you try to access the code again. So it's probably better to delete all references to Python at the same time:

 destroy_block(block, pagesize) del block del mul101 

If you run the code in its full form from the GitHub repository , you can specify a multiplication constant directly on the command line:

$ python mj.py 101
Pagesize: 4096
Allocating one page of memory
JIT-compiling a native mul-function w/arg 101
Making function block executable
Testing function
OK mul(0) = 0
OK mul(1) = 101
OK mul(2) = 202
OK mul(3) = 303
OK mul(4) = 404
OK mul(5) = 505
OK mul(6) = 606
OK mul(7) = 707
OK mul(8) = 808
OK mul(9) = 909
Deallocating function


JIT debugging


If you want to continue learning with the help of this small program, then soon there will be an idea to disassemble the generated machine code. Alternatively, you can simply use gdb or lldb, but you need to know where to start. There is such a trick: just output the hex value of the block address while waiting for the key to be pressed:

 print("address: 0x%x" % address) print("Press ENTER to continue") raw_input() 

Then just run the program in the debugger and during a pause, disassemble the memory area. Of course, there is also the possibility of step-by-step debugging of the assembly code, if you want to see what is happening. Sample lldb session:

 $ lldb python ... (lldb) run mj.py 101 ... (lldb) c Process 19329 resuming ... address 0x1002fd000 Press ENTER to continue 

At this point, press CTRL + C to return to the debugger, then disassemble the code from the memory area:

 (lldb) x/3i 0x1002fd000 0x1002fd000: 48 b8 65 00 00 00 00 00 00 00 movabsq $0x65, %rax 0x1002fd00a: 48 0f af c7 imulq %rdi, %rax 0x1002fd00e: c3 retq 

Note that 65 to hex is 101 in the decimal system, that is, the command line argument that we entered above.

If you need a disassembler only in Python, I recommend the Capstone module.

What's next?


A good exercise would be JIT compiling Brainfuck programs into native code. If you want to do this right away, I opened the GitHub repository at github.com/cslarsen/brainfuck-jit . I even did a Speaker Deck presentation . It shows JIT compilation and optimization, but instead of the approach from this article, GNU Lightning is used to compile native code. But it should be extremely easy to use examples without GNU Lightning or generate your own code. An interesting observation on the Brainfuck project: if you simply JIT compile all the Brainfuck instructions one by one, you will not get much performance gains even in native code. The entire performance increase occurs at the stage of code optimization , where you fill in one or more x86 instructions with integer operations. Another candidate for such a compilation is the Forth language .

Before you seriously tackle the expansion of this JIT compiler, take a look at the PeachPy project . This is a much more advanced project than ours, it includes a disassembler and supports, as it were, the entire set of x86-64 instructions, right down to AVX .

As mentioned above, there are many overheads when using ctypes to call functions. To eliminate some of them, you can use the cffi module, but the fact remains: if you want to repeatedly call very small JIT functions, then usually it is faster to do it in pure Python.

What other great uses are there? I have met some mathematical libraries in Python that are switching to vector operations to improve performance. But I can imagine other things. For example, tools for compressing and decompressing native code, accessing virtualization primitives, and so on. I know that some BPF tools and regex modules also perform JIT compilation of queries for faster data processing.

What is interesting in this exercise is that we are entering the territory outside the pure assembler. For example, it comes to mind how different instructions are disassembled into identical symbols. So, the RETQ instruction has a different opcode than the regular RET instruction, because it processes 64-bit values. Maybe it does not matter when programming in assembler, because such details do not always matter, but it is worth remembering the difference. I saw how gcc, lldb and objdump issued a slightly different listing of the disassembler for the RETQ and MOVABSQ instructions in the same code.

One more note. I mentioned that the native Brainfuck compiler I made initially produces a not-so-fast code. It needs to be optimized. That is, the code does not automatically get faster from the fact that you have AVX, CUDA or something else. The cruel truth is that the gcc compiler contains a large base of optimizations that are almost impossible to reproduce manually. For more information, I would recommend the classic lecture by Felix von Leitner on source code optimization .

What about the actual compilation?


Several people wrote in the comments that they expected a more detailed description of the stage where the compilation was actually performed. This is a fair comment. As I said, this is indeed a very limited form of compilation, where we practically do nothing with the code at runtime β€” we just insert a constant. Maybe I will write a continuation of the article, in which we consider a pure compilation stage. Be in touch!

Source: https://habr.com/ru/post/342410/


All Articles