48 b8 ed ef be ad de movabs $0xdeadbeefed, %rax 00 00 00 48 0f af c7 imul %rdi,%rax c3 retq
48 b8 ed ...
and so on. These 15 bytes in machine code constitute the x86-64 function, which multiplies its argument by the constant 0xdeadbeefed
. At the JIT stage, functions with different such constants will be created. This contrived form of specialization should demonstrate the basic mechanics of JIT compilation.ctypes
. From there, we will get access to system functions for interacting with the virtual memory manager. Use mmap
to get a block of memory aligned to the page border. Alignment is necessary for code execution. For this reason, we do not take the usual malloc
function, since it can return a memory that extends beyond the page boundary.mprotect
function to mark the memory block as read-only and executable. After that, it should be possible to call our freshly compiled code block with ctypes. import ctypes import sys if sys.platform.startswith("darwin"): libc = ctypes.cdll.LoadLibrary("libc.dylib") # ... elif sys.platform.startswith("linux"): libc = ctypes.cdll.LoadLibrary("libc.so.6") # ... else: raise RuntimeError("Unsupported platform")
>>> import ctypes >>> import ctypes.util >>> libc = ctypes.CDLL(ctypes.util.find_library("c")) >>> libc <CDLL '/usr/lib/libc.dylib', handle 110d466f0 at 103725ad0>
sysconf(_SC_PAGESIZE)
. The _SC_PAGESIZE
is 29 on macOS, but 30 on Linux. We just hard-code them in our program. You can find the page size by examining the system header files or writing a simple C program for output. A more reliable and elegant solution is to use the cffi
instead of ctypes, because it can automatically parse header files. However, since we set the goal to use the standard CPython distribution, we will continue to work with ctypes.mmap
and so on. They are written below. Maybe you have to look for them for other UNIX options. import ctypes import sys if sys.platform.startswith("darwin"): libc = ctypes.cdll.LoadLibrary("libc.dylib") _SC_PAGESIZE = 29 MAP_ANONYMOUS = 0x1000 MAP_PRIVATE = 0x0002 PROT_EXEC = 0x04 PROT_NONE = 0x00 PROT_READ = 0x01 PROT_WRITE = 0x02 MAP_FAILED = -1 # voidptr actually elif sys.platform.startswith("linux"): libc = ctypes.cdll.LoadLibrary("libc.so.6") _SC_PAGESIZE = 30 MAP_ANONYMOUS = 0x20 MAP_PRIVATE = 0x0002 PROT_EXEC = 0x04 PROT_NONE = 0x00 PROT_READ = 0x01 PROT_WRITE = 0x02 MAP_FAILED = -1 # voidptr actually else: raise RuntimeError("Unsupported platform")
# Set up sysconf sysconf = libc.sysconf sysconf.argtypes = [ctypes.c_int] sysconf.restype = ctypes.c_long
sysconf
function takes a four-byte integer, but produces a long integer. After that you can find out the current page size with the following command: pagesize = sysconf(_SC_PAGESIZE)
# 8-bit unsigned pointer type c_uint8_p = ctypes.POINTER(ctypes.c_uint8)
strerror
function strerror
. Use munmap
to destroy the machine code block when we munmap
done with it. So the operating system will be able to reuse this memory. strerror = libc.strerror strerror.argtypes = [ctypes.c_int] strerror.restype = ctypes.c_char_p mmap = libc.mmap mmap.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int, ctypes.c_int, ctypes.c_int, # Below is actually off_t, which is 64-bit on macOS ctypes.c_int64] mmap.restype = c_uint8_p munmap = libc.munmap munmap.argtypes = [ctypes.c_void_p, ctypes.c_size_t] munmap.restype = ctypes.c_int mprotect = libc.mprotect mprotect.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int] mprotect.restype = ctypes.c_int
mmap
wrapper. def create_block(size): ptr = mmap(0, size, PROT_WRITE | PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0) if ptr == MAP_FAILED: raise RuntimeError(strerror(ctypes.get_errno())) return ptr
mmap
to allocate memory aligned to the page borders. We mark PROT as readable and writeable, and mark it as private and anonymous. The latter means that other processes will not be able to see this section of memory and that it does not have file support. The Linux mmap tutorial on Linux covers this topic in more detail (just be sure to open the manual specifically for your system). If the mmap
call fails, we cause a Python error. def make_executable(block, size): if mprotect(block, size, PROT_READ | PROT_EXEC) != 0: raise RuntimeError(strerror(ctypes.get_errno()))
mprotect
call mprotect
we mark the mprotect
as readable and executable. If we want, we can also make it writable, but some systems will refuse to execute code from memory that is open for writing. This is sometimes called the security feature W ^ X. def destroy_block(block, size): if munmap(block, size) == -1: raise RuntimeError(strerror(ctypes.get_errno()))
def create_multiplication_function(constant): return lambda n: n * constant
#include <stdint.h> uint64_t multiply(uint64_t n) { return n*0xdeadbeefedULL; } If you want to compile it yourself, use something like $ gcc -Os -fPIC -shared -fomit-frame-pointer \ -march=native multiply.c -olibmultiply.so
-Os
) to generate minimal machine code, position-independent ( -fPIC
) to prevent the use of hops in absolute addresses, without any frame pointers ( -fomit-frame-pointer
) to remove unnecessary installation code stack (but it may be necessary for more advanced functions) and using the native instruction set of the existing processor ( -march=native
).-S
and get the disassembler listing, but we are interested in the machine code , so instead we use a tool like objdump
: $ objdump -d libmultiply.so ... 0000000000000f71 <_multiply>: f71: 48 b8 ed ef be ad de movabs $0xdeadbeefed,%rax f78: 00 00 00 f7b: 48 0f af c7 imul %rdi,%rax f7f: c3 retq
movabs
function simply places the immediate number (immediate number) in the RAX register. Direct - this is such a jargon in assembler to refer to something that is specified directly in the machine code. In other words, this is the built-in argument for the movabs
instruction. So now the RAX register contains the constant 0xdeadbeefed
.imul
, which multiplies RAX and RDI, putting the result in RAX. Finally, we retrieve the 64-bit return address from the stack and proceed to it with the RETQ command. At this level, it is easy to imagine how programming can be implemented in the transmission of continuations .0xdeadbeefed
is in reverse byte format (little-endian). You need to remember to do the same in the code. def make_multiplier(block, multiplier): # Encoding of: movabs <multiplier>, rax block[0] = 0x48 block[1] = 0xb8 # Little-endian encoding of multiplication constant block[2] = (multiplier & 0x00000000000000ff) >> 0 block[3] = (multiplier & 0x000000000000ff00) >> 8 block[4] = (multiplier & 0x0000000000ff0000) >> 16 block[5] = (multiplier & 0x00000000ff000000) >> 24 block[6] = (multiplier & 0x000000ff00000000) >> 32 block[7] = (multiplier & 0x0000ff0000000000) >> 40 block[8] = (multiplier & 0x00ff000000000000) >> 48 block[9] = (multiplier & 0xff00000000000000) >> 56 # Encoding of: imul rdi, rax block[10] = 0x48 block[11] = 0x0f block[12] = 0xaf block[13] = 0xc7 # Encoding of: retq block[14] = 0xc3 # Return a ctypes function with the right prototype function = ctypes.CFUNCTYPE(ctypes.c_uint64) function.restype = ctypes.c_uint64 return function
pagesize = sysconf(_SC_PAGESIZE) block = create_block(pagesize)
mul101_signature = make_multiplier(block, 101)
make_executable(block, pagesize)
address = ctypes.cast(block, ctypes.c_void_p).value mul101 = mul101_signature(address)
mul101_signature
constructor.>>> print(mul101(8))
808
destroy_block(block, pagesize) del block del mul101
$ python mj.py 101
Pagesize: 4096
Allocating one page of memory
JIT-compiling a native mul-function w/arg 101
Making function block executable
Testing function
OK mul(0) = 0
OK mul(1) = 101
OK mul(2) = 202
OK mul(3) = 303
OK mul(4) = 404
OK mul(5) = 505
OK mul(6) = 606
OK mul(7) = 707
OK mul(8) = 808
OK mul(9) = 909
Deallocating function
print("address: 0x%x" % address) print("Press ENTER to continue") raw_input()
$ lldb python ... (lldb) run mj.py 101 ... (lldb) c Process 19329 resuming ... address 0x1002fd000 Press ENTER to continue
(lldb) x/3i 0x1002fd000 0x1002fd000: 48 b8 65 00 00 00 00 00 00 00 movabsq $0x65, %rax 0x1002fd00a: 48 0f af c7 imulq %rdi, %rax 0x1002fd00e: c3 retq
cffi
module, but the fact remains: if you want to repeatedly call very small JIT functions, then usually it is faster to do it in pure Python.Source: https://habr.com/ru/post/342410/
All Articles