📜 ⬆️ ⬇️

Part 1. QInst: it is better to lose a day, then fly five minutes later (we write the instrumentation is trivial)

In the previous section, I described approximately how you can load eBPF functions from an ELF file. Now it's time to move from fantasy to Soviet cartoons, and following wise advice, having spent once a certain amount of effort, make a universal instrument of instrumentation (or, in abbreviation, UII !!!) . In doing so, I will use the Golden Hammer's anti-pattern design and construct a tool from QEMU, which is relatively familiar to me. Bonus for this, we get cross-architectural instrumentation, as well as instrumentation at the level of the whole virtual computer. The instrumentation will be “small native so-shnichek + small .o-file with eBPF”. In this case, the eBPF functions will be substituted before the corresponding instructions of the internal representation of QEMU before optimization and code generation.


As a result, the instrumentation itself, which is added during code generation (that is, not counting a couple of kilobytes of a regular sish runtime), looks like this, and this is not pseudo-code:


#include <stdint.h> extern uint8_t *__afl_area_ptr; extern uint64_t prev; void inst_qemu_brcond_i64(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u) { __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1; prev = tag; } void inst_qemu_brcond_i32(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u) { __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1; prev = tag; } 

Well, it's time to load our elf into the Matrix. Well, how to download, rather brush in spray it.


As mentioned in the article about QEMU.js , one of the modes of QEMU operation is JIT-generation of host machine code from the guest (potentially, for a completely different architecture). If last time I implemented my code generation backend, then this time I am going to process the internal representation, wedged directly in front of the optimizer. Is this arbitrary solution? Not. It is hoped that the optimizer will cut off extra corners, throw out unnecessary variables, etc. As far as I understand, he, in fact, is engaged in simple and quickly doable things: pushing constants, throwing out expressions like “x: = x + 0” and removing the unreachable code. And we can get a decent amount of it.


Build Script Configuration


First, let's add our source files: tcg/bpf-loader.c and tcg/instrument.c to the Makefiles. Generally speaking, there is a desire to shove it into the upstream someday, so you will need to do it in the end, according to your mind, but for now I’ll just unconditionally add these files to the assembly. And the parameters will be taken in the best traditions of AFL - through variable environments. By the way, I will test it again on the instrumentation for AFL.


Just look for the mention of the “neighbor” - the file optimize.c using grep -R and find nothing. Because it was necessary to search for optimize.o :


 --- a/Makefile.target +++ b/Makefile.target @@ -110,7 +110,7 @@ obj-y += trace/ obj-y += exec.o obj-y += accel/ obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/tcg-op-vec.o tcg/tcg-op-gvec.o -obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/optimize.o +obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/optimize.o tcg/instrument.o tcg/bpf-loader.o obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o obj-$(CONFIG_TCG) += fpu/softfloat.o 

So here you are, metaprogramming on C ...


First, let's add bpf-loader.c from the last series with the code pulling out the entry points corresponding to the QEMU operations. And the mysterious tcg-opc.h file will help us in this. It looks like this:


 /* * DEF(name, oargs, iargs, cargs, flags) */ /* predefined ops */ DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT) DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT) /* variable number of parameters */ DEF(call, 0, 0, 3, TCG_OPF_CALL_CLOBBER | TCG_OPF_NOT_PRESENT) DEF(br, 0, 0, 1, TCG_OPF_BB_END) // ... 

What nonsense? But the point is simply that it is not connected in the source header — you need to define the DEF macro, enable this file, and immediately delete the macro. See, he doesn't even have a guard.


 static const char *inst_function_names[] = { #define DEF(name, a, b, c, d) stringify(inst_qemu_##name), #include "tcg-opc.h" #undef DEF NULL }; 

As a result, we get a neat array of target function names, indexed by opcodes and ending with NULL, which we can run for each character in the file. I understand that it is not effective. But just that is important, given the one-time nature of this operation. Then we just skip all the characters for which


 ELF64_ST_BIND(sym->st_info) == STB_LOCAL || ELF64_ST_TYPE(sym->st_info) != STT_FUNC 

We check the rest with the list.


Bind to the execution thread


Now we need to get up somewhere on the thread of the execution of the code generation mechanism, and wait until the interesting instruction passes by. But first you need to define your instrumentation_init , tcg_instrument and instrumentation_shutdown functions in the tcg/tcg.h and register their calls: initialization — after the backend initialization, instrumentation — just before tcg_optimize . It would seem that instrumentation_shutdown can be hung up in instrumentation_init on atexit and not soared. I also thought so, and it most likely will work in the full system emulation mode, but in the usermode-emulation mode, QEMU translates the exit_group and sometimes exit system calls to the _exit function call, which ignores all these atexit-handlers, Therefore, we will find it in linux-user/syscall.c and enter in front of it a call to our code.


Interpreting bytecode


So it's time to read what the compiler generated for us. This is conveniently done using llvm-objdump with the -x option, or better immediately -d -t -r .


Sample output
 $ ./compile-bpf.sh test-bpf.o: file format ELF64-BPF Disassembly of section .text: 0000000000000000 inst_brcond_i64: 0: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll 0000000000000000: R_BPF_64_64 prev 2: 79 23 00 00 00 00 00 00 r3 = *(u64 *)(r2 + 0) 3: 77 03 00 00 01 00 00 00 r3 >>= 1 4: 7b 32 00 00 00 00 00 00 *(u64 *)(r2 + 0) = r3 5: af 13 00 00 00 00 00 00 r3 ^= r1 6: 57 03 00 00 ff ff 00 00 r3 &= 65535 7: 18 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r4 = 0 ll 0000000000000038: R_BPF_64_64 __afl_area_ptr 9: 79 44 00 00 00 00 00 00 r4 = *(u64 *)(r4 + 0) 10: 0f 34 00 00 00 00 00 00 r4 += r3 11: 71 43 00 00 00 00 00 00 r3 = *(u8 *)(r4 + 0) 12: 07 03 00 00 01 00 00 00 r3 += 1 13: 73 34 00 00 00 00 00 00 *(u8 *)(r4 + 0) = r3 14: 7b 12 00 00 00 00 00 00 *(u64 *)(r2 + 0) = r1 15: 95 00 00 00 00 00 00 00 exit 0000000000000080 inst_brcond_i32: 16: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll 0000000000000080: R_BPF_64_64 prev 18: 79 23 00 00 00 00 00 00 r3 = *(u64 *)(r2 + 0) 19: 77 03 00 00 01 00 00 00 r3 >>= 1 20: 7b 32 00 00 00 00 00 00 *(u64 *)(r2 + 0) = r3 21: af 13 00 00 00 00 00 00 r3 ^= r1 22: 57 03 00 00 ff ff 00 00 r3 &= 65535 23: 18 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r4 = 0 ll 00000000000000b8: R_BPF_64_64 __afl_area_ptr 25: 79 44 00 00 00 00 00 00 r4 = *(u64 *)(r4 + 0) 26: 0f 34 00 00 00 00 00 00 r4 += r3 27: 71 43 00 00 00 00 00 00 r3 = *(u8 *)(r4 + 0) 28: 07 03 00 00 01 00 00 00 r3 += 1 29: 73 34 00 00 00 00 00 00 *(u8 *)(r4 + 0) = r3 30: 7b 12 00 00 00 00 00 00 *(u64 *)(r2 + 0) = r1 31: 95 00 00 00 00 00 00 00 exit SYMBOL TABLE: 0000000000000000 l df *ABS* 00000000 test-bpf.c 0000000000000000 ld .text 00000000 .text 0000000000000000 *UND* 00000000 __afl_area_ptr 0000000000000080 g F .text 00000080 inst_brcond_i32 0000000000000000 g F .text 00000080 inst_brcond_i64 0000000000000008 g O *COM* 00000008 prev 

If you try to look up the description of the eBPF opcodes, you will find that there are descriptions in obvious places (source codes and man pages of the Linux kernel), how to use it, how to compile, etc. Then you come across the iovisor tool team page with a convenient unofficial eBPF guide.


The instruction takes one 64-bit word (some are two) and looks like


 struct { uint8_t opcode; uint8_t dst:4; uint8_t src:4; uint16_t offset; uint32_t imm; }; 

Those that occupy two words, simply consist of the first instruction with all the logic and a “trailer” with another 32 bits of immediate-value and are very clearly visible in the objdump disassembler.


Opcodes themselves also have a regular structure: the lower three bits are the operation class: 32-bit ALU, 64-bit ALU, load / store, conditional transitions. Therefore, they are very convenient in the best traditions of QEMU to implement on macros. I will not carry out detailed instructions on the code base we are not on code review better tell about the pitfalls.


My first problem was that I made a lazy allocator of eBPF registers in the form of QEMU-vskih local_temp , and thoughtlessly passed a call to this function in a macro. It turned out like in a famous meme: “We have inserted an abstraction into your abstraction so that you can generate an instruction while you generate an instruction.” Post factum, I already do not understand well what broke then, but with the order of the generated instructions, it seems that something strange was going on. After that, I made analogs to the tcg_gen_... functions for stuffing new instructions in the middle of the list that take operands as function arguments, and the order automatically became as it should (since the arguments are completely calculated exactly once before the call).


The second problem was trying to stuff the TCG const as an operand of an arbitrary instruction when seen as an immediate operand in eBPF. Sul by the previously mentioned tcg-opc.h , the list of arguments of the operation is strictly fixed: n input arguments, m output and k constant. By the way, when debugging such code, it is very helpful to pass the QEMU command line argument to -d op,op_opt or even -d op,op_opt,out_asm .


Possible arguments
 $ ./x86_64-linux-user/qemu-x86_64 -d help Log items (comma separated): out_asm show generated host assembly code for each compiled TB in_asm show target assembly code for each compiled TB op show micro ops for each compiled TB op_opt show micro ops after optimization op_ind show micro ops before indirect lowering int show interrupts/exceptions in short format exec show trace before each executed TB (lots of logs) cpu show CPU registers before entering a TB (lots of logs) fpu include FPU registers in the 'cpu' logging mmu log MMU-related activities pcall x86 only: show protected mode far calls/returns/exceptions cpu_reset show CPU state before CPU resets unimp log unimplemented functionality guest_errors log when the guest OS does something invalid (eg accessing a non-existent register) page dump pages at beginning of user mode emulation nochain do not chain compiled TBs so that "exec" and "cpu" show complete traces trace:PATTERN enable trace events Use "-d trace:help" to get a list of trace events. 

Well, do not repeat my mistakes: the internal instructions disassembler is quite advanced, and if you see in it something like add_i64 loc15,loc15,$554412123213 , then this thing after the dollar sign - this is not a pointer. More precisely, it is, of course, a pointer, but perhaps hung with flags in the role of the literal value of the operand, and not the pointer. All this is applicable, naturally, if you know that there should be some specific number, like $0 or $ff , you should not be afraid of pointers at all. :) How to movi with this - you just need to create a function that returns a fresh temp , into which movi will put the desired constant through the movi .


By the way, if you comment out #define USE_TCG_OPTIMIZATIONS in the tcg/tcg.c #define USE_TCG_OPTIMIZATIONS , then, all of a sudden, the optimization will turn off, and it will be easier to analyze the code conversions.


For this, I will send a reader who is interested in picking QEMU into the documentation , even the official one! And the rest I will demonstrate the promised instrumentation for AFL.


The same rabbit


For the full runtime text, I will, again, send the reader to the repository, since it (the text) does not represent artistic value and is honestly styled from qemu_mode from the AFL delivery, and in general is a common piece of C code. But what the instrumentation itself looks like :


 #include <stdint.h> extern uint8_t *__afl_area_ptr; extern uint64_t prev; void inst_qemu_brcond_i64(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u) { __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1; prev = tag; } void inst_qemu_brcond_i32(uint64_t tag, uint64_t x, uint64_t y, uint64_t z, uint64_t u) { __afl_area_ptr[((prev >> 1) ^ tag) & 0xFFFF] += 1; prev = tag; } 

It is important that the hook functions have as many arguments as the iargs the corresponding QEMU operation. Two extern in the cap will be linked to the runtime during the relocation process. In principle, the prev could be defined right here, but then it must be defined as static , otherwise it will fall into the COMMON section that I do not support. Actually, we, in fact, simply rewrote the pseudocode from the documentation, but here it is machine readable!


For verification, create the file bug.c :


 #include <stdio.h> #include <unistd.h> #include <stdlib.h> int main(int argc, char *argv[]) { char buf[16]; int res = read(0, buf, 4); if (buf[0] == 'T' && buf[1] == 'E' && buf[2] == 'S' && buf[3] == 'T') abort(); return res * 0; } 

And also the forksrv file, which is convenient to feed the AFL:


 #!/bin/bash export NATIVE_INST=./instrumentation-examples/afl/afl-native.so export BPF_INST=./instrumentation-examples/afl/afl-bpf.co exec ./x86_64-linux-user/qemu-x86_64 ./instrumentation-examples/afl/bug 

And run fuzzing:


 AFL_SKIP_BIN_CHECK=1 afl-fuzz -i ../input -o ../output -m none -- ./forksrv 

American Fuzzy Lop (Proceed)
 1234 T234 TE34 TES4 TEST <-     crashes,    2200   

So far, the speed is not so hot, but in justification I will say that here (for the time being) an important feature of the original qemu_mode not used: sending the addresses of the executable code to the fork server. But there is nothing AFL in the QEMU codebase, and there is hope that this generalized instrumentation will ever be pushed into the upstream.


GitHub project


')

Source: https://habr.com/ru/post/452608/


All Articles