McSema and decompiling into LLVM source code: is it real?

Imagine that there is some very useful program, but it, for example, exists only in the Windows version and only 64 bits. And you need, for example, under ARM64 and under another OS, respectively. And you do not have the source code, and it is impossible to get them.

What to do? There is a project MCSema (a post on Habré about mcsema: https://habrahabr.ru/post/232871/ ). Its creators (and they are financed by DARPA, by the way) promise fabulous things: translation of binaries to LLVM IR, optimization, semantic code analysis, etc. And of course, recompilation to any architectures that LLVM supports. Open Source Project (link to github: https://github.com/trailofbits/mcsema )

And now let's see what really happens.

First, you need to immediately clarify that MCSema does not know how to disassemble binaries by itself. This requires IDA Pro, an expensive product in itself. But this business is solved, in principle, you can replace IDA Pro with something else if you really want to.
')
But how does the magic of converting assembler to IR happen?

(If someone does not know: IR is an intermediate language LLVM, a kind of generalized assembler, higher level than processor assemblers.)

Very simple. The entire set of x86-64 registers can be represented as a structure called SimulatedCPU, which, with some abbreviations, you can see here:

x86 registers

struct SimulatedCPU { enum CPUFlags // See: https://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture { CF = (1 << 0), // 1 is 1 PF = (1 << 2), // 3 is 0 AF = (1 << 4), // 5 is 0 ZF = (1 << 6), SF = (1 << 7), TF = (1 << 8), IF = (1 << 9), DF = (1 << 10), OF = (1 << 11), // 12-13 is IOPL // 14 is NT // 15 is 0 // 16 is RF // 17 is VM // 18 is AC // 19 is VIF // 20 is VIP ID = (1 << 11), // ID flag // rest is 0 }; enum FPUControl // See: http://www.plantation-productions.com/Webster/www.artofasm.com/Linux/HTML/RealArithmetic.html { // Exception masks FPUC_InvalidOpMask = (1 << 0), FPUC_DenormalMask = (1 << 1), FPUC_ZeroDivideMask = (1 << 2), FPUC_OverflowMask = (1 << 3), FPUC_UnderflowMask = (1 << 4), FPUC_PrecisionMask = (1 << 5), // 6 is 1 (no idea why) // 7 is 0 FPUC_PrecisionControlMask = (3 << 8), // 2 bits, 00 - 24 bits (float) , 01 - reserved, 10 - 53 bits (double), 11 - 64 bits (FP80), default 11 FPUC_RoundingMask = (3 << 10), // 2 bits, 00 - to nearest or even, 01 - down, 10 - up, 11 - truncate, default 00 FPUC_Infinity = (1 << 12), // rest is 0 }; enum FPUStatus // 16 bits { FPUS_InvalidOpFlag = (1 << 0), FPUS_DenormalFlag = (1 << 1), FPUS_ZeroDivideFlag = (1 << 2), FPUS_OverflowFlag = (1 << 3), FPUS_UnderflowFlag = (1 << 4), FPUS_PrecisionFlag = (1 << 5), FPUS_StackFaultFlag = (1 << 6), // that's for FPU stack (8 regs) FPUS_ExceptionFlag = (1 << 7), // set if any previous is set FPUS_C0 = (1 << 8), // Condition bit 0 FPUS_C1 = (1 << 9), // Condition bit 1 FPUS_C2 = (1 << 10), // Condition bit 2 FPUS_TOP_Mask = (7 << 11), // top-of-stack pointer (3 bits) FPUS_C3 = (1 << 14), // Condition bit 3 FPUS_Busy = (1 << 15), // 1 is FPU is busy. I don't expcet any code to ever use it. }; enum SSE_MXCSR { MXCSR_InvalidOp = (1 << 0), MXCSR_Denormal = (1 << 1), MXCSR_DivideByZero = (1 << 2), MXCSR_Overflow = (1 << 3), MXCSR_Underflow = (1 << 4), MXCSR_Precision = (1 << 5), MXCSR_DenormalsAreZero = (1 << 6), MXCSR_InvalidOpExceptionMask = (1 << 7), MXCSR_DenormalExceptionMask = (1 << 8), MXCSR_DivideByZeroExceptionMask = (1 << 9), MXCSR_OverflowExceptionMask = (1 << 10), MXCSR_UnderflowExceptionMask = (1 << 11), MXCSR_PrecisionExceptionMask = (1 << 12), MXCSR_RoundingModeMask = (3 << 13), // 2-bit, so it's a mask MXCSR_FlushToZero = (1 << 15), }; // General Purpose Registers (32-bit in 32-bit mode, 64-bit in 64-bit mode) UINT_PTR RegAX; // 0x00 (0) UINT_PTR RegBX; // 0x04 (4) UINT_PTR RegCX; // 0x08 (8) UINT_PTR RegDX; // 0x0C (12) UINT_PTR RegSI; // 0x10 (16) UINT_PTR RegDI; // 0x14 (20) UINT_PTR RegSP; // 0x18 (24) UINT_PTR RegBP; // 0x1C (28) #ifdef EON_WIN64 UINT_PTR Reg08; UINT_PTR Reg09; UINT_PTR Reg10; UINT_PTR Reg11; UINT_PTR Reg12; UINT_PTR Reg13; UINT_PTR Reg14; UINT_PTR Reg15; UINT_PTR RegRIP; // possibly it won't be needed, if we fold all RIP-relative instructions on conversion. If you remove it, see about padding below. #endif // Various often used flags - close to other registers to improve caching if possible // EFLAGS (some) BYTE FlagCF; // 0x20 (32) BYTE FlagPF; // 0x21 (33) BYTE FlagAF; // 0x22 (34) BYTE FlagZF; // 0x23 (35) BYTE FlagSF; // 0x24 (36) BYTE FlagDF; // 0x25 (37) BYTE FlagOF; // 0x26 (38) // FPU Status Word (some) BYTE FlagFPU_C0; // 0x27 (39) BYTE FlagFPU_C1; // 0x28 (40) BYTE FlagFPU_C2; // 0x29 (41) BYTE FlagFPU_C3; // 0x2A (42) BYTE FlagFPU_TOP; // 0x2B (43) // FPU Control Word (some) BYTE FlagFPU_PC; // 0x2C (44) BYTE FlagFPU_RC; // 0x2D (45) // Rarely used flags, serving as paddding // FPU Control Word (some) BYTE FlagFPU_IM; // 0x2E (46) BYTE FlagFPU_DM; // 0x2F (47) BYTE FlagFPU_ZM; // 0x30 (48) BYTE FlagFPU_OM; // 0x31 (49) BYTE FlagFPU_UM; // 0x32 (50) BYTE FlagFPU_PM; // 0x33 (51) BYTE FlagFPU_X; // 0x34 (52) // FPU Status Word (some) BYTE FlagFPU_ES; // 0x35 (53) // Tag Word WORD FPUTagWord; // 0x36 (54) UINT_PTR ThreadEnvironmentBlock; // 0x38 (56) UINT_PTR ZeroRegister; // 0x3C (60) zero-filled dummy to give as CS,ES or SS register // XMM registers (aligned on 64-bit alignment) #ifdef EON_WIN64 BYTE XMMRegs[16*16]; // 16 128-bit (16-byte) registers #else BYTE XMMRegs[8*16]; // 0x40 (64) 8 128-bit (16-byte) registers #endif // x87 FPU registers (aligned on 64-bit alignment) DOUBLE FPURegs[8]; // 0xC0 (192) // Rarely used flags and other rarely used members // FPU Status Word (some) BYTE FlagFPU_IE; // 0x100 (256) BYTE FlagFPU_DE; // 0x101 (257) BYTE FlagFPU_ZE; // 0x102 (258) BYTE FlagFPU_OE; // 0x103 (259) BYTE FlagFPU_UE; // 0x104 (260) BYTE FlagFPU_PE; // 0x105 (261) BYTE FlagFPU_SF; // 0x106 (262) BYTE FlagFPU_B; // 0x107 (263) // MXCSR BYTE SSEFlag_InvalidOperation; // 0x108 (264) BYTE SSEFlag_Denormal; // 0x109 (265) BYTE SSEFlag_DivideByZero; // 0x10A (266) BYTE SSEFlag_Overflow; // 0x10B (267) BYTE SSEFlag_Underflow; // 0x10C (268) BYTE SSEFlag_Precision; // 0x10D (269) BYTE SSEFlag_DenormalsAreZero; // 0x10E (270) BYTE SSEFlag_InvalidOperationMask; // 0x10F (271) BYTE SSEFlag_DenormalMask; // 0x110 (272) BYTE SSEFlag_DivideByZeroMask; // 0x111 (273) BYTE SSEFlag_OverflowMask; // 0x112 (274) BYTE SSEFlag_UnderflowMask; // 0x113 (275) BYTE SSEFlag_PrecisionMask; // 0x114 (276) BYTE SSE_RoundingMode; // 0x115 (277) BYTE SSEFlag_FlushToZero; // 0x116 (278) BYTE FlagID; // 0x117 (279) - EFLAGS' ID flag // FPU debugging, currently unused but reserved UINT_PTR LastInstructionPtrOffset; // 0x118 (280) UINT_PTR LastDataPtrOffset; // 0x11A (281) WORD FPUFopcode; // 0x11E (282) WORD LastInstructionPtrSeg; // 0x120 (283) WORD LastDataPtrSeg; // 0x122 (284) #ifdef EON_WIN64 UINT_PTR StackValueAtLastNativeCodeEntry; // 0x124 (286), this is used to unwind simulated stack when part of it was used by native code #endif //... };

In the LLVM assembler, this structure is reflected very simply:

 %SimulatedCPU = type <{ i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i3, i2, i2, i1, i1, i1, i1, i1, i1, i1, i1, i16, i64, i64, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, [8 x double], i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i2, i1, i1, i64, i64, i16, i16, i16 }>

(i64, i1, etc. are integer types in LLVM IR, having, respectively, 64 bits, 1 bits, etc.) That is, general-purpose registers are of type i64, flags are of type i1, followed by 128 -bit registers XMM, registers FPU (type double), flags FPU and various service registers.
Suppose we have some function of the source program, which begins, for example, like this:

 0000000140820d40 mov rax, rsp 0000000140820d43 push rbp 0000000140820d44 push rbx 0000000140820d45 push rsi 0000000140820d46 push rdi 0000000140820d47 push r12 0000000140820d49 push r13 0000000140820d4b push r14 0000000140820d4d push r15

Nothing unusual, in the function prologue, we copy the RSP value into RAX and save some registers onto the stack, according to the Calling Convention. That is, it is not even the content part of the function itself, it is only part of the prologue.

What makes this MCSema code (with abbreviations):

 %0 = type opaque %SimulatedCPU = type <{ i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i3, i2, i2, i1, i1, i1, i1, i1, i1, i1, i1, i16, i64, i64, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, [8 x double], i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i2, i1, i1, i64, i64, i16, i16, i16 }> define hidden fastcc void @Foo(%SimulatedCPU* noalias nocapture align 16 dereferenceable(542)) entry: %R15_val = alloca i64 %XMM9_val = alloca i128 %XMM8_val = alloca i128 ... %RSP_val = alloca i64 %RBP_val = alloca i64 ... %RAX_val = alloca i64 ... %1 = getelementptr inbounds %SimulatedCPU, %SimulatedCPU* %0, i64 0, i32 0 %2 = load i64, i64* %1 store i64 %2, i64* %RAX_val ... %13 = getelementptr inbounds %SimulatedCPU, %SimulatedCPU* %0, i64 0, i32 6 %14 = load i64, i64* %13 store i64 %14, i64* %RSP_val %15 = getelementptr inbounds %SimulatedCPU, %SimulatedCPU* %0, i64 0, i32 7 %16 = load i64, i64* %15 store i64 %16, i64* %RBP_val ... block_0x10820d40: ; preds = %entry %165 = load i64, i64* %RSP_val store i64 %165, i64* %RAX_val %166 = load i64, i64* %RBP_val %167 = load i64, i64* %RSP_val %168 = sub i64 %167, 8 %169 = inttoptr i64 %168 to i64* store volatile i64 %166, i64* %169, align 4

For those who do not know LLVM IR, some explanations:

1. A single argument is always passed to the function - a pointer to the global structure of SimulatedCPU. The function always returns void.

2. We allocate on a stack memory for a variable 64 bits:
% RSP_val = alloca i64

3. Get the address of the 6th element in the SimulatedCPU structure (i.e. the RSP register)
% 13 = getelementptr inbounds% SimulatedCPU,% SimulatedCPU *% 0, i64 0, i32 6

4. Get the RSP value:
% 14 = load i64, i64 *% 13

5. Save it in a pre-allocated memory under the name RSP_val:
store i64% 14, i64 *% RSP_val

6. Load the RSP value into the variable again.
% 165 = load i64, i64 *% RSP_val

7. Copy to RAX (memory for RAX was allocated in the same way):
store i64% 165, i64 *% RAX_val

8. Get the RBP value:
% 166 = load i64, i64 *% RBP_val

9. Get RSP value
% 167 = load i64, i64 *% RSP_val

10. Subtract 8 from the RSP value (i.e., increment the stack by 8 bytes)
% 168 = sub i64% 167, 8

11. Convert the resulting value into an index.
% 169 = inttoptr i64% 168 to i64 *

12. Save the RBP value to the received address.
store volatile i64% 166, i64 *% 169, align 4

These are the commands into which MCSema has converted one command of the source program: push rbp.

But this is a non-optimized code, maybe the opt optimizer will make it efficient?

We'll see:

 %0 = type opaque %SimulatedCPU = type <{ i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i3, i2, i2, i1, i1, i1, i1, i1, i1, i1, i1, i16, i64, i64, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, i128, [8 x double], i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i1, i2, i1, i1, i64, i64, i16, i16, i16 }> define hidden fastcc void @Foo(%SimulatedCPU* noalias nocapture align 16 dereferenceable(542)) { entry: ... %12 = getelementptr inbounds %SimulatedCPU, %SimulatedCPU* %0, i64 0, i32 6 %13 = load i64, i64* %12, align 16 %14 = getelementptr inbounds %SimulatedCPU, %SimulatedCPU* %0, i64 0, i32 7 %15 = load i64, i64* %14, align 8 ... %56 = add i64 %13, -8 %57 = inttoptr i64 %56 to i64* store volatile i64 %15, i64* %57, align 4

So, the optimizer has removed the placement on the stack of copies of the registers of the simulated processor, replacing them with direct read / write operations into the global structure of SimulatedCPU. Has the program made it more effective? Definitely not. The optimizer cannot do anything else: the volatile qualifier is an insurmountable obstacle for it.

If you compile the resulting code and run it on ARM64, it will be approximately two orders of magnitude slower than the source code on x86-64.

MCSema does not see any variables on the function stack: it stupidly turns elementary operations with registers not even into operations with registers of the target processor (target), but into operations with the SimulatedCPU structure, which is located in memory (if the program is multi-threaded, then each thread has its own copy this structure).

Conclusions that can be drawn from the above: it is too early to talk about transferring binary files from one architecture to another. Instead of the promised semantic analysis, we simply have a very inefficient processor emulation. Such an approach, in principle, does not allow achieving an effective program operation on a processor with a different architecture.

Related Links:

1. Andrew Ruef. There and back again. Binary Analysis with mcsema.
2. Lifting Windows Driver Binaries into LLVM IR.

Source: https://habr.com/ru/post/347000/

All Articles

McSema and decompiling into LLVM source code: is it real?

More articles: