Position-independent code (PIC) in x64 shared libraries

enter image description here

Hi, I'm still Marco and still a system programmer at Badoo. Last week I published a translation about PIC in shared libraries, but there is a second part - about shared libraries on x64, so I decided not to leave the case unfinished.

The previous article explained how addressing does not depend on an address (Position Independent Code, PIC) using examples compiled for x86 architecture. I promised to talk about PIC on x64 [1] in a separate article. Here she is. This article will have far fewer details, since it is implied that you already understand how the PIC works in theory. In essence, the idea is the same for both architectures, but some details differ because of their features.

RIP Addressing

On x86 functions calls (using the call instruction) use offsets relative to IP, however data calls (using the mov instruction) support only absolute addresses. From the previous article, you know that this makes the PIC a little less efficient, since the PIC in essence requires that all offsets are relative. Absolute addresses and address independence do not go hand in hand.

x64 solves this problem with a new type of addressing relative to RIP, which is the default address for all 64-bit mov-instructions accessing memory (it is used for other instructions, such as, for example, lea). Here is a quote from the "Intel Architecture Manual vol 2a" (one of the main documents on the Intel-architecture):

RIP (relative instruction-pointer or relative to the pointer to the current instruction) is a new type of addressing implemented in 64-bit mode. The final address is formed by adding an offset to the 64-bit pointer to the next instruction.

The offset used in the RIP-relative mode has a size of 32 bits, since it can be used both in a negative and in a positive direction. It turns out that the maximum offset relative to the RIP, which is supported in this addressing mode, is ± 2GB.

x64 PIC with access to data. Example

For a simpler comparison, I will use the same C example as in the example in the previous article:

int myglob = 42; int ml_func(int a, int b) { return myglob + a + b; }

Let's look at ml_func disassembled view:

 00000000000005ec <ml_func>: 5ec: 55 push rbp 5ed: 48 89 e5 mov rbp,rsp 5f0: 89 7d fc mov DWORD PTR [rbp-0x4],edi 5f3: 89 75 f8 mov DWORD PTR [rbp-0x8],esi 5f6: 48 8b 05 db 09 20 00 mov rax,QWORD PTR [rip+0x2009db] 5fd: 8b 00 mov eax,DWORD PTR [rax] 5ff: 03 45 fc add eax,DWORD PTR [rbp-0x4] 602: 03 45 f8 add eax,DWORD PTR [rbp-0x8] 605: c9 leave 606: c3 ret

The most interesting instruction here is at 0x5f6 : it places the address myglob in rax, referring to the element from the GOT. As we can see, it uses RIP-relative addressing. Since it is relative to the address of the next instruction, we actually get 0x5fd + 0x2009db = 0x200fd8 . Thus, the GOT element, which contains the address myglob, is located at 0x200fd8 . Let's check how far our calculations are far from reality:

 $ readelf -S libmlpic_dataonly.so There are 35 section headers, starting at offset 0x13a8: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [...] [20] .got PROGBITS 0000000000200fc8 00000fc8 0000000000000020 0000000000000008 WA 0 0 8 [...]

GOT starts at 0x200fc8 , so myglob is in the third element. And we can see the relocation added to the myglob binary:

 $ readelf -r libmlpic_dataonly.so Relocation section '.rela.dyn' at offset 0x450 contains 5 entries: Offset Info Type Sym. Value Sym. Name + Addend [...] 000000200fd8 000500000006 R_X86_64_GLOB_DAT 0000000000201010 myglob + 0 [...]

We see a relocation entry for the address 0x200fd8 , telling the linker to add the address myglob there when it knows the final address for the symbol.

Now it should be clear how the address of myglob is obtained in the code. The following instruction at address 0x5fd dereferences the pointer in order to get the final address, and puts it in eax [2] .

x64 PIC with access to functions. Example

Let's now see how function calls work with the PIC on x64. And let's use the same example as in the previous article:

 int myglob = 42; int ml_util_func(int a) { return a + 1; } int ml_func(int a, int b) { int c = b + ml_util_func(a); myglob += c; return b + myglob; }

Disassembling ml_func , we get:

 000000000000064b <ml_func>: 64b: 55 push rbp 64c: 48 89 e5 mov rbp,rsp 64f: 48 83 ec 20 sub rsp,0x20 653: 89 7d ec mov DWORD PTR [rbp-0x14],edi 656: 89 75 e8 mov DWORD PTR [rbp-0x18],esi 659: 8b 45 ec mov eax,DWORD PTR [rbp-0x14] 65c: 89 c7 mov edi,eax 65e: e8 fd fe ff ff call 560 <ml_util_func@plt> [... snip more code ...]

The call, as before, looks like ml_util_func@plt . Let's see what is there:

 0000000000000560 <ml_util_func@plt>: 560: ff 25 a2 0a 20 00 jmp QWORD PTR [rip+0x200aa2] 566: 68 01 00 00 00 push 0x1 56b: e9 d0 ff ff ff jmp 540 <_init+0x18>

It turns out that the GOT record containing the real address ml_util_func is located at 0x200aa2 + 0x566 = 0x201008 . And the relocation record is also in place, as expected:

 $ readelf -r libmlpic.so Relocation section '.rela.dyn' at offset 0x480 contains 5 entries: [...] Relocation section '.rela.plt' at offset 0x4f8 contains 2 entries: Offset Info Type Sym. Value Sym. Name + Addend [...] 000000201008 000600000007 R_X86_64_JUMP_SLO 000000000000063c ml_util_func + 0

Performance

In both examples, you can see that a PIC on x64 requires less instructions than the same code on x86. On x86, the GOT address is loaded into the register (ebx according to the agreement) in two stages: first, we receive the instruction address with a special call, and then we add the offset to the GOT. None of these stages is needed on x64, since the relative offset to GOT is known to the linker and it can simply be used in the instruction with RIP-relative addressing.

When we call a function, there is also no need to prepare the GOT address in ebx for the springboard, unlike x86, since the springboard simply accesses the element in the GOT directly through RIP-relative addressing.

It turns out that the PIC on x64 still requires additional instructions compared to the code without the PIC, but the overhead projector is smaller. The costs involved in using the whole register to store the pointer to the GOT are also no longer necessary. RIP-relative addressing does not require additional registers [3] . As a result, the overhead for PIC on x64 is much smaller compared to x86 and this makes the PIC even more popular. So popular that PIC is the default choice when creating shared libraries on this architecture.

For the curious: not PIC x64

GCC not only encourages you to use PIC for x64 shared libraries, but it does require it by default. For example, if we compile the first example without -fpic [4] and try to build a shared library with -shared, we get an error from the linker:

 /usr/bin/ld: ml_nopic_dataonly.o: relocation R_X86_64_PC32 against symbol `myglob' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: final link failed: Bad value collect2: ld returned 1 exit status

What's happening? Let's look at the disassembled view of ml_nopic_dataonly.o [5] :

 0000000000000000 <ml_func>: 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: 89 7d fc mov DWORD PTR [rbp-0x4],edi 7: 89 75 f8 mov DWORD PTR [rbp-0x8],esi a: 8b 05 00 00 00 00 mov eax,DWORD PTR [rip+0x0] 10: 03 45 fc add eax,DWORD PTR [rbp-0x4] 13: 03 45 f8 add eax,DWORD PTR [rbp-0x8] 16: c9 leave 17: c3 ret

Notice how myglob is accessed here in the instructions at 0xa. The linker is expected to put the real address on myglob in the operand (that is, without GOT):

 $ readelf -r ml_nopic_dataonly.o Relocation section '.rela.text' at offset 0xb38 contains 1 entries: Offset Info Type Sym. Value Sym. Name + Addend 00000000000c 000f00000002 R_X86_64_PC32 0000000000000000 myglob - 4 [...]

But the relocation R_X86_64_PC32 , which the linker complained about. It cannot link an object with such a relocation to a shared library. Why? Because the offset we are making relative to rip must fit in 32 bits, and we cannot say that this is always enough. After all, we have a full-fledged 64-bit architecture with a huge address space. The symbol may eventually end up in some shared library that is so far away and we don’t have enough 32 bits to access it. So relocation R_X86_64_PC32 not suitable for shared libraries on x64.

But can we somehow create a non-PIC x64 code? We can! We need to tell the compiler to use the so-called “large code model”. This is done by adding the -mcmodel=large flag. The topic of code models is certainly interesting, but its explanation will take us too far from the goal of this article [6] . So I’ll briefly say that the code model is something like an agreement between the programmer and the compiler, in which the programmer makes some promises to the compiler as to what size of offsets will be used in the program. In exchange, the compiler will be able to generate better code.

It turns out that in order for the compiler to generate non-PIC code on x64, which would suit the linker, only the “large code model” is suitable as the most undemanding. Remember my explanation of why simple relocation is not good enough on x64 due to the fact that the offset can be more than 32 bits? Here, the “large code model” simply assumes nothing and uses the largest 64-bit offset for all data accesses. This makes it possible to say that relocations are safe, and not to use an x64 PIC code. Let's look at the disassembled view of the first example, compiled without -fpic and with -mcmodel=large:

 0000000000000000 <ml_func>: 0: 55 push rbp 1: 48 89 e5 mov rbp,rsp 4: 89 7d fc mov DWORD PTR [rbp-0x4],edi 7: 89 75 f8 mov DWORD PTR [rbp-0x8],esi a: 48 b8 00 00 00 00 00 mov rax,0x0 11: 00 00 00 14: 8b 00 mov eax,DWORD PTR [rax] 16: 03 45 fc add eax,DWORD PTR [rbp-0x4] 19: 03 45 f8 add eax,DWORD PTR [rbp-0x8] 1c: c9 leave 1d: c3 ret

The instruction at address 0xa puts the address on myglob in eax. Notice that her argument is still zero, and this suggests that you should expect relocation here. In addition, it has a full 64-bit argument. Absolute, not RIP-relative [7] . Well, notice that the two instructions here are needed to put the value of myglob in eax. This is one of the reasons why the “large code model” is less effective than alternatives.
Now let's look at relocation:

 $ readelf -r ml_nopic_dataonly.o Relocation section '.rela.text' at offset 0xb40 contains 1 entries: Offset Info Type Sym. Value Sym. Name + Addend 00000000000c 000f00000001 R_X86_64_64 0000000000000000 myglob + 0 [...]

Relocation type changed to R_X86_64_64 . This is a relocation with an absolute address, having a 64-bit value. Linker is now happy and happily agrees to link this object to the shared library.

Some critical thoughts may lead you to the question of why this compiler generates code that is not suitable by default for relocation during loading. The answer is very simple. Do not forget that the code is usually linked directly into a binary, which does not require any relocations. And by default, the compiler assumes a “small code model” to create the most efficient code. If you know that your code will be in the shared library and you do not want to use the PIC, just explicitly tell the compiler. It seems to me that the behavior of gcc is quite appropriate.

Another question is why there are no problems with the PIC when using the “small code model”? The reason is that the GOT is always in the same shared library, where the code that refers to it is. And, if the shared library is not large enough to fit in a 32-bit address space, there should be no problems with addressing. Such large shared libraries are unlikely, but if you have one, the ABI for AMD64 has a “big code model with a PIC”.

Conclusion

This article complements the previous one , telling how the PIC works on x64 architecture. This architecture uses a new addressing model, which helps make the PIC faster and is therefore more preferable for shared libraries (compared to x86). This is very important to know, since x64 is the most popular architecture for servers, desktops and laptops at the moment.

[one]

As always, I use x64 as a convenient short name for an architecture known as x86-64, AMD64 or Intel 64.

[2]

In eax , not rax , since myglob an int type, which is also 32-bit on x64.

[3]

By the way, using the register would be much less problematic in x64, because it has twice as many registers as compared to x86.

[four]

This also happens if we explicitly indicate that we do not want the PIC to be passed as -fno-pic as an argument to gcc .

[five]

Please note that unlike other disassembler pins, which we discussed in this and last article, this is an object file, not a library or a binary. So it will contain relocations for the linker.

[6]

More detailed information is available in AMD64 ABI and man gcc .

[7]

Some assemblers call this instruction movabs to distinguish it from other mov-instructions accepting relative addresses. The manual for Intel architectures nonetheless calls it simply mov . Its opcode format is REX.W + B8 + rd .

Source: https://habr.com/ru/post/324616/

All Articles