Hey. My name is Marco, and I'm a system programmer at Badoo. I really like to thoroughly understand how things work, and the subtleties of the work of shared libraries in Linux is no exception. I present to you the translation of just such an analysis. Enjoy reading.
I have already described the need for special handling of shared libraries while loading them into the process address space. In short, when the linker creates a shared library, it does not know in advance where it will be loaded in memory. Because of this, to make links to data and code inside the library is problematic: it is not clear how to create a link so that it points to the right place after the library has been loaded.
In Linux and ELF, there are two main ways to solve this problem:
We have already considered relocation during loading. And now consider the second approach - PIC.
Initially, I planned to talk about x86 and x64 (also known as x86-64), but the article kept growing and grew, and I decided that I needed to be more practical. So in this article I’ll only talk about x86, and x64 will be discussed in another (I hope, much shorter). I took the older x86 architecture, since, unlike x64, it was developed without PIC, and the PIC implementation is a bit more complicated.
As we saw in the previous article, relocation during boot is a very simple and straightforward method. And it works. But PIC is much more popular at the moment and is the recommended way to create shared libraries. Why, you ask?
There are several problems with relocation: it takes time and the text section (containing machine code) is no longer suitable for separation between processes.
Let's talk first about the performance problem. If the library was linked with information about symbols that require relocation, then relocation itself when loading will take some time. You might think that this time should not be long, because the loader does not need to run through all the source code - just walk through these same symbols. But if some complicated program loads several large libraries, the overhead head accumulates very quickly - and as a result we get quite noticeable delay at the start of the program.
Well, and a few words about the problem of impossibility to share a text-section. She is somewhat more serious. One of the main tasks of the existence of shared libraries is to save on memory. Some libraries are used by several applications at the same time. If the text section (where the machine code is located) can be loaded into memory only once (and then added to other processes using mmap), then a fairly large amount of RAM can be saved. But this is not possible when using relocation, since the text section must be changed at boot time to substitute the correct pointers for a particular process. It turns out that for each process using the library, you have to keep a full copy of this library in memory [1] . No separation occurs.
Moreover, keeping the text section with write permissions (and it must be with write permissions so that the downloader can correct the links) is bad from a security point of view. Making an exploit in this case is much easier.
As we will see, the PIC almost completely solves these problems.
The idea behind the PIC is very simple - adding an intermediate layer to the code for all references to global objects and functions. If you intelligently use some artifacts of linking and loading processes, you can make the text section really independent of the address where it will be put; we will be able to map the segment using mmap to various addresses in the process address space, and we will not need to change any bits in it. In the next few sections I will show how this can be achieved.
One of the key ideas on which the PIC is based is the offset between the text and data sections, the size of which is known to the linker during linking. When the linker combines several object files, it assembles their sections together (for example, all text sections are combined into one large text section). Thus, linker sizes and sections are known to the linker.
For example, right after the text section, the data section may follow, and in this case the offset from any instruction from the text section to the beginning of the data section will be equal to the size of the text section minus the offset to this instruction from the beginning of the text section. And all these dimensions and offsets are known to the linker.
In the diagram above, the code section was uploaded to some address (unknown to us at the time of linking) 0xXXXX0000 (X literally means "no matter what is there"), and the data section immediately after it at 0xXXXXF000. In this case, if some instruction on offset 0x80 in the code section wants to indicate something in the data section, the linker knows the relative offset (0xEF80 in this case) and can add it to the instruction.
Note that nothing will change if another section is jammed between the code and data sections, or if the data section is located before the code section. Since the linker knows the sizes of all sections and decides where to put them, the idea remains the same.
Everything described above works, if we can use relative displacements at all. After all, references to data (for example, as in the MOV instructions) on x86 require absolute addresses. So what do we do?
If we have a relative address, and we need an absolute address, we lack the value of the instruction pointer, or the instruction counter (IP). Indeed, by definition, a relative address is relative to IP. On x86, there are no instructions for getting an IP, but we can use a simple trick. Here is a small assembly pseudocode that demonstrates it:
call TMPLABEL TMPLABEL: pop ebx
What's going on here:
Now we have everything to finally talk about how x86 address independent addressing is implemented. And it is implemented using a global offset table (global offset table or GOT).
GOT is just a table with addresses, which is located in the data section. Suppose that some instruction in the code section wants to refer to a variable. Instead of accessing it via an absolute address (which will require relocation), it accesses the entry in the GOT. Since GOT has a strictly defined place in the data section, and the linker knows about it, this appeal is also relative. And the GOT entry already contains the absolute address of the variable:
In a pseudo-assembler, this will look like a substitute for absolute addressing.
// edx mov edx, [ADDR_OF_VAR]
for addressing through the register and a small pad:
Somehow find the GOT address and put it in ebx:
lea ebx, ADDR_OF_GOT
Suppose the variable address (ADDR_OF_VAR) is located at offset 0x10 in the GOT. In this case, the following instruction will put ADDR_OF_VAR in edx:
mov edx, DWORD PTR [ebx + 0x10]
Finally, let's turn to the variable and put its value in edx:
mov edx, DWORD PTR [edx]
Thus, we got rid of relocation in the code section by redirecting calls through the GOT. But we also created a relocation in the data section. Why? Because the GOT must in any case contain the absolute address of the variable for the above scheme to work. So where is the profit?
A profit, it turns out, a lot. Relocation in the data-section is associated with a much smaller number of problems than relocation in the code section. There are two reasons for this, corresponding to two problems arising during relocation during loading.
Now I will show a full-fledged example that demonstrates the PIC mechanics:
int myglob = 42; int ml_func(int a, int b) { return myglob + a + b; }
This code block will be compiled into a shared library (using the -fpic and -shared flags) libmlpic_dataonly.so.
Let's see what the compiler generated, focusing on the ml_func function:
0000043c <ml_func>: 43c: 55 push ebp 43d: 89 e5 mov ebp,esp 43f: e8 16 00 00 00 call 45a <__i686.get_pc_thunk.cx> 444: 81 c1 b0 1b 00 00 add ecx,0x1bb0 44a: 8b 81 f0 ff ff ff mov eax,DWORD PTR [ecx-0x10] 450: 8b 00 mov eax,DWORD PTR [eax] 452: 03 45 08 add eax,DWORD PTR [ebp+0x8] 455: 03 45 0c add eax,DWORD PTR [ebp+0xc] 458: 5d pop ebp 459: c3 ret 0000045a <__i686.get_pc_thunk.cx>: 45a: 8b 0c 24 mov ecx,DWORD PTR [esp] 45d: c3 ret
I will point to the address of the instructions (the leftmost number in the output). This address is the offset from the address to which the library was mapped.
Also with the help of readelf -S you can find out where the linker put the GOT:
Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al <snip> [19] .got PROGBITS 00001fe4 000fe4 000010 04 WA 0 0 4 [20] .got.plt PROGBITS 00001ff4 000ff4 000014 04 WA 0 0 4 <snip>
Let's get a calculator and check the compiler. We are looking for myglob. As I mentioned above, calling __i686.get_pc_thunk.cx puts the address of the next instruction in ecx. This is 0x444 [2] . The following instruction adds 0x1bb0 to it - and as a result in ecx we get 0x1ff4. Finally, to get the GOT element that contains the address myglob, do [ecx - 0x10]. The element thus has an address of 0x1fe4, and this is the first element in the GOT, according to the section header.
Why is there another section, the name of which begins with .got, I will tell later [3] . Note that the compiler decided to put the ecx address after the GOT, and then use a negative offset. It’s okay if everything fits together. And so far everything is converging.
But there is one thing that we still lack. How exactly does the address of myglob appear in the GOT element at 0x1fe4? Recall that I mentioned relocation, so let's find it:
> readelf -r libmlpic_dataonly.so Relocation section '.rel.dyn' at offset 0x2dc contains 5 entries: Offset Info Type Sym.Value Sym. Name 00002008 00000008 R_386_RELATIVE 00001fe4 00000406 R_386_GLOB_DAT 0000200c myglob <snip>
Here it is, the relocation for myglob, pointing to the address 0x1fe4, as we expected. The relocation is of the type R_386_GLOB_DAT, which simply says to the loader: “Put the real value of the simpol (that is, its address) at the given offset”. Now everything is clear. It remains only to see how it all looks when you load the library. We can do this by creating a simple binary (driver) that links to libmlpic_dataonly.so and calls ml_func, and running it through gdb.
> gdb driver [...] skipping output (gdb) set environment LD_LIBRARY_PATH=. (gdb) break ml_func [...] (gdb) run Starting program: [...]pic_tests/driver Breakpoint 1, ml_func (a=1, b=1) at ml_reloc_dataonly.c:5 5 return myglob + a + b; (gdb) set disassembly-flavor intel (gdb) disas ml_func Dump of assembler code for function ml_func: 0x0013143c <+0>: push ebp 0x0013143d <+1>: mov ebp,esp 0x0013143f <+3>: call 0x13145a <__i686.get_pc_thunk.cx> 0x00131444 <+8>: add ecx,0x1bb0 => 0x0013144a <+14>: mov eax,DWORD PTR [ecx-0x10] 0x00131450 <+20>: mov eax,DWORD PTR [eax] 0x00131452 <+22>: add eax,DWORD PTR [ebp+0x8] 0x00131455 <+25>: add eax,DWORD PTR [ebp+0xc] 0x00131458 <+28>: pop ebp 0x00131459 <+29>: ret End of assembler dump. (gdb) i registers eax 0x1 1 ecx 0x132ff4 1257460 [...] skipping output
Debagger logged into ml_func and settled on IP 0x0013144a [4] . We see that ecx is 0x132ff4 (instruction address plus 0x1bb0). Note that at the moment, during operation, these are all absolute addresses - the library is already loaded into the process address space.
So, the GOT element with myglob should be on [ecx - 0x10]. Let's check:
(gdb) x 0x132fe4 0x132fe4: 0x0013300c
That is, we expect that 0x0013300c is the address of myglob. Checking:
(gdb) p &myglob $1 = (int *) 0x13300c
And there is!
So, we saw how the PIC works for data addresses. But what about the features? Theoretically, the same way will work for functions. Instead of the call containing the address of the function, let it contain the address of the element from the GOT, and the element will already be filled at boot time.
But the function call in the PIC does not work that way, in reality everything is somewhat more complicated. Before I explain exactly how, in a nutshell I will talk about the motivation for choosing such a mechanism.
When the shared library uses a function, the real address of this function is not yet known. The definition of the real address is called binding (binding), and this is what the loader does when it loads the shared library into the process address space. Binding is not trivial, since the loader needs to look for function symbols in special tables [5] .
Thus, the determination of the real address of each function takes some time (not so much time, but since the function calls can be much larger than the data, the duration of this process increases). Moreover, in most cases this is done in vain, since with a normal program launch only a small part of the functions will be called (think how many calls are required only when errors or some special conditions occur).
To speed up this process, a clever “lazy” buyding scheme was invented. “Lazy” is a generic term for IT optimization, whereby any work is postponed until the very last moment. The point of this optimization is not to do extra work, which may not be necessary. Examples of such "lazy" optimization are the copy-on-write mechanism and "lazy" calculations .
The “lazy” scheme is implemented by adding another level of addressing - PLT.
PLT is a part of the text section in a binary consisting of a set of elements (one element per external function that the library calls). Each element in PLT is a small piece of executable machine code. Instead of calling the function, a piece of PLT code is called directly, which already calls the function itself. This approach is often called the "springboard . " Each element from the PLT has its own element in the GOT, which contains the actual offset for the function. After the bootloader detects it, of course.
At first glance, everything is rather confusing, but I hope that soon everything will become clearer - in the following sections I will discuss the details with diagrams.
As I already mentioned, PLT allows you to do a "lazy" definition of addresses of functions. At the moment when the shared library is loaded for the first time, the real addresses of the functions have not yet been determined:
Explanation:
What happens after func is called the first time:
After the first time, the diagram looks a bit different:
Note that GOT [n] now points to real func [7] instead of pointing back to PLT. So when the function is called again, the following happens:
In other words, func is now simply called without using the "definition" method and without an extra jump. This mechanism allows you to make a "lazy" definition of the addresses of functions and not to make any definition for those functions that are not called.
Please note that the library is completely independent of the address where it will be loaded, because the only place where the absolute address is used is GOT, and it is located in the data section and will be relocated during boot by the loader. Even PLT does not depend on the download address, so that it can be in the read-only text section.
I do not go into the details of the “definition” method, but this is not so important. The method is just a piece of low-level code in the loader that does its job. Arguments that are prepared before calling the method, let him know what address of the function to determine and where to place the result.
Well, in order to back up the theory with practice, consider an example that demonstrates a function call using the method described above.
Here is the shared library code:
int myglob = 42; int ml_util_func(int a) { return a + 1; } int ml_func(int a, int b) { int c = b + ml_util_func(a); myglob += c; return b + myglob; }
This code will be compiled in libmlpic.so, and we will focus on calling ml_util_func from ml_func. Disassemble ml_func:
00000477 <ml_func>: 477: 55 push ebp 478: 89 e5 mov ebp,esp 47a: 53 push ebx 47b: 83 ec 24 sub esp,0x24 47e: e8 e4 ff ff ff call 467 <__i686.get_pc_thunk.bx> 483: 81 c3 71 1b 00 00 add ebx,0x1b71 489: 8b 45 08 mov eax,DWORD PTR [ebp+0x8] 48c: 89 04 24 mov DWORD PTR [esp],eax 48f: e8 0c ff ff ff call 3a0 <ml_util_func@plt> <... snip more code>
The interesting part is the call to ml_util_func @ plt. Notice also that the GOT address is in ebx. This is what ml_util_func @ plt looks like (located in the .plt section with permissions to execute):
000003a0 <ml_util_func@plt>: 3a0: ff a3 14 00 00 00 jmp DWORD PTR [ebx+0x14] 3a6: 68 10 00 00 00 push 0x10 3ab: e9 c0 ff ff ff jmp 370 <_init+0x30>
Recall that each PLT element consists of three parts:
The method of "determination" (element 0 in PLT) is located at 0x370, but it does not interest us now. It is much more interesting to see what GOT contains. To do this, we again need a calculator.
The trick for getting the current IP in ml_func was done at 0x483, and we added 0x1b71 to it. So GOT is located at 0x1ff4. We can see what is there using readelf [8] :
> readelf -x .got.plt libmlpic.so Hex dump of section '.got.plt': 0x00001ff4 241f0000 00000000 00000000 86030000 $............... 0x00002004 96030000 a6030000 ........
The entry in the GOT for ml_util_func @ plt seems to be at the offset + 0x14, or 0x2008. Judging by the conclusion above, the word at this address has the value 0x3a6, and this is the address of the push-instruction in ml_util_func @ plt.
To help the bootloader do its job, an entry has been added to the GOT with the address of the place in the GOT where the ml_util_func address should be written:
> readelf -r libmlpic.so [...] snip output Relocation section '.rel.plt' at offset 0x328 contains 3 entries: Offset Info Type Sym.Value Sym. Name 00002000 00000107 R_386_JUMP_SLOT 00000000 __cxa_finalize 00002004 00000207 R_386_JUMP_SLOT 00000000 __gmon_start__ 00002008 00000707 R_386_JUMP_SLOT 0000046c ml_util_func
The last line means that the loader needs to put the address of the ml_util_func character in 0x2008 (and this, in turn, is the GOT element for this function).
It would be cool to see how this modification happens in the GOT. To do this, use GDB again.
> gdb driver [...] skipping output (gdb) set environment LD_LIBRARY_PATH=. (gdb) break ml_func Breakpoint 1 at 0x80483c0 (gdb) run Starting program: /pic_tests/driver Breakpoint 1, ml_func (a=1, b=1) at ml_main.c:10 10 int c = b + ml_util_func(a); (gdb)
We are now in front of the first call to ml_util_func. Recall that the GOT address is in ebx. Let's see what is there:
(gdb) i registers ebx ebx 0x132ff4
The offset for the item we need is at [ebx + 0x14]:
(gdb) x/w 0x133008 0x133008: 0x001313a6
Yes, ends at 0x3a6. It looks right. Now let's take a step to call ml_util_func and see again:
(gdb) step ml_util_func (a=1) at ml_main.c:5 5 return a + 1; (gdb) x/w 0x133008 0x133008: 0x0013146c
The value at 0x133008 has changed. It turns out that 0x0013146c is the real address ml_util_func, which was put there by the loader:
(gdb) p &ml_util_func $1 = (int (*)(int)) 0x13146c <ml_util_func>
As we expected.
Now is the time to mention that the process of “lazy” address determination, which is carried out by the loader, can be configured by several environment variables (as well as the corresponding arguments for linker ld). Sometimes these settings can be useful for debugging or some special performance requirements.
The variable LD_BIND_NOW, when defined, tells the loader to determine all addresses at startup, rather than "lazily." Her work can be checked by looking at the gdb output for the example above when it is set. We will see that the element from GOT for ml_util_func contains the real address of the function before the first function call.
In contrast, LD_BIND_NOT tells the boot loader to never update the GOT. That is, each function call in this case will go through the "definition" method.
The loader is configured and some other flags. I recommend learning man ld.so. There is a lot of interesting information.
We started the conversation with a relocation problem while working and solving this PIC problem. But the PIC itself, alas, is also not without problems. One of them is the cost of unnecessary indirect addressing. This is an extra memory access every time a global variable or function is accessed. The “scale of the disaster” depends on the compiler, processor architecture and the application itself.
Another, less obvious, problem is the use of additional registers to implement the PIC. In order not to determine the GOT address too often, it makes sense for the compiler to generate code that will store the address in a register (for example, ebx). But this means that the whole register goes only on the GOT. For RISC architectures, which usually have a lot of public registers, this is not such a big problem, which cannot be said about x86-type architectures, which have fewer available registers. Using PIC means one register less, which means you will need to make more memory access.
Now you know what code is not dependent on the address, and how it helps to create shared libraries with a shared, read-only text section.
PIC has pros and cons compared to relocation during operation, and the result will depend on many factors (in particular, on the architecture of the processor on which the program will run).
However, despite the shortcomings, PIC is becoming an increasingly popular approach. Some non-Intel architectures, such as SPARC64, require the use of PIC for shared libraries, and many others (for example, ARM) have IP-dependent addressing in order to make the PIC more efficient. Both that, and another is true for the successor of x86 - x64.
We did not focus on performance issues or processor architectures. My task was to tell how the PIC works. If the explanation was not “transparent” enough, let me know in the comments - and I will try to give more information.
Source: https://habr.com/ru/post/323904/
All Articles