Hello World's thorny path

The inspiration for writing this article was obtained after reading a similar publication for the x86 architecture [1].

This material will help those who want to understand how the programs are arranged from the inside, what happens before the entrance to the main, and why all this is done. I will also show how you can use some features of the glibc library. And at the end, as in the original article [1], the traversed path will be visually presented. Most of the article is a parsing of the glibc library.

So let's start our trip. We will use Linux x86-64, and lldb as a debugging tool. Also, sometimes we will disassemble the program with objdump.

The source text will be plain Hello, world ( hello.cpp ):

#include <iostream> int main() { std::cout << "Hello, world!" << std::endl; }

Just in case, information about the system and programs

 * Clang -- 4.0.1 * lldb -- 4.0.1 * glibc -- 2.25 * `uname -r` -- 4.12.10-1-ARCH

Compile the code and start debugging:

 clang++ -stdlib=libc++ hello1.cpp -g -o hello1.out lldb hello1.out

Note

Most of the code considered in the program is almost independent of the selected compiler and the c ++ library. It just happened that the llvm infrastructure is a bit closer to me than gcc, so the clang compiler with the libc ++ library will be considered, but again, there isn’t much difference, because most of the code under consideration will be parsed from the glibc library.

When bash is used (and not only), a program is created by calling the fork function and creating a new process using execve with passing command line arguments to it. Also, before transferring control to the first instruction of the executable file, input and output descriptors (STDIN, STDOUT, STDERR) are set, then, in the case of dynamic linking, the necessary library program is loaded and initialized and the functions of the " .preinit_array " section are called . Only after all this is the first function called, which is in the executable file (not counting the section " .preinit_array "), traditionally called _start , which is considered the beginning of the program. In the case of static linking, the work of the linker, for example, the initialization of the " .preinit_array " section, is located inside the executable file and the functions themselves are slightly different from dynamically linked programs. We will consider dynamically linked programs.

Entry point of the executable file is indicated in its header:

 readelf -h hello1.out | grep Entry

Next, we check which function is located at this address using objdump -d hello1.out . This is the already mentioned _start function, on which we set a breakpoint and start debugging.

 b _start r

About ABI

Wikipedia definition:
ABI (aplication binary interface) is a set of agreements for application access to the operating system and other low-level services, designed for the portability of executable code between machines that have compatible ABIs. Unlike API, which regulates compatibility at the source code level. An ABI can be thought of as a set of rules that allow a linker to combine compiled component modules without recompiling all of the code, while defining a binary interface.

The ABI level is hidden for c / c ++ programmers and all work of this level is implemented by the compiler and the standard libc library. In my case, the clang compiler and the glibc library follow all ABI rules. ABI rules for Linux x86-64 are listed in the System V AMD64 ABI document [2]. Solaris, Linux, FreeBSD, OS X follow the conventions of this document. Microsoft has its own specific ABI, which they carefully hide. The first chapter of this document [2] says that the architecture also obeys the ABI rules for 32-bit processors [3]. Therefore, these are the 2 fundamental documents on which developers of low-level libraries like glibc rely.

According to ABI, at the start of the program all registers are not defined except for:

% rdx: A pointer to a function that must be called before the program ends.
% rsp: The stack is aligned on a 16-byte boundary, contains the number of arguments, the arguments themselves, and the environment:
0 (% rsp) argc
8 (% rsp) argv [0]
...
8 argc (% rsp) NULL
8 (argc + 1) (% rsp) envp [0]
...
8 * (argc + k + 1) (% rsp) envp [k]
Null
auxiliary vectors
...
Null
Null

Auxiliary vectors (auxiliary vectors) contain information about the current machine. You can see their values using LD_SHOW_AUXV=1 ./hello1.out . The obtained values are well described in [4].

And actually

x `$rsp` -s8 -fu -c1 - the number of program arguments
p *(char**)($rsp+8) is the name of the program. Next on the stack are the program arguments, the zero separator, the environment arguments, and the auxiliary vectors.

In addition, flag registers are set, SSE and x87 are configured (§3.4.1 [2]).

You can see that the arguments are already almost prepared for the user-defined function main , all that remains is to set the correct pointers. But in addition to setting up pointers, a lot of work needs to be done before entering the main procedure. In the future, any function in its description will be accompanied by the location of its source code and the function itself in binary form in the form of a pop-up hint, for example: main .

Let's look at the _start function, it is small and its main task is to transfer control of the function __libc_start_main .

Disassemble the current function with di (the output here and below is formatted for clarity):

 _start: xor %ebp, %ebp mov %rdx, %r9 pop %rsi mov %rsp, %rdx and $-0x10, %rsp push %rax push %rsp lea 0x1aa(%rip), %r8 ; __libc_csu_fini lea 0x133(%rip), %rcx ; __libc_csu_init lea 0xec(%rip), %rdi ; main call *0x200796(%rip) ; __libc_start_main hlt

The _start function is connected to our program by the linker as an object file Scrt1.o . There are several types of object files crt1 (gcrt1, Srct1, Mcrt1) that perform similar functions, but are used in different cases. For example, Scrt1.o is used when generating PIC code [5]. You can verify the choice of the object file by compiling the program with the " -v " key. Note that in the object file, the object offsets __libc_csu_fini , __libc_csu_init and main are not specified, since the offsets of these functions become known only at the linking stage.

According to the requirements of the ABI, you need to zero % ebp to mark the frame as the initial one, which is what the xor% ebp,% ebp instruction does.

Next is preparing to call the function __libc_start_main , the signature of which is:

 int __libc_start_main(int (*main) (int, char **, char **), int argc, char **argv, __typeof (main) init, void (*fini) (void), void (*rtld_fini) (void), void *stack_end)

And the function arguments, according to the ABI, should be put in the appropriate places:

Argument	Position for function call	Description
main	% rdi	The main function of the program
argc	% rsi	Number of program arguments
argv	% rdx	Array of arguments. After the arguments are the environment variables, and after the auxiliary vectors
init	% rcx	The global object constructor, invoked before main. The type of this function is the same as the main function.
fini	% r8	Global object destructor called after main
rtld_fini	% r9	Dynamic linker destructor. Frees dynamically allocated libraries
stack_end	% rsp	The current position of the aligned stack

ABI requires that when calling a function, the stack is aligned on a 16-byte (sometimes 32, and sometimes 64, depending on the type of arguments) boundary. The request is fulfilled after the execution of the instruction and $ -0x10,% rsp (?) . The meaning of this alignment is that SIMD instructions (SSE, MMX) work only with aligned data, and scalar instructions read / write faster with aligned data.

To save 16-byte alignment, before calling __libc_start_main , the% rax register is placed on the stack, which holds an undefined value. This stack cell will never be read.

The program should not be returned from the libc_start_main function, and the hlt instruction is used to indicate the wrong behavior. The peculiarity of this instruction is that in the protected mode of the processor it can be executed only in the protection ring 0, that is, only the operating system can call it. We are in the 3rd ring, which means that when we try to execute a command to which the program has no rights, we get the segmentation fault.
After the hlt instruction there is also the instruction nopl 0x0 (% rax,% rax, 1) , which in turn is needed to align the next function with the 16-byte boundary. ABI does not require this, but compilers align the beginning of a function to improve performance ( 1 , 2 ).

So, go ahead

 b __libc_start_main c

From the source code of the __libc_start_main function, you can see that for statically and dynamically linked libraries different code is generated. You can check how the function code in libc.so.6 looks like with gdb or with lldb:
lldb libc.so.6 -b -o 'di -n __libc_start_main'

A bit about __glibc_ [un] likely

The glibc library code contains many entries __glibc_likely and __glibc_unlikely. A large number of conditional operations are replaced by this macro. The macro is eventually converted to the following build-in functions:

 # define __glibc_unlikely(cond) __builtin_expect ((cond), 0) # define __glibc_likely(cond) __builtin_expect ((cond), 1)

__builtin_expect is a kind of optimization that helps the compiler correctly locate parts of code in memory. We tell the compiler which branch is most likely to be executed, and the compiler places this area of memory right after the comparison instruction, thereby, improving the instruction caching, and the compiler hides the remaining branch, if any, at the end of the function.

The __libc_start_main function is a bit cumbersome, just briefly describing its main actions:

register rtld_fini with __cxa_atexit
call __libc_csu_init
create cancellation point
main
exit

__cxa_atexit

The __cxa_atexit function, in contrast to atexit , which is a wrapper over the first, can accept the parameters of the registered function, but the function should not be called directly from user space. It should not be called because the function uses a DSO- identifier that is known only to the compiler. It is needed so that when calling __cxa_atexit (f, p, d) , the function f (p) is called when DSO d is unloaded [8].

However, passing arguments to the function parameter

An example of using __cxa_atexit :

 #include <cstdio> extern "C" int __cxa_atexit (void (*func) (void *), void *arg, void *d); extern void* __dso_handle; void printArg(void *a) { int arg = *static_cast<int*>(a); printf("%d\n",arg); delete (int*)a; } int main() { int *k = new int(17); __cxa_atexit(printArg, k, __dso_handle); }

I recommend using this trick only for use. To call the destructor when exiting the program, it is safer to use any similar method .

rtld_fini is a pointer to the _dl_fini linker function . And yes, the linker is part of the glibc library. The _dl_fini function deals with the deinitialization and unloading of all loaded libraries.

__libc_csu_init

You can get into the __libc_csu_init function in the same way as in the previous one. __libc_csu_init calls _init and function pointers in the .init_array section.

_init

The _init function is entirely in the .init section. Its code is divided into 2 parts: the introduction and the epilogue . The introduction consists of a prologue and an attempt to call the function __gmon_start__ .

 _init subq $0x8, %rsp leaq 0x105(%rip), %rax ; __gmon_start__ testq %rax, %rax je 0x5555555548a2 ; je to addq instruction callq *%rax addq $0x8, %rsp retq

The main function of the _init function is to initialize the gprof profiler. The instruction " leaq 0x105(%rip), %rax " takes the address of the function __gmon_start__ - the function that initializes the profiler. If the profiler is not present, then the% rax will have the value 0 and the transition je will work. The instructions subq $ 0x8,% rsp and addq $ 0x8,% rsp align the stack and return it to its original state. This alignment is necessary due to the fact that when calling a function, we put a return address on the stack, the size of which on the x86-64 architecture is 8 bytes.

You can add your own code section to the .init section. Consider the hello2.cpp example:

 #include <cstdio> extern "C" void my_init() { puts("Hello from init"); } __asm__( ".section .init\n" "call my_init" ); int main(){}

Consider now what _init looks like :

 subq $0x8, %rsp movq 0x200835(%rip), %rax testq %rax, %rax je 0x5555555547ba callq *%rax callq 0x555555554990 ; ::my_init() addq $0x8, %rsp retq

As you can see from the listing, between the introduction and the epilogue of the function was added the instruction callq 0x555555554990 , which just makes the call to my_init . Apparently the _init function and implemented in such a way that you can easily add your own initialization of some parts of the program.

Interesting fact : The attentive reader must have noticed that the output in hello2.cpp is output via the puts function. If output via cout , then when compiling with the libstdc ++ library there will be a segmentation error, and with the help of the libc ++ library the message will be output normally. What makes this happen? The fact is that in libstdc ++ cout initialized as a regular global object, and the initialization of global objects occurs a little later. In the case of libc ++ , initialization occurs during the loading of libraries in the _dl_init functions from the ld-linux-x86-64.so.2 library . This function is just called from _dl_start_user right before passing control to the _start function.

The advantages and disadvantages of each method. When the libc ++ library is connected, even if standard c ++ output tools like cout will not be used, the constructors will be invoked anyway. In the case of the libstdc ++ library , even with the optimization flags enabled, the constructor will be called as many times as the iostream header file is included . Naturally, the constructor itself takes into account the fact that it can be called several times and re-initialization is skipped. This, of course, does not slow down the initialization of the program, but it is still unpleasant. Apparently for this reason, many high-performance projects do not use, do not recommend and even prohibit connecting the iostream header file and, as a result, create their own interfaces for I / O.

.init_array

Then functions are called whose pointers are located in the .init_array section.
Check the contents of the section:

 objdump hello1.out -s -j .init_array

In my case, the contents of .init_array have the following meaning: a00f0000 00000000 , which means the address 0x0fa0 in a 64-bit system with little-endian byte order . At this address is the frame_dummy function.

frame_dummy

Interestingly, frame_dummy is part of the gcc library.

What does it have to do with gcc? We have a clang compiler!

Do not forget that the gcc project is very large and has already sprouted roots in linux operating systems. The gcc project contains not only the compiler, but also the files needed for compilation. Thus, when linking, crt-files like crtbeginS.o and crtendS.o are used .
Therefore, it will not be possible to completely get rid of the gcc project, and at least it will be necessary to leave auxiliary crt files. Unix operating systems that do not use the gcc compiler as the main one do.

frame_dummy looks like this:

 pushq %rbp movq %rsp, %rbp popq %rbp jmp 0x555555554cc0 ; register_tm_clones nopw (%rax,%rax)

The frame_dummy task is to set the arguments and start the register_tm_clones function. This layer is needed only to put arguments. In this case, the arguments are not set, but as can be seen from the source code, this is not always the case, depending on the architecture. Interestingly, the first 2 instructions are the prologue, the third - the epilogue. The jmp instruction is a tail function call optimization. And as usual, at the end of the alignment.

The register_tm_clones function is needed in order to activate transactional memory .

Initializing Global Objects

Global objects, if present, are initialized here.
If there are global objects, the address of the function _GLOBAL__sub_I_< > added to the .init_array section.

Consider an example of initializing global variables:
global1.cpp :

 int k = printf("Hello from .init_array");

The variable will be initialized as follows:

 push %rbp mov %rsp, %rbp lea 0xf59(%rip), %rdi ; + 4 mov $0x0, %al call 0x555555554e80 ; symbol stub for: printf mov %eax, 0x202130(%rip) ; k pop %rbp ret

The first 2 instructions are a prologue. Next, we prepare for the call to the printf function, putting a pointer to our string in %rdi and setting %al to zero. According to ABI [2], functions with a variable number of arguments contain a hidden parameter stored in %al , meaning the number of variable arguments contained in the vector registers. Most likely this is needed to optimize some functions, but printf uses this information to move data from vector registers onto the stack.
After calling printf , the result of the function is placed in the memory area of the variable k and the epilog is called.

global2.cpp :
Suppose we have a Global class with a non-default constructor and destructor:

 Global g;

Then the initialization will look like this:

 push %rbp mov %rsp, %rbp sub $0x10, %rsp lea 0x202175(%rip), %rdi ; g call 0x5555555550e0 ; Global::Global() lea 0x1c5(%rip), %rdi ; Global::~Global() lea 0x202162(%rip), %rsi ; g lea 0x202147(%rip), %rdx ; __dso_handle call 0x555555554f10 ; symbol stub for: __cxa_atexit mov %eax, -0x4(%rbp) add $0x10, %rsp pop %rbp ret

Here we see how, after calling the global constructor, the destructor is registered with __cxa_atexit . This is implemented according to Itanium ABI [8].

Initializing function call

From glibc, initialization is called as follows: (*__init_array_start [i]) (argc, argv, envp);

Notice that the initialization function passes parameters similar to the main function, so we can use them. In the gcc and clang compilers, there is an attribute constructor , with which the function is called before the object initialization stage.

In it we also can transfer these arguments. Check the output of the program using the following global function:

 void __attribute__((constructor)) hello(int argc, char **argv, char **env) { printf("#args = %d\n", argc); printf("filename = %s\n", argv[0]); }

This can be used for more practical purposes (hello3.cpp):

 #include <cstdio> class C { public: C(int i) { printf("Program has %d argument(s)\n", i); } }; int constructorArg; const C c(constructorArg); void __attribute__((constructor (65535))) hello(int argc, char ** argv, char **env){ constructorArg = argc; } int main(){}

The priority of the call is specified in the parameters of the constructor attribute.

As you probably already guessed, the program will display the correct number of arguments, and most interestingly, the object c is constant. The main disadvantage of this approach is the lack of support for the standard and, as a result, the lack of cross-platform. Also, such code is highly dependent on the libc library used.

I would like to add that global variables of the form int x = 1 + 2 * 3; they are not initialized at all, their values are initially written by the compiler into the memory. If you want the variables initialized by simple functions like int s = sum(4, 5) to be initially initialized, add the identifier constexpr from the C ++ 11 standard to the sum function.

Create a cancellation point

The cancellation point is created by calling setjmp and setting a global variable.
Saving the setjmp context is needed to set the undo buffer so that when the main thread is canceled , it can be correctly terminated.

Example of canceling the main thread

File cancel.cpp .

 #include <pthread.h> pthread_t g_thr = pthread_self(); void * thread_start(void *) { pthread_cancel(g_thr); return 0; } int main() { pthread_t thr; pthread_create(&thr, NULL, thread_start, NULL); pthread_detach(thr); while (1) { pthread_testcancel(); } }

cancel.cpp , , , exit . , , , , .

, , setjmp :

 br set -n __libc_start_main -R 162

: , — .

setjmp __GI__setjmp . , . [7]. , , PLT .

main

 std::cout << "Hello, world!" << std::endl;

, :

 operator<<(std::cout, "Hello, World!").operator<<(std::endl);

 operator<<(std::cout, "Hello, World!"); std::cout.operator<<(std::endl);

C++ << . , , , , . , .

endl libc++, libstdc++ : ostream& endl(ostream&);

ostream , << , visitor .

. IFUNC-, __strlen_avx2 _strlen_sse2 . strlen .

stdout _IO_file_doallocate malloc , 1 . , setvbuf .

stdout , . flush , stdout .

, flush , fwrite , __libc_write , syscall ( , ):

 ssize_t __libc_write (int fd, const void *buf, size_t nbytes) { return ({ unsigned long int resultvar = ({ unsigned long int resultvar; long int __arg3 = (long int) (nbytes); long int __arg2 = (long int) (buf); long int __arg1 = (long int) (fd); register long int _a3 asm ("rdx") = __arg3; register long int _a2 asm ("rsi") = __arg2; register long int _a1 asm ("rdi") = __arg1; asm volatile ( "syscall\n\t" : "=a" (resultvar) : "0" (1) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory", "cc", "r11", "cx"); (long int) resultvar; }); resultvar; }); }

statement expressions , gcc:

 int l = ({int b = 4; int c = 8; c += b});

, c += b l == 12 .

__libc_write ( __GI___libc_write , _setjmp ) syscall , syscall , C . rax . =a , rax , "0" (1) , rax 1 ( sys_write ).

, , sys_write , .

, ABI [2], . : %rdi, %rsi, %rdx, %r10, %r8, %r9.

x86 x86-64 !

, , - , , . PIC ( 1 , 2 ).

exit

exit :

__call_tls_dtors — thread local storage , .
, atexit
- _dl_fini — , _start r9 , .
- ( ).
__libc_atexit
- _IO_cleanup — .
_exit — .

_exit 231 ( sys_exit_group ), %rdi . .

Linux sys_exit . , , sys_exit_group . , , , sys_exit , [6].

, , "Hello, World!!!", C/C++, glibc . : , , setjmp, atexit...

, dot

[1] — http://dbp-consulting.com/tutorials/debugging/linuxProgramStartup.html
[2] — https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-r252.pdf
[3] — https://github.com/hjl-tools/x86-psABI/wiki/intel386-psABI-1.1.pdf
[4] — https://habrahabr.ru/post/128111/
[5] — https://dev.gentoo.org/~vapier/crt.txt
[6] — http://syprog.blogspot.ru/2012/03/linux-threads-through-magnifier-local.html
[7] — https://sourceware.org/glibc/wiki/Style_and_Conventions#Double-underscore_names_for_public_API_functions
[8] — https://itanium-cxx-abi.imtqy.com/cxx-abi/abi.html#dso-dtor

Source: https://habr.com/ru/post/339698/

All Articles