📜 ⬆️ ⬇️

Nothing is easier than calling a function; I myself have done this repeatedly.


The previous article about exceptions in C ++ left a bunch of dark places,
the main thing that remained incomprehensible is how is it done
transfer of control when an exception is raised?
With SJLJ, everything is clear, but it is argued that this technology is practically
pushed out by some no-cost (with no exceptions) tabular mechanism.
But what kind of mechanism it is and how it works, we will understand under the cut.

This article appeared in the process of preparing to speak in C ++ Siberia, when it turned out some details that may well be useful to someone else, except the author, known for his tediousness.

Introduction


It all began with a simple desire to find out the size of the buffer that the setjmp / longjmp functions use:
It is believed that the state of the processor is preserved in this structure.
And how does this fit in with the number of registers ( AMD x86-64 )?

MMX and FP87 registers are combined.
In 32 - bit mode - 32 + 128 + 80 + 12 (eip, flags) = 252 bytes
In 64-bit - 128 + 256 + 80 + 24 (...) = 480 bytes
Something does not converge.
We read documentation :
When you call setjmp, the current stack position, non-volatile registers, and flags are saved.
What other non-volatile registers? Again, read the documentation :
When calling a function, the responsibility for saving the contents of a part of the registers lies with the caller, and these are the so-called volatile registers. The callee takes care of the contents of the other registers and these are non-volatile registers.
So what is it and why is it needed?

Register division by volatile and non-volatile


It is about optimizing call functions. Such a division has existed for a long time.
In any case, in the architecture of 68K (1979), two out of 8 general-purpose registers and seven address registers were considered volatile, the rest were protected by the called party.
')
In the 88K architecture (1988), 12 of 32 general-purpose registers were protected by the called party + (pointers to the stack and frame).

In the IBM S / 360 (1964), there are 16 integer registers (32-bit) for general purpose and 4 floating point, but no hardware stack. Before calling a function, all registers are stored in a special area. For a recursive call, you need to dynamically request memory from the OS. Parameters are passed as a pointer to a list of pointers to parameter values. In fact, 11 registers are non-volatile.

For architectures with fewer registers, there is no problem with their preservation. In PDP-11 (1970), there are a total of 6 general-purpose registers. And the parameters are passed through the stack. Actually, this is how “C calling convention” ( cdecl ) appeared and it was here that the C language crystallized .

8086. Where do without him. The processor has 8 general-purpose registers, but only BX has no architectural burdens. There were several agreements on calling functions, they concerned the transfer of parameters, and all registers were considered volatile.

We will not dwell on IA-32 , we will return again to x86-64 . In this case, there are too many registers to save their values ​​before each call. In the case of a full-fledged function, this will somehow have to be done, but for small “functionlets” it is wasteful. Need a compromise.
Who can determine which class belongs to a particular register? The compiler itself, to the architecture is indirectly related. Here is what one of the GCC developers wrote in 2003 about this:
The decision to which category to assign a specific register was not easy. AMD64 has 15 general-purpose registers (av. Ln .:% rsp does not count, save it anyway), and using 8 of them (the so-called extended registers) in the instruction requires the REX prefix, which increases its size. In addition, the% rax,% rdx,% rcx,% rsi, and% rdi registers are implicitly used in some IA-32 instructions. We decided to make these volatile registers to avoid restrictions on the use of instructions.
Thus, we can make non-volatile only% rbx,% rbp and extended registers. A series of tests showed that the smallest code is obtained when non-volatile registers are assigned (% rbx,% rbp,% r12-% r15).

Initially, we wanted to make volatile 6 SSE registers. However, difficulties arose - these registers are 128-bit and only 64 bits are usually used to store data, so saving them for the caller is more expensive than for the party being called.
Various experiments were carried out and we came to the conclusion that the most compact and fast code is obtained in the case when all SSE registers are declared as volatile.

This is exactly how things are still, see AMD64 ABI spec , page 21.
The same ABI is supported in B OS X.

Microsoft considered it different and their division is as follows :
Well, at least all this can fit into the size of jmp_buf .

What about more case-rich architectures?
This is how things work with the OS X 64-bit compiler for PowerPC, where there are 32 integer registers and floating point ones:
GPR11 (*) - non-volatile for leaf functions (of which there are no other calls)

Total: division of registers into two classes implements a universal optimization of function calls:

Register windows


An alternative approach, processors using this technique , grow from the project Berkeley RISC (1980..1984).

Intel i960 (1989) is a 32-bit processor with 16 local and 16 global general-purpose registers. Parameters are passed through global registers; when a function is called, all local registers are saved with a special instruction. In fact, all local registers are non-volatile, but they are saved forcibly in the hope that hardware support will give this some kind of acceleration. However, in modern times this is just one line of cache .

AMD 29K (1988) - 32-bit processor with 192 (sic!) Registers

SPARC (1987) may have a different number of registers (S in the name means Scalable)

Itanium (2001) SPARC follower.

Definitely, Itanium gets the audience award, it’s a pity that this processor didn’t take off.

Function call


So, having considered all this architectural splendor, we can draw the following conclusions about the function call.

Yes, after the optimizer, the contents of the function body sometimes remind of the primary broth , where it is not always clear why this or that instruction is needed and how to find the value of a variable.

However, at the time of the child function call, all this hectic activity freezes.

Regardless of the processor architecture, the actual data from the registers is somehow prevented from being lost. For architects with register windows, this happens naturally. For other volatile registers are saved in memory, if this is a temporary value and it has no place in memory, it will have to be re-evaluated. Non-volatile registers either remain unchanged or their values ​​are restored.

Suppose an exception occurred in the underlying function and we want to transfer control to one of the catch blocks of a certain function. All the information we need to restore the execution context is already in the stack or in registers,
The try block may take no effort, it does not need to allocate space on the stack and save there anything. All information is already saved. But now the problem arises how to save information about where we placed that information.

Fortunately, this information is static and is determined at compile time. The compiler collects all of this into tables, and that’s how it turned out to be a no-cost tabular mechanism.

Let's see how this is implemented in the MSVC (x64) and GCC (x64) compilers.

MSVC (x64)


MSVC creates a prolog and an epilog for each function, while the RSP value between them remains unchanged. RBP is considered a regular register until someone uses alloca . Take for experiments some nontrivial function and look in the debugger at the significant part of its prologue for us:
000000013F440850 mov rax, rsp
000000013F440853 push rbp
000000013F440854 push rdi
000000013F440855 push r12
000000013F440857 push r14
000000013F440859 push r15
000000013F44085B lea rbp, [rax-0B8h] # initialization
000000013F440862 sub rsp, 190h
000000013F440869 mov qword ptr [rbp + 20h], 0FFFFFFFFFFFFFFFEh # initialization
000000013F440871 mov qword ptr [rax + 10h], rbx
000000013F440875 mov qword ptr [rax + 18h], rsi
000000013F440879 mov rax, qword ptr [__security_cookie (013F4C5020h)] #from hereafter - the function body
the optimizer dilutes the prolog code with initialization instructions whenever possible .

And an epilogue in which non-volatile registers are brought to their original state.
000000013F4410C2 lea r11, [rsp + 190h]
000000013F4410CA mov rbx, qword ptr [r11 + 38h]
000000013F4410CE mov rsi, qword ptr [r11 + 40h]
000000013F4410D2 mov rsp, r11
000000013F4410D5 pop r15
000000013F4410D7 pop r14
000000013F4410D9 pop r12
000000013F4410DB pop rdi
000000013F4410DC pop rbp
000000013F4410DD ret
The compiler collects information for the promotion of the stack in the .pdata section. A RUNTIME_FUNCTION structure is created for each function, from which there is a link to the unwind table . Its contents can be pulled out using the link utility with the parameters -dump -unwindinfo. For the same function we find:
00001D70 00020880 0002110E 000946C0? Write_table_header ...
Unwind version: 1
Unwind flags: EHANDLER UHANDLER
Size of prologue: 0x3A
Count of codes: 11
Unwind codes:
29: SAVE_NONVOL, register = rsi offset = 0x1D0
25: SAVE_NONVOL, register = rbx offset = 0x1C8
19: ALLOC_LARGE, size = 0x190
0B: PUSH_NONVOL, register = r15
09: PUSH_NONVOL, register = r14
07: PUSH_NONVOL, register = r12
05: PUSH_NONVOL, register = rdi
04: PUSH_NONVOL, register = rbp
Handler: 0006BFD0 __GSHandlerCheck_EH
EH Handler Data: 00087578
GS Unwind flags: UHandler
Cookie Offset: 00000188
We are interested in Unwind codes - they contain actions that must be performed when an exception is raised.


GCC (x64)


Similarly, we analyze the prologue and epilogue of the same function created by GCC.
Prologue
.cfi_startproc
.cfi_personality 0x9b, DW.ref .__ gxx_personality_v0
.cfi_lsda 0x1b, .LLSDA11339
pushq% r15
.cfi_def_cfa_offset 16
.cfi_offset 15, -16
pushq% r14
.cfi_def_cfa_offset 24
.cfi_offset 14, -24
pushq% r13
.cfi_def_cfa_offset 32
.cfi_offset 13, -32
movq% rdi,% r13
pushq% r12
.cfi_def_cfa_offset 40
.cfi_offset 12, -40
pushq% rbp
.cfi_def_cfa_offset 48
.cfi_offset 6, -48
pushq% rbx
.cfi_def_cfa_offset 56
.cfi_offset 3, -56
subq $ 456,% rsp
.cfi_def_cfa_offset 512
Epilogue:
addq $ 456,% rsp
.cfi_remember_state
.cfi_def_cfa_offset 56
popq% rbx
.cfi_def_cfa_offset 48
popq% rbp
.cfi_def_cfa_offset 40
popq% r12
.cfi_def_cfa_offset 32
popq% r13
.cfi_def_cfa_offset 24
popq% r14
.cfi_def_cfa_offset 16
popq% r15
.cfi_def_cfa_offset 8
ret
CFI prefix means Call Frame Information; these are directives to the assembler how to write additional information for stack promotion. This information is collected in the .eh_frame section, you can see it in a readable form using the dwarfdump utility with the -F key
#prologue
〈0〉 〈0x00000e08: 0x00000f4a 〈〉 〈fde offset 0x00000e00 length: 0x00000060 eh aug data len 0x0
0x00000e08: 〈off cfa = 08 (r7) 〈off r16 = -8 (cfa)
0x00000e0a: 〈off cfa = 16 (r7) 〈off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e0c: 〈off cfa = 24 (r7) 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e0e: 〈off cfa = 32 (r7) 〈off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e10: 〈off cfa = 40 (r7) off r12 = -40 (cfa) 〈off r13 = -32 (cfa)
〈Off r14 = -24 (cfa)〉 〈off r15 = -16 (cfa) 〈off r16 = -8 (cfa)
0x00000e11: 〈off cfa = 48 (r7) 〈off r6 = -48 (cfa)
〈Off r12 = -40 (cfa)〉 〈off r13 = -32 (cfa)
〈Off r14 = -24 (cfa)〉 〈off r15 = -16 (cfa)
〈Off r16 = -8 (cfa)
0x00000e12: 〈off cfa = 56 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
#body
0x00000e19: 〈off cfa = 64 (r7) off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e51: 〈off cfa = 56 (r7) off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e52: 〈off cfa = 48 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e53: 〈off cfa = 40 (r7) off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e55: 〈off cfa = 32 (r7) 〈off r3 = -56 (cfa) 〈off r6 = -48 (cfa)
〈〈Off r12 = -40 (cfa)〉 〈off r13 = -32 (cfa)
〈Off r14 = -24 (cfa)〉 〈off r15 = -16 (cfa)
〈Off r16 = -8 (cfa)
0x00000e57: 〈off cfa = 24 (r7) off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e59: 〈off cfa = 16 (r7) off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e5b: 〈off cfa = 08 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000e60: 〈off cfa = 64 (r7) off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa) 〈off r14 = -24 (cfa) 〈off r15 = -16 (cfa)
〈Off r16 = -8 (cfa)
0x00000f08: 〈off cfa = 56 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000f09: 〈off cfa = 48 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000f0a: 〈off cfa = 40 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000f0c: 〈off cfa = 32 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa) 〈off r14 = -24 (cfa) 〈off r15 = -16 (cfa)
〈Off r16 = -8 (cfa)
0x00000f0e: 〈off cfa = 24 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000f10: 〈off cfa = 16 (r7) off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000f12: 〈off cfa = 08 (r7) 〈off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
0x00000f18: 〈off cfa = 64 (r7) off r3 = -56 (cfa)
〈Off r6 = -48 (cfa)〉 〈off r12 = -40 (cfa)
〈Off r13 = -32 (cfa)〉 〈off r14 = -24 (cfa)
〈Off r15 = -16 (cfa)〉 〈off r16 = -8 (cfa)
What we see here:


Total


So we got to the end. In the process, it turned out that there was no magic. In order to be able to recover from the exception after catching the exception, no action is needed, everything is saved by itself during the natural execution of the code.
But in order to restore the state, a little help from the compiler is required, but everything is rather modest, without frills.

In general, exception handling by modern compilers is an excellent example of how the most difficult problem can be solved calmly, without any fuss, using completely “worker-peasant” methods. Developers respect.

Source: https://habr.com/ru/post/267771/


All Articles