📜 ⬆️ ⬇️

The evolution of x86 architecture system calls

Much has been said about system calls, for example, here or here . Surely you already know that a system call is a way to call an OS kernel function. I also wanted to dig deeper and find out what is special about this system call, what implementations exist and what their performance is on the example of the x86-64 architecture. If you are also interested in the answers to these questions, welcome under cat.


System call


Every time when we want to display something on the monitor, write to the device, read from the file, we have to contact the OS kernel. It is the OS kernel that is responsible for any communication with hardware, it is there that work with interruptions, processor modes, task switching ... So that the program user could not overwhelm the entire operating system, it was decided to divide the memory space into user space (the memory area intended for user programs) and kernel space, and also to prohibit the user from accessing the OS kernel memory. Implemented this separation in the x86-family hardware using segmental memory protection . But the user program needs to somehow communicate with the kernel, for which the concept of system calls was invented.


A system call is a way for a user-space program to access kernel space. From the outside, this may look like a call to a normal function with its own calling convention, but in fact the processor takes a little more action than when the call is called by a call instruction. For example, in the x86 architecture during a system call, at a minimum, an increase in privilege level occurs, user segments are replaced with kernel segments, and the IP register is set to a system call handler.


The programmer usually does not work with system calls directly, since system calls are wrapped in functions and hidden in various libraries, such as libc.so in Linux or ntdll.dll in Windows, with which the application developer interacts.


Theoretically, a system call can be implemented using any exception, even though dividing by 0. The main thing is to transfer control to the kernel. Consider real-world examples of exception implementations.


Ways to implement system calls


Execution of invalid instructions.


Earlier, at 80386, this was the fastest way to make a system call. For this, a meaningless and incorrect LOCK NOP instruction was usually used, after the execution of which the processor called the incorrect instruction handler. It was more than 20 years ago and, they say, this method handled system calls at Microsoft. The invalid instruction handler is now used for its intended purpose.


Call gates


In order to have access to code segments with different levels of privileges, Intel developed a special set of descriptors called gate descriptors. There are 4 types of such descriptors:



We are interested only in call gates, since it was through them that it was planned to implement system calls in x86.


The call gate is implemented using the call far or jmp far instructions and takes as its parameter the call gate descriptor that is configured by the OS kernel. It is a rather flexible mechanism, since it is possible to switch to any level of the protection ring, and to a 16-bit code. Call gates are considered to be more productive than interrupts. This method was used in OS / 2 and Windows 95. Due to the inconvenience of using Linux, the mechanism was never implemented. Over time, it completely ceased to be used, as there appeared more productive and easy to use implementations of system calls (sysenter / sysexit).


System calls implemented in Linux


In the x86-64 architecture of the Linux operating system, there are several different ways to make system calls:



In the implementation of each system call has its own characteristics, but in general, the handler in Linux has approximately the same structure:



Let's take a closer look at each system call.


int 80h


Initially, in the x86 architecture, Linux used software interrupt 128 to make a system call. To specify the number of the system call, the user sets the number of the system call in eax , and his parameters are arranged in order in the registers ebx , ecx , edx , esi , edi , ebp . Next, the int 80h instruction is called , which programmatically causes an interrupt. The processor calls the interrupt handler set by the Linux kernel during kernel initialization. In x86-64, the interrupt call is used only during x32 emulation for backward compatibility.


In principle, no one forbids using the instruction in advanced mode . But you should understand that a 32-bit table of calls is used and all used addresses should be placed in a 32-bit address space. According to SYSTEM V ABI [4] §3.5.1, for programs whose virtual address is known at the linking stage and placed in 2GB, a small memory model is used by default and all known characters are in a 32-bit address space. Under this definition fit statically compiled programs, where it is possible to use int 80h . Step-by-step interrupt operation is described in detail on stackoverflow .


In the core, the interrupt handler is the entry_INT80_compat function and is located in arch / x86 / entry / entry_64_compat.S


Call example int 80h
section .text global _start _start: mov edx,len mov ecx,msg mov ebx,1 ; file descriptor (stdout) mov eax,4 ; system call number (sys_write) int 0x80 ; call kernel mov eax,1 ; system call number (sys_exit) int 0x80 ; call kernel section .data msg db 'Hello, world!',0xa len equ $ - msg 

Compilation:


 nasm -f elf main.s -o main32.o ld -melf_i386 main32.o -o a32.out 

Or in the advanced mode (the program works as it is compiled statically)


 nasm -f elf64 main.s -o main.o ld main.o -o a.out 

sysenter / sysexit


After some time, even when there was no x86-64, Intel realized that it was possible to speed up system calls by creating a special system call instruction, thereby bypassing some interruption costs. So a pair of sysenter / sysexit instructions appeared . Acceleration is achieved due to the fact that at the hardware level, when executing the sysenter instruction, many checks for the validity of descriptors are dropped, as well as checks that depend on the privilege level [3] §6.1. Also, the instruction is based on the fact that the program calling it uses a flat memory model. In the Intel architecture, the instruction is valid for both compatibility mode and advanced mode, but for AMD this instruction in advanced mode eliminates the unknown opcode [3]. Therefore, at present, the sysenter / sysexit pair is used only in compatibility mode.


In the kernel, the handler for this instruction is the entry_SYSENTER_compat function and is located in arch / x86 / entry / entry_64_compat.S


Sysenter call example
 section .text global _start _start: mov edx,len ;message length mov ecx,msg ;message to write mov ebx,1 ;file descriptor (stdout) mov eax,4 ;system call number (sys_write) push continue_l push ecx push edx push ebp mov ebp,esp sysenter hlt ; dumb instructions that is going to be skipped continue_l: mov eax,1 ;system call number (sys_exit) mov ebx,0 push ecx push edx push ebp mov ebp,esp sysenter section .data msg db 'Hello, world!',0xa len equ $ - msg 

Compiling:


 nasm -f elf main.s -o main.o ld main.o -melf_i386 -o a.out 

Despite the fact that the instruction is valid in the implementation of the architecture from Intel, it is most likely that such a system call cannot be used in the advanced mode. This is due to the fact that the current value of the stack is stored in the ebp register, and the top address, regardless of the memory model, is outside the 32-bit address space. This is all because Linux displays the stack at the end of the lower half of the canonical address of the space.


Linux kernel developers warn users against sysenter hard programming because system call ABI may change. Due to the fact that Android did not follow this advice, Linux had to roll back its patch to maintain backward compatibility. Correctly implement a system call using vDSO, which will be discussed later.


syscall / sysret


Since AMD developed the x86-64 architecture, which is called AMD64, they decided to create their own system call. The instruction was developed by AMD, as an analogue of sysenter / sysexit for the IA-32 architecture. AMD made sure that the instruction was implemented both in the advanced mode and in the compatibility mode, but Intel decided not to support this instruction in the compatibility mode. Despite all this, Linux has 2 handlers for each of the modes: for x32 and x64. The handlers for this instruction are the entry_SYSCALL_64 functions for x64 and entry_SYSCALL_compat for x32 and are located in arch / x86 / entry / entry_64.S and arch / x86 / entry / entry_64_compat.S, respectively.


Who is interested in learning more about the system call instructions, in the Intel [0] manual (§4.3) their pseudo-code is given.


Syscall call example
 section .text global _start _start: mov rdx,len ;message length mov rsi,msg ;message to write mov rdi,1 ;file descriptor (stdout) mov rax,1 ;system call number (sys_write) syscall mov rax,60 ;system call number (sys_exit) syscall section .data msg db 'Hello, world!',0xa len equ $ - msg 

Compiling


 nasm -f elf64 main.s -o main.o ld main.o -o a.out 

Example of a 32-bit syscall call

To test the following example, you will need a kernel with the configuration CONFIG_IA32_EMULATION = y and an AMD computer. If you have an Intel computer, you can run the example on a virtual machine. Linux can change the ABI and this system call without warning, so I’ll remind you once again that system calls in compatibility mode are more properly executed via vDSO.


 section .text global _start _start: mov edx,len ;message length mov ebp,msg ;message to write mov ebx,1 ;file descriptor (stdout) mov eax,4 ;system call number (sys_write) push continue_l push ecx push edx push ebp syscall hlt continue_l: mov eax,1 ;system call number (sys_exit) mov ebx,0 push ecx push edx push ebp syscall section .data msg db 'Hello, world!',0xa len equ $ - msg 

Compilation:


 nasm -f elf main.s -o main.o ld main.o -melf_i386 -o a.out 

The reason why AMD decided to develop its instruction instead of extending the Intel sysenter instruction to the x86-64 architecture is not clear.


vsyscall


When moving from user space to kernel space, a context switch occurs, which is not the cheapest operation. Therefore, to improve the performance of system calls, it was decided to process them in user space. For this, 8 MB of memory was reserved for mapping kernel space into user space. In this memory for the x86 architecture, 3 implementations of commonly used read-only calls were placed: gettimeofday, time, getcpu.


Over time, it became clear that vsyscall has significant drawbacks. Fixed placement in the address space is a security vulnerability, and the lack of flexibility in the amount of allocated memory can adversely affect the expansion of the displayed kernel area.


In order for the example to work, it is necessary that vsyscall support is enabled in the kernel: CONFIG_X86_VSYSCALL_EMULATION = y


Vsyscall call example
 #include <sys/time.h> #include <stdio.h> #define VSYSCALL_ADDR 0xffffffffff600000UL int main() { // Offsets in x86-64 // 0: gettimeofday // 1024: time // 2048: getcpu int (*f)(struct timeval *, struct timezone *); struct timeval tm; unsigned long addrOffset = 0; f = (void*)VSYSCALL_ADDR + addrOffset; f(&tm, NULL); printf("%d:%d\n", tm.tv_sec, tm.tv_usec); } 

Compilation:


 gcc main.c 

Linux does not display vsyscall in compatibility mode.


At this point, to maintain backward compatibility, the Linux kernel provides vsyscall emulation. Emulation is done to patch security holes at the expense of performance.


Emulation can be implemented in two ways.


The first way is by replacing the function address with the syscall system call. In this case, the virtual system call of the gettimeofday function on x86-64 is as follows:


 movq $0x60, %rax syscall ret 

Where 0x60 is the gettimeofday system call code .


The second method is a bit more interesting. When the vsyscall function is called , a Page fault is thrown, which is handled by Linux. The OS sees that the error occurred due to the execution of the instruction at the vsyscall address and passes control to the virtual system call handler emulate_vsyscall (arch / x86 / entry / vsyscall / vsyscall_64.c).


The vsyscall implementation can be controlled using the vsyscall kernel parameter . You can either disable the virtual system call using the parameter vsyscall=none , set the implementation using the syscall instruction syscall=native , or through Page fault vsyscall=emulate .


vDSO (Virtual Dynamic Shared Object)


To correct the main drawback of vsyscall , it was proposed to implement system calls in the form of a display of a dynamically connected library to which ASLR technology is applied. In the "long" mode, the library is called linux-vdso.so.1 , and in compatibility mode, the linux-gate.so.1 . The library is automatically loaded for each process, even statically compiled. You can see the dependencies of the application from it using the ldd utility in the case of dynamic linking of the libc library.


Also, vDSO is used as a choice of the most efficient method of system call, for example, in compatibility mode.


A list of shared functions can be found in the manual .


VDSO call example
 #include <sys/time.h> #include <dlfcn.h> #include <stdio.h> #include <assert.h> #if defined __x86_64__ #define VDSO_NAME "linux-vdso.so.1" #else #define VDSO_NAME "linux-gate.so.1" #endif int main() { int (*f)(struct timeval *, struct timezone *); struct timeval tm = {0}; void *vdso = dlopen(VDSO_NAME, RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); assert(vdso && "vdso not found"); f = dlsym(vdso, "__vdso_gettimeofday"); assert(f); f(&tm, NULL); printf("%d:%d\n", tm.tv_sec, tm.tv_usec); } 

Compilation:


 gcc -ldl main.c 

For compatibility mode:


 gcc -ldl -m32 main.c -o a32.elf 

It is best to look for vDSO functions by retrieving the library address from the auxiliary AT_SYSINFO_EHDR vector and then parsing the shared object. An example of parsing vDSO from an auxiliary vector can be found in the kernel source code: tools / testing / selftests / vDSO / parse_vdso.c


Or if you're interested, you can dig and see how vDSO parses in glibc:


  1. Parsing helper vectors: elf / dl-sysdep.c
  2. Parsing a shared library: elf / setup-vdso.h
  3. Setting function values: sysdeps / unix / sysv / linux / x86_64 / init-first.c, sysdeps / unix / sysv / linux / x86 / gettimeofday.c, sysdeps / unix / sysv / linux / x86 / time.c

According to System V ABI AMD64 [4], calls should occur using the syscall instruction. In practice, calls to this instruction are added via vDSO. Support for system calls in the form of int 80h and vsyscall remained for backward compatibility.


System Call Performance Comparison


With testing the speed of system calls, everything is ambiguous. In the x86 architecture, the execution of one instruction is influenced by many factors such as the presence of instructions in the cache, the pipeline workload, even there is a table of delays for this architecture [2]. Therefore, it is rather difficult to determine the speed of execution of a code segment. Intel even has a special time measurement guide for the code section [1]. But the problem is that we cannot measure time according to the document due to the fact that we need to call kernel objects from user space.


Therefore, it was decided to measure time using clock_gettime and test the performance of a gettimeofday call, as it is in all implementations of system calls. On different processors, time may vary, but in general, the relative results should be similar.


The program was launched several times and as a result, the minimum execution time was taken.
Testing int 80h , sysenter and vDSO-32 was performed in compatibility mode.


Testing program
 #include <sys/time.h> #include <time.h> #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <syscall.h> #include <dlfcn.h> #include <limits.h> #define min(a,b) ((a) < (b)) ? (a) : (b) #define GIGA 1000000000 #define difftime(start, end) (end.tv_sec - start.tv_sec) * GIGA + end.tv_nsec - start.tv_nsec static struct timeval g_timespec; #if defined __x86_64__ static inline int test_syscall() { register long int result asm ("rax"); asm volatile ( "lea %[p0], %%rdi \n\t" "mov $0, %%rsi \n\t" "mov %[sysnum], %%rax \n\t" "syscall \n\t" : "=r"(result) : [sysnum] "i" (SYS_gettimeofday), [p0] "m" (g_timespec) : "rcx", "rsi"); return result; } #endif static inline int test_int80h() { register int result asm ("eax"); asm volatile ( "lea %[p0], %%ebx \n\t" "mov $0, %%ecx \n\t" "mov %[sysnum], %%eax \n\t" "int $0x80 \n\t" : "=r"(result) : [sysnum] "i" (SYS_gettimeofday), [p0] "m" (g_timespec) : "ebx", "ecx"); return result; } int (*g_f)(struct timeval *, struct timezone *); static void prepare_vdso() { void *vdso = dlopen("linux-vdso.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); if (!vdso) { vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); } assert(vdso && "vdso not found"); g_f = dlsym(vdso, "__vdso_gettimeofday"); } static int test_g_f() { return g_f(&g_timespec, 0); } #define VSYSCALL_ADDR 0xffffffffff600000UL static void prepare_vsyscall() { g_f = (void*)VSYSCALL_ADDR; } static inline int test_sysenter() { register int result asm ("eax"); asm volatile ( "lea %[p0], %%ebx \n\t" "mov $0, %%ecx \n\t" "mov %[sysnum], %%eax \n\t" "push $cont_label%=\n\t" "push %%ecx \n\t" "push %%edx \n\t" "push %%ebp \n\t" "mov %%esp, %%ebp \n\t" "sysenter \n\t" "cont_label%=: \n\t" : "=r"(result) : [sysnum] "i" (SYS_gettimeofday), [p0] "m" (g_timespec) : "ebx", "esp"); return result; } #ifdef TEST_SYSCALL #define TEST_PREPARE() #define TEST_PROC_CALL() test_syscall() #elif defined TEST_VDSO #define TEST_PREPARE() prepare_vdso() #define TEST_PROC_CALL() test_g_f() #elif defined TEST_VSYSCALL #define TEST_PREPARE() prepare_vsyscall() #define TEST_PROC_CALL() test_g_f() #elif defined TEST_INT80H #define TEST_PREPARE() #define TEST_PROC_CALL() test_int80h() #elif defined TEST_SYSENTER #define TEST_PREPARE() #define TEST_PROC_CALL() test_sysenter() #else #error Choose test #endif static inline unsigned long test() { unsigned long result = ULONG_MAX; struct timespec start = {0}, end = {0}; int rt, rt2, rt3; for (int i = 0; i < 1000; ++i) { rt = clock_gettime(CLOCK_MONOTONIC, &start); rt3 = TEST_PROC_CALL(); rt2 = clock_gettime(CLOCK_MONOTONIC, &end); assert(rt == 0); assert(rt2 == 0); assert(rt3 == 0); result = min(difftime(start, end), result); } return result; } int main() { TEST_PREPARE(); // prepare calls int a = TEST_PROC_CALL(); assert(a == 0); a = TEST_PROC_CALL(); assert(a == 0); a = TEST_PROC_CALL(); assert(a == 0); unsigned long result = test(); printf("%lu\n", result); } 

Compilation:


 gcc -O2 -DTEST_SYSCALL time_test.c -o test_syscall gcc -O2 -DTEST_VDSO -ldl time_test.c -o test_vdso gcc -O2 -DTEST_VSYSCALL time_test.c -o test_vsyscall #m32 gcc -O2 -DTEST_VDSO -ldl -m32 time_test.c -o test_vdso_32 gcc -O2 -DTEST_INT80H -m32 time_test.c -o test_int80 gcc -O2 -DTEST_SYSENTER -m32 time_test.c -o test_sysenter 

About the system
cat /proc/cpuinfo | grep "model name" -m 1 cat /proc/cpuinfo | grep "model name" -m 1 - Intel® Core (TM) i7-5500U CPU @ 2.40GHz
uname -r - 4.14.13-1-ARCH


Results Table


Implementationtime (ns)
int 80h498
sysenter338
syscall278
vsyscall emulate692
vsyscall native278
vDSO37
vDSO-3251

As you can see, each new implementation of the system call is more productive than the previous one, not counting vsysvall, since this is emulation. As you probably already guessed, if vsyscall were the way it was conceived, the call time would be similar to vDSO.


All current performance comparisons were made with a KPTI patch fixing the meltdown vulnerability.


Bonus: Productivity of system calls without KPTI


The KPTI patch was developed specifically to fix the meltdown vulnerability. As you know, this patch slows down the performance of the OS. Check performance with KPTI turned off (pti = off).


Result table with patch off


ImplementationTime (ns)Increased execution time after patch (ns)Performance degradation after patch (t1 - t0) / t0 * 100%
int 80h31718157%
sysenter150188125%
syscall103175170%
vsyscall emulate49619640%
vsyscall native103175170%
vDSO3700%
vDSO-325100%

180 . , TLB-.


vDSO , , , , TLB-.



       Linux (  , ): https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html   Linux: https://www.win.tue.nl/~aeb/linux/lk/lk-4.html   ,  1: https://lwn.net/Articles/604287/   ,  2: https://lwn.net/Articles/604515/ 

Links


[0] Intel 64 and IA-32 Architectures Developer's Manual: Vol. 2B
[1] How to benchmark code execution times ...
[2] Instruction latencies and throughput for AMD and Intel x86 processors
[3] AMD64 Architecture Programmer's Manual Volume 2: System Programming
[4] System V ABI AMD64


')

Source: https://habr.com/ru/post/347596/


All Articles