In the
previous section, we agreed to the fact that non-exported Linux kernel names can be used in the code of native kernel modules with the same success as exported ones. One of these kernel names is the selector table of all Linux system calls. Actually, this is the main interface of any applications to kernel services. Now we will look at how to modify the original handler of any system call, replace it, or add variety to its execution in accordance with our own vision.
Technique modification
The technique of modifying the system calls of the operating system has been known for a long time and has been used in a variety of operating systems. This is a favorite technique of virus writers since the MS-DOS system, a system that simply provoked such experiments. But we will use this technique for peaceful purposes ... (In various publications, such actions are called differently: modification, embedding, implementation, substitution, interception - they have their own nuances, but in our discussion can be used as synonyms.)
If any system call in the Linux operating system kernel is called indirectly through the address in the table (array) of the sys_call_table system calls, then replacing the address in this selector table with our own handler function, we will also change the system call handler. This, in fact, is the technique of modification. In practice, radicalism is never required to such an extent; in reality, we sometimes need to perform the original system call, but having done some of our own actions either before it (preprocessing) or after finishing (post-processing) it (or a combination of both).
In the previous part, we made sure that we can find the location and use any kernel symbols, including those not exported. All that is required, in principle, for the purpose of modifying system calls is to find the base address of the sys_call_table array, and by offsetting the number of the required system call, write down the address of your own processing function.
')
In reality, the scheme will be a little more complicated - you must first save the old (original) value of the system handler:
- To call the original handler from its own processing function before or / and after executing the modified code;
- To restore the original handler when unloading the module.
Implementation
Implementation, as always happens, is somewhat more complicated than theory. The first, insignificant, complexity is how to write a prototype of the own function of processing a particular system call. The slightest incorrectness of the prototype is likely to simply lead to the collapse of the operating system. The solution is to simply peep and write off (as a Losers) the prototype of the function to handle this system call from the kernel header file <linux / syscalls.h>.
The next, much more significant, difficulty here is that the selector table sys_call_table, in processor architectures that allow it (and I386 and X86_64 among them), is placed in memory pages that are allowed only for reading (readonly). This is controlled by hardware (using the MMU - Memory Management Unit) and an exception is thrown when an access violation occurs. Therefore, we need to uncheck the no-write flag at the time of modifying the sys_call_table element and restore it after.
In the I386 and X86_64 architectures, the write resolution flag is determined by the bit flag in the hidden CR0 processor status register. To perform the actions we need, we use functions appropriately, for 32-bit architectures, for example, they will look like this (CR0.c file, this code is written in inline assembler inserts — the GCC compiler extension):
// page write protect - on #define rw_enable() \ asm( "cli \n" \ "pushl %eax \n" \ "movl %cr0, %eax \n" \ "andl $0xfffeffff, %eax \n" \ "movl %eax, %cr0 \n" \ "popl %eax" ); // page write protect - off #define rw_disable() \ asm( "pushl %eax \n" \ "movl %cr0, %eax \n" \ "orl $0x00010000, %eax \n" \ "movl %eax, %cr0 \n" \ "popl %eax \n" \ "sti " );
PS Various options for recording techniques in write-protected pages were discussed, for example, in
WP: Safe or Not? and the
Kosher method of modifying the write-protected areas of the Linux kernel .
Now we are ready to replace any Linux system call (man (2)) with our own handler function — and this is what we were aiming for. To illustrate the efficiency of the method, we replace (expand) the write (1, ...) system call - output to the terminal, dub the output stream into the system log (similar to what the tee command does):
#define PREFIX "! " #define DEB2(...) if( debug > 1 ) printk( KERN_INFO PREFIX " ---- " __VA_ARGS__ ) #define LOG(...) printk( KERN_INFO PREFIX __VA_ARGS__ ) #define ERR(...) printk( KERN_ERR PREFIX __VA_ARGS__ ) static int debug = 0; // debug output level: 0, 1, 2 module_param( debug, uint, 0 ); asmlinkage long (*old_sys_write) ( unsigned int fd, const char __user *buf, size_t count ); #define LEN 250 asmlinkage long new_sys_write ( unsigned int fd, const char __user *buf, size_t count ) { if( 1 == fd ) { char msg[ LEN + 1 ]; int n = count < LEN ? count : LEN, r; if( ( r = copy_from_user( msg, (void*)buf, n ) ) != 0 ) return -EINVAL; if( '\n' == msg[ n - 1 ] ) msg[ n - 1 ] = '\0'; else msg[ n ] = '\0'; if( strchr( msg, '!' ) != NULL ) goto rec; // to prevent recursion LOG( "{%04d} %s\n", count, msg ); } rec: return old_sys_write( fd, buf, count ); // original write() }; static void **taddr; // address of sys_call_table static int __init wrchg_init( void ) { void *waddr; if( NULL == ( taddr = find_sym( "sys_call_table" ) ) ) { ERR( "sys_call_table not found\n" ); return -EINVAL; } old_sys_write = (void*)taddr[ __NR_write ]; if( NULL == ( waddr = find_sym( "sys_write" ) ) ) { ERR( "sys_write not found\n" ); return -EINVAL; } if( old_sys_write != waddr ) { ERR( "Oooops! : addresses not equal\n" ); return -EINVAL; } LOG( "set new sys_write syscall [%p]\n", &new_sys_write ); show_cr0(); rw_enable(); taddr[ __NR_write ] = new_sys_write; show_cr0(); rw_disable(); show_cr0(); return 0; } static void __exit wrchg_exit( void ) { rw_enable(); taddr[ __NR_write ] = old_sys_write; rw_disable(); LOG( "restore old sys_write syscall [%p]\n", (void*)taddr[ __NR_write ] ); return; } module_init( wrchg_init ); module_exit( wrchg_exit );
The find_sym () kernel symbol search function, which uses the kallsyms_on_each_symbol () kernel API call, was seen in the
previous part of the discussion. In addition, we make control (more for illustration) that the address of the name of the original sys_write () matches the same address, which is in the __NR_write position of the sys_call_table table.
Now we can execute a system with parallel logging of everything that is displayed on the terminal (the choice for write () experiments is not particularly aesthetically pleasing, but very illustrative and, besides, safe in the early stages of experimentation in comparison with other Linux system calls):
$ sudo insmod wrlog.ko debug=2 $ ls CR0.c find.c Makefile Modi.hist wrlog.0.c wrlog.1.c wrlog.2.c wrlog.3.c wrlog.c wrlog.hist wrlog.ko $ sudo rmmod wrlog $ dmesg | tail -n31 [ 1594.231242] ! set new sys_write syscall [f8854000] [ 1594.231248] ! ---- CR0 = 80050033 [ 1594.231250] ! ---- CR0 = 80040033 [ 1594.231252] ! ---- CR0 = 80050033 [ 1594.232737] ! {0052} /home/olej/2015_WORK/own.BOOK/SysCalls/Modi/examles [ 1594.233368] ! {0078} \x1b[01;32molej@nvidia\x1b[01;34m ~/2015_WORK/own.BOOK/SysCalls/Modi/examles $\x1b[00m [ 1596.866659] ! {0001} l [ 1597.154675] ! {0001} s [ 1597.644985] ! {0110} CR0.c find.c Makefile Modi.hist wrlog.0.c wrlog.1.c wrlog.2.c wrlog.3.c wrlog.c wrlog.hist wrlog.ko [ 1597.645196] ! {0113} [ 1597.645196] CR0.c find.c Makefile Modi.hist wrlog.0.c wrlog.1.c wrlog.2.c wrlog.3.c wrlog.c wrlog.hist wrlog.ko [ 1597.645321] ! {0052} /home/olej/2015_WORK/own.BOOK/SysCalls/Modi/examles [ 1597.645951] ! {0078} \x1b[01;32molej@nvidia\x1b[01;34m ~/2015_WORK/own.BOOK/SysCalls/Modi/examles $\x1b[00m [ 1600.226651] ! {0001} s [ 1600.346587] ! {0001} u [ 1600.522683] ! {0001} d [ 1601.026667] ! {0001} o [ 1602.170701] ! {0001} [ 1602.426522] ! {0001} r [ 1603.218682] ! {0001} m [ 1603.682677] ! {0001} m [ 1603.906615] ! {0001} o [ 1604.338566] ! {0001} d [ 1606.442570] ! {0001} [ 1606.946670] ! {0001} w [ 1607.226667] ! {0001} r [ 1607.834662] ! {0001} l [ 1608.106672] ! {0001} o [ 1608.842694] ! {0001} g [ 1612.003059] ! {0002} [ 1612.014102] ! restore old sys_write syscall [c1179f70]
Discussion
Similarly, we can change the behavior of any Linux system call. This is done dynamically, by loading the module, and when it is unloaded, the original behavior of the system is restored. The areas of application of such a technique are wide: the possibilities of control and debugging during the development period, the targeted change in the behavior of individual system calls for project tasks, and more.
The code shown is noticeably simplified. A real module would have to take a series of safety actions to ensure integrity. For example, a new handler function could increase the module's reference count by calling try_module_get (THIS_MODULE) to prevent the module from being unloaded for the duration of the function (which is possible with a vanishingly small but still finite probability). Before returning, the function will then do the opposite: module_put (THIS_MODULE). Other precautions may be required, for the time of loading and unloading a module, for example. But this is quite an ordinary technique of kernel modules, and it is not discussed in order not to complicate the principle.
Some additional nuances and special cases of the shown equipment will be seen in the next part of the discussion.
The code archive for experiments can be taken
here or
here (because of the irrelevance of the examples, I do not post them on GitHub).
PS Everything shown works unchanged in 32-bit. In the 64-bit architecture, the picture becomes somewhat more complicated due to the need to emulate 32-bit applications. In order not to complicate the picture, this option was deliberately not affected (perhaps for now, and it is worth returning to it later).