- Something bothers me Honduras ...
- bothering? And you do not scratch him.
In the previous parts of the discussion (
1st ,
2nd and
3rd ), we considered how, using the ability to change the contents of the sys_call_table,
change the behavior of a particular Linux system call. Now we will continue to experiment in the direction of whether (and how) we can dynamically
add a new system call to your software project.
We will not focus on the question “why?” - in programming, the last thing to ask is “why?”, You need to ask “how?”: If any technique is not close to you, you simply do not use it (see the epigraph). But nevertheless, we will return shortly to this end, in the discussion.
What does this look like?
With a general similarity to the previously discussed examples of replacing a system call, this task, for all its similarity, has some aggravating features:
')
- The size of the original sys_call_table system call table slowly but monotonously increases from version to kernel version and significantly depends on the specific processor platform.
- The constant specifying the dimension of this table (known in the kernel as __NR_syscall_max, or in some new versions as __NR_syscalls) is declared a preprocessing constant (macro) of the compilation period, and is unknown at run time (at least, I do not know).
- Trying to add our own entry point to the end of the table, we have a significant risk to go beyond the area allocated to the table - this can not be done!
The size of the sys_call_table table is quite large, and it changes from version to kernel version (version 3.13), here’s a very
rough estimate:
$ cat /proc/kallsyms | grep ' sys_' | grep T | wc -l 357
The kernel versions in this part of the discussion will have to be mentioned all the time: what was defined in the header file of the previous version may be determined differently and in a completely different place (file) for the next version, if not explicitly defined at all. This is a common practice in kernel codes. But with all that, all the basic principles and dependencies remain unchanged from version to version.
The above limiting circumstances are mitigated by the fact that the table of system calls is
not dense , rather
sparse , it has unused positions (
left over from outdated system calls and not currently supported).
All such positions are filled with the same address - a pointer to the function of the unrealized sys_ni_syscall () call handler:
$ cat /proc/kallsyms | grep sys_ni_syscall c045b9a8 T sys_ni_syscall
And the sys_ni_syscall () system call itself is defined like this:
asmlinkage long sys_ni_syscall( void ) { return -ENOSYS; }
Therefore, we can add our new system call handler to
any unused position in the sys_call_table table. Let's pay attention to the fact that in these positions there are not obsolete, unused calls, but exactly a challenge, the only action of which is to return an error code. Moreover, kernel developers do not have the right to reuse these positions, otherwise a completely outdated application could cause, without suspecting, a new replacement call.
Statically, textually in the source code, it is possible to examine in detail the structure of the sys_call_table table (for the chosen platform and version). For such studies, the source code itself is not very suitable as it is presented by developers, but fortunately for our purposes, today there are quite a lot of resources that visualize the kernel code using the LXR project (Linux Kernel Cross Reference), for example, here or here (this allows you to compare versions and easily find the necessary identifiers). For example, I will show only those positions of the sys_call_table kernel 3.0.26 of the x86 architecture that contain (file <arch / x86 / kernel / syscall_table_32.S>) a link to sys_ni_syscall (but to the 3.2 kernel and further this file will disappear even from the code tree ... but the principles Formations of the table will remain the same and its view will not change):
ENTRY(sys_call_table) .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */ .long sys_exit ... .long sys_ni_syscall /* old break syscall holder */ //17 .long sys_ni_syscall /* old stty syscall holder */ //31 .long sys_ni_syscall /* old gtty syscall holder */ //32 .long sys_ni_syscall /* 35 - old ftime syscall holder */ //35 .long sys_ni_syscall /* old prof syscall holder */ //44 .long sys_ni_syscall /* old lock syscall holder */ //53 .long sys_ni_syscall /* old mpx syscall holder */ //56 .long sys_ni_syscall /* old ulimit syscall holder */ //58 .long sys_ni_syscall /* old profil syscall holder */ //98 .long sys_ni_syscall /* old "idle" system call */ //112 .long sys_ni_syscall /* old "create_module" */ //127 .long sys_ni_syscall /* 130: old "get_kernel_syms" */ //130 .long sys_ni_syscall /* reserved for afs_syscall */ //137 .long sys_ni_syscall /* Old sys_query_module */ //167 .long sys_ni_syscall /* reserved for streams1 */ //188 .long sys_ni_syscall /* reserved for streams2 */ //189 .long sys_ni_syscall /* reserved for TUX */ //222 .long sys_ni_syscall //223 .long sys_ni_syscall //251 .long sys_ni_syscall /* sys_vserver */ //273 .long sys_ni_syscall /* 285 */ /* available */ //285 ... .long sys_setns // 346
Listing shows only unused positions (except for the beginning and end of the table), comments are left from the source code, and the last comment, with the position number of the system call, is added by me.
We see that for this version of the kernel, the table has 347 positions of system calls, of which 21 are not involved. The analysis of unused positions
in dynamics , without relying on variable kernel codes, will be the subject of the first kernel module in question:
static void **taddr, // sys_call_table *niaddr; // sys_ni_syscall() static int nsys = 0; // #define SYS_NR_MAX 450 // SYS_NR_MAX - , sys_call_table static int sys_length( void* data, const char* sym, struct module* mod, unsigned long addr ) { int i; if( ( strstr( sym, "sys_" ) != sym ) || ( 0 == strcmp( "sys_call_table", sym ) ) ) return 0; for( i = 0; i < SYS_NR_MAX; i++ ) { if( taddr[ i ] == (void*)addr ) { // sys_* sys_call_table if( i > nsys ) nsys = i; break; } } return 0; } static void put_entries( void ) { int i, ni = 0; char buf[ 200 ] = ""; for( i = 0; i <= nsys; i++ ) if( taddr[ i ] == niaddr ) { ni++; sprintf( buf + strlen( buf ), "%03d, ", i ); } LOG( "found %d unused entries: %s\n", ni, buf ); } static int __init init_driver( void ) { if( NULL == ( taddr = (void**)kallsyms_lookup_name( "sys_call_table" ) ) ) { ERR( "sys_call_table not found!\n" ); return -EFAULT; } LOG( "sys_call_table address = %p\n", taddr ); if( NULL == ( niaddr = (void*)kallsyms_lookup_name( "sys_ni_syscall" ) ) ) { ERR( "sys_ni_syscall found!\n" ); return -EFAULT; } LOG( "sys_ni_syscall address = %p\n", niaddr ); kallsyms_on_each_symbol( sys_length, NULL ); LOG( "sys_call_table length = %d\n", nsys + 1 ); put_entries(); return -EPERM; } module_init( init_driver );
As before, the optional details (such as the macro macro LOG (), etc.) are not shown, they are all in the complete attached files.
It could have been easier (which is also correct) - to find out the length of sys_call_table, simply recalculate the number of kernel symbols using the sys_ * mask and subtract 1 (the sys_call_table symbol itself). But we go the redundant way:
- the loop contains the next character in the sys_ * mask;
- its position is searched in sys_call_table (this is an additional reassurance that this is a system call);
- if this position is greater than the previously found ones for the previous characters, then it is not considered the current number of the last call (the current size is sys_call_table);
This redundant (but not at all necessary) scheme allows you to simultaneously specify the
exact size of the system call table for your architecture and Linux kernel version:
$ uname -p i686 $ uname -r 3.13.0-37-generic $ sudo insmod nsys.ko insmod: ERROR: could not insert module nsys.ko: Operation not permitted $ dmesg | tail -n 4 [10751.601851] ! sys_call_table address = c1666140 [10751.602194] ! sys_ni_syscall address = c1075930 [10751.659769] ! sys_call_table length = 351 [10751.659779] ! found 27 unused entries: 017, 031, 032, 035, 044, 053, 056, 058, 098, 112, 127, 130, 137, 167, 169, 188, 189, 222, 223, 251, 273, 274, 275, 276, 285, 294, 317,
In total, this version has 351 system calls, of which 27 are not used (almost 10% of the table size). The stability of this list is very high (consciously for the analysis of the code version 3.0.26 was chosen, and for execution in the dynamics version 2.6.32 and 3.13, separated from each other by more than 4 years of release).
Note: Without being distracted aside, we note nonetheless casually that the writing of a module is in a similar manner, which is a). not intended for download at all, b). and in connection with this consciously returns a non-zero completion code, c). therefore, it has no unloading function at all (__exit) - it is the direct equivalent of a user application (starting from the main () point), but only running in supervisor mode, with maximum privileges. But this is a subject for another discussion ...
Implementing a new system call
Now we are ready to return to the implementation of the stated task: add a new system call. Naturally, we will also need a user-space test application using this call. The number of the new call is defined in the general header file (syscall.h), for consistency in the use of the module and the program (ibid also mentioned LOG (), ERR () and other small things):
// #define __NR_own 223 // , nsys.ko // 3.31 27 : // 017, 031, 032, 035, 044, 053, 056, 058, 098, 112, // 127, 130, 137, 167, 169, 188, 189, 222, 223, 251, // 273, 274, 275, 276, 285, 294, 317,
It is easier and clearer to start with a user application that will perform a new system call. Everything is simple here - it doesn't get easier:
static void do_own_call( char *str ) { int n = syscall( __NR_own, str, strlen( str ) ); if( n == 0 ) LOG( "syscall return %d\n", n ); else { ERR( "syscall error %d : %s\n", n, strerror( -n ) ); exit( n ); } } int main( int argc, char *argv[] ) { if( 1 == argc ) do_own_call( "DEFAULT STRING" ); else while( --argc > 0 ) do_own_call( argv[ argc ] ); return EXIT_SUCCESS; };
The program can do one or a series (if you specify several parameters on the command line) system calls and passes the character parameter to the call (just as it does, for example, sys_write). And already in the module code, we will be able to see how this line is copied into the kernel space. But the main interest here is the return code: the success or failure of the system call.
And here is the module that “picks up” such a call from the core:
asmlinkage long (*old_sys_addr) ( void ); // : asmlinkage long new_sys_call ( const char __user *buf, size_t count ) { static char buf_msg[ 80 ]; int res = copy_from_user( buf_msg, (void*)buf, count ); buf_msg[ count ] = '\0'; LOG( "accepted %d bytes: %s\n", count, buf_msg ); return res; }; static void **taddr; // sys_call_table static int __init new_sys_init( void ) { void *waddr; if( NULL == ( taddr = (void**)kallsyms_lookup_name( "sys_call_table" ) ) ) { ERR( "sys_call_table not found!\n" ); return -EFAULT; } old_sys_addr = (void*)taddr[ __NR_own ]; if( ( waddr = (void*)kallsyms_lookup_name( "sys_ni_syscall" ) ) != NULL ) LOG( "sys_ni_syscall address = %p\n", waddr ); else { ERR( "sys_ni_syscall not found!\n" ); return -EFAULT; } if( old_sys_addr != waddr ) { ERR( "not free slot!\n" ); return -EINVAL; } LOG( "old sys_call_table[%d] = %p\n", __NR_own, taddr[ __NR_own ] ); rw_enable(); taddr[ __NR_own ] = new_sys_call; rw_disable(); LOG( "new sys_call_table[%d] = %p\n", __NR_own, taddr[ __NR_own ] ); return 0; } static void __exit new_sys_exit( void ) { rw_enable(); taddr[ __NR_own ] = old_sys_addr; rw_disable(); LOG( "restore sys_call_table[%d] = %p\n", __NR_own, taddr[ __NR_own ] ); return; } module_init( new_sys_init ); module_exit( new_sys_exit );
A double reinsurance is also done here - checking whether the address in the specified (__NR_own) position of the sys_call_table table is the address of the unused sys_ni_syscall system calls.
And now we evaluate what we did:
$ ./syscall syscall error -1 : Operation not permitted $ echo $? 255 $ sudo insmod adds.ko $ lsmod | head -n3 Module Size Used by adds 12622 0 pci_stub 12550 1 $ dmesg | tail -n3 [15000.600618] ! sys_ni_syscall address = c1075930 [15000.600622] ! old sys_call_table[223] = c1075930 [15000.600623] ! new sys_call_table[223] = f87d9000 $ ./syscall new string for call syscall return 0 syscall return 0 syscall return 0 syscall return 0 $ dmesg | tail -n4 [15070.680753] ! accepted 4 bytes: call [15070.680799] ! accepted 3 bytes: for [15070.680804] ! accepted 6 bytes: string [15070.680807] ! accepted 3 bytes: new $ ./syscall 'new string for call' syscall return 0 $ dmesg | tail -n1 [15167.526452] ! accepted 19 bytes: new string for call $ sudo rmmod adds $ dmesg | tail -n1 [15199.917817] ! restore sys_call_table[223] = c1075930 $ ./syscall syscall error -1 : Operation not permitted
After the module is unloaded, the kernel is no longer able to support the execution of the required system call program!
Discussion
Actually, there is nothing to discuss here - everything is transparently shown by example. But I initially promised to express my thoughts on why such a thing can be applied at all (but I repeat once again my firm conviction that the question “why?” In programming is generally meaningless). The trick shown provides another way for (two-way) applications to interact with the kernel. Yes, of course there is an opportunity to do the same through / dev, / proc, or / sys ... but each of these methods is more ponderous than a system call, it involves a larger number of intermediate kernel mechanisms.
When is it possible to use such a mechanism? For example, for asynchronous application notifications about certain events in the kernel, when a separate application thread is blocked on a system call until the expected event occurs. Such an event may be, for example, a hardware interrupt (IRQ) from the debugged new device (moderately not fast). With this approach, any input-output operations with the device can be implemented from user space using the operations of the inb (), outb () ..., or ioperm () and iopl () operations. All this together makes it possible to study the work and write out the exchange code with the device in the finest details without going beyond the user space, without the risks and difficulties associated with the privileged kernel mode. And then, according to circumstances and at will: you can mechanically rewrite the code of this tested driver in the form of a module, or leave it as it is in user space.
Note: The note above about low speed devices, which can only be worked out in this way, should not be taken too close to the heart either. Truly high-speed devices and inside the Linux kernel do not work on interrupts, but use cyclic program polling. Like, for example, all network interfaces of a network stack at the hardware level ... who knows the Linux network subsystem will understand what I mean.
I'm not talking about the developers of proprietary hardware and projects that have the same rights to exist in nature, like others. In their works, such a technique can find the ground for use.
And again, as before, the archive of the code can be taken
here or
here ...
Epilogue
Since this is the final part of a small cycle about such an unusual (indecent?) Handling of Linux system calls, I would like to make a couple of words in the order of the total.
When you start writing kernel modules or kernel patches, you initially feel a sense of constraint, limited only by the capabilities provided by the poorly documented API of the kernel, or are described in the few and outdated books of the type “writing Linux drivers”. But experiments similar to those described in this cycle, and many other similar ones, suggest that in the kernel module you have access to
all (without exception!) Possibilities of user space (launching new processes and threads, sending UNIX signals, etc.). And plus to this, unattainable in user space capabilities associated with privileged (supervisor, ring 0) processor protection mode (privileged instructions, internal registers of the processor, response to interrupts).
To show this is the main goal of this series of articles, and not at all just private tasks of replacing or adding system calls. Kernel-mode programming should create a feeling of freedom that you are like gods here and can do everything here. But this also requires an adequate degree of responsibility ...