Network system calls. Part 3

We ended the previous part of the discussion on such an optimistic note: "In a similar way, we can change the behavior of any Linux system call." And here I was a goof - any ... yes, not any. The exception is (can be) a group of network system calls that work with BSD sockets. When you encounter this artifact for the first time, it is pretty puzzling.

How does a socket call occur?

To clarify the picture, we use the notes of one of the direct developers of the Linux network subsystem:
Network systems calls on Linux (2008). I will briefly retell its main content (in the part of interest to us) to whom it is not interesting to use the original.

When BSD sockets support was added to the Linux kernel, the developers decided to add them all 17 (currently 20) socket calls at the same time, and added one additional level of indirection for these calls. For the entire group of these calls, one new, rarely mentioned, system call has been introduced (see man socketcall (2)):

int socketcall( int call, unsigned long *args );

Where:
- call — the numerical number of the network call (SYS_CONNECT, SYS_ACCEPT ... we will see them soon);
- args - pointer of 6-element array (parameter block), in which all parameters of any of the system calls of this group (network) are sequentially packed, without distinguishing their type (reduced to unsigned long);

But such a macro in the kernel (<net / socket.c>), in which how many parameters are actually used for each of the socket calls, depending on its number (in the range from 1 to 20):

 /* Argument list sizes for sys_socketcall */ #define AL(x) ((x) * sizeof(unsigned long)) static const unsigned char nargs[ 21 ] = { AL(0),AL(3),AL(3),AL(3),AL(2),AL(3), AL(3),AL(3),AL(4),AL(4),AL(4),AL(6), AL(6),AL(2),AL(5),AL(5),AL(3),AL(3), AL(4),AL(5),AL(4) }; #undef AL

(Moreover, narg [0] is not used at all , because its dimension and 21.)
')
The number of the socket call to the kernel space (int 0x80 or sysenter) is passed in the eax register. We can see the values of these constants themselves in user space headers (<linux / net.h>):

 #define SYS_SOCKET 1 /* sys_socket(2) */ #define SYS_BIND 2 /* sys_bind(2) */ #define SYS_CONNECT 3 /* sys_connect(2) */ #define SYS_LISTEN 4 /* sys_listen(2) */ #define SYS_ACCEPT 5 /* sys_accept(2) */ ... #define SYS_SENDMSG 16 /* sys_sendmsg(2) */ #define SYS_RECVMSG 17 /* sys_recvmsg(2) */ #define SYS_ACCEPT4 18 /* sys_accept4(2) */ #define SYS_RECVMMSG 19 /* sys_recvmmsg(2) */ #define SYS_SENDMMSG 20 /* sys_sendmmsg(2) */

Actually, the processing scheme by this point should already be clear:
- the required number of parameters of the system call is packed into the unsigned long array, the largest number of parameters (6) for SYS_SENDTO = 11 (nargs [11]):

 ssize_t sendto( int sockfd, const void *buf, size_t len, int flags, const struct sockaddr *dest_addr, socklen_t addrlen );

- the address of the generated array is transmitted by the 2nd parameter of the system call, the first parameter is the number of the socket call (for example, SYS_SENDTO);
- all socket calls are processed by a single kernel handler sys_socketcall () (__NR_socketcall = 102);
- the processor first copies from the user space an array of parameter values, and then, depending on eax, copies from the user space the following and data areas indicated (possibly) by the pointer values from this parameter array.

Some new architectures (as in the original) do not use such an indirect method of calling, but use the same implementation for these calls as for all other system calls. So it is implemented, in particular, for X86_64 and ARM. Thus, even 64-bit and 32-bit (emulated in the X86_64 system) applications will run according to a different scheme. But let's not get distracted by this ...

To make sure that the service of socket calls in 32 and 64 bit systems is carried out in different ways, you can, if in the user space application directory (C header files of the C language, <i386-linux-gnu / asm>), consider, for comparison, determine a set of system calls for 32 and 64 bit modes:

 $ cat unistd_32.h | grep socketcall #define __NR_socketcall 102 $ cat unistd_32.h | grep connect

 $ cat unistd_64.h | grep socketcall $ cat unistd_64.h | grep connect #define __NR_connect 42

In a 32-bit system, there is a sys_socketcall () call, but there are no calls for each of the 20 socket calls. Conversely, in a 64-bit system there is no system call such as sys_socketcall (), but there is a complete set of system calls for each of the 20 socket calls.

The author himself notes in conclusion, as an assessment, he writes the following: This technique seems rather ugly at first glance, when compared with modern methods of object-oriented programming, but there is also a certain simplicity in it. It also stores data in a compact way, which improves cache hit rate. The only problem is that the sample must be made manually, which means that it is easy to shoot yourself in the foot.

Implementation

The possibility of intercepting network system calls will be illustrated on the layout of a distributed firewall (to simplify it as much as possible). At one time, this idea was very much worn, as a firewall implementation for large and extra-large networks (especially in Cisco environments). There are many publications on this topic, for example, two of them that give a complete picture of what is understood as a distributed firewall: Implementing a Distributed Firewall and
Automated Implementation of Stateful Firewalls in Linux .

The proposal is to control not all TCP / IP traffic at the level of IP packets, but to implement the rules on each extra-large network host only for the TCP protocol and only at the time of establishing the connection. Only 2 system calls get under control: accept () and connect (). A deeper discussion of a distributed firewall would take us very far away from our goals ... let's look only at how we could control these network calls.

As an illustration of the implementation of the interception of socket calls, a module of such a network filter was implemented in the kernel for accept () and connect () calls. This module is made in the most simplified (truncated) implementation: as parameters, the module receives an IP address (deny parameter) and TCP port (port parameter) as parameters, connections with which should be denied (and one more additional parameter debug - diagnostic output level) .

Note: In the tested version, the forbidden IP addresses and TCP ports were allowed to be multiple, stored in a cyclic list of type struct list_head (as is common in the kernel), and were placed (or deleted) by a separate application — the user space policy daemon. The filter is in the kernel and should function in some similar way, but this is too cumbersome for an article describing the principle, especially since it is not the firewall principle, but the principle of working with network system calls. With all the simplifications, the code is still too big, so I will tag it under the spoiler.

So, the code of the example module:

 static int debug = 0; // debug output level: 0, 1, 2 module_param( debug, uint, 0 ); static char* deny; // string parameter: denied IPv4 module_param( deny, charp, 0 ); static int port = 0; // denied port module_param( port, int, 0 ); static void **taddr; // table sys_call_table address u32 ipdeny; // denied IP #include "find.c" #include "CR0.c" inline char* in4_ntoa( uint32_t ip ) { // mapping IP to a string static char saddr[ MAX_ADDR_LEN ]; sprintf( saddr, "%d.%d.%d.%d", ( ip >> 24 ) & 0xFF, ( ip >> 16 ) & 0xFF, ( ip >> 8 ) & 0xFF, ( ip ) & 0xFF ); return saddr; } asmlinkage long (*old_sys_socketcall) ( int call, unsigned long __user *args ); asmlinkage long new_sys_socketcall( int call, unsigned long __user *args ) { #define PARMS 3 static unsigned long a[ PARMS ]; // accept() and connect() have the same number of parameters 3 static struct sockaddr sa; // ----------- nested functions are a GCC extension --------- long get_addr( void ) { const unsigned int len = PARMS * sizeof( unsigned long ); if( copy_from_user( a, args, len ) ) return -EFAULT; if( copy_from_user( &sa, (struct sockaddr __user*)a[ 1 ], sizeof( struct sockaddr ) ) ) return -EFAULT; return 0; } // ---------------------------------------------------------- long ret; if( SYS_ACCEPT == call ) { // accept() before syscall long err; if( ( err = get_addr() ) < 0 ) return err; if( AF_INET == sa.sa_family ) { // only IPv4 struct sockaddr_in *usin = (struct sockaddr_in *)&sa; if( ntohs( usin->sin_port ) == port ) { LOG( "accept from denied port %d\n", ntohs( usin->sin_port ) ); return -EIO; } } } if( SYS_CONNECT == call ) { // connect() before syscall long err; if( ( err = get_addr() ) < 0 ) return err; if( AF_INET == sa.sa_family ) { // only IPv4 struct sockaddr_in *usin = (struct sockaddr_in *)&sa; DEB( "connect to %s:%d\n", in4_ntoa( ntohl( usin->sin_addr.s_addr ) ), ntohs( usin->sin_port ) ); if( ( deny != NULL && ntohl( usin->sin_addr.s_addr ) == ipdeny ) || ( port != 0 && ntohs( usin->sin_port ) == port ) ) { LOG( "connect to %s:%d denied\n", in4_ntoa( ntohl( usin->sin_addr.s_addr ) ), ntohs( usin->sin_port ) ); return -EACCES; } } } ret = old_sys_socketcall( call, args ); // retranslate to original sys_socketcall() if( SYS_ACCEPT == call ) { // accepr() after syscall long err; if( ( err = get_addr() ) < 0 ) return err; if( AF_INET == sa.sa_family ) { // only IPv4 struct sockaddr_in *usin = (struct sockaddr_in *)&sa; DEB( "accept from %s:%d\n", in4_ntoa( ntohl( usin->sin_addr.s_addr ) ), ntohs( usin->sin_port ) ); if( ( deny != NULL && ntohl( usin->sin_addr.s_addr ) == ipdeny ) || ( port != 0 && ntohs( usin->sin_port ) == port ) ) { LOG( "accept from %s:%d denied\n", in4_ntoa( ntohl( usin->sin_addr.s_addr ) ), ntohs( usin->sin_port ) ); return -EACCES; } } } return ret; } static int __init init( void ) { void *waddr; // ----------- nested functions are a GCC extension --------- int pos_in_table( const char *symbol ) { // position in sys_call_table (__NR_*) const int last = __NR_process_vm_writev; // near last syscall in i386 int n; waddr = find_sym( symbol ); if( NULL == waddr ) return -1; for( n = 0; n <= last; n++ ) if( taddr[ n ] == waddr ) break; return n <= last ? n : -1; } // -------------------------------------------------------- void show_in_table( char *symb ) { // print info about symbol waddr = find_sym( symb ); if( NULL == waddr ) { DEB( "symbol %s not found in kernel\n", symb ); } else { int n = pos_in_table( symb ); if( n > 0 ) DEB( "symbol %s address = %p, position in sys_call_table = %d\n", symb, waddr, n ); else DEB( "symbol %s address = %p, not found in sys_call_table\n", symb, waddr ); } } // -------------------------------------------------------- ipdeny = ntohl( deny != NULL ? in_aton( deny ) : in_aton( "0.0.0.0" ) ); LOG( "denied IP: %s\n", deny != NULL ? in4_ntoa( ipdeny ) : "no" ); if( port != 0 ) LOG( "denied TCP port: %d\n", port ); if( NULL == ( taddr = find_sym( "sys_call_table" ) ) ) { ERR( "sys_call_table not found\n" ); return -EINVAL; } DEB( "sys_call_table address = %p\n", taddr ); show_in_table( "sys_accept" ); show_in_table( "sys_connect" ); show_in_table( "sys_socketcall" ); // only diagnostic old_sys_socketcall = (void*)taddr[ __NR_socketcall ]; if( NULL == ( waddr = find_sym( "sys_socketcall" ) ) ) { // sys_socketcall not exported ERR( "sys_socketcall not found\n" ); return -EINVAL; } if( old_sys_socketcall != waddr ) { // reinsurance! ERR( "Oooops! I don't understand: addresses not equal\n" ); return -EINVAL; } if( debug ) show_cr0(); rw_enable(); taddr[ __NR_socketcall ] = new_sys_socketcall; if( debug ) show_cr0(); rw_disable(); if( debug ) show_cr0(); LOG( "install new sys_socketcall handler: %p\n", &new_sys_socketcall ); return 0; } static void __exit exit( void ) { LOG( "sys_socketcall handler before unload: %p\n", (void*)taddr[ __NR_socketcall ] ); rw_enable(); taddr[ __NR_socketcall ] = old_sys_socketcall; rw_disable(); LOG( "restore old sys_socketcall handler: %p\n", (void*)taddr[ __NR_socketcall ] ); return; } module_init( init ); module_exit( exit );

The code is simplified as much as possible, such things as the LOG (), ERR () diagnostic macros have already been shown, in part, in the previous sections. The find () function has also been discussed. To write to the write-protected area of the sys_call_table table, there are at least 3-4 alternatives, all of which were called and referenced in the discussions in the previous part. Protection against unloading the module during the maintenance of system calls, by incrementing the module's reference counter, is also not shown (it was called in the previous part). All these details are present in the codes of the attached archive. In addition, the codes in the archive are copiously peppered with comments containing extracts from the kernel sources, indicating the files in the kernel code tree - this suggests the required data structures.

And yet, with all the simplifications, the code remains cumbersome enough (not complicated, but cumbersome). But it is possible and not to delve into the actual code, the sequence of processing the modified network system calls as follows:

take control (change handler) of the sys_socketcall () system call;
if the calling code (1st sys_socketcall () parameter) is SYS_ACCEPT or SYS_CONNECT, then copy the 3-element array of unsigned long parameters from the user space (in general, 6 elements, for SYS_SENDMSG, for example);
The 2nd element of the array (corresponding to the 2nd accept () or connect () parameter), although it looks like an unsigned long is a pointer to a struct sockaddr in the user address space, with the second step of accessing the parameters, copy the structure from the user address space;
the structure contains the IP address and TCP port parameters, if they fall into the banned list - return the error code and cancel the operation, if not - call the original system call handler;
for all other (18, not SYS_ACCEPT and SYS_CONNECT) socket calls, we simply transit the call to the original sys_socketcall ();
requests that are not related to the IPv4 protocol without modification are transferred to the network stack;

Some additional complexity is created by the fact that the call to accept () has to be checked twice:

TCP port number before the original system call, when the server starts listening on an unattached socket;
IP address of the source after the connection is established for the socket, after returning from the function of the original system call;

How does it look at work? Something like this:

 $ sudo insmod fwnet.ko deny=192.168.56.101 port=10000 debug=1 $ lsmod | head -n2 Module Size Used by fwnet 13116 0 $ dmesg | tail -n10 [ 786.609568] ! denied IP: 192.168.56.101 [ 786.609572] ! denied TCP port: 10000 [ 786.613047] ! sys_call_table address = c15b4000 [ 786.636336] ! symbol sys_accept address = c149a070, not found in sys_call_table [ 786.656437] ! symbol sys_connect address = c149a0a0, not found in sys_call_table [ 786.661444] ! symbol sys_socketcall address = c149acd0, position in sys_call_table = 102 [ 786.663994] ! CR0 = 8005003b [ 786.664090] ! CR0 = 8004003b [ 786.664096] ! CR0 = 8005003b [ 786.664100] ! install new sys_socketcall handler: e1ad50d0

Naturally, in order to observe the operation of the kernel network filter in action, we need a TCP client and server (for example, ncat). But for detailed testing, a special relay server (tcpserv) and a client (tcpcli) were prepared. Apart from some little things, sharpened for this work, they do not represent anything special and will not be considered here (but they are in the attached archive).
Here are some of the attempts to establish prohibited TCP connections:

- Starting the server listening on the forbidden port:

 $ ./tcpserv -v -p10000 listening on the TCP port 10000 denied TCP port: Input/output error $ dmesg | tail -n5 ... [11213.888556] ! accept before: port = 10000 [11213.888562] ! accept from denied port 10000

- Attempt to connect a client to a forbidden port:

 $ ./tcpcli -v -h 127.0.0.1 -p 10000 client: can't connect to server: Permission denied $ dmesg | tail -n5 ... [10984.082051] ! connect to 127.0.0.1:10000 [10984.082060] ! connect to 127.0.0.1:10000 denied [11166.236948] ! connect to 127.0.0.1:53 ...

Well and so on - the task provides a wide and exciting field for experimentation ...

(Here in the protocol, the address to DNS on port 53 is specifically stored and shown at the same time. In the same way, during the filtering experiments, you can observe many connections to TCP port 80 - all the time HTTP traffic goes without disruption).

It is important that after unloading the module, the system is restored to its original state:

 $ sudo rmmod fwnet $ dmesg | grep \! | tail -n2 [ 2890.602419] ! sys_socketcall handler before unload: e1ad50d0 [ 2890.602439] ! restore old sys_socketcall handler: c149acd0

Discussion

So, somewhat fictional, the handling of network system calls is carried out in Linux ... at least in 32-bit implementations. When you first encounter these system calls, the way they work is somewhat discouraging.

This part of the discussion turned out to be drawn out and boring, but such an artifact as this is how the system calls work - you need to know and take it into account.

A small code archive (and an extensive test log) for experiments can be found here or here .

Source: https://habr.com/ru/post/268145/

All Articles

Network system calls. Part 3

How does a socket call occur?

Implementation

Discussion

More articles: