Containers and safety: seccomp

For work with potentially dangerous, untested or simply “raw” programs, so-called sandboxes are often used - specially allocated environments with severe restrictions. For programs running in sandboxes, access to the network, the ability to interact with the operating system on the host machine, and read information from I / O devices are usually very limited.

Recently, containers have been increasingly used to run untested and insecure programs.

But the container (even despite the large number of common features) is not a complete analog of the sandbox - if only because sandboxes are usually “sharpened” for specific applications, and containerization is a more universal technology. And an application running in a container may well access the kernel and compromise it. That is why modern containerization tools use mechanisms to increase security. In today's article we would like to talk about one of these mechanisms - seccomp .
')
First, we will examine the principles of how seccomp works, and then we will demonstrate how it is used in Docker.

Seccomp: first acquaintance

Seccomp (short for secure computing) is a Linux kernel mechanism that allows processes to determine the system calls they will use. If an attacker gets the ability to execute arbitrary code, seccomp will not allow him to use system calls that were not previously announced.

Seccomp is a Google development. It is used, in particular, in the Google Chrome browser to launch plugins.

To activate seccomp, use the prctl () system call.

Let's see how it works, using a simple program as an example:

#include <stdio.h> #include <unistd.h> #include <linux/seccomp.h> #include <sys/prctl.h> int main () { pid_t pid; printf("Step 1: no restrictions yet\n"); prctl (PR_SET_SECCOMP, SECCOMP_MODE_STRICT); printf ("Step 2: entering the strict mode. Only read(), write(), exit() and sigreturn() syscalls are allowed\n"); pid = getpid (); printf ("!!YOU SHOULD NOT SEE THIS!! My PID = %d", pid); return 0; }

Save this program as seccomp1.c, compile and run:

 $ gcc seccomp1.c -o seccomp1 $ ./seccomp1

We will see the following output on the console:

 Step 1: no restrictions yet Step 2: entering the strict mode. Only read(), write(), exit() and sigreturn() syscalls are allowed Killed

To understand where exactly this conclusion came from, let's use strace:

 $ strace ./seccomp1 /   / prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0 write(1, "Step 2: entering the strict mode"..., 100Step 2: entering the strict mode. Only read(), write(), exit() and sigreturn() syscalls are allowed ) = 100 +++ killed by SIGKILL +++ Killed

So what happened? Using the prctl system call, we activated seccomp and enabled strict mode. After that, our program tried to find out the PID of the current process using the getpid () system call, but the imposed restrictions did not allow it: the process received a SIGKILL signal and was immediately completed.

As you can see, seccomp perfectly copes with its tasks. But strict mode is inconvenient because it does not give the opportunity to choose which system calls to allow, and which not. To solve this problem, we can use the BPF (Berkeley Packet Filters) mechanism.

Seccomp and BPF Filters

The BPF mechanism (short for Berkeley Packet Filters) was originally created to filter network packets, but subsequently its scope has expanded significantly. Today, BPF is used, for example, to trace the Linux kernel (here’s an interesting posting on this topic on Brendan Gregg’s blog). In 2012, it was integrated with seccomp; There is an extended version, which is called - seccomp-bpf.

Writing for BPF is very complicated (something about this can be read, for example, here ). We will not discuss the features of the BPF syntax (this topic goes far beyond the scope of this article) and use the library libseccomp, which provides a simple and convenient API for filtering system calls.

It is installed using the standard package manager:

 $ sudo apt-get install libseccomp-dev

Now let's try to write a small program:

 #include <stdio.h> #include <seccomp.h> #include <unistd.h> int main() { pid_t pid; scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sigreturn), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0); printf ("No restrictions yet\n"); seccomp_load(ctx); pid = getpid(); printf("!! YOU SHOULD NOT SEE THIS!! My PID is%d\n", pid); return 0; }

Let's comment on this code line by line.

 scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

Here we initialize the filter and indicate what action should be taken by default - in our case it is SCMP_ACT_KILL, that is, immediately stopping the process that will perform the forbidden system call.

Next are the seccomp rules; in them we specify the system calls that our process will be allowed to perform:

  seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sigreturn), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);

Next we activate the rules:

 seccomp_load(ctx);

As in the previous example, we are trying to display the PID of the current process on the console. But can we do it?

Compile and run the program:

 $ gcc -o seccomp2 seccomp2.c -lseccomp $ ./seccomp2

We will see the following output:

 No restrictions yet Bad system call

What happened during the execution of this program? As in the previous case, strace will help us to answer this question:

 $ strace ./seccomp2 /  / prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len = 9, filter = 0x1ef5fe0}) = 0 +++ killed by SIGSYS +++

We see that the filter worked: the process executed the getpid system call, prohibited by the rules, after which it was immediately stopped.

To better understand how seccomp filters work, it is useful to specify SCMP_ACT_KILL as SCMP_ACT_TRAP as the default action in the code:

 scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_TRAP);

The output of strace will be much more detailed:

 $ strace ./seccomp2 /  / syscall_18446744073709551615(0xffffffff, 0x7feb8c47ab28, 0, 0x22b, 0x130c0c0, 0) = 0x27 --- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0x7feb8c18366f, si_syscall=__NR_getpid, si_arch=AUDIT_ARCH_X86_64} --- +++ killed by SIGSYS +++

In our case (Ubuntu OS 16.04, kernel 4.4.), The output directly states the forbidden system call, the attempt to perform which caused the process to stop: si_syscall = __ NR_getpid.

In other distributions and in other versions of the kernel, the output may not include the name of the system call, but its number from the /asm/unistd.h file.

Seccomp in docker

In the previous sections we covered the basic principles of the seccomp. Now let's look at the Docker example, how seccomp is used in specific containerization tools.

For the first time, seccomp profiles for containers appeared in runc, about which we have already written .

In the Docker Engine, they have been added since version 1.10.

By default, 44 system calls are blocked in all Docker containers (a total of several hundred system calls on modern 64-bit Linux systems). For example, the reboot () system call is one of the forbidden ones: you can hardly imagine a situation when you need to reboot the OS on the host machine from the container.

Another good example is the keyctl () system call, for which a vulnerability was discovered not so long ago ( CVE 2016-0728 ). Now in Docker it is blocked by default.

Default seccomp profiles are a good innovation, which is useful only in that it limits the possibilities for intruders and reduces the likelihood of attacks. But this is clearly not enough: many of the unblocked calls have vulnerabilities. To ban all potentially dangerous calls for obvious reasons, it is simply impossible!

That is why containers provide filtering system calls. All filters are written in configuration files in JSON format.

Let's give a simple example:

 { "defaultAction":"SCMP_ACT_KILL", "syscalls":[ { "name":"chmod", "action":"SCMP_ACT_ERRNO" } ] }

As you can see, everything is done in the same way as in the code examples above. First, we specify what action to take by default. Next, we list the forbidden calls, as well as the actions that need to be carried out when making one of these calls.

Save this file as config.json and try to start the container with the seccomp settings specified above:

 $ docker run --security-opt seccomp:chmod.json busybox chmod 400 /etc/hostname chmod: /etc/hostname: Operation not permitted

As you can see, the filter worked in accordance with the formulated rules: the forbidden chmod system call was blocked.

Conclusion

In this article, we described how seccomp works and how it is used in Docker. If you have questions, comments and suggestions - welcome to the comments.

In conclusion, we, as usual, provide useful links for those who want to learn more:

http://blog.viraptor.info/tag/seccomp.html is a good introduction to the topic;
https://blog.yadutaf.fr/2014/05/29/introduction-to-seccomp-bpf-linux-syscall-filter/ is another interesting introductory publication;
https://eigenstate.org/notes/seccomp - a detailed article about seccomp filters;
https://lwn.net/Articles/494252/ - an article about libseccomp;
http://events.linuxfoundation.org/sites/events/files/slides/limiting_kernel_attack_surface_with_seccomp-ContainerCon.eu_2016-Kerrisk.pdf - a very detailed and interesting presentation of the seccomp report;
https://docs.docker.com/engine/security/seccomp/ - help on using seccomp in Docker;
https://coreos.com/rkt/docs/latest/seccomp-guide.html - help on using seccomp in rkt containers.

Source: https://habr.com/ru/post/322046/

All Articles

Containers and safety: seccomp

Seccomp: first acquaintance

Seccomp and BPF Filters

Seccomp in docker

Conclusion

More articles: