Isolate the demons with systemd or “you don't need a docker for this!”

Recently, I see how a fairly large number of people use container virtualization only to lock a potentially unsafe application inside the container. As a rule, Docker is used for this because of its prevalence, and they know nothing better. Indeed, many demons initially run as root, and then either lower their privileges, or the master process spawns processing processes with lower privileges. And there are those who work exclusively as root. If a demon is found to have a vulnerability that allows you to access with maximum privileges, it will not be very pleasant to find intruders who have already downloaded all the data and left viruses.
Containerization provided by Docker and other similar software really saves this problem, but also introduces new ones: you need to create a container for each daemon, take care of the integrity of changed files, update the base image, and the containers themselves are often based on different operating systems stored on disk, although you, in general, are not really needed. What if you don’t need containers as such, in the Docker Hub the application is not built the way you want, and the version is outdated, SELinux and AppArmor seem too complicated, and you would like to run it in your environment, but using the same Is the isolation Docker uses?

Capabilities

What is the difference between a regular user and root? Why can root manage the network, load kernel modules, mount file systems, kill processes of any users, and the average user is deprived of such features? It's all about the capabilities - a tool for managing privileges. All these privileges are given to the user with UID 0 (i.e. root) by default, while the regular user does not have any of them. Privileges can be both given and taken away. For example, the usual ping command requires the creation of a RAW socket, which is impossible to do on behalf of a regular user. Historically, a pid was set for the SUID flag, which simply ran the program on behalf of the superuser, but now all modern distributions set CAP_NET_RAW capability, which allows you to run ping from under any account.
You can get a list of the file's installed capabilities using the getcap command from libcap.

 % getcap $(which ping) /usr/bin/ping = cap_net_raw+ep

The p flag here means the limit , i.e. the application has the ability to use the specified capability, e means effective - the application will use it, and there is also the i - inheritable flag, which makes it possible to save the list of capabilities when calling the execve() function.
Capabilities can be set at the FS level, or just at a separate program flow. It is impossible to receive a capability that was not available since launch, i.e. privileges can only be lowered, but not increased.
There are also security bits (Secure Bits), there are three of them: KEEP_CAPS allows you to save the capability when you call setuid, NO_SETUID_FIXUP disables the ability to reconfigure when you call setuid, and NOROOT prohibits issuing additional privileges when running suid programs.

Namespaces

The ability to put an application in your namespaces (namespaces) is another feature of the Linux kernel. Separate namespaces can be set for:

File system
UTS (hostname)
System V IPC (interprocess communication)
Network
PID
Of users

If we place the application, for example, in a separate network space, it will not be able to see our network adapters, which are visible from the host. The same can be done with the file system.
')

systemd

Fortunately, systemd supports everything you need to isolate applications and differentiate rights.
We will use these features, but first we will think a little about what rights our application needs.
So, what are the demons? There are those who do not need superuser rights in general, and they use them only to listen to a port below 1024. Such programs need only issue a CAP_NET_BIND_SERVICE capability that will allow them to listen to any ports without restrictions, and immediately launch them from an unprivileged user. Set capability on a file is possible with the setcap command. As an experimental “service”, we will have ncat from nmap, which will give out shell access to anyone who wants it — you can’t think of anything worse:

 % sudo setcap CAP_NET_BIND_SERVICE=ep /usr/bin/ncat % getcap /usr/bin/ncat /usr/bin/ncat = cap_net_bind_service+ep

Now we write the simplest systemd unit, which will run ncat with the necessary parameters on port 81 on behalf of the user nobody:

 [Unit] Description=Vuln [Service] User=nobody ExecStart=/usr/bin/ncat --exec /bin/bash -l 81 --keep-open --allow ::1

Save it in /etc/systemd/system/vuln.service and run the usual sudo systemctl start vuln .
Connect to it:

 % ncat ::1 81 whoami nobody

Works great!
It is time to protect our service, for this systemd has the following directives:

CapabilityBoundingSet = - manages capabilities. Sets only those that were passed in this parameter, or vice versa, takes the transferred ones, if the first character is a tilde "~".
SecureBits = - sets the security bits.
Capabilities = - also manages the capabilities, but in such a way that the advantages that are specified in the file at the FS level, so practically useless, have an advantage.
ReadWriteDirectories =, ReadOnlyDirectories =, InaccessibleDirectories = - control the file system namespace. Remount filesystems inside the daemon namespace so that the specified directories are readable and writeable, read only, or inaccessible (become empty).
PrivateTmp = - remounts / tmp and / var / tmp to its own tmpfs inside the namespace.
PrivateDevices = - selects access to devices from / dev, leaving access only to standard devices, such as / dev / null, / dev / zero, / dev / random and others.
PrivateNetwork = - creates an empty network namespace with a single lo interface.
ProtectSystem = - mounts / usr and / boot in read-only mode, and when passing the argument “full”, does the same with / etc.
ProtectHome = - makes the / home, / root and / run / user directories unavailable, or reminds them to read-only mode with the “read-only” parameter
NoNewPrivileges = - allows you to make sure that the application does not receive additional privileges. According to the authors, more powerful than the corresponding capability.
SystemCallFilter = - filters system calls using seccomp technology. More on this later.

Let's rewrite our unit file using these options:

 [Unit] Description=Vuln [Service] User=nobody ExecStart=/usr/bin/ncat --exec /bin/bash -l 81 --keep-open --allow ::1 CapabilityBoundingSet=CAP_NET_BIND_SERVICE InaccessibleDirectories=/sys PrivateTmp=true PrivateDevices=true ProtectHome=true ProtectSystem=full

So, we gave our application one CAP_NET_BIND_SERVICE capability, created separate / tmp and / var / tmp, selected access to devices and home directories, remounted / usr, / boot and / etc to read-only mode, and blocked / sys, t .to. a typical demon is unlikely to go in there, and all this is done on behalf of the user.
It should be noted that CapabilityBoundingSet does not provide additional capabilities for even suid applications like su or sudo, so we cannot access on behalf of another user or root, even knowing their passwords, since the kernel will not allow setuid and setgid calls:

 % ncat ::1 81 python -c 'import pty; pty.spawn("/bin/bash")' #   pty,      sudo  su [nobody@valaptop /]$ sudo -i #  setuid()  setgid() sudo: unable to change to root gid: Operation not permitted sudo: unable to initialize policy plugin [nobody@valaptop /]$ ping #   capability cap_net_raw bash: /usr/sbin/ping: Operation not permitted [nobody@valaptop /]$ cd /home bash: cd: /home: Permission denied [nobody@valaptop /]$ ls -lad /home d--------- 2 root root 40 Nov 3 11:46 /home [nobody@valaptop tmp]$ ls -la /tmp total 4 drwxrwxrwt 2 root root 40 Nov 5 00:31 . drwxr-xr-x 19 root root 4096 Nov 3 22:28 ..

Consider the second type of daemons, those that run as root and lower their privileges. This approach is used for many purposes: reading confidential files that are accessible only from the superuser (for example, a private key for using TLS by the web server), maintaining logs that will not be available if the non-root fork is compromised, and just applications that are arbitrary change UID (ssh-servers, ftp-servers). If such programs are not isolated, then the worst thing that can happen is that the attacker will get full access on behalf of the superuser. Although the lack of capabilities inherent in root make it almost an ordinary unprivileged user, root still remains root with a bunch of files belonging to it that it can read, so we need to additionally make sure that certain directories where the keys and Configuration files that should not be read:

 [Unit] Description=Vuln [Service] ExecStart=/usr/bin/ncat --exec /bin/bash -l 81 --keep-open --allow ::1 CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SETUID CAP_SETGID NoNewPrivileges=yes InaccessibleDirectories=/sys InaccessibleDirectories=/etc/openvpn InaccessibleDirectories=/etc/strongswan InaccessibleDirectories=/etc/nginx ReadOnlyDirectories=/proc PrivateTmp=true PrivateDevices=true ProtectHome=true ProtectSystem=full

Here we added the CAP_SETUID capability and CAP_SETGID to allow our daemon to lower privileges, use NoNewPrivileges so that it cannot improve its capabilities, block access to directories that it should not read, and allow access to / proc only for reading so that could not use sysctl. You can also mount the entire root in read-only at once, and write permissions should be given only to the directories used by the program.
You should separately verify the permissions of the / etc / shadow file. In modern distributions, it is not readable even by root, and the capability CAP_DAC_OVERRIDE is used to work with it, which allows you to ignore access rights.

 % ls -la /etc/shadow ---------- 1 root root 1214  3 19:57 /etc/shadow

Check our settings!

 python -c 'import pty; pty.spawn("/bin/bash")' #   pty [root@valaptop /]# whoami root [root@valaptop /]# ping #   capability cap_net_raw bash: /usr/sbin/ping: Operation not permitted [root@valaptop /]# cat /etc/shadow #  CAP_DAC_OVERRIDE cat: /etc/shadow: Permission denied [root@valaptop /]# cd /etc/openvpn bash: cd: /etc/openvpn: Permission denied [root@valaptop /]# /suid # SUID shell [root@valaptop /]# cat /etc/shadow #  -  shell,    cat: /etc/shadow: Permission denied

Unfortunately, systemd (so far) does not know how to work with the PID namespace, so our root daemon can kill other programs that run as root.
In general, this can be done, the capabilities and namespace settings do a good job of isolating applications, but there is one more thing that would be great to configure.

seccomp

The seccomp technology prohibits the program from making certain system calls, immediately killing it when trying to do it. Although seccomp appeared a long time ago, in 2005, it began to be really used relatively recently, with the release of Chrome 20, vsftpd 3.0 and OpenSSH 6.0.
There are two approaches to using seccomp: blacklist and whitelist. Making a blacklist of potentially dangerous calls is noticeably simpler than white, so this approach is used more often. The firejail project by default prohibits programs from running the following syscalls (the tilde includes the blacklist mode):

 SystemCallFilter=~mount umount2 ptrace kexec_load open_by_handle_at init_module \ finit_module delete_module iopl ioperm swapon swapoff \ syslog process_vm_readv process_vm_writev \ sysfs_sysctl adjtimex clock_adjtime lookup_dcookie \ perf_event_open fanotify_init kcmp add_key request_key \ keyctl uselib acct modify_ldt pivot_root io_setup \ io_destroy io_getevents io_submit io_cancel \ remap_file_pages mbind get_mempolicy set_mempolicy \ migrate_pages move_pages vmsplice perf_event_open

In systemd up to version 227 inclusive, there is a bug that requires setting NoNewPrivileges = true to use seccomp.
The white list can be made as follows:

Run the required program under strace:

 % strace -qcf nginx

We get a large table of syscalls:

  time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 24 read 0.00 0.000000 0 27 open 0.00 0.000000 0 32 close 0.00 0.000000 0 6 stat … 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 4 epoll_ctl 0.00 0.000000 0 3 set_robust_list 0.00 0.000000 0 2 eventfd2

Rewrite them all, set as SystemCallFilter . Most likely, your application will fall, because strace did not find all the challenges. We look, during the execution of which call the application ended, in the audit daemon logs:
```
 type=SECCOMP msg=audit(1446730375.597:7943724): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=11915 comm="(nginx)" exe="/usr/lib/systemd/systemd" sig=31 arch=40000003 syscall=191 compat=0 ip=0xb75e5be8 code=0x0 
```
The syscall number we need is 191. Open the call table and look for the name of this call by number.
We add it to the allowed calls. In case of a fall, we return to step 2.

Tips & Tricks

You can check the current privileges and the possibility of their enhancement with the captest command.
filecap will give you a list of files with installed capabilities.
With the help of netcap, you can get a list of running network programs that have at least one socket and one capability, and pscap will display not only network running software.
It is not necessary to completely edit the systemd unit and track its changes during the update, but rather add the necessary directives through systemctl edit .

Source: https://habr.com/ru/post/270165/

All Articles