Containerization on Linux in detail - LXC and OpenVZ. Part 1

Hello!

In the last article we started talking about the advantages of container insulation (containerization), now I would like to delve into the technical aspects of the implementation of containers.

Linux Containerization

For Linux, there are at least three container implementations:

Linux upstream containers (namespaces + cgroups)
Linux VServer
Openvz

Wait, where is the LXC and what is the Linux upstream containers with namespaces and cgroups? If we are discussing the Linux kernel subsystem, which is responsible for isolating sets of processes from each other, then calling it LXC is categorically wrong for the reason that LXC is a proper name, the name of the utility set is only for managing the Linux kernel subsystems, and not for ensuring containers in the core!
')
The mechanisms that enable containers inside the kernel are called namespaces (here’s a good overview of this functionality). In addition, for the limitations of various container resources (the amount of memory used, the load on the disk, the load on the central processor) are responsible for the kernel mechanism - cgroups ( overview ). And in turn, I decided to combine both of these technologies under the name Linux upstream containers, that is, containers from the current version of the Linux kernel with kernel.org.

If we talk about user space, such packages as LXC, systemd-nspawn and even vzctl (a utility from the OpenVZ project) are engaged exclusively in creating / deleting namespacs and configuring cgroups for them.

What is OpenVZ?

If with Linux upstream containers everything is more or less clear, since they are included in the Linux kernel, then OpenVZ will require explanations. OpenVZ is an open source project developed by Parallels (formerly known as SwSoft) since 2005 (for more than 8 years) and engaged in the development and support of production ready solutions for containerization based on the Linux kernel (patches are superimposed on the cores from RHEL).

This project consists of two parts - a Linux kernel with patches from the Parallels company (vzkernel), as well as a set of utilities for managing the kernel (vzctl, vzlist, etc.). Since the modified kernel from OpenVZ is built on the basis of the Red Hat Enterprise Linux kernel, it is logical to assume that the recommended OS for OpenVZ is RedHat or CentOS. Also, a lot of work has been done recently to provide OpenVZ support for Debian 7 Wheezy, but for my part, I still recommend using OpenVZ exclusively on CentOS 6.

At the moment there are more than 25 thousand servers in the world (it is worth noting that due to the peculiarities of statistics collection, the data are somewhat underestimated) with the OpenVZ installed, you can see detailed statistics here: https://stats.openvz.org/

OpenVZ development history

For the OpenVZ project, the following development milestones can be identified:

OpenVZ for RHEL4 (closed by EOL)
OpenVZ for RHEL5
OpenVZ for RHEL6
OpenVZ for RHEL7 (while being written)

I, unfortunately, cannot comment on the differences between the RHEL4 / RHEL5 versions, because I have not even seen the RHEL 4 version. But the differences in RHEL 5 and 6 are extremely significant, the most important aspect in them is that the solid part of the OpenVZ patch was removed and instead the OpenVZ project switched to using standard Linux kernel mechanisms - namespace and cgroups, in future versions this work will continue and in the bright future, OpenVZ will not require anything other than a set of utilities for its work, all the required kernel functionality will be in the Linux upstream.

Linux upstream containers development history

Two fundamental technologies underpin the Linux upstream containers - these are namespaces and cgroups. The namespaces framework in the form of a mount namespace was first introduced in 2000 by the well-known developer Al Viro (Alexander Viro). In turn, the cgroup framework was developed by Google, with the participation of engineers Paul Menage and Rohit Seth.

Also, a great contribution to the development of new subsystems of namespaces / cgroups was made by the developers of the OpenVZ project. In particular, the PID and Net namespaces were developed in the context of the OpenVZ project. In addition, the Criu framework was recently transferred to the stable phase for serializing / deserializing sets of running Linux processes lwn.net/Articles/572125 , it was also developed in the context of solving tasks for OpenVZ.

If you want to evaluate the contribution yourself, I propose two documents - a report on the contribution of companies to the core development in 2013 and a similar report for 2012: Parallels has a contribution of 0.7% and 0.9% there, respectively. It may seem that 1% is not enough, I would like to remind you that the size of edits per year reaches several million lines.

If you already use OpenVZ and you have something to tell (bugs) or what to share (patches), then you are welcome to OpenVZ bug tracker. Each of your running reports improves the stability and quality of the project for all its users!

Technical implementation of the container insulation subsystem

Here I would like to follow the path through which, I am sure, every system administrator who was tasked with “isolating users from each other” within the same Linux server passed through.

So, let's say we have two web applications that belong to different clients (pay attention, I don’t use the term users ’here so that there’s no confusion with Linux users) and don’t have to interfere with each other (for example, when hacking overload, exhaustion of resources) and even more so on the physical server itself.

Let's take the sequence of steps taken in the form of a list:

You can run applications from different Linux users. Thus, the security of the file system of a neighboring application will be controlled solely by the system of rights to Linux files and folders, and if the configuration is incorrect, it may be possible to compromise files. Also, all the service files of the physical server will be accessible to both applications, which can also lead to data leakage and subsequent compromise of the entire system. This method does not solve all the problems and you need something more advanced.
To increase the security of the file system, we will try to use the chroot functionality, which allows you to “lock” the process inside a specific folder, from which it will be very difficult for it to get out. To do this, we need to create a complete set of system files (for example, through debootstrap) for each of the applications and put them in separate file hierarchies. Thus, each application will have its own file hierarchy that does not intersect with a friend. But if the process inside the chroot works as the root user (there are quite a lot of such processes), then using standard mechanisms (multiple chroot) he can get out of the chroot and get into the file system of the physical server, which, of course, is not acceptable. This option also does not suit us.
In addition to chroot Linux, the Linux mount namespace subsystem has been around for a very long time, which provides broader functionality, but with full protection so that it cannot be pulled out. In containerization it is used. Since it is no longer possible to describe our actions in detail without invoking the C code, I will limit myself to their formal description. So, we have solved the problems of isolating ordinary file systems.
Now suppose that each of the applications uses a fixed port, for example, 443d. What should we do if there is only 1 IP address on the server and it is attached to the physical server? Of course, first we need two additional IP addresses, one for each service. We will have to assign both of them to the physical server and configure for each application an explicit listening on its IP address. But what if one of the applications is hacked and tries to start listening on a port on the IP of the physical server or on the IP of the neighbor in order to compromise access? He will succeed and no one will prevent him from doing this. How to deal with this? Here the namespaces will help us again - namely, the network namespace, through which each of the containers (perhaps they can already be called that way!) Gets a strictly fixed separate IP address (as well as the space of sockets, ports, and the routing table). This will ensure complete isolation at the level of sockets, IP addresses and ports from the neighbor. No mutual influence is now possible.
And now let's get a little into the jungle of processes and remember that, for example, the Apache web server uses semaphores very actively, and the PHP APC opcode accelerator very often uses shm memory. What happens if one process takes and even clears the memory of the neighbor, which is clearly visible to him? Or read it and retrieve any sensitive data? Strongly unpleasant event that reduces all our tricks on isolation to nothing. But there is a solution - IPC namespace, which isolates semaphores, queues, mutexes, shm memory for IPC and IPC Sys V in the form that each container only sees its own IPC resources and no one else. Hooray! Problem solved.
At this stage, we have almost everything fine, but the problem remains with the fact that from the container we see all the processes of the physical server, as well as the processes of the neighbor. Often, processes disclose passwords or other important information in their names, so this possibility is extremely undesirable. In addition, if we want to start a process with a fixed PID, which is already occupied by a neighboring container or host system - this will not work, since Linux requires a unique PID for all processes. To get rid of this behavior, we use the PID namespace, which for each container will create a completely isolated (from the physical server and from the neighbor) PID numbering system, which in turn makes it possible to start, for example, several processes with a PID equal to 1. Stop, and What process on Linux has such a PID? Correct init! In our “self-made insulation” we have reached the stage when we can run a full-fledged system, and not just one process inside the container! Hooray!
But our happiness in launching a full-fledged Linux in the container will be incomplete, since the hostnames will coincide in both containers with the hostname of the physical server, which can, for example, when working on ssh lead to unpleasant consequences and a lot of software checks the correctness of the hostname and its consistency IP address of the machine. It also needs to be isolated; this is done with a namespace with a very non-intuitive name - UTS, which only allows you to set each container with its own unique hostname.

So, to implement the isolation of the base systems, we needed the following namespaces:

Mount
Network
Pid
UTS
IPC

Of course, the path was very thorny and the implementation of these subsystems in the kernel went in a slightly different sequence, but now they are all available and gathered in such a wonderful picture called Linux upstream containers!

What to expect in the continuation?

Technical implementation of the container hardware resource limiting subsystem
Common problems when using containers
The advantages of OpenVZ over standard Linux containerization
Problem from the user space
findings

Thank you for staying with us! In the coming days we will prepare and publish the continuation of the series!

Separately, I would like to thank Andrey Wagin avagin for his help in editing particularly complex technical issues.

Respectfully,
CTO hosting company FastVPS
Pavel Odintsov

Source: https://habr.com/ru/post/209072/

All Articles