Docker in production - what we learned by running over 300 million containers

Docker in production on Iron.io

Earlier this year (note 2014), we decided to run each task on the IronWorker inside our own Docker container. Since then, we have launched more than 300 million programs inside our own Docker containers in the cloud.

After several months of use, we’d like to share with the community some of the challenges we faced in building Docker-based infrastructure, how we overcome them, and why it was worth it.

IronWorker is a task execution service that allows developers to plan, process and scale tasks without worrying about creating and maintaining infrastructure. When we launched the service more than three years ago, we used a single LXC container containing all the languages and packages for running tasks. Docker gave us the ability to easily update and manage a set of containers, which allows us to offer our customers a much larger range of language environments and installed packages.
')
We started working with Docker version 0.7.4 where several bugs were noticed (one of the biggest ones - the container did not close properly , it was later fixed). We have overcome almost all of them, gradually discovering that Docker not only meets our needs, but even exceeds our expectations, so much so that we even expanded the scope of our use of Docker throughout our entire infrastructure. Given our experience today, it made sense.

Benefits

Here is a list of just some of the benefits we saw:

Few numbers

Easy to update and maintain images

Docker's git approach is very powerful and makes it easy to manage a large number of constantly deploying environments, and its imaging system allows us to fine-tune the detail of individual images, saving disk space. Now we are able to keep up with rapidly updated languages, and we can also offer special images, such as the new FFmpeg stack, designed specifically for media processing. We now have about 15 different stacks and this list is growing rapidly.

Resource allocation and analysis

LXC containers are a method of virtualization at the operating system level and allow containers to share a kernel between themselves, but so that each container can be limited to using a certain amount of resources, such as a processor, memory, and input / output devices. Docker provides these capabilities, as well as many others, including the REST API, the version control system, the update of images, and easy access to various metrics. In addition, Docker supports a more secure way to isolate data using the CoW file system . This means that all changes made to files within a task are stored separately and can be cleared with a single command. LXC is unable to track such changes.

Easy integration with Dockerfiles

Our teams are scattered around the world. The fact that we are able to publish a simple Dockerfile and feel relaxed, knowing that when you wake up, someone else will be able to collect exactly the same image as you do - a huge gain for each of us and our sleep patterns. The presence of clean images also makes much faster scanning and testing. Our development cycles are much faster, and all team members are much happier.

Custom environments built with Docker

Growing community

Updates to Docker come out very often (even more often than on Chrome). The degree of community involvement in adding new features and fixing bugs is growing rapidly. Whether it is image support, Docker update itself, or even the addition of additional tools for working with Docker, speaks of a large number of smart people dealing with these problems, so we don’t have to. We see that the Docker community is extremely positive and participation in it brings many benefits, and we are glad to be part of it.

Docker + CoreOS

We are still in the process of research, but the combination of Docker and CoreOS promises to take a serious position in our stack. Docker provides stable image and containerization management. CoreOS provides reduced cloud OS, distributed management and configuration management. This combination makes it more logical to share the problems and better manage the infrastructure stack than we have now.

disadvantages

Each server technology requires fine tuning, especially when scaling , and Docker is no exception. (Imagine, we run about 50 million tasks and 500,000 hours of CPU time per month, while quickly updating the images that we made available.)

Here are some of the problems we encountered when using Docker on large volumes:

Docker errors are small and fixable.

Limited backward compatibility

The rapid pace of development is certainly an advantage, but it has its drawbacks. One of them is limited backward compatibility. In most cases, what we are faced with is a change in the syntax of the command line and API methods, and therefore this is not such an important issue from a production point of view.

In other cases, however, this affected the performance. For example, in the case of any Docker errors after launching containers, we analyze STDERR and react depending on the type of error (for example, by retrying the task launch). Unfortunately, the error output format changed from version to version and therefore we decided to do debugging on the fly.

Solving these problems is relatively easy, but it means that all updates must be checked several times, and you are still in limbo until you release a release to the world of large numbers. It should be noted that we started a few months ago with version 0.7.4 and recently updated our system to version 1.2.0 and saw significant progress in this area.

Limited funds and libraries

While Docker had a stable release 4 months ago, many of the tools made for him still remain unstable. Using a large number of tools in the Docker system also means taking a lot of overhead. Someone from your team will have to stay up to date and mess around a lot in order to take into account new features and correct mistakes. Nevertheless, we like many of the tools that are made for Docker, and we can't wait to wait for anyone to win these battles (looking at infrastructure management). Of particular interest to us are etcd, fleet and kubernetes.

Triumphing over the difficulties

To reveal our experience a little deeper, below are some of the problems we encountered and their solutions.

Excerpt from debug session

This list was provided by Roman Kononov, our leading IronWorker developer and director of infrastructure management, and Sam Ward, who also plays an important role in debugging and optimizing our work with Docker.

It should be noted that when it comes to errors related to Docker or other system problems, we can automatically re-process tasks without any consequences for the user (repeated processing of tasks is the built-in functions of the platform).

Long removal process

Solving the Slow Container Removal Problem

Removing containers takes too much time, it also takes a lot of read / write operations. This caused significant delays and revealed vulnerabilities in our systems. We had to increase the number of cores available to us, to a larger number, which was not necessary.

After studying and examining the devicemapper (Docker file system driver), we found the special option `--storage-opt dm.blkdiscard = false`. This option tells Docker to skip heavy disk operations when removing containers, which greatly speeds up the process of shutting down the container. Since then, the problem has been resolved.

Non-unmount volumes

The containers did not shut down correctly because Docker did not unmount the volumes properly. Because of this, the containers worked non-stop, even after the task was completed. Another way is to unmount volumes and delete folders explicitly using a set of standard scripts. Fortunately, that was a long time ago when we used Docker v0.7.6. We removed this long script since this unmount problem was solved in Docker version 0.9.0.

Stack Usage Distribution

Memory limit switching

In one of the Docker releases, they suddenly added the ability to limit memory and remove the LXC options. As a result of this, some of the workflows exceeded memory limits, and the entire system stopped responding. This caught us by surprise, because the Docker never crashed, even using unsupported options. It was easy to fix - to apply the memory constraints in the Docker - but still the changes took us by surprise.

Future plans

As you have already noticed, we rather actively invest in Docker and continue to do this every day. In addition to using it as a container to run custom code in IronWorker, we are in the process of implementing it in a number of other areas of our business.

These areas include:

Ironworker backend

In addition to using Docker as a container for tasks, we are in the process of implementing it as a processing management tool on each server that controls and runs work tasks. (The main task on each executor takes the task from the queue, loads Docker into the appropriate environment, performs the task, controls it, and then, after the task is completed, clears the environment.) It is interesting that we will have code in the container that manages other containers same cars. Arranging our entire IronWorker infrastructure on Docker containers makes it pretty easy to run them on CoreOS.

IronWorker API, IronMQ and IronCache

We are no different from other development teams that don’t like to customize and deploy. And so we are very pleased with the possibility of packaging all our services in Docker containers for creating simple and predictable work environments. No longer need to configure the server. All we need is servers where you can run Docker containers and ... our services work! It should also be noted that we are replacing our build servers — servers that build releases of our software products in certain environments — Docker containers. The advantages are greater flexibility and a simpler, more reliable stack. Stay in touch.

Build and load Workers

We are also experimenting with using Docker containers to create and upload tasks to IronWorker. The big advantage here is that it provides users with a convenient way to customize the workload and workflows of specific tasks, load them, and then run and scale. Another advantage here is that users can test workers locally in the same environments as our service.

Implementing On-Premise Builds

Using Docker as the main distribution method for our latest version of IronMQ Enterprise simplifies our work and provides an easy and versatile way to deploy in virtually any cloud environment. In addition to the services that we run in the cloud, all clients need servers that can run Docker containers and they can relatively easily get multi-server cloud services in a test or production environment.

Production and beyond

Evolution of technology (Taken from docker.com)

Over the past year and a half, since Solomon Hykes introduced the demo version at GoSF meetup , Docker has come a long way. Since the release of version 1.0, Docker has shown itself to be fairly stable and truly ready for production.

Docker's development is very impressive. As you can see from the above list, we look forward to new features, but we are also pleased with its current capabilities.

If only we could get a tool for managing infrastructure based on Docker.

Source: https://habr.com/ru/post/247969/

All Articles