All good!
We return to our beloved tradition - the distribution of utilities that we collect and study as part of our courses. Today we have a
DevOps course on the agenda and one of its tools is
Kubernetes .
We recently created a distributed task scheduling system (
cron jobs ) based on Kubernetes, a new, exciting platform for container orchestration. Kubernetes is becoming increasingly popular and makes many promises: for example, engineers do not have to worry about which device their application is running on.
')
Distributed systems are complex in their own right, and managing services on distributed systems is one of the most complex problems that management teams face. We take very seriously the introduction of new software into production and the training of its reliable management. As an example of the importance of managing Kubernetes (and why it is so difficult!), Read the excellent post-mortem hourly interruption caused by a bug in Kubernetes.
In this post we will explain why Kubernetes was chosen. We will study the process of its integration into the existing infrastructure, the method of building confidence (and improving) the reliability of our Kubernetes cluster and the abstraction created on the basis of Kubernetes.

What is Kubernetes?
Kubernetes is a distributed system for planning programs to run in a cluster. You can tell Kubernetes to run five copies of the program, and it dynamically schedules them in your work nodes. Containers should increase utilization, thereby saving money, powerful deployment
primitives allow you to gradually roll out new code, and
Security Contexts and
Network Policies allow you to safely start multi-tenant workloads.
Kubernetes has many different scheduling capabilities. You can schedule lengthy HTTP services,
daemonsets running on all machines in your cluster, tasks running every hour, and more. Kubernetes is even more. And if you want to find out, see the great Kelsey Hightower appearances: you can start with
Kubernetes for system administrators and healthz:
Stop doing reverse engineering applications and start watching them from the inside. Do not forget about the good community in Slack.
Why Kubernetes?
Each infrastructure project (I hope!) Begins with a business need, and our task is to improve the reliability and security of the existing cron job planning system that already exists. Our requirements are as follows:
- Implementation and project management are possible with a relatively small team (only 2 people working full time on this project);
- Ability to plan about 500 different tasks on 20 machines;
And here are some reasons why we chose Kubernetes:
- The desire to create on the basis of an existing open-source project;
- Kubernetes includes a distributed task scheduler, so we don’t have to write our own;
- Kubernetes is an actively developing project in which you can take part;
- Kubernetes is written in Go, which is easy to learn. Almost all the Kubernetes bug fixes were produced by programmers with no previous Go experience;
- If we manage to manage Kubernetes, then in the future we will be able to create already on the basis of Kubernetes (for example, we are currently working on a system based on Kubernetes to train machine learning models).
Previously, we used
Cronos as a task scheduler, but it no longer meets our reliability requirements and is now almost not supported (1 commit over the past 9 months, and the last time a pull request was added in March 2016). Therefore, we decided to no longer invest in improving our existing cluster.
If you are considering Kubernetes, keep in mind: do not use Kubernetes simply because other companies do so. Creating a reliable cluster requires a tremendous amount of time, and the way it is used in business is not always obvious. Invest your time wisely.
What does “reliable” mean?
When it comes to service management, the word “reliable” does not make sense by itself. To talk about reliability, you first need to install SLO (service level objective, that is, the target level of service).
We had three main tasks:
- 99.99% of the tasks need to be planned and run within 20 minutes of their scheduled work time. 20 minutes is quite a long time period, but we interviewed our internal customers and no one demanded greater accuracy;
- Assignments must be completed in 99.99% of cases (without interruption);
- Our move to Kubernetes should not cause any incidents on the client side.
This meant several things:
- Short periods of downtime in the API Kubernetes are acceptable (if it fell by 10 minutes - not critical, if it is possible to recover within 5 minutes);
- Planning bugs (when an already running task drops or cannot start at all) is unacceptable. We took the planning reports for bugs very seriously;
- You need to be careful with the removal of the hearth and the destruction of the instances so that tasks are not interrupted too often;
- We need a good relocation plan.
Creating Kubernetes Cluster
The basic approach to creating the first Kubernetes cluster was to create a cluster from scratch, without using tools like
kubeadm or
kops . We created the configuration using Puppet, our standard configuration management tool. Creating from scratch attracted for two reasons: we could deeply integrate Kubernetes into our architecture, and we acquired a broad knowledge of its internal structure.
Building from scratch allowed us to integrate Kubernetes into existing infrastructure. We wanted seamless integration with our existing logging systems, certificate management, secrets, network security, monitoring, AWS instance management, deployment, proxy databases, internal DNS servers, settings management, and many others. The integration of all these systems sometimes required creativity, but, in general, it turned out to be easier than potential attempts to force kubeadm / kops to do what we need from them.
We already trust and know how to manage existing systems, so we wanted to continue using them in the new Kubernetes cluster. For example, reliable certificate management is a very difficult problem, but we already have a way to solve it. We managed to avoid creating a new CA for Kubernetes thanks to a smart integration.
We were forced to understand exactly how the tunable parameters affect our Kubernetes installation. For example, more than a dozen parameters were used to configure certificates / CA authentication. Understanding all these parameters has made it easier to debug our installation when finding problems with authentication.
Strengthening Kubernetes Confidence
At the very beginning of working with Kubernetes, no one in the team had any experience in using it (only for home projects). How come from “None of us have ever used Kubernetes” to “We are confident in using Kubernetes in production”?

Strategy 0: Talk to other companies
We asked a few people from other companies about their experiences with Kubernetes. They all used it differently and in different environments (to run HTTP services, on bare hardware, on
Google Kubernetes Engine , etc.).
When it comes to large and complex systems like Kubernetes, it is especially important to think about your application cases, experiment, find confidence in your environment and decide for yourself. For example, you should not decide after reading this post, “Well, in Stripe Kubernetes is successfully used, it means that we can!”.
Here is what we learned from conversations with various companies managing Kubernetes clusters:
- Prioritize work on the reliability of your etcd cluster (it is in etcd that the states of your Kubernetes cluster will be stored);
- Some Kubernetes functions are more stable than others, so be careful with alpha functions. Some companies use stable alpha functions, after their stability has been confirmed for more than one release in a row (that is, if the function becomes stable in 1.8, they wait for 1.9 or 1.10 before using it);
- Check out Kubernetes-hosted systems like GKE / AKS / EKS. Self-installation of the Kubernetes fault-tolerant system from scratch is a difficult task. AWS did not manage the Kubernetes service during this project, so we did not consider it;
- Be careful of the additional network latency caused by overlay / program networks.
Conversations with other companies, of course, did not give a clear answer as to whether Kubernetes was suitable for us, but provided us with questions and causes for concern.
Strategy 1: Read the code
We planned to depend heavily on a single Kubernetes component — the cron job controller. At the time, he was in alpha, which was a little cause for excitement. We checked it on the test cluster, but how can you say for sure whether it will suit us on production? Fortunately, the entire code of the controller's functionality consists of 400 Go lines. Reading the
source code quickly showed that:
- The controller is a stateless service (like any other Kubernetes component, with the exception of etcd);
- Every 10 seconds, this controller calls the syncAll function: go wait.Until (jm.syncAll, 10 * time.Second, stopCh);
- The syncAll function returns all tasks from the Kybernetes API, goes through this list, determines which tasks should be started next, and then starts them.
The basic logic was easy to understand. More importantly, if a bug was detected in the controller, most likely we would be able to fix it ourselves.
Strategy 2: Perform load testing
Before starting to build our cluster, we conducted load testing. We were not bothered by the number of nodes that the Kubernetes cluster could handle (planned to deploy about 20 nodes), it was important to make sure that Kubernetes was able to withstand as many tasks as we needed (about 20 per minute).
We conducted a test on a three-cluster cluster, where 1000 cron jobs were created, each of which was run once a minute. All tasks just ran bash -c 'echo hello world'. We chose simple tasks because we wanted to test the possibilities of planning and orchestrating the cluster, and not its total computing power.
The test cluster could not handle 1000 tasks per minute. We found out that each node runs a maximum of one per second, and the cluster could easily run 200 tasks per minute. Considering that we need about 50 tasks per minute, it was decided that such a restriction did not block us (and, if necessary, we could solve them later). Forward!
Strategy 3: Prioritize the creation and testing of a fail-safe etcd cluster
When setting up Kubernetes, it is very important to run etcd correctly. Etcd is the heart of your Kubernetes cluster, this is where all the data about it is stored. Everything except etcd has no states. If etcd is not running, you cannot make changes to your Kubernetes cluster (although existing services will continue to work!).

When starting it is worth keeping in mind two important points:
- Set up replication so that your cluster does not die if the node is lost. We now have three copies of etcd;
- Make sure you have sufficient I / O bandwidth. In our version of etcd, there was a problem that one node with a high fsync delay could have caused a long choice of a leader, turning into a cluster inaccessibility. We fixed this by making sure that the I / O bandwidth of our nodes is higher than the number of records made by etcd.
Setting up replication is not a made-and-forget task. We carefully tested that if the etcd node is lost, the cluster is still elegantly restored.
Here are some of the tasks we performed to set up the etcd cluster:
- Replication setup;
- Check availability of the etcd cluster (if etcd has dropped, we should immediately be aware of this);
- Writing simple tools to easily create new etcd nodes and add them to the cluster;
- Adding etcd Consul integration so that you can run more than one etsd cluster in our production environment;
- Testing recovery from backup etcd;
- Testing the ability to rebuild the cluster completely without downtime.
We are very pleased to have tested early. One Friday morning, in our production cluster, one of the etcd nodes stopped pinging. We received a warning about this, destroyed the node, raised a new one, added it to the cluster and all this time Kubernetes continued to work without interruption. Amazing.
Strategy 4: Gradually move tasks to Kubernetes
One of our main goals was to move tasks to Kubernetes without causing disruptions. The secret of a successful production transfer is not in an error-free process (this is impossible), but in designing the process in such a way as to reduce the harm from mistakes.
We were happy owners of a wide variety of tasks that require migration to a new cluster, so among them were those with a low level of influence - a couple of mistakes in their transfer were acceptable.
Before starting the migration, we created an easy-to-use tool that allows you to move tasks between the old and the new systems in less than five minutes. This tool has reduced the damage from errors - if we transferred the addictive task that we didn’t plan, no problem! You can simply return it, fix the problem and try again.
We followed this migration strategy:
- Approximately distribute tasks according to their importance;
- Repeatedly transfer several tasks to Kubernetes. When you discover new problem areas, quickly roll back, correct the problem and try again.
Strategy 5: Explore the Kubernetes Bugs (and fix them)
At the very beginning of the project, we established a rule: if Kubernetes does something unexpected, you need to understand why and propose a correction.
Analysis of each bug takes a lot of time, but it is very important. If we simply dismissed the strange behavior of Kubernetes, deciding that the complexity of the distributed system is to blame, then each new bug would make us think that only we are to blame for it.
After adopting this approach, we found (and fixed!) Several bugs in Kubernetes.
Here are a few problems we found during these tests:
- Cronjobs with names longer than 52 characters silently do not cope with scheduling tasks (corrected here );
- Subs may hang forever in the Pending state (corrected here and here );
- The scheduler crashes every three hours (corrected here );
- The hostgw backend Flannel did not replace obsolete root values ​​(corrected here ).
Correcting these bugs helped us better relate to using Kubernetes in the project - it not only worked quite well, but also took patches and had a good PR review process.
Kubernetes is full of bugs, like any software. For example, we heavily use the scheduler (because our tasks constantly create new scams), and the use of caching by the scheduler sometimes turns into bugs, regression, and crashes. Caching is hard! But the codebase is available and we were able to cope with all the bugs we encountered.
Another problem to be mentioned is the logic of transferring the hearths to Kubernetes. In Kubernetes, there is a component called the node controller, which is responsible for eliminating the pods and transferring them to other nodes in the case of non-responsiveness. There may be a case of no response from all nodes (for example, if there is a problem with the network and configuration), then Kubernetes can destroy all the pods in the cluster. We encountered this at the beginning of testing.
If you are running a large cluster of Kubernetes, carefully read the
documentation for the node controller, think about the settings and test hard. Every time we tested changes to the configuration of this setting (for example, --pod-eviction-timeout) by creating a network constraint, amazing things happened. It is better to learn about such surprises from testing, and not at 3 in the morning on production.
Strategy 6: intentionally create problems for the Kubernets cluster
We have previously discussed the launch of the game of the day in
Stripe , and continue to do so now. The idea is to come up with situations that may occur in production (for example, the loss of the Kubernetes API server), and then specifically create them (during the working day, with a warning) to make sure that they can be fixed.
Conducting several exercises on our cluster revealed some problems such as gaps in monitoring and configuration errors. We were very pleased to find these problems in advance and in a controlled environment, and not by chance six months later.
Some game exercises that we conducted:
- Shutting down one server Kubernetes API;
- Shutting down all Kubernetes API servers and raising them (surprisingly, everything worked well);
- Turn off etcd node;
- The termination of working nodes in our Kubernetes cluster from server APIs (so that they lose the ability to communicate). As a result, all the pods from these nodes were transferred to other nodes.
We were very pleased to see that Kubernetes withstand our breakdowns with dignity. Kubernetes is designed to be error resilient - it has one etcd cluster in which all states are stored, the server API, which is a simple REST interface for the database and a set of stateless controllers that coordinates cluster management.
If any of the Kubernetes root components (API server, controller management, scheduler) is interrupted and restarted, then immediately after recovery, it reads the state from etcd and continues to work. This is one of those things that attracted in theory, and in practice proved to be no worse.
Here are some problems that we found during the tests:
- “Strange, I was not informed about it, although they should have been. It is necessary to correct the observation here ”;
- “When we destroyed the server API instances and brought them back up, they demanded human intervention. Better to fix it ”;
- “Sometimes when we check etcd for fault tolerance, the API server starts to time out requests until we restart it.”
After running these tests, we made edits for the problems found: improved monitoring, corrected the found configuration settings, and documented the Kubernetes bugs.
Making cron jobs easy to use.
Let's briefly review how we made our Kubernetes based system easy to use.
Our initial goal was to design a system for running tasks that our team could manage with confidence. When we became confident in Kubernetes, it became necessary to easily set up and add new cron jobs by our engineers. We have developed a configuration YAML format so that our users do not need an understanding of the internal structure of Kubernetes to use our system. Here is the format:
name: job-name-here kubernetes: schedule: '15 */2 * * *' command: - ruby - "/path/to/script.rb" resources: requests: cpu: 0.1 memory: 128M limits: memory: 1024M
Nothing complicated - just wrote a program that takes this format and translates it into
the cron job Kubernetes configuration, which we use with kubectl.
We also wrote a test suite for checking the length (task names Kubernetes cannot exceed 52 characters) and the uniqueness of task names. Now we do not use
cgroups to limit the amount of memory in most of our tasks, but this is in our plans for the future.
Our format was easy to use, and given that we automatically generated both the cron job Chronos and the cron job Kubernetes, it was easy to move tasks between the two systems. This was a key element in the good work of our gradual migration. As soon as the task transfer to Kubernetes caused an error, we could transfer it back by a simple three-line change in the configuration in less than 10 minutes.
Kubernetes Monitoring
Monitoring the internal states of our Kubernetes cluster proved to be enjoyable. We use the
kube-state-metrics package for monitoring and a small Go program called
veneur-prometheus to collect Prometheus metrics that kube-state-metrics issues and publish as statsd metrics in our monitoring system.
For example, here is a graph of the number of pending pods in our cluster for the last hour. Pending means they are waiting for an appointment to be started. You can see the peak value at 11 am, because many tasks run at zero minutes past the hour.

In addition, we observe that there are no pods stuck in the Pending state - we check that everyone runs under the working node for 5 minutes, otherwise we get a warning.
Future plans for Kubernetes
Setting up Kubernetes, reaching the moment of readiness to launch production code and migration of our cron jobs to a new cluster, took three programmers to full-time for five months. One of the significant reasons why we invested in the study of Kubernetes is the expectation of a wider use of Kubernetes in Stripe.
Here are some guidelines for managing Kubernetes (and any other complex distributed system):
- Identify a clear business reason for your Kubernetes project (and all infrastructure projects!). Understanding the case and user needs has greatly simplified our project;
- Aggressively reduce volumes. We decided not to use many of the basic Kubernetes functions to simplify the cluster. This allowed us to run faster - for example, given that the pod-to-pod network was not mandatory for our project, we were able to restrict all network connections between nodes and postpone thinking about network security in Kubernetes until a future project.
- Spend a lot of time learning the proper management of the Kubernetes cluster. Carefully test sharp cases. Distributed systems are very complex, so many things can go wrong. Take the example described earlier: a node controller can kill all the pods in your cluster if they lose contact with the API servers, depending on your configuration. Studying the behavior of Kubernetes after each change of setting takes time and care.
Keeping our focus on these principles, we can confidently use Kubernetes in production. Our use of Kubernetes will continue to grow and develop - for example, watching AWS releases of EKS with interest. We are completing work on another system for training machine learning models, and we are also considering the option of transferring some HTTP services to Kubernetes. And in the process of using Kubernetes in production, we plan to contribute to the open-source project.
THE END
As always, we are waiting for your comments, questions here or on our
Open House .