Learning to manage Kubernetes securely

We recently created a distributed cron job planning system based on Kubernetes , an exciting new container cluster management platform. Now Kubernetes takes leading positions and offers many interesting solutions. One of its main advantages is that engineers do not need to know which machines have their applications running.
Distributed systems are truly complex, and managing their services is one of the biggest challenges facing operational teams. Implementing new software in production and learning how to reliably manage it is a task that should be taken seriously. To understand why training with Kubernetes is important (and why it is difficult!), We suggest reading about the fantastic one-hour switching caused by an error in Kubernetes.

This article explains why we decided to build architecture on Kubernetes. We will describe how Kubernetes was integrated into the existing infrastructure, give an approach to building (and improving) confidence in the Kubernetes cluster reliability, and consider the abstractions that we implemented over Kubernetes.

What is Kubernetes?

Kubernetes is a distributed system for planning the work of programs in a cluster. You can order Kubernetes to run five copies of the program, and he will dynamically schedule their deployment on the worker nodes. Containers are deployed automatically, which improves resource utilization and saves money. Powerful deployment primitives (Deployment Primitives) allow you to gradually roll out new code, and Security Contexts and Network Policies securely launch different projects in the same cluster.

Kubernetes has many built-in scheduling capabilities. Scheduling HTTP services (long-running), daemonsets (daemonsets) that run on each node of the cluster, cron-tasks that run every hour, etc. If you want to learn more, Kelsey Hightower gave several excellent explanations: Kubernetes for sysadmins and healthz: There is also a great Slack community.

Why Kubernetes?

Each infrastructure project (hopefully!) Starts with business needs, and our goal was to increase the reliability and security of the existing distributed cron-tasks system. Our requirements:

Build infrastructure and manage it with a relatively small team (only 2 people worked on the project full-time).
Schedule about 500 different cron tasks for all 20 nodes.
Here are some reasons why we decided to use Kubernetes for this:
We wanted to build the infrastructure on top of an existing open source project.
Kubernetes includes a distributed cron scheduler, so we would not have to write it ourselves.
Kubernetes is a very active project that regularly accepts contributions.
Kubernetes is written in Go, which is easy to learn. Almost all the fixes in Kubernetes were made by inexperienced programmers of our team.
We used to use Chronos as a cron scheduling system, but it no longer met our reliability requirements. Currently, Chronos is practically not supported (1 commit over the past 9 months; the last time a merge request was approved in March 2016). Since Chronos is not supported, we decided that we should not continue to invest in improving the existing cluster.

If you are considering Kubernetes, keep in mind: do not use Kubernetes just because other companies use it . Creating a reliable cluster takes an enormous amount of time, and the business example of its use is not always obvious. Invest your time in a reasonable way.

What does reliable mean?

When it comes to operating services, the word “reliable” does not make sense by itself. To talk about reliability, you first need to set SLO (purpose of service level).
We had three goals:

99.99% of cron tasks should be scheduled and started within 20 minutes after the scheduled time. 20 minutes is a fairly wide window, but we interviewed internal customers, and none of them asked for higher accuracy.
Tasks must be completed up to 99.99% of the time (without completion).
Our migration to Kubernetes should not cause any customer related incidents.
This meant several things:
Short periods of downtime in the Kubernetes API are acceptable (if the API fell by 10 minutes, this is normal, as long as we can recover within 5 minutes).
Scheduling errors (when a running cron task is not performed) are unacceptable. We took very seriously planning error messages.
You need to be careful in working with containers so that tasks are not interrupted too often.
Need a good migration plan.

Kubernetes cluster building

Our basic approach to building the first Kubernetes cluster was to create a cluster from scratch, without using tools like kubeadm or kops , using Kubernetes The Hard Way as a reference. We deployed the cluster using Puppet. Building a cluster from scratch was a good solution for two reasons: we were able to deeply integrate Kubernetes into our architecture and gained a deep understanding of how its internal components work.

Building from scratch will allow Kubernetes to integrate into existing infrastructure.
We wanted seamless integration with existing systems for logging, certificate management, secrets, network security, monitoring, AWS instance management, deployment, database proxy servers, internal DNS servers, configuration management, etc. Integration of all these systems sometimes required a little creativity, but in general it was easier to integrate than to force kubeadm / kops to do what we wanted.
We already trust existing systems and know how to manage them, so we would like to continue to use them in the new Kubernetes cluster. For example, secure certificate management is a very complex issue, and we already have a way to issue and manage certificates. We were able to avoid creating a new CA only for Kubernetes with proper integration.

We had to understand how the set parameters influenced the Kubernetes setting. For example, there are more than a dozen certificate / certificate authority settings used for authentication. Understanding the operation of these parameters has made it easier to debug the installation when we are faced with authentication problems.

Build Confidence in Kubernetes

At the beginning of working with Kubernetes, none of the team members had previously worked with him (with the exception of some cases for toy projects). What do you think of such statements: “none of us have ever used Kubernetes”, “we are sure that Kubernetes is ready to work in production”?

Strategy 0: talk to other companies

We asked about the experience of working with Kubernetes for several employees of other companies. They all used Kubernetes differently or in different environments (to run HTTP services, on bare hardware, on Google Kubernetes Engine , etc.).
When it comes to a large and complex system, such as Kubernetes, it is important to think seriously about use cases. You need to conduct experiments, build confidence in your environment and make your own decisions. For example, you should not read this article and decide: “Well, Stripe successfully uses Kubernetes, so we will use it!”
Here is what we learned from employees of companies working with Kubernetes clusters:

Prioritize work on cluster reliability (etcd, which stores the state of your Kubernetes cluster.)
Some Kubernetes functions are more stable than others, so be careful with alpha functions. Some companies use stable functions only after they are marked as stable for several issues (for example, if the function became stable in 1.8, they waited for 1.9 or 1.10 before using it).
Consider using the Kubernetes hosted system, such as GKE / AKS / EKS . Proper installation and setup of Kubernetes from scratch is a lot of work. During the start of our work, AWS has not yet managed the Kubernetes service, so this path was not accessible to us.
Be careful with the additional delay in the network created by overlay and software defined networks.
The experience of other companies, of course, did not give us a clear idea of whether Kubernetes would be a good choice for us. But the information allowed to ask yourself the right questions and understand what you should pay attention to when working with Kubernetes.

Strategy 1: read the code

In our plans there was a very strong dependence on one of the components of Kubernetes, the cronjob controller. This component was alpha at the time, which made us uneasy. We checked his work in the test cluster, but could not say whether it would work in production?
Fortunately, all the basic functions of the cronjob controller are only 400 lines on Go. Reading the source code quickly showed that:

The cronjob controller is a stateless service (like any other component of Kubernetes, except for etcd).
Every ten seconds, this controller calls the syncAll function: go wait.Until(jm.syncAll, 10 * time.Second, stopCh) .
The syncAll function retrieves all cron jobs from the Kubernetes API, iterates through this list, determines which tasks should be performed next time, and then starts these tasks.
The basic logic looked relatively easy to understand. More importantly, we felt that if there was a mistake in this controller, we could fix it ourselves.

Strategy 2: Perform Load Testing

Before creating the cluster, we did some load testing. We didn’t worry about how many nodes Kubernetes could manage (in our plans it was about ~ 20 nodes), but really wanted some nodes to process as many cron tasks as possible (about 50 per minute).
We conducted a test on a cluster of 3 nodes, where we created 1000 cron jobs, each of which was performed once a minute. Each of these tasks just ran bash -c 'echo hello world' . We chose simple tasks because we wanted to test the possibilities of planning and orchestrating the cluster, and its not the total computational capacity.
Our test cluster could not process 1000 cron jobs per minute. We noticed that each node would run no more than one item per second, and the cluster could run 200 cron jobs per minute without problems. Since we wanted to run only about 50 cron jobs per minute, we decided that these restrictions are not blocking and that we could deal with them later if necessary. Forward!

Strategy 3: prioritize building and testing high availability etcd cluster

One of the most important tasks you need to perform when setting up Kubernetes is running etcd. Etcd is the heart of your Kubernetes cluster. It stores data about all the events in your cluster. All Kubernetes components, except for etcd, are stateless. If etcd is not running, you will not be able to make changes to your cluster (although the existing services will continue to work!).
This diagram shows exactly how etcd plays the role of the “heart” of your Kubernetes cluster. The API server is the final stateless point before etcd, each cluster component talks to etcd through the API server.

During the work it is necessary to consider two important points:

Set up replication so that the cluster does not die if you lose a node. Now we have three replicas.
Make sure you have enough I / O bandwidth. Our version of etcd had a problem: one high latency fsync node could initiate continuous leader elections. This caused the cluster to become unavailable. We fixed this by making sure that all nodes have more I / O bandwidth than the number of write operations etcd.
Configuring replication is not a “set-and-forget” operation. We have thoroughly tested the loss of the etcd node and made sure that the cluster can safely recover in case of such problems.
Here are some of the work we did to set up the etcd cluster:
Configure replication.
Monitoring accessibility etcd (if etcd does not work, we want to know about it right away).
Create a few simple tools to easily deploy new etcd nodes and merge them into a cluster.
Verification of recovery from backup etcd.
Check and confirm that we can rebuild the entire cluster without downtime.
We were happy that we tested early, and here's why. On Friday morning, in our production cluster, one of the etcd nodes stopped responding to ping. We received a warning about this, stopped the node, raised a new one, attached it to the cluster, and in the meantime Kubernetes cluster continued to work without incident. Fantasy.

Strategy 4: Gradually migrating cron jobs to Kubernetes

We set a goal to transfer our cron-tasks to Kubernetes without interruptions. The secret to the success of successful production migrations is not to avoid mistakes (this is impossible), but to design migration so as to reduce the consequences of mistakes.
Fortunately, we have many different cron tasks for migrating to a new cluster. There were some low priority tasks that we could carry over with little downtime.
Before starting the migration, we created an easy-to-use tool that, if necessary, would allow us to move tasks back and forth between the old and new systems in less than 5 minutes. This simple tool has greatly reduced the impact of errors. If, during a move, for example, a dependency that we did not plan would emerge, no harm! We could just move the task back, fix the problem and try again later.
Here is the general migration strategy we used:

Set a strict order of tasks for importance.,
Repeat the move to move over each cron task. If we find a new problem, we quickly roll back, fix and try again.

Strategy 5: investigate Kubernetes errors (and correct them)

At the beginning of the project, a rule was established: if Kubernetes does something strange or unexpected, we must investigate, find out the reasons and make corrections.
The investigation of each question takes a long time, but it is very important. If we do not pay attention to mistakes, then when working in a production environment, we will definitely encounter problems.
After adopting this approach, we discovered (and were able to correct!) Several errors in Kubernetes.
Here are a few issues found during the research:

Cronjobs with names longer than 52 characters silently do not perform scheduled tasks (corrected here ).
Sometimes Pods hung in idle state (fixed here and here ).
Scheduler crashes every 3 hours (fixed here ).
The hostgw Flannel backend did not replace the outdated route table entries (corrected here ).
The correction of these errors allowed us to feel much more confident in using Kubernetes.
Kubernetes definitely has bugs, just like any other software. In particular, we use the scheduler a lot and often (because our cron jobs constantly create new modules), and using the caching of the scheduler sometimes leads to errors, regressions and failures. Caching is hard! But the code base is available, and we were able to handle the errors we encountered.
If you work with a large cluster of Kubernetes, carefully read the documentation for the host controller , carefully consider and test the settings. Every time we tested changing the configuration of these parameters (for example, --pod-eviction-timeout ), creating network partitions, an amazing --pod-eviction-timeout happened. It is always better to detect these surprises when testing, rather than at 3 am in production.

Strategy 6: Deliberately cause Kubernetes cluster problems

We used to discuss the exercises on the game day in Stripe, and we still do them. The idea is to come up with situations that you expect to happen in production (for example, the Kubernetes API server crashes). Then you need to deliberately reproduce these situations in production (during the working day with a warning) so that you can handle them.
Performing exercises to the cluster often revealed problems in monitoring or configuration files. We were glad to discover (and control!) These problems at an early stage, and not suddenly see after six months.
Here are some exercises from the day of the game that we used:

Shut down one Kubernetes API server.
Shut down all Kubernetes API servers and run them back (to our surprise, it worked very well).
Shut down the etcd node.
Reduce the number of working nodes in the Kubernetes cluster. This should result in all pods migrating to other nodes.
We were very pleased to see how well the Kubernetes handled many of the failures that we gave him. Kubernetes is designed to be fault tolerant. It has one etcd cluster that stores all states, an API server, which is a simple REST interface for this database and a set of stateless controllers that coordinate cluster management.
If any of the major Kubernetes components (API server, controller manager, or scheduler) crash or restart, the parameters read the corresponding state from etcd and continue to work without problems. This was the function we hoped for and believed that it would work well. She actually showed excellent practical results.
Here are a few problems that we discovered during the tests:
“Strange, I have not received information about what really happened. Let's fix our monitoring. ”
“When we destroyed the API server instances and returned them, they required human intervention. We better fix it. "
"Sometimes, when we set up etcd failover, the API server runs timings requests until we restart it."
After running these tests, we developed fixes for the problems found: improved monitoring, identified problems with a fixed configuration, and bugs with Kubernetes.

Creating easy-to-use cron tasks

Let's take a quick look at how we made the Kubernetes based system easy to use.
Our initial goal was to develop a cron job system that our team could easily maintain. We became confident in choosing Kubernetes, then it was necessary to simplify the configuration and addition of tasks in cron for fellow engineers. We have developed a simple YAML configuration format so that users do not need to understand the internal functions of Kubernetes to use the system. The format we developed is:

 name: job-name-here kubernetes: schedule: '15 */2 * * *' command: - ruby - "/path/to/script.rb" resources: requests: cpu: 0.1 memory: 128M limits: memory: 1024M

We didn’t do anything interesting here, just wrote a simple program to take this format as a basis and get the configuration in cronjob in the Kubernetes format, which we use with kubectl .
We also wrote tests to ensure that the names of tasks are not too long (the names of cron tasks can not exceed 52 characters) and that all names are unique. Currently, we do not use cgroups to limit the amount of memory on most cron tasks, but we plan to address this issue in the future.
The simple format of the configuration file was easy to use, and since we automatically generated job definitions for Chronos and Kubernetes from the same source description, moving a job between any system was very simple. This was a key part of the smooth operation of gradual migration.

Kubernetes Monitoring

Monitoring the internal state of the Kubernetes cluster turned out to be easier than expected. We use the kube-state-metrics indicator package for monitoring and the small program Go veneur-prometheus to collect metrics in Prometheus. The kube-state-metrics indicators publish them as statsd indicators for our monitoring system.
For example, here is a chart of the number of waiting containers in our cluster for the last hour. Waiting means that they are waiting for the work node to be started. You can see that at 11 am there was a surge. This is because many of our cron jobs work at the 0th minute of the hour.

We also have a monitor that checks that there are no containers left in the waiting state. We make sure that each unit starts working on the node within 5 minutes, otherwise we will receive a warning.

Future plans for Kubernetes

The process of setting up a Kubernetes cluster and transferring our cron-tasks to a new cluster took us five months with the participation of three full-time engineers.
We invested in the study of Kubernetes, because we expect that we will be able to use Kubernetes more widely on Stripe.
Here are some principles that apply when working with Kubernetes (or any other complex distributed system):

Identify a clear business reason for Kubernetes projects (and all infrastructure projects!). Understanding the business case and the needs of our users has greatly simplified the project.
Aggressively reduce the volume. We decided not to use many of the basic features of Kubernetes to simplify the cluster. This allows you to deliver changes faster. For example, since the project did not require network interaction, we could use a firewall for all network connections between nodes and postpone thinking about network security in Kubernetes for a future project.
Invest a significant amount of time in learning about the proper operation of the Kubernetes cluster. Thoroughly test all decisions and changes. Distributed systems are extremely complex, and there is a high probability of deviation from the plan. Take the example we described earlier: a node controller can kill all containers in a cluster if they lose contact with the API servers. Learning the behavior of Kubernetes after each configuration change takes time and careful focusing.
Without dwelling on these principles, we can confidently use Kubernetes to continue to grow and develop. For example, we are watching with interest the release of AWS EKS. We are finishing work on another system to work with a machine learning model, and we are also studying the movement of some HTTP services to Kubernetes. As Kubernetes is used in production, we plan to contribute to the open source project.

Original: Learning to operate Kubernetes reliably .

Source: https://habr.com/ru/post/347014/

All Articles