Resiliency of the application when updating the cluster Cubernetes

Somehow in the comments they asked the question, how does participation in Slurme differ from reading manuals on Couberntes. I asked Pavel Selivanov, Speaker Slurm-2 and MegaSlerm, to give a small example of what he would say at Slurm. I give the floor to him.

I administer a cluster of Cuberntes. Recently, I needed to update the k8s version and, among other things, restart all the machines in the cluster. I started the process at 12:00, and by the end of the working day everything was ready. And for the first time, I also followed the update process, and the second time I went for lunch for 1.5 hours (for the sake of fairness, grabbing a laptop). The cluster was updated by itself, without my participation and imperceptibly for the clients, the development did not notice anything, the deployments continued, the service worked as usual.

What it looked like.

Probable problems

When rebooting machines there are two bad scenarios.

The developer has launched the application / redis in one instance. No matter how carefully you take the car out of service, downtime will happen.
There are 2 replicas of the application, and one is deployed. It went out, the only replica remained, and here comes the admin and puts out the last replica. Again, until the replica comes up after deployment, there will be downtime.

I could coordinate the reboot with the development, they say, stop the deployment, check the instances, I will restart the machines, but I like the idea of DevOps that human communication should be kept to a minimum. It is better to set up automation once, than to coordinate your actions each time.

Conditions of the problem

I use Amazon with its convenience and stability. Everything is automated, you can create and extinguish virtuals, check their availability, etc.

Cluster Kuburnetes is deployed, managed and updated via the kops utility, which I really love.
When updating kops, the node dranit automatically (kubectl drain node), waits until everything is evacuated from this node, deletes it, creates a new node in Amazon with the correct version of the components Cubernetic, attaches it to the cluster, checks that the node has moved into the cluster well, and so with all the nodes in a circle, until everywhere is the right version of Cubernets.

Decision

In CI, I use kube-lint to check all the manifests that will run in Cubernette. Helm Template throws out everything that is going to run, I set up a linter for unloading, which evaluates everything according to specified rules.

For example, one of the rules states that for any application in the Cuberentes cluster, the number of replicas must be at least 2.
If there are no replicas at all (which defaults to 1), they are 0 or 1, kube-lint prohibits the cluster to cluster in order to avoid problems in the future.

For example, a warm design by design is designed so that one replica remains. In this case, there is the pod disruption budget, where max_unavailable and min_available are set for the application running in Cubernetworks. If you want to always have at least 1 replica, set min_available = 1.
There were 2 replicas, the deployment started, 1 replica died, 1 remained. On the machine where the replica lives, the admin starts the kubectl drain node. In theory, Kuburnetes should start removing this live replica and transport it to another node. But it works pod disruption budget. Kuburnetes says to the admin: sorry, there is a replica here, if we remove it, we will break the pod disruption budget. A smart drain node hangs before the timeout expires and tries to drop the node. If the deployment is finished and both replicas become available, the replica on this node will be displayed.

At MegaSlurme, I’ll show you a complete set of rules that allows me to drink coffee in a cafe while the Cuberentes cluster is updated with a restart of all nodes.

My topics on Slurm :

Introduction to Kubernetes, the main components
Cluster device, main components, fault tolerance, k8s network
Kubernetes Advanced Abstractions
Logging and monitoring

My topics on MegaSlreme :

The process of creating a failover cluster from within
Authorization in the cluster using an external provider
Secure and highly available applications in a cluster
Implementation of deployment strategies other than RollingUpdate
Trabshuting in Kubernetes

Source: https://habr.com/ru/post/425217/

All Articles

Resiliency of the application when updating the cluster Cubernetes

Probable problems

Conditions of the problem

Decision

More articles: