Somehow in the comments they asked the question, how does participation in Slurme differ from reading manuals on Couberntes. I asked Pavel Selivanov, Speaker Slurm-2 and MegaSlerm, to give a small example of what he would say at Slurm. I give the floor to him.
I administer a cluster of Cuberntes. Recently, I needed to update the k8s version and, among other things, restart all the machines in the cluster. I started the process at 12:00, and by the end of the working day everything was ready. And for the first time, I also followed the update process, and the second time I went for lunch for 1.5 hours (for the sake of fairness, grabbing a laptop). The cluster was updated by itself, without my participation and imperceptibly for the clients, the development did not notice anything, the deployments continued, the service worked as usual.
What it looked like.
When rebooting machines there are two bad scenarios.
I could coordinate the reboot with the development, they say, stop the deployment, check the instances, I will restart the machines, but I like the idea of DevOps that human communication should be kept to a minimum. It is better to set up automation once, than to coordinate your actions each time.
I use Amazon with its convenience and stability. Everything is automated, you can create and extinguish virtuals, check their availability, etc.
Cluster Kuburnetes is deployed, managed and updated via the kops utility, which I really love.
When updating kops, the node dranit automatically (kubectl drain node), waits until everything is evacuated from this node, deletes it, creates a new node in Amazon with the correct version of the components Cubernetic, attaches it to the cluster, checks that the node has moved into the cluster well, and so with all the nodes in a circle, until everywhere is the right version of Cubernets.
In CI, I use kube-lint to check all the manifests that will run in Cubernette. Helm Template throws out everything that is going to run, I set up a linter for unloading, which evaluates everything according to specified rules.
For example, one of the rules states that for any application in the Cuberentes cluster, the number of replicas must be at least 2.
If there are no replicas at all (which defaults to 1), they are 0 or 1, kube-lint prohibits the cluster to cluster in order to avoid problems in the future.
For example, a warm design by design is designed so that one replica remains. In this case, there is the pod disruption budget, where max_unavailable and min_available are set for the application running in Cubernetworks. If you want to always have at least 1 replica, set min_available = 1.
There were 2 replicas, the deployment started, 1 replica died, 1 remained. On the machine where the replica lives, the admin starts the kubectl drain node. In theory, Kuburnetes should start removing this live replica and transport it to another node. But it works pod disruption budget. Kuburnetes says to the admin: sorry, there is a replica here, if we remove it, we will break the pod disruption budget. A smart drain node hangs before the timeout expires and tries to drop the node. If the deployment is finished and both replicas become available, the replica on this node will be displayed.
At MegaSlurme, I’ll show you a complete set of rules that allows me to drink coffee in a cafe while the Cuberentes cluster is updated with a restart of all nodes.
My topics on Slurm :
My topics on MegaSlreme :
Source: https://habr.com/ru/post/425217/
All Articles