📜 ⬆️ ⬇️

Scale Kubernetes to 2500 nodes

All good!

Well. The first course of the DevOps course has been released, the second one is being trained with might and main and the third one is coming up . The course will be improved, the project too, remains unchanged so far one thing: interesting articles that we are only translating for you, but on the nose there are already disruptions of cover from those things that we have been asked for :)

Go!
')
We have been using Kubernetes for deep learning research for more than two years now. While our most massive loads manage cloud VMs directly, Kubernetes provides a fast iterative cycle and scalability, which makes it ideal for our experiments. Now we manage several Kubernetes clusters (both cloud and physical equipment), the largest of which consists of more than 2500 nodes - this is a cluster in Azure on a combination of D15v2 and NC24 virtual machines.

Many system components have failed in the scaling process, including etcd, Kube masters, loading Docker images, networks, KubeDNS, and even ARP caches of our machines. Therefore, we decided that it would be useful to share what problems we encountered and how we coped with them.


etcd

After expanding the cluster to 500 nodes, our researchers began to receive regular timeouts in kubectl . We tried to solve the problem by adding more Kube Masters ( VMs running kube-apiserver ). At first glance, this helped temporarily solve the problem, but after 10 iterations, we realized that we were trying to cure a symptom, and not the original cause (for comparison, GKE uses one 32-core VM for 500 nodes).

The main suspect was our cluster etcd , which is the central repository of Kube Masters' states. Looking at the Datadog charts, we found recording delays of up to hundreds of milliseconds on our DS15v2 machines, where copies of etcd were running, despite the fact that each server used a P30 SSD with a capacity of 5000 IOPS.


These bursts of delays block the entire cluster!

The performance test with fio showed that etcd can only use 10% IOPS due to 2ms write delays - etcd makes serial I / O, so delays block it.

For each node, we moved the etcd directory to a local temporary disk that is connected to the SSD directly, and not over the network. Moving to a local disk reduced recording latency to 200us and cured etcd!

The problem arose again when the cluster increased to 1000 nodes - we again encountered large delays etcd. This time, kube-apiservers read more than 500MB / s from etcd. We set up Prometheus to monitor apiservers and add the flags --audit-log-path and --audit-log-maxbackup for better logging of apiservers. This has revealed a series of slow queries and excessive calls to the LIST API for Events.

The main problem: the default settings for the Fluentd and Datadog monitoring processes were to create requests for apiservers from each node in the cluster (for example, this problem is now solved). We made the survey process less aggressive, and the download of apisevers again became stable:



Egress etcd dropped from over 500MB / s to almost zero values ​​(negative numbers in the image above show egress)

Another useful editing was storing Kubernetes Events in a separate etcd cluster - in this case, bursts in event creation will not affect the activities of core etcd instances. To do this, we simply set the --etcd-servers-overrides flag like this:

 --etcd-servers-overrides=/events#https://0.example.com:2381;https://1.example.com:2381;https://2.example.com:2381 

Another problem after reaching 1000 nodes was exceeding the storage limit etcd (default 2GB), after which it stopped accepting records. This caused an avalanche-like problem - all Kube nodes failed the performance check, and so our autoscaler decided to destroy all workers. We increased the maximum storage size of etcd using the --quota-backend-bytes flag. And they added a check to autoscaler - if the autoscaler action destroys more than 50% of the cluster, then it does not perform it.

Kube masters


We host the kube-apiserver, kube-controller-manager and kube-scheduler processes on the same machines. For fault tolerance , we always have at least two wizards, and the --apiserver-count flag is set to the number of running apiservers (otherwise Prometheus might get confused in instances).

Basically, we use Kubernetes as a task scheduling system, and the autoscaler dynamically scales the cluster - this can significantly reduce the cost of idle nodes, providing low delays during fast iterations. By default, the kube-scheduler evenly distributes the load between the nodes, but we want the opposite to destroy unused nodes and quickly plan large heaps. Therefore, we have moved to the following policy:

 { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "GeneralPredicates"}, {"name" : "MatchInterPodAffinity"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ {"name" : "MostRequestedPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 2} ] } 

We are using KubeDNS extensively to discover services, but soon after deploying a new scheduling policy, it had problems with reliability. It became clear that problems arise only on certain KubeDNS subs. Due to the new scheduling policy, some machines started to run more than 10 copies of KubeDNS, create hotspot. Because of this, we exceeded ~ 200QPS, which are allowed for each VM in Azure for external DNS queries.

We fixed this by adding an anti-affinity rule to all KubeDNS files:

 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - weight: 100 labelSelector: matchExpressions: - key: k8s-app operator: In values: - kube-dns topologyKey: kubernetes.io/hostname 

Download Docker Images


We started the Dota project on Kubernetes, but as it grew, we began to notice that the pods in the Kubernetes nodes were often in the Pending state for a long time. The image of the game weighs approximately 17GB, it takes about 30 minutes to deploy it to the fresh node of the cluster, and to understand why the Dota container is in Pending. It turned out that this applies to other containers. We found that kubelet has a flag --serialize-image-pulls (true by default), which means that uploading the Dota image blocks all other images. Changing the flag to false required switching Docker from AUFS to overlay2. To make uploading even faster, we moved the Docker root to the local SSD instance, as is the case with etcd machines.

But even after optimizing the download speed, the sums could not start, giving an error message: rpc error: code = 2 desc = net/http: request canceled . In the kubelet and Docker logs there were error messages indicating interruption of the image upload due to lack of progress. The root of the problem was excessively long unloading of large images and cases where the backlog of unloading images turned out to be too long.

To solve this problem, we set the kubelet --image-pull-progress-deadline flag to 30 minutes, and the Docker max-concurrent-downloads daemon option to 10. (The second edit did not speed up unloading of large images, but allowed parallelizing the unloading of the image queue. )

Our latest Docker issue was related to the Google Container Registry. By default, a cublet unloads a specific image with gcr.io (controlled by the --pod-infra-container-image flag), which is used at the start of any new container. If this unloading is unsuccessful for some reason (for example, exceeding the quota), it will not allow the node to launch containers. Since our nodes do not have their own public IP and go through NAT to achieve gcr.io, we most likely exceed the IP quota limit. To solve the problem, we simply preload the Docker image into the image of the machine for our Kubernetes workers using

 docker image save -o /opt/preloaded_docker_images.tar and docker image load -i /opt/preloaded_docker_images.tar 
.

We do the same with the whitelist of common internal OpenAI images, such as the Dota image, to improve performance.

Network


As development progressed, our experiments grew to complex distributed systems, the performance of which depended heavily on the network. Already at the beginning of the launch of distributed experiments, it became clear that our network was poorly configured. The bandwidth between the machines was at the level of 10-15Gbit / s, but Kube, using Flannel, reached the limit by ~ 2Gbit / s. Machine Zone's publicly available tests showed similar values, which means the problem is not in a poor setup, but in our environment. (At the same time, Flannel does not create an overhead for our physical machines.)

To get around this problem, users can add two different settings to disable Flannel in the sub-sites: hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet. (But read the Kubernetes documentation warnings before doing so.)

ARP Cache


Despite DNS tuning, we still faced DNS problems. One day, the engineer reported that the nc -v command before the Redis server had been running for more than 30 seconds before the connection setup message appeared. We traced the problem in the ARP kernel stack. The Redis host exploration revealed a problem with the network: communication on any port hung for a few seconds, but no DNS names were resolved by the local dnsmasq daemon, while dig just printed a cryptic error message: socket.c: 1915: internal_send: 127.0. 0.1 # 53: Invalid argument. The dmesg log was more informative: neighbor table overflow! This meant that the ARP cache ran out of space. ARP is used to map a network address, such as an IPv4 address, to a physical address, such as a MAC address. Fortunately, it is easy to fix by configuring several parameters in /etc/sysctl.conf:
 net.ipv4.neigh.default.gc_thresh1 = 80000 net.ipv4.neigh.default.gc_thresh2 = 90000 net.ipv4.neigh.default.gc_thresh3 = 100000 


Usually, these parameters are configured in HPC clusters, which is especially true for Kubernetes clusters, since each of them has its own IP address, which takes up space in the ARP cache.

Our Kubernetes clusters have been operating without incident for 3 months now, and we plan to expand it even more in 2018. We recently updated to version 1.8.4 and are pleased to see that now it officially supports 5000.

THE END

We are waiting for questions, comments here or at the Open Day . Today, it is led, by the way, by Alexander Titov - a man and a steamer.

Source: https://habr.com/ru/post/348640/


All Articles