How to connect Kubernetes clusters in different data centers

Welcome to the Kubernetes series of quick tutorials. This is a regular column with the most interesting questions that we receive online and at our trainings. An expert on Kubernetes answers.

Today’s expert is Daniel Polencic. Daniel is an instructor and software developer at Learnk8s .

If you want to get an answer to your question in the next post, please contact us by email or on Twitter: @ learnk8s .

Missed previous posts? Look for them here .

How to connect Kubernetes clusters in different data centers?

Briefly : Kubefed v2 is coming out soon , and I also advise you to read about Shipper and the multi-cluster-scheduler project .

Quite often, infrastructure is replicated and distributed across different regions, especially in controlled environments.

If one region is unavailable, traffic is redirected to another to avoid interruptions.

With Kubernetes, you can use a similar strategy and distribute workloads across different regions.

You can have one or more clusters per team, region, environment, or combination of these elements.

Your clusters can be located in different clouds and in a local environment.

But how to plan the infrastructure for such a geographical spread?
Need to create one large cluster across multiple cloud environments across a single network?
Or start a lot of small clusters and find a way to control and synchronize them?

One management cluster

Creating a single cluster across a single network is not so easy.

Imagine you have a crash, lost connectivity between cluster segments.

If you have one master server, half the resources will not be able to receive new commands, because they will not be able to contact the master.

And with this you have old routing tables ( kube-proxy cannot load new ones) and no additional pods (kubelet cannot request updates).

Worse, if Kubernetes does not see the node, it marks it as lost and distributes the missing pods to the existing nodes.

As a result, you have twice as many pods.

If you make one master server per region, there will be problems with the consensus algorithm in the etcd database. ( Editor's note. - In fact, the etcd database does not necessarily have to be located on the master servers. It can be run on a separate server group in the same region. However, having received the cluster failure point. But quickly. )

etcd uses the raft algorithm to match the value before writing it to disk.
That is, most instances must reach consensus before the state can be written to etcd.

If the delay between etcd instances dramatically increases, as is the case with three etcd instances in different regions, it takes a long time to reconcile the value and write it to disk.
This is reflected in the controllers Kubernetes.

The controller manager needs more time to learn about the change and write the response to the database.

And once the controller is not one, but several, it turns out a chain reaction, and the entire cluster starts working very slowly .

etcd is so sensitive to latency that the official documentation recommends using an SSD instead of regular hard drives .

Now there are no good examples of a large network for one cluster.

Basically, the developer community and the SIG-cluster team are trying to figure out how to orchestrate clusters in the same way that Kubernetes orchestrates containers.

Option 1: federate clusters with kubefed

The official response from SIG-cluster is kubefed2, a new version of the original client and the kube federation operator .

For the first time, we managed to manage the collection of clusters as a single object using the kube federation tool.

The start was good, but in the end, the kube federation did not become popular because it did not support all the resources.

He supported integrated deliveries and services, but, for example, not StatefulSets.
And the configuration of the federation was transmitted in the form of annotations and did not differ in flexibility.

Imagine how you can describe the separation of replicas for each cluster in the federation using only annotations.

It turned out a complete mess.

SIG-cluster did a great job after kubefed v1 and decided to approach the problem from the other side.

Instead of annotations, they decided to release a controller that is installed on clusters. It can be customized using Custom Resource Definition (CRD).

For each resource that will be included in the federation, you have a custom CRD definition of three sections:

standard definition of a resource, for example deploy;
the placement section, where you determine how the resource will be distributed in the federation;
override section, where for a specific resource, you can override the weight and parameters from the placement.

Here is an example of a combined delivery with placement and override sections.

 apiVersion: types.federation.k8s.io/v1alpha1 kind: FederatedDeployment metadata: name: test-deployment namespace: test-namespace spec: template: metadata: labels: app: nginx spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: nginx name: nginx placement: clusterNames: - cluster2 - cluster1 overrides: - clusterName: cluster2 clusterOverrides: - path: spec.replicas value: 5

As you can see, the delivery is distributed over two clusters: cluster1 and cluster2 .

The first cluster supplies three replicas, and the second has the value 5.

If you need more control over the number of replicas, kubefed2 provides a new ReplicaSchedulingPreference object, where replicas can be distributed by weight:

 apiVersion: scheduling.federation.k8s.io/v1alpha1 kind: ReplicaSchedulingPreference metadata: name: test-deployment namespace: test-ns spec: targetKind: FederatedDeployment totalReplicas: 9 clusters: A: weight: 1 B: weight: 2

The structure of CRD and API is not yet fully prepared, and active work is underway in the official project repository.

Watch for kubefed2, but remember that it is not suitable for a working environment yet.

Learn more about kubefed2 from the official article about kubefed2 in the Kubernetes blog and in the official repository of the kubefed project .

Option 2: combine clusters in the style of Booking.com

The developers of Booking.com did not deal with kubefed v2, but they invented Shipper, the operator for delivery on several clusters, in several regions and in several clouds.

Shipper is somewhat similar to kubefed2.

Both tools allow you to customize your deployment strategy across multiple clusters (which clusters are used and how many replicas they have).

But the task of Shipper is to reduce the risk of delivery errors.

In Shipper, you can define a series of steps that describe the division of replicas between the previous and current deploy and the amount of incoming traffic.

When you send a resource to a cluster, the Shipper controller expands this change step by step across all the merged clusters.

And Shipper is very limited.

For example, it accepts Helm charts as input and does not support vanilla resources.
In general terms, Shipper works as follows.

Instead of the standard delivery, you need to create an application resource that includes the Helm-chart:

 apiVersion: shipper.booking.com/v1alpha1 kind: Application metadata: name: super-server spec: revisionHistoryLimit: 3 template: chart: name: nginx repoUrl: https://storage.googleapis.com/shipper-demo version: 0.0.1 clusterRequirements: regions: - name: local strategy: steps: - capacity: contender: 1 incumbent: 100 name: staging traffic: contender: 0 incumbent: 100 - capacity: contender: 100 incumbent: 0 name: full on traffic: contender: 100 incumbent: 0 values: replicaCount: 3

Shipper is a good option for managing multiple clusters, but its close connection with Helm only hinders.

What if we all move from Helm to kustomize or kapitan ?

Learn more about Shipper and his philosophy in this official press release .

If you want to delve into the code, go to the official repository of the project .

Option 3: "magic" cluster clustering

Kubefed v2 and Shipper work with cluster federation, providing new resources to clusters through a custom definition of resources.

But what if you don’t want to rewrite all deliveries, StatefulSets, DaemonSets, etc., for a merge?

How to include an existing cluster in the federation, without changing the YAML?

The multi-cluster-scheduler is an Admirality project that deals with planning workloads in clusters.

But instead of inventing a new way of interacting with the cluster and wrapping resources into custom definitions, the multi-cluster-scheduler is embedded in the Kubernetes standard life cycle and intercepts all the calls that create scams.

Each created under immediately replaced by a dummy.

Multi-cluster-scheduler uses web hooks to modify access to intercept the call and create an inactive pod-dummy.

The original pod goes through another planning cycle, where after interviewing the entire federation, a decision is made on the placement.

Finally, the pod is delivered to the target cluster.

As a result, you have an extra pod that does nothing, just takes up space.

The advantage is that you did not have to write new resources to combine supplies.

Each resource that creates a pod is automatically ready for merging.

This is interesting, because you suddenly have deliveries distributed across several regions, but you have not noticed. However, it is quite risky, because here everything rests on magic.

But if Shipper tries, in the main, to mitigate the effects of supply, multi-cluster-scheduler performs more general tasks and is probably better suited for batch jobs.

It does not have an advanced incremental delivery mechanism.

More about multi-cluster-scheduler can be found on the official repository page .

If you want to read about multi-cluster-scheduler in action, Admiralty has an interesting case of use with Argo - workflows, events, CI and CD Kubernetes.

That's all for today

Thank you for reading to the end!

If you know how to more effectively connect several clusters, tell us .

We will add your way to the links.

Special thanks to Chris Nesbitt-Smith ( Chris Nesbitt-Smith ) and Vincent de Smet (reliability engineer at swatmobile.io ) for reading the article and sharing useful information about how the federation works.

Source: https://habr.com/ru/post/454056/

All Articles