What happens in Kubernetes when starting the kubectl run? Part 2

Note trans. : The second and final part of the translation of the material, entitled in the original as “What happens when ... Kubernetes edition!” nginx.

If the first part was devoted to the work of kubectl, kube-apiserver, etcd and initializers, now we will talk about the controllers Deployments and ReplicaSets, informers, scheduler and kubelet. Let me remind you that we stopped at the moment when the request transmitted by the user (via kubectl) was authorized and executed in Kubernetes, new objects (resources) were created and stored in the database (etcd), after which they were initialized for apiserver).
')

Control loops

Deployments controller

By this stage, the Deployment entry exists in etcd and all initialization logic is completed. The following steps are about setting up the topology of the resources used in Kubernetes. If you think about it, then Deployment is really just a collection of ReplicaSets, and ReplicaSet is a collection of pods. What happens in Kubernetes to create this hierarchy from a single HTTP request? This is where the K8s integrated controllers take over.

Kubernetes makes extensive use of “controllers” throughout its system. The controller is an asynchronous script that compares the current state of the Kubernetes system with the desired one. Each controller is responsible for its small part and is started by the component kube-controller-manager . Let's introduce ourselves to the first one who enters the process, the Deployment controller.

After the record with the Deployment is saved in etcd and initialized, it becomes visible in kube-apiserver. When a new resource appears, it is detected by the Deployment controller, whose task is to track changes among the corresponding records (Deployments). In our case, the controller registers a special callback for creation events via an informant (for details on what it is, see below).

This handler will be called when Deployment becomes available for the first time, and starts its work by adding an object to the internal queue. By the time he gets to the processing of this object, the controller will inspect the Deployment and realize that there are no ReplicaSet records and pods associated with it. He gets this information by querying kube-apiserver for label selectors (for more information, see the Kubernetes documentation - approx. Transl. ). It is interesting to note that this synchronization process knows nothing about the state (is state agnostic) : it checks the new entries in the same way as the existing ones.

Having learned that the necessary records do not exist, the controller starts the scaling process in order to arrive at the expected state. This process is accomplished by rolling out (for example, creating) a ReplicaSet resource, assigning it a label selector and assigning the first revision. PodSpec and other metadata for ReplicaSet are copied from the Deployment manifest. Sometimes, after this, it may also be necessary to update the Deployment record (for example, if the progress deadline is set; that is, the specification field of .spec.progressDeadlineSeconds is .spec.progressDeadlineSeconds - approx. Transl. ).

After that, the status is updated and the same reconciliation cycle begins, comparing the Deployment with the desired, completed state. Since the controller only knows about creating ReplicaSets, the verification phase continues with the next controller responsible for ReplicaSets.

ReplicaSets controller

In the previous step, the Deployments controller created the first ReplicaSet for Deployment, but we still do not have any pods. This is where the ReplicaSets controller comes to the rescue. His task is to follow the life cycle of ReplicaSets and dependent resources (hearths). Like most other controllers, this is due to the event handlers for triggers of certain events.

The event in which we are interested is creation. When a ReplicaSet is created (as a result of the activity of the Deployments controller), the RS controller inspects the state of the new ReplicaSet and understands that there is a difference between what exists and what is required. Therefore, he corrects the state, rolling out the hearth, which will belong to the ReplicaSet. The process of creating them occurs neatly, corresponding to the number of bursts of ReplicaSet (inherited from the parent Deployment).

Create operations for pods are also performed in batches, starting with SlowStartInitialBatchSize and SlowStartInitialBatchSize this value for each successful iteration of the “slow start” operation. This approach is designed to reduce the risk of kube-apiserver being thrown by unnecessary HTTP requests in the event of frequent errors loading podov (for example, due to quotas on resources). If we fail, then let it happen with minimal impact on the rest of the system.

Kubernetes implements a hierarchy of objects through references to the owner, Owner References (a field in the child resource referencing the parent ID). This not only ensures that the garbage collector will find all child resources when deleting a resource managed by the controller (cascade deletion), but also provides an effective way for parent resources not to fight for their children (imagine a scenario in which two potential parents think they own by the same child).

Another advantage of the Owner References architecture is that it’s stateful: if any controller needs to reboot, its idle will not affect other parts of the system, since the topology of the resources does not depend on the controller. The isolation orientation has penetrated into the architecture of the controllers themselves: they should not work with resources that they do not possess explicitly. On the contrary, controllers must be selective in their claims to own resources that are non-interfering and non-sharing .

But back to the links to the owner. Sometimes “orphaned” resources appear in the system - this is usually due to the fact that:

the parent is removed, but not his children;
garbage collection policies prohibit child removal.

When this happens, the controllers verify that the orphans have been adopted by the new parent. Many parents can claim a child, but only one of them will succeed (the rest will receive a validation error).

Informants

As you can see, for the operation of controllers such as the RBAC authorizer or the Deployments controller, you need to get the status of the cluster. Returning to the example of the RBAC authorizer, we know that when the request comes, the authenticator will save the initial view of the user's state for future use. Later, the authorizer will use it to retrieve all roles and role bindings associated with the user from etcd. How should controllers gain access to reading and modifying such resources? It turns out that this is a common usage scenario and is solved in Kubernetes with the help of informers.

The informer is a pattern that allows controllers to subscribe to events from the repository and receive a list of resources in which they are interested. In addition to providing an easy-to-use abstraction, it also implements many basic mechanisms, such as caching (it is important because it reduces the number of connections to kube-apiserver and the need for repeated serialization on the server and controller side). In addition, this approach allows controllers to interact with respect to thread safety ( thread safety ) , without fear of stepping on someone's feet.

Read more about how informers work with controllers, read this blog post . ( Note : The work of the informants was also described in this translated article from our blog.)

Scheduler

After all the controllers have worked, we have Deployment, ReplicaSet and 3 pods stored in etcd and available in kube-apiserver. Our pods, however, are stuck in the Pending state, because they have not yet been scheduled / assigned to the node. The scheduler is the last controller that does this.

The scheduler runs as a standalone component of the control plane and works like other controllers: it monitors events and tries to bring the state to the desired one. In this case, it selects pods with an empty NodeName field in PodSpec and tries to find a suitable node to which it can be assigned as. To find a suitable site, a special scheduling algorithm is used. By default it works like this:

When the scheduler starts, a chain of default predicates is registered . These predicates are essentially functions that, when they call, filter out nodes suitable for placing the hearth. For example, if explicit requirements for CPU or RAM resources are set in the PodSpec, and the node does not meet these requirements due to lack of resources, this node will not be selected for submission (node resource consumption is considered as the total capacity minus the sum of the requested resources of the containers running in this moment).
When the appropriate nodes have been selected, a set of priority functions is launched to rank them by selecting the most appropriate ones. For example, for the best distribution of the workload through the system, priority is given to nodes that have less than all the requested resources (since this serves as an indicator of the presence of a smaller workload). As these functions are launched, each node is assigned a numerical rating. The node with the highest rating is selected for planning (assignment).

When the algorithm determines a node, the scheduler creates a Binding object, the Name and UID values of which correspond to the hearth, and the ObjectReference field contains the name of the selected node. It is sent to the apiserver via a POST request.

When kube-apiserver receives the Binding object, the registry deserializes it and updates the following fields in the sub-object: sets it to NodeName from ObjectReference , adds corresponding annotations (annotations) , sets the status to PodScheduled to True .

When the scheduler has assigned a sub node, the kubelet located on this sub starts its work.

Note on the customization of the scheduler : What is interesting is that both predicates and priority functions are expanded and can be defined with the --policy-config-file flag. This gives a certain flexibility. Administrators can also run their schedulers (controllers with arbitrary processing logic) for individual Deployments. If the PodSpec contains a schedulerName , Kubernetes will transfer the scheduling of this pod to any scheduler registered with the appropriate name.

kubelet

Synchronization

Okay, the main controller loop is complete, phew! Let's recap: an HTTP request went through authentication, authorization, access control; in etcd resources Deployment, ReplicaSet and three hearths were created; worked out a set of initializers; finally, each node was assigned a suitable node. Until now, however, the state we discussed existed only in etcd. The following steps include distributing this status across work nodes, which is the main point of the work of a distributed system like Kubernetes. This happens through a component called kubelet. Go!

Kubelet is an agent that runs on every node in the Kubernetes cluster and, among other things, is responsible for ensuring the life cycle of the hearths. Thus, it serves the entire logic of interpreting the abstraction of “hearth” (which is essentially just the concept of Kubernetes) into its building blocks — containers. It also handles all the logic associated with mounting volumes, container logging, garbage collection, and many other important things.

Kubelet is conveniently presented again as a controller. He polls pods in kube-apiserver every 20 seconds (this is configured) [about such intervals in Kubernetes was told in this material of our blog - approx. trans. ] , filtering those from which the NodeName values correspond to the name of the node where the kubelet is running. After receiving the list of podov, he compares it with his internal cache, detects new additions and begins to synchronize the state, if differences exist. Let's see how this synchronization process looks like:

If a sub is created (our case), kubelet registers the initial metric, which is used in Prometheus for tracking delays at the pods.
A PodStatus object is PodStatus , representing the state of the current phase. The hearth phase is the high level designation of the hearth position in its life cycle. Examples: Pending , Running , Succeeded , Failed and Unknown . It is not easy to determine, so let's consider what exactly is happening:
- first, the PodSyncHandlers chain is called PodSyncHandlers . Each handler checks whether it should stay on the node. If any of them decides that there is nothing to do here, the hearth phase changes to PodFailed , while he himself is removed from the node. For example, the sub must be removed from the node if the activeDeadlineSeconds value is activeDeadlineSeconds (used during Jobs);
- then the phase of the flow is determined by the state of its init and real containers. Since containers in our case have not yet been launched, they are classified as “waiting” ( waiting ) . Anyone with a container on hold is in the Pending phase;
- Finally, the condition for the hearth is determined by the condition of its containers. Since none of our containers have yet been created with an executable environment for containers, the PodReady condition will be set to False .
After the PodStatus created, it will be sent to the pod state manager, which asynchronously updates the etcd entry via apiserver.
Next, a set of admission handlers starts, checking that the security rights are valid. In particular, the AppArmor and NO_NEW_PRIVS profiles are used . Pods that were rejected at this stage will remain in the Pending state for an indefinite time.
If the cgroups-per-qos runtime flag is cgroups-per-qos , the kubelet will create cgroups for the submission and apply the parameters to the resources. This is done to enable the best implementation of Quality of Service (QoS) for hearths.
Creates directories with data hearths. These include /var/run/kubelet/pods/<podID> (usually /var/run/kubelet/pods/<podID> ), its volumes ( <podDir>/volumes ), and plug-ins ( <podDir>/plugins ).
The volume manager will connect all the necessary volumes defined in Spec.Volumes and wait for them. Some sites may take longer depending on the type of volumes being mounted (for example, cloud or NFS volumes).
All secrets specified in Spec.ImagePullSecrets will be obtained from Spec.ImagePullSecrets so that they can be further inserted into the container.
Then the container execution environment will launch the container (described in more detail below).

CRI and pause containers

Now we are at the stage where the main preparatory part is completed and the container is ready for launch. The software that performs this run is called the Container Runtime — for example, docker or rkt .

The desire to become more extensible led kubelet to the fact that since version 1.5.0 it uses a concept called CRI (Container Runtime Interface) to interact with specific executable container environments. In short, CRI offers an abstraction between the kubelet and the concrete implementation of the executable environment. The interaction takes place via Protocol Buffers (something like faster JSON) and gRPC API (an API type well suited for performing operations in Kubernetes). This is a very cool idea, because when using the agreed agreement between the kubelet and the executable environment, the actual implementation details of how containers are orchestrated largely lose their significance. Only this convention matters. This approach allows you to add new executable environments with minimal overhead, since the basic Kubernetes code does not need to be changed.

( Note : For more information about the CRI interface in Kubernetes and its implementation, CRI-O we wrote in this article .)

Enough of lyrical digressions - let's return to the container deployment ... When it starts for the first time, kubelet makes a remote procedure call (RPC) RunPodSandbox . The word “sandbox” in its name is a CRI term that describes a set of containers, which in Kubernetes means, you guessed it, under. This term is deliberately very broad, so as not to lose its meaning for other executable environments that can actually use non-containers (imagine an executable environment based on a hypervisor, where the sandbox is a virtual machine).

In our case, we use Docker. In this executable environment, creating a sandbox involves creating a “pause” container. The pause container acts as a parent for all other containers in the hearth, hosting many of the hearth resources that will be used by the loaded containers. These “resources” are Linux namespaces (IPC, network, PID). If you are not familiar with how containers work in Linux, let's quickly refresh this information. The Linux kernel has a concept of namespaces (namespaces) that allow the host operating system to take a certain set of resources (for example, CPU or memory) and assign it to processes as if they and only they consume this set of resources. Cgroups are also important because they are the way Linux manages the allocation of resources (like the cop who controls the use of resources). Docker uses both of these kernel features to host a process for which resources are guaranteed and isolation is provided. More information about running Linuxx containers can be found in this wonderful b0rk publication: “ What even is a Container? ".

The pause container provides a way to put all these namespaces and allows child containers to share them. Being part of a single network namespace, containers of the same hearth can access each other through localhost. The second role of the pause container is related to how the PID namespaces work. In spaces of this type of name, the processes form a hierarchical tree, and the top process “init” takes responsibility for “extracting” dead processes ( note deleting their records from the operating system's process table - approx. Transl. ) . Details on how this works can be found in this excellent article . After the pause-container has been created, a checkpoint on the disk is made for it and it starts.

CNI and network

Our hearth had a skeleton: a pause-container, which sheltered all the namespaces to allow interaction within the hearth. But how does the network work and is configured?

When kubelet sets up the network for the pod, it delegates this task to the CNI plugin. CNI stands for Container Network Interface and works on a principle similar to the Container Runtime Interface. , CNI — , . , kubelet JSON ( /etc/cni/net.d ), CNI ( /opt/cni/bin ) stdin. JSON:

 { "cniVersion": "0.3.1", "name": "bridge", "type": "bridge", "bridge": "cnio0", "isGateway": true, "ipMasq": true, "ipam": { "type": "host-local", "ranges": [ [{"subnet": "${POD_CIDR}"}] ], "routes": [{"dst": "0.0.0.0/0"}] } }

— , — CNI_ARGS .

CNI — bridge :

Linux- .
( veth-) pause-, . veth- : , — , .
pause- IP . IP-. IP IPAM-, JSON-.
- IPAM : . IP/ , , . IPAM- host-local IP- . , IP- .
DNS kubelet IP- DNS- CNI, resolv.conf .

, kubelet JSON, .

( . . : CNI .)

, , ? , , .

(overlay networking) , . Flannel. — L3 IPv4- . Flannel , ( , CNI), . etcd. UDP-, . CoreOS .

. ? .

, kubelet . init-, PodSpec, — . :

. , PodSpec.
CRI . ContainerConfig ( , , , , , ..) PodSpec protobufs CRI. Docker Daemon API payload . (, , , ID ).
CPU Manager — , 1.8 alpha CPU UpdateContainerResources CRI.
.
- (post-start) , . Exec ( ) HTTP ( HTTP- endpoint ). , , Running .

Results

Okay. Is done. The end.

3 , . , kubelet CRI.