Kubernetes 1.8: a review of major innovations

The large and well-organized Open Source community behind Kubernetes has taught us to wait for significant and numerous changes from each release. And Kubernetes 1.8 was no exception, presenting DevOps-engineers and everyone with ~~feeling~~ participants improvements and new features in almost all of their components.

The official release of Kubernetes 1.8 was scheduled for last Wednesday, but the official announcements (in the project blog and CNCF) have not yet taken place. However, today at 3:35 am MSK, a change in CHANGELOG was observed in the Git-repository of the project, which indicates that Kubernetes 1.8 is ready for download and use:
')

So, what did the new release of Kubernetes 1.8 bring?

Network

An alpha version of IPVS mode support has been added to kube-proxy for load balancing (instead of iptables). In this mode, kube-proxy monitors the services and endpoints in Kubernetes, creating a netlink interface ( virtual server and real server respectively). In addition, it periodically synchronizes them, maintaining the consistency of the IPVS state. When requesting access to the service, the traffic is redirected to one of the backend platforms. At the same time, IPVS offers various algorithms for load balancing (round-robin, least connection, destination hashing, source hashing, shortest expected delay, never queue). Such an opportunity was often requested in Kubernetes tickets, and we, too, were waiting for her very much.

Other network innovations include beta support for outgoing traffic EgressRules in the NetworkPolicy API, as well as the possibility (in the same NetworkPolicy ) of applying source / recipient CIDR rules (via ipBlockRule ).

Scheduler

The main innovation in the scheduler is the ability to set priorities (in the hearth specification, PodSpec , users define the PriorityClassName field, and Kubernetes set the Priority on its basis). The goal is simple: to improve the allocation of resources in cases where they are not enough, and at the same time you need to perform truly critical tasks and less urgent / important ones. Now high priority pods will get a greater chance of execution. In addition, when releasing resources in a cluster (preemption) , lower priority will be affected rather than high priority. In particular, for this, kubelet has changed its strategy for selecting pods (eviction strategy) , which now takes into account both the priority of supply and their consumption of resources. The implementation of all these features has the status of an alpha version. Kubernetes priorities and working with them are described in detail in the architecture documentation .

Another interesting innovation presented in the alpha version is a more complex mechanism for processing the conditions field ( Condition , see the documentation ) on the nodes. Traditionally, this field records the problematic states of the node — for example, in the absence of a network, the NetworkUnavailable condition is set to True , as a result of which the values will no longer be assigned to this node. Using the new Taints Node by Condition approach, the same situation will lead to a node marking with a certain status (for example, node.kubernetes.io/networkUnavailable=:NoSchedule ), based on which (in the hearth specification) you can decide what to do next ( assign under this problem node).

Storage

Specifying mount options for volumes has become stable, and at the same time:

in the PersistentVolume specification, a new MountOptions field MountOptions to indicate mount options (instead of annotations );
in the StorageClass specification, a similar MountOptions field MountOptions for dynamically created volumes.

The Kubernetes API metric has added information about the available space in permanent volumes (PV), as well as metrics for success and latency for all mount / unmount / attach / detach / provision / delete calls.

In the PersistentVolume specification for Azure File, CephFS, iSCSI, GlusterFS, you can now refer to resources in namespaces.

Among unstable innovations (in alpha and beta statuses):

A beta version of the support for the definition of the reclaim policy has StorageClass added to the StorageClass (similar to PersistentVolume ), instead of the application of the delete policy, always the default;
The ability to increase the size of the volume has been added to the Kubernetes API - the alpha version of this feature increases the size only for the volume (does not resize for the file system) and only supports Gluster;
Work has begun on isolation / restrictions for data warehouses - the status of alpha presents a new ephemeral-storage resource, which includes all the disk space available to the container and allows you to set quota management limits and requests for it (limitrange) - see current documentation for details;
A new VolumeMount.Propagation field for VolumeMount in pod containers (alpha version) allows you to set the Bidirectional value to be able to use the same mounted directory on the host and in other containers;
An early prototype of creating volume snapshots via the Kubernetes API is available - while these snapshots may be inconsistent, and the code responsible for them is moved from the Kubernetes core to the external repository .

kubelet

The kubelet has an alpha version of the new component - CPU Manager , which interacts directly with kuberuntime and allows you to assign dedicated processor cores to container containers (that is, CPU affinity policies at the container level). As specified in the documentation , its appearance was the answer to two problems:

poor or unpredictable performance compared to virtual machines (due to the large number of context switches and insufficiently efficient use of the cache),
unacceptable delays related to the OS process scheduler, which is especially noticeable in the functions of virtual network interfaces.

Dynamic kubelet configuration is another feature in alpha status that allows you to update the configuration of this agent in all nodes of the live cluster. Bringing it to a stable state (GA) is expected only in release 1.10.

Metrics

Support for user metrics in Horizontal Pod Autoscaler (HPA) has received beta status, and its associated API has been translated to v1beta1 .

metrics-server has become the recommended way to provide APIs for resource metrics. Deployed as a supplement by analogy with Heapster . Direct receipt of metrics from Heapster is deprecated.

Cluster Autoscaler

The Cluster Autoscaler utility, created to automatically resize the Kubernetes cluster size (when there are scams that do not start due to lack of resources, or some nodes are not used for a long time), has received stable status (GA) and support up to 1000 nodes.

In addition, when deleting nodes, Cluster Autoscaler now gives 10 minutes of service for correct shutdown (graceful termination) . If the sub has not been stopped during this time, the node is still deleted. Previously, this limit was 1 minute or did not wait for the correct completion at all.

kubeadm and kops

An alpha implementation of a self-hosted control plane cluster ( kubeadm init with the flag --feature-gates=SelfHosting=true ) has --feature-gates=SelfHosting=true . Certificates can be stored on disk ( hostPath ) or in secrets. And the new kubeadm upgrade subcommand (in beta status) allows you to automatically upgrade the self-hosted cluster created with kubeadm.

Another new feature of kubeadm in the status of alpha is the execution of subtasks instead of the whole kubeadm init cycle using the phase subcommand (currently available as kubeadm alpha phase and will be brought to official form in the next Kubernetes release). The main purpose is the possibility of better integration of kubeadm with provisioning utilities like kops and GKE.

In kops , meanwhile, there are two new features in the status of alpha: support for bare metal machines as targets and the ability to run as a server (see Kops HTTP API Server ). Finally, GCE’s support for kops has been upgraded to beta status.

CLI

The kubectl console utility received experimental (alpha) support for add-ons. This means that the standard set of commands included in it can now be expanded using plug-ins.

The rollout and rollback commands in kubectl now support StatefulSet .

API

API changes include APIListChunking , a new approach to issuing responses to LIST requests. Now they are broken into small pieces and given out to the client in accordance with the limit specified by him. As a result, the server consumes less memory and CPU when issuing very large lists, and this behavior will become standard for all information in Kubernetes 1.9.

The CustomResourceDefinition API learned how to validate objects based on the JSON scheme (from the CRD specification) CustomResourceValidation alpha implementation is available as a CustomResourceValidation in kube-apiserver .

The garbage collector received support for custom APIs added via CustomResourceDefinition or aggregated API servers. Since the controller updates occur periodically, between adding an API and starting work of the garbage collector for it you should expect a delay of about 30 seconds.

Workload API

The so-called Workload API is the basic part of the Kubernetes API related to “workloads” and includes DaemonSet , Deployment , ReplicaSet , StatefulSet . At the moment, these APIs have been moved to the apps group, and with the release of Kubernetes 1.8, they have obtained version v1beta2. Stabilizing the Workload API implies putting these APIs in a separate group and achieving the highest possible consistency by standardizing these APIs by deleting / adding / renaming existing fields, determining the same default values, and general validation. For example, the default spec.updateStrategy strategy for StatefulSet and DaemonSet was RollingUpdate , and the default spec.selector for all Workload API (due to incompatibility with kubectl apply and strategic merge patch ) is disabled and now requires explicit definition by the user in the manifest. Summarizing ticket with details - # 353 .

Other

Among other (and quite numerous!) Changes in the release of Kubernetes 1.8, I note:

role-based access control ( RBAC ), using the rbac.authorization.k8s.io API rbac.authorization.k8s.io to enable dynamic policy configuration, has been set to stable status (GA), and also received a beta version of the new API ( SelfSubjectRulesReview ) to view the actions that the user can execute with namespace;
an alpha version of the mechanism for storing resource encryption keys in third-party systems ( Key Management Systems , KMS) is presented, and at the same time the Google Cloud KMS plugin ( # 48522 ) appeared;
PodSecurityPolicies added support for the white list of allowed paths for host volumes;
CRI-O support (Container Runtime Interface) based on the standard from the Open Container Initiative declared stable (passed all e2e tests) [CRI-O is the link between the kubelet and OCI-compatible executable environments, such as runc; for details, see GitHub ] , and the cri-containerd project has reached alpha status;
Multi-cluster support, formerly known as Federation, is preparing for a stable release (GA) in the next Kubernetes releases, but for now Alpha Federated Jobs implementations are available, which are automatically deployed into multiple clusters, and Federated Horizontal Pod Autoscaling (HPA) working in a similar way the usual HPA, but, again, with the spread to many clusters;
The team responsible for scalability formally fixed the process of its testing, created documentation for the existing threshold values, defined new sets for service levels (Service Level Indicators and Service Level Objectives).

PS

During the preparation of Kubernetes 1.8, the project was built with the following Docker versions: 1.11.2, 1.12.6, 1.13.1, and 17.03.2. For a list of known issues for them, see here . In the same document, entitled “ Introduction to v1.8.0 ”, you can find a more complete list of all major changes.

We ourselves delayed the update of the Kubernetes serviced clusters from release 1.6 to 1.7 and carried out the main migration only 2 weeks ago (at the moment there are several installations with version 1.6 left). A global update to the new release - 1.8 - is planned in October.

Read also in our blog: