Kubernetes & production - to be or not to be?

Hundreds of containers. Millions of external queries. Billions of internal transactions. Monitoring and notification of problems. Simple scaling. 99% up time. Deploy and rollback releases.

Kubernetes as a solution to all problems! “To be or not to be?” - that is the question!

Disclaimer

Despite the publicity of this article, most likely, I am writing this first of all for myself, like a conversation with a rubber duck . After more than a two-year voyage with “hipster” technologies, I have to step aside and adequately assess how well-grounded and adequate it will be for my next project.

Nevertheless, I very much hope that this publication will find its reader and help many prepared to approach the choice or rejection of Kubernetes.
')
I will try to describe all the experience we have gained at `Lazada Express Logistics`, a company that is part of` Lazada Group` , which in turn is part of `Alibaba Group` . We develop and support systems that automate to the maximum the entire operational cycle of delivery and fulfilment in the 6 largest countries of Southeast Asia.

Prerequisites for use

One day, a representative of a company selling cloud solutions all over the world asked me: “What is the 'cloud' for you?”. Having agitated for a couple of seconds (and thinking: "Hmmm ... our dialogue is clearly not about water vapor condensations suspended in the atmosphere ..."), I replied that, they say, it’s like one ultra-reliable computer with unlimited resources and practically no overhead data stream transfer costs (network, disk, memory, etc.). As if this is my laptop working for the whole world and capable of holding such a load, and I alone can manage it.

Actually, why do we need this cloudy miracle? Everything is very simple! We strive to make life easier for developers, system administrators, devops, technical managers. And such a thing as a properly prepared cloud makes life easier for everyone. And besides everything else, monomorphic systems working for a business are always cheaper and generate less risks.

We set out to find a simple, convenient and reliable private cloud platform for all of our applications and for all types of roles in the team listed above. We did a little research: Docker, Puppet, Swarm, Mesos, Openshift + Kubernetes, Kubernetes - Openshift ... settled on the latter - Kubernetes without any addons.

The functionality described on the very first page fit perfectly and was suitable for the whole of our enterprise. Detailed study of the documentation, chatter with colleagues and a little quick test experience. All this gave confidence that the authors of the product do not lie, and we will be able to get our magnificent cloud!

Roll up the sleeves. And it started ...

Problems and solutions

3-tier architecture

Everything comes with the basics. In order to create a system that can live well in a Kubernetes cluster, you will need to think about the architecture and development processes, set up a bunch of delivery mechanisms and tools, learn to put up with the limitations / concepts of the Docker world and isolated processes.

As a result: we come to the conclusion that the ideology of micro-service and service-oriented architecture is not the way for our tasks. If you read the article by Martin Fowler ( translation ) on this topic, then you should more or less understand what kind of titanic work must be done before the first service comes to life.

My checklist divides the infrastructure into three layers, and then roughly describes what you need to keep in mind when creating such systems at each level. Three layers in question:

Hardware - servers, physical networks
Cluster - in our case Kubernetes and system services supporting it (flannel, etcd, confd, docker)
Service - directly process docker-packaged - micro / macro service in your domain

In general, the idea of a 3-layer architecture and the tasks associated with it is the topic of a separate article. But in the light it will not come out earlier than this same check list will be immaculately full. It may never happen :)

Qualified specialists

As far as the topic of private clouds is relevant and interesting more and more to medium and large businesses, so is the question of qualified architects, devops, developers, database administrators who can work with it.

The reason for this is new technologies, which, entering the market, do not have time to accumulate the necessary volume of documentation, training articles and answers to `Stack Overflow`. However, despite this, such technologies, as in our case, Kubernetes - are becoming very popular and create a shortage of personnel.

The solution is simple - you need to cultivate specialists within the company! Fortunately, in our case, we already knew what a Docker is and how to prepare it - the rest had to be caught up.

Continuous Delivery / Integration

In spite of the beauty of the “smart cloud cluster” technology, we needed the means of communication and the installation of objects inside Kubernetes. Having passed the road from a samopisny bash script and hundreds of branches of logic, we ended up with quite understandable and readable recipes for Ansible. To fully transform Docker files into live objects, we needed:

A set of standard solutions:
- Team City - for automated deployes
- Ansible - to build templates and deliver / install objects
- Docker Registry - for storing and delivering Docker images
images-builder - a script for recursively searching Docker files in the repository and sending images based on them after building into a centralized registry
Ansible Kubernetes Module - module for installing objects with different strategies depending on the object (create or update / create or replace / create or skip)

By just passing by, we studied the issue of Kubernetes Helm . But nevertheless, we could not find the killer-feature itself, which could force us to refuse or replace Ansible templating with Helm charts. We could not find other useful abilities of this solution.

For example, how to make a check that one of the objects is successfully installed and you can continue rolling out others? Or how to make more advanced settings and installation of containers that are already working, and you just need to execute a couple of commands inside them?

These and many other questions require Helm to be treated as a simple template engine. But why? .. if Jinja2 , which is part of Ansible, will give odds to any non-specialized solution.

Services storing state

As a complete solution for any type of services, including statefull (with state), Kubernetes comes with a set of drivers for working with network block devices . In the case of AWS, the only acceptable option is EBS .

As you can see, the k8s tracker is replete with a bunch of EBS-related bugs , and they are resolved rather slowly. Today, we do not suffer from any serious problems, in addition to the fact that, sometimes, it takes up to 20 minutes to create an object with a persistent storage. The integration of EBS-k8s is of very, very, very dubious quality.

However, even if you use other solutions for storage and do not experience any special problems, you still need high-quality solutions for everything that can store data. We spent a lot of time to fill in the gaps and provide quality solutions for each of the cases:

In addition, Kubernetes, and the Docker world, in principle, obliges, sometimes, to many tricks and subtleties that are obvious at first glance, but require an additional solution.

A small example.
It is impossible to collect logs inside a container running Docker. But a lot of systems and frameworks are not ready for streaming in `STDOUT`. It is necessary to deal with `patching` and deliberate development at the system level: write in pipes, take care of processes, etc. A little time and we have Monolog Handler for `php`, which is able to issue logs in the way that Docker / k8s understands

API Gateway

As part of any micro-service and service-oriented architecture, you will most likely need some kind of gateway. But this is for architecture, but here I want to focus attention on why this is especially important for the cluster and the services invested in it.

Everything is quite simple - you need a single point of access ~~denial~~ to all your services.

There are a number of tasks that we solved in the Kubernetes section of the cluster:

Access control and limit requests from the outside - as an example a small LUA script sheds light on the problem
Single point of user authentication / authorization for any services
The lack of a multitude of services requiring HTTP access from the `world` - reserving ports on servers for each wishing service is more difficult to manage than routing in Nginx
Kubernetes-AWS integration for work with AWS Load Balancer
Single point of monitoring HTTP statuses - convenient even for internal communication of services
Dynamic routing of service requests or service versions, A / B tests (alternatively, the problem can be solved by different pods behind the Kubernetes service)

The sophisticated user of Kubernetes will hasten to ask about the Kubernetes Ingress Resource , which is designed specifically for solving such problems. That's right! But we demanded a bit more `features, 'as you might have noticed, for our API Gateway than there is in Ingress. Moreover, this is just a wrapper for Nginx, with which we already know how to work.

Current state

Despite the myriad nuances and problems associated with the installation, use and support of the above solution, being quite persistent, you will most likely come to success and get, approximately, what we have today.

What is the platform in the current state - some dry facts:

Only 2-3 people to support the entire platform.
One repository storing all information about the entire infrastructure
From 10-50 independent automated releases per day - CI / CD mode
Ansible as a cluster management tool
A few hours to create an identical `life` environment - locally on the minikube or on real servers
AWS-based architecture based on EC2 + EBS, CentOS, Flannel
500 ~ 1000 pods in the system
Sheet of technologies wrapped in Docker / K8s : Go, PHP, JAVA / Spring FW / Apache Camel, Postgres / Pgpool / Repmgr, RabbitMQ, Redis, Elastic Search / Kibana, FluentD, Prometheus, etc
There is no infrastructure outside the cluster, except for monitoring at the `Hardware` level
Centralized log storage based on Elastic Search inside Kubernetes cluster
A single point for collecting metrics and alerting problems based on Prometheus

The list reflects many facts, but the clear advantages and pleasant features of Kubernetes as a Docker process management system remain omitted. More information about these things can be found on the official website Kubernetes , in articles on the same Habré or Medium .

A long list of our wishes, which are at the prototype stage or still cover a small part of the system, is also very large:

The system of profiling and tracing - for example, zipkin
Anomaly detection - machine-learning algorithms for analyzing problems across hundreds of metrics, when we can’t or don’t want to understand what each metric / set of metrics means separately, but we want to know about the problem associated with these metrics
Automatic capacity planning and scaling of both the number of pods in the service and servers in the cluster based on certain metrics
Intelligent backup management system - for any stateful services, primarily databases
The system of network monitoring and visualization of connections - within the cluster, between services and pods, first of all ( interesting example )
Federation mode - distributed and connected mode of operation of several clusters.

So be or not to be?

An experienced reader, most likely, has already guessed that the article is unlikely to give an unequivocal answer to such a seemingly simple short question. Many details and details can make your system incredibly cool and productive. Or another set of bugs and implementation curves will turn your life into hell.

You decide! But my opinion on all this is: “BE! .. but very carefully”

Source: https://habr.com/ru/post/332108/

All Articles