Last week
it was announced that from now on all the new Reddit services will be launched in production on the infrastructure based on Kubernetes clusters. This significant milestone in the migration path to K8s one of the most popular online resources, and here's how to come to her ...
Likbez : Today, Reddit is in the top 20 global sites (and number 6 in the US) according to Alexa . This online community of American origin consists of more than 400 million active (within a month) users, 12 million publications and 2 billion votes a day.
')
On why and how the Reddit engineers came to Kubernetes, at KubeCon 2018 in December last year
spoke ( presentation + video ) Greg Taylor - head of the Release Engineering Group of the infrastructure department of the project.
Why came to Kubernetes?
At the beginning of 2016, the service, implemented as a
monolithic application , had only about 20 engineers, who formed 3 teams, one of which is a kind of hero of the story - Infrastructure team. However, this year brought big changes: by its end more than 60 engineers worked in the company (and by the end of 2018 their number had increased to 200, that is, in just
3 years there was a 10-fold increase in staff ).
Such rapid growth rates put on the agenda the irrelevance (inefficiency) of the application's monolithic architecture, since making numerous changes to its various components (by different teams) became very difficult. Gathering to solve the problem and having considered numerous options, the engineers chose the
path of service-oriented architecture (SOA) .
Turning to the architecture of services instead of a large monolith, Reddit faced a new problem.
The infrastructure team has become a bottleneck in the activities of developers, who have been very dependent on it at different stages: when initializing services, during their ongoing operation, when debugging and solving performance problems. As a quick fix, problems in the company formed more self-sufficient teams called “infrastructure-oriented”: members of such teams had the necessary skills in infrastructure maintenance, which allowed them to overcome many difficulties without waiting for the actions of the Infrastructure team, overloaded with endless backlog from numerous developers.
However, it was still a temporary solution and the practice showed that not everyone wanted to operate the entire stack for its service:
How was this situation resolved? Organizations have introduced the concept of
service owners , who could
develop their service from the very beginning to the end,
deploy the service early and often,
exploit the service (including its availability and performance). But how to achieve this?
Instead of expecting teams of impeccable skills to team services together from dozens of bricks, for many of which they may not have the knowledge, they need to offer them a well thought out, predetermined way to output services in production, affecting a minimum of technologies. This will save engineers from the need to learn many new technologies and tools, which can be really a lot:
“To put this idea into practice, we needed to“ pack ”the knowledge, process, best practices and much more into a more accessible form.”
InfreRedd - Kubernetes to Reddit
That's how InfreRedd appeared - Reddit's internal infrastructure product based on Kubernetes.
How were the three needs of the owners of the services indicated in their definition satisfied?
1. Development
The standard for development in an organization does not indicate the choice of a specific language or framework, but specifies the general “form” of the service, which it must comply with. The standard -
a service specification that is independent of a programming language - includes the definition of the RPC protocol, work with secrets, return of metrics, traceability, and the format of log output. An example of the implementation of such a Python specification can be found in the
baseplate project, which, however, is hardly useful to anyone for real use, but can be an inspiration.
In addition, materials were created for a quick start when writing new services: code stubs for different languages ​​(Python, Go, Node), as well as Dockerfile, configs for CI and even Helm-charts.
To help with local development, Reddit’s choice of engineers fell on Google’s product,
Skaffold , which offers a clear, edit → rebuild → refresh cycle for developers that:
- does not require deep knowledge of Kubernetes;
- as close as possible to production;
- allows the use of standard charts / images;
- and - unlike Minikube, which was used before, - working with Skaffold does not require huge resources from working laptops (because rollout is performed on remote clusters).
2. Deploy
To run tests and build artifacts (usually Docker images), Reddit uses the
Drone platform for continuous delivery.
For the deployment in Kubernetes, the plug-in to Helm for Drone was initially used, but rather quickly the engineers came to the conclusion that Helm did not suit them because they wanted a system that “better understands the state of the objects being created or updated”, and further automation of the deployment processes led to the need for a solution that could refer to the tools used and pause the rollback if there were any malfunctions or performance problems.
As a result,
Spinnaker was chosen to orchestrate the deployment in Kubernetes. For it, templates of typical pipelines (on Jsonnet) were created. Then Helm-charts are generated, which are already being rolled out to Kubernetes
by Spinnaker's
efforts . Users receive information about the progress of deployment and help to diagnose in case of any problems. Here is how a typical deployment process in staging / production looks in a very general way:
3. Operation
First, how are the obligations of the owners of the services and the infrastructure team shared?
- Service owners : understand the basics of Kubernetes, deploy and exploit their services;
- Infrastructure team : support operability (rollout, support, scaling) of Kubernetes clusters, providing them with all the necessary resources, and also advise the organization’s engineers on the design of reliable, productive, fault-tolerant services (in particular, training sessions are held regularly, which are then distributed throughout the company).
Service owners are limited in their rights. However, to gain access to production (to diagnose a problem), it is possible to request (via a special console utility) a temporary token giving them full rights to their namespaces.
Another important point of operation is the minimization of potential damage that may come from different sources. This is what Reddit does for it:
To facilitate the life of the engineers involved in the operation, also involved:
Kubernetes status in reddit
The overall statistics on Kubernetes infrastructure as of December last year was as follows:
- 7 clusters (from 3 to 6 new ones should have been added in the next few months);
- between one third and one half of all engineering teams interact with Kubernetes;
- about 20 Reddit services are in production with K8s;
- on a working day, 10-20 deployments of these services occur in K8s.
The availability of InfreRedd with Kubernetes for the entire organization was planned for the first quarter of 2019, which meant the deployment of any new service in production, served by Kubernetes.
(At that time, this was happening for about 3 of 4 new services.)
As mentioned at the beginning of the article, this milestone was successfully achieved just last week:
Other articles from the cycle