⬆️ ⬇️

Kubernetes success stories in production. Part 10: Reddit

Last week it was announced that from now on all the new Reddit services will be launched in production on the infrastructure based on Kubernetes clusters. This significant milestone in the migration path to K8s one of the most popular online resources, and here's how to come to her ...







Likbez : Today, Reddit is in the top 20 global sites (and number 6 in the US) according to Alexa . This online community of American origin consists of more than 400 million active (within a month) users, 12 million publications and 2 billion votes a day.

')

On why and how the Reddit engineers came to Kubernetes, at KubeCon 2018 in December last year spoke ( presentation + video ) Greg Taylor - head of the Release Engineering Group of the infrastructure department of the project.







Why came to Kubernetes?



At the beginning of 2016, the service, implemented as a monolithic application , had only about 20 engineers, who formed 3 teams, one of which is a kind of hero of the story - Infrastructure team. However, this year brought big changes: by its end more than 60 engineers worked in the company (and by the end of 2018 their number had increased to 200, that is, in just 3 years there was a 10-fold increase in staff ).



Such rapid growth rates put on the agenda the irrelevance (inefficiency) of the application's monolithic architecture, since making numerous changes to its various components (by different teams) became very difficult. Gathering to solve the problem and having considered numerous options, the engineers chose the path of service-oriented architecture (SOA) .



Turning to the architecture of services instead of a large monolith, Reddit faced a new problem. The infrastructure team has become a bottleneck in the activities of developers, who have been very dependent on it at different stages: when initializing services, during their ongoing operation, when debugging and solving performance problems. As a quick fix, problems in the company formed more self-sufficient teams called “infrastructure-oriented”: members of such teams had the necessary skills in infrastructure maintenance, which allowed them to overcome many difficulties without waiting for the actions of the Infrastructure team, overloaded with endless backlog from numerous developers.



However, it was still a temporary solution and the practice showed that not everyone wanted to operate the entire stack for its service:







How was this situation resolved? Organizations have introduced the concept of service owners , who could develop their service from the very beginning to the end, deploy the service early and often, exploit the service (including its availability and performance). But how to achieve this?



Instead of expecting teams of impeccable skills to team services together from dozens of bricks, for many of which they may not have the knowledge, they need to offer them a well thought out, predetermined way to output services in production, affecting a minimum of technologies. This will save engineers from the need to learn many new technologies and tools, which can be really a lot:







“To put this idea into practice, we needed to“ pack ”the knowledge, process, best practices and much more into a more accessible form.”


InfreRedd - Kubernetes to Reddit



That's how InfreRedd appeared - Reddit's internal infrastructure product based on Kubernetes.



How were the three needs of the owners of the services indicated in their definition satisfied?



1. Development



The standard for development in an organization does not indicate the choice of a specific language or framework, but specifies the general “form” of the service, which it must comply with. The standard - a service specification that is independent of a programming language - includes the definition of the RPC protocol, work with secrets, return of metrics, traceability, and the format of log output. An example of the implementation of such a Python specification can be found in the baseplate project, which, however, is hardly useful to anyone for real use, but can be an inspiration.



In addition, materials were created for a quick start when writing new services: code stubs for different languages ​​(Python, Go, Node), as well as Dockerfile, configs for CI and even Helm-charts.



To help with local development, Reddit’s choice of engineers fell on Google’s product, Skaffold , which offers a clear, edit → rebuild → refresh cycle for developers that:





2. Deploy



To run tests and build artifacts (usually Docker images), Reddit uses the Drone platform for continuous delivery.



For the deployment in Kubernetes, the plug-in to Helm for Drone was initially used, but rather quickly the engineers came to the conclusion that Helm did not suit them because they wanted a system that “better understands the state of the objects being created or updated”, and further automation of the deployment processes led to the need for a solution that could refer to the tools used and pause the rollback if there were any malfunctions or performance problems.



As a result, Spinnaker was chosen to orchestrate the deployment in Kubernetes. For it, templates of typical pipelines (on Jsonnet) were created. Then Helm-charts are generated, which are already being rolled out to Kubernetes by Spinnaker's efforts . Users receive information about the progress of deployment and help to diagnose in case of any problems. Here is how a typical deployment process in staging / production looks in a very general way:







3. Operation



First, how are the obligations of the owners of the services and the infrastructure team shared?





Service owners are limited in their rights. However, to gain access to production (to diagnose a problem), it is possible to request (via a special console utility) a temporary token giving them full rights to their namespaces.



Another important point of operation is the minimization of potential damage that may come from different sources. This is what Reddit does for it:







To facilitate the life of the engineers involved in the operation, also involved:





Kubernetes status in reddit



The overall statistics on Kubernetes infrastructure as of December last year was as follows:





The availability of InfreRedd with Kubernetes for the entire organization was planned for the first quarter of 2019, which meant the deployment of any new service in production, served by Kubernetes. (At that time, this was happening for about 3 of 4 new services.)



As mentioned at the beginning of the article, this milestone was successfully achieved just last week:







Other articles from the cycle



Source: https://habr.com/ru/post/441754/



All Articles