Kubernetes success stories in production. Part 4: SoundCloud (by Prometheus)

The series of articles about large and successful Kubernetes users continues with a story about a popular online service for distributing audio content - SoundCloud . Last year, Spotify AB was going to buy this company (it has Swedish roots, like SoundCloud) , and more recently, the Chinese Internet giant Tencent. Even servicing ~ 175 million active users per month, SoundCloud has recently been experiencing financial problems, which became known due to a large reduction (173 employees) last summer, however, according to the latest data, the situation has improved . Anyway, we are much more interested in the technological side of the issue, or rather, the use of Kubernetes , and here is what is known about SoundCloud from public sources ...

Begin migration to Kubernetes

At the first annual Kubernetes conference - KubeCon 2015, held in San Francisco in November 2015, Tobias Schmidt, an engineer for SoundCloud, spoke about the company's way to microservice architecture (this happened by 2014) , and later to Kubernetes:
')

In SoundCloud, they started with a monolithic application (“4-5 years ago”) on Ruby on Rails (+ MySQL, + RabbitMQ), which was used for rolling out Capistrano, and for describing the infrastructure - Chef. Among the main problems of that time were: slow scaling, painfulness and instability of the deployment scheme, laborious deployment of new applications. For these reasons, the company came to the creation of its system - Bazooka , which was dubbed as "PaaS in the style of Heroku". By managing containers (based on LXC), she provided developers with simple commands to deploy applications and scale them. At that time, the company had 400-500 services (not all of them were in production), and with Bazooka it was possible to solve the problems of fast rollback and rollback of application releases, their scaling, as well as the independence of the teams (the developers no longer needed to contact operation specialists for new resources).

During these experiments, SoundCloud experts have definitely fallen in love with the containers and the Go language . At the same time, Bazooka created inside has ceased to be suitable for working with significantly expanded microservices that have evolved from simple HTTP request handlers to complex and resource-demanding applications (they have already been written in Scala, Clojure, JRuby). In addition, supporting and developing Bazooka was becoming increasingly difficult with the company's existing resources, and in the container world, Docker gained considerable popularity (and SoundCloud developers already loved it by starting to use CI / CD). All this prompted the engineers to decide to move to some third-party system for orchestration.

Why Kubernetes?

In SoundCloud, Kubernetes was compared with the native Bazooka and other DevOps products available in the world such as Mesos, and also compared the capabilities of this system with their actual requirements. The selection in favor of Kubernetes is explained in the report by the following reasons:

Simple and clear basic objects that are enough for users / developers to work with: containers, trays, services, Replication Controller .
Powerful network capabilities: flexible configuration of different applications with different traffic and resources at the network level, convenient audit of user connections, secure traffic distribution.
Planning at the hearth level, in which you can place different containers connected by a common life cycle.
A labeling system for grouping resources, finding them and setting limits on them.
A large and vibrant community, as well as the availability of commercial support.
In the understanding of SoundCloud engineers, the Kubernetes project in its essence expanded the ideas that were laid in Bazooka.

Bottom line: as of late 2015, SoundCloud actively experimented with Kubernetes within the Google Cloud Platform and its bare metal cluster in the data center. The nearest plans at that time were: completion of the construction of the CI pipeline (while it was “mostly set up”), full integration with Prometheus for monitoring, and resolution of logging problems. The story of the Prometheus mentioned in these plans deserves special attention ...

Prometheus

Hardly anyone from the engineers working with Kubernetes today has heard of Prometheus - this is one of the first projects of the CNCF organization (Clound Native Computing Foundation), formally standing behind K8s itself. To date, this Open Source project is positioned as a “monitoring system for systems and services”, collecting metrics from specified devices at specified time intervals, applying rules to them, demonstrating the results obtained and calling triggers if certain conditions for the specified values are met. The main components of Prometheus are written in the Go language, and the actual overall architecture looks like this (the diagram is taken from the documentation ) :

Why is this CNCF project paid special attention in the SoundCloud and Kubernetes article? The fact is that this company was engaged in its initial development, and this happened back in 2012 - long before SoundCloud engineers could even theoretically think about switching to Kubernetes (it wasn’t there at the time) . Over time, the popularity of using Prometheus led him to become an independent project, supported by the wide community, and in 2016 he joined the ranks of CNCF, becoming the second (after Kubernetes) project of the organization.

What's even more interesting is that the former Google employee Matt Proud is behind Prometheus in SoundCloud. By 2012, SoundCloud was unhappy with the utilities used for statistics and monitoring (StatsD and Graphite), and began searching for alternatives. The engineer from Google who joined the company at that time did not find suitable Open Source products that allow storing data such as time series in a multidimensional format and using a simple query language to select from them. This is how the Prometheus project started, created “under the inspiration that he [Matt Proud] knew about Borgmon - the satellite of the cluster manager and task scheduler Google Borg [who, as everyone knows, became the ancestor of Kubernetes]”.

As a result, Matt Proud, having come to SoundCloud for only 2 years (left the company at the end of 2013, and in 2014 returned to Google) , initiated the creation of Prometheus, which was joined by other SoundCloud engineers, who continued its development after the departure of their ideologue. Already in 2012, the project was published on GitHub , and a year later began to be used in production at SoundCloud itself.

“Prometheus is an excellent example of what was written before the existence of Kubernetes and what was written for a world in which many applications of various types need to be monitored. Prometheus is not just monitoring for applications in Kubernetes, it is for Mesos, Docker, OpenStack, and other platforms. Much more will only appear, and I personally believe that there will be more main platforms. So this is truly a ubiquitous and powerful tool. We try to choose good tools that fit the different use cases, and not the only case. Optionally, these will be containers — there may be virtual machines. ”

- Alexis Richardson, Head of CNCF Technical Committee and Head of Weaveworks, in 2016, when Prometheus received the official status of an incubated project at CNCF.

Production to SoundCloud

According to an interview given by Björn Rabenstein, the leader of Production Engineering at SoundCloud in April 2016, the company had already used Prometheus and Kubernetes in production at that time, calling these products the perfect combination for addressing infrastructure needs.

As specified in this interview, when choosing a replacement for Bazooka, the victory of Kubernetes happened “slightly ahead” of other decisions, the reason for which was the relative youth of the project, which was yet to receive / develop many of its capabilities.

At the 2016 European DevOps Events ( JAX DevOps in London , CoreOS Fest in Berlin , DevOpsCon 2015 in Berlin ) Björn spoke with Fabian Reinartz, one of the main developers of Prometheus and an engineer from CoreOS, who previously worked at SoundCloud, telling about how monitor Kubernetes using Prometheus:

The experience described in this report has already relied on the production environment at SoundCloud, and was soon supplemented by a performance by the same Tobias Schmidt company engineer - on ContainerDays NYC 2016 in November 2016. (Those who are interested in the example described in the Prometheus configuration report for Kubernetes can find it in a special repository on GitHub.)

Unfortunately, there is practically no details about the SoundCloud infrastructure itself. However, there is a screenshot, which, according to the author, is taken from the real infrastructure. In particular, it is possible to see an indicator of 19 thousand for RPS from incoming requests (HTTP + Thrift):

From a SoundCloud vacancy to the Production Engineer position, it is known that the infrastructure involves “technologies such as Kubernetes, Kafka, Distributed Storage and Prometheus”, and from the position Backend Engineer that the company still uses the Scala, Java and Go languages for the backend, as well as Spark and Hadoop technologies. (By the way, there is a separate article about using Go in the SoundCloud production infrastructure, but it’s probably already partially outdated.)

Finally, it is known that in the SoundCloud production infrastructure:

microservices use Finagle's Open Source framework, developed on Twitter as an “extensible RPC system for JVM, used to create servers with active parallel processing”;
part of the data is stored in memcached, Redis and MySQL;
AWS cloud services (S3 and Glacier) are used for storing data measured by petabytes, as well as for transcoding (Amazon EC2) and analyzing user behavior (Amazon Redshift);
Elasticsearch serves real-time search queries for multiple users.
IPVS (L4) and HAProxy (L7) are used for load balancing, and Consul for service discovery.

Let the operating characteristics of Kubernetes and specific figures for the SoundCloud production environment are not publicly advertised (possibly due to the fact that the company's engineers are more busy promoting the Prometheus-grown monitoring equipment), the facts of its application are not only evident, but also make service of one of the early users of Kubernetes in a truly large-scale production.

Kubernetes success stories in production. Part 4: SoundCloud (by Prometheus)

Begin migration to Kubernetes

Why Kubernetes?

Prometheus

Production to SoundCloud

Other articles from the cycle

More articles: