📜 ⬆️ ⬇️

How Spotify Scales Apache Storm

Spotify is a Swedish music streaming service with companies such as Sony, EMI, Warner, and Universal. Spotify service was launched in October 2008, now it provides more than 30 million tracks. Many consider him an attempt to replicate the success of Napster and legalize its model. Swedes managed all this almost the best in the world.

The service itself works as follows (general description): the algorithm analyzes user playlists, taking into account the point classification by genres, and compares the resulting “preference profiles” with millions of other playlists. As a result, you get songs that suit your tastes and have not been played before.


/ photo Sunil Soundarapandian CC

More than 75 million active users work with the service, which places special demands on working to maintain an adequate level of performance for all systems.
')
Several task pipelines were created in Spotify. Here, the developers decided to divide everything by meaning and expected load. Thus, both advertising and recommendation services were separated, as well as everything related to visualization. From a technical point of view, this separation is organized on the basis of the Apache Storm.

In our materials on Habré we talk a lot about how virtual infrastructure solves the problems associated with scalability, taking into account the increasing workload. The situation with the fast-growing Spotify service is no exception - the logic of resolving this case is absolutely similar.

When servicing such highly loaded systems, it is necessary to think not only about the distribution of the direct load, but also about the operational costs of maintaining the physical or virtual infrastructure.

Spotify experts talk about how the pipeline mechanism allows them to scale the service. The meaning of this approach is to track such events as the beginning of the user’s work with the playlist and the playback of one or another content (songs, advertising, etc.).

Kafka cluster - collects topics for various types of events, and Storm subscribes to user events and provides them with special metadata that is read from Cassandra. This approach allows to take into account personal experience with the service of each individual user. On the basis of further calculations, it is possible to obtain general and averaged models, which will already be applicable to recommend interesting content to other users of the Spotify service.

Given the complexity of setting up and debugging such a mechanism, developers advise to adhere to certain principles that (according to them) help simplify the work. Everything is standard here: splitting the topology at a logical level for different tasks and using universal libraries - reducing the code base and blocks with tests.

In this case, one can observe the behavior of the entire system with the help of metrics that reflect the “well-being” of the system and allow identifying problems at the topology level. It is better to put the basic settings into a common file, which will be “untied” from changes in the main code.


/ photo stuartpilbrow CC

We will continue to understand how the Spotify team approaches the deployment of topology and solves various problems associated with the operation of the service and service systems.

We have already learned a little about what the personalization pipeline is. Now let's see why Apache Storm is used here. So, given the independence of computing in several personalization topologies and duplication of event processing, the system requires a certain level of performance (Storm cluster).

The operation of the entire system is structured in such a way that even in the event of problems, everything is possible to return to its original state absolutely safely and painlessly. Thus, it is possible to minimize risks and experiment with changes.

Apache Storm , which is necessary for system operation, is a distributed tool for processing large volumes of data in real time. Its use guarantees fault tolerance through a mechanism for controlling the quality of data processing ( for more information ). According to the best practices, here it is used in combination with Apache Kafka.

The author says that at the moment the service processes more than 3 billion events a day. This task requires stable bandwidth and minimizing latency, which can be achieved by load balancing in Kafka and using different group id for each KafkaSpout.

Problems with concurrency that may arise if you do not secure requests for the creation and processing of tuples in Storm, can be solved on the basis of this approach , which implies certain principles of system configuration.

The authors of this presentation advise to pay attention to the slowest of tasks that require parallelization and resource allocation. Fast - do not need such close attention.

Among other things, it is worth noting issues related to the direct processing of a user profile — the attributes that the handler receives. It is recommended to use internal caching to avoid unnecessary input and output operations in the network version, which will only lead to increased latency.

PS Other materials from our blog:

Source: https://habr.com/ru/post/264549/


All Articles