Discover Nirvana - Yandex’s Universal Computing Platform

Machine learning has become a fashionable term, but when dealing with large amounts of data, it has been a vital necessity for many years. Yandex processes more than 200 million requests daily! Once on the Internet there were so few sites that the best of them were placed in a directory, and now for the relevance of links to pages in the issue are responsible complex formulas that study on new and new data. The task falls on the so-called pipelines, regular processes, training and controlling these formulas.

Today we want to share with the Habr community our experience in creating the Nirvana computing platform, which, among other things, is used for machine learning tasks.

')
Nirvana is a non-specialized cloud computing management platform where applications are launched in the order specified by the user. Descriptions, references, blocks of processes, and related data are stored in Nirvana for the processes needed. Processes are designed in the form of acyclic graphs.

Nirvana for solving computational problems are used by developers, analysts and managers of various departments of Yandex - because not everything can be counted on your laptop (and why else - we will tell at the end of the article when we turn to examples of using Nirvana).

We will describe what problems we encountered when using the previous solution, describe the key components of Nirvana, and explain why this name was chosen for the platform. And then we will look at the screenshot and move on to the tasks for which the platform is useful.

How did Nirvana appear

The process of learning ranking formulas is a constant and voluminous task. Yandex is currently working with CatBoost and Matriksnet technologies, in both cases the construction of ranking models requires significant computational resources - and a clear interface.

The FML service (Friendly Machine Learning) at one time was a big step in automation and simplification - he put the work with machine learning on the stream. FML has opened a simple access to the configuration tools for learning parameters, analyzing results and managing hardware resources for distributed start-up on a cluster.

But since users received FML as a ready-made tool, it means that any improvements to the interface and the development of innovations fell on the team’s shoulders. At first it seemed that it was convenient - we add only the necessary features to the FML, monitor the release cycle, dive into the users' domain and make a really friendly service.

But along with these advantages, we got poor scalability of development. The flow of orders for refinement and improvement of FML forms exceeded all our expectations - and in order to do everything quickly, we would have to expand the team indefinitely.

FML was created as an internal search service, but developers from other departments whose work tasks were also related to Matrixnet and machine learning quickly learned about it. It turned out that the possibilities of FML are much wider than the search tasks, and the demand far exceeds our resources - we are at a dead end. How to develop a popular service if it requires a proportional expansion of the team?

We found the answer for ourselves in an open architecture. Fundamentally did not become attached to the subject area, developing Nirvana. Hence the name: the platform is indifferent to what tasks you come to it with - the development environment is just as indifferent about what your program is about, and it does not matter to the graphic editor which image you are editing now.

And what is Nirvana important? Accurately and quickly perform an arbitrary process, configured in the form of a graph, at the vertices of which there are blocks with operations, and the connections between the blocks are arranged according to the data.

Since Nirvana appeared in the company, developers, analysts and managers of various departments of Yandex have become interested in it - not only related to machine learning (other examples are at the end of the article). In a week, Nirvana handles millions of blocks with operations. Some of them are started from scratch, some are raised from the cache - if the process is put on stream and the graph is often restarted, it is likely that some deterministic blocks do not need to be restarted and you can reuse the result already obtained by such a block in another graph.

Nirvana not only made machine learning more accessible, it became a meeting place: the manager creates a project, calls the developer, then the developer assembles the process and starts it, and after many launches that are watched by the manager, the analyst comes to understand the results. Nirvana allowed the reuse of operations (or entire graphs!) Created and maintained by other users, so as not to have to do double work. Graphs are very different: from several blocks to several thousand operations and data objects. They can be assembled in a graphical user interface (the screenshot will be at the end of the article) or using API services.

How is Nirvana

There are three large sections in Nirvana: Projects (large business tasks or groups with which children saw common tasks), Operations (a library of ready-made components and the ability to create a new one), Data (a library of all objects loaded into Nirvana and the ability to load a new one).

Users collect graphs in the Editor. You can clone someone else's successful process and edit it - or build your own from scratch by dragging blocks with operations or data on the field and connecting them with connections (in Nirvana, the connections between the blocks are based on data).

We’ll tell you first about the architecture of the system - we think there are our backend colleagues among our readers who are curious to look at our kitchen. We usually talk about this at the interview so that the candidate is ready for Nirvana.

And then we turn to the screenshot of the interface and examples from life.

At first, users usually come to the graphical user interface of Nirvana (single page application), over time, many of the ongoing processes are transferred to API services. In general, Nirvana does not care what interface is used, the graphs are launched the same. But the more production processes are transferred to Nirvana, the more noticeable is that most of the graphs run through the API. The UI is left for experimentation and initial configuration, as well as for changes as needed.

On the side of the backend is Data Management : the model and storage of information about the graphs, operations and results, as well as the service layer that provides the frontend and API.

A little below is the Workflow Processor , another important component. It ensures the execution of graphs, knowing nothing about what operations they consist of. It initializes blocks, works with the operations cache, and tracks dependencies. At the same time, the execution of the operations themselves is not part of the tasks of the Workflow Processor. This is done by separate external components, which we call processors.

Processors bring specific functionality to Nirvana from one or another subject area, they are developed by the users themselves (however, we ourselves support the basic processors). The processors have access to our distributed storage, from where they read the input data to perform operations, and there they write the results obtained.

The processor in relation to Nirvana plays the role of an external service that implements the specified API - so you can write your processor without making any changes to Nirvana or to already existing processors. There are three main methods: start, stop and get task status. Nirvana (or rather, Workflow Processor), making sure that all incoming dependencies of the operation on the graph are ready, sends a launch request to the processor specified in the task, sends the configuration and links to the input data. We periodically request the status of execution and, if ready, proceed further along dependencies.

The core processor, supported by the Nirvana team, is called the Job Processor . It allows you to run an arbitrary executable file on a large Yandex cluster (using the scheduler and the resource management system). A distinctive feature of this processor is the launch of applications in full isolation, so parallel launches work exclusively within the resources allocated to them.

In addition, the application, if necessary, can be run on multiple servers in distributed mode (this is how Matrixnet works). The user simply download the executable file, specify the command line to run and the required amount of computing resources. The platform takes care of the rest.

Another key component of Nirvana is the key-value repository , which stores both the results of operations and downloaded executables or other resources. We laid in the architecture of Nirvana the ability to work with several locations and implementations of the storage at once, which allows us to improve the efficiency and structure of data storage, as well as carry out the necessary migrations without interrupting user processes. During the operation of the platform, we managed to live with the file system CEPH and with our technology MapReduce-a and data storage YT, eventually moved to MDS, another internal storage.

Any storage system has limitations. First of all - this is the maximum amount of stored data. With the constant increase in the number of Nirvana users and processes, we risk filling in even the largest repository. But we believe that most of the data in the system are temporary, which means that they can be deleted. Due to the well-known structure of the experiment, one or another result can be obtained anew by restarting the corresponding graph. And if a user needs a data object forever, he can purposefully save it to the Nirvana repository with infinite TTL to protect it from deletion. We have a quota system that allows us to divide the repository between different business tasks.

How does Nirvana look and what is useful for

So that you can imagine what the interface of our service looks like, we have attached an example of a graph that prepares and launches the assessment of the quality of a formula using Catboost technology.

Why do services and developers of Yandex use Nirvana? Here are some examples.

1. The process of selecting ads for the Advertising Network using MatrixNet is implemented using Nirvana graphs. Machine learning allows you to improve the formula by adding new factors to it. Nirvana allows you to visualize the learning process, reuse results, set up regular training launches - and, if necessary, make changes to the process.

2. The Weather team uses Nirvana for ML tasks. Due to the seasonal variability of the predicted values, it is necessary to constantly retrain the model, adding the most relevant data to the training sample. In Nirvana, there is a graph that automatically clones itself through the API and restarts new versions on fresh data in order to recalculate and regularly update the model.

The weather also collects in Nirvana experiments to improve the current production solutions, tests new features, compares the ML-algorithms, selecting the necessary settings. Nirvana guarantees reproducibility of experiments, provides power for carrying out volumetric calculations, can work with other internal and external products (YT, CatBoost, etc.), eliminates the need for local installation of frameworks.

3. The team of computer vision with the help of Nirvana can conduct a search of the hyperparameters of the neural network, running a hundred copies of the graph with different parameters - and choose the best of them. Thanks to Nirvana, a new classifier for any task, if necessary, is created “by button” without the help of computer vision specialists.

4. The Directory team conducts thousands of assessments per day through Toloka and assessors , using Nirvana to automate this pipeline. For example, this is how photos are filtered to organizations; new ones are being collected via mobile Toloka. Nirvana helps cluster organizations (find duplicates and glue them together). And most importantly - to build automatic processes for completely new assessments can literally take hours.

5. On Nirvana, all assessment processes are based on assessors and Toloka, not only important for the Directory. For example, Nirvana helps to organize and set up all the work of Pedestrians who update maps, technical support work and testing assessors.

Discuss?

Our offices regularly hold special Yandex meetings from the inside . At one of them, we already talked a little about Nirvana (there is a video about the device of Nirvana , the use of Nirvana in machine learning ), and it aroused great interest. So far, it is available only to Yandex employees, but we want to know your opinion about the Nirvana device we described, about your tasks for which it would be useful. We will be grateful if you tell us about systems similar to ours in the comments. Perhaps, your companies already use similar computing platforms, and we will be grateful for advice and feedback, stories from practice, stories about your experience.

Source: https://habr.com/ru/post/351016/

All Articles

Discover Nirvana - Yandex’s Universal Computing Platform

How did Nirvana appear

How is Nirvana

How does Nirvana look and what is useful for

Discuss?

More articles: