Yandex ranking: how to put machine learning on stream (post # 1)

Today we are starting to publish a series of posts about machine learning and its place in Yandex, as well as tools that saved search engine developers from routine actions and helped focus on the main thing - the invention of new approaches to improving search. We will focus on the use of these tools to improve the relevance formula, and more broadly for ranking quality.

Our industry is designed so that the quality of the search must be dealt with constantly. First, search companies compete with each other, and any improvement quickly leads to a change in market share each. Secondly, search engine optimizers try to find weak spots in algorithms in order to raise even those sites that are less relevant to people's needs in the search results. Third, user habits are changing. For example, over the past few years, the average search query length has grown from 1.5 to 3 words.

Search quality is a complex concept. It consists of many interrelated elements: ranking, snippets , the completeness of the search base, security and many others. In this series of articles, we will look at only one aspect of quality - ranking. As many already know, it is responsible for providing the information found to the user.

Even the simplest idea that can improve it should go through a complex multi-step path from discussion among colleagues to the launch of a turnkey solution. And the more automated (and therefore quick and easy for the developer) this path will be, the faster users will be able to take advantage of this improvement and the more often Yandex will be able to launch such innovations.
')
You may have already read about the distributed computing platform Yet Another MapReduce (YAMR) , the Matrixnet machine learning library, and the basic algorithm for learning ranking formulas. Now we have decided to tell about the FML framework (friendly machine learning - “machine learning with a human face”). It became the next step in automation and simplification of the work of our colleagues - put the work with machine learning on stream. Together, FML and MatrixNet are part of the same solution - Yandex machine learning technology.

Talking about FML is quite difficult, and we want to do this in detail. Therefore, we divide our story into several posts:

What is the ranking and what problems it solves . Here we will talk about the ranking issues and the main difficulties in this area. Even if you have never dealt with this topic, this introduction will be enough to understand all the following material. And already familiar with the ranking will be able to check with us in the terminology used.
Selection of ranking formula . You will learn how FML has become a pipeline for the selection of formulas (yes, there are a lot of them!) To quickly take into account the large amount of assessment assessments and new factors with minimal human participation. And also about the cluster created in Yandex on GPU-processors, which can quite enter into one hundred of the most powerful supercomputers in the world.
Development of new factors and evaluation of their effectiveness . As a rule, publications in the field of machine learning focus on the selection process itself formulas, and the development of new factors bypass. However, no matter how remarkable machine learning technology is, without good factors, it will not work. In Yandex, there is even a separate group of developers engaged exclusively in their creation. Here we will talk about what the process cycle consists of, as a result of which new factors emerge, and how FML helps to evaluate the benefits of implementation and the cost of each of them.
Monitoring the quality of already implemented factors . The Internet is constantly changing. And it is quite possible that the factors that a few years ago really helped to raise the quality, today have lost their value and waste computational resources. Therefore, we will talk about how FML maintains a constant evolution in which weak factors die and give way to strong ones.
Conveyor distributed computing . Machine learning is just one of the tasks that FML solves well. More widely it is used to simplify work with distributed computing on a cluster of several thousand servers on a large data array that changes over time. To date, about 70% of the calculations in the development of Yandex.Search is under the control of FML.
Applications and comparison with analogues . FML is used in Yandex for machine learning by a number of commands and for solving problems far from searching. We believe that our development can also be useful to colleagues in the industry who deal with machine learning tasks, or simply with calculations on large amounts of data. We will designate a range of tasks for which FML may be useful outside of Yandex, and compare it with similar products available on the market. We will also tell how the application of FML in CERN can open the way for the Nobel Prize.

What is ranking and what problem does it solve

After the search engine has accepted the user's request and found all the relevant pages, it should order them according to the principle of maximum matching the request. The algorithm that performs this work is called the ranking function (in the media, it is sometimes called the relevance formula). It is to select the most important of the pages found and determine the "correct" order of their issuance, and is the task of ranking. Its improvement is the first and foremost place where FML and Matrixnet are used.

Once upon a time in Yandex, the ranking function was expressed by a single formula, selected manually. Its size grows exponentially (on the graph, the Y scale is logarithmic).

In addition to the fact that over time the formula threatened to reach uncontrollable sizes, there were other reasons for the transition from manual selection to machine learning. For example, at some point we needed to have several formulas at the same time so that identical requests were processed differently depending on the region of the user.

Formally, in ranking, as in any machine learning task with a teacher, we need to build a function that best fits the expert data. In the ranking experts determine the order in which to show documents for specific requests. There are tens of thousands of such requests. And the better, from the point of view of expert assessments, the order of documents was that formed by the formula, the better ranking we received. These data are called estimates and, as many know, are prepared by individual specialists - assessors . For each request, they evaluate how well a particular document responds to it.

The input data for the learning function, according to which it must determine the order of documents for any other request, are the so-called factors - various signs of the pages. These signs may depend on the request (for example, take into account how many words it contains in the text of the page) or not (for example, distinguish the start page of the site from the internal). Among the factors used for training, there are also signs of the query itself, which are the same for all pages — for example, in what language the query is specified, how many words are in it, how often users specify it.

Machine learning uses a training set to establish the relationship between the order of pages for a query, obtained on the basis of their evaluation by people, and the characteristics of these pages. The resulting function is used to rank all requests, regardless of whether expert evaluations are available.

To build a good ranking formula, it is important not only to obtain relevance scores, but also to correctly select the queries on which to do them. Therefore, we take a subset of them that best represents the interests of users.

There are several technologies for obtaining assessor assessments, and each of them gives a different type of judgment. At the moment, in Yandex assessors assess the relevance of a document to a request on a five-point scale. This method is based on the Cranfield II methodology. In other tasks we use other types of expert data - for example, binary evaluations can be used in classifiers.

Why standard techniques are not applicable in ranking

However, even collecting a sufficient number of estimates and calculating a set of factors for each pair (request + document), it is not so easy to construct a ranking function using standard optimization methods. The main difficulty arises from the piecewise constant nature of the target ranking metrics (nDCG, pFound, etc.). This property does not allow to use here, for example, well-known gradient methods, which require the differentiability of the function that we optimize.
There is a separate scientific area dedicated to ranking metrics and their optimization - Learning to Rank (learning ranking). And in Yandex there is a special group that is engaged in the implementation and improvement of various methods for solving this rather narrow, but very important for finding a class of optimization problems.

So, the ranking function is based on a set of factors and on the training data prepared by experts. Its construction and engaged in machine learning - in the case of Yandex library Matriksnet. In the following posts, we will talk about where the search factors come from, and how it all relates to FML.

Source: https://habr.com/ru/post/174213/

All Articles

Yandex ranking: how to put machine learning on stream (post # 1)

What is ranking and what problem does it solve

More articles: