Model the world for the search engine. Lecture in Yandex

Today we will talk about modeling reality as a way of thinking, perceiving information and analyzing data. Together we will reinvent and improve the models that are currently used in search engines: in search quality metrics, in creating ranking factors, and even in building new Internet services. This is what the lecture of Fyodor Romanenko is devoted to.

However, before moving on to the main topic of our lecture, it is worthwhile to consider some philosophical questions related to modeling.

Man thinks with models, and with their help he perceives and understands the world around him. As an illustration of this thought, we present a simple model of the world in which only good and bad guys exist. The bad always lie, and the good say the truth. If a person lied to us once, then we will consider him a bad guy, and we will not trust him. But if this bad guy suddenly starts telling the truth, we may have cognitive dissonance: the discrepancy between the observed and the model we have. You can respond to this in different ways. Some, for example, may deny what they see, and remain within their model. However, this approach is very far from scientific. It would be better to abandon the adopted model, or try to expand it.
')

Scientific method

The most successful approach to obtaining knowledge is this: we have some experience (empirical knowledge), on the basis of which we create certain theories and models. All hypotheses that we can think of are initially equal. They can explain our experience or part of it, and also predict new experience. The model does not have to be accurate, it can be approximate, as long as it helps us understand, explain or predict something. Generally speaking about the correctness or incorrectness of the model is not quite true; the model must be perceived in terms of its utility. For example, after the invention of the theory of relativity, it became clear that Newtonian mechanics was not entirely correct. But at the same time, the model that she represents is very useful, it helps to explain and predict many things.

Very useful to scientists is the principle known as "Occam's razor". It implies that there is no need to introduce entities into the model unnecessarily. If you can make a simpler model with the same utility, then use it better.

The development of search and data analysis is also a kind of scientific work in the field of identifying high-level patterns. We have huge data arrays, logs of user actions, on the basis of which we can make models, predict actions, and, based on this, make good services. At the same time there is much more space for modeling here than, for example, in physics, where new models appear extremely rarely. In the search for new models, you can come up with at least every day, and each of them will be somewhat useful.

PageRank

For example, let's talk about the PageRank model, which is considered on the graph of web pages and web links. The Internet can be represented as a graph, where the vertices are the pages, and the edges are the links. Pages can be important and useful, and can be the result of automatic generation, do not bear any semantic load and value for most users. Our task is to calculate a certain authority of the page, to determine what the probability is that we will be interested in dealing with it in general. Based on this indicator, we will be able to select pages for search results and rank them.

There is a very simple classic PageRank algorithm invented by Google. As inspiration was used graph of scientific works, because each scientific work has references to the used literature and related publications. In this case, the more they refer to this or that work, the more authoritative it is considered in the scientific world. Accordingly, the model is arranged quite simply, it is the so-called model of the random walker. It has web pages with different popularity, between which there are links in the form of links. The user walks through these pages and with some probability clicks on some outgoing link. Suppose we have many such users, they start from a random page. And we need to calculate the probability that the user will be on a certain page. All this is considered as follows. Suppose that we have N pages, at the initial moment the user with a probability of 1 / N hits a random page. We take the probability that he is tired of reading, for 15 percent, respectively, with an 85 percent probability, the user continues surfing and follows the random outbound link with equal probability. In the case when the user is annoying, he starts from a random page again. The function on the graph is considered iterative. At some iteration, the PR value from some node per iteration t. We take this PageRank and evenly distribute it by outgoing links, a value appears on the edge-link, called PageRank delta - dPR. It is argued that if you make a lot of iterations, the weights in the nodes practically cease to change.

Interestingly, on a real web graph, quite simple properties are found in such a simple model. For example, a page to which there are a lot of links, PR will be high, as it is made up of deltas of outgoing links. And the village she refers to will also be quite high PR.

This model can predict page traffic, although it also has problems related to the fact that in those times when it was invented, the Internet was different. There were not many pages, and all links were made manually. Now, in the Yandex database, there are about 20 billion pages on the Internet only, and there are not many useful ones among them. And the main problem of the algorithm in the classical form is that you can make a spam site, and generate pages on it, the sole purpose of which is to have links to a specific page, which you need to raise PR. In addition, the classic PR prefers the old pages.

After watching the lecture to the end, you will learn how to deal with the problems of classical PageRank, how to measure and improve the quality of search, what pfound and widepfound models are, and also why you need machine learning in search.

Source: https://habr.com/ru/post/212545/

All Articles

Model the world for the search engine. Lecture in Yandex

Scientific method

PageRank

More articles: