In this lecture, on the example of Yandex, the basic components necessary for organizing an Internet search engine will be considered. We will talk about how these components interact and what features they have. You will also learn what document ranking is and how search quality is measured.
The lecture is designed for high school students - students of the Small ShAD, but adults can learn from it a lot about the search engine device.
')
The first component of our search engine is Spider. He walks on the Internet and tries to download as much information as possible. The robot processes documents in such a way that it is easier to search by them. It is not very convenient to search for simple html files. They are very big, there is a lot of excess. The robot cuts off all unnecessary and makes it convenient to search through documents. Well, directly search, which receives requests and gives answers.

Spider
How to understand how well Spider works? First metric - what percentage of sites we saw. The second metric is how quickly we can notice changes.
Runet:
- Rocking servers: 300;
- Load: 20,000 documents per second;
- Traffic: 400 MB / s (3200 Mbit / s).
Together:
- Rocking servers: 700;
- Load: 35,000 documents per second;
- Traffic: 700 MB / s (5600 Mbit / s).
If the entire Spider starts downloading one site with all its servers, you will get a fairly powerful DDoS attack. To prevent such situations, the Zora component is provided. He coordinates downloads, knows which sites have been cached recently, and which ones should be spent in the near future.

The spider should be able to look at the page from different points to make sure that, for example, from Russia and from the USA the pages look the same. If there are different pages for different regions, he should take into account and process this information.
Robot
Spider-downloaded pages go to a robot that processes them and makes them searchable. First of all, the entire html markup is deleted, important sections are determined.
Robot consists of three components. First, it is MapReduce distributed computing system, which allows you to calculate the properties for each document that Spider has downloaded. The second component is a cluster of servers that build a search index. Well, the third component is an archive containing several versions of the search base.

We mentioned document properties. What do we mean by this concept? Suppose we downloaded an html file, and we need to collect as much data as possible about its content. MapReduce carries out calculations and puts labels on the document, which will later be used as ranking factors: document language, subject matter, commercial orientation, etc.
Factors are of two types: fast and slow. Slow factors are counted only once and are assigned exclusively to the document. Quick are calculated for the document along with the search query.
Even if you do not take into account the MapReduce server (they can be used for other tasks), the Robot represents more than two thousand servers.
Russian base:
- Factor Cluster: 650;
- Cooking search database: 169;
- Test servers: 878;
- Archive: 172.
World Base:
- Factor Cluster: 301;
- Cooking search database: 120;
- Test servers: ???;
- Archive: 60.
The database contains about 25 billion documents (214 TB), twice a week it is completely recalculated.
Search index device
Suppose we have three very simple documents containing short texts:
- Mom washed the frame
- Rama in Moscow buy
- Moscow for moms
It’s ineffective to look through all these documents every time a request comes in searching for the words it contains. Therefore, an inverted index is created: we write out all the words from the three documents (excluding word forms) and indicate where they occur.
Mom (1, 3)
Wash (1)
Frame (1, 2)
Moscow (2, 3)
Buy (2)
AT 2)
For (3)
Now if we receive a search query [mother], we will already have a ready answer. It will be enough for us to glance once in the table to find out in which documents this word is found. If the request contains more than one word (for example, [Mama Moscow]), in the same table the Robot will be able to find documents in which both of these words appear. In our case, this is the third document.
As we have said, the size of our real search index is 214 TB. For Search to work with sufficient speed and efficiency, all this data must be stored in RAM. Now we have servers in which from 24 to 128 GB of memory are installed. To optimize, we divide the search base into ranges (from the English. Tier - level). With their help, we divide documents by language and other features. Therefore, when we receive a request in Russian, we are able to conduct a search only on the relevant documents. We have a total of over ten such galleries. Tyra are divided into shards of 32 gigabytes: the amount of data that can be stored in the memory of a physical machine. Now we have about 6,700 shards.
Work with query
The search flow to Yandex can reach 36,000 hits per second. Just getting such traffic is a serious task. To solve it there are several levels of balancing. DNS is used as the first level. L3 balancer is responsible for distributing packets, and http balancer is responsible for processing their contents. First of all, requests from robots are cut off. Then corrected typos and analysis of the request. The result is a "query tree" containing the possible spellings of the query and the likely meanings. After all this processing, the request is sent to the frontend, and the search begins directly. In addition to the main search in all the documents contained in the database, there are many more small searches with certain parameters. For example, by pictures, video, poster, etc. If these searches give relevant results, they will be mixed into the main search results.
