Search engine job dataflow

In continuation of the article How to start a search engine, or a few thoughts about the crawler

In the previous article, I briefly discussed the experiments with the intensity of the load and the work of the Crawler, in general terms I will describe the project's DataFlow before building the index, so that it is clear what I am writing about. I will try to describe each step in detail in the corresponding article.

So, the downloaded page first gets to highlight links. New links from the current site are placed in the local queue for download in the current session, and all other sites are added to the common Crawler queue. This queue contains only the main pages of sites.
')
After collecting a sufficient number of pages of one site, the analyzer is launched, the patterns that are present on most pages of the site are highlighted, and they are cut.
At the exit we get the texts of the pages without everything superfluous and grouped by sites.

I put all the finished pages of a site into a directory that goes to the indexer. The indexer creates a small version of the "direct" index, consisting of several (10-1000) sites prepared in the previous step - parsing HTML, matches words, packages, counts metrics word by word. The direct index is a table where, by the document ID, I can get a list of words and all the information about their entries in the document. Also, a database of words (built from this small number of documents) and references (still not added to the general index and therefore not having an ID) are attached to it. Links go to the PR recalculation module (words and their weights participate in the links information), the rest is for updating the inverse index. The volume of the piece that the indexer prepared is usually 100-300Mb per hour. And this is without the texts of the pages.

Counting PR and various metrics is done by a separate module. The funny thing is that for PR calculation it is more convenient to store links as strings, and not as IDs because not all links will get into the final index - it’s rather silly to fill the link storage structure with unnecessary information. Because of this feature, it is easier to write this part in Perl or a similar language that quickly works with strings, or you will have to organize a system for storing a large number of strings in a structural language.

At the output of the indexer we get a packed piece of a direct index. But we need to build a reverse index so that we can quickly search. The reverse index is a table in which each word is associated with a list of documents sorted by relevance or PR (like Google’s) in which this word exists. It is clear that just taking a direct index and overtaking it in the reverse will take years when the base grows, therefore all the time the reverse index is incrementally increased and a couple dozen pieces of the direct index prepared at the previous stage are combined and then sorted. Technically, this happens at the same time - sorted insertion into a sorted list. There are many moments how to arrange a database so that it can withstand it, how to do it so that it is not necessary to copy the files to a hundred gigs each time, how to put the necessary piece into memory and so on.
A database of words, metrics and everything else is pre-prepared, also incrementally connecting the previous version with new information. ID words in a small direct index on the move are replaced by those that correspond to the full version of the database, and the rest of the information is similar.
When building a reverse index, a separate database storing URLs is also added - this is a separate large structure, too many records need to be stored and have quick access by ID, access for sequential reading within one site, quickly check for the existence of links, in the description of database structures I will cut this question.

At the stage of rebuilding the inverse index, we need data on PR, page metrics and everything else that helps sort the results. In fact, you only need to have the first couple of hundreds of thousands of results guaranteed to be sorted - if it comes to the rest - then this will be the work of the ranking function.

The built inverse index can already be used for searching - we select lists of words from the query, we are looking for an intersection - we count the coefficients, the distance between the found words, well, we sort them based on the ranking function.

A little more from morphology - first, on the basis of the morphological dictionary of Zaliznyak, we choose the largest base, cut off the ending, substitute the 1st dictionary form. This whole process is assembled into a tree for quick parsing, the final leaves contain variants of endings. We run along the word, parallelly descending the tree on the basis of the met letter, until we reach the lowest possible sheet - there, proceeding from the endings, we substitute the normalized one.
If no normal form was found, then we apply stemming - based on the text of the books downloaded from lib.ru, I built a table of the frequency of occurrence of the endings, looking for the most common of the suitable (also a tree) and replacing it with the normal form. Stemming works well, if the word was not in the language another 5-10 years ago - it will easily disassemble “crawlers” into “crawler”

The full content and list of my articles on the search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/

Source: https://habr.com/ru/post/123805/

All Articles

Search engine job dataflow

More articles: