The full content and list of my articles on the search engine will be updated
here .
In previous articles, I talked about the work of the search engine, and that is how it came to a technically difficult moment. Let me remind you that they share 2 types of indices - direct and inverse. Direct — compares a document with a list of words in it. Reverse - the word is matched with a list of documents in which it is. Logically, a reverse index is best for quick searches. An interesting question and about the order in which the list of stored documents.
In the previous DataFlow step from the indexing module, we received a piece of data in the form of a direct index, reference information, and page information. Usually I have about 200-300mb and contains about 100 thousand pages. Over time, I abandoned the whole direct index storage strategy, and I only store all these pieces + a full reverse index in several versions so that you can roll back.
')
A reverse index device with a simple form, we store a file, in it there is a table of addresses of the beginning of the data for each word in the beginning, then the data itself. This is me exaggerated. This is how the format that is most beneficial for optimizing search speeds turns out - no need to jump through the pages - as Bryn and Page wrote, - 1 seek, 1 read. At each iteration of the rebuild, I use 20-50 pieces of information described above, I obviously cannot load all of them into memory, especially since it is useful to store a lot of index overhead there.
To solve this and many other problems associated with the limitations of the file system, the disk, parallel access to several lists of pages, I divided the entire index into 256 parts. Part I contains lists of pages for the words WORD_ID% 256 = I
Thus we divide the index into approximately equal parts. In addition, all processing at the next iteration of rebuilding the index is connected only within one part. Those. out of 20-50 pieces of database of 200-300 Mb, you need to download only those data that are associated with words that fall into the part being processed.
Total: we do 256 repetitions of the same procedure:
- We read the necessary data from the input pieces of the database, after downloading all the information about words, pages and Url pages. In the memory we sort, make lists of the correct structure. We pack info about the page and add its ID, HASH and other data.
- open the barrel (I called so the pieces of the index) reverse index "extreme version" for reading. Open the empty barrel of the “new version” of the index for writing.
- For each word, we merge 2 lists in order — one built sorted in memory, and the other readable and already sorted in the file.
- Close, append everything you need, rejoice.
In practice, it turned out naturally that the list that is already stored in the “previous version” of the index turns out to be sorted during the new iteration only when no external information about the pages has changed, and this is obviously not the case.
After all, while we were saving info for the index, we could recalculate data on PR, page metrics, websites relevance, and so on. And as a result, the odds for sorting will change. The question whether it is necessary to sort lists at all by similar parameters is still open. According to rumors, Google sorts the pages in lists on PR, but obviously then you need to be able to quickly discard the husks when searching - which you have specially pumped PR, but the content is not enough.
I use a combined factor, built from general information about a word on a page (number of meetings; meeting place — title, meta, links, text; selection, formatting), from the metrics of the topic (they are very poorly developed at the moment), from PR, from the word-by-word PR (I also have PR for every word) and some others.
And it is obvious that with high-frequency queries, we really need, God forbid, the first 1000-2000 results in the list, for accuracy we will count 262 thousand. Because all the popular answers obviously fall into them, and then you can run them already ranking function that sorts them properly.
With low-frequency queries, when we really need 2-3-10 results, we cannot do without complete index passing, however, the total length of the list of pages is rarely more than 100 thousand.
In general, if we can guarantee that the first N of thousands of pages in the list will be the best of the remaining couple of millions, then the problem is solved, and to guarantee this is quite simple heuristic methods. Then for correct rebuilding of each barrel:
- remember which PR has changed a lot when recalculating
- we load the first N thousands from the list + a bit from the rest according to the PR criteria, page metrics and other heuristics
- there we load from the input data the necessary portion of the words of the current barrel
- only for selected pages we are already updating all the data to the correct, new ones - PR, metrics
- sorting hundreds of thousands of results, instead of a couple of tens of millions and already in memory
- we write the sorted, and the rest is simply copied from the previous index by removing duplicates
And of course, you need to have the “update entirely” function once a month or six months, for example, the index will be rebuilt all with the simplest direct algorithm.
It is clear that even this amount of work applied to 3 million words and numbers used average is not easy, but compared with a complete rebuilding, the gain is significant: you don’t need to sort on the disk, you don’t need to load external metrics, PR and other parameters - because also can not cache everything in memory.
It's funny that the iterative rebuilding method gives you many times more performance than the simplest construction of the inverse index from the direct one. Of course, some search engines manage to do it fairly quickly on a fairly powerful hardware, but this is still not the best approach, even though the most correct one. It’s just that the number of long inserts in the inverse index reduces all advantages to nothing, and in the storage structures known to me either the insertion speed or the sequential read speed suffers.