📜 ⬆️ ⬇️

Search technology or what is the catch to write your search engine

Once upon a time, an idea came up in my head: to write my own search engine. It was a very long time ago, when I was still studying at the university, I knew little about the technology for developing large projects, but I was fluent in a couple of dozen programming languages ​​and protocols, and by that time I had a lot of sites.

Well, I have a craving for monstrous projects, yes ...

At that time, little was known about how they work. Articles in English and very scarce. Some of my friends who were then aware of my searches, based on the documents and ideas dug in and out by me and them, including those born in the course of our disputes, are doing quite good courses, coming up with new search technologies, in general, this topic gave the development of quite interesting work. These works led, among other things, to new developments of various large companies, including Google, but I personally have no direct relation to this.
')
At the moment I have my own, learning search engine in and out, with many nuances - PR counting, collection of statistics-topics, learning ranking function, know-how in the form of cutting off irrelevant page content such as menus and advertising. The indexing speed is about half a million pages per day. All this is spinning on my two home servers, and at the moment I am scaling the system into about 5 free servers to which I have access.


Here, for the first time, I will publicly describe what was done by me personally. I think many will be interested in how Yandex, Google and almost all known search engines work for me from within.

There are many problems in building such systems that are almost impossible to solve in the general case, but with the help of some tricks, notions and a good understanding of how the iron part of your computer works, you can seriously simplify it. As an example, recalculation of PR, which in the case of several tens of millions of pages can no longer be placed in the largest RAM, especially if you, like me, are greedy for information, and want to store many more utilities besides 1 digit. Another task is to store and update an index, at least a two-dimensional database, in which a list of documents on which it is found is associated with a specific word.

Just think, Google stores, according to one of the estimates, more than 500 billion pages in the index. If each word were found on 1 page only 1 time, and for storing this, it was necessary 1 byte - which is impossible, because you need to store at least the page id - already from 4 bytes, so then the index volume would be 500GB. In reality, one word is found on the page on average up to 10 times, the amount of information on the occurrence is rarely less than 30-50 bytes, the whole index increases a thousand times ... So how do you order it to be stored? And update?

Well, how it all works and works, I will tell you systematically, as well as how to count PR quickly and incrementally, how to store millions and billions of text pages, their addresses and quickly search for addresses, how different parts are organized my database, how to incrementally update the index by many hundreds of gigs, and I’ll probably tell you how to make a learning ranking algorithm.

At present, the volume of only the index by which the search is being performed - 57Gb, is increasing every day by about 1Gb. The volume of compressed texts is 25Gb, and I also keep a bunch of other useful information, the amount of which is very difficult to calculate due to its abundance.

Here is a complete list of articles related to my project and described here:
0. Search technology or what is the catch to write your search engine
1. How does a search engine begin, or a few thoughts about a crawler
2. Common words about a web search engine
3. Dataflow of the search engine
4. About the removal of insignificant parts of the pages during site indexing
5. Methods to optimize application performance when working with DDB
6. A little about database design for a search engine
7. AVL trees and the breadth of their application.
8. Working with URLs and storing them.
9. Building an index for a search engine

Source: https://habr.com/ru/post/123671/


All Articles