Since so many questions arose about the overall functionality of the search engine, here is a small introductory article. To make it a little clear what a search engine is and what it should do, I will describe in general terms. Probably for the programmers will not be very interesting, do not blame me
But to the point: in my humble opinion, the search engine should be able to find the most relevant results for the search query. In the case of the text search, to which we are all accustomed, the search query is a collection of words, I personally limited its length to eight words. Answer - a set of links to pages that are most relevant to the search query. It is advisable to provide links with annotations so that a person knows what to expect and can choose the desired one from the results - the annotation is called a snippet.
It must be said that the search task in general form is not solved - for any document having the greatest relevance, for example, using the word “work”, you can create a modified copy that will be even better from the search engine’s point of view, but it will be complete nonsense person A question of price and time, of course. Due to the vastness of the Internet today, there are many such pages, to put it mildly. Different systems are struggling with them in different ways and with varying success, someday artificial intelligence will conquer all of us ...
')
Algorithms for the recognition of meaning would help here, but I am familiar with only one of them (which really recognize the meaning, and do not consider statistics) and I have little idea of ​​its applicability. Therefore, problems are solved empirically - i.e. the selection of some manipulations with the pages to separate the "grain from the chaff."
In the real world, it’s impossible to go through the entire Internet in a second and find the results best of all, so the search engine stores a local copy of the piece of the network that it managed to collect and process. In order to quickly receive from a billion pages only those that contain the right words, an "index" is built - a database in which each word corresponds to a list of the pages that contain this word. Naturally, it is necessary to store in it the places where the searched words were encountered, as they were highlighted in the text, other numeric metrics of the page, in order to use them in the sorting process.
Let's say I have 100 million pages. The average word is found on 1-1.5% of pages, i.e. 1 million pages per word (there are words that are found on every second page, and there are more rare). Just say 3 million words encountered - the rest are much less common and these are mostly clerks and numbers. For storing 1 record that a specific word is found on a particular page, the page id goes - 4 bytes, the site id - 4 bytes, the packed information about where and how it was allocated - 16-32 bytes, 3 reference ranking factors - 12 bytes, the rest metrics are about 12-24 bytes. What volume will be the index - leave it to you to estimate:
3 million * 1 million * the total recording volume.
In order to build this index there are 3 mechanisms:
indexing pages - getting pages from the web and their initial processing
building reference PageRank metrics based on primary information
updating the existing index - adding new information there and sorting it according to the received metrics, in particular PageRank.
Additionally, you need to save the texts of the pages - to build annotations in the search process
Search process
You can highlight many relevance metrics, some depend on the “usefulness” of the result to a specific user, others on the total number found, and still others on the indicators of the pages themselves - for example, some search engines have a certain “standard” to which they aspire.
To machine, i.e. the server could sort the results by some metric, using a set of numbers associated with each page. For example - the total number of words found in the text of this page, their weight, calculated from the selection of these words in the text of the page and so on. Another kind of such factors does not always depend on the request - for example, the number of pages that refer to this one. The more it is, the more significant the page is in the output. The third kind of coefficients depends on the query itself - how rarely used words are found in it, which of them are commonly used, and can be skipped.
Based on the large number of these factors for each page, the search should display one number - relevance and sort all the results by it.
When the index is already built, you can search by it:
split a word query, select index pieces that match each word, cross or do something else, depending on the policy chosen
calculate the coefficients for each page - their number, if desired, can be far beyond a thousand
build a relevance metric based on the coefficients, sort, select the best results
build annotations - snippet and display the result
The full content and list of my articles on the search engine will be updated here:
http://habrahabr.ru/blogs/search_engines/123671/