I am always haunted by the idea of search engines, especially the fact that the creators at the beginning did not even know about the extraordinary prospects of this technology.
I decided to learn in practice what it is - a search engine. Called it
nanorit.com . But for experiments, I did not take any known API from Google, but decided to create my own.
For a start, I uploaded a database of domains, it turned out about 70,000 unique sites. Then he developed a search robot that connected in turn to one site and downloaded all the links from the main page that belong to this site. I made this restriction so that the robot is not mired in the wilds of a large site, or a popular forum. But, I think, further optimize the algorithm. Next, I put a label for the indexed site with the date of indexation and go to the next site.
What I have achieved at the moment - there are now about 1.5 million documents in the database, and I load only the headers, because the body of the document is very expensive to load on resources. The base already occupies 500 MB on disk, and is hosted on a simple hosting, without a dedicated server.
Next, I told about my idea to a familiar Ph.D., we studied together. He told me about linguistic analysis. I decided to split all the headers into separate words and compile a register of these words and a related table - in which for each heading there is a listing of word identifiers. The result was words in the index 139000, and bundles for headings 2,184,204. Next, I made a search algorithm for this index, but the result turned out to be worse than if I simply looked through the like '% keyword%', so I decided not to develop the algorithm so far.
Then I decided to check the interest of users, and added a rating of search queries, for each query I count the number of hits. The most interesting thing is that the search engines also started to “click”, there is a danger that they will be banned, but Yandex is still indexing.
Now I have added the function of adding my site to the index, and also users have shown interest and regularly add their sites.
What conclusions I got - not the gods of the pots are burning. Here is the main conclusion. I think now to develop the idea and purchase a dedicated server for a search engine. Well, then in the plans to explore the architecture of cluster data processing and optimize the speed of processing requests - now frankly, compared with Google, it is looking very slowly.