📜 ⬆️ ⬇️

I write a search engine (virtual project). Part 1.1. The interior of the brick

There are two ways to design - “top down” and “bottom up”. Looks like I'm trying to reinvent the wheel again, go the third way - from the middle.
Since for me personally, at the moment, the “private derivative” of the search is more interesting, namely, the search on a separate site (a group of sites grouped into a single unit) - I will go in this direction.

In addition to the actual search, there is also the problem of updating the index. For example, colleagues set themselves the search engine <a hrefive www.sphinxsearch.com »> Sphinx. One of the problems they faced is the inability to quickly update the main index, which is unacceptable for the media site.
They got out by means of search in several indexes (the Sphinx allows it). During the day, when posting a new article, the current day index was rebuilt. During the week of such indexes 7 pieces accumulated and on weekends, when the load on the server fell, the main index was reassembled taking into account the accumulated one. And so in a circle. Such index rigidity was made to speed up the search process. You have to pay for everything.
I heard that the developer of the Sphinx is already solving (or even solved) this problem. Indexes can be combined, avoiding the regeneration of the original data. Thus he showed (for which many thanks) a rake, which can be stepped on (some of). Such information about the technological rakes that await during development is no less valuable than all manuscripts on search theory. After all, the greatest difficulties begin when you try to shift theory to practice.
Phew! A lot of words. But if not here, then in the comments I would still have to explain the reasons for my decision.
I want to divide the base index into three parallel ones:
The main index - it is the main one.
The index of this day - all updates of the current day.
After midnight, “today” becomes “yesterday” and makes room for a new day.
During the day, the main index and the “yesterday's” are merged, after which the yesterday's one is deleted, and the main one is replaced by the result of the merge.
This minimizes the cost of keeping the index current.
When working in the usual search engine mode (when the data is updated as the site turns to scan), the “yesterday's” index is not needed, but the “today's” index is formed on the basis of fresh receipts, then we believe that it is midnight and act on the same algorithm.

Ps. Google is already working on instant indexing technology for updating site content - PubSubHubbub . I do not think that when receiving the next portion of updates, the entire index will be rebuilt, most likely the news will be accumulated in a kind of buffer index, which is available in parallel with the main index. Such technologies search engines have long been able to run in search for news. Now it's time to distribute them to all the content.

')

Source: https://habr.com/ru/post/89543/


All Articles