Full-text search: how it is done in Mail.Ru Mail

Historically, Mail.Ru used a mechanism from the “large” Search (go.mail.ru); however, this option was not optimal for mailbox search tasks due to the large resource consumption and relative difficulty in servicing. About 3% of mailbox owners use mail search; However, although this figure seems relatively small, the boxes of these people are usually quite bulky and they really need to search. Therefore, we have decided to write a specialized search daemon, which will be engaged in the search by mail. The main requirements for it were restrictions on consumed resources (index size - no more than 3% of the mailbox size, average RAM consumption - no more than 100 MB, average CPU utilization - no more than 3%) and speed of execution of requests (average time - not more than 200 ms). How it was organized, I will tell below.

The two main processes that are performed as part of the solution to the problem of searching by mail are the indexing of mailboxes and the execution of search queries. At the time of receipt of the new letter, it is necessary to replenish the search index by entering this letter into it. Obviously, the data in the index should be streamlined and as compact as possible; however, in this case, it is most likely that you will need to insert into the middle of the file, which will generate a very "heavy" recording on the disk. Given that the arrival of a new letter occurs many times more often than the execution of any search query, the use of such a heavy operation to keep the search index up to date is doubtful.

We decided to make an index of two files: snapshot, which contains a full-text index (sorted data), and xlog, which contains a list of consecutive transactions applied to the index. Any operation on the index (for example, receiving a new letter) causes one record to disk - this is the record at the end of the xlog file. Upon execution of a search query, in fact, there are two search operations — a binary search on a snapshot and a sequential search on xlog — the results of which are combined. At that moment, when the xlog search speed stops satisfying us, we perform a snapshot rebuild - we make all changes to it from xlog, and we begin to save xlog again. This moment is determined automatically by one of two possible policies: when the time of execution of the next request exceeds the set threshold, or when the set threshold is exceeded by the size of the xlog file.
')
Indexing a new letter begins with tokenization. Tokenization is the splitting of a letter into separate words (full-text search works up to a whole word and is not able to search for an arbitrary substring). It is worth noting that tokenization is not the most trivial task. Take, for example, an email address

d.kalugin-balashov@corp.mail.ru

Obviously, he is a whole word. It is reasonable to make it possible to search also by the word d.kalugin (a study of user search queries has shown that they often try to search by email part). However, it is impossible to support all the substrings of a given word, since this will lead to a sharp increase in the size of the index, and, as a consequence, a loss in the speed of execution of queries. It is very reasonable to break a word into substrings only by punctuation marks. Accordingly, we get the following subwords:

d.kalugin-balashov@corp.mail.ru
d.kalugin-balashov@corp.mail
d.kalugin-balashov@corp
d.kalugin-balashov
d.kalugin
d
kalugin-balashov@corp.mail.ru
kalugin-balashov@corp.mail
kalugin-balashov @ corp
kalugin-balashov
kalugin
balashov@corp.mail.ru
balashov@corp.mail
balashov @ corp
balashov
corp.mail.ru
corp.mail
mail.ru
mail
ru

All these words will be included in the index.
Note that such a recursive word break has some problems. For example, system administrators often receive service letters that contain various paths (/usr/local/something/libexec/libany.so), often very long. Such words can cause a greater depth of recursion. Therefore, words that have a length greater than that specified in the configuration file of the tokenizer are not recursively divided into tokens, but broken into subwords of minimal length (the original word itself, however, also falls into the index).
For example, take the word:

/usr/local/something/libexec/libany.so

Provided that its length is greater than the length allowed for recursive tokenization, it is divided into the following subwords:

/usr/local/something/libexec/libany.so
usr
local
something
libexec
libany
so

Such a breakdown gives less qualitative search results, but is a compromise option in terms of quality / resources ratio.
The final stage of tokenization is to obtain the first form of the word (for the search of all word forms, the lemmatizer from the “big search” is used) and the taking of the CRC32 from it. All words in the index are just these 32-bit integers.

A letter has a specific set of numeric (folder, date, size, checkbox, presence of attachments, etc.) and text (subject, sender, text, etc.) zones. The list of zones is configured in a special file and can be supplemented during the operation of the search engine.

Snapshot consists of two parts - a dictionary (a list of all the words found in letters and pointers (offsets)) and the actual lists of documents (and zones) referenced by pointers from the dictionary. When searching, the dictionary is read, which contains the words contained in the search query, after which the lists of documents are read by pointers from the dictionary; results are combined. On average, a search query (using only snapshot) for one word requires two disk hits — to read the dictionary and read the list of documents.

Xlog consists of sequentially recorded transactions. The integrity of each transaction is guaranteed by the checksum (CRC32) of its contents. When reading xlog, all transactions for which the checksum has not converged are skipped (however, the reading process is not interrupted until the number of errors exceeds a certain number set in the configuration file). A transaction, as a rule, consists of several commands and always describes exactly one letter. The main role is played by the team that describes the list of words that are present in this letter, and the numbers of text zones in which they occur. Execution of a search query on xlog requires reading the entire file into memory and analyzing all transactions, so the size of xlog is very limited. The results of executing queries over xlog and snapshot are combined into one common result.

After the list of letters in which words from the query are found is generated, the values of all numerical zones for them are obtained. The values of the numerical zones are stored in the nzdata file. The file structure is similar to the snapshot structure - it is a dictionary containing all the numbers of letters and pointers that refer to the list of values of the numerical zones of the letter. However, this file is read into memory entirely because the number of accesses to the data inside it, unlike the snapshot, is large, and the file itself is much smaller. Note that nzdata does not contain all the actual values of numeric zones. He, like snapshot, contains the values of the numerical zones at a certain point, and all their subsequent changes are contained in xlog. Rebuilding nzdata is done at the same time as the snapshot. After loading all numerical zones, a second reading of xlog occurs and all commands that describe changes in numeric zones are applied to the loaded results. Also note that the search daemon, before accessing nzdata, tries to get numeric zones from the daemon, which provides work with postal codes (which are cached in memory while actively working with the box). This method is many times faster and ensures data consistency. Getting numerical zones from nzdata occurs, in fact, only in emergency situations.

After receiving the numeric zones, the search daemon begins to apply the filters specified in the request. Filters are restrictions of numerical zones: “equal”, “not equal”, “more”, etc. The most popular filters are the limitation for a date (“yesterday”, “week”), a folder (“Inbox” , "Sent", "not in Spam"), on any flag ("Unread", "Checked", "With attachments"). Filters are applied sequentially to each letter from the list of results and exclude part of the letters from there.

At the same stage, groupers are used, which calculate “how many results would there be if such a filter were applied”.

The next step is the ranking. In the case of a mail search, unlike other full-text search tasks, the most relevant ranking function is obvious and simple — sorting by the date of the letter. The ranking ends with clipping of letters that do not fall into the current page (analog of the LIMIT operation in SQL).
For the formation of the search result some numerical zones are few. Since after ranking most often no more than 25 letters remain (the standard number of letters displayed on one page), heavy loading and processing of text zones occur at this stage. After loading, each text zone is divided into words, among which are those present in the search query (including word forms and case insensitive). These words are “highlighted” by the tags <b> and </ b>, and the text area itself is clipped to the left and right to a certain size, which is guaranteed to include the first selected word.

The index of sadzhestov is under construction at the moment of reorganization of xlog in snapshot. The expectation of the query length is 6 characters.

In the index of sadzhestv stored words that are not reduced to the first form, in the register, which are in the letter. The index of sadzhestov consists of two parts:

1. Dictionary consisting of prefixes of words (up to 6 characters long). For words longer than 6 characters, references to lists of postfixes are stored in the dictionary.
2. The set of lists of postfixes (of arbitrary length).

It was experimentally established that building an index of sadzhest with prefixes of exactly 6 characters in length minimizes its size. The index of sadzhestov, in contrast to the search index, makes sense to cache, but to keep in memory relatively short-lived, because users tend to enter several letters in a row until the issuance of sajestov does not suit them. The list of prefixes is cached (completely) and all lists of postfixes read so far from the disk. The data from the xlog is not used when generating the event, since reading the xlog can take quite a long time. Therefore, the sadzhest index always lags behind the real state of the box. The index of sadzhestv updated with data from xlog during the next rebuild snapshot.

If you have questions on implementation or have experience in solving such a problem, let us share the best practices.

Dmitry Kalugin-Balashov,
Mail.Ru Mail Team Programmer

Source: https://habr.com/ru/post/167497/

All Articles

Full-text search: how it is done in Mail.Ru Mail

More articles: