Fast morphology or files against MySQL

When I entered one project, I had to create a super-fast Russian morphology. About 50,000 words per second on a rather weak laptop, which is only 2-3 times slower than stemming (trimming the endings according to the rules), but much more accurate. This is data on a regular disk; on an SSD or virtual disk, the search is much faster.

The original version was on MySQL, but I managed to achieve a one-hundred-fold increase in performance by transferring it to files. About when and why files are faster than MySQL, and I will tell in the article.

')
But for starters, so as not to weary those who are interested in my implementation. Here is a link to it in GitHub . There source for PHP. For other languages, writing analog is pretty easy, there will be no more than 100 lines of code. The main problem to convert the database into a searchable form I have already done. You just need to port the search to your language.

Task

Anyone who uses Yandex.WordStat at least a little in the contextual advertising theme . This is a service for the selection of keywords. However, it contains a lot of noise.

In the screenshot, all requests that duplicate past ones are crossed out. Green highlights the difference between the main query and the refinement. Those. All these 60 words from the screenshot can be recorded with 6th:

  Odessa apartments
    take off 
    rent
    inexpensively

Those. 90% of the information in WordStat is noise. But the main problem is that you need to copy the data, paste it into a notebook and edit it. For users it was more convenient to use checkboxes. Here, for example, is a comparison of WordStat with what I ended up with:

In the screenshot, Yandex.Wordstat and HTraffic . Data alone, different visualization.

In general, this allows you to accelerate the selection of keywords 10-20 times. However, all this requires morphology. At the same time, WordStat API can issue up to 200 requests, each of them has an average of 5 words. This is already 1000 words. In addition, together with the request, we also break through its synonyms, there can be up to 10 of them. Those. 10,000 words is a realistic situation.
Of course, you can use caching and memorize previously found word forms. Then the number of searches will be reduced every 10. But this does not completely eliminate the problems with speed.

Homonymy and normalization

Speed is not our only problem. Russian language is too big and powerful. It has a huge number of word forms (spellings of words in a different case, gender, and so on). And sometimes different word forms of different words are written in the same way. This is called homonymy. For example, the system having encountered the word-string “delhi” in the text does not know which word to refer to:

Delhi. “Delhi is the capital of India”.
Share. “Bravely divide by two.”
To chit “Where are you sharing my socks?”

Such ambiguity hinders us greatly. It was much more convenient if each word-line would have no more than one interpretation. In this case, we would have much less code and the speed of work would increase.

But to teach how to resolve homonymy machine is a difficult task. It requires a lot of time to implement, and can also have a negative impact on performance.

I copied the solution to this problem from Yandex.Direct. He combines words having homonyms into one. Those. they get one ID. Sometimes this leads to glitches, when the query does not contain the word being searched for, but glued to it, but this happens quite rarely. Otherwise, Direct would not use such a solution.

Base AOT.ru

In Runet, there is only one free base of Russian morphology - Aot.ru. It has about 200 thousand words, each with an average of about 15 word forms. The documentation on it is very complicated and has inaccuracies. Therefore, I will spend a couple of paragraphs to describe it.

The base looks like this: it has sets of endings, prefixes and words. The word has a pseudo-base (roughly speaking, the root of the word), the number of the prefix (if any), the number of the set of endings. To generate all variants of a word, you need to loop through the endings in a cycle, and add a prefix and a pseudo-base to each of them.

There are several points:

Pseudo base may not be. For example: go, walk. In this case, “#” is written in the database instead of pseudo base.
Sometimes there are prefixes in the end table. This happens in superlatives of adjectives (best -> best). The prefix from the ending must go before the main prefix.

The type of base when the endings are separated from words is called compressed. In this case, the base weighs little, but the search for it is extremely slow. Therefore, the dictionary must be “released” - write each word form separately. But the word forms in the database are several million, so it takes several dozen megabytes. The first thing I looked at was database.

Mysql

I wrote all the word forms in one table with two columns, text and word id. Added an index on the text of the word form, but the speed was extremely low. Running table optimization procedures did not help.

To understand what is the reason, I began to search for 10 words at once, through IN. The speed increased slightly, therefore. it was not a matter of parsing SQL and NO-SQL would not help me.

Games with the type of index and storage also gave almost nothing.

I also studied several performance tests comparing MySQL with other engines, for example, with Mongo. However, not a single engine demonstrates a noticeable superiority of reading speed over MySQL.

CRC32

I decided to reduce the length of the index. The text of the word form is replaced by its checksum (crc32).

Since word forms are about 2 million, and 2 ^ 32 = 4 billion, we get the probability of a collision of 0.05%. According to well-known words, there are about 1,000 collisions, you can exclude them simply by specifying everything in the associative array with their real id:
if(isset($Collisions[$wordStr])) return $Collisions[$wordStr];

But this does not exclude false positives, if the word forms are not in the database, then in 1 out of 2000 cases the left id is returned. But this is critical only for searching through huge multi-thematic data arrays (such as Wikipedia).

A person’s vocabulary is extremely rare above 10,000 words. For example, Leo Tolstoy has 20,000. There are 200,000 words in the database, if you organize a search in the complete collection of Tolstoy’s works, then the probability that, due to a collision, he will find something extra, will be 1 in 10 * 2000. That is, 1 to 20,000.

In other words, the probability of glitches due to the use of crc32 is extremely small. And this is more than offset by an increase in performance and a decrease in data volume.

However, this did not help much. The speed was about 500 words per second. I started looking towards the files.

Files

When all methods of MySQL optimization were tried, I decided to write the morphological dictionary to a file and search for it with a binary search. The results exceeded my expectations. The speed has increased about three times.

For me, this figure was unexpected, because I wrote in PHP, and the scripting languages incur their overhead (like marshaling). At the same time, MySQL and my implementation use similar algorithms - binary trees and quick search.

In both cases, the complexity is log (n). Those. log2 (2.000.000) ~ 21 read operation from file occurs. PHP should convert a string from a view 21 times to its internal one. In MySQL, this does not happen. In theory, PHP should noticeably lose in speed.

To understand something, I added blocking files for reading. The blocking itself is not needed here, because reading with reading does not conflict, and I tested it in one thread. The overheads were interesting. The speed dropped 1.5 times. It is likely that one of the main reasons is that MySQL prevents write conflicts in several threads, and this is not necessary in our task.

But this does not explain half the difference in speed. I don't understand where the rest of the brakes came from. Although trees and the driver can bear their overhead, but not so big. I am interested in your opinion in the comments.

Hashing

Trees are not the optimal structure for data retrieval. Hash is much faster. Hesh has a constant time. In our case, instead of 21 operations, we will spend 1-2. In this case, the reading will occur sequentially, i.e. it will be better to work the disk cache and the read head will not have to jump back and forth.

Hash is used, for example, in PHP to implement asociative arrays. Binary trees are used in the database, since they allow organizing sorting and operators more and less. But when it is necessary to choose elements equal to some value, here trees lose a hash on large tables tenfold.

Thus, I managed to speed up the library 10-15 times. But I didn’t stop there, I started storing the CRC and id words in different files. Thus, the reading became completely consistent. I also began to keep the file pointer open between several search launches.

The final performance gain over MySQL was 80-100 times. The scatter of the estimate is due to the fact that I tested on a different ratio of errors and correct word forms.

findings

I do not urge to replace the database with files anytime and anywhere. DB has much more functions. But there are several cases when this replacement will be justified. For example, these are static databases (cities by IP, morphology). In this case, the transition to files, among other things, will make life easier, for installation on another server it will be enough just to copy the files.

The main feature of the database is getting rid of conflicts between threads. This is quite difficult to organize when working with files. DB block reading when writing. But at high loads it creates its own problems.

How many you did not have servers and streams, they are all blocked when one of them writes data to the database. It all depends on the type of storage, some put a lock at the row level, but this does not always help.

Therefore, bases are often divided. One is used for reading, the second for writing. They are synchronized from time to time, for example at night. If, in most read requests, your selection is done with the “=” operator, then you can multiply the speed by making copies of tables in files and using a hash.

If the record base slows down, then, in some cases, it will be faster to record something like a log with changes, and when synchronizing it will transfer this log to a normal base. But this option does not always help.

Source: https://habr.com/ru/post/230393/

All Articles