📜 ⬆️ ⬇️

Whose morphology is better? Yandex vs Google

There is a perception that the Russian morphology of Yandex is implemented better than that of Google. In this article I will show that the situation is exactly the opposite.
image

This article is an adaptation of my article on SeoNews for Habr

Russian morphology

There are several hundreds of thousands of words in Russian, each of which can be in a variety of word forms. For example, an adjective can be in 100 word forms:

Piccy.info - Free Image Hosting
')
As a result, if we keep the morphological dictionary “head on”, we need about 500 mb. 500.000 (number of words) * 75 (cf. number of word forms) * (10 (cf. word length) + 4 bytes (to save the word number + 2 bytes to save the number of the word form)) . For acceleration, it is necessary to keep all this data in memory, and speed is critical in the case of a search engine.

There is a “compressed” look. Many words have the same ending in the same form. For example, "great" and "mighty." We need to save only the beginning of the word (“great” and “mighty”) and the number of the group. In the end, we need about 5mb. 500.000 * (8 (cf. start length) + 2 (group number)) . However, in this case the base will contain artifacts.

Artifacts

There are not many rules for converting verbs ( to do ) into participles ( doing ). Therefore, in a concise base, verbal adverb and participle are considered as word forms of the verb, and not separate words.
But the rules for converting verbs into a perfect form (do-> make, buy-> buy, search-> find) are innumerable, so for the compressed base the verbs of perfect and imperfect form are different words.
These artifacts are only search-critical, in which morphology is used to combine word forms.

Yandex

Yandex highlights not only word forms, but also synonyms. However, synonym highlighting can be turned off using the "+" operator.
image

The connection between the perfect and the imperfect verbs in Yandex is organized through synonyms, and not through morphology.
But the connection between verbs and participles is realized through morphology.
image
This picture clearly shows the artifacts of compression morphological dictionary. In other words, Yandex uses compression.

Difference in issue

Perhaps the lights just "behind the brains." However, for high-frequency queries, the highlighting itself is turned off by a synonym. This shows that the backlight is associated with brains in the case of synonyms - it can not just turn off. The only explanation for this is that there are enough results in the results and Yandex saves resources without connecting the search by synonyms.

The difference in the issue is well observed in requests containing the verb in both forms and the participle. For example, “make an enema”, “make an enema” and “make an enema”, if you type them in Yandex and Google.

Impact on issue quality

We showed the presence of Yandex morphology artifacts and the fact that they affect the ranking, although they may not affect the quality of the issue. However, I managed to quickly find a few exceptions in Yandex: buy and buy, pull out and pull out, send and send glued together at the morphology level. The only hypothesis is why these exceptions appeared - they were added to improve the output. Consequently, artifacts, at least in particular cases, worsen the issue.

Google

Google uses uncompressed morphology. At least, I did not manage to find “compression artifacts”.

The only discrepancy between the formal Russian language model in Google is the usual ( good ) and excellent ( best ) degrees of adjectives divided in morphology. They are probably connected as synonyms, however, Google does not highlight synonyms.

This is not explained as a compression artifact, since there are not so many rules for transforming the transformation of forms of adjectives (beautiful-> beautiful, clever-> cleverest) and neither the AOT.ru database nor the Zaliznyak dictionary divides the forms of adjectives.

The separation of adjectives by degree, due to the optimization of the quality of the issue. The degree of adjectives changes their “coloring”, making their semantic connection more similar to synonyms than word forms. For example, the query “beautiful photos” is much closer to “beautiful photos” than “beautiful photos”.

This coincides with an intuitive view of the language. I met several times with the fact that “good” and “better” were cited as an example that Yandex understands synonyms.

Why did it happen

Morphology in Yandex was written about 10 years ago, and then 500 MB. memory for several hundred servers could cost a lot of money. Since then, memory has fallen in price, but a change in morphology would lead to a whole cascade of changes in the Yandex database. Therefore, Yandex uses a compressed form of morphology.
Google is originally an English search engine. In English, words have only a few word forms and there is no point in compressing morphology. Apparently, therefore, in the Russian morphology, Google does not use compression.

Total

Google's morphology is organized “more correctly” and slightly better than Yandex. The irony is that this is Google’s English origin.
However, morphology is just one of many aspects of extradition. To say that Google is better at issuing than at Yandex only on the basis of morphology is the same as evaluating intelligence by the height of the forehead. The purpose of the article was to dispel the belief that morphology in Google was organized less well than in Yandex.

Source: https://habr.com/ru/post/173351/


All Articles