Hybrid realization of Russian morphology

When creating a search solution, one of the first things that a developer has to face is text preprocessing. Breakdown of terms, filtering stop words. An important operation affecting the quality of the search at this stage is to bring the words to normal form. Below are the main approaches to this problem.

In the simplest case, you can use the truncation operator (stemming) , which throws off the endings and leaves the basic form of the word. The most popular solution for the English language is a snawball stemmer. It is based on a few simple rules. For example, if s is the last letter in a word, then it can be dropped. Here is an example of his work for the English language:

comments -> comment
using -> use
stemming -> stem
received -> receiv
develops -> develop.
')
Although Snawball supports the Russian language, its use does not provide enough qualitative results. Here is an example of two word forms of the same word for which truncation occurs in different ways:

review -> reviews
reviews -> review

Another approach is to store all pairs (word, normal form). The downsides to this are obvious. If there is an unfamiliar word, the algorithm will not work for it. Another minus follows from the fact that the modern dictionary of the Russian language contains more than one and a half million word forms. All of them need to be stored and quickly search among them. You can build a state machine to reduce the amount of memory required and increase the speed of work. The algorithm for constructing such an automaton is given in [1]. Implementation in C ++ is available at aot.ru. This algorithm is quite difficult to implement.

You can use a hybrid version of stemming and vocabulary approach. In most cases, we will need only the knowledge of the last few letters of a word to correctly determine its normal form. Thus, we replace the pairs (word, normal form) with (ending word, ending normally form). As the end of the word we will take the last 7 letters of the word. Thus, the problem of the absence of unfamiliar words in the dictionary was solved. It is clear that this result will be worse than the vocabulary, but much better than the result of the measurer. For a word of no more than 7 letters of words, the algorithm will give the exact answer. It is also used in the implementation of morphology from aot.ru as a heuristic for unfamiliar words. The difference of implementation is in the same way a finite state machine is built as for words from a dictionary.

Consider the question of its complexity and the required amount of memory. In the Russian alphabet there are 33 letters plus in it is necessary to take into account that words can contain a hyphen. Thus, one letter in a word is encoded with a number from 0 to 33 and the 64-bit data type (for example, long in java) can contain information about 7–12 letters. This is quite enough for our purposes. For the experiment, let's take a dictionary from the aot.ru project site, distributed under the LGPL 2.0 license. He will give us 750 thousand unique pairs (the end of a word, the end of a normal form). Some of the pairs will have the same endings of words for which you want to define a normal form. We will leave only one - which is more common and suitable for more common words in Russian. So they way we will have about 690 thousand pairs. In particular, for more than 600 thousand cases, the normal form will determine unequivocally.

It is easy to calculate that the data required for the operation of the algorithm will take about 10.6 megabytes. The desired ending can be found using a binary search. (you need to find about 20 comparisons). For comparison in the algorithm from aot.ru, the size of the automaton without morphological information is about 5 megabytes. To implement a heuristic at the end of a word, one more automaton is needed, unfortunately its size is not given. Thus, the memory size is comparable, and this implementation does not give much worse results at lower costs.

The implementation of this approach was implemented in java, the Russian language Analayzer was also made
for the search engine framework lucene . It is distributed under the apache 2.0 license.

[1] Jan Daciuk, Bruce Watson, and Richard Watson, Increase Construction of Minimal Acyclic Finite State Automata and Transducers, Proceedings of the Natural History Processing, pp. 48-56, Bilkent University, Ankara, Turkey, June 29 - July 1, 1998.

Hybrid realization of Russian morphology.

Source: https://habr.com/ru/post/66560/

All Articles

Hybrid realization of Russian morphology

More articles: