One of the promising approaches in machine learning is based on memorizing the already disassembled examples and finding a similar sample. For example, we already have a collection of decrypted audio recordings, and if a new sound file appears, we are looking for a similar sample and build recognition on the basis of it. Consider how, based on this principle, you can build the morphology of the Russian language.
In most cases, the word form is formed by changing the word suffix. In this case, in the simplest case, the rule according to which the normal form will be determined is given by a triple (n, newsuffix, morphinfo)
- n is the length of the suffix to be dropped;
- newsuffix - new suffix, adding which will give a normal form;
- morphinfo - morphological information (for example, case, gender, etc.).
For example, for the word “ people
” the rule will have the following form - (4, “man”, ”noun, masculine, in the nominative case, plural”)
. In the event that a word can have several normal word forms ( wine
-> wine, wine
), the rule will consist of several triples. Examples when the word form is also formed with the help of the prefix will not be considered. Basically, this applies to superlatives, which, in principle, can be interpreted as a separate word (for example, “best”, “greatest”
). Having established such a rule for each word, we can easily get a normal word form.
The dictionary from the AOT
project contains about 5 million word forms, according to which about 30 thousand rules are built to determine the normal form of the word. Actually, these are the same processed results, so necessary for the construction of our algorithm. At the same time, at first glance, it seems that to find the necessary rule, it is necessary to store all 5 million word forms. In fact, it is not.
Let's try to build a heuristic to find the normal form of the word, which at the same time will help reduce the amount of stored information. For this we will use the following pattern: the longer a common suffix have two words, the more likely they will have the same rule for the formation of a normal form.
We proceed as follows: we will sort all word forms in the lexographic order of inverted words. For example, for words a, weapon, side, sea, the
following order will be true: a, side, weapon, sea
. We get the following pattern: words that are after sorting side by side will be more likely to have the same rule. Now we will remove from our list all the words whose rules coincide with the rule of the word preceding it. At our disposal will remain about 500 thousand words, that is, we were able to reduce the amount of stored information by 10 times. Thus, in order to get a normal word form for a word, we need to find the largest word from the remaining list that does not exceed the given one and apply the rule from it. This algorithm will work correctly for all words from the dictionary and will give good heuristics for unfamiliar examples. This approach was also used to add support for the Russian language in lucene.
Also, to reduce the amount of information occupied, we can perform additional coding, in which six consecutive letters are represented as 32-bit int. Thus, the list of samples by which the necessary rule is searched is represented as a two-dimensional array of numbers in which one line corresponds to one word. To find the desired sample, you must use a binary search. The speed of this algorithm is about 200 thousand words per second. In addition, further improvements are possible: for example, using a prefix tree for storing and searching information. But this is a topic for the next conversation.