Morphological analyzer for the Russian language - is it something abstruse? The program, which leads the word to the initial form, determines the case, finds the word forms - it is not clear how to approach? But in fact, all is not so difficult. In the article - as I wrote an analogue of mystem, lemmatizer and phpmorphy in Python, and what came of it.
I’ll say right away that a library for morphological analysis in Python has turned out, which you can use and modify according to the MIT license.
The first question - why all this?
Slowly I make one hobby project, there is a decision to write it in Python. And for the project, a more or less high-quality morphological analyzer was vital. The option with mystem did not fit because of the license and the lack of the ability to dig deeper, lemmatizer and phpmorphy I was slightly strained with my dictionaries not in Unicode, but for some reason I did not find a decent analog in python. In general, there are many reasons, everything, to be honest, not very serious, in fact, I just wanted to. I finally decided to invent a bicycle and write my own implementation.
Based on the best practices from the site aot.ru. Dictionaries (LGPL) for Russian and English, as well as ideas - from there. There were described specific implementation algorithms, but in terms of the theory of automata. I am familiar with the theory of automata (I even defended my diploma in formal grammars), but it seemed to me that one could do very well without it. Therefore, the implementation is independent, which has both advantages and disadvantages.
')
To begin with, the library should, at a minimum, be able to:
1. To reduce a word to a normal form (for example, in units, Ip for nouns) - for the word “PEOPLE” it must return “PERSON”
2. Determine the shape of the word (or form). For example, for the word “PEOPLE” she must somehow let know that this noun, in the plural, may end up in the genitive or accusative case.
We deal with dictionaries
Dictionaries from aot.ru contain the following information:
1. word paradigms and specific rules of education;
2. stress;
3. user sessions;
4. a set of prefixes (productive prefixes);
5. lemmas (non-replaceable parts of a word, a basis);
and there is also 6. grammatical information - in a separate file.
From all this, we are interested in the rules of word formation, prefixes, lemmas and grammatical information.
All words are formed on the same principle:
[prefix] + [prefix] + [base] + [ending]Prefixes are any “mega”, “super”, etc. A set of prefixes is stored simply in a list.
Prefixes - refers to prefixes that are inherent in grammatical form, but not inherent in the unchangeable part of the word (“by”, “na”). For example, "nai" in the word "most beautiful", because without superlatives will be "beautiful."
The rules for the formation of words - this is what must be attributed to the front and back of the base, to get some form. The dictionary stores pairs “prefix - ending”, + “number” of the record of grammatical information (which is stored separately).
The rules for the formation of words are combined in
paradigms . For example, for any class of nouns it can be described how a word looks in all cases and kinds. Knowing that the noun belongs to this class, we can correctly get any form of it. Such a class is a paradigm. The first in the paradigm is always the normal form of the word (to be honest, the empirical conclusion))
Lemmas are unchangeable parts of words. The dictionary stores information about which lemma corresponds to which paradigms (which set of rules for the formation of grammatical forms of a word). Several paradigms can correspond to one lemma.
Grammatical information - just pairs ("number" of the record, gram. Information). "Number" in quotes, because These are 2 letters, it is simple from, but all different.
The file with the dictionary is plain text, for each section, the number of lines in it is first indicated, and then the lines follow, the format is described, so writing the parser was not working.
Having understood the structure of the dictionary, it is already easy to write the first version of the morphological analyzer.
We write morphological analyzer
In fact, we have the word, and it must be found among all reasonable combinations of the form
<prefix> + <prefix> + <lemma> + <ending> and
<prefix> + <lemma> + <ending>The matter is simplified by what turned out (as shown by a couple of lines on the python) that there are only 2 “prefixes” in our language (and also in English, too) 2. And the prefixes in the dictionary are about 20 for the Russian language. Therefore, you can search among the combinations
<prefix> + <lemma> + <ending> , combining in mind the list of prefixes and prefixes, and then doing a small test.
Let's sort out the prefix in our own way: if the word starts with one of the possible prefixes, then we (the prefix) discard it and try to morphologically analyze the remainder (recursively), and then simply assign the rejected prefix to the received forms.
The result is that the task comes down to searching among the combinations.
<lemma> + <ending> , which is even better. We are looking for suitable lemmas, then we look to see if there are suitable endings for them *.
Here I decided to greatly simplify my life, and use the standard Python associative array in which I put all the lemmas. It turned out a lemmas type dictionary: {base -> [rule_id]}, i.e. the key is a lemma, and the value is a list of numbers of admissible paradigms. And then we went, first we believe that the lemma is the first letter of the word, then that these are the 2 first letters, etc. By the lemma we try to get a list of paradigms. If received, then in each permissible paradigm we run according to all the rules and see if our word will work out, if we apply the rule. It turns out - we add it to the list of found forms.
* There was another option - to compile a dictionary of all possible words at once <lemma> + <ending>, the result was about 5 million words, not much, but I didn’t like the option at all.
We add the morphological analyzer
In fact - everything is ready, we wrote a morphological analyzer, with the exception of some details that took me 90% of the time)
Part One
If we recall the example, which was at the beginning, about "PEOPLE" - "MAN", then it becomes clear that there are words that have an unchangeable part missing. And then where to look for forms of these words is not clear. I tried to search among all endings (just like among lemmas), because for such words both “PEOPLE” and “MAN” in the dictionary will be stored as endings. For some words it worked, for some it gave out, besides the correct result, also a bunch of garbage. As a result, after long experiments, it turned out that there is such a cunning magic lemma "#" in the dictionary, which corresponds to all empty lemmas. Hurray, the problem is solved, we look for it every time there.
Detail Two
It would be nice to have a "predictor" who could work with words that are not in the dictionary. These are not only rare words unknown to science, but simply misinformation, for example.
There are 2 simple, but quite working approaches:
1. If words differ only in the fact that something is attributed to one of them in front, then, most likely, they will be inclined
2. If 2 words end in the same way, then they will most likely be inclined equally.
The first approach is guessing the prefix. It is implemented very simply: we try to consider first one first letter of a word as a prefix, then 2 first letters, etc. And what remains is passed to the morphological analyzer. Well, we do it only for not very long prefixes and not very short residues.
The second (prediction at the end of the word) approach is a little more difficult to implement (it is so much more difficult if a good implementation is needed)) and “smarter” in terms of predictions.
The first difficulty is connected with the fact that the end of a word can consist not only of an end, but also of a part of the lemma. Here I also decided to “cut corners” and re-used the associative array with previously prepared all possible endings of words (up to 5 letters). Not so many of them turned out, several hundred thousand. The key of the array is the end of a word, the value is a list of possible rules. Further, everything is as if searching for a suitable lemma, only we take the word not from the beginning, but from 1, 2, 3, 4, 5-letter ends, and now instead of the lemmas we now have a new monster array.
The second difficulty - it turns out a lot of deliberate garbage. This garbage is cut off, if we consider that the words obtained can only be nouns, adjectives, adverbs or verbs.
Even after that, we have too many non-junk rules left. For definiteness, for each part of speech we leave only the most common rule.
In theory, if the word was not predicted as a noun, it would be good to add the variant with the unchangeable noun in units. ip, but it is in TODO.
The ideal text for testing the predictor's work is, of course, “Syapala Kalush by gunshot,” about how she rescued the boot-out there and what came of it:
Syal Kalush on the gun and uvazila butyavku. And will:
- Kalushata, kalushatochki! Butyavka!
Kalushata zaypalis and butyavku strung. And podudonilis.
And Kalush will:
- Oee, oee! Butyavka something nekuzyavaya!
Kalushat Butyvka learned.
Butyavka rattled, snitched away, and poured from her gun.
And Kalush will:
- Butyavok not twitch. Butyavki Dybye and zyumo-zyumo nonkazyavye. From butterflies dudonitsya.
And the booty will come after the gun:
- Kalushata podduonilis! Kalushata podduonilis! Zyumo nekuzyavye! Puski btyaty!
From the text, the predictor failed to cope with Kalush’s own name, Kalush (they became Kalush and Kalush peasants), Oee, the mysterious “Zyumo-Zyumo” interjection, and instead of “Pusk” again gave the peasant “Pusek ", The rest is, in my opinion, determined correctly. In general, there is much to strive for, but not bad.
About horror, "Habrahabr" is also predicted in any way. But "Habrahabr" - already understands that "Habrahabr".
Here it would be possible, in principle, to summarize if the computers worked instantly. But it is not, so there is
Detail number 3: CPU and RAM resources
The speed of the first version was fine with me. It turned out, according to estimates, up to 10 words per second for simple Russian words, about a thousand for clever ones. For English - even faster. But there were 2 obvious (even before the start of development) “but”, connected with the fact that all data were loaded into RAM (via pickle / cPickle).
1. the initial download took 3-4 seconds
2. it was eaten about 150 megabytes of RAM with psyco and about 100 without (+ it was possible to reduce a little when Python set and list all sorts of there led to tulpe, where possible)
Without hesitation, I did a little refactoring and brought all the data into a separate entity. And then I came to the aid of the magic of Python and duck typing. In short, the algorithms used data in the form of associative and simple arrays. And everything will work without rewriting the algorithms, if instead of “real” arrays you slip something that behaves like an array, and more specifically, for our task, it supports [] and in. All) In the standard python delivery, such “massive” interfaces to several non-relational databases were found. These bases (bsddb, ndbm, gdbm), in fact, are associative arrays on the disk. Just what you need. There was also a couple of high-level superstructures over this economy (anydbm and shelve). I stopped at inheriting from DbfilenameShelf and added support for keys in unicode and keys-integers. And, he added caching of the result, which for some reason is in Shelf, but killed in DbfilenameShelf.
Speed ​​data on test texts (995 words, Russian dictionary, macbook):
get_pickle_morph ('en', predict_by_prefix = False): 0.214 CPU seconds
get_pickle_morph ('en'): 0.262 CPU seconds
get_shelve_morph ('en', predict_by_prefix = False): 0.726 CPU seconds
get_shelve_morph ('en'): 0.874 CPU seconds
Memory option shelve, one might say, did not eat at all.
The shelve options seemed to work when dictionaries were already in disk cache. Without this, the time may be 20 seconds with the predictor turned on. I also noticed that the prefix prediction is the slowest to work: before screwing caching to my heirs from DbfilenameShelf, this prediction was inhibited 5 times more than everything else taken together. And in these tests it seems not much already.
By the way, taking this opportunity, I want to ask if suddenly someone knows how in the python you can find out the amount of memory occupied by the current process. If possible, cross-platform somehow. And then put the delay in the end of the script and measure according to the list of processes.
Usage example
import pymorphy
morph = pymorphy.get_shelve_morph('ru')
#
info = morph.get_graminfo(unicode('').upper())
So after all, why?
At the very beginning I already explained why I began to write a morphological analyzer.
Now it's time to explain why I started writing this article. I wrote it in the hope that such a library is interesting and useful, and together we will improve this library. The fact is that the functionality that has already been implemented is quite sufficient for my needs. I will support and correct it, but not very actively. Therefore, this article is not a detailed instruction manual, but an attempt to describe how everything works.
In delivery
pymorphy.py - the library itself
shelve_addons.py - heirs from the shelf, who can come in handy
encode_dicts.py is a utility that converts dictionaries from AOT format to pymorphy formats. Without parameters, it works for a long time, eats 200 meters of memory, starts 1 time. Converted dictionaries do not distribute because of possible binary incompatibility and large volume.
test.py - unit tests for pymorphy
example.py is a small example and text with those 995 words
dicts / src - folder with source dictionaries downloaded from aot.ru
dicts / converted - the folder where encode_dicts.py will add converted dictionaries
At last
link:
bitbucket.org/kmike/pymorphyLicense - MIT.
checked only under Python 2.5.
Comments, suggestions, questions on the case - are welcome. If interested - connect to the development.