pymorphy2

Back in 2009, there was already an article in the Habré, “ Whether Kuzyavye Butyavki .. ” about pymorphy is a morphological analyzer for the Russian language in Python (a thing that can bend words, communicate information about a part of speech, case, etc.)

In 2012, I began to slowly do pymorphy2 ( github , bitbucket ) - I think it's time to introduce this library here: pymorphy2 can work hundreds of times faster than pymorphy (without using C / C ++ extensions) and still require less memory; there are better dictionaries, better parsing quality, better support for the letter e, easier installation and a more “honest” API. From the negative - not all features of pymorphy are now implemented in pymorphy2.

This article is about how pymorphy2 was created (sometimes with rather boring technical details), and how much nonsense I did; if you just want to try everything, you can read the documentation .
')

What were the problems with pymorphy:

sufficiently slow speed of work (several hundred words / sec at the default setting);
dependence on dictionaries with aot.ru, which is not clear how to replenish / correct;
some problems with the API - for example, according to the API for the supporter, it might seem that the library itself removes the ambiguity of parsing, although this is not the case;
the letter was supported by the “everywhere replace all with e” method;
rather complicated installation - you need to download dictionaries, some different formats, etc. (even in Japanese, someone made a series of video lessons );
Python 3.x was supported only in the version with bitbucket, which was never released;
it is impossible to incline the infinitive: the verb could be put in the form of an infinitive, but it is no longer possible;
hyphenated words are not always inclined;
well, etc.

Before I did pymorphy, I never wrote in Python, I didn’t do any word processing in natural language (pymorphy was a way to make out in both) it was quite understandable that much could be done better.

But a lot of things in pymorphy1 were nevertheless done correctly. For example, documentation in Russian and installation without special dependencies (most importantly - without the need for a compiler); more or less normal development process with tests; The quality of the analysis and prediction itself was quite high. Integration with django was also a good “marketing” move - with it, however, there were some conceptual problems (it’s impossible to properly deflect the word directly in the template without resolving the ambiguity of the parsing - and this was not provided for in the API).

Despite all its shortcomings, the library (unexpectedly) turned out to be quite popular - for example, Nuggulil that pymorphy was slightly used in developing the Speech-to-Text system for the Russian language in the framework of the French Quaero project, and is recommended as a teaching material in some universities. I don’t think that there’s some great merit here; rather, I just ended up in the right place at the right time.

I didn’t want to break backward compatibility for a long time and tried to develop what I had (although buriy , for example, even did fork, which broke backward compatibility, but it worked faster). The decisive impetus for writing pymorphy2 was the OpenCorpora project - the guys from there, among other things (and there are a lot of “everything else”), took the dictionary from aot.ru, completely reworked its structure and started updating and other improvements.

So, the idea was to use a dictionary from OpenCorpora.

In pymorphy, dictionaries from aot.ru were used, almost not processed in any way (they were converted to the sqlite format, but in fact the structure remained the same). The “bases” of words, separately “prefixes”, “endings” separately were stored separately, and the rules for the formation of words from these parts separately. This method was good because it was possible to implement a morphological analyzer in a couple of "weekends" without much effort and research.

But all of this — access to the data through a multitude of wrappers and (especially) the collection of words from parts with the help of a code — adversely affected the speed of work; I somehow didn’t succeed in drastically improving speed with this approach, the solutions were complex and didn’t really help (note: the option “rewrite everything as it is, but in C ++” it works quickly, but it requires more memory than in pymorphy2 as a result) .

In short, in pymorphy2 I wanted to try some other approach, since the dictionaries are new, and still, much of the vocabulary associated with them is rewritten. At the same time, I wanted pymorphy2 to be a library library, and not just a wrapper to a pile of C / C ++ code - I wanted the parsing logic to remain on python. There were several options for what to do.

Automatic machines

The first option that came to mind was to reformulate the problem in terms of finite automata. lightcaster described the approach well here: habrahabr.ru/post/109736 . The approach is beautiful, that is what it is.

What confused me here:

a) in the article I used the OpenFST C ++ library (like the most popular way to implement state machines), but forcing users to install it manually is not an option;
b) even using the C ++ library, the results, judging by the article, were quite modest (2 thousand words / s against 100+ thousand words / s for mystem or lemmatizer); It is clear that this figure could, most likely, be significantly improved (and the lightcaster writes that he didn’t optimize anything) - but still;
c) this is one of those approaches, which (in my opinion) raises the threshold of entry - I think this is rather a minus.

The result was that I would need: to figure out how to optimize the code and why it happens so slowly even with the C ++ library; write an easier to install wrapper for OpenFST (or use another FST implementation - for example, make one's own) + make a small part of the implementation of OpenFST (or just an FST implementation) in Python (so that pymorphy could be used without a compiler), well, to formulate all the algorithms in terms of finite automata.

A few years ago, lightcaster sent me another sketch of the implementation of pymorphy on finite automata (pythonium, without C extensions, I didn’t understand anything in it then :) - the sketch eventually worked at a speed of about 2 thousand words / sec and required about 300MB of memory. All this, on the one hand, not bad, but on the other - not very inspiring after all; it was clear that if this method was solved "in the forehead", it would not work very quickly, and that it was necessary to write much better than a person who understands the topic. In short, it seemed that this is a lot of work and an unguaranteed result (+ there will be other architectural considerations, closer to the end of the article will be). Perhaps it seemed wrong, who knows - I have not tried, only excuses.

This way (to consider the task solely as an automaton problem) I did not go. Although - how to say, it all depends on the point of view :)

Let's copy mystem

The second option was to implement approximately what is described in the publication on mystem . I do not know if mystem uses the same algorithms now, but the method described in the article is quite reasonable. Many good things are written about mystem, so at first I tried to implement in pymorphy2 something similar to what is described in the publication.

The essence of the method (as I understood it)

a) We have a list of all the words of the Russian language, and for each word there is a “declination model” (“paradigm”) - some kind of template by which you can build other forms of this word. Well, for example, for many nouns you can add the letter “and” at the end to get the plural of the nominative case, and “am” - the instrumental. Most of the implementations of Russian morphology are built on this principle.

b) A prefix tree (trie) is built for all inverted “endings” of words. The “ending” is taken about the same as that of aot.ru - the part of the word on the right, which changes with declination. Well, for example, the word “hamsters” may have “am” - add “ima” to the trie.

c) a prefix tree (trie) is built for all inverted “bases” of a word. For example, for the word "hamsters" is "kyamoh." In addition, at the end of the "basis" are assigned (through a separator) the possible indices of the models of declination (paradigm) for this basis. For example, if a “hamster” can be declined according to the paradigms A and B, and the separator is chosen $, then in the trie we add the values “kyamoh $ A” and “kyamoh $ B”. In the article on mystem, not one tree was used for the foundations, but many (my opinion as a measure of optimization) - but this does not affect the algorithm much, we will assume that the tree is one.

d) analysis of the word itself: we follow the word from the end, collect all possible “endings” from the first trie (in the prefix tree, this operation is performed in O (1) of the number of words in trie and in O (N) of the word length). Suppose for “hamsters” we could find that there are words that have an inflectional ending "" (empty), "and", "ami" or "kami." These "endings" correspond to the "basics": hamsters, hamsters, hamster and hamster.

d) take the shortest found basis and look for it in the second trie (go to the $ delimiter). If found, then besides the very basis we immediately get all possible indexes of the word declension patterns (they are behind the separator). For each of the models of declension, it is now possible to check whether there is a form with the ending we need (found in step (d)) - if there is, then we have made out the word.

f) if you find nothing, then there is no dictionary word with the shortest base - you can go to a longer basis; you can also check similar fundamentals (I did not understand what was being done from paper, but this is not the point - for example, you can try the next, longer basis; after all possible fundamentals have been tested, you can start looking for similar words - it works) .

The paper also describes various heuristics that can reduce the size of the dictionary and remove unlikely predictions.

I could misinterpret or misunderstand; if you need to do something, then it is better to read the original article, of course.

Trie

The method uses prefix trees (Trie) - to implement the method, you need a Trie implementation for Python. I didn’t want to use a python implementation (I was worried about speed and gluttony from memory), but I was not surprised to find ready python wrappers for some good C / C ++ implementation of Trie.

It’s rare to make sense to write your own C / C ++ implementation of data structures: there are a lot of things already in C / C ++, and many of the libraries implement state-of-art algorithms. Why? That person came up with a clever data structure, wrote an article about this in a scientific journal. So that the results could be repeated, the author often lays out the implementation, and more often - in C or C ++. If he does not post it, then somebody else reads the paper and writes the implementation - also in C / C ++ (ok, sometimes this is written in Java).

And still, pymorphy2 is a hobby project, and it would not have been possible to spend a lot of time on it; it was better to try to spend time efficiently and reinvent smaller bikes. In short, I decided to take a ready-made implementation of Trie and make a wrapper for it on Cython (a separate package that is not related to pymorphy2).

There are 2 big advantages in this approach:

1) The data structure can be used for other purposes; even if the approach did not justify itself (spoiler: it did, it didn’t justify itself), the efforts would not have been wasted;
2) the complexity associated with data structures is “dragged away” from the morphological analyzer itself; the analyzer communicates with Trie through a simple API, while he himself is engaged only in the actual analysis of words. To understand the algorithm of work, it is not necessary to understand in detail how the prefix tree is arranged. In addition, replacing the implementation of basic data structures (for example, python) is a simple task; we have a clearly defined border (= interface) between the morphs. analyzer and storage for words.

At first I liked the libdatrie library, I wrote about the wrapper for it here: habrahabr.ru/post/147963 . Some things in the library had to be fixed (it usually came out like this - I wrote the working implementation in C, the library author threw everything out and wrote more idiomatic C code for the same thing - and did it right, because the C code was really better for him :); in the end it turned out quite a usable wrapper.

With the datrie and the “steal everything from the mystem!” Approach, it turned out 20-60 thousand words / sec (without a predictor and with a minimum of tying; I do not remember such a spread, I guess the weather), and it all took about 30MB of memory.

With speed, the plug was in an unexpected (for me) place: the comparison of permissible analyzes for the “endings” of a word and permissible analyzes for the “beginnings” of words (this is part of the algorithm) was the most inhibited. In my implementation, it came down to finding the intersection of 2 sorted lists of numbers. The obvious way (“to go through both lists in parallel”) turned out to be much slower in this task than the “go short list” approach and in the long look for a narrowing binary search ”. But this second option remained a bottleneck, even rewritten in Cython - what to do with it, I did not know. Probably, it was possible to write this code better, but it did not work right away.

In addition, 30Mb is, on the one hand, good, but on the other - aot.ru is smaller (they write that 9Mb, then 19Mb - we will assume that 9, my hands did not reach me). Memory consumption is important because pymorphy is also used for firing sparrows (the declination of words on the site), and this requires loading it into each web process. Well, or if you do not load (due to the memory being pitiful), then you should not make a fuss with a separate service (on some zeromq, or with http communication) - which is not desirable either.

Implementing a datrie is not a “naive” trie on pointers; a datrie already requires 2 times less memory than a regular prefix tree (so 30MB is still good).

Plus, somehow everything was complicated, in terms of algorithms; it was clear that if you continue to finish (predictor, heuristics), then everything will be much slower and even more difficult. Disorder.

marisa-trie

But I decided not to give up and not to abandon the “a la mystem” approach, but try to replace the datrie with something else (which was rather silly, but had good consequences). The choice fell on the C ++ library marisa-trie , which was written by the guru of data structures Susumu Yata. For her, I also made a python wrapper with about the same interface as datrie.

When I started testing marisa-trie, I first had my eyes on my forehead. See:

If you take and load all 3 million Russian words into the Python dictionary, it will take about 600MB of RAM (in the list - about 300MB);
if you load the same data into a datrie, then it will take 70MB of RAM - which is already pretty cool;
and if you load the same data into a MARISA-Trie, then it will take another 10 times less - 7 MB of memory.

How is it going? The fact is that MARISA-Trie is in fact not a Trie at all :) What I mean is that I didn’t understand so normally; the folder with the "meat" in the source is called grimoire ; The description of the algorithm is exclusively in Japanese.

Apparently, MARISA-Trie is the “succinct” implementation of Patricia-Trie ( wiki ), in which not only textual information can be compared to each node, but also the next level of MARISA-Trie. “Succinct” means that the tree structure is encoded with a bitmap, which allows you to spend very little memory on it. Well, there is an interesting feature: MARISA Trie also appears immediately in the role of “perfect hash”. Each key is assigned a unique number from the range of 0 to len (trie) -1, and each number from 0 to len (trie) -1 corresponds to a single key; You can either get a key by number or get a number by key.

By speed - datrie faster than marisa-trie, every 10.

Let's go back to pymorphy. After a simple replacement of the datrie by marisa-trie, it turned out that the whole farm takes up 5MB of memory, and it works at a speed of 2-5 thousand words / sec (without a predictor).

On the one hand - 5MB is cool (although so far without a predictor). But on the other - 2–5 thousand words / sec - slowly after 20–60 thousand words / sec, + I didn’t really like to get tied up on a marisa-trie, because this would lead to a compiler requirement for installing pymorphy2. I fully understood how datrie works, and writing a python-compatible implementation would not be a problem, but marisa-trie ... In the case of marisa-trie, it would be possible to port, but it would require more effort, and it would most likely work this question would probably be “postponed”, and pymorphy2 would require a compiler to be installed.

By itself, the attempt was a dead end, but revealed one interesting thing: 5MB (“almost mystem” algorithm + marisa-trie) and 7MB (“we put all the words in marisa-trie”) - the numbers are very comparable. These numbers finally opened my eyes to the fact that you should not discard approaches, in which all words will be loaded into memory immediately and completely (without splitting into beginnings and ends).

Intuitively: if you keep all words in memory “as is”, then you will need to do fewer calculations, words will not need to be assembled from pieces - everything should work faster and the code should be easier.

At this point, the idea of “copying the mystem” I stopped; unfinished options with datrie and marisa-trie are available in early pymorphy2 commits.

I have a suspicion that mystem uses (well, according to the publication, I cannot know what is being used there now) several trie with parts of words, not because they make it somehow easier to describe algorithms for the predictor, but simply for In order not to keep all words in memory in full. According to my tests, with the usual trie it would take (words + data about their paradigms) 300-400 megabytes, with a 200 megabyte with datrie, but with MARISA-Trie everything could hold a megabyte of 20. Use a few "regular" trie, as in mystem paper, it’s a good trick, but I think this way (I can be wrong, I don’t finish the “I want my mystem” approach), it works slower in principle than “we keep all words as they are”.

Back to the Future

In itself, MARISA-Trie did not fit to keep all the words in memory: it was not fast - which eliminated the potential gain in speed from the fact that the words would not need to be collected piece by piece, + the API lacked the possibility of “step by step” navigating the tree in an arbitrary way, which was necessary for I-no-no-remember-for-what, probably, for proper support of the letter . Then there was a vague recollection of the fact that in aot.ru all the words were somehow remembered at once.

So, I began to re-read the description of the morphological analyzer with aot.ru, and in particular - the paragraph “Binary dictionary representation”. Returned, so to speak, to where it all began.

In an article on aot.ru, everything is called “automaton”.

The variant with the “alienable” data structure (which can be used for something else - well, like a datrie or marisa-trie) I architecturally liked (and like) a much more specialized automaton with a dictionary. A specialized automaton would be “buried” in pymorphy2 - from there nothing could be reused, it could be debugged only within the framework of pymorphy, and the complexity would be accumulated within the framework of pymorphy. I usually ran through the eyes of that paragraph, I thought, “hm-hm, not good” and thought out a way to do something different.

But things can be viewed from different angles. Probably, this should have been obvious from the very beginning, but only after I tried to make a “mystem clone” it came to me that the automaton used in aot is in fact exactly the same as the DAWG data structure in which words are written as keys (with annotations attached to the end). And that all the operations described there, fall perfectly on the same API that was in datrie and marisa-trie.

That's really true: “I’m not sure about“ cache invalidation, ”but I’ve got a lot of“ naming things ”here.

You seem to yourself a complete idiot at such moments; everything was simple, everything was before our eyes from the very beginning, and they talked to me about it many times (both about automata and DAWG).

So, the new plan was: on the beaten path

find DAWG implementation in C / C ++;
make a wrapper for it with the same API as datrie and marisa-trie;
write all words in DAWG (immediately with information about their grammatical forms)
Bonus - make a python library exclusively for reading DAWGs, to simplify installation.

Morphological analysis of vocabulary words in this case is simply to get information about a word from the DAWG; in the simplest case (assuming that a word can have only 1 parsing, and that only grammatical information is needed), it can be one line of code.

A good library for DAWG ( dawgdic ) was found in the same Susumu Yata, who wrote marisa-trie; I made a python wrapper for it: github.com/kmike/DAWG and a python "reader" of this format: github.com/kmike/DAWG-Python (for installation which does not need a compiler).

The entire dictionary without annotations occupied about 3Mb in DAWG, and with annotations (= with information about how to parse each word) - about 7Mb. For comparison, if you load all this data into the pythonium dictionary "as is", then it takes several gigabytes of memory.

Further it was necessary only to write pymorphy2 :)

The article and so it turns out some kind of infinite, but about how the pymorphy2 inside is arranged now - until almost nothing happened. So as not to inflate the text even more, nothing will happen; how pymorphy2 is arranged inside is rather (unnecessarily?) described in detail in the documentation , even with pictures. I would still note that pymorphy2 is not a Python lemmatizer clone, everything is different there, but the information from aot.ru , of course, helped a lot.

Here are two funny benchmarks (1.8 Ghz notebook processor, under Python 3.3 using DAWG and under PyPy 1.9 using DAWG-Python):

Memory usage: 13.8M dictionary, 25.4M total (load time 0.14s) py33 runtests: commands[1] benchmarking MorphAnalyzer(): morph.parse(w) 29950 words/sec morph.parse(w) 47474 words/sec (considering word frequencies) morph.word_is_known(w) 244816 words/sec [p.normal_form for p in morph.parse(w)] 27807 words/sec [p.normalized for p in morph.parse(w)] 18231 words/sec [p.lexeme for p in morph.parse(w)] 3421 words/sec [{'NOUN'} in p.tag for p in morph.parse(w)] 23862 words/sec [p.tag.POS == 'NOUN' for p in morph.parse(w)] 22157 words/sec morph.tag(w): 96342 words/sec (considering word frequencies) morph.tag(w): 46767 words/sec morph.tag(w): 46935 words/sec (umlauts removed from input) morph.tag(w): 36331 words/sec (str(tag) called) benchmarking MorphAnalyzer(result_type=None): morph.parse(w) 35088 words/sec morph.parse(w) 62431 words/sec (considering word frequencies)

 Memory usage: 40.5M dictionary, 84.8M total (load time 0.39s) pypy runtests: commands[1] benchmarking MorphAnalyzer(): morph.parse(w) 41360 words/sec morph.parse(w) 108858 words/sec (considering word frequencies) morph.word_is_known(w) 243742 words/sec [p.normal_form for p in morph.parse(w)] 51728 words/sec [p.normalized for p in morph.parse(w)] 37551 words/sec [p.lexeme for p in morph.parse(w)] 14612 words/sec [{'NOUN'} in p.tag for p in morph.parse(w)] 44878 words/sec [p.tag.POS == 'NOUN' for p in morph.parse(w)] 45129 words/sec morph.tag(w): 133086 words/sec (considering word frequencies) morph.tag(w): 60049 words/sec morph.tag(w): 62567 words/sec (umlauts removed from input) morph.tag(w): 52632 words/sec (str(tag) called) benchmarking MorphAnalyzer(result_type=None): morph.parse(w) 60777 words/sec morph.parse(w) 124226 words/sec (considering word frequencies)

It is amusing here that I find that (a) the numbers are very similar, despite the fact that the “inside” actions are very different (C ++ wrapper + interpreter vs jit compiler), and (b) that PyPy is faster.

By itself, the implementation of DAWG in C ++ (if used from python, considering Cython wrappers) is several times faster than DAWG-Python under PyPy, and early versions of pymorphy2 (which made everything less) were faster under CPython. Over time, the features became larger, the code was more complicated, pymorphy2 was slower (earlier versions and 200 + thousand words / sec could be disassembled - though not very well and not very convenient); the faster basic data structure "outweigh" has ceased - and now it works faster under PyPy pymorphy2. On the other hand, to speed up the CPython version is clear how to rewrite something else on Cython (by the way: I do not plan to do this); with PyPy it's not so obvious.

You may notice that the analysis of 100 thousand words / sec in the real text is quite possible to get with Python 3.3 (obtaining grammatical information), and with PyPy (full analysis, with the initial form and user-friendly API). For comparison: the “real” mystem (written, apparently, in C ++) worked (according to my tests sometime in the early stages) one and a half to two times faster and demanded as much memory; from this I made a conclusion for myself - perhaps wrong, - that the mystem still uses the approach with several trie; if the words were stored entirely, the C ++ implementation would have to be torn off even if mystem and something else makes it tricky. Well this is so, again, nothing backed chatter.

If neither PyPy nor C ++ implementation of DAWG is used, pymorphy2 will still work many times faster (by estimates - a couple of dozen times) than pymorphy1 with all included accelerations - well, it's better to disassemble.

How can I help

An interesting question remains - what to do with pymorphy1. There now there are features that are not in pymorphy2 (for example, integration with django, matching words with numbers and declining surnames), but in the version on the beatbeat I broke backward compatibility. To release another backward incompatible version of pymorphy is somehow stupid :) so everything hangs out in limbo. I have absolutely no hands on everything.

It would be great if someone would help with porting the missing features from pymorphy1;
any help from OpenCorpora is help and pymorphy2 too;
There are a lot of open bugs in the tracker (various degrees of hardcore);
A very primitive tokenizer is now built into pymorphy2 - you can think up a better tokenizer; it’s still a good idea to understand how classes from pymorphy2.units.by_shape.py relate to tokenization;
CLI can be improved so that pymorphy2 can be used as a full command line utility;
you can try to write more predictors for pymorphy2 (see how the current ones are made in pymorphy2.units) - you need more eyes to see if the interface is good for expansion (+ maybe you can improve parsing and add something new to the standard distribution);
well, you can just start using pymorphy2.

A large number of people took part in the development of pymorphy, many of whom actually made a serious contribution, and not all of this was transferred to pymorphy2; I would not want work to be lost (I think that it will not be lost). Without this help, I would never have had the motivation to support pymorphy, nor the motivation to rewrite everything as pymorphy2.

Yes, and in pymorphy2 several people have already made a serious contribution, by the way :)