Morphology analyzer on automata

Periodically, articles about how to write a program for the analysis of morphology jump through to Habré. Most authors use databases, or standard structures, such as dictionaries. But this is not always convenient. First, speed suffers. Secondly, some algorithms, such as predicting the morphology of unfamiliar words, are implemented nontrivially.

Here I present a version based on state machines, where I will try to avoid these problems. How it works can be found here .

So, what do we want from morphology. Usually, the morphology analysis module requires:

obtaining a lemma on a given word
obtaining morphological information on a given word
prediction of morphology

We will prepare all this from two ingredients: a morphological dictionary and a library of automata. First we take (as already accepted) from the aot.ru project, thanks to them for the LGPL. A library of machines - from here: openfst.org . The code will be at a minimum: all the necessary operations have already been implemented in the library.

Step one. Parsim morphological dictionary.

In the morphological dictionary there are lines of the form:

Here is an example for the word 'darkness'. The letters MGL are the root of the word. The rest is the endings (flexion). We can get a lemma, or a normalized word, by adding the root and the first inflection, in this case, MSLA. Small two-letter combinations are an ankod. Those. This is a code that uniquely identifies the morphology of the word form. For example, for 'ga' this is the 'noun, feminine gender, singular, nominative case'.
')
We turn to action. For each line of the dictionary we build this structure.

This is the final transducer. Works as follows. Each transition is marked with two letters, where the first is the input symbol, the second is the output symbol. If we “collect” incoming and outgoing characters along any of the paths from the initial state to the final state, we will get two words. The magic is that here the input word will be one of the morphological forms of the word, and the output will be Lemma + Ancode. For example, having “fed” the word “HUMBLE” into this machine, we will get the output “MGLAMG”. It is worth noting that "-" denotes a blank character, and when passing through the graph we can ignore it.

In general, you can get the output word by input using the standard composition operation:
if D is a dictionary like in the picture above,
W is the word, where W = <(M, M) (T, T) (L, L) (O, O) (S, S), (I, I)>
That M - the result of the composition - will be equal (minus empty characters)
M = W o D = <(M, M) (T, T) (L, L) (A, A) (a, a) (m, m)

Step two. Optimization.

The structure in the picture above has a problem. It is not optimal. The first is a lot of unnecessary empty characters that uselessly “inflate” the converter. The second is that the converter itself is not optimal. If we construct such a construction for a complete dictionary, the latter will be very large.

And so, first we shift the outgoing symbols towards the initial state. This will reduce the number of empty characters. The openfst library implements such an operation. Called fstpush.

As you can see in the picture, arcs appeared with the labels "-: -". They will be removed during the packaging process.
Next, pack the converter. There is a subtlety. In the general case, the end transducer cannot be packaged as a normal state machine. The fact is that in order to minimize the converter, as well as the automaton, it must be deterministic (there is a single output word for each input word). But this converter is not.
To solve this problem, you can

bring the converter to the state machine
pack machine (standard algorithm)
bring the resulting automaton to the converter

All these operations are standard, there is no need to write code. The result for the word "mist" in the picture below.

As can be seen from the figure, the number of states and transitions has noticeably decreased. And this is our goal.

Step three. Prediction of the morphology of an unknown word.

Everything is simple here. All we need is for the incoming word W to build a composition with two transducers and find the shortest path to the final state:
P = shortest (W o T o D)
where, D - dictionary
T is a special converter with weights that absorbs the prefix of the word, while penalizing each “eaten” character.

I have not yet implemented the prediction in the example. Perhaps later.

findings

In my opinion, this is a rather interesting way to work with morphology. I can distinguish two significant advantages:

speed and resources
ease of development
universality of algorithms

Regarding the speed in this example, I was not able to achieve good results with a hitch. It turned out something like 2000 words / sec. But I used the standard library and didn’t care much about optimization. Perhaps if you use other types of transducer speed may increase. The dictionary weighed about 35M (however, if you use a compact representation, it is compressed 3 times).

At the expense of simplicity: a lot of code is unnecessary here. It is enough to use the standard operations: composition, concatenation, union and search for the shortest path. The design of the dictionary took about 50 lines of code. An example tarball with source codes and Python buffings can be downloaded here .

In addition, this approach has a very interesting feature. It allows you to build complex transformations by combining transducers into more complex schemes. For example, by adding an automaton from my last article about a spellchecker to the predictor, we will get a logical device that will answer the question: “is this word really unknown or does it contain an error”.
Perhaps, for now. I hope it was interesting.

Source: https://habr.com/ru/post/109736/

All Articles

Morphology analyzer on automata

Step one. Parsim morphological dictionary.

Step two. Optimization.

Step three. Prediction of the morphology of an unknown word.

findings

More articles: