Using statistical methods to generate a paradigm for the canonical forms of new words

So the time has come when it will be possible to talk about an interesting method of filling in the base / vocabulary of a morphology module using data from statistical methods already available in it.

We have:

The base of Zaliznyak from 94 thousand squares of almost 3 million word forms
Base of 52 thousand new words with information about their grammatical affiliation (gender, number, animation, etc. ...)
Ruby interpreter
Some free time after work
The need to supplement the base of the module with paradigms of new words (there are only their canonical forms)

Let's try to use statistical methods (or someone can say pseudo-statistical):

1. First of all, we form the base of models of the endings of paradigms for each word from the existing base. The given examples of data will probably be clearer than any explanations. The records of the connection between the canonical form of the word and the model of endings look like this:

 ...
 abbot | 15 | <... any grammatical information for the canonical form of the word ...>
 abbey | 20 | <... any grammatical information for the canonical form of the word ...>
 ...

here 15 and 20 are the model numbers of endings.
Entries for graduation models look like this:

 ...
 ---
 15
 | | <... any grammatical information for the canonical form of the word ...>
 th | <... any grammatical information for the canonical form of the word ...>
 oh | <... any grammatical information for the canonical form of the word ...>
 s | <... any grammatical information for the canonical form of the word ...>
 oh | <... any grammatical information for the canonical form of the word ...>
 oh | <... any grammatical information for the canonical form of the word ...>
 oh | <... any grammatical information for the canonical form of the word ...>
 their | <... any grammatical information for the canonical form of the word ...>
 omu | <... any grammatical information for the canonical form of the word ...>
 ...
 and | <... any grammatical information for the canonical form of the word ...>
 ---
 ...

2. After that, we begin the repeated analysis of the existing words for the dependence of the ending model on the structure of the word. To do this, we divide into syllables (I cannot talk about the correctness of this algorithm, but for simple cases it works more or less - we will not consider any options like Landsknecht). The main thing is that the same algorithm is used in both phases of the algorithm: the analysis of existing words and the generation of new ones.
')
The division into syllables instead of using n-gramm was deliberately chosen, since, in my opinion, it more naturally expresses dependencies in Russian.
So, we divide the word into syllables, number them in the reverse order, and write them into the database (if such a record is available in the database, we combine them with the available data).
As a result, we get the following table (a lot of records):

Record ID	Syllable	The position of the syllable in the word	Ending model number	Number of similar combinations (syllable / position / ending model)
sixteen	ba	2	ten	29
sixteen	zhur	one	ten	3
sixteen	ny	0	ten	5609

3. Finally, we sculpt a code that breaks new words into syllables with the same algorithm, interprets the number of syllable / position / ending model combinations as probabilities and with decreasing “weight” from the last syllable to the first (from the beginning of the word to the head ) calculates a possible end model.

Results:

As a result of the analysis of paradigms sampling from 1000 arbitrary new words - 85% of true paradigms.
15% of incorrect paradigms for words is a sequence of syllables in which it was “unusual” for the old set: phials -> phialy , phialyas ...
The growth of the database of word forms on almost 2 million word forms.

Source: https://habr.com/ru/post/107109/

All Articles

Using statistical methods to generate a paradigm for the canonical forms of new words

More articles: