📜 ⬆️ ⬇️

Notes on NLP (Part 2)

Although in the first part I said that I was not going to dwell on morphology, apparently, it would not work at all without it. Still, the processing of sentences is strongly tied to the previous morphological analysis.

Lyrical digression. Our native Russian language with you is very good (for us) and difficult (for foreigners) with rich phonetics and a variety of grammatical means. Therefore, in principle, it should not be very difficult for us to learn foreign languages. Firstly, there are not so many unfamiliar phonemes in them. Okay, you need to spend some time training “th”, “w” and “r” in English, but imagine a learning Russian foreigner whose native language does not distinguish between “b” and “n” or, say, between “p "And" l "! Many more are terrified by the combination of consonants (“Krzhizhanovsky”), which we click almost without difficulty. Secondly, the abundance of grammatical phenomena rarely confronts us with something incomprehensible. And for an American, for example, the very notion of gender or a case is not at all obvious. There are languages ​​without personal forms of verbs, there are languages ​​without prepositions.

Yet morphology

Now about morphology. In principle, at first glance there is nothing to talk about here. Automatic morphological analyzers work well. Of course, they cannot independently define the context, and give all possible interpretations (for example, the word “Russian” can be both a noun and an adjective). If anyone is interested to see how the automatic analyzer works, you can experiment on the website of S.A. Starostin . I dare to suggest that almost all Russian morphological analyzers in one way or another rely on the Zaliznyak Grammar Dictionary . In addition, the module must somehow take into account the regularity of the structure of the language, and "guess" (if possible) new words. It is not difficult to verify the regularity of the Russian with the help of the well-known phrase “a gluttony kuzdr, a shtek budlanula bokra and curls a sideclock” . It easily guesses parts of speech and word forms, although it is clear that not a single word of this phrase can be found in the dictionaries. In general, not dudon butyavku.
')
There are also loadable modules. I myself use the developments of Alexey Socirko , "wrapped" in a user-friendly interface on the site Lemmatizer . I want to draw attention to the fact that not only a form analyzer is available here, but also a synthesizer that allows you to automatically generate the desired form of a word. There are, of course, some flaws. For example, I am slightly annoyed by the dislike of the analyzer for the letter “e”, as well as some technical features of attribute generation. The analyzer is able to guess unknown words, although sometimes it is ridiculous. For example, he believes that the word “crocodile” is the word form of the initial form “crocodile” :)

It has already been said that the morphological analyzer in itself is only a module within the project, but it seems to me that it can be interesting for language learners in and of itself. I wrote GUI for analyzer modules at my leisure, but for now there is no time to send somewhere, advertise :) I posted only a small description here , but this is clearly not enough.

However, I probably painted a too rosy picture of morphological analysis :) There are also difficulties. The first is technical. Not all languages ​​are equally easy to analyze. Judge for yourself: the mentioned Russian morphological analyzer Alexey Sokirko operates with a database of 18.5 megabytes. Its English version only requires 1.6 megabytes.

The second problem is related to terminology. Oddly enough it sounds (after all, everyone was in school) in the morphology of words is not so simple. Yes, we all know that the “table” is a noun, “red” is an adjective, in Russian there are six cases and so on. But there is also a mass of subtleties in which “there is no agreement among the comrades”. For example, the same analyzer considers that “about the forest” and “in the forest” are forms of the prepositional case. Although many linguists will insist that the second form is a locative, a case almost extinct in Russian. There are other "relic forms." For example, the vocative case (“Grish! And Grish!”) As far as I know, he is quite active in Ukrainian. There is also a partial case, he is also a partitive: “waiter, more tea!” (Instead of “tea”). Partitive violent color blooms in Finnish, delivering students a lot of joyful moments.

There is no unity about the attribution of certain words to one or another part of speech. For example, what is “no”, “time”, “pity”? In Literacy, they propose to refer them to “predicatives,” but there is no generally accepted approach.

You can ask, and what, in fact, the difference. Well, parses the analyzer and parses. He considers it a predicative - excellent, adverb - wonderful. He sees the difference between “forest” and “forest”, does not see - these are all games of philologists. Unfortunately, with the selected analyzer you still live and live. If in the future, for example, you will assume that only a prepositional case can appear in this context, whether your statement will work or not depends on the morphological analyzer. He will call the combination “in the forest” as a “locative”, and your prepositional case will cry :)

Accordingly, the analyzer once selected is not a fact that it can easily be replaced by another without additional efforts. For example, another feature of the analyzer Sokirko: he calls verbs in personal form (“run”) verbs, and in the initial form (“run”) - infinitives. That is, it turns out as if two different parts of speech. Accordingly, if another analyzer considers any verb to be a verb, and an additional flag “this is an infinitive” is added to the infinitive, one cannot do without an adapter converter.

Everything, if there are no abundance of questions on morphology, let's move on to the next part to the proposals.

Source: https://habr.com/ru/post/79819/


All Articles