📜 ⬆️ ⬇️

Notes on NLP (Part 1)

On the eve of the New Year, I decided to start a short series of articles on the direction I personally interested in natural language processing. (That is, NLP in the title means natural language processing - your KO). Parse, semantics, machine translation, search for the meaning of the word in context - in general, all the joy of a computer linguist :)

Probably, it immediately makes sense to determine the level of presentation. I myself am trying to do computer linguistics (with varying success). I will try to talk about what specifically worries, what is already possible, what is not yet possible, and what we should work on right now. Perhaps, these articles will help me to structure the information in my head and rely on the already prepared structure in the new year. And if readers have their own ideas or ideas about cooperation - even better.

First, a little about the general structure of our conversation. A lot of different things can be attributed to the NLP area - from the morphological analysis of words to the search for addresses in a text file. I will only talk about what I do. The approach is more scientific than practical. This does not mean that I do not care about the application, not at all! Simply, I will not deal with a potentially dead-end branch, even if it gives some quick (marketable) momentary result. Also, I do not care what language is written, it will be on the Internet or on the hard drive, Windows or Linux and the like. The main thing is the quality of ideas, their potential developability, as well as applicability in different tasks and to different languages.

Let's start with a small excursion “through the levels” of NLP, so that it is clear which of them will be the main conversation.
')

Fetching facts and text breakdown

There is a rather extensive class of tasks that can be solved without attracting special NLP funds. Suppose you want to find out what they write on the Internet about the President of Russia (no doubt, the relevant services are doing this :)). Accordingly, it is necessary to look for the combination “Dmitry Medvedev”, “President of Russia”, “D. Medvedev "and stuff like that. Of course, not everything is so simple here either, the text may contain phrases like “the former President of Russia”, and simply “the head of state” (you need to understand which one) - these are the things that force the algorithmists to invent various heuristics. The same level of task - search for email addresses. In principle, there is a more or less clear structure: index, region, city, street, house, apartment. But different people can write differently, and in general this task is not trivial.

Around the same “basket” I would define the task of finding the ends of sentences. It seems to be NLP, but not quite. In the simplest case, we are looking for a point, followed by the capital letter. Here you have the end, that's the beginning. But in fact, of course, there are a million exceptions: “On the street. Ivanova is our office, ”“ At 10 pm London time, the Reuters reported ... ”

In general, it seems to be NLP, but still it’s impossible to call it full-fledged “computer linguistics”.

Word processing

Here, first of all, an automatic morphological analysis comes to mind: it is required to define the part of speech and the “attributes” that define the word form. For example, “red” is a feminine adjective in the nominative singular, with the initial form “red”. And “glass” is either the noun “glass” in the genitive case, singular, or the verb “flow out” in the third person, the female gender, of the past tense.

Here is the inverse problem of synthesis: according to the attributes and the initial form, generate the required word form.

Morphology is, frankly, an interesting thing, and for different languages ​​it has completely different complexity. For English, for example, it is simple. For Russian, it is quite difficult. For Finnish it’s quite difficult, there is a very developed morphology. For Japanese, it seems to be not very difficult, but there is another problem: you must first understand where one word ends and the other begins!

The best thing about this task is that it can be considered solved. Now there are very good systems (varying degrees of quite good for different languages). With its pluses and minuses, not without it. But if you aim, you can download / buy a decent analyzer (and even a synthesizer).

By itself, however, morphological analysis and synthesis can hardly be called very useful. It is clear that such an analyzer or synthesizer is just a module in a larger project. However, I think that for a person learning a foreign language with a developed morphology, such software would be good in itself. As hands reach, I will lay out the link to the prototype of an analyzor / synthesizer of Russian morphology for pupils.

Offer Level Processing

I will say straight away: it is here that we mainly work and will. Therefore, for the time being I will only briefly explain what is being said, and we will talk in more detail in the next article.

The main purpose of the analysis of a sentence is to build a dependency tree or a parse tree. It shows the structure of the sentence, in particular, which words depend on which ones. Why do you need it? For example, in a machine translation system. Suppose there is an English sentence “I have a red ball and a blue shovel”. When translating into Russian, a computer must understand that red refers to ball, and this word must be translated into the masculine (“red ball”). But the female spade, therefore, it is "blue".

I must say, Google translator in the analysis is completely bad, and the phrase "I have a blue shovel" translates as "I have a blue shovel." This, of course, is no good.

Perhaps, here you can also mention the clarification of the meanings of words using a local context. For example, the word “smash” is translated into English in at least ten different ways: to break (to break the cup), to defeat (to break the enemy), to lay out (to break the park), etc. There are also obviously exotic options like “he was broken by paralysis”; it is clear that in English there is no “breaking” (he was paralysed). In complex cases, it is necessary to analyze the broad context of the phrase, but in practice, the local context is often enough. If they break a cup, it means to break, if the garden / park is to lay out and so on.

I want to dwell on this level not only because it is interesting to me personally, but also because, as it seems to me, this is where the main work continues. Still, the analysis of phrases is not yet as good as we would like.

Text Level Processing

Of course, our problems are not limited to the level of proposals (however, we note that without a good analysis of the proposal, it is useless to go further). Very often, a more global context is required. Example: “I have a sibling. She is beautiful. ” In English, "sibling" is "brother or sister." As further it is obviously about a woman (“she”), it is necessary to translate “sibling” as a “sister”. This is called link tracking. Other examples from the same opera: “Minulla on veli. Hänellä on auto ”-“ I have a brother. He has a car ”(Finnish). The thing here is that “Hänellä” is both “he” and “she” (in Finnish there is no difference between “he” and “she”), and you have to choose the right option. There are worse things: how to translate into Finnish “he and she are one Satan”? :) Sometimes you have to collect a whole arsenal of information about the object of interest. For example, it is impossible to translate the phrase “my brother is a student” into Japanese, since in Japanese there is no word “brother” - there is only “elder brother” and “younger brother”. Accordingly, there will be “watashi no ani wa gakusei desu” or “watashi no ototo wa gakusei desu” (and this we do not mention the degrees of courtesy in Japanese, which is also a problem).

But here, as you can see, it turns out a cool cocktail from data mining, natural language processing, any heuristics, and it is even difficult to imagine anything else. The tasks are interesting, but I’m not ready to go into them yet :)

Likely, on this article it is time to finish (for the volume). It is to blame, if it turned out so far not too informative - then it will be more interesting.

Source: https://habr.com/ru/post/79790/


All Articles