📜 ⬆️ ⬇️

Notes on NLP (Part 10)

(The first parts: 1 2 3 4 5 6 7 8 9 ). As stated in the well-known advertising, "you did not wait, but we came" :)

During the time that has passed since the publication of the ninth part, I read one good book on the topic (there are a couple more in the to-read list), many articles, and also talked to several experts. Accordingly, a new volume of material has accumulated that deserves a separate note. As usual, I introduce others, in parallel, I structure knowledge for myself.

I apologize right away: this part is difficult to read and understand. Well, yes, as they say, not all the Shrovetide cat. Difficult tasks correspond to complex texts :)
')

A little about machine translation: statistics against the rules

I'll start with my favorite holivar. Which is better - statistical models or models based on explicit grammar rules? Not bad writes about this Yorick Wilks in the book Machine Translation: Its Scope and Limits . The book deals with the task of machine translation. It is clear that indiscriminately the structure of sentences does not go far in translation, so the topics of syntactic analysis and translation are closely related.

In general, Wilkes sees the development of the industry. Initially, everything was based on the rules. Then, in the early 90s, a group of guys from IBM came up with the idea of ​​applying pure statistics (without any rules at all), and got results that were much better than expected. However, Wilkes says that good results initially do not guarantee improvements in the future, since the “theoretical ceiling” of the technology may not be so high. And in general, in accordance with the "Wilkes law", any theory, even the most abnormal, allows you to get good results in machine translation . There is another “Wilkes second law”: a successful machine translation system usually does not work on the stated principles .

That is, if we take the current state of the dinosaur SYSTRAN, then it turns out that calling this system “rule-based” is not entirely correct, it uses a bunch of different “crutches” and specific algorithms. Similarly, the Google & IBM name systems quickly moved out of the “purely statistical” category. Say, IBM was initially denied even the morphology modules for individual languages, assuming that everything should be derived “on the fly” from the body of texts. Now they don’t. Yes, we should also note: “pure” statistical analysis of a text in one way or another boils down to the statistics of following one word after another. If we consider the "trigrams", that is, three consecutive words as the basic element of the analysis, the modern computer is even more or less coping (although for the early 90s and trigrams were very heavy). Longer strings of words lead to a space of options beyond the capabilities of any current equipment.

The book draws attention to the fact that Google also does not reach fanaticism. For example, the SYSTRAN system for some areas has evolved over decades, and it will not be possible to quickly catch up with any alternative algorithm. Therefore, Google for certain pairs of languages ​​really uses SYSTRAN, and not its newfangled algorithms. There are some more good comments on this topic. For example, the author notes that the statistical translator from English to French is based on a good parallel corpus of English-French texts (minutes of meetings of the Canadian parliament). Moreover, for rare pairs of languages ​​(and in this context, almost all other pairs are rare), it is very difficult to find such an extensive corpus. Of course, “War and Peace” can be found in dozens of languages, but these artistic translations are unlikely to fit the role of elements of a parallel corpus.

In general, Wilkes believes that: (1) naked statistics, as well as bare rules, are not very promising (good machine translation requires a knowledge base about text objects and the outside world); (2) future systems will be hybrid, but it is not quite clear how to make these hybrids; (3) even purely statistical algorithms must be “smarter”. Here is another such observation. Now on the statistics built speech recognition system. The system is initially “trained” with input data, and then it can be used in practice. It is estimated that if children would need as much instructional data for learning speech as computer discriminators, language training would take more than 100 years around the clock.

Towards hybridization

Machine translation is a complex and multifaceted topic. So let's better go back to the parsing, that is, to parse the sentences.

I was already accused of unjustified dislike for statistical methods. I think this is not entirely fair: from my point of view, the ideal direction for developing a syntactic (or syntactic-semantic) analyzer is the automatic, that is, based on statistical methods, the isolation of the parsing rules from the existing tribank. Just in this direction, my understanding of the subject has progressed markedly over the past month.

First, let's talk a little about the tribanks themselves (I remind you, we discuss almost exclusively dependency treebanks). It seems to me that the creation of a sample bank of disassembled sentences is a very correct task. Obviously, tagged phrases allow you to perform much more interesting types of text analysis, rather than unmarked sentences. The amount of work to create a tribank is not so great. In essence, the compilation of a tribank is an analysis of proposals for members, that is, a job with which each of us is familiar with the school. Authors of the Finnish Turku dependency treebank believe that on the basis of a three-bank of ten thousand sentences, you can write a full-fledged parser. OK, if every day, without straining, to disassemble 10 sentences, such a tribank can be done in three years. And if you work in three, then for a year. Is it a lot? The number of tribanks in the world is growing, this is a fact. And many of them are available to anyone, often for free.

With bitterness, you can see that with Russian, as usual, everything is difficult. There is a rather large tribank SynTagRus (about 42 thousand of disassembled sentences). I did not get in touch with its authors, but from the point of view of an outside observer, everything is somehow opaque to them. “Write letters, and we may answer. Maybe we will, and maybe we will not. ” Compare this with the freely available Czech and Finnish corps! I don’t say that the bank should be free, but it’s clearly easier to explain the distribution rules than to annotate 42 thousand phrases? Why is the hard work done, and the simple one is still “hanging”? .. On ruscorpora.ru you can search for individual words in the corpus and evaluate the parse trees issued in PDF. At the same time, the phrase “pleases”: “Any offline versions of the case are not yet available, but work is being done in this direction . ” I am very vividly imagining these “ongoing work”: apparently, the old RAR archiver works on an old 80286 computer around the clock, packing gigabytes of trees for later uploading the archive to the site. And what else to do? The already mentioned Finnish tribe is stupidly laid out in the archive with explanations, and no one complains.

I think it makes sense to talk about two main problems in treebank-based parsing: how to annotate a tribank and how to automatically create a text parser on the basis of a ready-made tribank. Let's start with the second task. Suppose the tribank is already there and is “somehow” annotated. That is, for each sentence from the tribank there is a ready-made parse graph; in other words, links between words (which word is associated with) are listed, and for each link its “type” is indicated (for example, “subject”).

In one of the previous parts I mentioned the work in which the tribank was converted into the rules of the XDG grammar for subsequent parsing. The direction of this is, unfortunately, stalled. The author explains the reason for this. Tribank does not give enough restrictions for trees. That is, the output is obtained by the rules by which you can generate ten different trees for one input sentence. In general, the author believes that the problem can be solved by ranking according to the “goodness” of the trees obtained. Indeed, such studies exist. Let's see what will come out of them (and maybe we will participate :)). By the way, on the issue of Google and on pure statistics: one of the trees promoting this direction of tree ranking, Comrade Liang Huang himself worked for a while on Google.

Generating rules a la XDG, however, is not the only way to build a parser. There are more successful projects to date, for example, MaltParser . This thing promises practically fiction: give it a tribank of any language as input, and it will generate the corresponding parser. And the system works, apparently, not bad. I remember somewhere I was trying to slander: if the statistics is so good, why did nobody try to write the Pascal statistical analyzer? :) After all, Pascal is clearly simpler than any natural language! So, the authors of MaltParser really managed to solve this problem - they generated the C ++ parser with the help of MaltParser!

I have not yet got acquainted with the details of algorithms that convert tribanks into parsers, but in the most general form, the point is that the parsing process is presented as a procedure, depending on the numerical parameters. Different parameter values ​​direct the parsing procedure for a particular scenario. Tribank is used as a training set for some standard machine learning algorithm, which is used to select the required set of parameters. It turns out that "a set of parameters + fixed algorithm = parser".

Here I would like to stop and make a few important remarks on the internal, irremovable limitations of MaltParser:
- Statistical methods always try to parse the input sentence, even if it is incorrect. In principle, this can be perceived in two ways. If the task is to analyze any phrases that are even incorrectly formed, this is a plus. If the goal is to create a spelling checker, this is an obvious minus.
- The statistical method is, in fact, a “black box”. If we are not satisfied with the analysis of any specific phrase, there is no opportunity to analyze why this happens, and somehow “fix” the algorithm. You can only change the tribank and re-start the learning algorithm.
- At the output of the parser, only one parse tree is generated, which is considered to be a solution to the parsing problem.

In principle, everything is clear with the first two constraints. But the third point I would like to discuss in more detail. Here we are faced with another reincarnation of the basic question: what exactly should the parser do?

Verge of parsing

Immediately I warn you, I have not had time to analyze the considerations of colleagues on this issue, so I express my personal opinion.

One and only one tree: is it good or bad? I think it is necessary to separate the tree as a structure of relations “in general” and a tree as a structure of clearly designated types of connections. In other words, a tree is like (1) a graph with unlabeled edges and like (2) a graph with marked edges.

Consider the phrase Ivan came from the guests . It corresponds to a tree with edges (came, Ivan), (came, from) and (from, guests). The phrase Ivan came out of courtesy syntactically arranged in the same way, and it corresponds to a similar parse tree. However, if you add edge labels that describe certain “types of connection”, the trees will no longer be identical. Ivan came (from where?) From the guests , but: Ivan came (why?) Out of politeness .

Now let's think about how important it is for the parser to be able to generate different trees in structure and different tags. In other words, how likely is the case when different sentences (separately by structure and separately by tags) correspond to the same sentence?

It seems to me that the situation is as follows. The proposal must correspond to exactly one graph without labels. The case when there are two or more graphs is possible, but in this case we have a verbal pun - as if with text and subtext. Do we need such a play on words in practice - the big question. Honestly, I don’t think I need it. In fact, here we are talking about examples of this sort:

  He saw her in front of his eyes => he saw (who?) Her (where?) In front of his eyes.
                                      he saw (what?) before her (how?) with his own eyes.
									 
 The Countess was riding in a carriage with a raised backside => The Countess was riding in a carriage (which one?) With a raised backside.
                                            rode in a carriage (who?) Countess with a raised ass. 

Of course, it is difficult for me to say for sure, but according to the sensations, only one parse tree should correspond to the sentence (with unmarked edges). Whether the parser can independently build such a tree, and what kind of information it needs is a difficult question. For example, slightly change the phrase with the Countess (forgive what was in my head, I report): The Countess was riding in a carriage with a raised ass . Here you would notice that the countess has an ass, but not the coach; therefore, in the correct assortment tree, “ass” must be connected with the word “countess”, and not with the word “carriage”. But this requires the parser to have extensive knowledge of anatomy, so the question of the expediency of analyzing such difficulties remains open.

Thus, MaltParser is great for unlabeled graphs. The situation is considerably complicated for graphs with edge labels. According to the logic of things, a single tree always corresponds to an ideal phrase analysis. However, in practice, with the growth of our appetites, the capabilities of the parser are sharply reduced. The real extent of the problem depends on the types of communication between the words of the tribank. The more sophisticated the communication system, the more difficult it is for the parser. As a result, the parser tries to sort the sentences “in the image and likeness” of the tribank, that is, using all the methods of connecting words known to it. (Thus, we come to the first problem mentioned above - the task of annotating the tribank).

Do not think that I am talking about some purely theoretical things. Tribanka markup is not a trivial question. There is no unified system for marking three banks. There are some developed approaches that can be used. And you can not take advantage. Say, the authors of MaltParser say that by training on different tribanks, they received different quality parsing. So, Prague Dependency Treebank as a whole gives a worse quality. And this is explained by the fact that the markup in it is more detailed, there are more types of links. Accordingly, the parser, which operates with such a solid number of links, has more opportunities to make a mistake.

So, here I would like to put the question squarely: what kind of detail should be in the tribank to make the parser useful? If we confine ourselves to the “subject-predicate-addition” system, MaltParser (and other statistical analyzers) is enough. If, however, each word is assigned a whole set of semantic attributes, the problem of “underspecification” inevitably arises, that is, there is a true lack of data for building a single tree.

For example, in a simple marking system, the phrase “I see an onion” is understood trivially: the subject - predicate - addition.
In a more detailed system, there are two equivalent trees: "I see a bow-weapon" and "I see a bow-food."

In principle, you can parse the “simple scenario,” and then shift the responsibility of identifying clear links and meanings of words to subsequent modules. But wouldn't it turn out that all the work that we don’t want to think about was simply transferred to another place, and still it would have to be done somehow?

I don't know yet. I dig further :)

Source: https://habr.com/ru/post/82068/


All Articles