(Beginning:
1 ,
2 ,
3 ) This time I want to digress a little and speculate (or, more precisely, say) on the topic of statistical algorithms and in general “workarounds” of computational linguistics.
In the first parts of our conversation we talked about the “classical way” of text analysis - from words to sentences, from sentences to a coherent text. But in our insane time there were also temptations to solve the problem “in one fell swoop”, finding, if you will, a bug in the system or the “royal road”.
By the way, about the royal roads in science and study "in general". Forgive me readers for a lengthy quote:
Desiderius. How are your classes moving, Erasmus?
Erasmus. It seems that the Muses are not very supportive of me. But it would have been more successful if I could get something from you.
Desiderius. Failure in anything not meet - if only it was for you to benefit. So talk.
Erasmus. I have no doubt that there is not a single secret art that you would not know.
Desiderius. If so!
Erasmus. They say that there is a certain art of memorization, which makes it possible to learn all the free sciences almost without hassle.
Desiderius. What do i hear And you saw the book yourself?
Erasmus. Saw. But exactly what he saw: there was no teacher.
Desiderius. And in the book what?
Erasmus. Images of different animals - dragons, lions, leopards, different circles, and in them the words - and the Greek, and Latin, and Hebrew, and even from barbarian languages.
Desiderius. And in the title indicated, for how many days can science be comprehended?
Erasmus. Yes, for fourteen.
Desiderius. Generous promise, do not say anything. But do you know at least one person whom this art of memorization would make a scientist?
Erasmus. No, not one.
Desiderius. And no one else has seen such a person and will not see it, except that at first we will see a lucky man whom alchemy made rich.
Erasmus. And I would like it so that it was true!
Desiderius. Probably because it is annoying to buy knowledge at the cost of so many labors.
Erasmus. Of course.
...
If you are interested in the story,
read the ending
yourself . This is Erasmus of Rotterdam, “Conversations Easily” (1524). The 21st century is in the courtyard, and books from the “for 21 days” series are not translated, we will notice.
So, attempts are being made to analyze the text without any understanding of its structure. And both at the level of syntactic analysis (to create a sentence tree without knowing anything about the laws of constructing phrases), and at the level of further work, for example, machine translation. How is this possible in principle? The answer lies in the magic spell "statistics".
')
Glitter and poverty statistics
Statistics - a great thing, and has a lot of applications, including in computer linguistics. But not a panacea. Since the texts in the history of mankind have already accumulated a myriad, there is a reasonable temptation to study the structure of new texts based on existing (presumably, correct). I must say that in the previous parts I
have not mentioned
anywhere else exactly how the phrase parsing tree is built. Yes, it was said about Chomsky's grammars, but only as an idea
from which the concept of phrase-structure parsing
grew . I did not specifically write anywhere that Chomsky's grammar is really used to construct such trees. This is not necessarily the case.
How can you talk about the correctness of the phrase based on the accumulated data? For example, so. There is a phrase "I ate a cake." Let's see if it is often found in existing documents? And the phrase "I ate a broom"? Most likely, rarely. And the phrase "I ate a cake," probably does not occur at all. Hence the conclusion that the first phrase is correct, the second is indisputable, and the third is wrong. You can search for "correlated phrases". If some words meet each other often, they are probably dependent on each other. So you can build the whole tree. Although we note that such a system will never explain to you exactly what the phrase is bad. Just say that they do not say that. You understand that for a person learning a foreign language this is not so much help.
You can go even further. Suppose you want to translate a document into another language. What is the probability that nobody has ever translated your phrase? Probably, you can find a ready translation, at least for part of the phrase. The “base of knowledge” in such projects is the corpus of bilingual texts. For example, they very much like the minutes of the Canadian Parliament meetings, as they are conducted in two languages: English and French. At the same time the texts are formal, the translation is strict, without liberties. So take a piece of text, find a piece of the corresponding text - and voila! (Of course, I greatly simplify the reality of things, but the basic idea is this). From here also jokes turn out with not clear transfer. It was made in China, it became
“made in the Republic of Belarus” . Well, I like this topical joke, but in fact this is
exactly what happens .
You do not think that I attack statistical algorithms in principle. There are lots of great ideas. For example, I like the idea of ​​analyzing tribanks (treebanks), but more on that another time.
Believe algebra harmony
And now I want to play a little "believe - I do not believe." What I believe, what I do not believe.
I believe that with the analysis of ready-made texts, much can be done. I do not believe that the "royal road" in machine linguistics exists. Thirty years ago, it seemed that creating a program for playing chess was roughly equivalent to creating artificial intelligence. The current results, when the computer can beat
any grandmaster, were taken with mixed feelings. On the one hand, yes, success, and on the other - it is obvious that the algorithms are not very advanced, just computers dramatically tightened up, and it became possible to calculate millions of combinations and store the vast library of ready-made batches.
In linguistics in a similar way, you can make a breakthrough, but I am sure that this approach has a theoretical ceiling. Whatever one may say, at least the creation of "portraits of objects" is necessary. Well, how can you translate into Russian "sibling", if you do not know, was it about your brother or sister? You can fill an extensive database, make the computer translate Byron (according to well-known translations), but in fact it will be the same
Chinese room Searle. While we know the input pieces, we translate, and step to the left, step to the right - arrived. And machine translation is not the only goal. The goal may be, for example, an
understanding of the text , whatever is behind this term. For example, the completion of the knowledge base about the world described in the text. (However, this is already a conversation about the
pragmatics of the language , obviously not the topic of today).
That is, in a sense, the approach of the same Google translate gives me conflicting feelings. On the one hand, thanks for the fast and convenient service. On the other hand, it seems to me that they shifted the "center of gravity" towards statistics. I think in a few years they will plunge into the maximum, and then you have to look for other methods. This is especially obvious for languages ​​with a free word order and a rich morphology - here the translator simply goes crazy, because there are a lot of translation options, it is difficult to gather unambiguous statistics, and there are also many different input phrases.
In the end, it does not occur to anyone to write the statistical compiler of Pascal, although a great number of programs have already been written in Pascal too. However, in Google they hire quite prominent computer linguists, so maybe they will not have everything so unequivocally with the algorithms used.
So, it turned out somehow poisonous and emotionally :) But this is nothing, in the following parts we will return to a more productive conversation. Although here, apparently, not everything is said. Well, okay, I'll write a sequel, if anything.