(The first parts:
1 2 3 4 5 6 ). As promised yesterday, we continue to discuss XDG and move on to the following topics. Perhaps we are moving too fast, and it would really make sense to publish one article every two to three days, so that there is time to discuss everything. But, probably, as long as there is “gasoline,” I will continue to write. And then it will be possible to return and discuss previously lit questions. It seems to me that in computer linguistics different topics are so closely connected with each other that talking about one of them without communication with others is unproductive. And we have not yet talked about everything, so it’s best to cover as many aspects of computer text analysis as possible, and then talk about specifics within the framework of the overall picture of what is happening.
More about XDG
In principle, we have already considered the most important features and capabilities of XDG. The rest is on the category of "goodies". For example, you can specify a set of attributes, and assign them not to a specific word, but to a “class”. Then the word is simply declared an instance of this class and automatically receives the listed attributes.
From the important point: quantitative and qualitative questions of joining dependent words are thought out (in the example this already happened, but I want to mention explicitly). So, for a verb, you can specify that he has only one subject, only one object and as many circumstances as necessary (where, when, why). For each word to be added, its own set of matching attributes is indicated. Let's say the same verb is consistent in face and number with the subject. Yes, again: the same word can be described several times in different contexts. For example, the “dining room” is both an adjective and a noun.
')
Also quite flexible is the output mechanism. Trees can be drawn on the screen, output to a text or xml file. Principles (such as valency principle, tree principle) can be developed independently, adding to the project library. To do this, of course, will have to learn Mozart / Oz :)
So let's say thanks to the author of this project, Ralph Debusmann, and will not dig XDG any further. A little tip for those who want to get acquainted with the XDG closer: it is best to read not only the manual, but also the
author's dissertation . The manual is not very suitable as a textbook, it is hard to read. And the thesis - a set, carefully acquaints the reader with the ideas of XDG using simple examples.
Tribanki
Now a few words about such an important phenomenon as tribanks (treebanks). Unfortunately, I still have little understanding of them - all hands do not reach.
We have already talked about where the rules of grammar come from. It is clear that they can either be composed by hand, or “somehow” extracted from existing texts. Simple statistical approaches (which lead to melancholy) simply analyze the pure text in its original form, based on rather conventional “heuristic” criteria. For example: standing next to words are more likely to depend on each other than words standing at a distance. It seems to me that you will not go far on such principles; for myself, I use this criterion: is it possible to write the parser of incomparably simpler Pascal on the ideas of the proposed natural language parser? If not, then, in my opinion, there is nothing to talk about here.
But the statistical method can be “set off” on a specially prepared text, and this is a completely different matter.
A tribank is a collection of disassembled sentences (that is, parse graphs) prepared by hand. In my opinion, it is these banks that should be used to automatically extract the rules for generating trees.
In a logical way, tribanks are divided into phrase-structure treebanks and dependency treebanks. Probably the most famous example of a bank of the first kind is
Penn treebank , the syntax of annotations of which is the de facto standard for “Khomsky” tribanks. Dependency treebanks is a more recent phenomenon, while there are obviously fewer of them. Probably,
The Prague Dependency Treebank of the Czech language is mentioned more often. There are
others , of course (see projects with the word “dependency” in the title). For the Russian, as usual, some work is “in progress,” but nothing concrete can be said so far.
I have not climbed deep into the tribanks, because it’s full of technical features. How annotation is performed, what is the syntax, what is specifically described and what is not described, and so on. In general, there is a feeling that you can dig deep into the subject and not dig out. So when digging in, it would be better to know exactly what you are looking for, but I haven’t decided for myself yet.
But what is interesting to me, there are
attempts to automatically extract rules from the tribank in the form of XDG. True, this experience can hardly be called very successful. The author writes that the rules are still too “loose”, and XDK allows itself to glue to each other too much. Maybe the tribank is not big enough, it's hard to say. He also writes that there are no “probabilistic” rules in XDG. If something happened in the tribank at least once, it is already a rule of the same level as the rule obtained from hundreds of examples. However, I think this is not a problem XDG. If the rule is so rare, it’s better not to insert it into the grammar at all. They also complain that there are a lot of rules, and the grammar "swells." To this I repeat my old thesis that a grammar can be generated for each specific sentence.
Perhaps, for me, the most difficult issue in this topic remains the following: suppose the rules were extracted from the tribank, and they even allow you to get a good analysis of the text. What's next? Will such a parser help in further text analysis? We will discuss this now.
Pragmatics
Honestly, the topic crept up unexpectedly :) I wanted to first talk about my ideas for using XDG, but it’s better to postpone them until the next time. And here there is just a little space for a slightly abstract philosophizing.
We already talked about the
syntax of the sentence. To understand the syntax is to be able to build a graph of syntactic dependencies between words. At the next level of understanding is
semantics - understanding the meaning of the sentence. Under the "understanding" can hide a variety of things, but for now let's leave this term as it is. The next level is
pragmatics , responsible for the use of language.
When you write a computer program, almost all the processing of a source text by a computer in one way or another comes down to transforming one type of text into another type of text, that is, in fact, to machine translation. A program in a high-level language can be turned into a C-code, then a C-code can be automatically optimized and turned into text in assembler or immediately into machine instructions. At the input of the text, at the output of the text - this is what all compilers, translators, optimizers, and others like them do. But at the very end of this chain there is a stage, for the sake of which everything was started: the
execution of the program by the processor. That is, the set of passive data (text) is transformed into a control sequence that triggers some
active process . This is the meaning of the programming language: to control the action.
With natural language, everything is much more complicated. If I write the Pascal compiler, I firmly know the ultimate goal: it is a machine translator, translating a Pascal program into a program in machine codes, executed by a central processor. If I write a parser (that is, so far only a parser is a relatively early and simple compiler module!) Of natural language, I need to clearly decide for myself what will happen next in the next steps. Even if the issue with the semantics / meaning of the text is successfully solved, it is still not clear what will happen behind the semantics.
In practice, three applications of word processing are obvious. First: machine translation. Here the question of pragmatics does not even arise - the text is translated into another language, and let the reader himself think what to do with it further :) Second: replenishment of some kind of “knowledge base” or “fact base” based on the information contained in the text. Third: speech interface. Here, in essence, a natural language (or rather, its narrow subset) replaces a programming language. Open a window, click a button, start the application.
Surely you can come up with a lot of other applications - from gaming and training to analytical. You can play "edible - not edible": a person enters "I have eaten X", and a computer, depending on X, writes, he believes you or does not believe. You can use the parser as part of the spell-checking or speech recognition system: for example, when recognizing we cannot understand, the person said, “my horse and laughs and jumps” or “my horses laughs and jumps”. The parser will immediately understand that “my horses are laughing” - the design is incorrect, and will help the system choose the first option.
The thing is that the further usage scenario (language or parser) leaves an imprint on the parser itself, no matter how strange it sounds. For example, if we play “edible - not edible”, we can set up a grammar (XDG, let's say) so that when we try to process a construct like “I ate
'inedible' ”, a parsing error occurs.
A less contrived example is the machine translation system. The phrases “they camped” and “they crashed the car” are dealt with in exactly the same way - subject-verb-object. However, another problem arises: what is “broken”? The first "broken" in English should be translated as "set up", the second as "crash". That is, syntactically, we seem to have understood, but in the translation it did not help us much. In this case, we note that the parser, which does not see the difference between “set up” and “crash”, will still perfectly fit to check the spelling of Russian. We smashed and smashed a garden, a tent, a car, a cup, a heart. Who cares how it is in English?
In this I see some problem (although I may be mistaken) to automatically extract the rules from the tribank: all this work is concentrated around the task of getting the correct parse tree, but what to do next with this tree is not discussed. One could argue: the parser's task is to parse, and what about the semantics, let the following modules in the chain be resolved. However, if the next module (say, the semantic analyzer) is as complex as the parser, what's the use of such a parser? It seems to me that at the stage of parsing it would be good to squeeze as much information as possible from the text within the framework of the task.
There are attempts to
“objectively” assess the quality of a particular parser, but it usually comes down to whether the parser generates the right parse trees or not.
So, today we put an end. Tomorrow (well, or when the mood will be :)) continue. Or maybe we will finish the next part :)