(The first parts:
1 2 3 4 5 6 7 ). In this part I will talk about the syntactic-semantic analyzer - as I see it. Pay attention, by the way, to part 7 - she didn’t get to the main page, so I’m not sure that everyone interested saw it.
Beyond parsing
In the last part I touched upon, as it seems to me, an important topic: what kind of information, besides the tree structure, is required to be extracted from the sentence? In the sentence “he broke her mother’s cup,” knowledge of the structure of the phrase makes it possible to understand that the adjective “mother’s” refers to the word “cup”; therefore, when translated into English, it should be translated as a possessive case: “mother's”. However, one structure is not enough to find the right translation for the verb “broke” from almost a dozen alternatives.
In part, this problem can be solved by analyzing semantics, that is, studying, along with the syntax of the phrase, the meaning of its words. Here I first of all want to pay tribute to the teacher -
Vitaly Alekseevich Tuzov , whose graduate student I once was. My views on the problematics were largely formed under the influence of his ideas, and the basic thoughts that I am going to present are either his own, or at least are known to me through his mediation.
')
First, we outline the range of tasks. Let's start with the relatively simple - word sense disambiguation, that is, the definition of the meaning of the word in this particular context at the phrase level (UPD: it seems that in today's part we will not go further than this :)). Not all words can be unambiguously understood within a sentence. "Down the street was a girl with a scythe." What kind of spit in question - we still can not know. "I have a sibling" - to translate the word "sibling" as "brother" or as "sister"? This is also unknown within the local context.
However, many things are completely too hard for an analyzer that parses text at the level of phrases. For example, the translation of the word "broke" obviously depends on the object. If we know that we broke, we can translate the verb. (Aces liked this analogy: what is sin (x) equal to? This question can be answered if and only if x is known).
Further actions in general suggest themselves. It is necessary to introduce an ontology (hierarchy of concepts), and assign a “class” to each word. Further, depending on the class, one can draw certain conclusions about the translation or, more broadly, about the meaning of the word.
Returning to the example with the word "broke", you can include the following rules in the grammar of the language:
dishes CUP () cup
Transport CAR () car
People ARMY () army
BREAK (subject, object: dishes) to break
CUT (subject, object: transport) to crash
BREAK (subject, object: people) to defeat
Here, “dishes”, “transport” and “people” are the data types in the hierarchy. Such strict typing allows you to describe and stable expressions in a fairly natural way:
property / bald BALD () bald
property WHITE () white
bird EAGLE (property) eagle
bird EAGLE (property / bald) bald eagle (and not a "bald eagle" :))
Here, of course, we need not just types, but a tree-like structure, so that we can indicate the required level of detail of the object type. In some cases, a clear indication of the subclass is required, in others, the superclass is sufficient.
For example, in the example above, only “bald eagle” is translated as “bald eagle”, while other adjectives (non-bald) will not change the default translation “eagle” (i.e., “white eagle” will simply be “white eagle").
Similarly, you can solve the problem of the "girl with a scythe": if the braid is blond, for example, it is hair, and if the metal is an agricultural tool. If it is just a spit or, say, "black" - the question remains open.
Hierarchy: dreams and reality
If we accept the proposed concept (however, I do not see anything radically diverging from worldly wisdom in it), the full problem arises of the categorization of everything and everyone. And really, is it possible to build a reasonable tree of "all the objects of the world"? On this issue, I think the following. For a start, such work already exists (
EuroWordNet ,
WordNet ). However, I am pretty cool about the “universal cataloging of everything”. By mature reasoning, it becomes more or less clear that the hierarchy of objects for a particular syntactic-semantic analyzer depends on two closely related things: (a) an assigned task; (b) world view of the analyzed language.
In its pure form, the “task” affects the tree detail in one direction or another. You can develop an analyzer with a deep hierarchy of vehicles from rickshaws to rockets. It is possible, on the contrary, to be limited to the only class “vehicles”. You can create an analyzer in which all objects will be divided into “large”, “small” and “abstract”.
In the task of machine translation, the “set task” is already sharply intersected with the linguistic picture of the world. For an Englishman, the word "broke" is translated in ten ways, so when creating a Russian-English translator you will have to write multiple descriptions for the word "break." If the target language of the translation is any language with a less developed system of alternatives for “breaking up,” there will be fewer descriptions.
The picture of the original (analyzed) language also matters. For example, if an Englishman goes somewhere, for him it is always “to”: to travel to Moscow, to travel to England, to travel to Cyprus. For us, there are at least “on” places and “in” places: go to Moscow, go to Cyprus, go to the Crimea, to Sakhalin. There is not so much logic here; There are, of course, general rules like the fact that the island is “on”, and the country, the region is “in”. But we say “in Russia” (as well as in “Ukraine”, even if not all Ukrainians love it, we will still try to bring the language out of the political sphere).
In this case, the picture of one language, of course, may differ markedly from the picture of another. In Finnish, too, there are “on” places and “in” places. They are selected according to their criteria, the attitude of the Russian language is not related (basically the same random as ours).
In fact, from all these observations, I make a comparatively optimistic conclusion: we do not need a “universal classification of everything and everyone” (which will always be indisputable and hardly practically achievable) - only a classification within the language field is required. On the one hand, this is bad: for each language you need to build your own hierarchy. On the other hand, this is good: the linguistic picture of the world can definitely be studied on the basis of existing texts, and, I think, building the required hierarchy is at least theoretically achievable.
Yes, the question arises: is there a hierarchy? Maybe this is a more complex structure, a kind of "multiple inheritance" with the simultaneous belonging of objects to different branches of the hierarchy? Honestly, I do not know. While I stopped at the hierarchy, I can’t give an example when the tree-like class system would not be enough. But I can admit that it happens.
Here, by the way, one of problems XDG gets out: hierarchical types are not supported. However, for myself I found a way out: it is necessary to generate separate types for all possible superclasses. For example, there is a branch “object-material-ware”. Then, in the XDG, the word “cup” from the class “dishes” turns into three descriptions:
CUP object ()
object / material CUP ()
object / material / dishes CUP ()
Everything, on it we will finish. In the next, probably the last part, I am going to talk about the “semantic language” and speculate a little about where in our area you can try to dig, what are the difficulties in the “scientific and political” respect.