Continuing trips to the laboratories of scientists, we got into the company ABBYY, and talked with Anatoly Starostin, the head of the semantic analysis group and the teacher of the department "Computational linguistics" at MIPT. He spoke about the work of his group, the directions of computational linguistics at ABBYY and who are such engineers.

First you need to decide on the terminology. So, computational linguistics is a science, on the one hand, about language, and on the other hand, how to work with language (not always natural) using computer methods. This is a science originated at the junction of linguistics and computer science. Computational linguistics from different angles examines a natural, formal language. At the heart of this science is the concept of language, which can be viewed from different angles. It can be considered, for example, formally. There is another area such as mathematical linguistics, it came into being before computer science and is a branch of mathematics, which has its own theorems, proofs, and formal objects.
')
In computational linguistics, it is important that there is always a practical, concrete problem in the center that needs to be solved. For example, automatic syntax analysis, machine translation, speech recognition.
What is syntactic analysis: when there is a standard task for schoolchildren to determine where in the sentence are the members, which words are main and which are subordinate, but to do this without human intervention, in an automatic way. To do this, the computer at the entrance has a chain of characters that you need to correctly interpret, break into words, link them together and build a syntax tree. Since the structure of a natural language sentence is arboreal, and this is an understandable fact from ordinary, not computer linguistics.

Homonym as a problem
The main difficulty in the syntactic analysis is the problem of homonymy, when two words are written the same way, but have different meanings, in this case the machine must understand what it means.
Anatoly Starostin: “Homonymy also occurs on the following levels. For example, when we try to understand how words are related to each other, there is a typical example of a sentence that always leads: "These types of steel are in stock." In this sentence, all types of homonymy are presented. On the one hand, we hear that some people began to eat in the warehouse, and on the other hand, that there are different types of metal in the warehouse. Are these types of steel? Or types started? This homonymy is audible. And if you draw syntactic trees, in one case the predicate is the word “steel” (that is, the form of the verb “become”), and in the other case the predicate is the word “is”. Accordingly, if you draw two trees, they will be different. This is an example of syntactic homonymy. ”The problem is how to automatically read the context of a sentence and solve homonyms. This is one of the directions in computational linguistics.

Story
Computational linguistics as a direction originated in the middle of the 20th century. And the first task for the paste became a very ambitious desire for those days to create a program for automatic translation of text. It should be remembered that computers did not differ in power. And over time, people came to realize how difficult this task is.
At the same time, a direction called artificial intelligence was popular. People wanted the computer to solve intellectual problems. Although artificial intelligence is a wider area than computational linguistics, the processing of natural language was considered as part of it.
A. S.: “At some point there was a boom of machine learning methods. It happened when people realized that with the help of mathematical statistics methods and special algorithms, you can make the computer reproduce quite accurately some kind of human intellectual behavior. I can give an example of a task that is still very relevant to this day. This is the task of finding named entities in texts, when you need to find in the text all references to persons or all references to organizations. It turns out that this problem can be solved analytically (write rules, complex algorithms). And you can solve this problem in another way, taking a pack of texts, mark in it with your hands, where are the persons, and where are the organizations. After that, give it to the computer, say: "Look, in these texts the person is here, and the organizations are here." And with the help of scientific training methods, a computer can absorb this knowledge. And on other texts that he hasn’t seen before, repeat it with rather high accuracy. That is, he will take another text, which he had not read before, and by analogy with those texts he will guess where there is a mention of persons, where are organizations. At this point, he naturally takes advantage of some of the signs that he himself has learned. These signs: large letters, some morphological forms. There are different hooks and hooks that actually exist in the text. We read, we understand that this is a person. In fact, a person is usually used in such contexts. We don't even know that. A computer using the machine learning method can absorb these contexts and reproduce them. When people understood this, there were a lot of machine learning applications. Methods of machine learning today is a very essential part of computational linguistics as applied to the problems of analyzing texts from various sides. With the help of machine learning a lot of different tasks are being solved. ”
Profession "computer linguist"
There are several major professions used for computational linguistics. One profession is a programmer. Another thing is that usually all programmers come to work under-educated, and they need to be trained. But we also need linguists, because they are carriers of knowledge about natural language. Linguists, getting into the area of ​​computational linguistics, should also get new knowledge, become more structural and formal to work in Computer science.
A. S .: “Computational linguistics involves the cooperation of linguists and programmers. And they meet each other. Programmers who practice computational linguistics, since they describe an object, should understand it. Any programmer, for example, with us (in ABBYY), understands what a syntax tree is, understands how words connect with each other, knows a lot about linguistics, understands what gender, number, case. And in ABBYY in its pure form, special levels of abstraction are made. That is, for linguists they create some kind of formal languages, the environments in which they work, and which are close to their world view. They are close to the language, naturally. And linguists stew in these environments. But at the same time, they still know well that here they will write such a rule now. And this rule will be picked up by such and such an algorithm and so used. Linguists have such ideas. Without this, they could not work. Training linguists with an eye on computational linguistics significantly affects the linguists themselves. More or less modern linguists today (if we talk about people who are engaged in natural language, write theoretical papers about this), long ago they turned to ideas about computer methods. ”
Onto engineers in ABBYY
The basic linguistic component, which is the foundation of Compreno technology, has been developing in ABBYY the last many years. This is a program that builds semantic-syntactic trees.
Using the basic layer, which turns any sentence in a natural language into a semantic-syntactic tree, it is possible to solve higher-level problems. In particular the extraction of information. These are different abbreviations, mostly English-speaking, but they all have Russian analogues. Actually it is a complex of tasks around the analysis of information. If there is a text at the input, then you need to interpret it in a certain way. This is what Anatoly does in his group at ABBYY.
A.S .: “Speaking in a bit more detail, what does it mean to interpret and extract something in a certain way? The tasks of studying the information are always set as follows. First, the domain model is described. That is, we always know what problem we are solving. And we formally fix it. This domain model is also called ontology. We draw in advance that we are interested in, for example, persons: organizations, facts of work of persons in organizations. Or we are interested in locations: the facts of the location of organizations in locations. That is, we draw a conceptual schema of the subject area. And under the prism of this conceptual scheme we consider the text. That is, we need to extract from the text not all the information that is in the text (which would be absolutely vague, because in any text there is a lot of different information). We need to extract only the information that fits into the shelves that we have previously drawn. This is how the tasks of studying information are set. ”Ontologies themselves are very different, usually thematic. This may be the ontology of medicine, business, it may be the ontology of sports. An ontology is always described first. Further development begins. This is done by special people who are called onto-engineers.
Onto-engineers are a good example of the symbiosis of a linguist and a programmer, usually graduates of mathematical universities, because they have to conceptualize reality really well, break tasks into subtasks, understand where there are entities, how they are connected. On the other hand, they should be well aware of what semantic-syntactic trees are, that is, have linguistic knowledge.
A. S .: “Onto-engineers sit down and write rules in a high-level language. At the input, this language receives semantic-syntactic trees, and at the output it generates a conceptual graph of the corresponding domain model. A simple example, you have people in your company, organization and facts of work. And you have the sentence "Vasya works at ABBYY". The program must extract the person “Vasya” (that is, a specific instance of the person’s concept), extract the ABBYY organization (a specific instance of the organization concept) and understand that these two instances are related by the work relation. This is a typical example of extracting information. The difficulty here is that in natural language the same concept can be expressed in very different ways. It is always a huge variety of ways to say the same thing. You can say: "Vasya employee ABBYY". You can say: "Vasya works in ABBYY". You can say: “Vasya was fired from ABBYY”, and it will still mean that he was once an employee. You can say: “Vasya works part time at ABBYY”. All these phrases must be understood and reduced to a common denominator. Here is the task of extracting information in a model form. ”To summarize, it can be said that computational linguistics becomes a tool for extracting information from where it could not be extracted before.
A.S. : “If someone put in a structured database, we can take it because they are structured. It is only necessary to understand the format, that is, how they laid them out. And if it is written in text, it would seem that only a person can understand it. It turns out that with the help of such methods you can write programs that will be understood instead of a person. This, roughly speaking, converters unstructured information into structured. This is what we create as part of my group at ABBYY. ”All the same, but in the video format can be found
here.