Over the solution of problems associated with the automatic processing of natural language and understanding the machine meaning of the text, humanity has been beating for more than a decade. Some success in this area has been achieved by the Russian company ABBYY, which has developed the Compreno universal linguistic platform for performing many applied tasks at a qualitatively different level.
The idea to deal with one of the key problems of the theory of artificial intelligence and solve the problem of understanding the computing technology of human speech originated in the minds of ABBYY specialists fifteen years ago. It was then that, with the filing of company founder David Yang, first started research and then developmental and technological work to create a new generation machine translation system, which later grew into a separate Compreno project (formerly called Natural Language Compiler) to solve many problems related to with natural language processing.
The seriousness of ABBYY’s intentions to make a revolution in computational linguistics is evidenced not only by the long-term work of more than three hundred employees of the company, but also by the interest in the platform from the Development Fund of the Center for Development and Commercialization of New Technologies (Skolkovo Foundation), which selects the most promising projects support The financial side of the case is no less convincing: the total investment of the Skolkovo Foundation in Compreno is 475 million rubles, which is half of the financing of the project. The second part (475 million rubles) is contributed by ABBYY itself. Impressive figures, emphasizing the scope and scale of the project.
The amount of technology
To understand the nuances of the mechanisms underlying the Compreno and the logic of their work, it is necessary to understand the fundamental concept of the project, which is as follows. No matter what language civilized people may say, the concepts they designate in words are much more similar than different. We all live in houses, use furniture, telephones, drive cars, go to work in offices, fly on airplanes, etc. These concepts are general and do not depend on the language in terms of how we imagine them. Having captured this connecting thread, ABBYY built a language-independent universal semantic hierarchy of concepts.
The semantic hierarchy of concepts is a universal tree for all languages, whose thick branches are more general concepts (for example, “movement”), and thin - more specific semantic values, structured from general to specific (“crawl”, “fly”, “walk on foot "," run ", etc.). If we are talking about the head of the organization, then at the head of this lexical class the notion of “leader” appears, and in subclasses there are more specific concepts, such as “boss”, “boss”, “leader”, “chief” and other words and phrases , which are a kind of leaves on the concept tree.
')

Such a tree structure ensures the inheritance of properties from ancestors to descendants and allows you to avoid ambiguities in the process of translating sentences from one language to another. Explanation of the developers give the example of the meaning of the word "management" in the Russian language corresponding to several concepts on different branches of the universal semantic tree: you can interpret "management" as a department, or you can, for example, as an action. And due to the fact that the semantic class “management” in the sense of some organization is represented in one branch of the tree, and as actions in another, the system automatically selects the correct word when translating text into English, making a choice in favor of department or management depending on the context of the phrase . As a result, the semantic descriptions that serve as the core of Compreno make it easy to translate text from English or Russian into a universal language and from a universal one into any other language that is described in the system.
The second major block of the Compreno platform is syntax. It is important to understand that the syntax describes how the concepts are related to each other within one or more sentences. For the coding of these connections in languages, the members of the sentence, agreement, word order, case, various official words, unions, prepositions, and many other things are used. Syntax is, figuratively speaking, a large constructor of the listed elements.
Different languages ​​can use different constructor elements. For example, in English, word order is an important part of syntax. Interrogative sentences are formed in one way, narrative sentences in another, and nothing else. There are some optional circumstances of time and place, which are put in the beginning of the sentence, but usually the subject is in the first place, in the second - the predicate and further parts of speech are located. In Russian, the situation is different. We are not tied to the order of words, but on the other hand, coordination is important for us, which, in fact, is perhaps the biggest stumbling block for people learning Russian.
Another important thing that needs to be taken into account when parsing a text is the substitutions and connections between words that occur when we miss a word, but we understand that it still exists. A vivid example is the phrase “The boy loves red apples, and the girl is green.” It is clear that with regard to the girl it is about apples (and also about the fact that she loves them), and we understood that perfectly, although in the text a couple of words are missing. There are other, more complex syntax links that Compreno successfully parses. For example: "Although the boy wanted to play, but he understood that he had little time." In this case, we twice replaced the word “boy” with the pronouns “he” and “him”, and it is important for the machine to understand that it is one and the same object, and to restore the missing nodes.

The Compreno block, which is responsible for syntax, deals with the roles of various concepts in a sentence and links them to each other. The system analyzes the text and builds a tree of relations, in which the main thing is usually some kind of action. From it further there are an object, a subject and other attributes that are attached either to the object or to the subject and transmitting the meaning implicit in the specific sentence. To make parsing as accurate as possible, Compreno uses semantic analysis based on the above-described universal hierarchy of concepts. All this adds up to a new level of freedom in the processing of texts by the machine, allows it to “understand” the meaning of the original sentence and then synthesize this meaning in another language.
Finally, the third important component of the ABBYY linguistic platform is statistics, which allows the system to correctly combine phrases and more fully understand homonymy, when the same word can mean different things (a typical example: “lock” and “lock”). No less important is the statistical information for correct parsing of sentences with ambiguous interpretation. For example, a competent analysis of the phrase “These types of steel in our workshop” can be carried out only by resorting to data on the frequency of relationships between concepts, thereby penetrating the context of speech or, in other words, into the subject matter. If it is about metallurgy, then the story goes about steel, if it is about people's behavior, then it would be logical to make a choice in favor of some not very good types.
The statistical model Compreno is based on an impressive collection of texts of various subjects and genres, almost daily processed by the system. Moreover, the text data is not anyhow anyhow, but created by or translated from one language to another by a person. Such an approach reduces the likelihood of errors in the process of making decisions and distortions by the system during the synthesis of semantic structures.
What ultimately happened? As a result, ABBYY specialists, combining knowledge, imagination, ideas and experience, built on "three pillars" - the semantic hierarchy of concepts, syntax and statistics - a model of linguistic-independent data about the structure of the world and a model of access to this data. As a result, we managed to get as close as possible to understanding the meaning of the text with a computer and to make it possible to solve a wide layer of linguistic tasks. Which ones?
Mind gamesSpeaking about the practical importance of the platform ABBYY Compreno, the developers, first of all, focus on solving two key tasks - automatic translation of texts for many language pairs and intelligent search for information.
The first challenge associated with translating textual data is extremely important in the digital age, erasing formal boundaries and barriers between countries. With the ever-increasing volumes of multilingual information, the need to involve an increasing number of participants from different parts of the world in the implementation of modern projects, not only the speed of receiving the translation, but also the quality of the texts produced at the output become critical. With the provision of the latter, the existing machine translation systems are not at all as smooth as it may seem at first glance. This is due to numerous fundamental limitations in scientific approaches that are the basis of many existing machine translators. These limitations are associated with the inability to correctly handle exceptions, the objective complexity of language constructs, ignoring semantics, the inability to fix real connections in a sentence, and other problems. Compreno technology is the engineering embodiment of the fundamental linguistic studies of many scientists in the world, accumulating approximately 50 years of experience. And thanks to this, Compreno is able to overcome the listed difficulties and allows you to synthesize a text within the meaning of the same as it was in the original language, or as similar as possible. To assess the capabilities of the system, below is an example of translating a piece of Google’s article “Babel fish” heralds future of translation by means of a statistical translator and the ABBYY platform. Comments, as they say, are superfluous.
Source:It would be a hopeless task. The power of machine computation. We build statistical models that are automatically training themselves and learning all the time.
ABBYY Compreno:If we tried to manually give those languages ​​to the system, this would be a hopeless task. The only possible way we could do this is to use the power of machine computing. We create statistical models that automatically learn and learn all the time.
Statistical translator:If we tried manually to give the system of these languages, it would be a hopeless task. The only possible way we could do this is to use the capabilities of the computing machine. We build statistical models that automatically train yourself and learn all the time.
The importance of the second task - intellectual search - is a consequence of the enormous amount of information generated by humanity, which grows exponentially and requires different approaches to analyzing and searching for the necessary data. Now the search works mainly with the use of verbal information: when searching for a document, we first invent words to contain in it, then we enter key phrases, we get data that meets the search criteria, and then we manually select the information we are interested in. This familiar search has a number of major flaws. First, it is far from always possible to formulate a request that accurately describes the information that needs to be found. Secondly, by inventing clarifying words, we narrow the selection and limit the search. Finally, to sort through all combinations of keywords is sometimes extremely tiring, if not impossible at all. ABBYY Compreno technologies successfully cope with all these shortcomings, allowing to carry out a semantic search using the concepts and relationships that were extracted by the machine from a search query formulated in ordinary language.
The “erudition” of the platform and the huge knowledge base concentrated in it allow using Compreno to perform many other applied tasks. On its basis, companies can create qualitatively new solutions for multilingual search and data classification systems, extract facts and establish connections between objects, monitoring, protection systems against unauthorized use of information, automatic summarization and annotation of documents, speech recognition and many other tasks.
No less promising and interesting field of application Compreno is the solution of problems associated with the visualization of the text. A vivid example is the creation of animated videos and films based on text scripts. It is in this direction that the company Bazelevs Innovations, which also takes an active part in the Skolkovo project, has already achieved certain results in creating a software package for interactive three-dimensional visualization of texts. It is not without pride that ABBYY declares that there is not such a universal platform in the world right now that allows us to solve so many applied tasks that require high-quality linguistic analysis of texts.
Plenty of plansToday, as mentioned above, more than 300 specialists participate in the project, young personnel are actively involved, students of the ABBYY department at the Moscow Institute of Physics and Technology and graduates from leading universities in the country - Moscow State University, RSUH, MGLU, St. Petersburg State University and many others. If you look at the roots of the work, they are rooted in serious studies of Russian and world linguistics. This scientific baggage is used by ABBYY specialists. The company's plans include attracting the world's leading experts in the field of linguistics and linguistics and giving the project an international status.
ABBYY is currently implementing pilot projects to deploy Compreno-based software solutions. So far, the project initiators have not disclosed details about the products being developed, but they assure that everyone will benefit from their implementation and widespread implementation - both software producers and consumers, that is, we are with you.
It is too early to talk about how much the ambitious project ABBYY Compreno will change the life of humanity in the future. However, it is safe to say that in the near future, computational linguistics will make significant progress in the field of language modeling and will switch to a completely new technological base, the foundation of which is being laid now.