
What do they do in the R & D department at ABBYY? To answer this question, we begin a series of publications about how our developers create new technologies and improve existing solutions. Today we will tell about the direction of
Natural Language Processing (NLP).
At ABBYY we are engaged in research in the field of natural language processing and undertake complex scientific problems for which there are no ready-made solutions yet. This is how we create innovations that form the basis of products and help our customers, as well as us, to move forward. By the way, on November 24, at a lecture at the
School of Deep Learning at MIPT, the head of the NLP Advanced Research Group in the R & D department of ABBYY, Ivan Smurov, will tell you what text analysis tasks there are in the world and how modern neural networks allow them to solve. And in this post, Ivan told us about three tasks that he is engaged in now.
For colleagues from the NLP Advanced Research Group, it is important to choose isolated tasks, that is, not very strictly related to existing technologies and ABBYY solutions. Sometimes our employees themselves find such tasks, sometimes our R & D talks about them and asks for help with their solution, and then with the publication of the results in scientific journals. So, the first task.
')
Self-refrigeration: no more difficult than retelling?

This text analysis technique allows you to turn it into a retelling or annotation. In this form, people have long used self-freezing. In ABBYY we try to apply self-freezing techniques in an expanded sense: we try to solve those problems that traditionally are not solved with the help of self-freezing, for example, to obtain the integral characteristics of the text and highlight the events that occur in the text.
Self-refrigeration can simplify the traditional pipeline. For example, now, in order to extract from the document the names of the parties to the contract, many consecutive NLP tasks are traditionally solved, from identifying entities to filtering the extracted facts. All these tasks are dependent on each other, and most importantly, each of them requires its own reference markup. And creating markup in machine learning is one of the most expensive things.
With the help of self-freezing, you can make end-to-end fact extraction, that is, without intermediate steps, subtasks and markup. And it will be as easy and fast as retelling the text. And perhaps cheaper.
Syntax parsing: in search of an ellipsis
Remember, at school we did a syntactic analysis of the sentence: subject, predicate, addition? In a linguistic sense, the analysis of the sentence is more complex and detailed. Everything can be portrayed as a dependency, where the main thing is a predicate or a verb, and the subject, add-ons, etc. depend on it. A parsing parser deals with this kind of sentence in modern programs. Usually, a syntax parser spends a considerable amount of time creating and discarding syntax zeros that appear during
ellipsis .

Let us give an example:
Misha ate a pear, and Masha an apple . Both verbally and in writing, we simply skip the verb “ate” and the meaning for us does not change. But for computational linguistics, the definition of syntactic zeros is a complex problem. There are many types of ellipsis, they can be in different places of sentences. As a result, the parser has to recheck a lot of hypotheses: was there a zero here, which is not really zero?
Such a recheck complicates and slows down the work of the parser, in addition, a lot of computing power is spent on it. Therefore, we are inventing new ways to search for those places where syntactic zeros are likely to appear. This will reduce the time for which the parser will determine the ellipsis.
By the way, the interest in ellipse in computational linguistics has greatly increased this year. A research article "
Sentences with Gapping: Parsing and Reconstructing Elided Predicates " by the leading contemporary computer linguists
Sebastian Schuster ,
Joachim Nivre and
Christopher Minger has been published . Thus, the study of ellipsis is a good task, the solution of which can yield a result both for the scientific community and for practical application.
Removal of lexical ambiguity

What is a "stop"? This may be the object where the bus arrived, and maybe the process stop, or maybe the stop in speech. The word is one, and he has many meanings.
Many companies have thesauri, where these meanings are painted. It is convenient to automatically derive from a sequence of words, word forms or tokens — a sequence of meanings or semantic classes. In ABBYY we try to make an isolated model that would accurately determine the meaning of a word with good quality and speed. If you quickly remove the lexical ambiguity, then you can decently speed up the work - whether it is syntactic parsing or extracting named entities / facts.
And what have the neural networks and the School of deep learning?
All these tasks are solved using neural networks. Not that they can not be solved without nets, but now it is the most modern method. Recurrent neural networks give better results for NLP tasks. So this is not just an abstract fashionable phenomenon, but something that is used in practice to solve various problems of NLP.
More details about what text analysis tasks still exist, how modern neural networks are used to solve such problems in Russia and in the world, Ivan Smurov will tell at a
lecture at the School of Deep Learning at MIPT. The lecture will be held this Saturday, November 24, at 5:00 pm, at Dmitrovskoye Highway 9.