📜 ⬆️ ⬇️

ABBYY Labs. Q & A Project: Demonstration of Opportunities

Summary of past episodes:
ABBYY Labs? What is it?
The idea of ​​student laboratories is very simple: we assemble a team of students who are engaged in problem solving under the guidance of our specialists. At MIPT, this takes place as part of the annual course “Innovation Workshop”. The goal of the project is to enable students in the learning process to solve problems that are more close to real than it happens in the normal learning process. And “immerse” them at the same time in the appropriate environment: the environment in which the development takes place is a real-life IT company.
Past projects
Formula Recognition
Formulation of the problem: image Student laboratories ABBYY
The solution of the problem: image ABBYY Labs - what's new?

Briefly about the project:
Task: To be able to find the part of the previously loaded text that most fully answers the user's question, given in natural language.
Current state: You can touch the pens!
Future: Foggy. Depends on the reaction and solvency of the audience.
Therefore: Do not pass by!

Under the cat a link to a demo example, and in general, a logical continuation of the last part .


Principle of operation
Pre-loaded texts that are planned to work. After processing these texts, it is possible to ask questions about them in a natural language and, importantly, receive answers :). For this business is already ready API. The guys see that this service can be good for use on sites with a lot of textual information, for example, on forums, in medical and legal reference books. If habralud offer a new scope - the developers will be only too happy.
')
The language barrier is not a problem as long as we speak English and Russian (in the sense that the Russian text can be asked in English and vice versa). In the future, the list of "native" for the mechanism of languages ​​will grow.

The guys are very asked to tell about the organization of the engine . I understood a little less than half in their description (read - nothing), so I put it under the spoiler
Scary words: ellipse, morphological description, tree, graph
  1. Word processing
    • Received from Compreno text analysis in the form of xml
    • Each text sentence is a tree (in general, a forest, if the sentence is complex). The node of such a tree is a word (or a phrase, for example, when a sentence contains a phraseological unit) in a sentence. Moreover, in each vertex, the morphological description of the given word is stored (ie, case, number, gender, etc.). Two connected vertices of the tree are a phrase.
    • In the general case, a sentence consists of several trees (for example, two parts of a complex sentence), and therefore a purely technical node is created for each sentence, not carrying any information, for which the trees from the sentence are suspended. And they, in turn, are hung from the root of the text. Thus for any text we get one parse tree.
    • Then add non-wood connections ( anaphora and ellipsis )
  2. A similar tree is built for the question.
  3. Next is the search
    • We run over all sentences of the text and compare all the nodes from the question with all the nodes from this sentence. Comparison takes place according to their morphological and semantic descriptions. Such a comparison allows identifying not only synonyms, but also similar words from different languages ​​(words with a similar meaning). And the result of the comparison is the coefficient of content similarity of the nodes in the pair.
    • Next, anaphoric connections are processed, and for some pairs (the node from the sentence is the node from the question), the coefficient is recalculated.
    • All question nodes are compared with all sentence nodes, but now taking into account the children. As a result, each pair of nodes receives a coefficient of structural similarity (this is a number characterizing the similarity of subtrees in structure).
    • At the next stage, entire subtrees are compared. It is difficult to describe the process of comparing subtrees with words, so we will try to draw an analogy (not quite adequate). Let there be two road networks representing trees. Choose a node in each network and put it on the twin, who love to walk on the same (or very similar) roads. And, in fact, we ask them to walk on them, counting the total similarity. Placing them initially in different pairs of nodes, you can find the pair, starting from which the total similarity will be greatest. This number is remembered for this sentence as its weight.
    • The answers are the sentences with the highest weight.


Other important things
Speed. It doesn’t work like lightning, but it has its own reasons: it’s all kept in the Amazon cloud on the cheapest instance (which is free for test purposes). Therefore, do not blame the speed!

Quality search answer.
Through the site you can see the most relevant in the opinion of the system response. However, as it happens in the harsh reality, the most relevant answer is not always the opinion of the “computer” and the opinion of the person - this is one and the same thing. Therefore, several possible answers will be returned to API users, taking into account their relevance, and how to display them is already better for the owner of the service to know. However, you can see them now in the form of xml, the link to which is directly below the words "File with all the answers."

Opportunities for improvement . In the case of a search on a material with a previously known subject, the service can be configured taking into account it, which will increase the relevance of the search. The speed, as already mentioned, can also be increased by providing a more powerful instance in the clouds.

Most importantly . Link, here it is !
Achtung! At the moment, you can search for the answer in one of 3 texts (downloading of new ones is disabled to avoid habraeffect), among which

I want to know the opinion of Habr users: where else can I use this service? Well, let's critics, critics :)

UPD: By the way, I remembered here that nafany121 suffers without invites, cannot even respond to comments. And he, by the way - one of the developers of this thing. Well, you understand what I mean, right? Thanks, HeadMatters

Source: https://habr.com/ru/post/161245/


All Articles