📜 ⬆️ ⬇️

Algorithm for extracting information in ABBYY Compreno. Part 1

Hi, Habr!

My name is Ilya Bulgakov, I am a programmer of the information extraction department at ABBYY. In a series of two posts, I will tell you our main secret - how Information Extraction technology works in ABBYY Compreno .

Earlier, my colleague Danya Skorinkin DSkorinkin managed to tell about the view of the system from the engineer, covering the following topics:

This time we will go deeper into the depths of ABBYY Compreno technology, let's talk about the architecture of the system as a whole, the basic principles of its operation and the algorithm for extracting information!
')



What are you talking about?


Recall the task.

We analyze natural language texts using ABBYY Compreno technology. Our task is to extract important information for the customer, represented by entities, facts and their attributes.

Onto engineer Danya in 2014 wrote a post on Habr



At the input to the information extraction system, parse trees received by the semantic-syntactic parser, which look like this:



After a complete semantic-syntactic analysis, the system needs to understand what needs to be learned from the text. This will require a domain model (ontology) and rules for extracting information. The creation of ontologies and rules is carried out by a special department of computer linguists, whom we call engineer engineers. An example of an ontology that models the fact of publication:



The system applies the rules to different parts of the parse tree: if the fragment matches the pattern, the rule generates assertions (for example, create a Person object, add an attribute-Name, etc.). They are added to the “bag of statements” if they do not contradict the statements already contained in it. When no other rules can be applied, the system builds an RDF-graph (the format of the extracted information) according to the claims from the bag.

The complexity of the system is added to the fact that templates are built for semantic-syntactic trees, there is a wide variety of possible statements, rules can be written, almost without worrying about the order of their application, the output RDF-graph should correspond to a specific ontology and many other features. But let's get everything in order.

Information retrieval system


The system can be divided into two stages:

  1. Preparation of ontologies and compilation of models
  2. Text analysis:
    • Semantic and syntactic analysis of texts in natural language
    • Information retrieval and generation of the final RDF-graph




Preparation of ontological data and compilation of the model


Preparation of ontological data is carried out by engineers in a special environment. In addition to designing ontologies, online engineers are engaged in creating rules for extracting information. We talked about the process of writing the rules in detail in the last article .

Rules and ontologies are stored in a special repository of ontological information, from where they fall into the compiler, which collects from them a binary domain model.

In the model fall:


The compiled model arrives at the input to the “engine” of information extraction.

Semantic-syntactic analysis of texts


In the very depths of the ABBYY Compreno technology lies the Semantic-syntactic parser. The story about him is worthy of a separate article, today we will discuss only its most important features for our task. If you wish, you can study the article from the Dialog conference .

What we need to know about the parser:

Assorted proposals are received at the entrance to the "engine" of information extraction.

A word about information objects


Inside our system, we are not working with an RDF-graph, but with some internal representation of the extracted information. We consider the extracted information as a set of information objects, where each object represents a certain entity / fact and a set of statements associated with it. Inside the information extraction system, we operate with information objects.

Information objects are allocated by the system using rules written by engineers. Objects can be used in rules to select other objects.

The following operations can be performed on objects:


The first four points are intuitive, and we already talked about them in the previous article. Let us dwell on the latter.

The mechanism of "anchors" occupies a very important place in the system. One information object can, in general, be connected by “anchors” with a certain set of nodes of semantic-syntactic trees. Binding to "anchors" allows you to re-access objects in the rules.

Consider an example.
Onto engineer Danya Skorinkin wrote a good post



The rule below creates a person “Danya Skorinkin” and connects it with two components.

name "PERSON_BY_FIRSTNAME" [ surname "PERSON_BY_LASTNAME " ] => Person P(name), Anchor(P, surname); 

The first part of the rule (before the sign =>) is a pattern on the parse tree. The template involves two components with the semantic classes “PERSON_BY_FIRSTNAME” and “PERSON_BY_LASTNAME” . Two variables were compared with them - name and surname . In the second part of the rule, the first line creates a person P on the component, which is mapped to the variable name . This component is connected by the “anchor” with the object automatically. The second line of the Anchor(P, surname) we explicitly associate with the object of the second component, which is associated with the variable surname .

The result is an information object-person associated with two components.



After this, a fundamentally new opportunity appears in the template part of the rules - to check that the information object is already attached to a specific place in the tree.

 name "PERSON_BY_LASTNAME" <% Person %> => This.o.surname == Norm(name); 

This rule will only work if the object of the Person class has been attached to the component with the semantic class “PERSON_BY_LASTNAME” .

Why is this technique important to us?


The concept of the anchor mechanism is close to the notion of reference, but does not fully correspond to the model adopted by linguists. On the one hand, anchors often mark different referents of the same extracted subject. On the other hand, in practice this is not always the case, and the placement of anchors is sometimes used as a technical tool for the convenience of writing rules.

The placement of anchors in the system is a fairly flexible mechanism, allowing to take into account core (non-wood) communications. With the help of a special construction in the rules, it is possible to connect an object with an anchor not only with the selected component, but also with the components related to its core relational connections.

This feature is very important to increase the completeness of the selection of facts - the allocated information objects are automatically associated with all nodes that the parser considered coreferential, after which the rules highlighting the facts begin to “see” them in new contexts.

Below is an example of coreference. We analyze the text “The questions of love and death did not worry Ippolit Matveyevich Vorobyaninov, although, by the nature of his service, he was in charge of these questions from 9 am to 5 pm daily, with a half-hour break for breakfast.”

The parser restores the IPPOLIT semantic class for the “He” node. The nodes are connected by coreferential non-wood connection (indicated by a purple arrow).



The following construction in the rules allows the anchor to associate an object P not only with the node that matched the this variable, but also with those nodes that are associated with it by a core relation (i.e., go along the purple arrows).

 // ,   P     ,   //  this.  c Coreferential ,   //      ,    //  ,  . anchor( P, this, Coreferential ) 

On it the first part came to an end. In it, we talked about the general system architecture and elaborated on the input data of the information extraction algorithm (analyzes, ontologies, rules).

The next post, which will be released tomorrow, we will immediately begin with a discussion of how the "engine" of extracting information is arranged and what ideas it contains.

Thank you for your attention, stay with us!

Update: second part

Source: https://habr.com/ru/post/269191/


All Articles