Implementing a semantic news aggregator with extensive search capabilities

The purpose of this article is to share experiences and ideas for the implementation of a project based on the complete transformation of texts into a semantic representation and the organization of a semantic (semantic) search using the knowledge base obtained. We will discuss the basic principles of the functioning of this system, the technologies used, and the problems arising from its implementation.

Why do you need it?

Ideally, the semantic system “understands” the content of the articles being processed in the form of a system of semantic concepts and distinguishes the main ones from them (the “what” text). This provides tremendous opportunities for more accurate clustering, automatic referencing and semantic search, when the system searches not according to the request, but according to the meaning behind these words.

Semantic search is not only the answer within the meaning of the phrase typed in the search line, but in general the way the user interacts with the system. A semantic query can be not only a simple concept or phrase, but also a document — the system at the same time issues semantically related documents. The user's interest profile is also a semantic query and can act in the “background” mode in parallel with other queries.
')
The response to a semantic query in general consists of the following components:

Direct answer to the question and other information concerning the requested and related concepts.
Semantic concepts, semantically related to query concepts, which can be both an answer to a question and a means to “refine” a query.
Text documents, multimedia objects, links to sites on the topic, which reveal and describe the requested semantic concept.

The news aggregator is the most convenient information application for testing this semantic approach. You can build a working system with a relatively small amount of processed text and a high acceptable level of processing errors.

Ontology

When choosing an ontology, the main criterion was the convenience of its use both for building the semantic text parser and for efficiently organizing the search. To simplify the system, it was made an assumption that it is possible not to process or process with a large allowable error level some of the information contained in the text, which is assumed to be not very important for search tasks (auxiliary information).

In our ontology, simple semantic concepts (objects) can be divided into the following classes:

Material objects, people, organizations, non-material objects (for example, films), geographical objects, etc.
Actions, indicators ("sell", "inflation", "make").
Characteristics ("big", "blue"), let's call them attributes.
Periods of time, numeric information.

The basis of the information contained in the text is “knots” formed by semantic combinations of the concepts of the second class (action) and the first class. Objects of various types fill the free valencies (roles) (for example, the price - for which product? Where? From which seller? ). We can say that the objects of the first class specify, specify the actions and indicators (the price is the price of oil ). Not only action objects, but also first-class objects (“ Russian companies”) can act as a “node-forming” object. This approach is similar to frames widely known in Western computational linguistics ( Framenet ).

Nodes can enter one another, when one node fills an empty role in another node. As a result, the text is converted to a system of nested nodes.

Characteristics applied to the semantic concepts of the first and second class, as a rule, can be considered “secondary” information in relation to search tasks. For example, in the expressions " low oil prices remain", " stable supplies of oil to Europe", the italicized attributes are less significant, the other objects. Such information is not included in the nodes, but is linked to them in relation to a specific place in the document. Similarly, numeric information and time periods are attached to nodes.

The figure below illustrates the semantic transformation of two simple phrases. The colored rectangles are the elements of the node templates, and the rectangles above them are the elements of the node constructed from this template.

With this approach, we have two kinds of information:

A specific node exists ("oil prices"). The drive of such nodes is called the "Knowledge Base".
This node exists in certain places in documents with certain attributes, numeric values, and time periods.

We make this separation to simplify and speed up the search for information, when, as a rule, we first look for the nodes relevant to the query, and then filter the data by auxiliary parameters.

Convert text to semantic representation

The main task of semantic text transformation is to structure the objects contained therein as a set of suitable nodes. To do this, we use a system of templates of nodes, in which for each element a condition on an acceptable type of object is established. Types form a tree graph. When a particular type of object is installed in a node template for a given role, then all objects of the same type or “subordinate” types can fit this role.

For example, in the “trading operations” node, an active object (seller or buyer) may be an object of the “person or organization” type, as well as objects of all underlying types (companies, shops, cultural institutions, etc.). In the site templates we get and syntactic restrictions. Unlike most other systems of semantic analysis of texts, we do not do preliminary syntactic analysis with the formation of a network of syntactic dependencies, but apply syntactic restrictions in parallel with semantic analysis.

Briefly explain the main stages.

First, simple objects are identified that are identified by individual words or known phrases. Further, combinations of first and last names are defined as indications of people, and an algorithm for analyzing individual words and sequences of words, which may be objects unknown to the system, works.

At the second stage, we form nodes based on objects of class 1 with objects that refine them. Phrases like “the general director of the Moscow trading company“ Horns and hoofs ”are rolled up into one object. The supplementary information contained in these nodes (“Moscow” as a sign of location and “trade” as a sign of the industry in this example) can be added to the semantic link graph for the specified company. In the next chapter, we will take a look at the semantic link graph.

Then, the text should be structured as a sequence of independent fragments , each of which usually contains a specific phrase based on a verb, and ideally should be folded into one node, which may include other nodes. We process sacramental revolutions and other constructions, and the enumeration of objects of class 1, including already formed nodes, is turned into special objects.

After that, for each fragment there is a search for suitable nodes based on objects of class 2 . If several nodes have been formed for a single node-forming object, those that include the maximum number of objects in this fragment remain. Thus, based on the type of surrounding objects, a transition occurs from semantically wide objects like “go” to a node that has a clear semantic meaning. If during the initial processing on the spot of homonyms several parallel objects arose, then after this processing only those objects that have entered the nodes remain (that is, semantically agree with the neighboring objects).

The last block of transformation into a semantic representation is the consideration of objects that are removed from the node-forming objects in the text , but are implied in the meaning. For example, “It's warm in Moscow, it's raining. It will be cold tomorrow and it will snow. ” A semantic analysis of the end of a sentence leaves the role of a geographical object vacant, and for a number of signs one can determine that “Moscow” is suitable.

When the nodes are fully formed, we bind attributes, numeric information and time periods to them. A typical situation is when the time period is indicated only in one place of the text, but refers to several nodes throughout the text. It is necessary to use a special algorithm for the "distribution" of periods over all nodes, where the "lack of" period of time is based on their semantic meaning ..

Finally, in each document we define the main objects (this document is about what). In addition to the number of occurrences, the participation of objects in nodes of different types is taken into account.

Having rich semantic information, one can construct a fairly accurate measure of the semantic proximity of documents. Clustering of documents is done when the measure of semantic proximity exceeds a certain threshold. We form semantic profiles of clusters (the main objects of a cluster, they usually search for them) and a network of semantic relations between clusters, which allows you to display a “cloud” of documents related in meaning to a specific document.

How semantic search works

The semantic search algorithm consists of the following main blocks:

First, if a text query, then you need to convert it into a semantic representation . Differences from the document processing algorithm described above are dictated primarily by the need to execute a search query very quickly. Therefore, we do not form any nodes, but select one or several blocks consisting of a potentially node-forming object and a number of objects, which, based on their type and position in the request, can relate to this node-forming.

In this case, several parallel combinations can be formed, in one of which in the next stage it is necessary to disclose combinations of the “Moscow companies” type into the list of specific objects through the knowledge base, and in the other one it is not necessary.

The next stage is the search for semantically related objects and nodes . For single objects of class 1, this is a sample of semantically related objects. In the case of the combination “action + objects”, there is a search for nodes that have the same or subordinate type of the node-forming object, and at the same time having in its composition objects that are identical or semantically related to the request objects. Also, the disclosure of specific combinations of objects such as "Moscow companies" or "European countries" is made to the list of specific objects.

It uses a tree graph of semantic links between objects. The principle of its construction is simple - those "subordinate" objects that should be taken into account in the search for this object are attached to a specific object. For example, cities are subordinated to states, politicians are also subordinated to states, companies are subordinated to countries or cities, company leaders are subordinated to companies. For material objects, this graph is constructed from more general concepts to particular ones and partially coincides with the type graph.

For a number of objects, the number of "subordinates" can be very large and there is a need to choose the most significant ones. To do this, a numerical coefficient of semantic connection is established between the elements of the graph, which is calculated based on the significance of the objects. For different types of objects, significance is determined differently, for example, for companies - based on economic indicators (turnover) or number of employees, for geographic objects - by population.

Next, we look for simple objects and nodes that are obtained at the output of the previous stage in the object profiles of clusters . If few clusters are found, then a search is performed in the object profiles of documents.

If the search query contains attribute objects (characteristics), additional filtering of the found documents is performed by the presence of the required attributes attached to the found nodes. If the request contains lexemes for which the database does not have a transition to semantic objects, the semantic search is supplemented with a regular text search by lexemes.

Finally, we rank the found clusters and documents, form snippets and other output elements (links to related objects, etc.). Ranking is usually based on the degree of semantic connection between request objects and objects through which documents are found. Also, when ranking, a semantic profile of user interests can be taken into account.

Before starting the execution of a complex query, it is necessary to do an analysis of the complexity of processing its various components, and to build the order of its execution in such a way that fewer intermediate objects or documents appear during the processing. Therefore, the processing order may not always correspond to the one described above. Sometimes it may be beneficial to first find the documents based on the request part, and then filter the objects contained in them with respect to the rest of the request.

A separate algorithm is required for "broad" requests - "economy", "politics", "Russia", etc., which are characterized by a very large number of related objects and relevant documents.

For example, with the object "policy" associated:

People-politicians - occupying top government positions or reputable experts
Organizations - political parties, state bodies. authorities.
A number of events and actions (elections, appointments to certain positions, the activities of the State Duma, etc.).

In this case, the search is conducted by a relatively small number of relevant clusters with a high degree of importance, and we rank them by the number of fresh documents in the cluster.

The main problems of implementation of this approach and their solutions

Problem 1. The system should "know" all the objects that are found in the texts.

Possible solutions include the following:

The use of a semantic system in an area where ignorance or identification errors of rare and little-known objects are not critical.
Downloading objects from existing structured information databases (DBpedia, Rosstat, etc.)
The use of simple algorithms for the automatic identification of the type of object by specifying words (for example, “the film “ Martian ”), automatic identification of persons, as well as word combinations that may be objects unknown to the system. With a low probability of error, objects are automatically entered into the database; in cases of high probability of error, we use a manual check system.
To identify objects, we consider the possibility of using machine learning, teaching the system to select already known objects and relying on the semantic type of objects surrounding an unknown object.

Problem 2. How to create templates for all possible semantic nodes.

Solving similar problems of the distribution of objects according to the semantic roles of the English-language SRL (Semantic Role Labeling) systems use machine learning algorithms using already marked cases. As a system of constructions "action + roles" is used, for example, Framenet. However, there is no suitable corpus for the Russian language. In addition, the implementation of this approach has its problems, the discussion of which is beyond the scope of this short article.

In our approach, as described above, the distribution of objects by roles is based on the correspondence of object types to semantic restrictions established for roles in the template of nodes. In total, the system now has about 1,700 node templates, most of which were formed semi-automatically based on Framenet frames. However, semantic restrictions for roles have to be mainly set manually, at least for the most frequently encountered nodes.

You can try the automatic formation of nodes using machine learning based on already formed. If there is a certain combination of objects and words (unknown to the system) with certain syntactic characteristics, then it is possible to form nodes similar to the already existing ones. Although it is still necessary to manually make templates for these nodes, the presence of such a node will still be better than its absence.

Problem 3. High computational complexity of performing many semantic queries.

Some queries may involve processing a very large number of intermediate objects and nodes and be slow. This problem is completely solved by technical methods.

Parallel execution of queries is required.
Analysis of the complexity of various ways to execute a query and the selection of the most optimal.
The use of numerical coefficients in the semantic link graph makes it possible to limit the number of objects participating in the intermediate stages of query processing.

Recommended literature

Cycle of articles on Habré technology ABBYY Compreno .
A good overview book: "Semantic Role Labeling", Martha Palmer, Daniel Gildea, and Nianwen Xue, 2010.
Dipanjan Das, Desai Chen, André FT Martins, Nathan Schneider, Noah A. Smith (2014) Frame-Semantic Parsing .

Source: https://habr.com/ru/post/277351/

All Articles