Earlier we wrote about the analysis of reviews about restaurants, in order to extract references to various aspects (food, decor, and the like). Recently,
in the comments there was a question about extracting actual information from the text, i.e. is it possible, for example, to extract facts from reviews about cars, for example, “the gearbox breaks down quickly” => breaks down (gearbox, quickly) so that you can work with these facts later. In this article we describe one of the approaches to solving this problem.

The method, which we describe, is based on a number of simplifications, it is not the most accurate, but it is easy to implement and allows you to quickly create a prototype of the application in which it should be used. In some cases, it will be quite enough, while for others it is possible to introduce improvements without departing from the basic principle.
Consider, for example, such a proposal in the review of the bracket for the TV:
')
All the holes with the TV match, the washers do not fall out .
We want to extract ratios from it, for example, in this form:
predicate => match
subject => holes
object => with TV
predicate => do not fall out
subject => pucks
This task is often referred to as semantic role selection (Seeling Role Labeling). There is a certain verb (coincides, falls out, etc.) and there are his arguments. What are the arguments of the verb and what they are, there is a subject of controversy among linguists. In practice, they are the ones that are needed for a specific task. Therefore, in order not to plunge deeply into philosophical problems, we determine that we will need a subject, an object, a circumstance / condition under which the action took place. A description can also be attached to an object, or a subject, for example, in the phrase “good TV”, the word “good” performs this role. If the description is not a characteristic of the quality of the object, but its component (plasma TV), then we will allocate this into a separate class. To begin with this will be enough, and a little lower we will return to this issue.
Now we will try to reduce the problem of extracting relations to the task of annotation of a sequence, the solution of which we
described earlier .
Everything | |
holes | subject |
with | |
tv | an object |
match up | predicate |
, | |
washers | an object |
not | predicate |
fall out | predicate |
We will place a corresponding category in front of each word and thus mark out a training sample of sufficient size. Next, we can train any classifier that can work with sequences, for example CRF, after which we submit a new sentence to the input and we can get a category prediction for each word. Of course, we used our API for our experiments, which everyone can access for free by registering on our website. On how to use it, we
wrote in detail here , so we will not repeat here, so as not to lose the main idea.
We manually marked about 100 sentences, which is actually a very small sample for such a task. Next, we submitted several new offers to the model input and this is what happened:
First sentence:Changed | predicate |
awl | an object |
on | |
soap | an object |
behind | |
such | |
money | an object |
, | |
bought | predicate |
on | |
title | an object |
Second sentence:AT | |
complete | an object |
besides | |
ordinary | |
the knife | an object |
there is | predicate |
for | |
dotted | description |
notches | an object |
At this stage, we noticed that the connections between the verb and the object corresponding to it were lost (in fact, we knew it right away, but for simplicity we did not say).
There are different ways to solve this problem. In highly specialized systems, the type of argument can in itself speak about which verb it should be attributed to:
The train
arrives at the station at
16-00 , and
departs at
15-20It is clear that here we can immediately mark 16-00 as
arrival_time , and 15-20 as
departure_time , while the verb will also have a type, for example, will be of the type "
arrival ", and departs to the type "
departure ". Thus, the question of correct matching is shifted to the sequence annotation system, and whether it will cope with it or not will depend on the algorithm used.
This approach is well suited for team analysis (“wake me up tomorrow at 10 am => wake up (me, tomorrow at 10 am,“ order a pizza for 10 people ”=> order (pizza, 10 people), etc.)
In our case, we could define the types of the arguments more precisely. Say, in the phrase, “all the holes with the TV match”, we would have the type of relationship “
coincidence ”, and the two arguments “
what ’s
matched ” and “
what ’s
matched ”. And it works great when the number of relationships is limited, and strictly defined.
We first chose a more general scheme, when the types of arguments are very vague, in the hope that they will fit any verb. As a consequence, we need the second phase of the analysis - determining which arguments to which verb correspond.
Since we make a simple method of extracting facts, we assume that all objects belong to the verbs closest to them. This is not always the case, but it is often true. The same method can be used to correlate the objects of their description.
Having accepted this simplification, we wrote a program that first searches for all specific verbs, and then distributes corresponding objects to them, counting the distance to the verb in words. With the help of this program from the above suggestions, it was possible to isolate such relations:
first sentence:
predicate => changed
object1 => awl
object2 => soap
predicate => bought
object => name
second sentence:
predicate => is
object1 => notches
Description => Dotted
object2 => knife
description => normal
object3 => set
It turned out quite an interesting thing, while all the work, including manual annotation of the training sample, took us 4 hours. To improve the quality, it is possible at the second stage of analysis to collect all the selected facts together, and try to reject incorrectly defined relationships based on the analysis of the results.
In general, as we see, the problem of extracting relationships from texts can be solved in different ways. We have reviewed only a few, trying to focus on the available methods, and, as can be seen, to date, to build such an analyzer is not so difficult.