The task of
extracting information from the text itself is not new: quite a lot of work has been done in this direction both by large companies aka Yandex and Google, and by independent developers. However, to say that this problem is finally solved, alas, is not necessary. In this article I want to streamline my knowledge on this issue, superficially examining the developments that I recently had to face.
Existing Solutions
And so, let us be given the proposal “The visit of the President to Denmark will give a new breath to the dialogue between the two countries.” To isolate the facts from this sentence, you will need to go through the following steps:
I. Tokenization
At this stage it is necessary to break the sentence into separate words. There should be no problems here.
Ii. Morphological analysis
After the breakdown, for each word it is necessary to obtain morphological information about it (part of speech, gender, case, number, etc.) and all sorts of attributes (for example, is the word a name or a geographical location). For this task, you must have special dictionaries: as a register of Russian words, you can use the
Zaliznyak dictionary or
its proprietary derivatives . As a ready-made solution, you can consider the
Mystem utility from Yandex.
')
If the Mystem utility is used, the original sentence will be parsed as follows:
krestyaninov @ localhost # echo "The visit of the President to Denmark will give a new breath to the dialogue of the two countries" | ./mystem -niwg
Visit {visit = S, husband, neo = (im, unit | v, unit)}
President {President = S, husband, od = (clan, ed | vin, ed)}
in {in = PR = | in = S, abbr = (im, ed | im, plural | kind, ed | genus, plural | dat, ed | dates, plural, ed | ed, plural, ed | creator, pl | pr, ed | pr, pl)}
Denmark {Denmark = S, Geo, Female, Neod = Wine, Unit}
will give {attach = V = unproach, unit, withdrawing, 3-l, owls}
new {new = S, media, neody = (im, unit | vin, unit) | new = A = (im, unit, complete, medium | win, unit, complete, medium)}
breathing {breathing = S, media, neody = (im, unit | vin, unit)}
dialogue {dialogue = s, husband, neo = date, unit}
two {two = NUM ​​= (genus | wines, wives, od wines, husband, od | pr)}
countries {country = S, wives, neode = gender, mn}
Iii. Syntax parsing
At this stage, the associated subgroups of words in the sentence are determined. For example, the NEGATIVE-VERGE: “visit-betray”. The establishment of these relationships will allow us to determine the ambiguities in the morphological analysis. For example, from the phrase "new breath" it is clear that "new" is an adjective, and not a noun (which was not certain after morphological analysis).
Iv. Semantic parsing
The task of semantic analysis is to build a full-fledged tree of connections of words in a sentence. This process has many nuances and in general is quite complicated to be described in this article. More information about the semantic analysis can be found
here .
V. Fetching facts
Having formed a connection tree of words in a sentence, we can proceed directly to extracting facts. To do this, you can use the following tools:
-
Search for the reference element : Some word is searched for in the text (for example, “President”), on the basis of which a fact is built on the basis of a connection tree;
-
Search by pattern : Search for data by a regular expression (for example, the isolation of the date);
-
Search by ontology : Search for data based on predicative rules described in a
special language .
An example .
More information about the fact retrieval can be found at the following addresses:
-
Presentation "Yandex.Press-portraits" ;
-
Presentation: “Automatic extraction of facts from the text” ;
Existing Solutions
Of the existing systems of fact extraction that work in an acceptable manner, I managed to find only a system called
GATE . The system is quite interesting and promising, but, unfortunately, it has no native support for the Russian language. You can try it in action with the help of
this manual .
From paid systems, we can note the domestic development of
RCO (thanks to
rg_software ).
Related links
-
Extract keywords using Wikipedia ;
-
Review Text Mining Systems ;
-
Overview of the linguistic system "Semantix" .
PS
As mentioned above, the article is a compilation of the data on the topic of information extraction that I was able to find and understand. If you have information / ideas / ideas on the topic - I will be glad to see them in the comments.