FactRuEval - a competition to select named entities and extract facts

Competitions on various aspects of text analysis are held at the Dialogue international computer linguistics conference every year. Usually the competitions themselves take place several months before the event, and at the conference itself announce the results. This year three competitions are planned:

on the allocation of named entities and facts - FactRuEval ;
on the analysis of tonality - SentiRuEval ;
to fix typos - SpellRuEval .

The article you started reading has three goals. First, we would like to invite developers of automatic text analysis systems to take part in competitions. The second is that we are looking for helpers who could mark up text collections on which our participants' systems will be checked (this is, firstly, interesting, and secondly, you can bring real benefits to science). And the third one is that competitions on identifying named entities and facts are held on the “Dialogue” for the first time, and we want to tell all interested readers how they will occur.

Competitions on the factual search "at them"

Western computer linguists have long been paying attention to extracting facts from texts. The first conference was called Message Understanding Conference (MUC) and was held in 1987. The event was funded by the military (DARPA), and the topics of the texts were initially focused on their interests: reports on naval operations and terrorism in Latin American countries. Then there were news articles on economic topics, articles about rocket launches and avikatastrofah.
')
Since 1999, the competition continued as part of the Automatic Content Extraction (ACE) program and was no longer limited to English (Chinese and Arabic were added). Participants were offered the following tasks:

Entity detection (Entity Detection and Tracking) - there were seven types (person, organization, location, enterprise, weapon, vehicle, and geo-political entity) with subtypes.
relationship isolation (Relation Detection and Characterization) - spatial relationships, family and business relationships between persons, places of work, membership in organizations, ownership, nationality, and others.
event detection (Event Detection and Characterization) - interaction, movement, movement, creation and destruction.

Detailed instructions for tasks on ACE over the years are available on the Linguistic Data Consortium website.

Since 2009, similar in content tasks have been presented in the Knowledge Base Population section of the Text Analysis Conference (TAC) . In 2015, KBP includes the following tracks:

Cold Start KBP - a database schema and a large collection of texts; it is necessary to fill the database with information about objects found in the texts and the relations between them.
Tri-Lingual Entity Discovery and Linking - a non-empty database and a collection of texts in three languages (English, Chinese, Spanish) are given; it is necessary to distinguish from texts the references to objects in the database and link these references to objects from the database. Objects missing in the database should be added there.
Event Track - retrieving information about events and their attributes.
Validation / Ensembling Track - improving the results of the system, extracting from the text attributes of objects, by combining the answers of several such systems or additional linguistic processing.

Publications on the results of the TAC Workshops are on the NIST website .

How about us?

The first factual search competitions were held in 2004-2006. in the framework of the ROMIP seminar .

In 2004, a collection of texts and a list of persons (for example: Sting, an English pop singer) were given in the fact track . It was necessary to find in the collection the facts (events) associated with this person, and provide a list of documents and coordinates of the fragments in them (the beginning of the fragment and its length), where these events are mentioned.

In 2005 and 2006, several tasks were proposed in this direction :

selection of named entities (person, organization, geographical object, etc.);
the selection of facts of several types (place of work, ownership of the organization).

Since then, it took about 10 years. During this time, experts in the field of computational linguistics, data mining and other related areas have done quite a lot. Moreover, both large companies and small research groups. However, in the free access there is little reliable information about the results obtained. And now, within the framework of the “Dialogue” conference, independent comparative testing of information extraction systems for the Russian language will take place, the results of which will be available to everyone.

The FactRuEval-2016 competition will include three tracks: two on distinguishing named entities and one on extracting facts. All three tracks will be evaluated on one collection of modern news texts. Then I will explain the task of each of the tracks on one short example. The text will be as follows:

     ,    … 0 1 2 3 4 5 6 01234567890123456789012345678901234567890123456789012345678901

Track 1: Named Entities

The task of the first track is to select each occurrence of the named entity in the text and determine its type. Those. in the above text, three entities should be singled out: “the village of Martyshkino”, the person “Ivan Petrov” and once again “Martyshkino”. As a response to this task, you need to generate a text file, in which for each entity the type, the number of the first letter of the selected fragment from the beginning of the text and its length will be indicated:

 LOC 6 15 PER 22 11 LOC 48 10

Track # 2: Identifying Entities and Attributes

In this track, you no longer need to bind entities to positions in the text. Instead, you need to link all references to the same entity within the text in one object and define the attributes of this object. For example, in the example being discussed, “Martyshkino” is mentioned twice, but in the output it should appear only once. For persons, the surname, name, patronymic and nickname must be separately indicated. The final result will be:

 PER Firstname: Lastname: LOC Name:

Track # 3: Fetching the facts

Fact is a relationship between several objects. A fact has a type and a set of fields. For example: type of fact - Occupation, fields - Who, Where, Position and Phase (started, finished or undefined). This year we will extract several types of facts:

Occupation (work of the person in the organization)
Deal (a deal between several parties without specifying its subject matter and conditions)
Ownership
Meeting (meeting of several persons)

From our example, one fact must be extracted: Ivan Petrov works in the village of Martyshkino as a head.

 Occupation Who:  Where:  Position:

Factual search competitions always have rather voluminous markup guides, and this competition is no exception. Participants need to study the “Description of the tracks” and the “Results Format” .

Evaluation of results

The competition will be held in January 2016. Before it starts, participants will be presented with a demonstration collection of tagged texts and a comparator program, with which it will be possible to independently evaluate their results. The comparator will be published as source code in Python. Several weeks will be given to finalize their systems and bring the results of their work in the expected format.

After that, in order to assess the quality of work of the participants' systems, they will be provided with a test collection, which includes several hundreds of pre-marked documents. Since several hundred documents, theoretically, participants can mark up manually, several tens of thousands of documents from the same sources as previously marked documents will be added to the test collection. The markup of all these documents will be given two days. The results of the systems in the described format will be transferred to the organizing committee.

Collection of texts

The corpus of texts of the competition consists of news and analytical texts on the socio-political topic in Russian. The sources of the texts are the following editions:

“Private Correspondent” www.chaskor.ru
“Wikinews” ru.wikinews.org
“Lentapedia” ru.wikisource.org/wiki/Lentapedia

The case is divided into two parts: demo and test. The ratio of the number of texts from different sources in these two parts is the same. Balance on any other indicators is not guaranteed.

Work on the markup of this collection of texts are now on the site OpenCorpora.org. We invite all interested to join this work. About how the layout is arranged, it is written in a separate article “How can science benefit the reading of news?” . Detailed markup instructions are here .

The task of marking the corpus is to find in the text the first and last names of people, the names of organizations and geographical names, select them with the mouse and select the type of the selected object. For organizations and geographical names, you must also specify a descriptor (a word or phrase denoting a generic concept). After that, selected text fragments (spans) should be combined into references to objects. For example, the name and surname should be combined into a reference to an object of type Person, and the organization descriptor (“SRI”) and its name (“SRI of Transport and Road Management”) should be combined into a reference to an object of type Org. The list of references to objects, which should turn out as a result, is shown in the following picture. In the manual, examples and complex cases arising during marking are analyzed in detail.

How to take part in the competition?

You can participate in any of the announced tracks or all at once. You need to teach your system to output the results in the format described. After that, with the help of a comparator, evaluate its work on the demo part of the collection (as soon as its markup is completed, we will publish it). Make the necessary changes in accordance with the discrepancies found.

In the very near future, we ask potential participants to register (we will send you news and let you know about the beginning of the evaluation procedure) and help to size up the corpus (the task page is available after logging in to OpenCorpora) . We would like to publish its demo part as soon as possible.

We will also welcome any comments and suggestions here or by letter.

Source: https://habr.com/ru/post/273965/

All Articles