Retrieving Entity Mentions and Textocat API Search

Textocat API is a cloud-based text analysis SaaS. High-quality extraction of useful information from texts is a difficult task and requires serious expertise. The mission of the Textocat team is to make the word processing process so easy to use that any modern developer can include in their arsenal. Using the Textocat API, you can quickly prototype text-based applications and turn them into your business. In this publication, we show how easy it is to integrate into any application the capabilities of the Textocat API to recognize references to entities (objects) and search for documents in Russian.

Textocat API Beta features

In early April, we launched a beta test of the Textocat API . In this version, we offer developers to use part of the service’s functionality for free with the following features:

recognition of references to entities (entity recognition) in collections of documents in Russian;
storage of processed collections;
full text search with selected types of entities.

Recognition of entity references

The task of recognizing references to entities , including named entities (named entity recognition, NER), is to select and classify certain fragments of text into previously known types. For example, the current version of Textocat supports seven types:

Person (people)
ORGANIZATION (organizations),
GPE (geopolitical entities),
LOCATION (geographic features),
FACILITY (infrastructure),
TIME (time units)
and MONEY (monetary units).

')
Why is a high-quality solution to this problem difficult? After all, you can simply take the dictionaries, methods from the university course on the theory of algorithms and get the same result? The fact is that natural language, especially Russian, is rich in various nuances:

Inflectionality: for accurate matching, taking into account possible forms of words and endings, you need a good morphological analyzer.
Lexical ambiguity: how to distinguish between references of the Bank "Russia" (ORGANIZATION) and Russia as a country (GPE)?
The ambiguity of the name: how to distinguish the references of Sergei Ivanov, the head of the Presidential Administration of the Russian Federation, from the references to his full namesake - the chairman of the state committee of the Republic of Tatarstan on tourism?

At the core of our machine learning approach is our own data collection technologies for annotating from assessors on the principle of crowdsourcing and our own developments to improve the current methods of using probabilistic graphical models for these tasks. The accuracy of this decision will be an order of magnitude higher than the straightforward approach based on keywords. In addition, all implemented Textocat technologies are scaled by the number of machines without rewriting the source code.

You can learn how Textocat technology for this task works on different texts, even without starting to program, on our interactive demo page. Note that the current results are far from the limit of possibilities, and the quality of recognition will be significantly improved in the near future. Now technology shows its best results on news texts.

Uploading documents and searching

After processing, documents become available for uploading (and only to the client with this authorization token). The Textocat API also allows the user to search among all uploaded documents, supporting full-text queries by keywords (for example, “tinkoff bank”). In addition, the Textocat API provides an extended syntax for queries that impose a limit on the type of annotation in which keywords should appear. For example, the query “ORGANIZATION: Ford” makes it possible to search for documents that mention the keyword “Ford” in the context of the organization (“Ford Motors”), separating the references to the name of the famous American industrialist Henry Ford. In the full version of the service, the tracking function in documents of specific objects from the dictionary or customer knowledge base (entity linking or named entity disambiguation) is available.

API principles

Textocat API is a classic RESTful API , that is, work with the service is done through standard HTTP requests. JSON is currently the only supported format for input and output data in the Textocat API. For the convenience of working with many documents, the Textocat API allows you to collect documents into batches and send them for processing. In principle, this allows you to break a large collection into packages and send them for processing in parallel (premium feature). Packet processing time depends on their size and the current load created by all users. Therefore, the client is charged with the need to periodically check the status of the package by sending an appropriate request. As soon as the package of documents is processed, it is stored in the system in a structured form and will be stored for the entire subscription period of the client. From this point on, the client is able to upload the processed documents. Textocat API allows you to filter and sort the output documents through the described search tools.

Get the key to the API

To repeat the steps described below, you need an authorization key (auth_token) to access the API. You can get the key for free by simply registering on our website. If you have any problems with registering or using the service, please use our knowledge base .

Online documentation

The interactive online documentation of the Textocat API allows you to get acquainted with the basic commands and parameters for invoking the service's functionality. Next, we describe the processing cycle of a simple package consisting of a single test document, which can be repeated directly in the online documentation.

Step-by-step instructions for using online documentation

Open the Textocat API online documentation page . To start working with it, you need to insert an authorization token in the appropriate field (auth_token).
We start with the function of sending a package to highlight references to entities. Click on the link / entity / queue .
In the expanded form, you must fill in the field body - an array of documents to be added in JSON format. The easiest way to do this is by clicking on the JSON example on the right below “Model Schema” and filling the text field in JSON with the text of the test document. Clicking next to the “Model” link, you will receive a description of the main document fields processed by the Textocat API.
Push the button to send the request “Try it out”. In case of successful processing of the request, the form displays a message with the HTTP status code 202 and the generated identifier for the package being processed (see the “batchId” field in the response JSON). This ID needs to be remembered. In case of an error, it is necessary to compare the error code (for example, 406) with the description from the “Response Messages” table and fix the problems in the request.
We check the status of the sent packet ( / entity / request ) by entering in the batch_id field the identifier of the packet that we memorized in the previous step. The service will return code 200 and a FINISHED response in the “status” field if all documents from the batch are processed and ready to be unloaded, or IN_PROGRESS if the batch is still being analyzed.
If the packet is processed, we can proceed to its unloading ( / entity / retrieve ), specifying the same packet identifier in the batch_id field and send the request. To search through all user documents ( / entity / search ), the search_query field is filled in the Textocat search query syntax .
In both cases, as a response, the service will return JSON within a model consisting of package metadata, a document, and selected entities. A detailed description of the fields in the response can be obtained by clicking on the link “Model” under the caption “Response Class”.

Working with the Textocat API on the command line

We show how you can work with our service directly from the command line using curl , a standard utility for Unix.

Instructions for calling the Textocat API from the command line

Let's prepare a test document example.json with a simple package consisting of three small documents:

[ { "text": "    « »   —  ,           .   .", "tag": "doc1" }, { "text": "      «»  ", "tag": "doc2" }, { "text": "-       .", "tag": "doc3" } ]

Send the text Textocat API to recognize the mentioning of entities by running the command in the console:

 curl -X POST http://api.textocat.com/entity/queue?auth_token=<YOUR_AUTH_TOKEN> -H "Content-Type: application/json" --data @example.json

In response, you should receive a similar response containing the identifier of this batchId package:

 { "batchId": "931da87a-fe98-4639-8cf6-570b5a3fc347", "status": "IN_PROGRESS" }

Check the package status by passing batchId as a parameter:
```
 curl http://api.textocat.com/entity/request?auth_token=<YOUR_AUTH_TOKEN>&batch_id=931da87a-fe98-4639-8cf6-570b5a3fc347 
```
Since the packet is very small and should be processed instantly, this time the server response will be:
```
 { "batchId": "931da87a-fe98-4639-8cf6-570b5a3fc347", "status": "FINISHED" } 
```
This means that the package is ready for uploading or searching.

Perform a search on the downloaded document of our user, passing the search search parameter search_query with the value "ORGANIZATION: Ford":

 curl -G --data-urlencode 'search_query=ORGANIZATION:' --data 'auth_token=<YOUR_AUTH_TOKEN>' http://api.textocat.com/entity/search

From the server will come the following response:

 { "searchQuery": "ORGANIZATION:", "documents": [{ "status": "SUCCESS", "tag": "doc2", "entities": [{ "span": " ", "category": "PERSON", "beginOffset": 14, "endOffset": 25 }, { "span": " «»", "category": "ORGANIZATION", "beginOffset": 28, "endOffset": 43 }] }, { "status": "SUCCESS", "tag": "doc3", "entities": [{ "span": "  ", "category": "ORGANIZATION", "beginOffset": 14, "endOffset": 34 }, { "span": " ", "category": "GPE", "beginOffset": 51, "endOffset": 65 }, { "span": "-   ", "category": "FACILITY", "beginOffset": 0, "endOffset": 34 }] }] }

Thus, from the processed packages, the Textocat API returned only documents (see the documents array) mentioning the word "Ford" in the specified lexical meaning, ranking them by relevance.

Java sdk

Finally, we give an example of highlighting references to entities using the official Textocat Java SDK , which greatly simplifies the logic of working with the service when implementing applications in JVM-compatible languages.

 final EntityRecognition entityRecognition = TextocatFactory.getEntityRecognitionInstance("<AUTH_TOKEN>"); final FutureCallback<AnnotatedBatch> outputCallback = // a callback for dealing with annotated documents ... FutureCallback<BatchMetadata> inputCallback = new FutureCallback<BatchMetadata>() { public void onSuccess(BatchMetadata batchMetadata) { entityRecognition.retrieve(outputCallback, batchMetadata); } public void onFailure(Throwable throwable) {} }; entityRecognition.submit(new Batch(documents), inputCallback);

EntityRecognition is the main interface for accessing the entity recognition recognition functionality in the Textocat API. All calls are asynchronous, so the client response code for EntityRecognition needs to be wrapped in FutureCallback from the Google Guava library. In the example, a batch of documents is sent (entityRecognition.submit) and documents with ready annotations of entity references (entityRecognition.retrieve) are uploaded. Instead of the latter, the entityRecognition.search method could be used to search through all of the user's uploaded documents.

Applications

These features Textocat API can be used in different areas of business intelligence. Examples of some interesting applications you can find among the projects of the hackathon Text Analytics HackDay , which we conducted in Kazan in partnership with Kazan Federal University. Separately, we dwell on the following two cases.

Search in corporate document flow

Modern BI platforms provide search functions for such internal company resources as document management systems, file storages, CRM, ERP, knowledge bases in technical support departments, call centers, corporate e-mails and forums. The textocat API extracts values, objects, locations, and time units from the text. The results are issued in a structured format, convenient for downloading to any modern storage. In addition, we provide full-text search capabilities based on extracted semantics. Accounting categories of names and keywords allows you to perform accurate and high-quality ranking of search results. Along the way, a number of low-level tasks are being solved, which developers of their own search solutions for the Russian language often use, using the tools of well-known open source search libraries (for example, Apache Solr or Elastic Search) - these are qualitative tokenization, splitting into sentences, lemmatization, etc.

Marketing tool

In business, there is often a need for analyzing external sources and gathering facts, opinions, feedback on a company, products, competitors, contractors, partners, takeover sites, individual persons, etc. This allows you to make timely decisions in marketing and adjust sales plans. Sources of texts can be official press releases, news, posts on social networks, reviews and comments on sites. The data on the searched objects (for example, names and addresses of sites) can be downloaded from CRM or external databases (for example, the register of the Federal Tax Service of Russia). Using the capabilities of the Textocat API, developers can create services that allow you to get a full picture (360-degree view) of references to the object of interest.

Announcement

In the next publications we:

We present a detailed review of the main cases and success stories of text analytics in the US, Europe and Russia;
Let's introduce TextoKit - our stack of basic functions of text processing in Russian, implemented for the Apache UIMA platform and which we open with source code under a free license for the developer community.

Subscribe to our blog and tell your colleagues. It will be interesting!

Source: https://habr.com/ru/post/256165/

All Articles