Natural Language Processing: Apache UIMA

Originally developed by IBM specialists, the Unstructured Information Management Architecture ( UIMA ) is now inhabited in an Apache incubator , is a sample of open source software and is distributed under an Apache license.

What is it?

This is a software infrastructure whose goal is to analyze large amounts of information and extract knowledge from this information. Here we carefully stop, look into the abyss of the semantic web, at the bottom of which lies artificial intelligence, and take a cautious step back.

Apache UIMA is good because it does not harbor any mysticism. Everything can be felt, poked, sawed.
')
She offers a modular approach to text analysis. For example, the analysis sequence can be:

we define the language of the text;
find the boundaries of the sentences;
looking for named occurrences (names, titles, etc.).

Each operation is performed by a specific component, the connection between which is provided by the framework (the UIMA Java Framework and the UIMA C ++ Framework are available).

Annotators

Thus, we work with a channel (pipeline) made up of component annotators, each of which performs a specific operation on the text. The input is a bare text; the output is an enriched xml file.
Here are some of the existing annotators:

Whitespace Tokenizer Annotator
Simple tokenizer. Using whitespace, breaks the text into tokens: words, numbers, punctuation marks.
Snowball annotator
Highlights the basics of words. Supports Russian language. Here you can read how it works.
Regular Expression Annotator
It simply extracts from the text pieces that can be determined using regular expressions: url, e-mail, etc.

The principle of operation of other components is not so obvious, their list (far from complete) can be found here .
If you wish, you can write your own annotator or collect a channel from existing ones.

Work example

As an example, let's take a look at the sentence from the article “ Automate Metadata Extraction for Corporate Search and Mashups ” (by Dan McCreary). To begin with, we teach the program to understand that Sue and Susan are semantically close concepts. Consider two sentences:

Our client is going to sue your company.
This was written by Sue Smith for the Johnson Corporation.

First of all, we carry out a POS-analysis (part-of-speach, analysis in parts of speech). Here is a list of identifiers denoting parts of speech for the English language.
At the output of the pos-annotator we get:

<AnnotationResult>
<Sentence>Our client is going to sue your company. </Sentence>
<token POS="pp$">Our</token>
<token POS="nn">client</token>
<token POS="bez">is</token>
<token POS="vbg">going</token>
<token POS="to">to</token>
<token POS="vb">sue</token>
<token POS="pp$">your</token>
<token POS="nn">company</token>
<token POS=".">.</token>
<Sentence>This proposal was written by Sue Smith for the Johnson Corporation. </Sentence>
<token POS="dt">This</token>
<token POS="nn">proposal</token>
<token POS="bedz">was</token>
<token POS="vbn">written</token>
<token POS="in">by</token>
<token POS="np">Sue</token>
<token POS="np">Smith</token>
<token POS="in">for</token>
<token POS="at">the</token>
<token POS="np">Johnson</token>
<token POS="nn">Corporation</token>
<token POS=".">.</token>
</AnnotationResult>

Bearing in mind that the required Sue has the identifier of the part of speech POS = "np" (roughly speaking, the noun), we dismiss the first sentence. It is clear that the traditional, non-semantic search does not distinguish the verb sue from the proper name.

We deal with the second sentence. Using a dictionary (and the use of dictionaries of terms and ontologies is an essential attribute of semantic search), we learn that <token POS = "np"> Sue </ token> is an abbreviated name of Susan, and two consecutive nouns "Sue Smith" are possible , employee of the company behind the number 1234747 Susan Smith. In the same way we can determine that Johnson Corporation is an organization with the identifier 347474. We fix the knowledge gained:

<AnnotationResult>
<Sentence>This proposal was written by Sue Smith for the Johnson Corporation. </Sentence>
<token POS="dt">This</token>
<token POS="nn">proposal</token>
<token POS="bedz">was</token>
<token POS="vbn">written</token>
<token POS="in">by</token>
<person EmpID="1234747">
<token POS="np">Sue</token>
<token POS="np">Smith</token>
</person>
<token POS="in">for</token>
<token POS="at">the</token>
<org CompanyID="347474">
<token POS="np">Johnson</token>
<token POS="nn">Corporation</token>
</org>
<token POS=".">.</token>
</AnnotationResult>

Thus, in several iterations, we enrich the textual information, which raises the search to a new level.
No magic

Source: https://habr.com/ru/post/56461/

All Articles

Natural Language Processing: Apache UIMA

What is it?

Annotators

Work example

More articles: