New language-independent NLP library

Introduction

Everyone who came to this world, passed through the path of knowledge of the language. In this case, the person learns the language is not by the rules or grammar. Even, moreover, each person, being a child, first learns such a strange phenomenon as language, and later, with age, begins to learn its rules (in kindergarten and school). This explains the funny fact that everyone who learns a foreign language in adulthood, when he is less inclined to learn new languages, knows more about the subject of his study than most of the speakers of this language.

This simple observation makes it possible to assume that in order to understand a language it is not at all necessary to have knowledge of it. It is enough only empirical (experience), which can be gleaned from others. But almost all modern NLP libraries forget about this, trying to build an all-encompassing language model.

For a clearer understanding, imagine yourself blind and deaf. And even be born in such a state, you could interact with the world and master the language. It goes without saying that your idea of the world would be different than that of everyone around you. But you could all interact in the same way with the world. There would be no one to explain to you what is happening and what language is like this, tactilely analyzing braille you would have moved a little off the ground.
')
And this means that in order to understand a message in any language, we need nothing but the message itself. Provided that this message is quite large. This idea is the basis of the library called AIF. For details I ask under kat.

At first, there is very little theory about how sad everything is.

There is a very good Stanford NLP course: www.coursera.org/course/nlp . If for some reason he was not seen, it is very vain. After reviewing at least the first 2 weeks, it becomes clear what is the probabilistic model of the language on which most of all existing NLP solutions are built. In short, having a huge pile of texts, you can estimate with what probability each word is used with another word. This is a very crude explanation, but it seems to me that it accurately captures the essence. As a result, it turns out to build more or less decent translations (hello Google Translate). This approach does not bring us closer to understanding the text, but only trying to find similar sentences and build a translation based on them.

But let's not talk about sad things, let's talk about what we can potentially give:

What functions should the final version of our library have to implement?

Search for characters used to separate sentences in the text.
Extraction of lemmas from the text (with weights).
Building a semantic text graph.
Comparison of semantic text graphs.
Build a summary of the text.
Extracting objects from text (partial NER).
Definition of communication between objects.
Definition of the topic of the text.
In the current version, we already have a realization of some items from this list.

Why does the world need AIF?

Given that there are already quite a lot of similar libraries OpenNLP, StanfordNLP, - why create another one?

Most existing NLP libraries have significant drawbacks:

attachment to specific languages (the quality of the result of work can vary greatly from language to language);
attachment to the exact grammatical structure (it would be cool to see how everyone writes like Shakespeare or Tolstoy, but this is far from reality);
attachment to the encoding (since language models are often sharpened for a specific encoding).

In such libraries there is a very high correlation between the quality of the text fed to the input, and the result obtained at the output.

Language models cannot conduct semantic text analysis. They avoid understanding the text at the parsing stage. A language model can help break up the text into sentences, perform entity extraction (NER), and feel extraction. Nevertheless, the model cannot determine the meaning of the text, for example, it cannot compile an acceptable summary of the text.

We illustrate the above points with an example.

Take the scanned text https://archive.org/details/legendaryhistor00veld . This text has a number of non-standardly encoded characters, but we will make it even worse by replacing the “.” Character with “¸.” This replacement will not interfere with readability for the average user, but makes the text practically unworkable for NLP libraries.

Let's try to break this text into sentences using such libraries as: OpenNLP, StanfordNLP and AIF:

As a result, the libraries were able to distinguish the following number of sentences:

StanfordNLP: 13
OpenNLP: 3
AIF: 2240

But even simpler problems than this are often unsolvable for most NLP libraries. The main reason is that they are not so smart. They are based on models that are a set of static rules and values. Changes in rules or values often require retraining of the model. And it is quite long and costly. Avoiding this (using language models) is the fundamental idea of our library.

AIF is learning the language of the input text. It does not need language models, as it receives all the necessary information about the language from the text itself. The only important requirement is that the input text must be more than 20 sentences.

So how does AIF break the text into sentences?

To select characters that divide the text into sentences, we have developed a special formula — for each character, the probability that it is the separator is calculated.

The results of calculating the probability that a symbol is used to separate sentences are given below.

Example # 1 (The Legendary History of the Cross)

archive.org/details/legendaryhistor00veld

This chart displays the symbols that are most likely to be used to separate sentences.

Example 2 (Punch, Or the London Charivari, Volume 107, December 8th, 1894)

www.gutenberg.org/ebooks/46816

This chart displays the symbols that are most likely to be used to separate sentences.

Example No. 3 (William S.Burroughs. Naked lunch)

en.wikipedia.org/wiki/Naked_Lunch

This chart displays the symbols that are most likely to be used to separate sentences.

Of course, the presence of such probabilities does not give the result itself. You still need to understand where the limit is, which divides these characters into “delimiters” and “other characters” of sentences. You also need to be able to divide the characters into groups: those who divide the text into sentences and divide the sentence itself into parts.

The results are easy to reproduce using the CLI, which uses our library.

Simplest CLI for AIF

GitGub Link: github.com/b0noI/aif-cli/wiki
For download: s3.amazonaws.com/aif2/aif-cli/1.0/aif-cli.jar

You can use it in the following way:

java -jar aif-cli.jar <key> <path_to_txt_file>

For example, you can divide your text into sentences using the command:

 java -jar aif-cli.jar —ssplit <path_to_txt_file>

Or tokens:

 java -jar aif-cli.jar —tsplit <path_to_txt_file>

Or you can output characters with the highest probability that they are sentence separators:

 java -jar aif-cli.jar —ess <path_to_txt_file>

Using the AIF library

You can start using our library version Alpha 1 in your project. To do this, simply add our Maven repository to the project. Instructions can be found here: github.com/b0noI/AIF2/wiki

Currently only two functions are available:

breakdown of text into tokens ( description );
breakdown of tokens into sentences ( description ).

What is planned in the next version?

In the first Alpha, we do not divide the characters that are separating sentences into groups, for example:

Group 1:.!?
Group 2: “; '()
Group 3:,:

While we are working with all the "delimiters", as if they were all in group 1. However, starting with the Alpha2 version, we will have a division into groups (quite right, our library can divide the "separator characters" without a language model!)

Also in Alpha 2, we will present a lemmatization module that will extract the lemmas from the text. Again, this module will work completely independently of the language! AIF will be able to extract lemmas from the text, for example:

car, cars, car's, cars' => car

Since the possibility of semantic analysis will NOT be implemented in the version of Alpha 2, this means that we will not be able to extract lemmas like this:

am, are, is => be

But even this problem can be solved in a language independent way. And it will be solved in future releases.

What is planned in the next article?

comparative analysis of the quality of the breakdown of proposals with other key libraries;
description of the algorithm for selecting characters that break the text into sentences;
description of the algorithm for dividing characters into groups (those that divide the text into sentences and the sentences themselves).

Afterword

Of course, the current implementation does not work equally well with all languages. For example, Japanese text or languages that do not use spaces are still incomprehensible for AIF.

our team

Kovalevskyi Viacheslav - algorithm, design, team lead (viacheslav@b0noi.com / @ b0noi )
Ifthikhan Nazeem - algorithm designer, architecture designer, developer
Evgeniy Dolgikh (marcon@atsy.org.ua, marcon ) - QA assistance, junior developer
Siarhei Varachai - QA assistance, junior developer
Balenko Aleksey (podorozhnick@gmail.com) - worked on Sentence Splitters for tests (using Stanford NLP and AIF NLP), added tokenization support for CLI, junior developer
Sviatoslav Glushchenko - REST design and implementation, developer
Oleg Kozlovskyi QA (integration and qaulity testing), developer.

If you have an interesting NLP project, contact us;)

Project links and details

project language: JDK8
license: MIT license
issue tracker: github.com/b0noI/AIF2/issues
wiki: github.com/b0noI/AIF2/wiki
source code: github.com/b0noI/AIF2
developers mail list: aif2-dev@yahoogroups.com (subscribe: aif2-dev-subscribe@yahoogroups.com)

Afterword ^ 2

Honestly, the library is not a complete novelty. At the beginning of my candidate's path, I already laid out some of the algorithms in their raw form, and even wrote an article about it on Habr . However, since then much water has flowed, many hypotheses have been confirmed, much has been rejected. It has become an urgent need to write a new implementation, which embodies the accumulated and proven hypotheses in the field of NLP.

Only this time, it turned out to attract more developers to the project and we are trying to approach the development more consistently than we did last time. Plus it turned out to be a very good project where students of my Java course at Hexlet can get real experience in developing a Java project in a team;)

Source: https://habr.com/ru/post/238359/

All Articles