Split text into sentences by a linguistic-independent method using the example of the AIF library

In the last article we already talked about the new NLP library. However, then we told “general” and there is nothing concrete about it. Today we will talk about the theoretical aspects of the sentence breakdown into tokens by linguistic-independent algorithms. Theoretical calculations will be supported by practical implementation in the AIF library. Go…

Terms

We decided to specify the terms that we use to build our article on their basis.

Token - part of the text (array of characters), separated by problems or the end / beginning of a line;
technical symbols - symbols that are used for logical grouping of tokens in the text (:,.!? ”;);
technical symbols of the first group are symbols that are most often used in the text to divide the text into sentences;
technical symbols of the second group are symbols that are most often used in the text to divide sentences into parts;

')
We now turn to a more confusing definition - the sentence. We take the definition of Wikipedia:

A sentence (in language) is a unit of language, which is a grammatically organized combination of words (or word) that has semantic and intonational completeness. [1] From the point of view of punctuation, a sentence as a complete unit of speech is drawn up at the end with a dot, exclamation or question marks - or ellipses.

This definition can be used almost as is, slightly remaking the second part (punctuation):

sentence - part of tokens separated by technical symbols, which are used to break the text into sentences;

So we leave behind the scenes exactly what characters are used to break the text into sentences.

The process of breaking the text into sentences linguistic-independent method

First, let's define the condition of the problem:

Tokinized text (array of tokens) is fed to the input;
there is no data about the language of the text;
There is no information about which symbols are used or can be used to divide the text into sentences.

The task of breaking the text into sentences in such conditions is divided into 4 subtasks:

find technical symbols in the text;
divide the found technical symbols into groups;
determine which of the technical symbol groups is the first and which is the second;
break the text into sentences using selected groups of characters;

Each of these tasks is rather independent and can be considered in isolation from the others. Let's understand a little bit. For each step, a link to the wiki with the English version of the description of the algorithm will be given.

1 Search for technical characters in the text

wiki

How can I find technical symbols in the text?

After reviewing the text in most languages, you can find that these characters (technical characters) are always at the end of tokens. This hypothesis immediately introduces restrictions on the language of the incoming message (spaces should be used in the language). Let us now think how technical symbols can be defined if our hypothesis is correct.

The first thing that can come to mind is to calculate for each character the probability that this character in this text is at the end of the tokens. And then select the characters that have the highest probability.

This assumption is broken about the first experiment with different languages, showing that in many languages there are letters with enviable regularity at the very beginning of tokens or at the end. Use this approach so naively impossible.

As a result, the formula for the probability that a symbol is technical looks like this:

(1) P1 (ch) * (P2 (ch) ^ 2)

Where:

P1 (ch) is the probability that the ch character in this text is at the end of tokens
P2 (ch) is the probability that the character in front of the ch in this text is also at the end of the tokens

let's look at an example of calculating the probability of P2. Suppose we have these characters:

. yab

and the probability that they are at the end of tokens:
P1 (.) = 0.8
P1 (y) = 0.9
P1 (a) = 0.01
P1 (b) = 0.02

the dot character at the end of the tokens was encountered 4 times, and the dot was preceded by the following characters:
y - 3 times
a - 1 time

in this case, P2 ('.') will be calculated as follows:
3/4 * P1 (y) + 1/4 * P1 (a) = 0.75 * 0.9 + 0.25 * 0.01 = 0.6775

This is more interesting. And it seems to even work. However, if you delve into the AIF code, you can see that the analysis is performed a little more complicated than simply calculating the probabilities using this formula.

The fact is that even this formula does not give 100% of the desired result. For there are languages in which words often end in the same two characters. Well, for example, the text in which the ending “ec” substantially prevails. And while the e character itself is often at the end of tokens. As a result, the “c” character will be highlighted as a technical one, since it often stands at the end of tokens and in front of it a character that also appears at the end of tokens.

But it is also quite simple to heal - it is necessary to filter out the technical symbols that were obtained by the first formula. You can filter by two simple criteria. First, the probability that the character always stands at the end of the token. Yes, we said that this is a bad criterion, but it is also an excellent filter for the characters found by our heuristics. Secondly, you need to remove the characters that were found in the text very little. We don’t really want to get a symbol at the output, which was met once in the text and this time it was at the end of the token. Probably, this symbol is a separator, but we can not find out in any way.

As a result, AIF does the following:

Selects N-symbols with the highest probability calculated by the formula 1, and from them selects N2-symbols that have the highest probability P1. Choosing the right N / N2 numbers is still a task, but we will not go into its details right now, since AIF is taking these numbers in some way from the ceiling. We still have to test in practice several hypotheses on this subject.

In the first article on Habré, the first pass was shown by the formula in the text. I just want to say that since that moment the formula has been changed and now it works more precisely (much more precisely).

And now let's talk a little about qualitative analysis.

A qualitative analysis of the approach shows the following results (for 186 books):
12 books in which you can not find a point on 186 books;
2 books in which a comma cannot be found;
3 books in which a letter was highlighted as a technical symbol;

2 Division of technical symbols into groups

wiki

At the previous stage, we were able to define a list of technical symbols. The list might look like this:

.,!?:; () ”-

Now you need to divide this list into groups. In this example, the characters can be divided into groups as follows:

.!? () ”
-:;,

The task of dividing into groups was solved by a simple hypothesis - in each language there are symbols that are more often used at the beginning of sentences. Accordingly, for each technical symbol, it is possible to make a graph of links with the symbols that follow the technical symbol.

As a result, for each separator character we get approximately the following link graph:

Connection point:
.
P = 32, S = 16, T = 63, F = 15, W = 44, {= 67, N = 14

communication comma:
,
p = 47, a = 59, b = 84, c = 52, s = 102, d = 76, t = 295, e = 49, w = 318, m = 59

Latin letter connections 'y':
y
s = 3, t = 2, w = 6

These links are taken from the real text and they are filtered. The fact is that AIF for analysis leaves only the most significant links. Without filtering, the graph will look different.

Of course, the absolute scales are not interesting to us and these links will be converted into probabilistic ones.

Having such a graph of connections for each character, we can divide the characters into two clusters; for this, it is only necessary to define an algorithm for comparing two graphs.

The implementation of graph comparison in AIF is best considered by example. Suppose we have two connection graphs:
.
A 0.2
P 0.6
a 0.05
n 0.15

,
a 0.4
p 0.6

And now the comparison algorithm:

P (ch1, ch2) = ch1.connections.keys.mapToDouble (connectionKey -> (ch1.connections.get (connectionKey) + ch2.connections.get (connectionKey)) / 2) .sum () / ch1.connections.keys .size ()

something like that. Yes, there are some nuances, for example, P (ch1, ch1) may not be equal to P (ch2, ch1). But no one says that Alpha2 is the final version;)

This approach works especially well in case-sensitive languages (i.e., capital letters or small letters). Slightly worse, but still effective, works in Arabic and similar languages.

The quality of this module:
~ 0.5 books per 100 books, where the full stop and comma fall into the same group.

Of course, at this point there is a lot left behind the scenes. For example, after what threshold of probability P (ch1, ch2) we assume that two graphs will be equal, and that the two symbols that make up these graphs from one cluster?

Random group examples for different books:
book 1: [“; ,] [.]
book 2: [! .] [:; ,]

3 Determine which of the groups of technical symbols is responsible for breaking the text into sentences.

wiki

And here everything is quite simple. We display an array of tokens in a new array, where each token is mapped to the number of the group of technical symbol with which the token ends. If there is no technical symbol in tokens, then we put a null group to it. For example, we have such tokens:

Hello! How are you? I think everything is fine. Not?

And at the last stage we have been allocated such groups of separators:
one - [!?,]
2 - [,]

Accordingly, after the mapping we get the following array:
1 0 0 1 2 2 0 0 0 1 1

Having such an array, we can build a connection between the group (for example, between the first group) and the symbols with which the next token begins.

Well, an example: here are the links of the first group with symbols - taken from a real book (links are normalized with respect to the maximum):

{A = 0.36, B = 0.26, "= 0.39, C = 0.10, D = 0.08, E = 0.1, H = 0.37, I = 0.72, L = 0.08, M = 0.14, N = 0.1, O = 0.15, S = 0.26, T = 1.0, W = 0.26, Y = 0.08}

But links for the second group:

{A = 0.05, a = 1.0, "= 0.11, b = 0.24, c = 0.08, d = 0.06, e = 0.05, f = 0.11, h = 0.17, I = 0.09, i = 0.22, m = 0.07, n = 0.06, o = 0.16, p = 0.05, s = 0.21, t = 0.40, w = 0.35}

Like last time, AIF filters connections, leaving only the most significant ones. So before filtering the links will be slightly different.

But the connection of the zero group:

{a = 0.61, b = 0.25, c = 0.26, d = 0.18, e = 0.15, f = 0.26, g = 0.12, h = 0.41, I = 0.05, i = 0.35, l = 0.17, m = 0.26, n = 0.13, o = 0.48, p = 0.22, r = 0.14, s = 0.42, t = 1.0, u = 0.07, v = 0.06, w = 0.43, y = 0.06}

Having this data, we only need to calculate which of the graphs is closer to the graph of the zero group. Technical symbols forming this group (closest to the zero group) are the symbols that are used to separate tokens within a sentence. At the same time, the graph that is the most distant from the zero-group graph is formed by a group of technical symbols that are responsible for dividing the text into sentences.

In this example, the first group is the group of technical symbols that divides the tokens into sentences, and the second group contains the symbols that divide the tokens within the sentence.

A qualitative analysis of the results shows:
4 books out of 180, where the point was assigned to the wrong group;
2 books out of 180, where the comma was assigned to the wrong group;

4 Split text into sentences using selected groups of characters.

wiki

Oh well. We have groups of characters and we know which group is responsible for what. But how can we now understand in which cases the symbols of the first group are used to break the text into sentences, and in what cases not?

For example, how to understand that the dot at the end of “SSSR.” Is not for the end of a sentence, but simply the end of an abbreviation. Well, or abbreviations like "tysch.", Etc.

And again, it's simple.

Knowing the symbols that belong to the first group, we can go over all the tokens and calculate the average length of sentences. After that, for each use case of characters from the first group, we can:
view the first character of the next token and say for which group of technical symbols this connection is characteristic;
see the distance to the previous character of the first group and to the next;

If the character of the next token is characteristic of the second group of characters and the size of the previous and next sentence is less than or equal to the average size of the sentence in the text, then in this particular case the symbol of the first group is NOT used to break the text into sentences.

Examples from the book with places found by AIF, where the separator of the first group was marked as the separator of the second group:

"The Last Uprising" at Seoul International Biennale
"This killer ... he's so soft and
Eldar Ryazanov. “Irony of Fate”, “Office Romance”
5-7 minutes can also be easy
1 minute. hot water (+ 37–38 °), 1
5–10 sec. - cold water(
5 thousand years ago and performed

When you look at this result it is very funny to realize that it works absolutely linguistically, without any input information other than the text itself. Here is an example from another book in English:

by US federal laws and
the us unless a copyright

Unfortunately, for the current step we do not have any quality results = (

Afterword

Everything seen is only Alpha2, we have not even started to care about the quality and speed of work.

According to Alpha3 plans, in the next release we will pay attention to quality and speed, as well as refactoring. The only new library function will be the construction of a dictionary of the input text language.

The algorithm itself and library APIs are detailed on the project wiki: github.com/b0noI/AIF2/wiki

In the place with the library, we also updated a simple console client - AIF-CLI , which now can show the found technical symbols and groups of these symbols.

our team

Kovalevskyi Viacheslav - algorithm, design, team lead ( viacheslav@b0noi.com/b0noi )
Ifthikhan Nazeem - algorithm designer, architecture designer, developer
Sviatoslav Glushchenko - REST design and implementation, developer
Oleg Kozlovskyi QA (integration and qaulity testing), developer.
Balenko Aleksey (podorozhnick@gmail.com) - added stammer support to CLI, junior developer
Evgeniy Dolgikh - QA assistance, junior developer

PS: If you have an interesting NLP project, contact us;)

Project links and details

project language: Java 8
license: MIT license
issue tracker: github.com/b0noI/AIF2/issues
wiki: github.com/b0noI/AIF2/wiki
source code: github.com/b0noI/AIF2
developers mail list: aif2-dev@yahoogroups.com (subscribe: aif2-dev-subscribe@yahoogroups.com)

Test conditions

All tests were conducted on 186 books in the following languages:

Polish;
Italian;
French;
Deutsch;
English;
Russian.

Source: https://habr.com/ru/post/242147/

All Articles

Split text into sentences by a linguistic-independent method using the example of the AIF library

More articles: