Text Segmentation Algorithms

Hello.

In the context of analyzing data from Twitter, the problem of processing hashtags arose. It was necessary to take the hashtag and break it into separate words (#habratopic => habra topic). The task seemed primitive, but, it turns out, I underestimated it. I had to go through several algorithms until we found what we needed.

This article can be considered as a certain chronology of solving a problem with an analysis of the advantages and disadvantages of each of the algorithms used. Therefore, if you are interested in this topic, please under the cat.
')
It should be noted that the task of splitting large text without spaces is very often found in nlp. This definition of words in the German “long words”, which, in fact, are a concatenation of several ( geschwindigkeitsbegrenzung - speed limit ), the definition of words in Chinese writing, where rarely use a space ( 城市人的心爱宠物 is a favorite pet of city dwellers ) etc. If in the second case with Chinese, the simplest algorithm considers one hieroglyph per word and works quite acceptable, then everything is much more complicated with German.

Algorithm 1. Minimum Matching

We pass through the line and find the first word that matches. Save this word and repeat the procedure for the remainder of the line. If no word matches in the last line, we assume that the segmentation was not found.
The algorithm is very fast (just what we need), but very stupid.

Example: niceday => nice day .
But, for niceweather, segmentation will not be found, since after finding the word nice , the algorithm will determine we , then the article a , then the , and the word r in our dictionary does not exist.
Hm What prevents, instead of the first word that matches, to take that which has the maximum length?

Algorithm 2. Maximum Matching or Greedy

We do everything the same as in the first case, but always choose the word with the maximum length.
The algorithm is slower than the previous one, since we need to go from the end of the line to determine the word with the maximum length first. The speed will drop noticeably if you need to process very long lines, but since we have data from Twitter, we are hammering on the problem. (In fact, if you choose not the entire line, but the first n characters, where n is the maximum word length in the dictionary, then the speed will be on average the same as that of the first algorithm).

Example: niceweather => nice weather
But, for workingrass, segmentation will not be found again. The first word that zamatchit our algorithm will be working , not work and which will also absorb the first letter in the word grass .
Maybe you need to combine both algorithms in some way? But, how then to be with the line niceweatherwhenworkingrass ? In general, they came to brute force.

Algorithm 3. Bruteforce

We generate all possible options for splitting a string into words using a normal recursive function. Such options would be 2 ^ (N-1), where N is the size of the string. Next, we make a screening out of those options in which the substrings are not from the dictionary. And the resulting option will be true. The main problem of the algorithm is speed.
Stop! And why generate everything, and then perform filtering, if you can immediately generate what you need.

Algorithm 4. Clever Bruteforce

We modify our recursive function so that the recursive call occurs when we have already made a word from the dictionary. In this case, the necessary segmentation is generated immediately. The algorithm is very fast, it gives the desired result, and in general I thought that the problem was solved.
Unfortunately, I missed the ambiguity. The fact is that string segmentation is not unique and there are cases when there are dozens of equivalent partitions.

Example: expertsexchange => ( expert sex change , experts exchange )

A new subtask has appeared: how to choose the “correct” segmentation?
I went through the first, random, last, one in which there are more words, one in which there are fewer words and the results were, to put it mildly, not very. I needed some smarter algorithm. I can understand that
dwarfstealorcore is most likely a “dwarf stealing the orcs' ore”, and not “a dwarf stealing or the core”, which means that the machine must understand. Here machine learning algorithms came to the rescue.

Algorithm 5. Clever Bruteforce with ambiguity resolving (unigram model)

In order to teach our program to solve ambiguities, we feed it a large text file (train set), according to which it builds a model. In our case, the unigram model , is the frequency of use of each word in the text. Then for each of the candidates for segmentation, we consider the probability as the product of the probabilities of each word in the candidate. Who has more probability, he won. It's simple.

Example: input => in put
Suddenly? It's just that the text in and the word put are very often found in the text, while the word input is only 1 time. The Unigram model knows nothing even about the most primitive connection between words (for English speech, the combination of words in put is unlikely).

Algorithm 6. Clever Bruteforce with ambiguity resolving (bigram model)

Everything is the same, only now we are building a bigram language model. This means that we consider not the frequencies of words, but the frequencies of all pairs of words that go in a row. So, for example, the sentence " Kiev is the capital of Ukraine " will be divided into 5 bigrams: Kiev is the capital of Ukraine . With this approach, the model “understands” at least a little which words can stand together and which can not. Now the frequency of the big in put in our model is zero.

findings

The algorithm shows good results. A weak point is a dictionary. Since the data on Twitter is mostly informal, the names of people, place names, etc., the dictionary eliminates many suitable candidates. Therefore, one of the directions of development of the algorithm is the rejection of the dictionary. Instead, you can use the words from the train set. The second weak point is the train set. Since in ML algorithms everything depends on it, you need to have as much relevant data as possible. Here, as an option, you can use the train-set from the data obtained from the same Twitter.

Links

A dictionary with over 58 thousand words is taken from here .
A file with more than a million words was found on the Peter Norvig website as a train set. There are still many interesting things.
All this was implemented in Clojure. So, who cares, github .

Source: https://habr.com/ru/post/141228/

All Articles