In this article I will tell and show my way of segmentation of lines into words. If you are not interested in the life of a Siberian in the tropics, you can safely skip the entry.

Introduction
A year ago, I worked as a mathematics teacher in Thailand. He taught the young Jedi the power of abstraction. Teaching children is very interesting and brings a lot of fun. But the administration brings only a headache. There is a strict hierarchy in Thai schools: firstly, according to the merit of the years, secondly, according to access to information. The one who creates it is at the top, followed by the faithful distributors, ending with those who receive information at the last moment, and if you don’t have information, then you are nobody.
Guess who is in this hierarchy above: foreign teachers or fifth graders? Imagine the following:
Friday, the end of the working day, a stack of notebooks to check on the weekend. Your favorite director comes in and informs you that in two hours we are going to the neighboring town on a three-day excursion with the whole school. A curtain.
')
This is life in complete information blockade. In this case, publicly available information, the one that hangs out throughout the school, is also not available to you, because it is in Thai. You used to come to work, and the school is closed. The next day, you find out from the director how no one said anything. And the director throws up his hands in surprise, yet they knew, they say, orders were on the wall.
Help: Thai writing is very different from the writing of other languages, because it was independently created relatively recently. Thai contains 44 consonant symbols, as well as 15 vowel symbols, which can be interconnected in 28 different forms. Vowels can be placed on the left, right, top and bottom of a consonant. And in addition 4 more tone signs, tone is something like intonation from which the syllable should be pronounced. There are words that differ from each other only in tones and have completely different meanings. Spaces between words are rarely placed.There are no familiar European characters in Thai and therefore it is very difficult to learn to read. I had to solve the problem of reading and translation. After playing a little with the recognition of characters and writing a small program, I was able to recognize the characters in the string. The result was a very long string that I had to translate. And here I rested against the wall. There are no spaces. All words are glued together! In order to translate them, they needed to be separated from each other.
Search on the Internet did not bring results. Because the usual algorithms, such as, for example, are collected in this
habr article , relied on the frequency of words. To collect these statistics, it was necessary to be able to break into words. And to break into words, you need statistics. Here is such a vicious circle.
I had to invent from scratch. I will show you what turned out on the line of the Russian language (for simplicity and clarity).
Algorithm
So, we need a dictionary (it is desirable that it can be spelled). And an example of the line: “poison read etodocanzahabre”.
The first step of any algorithm is visualization.
Let's make the table where we will place all possible options of words taking into account their length. The words "nah" in our dictionary will not, but it will be the word "Habr" and "Habré."
Great, now we see that there are not so many options and it’s not difficult for a programmer to sort through them all.
But how to be in such a situation: “I read about poison” or “I read before” or “I read it”? Which option to choose?
To do this, we introduce the function of estimating the obtained partition. We will issue points for used letters and words. +3 points per letter and -1 point per word. To use as many letters as possible with as few words as possible.
And then:
Won the desired string! It remains only to think of how to sort out everything in an optimal way and get ONE single answer.
In the word “one” lies the solution of our algorithm.
Suppose we have found this final and only solution, then we will know about each letter whether this letter is initial and, if so, how many letters are in this word.
Which means that the search should take place at the level of one letter, and not all words with all. For example, take the first letter of our line “I”. And choose between
“0 points + x1
“I” 2 points + x1
“Poison” 5 points + x2.
Where x1, x2 - previous points (counting from scratch).
How to do it because we have no X? Previous points are known to us only for the last letter and they are equal to zero. So we need to go from the end!
We complete two columns, where we will store points and the lengths of the words used.
We will add points for the word with the previous points (red boxes) and look for the maximum.Sleight of hand and no recursion. Received a good solution without access to the frequency of words. The algorithm is very fast and completely copes with the task of splitting Thai strings. It can be easily adapted to the missing letters and unfinished words, especially if the dictionary can be spelled.
I will discuss how to make such a dictionary in the next article. I read this article to the end on Habré