When working with text, it is often necessary to arrange hyphenation correctly. The task at first glance is not so obvious, you need to take into account the peculiarities of each language in order to decide where to break the word. How to formalize such requirements correctly, and how then to apply them in the algorithm? One of the most common decisions to this day was proposed by Franklin Mark Liang, a student of the famous professor Donald Knuth. The algorithm is called “The Lyan-Knuth Algorithm”, it is used in the
TeX publishing system, the author of which is also D. Knut.
The algorithm is based on comparing the source word with a set of rules (patterns). The more rules and the better they are made, the better the spreads will be arranged. In the TeX package, you can find ready-made free rule sets for many languages, you only need to carefully look at the terms of use and distribution.
Example rules:
when
at 3v
2i1ve
.
Each rule consists of letters and numbers between them, as well as numbers at the beginning and end. The number 0 is usually omitted. For example, the first rule should be understood as
0001 . The sequence of letters is the part of the word for which the translation is determined, i.e. This sequence must be present in the word. The numbers are called "level", they set the priority between the rules and the possibility of transfer in the corresponding position. Even numbers, including 0, prohibit the transfer. Odd - allow. A dot at the beginning of a rule means that the rule applies only if the sequence is at the beginning of a word. Similarly, with a dot at the end - the word must end with this sequence. If there is a point at the beginning and at the end, then the rule contains the entire word.
The main stages of the algorithm:
- Select all the rules that match the selected word and for each position in the word get a set of levels (as many rules fell on one position, so many levels we get).
- In each position, select the maximum level. If it is even, it is impossible to transfer here, if odd is a valid transfer point.
- Trim obviously unacceptable hyphens (for example, one letter at the beginning or at the end).
Let's look at the work of the algorithm on an example:
Source word:
algorithmA set of rules (taken from TeX):
lgo1
1g
o1ri
Ń–1
u2tm
tm2
Match the word with all the rules and choose the highest levels:

In positions with level 1, you can safely put the transfer. We get the result
"al-rhythm" .
')
Implementation
Now we implement this algorithm in C ++. I needed a working algorithm for use in iOS, so I did everything as a C-Shn interface. The module is written without reference to any locale or platform, and can be used anywhere.
The rule will be stored as follows:
struct pattern_t { std::basic_string<unichar> str; std::vector<unsigned char> levels; };
We will convert each rule into the form “pure sequence of characters” + “level set” so that it is convenient to apply it in the future.
Rule set:
struct pattern_list_t { std::vector<pattern_t*> list; };
The code for pulling levels out of the rules is simple, it can be viewed in full source at the link at the end of the article.
After we fill in the list of rules, it must be sorted in order to ensure the correct and efficient operation of the algorithm. We write our function less and apply the standard sorting algorithm:
bool pattern_compare(const pattern_t* a, const pattern_t* b) { bool first = a->str.size() < b->str.size(); size_t min_size = first ? a->str.size() : b->str.size(); for (size_t i = 0; i < min_size; ++i) { if (a->str[i] < b->str[i]) return true; else if (a->str[i] > b->str[i]) return false; } return first; } void sort_pattern_list(pattern_list_t* pattern_list) { if (!pattern_list) return; std::sort(pattern_list->list.begin(), pattern_list->list.end(), pattern_compare); }
Now directly the algorithm for finding carries:
std::vector<unsigned char> levels; levels.assign(word_string.size(), 0); for (size_t i = 0; i < word_string.size()-2; ++i) { std::vector<pattern_t*>::const_iterator pattern_iter = pattern_list->list.begin(); for (size_t count = 1; count <= word_string.size()-i; ++count) { pattern_t pattern_from_word; pattern_from_word.str = word_string.substr(i, count); if (pattern_compare(&pattern_from_word, *pattern_iter)) continue; pattern_iter = std::lower_bound(pattern_iter, pattern_list->list.end(), &pattern_from_word, pattern_compare); if (pattern_iter == pattern_list->list.end()) break; if (!pattern_compare(&pattern_from_word, *pattern_iter)) { for (size_t level_i = 0; level_i < (*pattern_iter)->levels.size(); ++level_i) { unsigned char l = (*pattern_iter)->levels[level_i]; if (l > levels[i+level_i]) levels[i+level_i] = l; } } } }
In the string word_string we put the source word, with the added '.' along the edges, so that rules are automatically selected containing instructions on their position in the word. Now in the source word for each character
with i = 0 to N, we iterate over all the substrings beginning with
i and
from 1 to Ni . We
look for each substring in the rules vector by the standard algorithm
std :: lower_bound . We presume that the rules are sorted in the way we need and there is no need to go through everything all over again at every step. When we find a match, take a vector of levels and apply it to the current result, i.e. if the level for the current position in the rule is higher, remember it instead of the old one.
In vector levels, maximum levels of values ​​for each position are accumulated. It remains to check it for odd values.
mask_size = levels.size()-2; mask = new unsigned char[mask_size]; for (size_t i = 0; i < mask_size; ++i) { if (levels[i+1] % 2 && i) mask[i] = 1; else mask[i] = 0; }
Examples of the algorithm for the rule set from TeX:
programmer
cybernetics
scream
intuition
Sight
Hello
The ready C ++ code along with the above examples can be downloaded
here . A test example in the main.c file (encoding Windows-1251), rules in the file patterns.h.