Strutext C ++ word processing library

Introduction

This text can be viewed as an overview of the Strutext library, conceived by the author as a set of efficient algorithms for linguistic processing of C ++ text. The library code is in the repository on Github . The library is open source and comes under the Apache License 2.0 license, i.e. can be used completely free without any significant restrictions.

')
The last commit to the repository was made on February 16, 2014, in other words, the library has not been developed for more than six months. But, nevertheless, as it seems to the author, the library has already enough interesting implemented algorithms to be useful to those developers who are engaged in processing of texts in natural languages.

Strutext build is based on cmake. Almost all library code is implemented in C ++ 2003, however there is a small script written in Python version 2.7. The library uses boost version 1.53 and higher, as well as the Log4Cplus library as a logging tool. At the moment, the library is compiled under Linux using the gcc compiler, but it seems to the author that it is not difficult to make an assembly under Windows.

Probably, a legitimate question will arise: why do we need such a library, if there is, for example, ICU ? From the point of view of the author, some components of the library are not implemented effectively enough. The implementation style is based on low-level C or high-level Java. This circumstance, from the point of view of the author, does not make it possible to use the ICU library conveniently and effectively in C ++. In addition, the ICU does not implement such an important component as morphology. Also, one of the main advantages of the Strutext library is the availability of efficient text-based search algorithms based on the implemented state machine library. In general, ICU implements only one component, the processing of characters in UNICODE, and in this sense, the Strutext library provides more options.

Library structure

Strutext is intended as a tool for processing natural-language texts at various levels of representation of these languages. Usually the following levels of language are distinguished:

Character level At this level, the text of the document is considered simply as a sequence of characters. Symbols must be somehow classified, i.e. I would like to distinguish letters from numbers, punctuation from spaces, etc. Also, I would like to have a similar classification for as many languages as possible.
Lexical level. The sequence of characters is divided into words. Words, in turn, can somehow be classified. The class of the set of words is also called the lexical type. In the classical tradition, the lexical type is the word (for example, mother), and the words themselves are called word forms (for the word mother it will be: mother, mother, mothers, etc.). A set of word forms is called a paradigm, and the word itself is called a lexeme.
Syntactic level. At this level, the text is divided into sentences and the relationships between the words in the sentence are considered. Relations are basically hierarchical, i.e. presented in the form of a tree, but also there are more complex and intricate relationships.
Semantic level. This is the level of selection of semantic structures that are built on the basis of syntactic structures isolated from the text of the document.

There are also other levels of representation of the language, for example, pragmatic, but here it will not be discussed.

The Strutext library currently implements the character and lexical levels of the language representation. The main components of the library:

symbols - the implementation of the symbolic representation level of the language based on UNICODE32.
encode is a set of iterators for transcoding text.
automata is an implementation of a library of finite automata designed to search for character sequences in the text.
morpho is a library of morphological analysis for Russian and English languages.

The description of each component will take quite a lot of space. Therefore, in this text below, only the components component will be described. If the audience expresses interest in the library, the author will gladly continue the description in other articles. The component description can also be considered as an introduction to the corresponding programming area, i.e. tutorial on how to programmatically implement this text processing component. In this sense, the author tried not to limit himself to a simple description of the library interface, but also to present the ideas and methods of implementation underlying it.

Library processing symbols

A bit about UNICODE

As is known, the UNICODE consortium was created to develop a standard for representing the characters of world languages in computer memory. Since its formation, the consortium has already released seven versions of such a presentation. In the memory of the machine, characters can be encoded in different ways, depending on the purpose of use. For example, to represent characters in files when it is important to save on size, the UTF-8 representation is used. If it is not necessary to use specific language features, then a two-byte representation of UTF-16 can be used. For a complete presentation of the entire gamut of the UNICODE specification, it is better to use a four-byte representation of UTF-32.

Characters (letters) in the UNICODE standard are divided into classes. Classes are relatively few, we list some of them:

Lu is an uppercase letter.
Ll is a lowercase letter.
Nd is a decimal digit.
Zs - space used as a separator
Po - punctuation mark without additional specification.
Cc is the control character.

One of the functions of the symbols library is to effectively match the character code in UTF-32 of its class. To implement this functionality, it is convenient to use the UnicodeData.txt file. This file contains an enumeration of all standard characters, along with their four-byte codes (in hexadecimal representation) and classes. The file is intended to be processed by the machine, but it is also understandable to man. For example, the space character is specified as a string:

0020;SPACE;Zs;0;WS;;;;;N;;;;;

In the file, the symbol ';' is used as field separators. Accordingly, decimal digits are set as follows:

 0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;; 0031;DIGIT ONE;Nd;0;EN;;1;1;1;N;;;;; 0032;DIGIT TWO;Nd;0;EN;;2;2;2;N;;;;; 0033;DIGIT THREE;Nd;0;EN;;3;3;3;N;;;;; 0034;DIGIT FOUR;Nd;0;EN;;4;4;4;N;;;;; 0035;DIGIT FIVE;Nd;0;EN;;5;5;5;N;;;;; 0036;DIGIT SIX;Nd;0;EN;;6;6;6;N;;;;; 0037;DIGIT SEVEN;Nd;0;EN;;7;7;7;N;;;;; 0038;DIGIT EIGHT;Nd;0;EN;;8;8;8;N;;;;; 0039;DIGIT NINE;Nd;0;EN;;9;9;9;N;;;;;

We also list several definitions of letters:

 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; 0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062; 0043;LATIN CAPITAL LETTER C;Lu;0;L;;;;;N;;;;0063; 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041 0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042 0063;LATIN SMALL LETTER C;Ll;0;L;;;;;N;;;0043;;0043

It is interesting to note that for letters with classes Lu and Ll, the codes of the corresponding letters of the lower (upper) register are also set. This makes it possible to implement register conversion functions in the library.

Library implementation

The symbols library implements the definition of UNICODE character classes, as well as conversion to upper or lower case.

To assign classes, the SymbolClass enumerator is used:

 enum SymbolClass { UPPERCASE_LETTER = 0x00000001, LOWERCASE_LETTER = 0x00000002, TITLECASE_LETTER = 0x00000004, CASED_LETTER = UPPERCASE_LETTER | LOWERCASE_LETTER | TITLECASE_LETTER, MODIFIER_LETTER = 0x00000008, OTHER_LETTER = 0x00000010, LETTER = CASED_LETTER | MODIFIER_LETTER | OTHER_LETTER, NONSPACING_MARK = 0x00000020, ...

The elements of the enumerator are given by powers of two, so they can be used as bit flags. In the library implementation, for each character, a four-byte value is specified, in which the bits corresponding to its classes are set to ones. Then, checking for a character to belong to a particular class is just the value of the corresponding bit:

 template<SymbolClass class_name> inline bool Is(const SymbolCode& code) { return static_cast<uint32_t>(class_name) & GetSymbolClass(code); }

For the most frequently used classes, the corresponding functions are implemented:

 inline bool IsLetter(const SymbolCode& code) { return Is<LETTER>(code); } inline bool IsNumber(const SymbolCode& code) { return Is<NUMBER>(code); } inline bool IsPunctuation(const SymbolCode& code) { return Is<PUNCTUATION>(code); }

To convert to upper / lower case, the ToLower and ToUpper functions are used. It should be noted that these functions can be applied not only to letters, but also to any other characters. For a character to which the concept of a register is not applicable, the function returns the same code that was transmitted at the input.

Technically, all this is implemented quite effectively. At the configuration stage, it runs a script written in the Python language that reads the UnicodeData.txt file and generates the symbols.cpp file, in which three arrays are defined, for character classes, upper and lower case. These arrays are declared as follows:

 namespace details { // The symbol tables declarations. extern uint32_t SYM_CLASS_TABLE[]; extern SymbolCode SYM_UPPER_TABLE[]; extern SymbolCode SYM_LOWER_TABLE[]; } // namespace details.

The conversion functions to upper and lower registers are simply set:

 inline SymbolCode ToLower(const SymbolCode& code) { return details::SYM_LOWER_TABLE[code]; } inline SymbolCode ToUpper(const SymbolCode& code) { return details::SYM_UPPER_TABLE[code]; }

To get a set of character classes, use the following function:

 inline const uint32_t& GetSymbolClass(const SymbolCode& code) { return details::SYM_CLASS_TABLE[code]; }

As can be seen from the definition, the index in the array is the symbol code, therefore, accessing the array element does not require any additional costs. The size of each array is limited by the number of characters defined in UnicodeData.txt. At the moment, this number is 0x200000, i.e. each array occupies 8 MB in memory, and all together - 24 MB. This seems to be a small price to pay.

Of course, characters are almost never stored in files in UTF-32, it’s inefficient to use 4 bytes to store a character code that fits in one byte. For storage, single-byte encodings are used that come from the “pre-unicode world” (CP1251, Koi8-r, etc.), as well as UTF-8 encoding, specially developed for efficient storage of characters in files. The Strutext library provides efficient tools for analyzing character codes in UTF-8. The encode component deals with this.

Afterword

One of the main motivations, both for writing text and developing the Strutext library, was the author’s desire to share his experience in developing C ++ word processing applications with other developers. The library code can be viewed as the material embodiment of the author’s professional experience (the author hopes that this code does not exhaust all his experience).

Of course, the verb "to share" implies two sides: the one that is divided, and the one that wants to accept this division. If with the first side, everyone understands, there are no problems, then the presence of the second side is intended to be detected, including this publication. If there are responses to the text, the author is ready to work and describe other components of the Strutext library.

Source: https://habr.com/ru/post/240805/

All Articles