📜 ⬆️ ⬇️

Strutext C ++ word processing library - lexical implementation

Basic principles


This text is a continuation of a post about the Strutext word processing library in C ++. Here will be described the implementation of the lexical level of representation of the language, in particular, the implementation of morphology.

According to the author, there are the following main tasks that need to be solved when implementing the program model of the lexical level of language representation:
  1. Selection from the source text of chains of characters that have meaning, i.e. text representations in the form of a sequence of words.
  2. The identification of the selected chains as elements of lexical types.
  3. The definition for the selected chain of its lexical attributes (about them below).

Lexical types are usually represented as finite sets of chains of characters that have the same meaning in language sentences. The elements of a lexical class are usually called word forms, the set of word forms itself is a paradigm, and the lexical type is called a word or a lemma. For example, the mom lexical type consists of word forms {mom, mom, mom, ..., mom, mom, ...}.

Lexical types are divided into syntactic categories - parts of speech. The part of speech defines the role that the word plays in the sentence of the language. This role is important in determining the correctness of the place of the word in the sentence and, therefore, in determining the meaning of the sentence. Famous parts of the Russian language speech: noun, adjective, verb, adverb, etc.

Word forms of lexical type can have properties. Such properties are also called lexical attributes or lexical features. The types of these properties depend on the syntactic category to which this lexical type belongs. For example, for nouns, the case of the word form plays an important role, and there is no such attribute for verbs.
')
What specific syntactic categories are used to group the lexical types and which lexical attributes they have depends on both the language being implemented and the specific lexical analysis model being implemented. Below we will consider the AOT lexical model.

Lexical ambiguities


It may happen that in the process of extracting words from the source text, ambiguities will arise. Here the ambiguities of the two genders are considered:
  1. Ambiguities of the first kind arise in the process of assigning a string of symbols of a lexical type selected from the text. Consider the example of "mom washed the frame." Here the string of symbols “soap” can be the verb “wash”, and it can also be a noun “soap”. Such cases of ambiguity are also called lexical homonymy.
  2. Ambiguities of the second kind arise in the process of cutting the source text into chains of words. In most natural languages, words are separated from each other by spaces, although this principle can sometimes be violated (for example, composites in German). But programming languages ​​have interesting examples. Consider, for example, an expression of the form "a >> b" in C ++. In classical C, this expression is interpreted unambiguously: identifier "a", right bitwise shift operator ">>", identifier "b". But in recent versions of C ++, this expression can mean the end of the list of template parameters, when the last parameter also appears in the list. In this case, the sequence of words will be as follows: the identifier "a", the end of the list of parameters of the template ">", the end of the list of parameters of the template ">", the identifier "b".

In this text we consider only the lexical ambiguities of the first kind.

Morphological dictionary model AOT


The Strutext library implements the morphological model from AOT . Therefore we will give its description some place.

In the AOT dictionary, each lexical type is defined by two parameters:
  1. A string of stem (word root), to which suffixes are attached to form word forms.
  2. The declension paradigm number, which is a list of pairs (suffix, set of lexical features).


There are relatively few combinations of lexical feature sets, they are listed in a special file, and each such combination is encoded with a two-letter code. For example:
 A  ,,  A  ,,  A  ,,,2 ...  Y  ,,,,,  Y  ,,,,, ...  a  ,,1,  a  ,,1,  a  ,,2, ... 

Here the first element of each line is the two-letter dialing code, the third element is the code of the part of speech (C is a noun, P is an adjective, G is a verb, etc.), then the codes of grammatical signs are listed separated by commas.

The dictionary description file consists of five sections, of which two sections are most important. This is a section describing the paradigms of declension and a section of the bases (lexical types) Each line in this section represents a declension paradigm. In the description section of lexical types, together with the base, the declination paradigm line number is specified.

Consider, for example, the word "Zelenka". The lexical type of this word in the AOT dictionary is given by the string
  15 12 1  - 

Here, the number 15 is the declination paradigm number in the paradigm section. This paradigm line looks like this:
 %*%*%*%*%*%*%*%*%*%*%*%*%* 

Each pair in the paradigm is separated from another by the “%” symbol, and the elements of the pairs are separated from each other by the “*” symbol. The first pair (KA, ha) sets the word form green + ka = brilliant green and has a set of lexical attributes: ha = G C W, units, im = noun, feminine, singular, nominative. Other paradigm pairs can be decoded accordingly.

The word encoding method used in AOT has its advantages and disadvantages. We will not discuss them here, we note only an interesting fact: there are lexical types with an empty base in the dictionary. For example, the word “person” in the plural is represented by the word form “people”, which has no common basis with the form “person”. Therefore, this word has to be set by simple enumeration of word forms:
 %*%*%*%*%*%*%*%*%*%*%*%*%*%*%*%* 

This paradigm can also be used with other words (having a non-empty root), such as the God-man and the apeman.

Let us consider in more detail the set of syntactic categories and the corresponding lexical attributes of the AOT dictionary.

AOT syntax categories


As mentioned above, the syntax categories of the AOT dictionary are defined in a separate file and represent sets of strings in which the part of speech and the set of lexical attributes are specified in two-letter codes. In the Strutext library, parts of speech and their attributes are represented as a hierarchy of classes in C ++. Consider this implementation in more detail.

The syntax categories of the AOT dictionary are specified in the morpho / models directory. Models for Russian and English are presented. Consider some fragments files morpho / models / rus_model.h, which presents a description of the model of the Russian language.

The base class for all models is the abstract class PartOfSpeech, which contains the language label in the form of an enumerator, and also sets a virtual method for returning this label:
 class PartOfSpeech : private boost::noncopyable { public: /// Type of smart pointer to the class object. typedef boost::shared_ptr<PartOfSpeech> Ptr; /// Language tag definitions. enum LanguageTag { UNKNOWN_LANG = 0 ///< Unknown language. , RUSSIAN_LANG = 1 ///< Russian language. , ENGLISH_LANG = 2 ///< English language. }; /// Language tag. virtual LanguageTag GetLangTag() const = 0; /// Virtual destruction for abstract class. virtual ~PartOfSpeech() {} }; 


From this class the base class is inherited for all syntactic categories of the Russian language:
 struct RussianPos : public PartOfSpeech { /// Type of smart pointer to the class object. typedef boost::shared_ptr<RussianPos> Ptr; /// Possible parts of speech. enum PosTag { UNKNOWN_PS = 0 ///< Unknown part of speech. , NOUN_PS = 1 ///<  , ADJECTIVE_PS = 2 ///<  , PRONOUN_NOUN_PS = 3 ///< - , VERB_PS = 4 ///<     , PARTICIPLE_PS = 5 ///<  , ADVERB_PARTICIPLE_PS = 6 ///<  , PRONOUN_PREDICATIVE_PS = 7 ///< - , PRONOUN_ADJECTIVE_PS = 8 ///<   , NUMERAL_QUANTITATIVE_PS = 9 ///<  () , NUMERAL_ORDINAL_PS = 10 ///<   , ADVERB_PS = 11 ///<  , PREDICATE_PS = 12 ///<  , PREPOSITION_PS = 13 ///<  , CONJUCTION_PS = 14 ///<  , INTERJECTION_PS = 15 ///<  , PARTICLE_PS = 16 ///<  , INTRODUCTORY_WORD_PS = 17 ///<   , UP_BOUND_PS }; /// Number. enum Number { UNKNOUN_NUMBER = 0 ///< Unknown number. , SINGULAR_NUMBER = 0x01 ///< . , PLURAL_NUMBER = 0x02 ///< . }; /// Language. enum Lang { NORMAL_LANG = 0 // Normal language. , SLANG_LANG = 1 , ARCHAIZM_LANG = 2 , INFORMAL_LANG = 3 }; /// Gender definitions. enum Gender { UNKNOWN_GENDER = 0 ///< Unknown gender value. , MASCULINE_GENDER = 0x01 ///<  , FEMININE_GENDER = 0x02 ///<  , NEUTER_GENDER = 0x04 ///<  }; /// Case definition. enum Case { UNKNOWN_CASE = 0 ///< Unknown case. , NOMINATIVE_CASE = 1 ///<  , GENITIVE_CASE = 2 ///<  , GENITIVE2_CASE = 3 ///<   , DATIVE_CASE = 4 ///<  , ACCUSATIVE_CASE = 5 ///<  , INSTRUMENTAL_CASE = 6 ///<  , PREPOSITIONAL_CASE = 7 ///<  , PREPOSITIONAL2_CASE = 8 ///<   , VOCATIVE_CASE = 9 ///<  }; /// Time. enum Time { UNKNOWN_TIME = 0 ///< Unknown time. , PRESENT_TIME = 0x01 ///<  , FUTURE_TIME = 0x02 ///<  , PAST_TIME = 0x04 ///<  }; /// Person. enum Person { UNKNOWN_PERSON = 0 ///< Unknown person. , FIRST_PERSON = 0x01 ///<  , SECOND_PERSON = 0x02 ///<  , THIRD_PERSON = 0x04 ///<  }; /// Entity kind. enum Entity { UNKNOWN_ENTITY = 0 ///< Unknown entity, for ordinal words. , ABBREVIATION_ENTITY = 1 ///< . , FIRST_NAME_ENTITY = 2 ///< . , MIDDLE_NAME_ENTITY = 3 ///< . , FAMILY_NAME_ENTITY = 4 ///< . }; /// Animation. enum Animation { UNKNOWN_ANIMATION = 0 , ANIMATE_ANIMATION = 0x01 ///< . , INANIMATE_ANIMATION = 0x02 ///< . }; /// Voice defintion. enum Voice { UNKNOWN_VOICE = 0 ///< Unknown voice. , ACTIVE_VOICE = 0x01 ///<  . , PASSIVE_VOICE = 0x02 ///<  . }; /// Language tag. LanguageTag GetLangTag() const { return RUSSIAN_LANG; } /// Class is absract one -- virtual destruction. virtual ~RussianPos() {} /// Get part of speech tag. virtual PosTag GetPosTag() const = 0; /// Serialization implementaion. virtual void Serialize(uint32_t& out) const = 0; /// Desirialization implementation. virtual void Deserialize(const uint32_t& in) = 0; /// Write POS signature. static void WritePosSign(PosTag pos, uint32_t& out) { // Write to lower 5 bits. out |= static_cast<uint32_t>(pos); } /// Read POS signature. static PosTag ReadPosSign(const uint32_t& in) { return PosTag(in & 0x1f); } }; 

In the class, labels of syntactic categories are specified in the form of PosTag enumerator, and also lexical attributes are defined. In addition to the grammatical component, the class defines the methods Serialize and Deserialize for conversion to / from the binary format. For each syntactic type, a four-byte conversion is defined, represented by the type uint32_t.

The RussianPos class is abstract; classes that represent specific syntactic categories are inherited from it. For example, the class Noun defines the noun:
 struct Noun : public RussianPos { Noun() : number_(UNKNOUN_NUMBER) , lang_(NORMAL_LANG) , gender_(UNKNOWN_GENDER) , case_(UNKNOWN_CASE) , entity_(UNKNOWN_ENTITY) {} /// Get part of speech tag. PosTag GetPosTag() const { return NOUN_PS; } /** * \brief Serialization implementaion. * * Binary map of the object: * 13 3 4 3 2 2 5 * ----------------------------------------------------------- * Unused | Entity | Case | Gender | Lang | Number | POS tag | * ----------------------------------------------------------- * * \param[out] ob The buffer to write to. */ void Serialize(uint32_t& ob) const { ob |= static_cast<uint32_t>(number_) << 5; ob |= static_cast<uint32_t>(lang_) << 7; ob |= static_cast<uint32_t>(gender_) << 9; ob |= static_cast<uint32_t>(case_) << 12; ob |= static_cast<uint32_t>(entity_) << 16; } /** * \brief Desirialization implementaion. * * Binary map of the object: * 13 3 4 3 2 2 5 * ----------------------------------------------------------- * Unused | Entity | Case | Gender | Lang | Number | POS tag | * ----------------------------------------------------------- * * \param ib The buffer to write to. */ void Deserialize(const uint32_t& ib) { number_ = static_cast<Number>((ib & 0x0060) >> 5); lang_ = static_cast<Lang>((ib & 0x0180) >> 7); gender_ = static_cast<Gender>((ib & 0x0e00) >> 9); case_ = static_cast<Case>((ib & 0xf000) >> 12); entity_ = static_cast<Entity>((ib & 0x070000) >> 16); } Number number_; Lang lang_; Gender gender_; Case case_; Entity entity_; }; 

The noun class stores lexical attributes: number, type of language (normal, anachronism, colloquial, etc.), gender, case, and a sign of the name or abbreviation.

State machines for encoding dictionaries


To store dictionaries, as well as efficiently extract words from a data dictionary, the Strutext library uses state machines. State machines are specified by the corresponding C ++ types in the automata directory.

Recall that a finite state machine is defined by a transition function, which for some pairs (state, symbol) associates a state: delta: Q x V -> Q. There is one initial state in which the machine starts its work and a certain number of “allowable” states. The automaton reads the input string character by character, if the transition function gives a state for the current state and the character read, the automaton “transitions” to this new state, after which the cycle of reading the new character starts anew. The machine can stop in two cases: if there is no transition by the pair (current state, read character) and if the entire string of characters is read to the end. In the first case, the input chain is considered not allowed by this automaton, in the second case, the chain is allowed if the automaton is in one of the allowable states after stopping.

Thus, each time a new symbol of the input chain is read, the automaton is confronted with the task of finding a match for the pair (state, symbol) of the new state. The Strutext library implements this search function in a separate class called Transition. An automaton is an array of Transition objects defined for each state (automata / fsm.h):
 template <typename TransImpl> struct FiniteStateMachine { /// Type of transition table. typedef TransImpl Transitions; ... /// State definition. struct State { Transitions trans_; ///< Move table. bool is_accepted_; ///< Is the state accepptable. /// Default initialization. explicit State(bool is_accepted = false) : is_accepted_(is_accepted) {} }; /// Type of states' list. typedef std::vector<State> StateTable ... StateTable states_; ///< The table of states. }; 

Here the parameter of the TransImpl template represents the transition function.

The Strutext library has two ways of implementing the transition function. One method is based on the usual std :: map (automata / flex_transitions.h), where the key is the character code and the value is the state number. Another way (automata / flat_transitions.h) is based on a sparse array, when an array corresponding to the possible character codes is allocated. In each element of the array is the status code. The value zero is reserved for incorrect states, i.e. means that there is no transition. If the value is non-zero, then this pair (array index = character code, state number in the array cell) specifies the transition.

The FiniteStateMachine class is not able to say anything about the input chain, except that this chain is allowed. To store additional information about allowed chains, you need to add attributes to the allowable states. This is done in the AttributeFsm template class. The class takes as the parameter of the template the implementation of the transition function and the attribute type for the admitting state. It should be noted that attributes can be attached not only to admitting states (although it is not clear whether this makes sense), but also that more than one attribute can be attached to a state, they are all stored in a vector.

The storage of the dictionary in the finite state machine specifies a tree structure for the transition function of the finite automaton of this dictionary. For such a structure is also used the term trie, introduced by D. Knut. The Strutext library has an implementation of such a state machine in the file automata / trie.h:
 template <class Trans, typename Attribute> struct Trie : public AttributeFsm<Trans, Attribute> { /// Chain identifier type. typedef Attribute ChainId; /// Attribute FSM type. typedef AttributeFsm<Trans, Attribute> AttributeFsmImpl; /// Default initialization. explicit Trie(size_t rsize = AttributeFsmImpl::kReservedStateTableSize) : AttributeFsmImpl(rsize) {} /// It may be base class. virtual ~Trie() {} /** * \brief Adding chain of symbols. * * \param begin Iterator of the chain's begin. * \param end Iterator of the chain's end. * \param id Chain identifier. * * \return The number of last state of the chain. */ template <typename SymbolIterator> StateId AddChain(SymbolIterator begin, SymbolIterator end, const ChainId& id); /** * \brief Adding chain of symbols. * * \param begin Iterator of the chain's begin. * \param end Iterator of the chain's end. * * \return The number of last state of the chain. */ template <typename SymbolIterator> StateId AddChain(SymbolIterator begin, SymbolIterator end); /** * \brief Search of the passed chain in the trie * * \param begin Iterator of the chain's begin. * \param end Iterator of the chain's end. * \result The reference to the list of attributes of the chain if any. */ template <typename SymbolIterator> const typename AttributeFsmImpl::AttributeList& Search(SymbolIterator begin, SymbolIterator end) const; }; 

From the code it is clear that there are two main methods: AddChain and Search. The latter method is noteworthy because it returns a reference to the attribute vector, i.e. When searching, state attributes are not copied. If the input character string is not found, then the attribute vector will be empty.

The Strutext library also has an Aho-Korasik automaton for efficiently searching for dictionary elements in the text. The implementation is presented in automata / aho_corasick.h. The presentation of the principles and methods of its implementation is beyond the scope of this text, we only note that the interface is quite simple to use, and there is also an iterator over the chains found in the text.

It should also be noted that all automata can be serialized / deserialized into std :: stream. This allows you to store automata in files on disk, i.e. use as dictionary stores in binary format.

Morphological analyzer


A morphological analyzer is a library located in the morpho / morpholib directory. The main interface class, Morphologist, is located in the morpho / morpholib / morpho.h file.

Before describing the interface and class implementation, we first describe the basic principles on which this implementation is based.

First, there is a dictionary of fundamentals that are implemented in an object of the class Trie.
Secondly, each basis in the admitting state is given a corresponding declination paradigm (still, this is a pairs vector (suffix, set of lexical attributes). The set of attributes is represented by an instance of the class inherited from PartOfSpeech).
Thirdly, each lexical type is assigned a unique numeric identifier, the base number in the dictionary.

Thus, in order to recognize the transmitted word form as a word, it is necessary to search the basis for the automaton (a lexical type identifier corresponding to the given basis will be found), then at the end to find the corresponding attributes. All this must be done taking into account the ambiguities, both when searching for the foundations and when defining endings. The search code is as follows:
 /** * \brief Implementation of morphological analysis of passed form. * * \param text Input text in UTF-8 encoding. * \param[out] lem_list List of lemmas within morphological attributes. */ void Analize(const std::string& text, LemList& lem_list) const { // The first phase. Go throw the passed word text, encode symbol // and remember symbol codes in the string. If found word base on // some position, remember attribute and position for an each // attribute. // Try starts with empty bases typedef std::list<std::pair<Attribute, size_t> > BaseList; BaseList base_list; strutext::automata::StateId state = strutext::automata::kStartState; if (bases_trie_.IsAcceptable(state)) { const typename Trie::AttributeList& attrs = bases_trie_.GetStateAttributes(state); for (size_t i = 0; i < attrs.size(); ++i) { base_list.push_back(std::make_pair(attrs[i], 0)); } } // Permorm the first phase. std::string code_str; typedef strutext::encode::Utf8Iterator<std::string::const_iterator> Utf8Iterator; for (Utf8Iterator sym_it(text.begin(), text.end()); sym_it != Utf8Iterator(); ++sym_it) { Code c = alphabet_.Encode(*sym_it); code_str += c; if (state != strutext::automata::kInvalidState) { state = bases_trie_.Go(state, c); if (bases_trie_.IsAcceptable(state)) { const typename Trie::AttributeList& attrs = bases_trie_.GetStateAttributes(state); for (size_t i = 0; i < attrs.size(); ++i) { base_list.push_back(std::make_pair(attrs[i], code_str.size())); } } } } // The second phase. Go throuth the found base list and find suffixes for them. // If suffixes have been found then add them to the lemma list. lem_list.clear(); for (BaseList::iterator base_it = base_list.begin(); base_it != base_list.end(); ++base_it) { AttrMap attr; attr.auto_attr_ = base_it->first; SuffixStorage::AttrList att_list; std::string suffix = code_str.substr(base_it->second); // If suffix is empty (empty suffix passed), add zero symbol to it. if (suffix.empty()) { suffix.push_back('\0'); } if (const SuffixStorage::AttrList* att_list = suff_store_.SearchAttrs(attr.line_id_, suffix)) { for (size_t i = 0; i < att_list->size(); ++i) { lem_list.push_back(Lemma(attr.lem_id_, (*att_list)[i])); } } } } 

As can be seen, the algorithm for determining is divided into two stages. First, the basics are highlighted (here, the existence of empty basics must also be taken into account). For each base, its position in the input chain is memorized, so that the end can be distinguished later. At the second stage, endings are searched for that correspond to the highlighted basics. If the ending is found in the corresponding decad paradigm of declension, then the lexical attributes of this ending are returned along with the identifier of the word.

The Morphologist class also provides a service for the generation of word forms by the base number and the lexical attributes transferred. The Generate method deals with this:
 /** * \brief Generate form. * * \param lem_id The lemma identifier. * \param attrs The attributes of the form. * \return Generated text in UTF-8 encoding. */ std::string Generate(uint32_t lem_id, uint32_t attrs) const; 

There is also a GenAllForms method for generating all forms of a given word and a GenMainForm method that returns the main form of a word. For the noun, this is obviously the singular form of the nominative case.

In the morpho / aot directory in the main.cpp file, the AOT dictionary view parser is implemented in the original format, which as a result returns a binary format representation compatible with the morphology library. The resulting binary dictionary can be used in the Morphologist class. The binary dictionaries themselves are not stored in the repository, but can be generated by the user if necessary. To implement the Russian dictionary, you can use the following command:
 ./Release/bin/aot-parser -t ../morpho/aot/rus_tabs.txt -d ../morpho/aot/rus_morphs.txt -m rus -b aot-rus.bin 

In the binary form of the dictionary, the size of the dictionary is slightly less than 20 MB.

You can use the WordIterator class defined in utility / word_iterator.h to select word forms from source text. This class considers words to be sequences of characters (symbols :: IsLetter). The iterator returns the word as a unicode string. This code can be recoded to UTF-8 using the GetUtf8Sequence function defined in encode :: utf8_generator.h.

Afterword


The text was quite voluminous and probably difficult to read. The author tried to simplify the presentation, where it was only possible, but given the complexity of the material, there were probably not many such places in the text.

Nevertheless, the author feeds hopes that the Strutext library described in the text will be useful and the work on its implementation will not be in vain.

Source: https://habr.com/ru/post/242173/


All Articles