📜 ⬆️ ⬇️

Text analyzer: recognition of authorship (continued)

This article is about the authorship recognition algorithm implemented in the Text Analyzer project. In the continuation of the article we will consider a finite state machine for splitting text into levels. ( Begin and end ).

Article structure:
  1. Authorship analysis
  2. Introducing the code
  3. TAuthoringAnalyser internals and text storage
  4. Leveling by state machine on strategies
  5. Collection of frequency characteristics
  6. Hamming neural network and authorship analysis

Additional materials:



4. Leveling by the state machine on strategies

Need to:

')

Findings:


KA is a state machine ( [1] , [2] , [3] ). One of its applications is that it recognizes certain constructions in a chain of incoming characters: words, compound punctuation marks, functions, structures, whole classes with their methods and fields. This is how spell checkers, source code analyzers, compilers, syntax highlighting tools work, so does your computer, and so on and so forth. The applicability of the spacecraft is huge. For example, I asked my students to generate a fairy tale using non-deterministic spacecraft, and this is possible. Here the spacecraft is used on strategies ( [1] ), written by Sergei Satsky ( [1] ) already in 2003.

Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  1. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  2. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  3. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  4. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  5. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  6. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  7. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  8. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  9. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  10. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  11. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  12. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  13. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  14. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  15. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  16. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  17. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  18. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......
  19. Copy Source | Copy HTML template < class SState , class SEvent , class SFunctor = SEmptyFunctor< SState , SEvent >, class SUnknownEventStrategy = SThrowStrategy< SEvent > > class SFiniteStateMachine { public : typedef SState StateType ; typedef SEvent EventType ; private : StateType _CurrentState; // Current machine state std::vector< StateType > _States; // A list of the registered states std::vector< EventType > _Events; // A list of the registered events std::vector< std::vector< StateType > > _Transitions; // A table of transitions between states SFunctor _Functor; // Transition function SUnknownEventStrategy _UnknownEventStrategy; // Unknown event strategy std::deque< EventType > _EventsQueue; // Internal events queue to support events // that were generated in the transition functions bool _InProcess; // To be sure that we are in the process of events int _CurrentStateIndex; // Index of column in a transition table (0 - based) StateType _InitialState; // Start machine state // ......



Satskii state machine ( [h] ) is implemented on the Strategy pattern. (Strategy, [1] , [2] , [3] ). The strategies come in the types SState, SEvent (event, state), SFunctor (functor of transition from one state to another) and SUnknownEventStrategy (behavior of the SC, if the event is not recognized). Here is the function that recognizes the event and translates the spacecraft into a new state:

Copy Source | Copy HTML
  1. inline void ProcessEvent ( const EventType & tEvent)
  2. {
  3. int EventIndex (GetEventIndex (tEvent));
  4. if (EventIndex == - 1 ) return ;
  5. StateType OldState (_CurrentState); // Save the old state.
  6. _CurrentState = (_Transitions [EventIndex]) [_ CurrentStateIndex]; // Define a new state.
  7. _CurrentStateIndex = GetStateIndex (_CurrentState);
  8. _Functor (OldState, tEvent, _CurrentState); // Perform the action associated with the new state.
  9. }


Let somewhere in the external environment, we iterate the text character by character. We take the next character and pass it to the ProcessEvent () function. The state machine for this symbol selects a cell in the transition table, also based on the current state. The cell says what state should come and what to do next. I slightly changed the spacecraft so that he understood the “symbol” as an event. I also redefined the “Status” strategy by slipping my state class. Thus, the transition table consisted of certain events and states by me.

And it was very difficult to develop. She grew by leaps and bounds. If the state was “Empty”, and the letter “A” came, then we should start recognizing the word, sentence and paragraph by going to the corresponding state. If a point has arrived, then we have to go into a special state and wait, but is this a composite punctuation mark? Could be “B”, “C”, “b”, “c”, “!”, “1”, “2” ... All the letters of the alphabet, all the punctuation marks, in general, all the signs of the ASCII table. And because they need to somehow react! Imagine a table with ~ 255 lines (one line - one event, one character) and about 20 columns of spacecraft states. It's a broom how many cells with transition teams to fill! 20 * 255 = 5100 cells. I found a simpler approach. All single-type characters are placed in character sets ( [CCharsSet.h] ; [UConstants.h] ), and the entire character set is already considered an event if its character has arrived. Of course, the sets may intersect. For example, the letter “B” is the elements of the following sets: “All symbols”, “Letter”, “Uppercase letter”, “Russian letter”, “Russian capital letter”. The “dot” sign is included in the “All symbols”, “Punctuation mark”, “Paragraph end” sets. And so on. The sets came out decently, however, this is less than all the characters. The transition table ( [xls] ) has decreased tenfold. Another important advantage is that the sets can be changed without affecting the conversion table. Well, we have forgotten the letter "E", well, inserted into the appropriate sets, and business! ..

How does this even work? After all, passing the text through the state machine, we have to build a tree of levels. If in words, then the state machine, executing the instruction upon the occurrence of an event, calls the function associated with the event. There are several such functions ( [UParSentWordFSM.h] ), and they all build a common tree, receiving a pointer to it as input. Let's take a closer look at how the configuration of the spacecraft is described.

Copy Source | Copy HTML
  1. class TState ;
  2. typedef TCharsSetEvent TEvent ;
  3. typedef TCFDivisionTreeItem < TCFLevelDataType , TUnitType > TCFCustomUnitDivisionTreeItem;
  4. typedef TCFTreeLevel < TCFLevelDataType , TUnitType > TCFCustomUnitTreeLevel;
  5. typedef void (* TFuncPtr) (TCFCustomUnitDivisionTreeItem *, const TEvent &);
  6. typedef TStateMachineDescriptor < TState , TEvent > TParSenWordDescriptor;
  7. / * Class of used state. * /
  8. class TState : public TFunctionalState < TEvent , TTextString >
  9. {
  10. private :
  11. TFuncPtr _Function;
  12. TCFCustomUnitDivisionTreeItem * _TargetTree;
  13. public :
  14. void OnEnter ( const TState & tStateFrom, const TEvent & tEvent) {
  15. _Function (_TargetTree, tEvent);
  16. };
  17. TState ( const TTextString & tName, TFuncPtr tFunc, TCFCustomUnitDivisionTreeItem * tTargetTree)
  18. : TFunctionalState < TEvent , TTextString > (tName, sat_None), _Function (tFunc), _TargetTree (tTargetTree) {};
  19. };


The TState class, as its name implies, is a state of the spacecraft. Inside it is a pointer to a function for building a tree, and a pointer to the tree itself. The function is called when the OnEnter () method is requested from TState. The class that this tree represents (TCFCustomUnitDivisionTreeItem) is quite complicated: this is the definition of some more abstract template class ( [h] ). We will not focus on it now. Let's look further at the functions of building a tree.

Copy Source | Copy HTML
  1. void FEmpty (TCFCustomUnitDivisionTreeItem * tItem, const TEvent & tEvent)
  2. {
  3. return ;
  4. };
  5. void FOpenParagraph (TCFCustomUnitDivisionTreeItem * tTree, const TEvent & tEvent)
  6. {
  7. TCFCustomUnitDivisionTreeItem NewItem ( 0 , TRangeItem (tEvent.Iterator (), tEvent.Iterator ()));
  8. tTree-> AddItem ( 0 , NewItem);
  9. };
  10. void FOpenSentence (TCFCustomUnitDivisionTreeItem * tTree, const TEvent & tEvent)
  11. {
  12. TCFCustomUnitDivisionTreeItem NewItem ( 1 , TRangeItem (tEvent.Iterator (), tEvent.Iterator ()));
  13. tTree-> AddItem ( 1 , NewItem);
  14. };
  15. // .......
  16. void FCloseAll (TCFCustomUnitDivisionTreeItem * tTree, const TEvent & tEvent)
  17. {
  18. FCloseWord (tTree, tEvent);
  19. FCloseSentence (tTree, tEvent);
  20. FCloseParagraph (tTree, tEvent);
  21. };
  22. // ... and some more functions ...


They all have the same type. We can replace these functions in the TState class with each other, - thus, we have here an overload and polymorphism of functions ( [1] , [2] ).

But these are all definitions and preparation for the main thing. We should get a specific object that contains settings for the spacecraft, right? Here is the function that returns this object:

Copy Source | Copy HTML
  1. TParSenWordDescriptor Descriptor (TCFCustomUnitDivisionTreeItem * tTree)
  2. {
  3. // The initial state is q0, the function is FEmpty, the transition table is called "Paragraphs, sentences, words".
  4. TParSenWordDescriptor D (TState ( "q0" , FEmpty, tTree), "Paragraphs, Sentences, Words" );
  5. // Set the state, and each has its own name: q0, q1, ... q14.
  6. // Each state is passed a function (FEmpty, FOpenAll ...), which should be called if the state occurs.
  7. // Also, we pass a pointer to the tree that we are building - tTree.
  8. D << TState ( "q0" , FEmpty, tTree)
  9. << TState ( "q1" , FOpenAll, tTree)
  10. << TState ( "q2" , FOpenSentenceOpenWord, tTree)
  11. << TState ( "q3" , FOpenWord, tTree)
  12. << TState ( "q4" , FEmpty, tTree)
  13. << TState ( " q5 " , FCloseWord, tTree)
  14. << TState ( "q6" , FCloseSentence, tTree)
  15. << TState ( "q7" , FCloseSentenceOpenSentenceOpenWord, tTree)
  16. << TState ( "q8" , FEmpty, tTree)
  17. << TState ( "q9" , FEmpty, tTree)
  18. << TState ( "q10" , FCloseWord, tTree)
  19. << TState ( "q11" , FCloseAll, tTree)
  20. << TState ( "q12" , FCloseSentenceCloseParagraph, tTree)
  21. << TState ( "q13" , FCloseParagraph, tTree)
  22. << TState ( "q14" , FEmpty, tTree);
  23. // Set events, each one has its own name ("WinParagraph", "SentenceEnd" ...).
  24. // cWinParagraph, cSentenceEnd are the character sets described in UConstants.h.
  25. D << TEvent ( "WinParagraph" , cWinParagraph)
  26. << "q0" << "q11" << "q11" << "q11" << "q11" << "q12" << "q12" << "q11" << "q12" << "q13" << "q12" << "q0" << "q0" << "q0" << "q12" ;
  27. D << TEvent ( "SentenceEnd" , cSentenceEnd)
  28. << "q0" << "q10" << "q10" << "q10" << "q10" << "q14" << "q14" << "q10" << "q14" << "q9" << "q14" << "q0" << "q0" << "q0" << "q14" ;
  29. D << TEvent ( "Letters" , cLetters + cDigits)
  30. << "q1" << "q4" << "q4" << "q4" << "q4" << "q3" << "q2" << "q4" << "q3" << "q2" << "q7" << "q1" << "q1" << "q1" << "q7" ;
  31. D << TEvent ( "Space" , cSpace + cTab)
  32. << "q0" << " q5" << "q5" << "q5" << "q5" << "q8" << "q9" << "q5" << "q8" << "q9" << "q6" << "q0" << "q0" << "q0" << "q6" ;
  33. D << TEvent ( "PunctMarks and other symbols" , TCharsSet (cPrintable) >> cWinParagraph >> cSentenceEnd >> cLetters >> cDigits >> cSpace >> cTab)
  34. << "q0" << " q5" << "q5" << "q5" << "q5" << "q8" << "q9" << "q5" << "q8" << "q9" << "q14" << "q0" << "q0" << "q0" << "q14" ;
  35. D << TEvent ( "EndSign" , TCharsSet (CTextStringEndSign))
  36. << "q0" << "q11" << "q11" << "q11" << "q11" << "q12" << "q13" << "q11" << "q12" << "q13" << "q12" << "q0" << "q0" << "q0" << "q12" ;
  37. return D;
  38. };


It is very clearly seen how events and states are entered in the AC table, but the table itself in these "<<" signs is difficult to see. Just believe, here it is recorded all.

One could also look into the TEvent class, which, in fact, is TCharsSetEvent ( [cpp] , [h] ). Curious - welcome, and we move on to the manager of the CA - TFiniteStateMachineManager ( [h] ). The task of this class is as follows: using unified interfaces, perform an analysis using an abstract spacecraft over an abstract source of events. Also in the manager, you can pass a function that will somewhere display the progress of recognition. The code of this class is very rich, and I will give only the most interesting sites.

Copy Source | Copy HTML
  1. template < class EventType , class FiniteStateMachineType , class StateMachineDescriptorType , class SourceType >
  2. class TFiniteStateMachineManager
  3. {
  4. private :
  5. StateMachineDescriptorType _Descriptor;
  6. SourceType _Source;
  7. TUInt _Begin;
  8. TUInt _End;
  9. TUInt _Iterator;
  10. public :
  11. // <......>
  12. TFiniteStateMachineManager ( const StateMachineDescriptorType & tDescriptor, const SourceType & tSource)
  13. :
  14. _Descriptor (tDescriptor),
  15. _Source (tSource),
  16. _ReportStep ( 0 ),
  17. _ProgressFunc (NULL)
  18. {};
  19. // <......>


Again we deal with patterns. To be precise, the “Strategy” pattern is also presented here. Not only do we have a finite state machine itself on strategies, but also some kind of manager! I don’t hide it, I’ll say that the same manager is used to create a map of euphony, it only works with another special state machine. He, in fact, in general, all the same, that the KA slipped, if only the interface was the same. The manager takes this "slipped" spacecraft, and redirects all the work to him. One KA handles the rules of sound, the other - breaks the text into levels. This is the essence of the “Strategy” pattern: we configure our class with strategy objects that do the real work. For example, if it were necessary for me that when an event occurred, a filter was called, I could replace the EventType event, passing another class as a strategy that implements filters.

Copy Source | Copy HTML
  1. TFiniteStateMachineManager & Process ()
  2. {
  3. _Begin = _Source.Begin ();
  4. _End = _Source.End ();
  5. FiniteStateMachineType _Machine (_Descriptor.StartState (), _Descriptor.Proxy ());
  6. for (_Iterator = _Begin; _Iterator <= _ End; _Iterator ++)
  7. {
  8. _Machine << EventType (_Source [_Iterator], _Iterator);
  9. };
  10. _Machine << EventType (_Source.EndSign (), _End + 1 );
  11. return * this ;
  12. };



In the Process () function, everything is simple. An _Machine state machine object is created, set up with a descriptor (transition table, initial state), and the recognition process starts in a loop. The “iterator” passed as the second parameter to the EventType class denotes the index of the character in the text. If the event caused, for example, the “Start word” status, then the index will be written into the RangeItem as BeginIndex. And vice versa: if the event is the end of a word (sentence, paragraph), then the index will be the end for the given range. So we get a list of RangeItem, correlated with the text.

Why did I put the word “iterator” in quotes? Because it is not a real iterator, but only an integer variable. But if it were necessary to make a truly universal manager, then we would have to abstract away from the event objects. The manager should not worry at all in which lists, tables or arrays the data is stored there, where they come from, how many elements, and in what order they are. And to ensure passage through them, one could use the Iterator pattern (Iterator, [1] , [2] ). Great changes would not be necessary to make the loop work with iterators:

Copy Source | Copy HTML
  1. // ......
  2. SourceType * _Source;
  3. // ......
  4. SourceType :: Iterator _Iterator = _Source-> Begin ();
  5. for (; _Iterator! = _Source-> End (); ++ _ Iterator)
  6. {
  7. _Machine << EventType (* _ Iterator);
  8. };



The result of this spacecraft will be split trees. Next you need to select areas for comparison from these partitions. The algorithm is pretty simple.
- For each text, plots are sorted in increasing length.
- A pattern of minimum compliance is compiled. It registers how many sections of the same length can be taken from all texts. If all texts have 10 paragraphs 1000 in length, and one text of such paragraphs is only 8, then the minimum correspondence is 8 sections. So - for every length.
- In accordance with the template selected sections of the texts.
- The algorithm alternately involves all levels: paragraphs, sentences and words.

Now we have samples, and we can begin to collect characteristics.

Source: https://habr.com/ru/post/114187/


All Articles