📜 ⬆️ ⬇️

Text analyzer: recognition of authorship (end)

This article is about the authorship recognition algorithm implemented in the Text Analyzer project. At the end of the article, we will look at how frequency characteristics are collected, and, in general terms, we will get acquainted with the Hamming neural system. ( Beginning and continuation ).

Article structure:
  1. Authorship analysis
  2. Introducing the code
  3. TAuthoringAnalyser internals and text storage
  4. Leveling by state machine on strategies
  5. Collection of frequency characteristics
  6. Hamming neural network and authorship analysis

Additional materials:



')
5. Collecting frequency characteristics



The frequency characteristics of individual symbols, two-, three-letter combinations, frequency tables of words, etc. are most useful for recognizing authorship. In this program, only the simplest is implemented: counting the frequencies of characters. Of course, this is not enough for a comprehensive analysis. Now recognition accuracy, it must be admitted, is not very high. As far as I remember, in tests, the probability of a correct answer reached 60-70 percent, and then in ideal conditions. In developing the program, I was hoping someday to rewrite it, adding a comprehensive analysis of authorship based on many methods. Who knows, maybe I'll take it ...

So, the collection of frequency characteristics of characters. The TCharFrequencyCalculator ( [h] ) class makes up a frequency table, performing one pass through the text. The template class TFrequencyTable ( [h] ) can store a frequency table for objects of any type.

Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  1. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  2. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  3. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  4. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  5. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  6. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  7. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  8. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  9. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  10. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  11. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  12. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  13. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  14. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  15. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  16. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  17. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  18. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  19. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  20. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  21. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  22. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };
  23. Copy Source | Copy HTML class TCharFrequencyCalculator { private : TFrequencyTable< TChar > _FTable; public : TFrequencyTable< TChar > & operator ()(TTextStringWrapper & tWrapper) { _FTable << ftm_Clear; TUInt i; TTextStringWrapper d; // , , ... for (i=tWrapper.Begin(); i<=tWrapper.End(); i++) _FTable << (tWrapper[i]); return _FTable; }; TFrequencyTable< TChar > & operator ()( const TTextString & tTextString) { _FTable << ftm_Clear; for (TSInt i= 1 ; i<=tTextString.Length(); i++) _FTable << tTextString[i]; return _FTable; }; TCharFrequencyCalculator (){}; };


// Use:
TCharFrequencyCalculator calculator;
TFrequencyTable charFrequencyTable = calculator (text);


It is worth paying attention to the fact that there are two functions that count frequencies. The first one takes TTextStringWrapper ( [cpp] , [h] ) - a wrapper class over a text string. The wrapper is also known as the Adapter pattern (Adapter, Wrapper, [1] , [2] , [3] ). It converts the interface of one class to the interface of another. It was possible to make a truly universal calculator, but I would have to abstract from objects whose frequencies we calculate. The calculator should not worry at all, in which lists, tables or arrays data is stored there, where it comes from, how many elements, and in what order they are. Adapted to the appropriate interface for the calculator, the lists of objects would be processed in a uniform way ... Did you experience a sense of déjà vu? That's right, we already discussed this when we considered a state machine manager. There we abstract from the event lists using the Iterator pattern, and here from the element lists using the Adapter. This is its atypical use, it is worse than iterators: we would have to build up a hierarchy of adapters and frequency tables so that they can be replaced quickly. In the end, the adapter would cease to be itself, and become a kind of abstract container. It would look like this:

Copy Source | Copy HTML
  1. class TFrequencyCalculator
  2. {
  3. TFrequencyTable * operator () (TWrapper * tWrapper, TFrequencyTable * table)
  4. {
  5. table << ftm_Clear;
  6. for ( int i = tWrapper-> Begin (); i <= tWrapper-> End (); ++ i)
  7. table << (tWrapper-> at (i));
  8. return table;
  9. };
  10. }
  11. class TWordWrapper : public TWrapper
  12. {
  13. // ......
  14. virtual int Begin () const ;
  15. virtual int End () const ;
  16. virtual Word at ( const int & index) const ;
  17. // ......
  18. };
  19. class TSentenceWrapper : public TWrapper { /*......*/ };
  20. class TWordsFrequencyTable : public TFrequencyTable { /*......*/ };
  21. class TSentenceFrequencyTable : public TFrequencyTable { /*......*/ };
  22. // Use:
  23. TWordWrapper wordWrapper = TWordWrapper (wordsList)
  24. TFrequencyCalculator wordCalc;
  25. TWordsFrequencyTable wordFrequencyTable = wordCalc (& wordWrapper, & wordFrequencyTable);
  26. TSentenceWrapper sentenceWrapper = TSentenceWrapper (sentencesMap)
  27. TFrequencyCalculator sentenceCalc;
  28. TSentenceFrequencyTable sentenceFrequencyTable = sentenceCalc (& sentenceWrapper, & sentenceFrequencyTable);


6. Hamming's neural network and authorship analysis



Finally, we, tired and beaten, got to the very last step. The Hamming neural network ( [1] , [2] ) takes as its basis a set of binary vectors of the same length. They are called samples and are stored in a sample matrix. The test vector of the same length is fed to the input of the neural network. The number of inputs is equal to the size of the vector; for data of large volume of inputs can be very much. One of the advantages of the Hamming INS (before the Hopfield INS) is that whatever the dimension of the input vector, the structure of the neural network will not change. Layers - two, with the first fictitious; There are exactly as many outputs and neurons in each layer as there are samples in the matrix. Neural network is fast. Its task is to find a vector in the sample matrix that is most “similar” to the input one. Similarity is determined by the so-called Hamming distance. The smaller this distance, the more “similar” the two vectors. Relatively speaking, the Hamming distance shows how many bits in these vectors do not match. It is easy to calculate the Hamming distance, and the whole neural network is just a convenient representation of several simple formulas. Having calculated some values, the neural network converges to the result: a vector with all zeros will be obtained at the outputs, excluding any one output where a unit will appear. The index of this output indicates the desired sample in the sample matrix.

In order to load samples of texts into the neural network (class THamNeuroSystem: [cpp] , [h] ), they need to be converted to binary form. This makes the template connector class (THamNSConnector: [h] ). It's funny to see the neural network code and the connector: I understand that this can be done much, much easier.

Copy Source | Copy HTML
  1. template < class T > void THamNSConnector < T > :: ByteToBinaryVector ( T DataItem, TSInt SizeOfData, TSampleVector * DestinationVector)
  2. {
  3. vector < bool > BoolBits;
  4. T NewDataItem = DataItem;
  5. for (TSInt i = 1 ; i <SizeOfData; i ++)
  6. {
  7. BoolBits.clear ();
  8. BoolBits.push_back (NewDataItem & bitOne);
  9. BoolBits.push_back (NewDataItem & bitTwo);
  10. BoolBits.push_back (NewDataItem & bitThree);
  11. BoolBits.push_back (NewDataItem & bitFour);
  12. BoolBits.push_back (NewDataItem & bitFive);
  13. BoolBits.push_back (NewDataItem & bitSix);
  14. BoolBits.push_back (NewDataItem & bitSeven);
  15. BoolBits.push_back (NewDataItem & bitEight);
  16. for (TUInt j = 0 ; j <BoolBits.size (); j ++)
  17. {
  18. if (BoolBits [j]) DestinationVector-> push_back ( 1 );
  19. else DestinationVector-> push_back ( 0 );
  20. }
  21. NewDataItem = NewDataItem >> 8 ;
  22. };
  23. };
  24. template < class T > TSampleVector THamNSConnector < T > :: VectorToBinaryVector (TVector SourceVector)
  25. {
  26. TSInt SizeOfData;
  27. T DataItem;
  28. TUInt i;
  29. TSampleVector ResVector;
  30. SizeOfData = sizeof ( T );
  31. for (i = 0 ; i <SourceVector.size (); i ++)
  32. {
  33. DataItem = SourceVector [i];
  34. ByteToBinaryVector (DataItem, SizeOfData, & ResVector);
  35. };
  36. return ResVector;
  37. };


That's all. Apart from my attempts to improve the neural network, there is nothing more to talk about. The neural network returns the index of the sample of the text, the characteristics of which it considers more similar to the characteristics of the text of an unknown author. How true is the answer, you can find out for yourself by compiling the program. I didn’t have enough for extensive tests either when writing a diploma or now. Hope the article was helpful.

Sincerely.

Source: https://habr.com/ru/post/114188/


All Articles