Commentary on the note "Frequency analysis of the Ukrainian language"

As a comment on the note “Frequency analysis of the Ukrainian language” [1], simple observations are made on the frequency of pairs of letters. It is proposed to apply the developed technique to the analysis of texts. The main hypothesis: a set of geometrically related clusters of symbols carry information about authorship and other important integral data.

In particular, it seems to me that it does not seem reasonable to expect from different communities of native speakers (forums, etc.) of the same spectrum of digrams.

Motivation

The author of the note about the frequency in the Ukrainian language does not give the motivation for his counting. It seems that the analysis of the frequency of letters, as well as pairs, triples of letters in a language, was initiated by the goals of deciphering simple passwords and similar cryptanalysis tasks.

About six to seven years ago, my friend and I did similar, but smaller, calculations. Our motivation was amateur, primitive, but different. These calculations, as we believed, could be the first step in an attempt to machine a selection of information from the text meaningful to humans. (Later, it turned out that they, in the most interesting part, are not original [1-3]).
')

It was assumed that the machine can “read”, knows the characters of letters and punctuation marks, can count frequencies, etc. But he does not know how to “think”, in particular, to obtain generalized integral information about the text as a whole. What is good and evil, about what or who is it, and so on. The task then would be to study the structure of the text and the selection of "algorithms" that allow to extract meaningful information from the frequency that is not embedded at the level of letters. From the point of view of a person, we will try, for example, to turn off our learning experience and try to understand the meaning of groups of unclear for us consecutive symbols. Encode, for example, all the letters a, b, c, etc. as x, y, z, alpha, beta and gamma, or, better, Babylonian wedges. And then ask what we can say about the text as a whole. It is not clear, but it is believed that approximately in the same area the problem is to take DNA and find "meaning" in a long sequence of four letters.

Significant information in a very narrow sense can be similarity of the text with another, authorship, rhythm and speed of the text, etc. To advance in the solution even of such primitive problems we so essentially could not. There are some observations, but there are many more questions. I want to believe that the task is meaningful and partially solvable. We took up such a primitive frequency analysis, after simple estimations (based on English texts). First, we looked at the diagram of the appearance of the character set “White Fang” in the novel by D. London. The name of the dog in the first parts was absent altogether! And on a small scale, in the main part of the book, fluctuations in the density of words were, of course, filled with noises. Nevertheless, it was obvious that the strictly zero frequency of the words “White Fang” at the beginning of the novel correlates with the plot. There was a clear feeling that either the main character was not born, or was not named like that, or the novel consists of several parts (about a canine or not about a canine). To call such a conclusion strict, probably not. However, it is believed that the words “In the first chapters, White Fang is not mentioned ...” would be a normal part of the Man ’s Answer about the text. Incomplete, primitive, but the machine "analysis" took milliseconds and there was no algorithm at all. Secondly, the statistics of the appearance of character names in the last Harry Potter book also indicated that it can be very traced only by frequency, when and with whom Harry was next (geometrically in the text, but it turned out that in meaning) when The plot from Harry dropped Ron, Hogwarts, when appeared Volan de Mort, etc. Those. taking the words "Harry", and looking at the density of other characters in the geometric neighborhood on the page, one could make some very vague and vague conclusions about the "plot line".

Technical task

The first step in the selection of the protagonist is the development of techniques for studying the frequency of letters. In this first step, our achievements ended in general. Due to the high cost of the Internet at that time, we did not conduct a thorough analysis of forums and the media, but simply took text files from the books of one author, ranging from the novels of Leo Tolstoy and ending with the detective stories of Darya Dontsova. There were a few hundred books in total from 300Kb. It was interesting that the graphomania writers had obvious correlations in their works in terms of the frequency of individual letters . In particular, the spectral series of letters for the prolific author of "ironic detectives" was indicative.

Checking the spectral composition of texts in which clusters of adjacent letters would be studied was the next step. In particular, we calculated the problem in the simplest geometry — the normalized frequency of the pairs of the nearest significant symbols “aa”, “ab”, “av”, ..., “ya”. Among the most frequent combinations of pairs of characters, the first 30-60 were allocated, which were compared for different texts. Relative indicators were considered - the frequency divided by the total frequency of pairs. The statistical sum in the task for 300-400 Kb of text turned out to be rather large. More specifically, the frequency of the trilogy "Childhood", "Adolescence", "Youth" was taken, against which the fluctuations of the frequencies of other works looked. The results showed, in particular, a significant difference in spectrum among different authors, even those recognized by the founders of the Russian literary language.

One of the firmly established “laws” of the Russian language, about which we were not aware of before the experience, is the fact that the number of commas in the text is in order of magnitude in all texts comparable to the number of the most common pairs of adjacent letters. It is possible that this is primitive knowledge, but for us it was a certain “discovery”, the importance of which, at least, is not emphasized.

The remaining observations were not so accurate. In particular, good writers in their works of the same period (early Chekhov, or late Chekhov, early Tolstoy or late) even had similar spectral curves. However, these author's curves differed from the curves of other authors. As for modern writers, the correlation between the curves for their texts was all the stronger than this writer is considered to be more “junk”. This conclusion was made on the basis of several examples and is not strict. For example, the curves of different grafomansky works of newcomers from Samizdat, lay almost on top of each other. The same could be said about more advanced works, such as, for example, Akuninsky Fandorins, fiction of the early and middle Lukyanenko, detectives of the Marinins, etc. The classics and, in particular, Dostoevsky were very bad for themselves. Out of the author's spectrum, Tolstoy’s novels, especially the last period. To account for all authors, the separation of curves into classes of authorship, we had to try different definitions of the proximity of curves . However, in general, the style determination technique worked. In the vast majority of cases, it was possible to separate the spectral curves of different authors from each other (more than a hundred authors).

In our approach, the inclusion of clusters of three adjacent letters did not give significant quantitative corrections to the definition of the author’s style. More significant seemed differences in
punctuation rates. (In the works of Khmelev and his followers, the trigrams were taken as a basis [2-4]). No structure was also observed when calculating pairs “through one” and in other simplest modifications.

Why clusters of a pair of adjacent letters clarify a lot in the texts, which is a whole mystery. In the comments to the original post about the frequency was the following remark:

Robotex
By the way, if someone knows how to convert a sequence of phonemes (at which they can be repeated, that is, the word mom recognizes the program as mmmmmmmaaaaaaaammmmmmmmmmmmaaaaaa) into words, then I would be glad to read (this has stopped for now)

It is possible that I do not correctly interpret the original question. But this example vividly shows the importance of precisely the pairs of letters when selecting a meaningful part from a sequence of characters. When considering pairs, we see that in a big word there are clusters of “mm”, “ma”, “am”, “aa”. Tilting low-frequency "mm" and "aa" leads to "ma" and "ma", or to "ma", "am", "ma", if you consider all two-character combinations. It is clear that the word "mother" has the same spectrum as the word mmmmmmmaaaaaaaammmmmmmmmmamaaaaaa in terms of high-frequency two-letter packages. For decrypting the password, guessing the original, it is of course useless. From the point of view of clipping noise, analysis of the original, it seems to work well: the extra a and m do not carry new information.

In terms of meaning, word breaking up into pairs of letters in Russian is quite close to breaking up into syllables. Note that there are alphabets like hiragana, in which exactly pairs have the meaning of basic elements. Take the word "frequency" (you can any other). In the approach of paired clusters, it splits into “cha”, “ac”, “that”, “from”, “us”, “i”. Couples consonant-consonant with rare exceptions (nn, etc.) drop out of the spectrum due to merging with the background. Roughly speaking, you can throw out a pair of consonant-consonant and the conclusion about authorship and style will not change. Thus, only the consonant + vowel pairs that correspond to the sounds pronounced by the language remain. The inclusion of punctuation marks (like time to inhale and exhale) probably makes the analysis of couples even closer to the analysis of syllables and oral speech.

It was also a pleasant conclusion that the old-fashioned “Kommersant” or Ukrainian “i”, Russian or Ukrainian languages did not affect the spectral curve of this author (for example, only Bunin was taken, there are no statistics on the authors). Check the impact of the translation from English to authorship and text style failed. The analysis of English texts was limited to Harry Potter and a couple of works by Jack London, i.e. statistics are again not recruited, but the two symbolic correlations of these two authors were also visible.

The task was abandoned by us, firstly, because a search even from the Internet showed that a similar analysis of the frequency of texts was conducted from the beginning of the last century, including Morovozi [1], whose work Markov himself became interested in. There was also some fomenkovism on this score. The very same conclusion about the possibility of determining the authorship of texts on the basis of trigrams was already formulated in the 2000 region by D. Khmelev [2, 3]. There were also works by other authors, see, for example, [4]. In Khmelev’s works, of course, there were words about text invariants, Markov chains, diagonalization of transition matrices, etc. In fact, there are similar statements about the importance of the most common triples of letters to determine the style. To this work we have a lot of questions. How Dostoevsky is caught, for example, by trigrams, we do not understand. Etc.

Even without mathematical terms, it is clear that pairs of letters give out quite similar spectral graphs for many authors. Quantitatively, the figures for the author's range strongly depended on how exactly the magnitude of the proximity of the two graphs was determined, whether the pointwise quadratic error was taken, whether it was cubic, and so on. But these are details. The fact that the pattern in the "style" is at the level of pairs of letters and punctuation marks is quite obvious. Explicit punctures were observed in examples when the novel was written “by blacks in the project”. Also translated books were a problem.

In general, our conclusion 6 years ago was that all these tasks are not original, and there is no point in continuing to “study” them. It is possible that such frequencies were discussed here on the site, of course. What was noticed by us - in the analysis of digrams, a significant difference in authorship is obtained taking into account punctuation marks. To some extent, the proportion of punctuation marks is a measure of the tempo of the text. When considering trigrams, punctuation marks are most likely not included in the statistics and this is, in my opinion, a mistake and loss of significant information.

The fact that the language is built only on the frequency of a couple of characters, and that there are no more objective quantitative characteristics and laws, is very hard to believe. It is more believed that no one was looking because of the complexity of the analysis in past years. However, our further search for the principles of organization of texts, the geometry of symbols in Russian did not give any obvious results. One hypothesis, for example, was that the presence of identical clusters within a single sentence , for example, a pair of "ma" within the same sentence, is an important characteristic of text and style. At school, for example, they teach that in the standing next to sentences it is not necessary to say “who” twice. Or repeat the name of the main character. It was believed that the repetition of a pair of letters would serve as a criterion for the poor quality of the text, would be a violation of melody, etc. The same children's words “mom”, “dad”, “woman”: it is obvious that the second “ma”, “pa”, “ba” are unnecessary from the point of view of the new information being brought. A lot of "ma" without dividing by a dot - should also be avoided. Therefore, such combinations were taken with an enhanced — reduced contribution — 2, 3 times, etc. However, this hypothesis did not bring any new clearly formulated results. An analysis of the classics showed that many works include paired doubles between two points. Complications in geometry at this level, of course, immediately caught the poem , but this is so obvious.

It is possible that it is necessary to study the structure of texts more subtly. Or set the task completely differently. Unfortunately, there was not a single competent specialist who would take the issue seriously in our environment. What science already knows, we could not understand. People from the MSU MSU said that “all this was done in the Baumanka a hundred years ago,” and therefore irrelevant. However, we could not find the published texts and people from the Baumani, etc.

Perhaps a new challenge

Our “experiences” have not advanced either due to the lack of basic knowledge in programming and the high cost of the Internet. Due to the primitiveness of the program and equipment, the analysis of the 500 page text took a few minutes, hung up through the computer, etc. We could not think about the automatic and free downloading of texts by gigabytes via the Internet, analyzing html tags, etc. Those. The conclusions made above - this was (and is) our technical limit.

However, the task of studying the geometric structure of the text in the Russian language was set by us initially (not seriously, of course) more widely.

It is possible that someone from the community will now be able to test the hypothesis ... It consists in the fact that every online community, newspaper, etc. has its own spectrum, which would be interesting to analyze quantitatively .

For concreteness and viability, consider the problem of how a particular large website relates to a politician, firm, or phenomenon - “XYZ”, for example. The obvious idea was that among the many web pages on the site, where the combination of the letters “XYZ” is often found, there will be a corresponding environment of characteristic clusters of letters, words. For example, the publication, the community, negatively relates to the brand "XYZ". The idea is that on the page geometrically close to this combination of symbols should be on average negative symbols - “collapse”, “devastation”, “crisis”, “decay”, pictures of falling aircraft, sunken ships, etc. In another community, on average, good symbols such as “confidence”, “progress”, “achievement”, etc. can stand close to the “XYZ” combination.

In general, such an analysis of resources would help streamline the Internet. This, of course, is very far from the problem of recognizing the meaning of textual information, but some new knowledge about communities and a machine should be obtained from such geometry. It means the following. By analogy with the frequency of letters in a language, the simplest task of analyzing text is, of course, to count the frequency of significant words, the structure of the semantic core, or the tag cloud. It is perfectly reasonable to take in bold weight, color and slope with added weight. This was done and is done by Google, Yandex, all SEOs. The next step is to enable the geometry of the semantic core. Closer to the cap word, it means higher weight, closer to the basement - lower. In the header or inside the body. And this is also done when ranking and issuing by search engines.

But is the metric included, in the sense of Riemann, i.e. the geometric distance (for example, the mean square distance on a page, the distance in clicks, etc.) between meaningful words when evaluating texts by a machine, I still do not know.

Even in free online tools for seo novices you can throw everything out of the text of the web page, except meaningful words. From the large text remains some skeleton of meaningful words. Something in this skeleton should be important, not just one quantity — frequency. For example, relative distances. However, I do not know whether further computational work is carried out with such a skeleton, with such a text basis. It would seem natural for a search engine to study the laws of text organization, because most of us read diagonally. Need some measure to determine what the author wanted to say in the text, if you remove all unnecessary. But ... search engines have billions of pages, and it’s not at all the fact that such analysis is already technically possible even for giants like Google. Those. the scale of the problem is much smaller than the standard for search engines (thousands, not billions of pages), but the analysis is offered deeper. In addition, this direction can be strongly inhibited by the hypertext structure of the Internet. It is much more effective to keep track of who refers to whom, who is in the trust network. But ... links are the inclusion of the natural intelligence of webmasters who are exploited by search engines for their own purposes. Sometime need and ability of the machine itself to draw conclusions about the text.

Conclusion

It is argued that, in Russian, there is an objective, statistically definable, internal structure based on the frequency of symbols [1–4]. Unlike previous works [1-4], the remaining “novelty” of our statement is that the role of the frequency of pairs of adjacent letters and punctuation in the text is significant. Philologists in the analysis of texts vaguely refer to the "unique rhythm", "readability"; It is possible that the frequency of punctuation marks, along with other factors, provides an equivalent quantitative description of rhythm.

It is proposed to pay attention to the study of the frequency of the language, begun in the original study on the frequency of the Ukrainian language, but for the purpose of extracting meaningful information from the texts of individual communities. That is, do not average the entire Russian-language Internet by frequency, but on the contrary, break it into sectors characterized by the same invariants.

It is proposed to try to study the geometry of the text by entering a new parameter - the relative distance between the characters. On N characters there is (N-1)! ties, each link still has weight. Therefore, technically, such an analysis can be much more complicated than a simple calculation of frequencies.

It is possible that the ideas are completely trivial and not original. It is clear that the analysis of Turgenev's novels would have no practical value here, but a specific infographic on popular media or communities for a particular presidential candidate 2012, public relations or anti-advertising campaign of a particular brand in a particular Internet publication, etc. But in general, it can be an interesting task in itself.

Such tasks primarily require the use of Internet technologies that the author of the original post about frequency used — the ability to download gigabytes of texts, the ability to quickly analyze data arrays on frequency, the ability to find common features in chaos, etc. The same html tags do a lot for text geometry.

Due to the same technical, Internet-computer reasons, such problems could be little studied by professional mathematicians and philologists. Therefore, such tasks may be suitable for research by IT specialists, who do not even know any philology or mathematics.

References

[1] professor_k "Frequency analysis of the Ukrainian language"

[1] Morozov N., “Linguistic spectra: a means for distinguishing plagiarism from the true works of an unknown author” 1915;

[2] Khmelev D.V. Recognition of the author of the text using the chains A.A. Markov // MSU Bulletin, ser.9: philology. "- 2000." - N 2. "- p. 115-126.

[3] Khmelev D., Tweedie F. Using Markov Chains for Identification of Writers // Literary and Linguistic Computing. "- 2001." - Vol. 16, no. 4. "- Pp. 299-307.

[4] Romanov AS, Meshcheryakov RV, Identification of the author of the text using the apparatus of support vectors in the case of two possible alternatives.

Source: https://habr.com/ru/post/126650/

All Articles

Commentary on the note "Frequency analysis of the Ukrainian language"

Motivation

Technical task

Perhaps a new challenge

Conclusion

More articles: