⬆️ ⬇️

Results and prospects of a small analysis of Russian texts

I present to readers the statistics collected during the creation of the simplest robot generator of Russian phrases



Word distribution



Let me give you some numbers first.

At 12.5Mb of the Russian text (mainly classical literature of different authors), at 142114 different words in it, the union “and” is most often found - 83575 times (words are taken in all word forms). And that's more than half!

The second in frequency of occurrence is the preposition "in" - 52,124 times, in third place - the particle "not": 36,268 times.

The verb "said" (unit, 3l.) Occurs 6566 times and is in 28th place.

But the word "yes" is in 36th place and occurs 5039 times, while "no" - occurs 2948 times and is in 53 place.

The remaining words are chosen randomly enough, based on the preferences of the author.





')

The frequency of words on the corpus of texts has been studied since the discovery of Zipf's law for the English language (that is, for more than 60 years), various dictionaries and reviews on this topic have been published, but we will look at Russian speech a little more attentively and visually.

Detailed graphs and examples with conclusions

Source: https://habr.com/ru/post/81485/



All Articles