We consider letters in works of Russian literature.

Have you ever wondered which letter of the Russian alphabet is found in texts more often than others? I’m going to look for the answer to this question. But for now, you do not know the results of my little research, I suggest you guess the five most common letters of our alphabet. Ready?

So, as one of my friends said, clutching at the wheel of his car, we drove off.
To begin with, we will need texts on which we will practice. I chose three literary works of our classics: “War and Peace” by Lev Nikolayevich Tolstoy, “The Quiet Don” by Mikhail Sholokhov, “The Master and Margarita” by Mikhail Bulgakov. Why these works? Simply, the first two are the only ones I read at school, and “Master and Margarita” my wife and I watched on TV and therefore, I understand a little about the subject.

Now we need to somehow count in them the number of each letter of the alphabet and the total number of letters. How to do it? You can go the simplest way, as, for example, makes my boss. To do this, go to the library, take the four volumes of “war and peace”, come home and do the recounting of letters, then do the same with the rest of the books. Of course, this will take a lot of time, but my boss is a very hardworking person, and he also has subordinates. You can distribute them because of that, but if they do not consider it, or they will be mistaken, “I will deprive the prize”.
')
I did not immediately like this method, and I decided to write a program that would do all the work for us. Below is attached the code of a program written in perl. It calculates the total number of letters in the text, as well as the number of each of the letters of the alphabet and their percentage.

use strict;
use locale;
use POSIX qw (locale_h);
setlocale(LC_CTYPE, 'ru_RU.CP1251');
setlocale(LC_ALL, 'ru_RU.CP1251');
my @letters = qw( );
my @out = qw(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0);
open (TEXT, "<text.txt");
my $sum = 0;
while (defined(my $char = getc(TEXT))) {
if (uc($char) eq "") {$char = ""}
for(my $i=0; $i<@letters; $i++) {
if(uc($char) eq $letters[$i]) {$out[$i]++; $sum++;}
}
}
open(OUT, ">out.txt");
print OUT " - $sum\n\n";
for(my $i=0; $i<@out; $i++) {
print OUT "$letters[$i] - $out[$i] (".($out[$i]/$sum*100)."%)\n";
}

For clarity, I slightly modified the data obtained in excel.

As they say, the result is obvious. The most popular letter of the Russian alphabet is “O”, and the top five looks like this: “O”, “A”, “E”, “I”, “H”.

Now it remains to answer the most important question. Why is all this necessary?

This information, for example, can be used when Leonid Yakubovich allows us to open any five letters. I hope you now know which letters to call?
But seriously, finding the frequencies of symbols is used much more often than you can imagine. This task is included in the Huffman algorithm , which is used in many modern data compression programs.

Source: https://habr.com/ru/post/92706/

All Articles

We consider letters in works of Russian literature.

More articles: