📜 ⬆️ ⬇️

The seine with the mud sea has returned ...

A year and a half ago, I decided to conduct a small experiment. The goal was to look at the concentrated newspeed. I did the following:
1) Raspar bash.im (then bash.org.ru) and created a frequency dictionary of words found there.
2) Rasparsil Wikipedia and created a frequency dictionary (or rather not quite so, I already had a Wikipedia dictionary by that time, I did it before for completely different purposes).
3) Sorted out Bash’s dictionary of occurrence in descending order, went over it and printed those words that had never been encountered on Wikipedia.

In general, after all the preparations, I launched the script and prepared to see modern slang in all its glory. The program began to print ...
Those who are not allergic to profanity can follow the link and admire the beginning of the list I received (no editing, I publish it because the program issued it):

I warned!

For those who didn’t follow the link, I’ll say that I really got a lot of slang - one server, one server, one comment, one camera, etc. But even more received accelerated harping and ashybok and mat.
One consolation - in the Russian-language Wikipedia these words were still not there!
')
Application

Since the article is still for programmers, I will tell you how to make the Wikipedia frequency dictionary (if I can find the sources, I’ll attach them to the article).
1) Download the Russian Wikipedia dump, the latest version is always here - download.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
2) Remove all the tags and leave the bare text with the help of this Python script, written by comrades from Italy - medialab.di.unipi.it/wiki/Wikipedia_Extractor writing it along the way to the files convenient for us and our car size.
3) For each file, we divide the text using everything that is not Cyrillic and not a hyphen as separators (so as not to divide any rocking chairs) and count tokens (you can use collections.Counter from the standard Python library)
4) Merge the resulting dictionaries together.

Appendix 2

But the actual frequency dictionary of Wikipedia , made it about two years ago.
You can do a lot of interesting things with it, for example, look for words with all sorts of interesting properties (well, let's say “difficult to heal” - the longest word of the Russian language in which all the letters are different). Or let's say make an anagram generator. But I will try to make a separate post about experiments with a dictionary.

Source: https://habr.com/ru/post/188678/


All Articles