Habrarating: building a cloud of Russian-language words on the example of Habr's headlines

Hi, Habr.

In the last part of Habraraiting, a method of constructing a word cloud for English terms was published. Of course, the task of parsing Russian words is much more complicated, but as suggested in the comments, there are ready-made libraries for this.

We will understand how to build such a picture:
')

Also look at the cloud of Habr's articles for all the years.

Who cares what happened, please under the cat.

Parsing

The original dataset, as in the previous case, is a csv with titles of Habr's articles from 2006 to 2019. If anyone is interested to try it yourself, download it here .

First, load the data into the Pandas Dataframe and sample the headings for the required year.

df = pd.read_csv(log_path, sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#') if year != 0: dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ') df['datetime'] = dates df = df[(df['datetime'] >= pd.Timestamp(datetime.date(year, 1, 1))) & ( df['datetime'] < pd.Timestamp(datetime.date(year + 1, 1, 1)))] # Remove some unicode symbols def unicode2str(s): try: return s.replace(u'\u2014', u'-').replace(u'\u2013', u'-').replace(u'\u2026', u'...').replace(u'\xab', u"'").replace(u'\xbb', u"'") except: return s titles = df["title"].map(unicode2str, na_action=None)

The unicode2str function is needed to remove various cunning Unicode characters from the console output, such as nonstandard quotes - under OSX this worked the same way, and when outputting to Windows Powershell, the error “UnicodeEncodeError: 'charmap' codec can't encode character” was generated. It was too lazy to deal with Powershell settings, so this was the easiest way.

The next step is to separate the Russian words from all the others. It's quite simple - we translate characters into ascii encoding, and see what remains. If there are more than 2 characters left, then we consider the word “full-fledged” (the only exception that comes to mind is the Go language, however, those who wish can add it themselves).

 def to_ascii(s): try: s = s.replace("'", '').replace("-", '').replace("|", '') return s.decode('utf-8').encode("ascii", errors="ignore").decode() except: return '' def is_asciiword(s): ascii_word = to_ascii(s) return len(ascii_word) > 2

The next task is the normalization of the word - to display a word cloud, each word must be output in one case and declination. For the English language, we simply remove the "'s" at the end, also remove other unreadable characters such as parentheses. I'm not sure that this method is scientifically correct (and I am not a linguist), but for this task it is quite enough.

 def normal_eng(s): for sym in ("'s", '{', '}', "'", '"', '}', ';', '.', ',', '[', ']', '(', ')', '-', '/', '\\'): s = s.replace(sym, ' ') return s.lower().strip()

Now the most important thing for the sake of which everything actually was started is the parsing of Russian words. As advised in the comments to the previous part, for Python this can be done using the library pymorphy2. Let's see how it works.

 import pymorphy2 morph = pymorphy2.MorphAnalyzer() res = morph.parse(u"") for r in res: print r.normal_form, r.tag.case

For this example, we have the following results:

  NOUN,inan,masc sing,datv datv  NOUN,inan,masc sing,loc2 loc2  NOUN,inan,neut sing,datv datv  NOUN,inan,masc sing,gen2 gen2

For the word “world”, MorphAnalyzer defined “normal form” as a noun “peace” (or “miro”, however, I don’t know what it is), a single number (sing), and possible cases like dativ, genitiv or locative.

With the use of MorphAnalyzer, the parsing is quite simple - we make sure that the word is a noun, and we derive its normal form.

 morph = pymorphy2.MorphAnalyzer() def normal_rus(w): res = morph.parse(w) for r in res: if 'NOUN' in r.tag: return r.normal_form return None

It remains to put everything together, and see what happened. The code looks like this (non-essential fragments removed):

 from collections import Counter c_dict = Counter() for s in titles.values: for w in s.split(): if is_asciiword(w): # English word or digit n = normal_eng(w) c_dict[n] += 1 else: # Russian word n = normal_rus(w) if n is not None: c_dict[n] += 1

At the output we have a dictionary of words and their numbers of occurrences. Let's output the first 100 and form a word popularity cloud from them:

 common = c_dict.most_common(100) wc = WordCloud(width=2600, height=2200, background_color="white", relative_scaling=1.0, collocations=False, min_font_size=10).generate_from_frequencies(dict(common)) plt.axis("off") plt.figure(figsize=(9, 6)) plt.imshow(wc, interpolation="bilinear") plt.title("%d" % year) plt.xticks([]) plt.yticks([]) plt.tight_layout() file_name = 'habr-words-%d.png' % year plt.show()

The result, however, turned out to be very strange:

In text form, it looked like this:

   3958  3619  1828  840 2018 496  389  375  375

The words "performing", "second" and "century" were leading by a wide margin. And although this is possible in principle (you can imagine a title like “Passing passwords at a speed of 1000 times per second will take a century”), but it was still suspicious that there are so many of these words. And not in vain - as debugging showed, MorphAnalyzer defined the word "c" as "second", and the word "c" as "century". Those. in the heading "With the help of technology ..." MorphAnalyzer singled out 3 words - "second", "help", "technology", which is obviously wrong. The following incomprehensible words were “when” (“When used ...”) and “already”, which were defined as the noun “straight” and “already”, respectively. The solution was simple - take into account when parsing only words longer than 2 characters, and enter a list of Russian-language exception words that would be excluded from the analysis. Again, maybe this is not entirely scientific (for example, an article about “observing changes in coloring already” would fall out of the analysis), but for this task already :) is enough.

The final result is more or less similar to the truth (with the exception of Go and possible articles about the tips). It remains to save all this in gif (the gif generation code is in the previous part ), and we get the animated result in the form of the popularity of keywords in the Habr headlines from 2006 to 2019.

Conclusion

As you can see, the analysis of the Russian text with the help of ready-made libraries turned out to be quite simple. Of course, with some reservations - the spoken language is a flexible system with many exceptions and the dependence of the meaning on the context, and 100% authenticity here is probably impossible to get at all. But for the task, the above code is enough.

The work with Cyrillic texts in Python, by the way, is far from perfect - minor problems with character output to the console, idle print output for print, the need to add u "" lines for Python 2.7, etc. It is strange that in the 21st century, when like all atavisms like KOI8-R or CP-1252 died out, the problems of encoding strings are still relevant.

Finally, it is interesting to note that the addition of Russian words to the text cloud practically did not increase the informativeness of the picture compared to the English version - almost all IT terms are English-speaking, so the list of Russian words changed much less in 10 years. Probably, to see the changes in the Russian language, you have to wait 50-100 years - after a specified time, there will be an occasion to update the article again;)

Source: https://habr.com/ru/post/442626/

All Articles

Habrarating: building a cloud of Russian-language words on the example of Habr's headlines

Parsing

Conclusion

More articles: