About a month ago I published a post I
returned a net with mud of the sea ... it was about comparing Wikipedia and Bashorg frequency dictionaries. In the comments there were a lot of ideas about how to do it right, as well as requests to parse other sites - Lurkmore and of course Habrahabr.
By reference, the frequency words from Habr's comments that have never been encountered in Habr's posts (carefully, quite a lot of profanity):
docs.google.com/file/d/0B-1U-yPHh8eSbk52bW84NXFyYm8/edit?usp=sharingEven for a short time of being here, I could not help noticing the love of the local inhabitants for creating and using habraslov, I wanted to appreciate the scale of the phenomenon.
Khabraslov (more precisely word forms, not engaged in stemming) sorted by frequency of use:
docs.google.com/file/d/0B-1U-yPHh8eST3l6M0tuZzVEOFE/edit?usp=sharingThey are, but sorted in lexicographical order:
docs.google.com/file/d/0B-1U-yPHh8eSaFVsYTdJaGtlQUU/edit?usp=sharingA bit of code:
')
For example, to get the text of this post, you need to do the following:
>>> import BeautifulSoup >>> import urllib >>> bs=BeautifulSoup(urllib.urlopen("http://habrahabr.ru/post/192670/")) >>> print get_post_text(bs)
And getPerhaps a post that quotes itself entirely, including quoting itself, including ... etc. it would be original ... I read that one priest managed to write something similar on the grave of his dog, but I probably will not try ...
UpdateIn the comments posted Bashorg frequency dictionary -
habrahabr.ru/post/192670/#comment_6692542Habr's frequency dictionary will also be, I accidentally made a mistake when breaking up into words, my letter ββ got into separators, I forgot that in the code table it is not between βaβ and βmeβ, and some words were cut. In the evening I will start to count everything, and tomorrow morning I'll post it.