πŸ“œ ⬆️ ⬇️

And again the old man threw a net ... (parsing habr, continued)

About a month ago I published a post I returned a net with mud of the sea ... it was about comparing Wikipedia and Bashorg frequency dictionaries. In the comments there were a lot of ideas about how to do it right, as well as requests to parse other sites - Lurkmore and of course Habrahabr.

By reference, the frequency words from Habr's comments that have never been encountered in Habr's posts (carefully, quite a lot of profanity):
docs.google.com/file/d/0B-1U-yPHh8eSbk52bW84NXFyYm8/edit?usp=sharing

Even for a short time of being here, I could not help noticing the love of the local inhabitants for creating and using habraslov, I wanted to appreciate the scale of the phenomenon.
Khabraslov (more precisely word forms, not engaged in stemming) sorted by frequency of use:
docs.google.com/file/d/0B-1U-yPHh8eST3l6M0tuZzVEOFE/edit?usp=sharing
They are, but sorted in lexicographical order:
docs.google.com/file/d/0B-1U-yPHh8eSaFVsYTdJaGtlQUU/edit?usp=sharing

A bit of code:
')
#          html- def generic_get(soup,search_tag,condition): l=[] for e in soup.findAll(search_tag): d=dict(e.attrs) if condition(d): l.append(e) return l #   def get_post_text(main_soup): return generic_get(main_soup, "div", lambda d:d.get("class",[''])[0]=="post")[0].text #    def get_comments_text(main_soup): return ' '.join([x.text for x in generic_get(main_soup, "div", lambda d:d.get("class",[''])[0]=="message")]) 


For example, to get the text of this post, you need to do the following:

 >>> import BeautifulSoup >>> import urllib >>> bs=BeautifulSoup(urllib.urlopen("http://habrahabr.ru/post/192670/")) >>> print get_post_text(bs) 


And get
Perhaps a post that quotes itself entirely, including quoting itself, including ... etc. it would be original ... I read that one priest managed to write something similar on the grave of his dog, but I probably will not try ...


Update
In the comments posted Bashorg frequency dictionary - habrahabr.ru/post/192670/#comment_6692542

Habr's frequency dictionary will also be, I accidentally made a mistake when breaking up into words, my letter β€œβ€ got into separators, I forgot that in the code table it is not between β€œa” and β€œme”, and some words were cut. In the evening I will start to count everything, and tomorrow morning I'll post it.

Source: https://habr.com/ru/post/192670/


All Articles