Hello everyone, habrachiteli!
This is an article about another habrapriser.
At the end of September, I read some kind of article and again there were the words “startup”, “innovation”, “javascript”, and “framework”.
It seemed that in every post they are. And I decided to check it out. Details under the cut.
Writing a parser took a little more than a month. Occasionally I wrote about the mood, dealt with LINQ, MySQL, and simultaneously wrote 2 more parsers.
Used
SharpDevelop 4.3 and MySQL 5.5.25 from the
Denver set.
As a client for muscle I use
Heidi SQL .
There are 2 tables in the database - tpost
CREATE TABLE `tpost` ( `id_post` int(10) unsigned NOT NULL, `date` date NOT NULL, `name_post` varchar(255) NOT NULL, `author` varchar(255) NOT NULL, PRIMARY KEY (`id_post`), ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT=" \r\n";
and twordsinpost
CREATE TABLE `twordsinpost` ( `id_post` int(10) unsigned NOT NULL, `count` int(11) NOT NULL, `word` varchar(255) NOT NULL, CONSTRAINT `FK_twordsinposts_tpost` FOREIGN KEY (`id_post`) REFERENCES `tpost` (`id_post`) ON DELETE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT=" ";
')
The meaning of work - the program checks the connection to the database. If it succeeds, the user enters the id of the first post and the last, the program creates a URL for each id and downloads a page from it. The jump occurs in single-threaded mode so that the habr server does not hurt me. Processing of pages is multi-threaded, and loading into the database again in one stream.
In the downloaded page through XPath the title of the post, the author, the date and the article itself are taken out. The comments have not yet touched - there are many of them, it will be possible to implement it in habrapers 2.0.
In the article itself through the regulars, all punctuation marks are removed, except for points and spaces. All words are translated to lower case. Next, it is calculated how many times each word occurs in this article, and is recorded in the database.
At first I decided to connect to MySQL via adapters (ADO.NET). MySQL connector 5.2.7 connected to the database, but did it extremely slowly. Adding about 2000 lines took almost 5 minutes.

Therefore, I wrote requests in the old fashioned way, with pens. I have not reached Entity Framework yet, and I don’t think it will be faster. At the same time, the connection via adapters to MS SQL works smartly - just as with queries with hands. For training, MS SQL scored up to 2.5 million records - the speed of addition did not fall. But I digress.
As always, bugs came out when debugging.
1. On Habré there are articles without a single word. Until November, they are 716.
Extreme .
They are located on the
subquery .
2. On Habré there are articles without the author. Until November, there are 371.
Extreme .
They are located on the
subquery .
3. On Habré there is an article with a virus. Sorry, could not remember the address. Antivirus (AVG) swore at it and nailed the whole program. I had to turn it off.
The addition of 200,000 articles was divided into 2 stages and took a total of 26 hours. The last articles were in early November and spoiled the statistics, so I deleted them.
Now in base about 34 million records. The base weighs 3 GB (indexed) and includes 199983 articles.
MySQL settings. Here is the
ini file launch .
Nout - i7, 8 GB of RAM, HDD.
Made indexing by fields (id_post, word, count) in the twordsinpost table. It seems to be more bright - instead of 2 minutes, the request takes 30 seconds. Advise by the way how to optimize the base of this size?
There is an idea to put the database on the site, and write a simple wrapper in PHP, but until I learn to force MySQL to work faster, I have to type in pens. Yes, and more flexible.
The parser itself!

The parser is hard-packed with IP 127.0.0.1, the database name is “habraparser”, user is “root”, password is “1”.
It is prescribed here - cDataBase.Connect ().
For lack of a better one, I threw out the code, the base and the results on my
dropbox .
So, the most interesting!
The results of all requests are in the
file . Or, again, on the
server . Go!
On Habré 83174 empty posts.
Number of posts by month -
request
A request for a number of different words is a
request . It is necessary to replace the word WORD with your own. Very interesting.
For example, when searching for “Habr%”, the word “Habrarevolution” was encountered. And look at the request "Putin%" yourself.
And the most important request - to mention words in posts by month - a
request .
In the same way, replace “WORD” with your own. I started with the word "java".

Then the word "google".

Then the word "Habr%".

It is not possible to find the word “C #”, since I have removed the grid with a regular schedule. The same with C ++. Alas.
The word "android".

The word "apple".

And finally my favorite. Startup%.

Javascript

"Framework" or "framework".

"Innovation%".

UPD
Specially at the request of
lomalkin - “bitcoin%”.

And finally, the most delicious.
Code I posted it in the archive, if I need to fill it somewhere else. Poured on
rghost .
File with
all the frequently used words . It is on the
rghost .
The base itself. 1,077 MB.
Download for free without SMS .
The mirror . Create through HeidiSQL.
In the comments can offer more interesting words.
I hope it was interesting to read! Thank you all for your attention! Sincerely,
Muxto .
UPD.
At the request of workers laid out the
top 100 words on Habré . I was surprised by the word "t" in 92 place. And the number 5 is mentioned more often than 4.