📜 ⬆️ ⬇️

Habrahabr statistics

Almost a week has passed since the execution of Habrahabr 6 years. It would be very interesting to look at the graphs of growth indicators of the site. Because Standard statistics are uninformative, it was decided to collect all the information with their own hands and analyze. And so, almost a week of parsing and collecting information and obtained the following interesting data (hidden / deleted posts / users and their comments are not taken into account):

Beautiful graphs, measurement methods, a base with the obtained data, “habanoanomalies” - all this is under the cut.


y is the number of published topics per month; x - timeline, 1 division - month

y is the number of published comments per month; x - timeline, 1 division - month

y is the number of user registrations per month; x - timeline, 1 division - month
The answer to the question of what caused such a failure in the number of registrations that began in August 2008 and reached a minimum in September (1 registration per month), I did not find. Perhaps users registered during this period were massively banned / transferred to read-only.

y is the average number of topics currently published; x - timeline, 1 division - hour
This schedule was obtained by counting the number of published topics in a given hour over 6 years. If you take a smaller frame, it is possible to shift the schedule.

y is the average number of topics currently published; x - timeline, 1 division - day

y is the average final grade of topics for all time; x - timeline, 1 division - day
As it turned out, more benefits are gaining topics published on the weekend. Perhaps this is due to the fact that on the weekend they are published in half.


y is the number of users with the number of topics indicated in the x scale; x - the number of user topics
No matter how sad, but a little more than half of the users have not published a single topic.

y is the number of users with the number of comments specified in the x scale; x - number of user comments
As can be seen from the graph, about 15% of users post 1-5 comments and stop their activity.


y is the number of users with the number of karma specified in the x scale; x - number of user's karma
20% of users have zero karma. Pleases the odds number of users in the positive part.
')

As it was considered


Because there is no direct access to the Habra database, then I had to look for workarounds. If you have noticed, then each topic has its number in the address bar, i.e. we can view the very first post on habrahabr.ru/post/1 . The decision came quickly, it is necessary to sort out all the published topics, starting with 1 and ending with the 144,400 number (at that time the last topic, which has already expired voting). Of these, 121,641 topics exist, of which 25,949 are moved to drafts and a few hundred more are empty, like this: habrahabr.ru/company/muk/blog/119653 . All topics were saved to files for further parsing, took almost 10GB. Then each topic was parsed as follows: the author of the topic was taken, the rating, the date of publication, after that the comments were parsed, of which the author was taken, the rating of the comment and the date. We got three tables. After receiving all the users, it was necessary to get for each his karma value and rating. With this approach, all those who post at least once or post a comment were considered. All this was pumped out and parsed for about a week around the clock. The parsing software was written in the data entry process. The frequency of requests to Habr did not exceed 1 request per second.
DB structure:

Download the database dump (MSSQL backup) here (132Mb):

Habroanomial


During the parsing, a whole bunch of anomalies were found:

PS I accept suggestions for building interesting graphs based on the information received.

Added by:

y is the number of user posts registered in the specified x scale; x - timeline, 1 division - month
From this diagram it follows that the users who registered before mid-2008 wrote the most topics.


It would be very interesting to see more such data:
* average rating of topics by the hour (by day there, but more interesting)
* average number of votes per topic according to the time of the creation of the topic
* the average number of comments on topics by the hour exactly in time to create a topic

Such information can give an idea of ​​when it is best to publish in order to get maximum activity.

in general, it would be interesting to still see the total statistics on tags and hubs, but I understand that the parser will have to be run again.





There is a section "Best of all time." It would be interesting to look at the "worst of all time." habrahabr.ru/post/145045/#comment_4873731

The most mined topics:

The most mined comments:

Most mined users (by karma):



Ten first comments:


I will note that the average rating of the first comment is +3.59, while the average rating of all comments is +0.98


The most exciting comment

Source: https://habr.com/ru/post/145045/


All Articles