📜 ⬆️ ⬇️

Detailed analysis of Habrahabr using the Wolfram Language (Mathematica)


Download the post in the form of a Mathematica document, which contains all the code used in the article, along with additional files, here (archive, ~ 147 MB).

Analysis of social networks and all kinds of media resources is now quite a popular area and it was even more surprising for me to find that, in fact, there are no articles on Habrahabr that would contain an analysis of a large amount of information (posts, keywords, comments, etc.) accumulated on it for quite a long period of work.

I hope that this post will be able to interest many Habrahabr participants. I will be happy with the suggestions and ideas of possible future directions for the development of this post, as well as any comments and recommendations.
')
In the post will be considered articles related to the hubs , all in the analysis involved 62,000 articles from 264 hubs . Articles written only for corporate blogs of companies were not considered in the post, and also the posts that were not included in the “interesting” group were not considered.

Due to the fact that the database built in the post was formed some time before the publication, namely on April 26, 2015, the posts published on the Habrahabr after this date (and also, possibly, new hubs) were not considered in this post.

Table of contents


Import hub list
Import links to all articles Habrahabra
Import all Habrahabr articles
Functions to extract specific data from a character XML post representation
Creating a database of Habrahabr posts using Dataset
Data processing results
- Brief analysis of hubs
- Hub connections graph on Habrahabr
- Number of articles depending on time
- The number of images (video) used in posts depending on time
- Habrahabr keyword clouds and individual hubs
- Sites referenced in articles on Habrahabr
- Codes that lead in articles on Habrahabr
- Frequency of meeting words
- Rating and the number of views of posts, as well as the probability of achieving their specific values
- The dependence of the rating and the number of post views from the time of publication
- Dependence of the post rating on its volume
Conclusion

Import hub list


We import the list of hubs and present them in the form of the built-in Dataset database format for the convenience of further work.

HabrAnalysisInWolframLanguage_1.png

HabrAnalysisInWolframLanguage_2.png

HabrAnalysisInWolframLanguage_3.png

HabrAnalysisInWolframLanguage_4.png

Import links to all articles Habrahabra


The link import function from the n-th page of a hub:

HabrAnalysisInWolframLanguage_5.png

The function of importing links to all articles located in a hub:

HabrAnalysisInWolframLanguage_6.png

The function of importing links to all posts from all hubs (except corporate blogs):

HabrAnalysisInWolframLanguage_7.png

Import and save Wolfram Language binary dump file (for later instant use) links to all posts from all hubs:

HabrAnalysisInWolframLanguage_8.png

Import all Habrahabr articles


Total database of links to posts:

HabrAnalysisInWolframLanguage_9.png

HabrAnalysisInWolframLanguage_10.png

At the same time, there are quite a few duplicates among them, which is due to the fact that the same post often refers to different hubs. Total duplicate posts ~ 30.6% , which can be seen from the code below.

HabrAnalysisInWolframLanguage_11.png

HabrAnalysisInWolframLanguage_12.png

Create a list of unique links to posts:

HabrAnalysisInWolframLanguage_13.png

HabrAnalysisInWolframLanguage_14.png

In total, we have 62,000 links that correspond to the same number of articles.

HabrAnalysisInWolframLanguage_15.png

HabrAnalysisInWolframLanguage_16.png

Let's create a function responsible for importing the HTML code of a web page (post) as a symbolic XML object ( XMLObject ) following a link to this page, which in turn creates a Wolfram Language serial .mx package.

HabrAnalysisInWolframLanguage_17.png

Run the download of all posts:

HabrAnalysisInWolframLanguage_18.png

After the download is complete, we will receive 62,000 files on the hard disk:

HabrAnalysisInWolframLanguage_19.png

Functions to extract specific data from a character XML post representation


After we loaded all the posts from Habrahabr in the format of symbolic XML objects, we will need to extract from them the information of interest to us. To do this, we will create a number of functions presented below.

Post title

HabrAnalysisInWolframLanguage_20.png

List of hubs in which the post was published

HabrAnalysisInWolframLanguage_21.png

Date and time of publication of the post in absolute time format (for convenience of further work).

HabrAnalysisInWolframLanguage_22.gif

Post rating

HabrAnalysisInWolframLanguage_23.png

The number of post views

HabrAnalysisInWolframLanguage_24.png

Statistics hyperlinks given in the post

HabrAnalysisInWolframLanguage_25.png

Number of images used in the post

HabrAnalysisInWolframLanguage_26.png

Number of comments to the post

HabrAnalysisInWolframLanguage_27.png

The number of videos inserted in the post

HabrAnalysisInWolframLanguage_28.png

The text of the post in a standardized form (paragraphs are eliminated, all letters are capitalized)

HabrAnalysisInWolframLanguage_29.gif

Statistics codes given in the post

HabrAnalysisInWolframLanguage_30.png

Keywords

HabrAnalysisInWolframLanguage_31.png

Creating a database of Habrahabr posts using Dataset


In some cases, access to posts is closed for various reasons. At the same time, if you follow the corresponding link, you can see a page of this kind:

HabrAnalysisInWolframLanguage_32.png

Create a function that sifts out such pages:

HabrAnalysisInWolframLanguage_33.gif

Now we will load the paths to all .mx files where posts are stored:

HabrAnalysisInWolframLanguage_34.png

HabrAnalysisInWolframLanguage_35.png

And remove the closed ones:

HabrAnalysisInWolframLanguage_36.png

In total, about 0.5% of posts that were closed were removed:

HabrAnalysisInWolframLanguage_37.png

HabrAnalysisInWolframLanguage_38.png

Let's create a function that will create a row of the database about Habrahabr posts, which we will get below. We do this with the help of previously created functions, as well as the Association function.

HabrAnalysisInWolframLanguage_39.png

Finally, let's create a database of Habrakhabr posts using the Dataset function:

HabrAnalysisInWolframLanguage_40.png

HabrAnalysisInWolframLanguage_41.png

HabrAnalysisInWolframLanguage_42.png

Data processing results


Brief analysis of hubs


Find the distribution of the number of hubs in which the article is posted:

HabrAnalysisInWolframLanguage_43.png

HabrAnalysisInWolframLanguage_44.png

Let's present this fragment of Dataset in the form of a table:

HabrAnalysisInWolframLanguage_45.png

HabrAnalysisInWolframLanguage_46.png

Find the largest hubs in the number of articles:

HabrAnalysisInWolframLanguage_47.png

HabrAnalysisInWolframLanguage_48.png

If we consider only unique articles (related to only one hub, then the picture will change somewhat):

HabrAnalysisInWolframLanguage_49.png

HabrAnalysisInWolframLanguage_50.png

Also, we will find the number of posts of companies (posts that were written by the company only for their blog are not counted here):

HabrAnalysisInWolframLanguage_51.png

HabrAnalysisInWolframLanguage_52.png

Hub links graph on Habrahabr


Create a function that calculates the measure of similarity of two hubs on the list of posts that are published in them, based on the Sørensen coefficient :

HabrAnalysisInWolframLanguage_53.gif

Create a list of all possible pairs of hubs (we do not consider company hubs)

HabrAnalysisInWolframLanguage_54.png

For each pair of hubs, we calculate their similarity coefficient:

HabrAnalysisInWolframLanguage_55.png

Create lists defining the edges of the graph and their weights:

HabrAnalysisInWolframLanguage_56.png

For coloring, create a function that normalizes the obtained values ​​of the similarity coefficient to the segment [0; one]:

HabrAnalysisInWolframLanguage_57.png

Set the color, thickness and transparency of the edges, depending on the similarity coefficient. The greater the weight of the ribs, the thicker and redder it is. The lighter its weight, the more transparent and thinner it is.

HabrAnalysisInWolframLanguage_58.png

The resulting graph is interactive, when you hover over each of the vertices you can see its name.


HabrAnalysisInWolframLanguage_59.png

HabrAnalysisInWolframLanguage_60.png

You can also change the style of this graph by displaying the names of the vertices. View this graph in natural size by the link (image, 12 MB).

HabrAnalysisInWolframLanguage_61.png

HabrAnalysisInWolframLanguage_62.png

Number of articles depending on time


Let's create a function, visualizing the number of published articles both on the whole Habrahabr and in some hub:

HabrAnalysisInWolframLanguage_63.png

Let's look at the results of her work. From the obtained graphs it is clear that at the moment, apparently, there is a yield of the number of posts published per year on Habrahabr on a plateau, approaching the value of 11,000 posts per year.

HabrAnalysisInWolframLanguage_64.png

HabrAnalysisInWolframLanguage_65.png

Since 2012, there has been a rapid increase in publications in the “Mathematics” hub:

HabrAnalysisInWolframLanguage_66.png

HabrAnalysisInWolframLanguage_67.png

Since 2011, one can observe the attenuation of interest in Flash:

HabrAnalysisInWolframLanguage_68.png

HabrAnalysisInWolframLanguage_69.png

At the same time, since 2010, the hub “Game Development” is growing by leaps and bounds:

HabrAnalysisInWolframLanguage_70.png

HabrAnalysisInWolframLanguage_71.png

Interestingly, fewer articles go to the Habrahabr hub.

HabrAnalysisInWolframLanguage_72.png

HabrAnalysisInWolframLanguage_73.png

Number of images (video) used in posts depending on time


Let's create a function, visualizing the number of images (or videos) in published posts, both on the whole Habrahabr and in some hub:

HabrAnalysisInWolframLanguage_74.png

HabrAnalysisInWolframLanguage_75.png

HabrAnalysisInWolframLanguage_76.png

HabrAnalysisInWolframLanguage_77.png

HabrAnalysisInWolframLanguage_78.png

HabrAnalysisInWolframLanguage_79.png

HabrAnalysisInWolframLanguage_80.png

HabrAnalysisInWolframLanguage_81.png

HabrAnalysisInWolframLanguage_82.png

Let's look at some hubs:

HabrAnalysisInWolframLanguage_83.png

HabrAnalysisInWolframLanguage_84.png

HabrAnalysisInWolframLanguage_85.png

HabrAnalysisInWolframLanguage_86.png

HabrAnalysisInWolframLanguage_87.png

HabrAnalysisInWolframLanguage_88.png

Clouds of Habrahabr and individual hubs


Let's find a list of the number of used keywords among all analyzed posts on Habrahabr:

HabrAnalysisInWolframLanguage_89.png

HabrAnalysisInWolframLanguage_90.png

Choose the 150 most common among them:

HabrAnalysisInWolframLanguage_91.png

HabrAnalysisInWolframLanguage_92.png

Create from them a word cloud in which the size of the word (or phrases) is directly proportional to the number of its instructions:

HabrAnalysisInWolframLanguage_93.png

HabrAnalysisInWolframLanguage_94.png

We can also create a mask from some string:

HabrAnalysisInWolframLanguage_95.png

HabrAnalysisInWolframLanguage_96.png

and make on its basis a word cloud containing already 750 of the most common keywords (phrases):

HabrAnalysisInWolframLanguage_97.png

HabrAnalysisInWolframLanguage_98.png

You can also make a word cloud in any form:

HabrAnalysisInWolframLanguage_99.png

HabrAnalysisInWolframLanguage_100.png

Now we will create a function that will visualize the clouds of the most popular keywords of a certain hub (100 words will be used by default):

HabrAnalysisInWolframLanguage_101.png

100 keywords of the hub “Mathematics”:

HabrAnalysisInWolframLanguage_102.png

HabrAnalysisInWolframLanguage_103.png

30 keywords hub "Mathematics":

HabrAnalysisInWolframLanguage_104.png

HabrAnalysisInWolframLanguage_105.png

Key words of the hub “Programming”:

HabrAnalysisInWolframLanguage_106.png

HabrAnalysisInWolframLanguage_107.png

Keywords of the “JAVA” hub:

HabrAnalysisInWolframLanguage_108.png

HabrAnalysisInWolframLanguage_109.png

200 keywords of the “open source” hub:

HabrAnalysisInWolframLanguage_110.png

HabrAnalysisInWolframLanguage_111.png

Sites referenced in articles on Habrahabr


Let's create a function that will show the sites that are most often referred to as Habrahabr in general, and in some hub:

HabrAnalysisInWolframLanguage_112.png

Let's find sites that are most often referred to in Habrahabr:

HabrAnalysisInWolframLanguage_113.png

HabrAnalysisInWolframLanguage_114.png

The picture becomes clearer if you remove the main source of links - Habrahabr himself.

HabrAnalysisInWolframLanguage_115.png

HabrAnalysisInWolframLanguage_116.png

Let's find sites that are most often referenced in the “Mathematics” hub (at the same time, we will delete Habrahabr himself everywhere, since he is referred to everywhere, which is obvious, most often):

HabrAnalysisInWolframLanguage_117.png

HabrAnalysisInWolframLanguage_118.png

Now let's look at, say, the “iOS Development” hub:

HabrAnalysisInWolframLanguage_119.png

HabrAnalysisInWolframLanguage_120.png

And here is the .NET hub:

HabrAnalysisInWolframLanguage_121.png

HabrAnalysisInWolframLanguage_122.png

Codes that lead in articles on Habrahabr


Let us find the share of articles in which there are no code inserts (a serious error is possible here, since the code is not always inserted by the authors with the help of a special tag - for example, in this post it is inserted as images).

HabrAnalysisInWolframLanguage_123.png

HabrAnalysisInWolframLanguage_124.png

Let's create a function that will show the statistics of language insertions of code in posts, both on the Habrahabr in general, and in some hub. In this case, if the author does not specify the code, then such a fragment will be marked with the name “SomeCode”. Also, here we do not process the names of the languages ​​indicated by the authors.

HabrAnalysisInWolframLanguage_125.png

Let's find the distribution of code insertion languages ​​for the whole Habrahabr:

HabrAnalysisInWolframLanguage_126.png

HabrAnalysisInWolframLanguage_127.png

The picture will become clearer if you remove inserts that have no programming language specified:

HabrAnalysisInWolframLanguage_128.png

HabrAnalysisInWolframLanguage_129.png

Now let's take a look at the most popular programming languages ​​of code inserts in the “Algorithms” hub:

HabrAnalysisInWolframLanguage_130.png

HabrAnalysisInWolframLanguage_131.png

The “Programming” hub:

HabrAnalysisInWolframLanguage_132.png

HabrAnalysisInWolframLanguage_133.png

Hub “Web Development”:

HabrAnalysisInWolframLanguage_134.png

HabrAnalysisInWolframLanguage_135.png

Hub “Configure Linux”:

HabrAnalysisInWolframLanguage_136.png

HabrAnalysisInWolframLanguage_137.png

Hub “Search engines and technologies”:

HabrAnalysisInWolframLanguage_138.png

HabrAnalysisInWolframLanguage_139.png

Word frequency


The Yandex “ Word Matching ” service is very useful if you want to write, say, an article that will be interesting to a wide audience. This service allows you to see the frequency of word searches. On the basis of the loaded information about Habrahabr articles, you can make a certain analogue of this service, issuing the frequency of meeting words (their groups or regular expressions) in the text of articles. This allows you to trace the interest of the audience to a particular topic.

So, we will create a function that will produce this kind of frequency of words:

HabrAnalysisInWolframLanguage_140.gif

Now you can see different things, for example, you can compare which resource name “Habrahabr” or “Habr” is most often used in Habrahabr:

HabrAnalysisInWolframLanguage_141.png

HabrAnalysisInWolframLanguage_142.png

Or you can compare the frequency of using the names of various programming languages ​​everywhere on Habrahab:

HabrAnalysisInWolframLanguage_143.png

HabrAnalysisInWolframLanguage_144.png

Comparing the frequency of mentions of mathematical packages (expressions like “string” ~~ _ (they were used in the previous example) allow you to specify collections of strings with different endings, say, the expression “tungsten” ~~ _ will set the collection of strings “tungsten”, “Tungsten”, etc.):

HabrAnalysisInWolframLanguage_145.png

HabrAnalysisInWolframLanguage_146.png

You can, of course, be interested in different things, for example, to find out the frequency of meetings of the words “Russia”, “USA” and “Europe”:

HabrAnalysisInWolframLanguage_147.png

HabrAnalysisInWolframLanguage_148.png

Or you can observe a gradual extinction (freezing) of interest in some technology:

HabrAnalysisInWolframLanguage_149.png

HabrAnalysisInWolframLanguage_150.png

Or the birth of a new:

HabrAnalysisInWolframLanguage_151.png

HabrAnalysisInWolframLanguage_152.png

You can also look at the frequency of use of words in individual hubs. Say, the frequency of use of the words “iOS” and “Android” in the hub “Development for iOS”.

HabrAnalysisInWolframLanguage_153.png

HabrAnalysisInWolframLanguage_154.png

Or the same words, but in the hub “Development for Android”.

HabrAnalysisInWolframLanguage_155.png

HabrAnalysisInWolframLanguage_156.png

You can compare the frequency of using operating system names in the “Open source” hub:

HabrAnalysisInWolframLanguage_157.png

HabrAnalysisInWolframLanguage_158.png

with Habrahabrom in general:

HabrAnalysisInWolframLanguage_159.png

HabrAnalysisInWolframLanguage_160.png

Rating and number of views of posts, as well as the probability of reaching their specific values


Select pairs of post rating + number of post views:

HabrAnalysisInWolframLanguage_161.png

HabrAnalysisInWolframLanguage_162.png

We construct their distribution on the plane in the usual and logarithmic scales:

HabrAnalysisInWolframLanguage_163.png

HabrAnalysisInWolframLanguage_164.png

The disadvantage of these graphs is that they do not reflect the density distribution of points on them.

Construct a two-dimensional and three-dimensional distribution density of the considered pairs:

HabrAnalysisInWolframLanguage_165.png

HabrAnalysisInWolframLanguage_166.png

HabrAnalysisInWolframLanguage_167.png

The average post rating on Habrahabr is 34.5, and the average number of views is 14237.3

HabrAnalysisInWolframLanguage_168.png

HabrAnalysisInWolframLanguage_169.png

However, this is not a statistical characteristic. Construct the distribution of pairs (create a distribution of a two-dimensional random variable):

HabrAnalysisInWolframLanguage_170.png

HabrAnalysisInWolframLanguage_171.png

Find the expectation:

HabrAnalysisInWolframLanguage_172.png

HabrAnalysisInWolframLanguage_173.png

As well as the standard deviation:

HabrAnalysisInWolframLanguage_174.png

HabrAnalysisInWolframLanguage_175.png

You can also find the probability, for example, that the post will score a certain rating:

HabrAnalysisInWolframLanguage_176.png

HabrAnalysisInWolframLanguage_177.png

Now we find the probability that the post will gain a certain number of views:

HabrAnalysisInWolframLanguage_178.png

HabrAnalysisInWolframLanguage_179.png

The dependence of the rating and the number of post views from the time of publication


From the code below it can be seen that for all the time on Habré all the articles scored a total rating of about 2.1 million, and the total number of their views is close to 1 billion:

HabrAnalysisInWolframLanguage_180.png

HabrAnalysisInWolframLanguage_181.png

Let's highlight the triples post publication time + post rating + number of post views:

HabrAnalysisInWolframLanguage_182.png

Let us study the behavior of the ranking of posts depending on the time of publication:

HabrAnalysisInWolframLanguage_183.png

HabrAnalysisInWolframLanguage_184.png

HabrAnalysisInWolframLanguage_185.png

HabrAnalysisInWolframLanguage_186.png

HabrAnalysisInWolframLanguage_187.png

HabrAnalysisInWolframLanguage_188.png

HabrAnalysisInWolframLanguage_189.png

HabrAnalysisInWolframLanguage_190.png

Examine the number of views of posts depending on the time of publication:

HabrAnalysisInWolframLanguage_191.png

HabrAnalysisInWolframLanguage_192.png

HabrAnalysisInWolframLanguage_193.png

HabrAnalysisInWolframLanguage_194.png

HabrAnalysisInWolframLanguage_195.png

HabrAnalysisInWolframLanguage_196.png

HabrAnalysisInWolframLanguage_197.png

HabrAnalysisInWolframLanguage_198.png

Dependence rating post from its volume


Select the pairs of the type of post length + post rating (length of the post - we will call it further the volume of the post - calculated as the total number of characters in the post):

HabrAnalysisInWolframLanguage_199.png

HabrAnalysisInWolframLanguage_200.png

We construct their distribution on the plane in the usual and logarithmic scales:

HabrAnalysisInWolframLanguage_201.png

HabrAnalysisInWolframLanguage_202.png

Construct a two-dimensional and three-dimensional distribution density of the considered pairs:

HabrAnalysisInWolframLanguage_203.png

HabrAnalysisInWolframLanguage_204.png

HabrAnalysisInWolframLanguage_205.png

The average volume of a post on Habrahabr is 5989 characters.

HabrAnalysisInWolframLanguage_206.png

HabrAnalysisInWolframLanguage_207.png

As before, we will construct the distribution of the pairs under consideration (we will create the distribution of a two-dimensional random variable):

HabrAnalysisInWolframLanguage_208.png

HabrAnalysisInWolframLanguage_209.png

Find the probability that a post with a volume not exceeding a specified number of characters will gain a rating of at least the specified:

HabrAnalysisInWolframLanguage_210.gif

HabrAnalysisInWolframLanguage_211.png

Conclusion


I hope that this analysis could interest you and will also be useful to you. Of course, on the basis of the resulting database, you can still conduct a lot of various studies, say, to answer such questions: will this post be popular (prediction of the level of popularity)? What affects the number of comments? How to find the best topic for the post? and much more. But these are topics for future posts.

Update from 3:21 April 30 : thanks to the attention of Power , the calculated values ​​associated with the rating of posts have been adjusted. Compared to the previously calculated values, the differences were quite insignificant. However, the integrity of the entire chain of algorithms was restored by eliminating the bug in the extractData ["Raiting"] function.

Source: https://habr.com/ru/post/256999/


All Articles