📜 ⬆️ ⬇️

Habrahabr articles citation graph

Once, it became interesting to me: as far as articles on Habré are connected among themselves? Therefore, today we will study the connectedness of articles, and of course not only calculate the numerical metrics, but also see the whole picture.



(this is not just a picture for attracting attention, but a graph of citing articles inside Habrakhabr, where the size of the peaks is determined by the number of incoming edges, ie, by the number of citations inside Habra ")


It all started with the fact that in the comments to the article about Habra-graph and Karma Tiberius and Loriowar, they voiced the idea that was actually floating in the air: why not look at the citation graph of the article inside Habr himself?




Did you ask? We answer. In order that the story was not a waving of hands, we specify the issues to be addressed:



Under the cut traffic . All pictures are clickable.


Brief explanations on terminology:


A hub is a vertex with a large number of outgoing links, and a "authoritative source" (authority) is a vertex with a large number of incoming links. By connectivity, we mean the average number of edges per vertex (incoming or outgoing). Self-citing is an edge that has both vertices with the same author.


Graph of cited articles (inside Habr)


Take a graph from the beginning of the article and carefully look at each of the clusters and large peaks. I managed to highlight and mark several interesting "communities" articles.





Unfortunately, post number one: habrahabr.ru/post/1 received many incoming for purely technical reasons (parser imperfection), in fact, no one referred to it.


The remaining clusters are quite interesting, for example, there is a whole group of IT stories in the spirit of: Grace “Gran Gran COBOL” Hopper or a whole series of articles on Tensor Algebra. We have a total of 95 thousand peaks and about 50 thousand edges. Connectivity is very low: on average one vertex has about one edge and about 60% of all points are not connected with any other article on Habré - see a large dense cloud around the graph in the very last picture below.


Graph without self-quotation


As we see the picture has changed significantly and a number of clusters disappeared. In general, this reflects the classic scenario when a series of articles by one author has a high coherence due to references to the entire series in each article.





However, a number of clusters still survived. Let's take a closer look at them.


"Folk" clusters


The three biggest and most interesting clusters that survived are the translation of Passionate Programmer, KingPin, and Peter Thiel's lectures. Excellent teamwork, including documenting the series! This is a very interesting and positive result, he says that the community can carry out fairly large and complex work in a coordinated manner, as well as maintain referential integrity - having found one article, you can always extract and find the entire series.

')


Hub map aka graph outgoing edges


We have already looked at the "authoritative sources", where the weight of the vertex was determined by the incoming edges, now we can look at the vertices with a large number of outgoing edges. And determine - what kind of hubs are present in the network.





Consider the degree of influence of each of the hubs, highlighting their edges.



Now carefully look at what kind of hubs?



As we can see it is mainly about posts with collections of interesting materials on Habré itself. For example, the top of the most interesting or materials on the python. What is certainly logical - the largest number of external links have directories that store outgoing links (where is the meta-review of all reviews of Habr's articles?).


Also, this graph tells us about the great love of the community to Python (and, I must say, not unreasonably).


Leaders in the number of incoming / outgoing quotes


Consider the remaining posts (25+ links) without regard to incoming and outgoing (that is, we consider the graph undirected).





All articles in the list can be divided into catalogs (interesting and useful links on topic X) and parts of the series. If you look closely, the first ones are exactly our hubs, and the second ones are the authorities.


That is, there are no articles that would simply be actively quoted on Habré (at least there are fewer quotes than articles in the series).


Authors citation rating


It is also interesting to collect a number of quotes in articles attributable to the author. When calculating and compiling the rating, self-citation was not taken into account (there will be a separate rating on this topic).


The first place turned out to be quite predictable - and with a big margin.


Citation rating top 30

1 alizar, 743
2 marks, 261
3 ilya42,202
4 MagisterLudi, 202
5 lapyk, 167
6 XaocCPS, 144
7 SLY_G, 131
8 frii_fond, 127
9 grokru, 124
10 dmitrykabanov, 118
11 kichik, 115
12 saul, 101
13 itinvest 99
14 jeston, 97
15 ValdikSS, 95
16 Mithgol, 83
17 andorro, 76
18 UiDesignGroup, 72
19 IT_invest, 71
20 amarao, 70
21 python, 69
22 esetnod32,66
23 aleksandrit, 66
24 azproduction, 64
25 nokiaman, 64
26 wiygn, 63
27 NCNecros, 62
28 FSBook, 61
29 Boomburum, 61


Self-quote rating


This rating is interesting, first of all, because it allows us to understand how comparable the number of quotes of other authors is with their own. On average, we see that the number of citations of their articles exceeds the number of ordinary quotations. This also indicates a significant contribution to the connectedness of the citation graph of personal articles.


It can be considered a personal contribution to the connectedness of Habr's articles (the author of this article even took the 26th (!) Place in this rating).


Self-quote rating

1 itinvest, 541
2 SLY_G, 526
3 MagisterLudi, 469
4 1cloud, 424
5 esetnod32,415
6 ptsecurity, 410
7 maisvendoo, 373
8 zag2art, 365
9 ilya42,337
10 EvseyFaydo, 302
11 lol_wat, 270
12 frii_fond, 264
13 1eqinfinity, 258
14 alexzfort, 229
15 XaocCPS, 226
16 andorro, 226
17 alizar, 222
18 khizmax, 218
19 Boomburum, 196
20 Mithgol, 188
21 Milfgard, 174
22 eagleson, 173
23 vedenin1980,168
24 OsipovRoman, 161
25 CooperMaster, 159
26 varagian, 155
27 bbk, 154
28 Irina_Ua, 153
29 dmitrykabanov, 133
30 Unrul, 131


Reproducibility and open data


I am firmly convinced that any research result must be reproducible, repeatable, and also accessible to the reader. Therefore, all the original data attached to the article.


Links: Habrahabr citation graph and graph without self-citation (Gephi), as well as a dapm of all Habrahabr articles are available here (collected on the 20th day of May 2016), as well as a large number of other tasty and interesting data on Habr, specially collected and purified for use (it may be a good idea if you write a diploma or you need real textual or (semi-) structured data).


findings



Instead of conclusion


For the love of art: a citation graph excluding edges as the vertex weight

Source: https://habr.com/ru/post/302430/


All Articles