📜 ⬆️ ⬇️

Habra graph, community and where did all the karma go

Introduction


Today we, along with graph analysis, data mining, subgroup discovery and all the fun stuff, take a look at Habr. All code and data are attached - everyone can look at them on their own, it is easy to repeat the calculations from the article and find something interesting on their own.




(this is not just a picture for attracting attention, but a graph of connections ~ 45000 Habr users by who is subscribed to; the size of the peak is proportional to the number of subscribers; all the pictures are clickable; details below)
')


The problems discussed did not arise, of course, yesterday, but some of their aspects seem to me to be fairly new and therefore worthy of discussion based on unbiased and representative data. For example, in the comments of this article, I saw an interesting statement:

Here the problem is that in the whole of Habré today there are no more than 50-80 people who can vote at all. In 90% of users, the karma is simply below 5. As a result, comments and articles are selected only by favorites. This is how the jury comes out like that.

And I decided that it would be worthwhile to formulate it as a hypothesis and to check:

Q1: Is it true that Habr turned into a jury-based community where two and a half people vote for articles?

Here in this article the "Iron" Hubs returned to us and it became interesting, but how are the different communities inside Habr represented? We formulate a hypothesis:

Q2: How is the community segmented, or in other words, how many interest groups do we have here and do they correspond to the existing hubs?

Last but no less interesting observation is that the activity on Habré has fallen (according to Habr-Pulse and my subjective observations), that they even decided to enter "read & comment" accounts. Therefore, I decided to assess the activity of the community and consider how information about the community structure can help us:
Q3: How active is the community and how can the structure of internal groups help us?


For details, welcome under cat.

Article structure



Data collection methodology


As you know, a new version of Duke Nukem Forever is written on the Habra API, so you have to collect all the interesting data yourself (well, ok, in fact, it does not provide all the interesting data). What data do we need to collect?


Since Habr has a limit on the number of connections, it is best to use the distributed architecture of the parsa, for example, to break all the articles into N groups and || 4 threads on each machine to parse. All data is collected in github HabraData (if you write a master's degree or some other diploma in data analysis, especially if suddenly in Russian, then there you can find a lot of interesting things).

General scheme of collection:


All users who for the last 2 years left comments and / or wrote an article gathered. Then those who were banned were filtered, went to minus, etc. In parallel, data was collected under the articles, namely to which hubs they belong. The data used in each experiment are explained in the course of the story.

An unfiltered list of users ~ 25k is available here , and here a filtered dataset with key indicators of users in the form of:

user,karma,rating,publications,comments,favourites,followers
....
var_bin,3.0,0.0,1,18,6,1
varagian,187.0,26.0,20,151,86,44
varanio,55.0,0.0,3,51,24,6
varerysan,16.0,0.0,9,26,0,3
....


: , , , etc


-

́ ~61% 10% , ( 20% , 10%) 20%.


: 50% 5 , 7500 .

, , .

(c - 20 16, )




( - y, «».)



— , , ́ ( , .)

( , , )

? , .


, :









:


,



,




-, , () . -, , , . Q3 .



, . :


: - ( , hub, .. ) .

, v1 v2, v1 v2. .

:

user:follower1,folllower2,.... 

( Gephi) , ( Gephi). () 45 110 "".

Louvain ommunity Detection. , , .



, ( Q1 ). , .



. .






, ~10% . , ( "" ) . , , , . (, , .)



, , - . , ,


, , — . , (Louvain ommunity Detection).

,

; , "" . — ( )


( — )

. 10 , controllers ( ), " "





, , , . 15 "" -, 10 .





* Active --               ~25
*   \      read only ~ 14
*       44k
*   : 104k
*      : 11.5k
*  : 4.7
*  ~0
*  : 528

, - .

( 2014 2015 ) , . . ) , Q1, ) , Q2.










, , , . , , , , , .

Source: https://habr.com/ru/post/276383/


All Articles