📜 ⬆️ ⬇️

Who is reading whom in LJ - analysis of the intersection of audiences by top bloggers

Start


The topic of research of connections in social networks is becoming increasingly relevant for various reasons: an attempt to answer the question about the degree of connectivity of network participants; speed and ways of disseminating information; about the effectiveness of targeted advertising, after all. And the process of research and the search for implicit links is delays!

For my research in this direction, I chose the most “boiling” piece of the Runet, namely, the Russian segment of the LiveJournal . The vaguely formulated question sounded like this: is it possible to single out blogger “groups” based on the structure of connections between users of the LJ service, i.e. having only information about "friends" .

Having put forward the idea that such information can be extracted from the analysis of the audiences of popular magazines as a working hypothesis, I was faced with the task of obtaining reliable data about these audiences. The basic tools of the livejournal service do not provide an opportunity to get a complete list of readers of the multi-thousandth blog. Therefore, the first step was to assemble the structure of links of Russian LJ on a home computer.
')
Looking ahead, I will say: the social graph of Russian LJ in my study has 2.08 million peaks and 58.05 million arcs. Interesting? Then there are quite a lot of letters, numbers and pictures under the cut.



Collection of information


Estimated service Yandex . Blogs Russian segment of LiveJournal has a little more than 2 million blogs. Taking this list as a basis, I did some automated work on populating a database of “friendly” relations between blogs, which allows you to answer at least one question: who reads a particular blog.

Few numbers


Communications were collected 2.08 million users . Those. a graph with 2.08 million peaks received another 58.05 million arcs (directed by friendships between users) on 03/13/2011 . Moreover, only half - 1.08 million users read someone else (has an outgoing arc) and 1.26 million have readers. As an illustration, you can bring some statistics on the number of friends (readers):



The base did not include links that were sent outside the “Russian segment” (it is somewhere just over 6 million arcs to 0.9 million peaks) that were not studied further and referred to “foreign” although there are living Russians blogs

Error estimate


To assess the completeness of the collected graph, we compare the “official” number of readers from the top livejournal.com with the number for the same bloggers, but obtained by summing up the incoming arcs on the graph. For greater accuracy, two TOP-50 fragments were taken:



As we see from the table, the generated graph of readers corresponds to the real a little more than 80%. The error may be due to the initial isolation of the “Russian segment” (that is, with the exception of foreign friends) and the incompleteness of the list of Russian journals themselves. In the future, some insignificant refinement of the structure of the local graph is possible.

Analysis of the intersection of audiences TOP-10


The analysis itself is simple and flat - we take the readers' lists of each blog from the TOP-10 and look for their intersection, recording the results in the table. More precisely, in two tablets - in absolute, with quantitative values ​​and relative - indicating the percentage of overlapping audiences.



First , it is clear from the second table that the general audience of magazines from the TOP-10 (i.e., people who have subscribed to at least one top blog): 1,68837 people (remember the error).

Secondly , it can be said that a third (34.5%) of the audience of Anton Nosik ( dolboeb ) also reads Navalny ( navalny ), but from the readers of Nika Belotserkovskaya ( belonika ) the same Navalny reads only 14.5%. But 30.9% of its readers are also eagerly awaiting new reports by Sergey Doli ( sergeydolya ) and as many as a quarter (24.9%) - stories from the life of Slava Se ( pesen_net ). And, by the way, almost half (46.7%) of the readers of the same Anton Nosik follow Artem Lebedev’s ( tema ) movements along the tundra, and 20% of the audience of top leader Rustem Adagamov ( drugoi ) likes to receive hot photo reports from our political campaign from the original source Ilya Varlamov ( zyalt ).

Third , you can build and intersections audiences of higher orders. For example, an analysis of the intersection of audiences of three bloggers is a cube. So a slice of a similar cube according to Alexei Navalny will give the following picture:



The matrix is ​​symmetric with respect to the main diagonal. The numbers indicate the share of the total audience of the two journals (the intersection of the row and column), which also read Alexei Navalny ( navalny ).

From the table it can be seen that only a third (34.4%) of the total audience of blogs, tema and drugoi, also read navalny , while almost two thirds read it from the zyalt and dolboeb audience - 64.2%. Belonika-sergeydolya (26.6%) and belonika-pesen_net audiences (25.5%) show minimum interest in the fight against online corruption.

Well, fourthly , if you place advertising posts on the Thousands blogs and don’t have such layouts on TOP-50 - dismiss the marketing specialist :)

How to embrace the immensity?


On the one hand, numerical data is sufficient for various applied research. On the other hand, for some researchers it is simply necessary to evaluate the visual field. Let's try to twist the data before our eyes. How?

Here we can use the methods of visual presentation of multidimensional data with decreasing dimension: let's try to “squeeze” our 10-dimensional data set (bloggers who read magazines from TOP-10) into a two-dimensional picture on the plane. At the same time, ideally, it would be nice to get some grouping of readers on readable blogs. Not very confused?

The first option of grouping is to carry out clustering with g-means algorithms (clustering with automatic determination of the number of clusters) or k-means (clustering by a given number of clusters). In principle, the idea is sound, but this approach does not solve the problem of displaying the results obtained and has its drawbacks given the structure of our data.

Therefore, I tried to use my favorite clustering tool — Kochnen's self-organizing maps in the implementation of the Deductor analytical system (Academic version) from BaseGroup Labs . The details of the algorithm can be read in the relevant publications, let me just say that in this task its ability to project multidimensional data relief on the display plane is important. What comes out of it and how to interpret it is highly dependent on the processing parameters and the understanding of the nature of the data being processed. Therefore, further analysis is a special case that does not claim to be absolute truth.

So, after feeding the neural network sample, which is the Kohonen map, we get this picture. The number of clusters (multi-colored zones in the lower right window) is slightly adjusted manually - set 7 pieces (numbering 0..6) - for better visual breakdown of the result.



Having a little meditation on a beautiful and incomprehensible picture, one can proceed to some superficial analysis.

Thus, the apolitical cluster (number 2, almost 19 thousand participants) fans of Nika Belotserkovskaya ( belonika ), mainly has intersections with readers of the magazines drugoi , sergeydolya , pesen_net , mi3ch (the selected cluster has a mesh fill):



Creative intelligentsia reading Slava Se ( pesen_net ), Dmitry Chernyshev ( mi3ch ), admiring beautiful pictures from the magazines drugoi and tebe_interesno as well as discussing the design finds of Artemy Lebedev ( tema ) are grouped into cluster number 4 (54.5 thousand):



Cluster number 6 has absorbed readers (8 thousand) without pronounced addictions who read almost all top bloggers at the same time:



Well, where is today without Alexei Navalny ( navalny ) !? In fear of the invasion of bots, I write these lines ... Cluster number 0 (oh, just do not beat me for associations - in decks of tarot cards the zero card has the name "Jester") covers 33 thousand readers (there should be a remark, but it will go to end). It would be right to combine it with cluster number 1 (another 16.5 thousand) also covering politically active top bloggers ( drugoi , dolboeb , zyalt ):



Technical note

As I have already said, in this case, the content and structure of the displayed clusters depends on their number, which can be configured. For example, for this model, the number of “physical” clusters into which a lot of readers have broken up as a result of processing is equal to 19, but for clarity, I made a coarser model with 7 “visual” clusters. The accuracy of such a partition (and the clustering method itself via a neural network) is not absolute, so it may happen that the user who does not read it gets into the “Bulk” cluster. But this error, in principle, is not critical for a surface assessment analysis.

Instead of conclusion



At this point, I finish the demonstration of the work done, and I suggest thinking about advertisers who use LJ to promote their products and services on its applied value (for example, analysis of TOP-30 or TOP-50 or a specially formed list).

PS


Learn users who are not registered on Habré can ask questions on my blog infist-xxi.livejournal.com

Source: https://habr.com/ru/post/115903/


All Articles