Hi, Habr!

We are already familiar from
previous articles on data analysis . Now it is time to talk about one very practical task that we have learned to solve. Namely, we will find out
who actually controls our opinion on the social network VKontakte . Code cut many unusual results and interesting mathematics.
Immediately a reservation - we do not mean anything concrete and do not draw any conclusions, we just love math and just use open data')
Many of you will remember that not so long ago we opened
one of the open projects in which, under the guidance of experienced people, participants could solve real data analysis tasks and gain experience. Now it's time to talk about the results and show where we are going next. To begin, let us recall the problem that we solved.
Introduction
It is no secret that lately social networks have become one of the most effective tools for disseminating information. Not so long ago, the public saw examples when these tools are used for other purposes - you seem to be sitting for yourself - you do not do anything, when suddenly an unknown account is added to friends and involuntarily you begin to receive new incomprehensible entries in the Facebook feed. To date, there are many methods for the promotion of groups and pages in social networks for generating traffic and the subsequent benefit from this. It is also far from a secret that Internet companies actively struggle with this kind of activity and cooperate with the relevant departments, since Often the activity of “opinion leaders” threatens, including the national security of countries.
Do you know how you can manage opinion on social networks? In fact, this is not so difficult - it is important to know which channels to disseminate information. And here, as always, one cannot do without mathematics, in particular - the theory of graphs.
That is why we decided to launch a series of tasks in which we will show
how, using a fairly limited API of social networks, learn to find opinion leaders, measure the quality of advertising campaigns and follow the spread of information in social networks using the example of VKontakte .
The first, and simplest task that brings us closer to calculating the opinion leaders is the task of
identifying users with the maximum number of subscribers within a limited API . A feature here is that the entire graph of the social network is almost impossible to view. Given that the monthly VKontakte audience has more than 300 million users, and requests to the API are allowed to do no more than 3 per second, it is easy to figure out that it took about 3 years to calculate the number of subscribers of all people.
We will show how to calculate the TOP100 people with the maximum number of subscribers in a few minutes . But first, a little dive into formal definitions. After all, without good mathematics is not enough.
From the point of view of mathematics, a social network can be represented as a
directed graph , the vertices of which are users, and the edges connecting them signify the fact of friendship / follow / subscription. At the same time, our graph is oriented - the edges have a direction (depending on who “followers” ​​from whom). For an undirected graph, the notion of
degree (the number of friends of a vertex) is defined. For the oriented, the incoming degree (the number of “followers”) and the outgoing degree (the number of people for whom people are “signed”).
In general, nothing special
if many of the graphs that occur in life would not be arranged in a certain way . For this, there is even a whole science, which is engaged in the study of so-called web graphs. It all started with the fact that in 1999 the article by A.-L. Barabashi and R.Albert, in which the authors experimentally investigated the properties of the Internet and proposed the idea of ​​its formation, which was subsequently formalized by many authors and is very actively used in practice. So let's start with those very features of social networks.
Properties of real networks
Let us say right away that when we talk about a web graph, by vertices we mean certain units on the Internet, for definiteness, we will consider sites (for example, we can also consider hosts). The edges will connect those vertices between which there are links. It is clear that the graph is oriented and multiple edges are allowed. Even loops (edges from apex into itself) and even multiple loops are allowed.
1. The diameter of the web graph is small
This is one of the most famous properties of web type graphs, which many people know as the
“theory of six handshakes,” which means that any 2 people on Earth know each other in 6 handshakes. In the language of graph theory, this property means that the graph has a small diameter. This suggests that the graph should be dense enough, however, it is not.
2. Sparseness
A web graph is a fairly
sparse graph. More formally, on
n vertices there are only about
const * n edges . It would seem that a large graph of small diameter should be very dense, i.e. to have about
n ^ 2 edges, however, the fact remains. The following property for a person who is not familiar with mathematics does not sound so informative, but it is in the future that will tell you how the graph itself should be formed in order to satisfy all the properties of the real web.
3. The power law of the distribution of powers of the vertices
It turns out that the proportion of vertices of degree d in the web graph is estimated as
const / d ^ a , where
2 <a <3 (at the time of the study, the value of
a was about
2.1 ).
These are the properties of the Internet. Now, if we know how the process of such a graph works, we will be able to do many things, starting from what we look for spam structures (many of which are co-dense dicotyledine subgraphs), ending up in order to effectively search for vertices of the maximum degree (about this we will describe below).
Preferred Attachment Idea
To date, one of the most common ideas underlying the construction of a web graph is the idea of ​​preferred connection, the logic of which is as follows.
Let's assume that as soon as a new site appears, it will most likely prefer to be on those who are already popular at that time . More formally - the probability with which the new site will link to an existing one is proportional to the number of links already available to that site.
It is clear that the idea described above is poorly formalized and by specifying many details one can obtain many models of web graphs. So the Bollobash-Riordan, Buckley-Ostgus models, the copy model and many others appeared. Not so long ago, Yandex even considered
a whole class of models for which many interesting results
were proved .
I hope the reader has an understanding of how the Internet works from the point of view of graph theory. It turns out that many other real networks, including social. networks are similar. That is what will help us in our task.
Heuristic algorithm for determining vertices of maximum degree
Now back to our task. How can we use the properties of real networks described above in order to find the vertices of the maximum degree with a sufficiently high accuracy? Let's take some number of arbitrary vertices of our graph (for example, from a uniform distribution). Denote this set of vertices by
A. It is approximately clear that with sufficiently high probability many of them refer to one of some popular peaks. Let's then remove edges from these vertices (let's see who these vertices are subscribed to) - we denote the set of these vertices by
B. It is clear that some vertices from
A refer to the same vertex from
B. For each vertex
V of set
B, we calculate the number of vertices from
A that hit the vertex
V. Thus, we get a lower estimate for the incoming degree (number of subscribers) of each vertex from the set
B. We call this the “estimate of the input degree” of a vertex
V.And now the main point that is very difficult to explain to the reader without serious mathematical preparation in more detail than in this article: due to the fact that the social graph has the properties described above, it turns out that the set
B will contain exactly the vertices of the maximum degree in the original graph. Thus, it remains to sort the vertices from the set
B according to the “estimate of the incoming degree” and to clarify for each of these vertices the real value of the subscribers in it. More details about the algorithm and its justification can be found
here .
As you can see, the algorithm is quite simple and not demanding on computing resources. This is exactly what was supposed to be realized by the participants of our research for the Russian social network VKontakte.
Small offtop
As we said earlier, including at past conferences, when teaching people, we try to explain complex things in simple language and give maximum practical skills. For a person to form a certain image and understanding, sufficient for the practical application of data analysis skills. That is why we take people from different subject areas, and by sharing experiences with each other, we instill the missing skills of the participants and ourselves.
One of the research participants was
Gleb Morozov , who for more than 6 years had been solving analytical problems in large federal retail chains, including, for example,
ZAO Tander (Magnit) or
TD Megapolis . After a large number of jointly solved tasks, which he will talk about a little later, Gleb learned how to use machine learning in retail, and the MLClass team gained domain experience. Gleb will tell about the tasks themselves later on independently, but for now I will give his code, which he used to solve the above task.
The implementation of the algorithm in R
Below is a fairly simple and well-commented implementation of this algorithm in
R. Any reader will be able to easily reproduce the result.
library("RCurl") # API library("jsonlite") # JSON library("stringr") # # url <- "https://api.vk.com/method/users.getSubscriptions?user_id=" # url2 <- "https://api.vk.com/method/users.getFollowers?user_id=" # id replace ( 2.8 . id 700 ) id_num <- sample.int(280000000, size = 300, replace = T) # id get_id_account <- function(id) { url_get <- str_c(url, id) # API resp <- getURL(url_get) # Sys.sleep(0.5) # - # id fromJSON(resp)$response$users$items } # - get_count <- function(id) { url_get <- str_c(url2, id) resp <- getURL(url_get) Sys.sleep(0.5) fromJSON(resp)$response$count } # , get_id_account, id list account_id <- sapply(id_num, get_id_account) # list data.frame account_id <- unlist(account_id) # - id account_id <- table(account_id) # data.frame account_id <- as.data.frame(account_id) # account_id <- account_id[order(-account_id$Freq),] # , get_count, - result <- sapply(as.numeric(as.character(account_id$account_id[1:300])), get_count) # - id result <- cbind(as.numeric(as.character(account_id$account_id[1:300])), result) # result <- result[order(-result[,2]),] result <- data.frame(id = result[,1], count_of_followers = result[, 2]) write.csv(result, "result_subscribers.csv")
As you can see, the algorithm, that its implementation is very simple. Now let's take a quick look at the results.
results
In just a few minutes we will get the following results (TOP20 vertices of the maximum degree, for a more complete list -
see here ).
In general, when we saw this list (and then looked at it further), we made 2 conclusions:
- The algorithm works.
- The results are a little overwhelming. We especially liked the schoolboy Ivan Rudskoy, who made many Russian pop stars
We started to doubt how well the algorithm works, but since we did not find people in open sources with the maximum number of subscribers, we decided to test the algorithm on groups (look for groups with the maximum number of subscribers), since there are lists for them,
like this . And, indeed, having received
these results , and realizing that VKontakte is not at all the world of the notorious
MDK , they were satisfied.
TOTAL
Now, after we have obtained such results, we will move on, namely, using the same API to try to build a
“repost” graph from the pages of these people , in order to evaluate a person not only by the number of subscribers, but also by how this person is able to spread the opinion, using his influence in the network. There are ideas how to correctly apply
PageRank .
If you want to connect to this study and get a very cool experience - join our
Data Science community and email me (
al.krot.kav@gmail.com ) a letter with the topic
"Opinion Leaders VKontakte"Well, in the meantime, we will continue research and will delight you with new articles from the world of data analysis and good mathematics!
UPDOh, by the way, found a lot of beautiful girls in this way
Here, for example,
vk.com/id109507300 , madam gained subscribers more than many Russian stars. Enjoy watching!