Recently managed to crank up an interesting thing. For all Vkontakte groups with the number of subscribers from 5,000 to 10,000 (~ 100,000 groups), a full graph was constructed in which the weight of the ribs equaled the intersection of the groups' audiences.

First, such a graph looks beautiful:

')
Secondly, it can be used to quickly select groups of a given topic. For example, you need to find a group about knitting. For the keyword "knitting" we find one suitable group,
Knitting - Knitting online- , for example. We deduce the groups with which it is associated:
Knitting - Knitting online-:
6.04%
Corporation “YARK”5.90%
Mamochkin channel - for creative moms (HOOK!)3.40%
Knitting. In this world everything is connected ...))3.01%
YARN CHEAP. FLISS. ELASTIC BANDS FOR WEAVING BRACELETS2.35%
Spaghetti Yarn Spagetti1.87%
Eesti lõng yarn shop (Kauni, Kauni)1.73%
* Crochet art *1.70%
Yarn Kauni (Kauni) - the legend of Estonia. Knitting.1.66%
«Lace motifs» - knitting and needlework1.54%
Turkish yarn in stock and to order (Ukraine)And we repeat until you get bored or until new names cease to appear.
Knitting. In this world everything is connected ...:
8.88%
Corporation YARK
3.06%
Mamochkin channel - for creative mothers (HOOK!)2.58%
YARN CHEAP. FLISS. ELASTIC BANDS FOR WOOL BRACELETS2.30%
Knitting - Knitting online-2.14%
Online Store Yarn "Openwork"1.94%
Yarn Kauni (Kauni) - the legend of Estonia. Knitting.1.85%
Store yarn - ღ YOUR YARN ღ1.76%
Yarn1.72%
Openwork world: connected with love!1.55%
Yesti lõng yarn shop (Kauni, Kauni)Corporation "Yarn":
7.54%
Knitting. In this world everything is connected ...))4.01%
Mamochkin channel - for creative moms (HOOK!)3.47%
Knitting - Knitting online-3.20%
YARN CHEAP. FLISS. ELASTIC BANDS FOR WEAVING BRACELETS2.72%
Online Store Yarn "Openwork"2.67%
Yarn2.11%
"Madame Vyazalkina" Yarn (Handicraft Products)2.00%
Yarn Kauni (Kauni) - the legend of Estonia. Knitting.1.85%
Shop of yarn Eesti lõng (Kauni, Kauni)1.82%
Spaghetti Yarn Spagetti"Madame Vyazalkina" Yarn (handicraft goods):
2.49%
Yarn2.37% YARK
Corporation1.42%
Yesti lõng yarn shop (Kauni, Kauni)1.39%
Yarn Kauni (Kauni) - the legend of Estonia. Knitting.1.32%
YARN CHEAP. FLISS. ELASTIC BANDS FOR WOOLING1.26%
Store yarn and goods for needlework KUDEL1.24%
Knitted hats and not only.1.21%
HOBBY & HOME | NEEDLEWORK1.18%
Online Store Yarn "Openwork"1.15%
Spaghetti Yarn SpagettiA similar result can be achieved by correctly selecting keywords for the search: “knitting”, “yarn”, “needlework”, “crochet”. But they are not always easy to invent.
To construct such a graph, several unobvious technical solutions were used, which I would like to talk about.
To get the full list of groups of a given size, the wonderful site
allsocial.ru was downloaded . I wonder how they collect this data? Just go to all indices:
vk.com/club1 ,
vk.com/club2 , ...? Only medium-sized groups with subscribers from 5,000 to 10,000 people were taken for two reasons: huge public institutions like MDK would be pumped up, but, more importantly, membership in them does not carry a special signal, such groups are connected with everything.
To get a list of group subscribers in the API of VKontakte, there is a special method. But it allows you to receive 1000 users at a time and only 3 times per second. And it was necessary to pump about 1 000 000 000 users, which is dofig. It turns out that it will be necessary to wait 3-4 days if the VC responds to each request instantly. This is, on the whole, tolerable, but the following remark in the documentation confused:
In addition to restrictions on the frequency of calls, there are also quantitative restrictions on calling methods of the same type. For obvious reasons, we do not provide information on exact limits.
In our case, this remark is annoying, because 1,000,000 requests will have to be made. Here comes the coolest method
execute . Big respect for him guys from VK. Does anyone else have such a thing? The bottom line is that through execute, you can send programs to the Contact in a special VKScript language, push several requests to the API and, possibly, some logic. In my case, the program looked like this:
return [ API.groups.getMembers(id=1, offset=0, count=1000), API.groups.getMembers(id=1, offset=1000, count=1000), API.groups.getMembers(id=1, offset=2000, count=1000), API.groups.getMembers(id=1, offset=3000, count=1000), API.groups.getMembers(id=1, offset=4000, count=1000), API.groups.getMembers(id=1, offset=5000, count=1000), ... ];
Inside the program there can be no more than 25 calls to the API. That is, the number of requests is reduced to 40,000, theoretically a ban could be bypassed. Each such request was not executed at all instantly, but about 5-6 seconds, so I had to wait anyway. Yes, it would be possible to start downloading in several streams, but even it was dumb. After two and a half days, everything worked out and took about 10GB on my disk.
Now the question is how to stuff these 10GB into RAM and how to calculate the pairwise intersection of the audiences for 100,000 groups. It is saved by the fact that each user usually consists of a small number of groups (99% of users are less than 15 groups). You can write out what contributions each user makes to the intersections and then add these contributions. Let, for example, there are two users: A and B, and three groups 1, 2 and 3. And consists in all three, B - only in 1 and 3. And contributes to three intersections: (1, 2), (1 , 3) and (2, 3), B - into one: (1, 3). We add up, we get that 1 and 3 intersect by two users, the other groups one by one. If you technically ignore users who are in 15 groups and more, you will have to write out about 500,000,000 intersections, which is much better than deciding to the forehead, where you need to calculate 100,000 * 100,000 intersections.
Fine, there was only a problem with RAM. Fortunately, the described algorithm fits well with the Map-Reduction paradigm, so the
nano-hadup of 50 lines was written
down and the calculation looked like this: write out the groups and users who have two columns in them:
group user 3953835 10 2065169 100001643 2112714 100001643 ...
It turns out a file on ~ 9GB, sort it with a unix-like grade by the second column, look where Pavel Durov is:
group user 2226515 1 37110020 1 38354466 1 43453499 1 60140141 1 60615047 1 64980878 1 1019652 10 ...
We read the file, group the stream by the second column, keep in memory only the list of user groups, if there are less than 15 groups, write out all the matchings to another file:
source target 10000 10027193 9980615 9997141 9974 9976553 ...
Since the threshold is chosen correctly, the file is not too large - ~ 9GB. We sort it by two columns:
source target 10000 100000 10000 100000 10000 10009982 10000 100100 10000 100100 10000 10019194 10000 10019194 10000 1002 10000 1002 10000 1002 ...
Then the file is read, grouped in two columns and immediately considered the intersection. For groups of 10,000 and 100,000, for example, the count of 2 users. This can be said right away; nothing is stored in the memory.
Further, the edges are filtered by some reasonable threshold, so that there are not very many of them left. The result can be seen in Gefi. There are two secrets: in order for everything to work for not painfully long, you need to turn off edge drawing, for installation you need to download OpenOrd, he laid my graph at ~ 100,000 vertices in ~ 5 minutes.
Such an approach can theoretically be used in any task where there are two related entities: sites and users, a query and issue results, for example.