“What kind of analyst doesn’t like Big Data!” - so you can paraphrase a popular adage about fast driving. 650 million social media messages from 35 million authors, 358 million links, of which 110 million are “short” - this amount of data was analyzed in March 2014 in order to compile a media citation rating.
In this post we will talk about the methodological and technological aspects, as well as offer to discuss the ideas of "in-depth drilling" Data Mining social media. Interested invite under cat.

Actually, the rating itself turned out like this:
Top-30 social media citation rankings (March 2014):
Place in ranking
| Resource name
| Website address
| Index SMI
| amount links
|
one
| RIA News" | ria.ru | 117 | 516 641 |
2
| RT in Russian | russian.rt.com | 83 | 364 845 |
3
| Lenta.Ru | lenta.ru | 72 | 318,735 |
four
| Radio station "Echo of Moscow" | echo.msk.ru | 52 | 226 985 |
five
| Gazeta.Ru | gazeta.ru | 51 | 226,760 |
6
| Life news | lifenews.ru | 48 | 212,870 |
7
| TV channel "Rain" | tvrain.ru | 48 | 210 413 |
eight
| ITAR-TASS | itar-tass.com | 46 | 203,795 |
9
| Vesti.ru | vesti.ru | 45 | 197,654 |
ten
| Sports.ru | sports.ru | 42 | 184,831 |
eleven
| RBC (RosBusinessConsulting) | rbc.ru | 35 | 154,048 |
12
| NEWSru.com | newsru.com | 32 | 140 082 |
13
| TVNZ | kp.ru | 31 | 136,291 |
14
| Interfax | interfax.ru | 28 | 121,714 |
15
| Russian newspaper | rg.ru | 27 | 118,643 |
sixteen
| NTV | ntv.ru | 26 | 113 353 |
17
| New Region 2 | nr2.ru | 25 | 110 104 |
18
| Business newspaper "Vzglyad" | vz.ru | 23 | 100 647 |
nineteen
| First channel | 1tv.ru | nineteen | 84 659 |
20
| Snob Media | snob.ru | 18 | 78 439 |
21
| REGNUM Information Agency | regnum.ru | 17 | 76 920 |
22
| Kommersant.ru | kommersant.ru | 15 | 66,221 |
23
| Slon.ru | slon.ru | 15 | 65,872 |
24
| Statements | vedomosti.ru | 15 | 63 915 |
25
| Arguments and Facts | aif.ru | 13 | 58,290 |
26
| Izvestia.ru | izvestia.ru | 13 | 56 109 |
27
| In Moscow - Moscow News | newsmsk.com | 12 | 54,147 |
28
| New Newspaper | novayagazeta.ru | 12 | 52 367 |
29
| Free press | svpressa.ru | eleven | 49,069 |
thirty
| Inosmi.ru | inosmi.ru | ten | 42,757 |
More information about the rating, the formation of the SMI index and SMR rating can be found in our blog:
http://br-analytics.ru/blog/?p=1264WHY AND FOR WHOM?
There are several measurement dimensions of publications in the media research market: by circulation, by attendance of online versions, by quoting of other media, by the number of subscribers (both offline and online). In fact, all these measurements compare data that has already been collected somewhere: in printing houses, Internet statistics services, social network counters. Comparing the media by citation in other media is the maximum that the media monitoring industry could offer, but, you see, this metric raises more questions than answers.
When our fellow sociologists came up with the task of ranking the media according to the credibility and trust of the readers, the decision was standard - to conduct a survey with a proposal to indicate which of the media, in the opinion of interviewees, are more authoritative.
')
Having the (bad :-)) habit of projecting all social tasks to social media, we decided to help partners get additional information from the messages of users in social networks and comments on articles.
HOW: OPENING AND DIFFICULTY
The task turned out to be interesting in technical terms and unexpected in terms of results. The data volume was clear in advance - our Platform collects 20-25 million messages and comments per day per day, which means that in March approximately 600 million materials will have to be processed.
Further, it seems like everything is simple: it remains to understand the number of messages containing links, tear them out, process them, remove unnecessary ones, normalize and sort them. For the analysis took the data in one day and away. The first surprise arose in the number of links: none of the analysts could have imagined that the number of links approximately corresponds to the number of messages — more than 15 million per day!
The second “nuisance” - the number of links to pictures, graphic elements, video - approximately 30% of the total. We were already ready for the third “trouble” - the technology of deploying “short” links is already used in Brand Analytics reports, but one thing is the deployment of tens of thousands, the other is about 4-5 million per day. At the same time, another 23 new ones added to the already familiar 12 popular services for convolving long links.
“Frontal” single-threaded data processing for one day took 3-4 hours, which in general is normal for a “knee-like” slow research option, but not acceptable for regular daily monitoring. The final multi-threaded (3 threads) algorithm that was applied to the data processing for the month, allowed to process a monthly array of 655 million in 6 hours.
PS Those who wish to experiment with various methodologies for parsing unstructured data are ready to provide hourly data uploads - maybe someone can offer a faster solution.
RESULTS
Summary data:
• March messages were processed: 655 269 709
• Unique authors: 35 172 270
• TOTAL references found: 536 185 906
• LINK WITHOUT PICTURES: 357 853 627
• SHORT links: 110 685 097
For fans of statistics, we give exclusive data on the top of "raw"
links- "millionaires" - in our opinion a very curious info:
154 659 839 | vk.com |
25,776,485 | apps.facebook.com |
23,611,855 | dsm.odnoklassniki.ru |
10,531,545 | facebook.com |
10 123 556 | youtube.com |
5,240,568 | instagram.com |
4,026,849 | twitter.com |
2 320 472 | plus.google.com |
2 304 521 | ask.fm |
1,847,571 | docs.google.com |
1,225,210 | islandandroid.17bullets.com |
Coming back from technocratic questions to methodological ...
1. It is no secret that in every popular social network there is a rather high (from 10 to 47%) level of automatic messages: both bot accounts (bot networks), and notification messages (games, cards, gifts, smiles, etc.). Anticipating the natural question of attentive expert readers, yes, such messages are filtered and do not reach the analysis and rating module of the link targets.
2. After the publication of the Media Citation Index on popular resources, a discussion arose in several groups in social networks, where people sneered at the leader of the rating, RIA Novosti, that their high level of social media citration was due to the fact that the editors closed the possibility of commenting on site ria.ru. The idea is interesting, is not it? And pushes on new "moves" for SMM'schikov :-)
Perhaps our analysts will take into account this aspect in the calculation of the Rating and Media Citation Index in the next study for the month of April. For example, having equated comments to an article on the site of a publication to publications in social media (especially since according to our methodology, commentary to the news is taken into account as an independent object). If you have an opinion on this issue, we will be happy to hear both the opinion and, of course, the arguments “for” and “against”.
PS In the ranking of links Habra is high, in the first 50 ke, and in the first place among the technological resources.