📜 ⬆️ ⬇️

Data Mining in Big Data: Social Media Citation Rating

“What kind of analyst doesn’t like Big Data!” - so you can paraphrase a popular adage about fast driving. 650 million social media messages from 35 million authors, 358 million links, of which 110 million are “short” - this amount of data was analyzed in March 2014 in order to compile a media citation rating.
In this post we will talk about the methodological and technological aspects, as well as offer to discuss the ideas of "in-depth drilling" Data Mining social media. Interested invite under cat.
image

Actually, the rating itself turned out like this:
Top-30 social media citation rankings (March 2014):
Place in
ranking
Resource name
Website address
Index
SMI
amount
links
one
RIA News"ria.ru117516 641
2
RT in Russianrussian.rt.com83364 845
3
Lenta.Rulenta.ru72318,735
four
Radio station "Echo of Moscow"echo.msk.ru52226 985
five
Gazeta.Rugazeta.ru51226,760
6
Life newslifenews.ru48212,870
7
TV channel "Rain"tvrain.ru48210 413
eight
ITAR-TASSitar-tass.com46203,795
9
Vesti.ruvesti.ru45197,654
ten
Sports.rusports.ru42184,831
eleven
RBC (RosBusinessConsulting)rbc.ru35154,048
12
NEWSru.comnewsru.com32140 082
13
TVNZkp.ru31136,291
14
Interfaxinterfax.ru28121,714
15
Russian newspaperrg.ru27118,643
sixteen
NTVntv.ru26113 353
17
New Region 2nr2.ru25110 104
18
Business newspaper "Vzglyad"vz.ru23100 647
nineteen
First channel1tv.runineteen84 659
20
Snob Mediasnob.ru1878 439
21
REGNUM Information Agencyregnum.ru1776 920
22
Kommersant.rukommersant.ru1566,221
23
Slon.ruslon.ru1565,872
24
Statementsvedomosti.ru1563 915
25
Arguments and Factsaif.ru1358,290
26
Izvestia.ruizvestia.ru1356 109
27
In Moscow - Moscow Newsnewsmsk.com1254,147
28
New Newspapernovayagazeta.ru1252 367
29
Free presssvpressa.rueleven49,069
thirty
Inosmi.ruinosmi.ruten42,757
More information about the rating, the formation of the SMI index and SMR rating can be found in our blog: http://br-analytics.ru/blog/?p=1264

WHY AND FOR WHOM?

There are several measurement dimensions of publications in the media research market: by circulation, by attendance of online versions, by quoting of other media, by the number of subscribers (both offline and online). In fact, all these measurements compare data that has already been collected somewhere: in printing houses, Internet statistics services, social network counters. Comparing the media by citation in other media is the maximum that the media monitoring industry could offer, but, you see, this metric raises more questions than answers.

When our fellow sociologists came up with the task of ranking the media according to the credibility and trust of the readers, the decision was standard - to conduct a survey with a proposal to indicate which of the media, in the opinion of interviewees, are more authoritative.
')
Having the (bad :-)) habit of projecting all social tasks to social media, we decided to help partners get additional information from the messages of users in social networks and comments on articles.

HOW: OPENING AND DIFFICULTY

The task turned out to be interesting in technical terms and unexpected in terms of results. The data volume was clear in advance - our Platform collects 20-25 million messages and comments per day per day, which means that in March approximately 600 million materials will have to be processed.

Further, it seems like everything is simple: it remains to understand the number of messages containing links, tear them out, process them, remove unnecessary ones, normalize and sort them. For the analysis took the data in one day and away. The first surprise arose in the number of links: none of the analysts could have imagined that the number of links approximately corresponds to the number of messages — more than 15 million per day!

The second “nuisance” - the number of links to pictures, graphic elements, video - approximately 30% of the total. We were already ready for the third “trouble” - the technology of deploying “short” links is already used in Brand Analytics reports, but one thing is the deployment of tens of thousands, the other is about 4-5 million per day. At the same time, another 23 new ones added to the already familiar 12 popular services for convolving long links.

“Frontal” single-threaded data processing for one day took 3-4 hours, which in general is normal for a “knee-like” slow research option, but not acceptable for regular daily monitoring. The final multi-threaded (3 threads) algorithm that was applied to the data processing for the month, allowed to process a monthly array of 655 million in 6 hours.

PS Those who wish to experiment with various methodologies for parsing unstructured data are ready to provide hourly data uploads - maybe someone can offer a faster solution.

RESULTS

Summary data:
• March messages were processed: 655 269 709
• Unique authors: 35 172 270
• TOTAL references found: 536 185 906
• LINK WITHOUT PICTURES: 357 853 627
• SHORT links: 110 685 097

For fans of statistics, we give exclusive data on the top of "raw"
links- "millionaires" - in our opinion a very curious info:
154 659 839vk.com
25,776,485apps.facebook.com
23,611,855dsm.odnoklassniki.ru
10,531,545facebook.com
10 123 556youtube.com
5,240,568instagram.com
4,026,849twitter.com
2 320 472plus.google.com
2 304 521ask.fm
1,847,571docs.google.com
1,225,210islandandroid.17bullets.com
Coming back from technocratic questions to methodological ...

1. It is no secret that in every popular social network there is a rather high (from 10 to 47%) level of automatic messages: both bot accounts (bot networks), and notification messages (games, cards, gifts, smiles, etc.). Anticipating the natural question of attentive expert readers, yes, such messages are filtered and do not reach the analysis and rating module of the link targets.

2. After the publication of the Media Citation Index on popular resources, a discussion arose in several groups in social networks, where people sneered at the leader of the rating, RIA Novosti, that their high level of social media citration was due to the fact that the editors closed the possibility of commenting on site ria.ru. The idea is interesting, is not it? And pushes on new "moves" for SMM'schikov :-)

Perhaps our analysts will take into account this aspect in the calculation of the Rating and Media Citation Index in the next study for the month of April. For example, having equated comments to an article on the site of a publication to publications in social media (especially since according to our methodology, commentary to the news is taken into account as an independent object). If you have an opinion on this issue, we will be happy to hear both the opinion and, of course, the arguments “for” and “against”.

PS In the ranking of links Habra is high, in the first 50 ke, and in the first place among the technological resources.

Source: https://habr.com/ru/post/220415/


All Articles