📜 ⬆️ ⬇️

Meduza.io: what about the likes?

Once, while reading the news on Medusa, I noticed that different news has a different ratio of likes from Facebook and VKontakte. Some news is megapopular on fb, and other people are shared only on VKontakte. I wanted to look at this data, try to find interesting patterns in them. Interested invite under the cat!


image


Data scraping


The first thing you need to get data for analysis. Anticipating the imminent release of Python + BeautifulSoup, I began to read the source code of the pages. The disappointment waited pretty quickly: this data is not loaded immediately with html, but deferred. Since I do not know how to javascript, I started looking for legs in the network connections of the page, and rather quickly came across a wonderful jellyfish API handle:


https://meduza.io/api/v3/social?links=["shapito/2016/05/03/poliem-vse-kislotoy-i-votknem-provod-v-rozetku"] 

The handle returns a nice looking json'ku:


image


And of course, since links is an array, I immediately want to try to substitute several records there at once, and, hooray, we get the list of interest.


I didn't even have to parse!


Now I want to get information about the news itself. Here I would like to thank the sirekanyan barn for his article , where he found another pen.


 https://meduza.io/api/v3/search?chrono=news&page=0&per_page=10&locale=ru 

Experimentally it was possible to establish that the maximum value of the per_page parameter is 30, and about 752 page at the time of this writing. An important test that the social handle will withstand all 30 documents, passed successfully.


It remains only to unload! I used a simple python script


 stream = 'https://meduza.io/api/v3/search?chrono=news&page={page}&per_page=30&locale=ru' social = 'https://meduza.io/api/v3/social' user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.3411.123 YaBrowser/16.2.0.2314 Safari/537.36' headers = {'User-Agent' : user_agent } def get_page_data(page): #   ans = requests.get(stream.format(page = page), headers=headers).json() #     ans_social = requests.get(social, params = {'links' : json.dumps(ans['collection'])}, headers=headers).json() documents = ans['documents'] for url, data in documents.iteritems(): try: data['social'] = ans_social[url]['stats'] except KeyError: continue with open('res_dump/page{pagenum:03d}_{timestamp}.json'.format( pagenum = page, timestamp = int(time.time()) ), 'wb') as f: json.dump(documents, f, indent=2) 

Just in case, I substituted a valid User-Agent, but everything works without it.


Next, the script of my former colleague, alexkuku , helped me to parallelize and visualize the process. You can read more about the approach in his post , he allowed to do such monitoring here:


image


The data was downloaded very quickly, in less than 10 minutes, no captcha or a noticeable slowdown. Swung in 4 streams from one aypishnik, without any superstructures.


Data minining


So, at the output we got a big json'ka with data. Now pound it in the pandas dataframe, and twist it in Jupyter.


Load the necessary data:


 df = pd.read_json('database.json').T df = df.join(pd.DataFrame(df.social.to_dict()).T) df.pub_date = pd.DatetimeIndex(df.pub_date) df['trust']=df.source.apply(lambda x: x.get('trust', None) if type(x) == dict else None) 

Build a boxplot


 df[['fb', 'tw','vk']].plot.box(logy = True); 

image


Just a few conclusions:


  1. Twitter has disabled the ability to watch the number of tweeted news. :-( We'll have to do without it.
  2. Distribution, as expected, is extremely abnormal: there are very strong outliers, which are noticeable even on the log scale (hundreds of thousands of reposts).
  3. At the same time, the average number of reposts turned out to be quite close: the median is 24 and 17 (here and below, facebook and VKontakte, respectively), the distribution vk is somewhat more "smeared".

So who are the most super-repostnuyu news jellyfish? Guess what?


image


Answer:

image


Well, of course, the first is FB: there are foreign languages, Soviet newspapers, Serov. And in the second 5nizza, "My orientation," the policy. I do not know how for me, so everything is obvious!


The only thing in which the preferences of the two social networks are similar: this is Irina Yarovaya, and yes Tsvetaeva with Guf.


Now, I want to look at the scatter plot of two quantities: the data is expected to correlate well with each other.


 df['logvk'] = np.log10(df.vk) df['logfb'] = np.log10(df.fb) #      sns.regplot('logfb', 'logvk', data = df ) 

image


 sns.set(style="ticks") sns.jointplot('logfb', 'logvk', data = df.replace([np.inf, -np.inf], np.nan).dropna(subset = ['logfb', 'logvk']), kind="hex") 

image


It seems, you can see two clusters: one with the center in (2.3, 2.4), and the second one smeared around zero. In general, there is no goal to analyze even low-frequency news (those that turned out to be uninteresting in social networks), so let's confine ourselves to recordings with more than 10 likes on both networks. Do not forget to check that we got rid of a small number of observations.


 stripped = df[(df.logfb > 1) & (df.logvk > 1)] print "Working with {0:.0%} of news, {1:.0%} of social network activity".format( float(len(stripped)) / len(df), float(stripped[['vk', 'fb']].sum().sum()) / df[['vk', 'fb']].sum().sum() ) # Working with 47% of news, 95% of social network activity 

Density:


 sns.jointplot('logfb', 'logvk', data = stripped, kind="kde", size=7, space=0) 

image


findings



Now I want to look at this relationship in dynamics: perhaps it has changed.


 by_month = stripped.set_index('pub_date').groupby(pd.TimeGrouper(freq = 'MS')).agg({'fb':sum, 'vk':sum}) by_month.plot( kind = 'area') 

image


Interestingly, with a general increase in the volume of activity in social networks, Facebook is growing faster. In addition, there is not visible some kind of explosive growth, which I would expect to see in Medusa. The first months of activity was quite low, but by December 2014 the level had stabilized, the new growth began only a year later.


Let's look at the dynamics of the density distribution of comments from two social networks:


image


Quite entertaining that the second cluster decreases with time, and rather is an artifact of the past.


Finally, I want to check that the ratio of social networks does not change depending on the type of the document: Medusa has cards, stories, a tent, galleries, and also a training ground in addition to news.


 def hexbin(x, y, color, **kwargs): cmap = sns.light_palette(color, as_cmap=True) plt.hexbin(x, y, gridsize=20, cmap=cmap, **kwargs) g = sns.FacetGrid(stripped.loc[::-1], col="document_type", margin_titles=True, size=5, col_wrap = 3) g.map(hexbin, "logfb", "logvk", extent=[1, 4, 1, 4]); 

image


In general, it is clear that the data is quite homogeneous by classes, there are no noticeable distortions. I would expect more social activity from the “tent”, but this effect is not observed.


But, if you look at the breakdown by the level of trust in the source , it’s nice to see that an unreliable source is less popular on social networks, especially on Facebook:


image


What's next?


On this my evening ended, and I went to sleep.



Github Code


')

Source: https://habr.com/ru/post/283058/


All Articles