📜 ⬆️ ⬇️

Who lives in social networks?


No matter how scandals about PRISM, about personal data and their leaks, social networks beckon to tell everything about themselves: what kind of kittens like, who are you friends with and why hasn't it slept since morning?
A whole encyclopedia about the behavior of the majority of the Internet active public is very close, and I always wanted to touch it. On the one hand, this data seems to be in the public domain, but simply taking and analyzing them is not so easy - everything is too unstructured and fragmented. In addition, as far as I know, there are practically no social data sets suitable for machine analysis. And for Russia - even more so.
There was no choice, and had ominously laughing at night, writing simple spiders for social networks VKontakte, Odnoklassniki, MoiMir and the Russian segment of Facebook, which for several months slowly collected more or less statistically-correct data sample. Only the information that people told about themselves was collected. And they told a lot.

The fact that it was possible to extract from such data, and the story goes.

How so?


I admit, this study is not the first. Social networks (especially Facebook and VKontakte) have been openly studied many times already. And even your humble servant wrote an article about six handshakes , collecting for this a full graph of friends from VKontakte.
But not a single Vkontakte live Runet. I wanted to look into what is happening in other social networks, no less habitable, and also to understand the differences in their audience.
')

Data collection


This is not our first experience of collecting large data under cover of night. So, at a fast pace of five hands, four spiders were written on Qt / C ++ and Python, who, while walking slowly along separate social networks, wrote everything they had found into the base.
Different social networks relate to parsing differently. Problems arose with Odnoklassniki and Facebook, who, as it turned out, have a rather cunning system of detecting suspicious bots. Fortunately, it is mostly aimed at spammers, and our bots from this point of view look pink and fluffy, and we somehow managed to set up a more or less stable, albeit very slow collection.

Analytics


Downloading a lot of data is easy, just two months of data collection. But paranoia is sweeping across the planet, and for most people, an open profile on a social network looks very poor. The lion's share of information is available only to friends. But the fact is that the friends themselves are usually open!
And based on them you can calculate quite a lot of interesting things. For example, city, age and university. And much more. For the seed, I will show a graph of the dependence of the real age on the median of the ages of friends:
image
As you might guess, the real age is for the most part highly related to the median of ages of friends. So even if you are paranoid, your friends will give out a lot about you just by their presence.

For storage and analysis, we decided, like big boys, to use HBase / Hadoop. This is a stylish, fashionable, youth, besides already had experience in the training of such technologies. As a result, from what we have compiled, about 50 parameters were calculated (ie, either reduced to a single form, or separated from social connections). Thumbs up. Then, a random sample of one million users from each social network was made from the total data set and carefully analyzed. Such a trick was made in order to at least slightly normalize the audience of different social networks with different numbers of users.
Then, in fact, the most delicious that I managed to find out.

Age


For a start, it would be interesting to know the age structure of different social networks.
Either the age itself was used as the age, if the person was not ashamed to indicate his year of birth, or his approximation based on the date of graduation from the school / university. Such a maneuver was needed mostly because of Vkontakte and Facebook, whose exact age is known for 40% and 20% of users, respectively.
It turned out about this picture.
image

Funny. Seeing this, you can definitely notice the following features:

And how are things with age and sex structure? The floor was either taken as it is, if it is possible to specify it in the social network at all, or it is calculated based on the name as follows. If the majority of people bearing the name “Alexander” are men, then we will consider all Aleksandrov with an unknown sex to be men. This approach works for an overwhelming number of names, but has some problems with Zhenya and Sasha.


Perfectly. I always suspected it:

Next, a certain abstract indicator of human activity in social networks was calculated in the form of a set of several rules: “there is an avatar”, “friends are over 50”, “recently been online”, etc. For the triggering of each such rule in the general piggy bank of the profile a few points fell. And this is how the distribution of this indicator across different social networks looks like:


Surprisingly, VKontakte is simply gushing by young and hyperactive users, whose fuses are fading away (or sensitively redirected to the family channel) only by the age of 35. In Odnoklassniki, the main activity begins as early as 30 years. And in MoiMir and Facebook, the situation is more deplorable in this respect - there is a swamp.

Obscene language


Age and activity is nice, but very boring. And in order not to fall asleep, for each person in the sample, the number of abusive words found in his posts was calculated. It was especially amusing to compile a dictionary of such words. The dimension along the ordinate axis is the number of words in the last 10 posts.

Obviously, with 15 years of age, young people are so bold that they can swear impenetrably right on their page. Personally, I wrote my first “X * Y” in the schoolyard back in 12, but the truth is anonymous. The outpouring of obscene language goes on until the 23rd. Then, apparently, comes the seriousness and it is time to become an adult. A very captain's statement, but now it is at least proved by facts.

Name Popularity


Now it's time to dissect the names. I always thought that names have different popularity in time. Sometimes it really feels like a peculiar fashion to call children in some unusual way. And now you can see it with your own eyes.

The trend is obvious: previously popular popular names are rapidly losing their former popularity.

With female names, the situation looks similar:

In general, the same rapid ram, but attention should be paid to the following features:

Phone models


We go further. All posts on the wall of VKontakte have a funny label if this post is made from an iOS / Android phone. And this, too, can (and even should) be analyzed. I want to note the fact that the graph on the Y axis shows the proportion of men. The proportion of women, as you can guess, has a very simple dependence on the proportion of men.

Interesting, but the iPhone has a clear bias towards the fair sex, which is not surprising. “Dad, buy me an iPhone, that I like a fool to go here” - perhaps, quite a popular phrase, appearing in the nightmares of many young dads. And Android is beginning to enjoy increased demand from the harsh men in 30.

Family status


Grandma always told me that in her time, being single (or, worse, unmarried) at 25 was equivalent to a catastrophe. What later usually led to a lecture on the topic “you need to marry, sir” and “everyone was already married, and you alone are like an owl”. I always wanted arguments in this debate, and now I have them.
I want to note that marital status was analyzed only among those who indicated it.

The following facts are very interesting:

Bad habits


Now you can go to the bad habits: smoking and drinking alcohol. This parameter is only VKontakte, but many people diligently fill it, which we will use.

Regrettably, the love of alcohol and smoking only increases with age. Some plateau comes only as early as 30 years, which somewhat surprised me. Somewhere around 40, some people think again and try to rectify the situation, but it is already too late to drink Borjomi.

Height and weight


In MyMir there are funny parameters in your profile: height and weight. I can not explain what motivated someone's great mind, who added them there. But there are parameters, and it would be foolish not to see them through the prism of our curiosity.

I supposed to see a less contrasting graph. But it turned out that way. I suppose this strangeness can be explained by the fact that women are more often proud of their small height, and hide large growth. In men, the situation looks exactly the opposite: it’s embarrassing to admit that the whole Internet is 150 cm in you, but if you are under two meters, then everyone should know this.
On the other hand, women are on average lower than men and everything can be much simpler.

With weight, the situation is about the same as with growth. Women after 60kg abruptly stop mentioning their weight. But men are always welcome. One hundred and twenty? Yes, no problem, a good man should be a lot.

In general, the relationship of height and weight is described in many medical sources. And on this graph, it is available. It's funny to note that stunted girls are usually heavier than boys.

Floor


When I was little, I always suspected that girls were more often friends with girls. And in retaliation, they were friends for the most part with the boys. I think it's time to confirm this trend.

Yes, you can see a clear connection with the fact that girls have girls as friends. If you are a male owner, and you have some ladies as friends, then I have bad news for you.

Likes


Likes are an amazing thing. Six years ago they did not exist, and now it is an essential attribute of any social network.

Always suspected that the fair sex like much more often. And this trend continues almost to 30 years, but then slowly fades away. With age, fortunately, values ​​change.
Like is a phenomenon of modern times. The girls nervously consider how many people have a new holiday in Avikalo with new leaves. and this time in my head a dialogue sounds: “Dad, how did you meet mom?” - “Well, I went to her ava, and then it started.”

Politics


No matter how Habr was out of politics, she now seeps out of all the cracks. In social networks, there is even a special field describing political views, which we are now dissecting.


On this beautiful graphics are over. But fun is not yet.

Data


To create such a wonderful set of data for analysis and not to share it is a crime against humanity. Therefore, it was decided to put it in open access, but so as not to hurt the rights and privacy of people who fall into this data set:

Archive with data , 7z, 135Mb in archive, 1Gb in unpacked form.

Instead of an afterword


Be careful with the data that you post to the network. What was once uploaded there will remain there forever. So take care of yourself and your privacy in your youth.

Source: https://habr.com/ru/post/198722/


All Articles