📜 ⬆️ ⬇️

Data Science: About love, names and more. Part II

Because in much wisdom there is much sadness;
And who multiplies knowledge, multiplies sorrow.
• Ecclesiastes 1:18

Shots from the movie Casino Royale (2006)


This article can not serve as a reason for the expression of intolerance or discrimination on any grounds.


In the first part of the article, I only identified the problem that sounded as follows: the probability of being alone / alone depends on the person’s name . It would be more correct to use the word correlation , but I still allow myself some linguistic freedom in this question again and hope that everyone understands this statement correctly. Nevertheless, I would like to thank everyone for the comments on my previous article.


In one of the comments I said that it is quite possible that there is some third factor that correlates with name and loneliness. As an illustration, I gave an example with apples: Suppose that loneliness depends on how many apples a girl eats, and for some reason, girls named Katya eat more apples than we have Masha. It is clear that for each specific Masha or Katya, this does not mean absolutely nothing, but on average it turns out that some are lonely more than others because they eat apples in different quantities.


In fact, the problem is reduced to another exactly the same: why do people with the same name eat more apples than others? However, the explanation of this correlation may be more simple.


Cherry picking and statistical significance


Before I continue, I will make a few comments about the sample in the previous article, because we will continue to work with it. On the one hand, I really prefer quality arguments. On the other hand, I understand the people who are asked the question why the sample was such and whether the results are statistically significant. I deliberately did not write anything about statistical significance, because the situation when two “random” processes behave the same in different systems, with different people and the mechanics of setting the status, seems to me quite incredible. As for the choice of names, there is an element of randomness (I tried not only to take the names of my girls friends, but also to fill parts of the distribution missing in the frequency sense), but I didn’t do anything on purpose except to limit myself in quantity, and the resulting table contained 3 stable parts completely independent of my desire.


However, at the request of the workers (as written in one of the comments), I took 100 completely random names, for which there was enough statistics on Odnoklassniki and checked what would happen if the names themselves were mixed. If I got exactly the same distribution (after calculating u ), as some people predicted, then one could say that the result is not statistically significant, and at best, one can speak of dependence only on the frequency of the name. However, the Mann-Whitney test showed p-value = 0.000256 , i.e. the initial distribution and what happened with mixing are completely different things.


Therefore, I will continue to use the original tables, considering them sufficiently representative of our research.


Will I have problems with you, Bond?


My experience at St. Petersburg State University prompted me to the following thought (I think she didn’t visit me alone): what if smarter people are more alone? That is, this whole dialogue between Bond and Vesper in the picture from the movie Casino Royale is a kind of tautology in a probabilistic sense.


It is well known that IQ tests are not very representative, and it is not possible to measure IQ directly on a social network. But we can make the following assumption: people who have a higher education are, on average, smarter than those who do not have it. Of course, this is a so-so criterion, because almost everyone has a higher education. Therefore, you can try to take a more or less elite educational institutions, but such that the diversity of specialization was good enough. Therefore, we will try to do the following: for the city of St. Petersburg, we will look at the distribution of names among students at St. Petersburg State University, and for Moscow, respectively, among MSU students. This is again a speculative assumption, but on average it is quite viable for our purposes.


Let's do the following: we will simply find those who studied at St. Petersburg State University and Moscow State University with the given name and divide by the number of all with such a name in the desired city. In truth, the name Leila would be worth removing, i.e. it has some "regional specifics", but for completeness we will not touch anything.


Let's see what happened and compare with those tables in the cities of St. Petersburg and Moscow, what I did for the previous article:






Here, p = edu / all , i.e. the proportion of girls with this name (according to VKontakte statistics) who studied or are studying at St. Petersburg State University in the total volume of people with the same name in St. Petersburg.


Now the same for MSU:






Let's take another look at the tables from the previous article for comparison. Here is the distribution across St. Petersburg ( q is a unified indicator of "loneliness", the full range of symbols can be found in the first part of the article ).


Statistics on St. Petersburg


')

For Moscow, the distribution is as follows:


Statistics in Moscow



It can be seen that at least the upper and lower parts of the table, when sorting by p and q more or less coincide, the average is slightly mixed, but there are no significant rearrangements between the parts. In the case of Inessa's name, there is some discrepancy, for an accurate analysis it would be necessary to separate the name Inna and Inessa and see the details of the distribution in Moscow and St. Petersburg. But here we will not do this, we confine ourselves to a qualitative assessment. To do this, we construct the "dependence" of q on p for the case of St. Petersburg:




Now the same schedule for MSU:




That is, it turns out that more intelligent and well-educated girls are more alone. All this is of course conditional, and it is possible, for example, that this only means a later marriage.


University Ranking


In fact, if there is a correlation between loneliness and good education, then probably loneliness can be considered a measure of the quality of education and intelligence (of course, in a probabilistic sense). Therefore, I took some good universities that I could recall right away (and which I managed to find with some difficulty in a search in VK) and decided to count for them the same q , u and v indicators that I counted in the last article for a variety of names. As in the case of names, I took and sorted by q (as an additional parameter, I considered diversity d = all / (all + all_m) by gender, where all_m is the number of young people at the university):


Loneliness rating


Doesn't this remind you of anything? That's right, if you google the university rankings, you can find the following (this is the top of the national rankings):


National University Ranking


Who wants to see the full rating, go here: National Ranking of Universities 2017 . Of course, not all universities are in my table, and for universities with a low rating this does not work like this (for example, for the Herzen State Pedagogical University), but this certainly makes you wonder.


Instead of conclusion


It is difficult to say how much we are closer to understanding what is happening. However, the correlation between education and loneliness no longer looks as insane as the correlation between name and loneliness.


Here I used Odnoklassniki data only to check the statistical significance of the results of the previous article, and everything else was built entirely on VK data.

Source: https://habr.com/ru/post/337368/


All Articles