In the
last article on Internet intelligence , the process of collecting and analyzing data on a specific person was briefly discussed. Since the topic has caused great interest, we continue the work we have started and in this article we will look at how you can collect information about a whole group of users.
Consider the following situation: the new Habr user, having received a minus in karma before his first post / comment on the resource, decides to find out who is behind the sophisticatedly invented nicknames of Habr users and asks: who is Mr./Ms. Habraman?
We will follow the intelligence cycle:
Step 1. Task settingThe task can be formulated as follows: it is required to make a portrait of the average user Habr.
')
Step 2. Planning2.1 Hypotheses
To effectively gather information, we will need working hypotheses, based on which we will understand what data we need and where to find them. To formulate hypotheses it is often necessary to have some preliminary knowledge of the subject.
For preliminary information about Habr's users, our inexperienced user refers to the Lurkmore site, which, as we know, is distinguished by a multilateral coverage of any phenomenon in the modern world. Puzzled by the negative attitude of the author of the
article, our Internet intelligence officer formulates the following hypotheses about who the average user Habr is all the same:
- Hypothesis 1. Severe IT-specialist in age, living in one of our capitals, embittered by hustle in the subway or multi-kilometer traffic jams;
- Hypothesis 2. An alien who has remained on our planet before the final repair of the control system of his ship, which he hopes to complete according to the guidelines placed on Habré by other aliens;
- Hypothesis 3. A young educated person with a wide range of interests who is interested not only in computers.
Please note that in formulating hypotheses we try to follow the MECE (mutually exclusive and collectively exhaustive) rule, which can be translated as a “mutually exclusive and jointly exhaustive list”).
2.2 Data and its sources
Based on the set of formulated hypotheses, we can now determine what data we need to be able to verify them.
Accordingly, we are interested in the following data:
- Interests (hobbies, movies, books, etc.);
- Floor;
- Age;
- Location (city, country).
A good source of such information, as we know, are social networks, in particular the Russian network VKontakte, in which our hero finds
the Habrhabr group with almost 670 thousand subscribers and with a motto that indirectly confirms one of the formulated hypotheses: “UFO is with you!” .
Step 3. Data collection3.1 Tools
To access the data of subscribers of the group in the VKontakte network, we will need:
- Network access token;
- R is a programming language for statistical data processing and their visualization by graphics;
- Library for access to VK API: vkR;
- Development environment for R: RStudio ;
- MS Excel and LibreOffice.
3.2 Accessing Data
To upload data from users of the VKontakte social network, you need to use the VK API. Thanks to
Dmitry Sorokin , who created the vkR library, working with the VK API from the R environment is now available to any confident (and insecure) user.
In order to be able to access the VK API,
it is necessary to generate a so-called access token.
3.3 Data Collection Process
Since in the comments to the previous article, dear readers showed great interest in the practical component of the data collection process, we present here an example of code in the R language, on the basis of which, if desired, you can fully reproduce the steps described below.
To work with the data of users of the Habkhabr group on the VKontakte network, you will need its unique identifier, which is easy to identify by hovering the mouse pointer, for example, on the “Write a message” button and seeing a link of the following type “
vk.com/im?sel=- 20629724 . The numbers are 20629724 and are the required ID.
# vkR >install.packages("vkR") # >library(vkR) # >setAccessToken(access_token = "000000000..000") # userids >userids<-getGroupsMembersExecute(group_id = "20629724", progress_bar = TRUE) # 3000 >sampleids<-sample(userids, 3000, replace=FALSE) # # : https://vk.com/page-1_27445058 >groupdata<-getUsersExecute(sampleids,fields = "sex,bdate,city,country,education,universities,relation,interests,movies,tv,books,games,about", drop=FALSE, flatten = FALSE, progress_bar = TRUE) # , > qofmale_female<-as.data.frame(table(groupdata$sex)) #, > qofmale_female Var1 Freq 1 0 2 2 1 853 3 2 2145 # qofmale_female$name<-c("", "", "") # qofmale_female$color<-c("#51b828", "#ff5d4e", "#2879a8") # > barplot(qofmale_female$Freq, names.arg = qofmale_female$name, col=qofmale_female$color, border = NA) # CSV- MS Excel/LibreOffice write.csv2(qofmale_female, "C://girlsboysaliens.csv")
3.4 Sampling
Internet intelligence should be efficient and effective, so “shoveling” all 670,000 users is not a good idea. We need a large enough sample, formed randomly. Our hero likes the number 3000 and he stops at it. Taking into account the
recommendations of sociologists , this volume will be enough even if not all profiles are completely filled.
Step 4. Data processingThanks to R, data processing turned out to be quite simple. The only problem our scout encountered was a problem with the encoding: after downloading data from the Vkontakte network, the RStudio console did not display the text correctly, but the problem was solved as follows:
- We unload the data received from the social network into a variable;
- Save to CSV file;
- Open in LibreOffice - check that everything is read as it should;
- Save the CSV file by setting the unicode UTF-8 encoding;
- We load it back into R using read.csv with the UTF-8 encoding specified.
Surely, there is a more optimal solution to this problem, who knows - share information in the comments to the article.
Step 5. Analysis of informationWe proceed to the analysis of the data.
5.1 Guys, girls and aliens
If you look at the distribution by sex, you can see the following picture:

We see that about 2/3 were guys, 1/3 were girls, but there were only two aliens who couldn’t decide whether they belonged to any particular gender. Hypothesis 2 can be safely discarded.
5.2 Cities and countries
Checking the first hypothesis, including the assumption that among the users of Habr are mostly residents of our two capitals, we will look at the distribution by cities.
This field in their profile indicated a total of 2092 users from our sample. The list of the top 15 cities turned out to be this:
Two capitals occupy the first positions, but the total number of users who have indicated Moscow and St. Petersburg is less than a quarter of 2092. Accordingly, it is difficult to call a purely capital Habr.
As for the countries in which our users live, it is clear that we are talking mainly about Russia, Ukraine, Kazakhstan and Belarus.

5.3 Age of the audience
It is interesting to look at the distribution of Habr's users from the sample by year of birth and to see that the age of the majority of users is in the range of 20-30 years.

The assumption of age from the first hypothesis can be safely excluded, but the corresponding assumption from the third hypothesis is confirmed.
5.4 Interests
Let's see what our users live. This can be done, at a minimum, by analyzing the following fields listed in the profiles: interests, movies, tv, books, games, about. Additionally, you can analyze the groups in which our users are members, since we have a great getGroupsForUsers () method. With the analysis of groups and begin. Below are the top 15 groups of the VKontakte network in which users of Habr like to be members.
As you can see, information technologies and programming occupy not the last place in the life of Habr users, but obviously no one dwells on them.
It makes sense to build histograms by the field values ​​of interests, movies, tv, books, games, about after a long comparison of what users indicated with general categories, but this time, unfortunately, no. Therefore, let's go through some points that caught my eye.
1. Apparently a significant number of Habr's audience does not favor the TV, but this is not a fact, but an assumption, since we see only about 20 people, and we do not know how many people chose not to report about it, although they do just that.

2. Literature is the most diverse, and, characteristically, the works of Donald Knuth are not included in the list of favorite books for the night.

3. Interests are distinguished by exceptional diversity, and, which is especially pleasing, sport is not the last.

5.5 Just Curiosity: Top 25 Names
Analyzing the collected data, it became interesting to see, and what are the most common names of Habr users, most of them trying to choose the mysterious nickname. The result is a table from the TOP-25 names from our sample:
Step 6. Report preparation and presentation of results.Thus, summing up the results of a small study, we can safely describe the portrait of the average Mr. or Ms. Habraman:
- Age from 20 to 30 years;
- He lives in a large city in Russia, Ukraine, Kazakhstan or Belarus;
- Leads an active lifestyle, is interested in IT, modern science and much more;
- The probability that a guy is in front of us is 2/3, and that a sweet girl is 1/3, and definitely not an alien.
Accordingly, of the three formulated hypotheses, the third was the most likely. The first two were discarded.
Data checkingWhen someone presents his analysis, and even reports his findings, the real intelligence officer always recheck. Let's do it and we will compare the results with available data from alternative sources.
Let's compare with the data that is presented in the
"Users" section available to any Habr user.
Matches:
- Total users: 739,159, and in the VKontakte group almost 670 thousand are the same order.
- The order of countries in the number of users almost coincides: Russia, Ukraine, Belarus, Kazakhstan, the United States.
- The order of cities: the first 5 are the same.
Discrepancies:
- According to the site itself, there are 134 thousand men among men, 10 thousand women, and about 600 thousand of the “others”. Perhaps, this indicates an alien invasion, but most likely the users simply did not fill out their profiles completely.
- The percentage of cities differ significantly. We have Moscow - at least 15%, on the site, however, it is indicated that there are only 6,000 users in Moscow, which is 1% of all registered users, which is most likely caused by the reluctance of users to completely fill out their profile on the site.
Let's also verify with data similar to a study conducted in 2014 by the
Apatic user:
Who subscribed to Habrahabr?Matches:
- 76% are men, 24% are women, which is very close to our result;
- Age distribution coincides;
- Top 4 countries in the same order.
ConclusionSo we figured out how the intelligence cycle can be used to analyze information about communities, which would be useful for a blogger to understand the audience of readers and a marketer who analyzes who buys products from competitors.
Literature- The art of visualization in business. How to present complex information with simple images. Nathan Yau ISBN: 978-5-91657-737-2
- A lot of numbers. Big data analysis with Excel. John Foreman. ISBN 978-5-9614-5032-3, 978-5-9614-5954-8
Update:A number of readers have doubts that the “Habrjabr” group on the VKontakte network significantly intersects with the user group of Habr himself. To verify this thesis, I suggest a small survey.