📜 ⬆️ ⬇️

How to make data talk

As soon as Google Analytics or Yandex.Metrica publish news about a new report, metrics or interface updates, and the whole community rejoices, I feel a little dizzy. But not for joy. For me, this is a signal that soon, instead of working on the quality of the product, we will start exploring analytics systems. The pursuit of data has supplanted the desire for quality analysis on the backyard needs. Accuracy has become more important than the trend, and the sites now have 3-5 counters from different analytics systems.



Data does not happen much? As it happens. Think of the data paradox that Avinash Kosick formulated well. Lack of data does not allow to make decisions, but abundance does not give an idea of ​​what is happening.
')
So is it time to start looking for answers? I will talk about the universal method that helps me to draw conclusions, and also brings great pleasure in working with information. So that people far from internet marketing and web analytics don't get bored, for example, I took the topic from our everyday reality.


The main stages of working with data


Working with data consists of several stages, but there is no need to observe a strict sequence: every now and then we have to go back to the previous stages and get ahead of ourselves.

1. Preparation
- The wording of questions.
- Select the source.
- Data collection.
- Study of.
- Data cleansing and acceptance of assumptions.

2. Analysis
- Search for answers to questions posed.
- Search for patterns.
- Search for addictions.

3. Demonstration of the result
- Data visualization.
- Demonstration of decisions, answers.

Go!

Training


Question wording


Data is a trap for the mind. They lure numbers into the forest and can easily lead them astray. In order not to deviate from the goal, ask the question to which you want to get an answer. Formulate it in free form and write it down on paper. Let it be a simple question “Does my site sell well or badly?” Or “Where have the customers disappeared from the site?”. Then break the general question for sub-questions and add them to the list. For example, a sub-question would be relevant to the question of sales on the site: which products are selling well, which are bad. Do not forget to leave an empty space on a sheet of paper, it is quite possible that at subsequent stages you will want to add to the list.

My questions:
What is Russia's foreign policy in recent years?
(I warned that the data will take from real life).

Sub-questions:


I am interested in foreign policy processes after the Munich Conference from February 2007 to September 2014. The questions have been formulated, now we are going to search for a source.

Select data source


The key requirement for the source: its constituent data must be relevant and homogeneous.

Relevant means that they contain the necessary and sufficient minimum of information to answer the questions posed, as well as being close to the original source.

In historical science there is a whole industry called source research. It deals with the classification and analysis of sources and uses the concepts of primary and secondary sources. To obtain the most reliable results, it is important to use original sources - first-hand messages that have not been processed by someone from the outside. For example, data from Wikipedia on events in foreign policy are not the primary source. The original source can be the minutes of meetings of the first persons with dates of meetings and the list of participants.

The second data requirement is homogeneity. The presence of common properties, the nature of which is unchanged for the entire set of objects, is a mandatory condition. In other words, the data must be qualitatively uniform in composition. It is not correct to compare and add metrics from Yandex.Metrics and Google Analytics, since the ways of their processing may be different. Although I often see the opposite picture.

Let's return to foreign policy. For the source of the data, I took the official reports on significant foreign policy events with the participation of Russia from the site kremlin.ru. Although official press releases are not primary sources, we can use them in our work. They are as close as possible to the original source. On the one hand, the publications reflect the quality of work of the content manager and the Kremlin’s PR services, and on the other hand, they are directly related to the events taking place.

1) Data from the archive in the Foreign Policy section
2) News on the "foreign policy" tag (from 08/05/2008 to 10/14/2014)

I will race ahead and say that I will have to stop using the first source. Since September 2009, the archive has ceased to be replenished with news, moreover, in the first and in the second cases, different principles of news description were used.

After we have decided on the source, we proceed to the most difficult and important part of the work: data collection.

Collection, study, clearance, assumptions


I asked the programmer to parse the sections of the site into a CSV table, so that later it would be convenient to work with the records in Excel. You are free to choose any data analysis tools convenient for you.

An important detail: it is necessary to use the relational model of data organization.

Simply put, each new record should be placed in a new line, attributes should be placed in columns and belong to the same data type (date, text, number, etc.). After all, we strive to create a homogeneous and high-quality database.

In my example, the entry in the string is a unique publication on the topic of foreign policy events. In Excel, it looks like a record in a string with attributes: event date, event type, event participant / participants.

Parsing the two sections was not easy for us: the site gave an error 402 Payment Required, 6 objects were lost somewhere, about 3,500 records were at our disposal. If the loss of 0.18% of the data can be allowed, then the fact that there are two tables from different sources and with different attributes is in the hands, it was impossible to ignore. When they were combined, the principle of data homogeneity would be violated, so I had to additionally compare overlapping periods from both sources, and in the end I decided to remove the first source. In the end, we received 3326 event records for the period from 08/08/2008 to 10/14/2014.

Now the data obtained must be studied. Excel has simple and convenient tools: groupings, filters, sorting, pivot tables, which are quite enough for most tasks. I looked at the content of the cells with interest and noticed the repeated names of events in the news headlines. Enviable consistency met the publication of meetings, telephone conversations, signing documents, ceremonies. A new attribute “event type” was requested for the records, I created another column and filled it with the appropriate values.



It is important to note that not all events were unequivocally interpreted. For example, I attributed the message about the beginning of a meeting and the message about negotiations at a meeting to one type of event “Meeting”, which means there could be several records in one database. The assumptions made were recorded and applied to all data.

The study period from 08/05/2008 to 10/14/2014 captures the presidency of V.V. Putin and D.A. Medvedev. Remind the date:

V.V. Putin - 05/07/2000 - 07/05/2008
YES. Medvedev - 07.05.2008 - 07.05.2012
V.V. Putin - 05/07/2012 - n.v.

This stage of work turned out to be the longest and most responsible. I repeatedly drove data through filters, grouped records, checked the correctness of values, data types, and finally achieved the necessary homogeneity and correctness.

Data analysis


Immediately after preparing the data, it is important to take a break and return to the beginning - to the questions that we formulated. It often happens that at this moment the thought goes far beyond the limits of the current research, therefore returning to the beginning becomes the best way not to miss the important.

Now we are close to drawing conclusions. At the analysis stage, it is important to avoid bias. It is possible to begin the study with the desire to prove the finished hypothesis, but do not forget about the possible existence of alternatives. Trying to prove that the bounce rate has grown due to bad traffic, we will never find a drop in site download speeds after a recent release.

Another caveat concerns the search for dependencies and patterns. We really want to know how one value affects the other, because in our everyday view, cause and effect go together. But social phenomena, and the behavior of users on the site also applies to them, characterized by a multiplicity of causes and effects. Even when we see two curves similar in shape that reflect different signs of the same phenomenon, there may be no correlation between them. Any conclusions about the presence of a correlation dependence between the values ​​are always probabilistic in nature.

And now we proceed to our answers to questions about foreign policy.

Demonstration of the result


What is the activity of Russia in foreign policy in recent years.

In 2010, the maximum amount of news came out on a foreign policy topic.

What is the list of countries with which Russia interacted most often? I have compiled a list of the top 5 countries about which the maximum number of reports has been accumulated during the study period. We will keep our focus on key participants in international relations. If suddenly someone disappears from the sample at subsequent stages, it will be a signal to check the data again or ask a new question.



What are the most popular types of events mentioned in the news and are there any features or changes throughout the entire period.


The number of press releases for meetings in 2010 is the maximum. In 2014, a sharp increase in the number of messages on telephone conversations took place.
Russian politicians began to talk more and meet less. Operational and urgent tasks require fewer ceremonies.

It is interesting, with which countries and organizations the number of telephone conversations increased in 2014. I chose participants with the maximum number of phone conversations for 2014.


In 2014, we observe unique compositions of participants in telephone conversations and an increase in direct contacts with a number of countries. Of the key participants in international relations, China is not on the list; later we will find out what this may be connected with.

We construct a graph of the number of messages by country, taking into account multilateral telephone conversations.

There is a noticeable increase in telephone conversations with Germany, France and the USA.

What about meetings? Take the leading countries in meetings and look at the big picture.

The graph is not the most indicative, but from the data table it can be seen that as of 10/14/2014 there is not a single message about Russia's meetings with the United States and Israel.

The nature of Russia's interactions with specific countries is interesting. We will continue to consider two key events of the meeting and telephone conversations by country.

Our eastern neighbor does not like talking on the phone.


Phone calls for the current year broke all records.


Already the end of the year and no meetings.


Abrupt changes.


In 2009, the year of complete calm. The lack of communications is likely due to the gas conflict between Russia and Ukraine in 2008–2009.

You may have noticed that in the column “participants” we have several types of values: with one or several countries, separated by commas or countries and organizations.

Meetings between politicians are bilateral and multilateral. It is interesting to look at the countries with which Russia meets more often at bilateral negotiations, with which at multilateral ones.

To do this, I supplemented the data with another attribute: a coefficient equal to the ratio of the total number of meetings to the number of bilateral. Those countries that are below average for the most part are negotiating at bilateral meetings; those above average are actively involved in multilateral ones.



There is nothing surprising in the fact that the CIS countries are closer to the intersection point and above average - they take part in joint forums and summits. But what has France forgotten in their company? I made a summary table for all events involving France for the entire period, and it turned out that France was a third party in the negotiations to resolve the 2008 Georgia-South Ossetian conflict.

***

Of course, you can still get a lot of interesting things out of this data, but I received answers to the questions, which means that the goal has been achieved. Even more: now I have always at hand information for a deeper understanding of the current situation in foreign policy. As you can see, if you stop collecting numbers and start asking specific questions, the data answers in the language of useful and interesting conclusions.

Finally, I will tell my favorite story about the first place of work of Avinash Koshik. The future world expert in web analytics came to the company where 200 reports were set up. A month after his arrival, Avinash Koshik turned off all of them. Two weeks passed, and no one noticed the loss.

upd. Promised Files
1. Source
2. Processing
opening password: habr2014

Source: https://habr.com/ru/post/241315/


All Articles