📜 ⬆️ ⬇️

We consider the statistics on the experiments on hh.ru

Hello!

Today I will tell you how we at hh.ru consider manual statistics on experiments. We'll see where the data comes from, how we process it, and what pitfalls we encounter. In the article I will share the general architecture and approach, the actual scripts and code will be at a minimum. The main audience is novice analysts who are interested in how the data analysis infrastructure in hh.ru is arranged. If this topic is interesting - write in the comments, we can delve into the code in the following articles.

How automatic metrics for A / B experiments are considered can be found in our other article .
')
image

What data we analyze and where they come from


We analyze access logs and any custom logs that we write ourselves.

95.108.213.12 - - [13 / Aug / 2018: 04: 00: 02 +0300] 200 "GET / employer / 2574971 HTTP / 1.1" 12012 "-" Mozilla / 5.0 (compatible; YandexBot / 3.0; + http: / /yandex.com/bots) "" - "" gardabani.headhunter.ge "" 0.063 "" - "" 1534122002.858 "" - "" 192.168.2.38:1500 "" [0.064] "{15341220027959c8c01c51a6e01b682f 200 https 1 -" - "- - [35827] [0.000 0]
178.23.230.16 - - [13 / Aug / 2018: 04: 00: 02 +0300] 200 "GET / vacancy / 24266672 HTTP / 1.1" 24229 " hh.ru/vacancy/24007186?query=bmw " "Mozilla / 5.0 ( Macintosh; Intel Mac OS X 10_10_5) AppleWebKit / 603.3.8 (KHTML, like Gecko) Version / 10.1.2 Safari / 603.3.8 "" - "" hh.ru "" 0.210 "" last_visit = 1534111115966 :: 1534121915966; hhrole = anonymous; regions = 1; tmr_detect = 0% 7C1534121918520; total_searches = 3; unique_banner_user = 1534121429.273825242076558 "" 1534122002.859 "" - "" 192.168.2.239:1500 "" [0.208] "{1534122002649b7eef2e901d8c9c0469} 200 https 1 -" - "- - [35927] [0.001]

In our architecture, each service writes logs locally, and then through a samopisny client-server logs (including the nginx access logs) are collected at the central repository (further logging). Developers have access to this machine and can manually log logs if necessary. But how in a reasonable time a few hundred gigabytes of logs? Of course, pour them into hadoop!

Where does the data appear in hadoop?


Not only service logs are stored in hadoop, but also prod-database unloading. Every day in hadoop, we unload some of the tables that are needed for analytics.

Logs of services get into hadoop in three ways.

  1. The path to the forehead - cron is started from the log repository at night, and rsync uploads raw logs to hdfs.
  2. The path is fashionable - logs from services flow not only into the general storage, but also into kafka, from where they are read by flume, preprocessing and stored in hdfs.
  3. The path is old-fashioned - in the days before kafka we wrote our service, which reads raw logs from the repository, makes preprocessing and fills in hdfs.

Consider each approach in more detail.

Way to the forehead


Cron runs a regular bash script.

#!/bin/bash LOGGING_DATE_PATH_PART=$(date -d yesterday +\%Y/\%m/\%d) HADOOP_DATE_PATH_PART=$(date -d yesterday +year=\%Y/month=\%m/day=\%d) ls /logging/java/${LOGGING_DATE_PATH_PART}/hh-banner-sync/banner-versions*.log | while read source_filename; do dest_filename=$(basename "$source_filename") /usr/bin/rsync --no-relative --no-implied-dirs --bwlimit=12288 ${source_filename} rsync://hadoop2.hhnet.ru/hdfs-raw/banner-versions/${HADOOP_DATE_PATH_PART}/${dest_filename}; done 

As we remember, in the log repository all the logs are in the form of regular files, the folder structure is something like this: /logging/java/2018/08/10/ Your_server_name_by/*.log

Hadoop stores its files in roughly the same folder structure hdfs-raw / banner-versions / year = 2018 / month = 08 / day = 10
year, month, day we use as partitions.

Thus, we only need to form the correct paths (lines 3-4), then select all the necessary logs (line 6) and use rsync to fill them in hadoop (line 8).

Advantages of this approach:


Minuses:


Fashionable way


Since we upload logs to the repository with a self-written script, it was logical to fasten the ability to upload them not only to the server, but also to kafka.

pros


Minuses


The path is old fashioned


It differs from the fashionable only in the absence of kafka. Therefore, it inherits all the minuses and only some advantages of the previous approach. A separate service (ustats-uploader) on java periodically reads the necessary files, makes them preprocessing and uploads into hadoop.

pros


Minuses


And the data got into hadoop and is ready for analysis. Let's stop a bit and remember what hadoop is and why hundreds of gigabytes can be stuck on it much faster than the usual grep.

Hadoop


Hadoop is a distributed data repository. The data does not lie on any separate server, but is distributed among several machines, and is also stored not in one instance, but in several - this is done to ensure reliability. The basis of the speed of data processing lies in the change in approach in comparison with conventional databases.

In the case of a normal database, we retrieve data from it and send it to the client, who does some analysis and returns the result to the analytics. Thus, in order to read more quickly, we need to have many clients and parallel requests (for example, divide the data by months - and each client can read the data for his month).

Hadoop is the opposite. We send the code (what exactly we want to count) to the data, and this code is executed on the cluster. As we know, data rests on many machines, so each machine executes code only on its data and returns the result to the client.

Many have probably heard about map-reduce , but writing code for analytics is not very convenient and fast, while writing in SQL is much easier. Therefore, there are services that can turn SQL into map-reduce transparently for the user, and the analyst may not even be aware of how his request is actually considered.

In hh.ru for this we use hive and presto. Hive was the first, but we are gradually moving to presto, since it is much faster for our requests. We use hue and zeppelin as GUI.

It is more convenient for me to consider analytics on python in jupyter, this allows you to read it with one click and to get properly styled excel tables at the output, which saves time. Write in the comments, this topic pulls on a separate article.

Let's return to the analyst itself.

How to understand what we want to count?


A product manager came in with the task to calculate the results of an experiment.


We send an email-newsletter, in which we send suitable vacancies for the applicant (do everyone like such newsletters?). We decided to slightly change the design of the letter and we want to understand whether it became better. For this we will consider:


Let me remind you that all that we have is access log and base. We need to formulate our metrics in terms of linking.

Number of transitions to a job from a letter


Transition is a GET request on hh.ru/vacancy/26646861 . To understand where the transition came from, we add utm tags of the form? Utm_source = email_campaign_123. For GET requests in access log, there will be information about the parameters, and we can filter the transitions only from our mailing list.

The number of responses after the transition


Here we could just count the number of responses to vacancies from the newsletter, but then the statistics would be wrong, because the responses could have been affected by something else besides our letter, for example, an advertisement in ClickMe was bought on the vacancy, and therefore the number of responses has grown a lot.

We have two options for how to formulate the number of responses:

  1. The response is POST on hh.ru/applicant/vacancy_response/popup?vacancy_id=26646861 , which has referer hh.ru/vacancy/26646861?utm_source=email_campaign_123 .
  2. The nuance of this approach is that if the user switched to a vacancy, and then walked around the site a bit and then responded to the vacancy, we will not deduct it.
  3. We can remember the id of the user who switched to hh.ru/vacancy/26646861 , and calculate the number of reviews per vacancy during the day.

The choice of approach is determined by business requirements, usually the first option is enough, but it all depends on what the product manager is waiting for.

Pitfalls that may occur


  1. Not all data is in hadoop, you need to add data from the prod-database. For example, in logs usually only id, and if you need a name, then it is in the database. Sometimes you need to search for user by resume_id, and this is also stored in the database. To do this, we unload the part of the base in hadoop, so that join is simpler.
  2. Data may be curves. This is generally the trouble of hadoop and how we load data into it. Depending on the data, an empty value can be null, None, none, an empty string, etc. You need to be careful in each case, because the data is really different, loaded in different ways and for different purposes.
  3. Long count for the entire period. For example, we need to count our transitions and responses for the month. These are about 3 terabytes of logs. Even hadoop will take it for quite a while. Usually it is quite difficult to write a 100% work request the first time, so we write it by trial and error. Each time to wait for 20 minutes is very long. Solutions:

    • Debugging request for logs for 1 day. Since the data in hadoop is partitioned by us, it’s pretty quick to calculate something for 1 day of logs.
    • Download the necessary logs to the temporary table. As a rule, we understand which URLs are interesting to us, and we can make a temporary table for the logs from these URLs.

    Personally, I prefer the first option, but sometimes it’s necessary to make a temporary table, depending on the situation.
  4. Distortions in summary metrics
    • It is better to filter the logs. You need to pay attention, for example, to the response code, redirect, etc. Better is less data, but more accurate, of which you are sure.
    • As few as possible intermediate steps in the metric. For example, a transition to a job is one step (GET request to / vacancy / 123). Response - two (go to the job + POST). The shorter the chain, the less errors and the more accurate the metric. Sometimes it happens that the data between the transitions are lost and count something is impossible at all. To solve this problem, it is necessary before the development of the experiment to think about what we will consider and how. Extremely helps your individual log of necessary events. We are able to shoot the necessary events, and thus the chain of events will be more accurate, and it will be easier to count.
    • Bots can generate a bunch of transitions. You need to understand where bots can go (for example, on pages that require authorization, they should not be), and filter this data.
    • Big bumps - for example, in one of the groups there may be one applicant who generates 50% of all responses. There will be a bias of statistics, such data also needs to be filtered.
  5. It is difficult to formulate what to consider in terms of access log. Here helps knowledge of the code base, experience and chrome dev tools. We read the description of the metric from the product, we repeat with our hands on the site and see which transitions are generated.

Finally, let's talk about how the result of calculations should look.

Counting result


In our example, there are 2 groups and 2 metrics that form a funnel.
image
Recommendations for the design of the results:

  1. Do not overload with parts until it is required. Simply and less is better (for example, here we could show each vacancy separately or clicks by day). Focus on one thing.
  2. Details may be needed in the process of demo results, so think about what questions you may ask, and prepare the details. (In our example, the detailing can be on the speed of transition after sending an email - 1 day, 3 days, week, grouping of vacancies by trade area)
  3. Remember about statistical significance. For example, a change of 1% with the number of transitions of 100 and clicks of 15 is insignificant and could be accidental. Use calculators
  4. Automate as much as possible, because you’ll have to count several times. Usually in the middle of an experiment you already want to understand how things are going. After the experiment, questions may arise and something will have to be clarified. Thus, it will be necessary to count 3-4 times, and if each calculation is a sequence of 10 requests and then manual copying to excel, it will be painful and spend a lot of time. Learn python, it will save a lot of time.
  5. Use a graphical representation of the results when justified. The built-in tools hive and zeppelin allow out of the box to build simple graphics.

It is necessary to count various metrics quite often, because we are releasing practically every task in the framework of the A / B experiment. There is nothing complicated in the calculations, after 2-3 experiments there comes an understanding of how to do it. Remember that access logs store a lot of useful information that can save a company money, help you promote your idea and prove which change option is better. The main thing - to be able to get this information.

Source: https://habr.com/ru/post/424251/


All Articles