Analytical data outside Wrike analytics

Date engineer in anticipation of the task at the park.

Over the years of Wrike development, we have accumulated a lot of scattered information about user actions. This information is scattered across several databases, logs, and external services, and we, analysts, need to gather these data together, find patterns in them and find answers to the eternal SaaS questions :

Why do customers leave?
What users bring us money?
How to develop the product further?

Most of the tasks we solve using SQL, but queries to the logs via SQL are cumbersome and slow. They can be used for automation or detailed analytics, but if you need to quickly look at something, it will take longer to prepare data than to analyze.

If you have to look a lot and often, it causes pain, in this article we will describe how to overcome it and how to extract the maximum benefit from the data obtained.

Our solution

We keep in logs information about what action the user took and the characteristics of this action. If the user chooses a subscription, then we log the following events:

User A came to the page and we offered him a business subscription.
User A has chosen a subscription for Marketers, this is not the one we have proposed.

To turn it into a SQL-friendly form, you need to come up with metrics for these logs and find the dimensions by which we will group these metrics.

We chose time and user ID as measurements.

Time - because we follow the changes in the values of metrics.
User ID - because it is in almost all logs, and information on it is easy to aggregate. If we sum up the metrics of all users in the account, we get account activity.

Metrics are the questions we ask to extract information from the logs. Determine them more difficult. Depending on what we are interested in, these questions will change:

If we want to estimate the number of users who bought each of the subscriptions, we will ask: “How many users bought a business subscription?” “How many users bought a subscription for marketers?”.
If we want to evaluate conversion, then we ask: "How many people saw this form?", "How many people chose a plan on this form?"

If we study in detail the structure of the application, ask all the important questions and extract answers to them in the logs, we will get the following structure:

user_id|date|   ?|     ?|...

But it’s impossible to guess right away with all the questions, and it’s important to keep the possibility of adding new metrics and recalculating old ones.

We combine all metrics, group them by user ID, enrich them with measurements: country, user locale, how many people are in his account, whether this account pays us money. And we load it into an orc-table in PostgreSQL.

The structure of this table is as follows:

 user_id|date|metric1|metric2|metric3|metricN|dimension1|dimension2|dimensionN

Behind this table is an infrastructure that helps analysts get rid of routine tasks and better understand the product. We call this internal product a “user data mart” .

How does this help analysts?

We have put together scripts for processing logs and tables from backend. Previously, analysts kept them themselves, shared the scripts manually, copied and corrected each time in a new way.

When we made the showcase, we wanted it to be expandable, easy for analysts and to have thoughtful naming. We wanted all the metrics we create to be accessible both to analysts - in the post-press and to those who want to analyze the data on their own - through the scoreboard. And most of all we wanted to collect enough data to make the product smarter.

Showcase expandable

To make the data warehouse extensible, we have made a system of modules.
A module is a python code that tells you how to get the following structure from input data:

 (user_id, {metric_1_key: metric_1_value, metric_2_key: metric_2_value})

In addition to this script, we declare dependencies in the module, write default values and descriptions for metrics before and after aggregation.

We consider modules independently of each other, and if an error occurs in a module, we fill this module with default values and write to a specially-trained table that this module was considered incorrect. Error in a separate module does not affect the system as a whole.

We try to put modules together according to the principle of Single Responsibility : only the code that changes for one reason must be in the same module. The module can work with different logs and databases. The input to the module is received by non-toll data structures, it cannot spoil the data, and, as long as it does not greatly affect the performance, anything can happen in it.

How exactly we are developing these modules and what difficulties we have encountered, I will tell in the next article.

Showcase - easy for analysts

We wanted with the help of the storefront any person who possesses the skills of Python or SQL could add several new metrics, even if he is not an analyst, but a developer who implements the log specification or a tester who wants to better understand the product.

From this point of view, the python seemed to be the ideal solution: there is a simple programming language and Pandas with data frames and PandasSQL. The solution seemed ideal because not all analysts know and love python, but everyone knows how to work with SQL or data frames.

True, Pandas worked terribly slowly on large amounts of data, and as the number of modules increased, the execution speed grew exponentially. Now we have switched to Spark data frames and Spark SQL, got rid of unnecessary deserialization of data, now new modules do not slow down the showcase so much and we even know how to optimize them further.

In the showcase thoughtful naming

We wanted the name of the metrics to be unique for similar metrics, even when there were many. Now we have 750 metrics, every month we add 50 new ones. And while never faced the intersection of names.

The name of the metric consists of seven parts:

entity is the entity to which this metric belongs:
act (activity), search (search), i (integration).
event - what happened?
btn_clckd (clicked on the button), dashb_open (opened the dashboard)
source - in which application?
ws (workspace, our web application), andr (android), ios ('nough said), x (no matter what).
path - in which part of the application did the action occur?
For example, editing a task could occur after a search search or on dbord dashboards.
measure - what action do we use for aggregation?
sum, flg, str (string concatenation), json (describe the events in json).
unit - what is the unit of measurement?
we can summarize the number of ev events , the number of tasks t or usrs users
details - something else?
if the same events can be counted differently or read from different logs, you should indicate this here.

In fact, the names of metrics look like this:

entity__event__source__path__measure__unit__details - we divide the components (tickers) with two underscores, and the words are one, if part of the description does not apply to the metric put X, for example:

view__reports_open__ws__x__cnt__ev__x - the number of events for opening a view__reports_open__ws__x__cnt__ev__x view from a web application (read better from the end).

act__assignment__x__user__flg__fct__non_self_assignment is a fact-metric: if the user is asynil of another user, we write 1, otherwise 0. Such metrics are perfectly aggregated by account or by dimshchenam.

Do not be afraid of abbreviations, we have a special dictionary that decodes them, and people work with human names.

We did not immediately come to this naming. At first we tried to make a showcase with fewer tickers, but we made a mistake and we had to rename a huge number of metrics with our hands. These alterations touched most analysts, a lot of time was wasted, and we raked the consequences for a long time later.

And how are you doing?

Now we have 22 modules, which were written by 7 people (this is more than twice the command that supports this showcase).

Not a single release of a big feature is complete without adding a module to this showcase, and we are getting more and more benefit from this data.

How does this help our colleagues?

We collected a lot of useful data in one place and wanted to give access to interested people, bypassing analysts. To do this, we made an automatically updated dashboard in Tableau Online. It is not familiar with programming people can aggregate metrics, cut into dimensions and answer questions about the use of products, for example:

“How many paid users are registered with android?” - you can see the number of registrations on android and filter free users.

“I found a bug in IE, how many users might have encountered it?” - you can see on a separate dashboard, how many people use each version of IE.

“How many times did you click on this button?” - you can ask to shake this button and add this event to the showcase.

To implement these dashboards, we chose Tableau . We actively use it for dashboards in analytics and decided that for this task Tableau would also be suitable. In it, we made a dashboard, which automatically gets part of the metrics from the custom storefront.

We continue to talk about this dashboard to our colleagues and help us understand that analytics is simple. If someone has a product hypothesis, he can test it quantitatively in this dashboard.

How do we work with the scoreboard?

We did not manage to use in the display the scheme from the database in its original form, because:

You cannot automatically change the names of the columns in the graphs;
You cannot use parts of a column name to filter metrics;
The functionality of the Measure Values and Measure Keys objects is limited and we never managed to automatically add new metrics to them;

We solved these problems with the help of EAV , that is, we turned the table into the form:

Day ;
User ID ;
The name of the metric;
The full name of the metric is greater than 63 characters (postgrese limit);
Metric value;
On it we build the schedule
Information about the metric;
To find the metric
Information about the user and his account;
We cut the graphics on them

In order to be able to filter and build graphs in the scoreboard, we calculated the aggregated metrics for all combinations of user information values. If we have two dimensions: a country (Russia or United States) and an account type (paid or free), we aggregate the metrics for all combinations of these values, for each metric we get 4 lines:

For free from Russia;
For paid from Russia;
For free from the USA;
For paid from the USA.

Even if there are zeros on these intersections, we do not miss this combination, otherwise it will be impossible to build a graph or there will be gaps on it.

But we want to filter more than two dimensions. Because of this, our scoreboard tables are huge, and our scoreboards in the scoreboard are quite slow.

Moreover, to transfer data to Tableau Online, you need to turn Spark Dataframe into Tableau Data Extract using a library with outdated documentation, and you can only send data to the Tableau Online server using Windows, but these are little things in life, we decided to do it once and forgot.

What did this give the company?

The coolest thing in this data storage format is that they can be analyzed automatically.

We are looking for anomalies in these metrics. This gives us the opportunity to monitor the use of all features every day, even if they are not interesting for analysts and product managers and have not been updated for a long time.

All the benefits can be traced in our last study:
If we predict the outflow of users on this data, we will be able to understand what factors influence it.
If we understand what factors affect - we can better prioritize accounts that do not receive all the benefits of the product and who need support.
If we understand that the model accurately identifies such accounts - we can make support cheap or even free, that is, we will increase the value of the product and save our colleagues from the routine.
If we continue to make expensive things cheap, we will make an intellectual product that adapts itself to the client’s tasks.

We have many plans for how to use this data:

better prioritize leads;
Consider user activity in optimizing marketing and onboarding;
better understand the outflow of users;
better understand product use scenarios;
better understand what attracts users;
personalization of recommendations to users and accounts.

In the next article I will tell you how it all works, I will more deeply plunge into compromises in the development of such systems, and I will tell you about the rake and the successful solutions that we have collected along the way.

The picture in the beginning - a frame from the movie "She" Spike Jones.

Source: https://habr.com/ru/post/322366/

All Articles