Trees without leaves

Dream result

Facebook has published an article about the so-called aha-moment , in which it was stated that if a user adds 7 friends within 10 days after registration, then this user remains on the service.

Why is this result (aha-moment) so appealing?
Because he:
a) simple
b) meaningful
c) actionable

When there is such a result, the service just needs to do everything to motivate users to add 7 friends in the first 10 days, and thus go to a higher goal: “to increase the retention rate”.
')

Can we repeat?

Of course, any service wants to find its aha-moment. And it is desirable that here is as simple as Facebook: so that everything is simple and clear, and without any abstractions of analysts, latent signs and other things.

We in Lingualeo tried to solve the puzzle: do not steal the decision from Facebook, but come up with our own, about which Konstantin Tereshin will tell below (what a pity that he is no longer with Leo)

Formulation of the problem

We make one reservation: we wanted our aha-moment to be as simple and clear as the result of Facebook, and reduced to specific user actions. This was the main requirement for the result.

Now to the task

In any research task, the first thing is to decide on the concepts. What you will have in mind under the terms “the user will remain”, “will become loyal”, “not ottechet” and the like. Here everyone has their own truth, each service understands this in its own way. Well, we introduced our own definition of loyalty. We will not bore you with the details of this definition. It is important how we worked with this feature. The only thing that needs to be noted is that loyalty is a binomial variable and takes the value 0 - if the user has not become loyal, 1 - if it has become. This is our target variable.

Now that we have decided on the target attribute, it remains to formalize what it means to “find aha-moment”?

For example, to find a variable and a value at which the probability of users becoming loyal is maximum.
But this is not an option, because you can “tighten the screws” to the maximum. It is clear that the more a user practices (learns) words, the more likely that he is stat loyal. But this is not an easy result.

Or, for example, you can follow the logic of associative rules and find a list of rules of a type arranged according to support: variable value => loyalty event.
A good option, BUT, the variables we have (about them later) often have a wide range of values, and as a result the dictionary of events will contain tens of thousands of positions. Therefore, we will not get meaningful results.

As a result, we stopped at the following logic: to find the variable and its value, which determines the limit of the trend change in the user's probability to become loyal.
For example:
Suppose we have a variable. The probability of becoming loyal has the following distribution:

The graph shows that the trend curve changes to a value of 5-6. After that, the likelihood of becoming loyal is growing rapidly. That is, we must find such a variable and its value, after which the probability starts to grow as much as possible.

A bit about data

For us it was important to determine aha-moment in the first two weeks after registration (each for himself determines the necessary period, based on the specifics of the business). Therefore, we looked closely at this period and collected about 750 variables for analysis. Among them: what the users did, how many times they did, how often they did, what they did on a particular day after registration, and so on and so forth.

And a few more comments:
We conducted the analysis only for users of Lingualeo mobile applications.
Separately Android and iOS

Eventually:
For Android, we received a sample (the number of users is 240,862 x the number of variables is 751)
For IOS received a sample (the number of users 73 712 x the number of variables 751)

How to solve the problem

It is clear that with hands to look at 750 variables, to find the best of them and that very meaning is not an occupation for the lazy. Therefore, I wanted to automate all this.

We divided the task into 2 stages:
Selection of “important” variables
Search for the “right” border.

Let's start with finding the right border.
Let we have a sign 1 and a target sign. How to find the value that separates the sample ideally in such a way as to catch the trend change?
When we thought about this task, an idea arose to build a decision tree of depth 1 and on one variable.
In fact, the task of the decision tree is to minimize the complexity of the system, i.e. choose a splitting in the variable in which the level of chaos of the system decreases, and there is an intelligible criterion for evaluating chaos. We used Entropy .
What is the point of our case? Since we have only two classes in the target attribute (0 is not loyal, 1 is loyal), then when we build a decision tree of depth 1 on sign 1 (priznak_1) we get one kind of splitting (given fake ones):

As a result, the boundary is chosen in such a way that the perfect balance is observed between the probability of class 1 on one side of the splitting and the probability of class 0 on the other. Formally, we received the maximum “information gain” (information gain).

In fact, this is the border that we need.

Important comment: in the end, we still don’t solve the classification problem and we don’t need to build a rougher tree, so there’s no point in talking about retraining, rather about underfitting. But it suits us.

Now that we understand how to search for a boundary, we need to select and rank the variables.
There is also nothing supernatural here: many ensemble teaching methods have the feature_importances_ method in scikit-learn. We used random forest with 1000 trees of depth 10, using the f1 score as a metric. At the exit, we got a ranked list of variables. As is usually the case, the importance of variables has a power distribution (our case is no exception), and the most important variable goes ahead with a margin. We decided to stay on it.

What got

The most important variable is the number of days during which the user gained experience on the service.

Is this a good result?
Yes and no. On the one hand, we understand after how many days a “tipping point” occurs on the service, the user begins to actively use it, and can use it, on the other, we do not understand what it is doing.

And what should we do next, because we have not achieved the desired result?
And then everything is simple. After we have a “basic” border, it remains to understand the difference between users on the left and right of the border. And we went the same way as before, since the task again was to find the perfect border, but by the features that reflect the specific actions of users: word training, word addition, learning tests, grammar training and passing courses. But now we have a different target sign: 1 - to the right of the “basic” border, 0 - to the left of it.

As a result, we received reference points that indicate what the user must do in order to become loyal, and these guidelines are expressed in specific user actions.

We liked the exercise, we hope you will like it too :)

But you should add a few comments about the result, and how you need to respond to it.
1) The result reflects the current mechanics of user interaction with the service. If the service changes, the whole logic of interaction with it may change.
2) In fact, our result is the essence of strong hypotheses, since if we introduce a product feature or marketing activity, we motivate users to reach the limits that we have found, it’s not a fact that they will immediately become loyal. This will have to be tested.

PS: and we also have analyst vacancies ! Come to work with us!

Source: https://habr.com/ru/post/300998/

All Articles