Guide to creating your own cohort recurrence report

An example of a cohort report with LTV values

A cohort analysis of user returns is a powerful way to understand different groups of customers — their behavior and business value. However, the final tables can be difficult to understand the first time, and right off the bat to figure out how to build them, it is even more difficult.

The article will describe a relatively simple, but useful algorithm for constructing a cohort of a table, as well as sketches of code with Python / Pandas and SQL . If you need to programmatically implement the construction of a cohort report, or just interested to know this algorithm, I ask for cat.

Introduction

One of my designs is a corporate analytics application. And the cohort returns report is probably the most important function. I had to rewrite it more than once, making it more customizable, flexible.

During development, I did not find examples of the implementation of such a report, so the algorithm outlined below was created by me from an understanding of the essence of the cohort analysis. If someone finds jambs or knows better options, please report it in the comments.

Conditional definitions :

Cohort analysis (Cohort analysis) is a method for evaluating any metrics with the division of users into independent groups - cohorts .
Return (Retention) - characteristic of a group of users, calculated by the ratio of active users in certain time intervals. Example: some applications were installed by 50 people, after a week only 5 people remained active users; Return: 5/50 * 100% = 10%.
The cohort by return is a cohort analysis table, in which each row describes a separate cohort by the date of its appearance, and the columns show the time of observation of the cohort.

On the example below it can be seen that of the clients who came in August, only 60% of their initial number remained the next month. And the number of active clients in September is: 100% of those who came in September + 60% of those who came in August + 30% of those who came in July. Similarly, the number of active customers in August is 100% of new customers in August and 50% of customers who came in July. That is, we look at the cohorts themselves in rows, and for all clients active in a certain month - diagonally, in different cohorts.

Example cohort table

Life-Time Value (LTV) is a characteristic of a group of clients, which shows how much income an average client brings from this group.

We will build a cohort table in several steps, at each of which we will receive a new table, each of which I gave my name.

Step 1. Task setting, Orders table

There is a table of orders Orders for which you need to conduct a cohort analysis. The structure is as follows:

//   date: DateTime, // ID  clientID: String, //   price: Int

Step 2. Clients table with the client’s arrival date

We need to get the Clients table with the client’s arrival date:

 date: DateTime, clientID: String

Maybe someone already has it (for example, with the date of registration of the user or with the date of installation of the application), but it can also be counted as the date of the first order.

Code example

Python:

 Clients = pd.groupby(Orders, by=['clientID'], as_index=False) Clients = Clients.agg({ 'dt' : {'date' : 'min' }}) Clients.columns = cli.columns.droplevel() Clients.columns = ['clientID', 'date']

MySQL:

 SELECT clientID, MIN(date) AS date FROM Orders GROUP BY clientID

Step 3. Combining Orders and Clients in Mix

We make the joining of the Orders and Clients tables of the Left type through the common clientID field. In order to avoid confusion, the date field from the first table is called dateOr , and the second is called dateCl .

Now we need to round the dates to make the groups. You can round up to a month, throwing back the number, you can count the number of weeks before any date. In the end, the date should be a string.

Mix table structure:

 //   dateCl: String, //   dateOr: String, // ID  clientID: String, //   price: Int

Code example

Python:

 Mix = pd.merge(Orders, Clients, how='left', on=['clientID']) Mix.columns = ['dateOr', 'clientID', 'price', 'dateCl'] def cutDate(txt): return txt[:7] Mix['dateOr'] = Mix['dateOr'].apply(cutDate) Mix['dateCl'] = Mix['dateCl'].apply(cutDate)

MySQL:

 SELECT STRFTIME_UTC_USEC(Clients.date, "%Y-%m") AS dateCl, STRFTIME_UTC_USEC(Orders.date, "%Y-%m") AS dateOr, clientID, price FROM Clients INNER JOIN Orders ON (Clients.date = Orders.date)

Step 4. Grouping times table Preresult

Finally getting closer to the cohorts! Let 's group by three fields at once: dateCl, dateOr and clientID .

Why on the client, too, and not just on dates?

Without this, we will not be able to find the number of active customers at any time. The number of rows (the Count() function) would only give us the number of orders in the cohort during this time period, and the number of active clients could be found in no way.

Add the following to these fields:

The number of orders that the client made during this time period. Find as the number of grouped lines:
ordersCount = Count()
The amount for which this customer placed orders during this time period. Find as the sum of the values of individual orders:
total = Sum(price)

Summary structure:

 //   dateCl: String, //   dateOr: String, // ID  clientID: String, //  ,        ordersCount: Int, // ,           total: Int,

Code example

Python:

 Preresult = pd.groupby(Mix, by=['tel', 'dateOr', 'dateCl'], as_index=False) Preresult = Preresult.agg({ 'price': { 'total': 'sum', 'ordersCount': 'count' } }) Preresult.columns = Preresult.columns.droplevel() Preresult.columns = ['clientID', 'dateOr', 'dateCl', 'total', 'ordersCount']

MySQL:

 SELECT clientID, dateCl, dateOr, COUNT(*) AS ordersCount, SUM(price) AS total, FROM Mix GROUP BY dateCl, dateOr, clientID

Step 5. Grouping two, table Result

Now we can depersonalize our data by grouping only by dateCl and dateOr . Add other fields:

The number of active clients in this time period. Find as the number of grouped lines:
clientsCount = Count()
The number of orders that all customers made during this time period. Find as the sum of the number of orders for individual customers:
ordersCount = Sum(ordersCount)
The amount for which all customers placed orders during this time period. Find as the sum of the amounts for each client:
total = Sum(total)

It turns out this table:

 //   dateCl: String, //   dateOr: String, //     clientsCount: Int, //  ,           ordersCount: Int, // ,             total: Int,

Code example

Python:

 Result = pd.groupby(Preresult, by=['dateOr', 'dateCl'], as_index=False) Result = Result.agg({ 'total': { 'total': 'sum' }, 'ordersCount': { 'ordersCount': 'sum', 'clientsCount': 'count' } }) Result.columns = Result.columns.droplevel() Result.columns = ['dateOr', 'dateCl', 'total', 'ordersCount', 'clientsCount'])

MySQL:

 SELECT dateCl, dateOr, COUNT(*) AS clientsCount, SUM(ordersCount) AS ordersCount, SUM(total) AS total FROM Preresult GROUP BY dateCl, dateOr

Step 6. Final Transformation, Pivot Cohort Table

Now we just have to transform the structure of the table: dateCl must be placed in rows, dateOr in columns, and the desired value in cells ( clientsCount , ordersCount , total or something else).

What is it?

This operation is called Pivot Table (in Russian, it seems, there is no normal term). A simple example:

  xy val yx 1 2 1 1 5 1 5 - 1 3 8 -> 2 - 7 2 2 7 3 8 -

Columns were obtained from x values, and rows from y . Here we had the row x = 1, y = 3, val = 8, and the cell became in the column (x) 1, in the row (y) 3 with the value (val) 8. Or the row x = 2, y = 2, val = 7, became a column cell (x) 2, in row (y) 2 with the value (val) 7.

Cells whose values are not described by rows from the original table are usually filled with the value NULL or something logically equivalent.

Code example

Python:

 #      clientsCount, ordersCount, total    Cohort = Preresult.pivot(index='dateCl', columns='dateOr', values='Data') #   null'   Cohort.fillna(0, inplace=True)

MySQL:
Unfortunately, MySQL does not know how to turn cells into rows and columns in an easy way, so I used other languages for this purpose. But if someone knows this way - write in the comments, I will be grateful.

Possible improvements

Rename columns
Now dateOr and dateCl describe the date independently, and the center of the cohort table is directed right-up. But if at Step 3, 4 or 5 you perform the operation dateOr - = dateCl , then this field will display the date from the beginning of the cohort's existence, and the center of the table will be directed left-up, which is better perceived:
```
 * 09 10 11 * 0 1 2 09 3 2 1 09 3 2 1 10 - 3 2 -> 10 3 2 - 11 - - 3 11 3 - - 
```
Extra options
What to do if you have other interesting parameters in the order table? For example, type of payment or branch ID? Quite simply: these parameters will also be present in the tables on Steps 3,4,5 ( Mix, Preresult, Result ), and they must be added to the fields of both groupings. Then, at Step 6, for each possible combination of parameters you need to build your Cohort table. For example, you have 3 branches and 2 types of payment, there will be 3 * 2 = 6 cohort tables.
Finding LTV
Having a long statistics, you can calculate the Life-Time Value of the cohorts walking through the rows of the Cohort table.

Conclusion

The cohort recoverability table is not the only application of cohort analysis, there are other, more visual applications. For example, the division of users into two groups according to a certain attribute (cohort) and the display of their characteristics on the graph as two independent lines.

Useful on the topic: cohort analysis in Google Analytics .

All success;)

Source: https://habr.com/ru/post/342108/

All Articles

Guide to creating your own cohort recurrence report

Introduction

Step 1. Task setting, Orders table

Step 2. Clients table with the client’s arrival date

Step 3. Combining Orders and Clients in Mix

Step 4. Grouping times table Preresult

Step 5. Grouping two, table Result

Step 6. Final Transformation, Pivot Cohort Table

Possible improvements

Conclusion

More articles: