📜 ⬆️ ⬇️

Guide to creating your own cohort recurrence report

An example of a cohort report with LTV values


A cohort analysis of user returns is a powerful way to understand different groups of customers — their behavior and business value. However, the final tables can be difficult to understand the first time, and right off the bat to figure out how to build them, it is even more difficult.


The article will describe a relatively simple, but useful algorithm for constructing a cohort of a table, as well as sketches of code with Python / Pandas and SQL . If you need to programmatically implement the construction of a cohort report, or just interested to know this algorithm, I ask for cat.


Introduction


One of my designs is a corporate analytics application. And the cohort returns report is probably the most important function. I had to rewrite it more than once, making it more customizable, flexible.


During development, I did not find examples of the implementation of such a report, so the algorithm outlined below was created by me from an understanding of the essence of the cohort analysis. If someone finds jambs or knows better options, please report it in the comments.


Conditional definitions :



Example cohort table



We will build a cohort table in several steps, at each of which we will receive a new table, each of which I gave my name.


Step 1. Task setting, Orders table


There is a table of orders Orders for which you need to conduct a cohort analysis. The structure is as follows:


//   date: DateTime, // ID  clientID: String, //   price: Int 

Step 2. Clients table with the client’s arrival date


We need to get the Clients table with the client’s arrival date:


 date: DateTime, clientID: String 

Maybe someone already has it (for example, with the date of registration of the user or with the date of installation of the application), but it can also be counted as the date of the first order.


Code example

Python:


 Clients = pd.groupby(Orders, by=['clientID'], as_index=False) Clients = Clients.agg({ 'dt' : {'date' : 'min' }}) Clients.columns = cli.columns.droplevel() Clients.columns = ['clientID', 'date'] 

MySQL:


 SELECT clientID, MIN(date) AS date FROM Orders GROUP BY clientID 

Step 3. Combining Orders and Clients in Mix


We make the joining of the Orders and Clients tables of the Left type through the common clientID field. In order to avoid confusion, the date field from the first table is called dateOr , and the second is called dateCl .


Now we need to round the dates to make the groups. You can round up to a month, throwing back the number, you can count the number of weeks before any date. In the end, the date should be a string.


Mix table structure:


 //   dateCl: String, //   dateOr: String, // ID  clientID: String, //   price: Int 

Code example

Python:


 Mix = pd.merge(Orders, Clients, how='left', on=['clientID']) Mix.columns = ['dateOr', 'clientID', 'price', 'dateCl'] def cutDate(txt): return txt[:7] Mix['dateOr'] = Mix['dateOr'].apply(cutDate) Mix['dateCl'] = Mix['dateCl'].apply(cutDate) 

MySQL:


 SELECT STRFTIME_UTC_USEC(Clients.date, "%Y-%m") AS dateCl, STRFTIME_UTC_USEC(Orders.date, "%Y-%m") AS dateOr, clientID, price FROM Clients INNER JOIN Orders ON (Clients.date = Orders.date) 

Step 4. Grouping times table Preresult


Finally getting closer to the cohorts! Let 's group by three fields at once: dateCl, dateOr and clientID .


Why on the client, too, and not just on dates?
Without this, we will not be able to find the number of active customers at any time. The number of rows (the Count() function) would only give us the number of orders in the cohort during this time period, and the number of active clients could be found in no way.

Add the following to these fields:



Summary structure:


 //   dateCl: String, //   dateOr: String, // ID  clientID: String, //  ,        ordersCount: Int, // ,           total: Int, 

Code example

Python:


 Preresult = pd.groupby(Mix, by=['tel', 'dateOr', 'dateCl'], as_index=False) Preresult = Preresult.agg({ 'price': { 'total': 'sum', 'ordersCount': 'count' } }) Preresult.columns = Preresult.columns.droplevel() Preresult.columns = ['clientID', 'dateOr', 'dateCl', 'total', 'ordersCount'] 

MySQL:


 SELECT clientID, dateCl, dateOr, COUNT(*) AS ordersCount, SUM(price) AS total, FROM Mix GROUP BY dateCl, dateOr, clientID 

Step 5. Grouping two, table Result


Now we can depersonalize our data by grouping only by dateCl and dateOr . Add other fields:



It turns out this table:


 //   dateCl: String, //   dateOr: String, //     clientsCount: Int, //  ,           ordersCount: Int, // ,             total: Int, 

Code example

Python:


 Result = pd.groupby(Preresult, by=['dateOr', 'dateCl'], as_index=False) Result = Result.agg({ 'total': { 'total': 'sum' }, 'ordersCount': { 'ordersCount': 'sum', 'clientsCount': 'count' } }) Result.columns = Result.columns.droplevel() Result.columns = ['dateOr', 'dateCl', 'total', 'ordersCount', 'clientsCount']) 

MySQL:


 SELECT dateCl, dateOr, COUNT(*) AS clientsCount, SUM(ordersCount) AS ordersCount, SUM(total) AS total FROM Preresult GROUP BY dateCl, dateOr 

Step 6. Final Transformation, Pivot Cohort Table


Now we just have to transform the structure of the table: dateCl must be placed in rows, dateOr in columns, and the desired value in cells ( clientsCount , ordersCount , total or something else).


What is it?

This operation is called Pivot Table (in Russian, it seems, there is no normal term). A simple example:


  xy val yx 1 2 1 1 5 1 5 - 1 3 8 -> 2 - 7 2 2 7 3 8 - 

Columns were obtained from x values, and rows from y . Here we had the row x = 1, y = 3, val = 8, and the cell became in the column (x) 1, in the row (y) 3 with the value (val) 8. Or the row x = 2, y = 2, val = 7, became a column cell (x) 2, in row (y) 2 with the value (val) 7.


Cells whose values ​​are not described by rows from the original table are usually filled with the value NULL or something logically equivalent.


Code example

Python:


 #      clientsCount, ordersCount, total    Cohort = Preresult.pivot(index='dateCl', columns='dateOr', values='Data') #   null'   Cohort.fillna(0, inplace=True) 

MySQL:
Unfortunately, MySQL does not know how to turn cells into rows and columns in an easy way, so I used other languages ​​for this purpose. But if someone knows this way - write in the comments, I will be grateful.


Possible improvements


  1. Rename columns
    Now dateOr and dateCl describe the date independently, and the center of the cohort table is directed right-up. But if at Step 3, 4 or 5 you perform the operation dateOr - = dateCl , then this field will display the date from the beginning of the cohort's existence, and the center of the table will be directed left-up, which is better perceived:
     * 09 10 11 * 0 1 2 09 3 2 1 09 3 2 1 10 - 3 2 -> 10 3 2 - 11 - - 3 11 3 - - 
  2. Extra options
    What to do if you have other interesting parameters in the order table? For example, type of payment or branch ID? Quite simply: these parameters will also be present in the tables on Steps 3,4,5 ( Mix, Preresult, Result ), and they must be added to the fields of both groupings. Then, at Step 6, for each possible combination of parameters you need to build your Cohort table. For example, you have 3 branches and 2 types of payment, there will be 3 * 2 = 6 cohort tables.
  3. Finding LTV
    Having a long statistics, you can calculate the Life-Time Value of the cohorts walking through the rows of the Cohort table.

Conclusion


The cohort recoverability table is not the only application of cohort analysis, there are other, more visual applications. For example, the division of users into two groups according to a certain attribute (cohort) and the display of their characteristics on the graph as two independent lines.


Useful on the topic: cohort analysis in Google Analytics .


All success;)


')

Source: https://habr.com/ru/post/342108/


All Articles