A cohort analysis of user returns is a powerful way to understand different groups of customers — their behavior and business value. However, the final tables can be difficult to understand the first time, and right off the bat to figure out how to build them, it is even more difficult.
The article will describe a relatively simple, but useful algorithm for constructing a cohort of a table, as well as sketches of code with Python / Pandas and SQL . If you need to programmatically implement the construction of a cohort report, or just interested to know this algorithm, I ask for cat.
One of my designs is a corporate analytics application. And the cohort returns report is probably the most important function. I had to rewrite it more than once, making it more customizable, flexible.
During development, I did not find examples of the implementation of such a report, so the algorithm outlined below was created by me from an understanding of the essence of the cohort analysis. If someone finds jambs or knows better options, please report it in the comments.
Conditional definitions :
The cohort by return is a cohort analysis table, in which each row describes a separate cohort by the date of its appearance, and the columns show the time of observation of the cohort.
On the example below it can be seen that of the clients who came in August, only 60% of their initial number remained the next month. And the number of active clients in September is: 100% of those who came in September + 60% of those who came in August + 30% of those who came in July. Similarly, the number of active customers in August is 100% of new customers in August and 50% of customers who came in July. That is, we look at the cohorts themselves in rows, and for all clients active in a certain month - diagonally, in different cohorts.
We will build a cohort table in several steps, at each of which we will receive a new table, each of which I gave my name.
There is a table of orders Orders for which you need to conduct a cohort analysis. The structure is as follows:
// date: DateTime, // ID clientID: String, // price: Int
We need to get the Clients table with the client’s arrival date:
date: DateTime, clientID: String
Maybe someone already has it (for example, with the date of registration of the user or with the date of installation of the application), but it can also be counted as the date of the first order.
Python:
Clients = pd.groupby(Orders, by=['clientID'], as_index=False) Clients = Clients.agg({ 'dt' : {'date' : 'min' }}) Clients.columns = cli.columns.droplevel() Clients.columns = ['clientID', 'date']
MySQL:
SELECT clientID, MIN(date) AS date FROM Orders GROUP BY clientID
We make the joining of the Orders and Clients tables of the Left type through the common clientID field. In order to avoid confusion, the date field from the first table is called dateOr , and the second is called dateCl .
Now we need to round the dates to make the groups. You can round up to a month, throwing back the number, you can count the number of weeks before any date. In the end, the date should be a string.
Mix table structure:
// dateCl: String, // dateOr: String, // ID clientID: String, // price: Int
Python:
Mix = pd.merge(Orders, Clients, how='left', on=['clientID']) Mix.columns = ['dateOr', 'clientID', 'price', 'dateCl'] def cutDate(txt): return txt[:7] Mix['dateOr'] = Mix['dateOr'].apply(cutDate) Mix['dateCl'] = Mix['dateCl'].apply(cutDate)
MySQL:
SELECT STRFTIME_UTC_USEC(Clients.date, "%Y-%m") AS dateCl, STRFTIME_UTC_USEC(Orders.date, "%Y-%m") AS dateOr, clientID, price FROM Clients INNER JOIN Orders ON (Clients.date = Orders.date)
Finally getting closer to the cohorts! Let 's group by three fields at once: dateCl, dateOr and clientID .
Without this, we will not be able to find the number of active customers at any time. The number of rows (the Count()
function) would only give us the number of orders in the cohort during this time period, and the number of active clients could be found in no way.
Add the following to these fields:
ordersCount = Count()
total = Sum(price)
Summary structure:
// dateCl: String, // dateOr: String, // ID clientID: String, // , ordersCount: Int, // , total: Int,
Python:
Preresult = pd.groupby(Mix, by=['tel', 'dateOr', 'dateCl'], as_index=False) Preresult = Preresult.agg({ 'price': { 'total': 'sum', 'ordersCount': 'count' } }) Preresult.columns = Preresult.columns.droplevel() Preresult.columns = ['clientID', 'dateOr', 'dateCl', 'total', 'ordersCount']
MySQL:
SELECT clientID, dateCl, dateOr, COUNT(*) AS ordersCount, SUM(price) AS total, FROM Mix GROUP BY dateCl, dateOr, clientID
Now we can depersonalize our data by grouping only by dateCl and dateOr . Add other fields:
clientsCount = Count()
ordersCount = Sum(ordersCount)
total = Sum(total)
It turns out this table:
// dateCl: String, // dateOr: String, // clientsCount: Int, // , ordersCount: Int, // , total: Int,
Python:
Result = pd.groupby(Preresult, by=['dateOr', 'dateCl'], as_index=False) Result = Result.agg({ 'total': { 'total': 'sum' }, 'ordersCount': { 'ordersCount': 'sum', 'clientsCount': 'count' } }) Result.columns = Result.columns.droplevel() Result.columns = ['dateOr', 'dateCl', 'total', 'ordersCount', 'clientsCount'])
MySQL:
SELECT dateCl, dateOr, COUNT(*) AS clientsCount, SUM(ordersCount) AS ordersCount, SUM(total) AS total FROM Preresult GROUP BY dateCl, dateOr
Now we just have to transform the structure of the table: dateCl must be placed in rows, dateOr in columns, and the desired value in cells ( clientsCount , ordersCount , total or something else).
This operation is called Pivot Table (in Russian, it seems, there is no normal term). A simple example:
xy val yx 1 2 1 1 5 1 5 - 1 3 8 -> 2 - 7 2 2 7 3 8 -
Columns were obtained from x values, and rows from y . Here we had the row x = 1, y = 3, val = 8, and the cell became in the column (x) 1, in the row (y) 3 with the value (val) 8. Or the row x = 2, y = 2, val = 7, became a column cell (x) 2, in row (y) 2 with the value (val) 7.
Cells whose values are not described by rows from the original table are usually filled with the value NULL or something logically equivalent.
Python:
# clientsCount, ordersCount, total Cohort = Preresult.pivot(index='dateCl', columns='dateOr', values='Data') # null' Cohort.fillna(0, inplace=True)
MySQL:
Unfortunately, MySQL does not know how to turn cells into rows and columns in an easy way, so I used other languages for this purpose. But if someone knows this way - write in the comments, I will be grateful.
* 09 10 11 * 0 1 2 09 3 2 1 09 3 2 1 10 - 3 2 -> 10 3 2 - 11 - - 3 11 3 - -
The cohort recoverability table is not the only application of cohort analysis, there are other, more visual applications. For example, the division of users into two groups according to a certain attribute (cohort) and the display of their characteristics on the graph as two independent lines.
Useful on the topic: cohort analysis in Google Analytics .
All success;)
Source: https://habr.com/ru/post/342108/
All Articles