📜 ⬆️ ⬇️

How to find the longest continuous series of events using SQL

The task of finding continuous sequences of events is quite easily solved using SQL. Let's clarify what these sequences are.

For example, take Stack Overflow. He uses a cool reputation system with awards for certain achievements. As in many social projects, they encourage users to visit the resource daily. Pay attention to these two awards:


')
It is not difficult to understand what they mean. Visit the site on the first day. Then on the second day. Then for the third (maybe several times, it doesn't matter). Did not go to the fourth? We start to count again.

How to track this with SQL?


For data access we will use Stack Exchange Data Explorer .

Please note that we will not request dates for visits, as this information is not provided. Instead, let's ask for the dates of user posted messages.

SQL Server is used as a database, therefore we can use the following query:

SELECT DISTINCT CAST(CreationDate AS DATE) AS date FROM Posts WHERE OwnerUserId = ##UserId## ORDER BY 1 

... which will give something like this:

 date ---------- 2010-11-26 2010-11-27 2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 2010-12-05 2010-12-06 2010-12-07 2010-12-08 2010-12-09 2010-12-13 2010-12-14 ... (769 rows) 

(you can make a request yourself, here)

As you can see, there are a few gaps:

 date -------------------------------------- 2010-11-26 2010-11-27 <----   2  2010-11-29 2010-11-30 2010-12-01 2010-12-02 2010-12-03 <----   5  2010-12-05 2010-12-06 2010-12-07 2010-12-08 2010-12-09 <----   5  2010-12-13 2010-12-14 ... 

It is easy for a person to see how many days in a row the dates go without gaps. But how to do it through SQL?

To simplify the task, let's “save” individual queries in generalized table expressions. We’ll call the previous query dates :

 WITH -- This table contains all the distinct date -- instances in the data set dates(date) AS ( SELECT DISTINCT CAST(CreationDate AS DATE) FROM Posts WHERE OwnerUserId = ##UserId## ) ... 

Now the goal of the received request is to place all consecutive dates in the same group so that we can combine them. Here's how we do it:

 SELECT COUNT(*) AS consecutiveDates, MIN(week) AS minDate, MAX(week) AS maxDate FROM groups GROUP BY grp -- This "grp" value will be explained later ORDER BY 1 DESC, 2 DESC 

We want to combine each group “grp” and count the number of dates in the group, as well as find the minimum and maximum in each group.

Creating groups for consecutive dates


Let's now look at the result of the query, and to make it clearer, we enumerate the lines regardless of the omissions in the dates:

 row number date -------------------------------- 1 2010-11-26 2 2010-11-27 3 2010-11-29 <--     4 2010-11-30 5 2010-12-01 6 2010-12-02 7 2010-12-03 8 2010-12-05 <--     

As you can see, regardless of the fact that there is a gap between dates (the two dates are not consecutive), their line numbers will still be consecutive. We can do this using the ROW_NUMBER () function:

 SELECT ROW_NUMBER() OVER (ORDER BY date) AS [row number], date FROM dates 

Now let's see this interesting query:

 WITH -- This table contains all the distinct date -- instances in the data set dates(date) AS ( SELECT DISTINCT CAST(CreationDate AS DATE) FROM Posts WHERE OwnerUserId = ##UserId## ), -- Generate "groups" of dates by subtracting the -- date's row number (no gaps) from the date itself -- (with potential gaps). Whenever there is a gap, -- there will be a new group groups AS ( SELECT ROW_NUMBER() OVER (ORDER BY date) AS rn, dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp, date FROM dates ) SELECT * FROM groups ORDER BY rn 

The above query gives us the following result:

 rn grp date --- ---------- ---------- 1 2010-11-25 2010-11-26 2 2010-11-25 2010-11-27 3 2010-11-26 2010-11-29 4 2010-11-26 2010-11-30 5 2010-11-26 2010-12-01 6 2010-11-26 2010-12-02 7 2010-11-26 2010-12-03 8 2010-11-27 2010-12-05 9 2010-11-27 2010-12-06 10 2010-11-27 2010-12-07 11 2010-11-27 2010-12-08 12 2010-11-27 2010-12-09 13 2010-11-30 2010-12-13 14 2010-11-30 2010-12-14 

(you can make a request yourself, here)

All we did was subtract the line number from the day to get the new date “grp”. The date obtained in this way does not make sense, it is just an auxiliary value.

However, we can guarantee that for consecutive dates, the value of "grp" will be the same, because for all consecutive dates, the following two equations are correct:

 date2 - date1 = 1 //       rn2 - rn1 = 1 //     

For inconsistent dates, the difference in line numbers will also be 1, but the difference in days will be greater than one. Groups can now be easily distinguished:

 rn grp date --- ---------- ---------- 1 2010-11-25 2010-11-26 2 2010-11-25 2010-11-27 3 2010-11-26 2010-11-29 4 2010-11-26 2010-11-30 5 2010-11-26 2010-12-01 6 2010-11-26 2010-12-02 7 2010-11-26 2010-12-03 8 2010-11-27 2010-12-05 9 2010-11-27 2010-12-06 10 2010-11-27 2010-12-07 11 2010-11-27 2010-12-08 12 2010-11-27 2010-12-09 13 2010-11-30 2010-12-13 14 2010-11-30 2010-12-14 

Thus, the final query will be as follows:

 WITH -- This table contains all the distinct date -- instances in the data set dates(date) AS ( SELECT DISTINCT CAST(CreationDate AS DATE) FROM Posts WHERE OwnerUserId = ##UserId## ), -- Generate "groups" of dates by subtracting the -- date's row number (no gaps) from the date itself -- (with potential gaps). Whenever there is a gap, -- there will be a new group groups AS ( SELECT ROW_NUMBER() OVER (ORDER BY date) AS rn, dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp, date FROM dates ) SELECT COUNT(*) AS consecutiveDates, MIN(week) AS minDate, MAX(week) AS maxDate FROM groups GROUP BY grp ORDER BY 1 DESC, 2 DESC 

And its result:

 consecutiveDates minDate maxDate ---------------- ------------- ------------- 14 2012-08-13 2012-08-26 14 2012-02-03 2012-02-16 10 2013-10-24 2013-11-02 10 2011-05-11 2011-05-20 9 2011-06-30 2011-07-08 7 2012-01-17 2012-01-23 7 2011-06-14 2011-06-20 6 2012-04-10 2012-04-15 6 2012-04-02 2012-04-07 6 2012-03-26 2012-03-31 6 2011-10-27 2011-11-01 6 2011-07-17 2011-07-22 6 2011-05-23 2011-05-28 ... 

(you can make a request yourself, here)

Bonus: find the sequence of weeks


The fact that we used the days is just our choice. We took the exact time and rounded it to the day using the CAST function:

 SELECT DISTINCT CAST(CreationDate AS DATE) 

If we wanted to know the sequence, for example, from weeks, we could round off the time to weeks:

 SELECT DISTINCT datepart(year, CreationDate) * 100 + datepart(week, CreationDate) 

This query uses the numeric expression of the year and week and generates type numbers 201503 for the third week of 2015. The rest of the request remains unchanged:

 WITH weeks(week) AS ( SELECT DISTINCT datepart(year, CreationDate) * 100 + datepart(week, CreationDate) FROM Posts WHERE OwnerUserId = ##UserId## ), groups AS ( SELECT ROW_NUMBER() OVER (ORDER BY week) AS rn, dateadd(day, -ROW_NUMBER() OVER (ORDER BY week), week) AS grp, week FROM weeks ) SELECT COUNT(*) AS consecutiveWeeks, MIN(week) AS minWeek, MAX(week) AS maxWeek FROM groups GROUP BY grp ORDER BY 1 DESC, 2 DESC 

And that's what we get:

 consecutiveWeeks minWeek maxWeek ---------------- ------- ------- 45 201401 201445 29 201225 201253 25 201114 201138 23 201201 201223 20 201333 201352 16 201529 201544 15 201305 201319 12 201514 201525 12 201142 201153 9 201502 201510 7 201447 201453 7 201321 201327 6 201048 201053 4 201106 201109 3 201329 201331 3 201102 201104 2 201301 201302 2 201111 201112 1 201512 201512 

(you can make a request yourself, here)

It is not surprising that successive weeks cover much longer ranges, as the author writes regularly on Stack Overflow.

Source: https://habr.com/ru/post/270573/


All Articles