Developers Club geek daily blog

2 years, 11 months ago
The problem of search of continuous sequences of events is quite easily solved with the help of SQL. Let's specify that these sequences are.

For an example we will take Stack Overflow. He uses cool system of reputation with awards for certain achievements. As well as in many social projects, they encourage users daily to visit a resource. Let's pay attention to these two awards:

How to find the most long continuous line of events by means of SQL

It is easy to understand what they mean. Visit the website in the first day. Then for the second day. Then on the third (perhaps several times, it does not matter). Did not come on the fourth? We begin to consider again.

How to trace it by means of SQL?


For data access we will use Stack Exchange Data Explorer.

Pay attention that we will not request date of visits as this information is not provided. Instead, let's request dates of the messages placed by the user.

As the database SQL Server is used, therefore we can use the following request:

SELECT DISTINCT CAST(CreationDate AS DATE) AS date
FROM Posts
WHERE OwnerUserId = ##UserId##
ORDER BY 1

… which will issue something similar:

date          
----------
2010-11-26
2010-11-27
2010-11-29
2010-11-30
2010-12-01
2010-12-02
2010-12-03
2010-12-05
2010-12-06
2010-12-07
2010-12-08
2010-12-09
2010-12-13
2010-12-14
...
(769 rows)

(you can make request independently, here)

As it is possible to notice, eat several admissions:

date          
--------------------------------------
2010-11-26
2010-11-27 <---- Пропуск после 2 дней

2010-11-29
2010-11-30
2010-12-01
2010-12-02
2010-12-03 <---- Пропуск после 5 дней

2010-12-05
2010-12-06
2010-12-07
2010-12-08
2010-12-09 <---- Пропуск после 5 дней

2010-12-13
2010-12-14
...

It is easy for person to see how many days in a row go dates without admissions. But how to make it by means of SQL?

To simplify a task, let's "save" individual requests in the generalized tabular expressions. The previous request we will call dates:

WITH
 
  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  )
...

Now the purpose of the received request — to place all consecutive dates in the same group that we could integrate them. Here is how we will make it:

SELECT
  COUNT(*) AS consecutiveDates,
  MIN(week) AS minDate,
  MAX(week) AS maxDate
FROM groups
GROUP BY grp -- This "grp" value will be explained later
ORDER BY 1 DESC, 2 DESC

We want to integrate each grp group and to count the number of dates in group, and also to find a minimum and a maximum in each group.

Creation of groups for consecutive dates


Let's look now at result of request, and that it was more clear, we will number lines irrespective of admissions in dates:

row number   date          
--------------------------------
1            2010-11-26
2            2010-11-27

3            2010-11-29 <-- пропуск перед этой строкой
4            2010-11-30
5            2010-12-01
6            2010-12-02
7            2010-12-03

8            2010-12-05 <-- пропуск перед этой строкой

As it is possible to see regardless of the fact that there is a gap between dates (two dates are not consecutive), their row numbers will still be consecutive. We can make it by means of the ROW_NUMBER function ():

SELECT
  ROW_NUMBER() OVER (ORDER BY date) AS [row number],
  date
FROM dates

Now let's look here at such interesting request:

WITH
 
  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
   
  -- Generate "groups" of dates by subtracting the
  -- date's row number (no gaps) from the date itself
  -- (with potential gaps). Whenever there is a gap,
  -- there will be a new group
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY date) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp,
      date
    FROM dates
  )
SELECT *
FROM groups
ORDER BY rn

The request given above yields to us the following result:

rn  grp          date          
--- ----------   ----------
1   2010-11-25   2010-11-26
2   2010-11-25   2010-11-27

3   2010-11-26   2010-11-29
4   2010-11-26   2010-11-30
5   2010-11-26   2010-12-01
6   2010-11-26   2010-12-02
7   2010-11-26   2010-12-03

8   2010-11-27   2010-12-05
9   2010-11-27   2010-12-06
10  2010-11-27   2010-12-07
11  2010-11-27   2010-12-08
12  2010-11-27   2010-12-09

13  2010-11-30   2010-12-13
14  2010-11-30   2010-12-14

(you can make request independently, here)

Everything that we made, it subtracted row number from day to receive new date of "grp". The date received thus does not make sense, this just auxiliary value.

However, we can guarantee that for consecutive dates, grp value will be identical because for all consistently going dates, the following two equations are right:

date2 - date1 = 1 // разница в днях между двумя датами
rn2 - rn1 = 1 // Разница в цифрах строк

For inconsistent dates, the difference in row numbers will be also 1, but the difference in days will be more unit. Groups can be distinguished easily now:

rn  grp          date          
--- ----------   ----------
1   2010-11-25   2010-11-26
2   2010-11-25   2010-11-27

3   2010-11-26   2010-11-29
4   2010-11-26   2010-11-30
5   2010-11-26   2010-12-01
6   2010-11-26   2010-12-02
7   2010-11-26   2010-12-03

8   2010-11-27   2010-12-05
9   2010-11-27   2010-12-06
10  2010-11-27   2010-12-07
11  2010-11-27   2010-12-08
12  2010-11-27   2010-12-09

13  2010-11-30   2010-12-13
14  2010-11-30   2010-12-14

Thus, the final request will be the following:

WITH
 
  -- This table contains all the distinct date 
  -- instances in the data set
  dates(date) AS (
    SELECT DISTINCT CAST(CreationDate AS DATE)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
   
  -- Generate "groups" of dates by subtracting the
  -- date's row number (no gaps) from the date itself
  -- (with potential gaps). Whenever there is a gap,
  -- there will be a new group
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY date) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY date), date) AS grp,
      date
    FROM dates
  )
SELECT
  COUNT(*) AS consecutiveDates,
  MIN(week) AS minDate,
  MAX(week) AS maxDate
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

And its result:

consecutiveDates minDate       maxDate       
---------------- ------------- ------------- 
14               2012-08-13    2012-08-26
14               2012-02-03    2012-02-16
10               2013-10-24    2013-11-02
10               2011-05-11    2011-05-20
9                2011-06-30    2011-07-08
7                2012-01-17    2012-01-23
7                2011-06-14    2011-06-20
6                2012-04-10    2012-04-15
6                2012-04-02    2012-04-07
6                2012-03-26    2012-03-31
6                2011-10-27    2011-11-01
6                2011-07-17    2011-07-22
6                2011-05-23    2011-05-28
...

(you can make request independently, here)

Bonus: to find sequence of weeks


What we used days it is just our choice. We took exact time and rounded it about one day by means of the CAST function:

SELECT DISTINCT CAST(CreationDate AS DATE)

If we wanted to learn sequence, for example, from weeks, we could round time about one weeks:

SELECT DISTINCT datepart(year, CreationDate) * 100 
              + datepart(week, CreationDate)

This request uses numerical expression of year and week and creates numbers of 201503 type for the third week 2015. Other part of request remains without changes:

WITH
  weeks(week) AS (
    SELECT DISTINCT datepart(year, CreationDate) * 100 
                  + datepart(week, CreationDate)
    FROM Posts
    WHERE OwnerUserId = ##UserId##
  ),
  groups AS (
    SELECT
      ROW_NUMBER() OVER (ORDER BY week) AS rn,
      dateadd(day, -ROW_NUMBER() OVER (ORDER BY week), week) AS grp,
      week
    FROM weeks
  )
SELECT
  COUNT(*) AS consecutiveWeeks,
  MIN(week) AS minWeek,
  MAX(week) AS maxWeek
FROM groups
GROUP BY grp
ORDER BY 1 DESC, 2 DESC

And here what we will receive:

consecutiveWeeks minWeek maxWeek 
---------------- ------- ------- 
45               201401  201445  
29               201225  201253  
25               201114  201138  
23               201201  201223  
20               201333  201352  
16               201529  201544  
15               201305  201319  
12               201514  201525  
12               201142  201153  
9                201502  201510  
7                201447  201453  
7                201321  201327  
6                201048  201053  
4                201106  201109  
3                201329  201331  
3                201102  201104  
2                201301  201302  
2                201111  201112  
1                201512  201512  

(you can make request independently, here)

It is no wonder that consecutive weeks cover much longer ranges as the author regularly writes on Stack Overflow.

This article is a translation of the original post at habrahabr.ru/post/270573/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus