面试真题
常见题目
14.各用户最长的连续登录天数-可间断

常见大数据面试SQL-各用户最长的连续登录天数-可间断

一、题目

现有各用户的登录记录表t14_login_events如下,表中每行数据表达的信息是一个用户何时登录了平台。 现要求统计各用户最长的连续登录天数,间断一天也算作连续,例如:一个用户在1,3,5,6登录,则视为连续6天登录。

样例数据

+----------+----------------------+
| user_id  |    login_datetime    |
+----------+----------------------+
| 100      | 2021-12-01 19:00:00  |
| 100      | 2021-12-01 19:30:00  |
| 100      | 2021-12-02 21:01:00  |
| 100      | 2021-12-03 11:01:00  |
| 101      | 2021-12-01 19:05:00  |
| 101      | 2021-12-01 21:05:00  |
| 101      | 2021-12-03 21:05:00  |
| 101      | 2021-12-05 15:05:00  |
| 101      | 2021-12-06 19:05:00  |
| 102      | 2021-12-01 19:55:00  |
| 102      | 2021-12-01 21:05:00  |
| 102      | 2021-12-02 21:57:00  |
| 102      | 2021-12-03 19:10:00  |
| 104      | 2021-12-04 21:57:00  |
| 104      | 2021-12-02 22:57:00  |
| 105      | 2021-12-01 10:01:00  |
+----------+----------------------+

期望结果

+----------+---------------+
| user_id  | max_log_days  |
+----------+---------------+
| 100      | 3             |
| 101      | 6             |
| 102      | 3             |
| 104      | 3             |
| 105      | 1             |
+----------+---------------+

二、分析

本题依旧是连续问题,但是这个连续不是真的连续,允许存在间隔一天,是对连续条件考察的升级版。但是考点并没有发生改变。

维度评分
题目难度⭐️⭐️⭐️⭐️
题目清晰度⭐️⭐️⭐️⭐️⭐️
业务常见度⭐️⭐️⭐️⭐️

三、SQL

1.数据去重

由于数据给出的是操作记录,我们使用to_date函数,得到登陆日期,然后进行去重处理。

执行SQL

select user_id,
       to_date(login_datetime) as login_date
from t14_login_events
group by user_id, to_date(login_datetime)

查询结果

+----------+-------------+
| user_id  | login_date  |
+----------+-------------+
| 100      | 2021-12-01  |
| 100      | 2021-12-02  |
| 100      | 2021-12-03  |
| 101      | 2021-12-01  |
| 101      | 2021-12-03  |
| 101      | 2021-12-05  |
| 101      | 2021-12-06  |
| 102      | 2021-12-01  |
| 102      | 2021-12-02  |
| 102      | 2021-12-03  |
| 104      | 2021-12-02  |
| 104      | 2021-12-04  |
| 105      | 2021-12-01  |
+----------+-------------+

2.计算日期差

根据用户分组,使用lag函数获得当前行的上一行数据中的日期,使用datediff函数判断日期当期日期与上一行日期的时间差。

执行SQL

select user_id,
       login_date,
       lag(login_date, 1, null) over (partition by user_id order by login_date asc)                       as lag_log_date,
       datediff(login_date, lag(login_date, 1, null) over (partition by user_id order by login_date asc)) as date_diff
from (select user_id,
             to_date(login_datetime) as login_date
      from t14_login_events
      group by user_id, to_date(login_datetime)) t1

查询结果

+----------+-------------+---------------+------------+
| user_id  | login_date  | lag_log_date  | date_diff  |
+----------+-------------+---------------+------------+
| 100      | 2021-12-01  | NULL          | NULL       |
| 100      | 2021-12-02  | 2021-12-01    | 1          |
| 100      | 2021-12-03  | 2021-12-02    | 1          |
| 101      | 2021-12-01  | NULL          | NULL       |
| 101      | 2021-12-03  | 2021-12-01    | 2          |
| 101      | 2021-12-05  | 2021-12-03    | 2          |
| 101      | 2021-12-06  | 2021-12-05    | 1          |
| 102      | 2021-12-01  | NULL          | NULL       |
| 102      | 2021-12-02  | 2021-12-01    | 1          |
| 102      | 2021-12-03  | 2021-12-02    | 1          |
| 104      | 2021-12-02  | NULL          | NULL       |
| 104      | 2021-12-04  | 2021-12-02    | 2          |
| 105      | 2021-12-01  | NULL          | NULL       |
+----------+-------------+---------------+------------+

3.判断是否连续,累积求和得到分组id

根据date_diff结果判断是否连续,如果date_diff < = 2则认为连续 我们给赋值为0,否则不连续,赋值为1。

执行SQL

select user_id,
       login_date,
       lag_log_date,
       date_diff,
       sum(if(date_diff <= 2, 0, 1)) over (partition by user_id order by login_date asc) as group_id
from (select user_id,
             login_date,
             lag(login_date, 1, null) over (partition by user_id order by login_date asc)  as lag_log_date,
             datediff(login_date, lag(login_date, 1, null)
                                      over (partition by user_id order by login_date asc)) as date_diff
      from (select user_id,
                   to_date(login_datetime) as login_date
            from t14_login_events
            group by user_id, to_date(login_datetime)) t1) t2

查询结果

+----------+-------------+---------------+------------+-----------+
| user_id  | login_date  | lag_log_date  | date_diff  | group_id  |
+----------+-------------+---------------+------------+-----------+
| 100      | 2021-12-01  | NULL          | NULL       | 1         |
| 100      | 2021-12-02  | 2021-12-01    | 1          | 1         |
| 100      | 2021-12-03  | 2021-12-02    | 1          | 1         |
| 101      | 2021-12-01  | NULL          | NULL       | 1         |
| 101      | 2021-12-03  | 2021-12-01    | 2          | 1         |
| 101      | 2021-12-05  | 2021-12-03    | 2          | 1         |
| 101      | 2021-12-06  | 2021-12-05    | 1          | 1         |
| 102      | 2021-12-01  | NULL          | NULL       | 1         |
| 102      | 2021-12-02  | 2021-12-01    | 1          | 1         |
| 102      | 2021-12-03  | 2021-12-02    | 1          | 1         |
| 104      | 2021-12-02  | NULL          | NULL       | 1         |
| 104      | 2021-12-04  | 2021-12-02    | 2          | 1         |
| 105      | 2021-12-01  | NULL          | NULL       | 1         |
+----------+-------------+---------------+------------+-----------+

4.按照用户和group_id 分组,计算每次连续登陆的天数,再根据用户分组计算最大连续天数

首先根据user_id和group_id分组,用datediff计算出出最大登陆日期和最小登陆日期,两者做差+1 得到每次连续登陆的天数。 然后按照用户分组,使用max()计算每个用户最大连续天数。

执行SQL

select user_id,
       max(log_days) as max_log_days
from (select user_id,
             group_id,
             datediff(max(login_date), min(login_date)) + 1 as log_days
      from (select user_id,
                   login_date,
                   lag_log_date,
                   date_diff,
                   sum(if(date_diff <= 2, 0, 1)) over (partition by user_id order by login_date asc) as group_id
            from (select user_id,
                         login_date,
                         lag(login_date, 1, null) over (partition by user_id order by login_date asc)  as lag_log_date,
                         datediff(login_date, lag(login_date, 1, null)
                                                  over (partition by user_id order by login_date asc)) as date_diff
                  from (select user_id,
                               to_date(login_datetime) as login_date
                        from t14_login_events
                        group by user_id, to_date(login_datetime)) t1) t2) t3
      group by user_id,
               group_id) t4
group by user_id

查询结果

+----------+---------------+
| user_id  | max_log_days  |
+----------+---------------+
| 100      | 3             |
| 101      | 6             |
| 102      | 3             |
| 104      | 3             |
| 105      | 1             |
+----------+---------------+

四、建表语句和数据插入

--建表语句
 
create table if not exists t14_login_events
(
    user_id        int comment '用户id',
    login_datetime string comment '登录时间'
)
    comment '直播间访问记录';
--数据插入
INSERT overwrite table t14_login_events
VALUES (100, '2021-12-01 19:00:00'),
       (100, '2021-12-01 19:30:00'),
       (100, '2021-12-02 21:01:00'),
       (100, '2021-12-03 11:01:00'),
       (101, '2021-12-01 19:05:00'),
       (101, '2021-12-01 21:05:00'),
       (101, '2021-12-03 21:05:00'),
       (101, '2021-12-05 15:05:00'),
       (101, '2021-12-06 19:05:00'),
       (102, '2021-12-01 19:55:00'),
       (102, '2021-12-01 21:05:00'),
       (102, '2021-12-02 21:57:00'),
       (102, '2021-12-03 19:10:00'),
       (104, '2021-12-04 21:57:00'),
       (104, '2021-12-02 22:57:00'),
       (105, '2021-12-01 10:01:00');

本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;