常见大数据面试SQL-各用户最长的连续登录天数-可间断
一、题目
现有各用户的登录记录表t14_login_events如下,表中每行数据表达的信息是一个用户何时登录了平台。 现要求统计各用户最长的连续登录天数,间断一天也算作连续,例如:一个用户在1,3,5,6登录,则视为连续6天登录。
样例数据
+----------+----------------------+
| user_id | login_datetime |
+----------+----------------------+
| 100 | 2021-12-01 19:00:00 |
| 100 | 2021-12-01 19:30:00 |
| 100 | 2021-12-02 21:01:00 |
| 100 | 2021-12-03 11:01:00 |
| 101 | 2021-12-01 19:05:00 |
| 101 | 2021-12-01 21:05:00 |
| 101 | 2021-12-03 21:05:00 |
| 101 | 2021-12-05 15:05:00 |
| 101 | 2021-12-06 19:05:00 |
| 102 | 2021-12-01 19:55:00 |
| 102 | 2021-12-01 21:05:00 |
| 102 | 2021-12-02 21:57:00 |
| 102 | 2021-12-03 19:10:00 |
| 104 | 2021-12-04 21:57:00 |
| 104 | 2021-12-02 22:57:00 |
| 105 | 2021-12-01 10:01:00 |
+----------+----------------------+
期望结果
+----------+---------------+
| user_id | max_log_days |
+----------+---------------+
| 100 | 3 |
| 101 | 6 |
| 102 | 3 |
| 104 | 3 |
| 105 | 1 |
+----------+---------------+
二、分析
本题依旧是连续问题,但是这个连续不是真的连续,允许存在间隔一天,是对连续条件考察的升级版。但是考点并没有发生改变。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
1.数据去重
由于数据给出的是操作记录,我们使用to_date函数,得到登陆日期,然后进行去重处理。
执行SQL
select user_id,
to_date(login_datetime) as login_date
from t14_login_events
group by user_id, to_date(login_datetime)
查询结果
+----------+-------------+
| user_id | login_date |
+----------+-------------+
| 100 | 2021-12-01 |
| 100 | 2021-12-02 |
| 100 | 2021-12-03 |
| 101 | 2021-12-01 |
| 101 | 2021-12-03 |
| 101 | 2021-12-05 |
| 101 | 2021-12-06 |
| 102 | 2021-12-01 |
| 102 | 2021-12-02 |
| 102 | 2021-12-03 |
| 104 | 2021-12-02 |
| 104 | 2021-12-04 |
| 105 | 2021-12-01 |
+----------+-------------+
2.计算日期差
根据用户分组,使用lag函数获得当前行的上一行数据中的日期,使用datediff函数判断日期当期日期与上一行日期的时间差。
执行SQL
select user_id,
login_date,
lag(login_date, 1, null) over (partition by user_id order by login_date asc) as lag_log_date,
datediff(login_date, lag(login_date, 1, null) over (partition by user_id order by login_date asc)) as date_diff
from (select user_id,
to_date(login_datetime) as login_date
from t14_login_events
group by user_id, to_date(login_datetime)) t1
查询结果
+----------+-------------+---------------+------------+
| user_id | login_date | lag_log_date | date_diff |
+----------+-------------+---------------+------------+
| 100 | 2021-12-01 | NULL | NULL |
| 100 | 2021-12-02 | 2021-12-01 | 1 |
| 100 | 2021-12-03 | 2021-12-02 | 1 |
| 101 | 2021-12-01 | NULL | NULL |
| 101 | 2021-12-03 | 2021-12-01 | 2 |
| 101 | 2021-12-05 | 2021-12-03 | 2 |
| 101 | 2021-12-06 | 2021-12-05 | 1 |
| 102 | 2021-12-01 | NULL | NULL |
| 102 | 2021-12-02 | 2021-12-01 | 1 |
| 102 | 2021-12-03 | 2021-12-02 | 1 |
| 104 | 2021-12-02 | NULL | NULL |
| 104 | 2021-12-04 | 2021-12-02 | 2 |
| 105 | 2021-12-01 | NULL | NULL |
+----------+-------------+---------------+------------+
3.判断是否连续,累积求和得到分组id
根据date_diff结果判断是否连续,如果date_diff < = 2则认为连续 我们给赋值为0,否则不连续,赋值为1。
执行SQL
select user_id,
login_date,
lag_log_date,
date_diff,
sum(if(date_diff <= 2, 0, 1)) over (partition by user_id order by login_date asc) as group_id
from (select user_id,
login_date,
lag(login_date, 1, null) over (partition by user_id order by login_date asc) as lag_log_date,
datediff(login_date, lag(login_date, 1, null)
over (partition by user_id order by login_date asc)) as date_diff
from (select user_id,
to_date(login_datetime) as login_date
from t14_login_events
group by user_id, to_date(login_datetime)) t1) t2
查询结果
+----------+-------------+---------------+------------+-----------+
| user_id | login_date | lag_log_date | date_diff | group_id |
+----------+-------------+---------------+------------+-----------+
| 100 | 2021-12-01 | NULL | NULL | 1 |
| 100 | 2021-12-02 | 2021-12-01 | 1 | 1 |
| 100 | 2021-12-03 | 2021-12-02 | 1 | 1 |
| 101 | 2021-12-01 | NULL | NULL | 1 |
| 101 | 2021-12-03 | 2021-12-01 | 2 | 1 |
| 101 | 2021-12-05 | 2021-12-03 | 2 | 1 |
| 101 | 2021-12-06 | 2021-12-05 | 1 | 1 |
| 102 | 2021-12-01 | NULL | NULL | 1 |
| 102 | 2021-12-02 | 2021-12-01 | 1 | 1 |
| 102 | 2021-12-03 | 2021-12-02 | 1 | 1 |
| 104 | 2021-12-02 | NULL | NULL | 1 |
| 104 | 2021-12-04 | 2021-12-02 | 2 | 1 |
| 105 | 2021-12-01 | NULL | NULL | 1 |
+----------+-------------+---------------+------------+-----------+
4.按照用户和group_id 分组,计算每次连续登陆的天数,再根据用户分组计算最大连续天数
首先根据user_id和group_id分组,用datediff计算出出最大登陆日期和最小登陆日期,两者做差+1 得到每次连续登陆的天数。 然后按照用户分组,使用max()计算每个用户最大连续天数。
执行SQL
select user_id,
max(log_days) as max_log_days
from (select user_id,
group_id,
datediff(max(login_date), min(login_date)) + 1 as log_days
from (select user_id,
login_date,
lag_log_date,
date_diff,
sum(if(date_diff <= 2, 0, 1)) over (partition by user_id order by login_date asc) as group_id
from (select user_id,
login_date,
lag(login_date, 1, null) over (partition by user_id order by login_date asc) as lag_log_date,
datediff(login_date, lag(login_date, 1, null)
over (partition by user_id order by login_date asc)) as date_diff
from (select user_id,
to_date(login_datetime) as login_date
from t14_login_events
group by user_id, to_date(login_datetime)) t1) t2) t3
group by user_id,
group_id) t4
group by user_id
查询结果
+----------+---------------+
| user_id | max_log_days |
+----------+---------------+
| 100 | 3 |
| 101 | 6 |
| 102 | 3 |
| 104 | 3 |
| 105 | 1 |
+----------+---------------+
四、建表语句和数据插入
--建表语句
create table if not exists t14_login_events
(
user_id int comment '用户id',
login_datetime string comment '登录时间'
)
comment '直播间访问记录';
--数据插入
INSERT overwrite table t14_login_events
VALUES (100, '2021-12-01 19:00:00'),
(100, '2021-12-01 19:30:00'),
(100, '2021-12-02 21:01:00'),
(100, '2021-12-03 11:01:00'),
(101, '2021-12-01 19:05:00'),
(101, '2021-12-01 21:05:00'),
(101, '2021-12-03 21:05:00'),
(101, '2021-12-05 15:05:00'),
(101, '2021-12-06 19:05:00'),
(102, '2021-12-01 19:55:00'),
(102, '2021-12-01 21:05:00'),
(102, '2021-12-02 21:57:00'),
(102, '2021-12-03 19:10:00'),
(104, '2021-12-04 21:57:00'),
(104, '2021-12-02 22:57:00'),
(105, '2021-12-01 10:01:00');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;