腾讯大数据面试SQL-连续登陆超过N天的用户
一、题目
现有用户登录日志表 t5_login_log,包含用户ID(user_id),登录日期(login_date)。数据已经按照用户日期去重,请查出连续登录超过4天的用户ID
样例数据
+----------+-------------+
| user_id | login_date |
+----------+-------------+
| 0001 | 20220101 |
| 0001 | 20220102 |
| 0001 | 20220103 |
| 0001 | 20220104 |
| 0001 | 20220105 |
| 0001 | 20220107 |
| 0001 | 20220108 |
| 0001 | 20220109 |
| 0002 | 20220101 |
| 0002 | 20220102 |
| 0002 | 20220103 |
| 0002 | 20220107 |
| 0002 | 20220108 |
| 0003 | 20220107 |
| 0003 | 20220108 |
| 0003 | 20220109 |
+----------+-------------+
二、分析
row_number()函数、datediff()函数、日期格式处理 对于连续登录问题处理逻辑,对于同一个用户,如果连续登录,row_number排序和 登录日期与月初的日期差是一个恒定值,如果存在非连续,则值不同。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️⭐️ |
三、SQL
1.处理日期格式,计算登录日期与月初日期('2022-01-01')差date_diff
执行SQL
select user_id,
login_date,
datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as date_diff
from t5_login_log;
执行结果
+----------+-------------+------------+
| user_id | login_date | date_diff |
+----------+-------------+------------+
| 0001 | 20220101 | 0 |
| 0001 | 20220102 | 1 |
| 0001 | 20220103 | 2 |
| 0001 | 20220104 | 3 |
| 0001 | 20220105 | 4 |
| 0001 | 20220107 | 6 |
| 0001 | 20220108 | 7 |
| 0001 | 20220109 | 8 |
| 0002 | 20220101 | 0 |
| 0002 | 20220102 | 1 |
| 0002 | 20220103 | 2 |
| 0002 | 20220107 | 6 |
| 0002 | 20220108 | 7 |
| 0003 | 20220107 | 6 |
| 0003 | 20220108 | 7 |
| 0003 | 20220109 | 8 |
+----------+-------------+------------+
2.row_number()开窗,计算每个用户每个登录日期的排序 row_num
执行SQL
select user_id,
login_date,
datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as date_diff,
row_number() over (partition by user_id order by login_date asc) as row_num
from t5_login_log;
执行结果
+----------+-------------+------------+----------+
| user_id | login_date | date_diff | row_num |
+----------+-------------+------------+----------+
| 0001 | 20220101 | 0 | 1 |
| 0001 | 20220102 | 1 | 2 |
| 0001 | 20220103 | 2 | 3 |
| 0001 | 20220104 | 3 | 4 |
| 0001 | 20220105 | 4 | 5 |
| 0001 | 20220107 | 6 | 6 |
| 0001 | 20220108 | 7 | 7 |
| 0001 | 20220109 | 8 | 8 |
| 0002 | 20220101 | 0 | 1 |
| 0002 | 20220102 | 1 | 2 |
| 0002 | 20220103 | 2 | 3 |
| 0002 | 20220107 | 6 | 4 |
| 0002 | 20220108 | 7 | 5 |
| 0003 | 20220107 | 6 | 1 |
| 0003 | 20220108 | 7 | 2 |
| 0003 | 20220109 | 8 | 3 |
+----------+-------------+------------+----------+
3.计算date_diff和row_num的差值
执行SQL
select user_id,
login_date,
datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as date_diff,
row_number() over (partition by user_id order by login_date asc) as row_num,
row_number() over (partition by user_id order by login_date asc) -
datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as diff
from t5_login_log;
查询结果
+----------+-------------+------------+----------+-------+
| user_id | login_date | date_diff | row_num | diff |
+----------+-------------+------------+----------+-------+
| 0001 | 20220101 | 0 | 1 | 1 |
| 0001 | 20220102 | 1 | 2 | 1 |
| 0001 | 20220103 | 2 | 3 | 1 |
| 0001 | 20220104 | 3 | 4 | 1 |
| 0001 | 20220105 | 4 | 5 | 1 |
| 0001 | 20220107 | 6 | 6 | 0 |
| 0001 | 20220108 | 7 | 7 | 0 |
| 0001 | 20220109 | 8 | 8 | 0 |
| 0002 | 20220101 | 0 | 1 | 1 |
| 0002 | 20220102 | 1 | 2 | 1 |
| 0002 | 20220103 | 2 | 3 | 1 |
| 0002 | 20220107 | 6 | 4 | -2 |
| 0002 | 20220108 | 7 | 5 | -2 |
| 0003 | 20220107 | 6 | 1 | -5 |
| 0003 | 20220108 | 7 | 2 | -5 |
| 0003 | 20220109 | 8 | 3 | -5 |
+----------+-------------+------------+----------+-------+
4.统计连续登录大于等于4天的用户
执行SQL
select user_id
from (select user_id,
diff,
count(1) as login_days
from (select user_id,
login_date,
row_number() over (partition by user_id order by login_date asc) -
datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as diff
from t5_login_log) t
group by user_id, diff) tt
where login_days >= 4
group by user_id
查询结果
+----------+
| user_id |
+----------+
| 0001 |
+----------+
四、建表语句和数据插入
-- 建表语句
create table t5_login_log
(
user_id string comment '用户ID',
login_date string comment '登录日期'
);
--数据插入语句
insert into t5_login_log values
('0001','20220101'),
('0001','20220102'),
('0001','20220103'),
('0001','20220104'),
('0001','20220105'),
('0001','20220107'),
('0001','20220108'),
('0001','20220109'),
('0002','20220101'),
('0002','20220102'),
('0002','20220103'),
('0002','20220107'),
('0002','20220108'),
('0003','20220107'),
('0003','20220108'),
('0003','20220109');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;