面试真题
腾讯
5.连续登陆超过N天的用户

腾讯大数据面试SQL-连续登陆超过N天的用户

一、题目

现有用户登录日志表 t5_login_log,包含用户ID(user_id),登录日期(login_date)。数据已经按照用户日期去重,请查出连续登录超过4天的用户ID

样例数据

+----------+-------------+
| user_id  | login_date  |
+----------+-------------+
| 0001     | 20220101    |
| 0001     | 20220102    |
| 0001     | 20220103    |
| 0001     | 20220104    |
| 0001     | 20220105    |
| 0001     | 20220107    |
| 0001     | 20220108    |
| 0001     | 20220109    |
| 0002     | 20220101    |
| 0002     | 20220102    |
| 0002     | 20220103    |
| 0002     | 20220107    |
| 0002     | 20220108    |
| 0003     | 20220107    |
| 0003     | 20220108    |
| 0003     | 20220109    |
+----------+-------------+

二、分析

row_number()函数、datediff()函数、日期格式处理 对于连续登录问题处理逻辑,对于同一个用户,如果连续登录,row_number排序和 登录日期与月初的日期差是一个恒定值,如果存在非连续,则值不同。

维度评分
题目难度⭐️⭐️⭐️
题目清晰度⭐️⭐️⭐️⭐️⭐️
业务常见度⭐️⭐️⭐️⭐️⭐️

三、SQL

1.处理日期格式,计算登录日期与月初日期('2022-01-01')差date_diff

执行SQL

select user_id,
       login_date,
       datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as date_diff
from t5_login_log;

执行结果

+----------+-------------+------------+
| user_id  | login_date  | date_diff  |
+----------+-------------+------------+
| 0001     | 20220101    | 0          |
| 0001     | 20220102    | 1          |
| 0001     | 20220103    | 2          |
| 0001     | 20220104    | 3          |
| 0001     | 20220105    | 4          |
| 0001     | 20220107    | 6          |
| 0001     | 20220108    | 7          |
| 0001     | 20220109    | 8          |
| 0002     | 20220101    | 0          |
| 0002     | 20220102    | 1          |
| 0002     | 20220103    | 2          |
| 0002     | 20220107    | 6          |
| 0002     | 20220108    | 7          |
| 0003     | 20220107    | 6          |
| 0003     | 20220108    | 7          |
| 0003     | 20220109    | 8          |
+----------+-------------+------------+

2.row_number()开窗,计算每个用户每个登录日期的排序 row_num

执行SQL

select user_id,
       login_date,
       datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as date_diff,
       row_number() over (partition by user_id order by login_date asc)                            as row_num
from t5_login_log;

执行结果

+----------+-------------+------------+----------+
| user_id  | login_date  | date_diff  | row_num  |
+----------+-------------+------------+----------+
| 0001     | 20220101    | 0          | 1        |
| 0001     | 20220102    | 1          | 2        |
| 0001     | 20220103    | 2          | 3        |
| 0001     | 20220104    | 3          | 4        |
| 0001     | 20220105    | 4          | 5        |
| 0001     | 20220107    | 6          | 6        |
| 0001     | 20220108    | 7          | 7        |
| 0001     | 20220109    | 8          | 8        |
| 0002     | 20220101    | 0          | 1        |
| 0002     | 20220102    | 1          | 2        |
| 0002     | 20220103    | 2          | 3        |
| 0002     | 20220107    | 6          | 4        |
| 0002     | 20220108    | 7          | 5        |
| 0003     | 20220107    | 6          | 1        |
| 0003     | 20220108    | 7          | 2        |
| 0003     | 20220109    | 8          | 3        |
+----------+-------------+------------+----------+

3.计算date_diff和row_num的差值

执行SQL

select user_id,
       login_date,
       datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as date_diff,
       row_number() over (partition by user_id order by login_date asc)                            as row_num,
       row_number() over (partition by user_id order by login_date asc) -
       datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as diff
from t5_login_log;

查询结果

+----------+-------------+------------+----------+-------+
| user_id  | login_date  | date_diff  | row_num  | diff  |
+----------+-------------+------------+----------+-------+
| 0001     | 20220101    | 0          | 1        | 1     |
| 0001     | 20220102    | 1          | 2        | 1     |
| 0001     | 20220103    | 2          | 3        | 1     |
| 0001     | 20220104    | 3          | 4        | 1     |
| 0001     | 20220105    | 4          | 5        | 1     |
| 0001     | 20220107    | 6          | 6        | 0     |
| 0001     | 20220108    | 7          | 7        | 0     |
| 0001     | 20220109    | 8          | 8        | 0     |
| 0002     | 20220101    | 0          | 1        | 1     |
| 0002     | 20220102    | 1          | 2        | 1     |
| 0002     | 20220103    | 2          | 3        | 1     |
| 0002     | 20220107    | 6          | 4        | -2    |
| 0002     | 20220108    | 7          | 5        | -2    |
| 0003     | 20220107    | 6          | 1        | -5    |
| 0003     | 20220108    | 7          | 2        | -5    |
| 0003     | 20220109    | 8          | 3        | -5    |
+----------+-------------+------------+----------+-------+

4.统计连续登录大于等于4天的用户

执行SQL

select user_id
from (select user_id,
             diff,
             count(1) as login_days
      from (select user_id,
                   login_date,
                   row_number() over (partition by user_id order by login_date asc) -
                   datediff(from_unixtime(unix_timestamp(login_date, 'yyyyMMdd'), 'yyyy-MM-dd'), '2022-01-01') as diff
            from t5_login_log) t
      group by user_id, diff) tt
where login_days >= 4
group by user_id

查询结果

+----------+
| user_id  |
+----------+
| 0001     |
+----------+

四、建表语句和数据插入

-- 建表语句
create table t5_login_log
(
user_id string comment '用户ID',
login_date string comment '登录日期'
);
 
--数据插入语句
insert into t5_login_log values
('0001','20220101'),
('0001','20220102'),
('0001','20220103'),
('0001','20220104'),
('0001','20220105'),
('0001','20220107'),
('0001','20220108'),
('0001','20220109'),
('0002','20220101'),
('0002','20220102'),
('0002','20220103'),
('0002','20220107'),
('0002','20220108'),
('0003','20220107'),
('0003','20220108'),
('0003','20220109');

本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;