面试真题
常见题目
18.常见大数据面试SQL-近30天连续登陆3天以上次数-非开窗

常见大数据面试SQL-近30天连续登陆3天以上次数-非开窗

一、题目

有用户登录记录,已经按照日期去重。求近三十天,用户连续登录超过3天的次数,一直连续登录算一次,有间隔然后重新计算次数, 要求不能用开窗

样例数据

+----------+-------------+
| user_id  | login_date  |
+----------+-------------+
| 001      | 2024-07-03  |
| 001      | 2024-07-04  |
| 001      | 2024-07-05  |
| 001      | 2024-07-06  |
| 001      | 2024-07-12  |
| 001      | 2024-07-13  |
| 001      | 2024-07-17  |
| 001      | 2024-07-18  |
| 001      | 2024-07-19  |
| 001      | 2024-07-20  |
| 001      | 2024-07-21  |
+----------+-------------+

期望结果

+----------+-----------+
| user_id  | count(1)  |
+----------+-----------+
| 001      | 2         |
+----------+-----------+

二、分析

  • 这里要求求近30天,由于样例数据固定,所以我们改为求'2024-08-01'的近30天,如果在实际应用中可以使用current_date函数,实际传入参数等方式。
  • 题目属于连续问题,连续问题我们做过总结,解决方案参考一文搞懂连续问题 (opens in a new tab)
  • 题目中明确要求不能使用开窗函数,实际工作中确实存在不支持开窗的场景。开窗函数是一个很有帮助的技巧,但是不使用开窗,直接sql常规处理,很锻炼数据思维能力。

数据给出了用户登陆的日期,求连续登陆天数,比较难统计,如果我们把未登陆的日期找到,未登录的日期之间的日期就是登陆日期,未登录日期的差值-1则是连续登陆的天数.

维度评分
题目难度⭐️⭐️⭐️⭐️
题目清晰度⭐️⭐️⭐️⭐️⭐️
业务常见度⭐️⭐️⭐️

三、SQL

1.找出未登录日期

题目中没有给出日期维表,这里可以假设有,也可以直接生成函数做一个,这里我们生成一下2024-07-01至2024-08-02的日期维表。然后使用维表和用户登陆记录表进行关联。日期能关联上的是登陆记录,关联不上的是未登陆记录,找到未登陆的记录。因为要计算日期差,我们把每个用户第31天前的数据记录为未登录。

执行SQL

with t_dim_date as (SELECT explode(sequence(to_date('2024-07-03'), to_date('2024-08-01'), interval 1 day)) as dim_date),
     un_login as (select t1.user_id, t2.dim_date, t3.login_date
                  from (
                           --近30天登陆用户去重
                           select user_id
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')
                           group by user_id) t1
                           full outer join
                       (
                           -- 近30天日期维度数据
                           select dim_date
                           from t_dim_date
                           where dim_date > date_add('2024-08-01', -30)
                             and dim_date <= to_date('2024-08-01')) t2
                           left join
                       (
                           --近30天登陆记录
                           select user_id, login_date
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')) t3
                       on t1.user_id = t3.user_id
                           and t2.dim_date = t3.login_date
                  where t3.login_date is null
                  union all 
                  
                           --用户第31天前设置为未登录
                           select user_id,date_add('2024-08-01', -30) as dim_date,null
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')
                           group by user_id
                  )
select *
from un_login

查询结果

+----------+-------------+-------------+
| user_id  |  dim_date   | login_date  |
+----------+-------------+-------------+
| 001      | 2024-07-07  | NULL        |
| 001      | 2024-07-08  | NULL        |
| 001      | 2024-07-09  | NULL        |
| 001      | 2024-07-10  | NULL        |
| 001      | 2024-07-11  | NULL        |
| 001      | 2024-07-14  | NULL        |
| 001      | 2024-07-15  | NULL        |
| 001      | 2024-07-16  | NULL        |
| 001      | 2024-07-22  | NULL        |
| 001      | 2024-07-23  | NULL        |
| 001      | 2024-07-24  | NULL        |
| 001      | 2024-07-25  | NULL        |
| 001      | 2024-07-26  | NULL        |
| 001      | 2024-07-27  | NULL        |
| 001      | 2024-07-28  | NULL        |
| 001      | 2024-07-29  | NULL        |
| 001      | 2024-07-30  | NULL        |
| 001      | 2024-07-31  | NULL        |
| 001      | 2024-08-01  | NULL        |
| 001      | 2024-07-02  | NULL        |
+----------+-------------+-------------+

2.计算未登陆日期和前一未登陆日期

由于不能使用开窗函数,我们使用未登录记录表根据use_id自关联,限制t1的dim_date小于t2的dim_date,按照t2的dim_date分组,找到t1的最大日期--即用户t2的未登录日期的前一个未登陆日期。

执行SQL

with t_dim_date as (SELECT explode(sequence(to_date('2024-07-03'), to_date('2024-08-01'), interval 1 day)) as dim_date),
     un_login as (select t1.user_id, t2.dim_date, t3.login_date
                  from (
                           --近30天登陆用户去重
                           select user_id
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')
                           group by user_id) t1
                           full outer join
                       (
                           -- 近30天日期维度数据
                           select dim_date
                           from t_dim_date
                           where dim_date > date_add('2024-08-01', -30)
                             and dim_date <= to_date('2024-08-01')) t2
                           left join
                       (
                           --近30天登陆记录
                           select user_id, login_date
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')) t3
                       on t1.user_id = t3.user_id
                           and t2.dim_date = t3.login_date
                  where t3.login_date is null
                  union all
 
                  --用户第31天前设置为未登录
                  select user_id, date_add('2024-08-01', -30) as dim_date, null
                  from t18_use_login
                  where login_date > date_add('2024-08-01', -30)
                    and login_date <= to_date('2024-08-01')
                  group by user_id)
select t2.user_id, t2.dim_date as unlogin_date, max(t1.dim_date) as last_unlogin_date
from un_login t1
         join un_login t2
              on t1.user_id = t2.user_id
where t1.dim_date < t2.dim_date
group by t2.user_id, t2.dim_date
order by t2.user_id, t2.dim_date

SQL结果

+----------+---------------+--------------------+
| user_id  | unlogin_date  | last_unlogin_date  |
+----------+---------------+--------------------+
| 001      | 2024-07-07    | 2024-07-02         |
| 001      | 2024-07-08    | 2024-07-07         |
| 001      | 2024-07-09    | 2024-07-08         |
| 001      | 2024-07-10    | 2024-07-09         |
| 001      | 2024-07-11    | 2024-07-10         |
| 001      | 2024-07-14    | 2024-07-11         |
| 001      | 2024-07-15    | 2024-07-14         |
| 001      | 2024-07-16    | 2024-07-15         |
| 001      | 2024-07-22    | 2024-07-16         |
| 001      | 2024-07-23    | 2024-07-22         |
| 001      | 2024-07-24    | 2024-07-23         |
| 001      | 2024-07-25    | 2024-07-24         |
| 001      | 2024-07-26    | 2024-07-25         |
| 001      | 2024-07-27    | 2024-07-26         |
| 001      | 2024-07-28    | 2024-07-27         |
| 001      | 2024-07-29    | 2024-07-28         |
| 001      | 2024-07-30    | 2024-07-29         |
| 001      | 2024-07-31    | 2024-07-30         |
| 001      | 2024-08-01    | 2024-07-31         |
+----------+---------------+--------------------+

3.计算日期差,得出连续登陆天数

因为连续两天的日期差为1,日期差-1则为连续登陆天数

执行SQL

with t_dim_date as (SELECT explode(sequence(to_date('2024-07-03'), to_date('2024-08-01'), interval 1 day)) as dim_date),
     un_login as (select t1.user_id, t2.dim_date, t3.login_date
                  from (
                           --近30天登陆用户去重
                           select user_id
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')
                           group by user_id) t1
                           full outer join
                       (
                           -- 近30天日期维度数据
                           select dim_date
                           from t_dim_date
                           where dim_date > date_add('2024-08-01', -30)
                             and dim_date <= to_date('2024-08-01')) t2
                           left join
                       (
                           --近30天登陆记录
                           select user_id, login_date
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')) t3
                       on t1.user_id = t3.user_id
                           and t2.dim_date = t3.login_date
                  where t3.login_date is null
                  union all
 
                  --用户第31天前设置为未登录
                  select user_id, date_add('2024-08-01', -30) as dim_date, null
                  from t18_use_login
                  where login_date > date_add('2024-08-01', -30)
                    and login_date <= to_date('2024-08-01')
                  group by user_id)
select user_id,unlogin_date,last_unlogin_date,datediff(unlogin_date,last_unlogin_date)-1 as con_days
    from(
select t2.user_id, t2.dim_date as unlogin_date, max(t1.dim_date) as last_unlogin_date
from un_login t1
         join un_login t2
              on t1.user_id = t2.user_id
where t1.dim_date < t2.dim_date
group by t2.user_id, t2.dim_date) t
order by user_id,unlogin_date

SQL结果

+----------+---------------+--------------------+-----------+
| user_id  | unlogin_date  | last_unlogin_date  | con_days  |
+----------+---------------+--------------------+-----------+
| 001      | 2024-07-07    | 2024-07-02         | 4         |
| 001      | 2024-07-08    | 2024-07-07         | 0         |
| 001      | 2024-07-09    | 2024-07-08         | 0         |
| 001      | 2024-07-10    | 2024-07-09         | 0         |
| 001      | 2024-07-11    | 2024-07-10         | 0         |
| 001      | 2024-07-14    | 2024-07-11         | 2         |
| 001      | 2024-07-15    | 2024-07-14         | 0         |
| 001      | 2024-07-16    | 2024-07-15         | 0         |
| 001      | 2024-07-22    | 2024-07-16         | 5         |
| 001      | 2024-07-23    | 2024-07-22         | 0         |
| 001      | 2024-07-24    | 2024-07-23         | 0         |
| 001      | 2024-07-25    | 2024-07-24         | 0         |
| 001      | 2024-07-26    | 2024-07-25         | 0         |
| 001      | 2024-07-27    | 2024-07-26         | 0         |
| 001      | 2024-07-28    | 2024-07-27         | 0         |
| 001      | 2024-07-29    | 2024-07-28         | 0         |
| 001      | 2024-07-30    | 2024-07-29         | 0         |
| 001      | 2024-07-31    | 2024-07-30         | 0         |
| 001      | 2024-08-01    | 2024-07-31         | 0         |
+----------+---------------+--------------------+-----------+

可以看到用户的连续登陆天数数据

4.限制con_days >=3 得到最终结果

执行SQL

with t_dim_date as (SELECT explode(sequence(to_date('2024-07-03'), to_date('2024-08-01'), interval 1 day)) as dim_date),
     un_login as (select t1.user_id, t2.dim_date, t3.login_date
                  from (
                           --近30天登陆用户去重
                           select user_id
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')
                           group by user_id) t1
                           full outer join
                       (
                           -- 近30天日期维度数据
                           select dim_date
                           from t_dim_date
                           where dim_date > date_add('2024-08-01', -30)
                             and dim_date <= to_date('2024-08-01')) t2
                           left join
                       (
                           --近30天登陆记录
                           select user_id, login_date
                           from t18_use_login
                           where login_date > date_add('2024-08-01', -30)
                             and login_date <= to_date('2024-08-01')) t3
                       on t1.user_id = t3.user_id
                           and t2.dim_date = t3.login_date
                  where t3.login_date is null
                  union all
 
                  --用户第31天前设置为未登录
                  select user_id, date_add('2024-08-01', -30) as dim_date, null
                  from t18_use_login
                  where login_date > date_add('2024-08-01', -30)
                    and login_date <= to_date('2024-08-01')
                  group by user_id)
select user_id, count(1)
from (select t2.user_id, t2.dim_date as unlogin_date, max(t1.dim_date) as last_unlogin_date
      from un_login t1
               join un_login t2
                    on t1.user_id = t2.user_id
      where t1.dim_date < t2.dim_date
      group by t2.user_id, t2.dim_date) t
where datediff(unlogin_date, last_unlogin_date) - 1 >= 3
group by user_id

查询结果

+----------+-----------+
| user_id  | count(1)  |
+----------+-----------+
| 001      | 2         |
+----------+-----------+

四、建表语句和数据插入

--建表语句
CREATE TABLE IF NOT EXISTS t18_use_login
(
    user_id      string, -- 用户id
    login_date    string -- 登陆日期
)
    COMMENT '用户登录记录表';
--插入数据
INSERT INTO t18_use_login VALUES
('001','2024-07-03'),
('001','2024-07-04'),
('001','2024-07-05'),
('001','2024-07-06'),
('001','2024-07-12'),
('001','2024-07-13'),
('001','2024-07-17'),
('001','2024-07-18'),
('001','2024-07-19'),
('001','2024-07-20'),
('001','2024-07-21');

本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab) 发表;