面试真题
常见题目
7.计算次日留存率

常见大数据面试SQL-计算次日留存率

一、题目

现有用户登录记录表,已经按照用户日期进行去重处理。以用户登录的最早日期作为新增日期,请计算次日留存率是多少。

+----------+-------------+
| user_id  | login_date  |
+----------+-------------+
| aaa      | 2023-12-01  |
| bbb      | 2023-12-01  |
| bbb      | 2023-12-02  |
| ccc      | 2023-12-02  |
| bbb      | 2023-12-03  |
| ccc      | 2023-12-03  |
| ddd      | 2023-12-03  |
| ccc      | 2023-12-04  |
| ddd      | 2023-12-04  |
+----------+-------------+

二、分析

维度评分
题目难度⭐️⭐️⭐️⭐️
题目清晰度⭐️⭐️⭐️⭐️⭐️
业务常见度⭐️⭐️⭐️⭐️⭐️

三、SQL

指标定义:

次日留存用户:新增用户第二天登录(活跃)的用户;

次日留存率:t+1日留存用户数/t日新增用户;

1.根据登录日志,使用开窗函数计算出用户的最小登录时间作为新增日期first_day,然后计算当天日期和新增日期的时间差。

执行SQL

select user_id,
       login_date,
       min(login_date) over (partition by user_id order by login_date asc)                       as first_day,
       datediff(login_date, min(login_date) over (partition by user_id order by login_date asc)) as date_diff
from t7_login

查询结果

+----------+-------------+-------------+------------+
| user_id  | login_date  |  first_day  | date_diff  |
+----------+-------------+-------------+------------+
| aaa      | 2023-12-01  | 2023-12-01  | 0          |
| bbb      | 2023-12-01  | 2023-12-01  | 0          |
| bbb      | 2023-12-02  | 2023-12-01  | 1          |
| bbb      | 2023-12-03  | 2023-12-01  | 2          |
| ccc      | 2023-12-02  | 2023-12-02  | 0          |
| ccc      | 2023-12-03  | 2023-12-02  | 1          |
| ccc      | 2023-12-04  | 2023-12-02  | 2          |
| ddd      | 2023-12-03  | 2023-12-03  | 0          |
| ddd      | 2023-12-04  | 2023-12-03  | 1          |
+----------+-------------+-------------+------------+

2.我们根据first_day进行分组,date_diff=0的为当天新增用户,date_diff=1的为次日登录的用户

执行SQL

select first_day,
       count(case when date_diff = 0 then user_id end) as new_cnt,
       count(case when date_diff = 1 then user_id end) as next_act_cnt
from (select user_id,
             login_date,
             min(login_date) over (partition by user_id order by login_date asc)                       as first_day,
             datediff(login_date, min(login_date) over (partition by user_id order by login_date asc)) as date_diff
      from t7_login) t
group by first_day
order by first_day asc

查询结果

+-------------+----------+---------------+
|  first_day  | new_cnt  | next_act_cnt  |
+-------------+----------+---------------+
| 2023-12-01  | 2        | 1             |
| 2023-12-02  | 1        | 1             |
| 2023-12-03  | 1        | 1             |
+-------------+----------+---------------+

3.用次日留存数/新增用户数据即为留存率,因为新增可能为0,所以需要先判断。

select first_day,
       concat(if(count(case when date_diff = 0 then user_id end) = 0, 0,
                 count(case when date_diff = 1 then user_id end) / count(case when date_diff = 0 then user_id end)) *
              100, '%') as next_act_per
from (select user_id,
             login_date,
             min(login_date) over (partition by user_id order by login_date asc)                       as first_day,
             datediff(login_date, min(login_date) over (partition by user_id order by login_date asc)) as date_diff
      from t7_login) t
group by first_day
order by first_day asc

查询结果

+-------------+---------------+
|  first_day  | next_act_per  |
+-------------+---------------+
| 2023-12-01  | 50.0%         |
| 2023-12-02  | 100.0%        |
| 2023-12-03  | 100.0%        |
+-------------+---------------+

四、建表语句和数据插入

create table t7_login
(
user_id string COMMENT '用户ID',
login_date string COMMENT '登录日期'
) COMMENT '用户登录记录表';
 
insert into t7_login(user_id,login_date)
values
('aaa','2023-12-01'),
('bbb','2023-12-01'),
('bbb','2023-12-02'),
('ccc','2023-12-02'),
('bbb','2023-12-03'),
('ccc','2023-12-03'),
('ddd','2023-12-03'),
('ccc','2023-12-04'),
('ddd','2023-12-04');
 

本文同步在微信公众号“数据仓库技术”和“个人博客”数据仓库技术 (opens in a new tab)发表;