面试真题
滴滴
2.截止目前登陆用户数及登陆用户列表

滴滴大数据面试SQL-截止目前登陆用户数及登陆用户列表

一、题目

已知有用户登录记录表包含登录日期和登录用户ID,请查询出截止到当前日期累积登录用户数及登陆用户列表

样例数据

+-------------+----------+
|  log_date   | user_id  |
+-------------+----------+
| 2024-01-01  | a        |
| 2024-01-02  | a        |
| 2024-01-02  | b        |
| 2024-01-03  | b        |
| 2024-01-04  | c        |
| 2024-01-05  | b        |
| 2024-01-05  | c        |
| 2024-01-05  | d        |
| 2024-01-05  | e        |
+-------------+----------+

期望结果

+-------------+-----------+------------------------+
|  log_date   | user_cnt  |       user_list        |
+-------------+-----------+------------------------+
| 2024-01-01  | 1         | ["a"]                  |
| 2024-01-02  | 2         | ["a","b"]              |
| 2024-01-03  | 2         | ["a","b"]              |
| 2024-01-04  | 3         | ["a","b","c"]          |
| 2024-01-05  | 5         | ["a","b","c","d","e"]  |
+-------------+-----------+------------------------+

二、分析

统计截止到当前行的登录用户数,考察的是聚合函数开窗函数;查询用户列表考察的数据对数据的聚合、数组去重、数组排序等操作。属于深度考察开窗函数、数组操作等知识内容。

维度评分
题目难度⭐️⭐️⭐️⭐️
题目清晰度⭐️⭐️⭐️⭐️⭐
业务常见度⭐️⭐️⭐️⭐️

三、SQL

1.数据聚合

  1. 通过使用count(distinct )over(order by )的方式,聚合函数开窗,带有排序则统计到当前行的方式 ,完成对截止到当前行的数据统计;
  2. collect_set 也是对数据的聚合,所以也可以使用相同的方式完成截止到当前行的聚合;

执行SQL

select log_date,
       user_id,
       count(user_id) over (order by log_date asc)       as user_cnt,
       collect_list(user_id) over (order by log_date asc) as user_list
from t2_user_login

查询结果

+-------------+----------+-----------+----------------------------------------+
|  log_date   | user_id  | user_cnt  |               user_list                |
+-------------+----------+-----------+----------------------------------------+
| 2024-01-01  | a        | 1         | ["a"]                                  |
| 2024-01-02  | b        | 3         | ["a","b","a"]                          |
| 2024-01-02  | a        | 3         | ["a","b","a"]                          |
| 2024-01-03  | b        | 4         | ["a","b","a","b"]                      |
| 2024-01-04  | c        | 5         | ["a","b","a","b","c"]                  |
| 2024-01-05  | e        | 9         | ["a","b","a","b","c","e","d","c","b"]  |
| 2024-01-05  | d        | 9         | ["a","b","a","b","c","e","d","c","b"]  |
| 2024-01-05  | c        | 9         | ["a","b","a","b","c","e","d","c","b"]  |
| 2024-01-05  | b        | 9         | ["a","b","a","b","c","e","d","c","b"]  |
+-------------+----------+-----------+----------------------------------------+

注意

  1. 这里count()函数中特意没有使用去重函数,主要是为了方便大家观察第2行、第3行数据,可以看到user_cnt都是3。这里说明order by 之后 有相同的数据,则取较大值,并且不区分先后顺序;
  2. collect_list()函数进行聚合处理之后的结果,user_list也是一样的,包含到截止相同排序的最后一行数据。

2.数据去重聚合

这里我们加上去重,使用collect_set替换掉collect_list

执行SQL

select log_date,
       user_id,
       size(collect_set(user_id) over (order by log_date asc)) as user_cnt,
       collect_set(user_id) over (order by log_date asc)    as user_list
from t2_user_login

查询结果

+-------------+----------+-----------+------------------------+
|  log_date   | user_id  | user_cnt  |       user_list        |
+-------------+----------+-----------+------------------------+
| 2024-01-01  | a        | 1         | ["a"]                  |
| 2024-01-02  | b        | 2         | ["a","b"]              |
| 2024-01-02  | a        | 2         | ["a","b"]              |
| 2024-01-03  | b        | 2         | ["a","b"]              |
| 2024-01-04  | c        | 3         | ["a","b","c"]          |
| 2024-01-05  | e        | 5         | ["a","b","c","e","d"]  |
| 2024-01-05  | d        | 5         | ["a","b","c","e","d"]  |
| 2024-01-05  | c        | 5         | ["a","b","c","e","d"]  |
| 2024-01-05  | b        | 5         | ["a","b","c","e","d"]  |
+-------------+----------+-----------+------------------------+

3.对用户列表排序,然后去重得到最后结果

上面数据观察可知相同日期的结果相同我们只要对用户列表进行排序,然后使用group by 进行去重即可得到最终结果

执行SQL

select log_date,
       user_cnt,
       user_list
from (select log_date,
             user_id,
             size(collect_set(user_id) over (order by log_date asc))          as user_cnt,
             sort_array(collect_set(user_id) over (order by log_date asc)) as user_list
      from t2_user_login) t
group by log_date, user_cnt, user_list

查询结果

+-------------+-----------+------------------------+
|  log_date   | user_cnt  |       user_list        |
+-------------+-----------+------------------------+
| 2024-01-01  | 1         | ["a"]                  |
| 2024-01-02  | 2         | ["a","b"]              |
| 2024-01-03  | 2         | ["a","b"]              |
| 2024-01-04  | 3         | ["a","b","c"]          |
| 2024-01-05  | 5         | ["a","b","c","d","e"]  |
+-------------+-----------+------------------------+

四、建表语句和数据插入

-- 建表语句
CREATE TABLE t2_user_login
(
log_date STRING,
user_id STRING);
-- 数据插入语句
INSERT INTO t2_user_login
(log_date, user_id) VALUES
('2024-01-01','a'),
('2024-01-02','a'),
('2024-01-02','b'),
('2024-01-03','b'),
('2024-01-04','c'),
('2024-01-05','b'),
('2024-01-05','c'),
('2024-01-05','d'),
('2024-01-05','e');
  • 本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;

  • 欢迎扫码关注公众号,添加博主好友加入数据仓库技术交流群;

  • 如果觉得本文对您有帮助,欢迎扫码打赏

扫码备注 “数据仓库”加入数据仓库技术交流群

微信扫码备注「数据仓库」
加入数据仓库技术交流群

关注公众号数据仓库技术

微信关注公众号「数据仓库技术」

微信扫码buy me a coffee

微信扫码打赏

支付宝扫码buy me a coffee

支付宝扫码打赏