常见大数据面试SQL-连续点击三次用户
一、题目
有用户点击日志记录表 t2_click_log,包含user_id(用户ID),click_time(点击时间),请查询出连续点击三次的用户数,
连续点击三次:指点击记录中同一用户连续点击,中间无其他用户点击;
+----------+--------------+
| user_id | click_time |
+----------+--------------+
| 1 | 1736337600 |
| 2 | 1736337670 |
| 1 | 1736337710 |
| 1 | 1736337715 |
| 1 | 1736337750 |
| 2 | 1736337760 |
| 3 | 1736337820 |
| 3 | 1736337840 |
| 3 | 1736337850 |
| 3 | 1736337910 |
| 4 | 1736337915 |
+----------+--------------+
二、分析
1.连续问题,最大连续次数的变种问题;
2.思路一:累积求和分组法(此种方式比连续N天登录略难一些)
(2.1)按照时间排序之后,使用lag()函数可以判断出当前行用户与上一行用户,是否是同一个用户;
(2.2)与上一行是同一个用户的日志,则给该行打标0,不同打标1;(属于经验:要累积求和,0和不变,1和+1.)
(2.3)对打标完成的标签,进行累积求和,和相同代表属于同一用户连续(聚合函数开窗);
(2.4)对累积求和的值进行统计,相同值的个数>=3 表示连续登录;
3.思路二:双重排序差值法(难度同连续N天登录)
(3.1)与连续登录天数类似。按照点击时间(click_time) 进行全排序,按照用户ID(user_id)分组,按照点击时间排序;
(3.2)对两次排序计算差值,按照用户和差值进行分组,相同用户,差值相同说明连续;
(3.3)计算属于同一分组的数量>=3则是连续,统计用户数得出结果;
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
3.1 累积求和法
1.增加一列is_same_user,判断是否与上一行是同一用户点击,是取0,否取1,第一行默认为0;
执行SQL
select user_id,
click_time,
case
when lag(user_id) over (order by click_time asc) is null then 0
when user_id = lag(user_id) over (order by click_time asc) then 0
else 1 end as is_same_user
from t2_click_log
查询结果
+----------+-------------+---------------+
| user_id | click_time | is_same_user |
+----------+-------------+---------------+
| 1 | 1736337600 | 0 |
| 2 | 1736337670 | 1 |
| 1 | 1736337710 | 1 |
| 1 | 1736337715 | 0 |
| 1 | 1736337750 | 0 |
| 2 | 1736337760 | 1 |
| 3 | 1736337820 | 1 |
| 3 | 1736337840 | 0 |
| 3 | 1736337850 | 0 |
| 3 | 1736337910 | 0 |
| 4 | 1736337915 | 1 |
+----------+-------------+---------------+
2.对是否是同一用户分组进行累积求和
执行SQL
select user_id,
click_time,
sum(is_same_user) over (order by click_time asc) as sum_order
from (select user_id,
click_time,
case
when lag(user_id) over (order by click_time asc) is null then 0
when user_id = lag(user_id) over (order by click_time asc) then 0
else 1 end as is_same_user
from t2_click_log) t
查询结果
+----------+-------------+------------+
| user_id | click_time | sum_order |
+----------+-------------+------------+
| 1 | 1736337600 | 0 |
| 2 | 1736337670 | 1 |
| 1 | 1736337710 | 2 |
| 1 | 1736337715 | 2 |
| 1 | 1736337750 | 2 |
| 2 | 1736337760 | 3 |
| 3 | 1736337820 | 4 |
| 3 | 1736337840 | 4 |
| 3 | 1736337850 | 4 |
| 3 | 1736337910 | 4 |
| 4 | 1736337915 | 5 |
+----------+-------------+------------+
3.查询sum_order值个数>3的用户
执行SQL
select user_id,
sum_order,
count(1)
from (select user_id,
click_time,
sum(is_same_user) over (order by click_time asc) as sum_order
from (select user_id,
click_time,
case
when lag(user_id) over (order by click_time asc) is null then 0
when user_id = lag(user_id) over (order by click_time asc) then 0
else 1 end as is_same_user
from t2_click_log) t) tt
group by user_id, sum_order
having count(1) >= 3
查询结果
+----------+------------+------+
| user_id | sum_order | _c2 |
+----------+------------+------+
| 1 | 2 | 3 |
| 3 | 4 | 4 |
+----------+------------+------+
4.查询最终用户数量
执行SQL
select count(distinct user_id) as user_num
from (select user_id,
sum_order,
count(1)
from (select user_id,
click_time,
sum(is_same_user) over (order by click_time asc) as sum_order
from (select user_id,
click_time,
case
when lag(user_id) over (order by click_time asc) is null then 0
when user_id = lag(user_id) over (order by click_time asc) then 0
else 1 end as is_same_user
from t2_click_log) t) tt
group by user_id, sum_order
having count(1) >= 3) ttt
查询结果
+-----------+
| user_num |
+-----------+
| 2 |
+-----------+
3.2 双重排序差值法
1.分别按照时间,按照不分组和按照用户分组进行排序;
执行SQL
select user_id,
click_time,
row_number() over (order by click_time asc) as row_num1,
row_number() over (partition by user_id order by click_time asc) as row_num2
from t2_click_log
查询结果
+----------+-------------+-----------+-----------+
| user_id | click_time | row_num1 | row_num2 |
+----------+-------------+-----------+-----------+
| 1 | 1736337600 | 1 | 1 |
| 1 | 1736337710 | 3 | 2 |
| 1 | 1736337715 | 4 | 3 |
| 1 | 1736337750 | 5 | 4 |
| 2 | 1736337670 | 2 | 1 |
| 2 | 1736337760 | 6 | 2 |
| 3 | 1736337820 | 7 | 1 |
| 3 | 1736337840 | 8 | 2 |
| 3 | 1736337850 | 9 | 3 |
| 3 | 1736337910 | 10 | 4 |
| 4 | 1736337915 | 11 | 1 |
+----------+-------------+-----------+-----------+
2.计算差值并按照用户和差值进行分组
执行SQL
select user_id,
diff,
count(1) as aa
from (select user_id,
click_time,
row_number() over (order by click_time asc) -
row_number() over (partition by user_id order by click_time asc) as diff
from t2_click_log) t
group by user_id, diff
查询结果
+----------+-------+-----+
| user_id | diff | aa |
+----------+-------+-----+
| 1 | 0 | 1 |
| 1 | 1 | 3 |
| 2 | 1 | 1 |
| 2 | 4 | 1 |
| 3 | 6 | 4 |
| 4 | 10 | 1 |
+----------+-------+-----+
3.查询分组行数>=3的用户差值分组
执行SQL
select user_id,
diff
from (select user_id,
diff,
count(1) aa
from (select user_id,
click_time,
row_number() over (order by click_time asc) -
row_number() over (partition by user_id order by click_time asc) as diff
from t2_click_log) t
group by user_id, diff) tt
where aa >= 3
查询结果
+----------+-------+
| user_id | diff |
+----------+-------+
| 1 | 1 |
| 3 | 6 |
+----------+-------+
4.计算用户数
执行SQL
select count(distinct user_id) as user_num
from (select user_id,
diff,
count(1) aa
from (select user_id,
click_time,
row_number() over (order by click_time asc) -
row_number() over (partition by user_id order by click_time asc) as diff
from t2_click_log) t
group by user_id, diff) tt
where aa >= 3
查询结果
+-----------+
| user_num |
+-----------+
| 2 |
+-----------+
四、建表语句和数据插入
--建表语句
CREATE TABLE t2_click_log (
user_id BIGINT,
click_time BIGINT
);
--插入数据
insert into t2_click_log (user_id,click_time)
values
(1,1736337600),
(2,1736337670),
(1,1736337710),
(1,1736337715),
(1,1736337750),
(2,1736337760),
(3,1736337820),
(3,1736337840),
(3,1736337850),
(3,1736337910),
(4,1736337915)
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;