面试真题
常见题目
2.连续点击三次用户

常见大数据面试SQL-连续点击三次用户

一、题目

有用户点击日志记录表 t2_click_log,包含user_id(用户ID),click_time(点击时间),请查询出连续点击三次的用户数,

连续点击三次:指点击记录中同一用户连续点击,中间无其他用户点击;

+----------+--------------+
| user_id  | click_time |
+----------+--------------+
| 1        | 1736337600   |
| 2        | 1736337670   |
| 1        | 1736337710   |
| 1        | 1736337715   |
| 1        | 1736337750   |
| 2        | 1736337760   |
| 3        | 1736337820   |
| 3        | 1736337840   |
| 3        | 1736337850   |
| 3        | 1736337910   |
| 4        | 1736337915   |
+----------+--------------+

二、分析

1.连续问题,最大连续次数的变种问题;

2.思路一:累积求和分组法(此种方式比连续N天登录略难一些)

​ (2.1)按照时间排序之后,使用lag()函数可以判断出当前行用户与上一行用户,是否是同一个用户;

​ (2.2)与上一行是同一个用户的日志,则给该行打标0,不同打标1;(属于经验:要累积求和,0和不变,1和+1.)

​ (2.3)对打标完成的标签,进行累积求和,和相同代表属于同一用户连续(聚合函数开窗);

​ (2.4)对累积求和的值进行统计,相同值的个数>=3 表示连续登录;

3.思路二:双重排序差值法(难度同连续N天登录)

​ (3.1)与连续登录天数类似。按照点击时间(click_time) 进行全排序,按照用户ID(user_id)分组,按照点击时间排序;

​ (3.2)对两次排序计算差值,按照用户和差值进行分组,相同用户,差值相同说明连续;

​ (3.3)计算属于同一分组的数量>=3则是连续,统计用户数得出结果;

维度评分
题目难度⭐️⭐️⭐️
题目清晰度⭐️⭐️⭐️⭐️
业务常见度⭐️⭐️⭐️⭐️

三、SQL

3.1 累积求和法

1.增加一列is_same_user,判断是否与上一行是同一用户点击,是取0,否取1,第一行默认为0;

执行SQL

select user_id,
       click_time,
       case
           when lag(user_id) over (order by click_time asc) is null then 0
           when user_id = lag(user_id) over (order by click_time asc) then 0
           else 1 end as is_same_user
from t2_click_log

查询结果

+----------+-------------+---------------+
| user_id  | click_time  | is_same_user  |
+----------+-------------+---------------+
| 1        | 1736337600  | 0             |
| 2        | 1736337670  | 1             |
| 1        | 1736337710  | 1             |
| 1        | 1736337715  | 0             |
| 1        | 1736337750  | 0             |
| 2        | 1736337760  | 1             |
| 3        | 1736337820  | 1             |
| 3        | 1736337840  | 0             |
| 3        | 1736337850  | 0             |
| 3        | 1736337910  | 0             |
| 4        | 1736337915  | 1             |
+----------+-------------+---------------+

2.对是否是同一用户分组进行累积求和

执行SQL

select user_id,
       click_time,
       sum(is_same_user) over (order by click_time asc) as sum_order
from (select user_id,
             click_time,
             case
                 when lag(user_id) over (order by click_time asc) is null then 0
                 when user_id = lag(user_id) over (order by click_time asc) then 0
                 else 1 end as is_same_user
      from t2_click_log) t

查询结果

+----------+-------------+------------+
| user_id  | click_time  | sum_order  |
+----------+-------------+------------+
| 1        | 1736337600  | 0          |
| 2        | 1736337670  | 1          |
| 1        | 1736337710  | 2          |
| 1        | 1736337715  | 2          |
| 1        | 1736337750  | 2          |
| 2        | 1736337760  | 3          |
| 3        | 1736337820  | 4          |
| 3        | 1736337840  | 4          |
| 3        | 1736337850  | 4          |
| 3        | 1736337910  | 4          |
| 4        | 1736337915  | 5          |
+----------+-------------+------------+

3.查询sum_order值个数>3的用户

执行SQL

select user_id,
       sum_order,
       count(1)
from (select user_id,
             click_time,
             sum(is_same_user) over (order by click_time asc) as sum_order
      from (select user_id,
                   click_time,
                   case
                       when lag(user_id) over (order by click_time asc) is null then 0
                       when user_id = lag(user_id) over (order by click_time asc) then 0
                       else 1 end as is_same_user
            from t2_click_log) t) tt
group by user_id, sum_order
having count(1) >= 3

查询结果

+----------+------------+------+
| user_id  | sum_order  | _c2  |
+----------+------------+------+
| 1        | 2          | 3    |
| 3        | 4          | 4    |
+----------+------------+------+

4.查询最终用户数量

执行SQL

select count(distinct user_id) as user_num
from (select user_id,
             sum_order,
             count(1)
      from (select user_id,
                   click_time,
                   sum(is_same_user) over (order by click_time asc) as sum_order
            from (select user_id,
                         click_time,
                         case
                             when lag(user_id) over (order by click_time asc) is null then 0
                             when user_id = lag(user_id) over (order by click_time asc) then 0
                             else 1 end as is_same_user
                  from t2_click_log) t) tt
      group by user_id, sum_order
      having count(1) >= 3) ttt

查询结果

+-----------+
| user_num  |
+-----------+
| 2         |
+-----------+

3.2 双重排序差值法

1.分别按照时间,按照不分组和按照用户分组进行排序;

执行SQL

select user_id,
       click_time,
       row_number() over (order by click_time asc)                      as row_num1,
       row_number() over (partition by user_id order by click_time asc) as row_num2
from t2_click_log

查询结果

+----------+-------------+-----------+-----------+
| user_id  | click_time  | row_num1  | row_num2  |
+----------+-------------+-----------+-----------+
| 1        | 1736337600  | 1         | 1         |
| 1        | 1736337710  | 3         | 2         |
| 1        | 1736337715  | 4         | 3         |
| 1        | 1736337750  | 5         | 4         |
| 2        | 1736337670  | 2         | 1         |
| 2        | 1736337760  | 6         | 2         |
| 3        | 1736337820  | 7         | 1         |
| 3        | 1736337840  | 8         | 2         |
| 3        | 1736337850  | 9         | 3         |
| 3        | 1736337910  | 10        | 4         |
| 4        | 1736337915  | 11        | 1         |
+----------+-------------+-----------+-----------+

2.计算差值并按照用户和差值进行分组

执行SQL

select user_id,
       diff,
       count(1) as aa
from (select user_id,
             click_time,
             row_number() over (order by click_time asc) -
             row_number() over (partition by user_id order by click_time asc) as diff
      from t2_click_log) t
group by user_id, diff

查询结果

+----------+-------+-----+
| user_id  | diff  | aa  |
+----------+-------+-----+
| 1        | 0     | 1   |
| 1        | 1     | 3   |
| 2        | 1     | 1   |
| 2        | 4     | 1   |
| 3        | 6     | 4   |
| 4        | 10    | 1   |
+----------+-------+-----+

3.查询分组行数>=3的用户差值分组

执行SQL

select user_id,
       diff
from (select user_id,
             diff,
             count(1) aa
      from (select user_id,
                   click_time,
                   row_number() over (order by click_time asc) -
                   row_number() over (partition by user_id order by click_time asc) as diff
            from t2_click_log) t
      group by user_id, diff) tt
where aa >= 3

查询结果

+----------+-------+
| user_id  | diff  |
+----------+-------+
| 1        | 1     |
| 3        | 6     |
+----------+-------+

4.计算用户数

执行SQL

select count(distinct user_id) as user_num
from (select user_id,
             diff,
             count(1) aa
      from (select user_id,
                   click_time,
                   row_number() over (order by click_time asc) -
                   row_number() over (partition by user_id order by click_time asc) as diff
            from t2_click_log) t
      group by user_id, diff) tt
where aa >= 3

查询结果

+-----------+
| user_num  |
+-----------+
| 2         |
+-----------+

四、建表语句和数据插入

--建表语句
CREATE TABLE t2_click_log (
  user_id BIGINT,
  click_time BIGINT
);
--插入数据
insert into t2_click_log (user_id,click_time)
values
(1,1736337600),
(2,1736337670),
(1,1736337710),
(1,1736337715),
(1,1736337750),
(2,1736337760),
(3,1736337820),
(3,1736337840),
(3,1736337850),
(3,1736337910),
(4,1736337915)

本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;