常见大数据面试SQL-用户行为路径分析
一、题目
有一张用户操作行为记录表 t10_act_log 包含用户ID(user_id),操作编号(op_id),操作时间(op_time)
要求:
- 统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻;
- 统计每天用户行为序列为A-B-D的用户数;其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)
+----------+--------+----------------------+
| user_id | op_id | op_time |
+----------+--------+----------------------+
| 1 | A | 2023-10-18 12:01:03 |
| 2 | A | 2023-10-18 12:01:04 |
| 3 | A | 2023-10-18 12:01:05 |
| 1 | B | 2023-10-18 12:03:03 |
| 1 | A | 2023-10-18 12:04:03 |
| 1 | C | 2023-10-18 12:06:03 |
| 1 | D | 2023-10-18 12:11:03 |
| 2 | A | 2023-10-18 12:07:04 |
| 3 | C | 2023-10-18 12:02:05 |
| 2 | C | 2023-10-18 12:09:03 |
| 2 | A | 2023-10-18 12:10:03 |
| 4 | A | 2023-10-18 12:01:03 |
| 4 | C | 2023-10-18 12:11:05 |
| 4 | D | 2023-10-18 12:15:05 |
| 1 | A | 2023-10-19 12:01:03 |
| 2 | A | 2023-10-19 12:01:04 |
| 3 | A | 2023-10-19 12:01:05 |
| 1 | B | 2023-10-19 12:03:03 |
| 1 | A | 2023-10-19 12:04:03 |
| 1 | C | 2023-10-19 12:06:03 |
| 2 | A | 2023-10-19 12:07:04 |
| 3 | B | 2023-10-19 12:08:05 |
| 3 | E | 2023-10-19 12:09:05 |
| 3 | D | 2023-10-19 12:11:05 |
| 2 | C | 2023-10-19 12:09:03 |
| 4 | E | 2023-10-19 12:05:03 |
| 4 | B | 2023-10-19 12:06:03 |
| 4 | E | 2023-10-19 12:07:03 |
| 2 | A | 2023-10-19 12:10:03 |
+----------+--------+----------------------+
二、分析
本题分为两问,只看第一个问题可以看做计算相邻两行数据,lag()函数可能能解决,但是第二个问题,明显无法用lag(),更像是对字符串内容的匹配,所以该题目考察内容是按照顺序对字符串进行拼接,然后筛选出符合条件的字符串;
- 按照顺序拼接字符串;
- 包含'A,B';
- 包含’A%B%D'并且不能是‘A%B%C%D’
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
首先我们知道collect_list拼接字符串是无序的,即便按照顺序将原始数据排好,也不能保证结果有序。所以我们需要将时间op_time和op_id进行拼接,拼接完成之后,时间在前,对整个字符串进行排序,然后再把op_time去掉,保留op_id的拼接字符串,然后对字符串进行验证。
1.拼接op_time和op_id,然后根据用户和日期进行分组,collect_list聚合出每天用户的行为,使用sort_array保证拼接后的字符串有序。为了方便看结果,最后进加了order by
执行SQL
select user_id,
dt,
concat_ws(',', sort_array(collect_list(op_str))) as op_sort
from (select user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|', op_time, op_id) as op_str
from t10_act_log) t
group by user_id, dt
order by user_id, dt
查询结果
+----------+-------------+----------------------------------------------------+
| user_id | dt | op_sort |
+----------+-------------+----------------------------------------------------+
| 1 | 2023-10-18 | 2023-10-18 12:01:03|A,2023-10-18 12:03:03|B,2023-10-18 12:04:03|A,2023-10-18 12:06:03|C,2023-10-18 12:11:03|D |
| 1 | 2023-10-19 | 2023-10-19 12:01:03|A,2023-10-19 12:03:03|B,2023-10-19 12:04:03|A,2023-10-19 12:06:03|C |
| 2 | 2023-10-18 | 2023-10-18 12:01:04|A,2023-10-18 12:07:04|A,2023-10-18 12:09:03|C,2023-10-18 12:10:03|A |
| 2 | 2023-10-19 | 2023-10-19 12:01:04|A,2023-10-19 12:07:04|A,2023-10-19 12:09:03|C,2023-10-19 12:10:03|A |
| 3 | 2023-10-18 | 2023-10-18 12:01:05|A,2023-10-18 12:02:05|C |
| 3 | 2023-10-19 | 2023-10-19 12:01:05|A,2023-10-19 12:08:05|B,2023-10-19 12:09:05|E,2023-10-19 12:11:05|D |
| 4 | 2023-10-18 | 2023-10-18 12:01:03|A,2023-10-18 12:11:05|C,2023-10-18 12:15:05|D |
| 4 | 2023-10-19 | 2023-10-19 12:05:03|E,2023-10-19 12:06:03|B,2023-10-19 12:07:03|E |
+----------+-------------+----------------------------------------------------+
2.将结果进行字符串替换,去掉op_time及添加的|,这里使用regexp_replace()函数进行替换。
执行SQL
select user_id,
dt,
regexp_replace(concat_ws(',', sort_array(collect_list(op_str))),
'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)', '') as op_sort
from (select user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|', op_time, op_id) as op_str
from t10_act_log) t
group by user_id, dt
order by user_id, dt
查询结果
+----------+-------------+------------+
| user_id | dt | op_sort |
+----------+-------------+------------+
| 1 | 2023-10-18 | A,B,A,C,D |
| 1 | 2023-10-19 | A,B,A,C |
| 2 | 2023-10-18 | A,A,C,A |
| 2 | 2023-10-19 | A,A,C,A |
| 3 | 2023-10-18 | A,C |
| 3 | 2023-10-19 | A,B,E,D |
| 4 | 2023-10-18 | A,C,D |
| 4 | 2023-10-19 | E,B,E |
+----------+-------------+------------+
3.使用like查询包含'A,B'的记录
执行SQL
with tmp as
(select user_id,
dt,
regexp_replace(concat_ws(',', sort_array(collect_list(op_str))),
'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)', '') as op_sort
from (select user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|', op_time, op_id) as op_str
from t10_act_log) t
group by user_id, dt)
select user_id,
dt,
op_sort
from tmp
where op_sort like '%A,B%'
查询结果
+----------+-------------+------------+
| user_id | dt | op_sort |
+----------+-------------+------------+
| 1 | 2023-10-18 | A,B,A,C,D |
| 1 | 2023-10-19 | A,B,A,C |
| 3 | 2023-10-19 | A,B,E,D |
+----------+-------------+------------+
4.按照日期分组,计算每天符合条件的用户数量
执行SQL
with tmp as
(select user_id,
dt,
regexp_replace(concat_ws(',', sort_array(collect_list(op_str))),
'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)', '') as op_sort
from (select user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|', op_time, op_id) as op_str
from t10_act_log) t
group by user_id, dt)
select dt,
count(1)
from tmp
where op_sort like '%A,B%'
group by dt
查询结果
+-------------+------+
| dt | _c1 |
+-------------+------+
| 2023-10-18 | 1 |
| 2023-10-19 | 2 |
+-------------+------+
5.使用like查询满足包含’A%B%D'但是不包含‘A%B%C%D’的记录
执行SQL
with tmp as
(select user_id,
dt,
regexp_replace(concat_ws(',', sort_array(collect_list(op_str))),
'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)', '') as op_sort
from (select user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|', op_time, op_id) as op_str
from t10_act_log) t
group by user_id, dt)
select user_id,
dt,
op_sort
from tmp
where op_sort like '%A%B%D%'
and op_sort not like '%A%B%C%D%'
查询结果
+----------+-------------+----------+
| user_id | dt | op_sort |
+----------+-------------+----------+
| 3 | 2023-10-19 | A,B,E,D |
+----------+-------------+----------+
6.按照日期分组,计算每天符合条件的用户数量
执行SQL
with tmp as
(select user_id,
dt,
regexp_replace(concat_ws(',', sort_array(collect_list(op_str))),
'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)', '') as op_sort
from (select user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|', op_time, op_id) as op_str
from t10_act_log) t
group by user_id, dt)
select dt,
count(1)
from tmp
where op_sort like '%A%B%D%'
and op_sort not like '%A%B%C%D%'
group by dt
查询结果
+-------------+------+
| dt | _c1 |
+-------------+------+
| 2023-10-19 | 1 |
+-------------+------+
四、建表语句和数据插入
create table t10_act_log(
user_id bigint,
op_id string,
op_time string
);
insert into t10_act_log(user_id, op_id, op_time)
values
(1, 'A', '2023-10-18 12:01:03'),
(2, 'A', '2023-10-18 12:01:04'),
(3, 'A', '2023-10-18 12:01:05'),
(1, 'B', '2023-10-18 12:03:03'),
(1, 'A', '2023-10-18 12:04:03'),
(1, 'C', '2023-10-18 12:06:03'),
(1, 'D', '2023-10-18 12:11:03'),
(2, 'A', '2023-10-18 12:07:04'),
(3, 'C', '2023-10-18 12:02:05'),
(2, 'C', '2023-10-18 12:09:03'),
(2, 'A', '2023-10-18 12:10:03'),
(4, 'A', '2023-10-18 12:01:03'),
(4, 'C', '2023-10-18 12:11:05'),
(4, 'D', '2023-10-18 12:15:05'),
(1, 'A', '2023-10-19 12:01:03'),
(2, 'A', '2023-10-19 12:01:04'),
(3, 'A', '2023-10-19 12:01:05'),
(1, 'B', '2023-10-19 12:03:03'),
(1, 'A', '2023-10-19 12:04:03'),
(1, 'C', '2023-10-19 12:06:03'),
(2, 'A', '2023-10-19 12:07:04'),
(3, 'B', '2023-10-19 12:08:05'),
(3, 'E', '2023-10-19 12:09:05'),
(3, 'D', '2023-10-19 12:11:05'),
(2, 'C', '2023-10-19 12:09:03'),
(4, 'E', '2023-10-19 12:05:03'),
(4, 'B', '2023-10-19 12:06:03'),
(4, 'E', '2023-10-19 12:07:03'),
(2, 'A', '2023-10-19 12:10:03');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;