大数据面试SQL035-用户行为路径分析

一、题目

有一张用户操作行为记录表 t_act_log_035 包含用户ID(user_id),操作编号(op_id),操作时间(op_time)

要求:

​ (1)统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻;

​ (2)统计每天用户行为序列为A-B-D的用户数;其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)

+----------+--------+----------------------+
| user_id  | op_id  |       op_time        |
+----------+--------+----------------------+
| 1        | A      | 2023-10-18 12:01:03  |
| 2        | A      | 2023-10-18 12:01:04  |
| 3        | A      | 2023-10-18 12:01:05  |
| 1        | B      | 2023-10-18 12:03:03  |
| 1        | A      | 2023-10-18 12:04:03  |
| 1        | C      | 2023-10-18 12:06:03  |
| 1        | D      | 2023-10-18 12:11:03  |
| 2        | A      | 2023-10-18 12:07:04  |
| 3        | C      | 2023-10-18 12:02:05  |
| 2        | C      | 2023-10-18 12:09:03  |
| 2        | A      | 2023-10-18 12:10:03  |
| 4        | A      | 2023-10-18 12:01:03  |
| 4        | C      | 2023-10-18 12:11:05  |
| 4        | D      | 2023-10-18 12:15:05  |
| 1        | A      | 2023-10-19 12:01:03  |
| 2        | A      | 2023-10-19 12:01:04  |
| 3        | A      | 2023-10-19 12:01:05  |
| 1        | B      | 2023-10-19 12:03:03  |
| 1        | A      | 2023-10-19 12:04:03  |
| 1        | C      | 2023-10-19 12:06:03  |
| 2        | A      | 2023-10-19 12:07:04  |
| 3        | B      | 2023-10-19 12:08:05  |
| 3        | E      | 2023-10-19 12:09:05  |
| 3        | D      | 2023-10-19 12:11:05  |
| 2        | C      | 2023-10-19 12:09:03  |
| 4        | E      | 2023-10-19 12:05:03  |
| 4        | B      | 2023-10-19 12:06:03  |
| 4        | E      | 2023-10-19 12:07:03  |
| 2        | A      | 2023-10-19 12:10:03  |
+----------+--------+----------------------+

二、分析

本题分为两问,只看第一个问题可以看做计算相邻两行数据,lag()函数可能能解决,但是第二个问题,明显无法用lag(),更像是对字符串内容的匹配,所以该题目考察内容是按照顺序对字符串进行拼接,然后筛选出符合条件的字符串;

​ (1)按照顺序拼接字符串

​ (1)包含’A,B’

​ (2)包含’A%B%D’并且不能是‘A%B%C%D’

维度 评分
题目难度 ⭐️⭐️⭐️⭐️
题目清晰度 ⭐️⭐️⭐️⭐️⭐️
业务常见度 ⭐️⭐️⭐️⭐️

三、SQL

首先我们知道collect_list拼接字符串是无序的,即便按照顺序将原始数据排好,也不能保证结果有序。所以我们需要将时间op_time和op_id进行拼接,拼接完成之后,时间在前,对整个字符串进行排序,然后再把op_time去掉,保留op_id的拼接字符串,然后对字符串进行验证。

1)拼接op_time和op_id,然后根据用户和日期进行分组,collect_list聚合出每天用户的行为,使用sort_array保证拼接后的字符串有序。为了方便看结果,最后进加了order by

select 
user_id,
dt,
concat_ws(',',sort_array(collect_list(op_str))) as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
order by user_id,dt

查询结果

01

2)将结果进行字符串替换,去掉op_time及添加的|,这里使用regexp_replace()函数进行替换。

select 
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
order by user_id,dt

查询结果

02

3)使用like查询包含’A,B’的记录

with tmp as
(
  select 
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
)
select 
user_id,
dt,
op_sort
from tmp
where op_sort like '%A,B%'

查询结果

03

4)按照日期分组,计算每天符合条件的用户数量

with tmp as
(
  select 
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
)
select 
dt,count(1)
from tmp
where op_sort like '%A,B%'
group by dt

查询结果

04

5)使用like查询满足包含’A%B%D’但是不包含‘A%B%C%D’的记录

with tmp as
(
  select 
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
)
select 
user_id,
dt,
op_sort
from tmp
where op_sort like '%A%B%D%'
and op_sort not like '%A%B%C%D%'

查询结果

05

6)按照日期分组,计算每天符合条件的用户数量

with tmp as
(
  select 
user_id,
dt,
regexp_replace(concat_ws(',',sort_array(collect_list(op_str))),'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\|)','') as op_sort
from
(select
user_id,
op_id,
to_date(op_time) as dt,
op_time,
concat_ws('|',op_time,op_id) as op_str
from t_act_log_035
)t
group by user_id,dt
)
select 
dt,count(1)
from tmp
where op_sort like '%A%B%D%'
and op_sort not like '%A%B%C%D%'
group by dt

查询结果

06

四、建表语句和数据插入

create table t_act_log_035(
  user_id bigint,
  op_id string,
  op_time string
  )row format delimited fields terminated by '\t';

insert into t_act_log_035(user_id, op_id, op_time) values 

(1, 'A', '2023-10-18 12:01:03'),
(2, 'A', '2023-10-18 12:01:04'),
(3, 'A', '2023-10-18 12:01:05'),
(1, 'B', '2023-10-18 12:03:03'),
(1, 'A', '2023-10-18 12:04:03'),
(1, 'C', '2023-10-18 12:06:03'),
(1, 'D', '2023-10-18 12:11:03'),
(2, 'A', '2023-10-18 12:07:04'),
(3, 'C', '2023-10-18 12:02:05'),
(2, 'C', '2023-10-18 12:09:03'),
(2, 'A', '2023-10-18 12:10:03'),
(4, 'A', '2023-10-18 12:01:03'),
(4, 'C', '2023-10-18 12:11:05'),
(4, 'D', '2023-10-18 12:15:05'),
(1, 'A', '2023-10-19 12:01:03'),
(2, 'A', '2023-10-19 12:01:04'),
(3, 'A', '2023-10-19 12:01:05'),
(1, 'B', '2023-10-19 12:03:03'),
(1, 'A', '2023-10-19 12:04:03'),
(1, 'C', '2023-10-19 12:06:03'),
(2, 'A', '2023-10-19 12:07:04'),
(3, 'B', '2023-10-19 12:08:05'),
(3, 'E', '2023-10-19 12:09:05'),
(3, 'D', '2023-10-19 12:11:05'),
(2, 'C', '2023-10-19 12:09:03'),
(4, 'E', '2023-10-19 12:05:03'),
(4, 'B', '2023-10-19 12:06:03'),
(4, 'E', '2023-10-19 12:07:03'),
(2, 'A', '2023-10-19 12:10:03');