阿里大数据面试SQL-用户行为轨迹
一、题目
现有一份用户在各个地铁站进出的时间表和一份用户商场扫码行为表,其中商场扫码会存在多次,可以取最新的数据。请查询并返回出用户整个行为轨迹列表
样例数据
t_subway_station
+----------+-------------+----------------------+----------------------+
| user_id | station_id | in_time | out_time |
+----------+-------------+----------------------+----------------------+
| 001 | 1 | 2024-07-01 09:01:01 | NULL |
| 001 | 2 | NULL | 2024-07-01 09:21:08 |
| 001 | 3 | 2024-07-01 11:01:41 | NULL |
| 001 | 4 | NULL | 2024-07-01 11:23:12 |
| 001 | 4 | 2024-07-01 15:33:29 | NULL |
| 001 | 1 | NULL | 2024-07-01 15:45:41 |
+----------+-------------+----------------------+----------------------+
t_market
+----------+------------+----------------------+
| user_id | market_id | check_time |
+----------+------------+----------------------+
| 001 | 1001 | 2024-07-01 10:23:04 |
| 001 | 1001 | 2024-07-01 10:25:38 |
| 001 | 1002 | 2024-07-01 10:45:01 |
| 001 | 1004 | 2024-07-01 13:56:27 |
| 001 | 1003 | 2024-07-01 14:37:24 |
+----------+------------+----------------------+
期望结果
+----------+----------------------------------+
| user_id | place_list |
+----------+----------------------------------+
| 001 | 1,2,1001,1002,3,4,1004,1003,4,1 |
+----------+----------------------------------+
二、分析
- 处理时序问题,先把所有行为数据放到一起。
- 拼接对应行为。注意这里考察的是如何保证点位有序,也是本题的难点,直接使用collect_list无法保证顺序。所以需要将时间拼接进去,sort_array之后再将时间剔除。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
1.合并所有数据,把所有点位和对应时间放到一起
- 地铁站进出数据,原表中给的是进出站时间,两个字段只有一个字段有值,另外一个是null,使用coalesce()得到一个进出时间字段。
- 商场数据,我们根据题目要求,取出每个商场最新的时间即可。
- 对两份数据进行union all操作。
执行SQL
--地铁站时间
select user_id,
station_id as place,
coalesce(in_time, out_time) as check_time
from t1_subway_station
group by user_id,station_id,coalesce(in_time, out_time)
union all
--商场时间
select user_id,
place,
check_time
from (
select user_id,
market_id as place,
max(check_time) as check_time
from t1_market
group by user_id, market_id) t
查询结果
+--------------+------------+----------------------+
| _u1.user_id | _u1.place | _u1.check_time |
+--------------+------------+----------------------+
| 001 | 1001 | 2024-07-01 10:25:38 |
| 001 | 1002 | 2024-07-01 10:45:01 |
| 001 | 1003 | 2024-07-01 14:37:24 |
| 001 | 1004 | 2024-07-01 13:56:27 |
| 001 | 1 | 2024-07-01 09:01:01 |
| 001 | 2 | 2024-07-01 09:21:08 |
| 001 | 3 | 2024-07-01 11:01:41 |
| 001 | 4 | 2024-07-01 11:23:12 |
| 001 | 4 | 2024-07-01 15:33:29 |
| 001 | 1 | 2024-07-01 15:45:41 |
+--------------+------------+----------------------+
2.对数据进行进行拼接
需要根据时间保证结果有序,所以需要先对check_time 与place进行拼接。使用concat进行拼接
select
user_id,
concat(check_time,place) as con_str
from
(--地铁站时间
select user_id,
station_id as place,
coalesce(in_time, out_time) as check_time
from t1_subway_station
group by user_id,station_id,coalesce(in_time, out_time)
union all
--商场时间
select user_id,
place,
check_time
from (
select user_id,
market_id as place,
max(check_time) as check_time
from t1_market
group by user_id, market_id) t
) tt
查询结果
+----------+--------------------------+
| user_id | con_str |
+----------+--------------------------+
| 001 | 2024-07-01 10:25:381001 |
| 001 | 2024-07-01 10:45:011002 |
| 001 | 2024-07-01 14:37:241003 |
| 001 | 2024-07-01 13:56:271004 |
| 001 | 2024-07-01 09:01:011 |
| 001 | 2024-07-01 09:21:082 |
| 001 | 2024-07-01 11:01:413 |
| 001 | 2024-07-01 11:23:124 |
| 001 | 2024-07-01 15:33:294 |
| 001 | 2024-07-01 15:45:411 |
+----------+--------------------------+
3.使用collect_list拼接con_str,排序
使用collect_list对con_str进行拼接,并使用sort_array进行排序,得到顺序的时间点位信息。
执行SQL
select
user_id,
sort_array(collect_list(concat(check_time,place)))
from
(--地铁站时间
select user_id,
station_id as place,
coalesce(in_time, out_time) as check_time
from t1_subway_station
group by user_id,station_id,coalesce(in_time, out_time)
union all
--商场时间
select user_id,
place,
check_time
from (
select user_id,
market_id as place,
max(check_time) as check_time
from t1_market
group by user_id, market_id) t
) tt
group by user_id
查询结果
+----------+----------------------------------------------------+
| user_id | _c1 |
+----------+----------------------------------------------------+
| 001 | ["2024-07-01 09:01:011","2024-07-01 09:21:082","2024-07-01 10:25:381001","2024-07-01 10:45:011002","2024-07-01 11:01:413","2024-07-01 11:23:124","2024-07-01 13:56:271004","2024-07-01 14:37:241003","2024-07-01 15:33:294","2024-07-01 15:45:411"] |
+----------+----------------------------------------------------+
4.剔除时间
使用正则表达式将结果时间进行剔除,得到最终结果。
查询SQL
select
user_id,
regexp_replace(concat_ws(',',sort_array(collect_list(concat(check_time,place)))) ,'\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}','') as place_list
from
(--地铁站时间
select user_id,
station_id as place,
coalesce(in_time, out_time) as check_time
from t1_subway_station
group by user_id,station_id,coalesce(in_time, out_time)
union all
--商场时间
select user_id,
place,
check_time
from (
select user_id,
market_id as place,
max(check_time) as check_time
from t1_market
group by user_id, market_id) t
) tt
group by user_id
执行结果
+----------+----------------------------------+
| user_id | place_list |
+----------+----------------------------------+
| 001 | 1,2,1001,1002,3,4,1004,1003,4,1 |
+----------+----------------------------------+
四、额外的问题
本题对于商场扫码行为,给出了明确的要求取最新的,但是在实际生活场景用我们是可A商场》B商场》A商场的,那应该如何处理呢?
五、建表语句和数据插入
--建表语句
CREATE TABLE t1_subway_station (
user_id string,
station_id string,
in_time string,
out_time string
);
--插入数据
insert into t1_subway_station (user_id,station_id,in_time,out_time)
values
('001','1','2024-07-01 09:01:01',null),
('001','2',null,'2024-07-01 09:21:08'),
('001','3','2024-07-01 11:01:41',null),
('001','4',null,'2024-07-01 11:23:12'),
('001','4','2024-07-01 15:33:29',null),
('001','1',null,'2024-07-01 15:45:41');
--建表语句
CREATE TABLE t1_market (
user_id string,
market_id string,
check_time string
);
--插入数据
insert into t1_market (user_id,market_id,check_time)
values
('001','1001','2024-07-01 10:23:04'),
('001','1001','2024-07-01 10:25:38'),
('001','1002','2024-07-01 10:45:01'),
('001','1004','2024-07-01 13:56:27'),
('001','1003','2024-07-01 14:37:24');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;