小红书大数据面试SQL-用户商品购买收藏行为特征加工
一、题目
已知有
购买记录表t2_order,包含自增id:id,用户ID:user_id,商品ID:goods_id,订单时间:order_time,商品类别:goods_type;
用户收藏记录表t2_collect_log,包含自增id,用户ID:user_id,商品ID:goods_id,收藏时间 collect_time
请用一句sql语句得出以下查询结果,得到所有用户的商品行为特征,其中用户行为分类为4种:是否已购买、购买未收藏、收藏未购买、收藏且购买。
购买记录表t2_order
+-----+----------+-----------+-------------------+-------------+
| id | user_id | goods_id | order_time | goods_type |
+-----+----------+-----------+-------------------+-------------+
| 1 | 1 | 201 | 2020/11/14 10:00 | 1 |
| 2 | 2 | 203 | 2020/11/15 12:00 | 2 |
| 3 | 3 | 203 | 2020/11/16 10:00 | 1 |
| 4 | 4 | 203 | 2020/11/17 10:00 | 1 |
| 5 | 5 | 203 | 2020/11/18 10:00 | 1 |
| 6 | 6 | 203 | 2020/11/18 11:00 | 1 |
| 7 | 7 | 204 | 2020/11/18 12:00 | 1 |
| 8 | 8 | 205 | 2020/11/18 11:30 | 1 |
| 9 | 9 | 206 | 2020/12/1 10:00 | 1 |
| 10 | 4 | 207 | 2020/12/2 10:00 | 3 |
| 11 | 5 | 208 | 2020/12/3 10:00 | 1 |
| 12 | 6 | 209 | 2020/12/4 8:00 | 2 |
| 13 | 7 | 203 | 2020/12/5 10:00 | 2 |
| 14 | 8 | 203 | 2020/12/6 10:00 | 3 |
| 15 | 9 | 203 | 2020/12/7 15:00 | 4 |
| 16 | 1 | 204 | 2020/12/8 10:00 | 5 |
| 17 | 2 | 204 | 2020/12/9 10:00 | 5 |
| 18 | 3 | 206 | 2020/12/10 10:00 | 5 |
| 19 | 4 | 208 | 2020/12/11 10:00 | 5 |
| 20 | 5 | 209 | 2020/12/12 19:00 | 5 |
+-----+----------+-----------+-------------------+-------------+
收藏记录表t2_collect_log
+-----+----------+-----------+-------------------+
| id | user_id | goods_id | collect_time |
+-----+----------+-----------+-------------------+
| 1 | 1 | 203 | 2020/11/14 12:00 |
| 2 | 9 | 203 | 2020/11/15 10:00 |
| 3 | 4 | 203 | 2020/11/16 10:00 |
| 4 | 5 | 203 | 2020/11/17 10:00 |
| 5 | 6 | 203 | 2020/11/17 11:00 |
| 6 | 7 | 204 | 2020/11/17 12:00 |
| 7 | 8 | 205 | 2020/11/18 11:30 |
| 8 | 9 | 212 | 2020/12/1 10:00 |
| 9 | 4 | 207 | 2020/12/2 10:00 |
| 10 | 5 | 213 | 2020/12/3 10:00 |
| 11 | 6 | 209 | 2020/12/4 8:00 |
| 12 | 7 | 203 | 2020/12/5 10:00 |
| 13 | 8 | 203 | 2020/12/6 10:00 |
| 14 | 9 | 203 | 2020/12/7 15:00 |
| 15 | 1 | 203 | 2020/12/8 10:00 |
| 16 | 2 | 204 | 2020/12/9 10:00 |
| 17 | 3 | 205 | 2020/12/10 8:00 |
| 18 | 4 | 208 | 2020/12/11 10:00 |
| 19 | 5 | 209 | 2020/12/10 19:00 |
| 20 | 7 | 201 | 2020/12/11 19:00 |
+-----+----------+-----------+-------------------+
期望结果
+----------+-----------+---------+------------------+------------------+------------------+
| user_id | goods_id | is_buy | buy_not_collect | collect_not_buy | buy_and_collect |
+----------+-----------+---------+------------------+------------------+------------------+
| 1 | 201 | 1 | 1 | 0 | 0 |
| 1 | 203 | 0 | 0 | 1 | 0 |
| 1 | 204 | 1 | 1 | 0 | 0 |
| 2 | 203 | 1 | 1 | 0 | 0 |
| 2 | 204 | 1 | 0 | 0 | 1 |
| 3 | 203 | 1 | 1 | 0 | 0 |
| 3 | 205 | 0 | 0 | 1 | 0 |
| 3 | 206 | 1 | 1 | 0 | 0 |
| 4 | 203 | 1 | 0 | 0 | 1 |
| 4 | 207 | 1 | 0 | 0 | 1 |
| 4 | 208 | 1 | 0 | 0 | 1 |
| 5 | 203 | 1 | 0 | 0 | 1 |
| 5 | 208 | 1 | 1 | 0 | 0 |
| 5 | 209 | 1 | 0 | 0 | 1 |
| 5 | 213 | 0 | 0 | 1 | 0 |
| 6 | 203 | 1 | 0 | 0 | 1 |
| 6 | 209 | 1 | 0 | 0 | 1 |
| 7 | 201 | 0 | 0 | 1 | 0 |
| 7 | 203 | 1 | 0 | 0 | 1 |
| 7 | 204 | 1 | 0 | 0 | 1 |
| 8 | 203 | 1 | 0 | 0 | 1 |
| 8 | 205 | 1 | 0 | 0 | 1 |
| 9 | 203 | 1 | 0 | 0 | 1 |
| 9 | 206 | 1 | 1 | 0 | 0 |
| 9 | 212 | 0 | 0 | 1 | 0 |
+----------+-----------+---------+------------------+------------------+------------------+
二、分析
这个题目属于简单但繁琐的类型,日常数据开发中尤其在有支持算法的数据团队中比较常见,但是大家都不乐意做的脏活。如果面试中遇到类似这种问题,需要考虑这个团队日常主要干脏活,能不能接受。说回题目,因为数据需要join操作,并且存在较多的冗余,由此很容易出现数据倾斜的问题,如果我出这个题目,希望看到的是候选人日常sql的习惯中是否优先进行行列裁剪和去重以保证join时两个表的粒度统一。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
1.行列裁剪
首先对订单表、收藏记录进行行列裁剪,我们只需要user_id、goods_id,去掉其他冗余列,且保证user_id+goods_id唯一
订单表
select
user_id,
goods_id
from t2_order
group by
user_id,
goods_id
执行结果
+----------+-----------+
| user_id | goods_id |
+----------+-----------+
| 1 | 201 |
| 1 | 204 |
| 2 | 203 |
| 2 | 204 |
| 3 | 203 |
| 3 | 206 |
| 4 | 203 |
| 4 | 207 |
| 4 | 208 |
| 5 | 203 |
| 5 | 208 |
| 5 | 209 |
| 6 | 203 |
| 6 | 209 |
| 7 | 203 |
| 7 | 204 |
| 8 | 203 |
| 8 | 205 |
| 9 | 203 |
| 9 | 206 |
+----------+-----------+
收藏表
select
user_id,
goods_id
from t2_collect_log
group by
user_id,
goods_id
执行结果
+----------+-----------+
| user_id | goods_id |
+----------+-----------+
| 1 | 203 |
| 2 | 204 |
| 3 | 205 |
| 4 | 203 |
| 4 | 207 |
| 4 | 208 |
| 5 | 203 |
| 5 | 209 |
| 5 | 213 |
| 6 | 203 |
| 6 | 209 |
| 7 | 201 |
| 7 | 203 |
| 7 | 204 |
| 8 | 203 |
| 8 | 205 |
| 9 | 203 |
| 9 | 212 |
+----------+-----------+
2.两个表进行全外联,获得全量的数据行
对两个结果表进行全外联,关联条件为user_id + goods_id, 因为已经完成去重, 所以得到的行就是全量的行,且不会有重复。我们先直接关联,不做任何加工。
执行SQL
select
t_ord.user_id,
t_ord.goods_id,
t_collect.user_id,
t_collect.goods_id
from
(
--订单表数据
select
user_id,
goods_id
from t2_order
group by
user_id,
goods_id
) t_ord
full join
(
--收藏表数据
select
user_id,
goods_id
from t2_collect_log
group by
user_id,
goods_id
) t_collect
on t_ord.user_id = t_collect.user_id
and t_ord.goods_id = t_collect.goods_id
执行结果
+----------------+-----------------+--------------------+---------------------+
| t_ord.user_id | t_ord.goods_id | t_collect.user_id | t_collect.goods_id |
+----------------+-----------------+--------------------+---------------------+
| 1 | 201 | NULL | NULL |
| NULL | NULL | 1 | 203 |
| 1 | 204 | NULL | NULL |
| 2 | 203 | NULL | NULL |
| 2 | 204 | 2 | 204 |
| 3 | 203 | NULL | NULL |
| NULL | NULL | 3 | 205 |
| 3 | 206 | NULL | NULL |
| 4 | 203 | 4 | 203 |
| 4 | 207 | 4 | 207 |
| 4 | 208 | 4 | 208 |
| 5 | 203 | 5 | 203 |
| 5 | 208 | NULL | NULL |
| 5 | 209 | 5 | 209 |
| NULL | NULL | 5 | 213 |
| 6 | 203 | 6 | 203 |
| 6 | 209 | 6 | 209 |
| NULL | NULL | 7 | 201 |
| 7 | 203 | 7 | 203 |
| 7 | 204 | 7 | 204 |
| 8 | 203 | 8 | 203 |
| 8 | 205 | 8 | 205 |
| 9 | 203 | 9 | 203 |
| 9 | 206 | NULL | NULL |
| NULL | NULL | 9 | 212 |
+----------------+-----------------+--------------------+---------------------+
3.求取结果数据
我们先把所有的user_id 和 goods_id取出来,然后进行特征加工:
- 是否购买: 根据 t_ord中的goods_id 是否为空判断是否购买,为空代表未购买,非空代表购买;
- 购买未收藏: t_ord中goods_id不为空,t_collect中goods_id为空;
- 收藏未购买: t_ord中goods_id为空,t_collect中的goods_id不为空;
- 收藏且购买: t_ord中的goods_id不为空,t_collect中的goods_id不为空;
执行SQL
select
coalesce(t_ord.user_id,t_collect.user_id) as user_id,
coalesce(t_ord.goods_id,t_collect.goods_id) as goods_id,
if(t_ord.goods_id is not null,1,0) as is_buy,
if(t_ord.goods_id is not null and t_collect.goods_id is null,1,0) as buy_not_collect,
if(t_ord.goods_id is null and t_collect.goods_id is not null,1,0) as collect_not_buy,
if(t_ord.goods_id is not null and t_collect.goods_id is not null,1,0) as buy_and_collect
from
(
--订单表数据
select
user_id,
goods_id
from t2_order
group by
user_id,
goods_id
) t_ord
full join
(
--收藏表数据
select
user_id,
goods_id
from t2_collect_log
group by
user_id,
goods_id
) t_collect
on t_ord.user_id = t_collect.user_id
and t_ord.goods_id = t_collect.goods_id
执行结果
+----------+-----------+---------+------------------+------------------+------------------+
| user_id | goods_id | is_buy | buy_not_collect | collect_not_buy | buy_and_collect |
+----------+-----------+---------+------------------+------------------+------------------+
| 1 | 201 | 1 | 1 | 0 | 0 |
| 1 | 203 | 0 | 0 | 1 | 0 |
| 1 | 204 | 1 | 1 | 0 | 0 |
| 2 | 203 | 1 | 1 | 0 | 0 |
| 2 | 204 | 1 | 0 | 0 | 1 |
| 3 | 203 | 1 | 1 | 0 | 0 |
| 3 | 205 | 0 | 0 | 1 | 0 |
| 3 | 206 | 1 | 1 | 0 | 0 |
| 4 | 203 | 1 | 0 | 0 | 1 |
| 4 | 207 | 1 | 0 | 0 | 1 |
| 4 | 208 | 1 | 0 | 0 | 1 |
| 5 | 203 | 1 | 0 | 0 | 1 |
| 5 | 208 | 1 | 1 | 0 | 0 |
| 5 | 209 | 1 | 0 | 0 | 1 |
| 5 | 213 | 0 | 0 | 1 | 0 |
| 6 | 203 | 1 | 0 | 0 | 1 |
| 6 | 209 | 1 | 0 | 0 | 1 |
| 7 | 201 | 0 | 0 | 1 | 0 |
| 7 | 203 | 1 | 0 | 0 | 1 |
| 7 | 204 | 1 | 0 | 0 | 1 |
| 8 | 203 | 1 | 0 | 0 | 1 |
| 8 | 205 | 1 | 0 | 0 | 1 |
| 9 | 203 | 1 | 0 | 0 | 1 |
| 9 | 206 | 1 | 1 | 0 | 0 |
| 9 | 212 | 0 | 0 | 1 | 0 |
+----------+-----------+---------+------------------+------------------+------------------+
四、建表语句和数据插入
--订单表创建语句
CREATE TABLE IF NOT EXISTS t2_order (
id string,
user_id string,
goods_id string,
order_time string,
goods_type string
);
--订单数据插入
insert into t2_order(id,user_id,goods_id,order_time,goods_type)values
('1','1','201','2020/11/14 10:00','1'),
('2','2','203','2020/11/15 12:00','2'),
('3','3','203','2020/11/16 10:00','1'),
('4','4','203','2020/11/17 10:00','1'),
('5','5','203','2020/11/18 10:00','1'),
('6','6','203','2020/11/18 11:00','1'),
('7','7','204','2020/11/18 12:00','1'),
('8','8','205','2020/11/18 11:30','1'),
('9','9','206','2020/12/1 10:00','1'),
('10','4','207','2020/12/2 10:00','3'),
('11','5','208','2020/12/3 10:00','1'),
('12','6','209','2020/12/4 8:00','2'),
('13','7','203','2020/12/5 10:00','2'),
('14','8','203','2020/12/6 10:00','3'),
('15','9','203','2020/12/7 15:00','4'),
('16','1','204','2020/12/8 10:00','5'),
('17','2','204','2020/12/9 10:00','5'),
('18','3','206','2020/12/10 10:00','5'),
('19','4','208','2020/12/11 10:00','5'),
('20','5','209','2020/12/12 19:00','5');
--收藏记录日志
CREATE TABLE IF NOT EXISTS t2_collect_log (
id string,
user_id string,
goods_id string,
collect_time string
);
--收藏记录数据插入
insert into t2_collect_log(id,user_id,goods_id,collect_time)values
('1','1','203','2020/11/14 12:00'),
('2','9','203','2020/11/15 10:00'),
('3','4','203','2020/11/16 10:00'),
('4','5','203','2020/11/17 10:00'),
('5','6','203','2020/11/17 11:00'),
('6','7','204','2020/11/17 12:00'),
('7','8','205','2020/11/18 11:30'),
('8','9','212','2020/12/1 10:00'),
('9','4','207','2020/12/2 10:00'),
('10','5','213','2020/12/3 10:00'),
('11','6','209','2020/12/4 8:00'),
('12','7','203','2020/12/5 10:00'),
('13','8','203','2020/12/6 10:00'),
('14','9','203','2020/12/7 15:00'),
('15','1','203','2020/12/8 10:00'),
('16','2','204','2020/12/9 10:00'),
('17','3','205','2020/12/10 8:00'),
('18','4','208','2020/12/11 10:00'),
('19','5','209','2020/12/10 19:00'),
('20','7','201','2020/12/11 19:00');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;