面试真题
小红书
2.用户商品购买收藏行为特征加工

小红书大数据面试SQL-用户商品购买收藏行为特征加工

一、题目

已知有

购买记录表t2_order,包含自增id:id,用户ID:user_id,商品ID:goods_id,订单时间:order_time,商品类别:goods_type;

用户收藏记录表t2_collect_log,包含自增id,用户ID:user_id,商品ID:goods_id,收藏时间 collect_time

请用一句sql语句得出以下查询结果,得到所有用户的商品行为特征,其中用户行为分类为4种:是否已购买、购买未收藏、收藏未购买、收藏且购买。

购买记录表t2_order

+-----+----------+-----------+-------------------+-------------+
| id  | user_id  | goods_id  |    order_time     | goods_type  |
+-----+----------+-----------+-------------------+-------------+
| 1   | 1        | 201       | 2020/11/14 10:00  | 1           |
| 2   | 2        | 203       | 2020/11/15 12:00  | 2           |
| 3   | 3        | 203       | 2020/11/16 10:00  | 1           |
| 4   | 4        | 203       | 2020/11/17 10:00  | 1           |
| 5   | 5        | 203       | 2020/11/18 10:00  | 1           |
| 6   | 6        | 203       | 2020/11/18 11:00  | 1           |
| 7   | 7        | 204       | 2020/11/18 12:00  | 1           |
| 8   | 8        | 205       | 2020/11/18 11:30  | 1           |
| 9   | 9        | 206       | 2020/12/1 10:00   | 1           |
| 10  | 4        | 207       | 2020/12/2 10:00   | 3           |
| 11  | 5        | 208       | 2020/12/3 10:00   | 1           |
| 12  | 6        | 209       | 2020/12/4 8:00    | 2           |
| 13  | 7        | 203       | 2020/12/5 10:00   | 2           |
| 14  | 8        | 203       | 2020/12/6 10:00   | 3           |
| 15  | 9        | 203       | 2020/12/7 15:00   | 4           |
| 16  | 1        | 204       | 2020/12/8 10:00   | 5           |
| 17  | 2        | 204       | 2020/12/9 10:00   | 5           |
| 18  | 3        | 206       | 2020/12/10 10:00  | 5           |
| 19  | 4        | 208       | 2020/12/11 10:00  | 5           |
| 20  | 5        | 209       | 2020/12/12 19:00  | 5           |
+-----+----------+-----------+-------------------+-------------+

收藏记录表t2_collect_log

+-----+----------+-----------+-------------------+
| id  | user_id  | goods_id  |   collect_time    |
+-----+----------+-----------+-------------------+
| 1   | 1        | 203       | 2020/11/14 12:00  |
| 2   | 9        | 203       | 2020/11/15 10:00  |
| 3   | 4        | 203       | 2020/11/16 10:00  |
| 4   | 5        | 203       | 2020/11/17 10:00  |
| 5   | 6        | 203       | 2020/11/17 11:00  |
| 6   | 7        | 204       | 2020/11/17 12:00  |
| 7   | 8        | 205       | 2020/11/18 11:30  |
| 8   | 9        | 212       | 2020/12/1 10:00   |
| 9   | 4        | 207       | 2020/12/2 10:00   |
| 10  | 5        | 213       | 2020/12/3 10:00   |
| 11  | 6        | 209       | 2020/12/4 8:00    |
| 12  | 7        | 203       | 2020/12/5 10:00   |
| 13  | 8        | 203       | 2020/12/6 10:00   |
| 14  | 9        | 203       | 2020/12/7 15:00   |
| 15  | 1        | 203       | 2020/12/8 10:00   |
| 16  | 2        | 204       | 2020/12/9 10:00   |
| 17  | 3        | 205       | 2020/12/10 8:00   |
| 18  | 4        | 208       | 2020/12/11 10:00  |
| 19  | 5        | 209       | 2020/12/10 19:00  |
| 20  | 7        | 201       | 2020/12/11 19:00  |
+-----+----------+-----------+-------------------+

期望结果

+----------+-----------+---------+------------------+------------------+------------------+
| user_id  | goods_id  | is_buy  | buy_not_collect  | collect_not_buy  | buy_and_collect  |
+----------+-----------+---------+------------------+------------------+------------------+
| 1        | 201       | 1       | 1                | 0                | 0                |
| 1        | 203       | 0       | 0                | 1                | 0                |
| 1        | 204       | 1       | 1                | 0                | 0                |
| 2        | 203       | 1       | 1                | 0                | 0                |
| 2        | 204       | 1       | 0                | 0                | 1                |
| 3        | 203       | 1       | 1                | 0                | 0                |
| 3        | 205       | 0       | 0                | 1                | 0                |
| 3        | 206       | 1       | 1                | 0                | 0                |
| 4        | 203       | 1       | 0                | 0                | 1                |
| 4        | 207       | 1       | 0                | 0                | 1                |
| 4        | 208       | 1       | 0                | 0                | 1                |
| 5        | 203       | 1       | 0                | 0                | 1                |
| 5        | 208       | 1       | 1                | 0                | 0                |
| 5        | 209       | 1       | 0                | 0                | 1                |
| 5        | 213       | 0       | 0                | 1                | 0                |
| 6        | 203       | 1       | 0                | 0                | 1                |
| 6        | 209       | 1       | 0                | 0                | 1                |
| 7        | 201       | 0       | 0                | 1                | 0                |
| 7        | 203       | 1       | 0                | 0                | 1                |
| 7        | 204       | 1       | 0                | 0                | 1                |
| 8        | 203       | 1       | 0                | 0                | 1                |
| 8        | 205       | 1       | 0                | 0                | 1                |
| 9        | 203       | 1       | 0                | 0                | 1                |
| 9        | 206       | 1       | 1                | 0                | 0                |
| 9        | 212       | 0       | 0                | 1                | 0                |
+----------+-----------+---------+------------------+------------------+------------------+

二、分析

这个题目属于简单但繁琐的类型,日常数据开发中尤其在有支持算法的数据团队中比较常见,但是大家都不乐意做的脏活。如果面试中遇到类似这种问题,需要考虑这个团队日常主要干脏活,能不能接受。说回题目,因为数据需要join操作,并且存在较多的冗余,由此很容易出现数据倾斜的问题,如果我出这个题目,希望看到的是候选人日常sql的习惯中是否优先进行行列裁剪和去重以保证join时两个表的粒度统一

维度评分
题目难度⭐️⭐️⭐️
题目清晰度⭐️⭐️⭐️⭐️⭐️
业务常见度⭐️⭐️⭐️⭐️

三、SQL

1.行列裁剪

首先对订单表、收藏记录进行行列裁剪,我们只需要user_id、goods_id,去掉其他冗余列,且保证user_id+goods_id唯一

订单表

select
    user_id,
    goods_id
from t2_order
group by
    user_id,
    goods_id

执行结果

+----------+-----------+
| user_id  | goods_id  |
+----------+-----------+
| 1        | 201       |
| 1        | 204       |
| 2        | 203       |
| 2        | 204       |
| 3        | 203       |
| 3        | 206       |
| 4        | 203       |
| 4        | 207       |
| 4        | 208       |
| 5        | 203       |
| 5        | 208       |
| 5        | 209       |
| 6        | 203       |
| 6        | 209       |
| 7        | 203       |
| 7        | 204       |
| 8        | 203       |
| 8        | 205       |
| 9        | 203       |
| 9        | 206       |
+----------+-----------+

收藏表

select
    user_id,
    goods_id
from t2_collect_log
group by
    user_id,
    goods_id

执行结果

+----------+-----------+
| user_id  | goods_id  |
+----------+-----------+
| 1        | 203       |
| 2        | 204       |
| 3        | 205       |
| 4        | 203       |
| 4        | 207       |
| 4        | 208       |
| 5        | 203       |
| 5        | 209       |
| 5        | 213       |
| 6        | 203       |
| 6        | 209       |
| 7        | 201       |
| 7        | 203       |
| 7        | 204       |
| 8        | 203       |
| 8        | 205       |
| 9        | 203       |
| 9        | 212       |
+----------+-----------+

2.两个表进行全外联,获得全量的数据行

对两个结果表进行全外联,关联条件为user_id + goods_id, 因为已经完成去重, 所以得到的行就是全量的行,且不会有重复。我们先直接关联,不做任何加工。

执行SQL

select
    t_ord.user_id,
    t_ord.goods_id,
    t_collect.user_id,
    t_collect.goods_id
from
    (
    --订单表数据
    select
        user_id,
        goods_id
    from t2_order
    group by
        user_id,
        goods_id
    ) t_ord
    full join
    (
    --收藏表数据
    select
        user_id,
        goods_id
    from t2_collect_log
    group by
        user_id,
        goods_id
    ) t_collect
        on t_ord.user_id = t_collect.user_id
        and t_ord.goods_id = t_collect.goods_id

执行结果

+----------------+-----------------+--------------------+---------------------+
| t_ord.user_id  | t_ord.goods_id  | t_collect.user_id  | t_collect.goods_id  |
+----------------+-----------------+--------------------+---------------------+
| 1              | 201             | NULL               | NULL                |
| NULL           | NULL            | 1                  | 203                 |
| 1              | 204             | NULL               | NULL                |
| 2              | 203             | NULL               | NULL                |
| 2              | 204             | 2                  | 204                 |
| 3              | 203             | NULL               | NULL                |
| NULL           | NULL            | 3                  | 205                 |
| 3              | 206             | NULL               | NULL                |
| 4              | 203             | 4                  | 203                 |
| 4              | 207             | 4                  | 207                 |
| 4              | 208             | 4                  | 208                 |
| 5              | 203             | 5                  | 203                 |
| 5              | 208             | NULL               | NULL                |
| 5              | 209             | 5                  | 209                 |
| NULL           | NULL            | 5                  | 213                 |
| 6              | 203             | 6                  | 203                 |
| 6              | 209             | 6                  | 209                 |
| NULL           | NULL            | 7                  | 201                 |
| 7              | 203             | 7                  | 203                 |
| 7              | 204             | 7                  | 204                 |
| 8              | 203             | 8                  | 203                 |
| 8              | 205             | 8                  | 205                 |
| 9              | 203             | 9                  | 203                 |
| 9              | 206             | NULL               | NULL                |
| NULL           | NULL            | 9                  | 212                 |
+----------------+-----------------+--------------------+---------------------+

3.求取结果数据

我们先把所有的user_id 和 goods_id取出来,然后进行特征加工:

  • 是否购买: 根据 t_ord中的goods_id 是否为空判断是否购买,为空代表未购买,非空代表购买;
  • 购买未收藏: t_ord中goods_id不为空,t_collect中goods_id为空;
  • 收藏未购买: t_ord中goods_id为空,t_collect中的goods_id不为空;
  • 收藏且购买: t_ord中的goods_id不为空,t_collect中的goods_id不为空;

执行SQL

select
    coalesce(t_ord.user_id,t_collect.user_id) as user_id,
    coalesce(t_ord.goods_id,t_collect.goods_id) as goods_id,
    if(t_ord.goods_id is not null,1,0) as is_buy,
    if(t_ord.goods_id is not null and t_collect.goods_id is null,1,0) as buy_not_collect,
    if(t_ord.goods_id is null and t_collect.goods_id is not null,1,0) as collect_not_buy,
    if(t_ord.goods_id is not null and t_collect.goods_id is not null,1,0) as buy_and_collect
from
    (
    --订单表数据
    select
        user_id,
        goods_id
    from t2_order
    group by
        user_id,
        goods_id
    ) t_ord
    full join
    (
    --收藏表数据
    select
        user_id,
        goods_id
    from t2_collect_log
    group by
        user_id,
        goods_id
    ) t_collect
        on t_ord.user_id = t_collect.user_id
        and t_ord.goods_id = t_collect.goods_id

执行结果

+----------+-----------+---------+------------------+------------------+------------------+
| user_id  | goods_id  | is_buy  | buy_not_collect  | collect_not_buy  | buy_and_collect  |
+----------+-----------+---------+------------------+------------------+------------------+
| 1        | 201       | 1       | 1                | 0                | 0                |
| 1        | 203       | 0       | 0                | 1                | 0                |
| 1        | 204       | 1       | 1                | 0                | 0                |
| 2        | 203       | 1       | 1                | 0                | 0                |
| 2        | 204       | 1       | 0                | 0                | 1                |
| 3        | 203       | 1       | 1                | 0                | 0                |
| 3        | 205       | 0       | 0                | 1                | 0                |
| 3        | 206       | 1       | 1                | 0                | 0                |
| 4        | 203       | 1       | 0                | 0                | 1                |
| 4        | 207       | 1       | 0                | 0                | 1                |
| 4        | 208       | 1       | 0                | 0                | 1                |
| 5        | 203       | 1       | 0                | 0                | 1                |
| 5        | 208       | 1       | 1                | 0                | 0                |
| 5        | 209       | 1       | 0                | 0                | 1                |
| 5        | 213       | 0       | 0                | 1                | 0                |
| 6        | 203       | 1       | 0                | 0                | 1                |
| 6        | 209       | 1       | 0                | 0                | 1                |
| 7        | 201       | 0       | 0                | 1                | 0                |
| 7        | 203       | 1       | 0                | 0                | 1                |
| 7        | 204       | 1       | 0                | 0                | 1                |
| 8        | 203       | 1       | 0                | 0                | 1                |
| 8        | 205       | 1       | 0                | 0                | 1                |
| 9        | 203       | 1       | 0                | 0                | 1                |
| 9        | 206       | 1       | 1                | 0                | 0                |
| 9        | 212       | 0       | 0                | 1                | 0                |
+----------+-----------+---------+------------------+------------------+------------------+

四、建表语句和数据插入

--订单表创建语句
CREATE TABLE IF NOT EXISTS t2_order (
    id string,
    user_id string,
    goods_id string,
    order_time string,
    goods_type string
);
--订单数据插入
insert into t2_order(id,user_id,goods_id,order_time,goods_type)values
('1','1','201','2020/11/14 10:00','1'),
('2','2','203','2020/11/15 12:00','2'),
('3','3','203','2020/11/16 10:00','1'),
('4','4','203','2020/11/17 10:00','1'),
('5','5','203','2020/11/18 10:00','1'),
('6','6','203','2020/11/18 11:00','1'),
('7','7','204','2020/11/18 12:00','1'),
('8','8','205','2020/11/18 11:30','1'),
('9','9','206','2020/12/1 10:00','1'),
('10','4','207','2020/12/2 10:00','3'),
('11','5','208','2020/12/3 10:00','1'),
('12','6','209','2020/12/4 8:00','2'),
('13','7','203','2020/12/5 10:00','2'),
('14','8','203','2020/12/6 10:00','3'),
('15','9','203','2020/12/7 15:00','4'),
('16','1','204','2020/12/8 10:00','5'),
('17','2','204','2020/12/9 10:00','5'),
('18','3','206','2020/12/10 10:00','5'),
('19','4','208','2020/12/11 10:00','5'),
('20','5','209','2020/12/12 19:00','5');
 
--收藏记录日志
CREATE TABLE IF NOT EXISTS t2_collect_log (
    id string,
    user_id string,
    goods_id string,
    collect_time string
);
--收藏记录数据插入
insert into t2_collect_log(id,user_id,goods_id,collect_time)values
('1','1','203','2020/11/14 12:00'),
('2','9','203','2020/11/15 10:00'),
('3','4','203','2020/11/16 10:00'),
('4','5','203','2020/11/17 10:00'),
('5','6','203','2020/11/17 11:00'),
('6','7','204','2020/11/17 12:00'),
('7','8','205','2020/11/18 11:30'),
('8','9','212','2020/12/1 10:00'),
('9','4','207','2020/12/2 10:00'),
('10','5','213','2020/12/3 10:00'),
('11','6','209','2020/12/4 8:00'),
('12','7','203','2020/12/5 10:00'),
('13','8','203','2020/12/6 10:00'),
('14','9','203','2020/12/7 15:00'),
('15','1','203','2020/12/8 10:00'),
('16','2','204','2020/12/9 10:00'),
('17','3','205','2020/12/10 8:00'),
('18','4','208','2020/12/11 10:00'),
('19','5','209','2020/12/10 19:00'),
('20','7','201','2020/12/11 19:00');

本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;