2.行转列-使用transform进行有序行转列
一、基础数据
有配送订单表记录骑手配送的物品类型、送达时间、顾客id、配送举例及配送费。
+-----------+-----------+-------------+----------------------+--------------+-----------+----------+
| rider_id | order_id | goods_type | delivery_time | customer_id | distance | payment |
+-----------+-----------+-------------+----------------------+--------------+-----------+----------+
| r001 | 0001 | 食品 | 2024-01-01 08:30:00 | c001 | 8.05 | 7.50 |
| r002 | 0002 | 酒水 | 2024-01-01 08:43:00 | c003 | 3.01 | 13.00 |
| r003 | 0003 | 食品 | 2024-01-01 09:15:00 | c004 | 4.12 | 3.50 |
| r002 | 0004 | 文件 | 2024-01-01 09:21:00 | c005 | 10.98 | 15.00 |
| r001 | 0005 | 食品 | 2024-01-01 09:30:00 | c007 | 2.32 | 5.00 |
| r002 | 0006 | 食品 | 2024-01-01 09:47:00 | c002 | 0.78 | 5.00 |
| r003 | 0007 | 文件 | 2024-01-01 10:21:00 | c010 | 8.11 | 8.00 |
| r001 | 0008 | 酒水 | 2024-01-01 11:56:00 | c023 | 4.35 | 15.00 |
| r002 | 0009 | 电子产品 | 2024-01-01 12:30:00 | c031 | 5.05 | 9.50 |
| r002 | 0010 | 文件 | 2024-01-01 13:30:00 | c001 | 6.05 | 7.00 |
+-----------+-----------+-------------+----------------------+--------------+-----------+----------+
二、函数介绍
1.transform介绍
transform(expr, func) - 使用函数对数组中的元素进行转换。
Examples:
> SELECT transform(array(1, 2, 3), x -> x + 1);
[2,3,4]
> SELECT transform(array(1, 2, 3), (x, i) -> x + i);
[1,3,5]
Since: 2.4.0
2.其他函数介绍
三、有序行转列
根据配送订单记录表,查询出骑手id,顾客id列表,要求顾客id列表中的顺序按照送达时间早晚排序。
分析
这里要求按照订单送达时间,对顾客id进行排序。直接考虑是使用开窗函数,根据时间进行排序(这种方法不可行,原因在于collect_list和collect_set那句提示“该函数是非确定性的,因为收集结果的顺序取决于行的顺序,这在经过shuffle之后可能是不确定的”。collect_list函数很难测试,但是collect_set因为有去重操作,所以比较好测试。另外一种解法是将时间和用户id拼接在一起,时间在前,用户id在后,这样对整个字符串拼接为数组,进行排序后再拆分,如此得到的数据能确定保证按照顺序完成。
1.不确定解法(不建议使用)
collect_list执行SQL
select rider_id,
delivery_time,
customer_id,
collect_list(customer_id)over(partition by rider_id order by delivery_time) as customer_id_list
from t_delivery_orders
collect_list执行结果
+-----------+----------------------+--------------+---------------------------------------+
| rider_id | delivery_time | customer_id | customer_id_list |
+-----------+----------------------+--------------+---------------------------------------+
| r001 | 2024-01-01 08:30:00 | c001 | ["c001"] |
| r001 | 2024-01-01 09:30:00 | c007 | ["c001","c007"] |
| r001 | 2024-01-01 11:56:00 | c023 | ["c001","c007","c023"] |
| r002 | 2024-01-01 08:43:00 | c003 | ["c003"] |
| r002 | 2024-01-01 09:21:00 | c005 | ["c003","c005"] |
| r002 | 2024-01-01 09:47:00 | c002 | ["c003","c005","c002"] |
| r002 | 2024-01-01 12:30:00 | c031 | ["c003","c005","c002","c031"] |
| r002 | 2024-01-01 13:30:00 | c001 | ["c003","c005","c002","c031","c001"] |
| r003 | 2024-01-01 09:15:00 | c004 | ["c004"] |
| r003 | 2024-01-01 10:21:00 | c010 | ["c004","c010"] |
+-----------+----------------------+--------------+---------------------------------------+
collect_set执行SQL
select rider_id,
delivery_time,
customer_id,
collect_set(customer_id)over(partition by rider_id order by delivery_time) as customer_id_list
from t_delivery_orders
collect_set执行结果
+-----------+----------------------+--------------+---------------------------------------+
| rider_id | delivery_time | customer_id | customer_id_list |
+-----------+----------------------+--------------+---------------------------------------+
| r001 | 2024-01-01 08:30:00 | c001 | ["c001"] |
| r001 | 2024-01-01 09:30:00 | c007 | ["c001","c007"] |
| r001 | 2024-01-01 11:56:00 | c023 | ["c001","c007","c023"] |
| r002 | 2024-01-01 08:43:00 | c003 | ["c003"] |
| r002 | 2024-01-01 09:21:00 | c005 | ["c005","c003"] |
| r002 | 2024-01-01 09:47:00 | c002 | ["c005","c003","c002"] |
| r002 | 2024-01-01 12:30:00 | c031 | ["c005","c003","c002","c031"] |
| r002 | 2024-01-01 13:30:00 | c001 | ["c001","c005","c003","c002","c031"] |
| r003 | 2024-01-01 09:15:00 | c004 | ["c004"] |
| r003 | 2024-01-01 10:21:00 | c010 | ["c010","c004"] |
+-----------+----------------------+--------------+---------------------------------------+
我们可以观察最后一行结果,发现顺序是不一样。 可以看到collect_list结果是正确的(这里属于巧合,但是大多数情况都是对的),而collect_set的结果是错误的。两个函数中均有提示,不保证结果顺序,从严谨出发,这两个函数均不可因为开窗函数来保证有序。 如果忽略该问题,使用collect_list进行行转列开窗,大部分情况得到的结果是正确的。我们只需要取出每个骑手最后一行数据即可。
执行SQL
select
rider_id,
concat_ws(',',customer_id_list) as customer_id_list
from
(select rider_id,
delivery_time,
customer_id,
collect_list(customer_id)over(partition by rider_id order by delivery_time) as customer_id_list,
row_number() over (partition by rider_id order by delivery_time desc) as rn
from t_delivery_orders
) t
where rn = 1
执行结果
+-----------+---------------------------+
| rider_id | customer_id_list |
+-----------+---------------------------+
| r001 | c001,c007,c023 |
| r002 | c003,c005,c002,c031,c001 |
| r003 | c004,c010 |
+-----------+---------------------------+
2.拼接排序后拆分
2.1先把时间和顾客id进行拼接,拼接后进行行转列
执行SQL
select rider_id,
delivery_time,
customer_id,
concat(delivery_time, customer_id) as time_customer
from t_delivery_orders
执行结果
+-----------+----------------------+--------------+--------------------------+
| rider_id | delivery_time | customer_id | time_customer |
+-----------+----------------------+--------------+--------------------------+
| r001 | 2024-01-01 08:30:00 | c001 | 2024-01-01 08:30:00c001 |
| r002 | 2024-01-01 08:43:00 | c003 | 2024-01-01 08:43:00c003 |
| r003 | 2024-01-01 09:15:00 | c004 | 2024-01-01 09:15:00c004 |
| r002 | 2024-01-01 09:21:00 | c005 | 2024-01-01 09:21:00c005 |
| r001 | 2024-01-01 09:30:00 | c007 | 2024-01-01 09:30:00c007 |
| r002 | 2024-01-01 09:47:00 | c002 | 2024-01-01 09:47:00c002 |
| r003 | 2024-01-01 10:21:00 | c010 | 2024-01-01 10:21:00c010 |
| r001 | 2024-01-01 11:56:00 | c023 | 2024-01-01 11:56:00c023 |
| r002 | 2024-01-01 12:30:00 | c031 | 2024-01-01 12:30:00c031 |
| r002 | 2024-01-01 13:30:00 | c001 | 2024-01-01 13:30:00c001 |
+-----------+----------------------+--------------+--------------------------+
2.2 对time_customer 按照骑手id分组行转列,并使用sort_array排序
执行SQL
select rider_id,
sort_array(collect_list(time_customer)) as sorted_time_customer
from (select rider_id,
delivery_time,
customer_id,
concat(delivery_time, customer_id) as time_customer
from t_delivery_orders) t
group by rider_id
执行结果
+-----------+----------------------------------------------------+
| rider_id | sorted_time_customer |
+-----------+----------------------------------------------------+
| r001 | ["2024-01-01 08:30:00c001","2024-01-01 09:30:00c007","2024-01-01 11:56:00c023"] |
| r002 | ["2024-01-01 08:43:00c003","2024-01-01 09:21:00c005","2024-01-01 09:47:00c002","2024-01-01 12:30:00c031","2024-01-01 13:30:00c001"] |
| r003 | ["2024-01-01 09:15:00c004","2024-01-01 10:21:00c010"] |
+-----------+----------------------------------------------------+
2.3 去掉时间部分后,转换成字符串
执行SQL
select rider_id,
concat_ws(',', transform(sort_array(collect_list(time_customer)), x->substr(x, 20))) as customer_list
from (select rider_id,
delivery_time,
customer_id,
concat(delivery_time, customer_id) as time_customer
from t_delivery_orders) t
group by rider_id
执行结果
+-----------+---------------------------+
| rider_id | customer_list |
+-----------+---------------------------+
| r001 | c001,c007,c023 |
| r002 | c003,c005,c002,c031,c001 |
| r003 | c004,c010 |
+-----------+---------------------------+
四、数据准备
--建表语句
CREATE TABLE IF NOT EXISTS t_delivery_orders
(
rider_id string, -- 骑手ID
order_id string, -- 订单ID
goods_type STRING, -- 物品类型
delivery_time STRING, -- 送达时间
customer_id STRING, -- 客户id
distance decimal(10, 2), -- 配送距离
payment decimal(10, 2) -- 支付金额,骑手的配送费用
)
COMMENT '骑手配送订单表';
--插入数据
INSERT INTO t_delivery_orders VALUES
('r001', '0001', '食品', '2024-01-01 08:30:00','c001',8.05,7.50),
('r002', '0002', '酒水', '2024-01-01 08:43:00','c003',3.01,13.00),
('r003', '0003', '食品', '2024-01-01 09:15:00','c004',4.12,3.50),
('r002', '0004', '文件', '2024-01-01 09:21:00','c005',10.98,15.00),
('r001', '0005', '食品', '2024-01-01 09:30:00','c007',2.32,5.00),
('r002', '0006', '食品', '2024-01-01 09:47:00','c002',0.78,5.00),
('r003', '0007', '文件', '2024-01-01 10:21:00','c010',8.11,8.00),
('r001', '0008', '酒水', '2024-01-01 11:56:00','c023',4.35,15.00),
('r002', '0009', '电子产品', '2024-01-01 12:30:00','c031',5.05,9.50),
('r002', '0010', '文件', '2024-01-01 13:30:00','c001',6.05,7.00);
相关推荐
- 行转列-collect_list,collect_set进行简单行转列 (opens in a new tab)
- 行转列-使用transform进行有序行转列 (opens in a new tab)
- 行转列-使用transform进行有序行转列-多列一一对应 (opens in a new tab)
- 行转列-多行转多列(竖表转横表) (opens in a new tab)
- 列转行-多列转多行(横表变竖表) (opens in a new tab)
- 列转行-lateral view explode列转行 (opens in a new tab)
- 列转行-explode_outer和lateral view outer (opens in a new tab)
- 列转行-posexplode多列对应转行 (opens in a new tab)
- 列转行-lateral view outer posexplode及posexplode_outer多列对应转行 (opens in a new tab)
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表