再谈炸裂函数

前两篇,给大家分享了Lateral View、Lateral View Outer、explode和posexplode的用法,但是在群里大家有更加深入的讨论,讨论posexplode的应用场景,感觉之前的分享不是很尽兴,于是再加一篇。

1. 测试数据

为了更加完成更加复杂的测试,我们把基础数据调整一下,数据如下:

create table test(id int, items array<string>,cnts array<string>);
insert into test values 
(1, array('apple', 'banana', 'pear'),array('1','2','3')),
(2,array('watermelon', 'orange'),array('5','6')),
(3,array(null),array(null));
+----------+----------------------------+----------------+
| test.id  |         test.items         |   test.cnts    |
+----------+----------------------------+----------------+
| 1        | ["apple","banana","pear"]  | ["1","2","3"]  |
| 2        | ["watermelon","orange"]    | ["5","6"]      |
| 3        | NULL                       | NULL           |
+----------+----------------------------+----------------+

2.两列数组内容笛卡尔积

我们使用两次lateral view explode,可以计算出两列的笛卡尔积,SQL如下

select
id,items,cnts,item,cnt
from test
lateral view explode(items) t_item as item
lateral view explode(cnts) t_cnts as cnt

结果会是:

+-----+----------------------------+----------------+-------------+------+
| id  |           items            |      cnts      |    item     | cnt  |
+-----+----------------------------+----------------+-------------+------+
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | apple       | 1    |
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | apple       | 2    |
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | apple       | 3    |
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | banana      | 1    |
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | banana      | 2    |
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | banana      | 3    |
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | pear        | 1    |
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | pear        | 2    |
| 1   | ["apple","banana","pear"]  | ["1","2","3"]  | pear        | 3    |
| 2   | ["watermelon","orange"]    | ["5","6"]      | watermelon  | 5    |
| 2   | ["watermelon","orange"]    | ["5","6"]      | watermelon  | 6    |
| 2   | ["watermelon","orange"]    | ["5","6"]      | orange      | 5    |
| 2   | ["watermelon","orange"]    | ["5","6"]      | orange      | 6    |
+-----+----------------------------+----------------+-------------+------+

上面我们可以看到:item列和cnt两列进行了笛卡尔积计算,在需要计算笛卡尔积的时候,可以这样完成。

2. posexplode用法

可是,如果我们的items和cnts是有对应关系的,即 items中的apple对应cnts中的1,banana对应2,pear对应3,我们就需要有对应的关系,这个时候posexplode就闪亮登场了

select
id,item_idx,item,cnt_idx,cnt
from test
lateral view posexplode(items) t_item as item_idx,item
lateral view posexplode(cnts) t_cnts as cnt_idx,cnt

结果会是:

+-----+-----------+-------------+----------+------+
| id  | item_idx  |    item     | cnt_idx  | cnt  |
+-----+-----------+-------------+----------+------+
| 1   | 0         | apple       | 0        | 1    |
| 1   | 0         | apple       | 1        | 2    |
| 1   | 0         | apple       | 2        | 3    |
| 1   | 1         | banana      | 0        | 1    |
| 1   | 1         | banana      | 1        | 2    |
| 1   | 1         | banana      | 2        | 3    |
| 1   | 2         | pear        | 0        | 1    |
| 1   | 2         | pear        | 1        | 2    |
| 1   | 2         | pear        | 2        | 3    |
| 2   | 0         | watermelon  | 0        | 5    |
| 2   | 0         | watermelon  | 1        | 6    |
| 2   | 1         | orange      | 0        | 5    |
| 2   | 1         | orange      | 1        | 6    |
+-----+-----------+-------------+----------+------+

我们可以限定item_idx和cnt_idx 相等,得到对应的数据:

select
id,item_idx,item,cnt_idx,cnt
from test
lateral view posexplode(items) t_item as item_idx,item
lateral view posexplode(cnts) t_cnts as cnt_idx,cnt
where item_idx=cnt_idx

查询结果如下

+-----+-----------+-------------+----------+------+
| id  | item_idx  |    item     | cnt_idx  | cnt  |
+-----+-----------+-------------+----------+------+
| 1   | 0         | apple       | 0        | 1    |
| 1   | 1         | banana      | 1        | 2    |
| 1   | 2         | pear        | 2        | 3    |
| 2   | 0         | watermelon  | 0        | 5    |
| 2   | 1         | orange      | 1        | 6    |
+-----+-----------+-------------+----------+------+

我们可以看到只留下了对应关系的数据。

3.map_from_arrays()

然而,此时群里的朋友提出了新的解法,引入了一个新的函数 map_from_arrays(),具体写法如下:

select 
id,items,cnts,map_from_arrays(items,cnts) as col3
from test

可是结果出现了意外,语句报错说没有map_from_arrays函数。

于是我去spark集群上进行了测试

再讲

可以得到如下结果,可以拿到对应关系结果。

再讲2

于是大家学到了一个新的方法map_from_arrays()。