记一次 SQL 减肥:用 WINDOW 子句干掉重复的 OVER
前阵子给业务方出一个部门薪资分析报表,一个查询里同时算了排名、累计、环比、占比,四个窗口函数,每个 OVER 后面都挂着同一串 PARTITION BY dept ORDER BY salary DESC。
写完一数,这行重复出现了 4 次。
SELECT
name, dept, salary,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank,
SUM(salary) OVER (PARTITION BY dept ORDER BY salary DESC) AS cumsum,
AVG(salary) OVER (PARTITION BY dept ORDER BY salary DESC) AS avg_sal,
LAG(salary, 1) OVER (PARTITION BY dept ORDER BY salary DESC) AS prev_sal
FROM employee;
这还只是个简单的例子。实际生产里十几个窗口函数的查询我见过不少,PARTITION BY 后面动不动四五个字段,满屏都是重复代码。更麻烦的是,需求一改——比如多加一个分组维度——你得把每个 OVER 全部改一遍,漏一个就是一个隐式 bug。
当时就在想,SQL 有没有类似"定义变量"的机制,把这段重复的东西提出来?
WINDOW 子句
还真有。查 Spark SQL 文档发现了一个叫 WINDOW 的子句,用法很简单:在查询末尾把窗口规范定义好、起个名,前面的函数直接 OVER 名字 就行。
SELECT
name, dept, salary,
ROW_NUMBER() OVER w AS rank,
SUM(salary) OVER w AS cumsum,
AVG(salary) OVER w AS avg_sal,
LAG(salary, 1) OVER w AS prev_sal
FROM employee
WINDOW w AS (PARTITION BY dept ORDER BY salary DESC);
WINDOW 放在 FROM / WHERE / GROUP BY 之后,ORDER BY / LIMIT 之前。上面这个例子,四个 OVER w 替代了四行重复的窗口定义,以后改分组字段只改 WINDOW 那一行。
不止一个窗口
一个查询里定义多个窗口也很自然:
SELECT
name, dept, salary,
ROW_NUMBER() OVER dept_window AS dept_rank,
SUM(salary) OVER dept_window AS dept_cumsum,
ROW_NUMBER() OVER company_window AS company_rank
FROM employee
WINDOW
dept_window AS (PARTITION BY dept ORDER BY salary DESC),
company_window AS (ORDER BY salary DESC);
两个窗口各管各的,名字本身就是注释——dept_window 一看就知道是部门维度的窗口。
窗口之间还能继承
这个特性我开始没注意到,后来才发现挺好用:
SELECT
product_id, order_date, revenue,
SUM(revenue) OVER rolling_3d AS rolling_3d_sum,
AVG(revenue) OVER rolling_3d AS rolling_3d_avg,
SUM(revenue) OVER rolling_3d_ext AS rolling_3d_ext_sum
FROM orders
WINDOW
base AS (PARTITION BY product_id ORDER BY order_date),
rolling_3d AS (base ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
rolling_3d_ext AS (base ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING);
rolling_3d 直接引用 base,继承它的分组和排序,只追加了窗口帧。三个函数都用到 PARTITION BY product_id ORDER BY order_date,但只在 base 里出现了一次。
写复杂报表的时候这个模式特别管用——先定义几个基础窗口(只含分组+排序),然后往上叠加不同范围的窗口帧。
实际场景:面试题的 SQL 也能瘦身
最近整理题目的时候,发现站里有一道 拼多多的面试题——篮球比赛反超判断,刚好就是个典型。
题目要求找出"帮助球队反超比分的球员"。解题思路是先算 A、B 两队的实时累计得分,再用 LAG 拿前一行的比分,最后判断是否发生了反超。原解答分了两个子查询:内层算累计,外层拿 LAG。
-- 原解答写法(内层)
SELECT *,
SUM(CASE WHEN team = 'A' THEN score ELSE 0 END) OVER (ORDER BY score_time) AS total_a,
SUM(CASE WHEN team = 'B' THEN score ELSE 0 END) OVER (ORDER BY score_time) AS total_b
FROM t4_basketball_game_score_detail;
-- 原解答写法(外层)
SELECT *,
LAG(total_a) OVER (ORDER BY score_time) AS pre_total_a,
LAG(total_b) OVER (ORDER BY score_time) AS pre_total_b
FROM (...);
上下两层,四个 OVER (ORDER BY score_time),每个都一样。
用 WINDOW 改一下:
-- 内层
SELECT *,
SUM(CASE WHEN team = 'A' THEN score ELSE 0 END) OVER w AS total_a,
SUM(CASE WHEN team = 'B' THEN score ELSE 0 END) OVER w AS total_b
FROM t4_basketball_game_score_detail
WINDOW w AS (ORDER BY score_time);
-- 外层
SELECT *,
LAG(total_a) OVER w2 AS pre_total_a,
LAG(total_b) OVER w2 AS pre_total_b
FROM (...)
WINDOW w2 AS (ORDER BY score_time);
每条子查询里 ORDER BY score_time 只出现一次。后面如果要调整排序——比如加个 score_time, id 防止同秒——两条 WINDOW 各改一处,不会漏。
完整合起来就是这样:
SELECT *
FROM (
SELECT *,
LAG(total_a) OVER w2 AS pre_total_a,
LAG(total_b) OVER w2 AS pre_total_b
FROM (
SELECT *,
SUM(CASE WHEN team = 'A' THEN score ELSE 0 END) OVER w AS total_a,
SUM(CASE WHEN team = 'B' THEN score ELSE 0 END) OVER w AS total_b
FROM t4_basketball_game_score_detail
WINDOW w AS (ORDER BY score_time)
) t
WINDOW w2 AS (ORDER BY score_time)
) tt
WHERE (tt.team = 'A' AND tt.pre_total_a <= tt.pre_total_b AND tt.total_a > tt.total_b)
OR (tt.team = 'B' AND tt.pre_total_b <= tt.pre_total_a AND tt.total_b > tt.total_a);
跟在 article 里看到的解法比,每个子查询少了两行重复的 ORDER BY score_time。更重要的是,之后谁来看这段 SQL 都能一眼认出 OVER w 和 OVER w2 是同一个窗口——不用瞪着眼睛对三遍字符串。
写的时候注意几个坑
窗口名别跟列名、别名重了,会报歧义错。养成加前缀的习惯,比如 w_rank、w_dept。
WINDOW 子句的位置不能随便放——它必须在 ORDER BY 前面。我踩过一次坑,习惯性地把窗口定义写在了 ORDER BY 后面,SQL 解析直接报错。
另外窗口套窗口的时候,建议基础窗口只放 PARTITION BY 和 ORDER BY,不要顺手塞窗口帧进去。帧放到引用它的最终窗口里加,逻辑清楚,改起来也不容易出问题。
回头想,WINDOW 子句做的事特别简单——给重复出现的 OVER(...) 起个名字,跟写代码时把魔法字符串提成常量一个道理。两三个窗口函数共享同一个窗口定义的时候,用 WINDOW 立竿见影;但如果每个窗口都是不同粒度的、各只出现一次,直接写 OVER 反而更直观。
跟 窗口函数专题 里的各函数配合使用,或者用 窗口函数可视化播放器 看看帧变化,理解得更快。
「数据仓库技术」文章同步更新,不错过每一篇干货

备注「数据仓库技术」加入社群,每日一道大厂SQL真题
