跳到主要内容

数据质量校验 SQL 集

复制 → 改表名 → 执行,快速定位数据问题

DQC 全景框架

DQC全景框架

各层校验重点

层级核心校验原因
ODS完整性 + 时效性 + 波动性源头数据不可控,第一时间发现问题
DWD唯一性 + 准确性 + 一致性加工逻辑最复杂,最容易出错
DWS一致性 + 波动性聚合结果要和明细对得上
ADS全量复检面向业务方,交付前最后一道防线

维度一:完整性(数据不缺)

场景:字段缺失、分区缺失、记录丢失

场景适用层级校验逻辑严重程度
关键字段为空全层级主键/核心度量字段不允许 NULL阻断
批量字段空值率超标ODS/DWD空值率超过阈值告警
分区无数据ODS当日分区行数为 0阻断
记录数低于下限ODS/DWD行数低于历史最低值的某个比例告警

关键字段非空检测(阻断型)

SELECT
SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS user_id_null,
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS order_id_null,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS amount_null
FROM dwd_order
WHERE dt = '${bizdate}';
-- 任一字段 > 0 则阻断

批量空值率(告警型)

SELECT
COUNT(1) AS total,
ROUND((COUNT(1) - COUNT(phone)) / COUNT(1) * 100, 2) AS phone_null_pct,
ROUND((COUNT(1) - COUNT(address)) / COUNT(1) * 100, 2) AS address_null_pct,
ROUND((COUNT(1) - COUNT(city)) / COUNT(1) * 100, 2) AS city_null_pct
FROM ods_user_info
WHERE dt = '${bizdate}';
-- 超过设定阈值(如 phone > 5%)则告警

分区无数据检测

SELECT
CASE WHEN cnt = 0 THEN 'BLOCK' ELSE 'PASS' END AS result
FROM (
SELECT COUNT(1) AS cnt
FROM ods_order
WHERE dt = '${bizdate}'
) t;

维度二:唯一性(数据不重)

场景:ETL 重跑导致重复、JOIN 膨胀、上游重复推送

场景适用层级校验逻辑严重程度
主键重复DWD业务主键 COUNT > 1阻断
联合键重复DWD/DWS多字段组合去重阻断
整行重复ODS完全相同记录出现多次告警

主键重复检测

SELECT pk_col, COUNT(1) AS dup_count
FROM dwd_order
WHERE dt = '${bizdate}'
GROUP BY pk_col
HAVING COUNT(1) > 1
LIMIT 10;
-- 有结果则阻断

重复率统计

SELECT
COUNT(1) AS total,
COUNT(DISTINCT order_id) AS distinct_cnt,
ROUND((COUNT(1) - COUNT(DISTINCT order_id)) / COUNT(1) * 100, 4) AS dup_rate_pct
FROM dwd_order
WHERE dt = '${bizdate}';
-- dup_rate_pct > 0 需排查

JOIN 膨胀检测(加工后比加工前行数异常增长)

SELECT
CASE WHEN after_cnt > before_cnt * 1.05 THEN 'WARN' ELSE 'PASS' END AS result,
before_cnt,
after_cnt
FROM (
SELECT
(SELECT COUNT(1) FROM ods_order WHERE dt = '${bizdate}') AS before_cnt,
(SELECT COUNT(1) FROM dwd_order WHERE dt = '${bizdate}') AS after_cnt
) t;

维度三:准确性(数据不错)

场景:枚举值越界、数值异常、逻辑矛盾

场景适用层级校验逻辑严重程度
枚举值不在合法范围DWDNOT IN 合法值列表告警
数值超出业务边界DWD/DWS金额为负/年龄超150等告警
字段间逻辑矛盾DWD结束时间 < 开始时间阻断
比率类越界DWS/ADS转化率 > 100% 或 < 0阻断

枚举值校验

SELECT status, COUNT(1) AS cnt
FROM dwd_order
WHERE dt = '${bizdate}'
AND status NOT IN ('created', 'paid', 'shipped', 'completed', 'cancelled')
GROUP BY status;
-- 有结果则告警

数值边界校验

SELECT
COUNT(CASE WHEN amount < 0 THEN 1 END) AS negative_amt,
COUNT(CASE WHEN amount > 10000000 THEN 1 END) AS extreme_amt,
COUNT(CASE WHEN age < 0 OR age > 150 THEN 1 END) AS invalid_age
FROM dwd_order_detail
WHERE dt = '${bizdate}';

逻辑矛盾检测

SELECT COUNT(1) AS conflict_count
FROM dwd_event
WHERE dt = '${bizdate}'
AND end_time < start_time;
-- > 0 则阻断

比率越界

SELECT COUNT(1) AS invalid_count
FROM ads_funnel
WHERE dt = '${bizdate}'
AND (convert_rate > 1 OR convert_rate < 0);

维度四:一致性(上下游对得上)

场景:跨层对账、跨系统核对、维表关联缺失

场景适用层级校验逻辑严重程度
上下游行数差异DWD→DWS明细聚合后 vs 汇总表告警
上下游金额差异全层级SUM 对比,差值超阈值阻断
维表关联失败DWDLEFT JOIN 后外键为空告警
明细与汇总不一致DWS/ADS重新聚合 vs 已有结果阻断

上下游行数对比

SELECT
src.cnt AS dwd_count,
tgt.cnt AS dws_count,
src.cnt - tgt.cnt AS diff
FROM
(SELECT COUNT(DISTINCT order_id) AS cnt FROM dwd_order WHERE dt = '${bizdate}') src,
(SELECT SUM(order_cnt) AS cnt FROM dws_shop_daily WHERE dt = '${bizdate}') tgt;
-- diff 不为 0 需排查

金额一致性校验

SELECT
ABS(a.total_amt - b.total_amt) AS amt_diff,
CASE
WHEN ABS(a.total_amt - b.total_amt) / a.total_amt > 0.001 THEN 'BLOCK'
ELSE 'PASS'
END AS result
FROM
(SELECT SUM(amount) AS total_amt FROM dwd_order WHERE dt = '${bizdate}') a,
(SELECT SUM(order_amount) AS total_amt FROM dws_shop_daily WHERE dt = '${bizdate}') b;

维表关联覆盖率

SELECT
COUNT(1) AS total,
SUM(CASE WHEN b.user_name IS NULL THEN 1 ELSE 0 END) AS miss_count,
ROUND(SUM(CASE WHEN b.user_name IS NULL THEN 1 ELSE 0 END) / COUNT(1) * 100, 2) AS miss_rate_pct
FROM dwd_order a
LEFT JOIN dim_user b ON a.user_id = b.user_id
WHERE a.dt = '${bizdate}';
-- miss_rate_pct > 5% 则告警

明细重新聚合 vs 汇总表

SELECT
a.shop_id,
a.recalc_amt,
b.order_amount,
ABS(a.recalc_amt - b.order_amount) AS diff
FROM (
SELECT shop_id, SUM(amount) AS recalc_amt
FROM dwd_order
WHERE dt = '${bizdate}'
GROUP BY shop_id
) a
JOIN dws_shop_daily b ON a.shop_id = b.shop_id AND b.dt = '${bizdate}'
WHERE ABS(a.recalc_amt - b.order_amount) > 0.01;

维度五:时效性(数据不迟)

场景:上游延迟、分区未产出、数据滞后

场景适用层级校验逻辑严重程度
分区未按时产出ODS指定时间检查分区是否存在阻断
数据事件时间滞后ODS/DWD最新事件时间远早于当前时间告警
依赖表未就绪全层级上游分区检测阻断

数据延迟检测

SELECT
MAX(event_time) AS latest_event,
CURRENT_TIMESTAMP() AS check_time,
(UNIX_TIMESTAMP(CURRENT_TIMESTAMP()) - UNIX_TIMESTAMP(MAX(event_time))) / 60 AS delay_minutes
FROM ods_event_log
WHERE dt = '${bizdate}';
-- delay_minutes > 阈值则告警

分区存在性检测

-- Hive
SHOW PARTITIONS table_name PARTITION (dt = '${bizdate}');
-- 无结果则阻断

数据截止时间校验(防止只同步了部分数据)

SELECT
MAX(update_time) AS latest_sync,
CASE
WHEN MAX(update_time) < '${bizdate} 23:00:00' THEN 'INCOMPLETE'
ELSE 'PASS'
END AS result
FROM ods_order
WHERE dt = '${bizdate}';

维度六:波动性(数据不飘)

场景:数据量突增突降、指标异常波动

场景适用层级校验逻辑严重程度
日环比行数波动全层级与昨日对比超 ±30%告警
周同比波动全层级与上周同天对比告警
核心指标突变DWS/ADS关键业务指标环比阻断

行数日环比

WITH daily AS (
SELECT dt, COUNT(1) AS cnt
FROM dwd_order
WHERE dt BETWEEN DATE_SUB('${bizdate}', 1) AND '${bizdate}'
GROUP BY dt
)
SELECT
t.cnt AS today,
y.cnt AS yesterday,
ROUND((t.cnt - y.cnt) / y.cnt * 100, 2) AS change_pct,
CASE
WHEN ABS((t.cnt - y.cnt) / y.cnt) > 0.3 THEN 'WARN'
ELSE 'PASS'
END AS result
FROM daily t
JOIN daily y ON t.dt = DATE_ADD(y.dt, 1)
WHERE t.dt = '${bizdate}';

周同比

SELECT
a.cnt AS this_week,
b.cnt AS last_week,
ROUND((a.cnt - b.cnt) / b.cnt * 100, 2) AS wow_pct
FROM
(SELECT COUNT(1) AS cnt FROM dwd_order WHERE dt = '${bizdate}') a,
(SELECT COUNT(1) AS cnt FROM dwd_order WHERE dt = DATE_SUB('${bizdate}', 7)) b;

核心指标波动(如 GMV)

WITH metrics AS (
SELECT dt, SUM(amount) AS gmv
FROM dws_shop_daily
WHERE dt BETWEEN DATE_SUB('${bizdate}', 7) AND '${bizdate}'
GROUP BY dt
)
SELECT
curr.dt,
curr.gmv,
prev.gmv AS prev_day_gmv,
ROUND((curr.gmv - prev.gmv) / prev.gmv * 100, 2) AS change_pct
FROM metrics curr
JOIN metrics prev ON curr.dt = DATE_ADD(prev.dt, 1)
WHERE curr.dt = '${bizdate}'
AND ABS((curr.gmv - prev.gmv) / prev.gmv) > 0.2;
-- 有结果则告警

落地建议

调度集成方式

上游任务完成

DQC 校验节点(执行以上 SQL)

┌─────────────┬──────────────┐
│ PASS │ FAIL │
│ 放行下游 │ 强校验→阻断 │
│ │ 弱校验→告警 │
└─────────────┴──────────────┘

阈值配置参考

校验项建议阈值处理方式
主键重复= 0阻断
关键字段空值= 0阻断
上下游金额差异< 0.1%阻断
行数日环比±30%阻断
非关键字段空值率< 10%告警
维表关联缺失率< 5%告警
数据延迟< 60min告警