数据质量校验 SQL 集
复制 → 改表名 → 执行,快速定位数据问题
DQC 全景框架

各层校验重点
| 层级 | 核心校验 | 原因 |
|---|---|---|
| ODS | 完整性 + 时效性 + 波动性 | 源头数据不可控,第一时间发现问题 |
| DWD | 唯一性 + 准确性 + 一致性 | 加工逻辑最复杂,最容易出错 |
| DWS | 一致性 + 波动性 | 聚合结果要和明细对得上 |
| ADS | 全量复检 | 面向业务方,交付前最后一道防线 |
维度一:完整性(数据不缺)
场景:字段缺失、分区缺失、记录丢失
| 场景 | 适用层级 | 校验逻辑 | 严重程度 |
|---|---|---|---|
| 关键字段为空 | 全层级 | 主键/核心度量字段不允许 NULL | 阻断 |
| 批量字段空值率超标 | ODS/DWD | 空值率超过阈值 | 告警 |
| 分区无数据 | ODS | 当日分区行数为 0 | 阻断 |
| 记录数低于下限 | ODS/DWD | 行数低于历史最低值的某个比例 | 告警 |
关键字段非空检测(阻断型)
SELECT
SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS user_id_null,
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS order_id_null,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) AS amount_null
FROM dwd_order
WHERE dt = '${bizdate}';
-- 任一字段 > 0 则阻断
批量空值率(告警型)
SELECT
COUNT(1) AS total,
ROUND((COUNT(1) - COUNT(phone)) / COUNT(1) * 100, 2) AS phone_null_pct,
ROUND((COUNT(1) - COUNT(address)) / COUNT(1) * 100, 2) AS address_null_pct,
ROUND((COUNT(1) - COUNT(city)) / COUNT(1) * 100, 2) AS city_null_pct
FROM ods_user_info
WHERE dt = '${bizdate}';
-- 超过设定阈值(如 phone > 5%)则告警
分区无数据检测
SELECT
CASE WHEN cnt = 0 THEN 'BLOCK' ELSE 'PASS' END AS result
FROM (
SELECT COUNT(1) AS cnt
FROM ods_order
WHERE dt = '${bizdate}'
) t;
维度二:唯一性(数据不重)
场景:ETL 重跑导致重复、JOIN 膨胀、上游重复推送
| 场景 | 适用层级 | 校验逻辑 | 严重程度 |
|---|---|---|---|
| 主键重复 | DWD | 业务主键 COUNT > 1 | 阻断 |
| 联合键重复 | DWD/DWS | 多字段组合去重 | 阻断 |
| 整行重复 | ODS | 完全相同记录出现多次 | 告警 |
主键重复检测
SELECT pk_col, COUNT(1) AS dup_count
FROM dwd_order
WHERE dt = '${bizdate}'
GROUP BY pk_col
HAVING COUNT(1) > 1
LIMIT 10;
-- 有结果则阻断
重复率统计
SELECT
COUNT(1) AS total,
COUNT(DISTINCT order_id) AS distinct_cnt,
ROUND((COUNT(1) - COUNT(DISTINCT order_id)) / COUNT(1) * 100, 4) AS dup_rate_pct
FROM dwd_order
WHERE dt = '${bizdate}';
-- dup_rate_pct > 0 需排查
JOIN 膨胀检测(加工后比加工前行数异常增长)
SELECT
CASE WHEN after_cnt > before_cnt * 1.05 THEN 'WARN' ELSE 'PASS' END AS result,
before_cnt,
after_cnt
FROM (
SELECT
(SELECT COUNT(1) FROM ods_order WHERE dt = '${bizdate}') AS before_cnt,
(SELECT COUNT(1) FROM dwd_order WHERE dt = '${bizdate}') AS after_cnt
) t;
维度三:准确性(数据不错)
场景:枚举值越界、数值异常、逻辑矛盾
| 场景 | 适用层级 | 校验逻辑 | 严重程度 |
|---|---|---|---|
| 枚举值不在合法范围 | DWD | NOT IN 合法值列表 | 告警 |
| 数值超出业务边界 | DWD/DWS | 金额为负/年龄超150等 | 告警 |
| 字段间逻辑矛盾 | DWD | 结束时间 < 开始时间 | 阻断 |
| 比率类越界 | DWS/ADS | 转化率 > 100% 或 < 0 | 阻断 |
枚举值校验
SELECT status, COUNT(1) AS cnt
FROM dwd_order
WHERE dt = '${bizdate}'
AND status NOT IN ('created', 'paid', 'shipped', 'completed', 'cancelled')
GROUP BY status;
-- 有结果则告警
数值边界校验
SELECT
COUNT(CASE WHEN amount < 0 THEN 1 END) AS negative_amt,
COUNT(CASE WHEN amount > 10000000 THEN 1 END) AS extreme_amt,
COUNT(CASE WHEN age < 0 OR age > 150 THEN 1 END) AS invalid_age
FROM dwd_order_detail
WHERE dt = '${bizdate}';
逻辑矛盾检测
SELECT COUNT(1) AS conflict_count
FROM dwd_event
WHERE dt = '${bizdate}'
AND end_time < start_time;
-- > 0 则阻断
比率越界
SELECT COUNT(1) AS invalid_count
FROM ads_funnel
WHERE dt = '${bizdate}'
AND (convert_rate > 1 OR convert_rate < 0);
维度四:一致性(上下游对得上)
场景:跨层对账、跨系统核对、维表关联缺失
| 场景 | 适用层级 | 校验逻辑 | 严重程度 |
|---|---|---|---|
| 上下游行数差异 | DWD→DWS | 明细聚合后 vs 汇总表 | 告警 |
| 上下游金额差异 | 全层级 | SUM 对比,差值超阈值 | 阻断 |
| 维表关联失败 | DWD | LEFT JOIN 后外键为空 | 告警 |
| 明细与汇总不一致 | DWS/ADS | 重新聚合 vs 已有结果 | 阻断 |
上下游行数对比
SELECT
src.cnt AS dwd_count,
tgt.cnt AS dws_count,
src.cnt - tgt.cnt AS diff
FROM
(SELECT COUNT(DISTINCT order_id) AS cnt FROM dwd_order WHERE dt = '${bizdate}') src,
(SELECT SUM(order_cnt) AS cnt FROM dws_shop_daily WHERE dt = '${bizdate}') tgt;
-- diff 不为 0 需排查
金额一致性校验
SELECT
ABS(a.total_amt - b.total_amt) AS amt_diff,
CASE
WHEN ABS(a.total_amt - b.total_amt) / a.total_amt > 0.001 THEN 'BLOCK'
ELSE 'PASS'
END AS result
FROM
(SELECT SUM(amount) AS total_amt FROM dwd_order WHERE dt = '${bizdate}') a,
(SELECT SUM(order_amount) AS total_amt FROM dws_shop_daily WHERE dt = '${bizdate}') b;
维表关联覆盖率
SELECT
COUNT(1) AS total,
SUM(CASE WHEN b.user_name IS NULL THEN 1 ELSE 0 END) AS miss_count,
ROUND(SUM(CASE WHEN b.user_name IS NULL THEN 1 ELSE 0 END) / COUNT(1) * 100, 2) AS miss_rate_pct
FROM dwd_order a
LEFT JOIN dim_user b ON a.user_id = b.user_id
WHERE a.dt = '${bizdate}';
-- miss_rate_pct > 5% 则告警
明细重新聚合 vs 汇总表
SELECT
a.shop_id,
a.recalc_amt,
b.order_amount,
ABS(a.recalc_amt - b.order_amount) AS diff
FROM (
SELECT shop_id, SUM(amount) AS recalc_amt
FROM dwd_order
WHERE dt = '${bizdate}'
GROUP BY shop_id
) a
JOIN dws_shop_daily b ON a.shop_id = b.shop_id AND b.dt = '${bizdate}'
WHERE ABS(a.recalc_amt - b.order_amount) > 0.01;
维度五:时效性(数据不迟)
场景:上游延迟、分区未产出、数据滞后
| 场景 | 适用层级 | 校验逻辑 | 严重程度 |
|---|---|---|---|
| 分区未按时产出 | ODS | 指定时间检查分区是否存在 | 阻断 |
| 数据事件时间滞后 | ODS/DWD | 最新事件时间远早于当前时间 | 告警 |
| 依赖表未就绪 | 全层级 | 上游分区检测 | 阻断 |
数据延迟检测
SELECT
MAX(event_time) AS latest_event,
CURRENT_TIMESTAMP() AS check_time,
(UNIX_TIMESTAMP(CURRENT_TIMESTAMP()) - UNIX_TIMESTAMP(MAX(event_time))) / 60 AS delay_minutes
FROM ods_event_log
WHERE dt = '${bizdate}';
-- delay_minutes > 阈值则告警
分区存在性检测
-- Hive
SHOW PARTITIONS table_name PARTITION (dt = '${bizdate}');
-- 无结果则阻断
数据截止时间校验(防止只同步了部分数据)
SELECT
MAX(update_time) AS latest_sync,
CASE
WHEN MAX(update_time) < '${bizdate} 23:00:00' THEN 'INCOMPLETE'
ELSE 'PASS'
END AS result
FROM ods_order
WHERE dt = '${bizdate}';
维度六:波动性(数据不飘)
场景:数据量突增突降、指标异常波动
| 场景 | 适用层级 | 校验逻辑 | 严重程度 |
|---|---|---|---|
| 日环比行数波动 | 全层级 | 与昨日对比超 ±30% | 告警 |
| 周同比波动 | 全层级 | 与上周同天对比 | 告警 |
| 核心指标突变 | DWS/ADS | 关键业务指标环比 | 阻断 |
行数日环比
WITH daily AS (
SELECT dt, COUNT(1) AS cnt
FROM dwd_order
WHERE dt BETWEEN DATE_SUB('${bizdate}', 1) AND '${bizdate}'
GROUP BY dt
)
SELECT
t.cnt AS today,
y.cnt AS yesterday,
ROUND((t.cnt - y.cnt) / y.cnt * 100, 2) AS change_pct,
CASE
WHEN ABS((t.cnt - y.cnt) / y.cnt) > 0.3 THEN 'WARN'
ELSE 'PASS'
END AS result
FROM daily t
JOIN daily y ON t.dt = DATE_ADD(y.dt, 1)
WHERE t.dt = '${bizdate}';
周同比
SELECT
a.cnt AS this_week,
b.cnt AS last_week,
ROUND((a.cnt - b.cnt) / b.cnt * 100, 2) AS wow_pct
FROM
(SELECT COUNT(1) AS cnt FROM dwd_order WHERE dt = '${bizdate}') a,
(SELECT COUNT(1) AS cnt FROM dwd_order WHERE dt = DATE_SUB('${bizdate}', 7)) b;
核心指标波动(如 GMV)
WITH metrics AS (
SELECT dt, SUM(amount) AS gmv
FROM dws_shop_daily
WHERE dt BETWEEN DATE_SUB('${bizdate}', 7) AND '${bizdate}'
GROUP BY dt
)
SELECT
curr.dt,
curr.gmv,
prev.gmv AS prev_day_gmv,
ROUND((curr.gmv - prev.gmv) / prev.gmv * 100, 2) AS change_pct
FROM metrics curr
JOIN metrics prev ON curr.dt = DATE_ADD(prev.dt, 1)
WHERE curr.dt = '${bizdate}'
AND ABS((curr.gmv - prev.gmv) / prev.gmv) > 0.2;
-- 有结果则告警
落地建议
调度集成方式
上游任务完成
↓
DQC 校验节点(执行以上 SQL)
↓
┌─────────────┬──────────────┐
│ PASS │ FAIL │
│ 放行下游 │ 强校验→阻断 │
│ │ 弱校验→告警 │
└─────────────┴──────────────┘
阈值配置参考
| 校验项 | 建议阈值 | 处理方式 |
|---|---|---|
| 主键重复 | = 0 | 阻断 |
| 关键字段空值 | = 0 | 阻断 |
| 上下游金额差异 | < 0.1% | 阻断 |
| 行数日环比 | ±30% | 阻断 |
| 非关键字段空值率 | < 10% | 告警 |
| 维表关联缺失率 | < 5% | 告警 |
| 数据延迟 | < 60min | 告警 |