常见大数据面试SQL-查询每个产品每年总销售额
一、题目
已知有表如下,记录了每个产品id、产品名称、产品销售开始日期、产品销售结束日期以及产品日均销售金额,请计算出每个产品每年的销售金额
样例数据
+-------------+---------------+----------------------+----------------------+----------------------+
| product_id | product_name | period_start | period_end | average_daily_sales |
+-------------+---------------+----------------------+----------------------+----------------------+
| 1 | LC Phone | 2019-01-25 00:00:00 | 2019-02-28 00:00:00 | 100 |
| 2 | LC T-Shirt | 2018-12-01 00:00:00 | 2020-01-01 00:00:00 | 10 |
| 3 | LC Keychain | 2019-12-01 00:00:00 | 2020-01-31 00:00:00 | 1 |
+-------------+---------------+----------------------+----------------------+----------------------+
期望结果
+------------+--------------+-------------+--------------+
| product_id | product_name | report_year | total_amount |
+------------+--------------+-------------+--------------+
| 1 | LC Phone | 2019 | 3500 |
| 2 | LC T-Shirt | 2018 | 310 |
| 2 | LC T-Shirt | 2019 | 3650 |
| 2 | LC T-Shirt | 2020 | 10 |
| 3 | LC Keychain | 2019 | 31 |
| 3 | LC Keychain | 2020 | 31 |
+------------+--------------+-------------+--------------+
说明
- LC Phone 在 2019-01-25 至 2019-02-28 期间销售,该产品销售时间总计35天。销售总额 35*100 = 3500。
- LC T-shirt 在 2018-12-01 至 2020-01-01 期间销售,该产品在2018年、2019年、2020年的销售时间分别是31天、365天、1天,2018年、2019年、2020年的销售总额分别是3110=310、36510=3650、1*10=10。
- LC Keychain 在 2019-12-01 至 2020-01-31 期间销售,该产品在2019年、2020年的销售时间分别是:31天、31天,2019年、2020年的销售总额分别是311=31、311=31。
二、分析
题目中给出的是每个产品的开始时间和结束时间,这里最关键的问题在于一个时间段可能跨年,并且不知道跨几年。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
1.构建年份维表
首先我们生成一份年份的维表,包含年份、每年开始日期和每年结束日期,这里有多重方式,由于题目中只有三年,咱们直接union all 拼一个出来.
执行SQL
with dim_year as (select '2018' as year, '2018-01-01' as year_first_day, '2018-12-31' as year_end_day
union all
select '2019' as year, '2019-01-01' as year_first_day, '2019-12-31' as year_end_day
union all
select '2020' as year, '2020-01-01' as year_first_day, '2020-12-31' as year_end_day)
select year,year_first_day,year_end_day from dim_year
查询结果
+-------+-----------------+---------------+
| year | year_first_day | year_end_day |
+-------+-----------------+---------------+
| 2018 | 2018-01-01 | 2018-12-31 |
| 2019 | 2019-01-01 | 2019-12-31 |
| 2020 | 2020-01-01 | 2020-12-31 |
+-------+-----------------+---------------+
2.维表与原始数据进行关联
把原始数据与年份维表进行笛卡尔积,得到每年与原始数据的一个交叉值.
执行SQL
set hive.strict.checks.cartesian.product = false;
with dim_year as (select '2018' as year, '2018-01-01' as year_first_day, '2018-12-31' as year_end_day
union all
select '2019' as year, '2019-01-01' as year_first_day, '2019-12-31' as year_end_day
union all
select '2020' as year, '2020-01-01' as year_first_day, '2020-12-31' as year_end_day)
select
product_id,
product_name,
period_start,
period_end,
average_daily_sales,
year,
year_first_day,
year_end_day
from t11_product_sales
left join dim_year
查询结果
+-------------+---------------+----------------------+----------------------+----------------------+-------+-----------------+---------------+
| product_id | product_name | period_start | period_end | average_daily_sales | year | year_first_day | year_end_day |
+-------------+---------------+----------------------+----------------------+----------------------+-------+-----------------+---------------+
| 1 | LC Phone | 2019-01-25 00:00:00 | 2019-02-28 00:00:00 | 100 | 2019 | 2019-01-01 | 2019-12-31 |
| 1 | LC Phone | 2019-01-25 00:00:00 | 2019-02-28 00:00:00 | 100 | 2020 | 2020-01-01 | 2020-12-31 |
| 1 | LC Phone | 2019-01-25 00:00:00 | 2019-02-28 00:00:00 | 100 | 2018 | 2018-01-01 | 2018-12-31 |
| 2 | LC T-Shirt | 2018-12-01 00:00:00 | 2020-01-01 00:00:00 | 10 | 2019 | 2019-01-01 | 2019-12-31 |
| 2 | LC T-Shirt | 2018-12-01 00:00:00 | 2020-01-01 00:00:00 | 10 | 2020 | 2020-01-01 | 2020-12-31 |
| 2 | LC T-Shirt | 2018-12-01 00:00:00 | 2020-01-01 00:00:00 | 10 | 2018 | 2018-01-01 | 2018-12-31 |
| 3 | LC Keychain | 2019-12-01 00:00:00 | 2020-01-31 00:00:00 | 1 | 2019 | 2019-01-01 | 2019-12-31 |
| 3 | LC Keychain | 2019-12-01 00:00:00 | 2020-01-31 00:00:00 | 1 | 2020 | 2020-01-01 | 2020-12-31 |
| 3 | LC Keychain | 2019-12-01 00:00:00 | 2020-01-31 00:00:00 | 1 | 2018 | 2018-01-01 | 2018-12-31 |
+-------------+---------------+----------------------+----------------------+----------------------+-------+-----------------+---------------+
3.比较计算每年每个产品在售天数
我们先观察2中的结果,可以看到原始记录中每行数据都与所有年都有一行记录。我们从销售日期和每年开始日期中取较大日期 得到一个开始时间,然后从销售截止日期和每年的结束日期取较小日期 得到一个结束日期,然后用结束日期减去开始日期。
执行SQL
with dim_year as (select '2018' as year, '2018-01-01' as year_first_day, '2018-12-31' as year_end_day
union all
select '2019' as year, '2019-01-01' as year_first_day, '2019-12-31' as year_end_day
union all
select '2020' as year, '2020-01-01' as year_first_day, '2020-12-31' as year_end_day),
tmp as (select product_id,
product_name,
period_start,
period_end,
average_daily_sales,
year,
year_first_day,
year_end_day,
datediff(
if(to_date(period_end) > to_date(year_end_day), to_date(year_end_day), to_date(period_end)),
if(to_date(period_start) > to_date(year_first_day), to_date(period_start),
to_date(year_first_day))) as date_diff
from t11_product_sales
left join dim_year)
select product_id,
product_name,
year,
year_first_day,
year_end_day,
(date_diff + 1) * average_daily_sales
from tmp
where date_diff > 0
查询结果
+-------------+---------------+----------------------+----------------------+----------------------+-------+-----------------+---------------+------------+
| product_id | product_name | period_start | period_end | average_daily_sales | year | year_first_day | year_end_day | date_diff |
+-------------+---------------+----------------------+----------------------+----------------------+-------+-----------------+---------------+------------+
| 1 | LC Phone | 2019-01-25 00:00:00 | 2019-02-28 00:00:00 | 100 | 2019 | 2019-01-01 | 2019-12-31 | 34 |
| 1 | LC Phone | 2019-01-25 00:00:00 | 2019-02-28 00:00:00 | 100 | 2020 | 2020-01-01 | 2020-12-31 | -307 |
| 1 | LC Phone | 2019-01-25 00:00:00 | 2019-02-28 00:00:00 | 100 | 2018 | 2018-01-01 | 2018-12-31 | -25 |
| 2 | LC T-Shirt | 2018-12-01 00:00:00 | 2020-01-01 00:00:00 | 10 | 2019 | 2019-01-01 | 2019-12-31 | 364 |
| 2 | LC T-Shirt | 2018-12-01 00:00:00 | 2020-01-01 00:00:00 | 10 | 2020 | 2020-01-01 | 2020-12-31 | 0 |
| 2 | LC T-Shirt | 2018-12-01 00:00:00 | 2020-01-01 00:00:00 | 10 | 2018 | 2018-01-01 | 2018-12-31 | 30 |
| 3 | LC Keychain | 2019-12-01 00:00:00 | 2020-01-31 00:00:00 | 1 | 2019 | 2019-01-01 | 2019-12-31 | 30 |
| 3 | LC Keychain | 2019-12-01 00:00:00 | 2020-01-31 00:00:00 | 1 | 2020 | 2020-01-01 | 2020-12-31 | 30 |
| 3 | LC Keychain | 2019-12-01 00:00:00 | 2020-01-31 00:00:00 | 1 | 2018 | 2018-01-01 | 2018-12-31 | -335 |
+-------------+---------------+----------------------+----------------------+----------------------+-------+-----------------+---------------+------------+
查看上面结果,就可以发现,只有在当年有销售时间的数据date_diff >=0,如果在当年没有销售时间,则date_diff为负。但是这个date_diff 的值比预期小1,这是因为我们算了日期差,所以我们在结果上+1即可。
4.筛选符合条件数据,计算最终结果
得到有销售的年份和天数之后,乘以每天的销售和即可得到最终结果。
执行SQL
with dim_year as (select '2018' as year, '2018-01-01' as year_first_day, '2018-12-31' as year_end_day
union all
select '2019' as year, '2019-01-01' as year_first_day, '2019-12-31' as year_end_day
union all
select '2020' as year, '2020-01-01' as year_first_day, '2020-12-31' as year_end_day),
tmp as (select product_id
, product_name
, period_start
, period_end
, average_daily_sales
, year
, year_first_day
, year_end_day
, datediff(if(to_date(period_end)
> to_date(year_end_day)
, to_date(year_end_day)
, to_date(period_end))
, if(to_date(period_start)
> to_date(year_first_day)
, to_date(period_start)
, to_date(year_first_day))) as date_diff
from t11_product_sales
left join dim_year)
select product_id,
product_name,
year,
(date_diff + 1) * average_daily_sales as total_amount
from tmp
where date_diff >= 0
查询结果
+-------------+---------------+-------+---------------+
| product_id | product_name | year | total_amount |
+-------------+---------------+-------+---------------+
| 1 | LC Phone | 2019 | 3500 |
| 2 | LC T-Shirt | 2018 | 310 |
| 2 | LC T-Shirt | 2019 | 3650 |
| 2 | LC T-Shirt | 2020 | 10 |
| 3 | LC Keychain | 2019 | 31 |
| 3 | LC Keychain | 2020 | 31 |
+-------------+---------------+-------+---------------+
四、建表语句和数据插入
--建表语句
create table if not exists t11_product_sales
(
product_id bigint,
product_name string,
period_start string,
period_end string,
average_daily_sales bigint
);
--插入数据
insert into t11_product_sales(product_id, product_name, period_start, period_end, average_daily_sales)
values (1, 'LC Phone', '2019-01-25 00:00:00', '2019-02-28 00:00:00', 100),
(2, 'LC T-Shirt', '2018-12-01 00:00:00', '2020-01-01 00:00:00', 10),
(3, 'LC Keychain', '2019-12-01 00:00:00', '2020-01-31 00:00:00', 1);
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;