三一重工大数据面试SQL-部门人员数据分析
一、题目
现有一张员工在职所在部门信息表,包含员工ID、所属部门、开始日期、结束日期,请查询出如下内容
1.2024年1月31日A部门在职员工数;
2.2024年1月份A部门员工最多时有多少员工;
3.2024年1月份A部门平均有多少员工;
+--------------+-------------+-------------+-------------+
| employee_id | department | start_date | end_date |
+--------------+-------------+-------------+-------------+
| 1 | A | 2023-12-20 | 2024-01-22 |
| 2 | A | 2024-01-02 | 2024-01-11 |
| 2 | B | 2024-01-11 | 2024-01-25 |
| 2 | A | 2024-01-25 | 9999-01-01 |
| 3 | A | 2023-12-20 | 9999-01-01 |
| 4 | A | 2024-02-02 | 9999-01-01 |
| 5 | A | 2023-06-20 | 2023-12-22 |
+--------------+-------------+-------------+-------------+
二、分析
- 题目本身是一个拉链表的结构,可以认为是一个左闭又开的数据。即开始日期算做在部门内,离开日期不算在部门日期;
- 第1问:查询时点数据,我们可以根据记录中的开始日期和结束日期与时间判断,如果时点在区间内,则代表用户在该部门;
- 第2问:2024年1月份A部门员工最多时有多少员工,调整表结构为进入离开部门的记录表,并进行计数,进入部门+1,离开部门-1,然后对所有行为进行累积求和,取出出现在1月份的最大值即可;
- 第3问:2024年1月份A部门平均有多少员工,存在两种计算方式:1.计算出A部门1月份每天员工数,然后进行求和。2.计算出A部门在1月份的总人 * 天 然后除以1月份天数(31天)。其中计算总人 * 天数可以使用第2问的过程数据,即每个人数状态 * 持续天数来计算;
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️⭐️ |
三、SQL
1.2024年1月31日A部门在职员工数
通过判断2024-01-31>=start_date并且2024-01-31 < end_date 证明在部门内。sql如下
select count(1) as a_depart_num
from t1_employee_department_info
where department = 'A'
and start_date <= '2024-01-31'
and end_date >'2024-01-31';
查询结果
+---------------+
| a_depart_num |
+---------------+
| 2 |
+---------------+
2.2024年1月份A部门员工最多时有多少员工;
2.1 生成员工进入离开部门表
生成员工加入离开部门表,表内包含employee_id, department, enter_or_leave,action_date。其中enter_type 1代表进入,-1代表离开,进入时间用start_date,离开时间用end_date。通过查询2遍员工部门表,并通过union all 来整合到一起。
--加入部门记录
select
employee_id as employee_id,
department as department,
1 as enter_or_leave,
start_date as action_date
from t1_employee_department_info
where department = 'A'
union all
-- 离开部门记录
select
employee_id,
department,
-1 as enter_or_leave,
end_date as action_date
from t1_employee_department_info
where department = 'A'
查询结果
+--------------+-------------+-----------------+--------------+
| employee_id | department | enter_or_leave | action_date |
+--------------+-------------+-----------------+--------------+
| 1 | A | 1 | 2023-12-20 |
| 2 | A | 1 | 2024-01-02 |
| 2 | A | 1 | 2024-01-25 |
| 3 | A | 1 | 2023-12-20 |
| 4 | A | 1 | 2024-02-02 |
| 5 | A | 1 | 2023-06-20 |
| 1 | A | -1 | 2024-01-22 |
| 2 | A | -1 | 2024-01-11 |
| 2 | A | -1 | 9999-01-01 |
| 3 | A | -1 | 9999-01-01 |
| 4 | A | -1 | 9999-01-01 |
| 5 | A | -1 | 2023-12-22 |
+--------------+-------------+-----------------+--------------+
2.2 添加一条无状态数据,保证1月份有记录,增加两条A部门月初月末无人员变动记录
因为后面使用数据累积,也就是只有在数据变化的时候的才有记录,所以我们增加两条无人员变动记录,employee_id = 0 ,enter_or_leave = 0 代表该用户既不是进入,也不是离开。 时间分别是月初和月末。这样即能保证1月份肯定有数据,也能保证有1月份的初始和结束状态。
--加入部门记录
select
employee_id,
department,
1 as enter_or_leave,
start_date as action_date
from t1_employee_department_info
where department = 'A'
union all
-- 离开部门记录
select
employee_id,
department,
-1 as enter_or_leave,
end_date as action_date
from t1_employee_department_info
where department = 'A'
--月初记录
union all
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-01' as action_date
union all
--月末记录
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-31' as action_date
** 查询结果**
+--------------+-------------+-----------------+--------------+
| employee_id | department | enter_or_leave | action_date |
+--------------+-------------+-----------------+--------------+
| 1 | A | 1 | 2023-12-20 |
| 2 | A | 1 | 2024-01-02 |
| 2 | A | 1 | 2024-01-25 |
| 3 | A | 1 | 2023-12-20 |
| 4 | A | 1 | 2024-02-02 |
| 5 | A | 1 | 2023-06-20 |
| 1 | A | -1 | 2024-01-22 |
| 2 | A | -1 | 2024-01-11 |
| 2 | A | -1 | 9999-01-01 |
| 3 | A | -1 | 9999-01-01 |
| 4 | A | -1 | 9999-01-01 |
| 5 | A | -1 | 2023-12-22 |
| 0 | A | 0 | 2024-01-01 |
| 0 | A | 0 | 2024-01-31 |
+--------------+-------------+-----------------+--------------+
2.3 使用累加方式计算每次变动之后A部门的人数
3)使用sum()over(order by **) 的方式,对A部门的每次人数变化进行累积求和
with t as (
--加入部门记录
select
employee_id,
department,
1 as enter_or_leave,
start_date as action_date
from t1_employee_department_info
where department = 'A'
union all
-- 离开部门记录
select
employee_id,
department,
-1 as enter_or_leave,
end_date as action_date
from t1_employee_department_info
where department = 'A'
--月初记录
union all
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-01' as action_date
union all
--月末记录
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-31' as action_date
)
select
employee_id,
department,
enter_or_leave,
action_date,
sum(enter_or_leave)over(order by action_date asc) as depart_emp_cnt
from t
查询结果
+--------------+-------------+-----------------+--------------+-----------------+
| employee_id | department | enter_or_leave | action_date | depart_emp_cnt |
+--------------+-------------+-----------------+--------------+-----------------+
| 5 | A | 1 | 2023-06-20 | 1 |
| 1 | A | 1 | 2023-12-20 | 3 |
| 3 | A | 1 | 2023-12-20 | 3 |
| 5 | A | -1 | 2023-12-22 | 2 |
| 0 | A | 0 | 2024-01-01 | 2 |
| 2 | A | 1 | 2024-01-02 | 3 |
| 2 | A | -1 | 2024-01-11 | 2 |
| 1 | A | -1 | 2024-01-22 | 1 |
| 2 | A | 1 | 2024-01-25 | 2 |
| 0 | A | 0 | 2024-01-31 | 2 |
| 4 | A | 1 | 2024-02-02 | 3 |
| 2 | A | -1 | 9999-01-01 | 0 |
| 3 | A | -1 | 9999-01-01 | 0 |
| 4 | A | -1 | 9999-01-01 | 0 |
+--------------+-------------+-----------------+--------------+-----------------+
2.4 时间段限定在1月份,对depart_emp_cnt 求最大值。
with t as (
--加入部门记录
select employee_id,
department,
1 as enter_or_leave,
start_date as action_date
from t1_employee_department_info
where department = 'A'
union all
-- 离开部门记录
select employee_id,
department,
-1 as enter_or_leave,
end_date as action_date
from t1_employee_department_info
where department = 'A'
--月初记录
union all
select 0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-01' as action_date
union all
--月末记录
select 0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-31' as action_date
)
select max(depart_emp_cnt) as max_emp_cnt
from (
select
employee_id,
department,
enter_or_leave,
action_date,
sum(enter_or_leave)over(order by action_date asc) as depart_emp_cnt
from t) tt
where action_date >= '2024-01-01'
and action_date <= '2024-01-31'
查询结果
+--------------+
| max_emp_cnt |
+--------------+
| 3 |
+--------------+
3.2024年1月份A部门平均有多少员工;
3.1 求每个阶段员工人数持续天数
求平均有多少员工,我们可以根据2.3的结果进行计算,查看每个阶段(两次员工变化之间的日期为同一个阶段)的人数和持续天数。然后相乘,再求和得出最终1月份在职员工的人*天,然后除以1月份天数得出平均在职人数。使用lead函数,计算出下一次变动日期,然后使用下一次变化日期-当前日期即为当前状态持续时间。这里因为在月末的时候,人数持续一天,所以我们需要在原始记录中增加2024-02-01一条无人员变动记录。
with t as (
--加入部门记录
select
employee_id,
department,
1 as enter_or_leave,
start_date as action_date
from t1_employee_department_info
where department = 'A'
union all
-- 离开部门记录
select
employee_id,
department,
-1 as enter_or_leave,
end_date as action_date
from t1_employee_department_info
where department = 'A'
--月初记录
union all
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-01' as action_date
union all
--月末记录
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-31' as action_date
union all
--2月初记录
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-02-01' as action_date
),
t2 as(
select
employee_id,
department,
enter_or_leave,
action_date,
sum(enter_or_leave) over( order by action_date asc) as depart_emp_cnt
from t)
select
employee_id,
department,
enter_or_leave,
action_date,
depart_emp_cnt,
datediff(lead(action_date)over(order by action_date asc),action_date) as keep_days
from t2
查询结果
+--------------+-------------+-----------------+--------------+-----------------+------------+
| employee_id | department | enter_or_leave | action_date | depart_emp_cnt | keep_days |
+--------------+-------------+-----------------+--------------+-----------------+------------+
| 5 | A | 1 | 2023-06-20 | 1 | 183 |
| 1 | A | 1 | 2023-12-20 | 3 | 0 |
| 3 | A | 1 | 2023-12-20 | 3 | 2 |
| 5 | A | -1 | 2023-12-22 | 2 | 10 |
| 0 | A | 0 | 2024-01-01 | 2 | 1 |
| 2 | A | 1 | 2024-01-02 | 3 | 9 |
| 2 | A | -1 | 2024-01-11 | 2 | 11 |
| 1 | A | -1 | 2024-01-22 | 1 | 3 |
| 2 | A | 1 | 2024-01-25 | 2 | 6 |
| 0 | A | 0 | 2024-01-31 | 2 | 1 |
| 0 | A | 0 | 2024-02-01 | 2 | 1 |
| 4 | A | 1 | 2024-02-02 | 3 | 2912777 |
| 2 | A | -1 | 9999-01-01 | 0 | 0 |
| 3 | A | -1 | 9999-01-01 | 0 | 0 |
| 4 | A | -1 | 9999-01-01 | 0 | NULL |
+--------------+-------------+-----------------+--------------+-----------------+------------+
3.2计算最终结果
查询1月份的日期数据,然后对depart_emp_cnt* last_date 求和,再除以31 即为1月份的平均在职人数
with t as (
--加入部门记录
select
employee_id,
department,
1 as enter_or_leave,
start_date as action_date
from t1_employee_department_info
where department = 'A'
union all
-- 离开部门记录
select
employee_id,
department,
-1 as enter_or_leave,
end_date as action_date
from t1_employee_department_info
where department = 'A'
--月初记录
union all
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-01' as action_date
union all
--月末记录
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-01-31' as action_date
union all
--2月初记录
select
0 as employee_id,
'A' as department,
0 as enter_or_leave,
'2024-02-01' as action_date
),
t2 as(
select
employee_id,
department,
enter_or_leave,
action_date,
sum(enter_or_leave) over( order by action_date asc) as depart_emp_cnt
from t),
t3 as (
select
employee_id,
department,
enter_or_leave,
action_date,
depart_emp_cnt,
datediff(lead(action_date)over(order by action_date asc),action_date) as keep_days
from t2)
select
sum(depart_emp_cnt* keep_days)/31 as avg_emp_cnt
from t3
where action_date >='2024-01-01' and action_date <= '2024-01-31'
查询结果
+--------------------+
| avg_emp_cnt |
+--------------------+
| 2.193548387096774 |
+--------------------+
四、建表语句和数据插入
CREATE TABLE IF NOT EXISTS t1_employee_department_info (
employee_id INT, -- 员工ID
department STRING, -- 所属部门
start_date STRING, -- 开始日期
end_date STRING -- 结束日期
);
insert into t1_employee_department_info(employee_id, department, start_date,end_date) values
(1, 'A', '2023-12-20','2024-01-22'),
(2, 'A', '2024-01-02','2024-01-11'),
(2, 'B', '2024-01-11','2024-01-25'),
(2, 'A', '2024-01-25','9999-01-01'),
(3, 'A', '2023-12-20','9999-01-01'),
(4, 'A', '2024-02-02','9999-01-01'),
(5, 'A', '2023-06-20','2023-12-22');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;