字节跳动大数据面试SQL-共同使用ip用户检测问题
一、题目
现有用户登录日志表,记录了每个用户登录的IP地址,请查询共同使用过3个及以上IP的用户对;
+----------+-----------------+----------------------+
| user_id | ip | time_stamp |
+----------+-----------------+----------------------+
| 2 | 223.104.41.101 | 2023-08-24 07:00:00 |
| 4 | 223.104.41.122 | 2023-08-24 10:00:00 |
| 5 | 223.104.41.126 | 2023-08-24 11:00:00 |
| 4 | 223.104.41.126 | 2023-08-24 13:00:00 |
| 1 | 223.104.41.101 | 2023-08-24 16:00:00 |
| 3 | 223.104.41.101 | 2023-08-24 16:02:00 |
| 2 | 223.104.41.104 | 2023-08-24 16:30:00 |
| 1 | 223.104.41.121 | 2023-08-24 17:00:00 |
| 2 | 223.104.41.122 | 2023-08-24 17:05:00 |
| 3 | 223.104.41.103 | 2023-08-24 18:11:00 |
| 2 | 223.104.41.103 | 2023-08-24 19:00:00 |
| 1 | 223.104.41.104 | 2023-08-24 19:00:00 |
| 3 | 223.104.41.122 | 2023-08-24 19:07:00 |
| 1 | 223.104.41.122 | 2023-08-24 21:00:00 |
+----------+-----------------+----------------------+
二、分析
- 题目给出的数据是登录记录,需要使用IP进行关联,找到使用相同IP的记录;
- 因为要使用ip进行关联,首先保证每个用户同一个IP只有一条记录,否则关联会导致结果数据重复;
- 自关联,会导致使用相同IP的用户,出现1-2,2-1两条记录、1-1,2-2自己的记录,这些记录需要去重和剔除;
- 计算共同使用过的IP数量,得出结果;
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐️ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
1.将所有用户登录记录按照用户ID和登录IP去重
执行SQL
select user_id,
ip
from t6_login_log
group by user_id, ip
查询结果
+----------+-----------------+
| user_id | ip |
+----------+-----------------+
| 1 | 223.104.41.101 |
| 1 | 223.104.41.104 |
| 1 | 223.104.41.121 |
| 1 | 223.104.41.122 |
| 2 | 223.104.41.101 |
| 2 | 223.104.41.103 |
| 2 | 223.104.41.104 |
| 2 | 223.104.41.122 |
| 3 | 223.104.41.101 |
| 3 | 223.104.41.103 |
| 3 | 223.104.41.122 |
| 4 | 223.104.41.122 |
| 4 | 223.104.41.126 |
| 5 | 223.104.41.126 |
+----------+-----------------+
2.通过IP地址进行自关联,去重,剔除相同用户。
执行SQL
with tmp as
(select user_id,
ip
from t6_login_log
group by user_id, ip)
select t1.user_id,
t2.user_id,
t1.ip
from tmp as t1
join
tmp as t2
on t1.ip = t2.ip
where t1.user_id < t2.user_id
查询结果
+-------------+-------------+-----------------+
| t1.user_id | t2.user_id | t1.ip |
+-------------+-------------+-----------------+
| 1 | 2 | 223.104.41.101 |
| 1 | 3 | 223.104.41.101 |
| 2 | 3 | 223.104.41.101 |
| 2 | 3 | 223.104.41.103 |
| 1 | 2 | 223.104.41.104 |
| 1 | 2 | 223.104.41.122 |
| 1 | 3 | 223.104.41.122 |
| 1 | 4 | 223.104.41.122 |
| 2 | 3 | 223.104.41.122 |
| 2 | 4 | 223.104.41.122 |
| 3 | 4 | 223.104.41.122 |
| 4 | 5 | 223.104.41.126 |
+-------------+-------------+-----------------+
3.根据用户组计算使用共同IP的个数
执行SQL
with tmp as
(select user_id,
ip
from t6_login_log
group by user_id, ip)
select t1.user_id,
t2.user_id,
count(t1.ip)
from tmp as t1
join
tmp as t2
on t1.ip = t2.ip
where t1.user_id < t2.user_id
group by t1.user_id,
t2.user_id
查询结果
+-------------+-------------+------+
| t1.user_id | t2.user_id | _c2 |
+-------------+-------------+------+
| 1 | 2 | 3 |
| 1 | 3 | 2 |
| 1 | 4 | 1 |
| 2 | 3 | 3 |
| 2 | 4 | 1 |
| 3 | 4 | 1 |
| 4 | 5 | 1 |
+-------------+-------------+------+
4.查询共同使用过3个以上IP的用户对
执行SQL
with tmp as
(select user_id,
ip
from t6_login_log
group by user_id, ip)
select t1.user_id,
t2.user_id
from tmp as t1
join
tmp as t2
on t1.ip = t2.ip
where t1.user_id < t2.user_id
group by t1.user_id,
t2.user_id
having count(t1.ip) >= 3
查询结果
+-------------+-------------+
| t1.user_id | t2.user_id |
+-------------+-------------+
| 1 | 2 |
| 2 | 3 |
+-------------+-------------+
四、建表语句和数据插入
--建表语句
CREATE TABLE t6_login_log (
user_id bigint COMMENT '用户ID',
ip string COMMENT '用户登录ip地址',
time_stamp string COMMENT '登录时间'
) COMMENT '用户登录记录表';
-- 插入数据
insert into t6_login_log(user_id,ip,time_stamp)
values
(1,'223.104.41.101','2023-08-24 16:00:00'),
(1,'223.104.41.121','2023-08-24 17:00:00'),
(1,'223.104.41.104','2023-08-24 19:00:00'),
(1,'223.104.41.122','2023-08-24 21:00:00'),
(1,'223.104.41.122','2023-08-24 22:00:00'),
(2,'223.104.41.101','2023-08-24 07:00:00'),
(2,'223.104.41.103','2023-08-24 19:00:00'),
(2,'223.104.41.104','2023-08-24 16:30:00'),
(2,'223.104.41.122','2023-08-24 17:05:00'),
(3,'223.104.41.103','2023-08-24 18:11:00'),
(3,'223.104.41.122','2023-08-24 19:07:00'),
(3,'223.104.41.101','2023-08-24 16:02:00'),
(4,'223.104.41.126','2023-08-24 13:00:00'),
(5,'223.104.41.126','2023-08-24 11:00:00'),
(4,'223.104.41.122','2023-08-24 10:00:00');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;