百度大数据面试SQL-无效搜索
一、题目
现有一份用户搜索日志,包含用户ID,时间,用户搜索内容。定义 无效搜索:如果用户下一次搜索内容中包含本次搜索内容,则认为本次搜索为无效搜索。请查询用户无效搜索记录
样例数据
+---------+---------------------+------------------------+
| user_id | search_time | search_content |
+---------+---------------------+------------------------+
| 1 | 2022-01-01 10:00:00 | apple |
| 1 | 2022-01-01 11:30:00 | banana and apple |
| 1 | 2022-01-01 12:45:00 | fruit salad |
| 1 | 2022-01-01 15:00:00 | apple pie |
| 1 | 2022-01-01 16:20:00 | applesauce recipe |
| 2 | 2022-01-01 10:00:00 | cat food |
| 2 | 2022-01-01 11:30:00 | wet vs dry cat food |
| 2 | 2022-01-01 12:45:00 | homemade cat food recipe |
| 2 | 2022-01-01 14:00:00 | cat food brands to avoid |
| 2 | 2022-01-01 16:20:00 | best cat food for i... |
| 3 | 2022-01-01 10:00:00 | book |
| 3 | 2022-01-01 11:30:00 | books like Harry Potter|
| 3 | 2022-01-01 13:00:00 |best selling books ... |
| 3 | 2022-01-01 14:30:00 | bookstores near me |
| 3 | 2022-01-01 15:45:00 | how to publish a book|
+---------+---------------------+------------------------+
二、分析
- 有序计算,即对上下两行数据进行比较,考察的是lead()函数,用法参考:hive开窗函数-lag和lead函数
- 字符串包含判断,instr()函数。
INSTR
(str
,substr
) 其中,str
是要搜索的字符串,substr
是要查找的子字符串。该函数返回子字符串在指定字符串中第一次出现的位置,如果未找到则返回0。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
1.查询出下一行数据,并把下一行搜索内容作为新字段放到本行
执行SQL
select user_id,
search_time,
search_content,
lead(search_content) over (partition by user_id order by search_time asc) as next_search_content
from t3_user_search_log
查询结果
+----------+------------------------+---------------------------------+---------------------------------+
| user_id | search_time | search_content | next_search_content |
+----------+------------------------+---------------------------------+---------------------------------+
| 1 | 2022-01-01 10:00:00.0 | apple | banana and apple |
| 1 | 2022-01-01 11:30:00.0 | banana and apple | fruit salad |
| 1 | 2022-01-01 12:45:00.0 | fruit salad | apple pie |
| 1 | 2022-01-01 15:00:00.0 | apple pie | applesauce recipe |
| 1 | 2022-01-01 16:20:00.0 | applesauce recipe | NULL |
| 2 | 2022-01-01 10:00:00.0 | cat food | wet vs dry cat food |
| 2 | 2022-01-01 11:30:00.0 | wet vs dry cat food | homemade cat food recipe |
| 2 | 2022-01-01 12:45:00.0 | homemade cat food recipe | cat food brands to avoid |
| 2 | 2022-01-01 14:00:00.0 | cat food brands to avoid | best cat food for indoor cats |
| 2 | 2022-01-01 16:20:00.0 | best cat food for indoor cats | NULL |
| 3 | 2022-01-01 10:00:00.0 | book | books like Harry Potter |
| 3 | 2022-01-01 11:30:00.0 | books like Harry Potter | best selling books of all time |
| 3 | 2022-01-01 13:00:00.0 | best selling books of all time | bookstores near me |
| 3 | 2022-01-01 14:30:00.0 | bookstores near me | how to publish a book |
| 3 | 2022-01-01 15:45:00.0 | how to publish a book | NULL |
+----------+------------------------+---------------------------------+---------------------------------+
2.比较搜索内容是否为下一次搜索内容的子字符串,给判断逻辑打标记(如果是返回1,否则返回0)
执行SQL
select user_id,
search_time,
search_content,
lead(search_content) over (partition by user_id order by search_time asc) as next_search_content,
If(instr(lead(search_content) over (partition by user_id order by search_time asc), search_content) > 0, 1,
0) as flag
from t3_user_search_log
查询结果
+----------+------------------------+---------------------------------+---------------------------------+-------+
| user_id | search_time | search_content | next_search_content | flag |
+----------+------------------------+---------------------------------+---------------------------------+-------+
| 1 | 2022-01-01 10:00:00.0 | apple | banana and apple | 1 |
| 1 | 2022-01-01 11:30:00.0 | banana and apple | fruit salad | 0 |
| 1 | 2022-01-01 12:45:00.0 | fruit salad | apple pie | 0 |
| 1 | 2022-01-01 15:00:00.0 | apple pie | applesauce recipe | 0 |
| 1 | 2022-01-01 16:20:00.0 | applesauce recipe | NULL | 0 |
| 2 | 2022-01-01 10:00:00.0 | cat food | wet vs dry cat food | 1 |
| 2 | 2022-01-01 11:30:00.0 | wet vs dry cat food | homemade cat food recipe | 0 |
| 2 | 2022-01-01 12:45:00.0 | homemade cat food recipe | cat food brands to avoid | 0 |
| 2 | 2022-01-01 14:00:00.0 | cat food brands to avoid | best cat food for indoor cats | 0 |
| 2 | 2022-01-01 16:20:00.0 | best cat food for indoor cats | NULL | 0 |
| 3 | 2022-01-01 10:00:00.0 | book | books like Harry Potter | 1 |
| 3 | 2022-01-01 11:30:00.0 | books like Harry Potter | best selling books of all time | 0 |
| 3 | 2022-01-01 13:00:00.0 | best selling books of all time | bookstores near me | 0 |
| 3 | 2022-01-01 14:30:00.0 | bookstores near me | how to publish a book | 0 |
| 3 | 2022-01-01 15:45:00.0 | how to publish a book | NULL | 0 |
+----------+------------------------+---------------------------------+---------------------------------+-------+
3.限制标签为1,查询出最后结果
执行SQL
select user_id,
search_time,
search_content
from (select user_id,
search_time,
search_content,
if(instr(lead(search_content) over (partition by user_id order by search_time asc), search_content) > 0, 1,
0) as flag
from t3_user_search_log) t
where flag = 1;
查询结果
+----------+------------------------+-----------------+
| user_id | search_time | search_content |
+----------+------------------------+-----------------+
| 1 | 2022-01-01 10:00:00.0 | apple |
| 2 | 2022-01-01 10:00:00.0 | cat food |
| 3 | 2022-01-01 10:00:00.0 | book |
+----------+------------------------+-----------------+
四、建表语句和数据插入
--建表语句
CREATE TABLE t3_user_search_log (
user_id STRING,
search_time STRING,
search_content STRING
);
--插入数据
INSERT INTO t3_user_search_log
VALUES
('1', '2022-01-01 10:00:00', 'apple'),
('1', '2022-01-01 11:30:00', 'banana and apple'),
('1', '2022-01-01 12:45:00', 'fruit salad'),
('1', '2022-01-01 15:00:00', 'apple pie'),
('1', '2022-01-01 16:20:00', 'applesauce recipe'),
('2', '2022-01-01 10:00:00', 'cat food'),
('2', '2022-01-01 11:30:00', 'wet vs dry cat food'),
('2', '2022-01-01 12:45:00', 'homemade cat food recipe'),
('2', '2022-01-01 14:00:00', 'cat food brands to avoid'),
('2', '2022-01-01 16:20:00', 'best cat food for indoor cats'),
('3', '2022-01-01 10:00:00', 'book'),
('3', '2022-01-01 11:30:00', 'books like Harry Potter'),
('3', '2022-01-01 13:00:00', 'best selling books of all time'),
('3', '2022-01-01 14:30:00', 'bookstores near me'),
('3', '2022-01-01 15:45:00', 'how to publish a book');
本文同步在微信公众号”数据仓库技术“和个人博客”数据仓库技术 (opens in a new tab)“发表;