源数据 用户-视频互动表tb_user_video_log id uid video_id start_time end_time if_follow if_like if_retweet comment_id 1 101 2001 2021-10-01 10:00:00 2021-10-01 10:00:30 0 1 1 NULL 2 102 2001 2021-10-01 10:00:00 2021-10-01 10:00:24 0 0 1 NULL 3 103 2001 2021-10-01 11:00:00 2021-10-01 11:00:34 0 1 0 1732526 4 101 2002 2021-09-01 10:00:00 2021-9-01 10:00:42 1 0 1 NULL 5 102 2002 2021-10-01 11:00:00 2021-10-01 10:00:30 1 0 1 NULL (uid-用户ID, video_id-视频ID, start_time-开始观看时间, end_time-结束观看时间, if_follow-是否关注, if_like-是否点赞, if_retweet-是否转发, comment_id-评论ID) 短视频信息表tb_video_info id video_id author tag duration release_time 1 2001 901 影视 30 2021-01-01 07:00:00 2 2002 901 美食 60 2021-01-01 07:00:00 3 2003 902 旅游 90 2021-01-01 07:00:00 (video_id-视频ID, author-创作者ID, tag-类别标签, duration-视频时长(秒), release_time-发布时间) 注:视频完播率是指完成播放次数占总播放次数的比例。简单起见,结束观看时间与开始播放时间的差>=视频时长时,视为完成播放。 输出示例: 示例数据的结果如下: video_id avg_comp_play_rate 2001 0.667 2002 0.000 SQL代码 注意区分timestampdiff()与datediff()用法 timestampdiff()函数的作用是返回两个日期时间之间的整数差。而datediff()函数的作用也是返回两个日期值之差。 它们的函数语法分别为: TIMESTAMPDIFF(unit,start_time,end_time) DATEDIFF(start_time,end_time) 注: 输出示例: 示例数据的输出结果如下: tag avg_play_progress 影视 90.00% 美食 75.00% SQL代码 注:转发率=转发量÷播放量。结果按转发率降序排序。 输出示例: 示例数据的输出结果如下 tag retweet_cut retweet_rate 影视 2 0.667 美食 1 0.500 SQL代码 注意date_sub函数用法 DATE_SUB() 函数从日期减去指定的时间间隔。 实例:DATE_SUB("2023-3-2",INTERVAL 1 day) 返回的是2023-3-1 注: 解题思路: 关于开窗函数用法可以参考学习:https://blog.csdn.net/weixin_43997319/article/details/124964670 输出示例: 示例数据的输出结果如下 tag dt sum_like_cnt_7d max_retweet_cnt_7d 旅游 2021-10-01 5 2 旅游 2021-10-02 5 3 旅游 2021-10-03 6 3 解释: 由表tb_user_video_log里的数据可得只有旅游类视频的播放,2021年9月25到10月3日每天的点赞量和转发量如下: tag dt like_cnt retweet_cnt 旅游 2021-09-25 1 2 旅游 2021-09-26 0 1 旅游 2021-09-27 1 0 旅游 2021-09-28 0 1 旅游 2021-09-29 0 1 旅游 2021-09-30 1 1 旅游 2021-10-01 2 1 旅游 2021-10-02 1 3 旅游 2021-10-03 1 0 因此国庆头3天(10.01~10.03)里10.01的近7天(9.25~10.01)总点赞量为5次,单天最大转发量为2次(9月25那天最大);同理可得10.02和10.03的两个指标。 SQL代码 开窗函数 ①rows between 2 perceding and 2 following #当前行往前2行+当前行+当前行往后2行(一共5行) ②rows between 1 following 3 following #当前行的后1——>后3(共3行) ③rows between unbounded preceding and current row #从第一行到当前行 注: 输出示例: 示例数据的输出结果如下 video_id hot_index 2001 122 2002 56 2003 1 解释: 最近播放日期为2021-10-03,记作当天日期;近一个月(2021-09-04及之后)发布的视频有2001、2002、2003、2004,不过2004暂时还没有播放记录; 视频2001完播率1.0(被播放次数4次,完成播放4次),被点赞3次,评论1次,转发2次,最近无播放天数为0,因此热度为:(100*1.0+5*3+3*1+2*2)/(0+1)=122 同理,视频2003完播率0,被点赞数1,评论和转发均为0,最近无播放天数为3,因此热度为:(100*0+5*1+3*0+2*0)/(3+1)=1(1.2保留为整数)。 SQL代码 题目本身不复杂,但是理解题目好难 … 比如,最近无播放天数,一直理解成最近两次播放天数间的间隔天数,实际要简单很多,就是整体最近的播放日期减每个视频的最近播放日期。问题:计算2021年里有播放记录的每个视频的完播率(结果保留三位小数),并按完播率降序排序
#方法1使用sum函数
select
a.video_id,
round(sum(if(timestampdiff(second, a.start_time, a.end_time) >= b.duration,1,0))/count(a.video_id),3) as avg_comp_play_rate
from
tb_user_video_log a
join tb_video_info b on a.video_id = b.video_id
where
year (a.start_time)=2021
group by a.video_id
order by avg_comp_play_rate desc
#方法2使用avg函数
select
a.video_id,
round(avg(if(timestampdiff(second, a.start_time, a.end_time) >= b.duration,1,0)),3) as avg_comp_play_rate
from
tb_user_video_log a
join tb_video_info b on a.video_id = b.video_id
where
year (a.start_time)=2021
group by a.video_id
order by avg_comp_play_rate desc
问题:计算各类视频的平均播放进度,将进度大于60%的类别输出。
select tag,concat(avg_1,'%') avg_play_progress
from
(
select b.tag ,
round(avg(if( timestampdiff(second,a.start_time,a.end_time)>b.duration,1, timestampdiff(second,a.start_time,a.end_time)/b.duration))*100,2) as avg_1
from tb_user_video_log a
join tb_video_info b
on a.video_id=b.video_id
group by b.tag
having avg_1>60
)t
order by avg_play_progress desc
问题:统计在有用户互动的最近一个月(按包含当天在内的近30天算,比如10月31日的近30天为10.2~10.31之间的数据)中,每类视频的转发量和转发率(保留3位小数)。
select b.tag,
sum(a.if_retweet) as retweet_cut,
round(avg(a.if_retweet),3) as retweet_rate
from tb_user_video_log a
join tb_video_info b on a.video_id=b.video_id
where
date(a.start_time) >= date_sub((select max(start_time) from tb_user_video_log ),INTERVAL 30 DAY)
group by b.tag
order by retweet_rate desc
注意:
date(a.start_time) >= date_sub((select max(start_time) from tb_user_video_log ),INTERVAL 30 DAY)
不能写成
date(a.start_time) >= date_sub(max(start_time),INTERVAL 30 DAY)
知道原因的同学跟我说一下
DATE_SUB(date,INTERVAL expr type)问题:计算2021年里每个创作者每月的涨粉率及截止当月的总粉丝量
方法1:
select a.author,
date_format(b.start_time,'%Y-%m') as month,
round(sum(if(b.if_follow=2,-1,b.if_follow))/count(1),3) as fans_growth_rate,
sum(sum(if(b.if_follow=2,-1,b.if_follow))) over(partition by a.author order by date_format(b.start_time,'%Y-%m')) as total_fans
from tb_video_info a
join tb_user_video_log b on a.video_id=b.video_id
WHERE YEAR(b.start_time)=2021
group by a.author,month
order by a.author,total_fans
方法2:
with
main as(
#统计每个用户的播放量、加粉量、掉粉量
select
author,
mid(start_time,1,7) as month,
count(start_time) as b,
count(if(if_follow = 1, 1, null)) as follow_add,
count(if(if_follow = 2, 1, null)) as follow_sub
from tb_user_video_log a, tb_video_info b
where a.video_id = b.video_id
and year(start_time) = 2021
group by author,month
)
#计算2021年里每个创作者每月的涨粉率及截止当月的总粉丝量
select
author,
month,
round((follow_add-follow_sub)/b ,3) as fans_growth_rate,
sum(follow_add-follow_sub) over(partition by author order by month) as total_fans
from main
order by author,total_fans
问题:统计2021年国庆头3天每类视频每天的近一周总点赞量和一周内最大单天转发量,结果按视频类别降序、日期升序排序。假设数据库中数据足够多,至少每个类别下国庆头3天及之前一周的每天都有播放记录。
with t as
(select b.tag,
mid(a.start_time,1,10) as dt,
sum(sum(a.if_like)) over (partition by b.tag order by mid(a.start_time,1,10) rows 6 preceding) as sum_like_cnt_7d,
max(sum(a.if_retweet)) over (partition by b.tag order by mid(a.start_time,1,10) rows 6 preceding) as max_retweet_cnt_7d
from tb_user_video_log a
join tb_video_info b on a.video_id=b.video_id
where datediff('2021-10-03',mid(a.start_time,1,10))<9
group by b.tag,dt
)
select * from t
where dt >='2021-10-01' and dt<='2021-10-03'
order by tag desc,dt asc
问题:找出近一个月发布的视频中热度最高的top3视频。
with t1 as
(select a.video_id,
avg(if(timestampdiff(second,a.start_time,a.end_time)>=b.duration,1,0)) as a, #视频完播率
sum(a.if_like) as b, #点赞数
count(a.comment_id) as c, #评论数
sum(a.if_retweet) as d, #转发数
datediff((select max(end_time) from tb_user_video_log),max(a.end_time)) as f #最近无播放天数
from tb_user_video_log a
join tb_video_info b on a.video_id=b.video_id
where date(b.release_time) > date_sub((select max(end_time) from tb_user_video_log),interval 30 day)
group by a.video_id
)
select video_id,
round((a*100+b*5+c*3+d*2)/(f+1),0) as hot_index
from t1
order by hot_index desc
limit 3