SQL2数分专题

SQL2数分专题

【视频场景】:

各个视频的平均完播率

明确题意:

计算2021年里有播放记录的每个视频的完播率,并按完播率降序排序


问题分解:

  • 关联用户-视频互动记录和短视频信息表:JOIN tb_video_info USING(video_id);
  • 筛选2021年的记录:WHERE YEAR(start_time)=2021
  • 按视频id分组:GROUP BY video_id
  • 计算每条播放记录是否完成播放:IF(TIMESTAMPDIFF(SECOND, start_time, end_time)>=duration, 1, 0)
  • 计算完播率,完成播放为1,未完成播放为0,取平均即为完播率AVG()
  • 保留3位小数:ROUND(x, 3)

细节问题:

  • 表头重命名:as
  • 按完播率倒序排序:ORDER BY avg_comp_play_rate DESC;

完整代码:

SELECT video_id, ROUND(AVG(IF(
        TIMESTAMPDIFF(SECOND, start_time, end_time)>=duration, 1, 0
    )), 3) as avg_comp_play_rate
FROM tb_user_video_log
JOIN tb_video_info USING(video_id)
WHERE YEAR(start_time)=2021
GROUP BY video_id
ORDER BY avg_comp_play_rate DESC;

平均播放进度大于60%的视频类别

明确题意:

计算各类视频的平均播放进度,将进度大于60%的类别输出


问题分解:

  • 关联用户-视频互动记录和短视频信息表:JOIN tb_video_info USING(video_id);
  • 按视频类别分组:GROUP BY tag
  • 计算每个类别的平均播放进度:
    • 播放进度=播放时长÷视频时长*100%
    • 播放时长=TIMESTAMPDIFF(SECOND, start_time, end_time);特殊情况:播放时长大于视频时长时,播放进度为100%(加个IF判断)
    • 平均进度=AVG(每个进度)
    • 结果保留2位小数:ROUND(x, 2)
    • 百分比格式化:CONCAT(x, '%')
    • 计算公式:
    ROUND(AVG(
        IF(TIMESTAMPDIFF(SECOND, start_time, end_time) > duration, 1,
           TIMESTAMPDIFF(SECOND, start_time, end_time) / duration)
    ) * 100, 2) as avg_play_progress
  • 筛选播放进度>60%的视频类别:HAVING avg_play_progress > 60

细节问题:

  • 表头重命名:as
  • 按播放进度倒序排序:ORDER BY avg_play_progress DESC;

完整代码:

SELECT tag, CONCAT(avg_play_progress, "%") as avg_play_progress
FROM (
    SELECT tag, 
        ROUND(AVG(
            IF(TIMESTAMPDIFF(SECOND, start_time, end_time) > duration, 1,
               TIMESTAMPDIFF(SECOND, start_time, end_time) / duration)
        ) * 100, 2) as avg_play_progress
    FROM tb_user_video_log
    JOIN tb_video_info USING(video_id)
    GROUP BY tag
    HAVING avg_play_progress > 60
    ORDER BY avg_play_progress DESC
) as t_progress;

每类视频近一个月的转发量/率

明确题意:

计算每类视频在有用户互动的最近一个月(按包含当天在内的近30天算)中的转发量和转发率。结果按转发率降序排序


问题分解:

  • 关联用户-视频互动记录和短视频信息表:JOIN tb_video_info USING(video_id);
  • 筛选最近30天的记录:
    • 找到今天的日期:MAX(DATE(start_time)
    • 往过去推移30天:DATE_SUB(MAX(DATE(start_time)), INTERVAL 30 DAY)
    • 筛选最近的:WHERE DATE(start_time) > (SELECT DATE_SUB(MAX(DATE(start_time)), INTERVAL 30 DAY) FROM tb_user_video_log)
  • 按视频类别分组:GROUP BY tag
  • 计算每个类别的转发量:SUM(if_retweet)
  • 计算每个类别的转发率(转发率=转发量÷播放量):SUM(if_retweet) / COUNT(1)
  • 保留3位小数:ROUND(x, 3)

细节问题:

  • 表头重命名:as
  • 按转发率倒序排序:ORDER BY retweet_rate DESC;

完整代码:

SELECT tag, SUM(if_retweet) as retweet_cnt,
    ROUND(SUM(if_retweet) / COUNT(1), 3) as retweet_rate
FROM tb_user_video_log
JOIN tb_video_info USING(video_id)
WHERE DATE(start_time) > (
    SELECT DATE_SUB(MAX(DATE(start_time)), INTERVAL 30 DAY)
    FROM tb_user_video_log
)
GROUP BY tag
ORDER BY retweet_rate DESC;

每个创作者每月的涨粉率及截止当前的总粉丝量

明确题意:

计算2021年里每个创作者每月的涨粉率及截止当月的总粉丝量,涨粉率=(加粉量 - 掉粉量) / 播放量。结果按创作者ID、总粉丝量升序排序


问题分解:

  • 关联用户-视频互动记录和短视频信息表:JOIN tb_video_info USING(video_id);
  • 筛选2021年的记录:where YEAR(start_time) = 2021
  • 计算每个创作者每个月的涨粉量和播放量:
    • 按创作者和月份分组:group by author, DATE_FORMAT(start_time, "%Y-%m")
    • 计算涨粉量(特殊情况if_follow=2时表示掉粉):sum(if(if_follow=2, -1, if_follow)) as fans_add_cnt
    • 计算播放量(每条记录就是一次播放):count(1) as play_cnt
  • 计算每个创作者每个月的涨粉量和播放量:
    • 计算涨粉率,保留3位小数:round(fans_add_cnt / play_cnt, 3)
    • 计算截止当月的总粉丝量(累积求和):sum(fans_add_cnt) over(partition by author order by month)

细节问题:

  • 表头重命名:as

完整代码:

select author, `month`, round(fans_add_cnt / play_cnt, 3) as fans_growth_rate,
    sum(fans_add_cnt) over(partition by author order by `month`) as total_fans
from (
    select author,
        DATE_FORMAT(start_time, "%Y-%m") as `month`,
        sum(if(if_follow=2, -1, if_follow)) as fans_add_cnt,
        count(1) as play_cnt
    from tb_user_video_log
    join tb_video_info USING(video_id)
    where YEAR(start_time) = 2021
    group by author, `month`
) as t_author_monthly_fans_play_cnt
order by author, total_fans;

国庆期间每类视频点赞量和转发量

明确题意:

计算2021年国庆头3天每类视频每天的近一周总点赞量和一周内最大单天转发量,结果按视频类别降序、日期升序排序


问题分解:

  • 计算每类视频每天的点赞量和转发量:
    • 关联用户-视频互动记录和短视频信息表:JOIN tb_video_info USING(video_id);
    • 筛选必要时间窗的记录(因为最终只需要2021国庆头三天的近7日量,所以只需要这三天+前6天的数据即可,提前过滤提高效率):WHERE DATE(start_time) between '2021-09-25' and '2021-10-03'
    • 按视频类别和日期分组:GROUP BY tag, DATE(start_time)
    • 计算点赞量和转发量:SUM(if_like) as like_cnt, SUM(if_retweet) as retweet_cnt
  • 计算每类视频每天的近7日总量:
    • 定义窗口(视频类别分区内按日期排序),方便复用:window wd_tag_dt as (partition by tag order by dt)
    • 计算近7日点赞总量:SUM(like_cnt) over(wd_tag_dt ROWS 6 preceding)
    • 计算近7日最大单天转发量:MAX(retweet_cnt) over(wd_tag_dt ROWS 6 preceding)
  • 过滤出国庆三天的结果:where dt between '2021-10-01' and '2021-10-03'

细节问题:

  • 表头重命名:as
  • 按视频类别降序、日期升序排序:order by tag DESC, dt

完整代码:

select tag, dt, sum_like_cnt_7d, max_retweet_cnt_7d
from (
    select tag, dt,
        SUM(like_cnt) over(wd_tag_dt ROWS 6 preceding) as sum_like_cnt_7d,
        MAX(retweet_cnt) over(wd_tag_dt ROWS 6 preceding) as max_retweet_cnt_7d 
    from (
        SELECT tag, DATE(start_time) as dt, SUM(if_like) as like_cnt,
            SUM(if_retweet) as retweet_cnt 
        FROM tb_user_video_log
        JOIN tb_video_info USING(video_id)
        WHERE DATE(start_time) between '2021-09-25' and '2021-10-03'
        GROUP BY tag, DATE(start_time)
    ) as t_tag_daily_like_retweet_cnt
    window wd_tag_dt as (partition by tag order by dt)
) as t_tag_dt_sum_max_cnt
where dt between '2021-10-01' and '2021-10-03'
order by tag DESC, dt;

近一个月发布的视频中热度最高的top3视频

明确题意:

请找出近一个月发布的视频中热度最高的top3视频。 假设热度计算公式简化为:热度=(a*视频完播率+b*点赞数+c*评论数+d*转发数)*新鲜度;新鲜度=1/(最近无播放天数+1); 结果中热度保留为整数,并按热度降序排序。


问题分解:

  • 计算每个视频的各项指标:

    • 关联用户-视频互动记录和短视频信息表:JOIN tb_video_info USING(video_id);
    • 在每一行上追加当前日期列:
        LEFT JOIN (
            SELECT MAX(DATE(end_time)) as cur_date FROM tb_user_video_log
        ) as t_max_date ON 1
    
    • 按视频id分组:GROUP BY video_id
    • 计算各指标:
      • 播放进度:AVG(IF(TIMESTAMPDIFF(SECOND, start_time, end_time)>=duration, 1, 0)) as comp_play_rate
      • 点赞量:SUM(if_like) as like_cnt
      • 评论量:COUNT(comment_id) as comment_cnt
      • 转发量:SUM(if_retweet) as retweet_cnt
      • 最近被播放日期:MAX(DATE(end_time)) as recently_end_date
      • 发布日期:MAX(DATE(release_time)) as release_date
      • 当前日期(非分组列,加MAX避免语法错误):MAX(cur_date) as cur_date
    • 分组后筛选,筛选近30天的记录:HAVING TIMESTAMPDIFF(DAY, release_date, cur_date) < 30
  • 计算每个视频的热度:(100 * comp_play_rate + 5 * like_cnt + 3 * comment_cnt + 2 * retweet_cnt) / (TIMESTAMPDIFF(DAY, recently_end_date, cur_date) + 1) as hot_index

  • 保留为整数:ROUND(x, 0)

  • 取热度top3高的视频:ORDER BY hot_index DESC LIMIT 3


细节问题:

  • 表头重命名:as
  • 按热度倒序排序:ORDER BY hot_index

完整代码:

SELECT video_id,
    ROUND((100 * comp_play_rate + 5 * like_cnt + 3 * comment_cnt + 2 * retweet_cnt)
        / (TIMESTAMPDIFF(DAY, recently_end_date, cur_date) + 1), 0) as hot_index
FROM (
    SELECT video_id,
        AVG(IF(
            TIMESTAMPDIFF(SECOND, start_time, end_time)>=duration, 1, 0
        )) as comp_play_rate,
        SUM(if_like) as like_cnt,
        COUNT(comment_id) as comment_cnt,
        SUM(if_retweet) as retweet_cnt,
        MAX(DATE(end_time)) as recently_end_date,  -- 最近被播放日期
        MAX(DATE(release_time)) as release_date,  -- 发布日期
        MAX(cur_date) as cur_date  -- 非分组列,加MAX避免语法错误
    FROM tb_user_video_log
    JOIN tb_video_info USING(video_id)
    LEFT JOIN (
        SELECT MAX(DATE(end_time)) as cur_date FROM tb_user_video_log
    ) as t_max_date ON 1
    GROUP BY video_id
    HAVING TIMESTAMPDIFF(DAY, release_date, cur_date) < 30
) as t_video_info
ORDER BY hot_index DESC
LIMIT 3;

【用户增长场景】:

2021年11月每天的人均浏览文章时长

明确题意:

统计2021年11月每天的人均浏览文章时长(秒数),结果保留1位小数,并按时长由短到长排序


问题分解:

  • 计算每次文章浏览的时长和日期:
    • 过滤目标时间窗的有效浏览记录:WHERE artical_id != 0 AND DATE_FORMAT(in_time, "%Y%m") = "202111"
    • 将进入时间转化为日期:DATE(in_time) as dt
    • 计算浏览时长:TIMESTAMPDIFF(SECOND, in_time, out_time) as view_len_sec
  • 按日期分组:GROUP BY dt
  • 计算人均时长(=总时长÷人数):SUM(view_len_sec) / COUNT(DISTINCT uid) as avg_view_len_sec
  • 保留1位小数:ROUND(x, 1)

细节问题:

  • 表头重命名:as
  • 按时长由短到长排序:ORDER BY avg_view_len_sec

完整代码:

SELECT dt, ROUND(SUM(view_len_sec) / COUNT(DISTINCT uid), 1) as avg_view_len_sec
FROM (
    SELECT uid, DATE(in_time) as dt,
        TIMESTAMPDIFF(SECOND, in_time, out_time) as view_len_sec
    FROM tb_user_log
    WHERE artical_id != 0 AND DATE_FORMAT(in_time, "%Y%m") = "202111"
) as t_uid_len
GROUP BY dt
ORDER BY avg_view_len_sec;

每篇文章同一时刻最大在看人数

明确题意:

统计每篇文章同一时刻最大在看人数,如果同一时刻有进入也有离开时,先记录用户数增加再记录减少,结果按最大人数降序


问题分解:

  • 计算每篇文章当前时刻的在看人数(当前已进入-当前已离开=当前在看):
    • 跟踪每一时刻的人数变动:
      • 每条记录的进入时间表示进入人数多了1:SELECT artical_id, in_time as at_time, 1 as uv FROM tb_user_log
      • 每条记录的离开时间表示离开人数多了1:SELECT artical_id, out_time as at_time, -1 as uv FROM tb_user_log
      • 合并上面两个结果:UNION ALL
    • 过滤掉非文章浏览的记录:WHERE artical_id != 0
    • 计算当前在看人数,在同一篇文章分区内按时间升序按进入优先排序:SUM(uv) over(PARTITION BY artical_id ORDER BY at_time, uv DESC) as current_max
  • 按文章ID分组:GROUP BY artical_id
  • 计算最大在看:MAX(current_max) as max_uv

细节问题:

  • 表头重命名:as
  • 按最大人数降序排序:ORDER BY max_uv DESC;

完整代码:

SELECT artical_id, MAX(current_max) as max_uv
FROM (
    SELECT artical_id, at_time,
        SUM(uv) over(PARTITION BY artical_id ORDER BY at_time, uv DESC) as current_max
    FROM (
        SELECT artical_id, in_time as at_time, 1 as uv FROM tb_user_log
        UNION ALL
        SELECT artical_id, out_time as at_time, -1 as uv FROM tb_user_log
        ORDER BY at_time
    ) as t_uv_at_time
    WHERE artical_id != 0
) as t_artical_cur_max
GROUP BY artical_id
ORDER BY max_uv DESC;

2021年11月每天新用户的次日留存率

明确题意:

统计2021年11月每天新用户的次日留存率(保留2位小数),次日留存率为当天新增的用户数中第二天又活跃了的用户数占比。

如果in_time和out_time跨天了,在两天里都记为该用户活跃过,结果按日期升序。


问题分解:

  • 计算用户活跃日期集合,分别考虑进入和离开时间(可能跨天):

    WITH t_uid_dt as (
        SELECT distinct uid, DATE(in_time) as dt FROM tb_user_log
        UNION
        SELECT distinct uid, DATE(out_time) as dt FROM tb_user_log
    )
    
  • 生成每个用户活跃日期的前一天作为t_uid_dt_1:SELECT uid, DATE_SUB(dt, INTERVAL 1 DAY) as dt

  • 生成每个日期的最早活跃日期(即属于新用户的日期)作为t_first_dt:SELECT uid, DATE(MIN(in_time)) as dt FROM tb_user_log GROUP BY uid

  • 用t_uid_dt左连接t_uid_dt_1,得到第二天还活跃了的记录:t_uid_dt LEFT JOIN t_uid_dt_1 USING(uid, dt)

  • 继续内连接t_first_dt,得到当天作为新用户第二天活跃了的记录:JOIN t_first_dt USING(uid, dt)

  • 筛选有效时间窗口的记录:WHERE DATE_FORMAT(t_uid_dt.dt, "%Y%m") = "202111"

  • 选取作为新用户的日期和下一天是否留下(留下的有日期,否则为空):SELECT t_uid_dt.dt as uv_new_dt, t_uid_dt_1.dt as next_day_left

  • 按作为新用户的日期分组:GROUP BY uv_new_dt

  • 计算每天的次日留存率:COUNT(next_day_left) / COUNT(uv_new_dt)

  • 保留2位小数:ROUND(x, 2)


细节问题:

  • 表头重命名:as
  • 按最大人数降序排序:ORDER BY max_uv DESC;

完整代码:

WITH t_uid_dt as (
    SELECT distinct uid, DATE(in_time) as dt FROM tb_user_log
    UNION
    SELECT distinct uid, DATE(out_time) as dt FROM tb_user_log
)

SELECT uv_new_dt as dt,
    ROUND(COUNT(next_day_left) / COUNT(uv_new_dt), 2) as uv_left_rate
FROM (
    SELECT t_uid_dt.dt as uv_new_dt, t_uid_dt_1.dt as next_day_left
    FROM t_uid_dt
    LEFT JOIN (
        SELECT uid, DATE_SUB(dt, INTERVAL 1 DAY) as dt
        FROM t_uid_dt
    ) as t_uid_dt_1 USING(uid, dt)
    JOIN (
        SELECT uid, DATE(MIN(in_time)) as dt
        FROM tb_user_log
        GROUP BY uid
    ) as t_first_dt USING(uid, dt)
    WHERE DATE_FORMAT(t_uid_dt.dt, "%Y%m") = "202111"
) as t_uv_new_info
GROUP BY uv_new_dt
ORDER BY dt;

统计活跃间隔对用户分级结果

明确题意:

统计活跃间隔对用户分级后,各活跃等级用户占比,结果保留两位小数,且按占比降序排序


问题分解:

  • 计算每个用户最早最晚活跃日期(作为子表t_uid_first_last):

    • 按用户ID分组:GROUP BY uid
    • 统计最早活跃:MIN(DATE(in_time)) as first_dt
    • 统计最晚活跃:MAX(DATE(out_time)) as last_dt
  • 计算当前日期和总用户数(作为子表t_overall_info):

    • 获取当前日期:MAX(DATE(out_time)) as cur_dt
    • 统计总用户数:COUNT(DISTINCT uid) as user_cnt
  • 左连接两表,即将全表统计信息追加到每一行上:t_uid_first_last LEFT JOIN t_overall_info ON 1

  • 计算最早最晚活跃离当前天数差(作为子表t_user_info):

    • 最早活跃距今天数:TIMESTAMPDIFF(DAY,first_dt,cur_dt) as first_dt_diff
    • 最晚(最近)活跃距今天数:TIMESTAMPDIFF(DAY,last_dt,cur_dt) as last_dt_diff
  • 计算每个用户的活跃等级:

    CASE
        WHEN last_dt_diff >= 30 THEN "流失用户"
        WHEN last_dt_diff >= 7 THEN "沉睡用户"
        WHEN first_dt_diff < 7 THEN "新晋用户"
        ELSE "忠实用户"
    END as user_grade
    
  • 统计每个等级的占比:

    • 按用户等级分组:GROUP BY user_grade

    • 计算占比,总人数从子表得到,非聚合列避免语法错误加了MAX:COUNT(uid) / MAX(user_cnt) as ratio

    • 保留2位小数:ROUND(x, 2)


细节问题:

  • 表头重命名:as
  • 按占比降序排序:ORDER BY ratio DESC;;

完整代码:

SELECT user_grade, ROUND(COUNT(uid) / MAX(user_cnt), 2) as ratio
FROM (
    SELECT uid, user_cnt,
        CASE
            WHEN last_dt_diff >= 30 THEN "流失用户"
            WHEN last_dt_diff >= 7 THEN "沉睡用户"
            WHEN first_dt_diff < 7 THEN "新晋用户"
            ELSE "忠实用户"
        END as user_grade
    FROM (
        SELECT uid, user_cnt,
            TIMESTAMPDIFF(DAY,first_dt,cur_dt) as first_dt_diff, 
            TIMESTAMPDIFF(DAY,last_dt,cur_dt) as last_dt_diff
        FROM (
            SELECT uid, MIN(DATE(in_time)) as first_dt,
                MAX(DATE(out_time)) as last_dt
            FROM tb_user_log
            GROUP BY uid
        ) as t_uid_first_last
        LEFT JOIN (
            SELECT MAX(DATE(out_time)) as cur_dt,
                COUNT(DISTINCT uid) as user_cnt
            FROM tb_user_log
        ) as t_overall_info ON 1
    ) as t_user_info
) as t_user_grade
GROUP BY user_grade
ORDER BY ratio DESC;

每天的日活数及新用户占比

明确题意:

统计每天的日活数及新用户占比,新用户占比=当天的新用户数÷当天活跃用户数(日活数)。

如果in_time和out_time跨天了,在两天里都记为该用户活跃过。

新用户占比保留2位小数,结果按日期升序排序


问题分解:

  • 计算每天的日活数(生成子表t_dau):
    • 计算每条记录活跃信息(生成子表t_active_dt):
      • 分别将记录中的进入和离开时间作为该用户的活跃日期
      • 进入日期:DATE(in_time) as dt
      • 离开日期:DATE(out_time) as dt
      • 合并结果:UNION
    • 按日期分组:GROUP BY dt
    • 统计用户数:COUNT(distinct uid) as dau
  • 计算每天新增用户数:
    • 计算每个用户最早出现日期(即作为新用户的日期,生成子表t_uv_new_daily)
      • 按用户分组:GROUP BY uid
      • 统计最早日期:DATE(MIN(in_time)) as dt
    • 按日期分组:GROUP BY dt
    • 统计新增用户数:COUNT(uid) as uv_new_daily
  • 用日期左连接两个表:t_dau LEFT JOIN t_uv_new_daily USING(dt)
  • 计算新用户占比:IFNULL(uv_new_daily, 0) / dau as uv_new_ratio

细节问题:

  • 表头重命名:as
  • 按日期排序:ORDER BY dt;

完整代码:

SELECT dt, dau, ROUND(IFNULL(uv_new_daily, 0) / dau, 2) as uv_new_ratio
FROM (
    SELECT dt, COUNT(distinct uid) as dau
    FROM (
        SELECT uid, DATE(in_time) as dt FROM tb_user_log
        UNION
        SELECT uid, DATE(out_time) as dt FROM tb_user_log
    ) as t_active_dt
    GROUP BY dt
) as t_dau
LEFT JOIN (
    SELECT dt, COUNT(uid) as uv_new_daily
    FROM (
        SELECT uid, DATE(MIN(in_time)) as dt
        FROM tb_user_log
        GROUP BY uid
    ) as t_uid_first_dt
    GROUP BY dt
) as t_uv_new_daily
USING(dt)
ORDER BY dt;

连续签到领金币

明确题意:

计算每个用户2021年7月以来每月获得的金币数(该活动到10月底结束,11月1日开始的签到不再获得金币),结果按月份、ID升序排序。


问题分解:

  • 计算每个人连续签到的天次(生成子表t_sign_idx):
    • 计算每次签到的序号和连续签到起始日期(生成子表t_sign_base):
      • 计算每次签到的序号(生成子表t_sign_date_rn):
        • 过滤出活动期内的有效签到记录:
          • 有效签到:artical_id = 0 and sign_in = 1
          • 活动期:DATE(in_time) >= "2021-07-07" and DATE(in_time) <= "2021-10-31"
        • 截取签到日期:DATE(in_time) as sign_dt
        • 每次签到序号(按天,相同日期的序号一样):DENSE_RANK() over(partition by uid order by DATE(in_time)) as rn
        • 去重,每天的多次签到只保留一条:SELECT DISTINCT uid, sign_dt, rn
      • 计算当前签到往前连续的起始日期:DATE_SUB(sign_dt, INTERVAL rn DAY) as base_dt
    • 定义窗口,按每个用户每次起始日期分区,按签到日期排序:window wd_uid_dt as (partition by uid, base_dt order by sign_dt)
    • 计算当前签到是一个签到周期内(7天一个周期)的连续签到的天序:(ROW_NUMBER() over(wd_uid_dt) - 1) % 7 as sign_idx
  • 按用户和月份分组:group by uid, DATE_FORMAT(sign_dt, "%Y%m")
  • 计算每个用户每月获取的金币:
    • 第3天和第7天单独奖励,其他为1:case when sign_idx=6 then 7 when sign_idx=2 then 3 else 1 end
    • 计算总金币,求和:sum()

细节问题:

  • 表头重命名:as
  • 按月份、ID升序排序:order by month, uid;

完整代码:

select uid, DATE_FORMAT(sign_dt, "%Y%m") as `month`,
    sum(case when sign_idx=6 then 7 when sign_idx=2 then 3 else 1 end) as coin
from (
    select uid, sign_dt,
        (ROW_NUMBER() over(wd_uid_dt) - 1) % 7 as sign_idx
    from (
        select uid, sign_dt, rn, DATE_SUB(sign_dt, INTERVAL rn DAY) as base_dt
        from (
            select DISTINCT uid, DATE(in_time) as sign_dt,
                DENSE_RANK() over(partition by uid order by DATE(in_time)) as rn
            from tb_user_log
            where artical_id = 0 and sign_in = 1 
                and DATE(in_time) >= "2021-07-07" and DATE(in_time) <= "2021-10-31"
        ) as t_sign_date_rn
    ) as t_sign_base
    window wd_uid_dt as (partition by uid, base_dt order by sign_dt)
) as t_sign_idx
group by uid, `month`
order by `month`, uid;

【电商场景】:

计算商城中2021年每月的GMV

明确题意:

统计GMV大于10w的每月GMV


问题分解:

  • 筛选满足条件的记录:
    • 退款的金额不算(付款的记录还在,已算过一次):where status != 2
    • 2021年的记录:and YEAR(event_time) = 2021
  • 按月份分组:group by DATE_FORMAT(event_time, "%Y-%m")
  • 计算GMV:(sum(total_amount) as GMV
  • 保留整数:ROUND(x, 0)
  • 筛选GMV大于10w的分组:having GMV > 100000

细节问题:

  • 表头重命名:as
  • 按GMV排序:order by GMV;

完整代码:

select DATE_FORMAT(event_time, "%Y-%m") as `month`,
    ROUND(sum(total_amount), 0) as GMV
from tb_order_overall
where status != 2 and YEAR(event_time) = 2021
group by `month`
having GMV > 100000
order by GMV;

统计2021年10月每个退货率不大于0.5的商品的各项指标

明确题意:

统计2021年10月每个有展示记录的退货率不大于0.5商品的各项指标:

商品点展比=点击数÷展示数;加购率=加购数÷点击数;成单率=付款数÷加购数;退货率=退款数÷付款数,当分母为0时整体结果记为0。

结果中各项指标保留3位小数,并按商品ID升序排序。


问题分解:

  • 计算各个维度的计数(生成子表t_product_index_cnt)
    • 筛选时间窗内的记录:where DATE_FORMAT(event_time, '%Y%m') = '202110'
    • 按商品ID分组:group by product_id
    • 统计各种计数:
      • 展示数(每条记录就是一次展示):COUNT(1) as show_cnt
      • 点击数:sum(if_click) as click_cnt
      • 加购数:sum(if_cart) as cart_cnt
      • 付款数:sum(if_payment) as payment_cnt
      • 退款数:sum(if_refund) as refund_cnt
  • 计算各种指标率(除了展示数其他均可能为0):
    • 点击率:click_cnt/show_cnt as ctr
    • 加购率:IF(click_cnt>0, cart_cnt/click_cnt, 0) as cart_rate
    • 付款率:IF(cart_cnt>0, payment_cnt/cart_cnt, 0) as payment_rate
    • 退款率:IF(payment_cnt>0, refund_cnt/payment_cnt, 0) as refund_rate
    • 都保留3位小数:ROUND(x, 3)
  • 筛选退款率不大于0.5的商品(需将之前的结果作为子查询):where refund_rate <= 0.5

细节问题:

  • 表头重命名:as
  • 按商品ID排序:order by product_id;

完整代码:

select * from (
    select product_id, round(click_cnt/show_cnt, 3) as ctr,
        round(IF(click_cnt>0, cart_cnt/click_cnt, 0), 3) as cart_rate,
        round(IF(cart_cnt>0, payment_cnt/cart_cnt, 0), 3) as payment_rate,
        round(IF(payment_cnt>0, refund_cnt/payment_cnt, 0), 3) as refund_rate
    from (
        select product_id, COUNT(1) as show_cnt,
            sum(if_click) as click_cnt,
            sum(if_cart) as cart_cnt,
            sum(if_payment) as payment_cnt,
            sum(if_refund) as refund_cnt
        from tb_user_event
        where DATE_FORMAT(event_time, '%Y%m') = '202110'
        group by product_id
    ) as t_product_index_cnt
) as t_rate
where refund_rate <= 0.5
order by product_id;

某店铺的各商品毛利率及店铺整体毛利率

明确题意:

计算2021年10月以来店铺901中商品毛利率大于24.9%的商品信息及店铺整体毛利率。

此处毛利率定义如下:商品毛利率=(1-进价/平均单件售价)*100%;店铺毛利率=(1-总进价成本/总销售收入)*100%。

结果先输出店铺毛利率,再按商品ID升序输出各商品毛利率,均保留1位小数。


问题分解:

  • 统计每个被售出的商品的售价进价(生成子表t_product_in_each_order):
    • 订单明细表内连接商品信息表:tb_order_detail JOIN tb_product_info USING(product_id)
    • 继续内连接订单总表:JOIN tb_order_overall USING(order_id)
    • 筛选店铺和时间窗:WHERE shop_id = 901 and DATE(event_time) >= "2021-10-01"
  • 按商品分组:GROUP BY product_id
  • 加上汇总结果:WITH ROLLUP
  • 商品ID列重整:IFNULL(product_id, '店铺汇总') as product_id
  • 计算商品利润率:100 * (1 - SUM(in_price*cnt) / SUM(price*cnt)) as profit_rate
  • 保留1位小数:ROUND(x, 1)
  • 筛选满足条件的分组(商品):HAVING profit_rate > 24.9 OR product_id IS NULL
  • 格式化毛利率格式:CONCAT(profit_rate, "%") as profit_rate

细节问题:

  • 表头重命名:as
  • 按商品ID排序:ORDER BY product_id

完整代码:

SELECT product_id, CONCAT(profit_rate, "%") as profit_rate
FROM (
    SELECT IFNULL(product_id, '店铺汇总') as product_id,
        ROUND(100 * (1 - SUM(in_price*cnt) / SUM(price*cnt)), 1) as profit_rate
    FROM (
        SELECT product_id, price, cnt, in_price
        FROM tb_order_detail
        JOIN tb_product_info USING(product_id)
        JOIN tb_order_overall USING(order_id)
        WHERE shop_id = 901 and DATE(event_time) >= "2021-10-01"
    ) as t_product_in_each_order
    GROUP BY product_id
    WITH ROLLUP
    HAVING profit_rate > 24.9 OR product_id IS NULL
    ORDER BY product_id
) as t1;

零食类商品中复购率top3高的商品

明确题意:

统计零食类商品中复购率top3高的商品。复购率指用户在一段时间内对某商品的重复购买比例,复购率越大,则反映出消费者对品牌的忠诚度就越高,也叫回头率。

此处我们定义:某商品复购率 = 近90天内购买它至少两次的人数 ÷ 购买它的总人数。近90天指包含最大日期(记为当天)在内的近90天。

结果中复购率保留3位小数,并按复购率倒序、商品ID升序排序。


问题分解:

  • 计算每个用户对每个商品是否复购(生成子表t_uid_product_info):
    • 内连接多表:tb_order_detail JOIN tb_order_overall USING(order_id) JOIN tb_product_info USING(product_id)
    • 筛选零食类商品:WHERE tag="零食"
    • 筛选近90天的记录:
      • 计算最小允许日期:DATE_SUB(MAX(event_time), INTERVAL 89 DAY)
      • 筛选:event_time >= (SELECT ... FROM tb_order_overall)
    • 按用户和商品分组:GROUP BY uid, product_id
    • 计算是否复购:IF(COUNT(event_time)>1, 1, 0) as repurchase
  • 按商品分组:GROUP BY product_id
  • 计算复购率:SUM(repurchase) / COUNT(repurchase) as repurchase_rate
  • 保留3位小数:ROUND(x, 3)

细节问题:

  • 表头重命名:as
  • 按复购率倒序、商品ID升序排序:ORDER BY repurchase_rate DESC, product_id
  • 保留top3高的结果:LIMIT 3;

完整代码:

SELECT product_id,
    ROUND(SUM(repurchase) / COUNT(repurchase), 3) as repurchase_rate
FROM (
    SELECT uid, product_id, IF(COUNT(event_time)>1, 1, 0) as repurchase
    FROM tb_order_detail
    JOIN tb_order_overall USING(order_id)
    JOIN tb_product_info USING(product_id)
    WHERE tag="零食" AND event_time >= (
        SELECT DATE_SUB(MAX(event_time), INTERVAL 89 DAY)
        FROM tb_order_overall
    )
    GROUP BY uid, product_id
) as t_uid_product_info
GROUP BY product_id
ORDER BY repurchase_rate DESC, product_id
LIMIT 3;

10月的新户客单价和获客成本

明确题意:

计算2021年10月商城里所有新用户的首单平均交易金额(客单价)和平均获客成本(保留一位小数)。

某个订单的优惠金额可通过订单明细里的 {该订单各商品单价×数量之和-订单总金额} (订单总金额在订单总表中)得到。


问题分解:

  • 统计用户首单信息(生成子表t_first_order):
    • 定义按用户分区按行为时间排序的窗口:WINDOW wd_uid_first as (partition by uid order by event_time)
    • 获取窗口内第一个行为时间:FIRST_VALUE(event_time) over(wd_uid_first) as event_time
    • 获取窗口内第一个订单ID:FIRST_VALUE(order_id) over(wd_uid_first) as order_id
    • 获取窗口内第一个订单金额:FIRST_VALUE(total_amount) over(wd_uid_first) as total_amount
  • 统计每个订单原始金额(生成子表t_raw_amount):
    • 按订单号分组:GROUP BY order_id
    • 商品价格求和:SUM(price * cnt) as raw_amount
  • 关联用户首单的原始价格:t_first_order JOIN t_raw_amount USING(order_id)
  • 筛选时间窗:WHERE DATE_FORMAT(event_time, "%Y-%m") = '2021-10'
  • 基于以上结果统计首单平均交易金额和平均获客成本:
    • 首单平均交易金额:AVG(total_amount) as avg_amount
    • 平均获客成本:AVG(raw_amount-total_amount) as avg_cost

细节问题:

  • 表头重命名:as

完整代码:

SELECT ROUND(AVG(total_amount), 1) as avg_amount,
    ROUND(AVG(raw_amount-total_amount), 1) as avg_cost
FROM (
    SELECT uid, total_amount, raw_amount
    FROM (
        SELECT DISTINCT uid,
            FIRST_VALUE(event_time) over(wd_uid_first) as event_time,
            FIRST_VALUE(order_id) over(wd_uid_first) as order_id,
            FIRST_VALUE(total_amount) over(wd_uid_first) as total_amount
        FROM tb_order_overall
        WINDOW wd_uid_first as (partition by uid order by event_time)
    ) as t_first_order
    JOIN (
        SELECT order_id, SUM(price * cnt) as raw_amount
        FROM tb_order_detail
        GROUP BY order_id
    ) as t_raw_amount
    USING(order_id)
    WHERE DATE_FORMAT(event_time, "%Y-%m") = '2021-10'
) as t_first_order_info;

店铺901国庆期间的7日动销率和滞销率

明确题意:

计算店铺901在2021年国庆头3天的7日动销率和滞销率。结果保留3位小数,按日期升序排序。只要当天有任何商品的销量就输出该天的结果,即使店铺901当天的动销率为0。 动销率定义为店铺中一段时间内有销量的商品占当前已上架总商品数的比例,同理滞销率就是没有销量的商品比例。


问题分解:

  • 计算每天的动销率(有销量产品的比例):
    • 计算国庆期间店铺901截止每天的近7天有销量的商品数(生成子表t_dt_901_pid_cnt):
      • 统计每个有销量的日期(生成子表t_dates):
        • 筛选目标时间窗:WHERE DATE(event_time) BETWEEN '2021-10-01' AND '2021-10-03'
        • 取去重后日期:DISTINCT DATE(event_time) as dt
      • 统计每天有销量的商品(生成子表t_dt_pid):
        • 用订单号关联订单总表和明细表:tb_order_overall JOIN tb_order_detail USING(order_id)
        • 统计每天售出的商品:DISTINCT DATE(event_time) as event_dt, product_id
      • 左连接两张表,得到每个有销售的日期近7天售出的商品:t_dates LEFT JOIN t_dt_pid ON DATEDIFF(dt,event_dt) BETWEEN 0 AND 6
      • 继续左连接商品信息,得到店铺id:LEFT JOIN tb_product_info USING(product_id)
      • 按日期分组:GROUP BY dt
      • 统计每天店铺901近7天有销量的商品数:COUNT(DISTINCT IF(shop_id!=901, NULL, product_id)) as sale_pid_cnt
    • 统计店铺901每个商品上架日期(生成子表t_release_dt):
      • 筛选店铺:WHERE shop_id=901
      • 生成日期:SELECT DATE(release_time) as release_dt, product_id as all_pid
    • 左连接两个表,得到每天已上架在售的商品:t_dt_901_pid_cnt LEFT JOIN t_release_dt ON dt >= release_dt
    • 按日期分组:GROUP BY dt
    • 计算动销率(有销量商品数已计算好,非聚合列,加MIN避免语法错误):MIN(sale_pid_cnt) / COUNT(all_pid) as sale_rate
    • 保留3位小数:ROUND(x, 3)
  • 基于上述结果计算滞销率:1 - sale_rate as unsale_rate

细节问题:

  • 表头重命名:as

完整代码:

SELECT dt, sale_rate, 1 - sale_rate as unsale_rate
FROM (
    SELECT dt, ROUND(MIN(sale_pid_cnt) / COUNT(all_pid), 3) as sale_rate
    FROM (
        -- 国庆期间店铺901截止每天的近7天有销量的商品数
        SELECT dt, COUNT(DISTINCT IF(shop_id!=901, NULL, product_id)) as sale_pid_cnt
        FROM (
            SELECT DISTINCT DATE(event_time) as dt
            FROM tb_order_overall
            WHERE DATE(event_time) BETWEEN '2021-10-01' AND '2021-10-03'
        ) as t_dates
        LEFT JOIN (
            SELECT DISTINCT DATE(event_time) as event_dt, product_id
            FROM tb_order_overall
            JOIN tb_order_detail USING(order_id)
        ) as t_dt_pid ON DATEDIFF(dt,event_dt) BETWEEN 0 AND 6
        LEFT JOIN tb_product_info USING(product_id)
        GROUP BY dt
    ) as t_dt_901_pid_cnt
    LEFT JOIN (
        -- 店铺901每个商品上架日期
        SELECT DATE(release_time) as release_dt, product_id as all_pid
        FROM tb_product_info
        WHERE shop_id=901
    ) as t_release_dt ON dt >= release_dt # 当天店铺901已上架在售的商品
    GROUP BY dt
) as t_dt_sr;

【出行场景】:

2021年国庆在北京接单3次及以上的司机统计信息

明确题意:

统计2021年国庆7天期间在北京市接单至少3次的司机的平均接单数和平均兼职收入(暂不考虑平台佣金,直接计算完成的订单费用总额),结果保留3位小数。


问题分解:

  • 计算2021国庆在北京接单至少3次的司机信息:
    • 关联接单表和打车记录表:tb_get_car_order JOIN tb_get_car_record USING(order_id)
    • 筛选北京国庆期间的记录:WHERE city = "北京" and DATE_FORMAT(order_time,"%Y%m%d") BETWEEN '20211001' AND '20211007'
    • 按司机ID分组:GROUP BY driver_id
    • 统计每个司机的接单量和接单收入:COUNT(order_id) as order_num, SUM(fare) as income
    • 筛选接单至少3次的分组(司机):HAVING COUNT(order_id) >= 3
  • 基于上述结果,统计他们的平均接单量和平均收入:
    • 平均接单量:AVG(order_num) as avg_order_num
    • 平均收入:AVG(income) as avg_income
  • 保留3位小数:ROUND(x, 3)

细节问题:

  • 表头重命名:as

完整代码:

SELECT "北京" as city, ROUND(AVG(order_num), 3) as avg_order_num,
    ROUND(AVG(income), 3) as avg_income
FROM (
    SELECT driver_id, COUNT(order_id) as order_num, SUM(fare) as income
    FROM tb_get_car_order
    JOIN tb_get_car_record USING(order_id)
    WHERE city = "北京" and DATE_FORMAT(order_time,"%Y%m%d") BETWEEN '20211001' AND '20211007'
    GROUP BY driver_id
    HAVING COUNT(order_id) >= 3
) as t_driver_info;

有取消订单记录的司机平均评分

明确题意:

找到2021年10月有过取消订单记录的司机,计算他们每人全部已完成的有评分订单的平均评分及总体平均评分,保留1位小数。


问题分解:

  • 找出2021年10月有取消订单的司机:WHERE DATE_FORMAT(order_time, "%Y-%m")='2021-10' AND ISNULL(fare)
  • 筛选他们的已完成订单的评分:WHERE driver_id in (...) AND NOT ISNULL(grade)
  • 按司机分组:GROUP BY driver_id
  • 追加汇总信息:WITH ROLLUP
  • 输出每个司机的平均评分:
    • 司机ID或总体:IFNULL(driver_id, "总体") as driver_id
    • 平均评分:AVG(grade) as avg_grade
    • 保留1位小数:ROUND(x, 1)

细节问题:

  • 表头重命名:as

完整代码:

SELECT IFNULL(driver_id, "总体") as driver_id,
    ROUND(AVG(grade), 1) as avg_grade
FROM tb_get_car_order
WHERE driver_id in (
    SELECT driver_id
    FROM tb_get_car_order
    WHERE DATE_FORMAT(order_time, "%Y-%m")='2021-10' AND ISNULL(fare)
) AND NOT ISNULL(grade)
GROUP BY driver_id
WITH ROLLUP;

每个城市中评分最高的司机日均接单量和里程数

明确题意:

统计每个城市中评分最高的司机平均评分、日均接单量和日均行驶里程数。有多个司机评分并列最高时,都输出。

平均评分和日均接单量保留1位小数,日均行驶里程数保留3位小数,按日均接单数升序排序。


问题分解:

  • 计算每个司机的评分、日均接单量、日均里程和城市内评分排名(生成子表t_driver_rk):
    • 计算每个司机的评分等信息(生成子表t_driver_info):
      • 关联打车记录表和订单表:tb_get_car_record JOIN tb_get_car_order USING(order_id)
      • 按司机和城市分组:GROUP BY driver_id, city
      • 计算各指标:
        • 平均评分:AVG(grade) as avg_grade
        • 工作天数:COUNT(DISTINCT DATE(order_time)) as work_days
        • 接单量:COUNT(order_time) as order_num
        • 总行驶里程数:SUM(mileage) as toal_mileage
    • 计算日均指标和排名:
      • 日均订单量:order_num / work_days as avg_order_num
      • 日均里程数:toal_mileage / work_days as avg_mileage
      • 计算城市内的评分排名,允许并列第一:RANK() over(PARTITION BY city ORDER BY avg_grade DESC) as rk
      • 保留小数位数:ROUND(x, 1)
  • 筛选每个城市的第一名:WHERE rk = 1

细节问题:

  • 表头重命名:as
  • 按日均接单数升序排序:ORDER BY avg_order_num;

完整代码:

SELECT city, driver_id, avg_grade, avg_order_num, avg_mileage
FROM (
    SELECT city, driver_id, ROUND(avg_grade, 1) as avg_grade,
        ROUND(order_num / work_days, 1) as avg_order_num,
        ROUND(toal_mileage / work_days, 3) as avg_mileage,
        RANK() over(PARTITION BY city ORDER BY avg_grade DESC) as rk
    FROM (
        SELECT driver_id, city, AVG(grade) as avg_grade,
            COUNT(DISTINCT DATE(order_time)) as work_days,
            COUNT(order_time) as order_num,
            SUM(mileage) as toal_mileage
        FROM tb_get_car_record
        JOIN tb_get_car_order USING(order_id)
        GROUP BY driver_id, city
    ) as t_driver_info
) as t_driver_rk
WHERE rk = 1
ORDER BY avg_order_num;

国庆期间近7日日均取消订单量

明确题意:

统计国庆头3天里,每天的近7日日均订单完成量和日均订单取消量,按日期升序排序。


问题分解:

  • 计算每天的近7日订单的日均完成量和日均取消量:
    • 计算每天的订单完成量和取消量(生成子表t_finish_cancel_daily):
      • 统计每个订单是否完成(生成子表t_order_status):
        • 筛选必要的时间窗(题目只要国庆头3天的近7日):WHERE DATE(order_time) BETWEEN '2021-09-25' and '2021-10-03'
        • 生成日期:DATE(order_time) as dt
        • 生成完成状态:IF(start_time IS NULL, 0, 1) as is_finish
      • 按日期分组:GROUP BY dt
      • 计算每天的订单完成量:SUM(is_finish) as finish_num
      • 计算每天的订单取消量:COUNT(1) - SUM(is_finish) as cancel_num
    • 计算近7日日均(滑动窗口平均):
      • 7日日均完成量:AVG(finish_num) over(ORDER BY dt ROWS 6 preceding) as finish_num_7d
      • 7日日均取消量:AVG(cancel_num) over(ORDER BY dt ROWS 6 preceding) as cancel_num_7d
      • 保留2位小数:ROUND(x, 2)
  • 基于上述结果,筛选国庆头3天数据:WHERE dt >= '2021-10-01'

细节问题:

  • 表头重命名:as

完整代码:

SELECT dt, finish_num_7d, cancel_num_7d
FROM (
    SELECT dt,
        ROUND(AVG(finish_num) over(ORDER BY dt ROWS 6 preceding), 2) as finish_num_7d,
        ROUND(AVG(cancel_num) over(ORDER BY dt ROWS 6 preceding), 2) as cancel_num_7d
    FROM (
        SELECT dt, SUM(is_finish) as finish_num, COUNT(1) - SUM(is_finish) as cancel_num
        FROM (
            SELECT DATE(order_time) as dt, IF(start_time IS NULL, 0, 1) as is_finish
            FROM tb_get_car_order
            WHERE DATE(order_time) BETWEEN '2021-09-25' and '2021-10-03'
        ) as t_order_status
        GROUP BY dt
    ) as t_finish_cancel_daily
)as t_finish_cancel_7d
WHERE dt >= '2021-10-01';

工作日各时段叫车量&等待接单时间&调度时间

明确题意:

统计周一到周五各时段的叫车量、平均等待接单时间和平均调度时间。

全部以开始打车时间为时段划分依据,平均等待接单时间和平均调度时间均保留1位小数,平均调度时间仅计算完成了的订单,结果按叫车量升序排序。


问题分解:

  • 计算每次叫车的等待时间和调度时间(生成子表t_wait_dispatch_time):
    • 关联打车记录和订单表:tb_get_car_record JOIN tb_get_car_order USING(order_id)
    • 筛选工作日的记录(周一到周五):WHERE DAYOFWEEK(event_time) BETWEEN 2 AND 6
    • 转换打车时间所属时段:CASE WHEN HOUR(event_time) IN (7, 8) THEN '早高峰' ... END as period
    • 计算等待接单时间:TIMESTAMPDIFF(SECOND, event_time, end_time) as wait_time
    • 计算调度时间:TIMESTAMPDIFF(SECOND, order_time, start_time) as dispatch_time
  • 按时段分组:GROUP BY period
  • 计算叫车量:COUNT(1) as get_car_num
  • 计算平均等待接单时间:AVG(wait_time/60) as avg_wait_time
  • 计算平均调度时间:AVG(dispatch_time/60) as avg_dispatch_time
  • 保留1位小数:ROUND(x, 1)

细节问题:

  • 表头重命名:as
  • 按叫车量升序排序:ORDER BY get_car_num

完整代码:

SELECT period, COUNT(1) as get_car_num,
    ROUND(AVG(wait_time/60), 1) as avg_wait_time,
    ROUND(AVG(dispatch_time/60), 1) as avg_dispatch_time
FROM (
    SELECT event_time,
        CASE
            WHEN HOUR(event_time) IN (7, 8) THEN '早高峰'
            WHEN HOUR(event_time) BETWEEN 9 AND 16 THEN '工作时间'
            WHEN HOUR(event_time) IN (17, 18, 19) THEN '晚高峰'
            ELSE '休息时间'
        END as period,
        TIMESTAMPDIFF(SECOND, event_time, end_time) as wait_time,
        TIMESTAMPDIFF(SECOND, order_time, start_time) as dispatch_time
    FROM tb_get_car_record
    JOIN tb_get_car_order USING(order_id)
    WHERE DAYOFWEEK(event_time) BETWEEN 2 AND 6
) as t_wait_dispatch_time
GROUP BY period
ORDER BY get_car_num;

各城市最大同时等车人数

明确题意:

统计各个城市在2021年10月期间,单天里最大的同时等车人数。结果按最大等车人数升序排序,相同时按城市升序排序。 等车指从开始打车起,直到取消打车或取消等待或上车止的这段时间里用户的状态。如果同一时刻有人停止等车,有人开始等车,等车人数记作先增加后减少。


问题分解:

  • 计算截止当前时刻最大在等车人数(生成子表t_city_cur_max):
    • 计算每个时刻人数变化量(生成子表t_uv_at_time):
      • 记录等车人数增加(有人开始打车):event_time as at_time, 1 as wait_uv
      • 记录等车人数减少(上车了、打到车取消、没打到车取消):
        • 关联打车记录表和订单表:tb_get_car_record LEFT JOIN tb_get_car_order USING(order_id)
        • 减少:COALESCE(start_time, finish_time, end_time) as at_time, -1 as wait_uv
      • 合并上述记录:UNION ALL
    • 筛选时间窗:WHERE DATE_FORMAT(at_time, "%Y-%m") = "2021-10"
    • 定义按城市和日期分区按时刻排序的窗口(先增加后减少,所以uv倒排):WINDOW wd_city_date as (PARTITION BY city, DATE(at_time) ORDER BY at_time, wait_uv DESC)
    • 计算当前最大等车人数:SUM(wait_uv) over(wd_city_date) as current_max
  • 按城市分组:GROUP BY city
  • 计算各城市最大等待人数:MAX(current_max) as max_wait_uv

细节问题:

  • 表头重命名:as
  • 按最大等车人数、城市升序排序:ORDER BY max_wait_uv, city

完整代码:

SELECT city, MAX(current_max) as max_wait_uv
FROM (
    SELECT city, SUM(wait_uv) over(wd_city_date) as current_max
    FROM (
        SELECT city, event_time as at_time, 1 as wait_uv FROM tb_get_car_record
        UNION ALL
        SELECT city, COALESCE(start_time, finish_time, end_time) as at_time, -1 as wait_uv
        FROM tb_get_car_record
        LEFT JOIN tb_get_car_order USING(order_id)
    ) as t_uv_at_time
    WHERE DATE_FORMAT(at_time, "%Y-%m") = "2021-10"
    WINDOW wd_city_date as (PARTITION BY city, DATE(at_time) ORDER BY at_time, wait_uv DESC)
) as t_city_cur_max
GROUP BY city
ORDER BY max_wait_uv, city;
全部评论

相关推荐

03-21 08:46
已编辑
门头沟学院 C++
一个什么都不会的学生:当你有硕士学历的时候HR会说就是比本科生强
点赞 评论 收藏
分享
评论
点赞
2
分享

创作者周榜

更多
牛客网
牛客企业服务