HQL Notes | BiliBili Internship - Z指数用户数据监控系统

前情提要:

所谓全埋点,就是将预先定义的某类事件(通常包括这四种事件类型:APP启动、APP关闭、页面访问、元素点击),通过SDK实现完全自动上报。

SPM码的全称是 super position model(超级位置模型),它通过五个字符串,将一个页面功能位置,唯一标识出来,下面是一个SPM码的格式:

SPM=SPMA.SPMB.SPMC.SPMD.SPME

SPMA 唯一标识一个站点

SPMB 唯一标识某站点的一个页面

SPMC 唯一标识某页面的一个区块

SPMD 唯一标识某区块的一个具体位置

SPME 随机生成的字串,跟时间有关系,在循环页面计算时可以区分点击的时序

 

Hive SQL开发:

DAG图:Z指数各模块日访问明细

埋点信息底表:b_dwd.dwd_flow_ubt_web_ad_event_l_hr

重要的字段:

event_id:埋点事件信息字段

buvid:设备号

bid:外网用户id

uid:内网用户id

url:co和com区分内外网

module:不同访问模块

bid_role:不同类型用户

能实现T+1的准确数据交付目标

第一次更新

  • 将小程序埋点信息数据加入,统一web和小程序的口径,区别:web端是通过不同埋点上报信息来区分不同模块访问行为,xcx端是通过module字段来区分;
  • web埋点:rlike 999.173
  • xcx埋点:999.194.0.0.pv
  • 回刷历史数据
  • 修改BI看板 加入小程序筛选项与数据类别 

第二次更新:

  1. 发现原先外网未登录类型用户未计入MAU与DAU,我沟通后修改了口径,增加外网未登录,整套区分用户身份字段换新:coalesece(bid or uid, buvid):将设备号作为未登录用户id计入MAU统计,结果显示web与小程序端总共增加了700+MAU数量,占原先的60%,口径更新后数据增幅直接完成了team里Z指数产品姐姐的okr。
  2. 修补id字段数据缺失bug:有过登录的小程序用户,但是部分事件数据没有id数据,因为刚点进小程序首页在未登录状态被埋点系统上报了,后续有点进个人页面进行登录,因此显示同一设备号但不同类型用户:外网未登录/个人。修补做法,建立设备号与id单一映射关系,再进行join补全。
  3. 修补MAU数据缺失bug:某日某种角色0人使用平台,导致DAU为0,导致该角色MAU数据在当日缺失,我修改了能跑得对。(这段补丁GPT帮我写的,反正能跑通)
  4. 回刷历史数据
  5. 修改BI看板:发现整体UV与不同角色MAU分布中的总计数字无法对齐(有个位数的出入),完成数据明细探查后给到指标解释:部分用户在时间周期内存在角色变化的情况,因此在MAU分布的总计中可能在不同角色下被重复统计。

-- HQL任务开发必读:https://info.bilibili.co/pages/viewpage.action?pageId=231965315#id-%E6%95%B0%E6%8D%AE%E5%BC%80%E5%8F%91BestPractice-%E6%9C%80%E4%BD%B3HQL%E4%BD%BF%E7%94%A8%E5%A7%BF%E5%8A%BF 
-- Author: chenmingyang01
-- CreateTime: 2025-05-15 14:28:35
-- Description: 测试项目测试表
-- Copyright: bilibili.com
-- Modify: 修改时间: 2024-05-15 14:28:35
--        修改人:  chenmingyang01
--         修改内容: 创建
-- 
-- 
--  平台变量简介(详细请点击本页面右上角):
-- 
-- 常用天变量${yyyyMMdd},例如20210329
-- 常用小时变量${HH},例如02
-- 常见查询: where log_date='${yyyyMMdd}' and log_hour='${HH}'
WITH
  -- 小程序数据
  xcx_data AS (
    SELECT
      buvid,
      event_id,
      extended_fields['user_name'] as user_name,
      extended_fields['first_level_module'] as first_level_module,
      extended_fields['second_level_module'] as second_level_module,
      extended_fields['bid'] as outer_id,
      extra['uid'] as inner_id,
      extended_fields['bid_role'] as bid_role,
      extended_fields['dept_name_seq'] as dept_name,
      extended_fields['org_name'] as org_name,
      extended_fields['report_name'] as report_name,
      url
    FROM
      b_dwd.dwd_flow_ubt_web_ad_event_l_hr
    WHERE
      log_date = '${yyyyMMdd}'
      AND event_id = '999.194.0.0.pv'
      AND extended_fields['first_level_module'] is not null
  ),
  -- 下面这个子表的作用也是将小程序用户中有过登录但是没有outer_id的数据行补全(可能是刚刚点进小程序没有登录,但是埋点已经上报了)
  buvid2bidplus AS (
    SELECT DISTINCT
      buvid,
      user_name,
      outer_id,
      inner_id,
      bid_role,
      dept_name,
      org_name
    FROM
      xcx_data
    WHERE
      outer_id IS NOT NULL
  ),
  -- 小程序最终子表
  xcx_data_new AS (
    SELECT
      xcx_data.buvid,
      xcx_data.event_id,
      COALESCE(buvid2bidplus.user_name, xcx_data.user_name) AS user_name,
      xcx_data.first_level_module,
      xcx_data.second_level_module,
      COALESCE(buvid2bidplus.outer_id, xcx_data.buvid) AS outer_id,
      COALESCE(buvid2bidplus.inner_id, xcx_data.inner_id) AS inner_id,
      COALESCE(buvid2bidplus.bid_role, xcx_data.bid_role) AS bid_role,
      COALESCE(buvid2bidplus.dept_name, xcx_data.dept_name) AS dept_name,
      COALESCE(buvid2bidplus.org_name, xcx_data.org_name) AS org_name,
      xcx_data.report_name,
      xcx_data.url
    FROM
      xcx_data
      LEFT JOIN buvid2bidplus ON xcx_data.buvid = buvid2bidplus.buvid
  ),
  -- web端数据
  web_data AS (
    SELECT
      buvid,
      event_id,
      extended_fields['user_name'] as user_name,
      extended_fields['bid'] as outer_id,
      extended_fields['uid'] as inner_id,
      extended_fields['bid_role'] as bid_role,
      extended_fields['dept_name_seq'] as dept_name,
      extended_fields['org_name'] as org_name,
      extended_fields['report_name'] as report_name,
      url
    FROM
      b_dwd.dwd_flow_ubt_web_ad_event_l_hr
    WHERE
      log_date = '${yyyyMMdd}'
      AND event_id rlike '999.173'
      AND url is not null
  ),
  buvid2outeridplus AS (
    SELECT DISTINCT
      buvid,
      user_name,
      outer_id,
      inner_id,
      bid_role,
      dept_name,
      org_name
    FROM
      web_data
    WHERE
      outer_id IS NOT NULL
  ),
  -- 这个表是用来将外部用户中有过登录但是没有outerid的数据行补全
  web_data_middle AS (
    SELECT
      web_data.buvid,
      web_data.event_id,
      COALESCE(web_data.user_name, buvid2outeridplus.user_name) AS user_name,
      COALESCE(web_data.outer_id, buvid2outeridplus.outer_id) outer_id,
      web_data.inner_id,
      COALESCE(web_data.bid_role, buvid2outeridplus.bid_role) AS bid_role,
      COALESCE(web_data.dept_name, buvid2outeridplus.dept_name) AS dept_name,
      COALESCE(web_data.org_name, buvid2outeridplus.org_name) AS org_name,
      web_data.report_name,
      web_data.url
    FROM
      web_data
      LEFT JOIN buvid2outeridplus ON web_data.buvid = buvid2outeridplus.buvid
  ),
  -- web端最终子表
  web_data_new as (
    SELECT
      buvid,
      event_id,
      user_name,
      CASE
        when (outer_id is null)
        and (inner_id is null) then buvid
        else outer_id
      end as outer_id, -- 将外网未登录类别中的buvid当做outer_id,外网未登录就是外部用户
      inner_id,
      bid_role,
      dept_name,
      org_name,
      report_name,
      url
    FROM
      web_data_middle
    WHERE
      not (
        outer_id is null
        AND not (url rlike 'com')
        AND inner_id is null
      ) -- 删掉那些内网用户的数据行中没有inner_id的数据行(脏数据)
  )
INSERT overwrite table
  bi_sycpb.zindex_user_behavior_i_d partition (log_date = '${yyyyMMdd}') (
    SELECT
      case event_id
        when '999.173.0.0' THEN '整体'
        when '999.173.insight.detail.click' THEN '营销洞察报告下载'
        when '999.173.zindex.brandall.show' THEN '品牌综合榜'
        when '999.173.zindex.brandsearch.show' THEN '单品带货榜'
        when '999.173.zindex.content.show' THEN '内容指数'
        when '999.173.zindex.homepage.show' THEN '首页'
        when '999.173.zindex.insight.show' THEN '营销洞察'
        when '999.173.zindex.categorysearch.show' THEN '品牌搜索榜'
        ELSE '其他'
      END AS module,
      user_name,
      case
        when outer_id is NOT null THEN outer_id
        ELSE inner_id
      end AS bid, -- or we can: COALESCE(outer_id, inner_id) AS bid,
      case
        when bid_role = -1 THEN '个人'
        when bid_role = 2
        or bid_role = 4 THEN '代理'
        when bid_role = 1 THEN '品牌方'
        ELSE if (url rlike 'com', '外网未登录', '内部用户') -- z指数内部用户登录的网址是co结尾,但是外部用户登录的网址是com结尾
      end AS user_role,
      dept_name,
      org_name,
      report_name,
      if (url rlike 'com', 1, 0) as outer_user,
      'web' as platform,
      case
        when outer_id is NOT null THEN outer_id
        ELSE inner_id
      end AS user_id
    FROM
      web_data_new
  ) -- 这个是web端的数据终表
union ALL
(
  select
    first_level_module as module,
    user_name,
    COALESCE(outer_id, inner_id) AS bid,
    case
      when bid_role = -1 THEN '个人'
      when bid_role = 2
      or bid_role = 4 THEN '代理'
      when bid_role = 1 THEN '品牌方'
      ELSE '外网未登录'
    end AS user_role,
    dept_name,
    org_name,
    report_name,
    1 as outer_user,
    '小程序' as platform,
    COALESCE(outer_id, inner_id) AS user_id
  FROM
    xcx_data_new
) -- 这个是小程序端的module分明细终表
union ALL
(
  select
    '整体' as module,
    user_name,
    COALESCE(outer_id, inner_id) AS bid,
    case
      when bid_role = -1 THEN '个人'
      when bid_role = 2
      or bid_role = 4 THEN '代理'
      when bid_role = 1 THEN '品牌方'
      ELSE '外网未登录'
    end AS user_role,
    dept_name,
    org_name,
    report_name,
    1 as outer_user,
    '小程序' as platform,
    COALESCE(outer_id, inner_id) AS user_id
  FROM
    xcx_data_new
) -- 这个是小程序端的module整体终表

with
  t1 as (
    SELECT
      platform_type,
      user_role,
      outer_user,
      cast(
        regexp_replace(log_date, '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3') AS DATE
      ) AS log_date,
      COUNT(DISTINCT user_info) as user_cnt
    FROM
      (
        select
          platform_type,
          user_role,
          outer_user,
          log_date,
          concat(outer_user, '_', user_id) as user_info
        from
          bi_sycpb.zindex_user_behavior_i_d
        WHERE
          log_date BETWEEN '${yyyyMMdd,-1d}' and '${yyyyMMdd}'
      )
    group by
      platform_type,
      user_role,
      outer_user,
      log_date
  ),
  t2 AS (
    SELECT
      c.platform_type,
      c.user_role,
      c.outer_user,
      c.log_date,
      COALESCE(c.user_cnt - d.user_cnt, 0) as user_change
    FROM
      t1 c
      left join t1 d on DATEDIFF(c.log_date, d.log_date) = 1
      AND c.user_role = d.user_role
      AND c.platform_type = d.platform_type
    WHERE
      d.log_date is not null
  ),
  t3 AS (
    select
      platform_type,
      user_role,
      outer_user,
      cast(
        regexp_replace(
          '${yyyyMMdd}',
          '(\\d{4})(\\d{2})(\\d{2})',
          '$1-$2-$3'
        ) AS DATE
      ) AS log_date,
      COUNT(DISTINCT outer_user, user_id) as mau_cnt
    from
      bi_sycpb.zindex_user_behavior_i_d
    WHERE
      log_date BETWEEN '${yyyyMMdd,-29d}' and '${yyyyMMdd}'
    group by
      platform_type,
      user_role,
      outer_user
  ),
  date_range AS (
    SELECT
      '${yyyyMMdd,-1d}' AS raw_log_date
    UNION ALL
    SELECT
      '${yyyyMMdd}' AS raw_log_date
  ),
  distinct_users AS (
    SELECT DISTINCT
      user_role,
      outer_user
    FROM
      bi_sycpb.zindex_user_behavior_i_d
    WHERE
      log_date BETWEEN '${yyyyMMdd,-1d}' and '${yyyyMMdd}'
  ),
  full_combinations AS (
    SELECT
      du.user_role,
      du.outer_user,
      dr.raw_log_date
    FROM
      distinct_users du
      CROSS JOIN date_range dr
  ),
  t4 as (
    SELECT
      fc.user_role,
      fc.outer_user,
      CAST(
        REGEXP_REPLACE(
          fc.raw_log_date,
          '(\\d{4})(\\d{2})(\\d{2})',
          '$1-$2-$3'
        ) AS DATE
      ) AS log_date,
      COUNT(
        DISTINCT CASE
          WHEN t.user_info IS NOT NULL THEN t.user_info
        END
      ) AS user_cnt
    FROM
      full_combinations fc
      LEFT JOIN (
        SELECT
          user_role,
          outer_user,
          log_date,
          CONCAT(outer_user, '_', user_id) AS user_info
        FROM
          bi_sycpb.zindex_user_behavior_i_d
        WHERE
          log_date BETWEEN '${yyyyMMdd,-1d}' and '${yyyyMMdd}'
      ) t ON fc.raw_log_date = t.log_date
      AND fc.user_role = t.user_role
      AND fc.outer_user = t.outer_user
    GROUP BY
      fc.user_role,
      fc.outer_user,
      fc.raw_log_date
    ORDER BY
      fc.user_role,
      fc.outer_user,
      fc.raw_log_date
  ),
  t5 AS (
    SELECT
      c.user_role,
      c.outer_user,
      c.log_date,
      COALESCE(c.user_cnt - d.user_cnt, 0) as user_change
    FROM
      t4 c
      left join t4 d on DATEDIFF(c.log_date, d.log_date) = 1
      AND c.user_role = d.user_role
    WHERE
      d.log_date is not null
  ),
  t6 AS (
    select
      user_role,
      outer_user,
      cast(
        regexp_replace(
          '${yyyyMMdd}',
          '(\\d{4})(\\d{2})(\\d{2})',
          '$1-$2-$3'
        ) AS DATE
      ) AS log_date,
      COUNT(DISTINCT outer_user, user_id) as mau_cnt
    from
      bi_sycpb.zindex_user_behavior_i_d
    WHERE
      log_date BETWEEN '${yyyyMMdd,-29d}' and '${yyyyMMdd}'
    group by
      user_role,
      outer_user
  ),
  t7 AS (
    SELECT
      t4.user_role,
      t4.outer_user,
      t4.user_cnt,
      COALESCE(t5.user_change, 0) as user_change,
      COALESCE(t6.mau_cnt, 0) as user_mau
    FROM
      t4
      LEFT JOIN t5 ON t4.log_date = t5.log_date
      AND t4.user_role = t5.user_role
      LEFT JOIN t6 ON t4.log_date = t6.log_date
      AND t4.user_role = t6.user_role
    WHERE
      t6.log_date IS NOT NULL
  )
INSERT overwrite table
  bi_sycpb.zindex_user_behavior_change_i_d partition (log_date = '${yyyyMMdd}') (
    SELECT
      coalesce(t1.user_role, t2.user_role, t3.user_role) as user_role,
      coalesce(t1.outer_user, t2.outer_user, t3.outer_user) as outer_user,
      coalesce(t1.user_cnt, 0) as user_cnt,
      COALESCE(t2.user_change, 0) as user_change,
      COALESCE(t3.mau_cnt, 0) as user_mau,
      coalesce(
        t1.platform_type,
        t2.platform_type,
        t3.platform_type
      ) as platform_type
    FROM
      t1
      FULL OUTER JOIN t2 ON t1.log_date = t2.log_date
      AND t1.user_role = t2.user_role
      AND t1.platform_type = t2.platform_type
      FULL OUTER JOIN t3 ON t1.log_date = t3.log_date
      AND t1.user_role = t3.user_role
      AND t1.platform_type = t3.platform_type
    WHERE
      t3.log_date IS NOT NULL
  )
UNION ALL
(
  SELECT DISTINCT
    user_role,
    outer_user,
    user_cnt,
    user_change,
    user_mau,
    '整体' as platform_type
  FROM
    t7
)

全部评论

相关推荐

06-19 12:33
安徽大学 Java
点赞 评论 收藏
分享
有担当的灰太狼又在摸鱼:零帧起手查看图片
点赞 评论 收藏
分享
评论
点赞
收藏
分享

创作者周榜

更多
牛客网
牛客网在线编程
牛客网题解
牛客企业服务