HQL Notes | BiliBili Internship - Z指数用户数据监控系统
前情提要:
所谓全埋点,就是将预先定义的某类事件(通常包括这四种事件类型:APP启动、APP关闭、页面访问、元素点击),通过SDK实现完全自动上报。
SPM码的全称是 super position model(超级位置模型),它通过五个字符串,将一个页面功能位置,唯一标识出来,下面是一个SPM码的格式:
SPM=SPMA.SPMB.SPMC.SPMD.SPME
SPMA 唯一标识一个站点
SPMB 唯一标识某站点的一个页面
SPMC 唯一标识某页面的一个区块
SPMD 唯一标识某区块的一个具体位置
SPME 随机生成的字串,跟时间有关系,在循环页面计算时可以区分点击的时序
Hive SQL开发:
DAG图:Z指数各模块日访问明细
埋点信息底表:b_dwd.dwd_flow_ubt_web_ad_event_l_hr
重要的字段:
event_id:埋点事件信息字段
buvid:设备号
bid:外网用户id
uid:内网用户id
url:co和com区分内外网
module:不同访问模块
bid_role:不同类型用户
能实现T+1的准确数据交付目标
第一次更新:
- 将小程序埋点信息数据加入,统一web和小程序的口径,区别:web端是通过不同埋点上报信息来区分不同模块访问行为,xcx端是通过module字段来区分;
- web埋点:rlike 999.173
- xcx埋点:999.194.0.0.pv
- 回刷历史数据
- 修改BI看板 加入小程序筛选项与数据类别
第二次更新:
- 发现原先外网未登录类型用户未计入MAU与DAU,我沟通后修改了口径,增加外网未登录,整套区分用户身份字段换新:coalesece(bid or uid, buvid):将设备号作为未登录用户id计入MAU统计,结果显示web与小程序端总共增加了700+MAU数量,占原先的60%,口径更新后数据增幅直接完成了team里Z指数产品姐姐的okr。
- 修补id字段数据缺失bug:有过登录的小程序用户,但是部分事件数据没有id数据,因为刚点进小程序首页在未登录状态被埋点系统上报了,后续有点进个人页面进行登录,因此显示同一设备号但不同类型用户:外网未登录/个人。修补做法,建立设备号与id单一映射关系,再进行join补全。
- 修补MAU数据缺失bug:某日某种角色0人使用平台,导致DAU为0,导致该角色MAU数据在当日缺失,我修改了能跑得对。(这段补丁GPT帮我写的,反正能跑通)
- 回刷历史数据
- 修改BI看板:发现整体UV与不同角色MAU分布中的总计数字无法对齐(有个位数的出入),完成数据明细探查后给到指标解释:部分用户在时间周期内存在角色变化的情况,因此在MAU分布的总计中可能在不同角色下被重复统计。
-- HQL任务开发必读:https://info.bilibili.co/pages/viewpage.action?pageId=231965315#id-%E6%95%B0%E6%8D%AE%E5%BC%80%E5%8F%91BestPractice-%E6%9C%80%E4%BD%B3HQL%E4%BD%BF%E7%94%A8%E5%A7%BF%E5%8A%BF -- Author: chenmingyang01 -- CreateTime: 2025-05-15 14:28:35 -- Description: 测试项目测试表 -- Copyright: bilibili.com -- Modify: 修改时间: 2024-05-15 14:28:35 -- 修改人: chenmingyang01 -- 修改内容: 创建 -- -- -- 平台变量简介(详细请点击本页面右上角): -- -- 常用天变量${yyyyMMdd},例如20210329 -- 常用小时变量${HH},例如02 -- 常见查询: where log_date='${yyyyMMdd}' and log_hour='${HH}' WITH -- 小程序数据 xcx_data AS ( SELECT buvid, event_id, extended_fields['user_name'] as user_name, extended_fields['first_level_module'] as first_level_module, extended_fields['second_level_module'] as second_level_module, extended_fields['bid'] as outer_id, extra['uid'] as inner_id, extended_fields['bid_role'] as bid_role, extended_fields['dept_name_seq'] as dept_name, extended_fields['org_name'] as org_name, extended_fields['report_name'] as report_name, url FROM b_dwd.dwd_flow_ubt_web_ad_event_l_hr WHERE log_date = '${yyyyMMdd}' AND event_id = '999.194.0.0.pv' AND extended_fields['first_level_module'] is not null ), -- 下面这个子表的作用也是将小程序用户中有过登录但是没有outer_id的数据行补全(可能是刚刚点进小程序没有登录,但是埋点已经上报了) buvid2bidplus AS ( SELECT DISTINCT buvid, user_name, outer_id, inner_id, bid_role, dept_name, org_name FROM xcx_data WHERE outer_id IS NOT NULL ), -- 小程序最终子表 xcx_data_new AS ( SELECT xcx_data.buvid, xcx_data.event_id, COALESCE(buvid2bidplus.user_name, xcx_data.user_name) AS user_name, xcx_data.first_level_module, xcx_data.second_level_module, COALESCE(buvid2bidplus.outer_id, xcx_data.buvid) AS outer_id, COALESCE(buvid2bidplus.inner_id, xcx_data.inner_id) AS inner_id, COALESCE(buvid2bidplus.bid_role, xcx_data.bid_role) AS bid_role, COALESCE(buvid2bidplus.dept_name, xcx_data.dept_name) AS dept_name, COALESCE(buvid2bidplus.org_name, xcx_data.org_name) AS org_name, xcx_data.report_name, xcx_data.url FROM xcx_data LEFT JOIN buvid2bidplus ON xcx_data.buvid = buvid2bidplus.buvid ), -- web端数据 web_data AS ( SELECT buvid, event_id, extended_fields['user_name'] as user_name, extended_fields['bid'] as outer_id, extended_fields['uid'] as inner_id, extended_fields['bid_role'] as bid_role, extended_fields['dept_name_seq'] as dept_name, extended_fields['org_name'] as org_name, extended_fields['report_name'] as report_name, url FROM b_dwd.dwd_flow_ubt_web_ad_event_l_hr WHERE log_date = '${yyyyMMdd}' AND event_id rlike '999.173' AND url is not null ), buvid2outeridplus AS ( SELECT DISTINCT buvid, user_name, outer_id, inner_id, bid_role, dept_name, org_name FROM web_data WHERE outer_id IS NOT NULL ), -- 这个表是用来将外部用户中有过登录但是没有outerid的数据行补全 web_data_middle AS ( SELECT web_data.buvid, web_data.event_id, COALESCE(web_data.user_name, buvid2outeridplus.user_name) AS user_name, COALESCE(web_data.outer_id, buvid2outeridplus.outer_id) outer_id, web_data.inner_id, COALESCE(web_data.bid_role, buvid2outeridplus.bid_role) AS bid_role, COALESCE(web_data.dept_name, buvid2outeridplus.dept_name) AS dept_name, COALESCE(web_data.org_name, buvid2outeridplus.org_name) AS org_name, web_data.report_name, web_data.url FROM web_data LEFT JOIN buvid2outeridplus ON web_data.buvid = buvid2outeridplus.buvid ), -- web端最终子表 web_data_new as ( SELECT buvid, event_id, user_name, CASE when (outer_id is null) and (inner_id is null) then buvid else outer_id end as outer_id, -- 将外网未登录类别中的buvid当做outer_id,外网未登录就是外部用户 inner_id, bid_role, dept_name, org_name, report_name, url FROM web_data_middle WHERE not ( outer_id is null AND not (url rlike 'com') AND inner_id is null ) -- 删掉那些内网用户的数据行中没有inner_id的数据行(脏数据) ) INSERT overwrite table bi_sycpb.zindex_user_behavior_i_d partition (log_date = '${yyyyMMdd}') ( SELECT case event_id when '999.173.0.0' THEN '整体' when '999.173.insight.detail.click' THEN '营销洞察报告下载' when '999.173.zindex.brandall.show' THEN '品牌综合榜' when '999.173.zindex.brandsearch.show' THEN '单品带货榜' when '999.173.zindex.content.show' THEN '内容指数' when '999.173.zindex.homepage.show' THEN '首页' when '999.173.zindex.insight.show' THEN '营销洞察' when '999.173.zindex.categorysearch.show' THEN '品牌搜索榜' ELSE '其他' END AS module, user_name, case when outer_id is NOT null THEN outer_id ELSE inner_id end AS bid, -- or we can: COALESCE(outer_id, inner_id) AS bid, case when bid_role = -1 THEN '个人' when bid_role = 2 or bid_role = 4 THEN '代理' when bid_role = 1 THEN '品牌方' ELSE if (url rlike 'com', '外网未登录', '内部用户') -- z指数内部用户登录的网址是co结尾,但是外部用户登录的网址是com结尾 end AS user_role, dept_name, org_name, report_name, if (url rlike 'com', 1, 0) as outer_user, 'web' as platform, case when outer_id is NOT null THEN outer_id ELSE inner_id end AS user_id FROM web_data_new ) -- 这个是web端的数据终表 union ALL ( select first_level_module as module, user_name, COALESCE(outer_id, inner_id) AS bid, case when bid_role = -1 THEN '个人' when bid_role = 2 or bid_role = 4 THEN '代理' when bid_role = 1 THEN '品牌方' ELSE '外网未登录' end AS user_role, dept_name, org_name, report_name, 1 as outer_user, '小程序' as platform, COALESCE(outer_id, inner_id) AS user_id FROM xcx_data_new ) -- 这个是小程序端的module分明细终表 union ALL ( select '整体' as module, user_name, COALESCE(outer_id, inner_id) AS bid, case when bid_role = -1 THEN '个人' when bid_role = 2 or bid_role = 4 THEN '代理' when bid_role = 1 THEN '品牌方' ELSE '外网未登录' end AS user_role, dept_name, org_name, report_name, 1 as outer_user, '小程序' as platform, COALESCE(outer_id, inner_id) AS user_id FROM xcx_data_new ) -- 这个是小程序端的module整体终表
with t1 as ( SELECT platform_type, user_role, outer_user, cast( regexp_replace(log_date, '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3') AS DATE ) AS log_date, COUNT(DISTINCT user_info) as user_cnt FROM ( select platform_type, user_role, outer_user, log_date, concat(outer_user, '_', user_id) as user_info from bi_sycpb.zindex_user_behavior_i_d WHERE log_date BETWEEN '${yyyyMMdd,-1d}' and '${yyyyMMdd}' ) group by platform_type, user_role, outer_user, log_date ), t2 AS ( SELECT c.platform_type, c.user_role, c.outer_user, c.log_date, COALESCE(c.user_cnt - d.user_cnt, 0) as user_change FROM t1 c left join t1 d on DATEDIFF(c.log_date, d.log_date) = 1 AND c.user_role = d.user_role AND c.platform_type = d.platform_type WHERE d.log_date is not null ), t3 AS ( select platform_type, user_role, outer_user, cast( regexp_replace( '${yyyyMMdd}', '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3' ) AS DATE ) AS log_date, COUNT(DISTINCT outer_user, user_id) as mau_cnt from bi_sycpb.zindex_user_behavior_i_d WHERE log_date BETWEEN '${yyyyMMdd,-29d}' and '${yyyyMMdd}' group by platform_type, user_role, outer_user ), date_range AS ( SELECT '${yyyyMMdd,-1d}' AS raw_log_date UNION ALL SELECT '${yyyyMMdd}' AS raw_log_date ), distinct_users AS ( SELECT DISTINCT user_role, outer_user FROM bi_sycpb.zindex_user_behavior_i_d WHERE log_date BETWEEN '${yyyyMMdd,-1d}' and '${yyyyMMdd}' ), full_combinations AS ( SELECT du.user_role, du.outer_user, dr.raw_log_date FROM distinct_users du CROSS JOIN date_range dr ), t4 as ( SELECT fc.user_role, fc.outer_user, CAST( REGEXP_REPLACE( fc.raw_log_date, '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3' ) AS DATE ) AS log_date, COUNT( DISTINCT CASE WHEN t.user_info IS NOT NULL THEN t.user_info END ) AS user_cnt FROM full_combinations fc LEFT JOIN ( SELECT user_role, outer_user, log_date, CONCAT(outer_user, '_', user_id) AS user_info FROM bi_sycpb.zindex_user_behavior_i_d WHERE log_date BETWEEN '${yyyyMMdd,-1d}' and '${yyyyMMdd}' ) t ON fc.raw_log_date = t.log_date AND fc.user_role = t.user_role AND fc.outer_user = t.outer_user GROUP BY fc.user_role, fc.outer_user, fc.raw_log_date ORDER BY fc.user_role, fc.outer_user, fc.raw_log_date ), t5 AS ( SELECT c.user_role, c.outer_user, c.log_date, COALESCE(c.user_cnt - d.user_cnt, 0) as user_change FROM t4 c left join t4 d on DATEDIFF(c.log_date, d.log_date) = 1 AND c.user_role = d.user_role WHERE d.log_date is not null ), t6 AS ( select user_role, outer_user, cast( regexp_replace( '${yyyyMMdd}', '(\\d{4})(\\d{2})(\\d{2})', '$1-$2-$3' ) AS DATE ) AS log_date, COUNT(DISTINCT outer_user, user_id) as mau_cnt from bi_sycpb.zindex_user_behavior_i_d WHERE log_date BETWEEN '${yyyyMMdd,-29d}' and '${yyyyMMdd}' group by user_role, outer_user ), t7 AS ( SELECT t4.user_role, t4.outer_user, t4.user_cnt, COALESCE(t5.user_change, 0) as user_change, COALESCE(t6.mau_cnt, 0) as user_mau FROM t4 LEFT JOIN t5 ON t4.log_date = t5.log_date AND t4.user_role = t5.user_role LEFT JOIN t6 ON t4.log_date = t6.log_date AND t4.user_role = t6.user_role WHERE t6.log_date IS NOT NULL ) INSERT overwrite table bi_sycpb.zindex_user_behavior_change_i_d partition (log_date = '${yyyyMMdd}') ( SELECT coalesce(t1.user_role, t2.user_role, t3.user_role) as user_role, coalesce(t1.outer_user, t2.outer_user, t3.outer_user) as outer_user, coalesce(t1.user_cnt, 0) as user_cnt, COALESCE(t2.user_change, 0) as user_change, COALESCE(t3.mau_cnt, 0) as user_mau, coalesce( t1.platform_type, t2.platform_type, t3.platform_type ) as platform_type FROM t1 FULL OUTER JOIN t2 ON t1.log_date = t2.log_date AND t1.user_role = t2.user_role AND t1.platform_type = t2.platform_type FULL OUTER JOIN t3 ON t1.log_date = t3.log_date AND t1.user_role = t3.user_role AND t1.platform_type = t3.platform_type WHERE t3.log_date IS NOT NULL ) UNION ALL ( SELECT DISTINCT user_role, outer_user, user_cnt, user_change, user_mau, '整体' as platform_type FROM t7 )