题解2021年11月每天新用户的次日留存率 为何不先过滤?
2021年11月每天新用户的次日留存率
https://www.nowcoder.com/practice/1fc0e75f07434ef5ba4f1fb2aa83a450
1.错误解法
select dt,round(avg(if(timestampdiff(day,in_time,lead_time)=1,1,0)),2) from( select uid, date_format(in_time,'%Y-%m-%d') dt, in_time,out_time, lead(in_time) over(partition by uid order by in_time) lead_time,row_number() over(partition by uid order by in_time) rk from tb_user_log where date_format(in_time,'%Y-%m') = '2021-11' )t1 where rk = 1 group by dt order by dt
2.正确解法
select dt,round(sum(if(timestampdiff(day,dt,lead_time)=1,1,0))/count(*),2) from( select *, row_number() over(partition by uid order by dt) rk, lead(dt,1) over(partition by uid order by dt) lead_time from( select uid,date(in_time) as dt from tb_user_log union select uid,date(out_time) as dt from tb_user_log )t1 )t2 where rk = 1 group by dt having date_format(dt,'%Y-%m') = '2021-11' order by dt
3.为何如此
这里引用Cole4Youreyez在题解中的原话:
“该题有一个易错点,那就是题目中第一句话就说明了在十一月份的,可能做题人会考虑在建立表t时就直接where语句选出11月的,这样是错误的。 原因在于:在建立表t是where会先运行,也就是选择了11月的信息再进行排序。那么如果一个用户10月活跃过,在11月中,11月2号和3号活跃了,这样的操作会使得该用户11-02的排名rk=1,也会被我们认为是新用户,但是显然并非如此。”
使用having就能避免先where过滤掉的情况,即该用户在10月活跃过,时间来到11月,他不是新用户。
4.解体解答
4.1 首先求出所有的用户的情况
(因为题目要求如果in_time-进入时间和out_time-离开时间跨天了,在两天里都记为该用户活跃过)这一点比较棘手,需要用union过滤,
select uid,date(in_time) as dt from tb_user_log union select uid,date(out_time) as dt from tb_user_log
4.2 进行排序,并且用lead函数提出下一次用户出现的时间
select *, row_number() over(partition by uid order by dt) rk, lead(dt,1) over(partition by uid order by dt) lead_time from( select uid,date(in_time) as dt from tb_user_log union select uid,date(out_time) as dt from tb_user_log )t1
4.3 求得留存率
select dt,round(sum(if(timestampdiff(day,dt,lead_time)=1,1,0))/count(*),2) from( select *, row_number() over(partition by uid order by dt) rk, lead(dt,1) over(partition by uid order by dt) lead_time from( select uid,date(in_time) as dt from tb_user_log union select uid,date(out_time) as dt from tb_user_log )t1 )t2 where rk = 1 group by dt having date_format(dt,'%Y-%m') = '2021-11' order by dt
rk为1证明第一次出现,having在最后使用避免出现用户非新用户的情况