大数据 ODS&DWD&DIM-SQL分享 需求
思路一:等差数列
断2天、3天,嵌套太多
1.1 开窗,按照 id 分组,同时按照 dt 排序,求 Rank
-- linux 中空格不能用 tab 键 select id,dt,rank() over(partition by id order by dt) rk from tx;
1.2 将每行日期减去RK值,如果之前是连续的日期,则相减之后为相同日期
z: 等差
(x1+z)-(y1+z)=x1-y1
select id,dt,date_sub(dt,rk) flg from (select id,dt,rank() over(partition by id order by dt) rk from tx) t1;
断一天的数据,flag 变成了连续
1.3 计算绝对连续的天数
select id,flag,count(*) days from ( select id,dt,date_sub(dt,rk) flg from (select id,dt,rank() over(partition by id order by dt) rk from tx) t1; )t2 group by id,flag;
1.4 再计算连续问题
select id,flag,days,rank() over(partition by id order by flag) newFlag from t3;
1.5 将 flag 减去 newflag
select id,days,date_sub(flag,newFlag) flag from t4;t5
1.6 计算每个用户连续登录的天数,断一天也算
select id,sum(days)+count(1) days from t5 group by id,flag;[t6]
1.7 计算最大连续天数
select id,max(days) from t6 group by id;
准后再-1
思路二
2.1 将上一行数据下移
--下移默认值,一般给 1970-01-01,上移默认值一般 9999-01-01 select id,dt,lag(dt,1,'1970-01-01') over(partition by id order by dt) lagDt from tx; t1
2.2 将当前行日期减去下移的日期
select id,dt,datediff(dt,lagDt) dtDiff from t1; t2
执行
select id,dt,datediff(dt,lagDt) dtDiff from ( select id,dt,lag(dt,1,'1970-01-01') over(partition by id order by dt) lagDt from tx) t1;
每碰到一个 >2 的就分组 + 1
2.3 分组
select id,dt,sum(if(dtDiff>2,1,0)) over(partition by id order by dt) flag from t2; t3
select id,dt,sum(if(dtDiff>2,1,0)) over(partition by id order by dt) flag from ( select id,dt,datediff(dt,lagDt) dtDiff from ( select id,dt,lag(dt,1,'1970-01-01') over(partition by id order by dt) lagDt from tx) t1 ) t2;
select id,flag,datediff(max(dt),min(dt))+1 from t3 group by id,flag;
带入执行
--断3天把2改成3,断4天把2改成4 select id,flag,datediff(max(dt),min(dt))+1 from ( select id,dt,sum(if(dtDiff>2,1,0)) over(partition by id order by dt) flag from ( select id,dt,datediff(dt,lagDt) dtDiff from ( select id,dt,lag(dt,1,'1970-01-01') over(partition by id order by dt) lagDt from tx) t1 ) t2 )t3 group by id,flag;
2.3 求分组后的最大值
HiveOnSpark: 有个BUG, datediff over 子查询 => null point
解决方案:
- 换MR引擎
- 将时间字段由 String 类型改成 Date 类型