十七、时间序列--构造累积日期
表名:t17
表字段及内容:
date_id 2017-08-01 2017-08-02 2017-08-03
问题一:每一日期,都扩展成月初至当天
输出结果如下所示:
date_id date_to_day 2017-08-01 2017-08-01 2017-08-02 2017-08-01 2017-08-02 2017-08-02 2017-08-03 2017-08-01 2017-08-03 2017-08-02 2017-08-03 2017-08-03
这种累积相关的表,常做桥接表。
参考答案:
select date_id, date_add(date_start_id,pos) as date_to_day from ( select date_id, date_sub(date_id,dayofmonth(date_id)-1) as date_start_id from t17 ) m lateral view posexplode(split(space(datediff(from_unixtime(unix_timestamp(date_id,'yyyy-MM-dd')),from_unixtime(unix_timestamp(date_start_id,'yyyy-MM-dd')))), '')) t as pos, val;
十八、时间序列--构造连续日期
表名:t18
表字段及内容:
a b c 101 2018-01-01 10 101 2018-01-03 20 101 2018-01-06 40 102 2018-01-02 20 102 2018-01-04 30 102 2018-01-07 60
问题一:构造连续日期
问题描述:将表中数据的b字段扩充至范围[2018-01-01, 2018-01-07],并累积对c求和。
b字段的值是较稀疏的。
输出结果如下所示:
a b c d 101 2018-01-01 10 10 101 2018-01-02 0 10 101 2018-01-03 20 30 101 2018-01-04 0 30 101 2018-01-05 0 30 101 2018-01-06 40 70 101 2018-01-07 0 70 102 2018-01-01 0 0 102 2018-01-02 20 20 102 2018-01-03 0 20 102 2018-01-04 30 50 102 2018-01-05 0 50 102 2018-01-06 0 50 102 2018-01-07 60 110
参考答案:
select a, b, c, sum(c) over(partition by a order by b) as d from ( select t1.a, t1.b, case when t18.b is not null then t18.c else 0 end as c from ( select a, date_add(s,pos) as b from ( select a, '2018-01-01' as s, '2018-01-07' as r from (select a from t18 group by a) ta ) m lateral view posexplode(split(space(datediff(from_unixtime(unix_timestamp(r,'yyyy-MM-dd')),from_unixtime(unix_timestamp(s,'yyyy-MM-dd')))), '')) t as pos, val ) t1 left join t18 on t1.a = t18.a and t1.b = t18.b ) ts;
十九、时间序列--取多个字段最新的值
表名:t19
表字段及内容:
date_id a b c 2014 AB 12 bc 2015 23 2016 d 2017 BC
问题一:如何一并取出最新日期
输出结果如下所示:
date_a a date_b b date_c c 2017 BC 2015 23 2016 d
参考答案:
此处给出三种解法,其一:
SELECT max(CASE WHEN rn_a = 1 THEN date_id else 0 END) AS date_a ,max(CASE WHEN rn_a = 1 THEN a else null END) AS a ,max(CASE WHEN rn_b = 1 THEN date_id else 0 END) AS date_b ,max(CASE WHEN rn_b = 1 THEN b else NULL END) AS b ,max(CASE WHEN rn_c = 1 THEN date_id else 0 END) AS date_c ,max(CASE WHEN rn_c = 1 THEN c else null END) AS c FROM ( SELECT date_id ,a ,b ,c --对每列上不为null的值 的 日期 进行排序 ,row_number()OVER( PARTITION BY 1 ORDER BY CASE WHEN a IS NULL THEN 0 ELSE date_id END DESC) AS rn_a ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN b IS NULL THEN 0 ELSE date_id END DESC) AS rn_b ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN c IS NULL THEN 0 ELSE date_id END DESC) AS rn_c FROM t19 ) t WHERE t.rn_a = 1 OR t.rn_b = 1 OR t.rn_c = 1;
其二:
SELECT a.date_id ,a.a ,b.date_id ,b.b ,c.date_id ,c.c FROM ( SELECT t.date_id, t.a FROM ( SELECT t.date_id ,t.a ,t.b ,t.c FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.a IS NOT NULL ) t ORDER BY t.date_id DESC LIMIT 1 ) a LEFT JOIN ( SELECT t.date_id ,t.b FROM ( SELECT t.date_id ,t.b FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.b IS NOT NULL ) t ORDER BY t.date_id DESC LIMIT 1 ) b ON 1 = 1 LEFT JOIN ( SELECT t.date_id ,t.c FROM ( SELECT t.date_id ,t.c FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.c IS NOT NULL ) t ORDER BY t.date_id DESC LIMIT 1 ) c ON 1 = 1;
其三:
select * from ( select t1.date_id as date_a,t1.a from (select t1.date_id,t1.a from t19 t1 where t1.a is not null) t1 inner join (select max(t1.date_id) as date_id from t19 t1 where t1.a is not null) t2 on t1.date_id=t2.date_id ) t1 cross join ( select t1.date_b,t1.b from (select t1.date_id as date_b,t1.b from t19 t1 where t1.b is not null) t1 inner join (select max(t1.date_id) as date_id from t19 t1 where t1.b is not null)t2 on t1.date_b=t2.date_id ) t2 cross join ( select t1.date_c,t1.c from (select t1.date_id as date_c,t1.c from t19 t1 where t1.c is not null) t1 inner join (select max(t1.date_id) as date_id from t19 t1 where t1.c is not null)t2 on t1.date_c=t2.date_id ) t3;
二十、时间序列--补全数据
表名:t20
表字段及内容:
date_id a b c 2014 AB 12 bc 2015 23 2016 d 2017 BC
问题一:如何使用最新数据补全表格
输出结果如下所示:
date_id a b c 2014 AB 12 bc 2015 AB 23 bc 2016 AB 23 d 2017 BC 23 d
参考答案:
select date_id, first_value(a) over(partition by aa order by date_id) as a, first_value(b) over(partition by bb order by date_id) as b, first_value(c) over(partition by cc order by date_id) as c from ( select date_id, a, b, c, count(a) over(order by date_id) as aa, count(b) over(order by date_id) as bb, count(c) over(order by date_id) as cc from t20 )tmp1;
二十一、时间序列--取最新完成状态的前一个状态
表名:t21
表字段及内容:
date_id a b 2014 1 A 2015 1 B 2016 1 A 2017 1 B 2013 2 A 2014 2 B 2015 2 A 2014 3 A 2015 3 A 2016 3 B 2017 3 A
上表中B为完成状态。
问题一:取最新完成状态的前一个状态
输出结果如下所示:
date_id a b 2016 1 A 2013 2 A 2015 3 A
参考答案:
此处给出两种解法,其一:
select t21.date_id, t21.a, t21.b from ( select max(date_id) date_id, a from t21 where b = 'B' group by a ) t1 inner join t21 on t1.date_id -1 = t21.date_id and t1.a = t21.a;
其二:
select next_date_id as date_id ,a ,next_b as b from( select *,min(nk) over(partition by a,b) as minb from( select *,row_number() over(partition by a order by date_id desc) nk ,lead(date_id) over(partition by a order by date_id desc) next_date_id ,lead(b) over(partition by a order by date_id desc) next_b from( select * from t21 ) t ) t ) t where minb = nk and b = 'B';
问题二:如何将完成状态的过程合并
输出结果如下所示:
a b_merge 1 A、B、A、B 2 A、B 3 A、A、B
参考答案:
select a ,collect_list(b) as b from( select * ,min(if(b = 'B',nk,null)) over(partition by a) as minb from( select *,row_number() over(partition by a order by date_id desc) nk from( select * from t21 ) t ) t ) t where nk >= minb group by a;
二十二、非等值连接--范围匹配
表f是事实表,表d是匹配表,在hive中如何将匹配表中的值关联到事实表中?
表d相当于拉链过的变化维,但日期范围可能是不全的。
表f:
date_id p_id 2017 C 2018 B 2019 A 2013 C
表d:
d_start d_end p_id p_value 2016 2018 A 1 2016 2018 B 2 2008 2009 C 4 2010 2015 C 3
问题一:范围匹配
输出结果如下所示:
date_id p_id p_value 2017 C null 2018 B 2 2019 A null 2013 C 3
**参考答案:
此处给出两种解法,其一:
select f.date_id, f.p_id, A.p_value from f left join ( select date_id, p_id, p_value from ( select f.date_id, f.p_id, d.p_value from f left join d on f.p_id = d.p_id where f.date_id >= d.d_start and f.date_id <= d.d_end )A )A ON f.date_id = A.date_id;
其二:
select date_id, p_id, flag as p_value from ( select f.date_id, f.p_id, d.d_start, d.d_end, d.p_value, if(f.date_id between d.d_start and d.d_end,d.p_value,null) flag, max(d.d_end) over(partition by date_id) max_end from f left join d on f.p_id = d.p_id ) tmp where d_end = max_end;
二十三、非等值连接--最近匹配
表t23_1和表t23_2通过a和b关联时,有相等的取相等的值匹配,不相等时每一个a的值在b中找差值最小的来匹配。
t23_1和t23_2为两个班的成绩单,t23_1班的每个学生成绩在t23_2班中找出成绩最接近的成绩。
表t23_1:a中无重复值
a 1 2 4 5 8 10
表t23_2:b中无重复值
b 2 3 7 11 13
问题一:单向最近匹配
输出结果如下所示:
注意:b的值可能会被丢弃
a b 1 2 2 2 4 3 5 3 5 7 8 7 10 11
参考答案:
select * from ( select ttt1.a, ttt1.b from ( select tt1.a, t23_2.b, dense_rank() over(partition by tt1.a order by abs(tt1.a-t23_2.b)) as dr from ( select t23_1.a from t23_1 left join t23_2 on t23_1.a=t23_2.b where t23_2.b is null ) tt1 cross join t23_2 ) ttt1 where ttt1.dr=1 union all select t23_1.a, t23_2.b from t23_1 inner join t23_2 on t23_1.a=t23_2.b ) result_t order by result_t.a;
二十四、N指标--累计去重
假设表A为事件流水表,客户当天有一条记录则视为当天活跃。
表A:
time_id user_id 2018-01-01 10:00:00 001 2018-01-01 11:03:00 002 2018-01-01 13:18:00 001 2018-01-02 08:34:00 004 2018-01-02 10:08:00 002 2018-01-02 10:40:00 003 2018-01-02 14:21:00 002 2018-01-02 15:39:00 004 2018-01-03 08:34:00 005 2018-01-03 10:08:00 003 2018-01-03 10:40:00 001 2018-01-03 14:21:00 005
假设客户活跃非常,一天产生的事件记录平均达千条。
问题一:累计去重
输出结果如下所示:
日期 当日活跃人数 月累计活跃人数_截至当日 date_id user_cnt_act user_cnt_act_month 2018-01-01 2 2 2018-01-02 3 4 2018-01-03 3 5
参考答案:
SELECT tt1.date_id ,tt2.user_cnt_act ,tt1.user_cnt_act_month FROM ( -- ④ 按照t.date_id分组求出user_cnt_act_month,得到tt1 SELECT t.date_id ,COUNT(user_id) AS user_cnt_act_month FROM ( -- ③ 表a和表b进行笛卡尔积,按照a.date_id,b.user_id分组,保证截止到当日的用户唯一,得出表t。 SELECT a.date_id ,b.user_id FROM ( -- ① 按照日期分组,取出date_id字段当主表的维度字段 得出表a SELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id FROM test.temp_tanhaidi_20211213_1 GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') ) a INNER JOIN ( -- ② 按照date_id、user_id分组,保证每天每个用户只有一条记录,得出表b SELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id ,user_id FROM test.temp_tanhaidi_20211213_1 GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') ,user_id ) b ON 1 = 1 WHERE a.date_id >= b.date_id GROUP BY a.date_id ,b.user_id ) t GROUP BY t.date_id ) tt1 LEFT JOIN ( -- ⑥ 按照date_id分组求出user_cnt_act,得到tt2 SELECT date_id ,COUNT(user_id) AS user_cnt_act FROM ( -- ⑤ 按照日期分组,取出date_id字段当主表的维度字段 得出表a SELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id ,user_id FROM test.temp_tanhaidi_20211213_1 GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') ,user_id ) a GROUP BY date_id ) tt2 ON tt2.date_id = tt1.date_id SQL 复制 全屏
参考链接: