连续区间问题在业务分析中也很常见:连续N天登陆、车辆持续运行时长、系统状态报告连续日期等等等,接下来以真实需求的简化版进行分析
案例:某线上作业每天都会运行一次,且独立于其他的任务,运行结果会保留到一张表中run_status_result_1d;运行状态只有两个值:failed/succeeded,数据如下:
run_date |
status |
... |
... |
2021-08-01 |
succeeded |
2021-08-02 |
succeeded |
2021-08-03 |
succeeded |
2021-08-04 |
succeeded |
2021-08-05 |
succeeded |
2021-08-06 |
failed |
2021-08-07 |
failed |
2021-08-08 |
failed |
2021-08-09 |
succeeded |
2021-08-10 |
succeeded |
2021-08-11 |
succeeded |
... |
... |
需求:作业运行状态的连续日期 结果数据展示如下:
status |
start_date |
end_date |
succeeded |
2021-08-01 |
2021-08-05 |
failed |
2021-08-06 |
2021-08-07 |
... |
... |
... |
分析:利用sum(if)组合进行分析
- 生成一些数据
with tb1 as(select run_date, status fromvalues('2021-08-01','succeeded'),('2021-08-02','succeeded'),('2021-08-03','succeeded'),('2021-08-04','succeeded'),('2021-08-05','succeeded'),('2021-08-06','failed'),('2021-08-07','failed'),('2021-08-08','failed'),('2021-08-09','succeeded'),('2021-08-10','succeeded'),('2021-08-11','succeeded'),('2021-08-12','succeeded') t(run_date,status))
- 利用lag函数拿到上一条运行状态
tb2 as(select run_date, status, lag(status,1,status) over(orderby run_date) lag_status from tb1 )
- 利用当前数据状态与上一条状态的异同进行分分组
tb3 as(select run_date, status, lag_status, sum(if(status!=lag_status,1,0)) over(orderby run_date) flag from tb2 )--数据分组效果如下:run_date status lag_status flag 2021-08-01 succeeded succeeded 02021-08-02 succeeded succeeded 02021-08-03 succeeded succeeded 02021-08-04 succeeded succeeded 02021-08-05 succeeded succeeded 02021-08-06 failed succeeded 12021-08-07 failed failed 12021-08-08 failed failed 12021-08-09 succeeded failed 22021-08-10 succeeded succeeded 22021-08-11 succeeded succeeded 22021-08-12 succeeded succeeded 2
- 对上述数据按flag求最大时间和最小时间即为结果
select min(status)as status, min(run_date)as start_date, max(run_date)as end_date from tb3 groupby flag;--结果如下:status start_date end_date succeeded 2021-08-012021-08-05failed 2021-08-062021-08-08succeeded 2021-08-092021-08-12
对于案例中使用sum(if)组合,在业务需求分析中经常使用,也可以简化很多代码,还未使用的小伙伴,可以尝试使用下
拜了个拜