那么count distinct可以肆无忌惮的使用了吗?
接下来看另外一种场景,探查资产信息表中近5天每天的用户数,常见的写法与常规认为的优化写法:
--选择近5天的资产来看 --常见写法,count distinct写法 SELECT dt ,COUNT(DISTINCT cst_id) AS cst_cnt FROM @pc_bill_bal WHERE dt BETWEEN '${bizdate-5}' AND '${bizdate}' GROUP BY dt ; --优化写法 SELECT dt ,COUNT(cst_id) AS cst_cnt FROM ( SELECT dt ,cst_id FROM @pc_bill_bal WHERE dt BETWEEN '${bizdate-5}' AND '${bizdate}' GROUP BY dt ,cst_id )base GROUP BY dt ;
看一下这种场景下两种执行计划对比
常规写法(此处额外看一下分配的task):
优化写法:
从执行计划可以看出,直接count distinct的写法进行了一次去重,就将3亿条数据给到了5个task进行去重计算总和,每个task的压力相当大。再看优化写法,两次去重处理,两次计算总和,每一步都运行的很快,没有长尾。最后看一下运行时间和消耗资源,常规写法比优化写法慢了26倍,资源消耗多出2倍。
总结
ODPS对count distinct做了执行计划优化,但是限于从数据源只读取1个字段的情况下。当从数据源读取了多个字段时,应将count distinct写法改为group by count写法。
多张大表join提速(聚合类型)
场景介绍
在日常的开发工作中,经常会遇到多张表关联取属性的情况,比如计算用户在过去一段时间A、B、C...N行为的次数,或者是在资管领域中,统计一个资产池中的所有资产(日初资产+放款资产+买入资产)。
写法&执行计划探查
假设有3份数据需要关联得到属性,常规的写法为使用2次full outer join + coalesce来关联取值;或者先将3份数据主体合并在一起,再使用3次left join。
-- 举例为资产池得到每个用户的所有资产 -- 使用full outer join + coalesce的写法 SELECT COALESCE(tt1.cst_id, tt2.cst_id) as cst_id ,COALESCE(tt1.bal_init_prin, 0) AS bal_init_prin ,COALESCE(tt1.amt_retail_prin, 0) AS amt_retail_prin ,COALESCE(tt2.amt_buy_prin, 0) AS amt_buy_prin FROM ( SELECT COALESCE(t1.cst_id, t2.cst_id) as cst_id ,COALESCE(t1.bal_init_prin, 0) AS bal_init_prin ,COALESCE(t2.amt_retail_prin, 0) AS amt_retail_prin FROM @bal_init t1 -- 日初资产 FULL OUTER JOIN @amt_retail t2 -- 当天放款资产 ON t1.cst_id = t2.cst_id )tt1 FULL OUTER JOIN @amt_buy tt2 -- 当天买入资产 ON tt1.cst_id = tt2.cst_id ;
接下来看优化写法:
-- 写法一 SELECT cst_id ,SUM(bal_init_prin) as bal_init_prin ,SUM(amt_retail_prin) as amt_retail_prin ,SUM(amt_buy_prin) as amt_buy_prin FROM ( SELECT cst_id, bal_init_prin, 0 AS amt_retail_prin, 0 AS amt_buy_prin FROM @bal_init -- 日初资产 union ALL SELECT cst_id, 0 AS bal_init_prin, amt_retail_prin, 0 AS amt_buy_prin FROM @amt_retail -- 当天放款资产 UNION ALL SELECT cst_id, 0 AS bal_init_prin, 0 AS amt_retail_prin, amt_buy_prin FROM @amt_buy -- 当天买入资产 )t1 GROUP BY cst_id ; -- 优化写法二 SELECT cst_id ,SUM(IF(flag = 1, prin, 0)) as bal_init_prin ,SUM(IF(flag = 2, prin, 0)) as amt_retail_prin ,SUM(IF(flag = 3, prin, 0)) as amt_buy_prin FROM ( SELECT cst_id, bal_init_prin AS prin, 1 AS flag FROM @bal_init -- 日初资产 union ALL SELECT cst_id, amt_retail_prin AS prin, 2 AS flag FROM @amt_retail -- 当天放款资产 UNION ALL SELECT cst_id, amt_buy_prin AS prin, 3 AS flag FROM @amt_buy -- 当天买入资产 )t1 GROUP BY cst_id ;
对比join写法和优化写法的执行计划(这两个执行计划内部做的事情和任务名称理解一致,就不展开看了)