1.案例一
原sql:
select count(case when a.id in (select id from b) then 1 esle 0) from a;
结果总共数据:727 耗时:
2020-12-28 17:38:31 INFO Cost time is: 568.197s
改造后:
select count(case when b.id is not null then 1 else 0) from a left join b on a.id=b.id;
结果总共数据:727 耗时:
2020-12-28 17:36:41 INFO Cost time is: 417.218s
2.案例二
hive distinct 和 group by
select count(distinct order_no) from order_snap; Stage-Stage-1: Map: 396 Reduce: 1 Cumulative CPU: 7915.67 sec HDFS Read: 119072894175 HDFS Write: 10 SUCCESS Total MapReduce CPU Time Spent: 0 days 2 hours 11 minutes 55 seconds 670 msec OK _c0 763191489 Time taken: 1818.864 seconds, Fetched: 1 row(s) select count(t.order_no) from (select order_no from order_snap group by order_no) t; Stage-Stage-1: Map: 396 Reduce: 457 Cumulative CPU: 10056.7 sec HDFS Read: 119074266583 HDFS Write: 53469 SUCCESS Stage-Stage-2: Map: 177 Reduce: 1 Cumulative CPU: 280.22 sec HDFS Read: 472596 HDFS Write: 10 SUCCESS Total MapReduce CPU Time Spent: 0 days 2 hours 52 minutes 16 seconds 920 msec OK _c0 763191489 Time taken: 244.192 seconds, Fetched: 1 row(s)
注意到为什么会有这个差异,Hadoop其实就是处理大数据的,Hive并不怕数据有多大,怕的就是数据倾斜,我们看看两者的输出信息:
# distinct Stage-Stage-1: Map: 396 Reduce: 1 Cumulative CPU: 7915.67 sec HDFS Read: 119072894175 HDFS Write: 10 SUCCESS # group by Stage-Stage-1: Map: 396 Reduce: 457 Cumulative CPU: 10056.7 sec HDFS Read: 119074266583 HDFS Write: 53469 SUCCESS
SELECT COUNT(DISTINCT ip) FROM action__notice WHERE stat_date=20140516 AND stat_hour=2014051609;
ob启动后,只会使用1个reducer,在输入数据量非常大的情况下job运行非常慢。
可以使用group by 改写如下:
SELECT COUNT(ip) AS TOTAL FROM (SELECT ip FROM action__notice WHERE stat_date=20140516 AND stat_hour=2014051609 GROUP BY ip) tmp;