一、背景:
最近由于要回刷数据
调优前:
map数:30000 单个map 运行7-8分钟
reduce数:50 单个reduce 运行了20h 还没完成,还经常失败
整体耗时20多个小时还没有完成并且失败了,明显数据倾斜reduce 某个节点跑很久出不来
调优后:
map数:30000 单个map 运行7-8分钟
reduce数:1000 单个reduce 运行了1h左右
原始sql
set mapreduce.input.fileinputformat.split.minsize=4096000000; set mapreduce.input.fileinputformat.split.maxsize=4096000000; set mapred.reduce.tasks=1000; set hive.exec.reducers.max=1000; SELECT user_id, county FROM ( SELECT u as user_id, bi_warehouse_dwd.ip_parse(ip).gb_id as county, row_number() over(partition by u order by dm desc) as rank FROM udw.tmp_play WHERE dt >= '2022-03-14' and bi_warehouse_dwd.ip_parse(ip).gb_id is not null ) a WHERE rank = 1
优化后sql
set mapreduce.input.fileinputformat.split.minsize=16384000000; set mapreduce.input.fileinputformat.split.maxsize=16384000000; set mapred.reduce.tasks=1000; set hive.exec.reducers.max=1000; set mapreduce.reduce.shuffle.memory.limit.percent=0.1; set mapreduce.reduce.shuffle.input.buffer.percent=0.2; select a.u,a.county from (SELECT u, dm, bi_warehouse_dwd.ip_parse(ip).gb_id as county FROM udw.tmp_play WHERE dt >= '2022-03-14' and bi_warehouse_dwd.ip_parse(ip).gb_id is not null) a join ( SELECT u, max(dm) as dm FROM udw.tmp_play WHERE dt >= '2022-03-14' group by u) b on a.u=b.u and a.dm=b.dm;
参考: