Hive ERROR: Out of memory due to hash maps used in map-side aggregation .

简介:

当hive在执行大数据量的统计查询语句时,经常会出现下面OOM错误,具体错误提示如下:

 

Possible error: Out of memory due to hash maps used in map-side aggregation. Solution: Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;'

查看task的失败信息为:

 

Error:GC overhead limit exceeded

 

对于这个错误,通常是由两种情况造成的:(1) hive sql写的不合理,导致执行时hash map过大;(2)hive sql没有优化的余地了(要得到想要的数据只能写这样的sql)。

对于(1)则改变sql语句,从而降低hash map的大小。对于(2)则可以调整参数。

下面分别说明(1)和(2)的情况:

(1)改变sql语句

 

select count(distinct v) from tbl; 可以改为select count(1) from (select v from tbl group by v) t;

 

说明:减少了hash map的key个数 

 

select collect_set(messageDate)[0],count(*) from incidents_hive group by substr(messageDate,8,2); 可以改为select hourNum, count(1) from (select substr(messageDate,9,2) as hourNum from incidents_hive ) t group by hourNum;

 

说明:没有减少hash map的key个数,但是减少了value的大小

(2)调整参数

对于这个sql语句,是没办法进行优化(因为keywords的重复率很低,导致map阶段里面维护的一个内存Map对象非常巨大)来降低hash map大小的:

 

 

INSERT OVERWRITE TABLE hbase_table_poi_keywords_count SELECT concat(substr(key,0,8), svccode, keywords), substr(key,0,8), svccode, keywords, count(*) where substr(key,0,8)=\"$yesterday\" AND length(keywords)>0 AND svccode is not null GROUP BY substr(key,0,8),svccode,keywords;

 

与mapjoin和map aggregate相关的优化参数有:

hive.map.aggr

hive.groupby.mapaggr.checkinterval

hive.map.aggr.hash.min.reduction

hive.map.aggr.hash.percentmemory

hive.groupby.skewindata

以上参数可以查看配置文件说明即文档进行调整。如果需求确实没法通过调整这些参数来达到,那么set hive.map.aggr=false便是最终的方案,它肯定能满足你需求,只是执行速度比map join 和 map aggr慢些,但通过实际跑数据你很可能发现其实它也不慢哈。

 

参考文章:

http://blog.csdn.net/macyang/article/details/9260777
http://www.myexception.cn/open-source/1487747.html
http://blog.csdn.net/lixucpf/article/details/20458617

 

 

 

INSERT OVERWRITE TABLE hbase_table_poi_keywords_count SELECT concat(substr(key,0,8), svccode, keywords), substr(key,0,8), svccode, keywords, count(*) where substr(key,0,8)=\"$yesterday\" AND length(keywords)>0 AND svccode is not null GROUP BY substr(key,0,8),svccode,keywords;

 转自 http://blog.csdn.net/xyls12345/article/details/25418671




本文转自茄子_2008博客园博客,原文链接:http://www.cnblogs.com/xd502djj/p/3852921.html,如需转载请自行联系原作者。


目录
相关文章
|
6月前
ERROR: No matching distribution found for tb-nightly
ERROR: No matching distribution found for tb-nightly
212 1
解决Mapped Statements collection already contains value for experiment4.UserMapper.listUser错误~
解决Mapped Statements collection already contains value for experiment4.UserMapper.listUser错误~
|
缓存 数据可视化
'dict' object has no attribute '_txn_read_preference' && Sort exceeded memory limit of 10485760
'dict' object has no attribute '_txn_read_preference' && Sort exceeded memory limit of 10485760
212 0
|
Java 数据库连接 mybatis
A query was run and no Result Maps were found for the Mapped Statement
A query was run and no Result Maps were found for the Mapped Statement
221 0
|
分布式计算 Java 5G
spark异常:missing an output location for shuffle 0
spark异常:missing an output location for shuffle 0
506 0
|
SQL 弹性计算 安全
PostgreSQL sharding extension citus 优化器 Query Processing 之 - Distributed Query Planner、Executor (Real-time Executor, Router Executor, Task Tracker Executor)
标签 PostgreSQL , citus , sharding , 优化器 , query planner , query exexutor , Real-time Executor , Router Executor , Task Tracker Executor , co-locate 背景 A Citus cluster consists of a coordinator ins
557 0
|
SQL 分布式计算 负载均衡
Out of memory due to hash maps used in map-side aggregation解决办法
在运行一个group by的sql时,抛出以下错误信息: Task with the most failures(4):  -----Task ID:  task_201411191723_723592_m_000004URL:  http://DDS0204.
1251 1