java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond

简介: 一:任务: /datacenter/shell/user/full/bdl_social_test1.sh 跑全量数据时!Namenode挂掉! 二:报错日志: /var/log/hadoop-hdfs/hadoop-hdfs-namenode-yhbd01.

一:任务:

/datacenter/shell/user/full/bdl_social_test1.sh
跑全量数据时!Namenode挂掉!

二:报错日志:
/var/log/hadoop-hdfs/hadoop-hdfs-namenode-yhbd01.log
内容:
2015-08-27 10:23:04,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:05,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:06,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:15,483 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000029706041 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000029706041-0000000000029706137
2015-08-27 10:23:15,484 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 29706138
2015-08-27 10:23:16,485 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:17,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:18,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:19,487 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:20,488 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:21,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:21,489 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:22,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:22,490 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:23,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-08-27 10:23:23,487 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:23,491 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:24,489 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:24,492 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:25,490 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:25,493 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:26,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:27,492 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:28,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:29,494 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:30,495 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:31,496 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:32,497 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:33,499 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:34,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused]
2015-08-27 10:23:35,484 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 29706138 failed for required journal (JournalAndStream(mgr=QJM to [10.171.82.67:8485, 10.171.81.96:8485, 10.170.255.96:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1181)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1150)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1231)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6119)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:908)
        at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139)
        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-08-27 10:23:35,486 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2015-08-27 10:23:35,490 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at yhbd01/10.171.82.67
************************************************************/
2015-08-27 10:28:48,352 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = yhbd01/10.171.82.67
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.5.0-cdh5.3.3
 
三:故障分析:
(1)
yhbd02上的jounalnode进程没有起来!
而Namenode在合并fsimage 和 edits 文件的时候,向jounalnode集群提交合并的文件,但是此时一直访问不了yhbd02的8485端口,导致Namenode节点访问jounalnode超时,Namenode shutdown。
(2)
然而,当NameNode切换好之后,所以JN集群都正常,继续跑任务,yhbd02上的NameNode继续挂掉了:
查了查log,应该是JN访问超时,这种问题通常是Full GC导致的问题,  namenode这个时间点进行了一次时间比较长的 full gc,
导致写 journalnode 超时(默认是20s), namenode进程退出。故需要调整mapred-site.xml参数。

一些优化JVM的思路,从JVM Dump出的文件查看出的内容:
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 350748672 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2809), pid=16666, tid=139882867275520
#
# JRE version:  (7.0_79-b14) (build )
# Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.5.5
# Distribution: Built on CentOS release 6.6 (Final) (Wed Apr 15 00:02:21 UTC 2015)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again


四:解决办法:
(1)
启动yhbd02上的jounalnode:
service hadoop-hdfs-jounalnode start
并启动namenode:
service hadoop-hdfs-namenode start
(2)
调整mapred-site.xml的参数:

点击(此处)折叠或打开

  1. <property>
  2.         <name>mapreduce.reduce.shuffle.memory.limit.percent</name>
  3.         <value>0.13</value>
  4.    </property>
  原来是0.25现在修改为0.13,减少shuffle阶段对内存的占用,更多的使用本地。
 
查找发现,这里的每个reducer设置的数据过大,把这句注释掉,再执行:
#set hive.exec.reducers.bytes.per.reducer=1000000000;
#set mapred.min.split.size.per.node=300000000;
#set mapred.min.split.size.per.rack=300000000;

调节Map和Reduce的内存参数:
mapred-site.xml
mapreduce.map.memory.mb      //map内存控制       
mapreduce.reduce.memory.mb  //reduce内存控制
(3)

后续解决方法:

1)调节journalnode 的写入超时时间

1
dfs.qjournal.write-txns.timeout.ms

2)调整namenode 的java参数,提前触发 full gc,这样full gc 的时间就会小一些。

3)默认namenode的fullgc方式是parallel gc,是stw模式的,更改为cms的格式。调整namenode的启动参数:

1
2
3
4
5
6
-XX:+UseCompressedOops 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled 
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 
-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75
 -XX:SoftRefLRUPolicyMSPerMB=0

点击(此处)折叠或打开

  1. <property>
  2.    <name>dfs.qjournal.start-segment.timeout.ms</name>
  3.    <value>600000000</value>
  4.   </property>
  5.  
  6.   <property>
  7.    <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
  8.    <value>600000000</value>
  9.   </property>
  10.  
  11.   <property>
  12.    <name>dfs.qjournal.accept-recovery.timeout.ms</name>
  13.    <value>600000000</value>
  14.   </property>
     
  15.  
  16.   <property>
  17.    <name>dfs.qjournal.finalize-segment.timeout.ms</name>
  18.    <value>600000000</value>
  19.   </property>
  20.  
  21.   <property>
  22.    <name>dfs.qjournal.select-input-streams.timeout.ms</name>
  23.    <value>600000000</value>
  24.   </property>
  25.  
  26.   <property>
  27.    <name>dfs.qjournal.get-journal-state.timeout.ms</name>
  28.    <value>600000000</value>
  29.   </property>
  30.  
  31.   <property>
  32.    <name>dfs.qjournal.new-epoch.timeout.ms</name>
  33.    <value>600000000</value>
  34.   </property>
  35.  
  36.   <property>
  37.    <name>dfs.qjournal.write-txns.timeout.ms</name>
  38.    <value>600000000</value>
  39.   </property>

Namenode Full GC过多,每次GC,系统停顿3-4分钟

  由于namenode保存元数据在内存,所以对老生代的压力比较大,每次full gc时,系统就会卡死几分钟,解决方法如下:
  (1). 使用cms gc算法
  (2). 修改新生代和老生代比例为1:2,如果是1:4,会造成大对象在做ygc时,大对象直接进入老生代,造成老生代内存快速增长,full gc更加频繁。



五:延伸:
  在hadoop中edits和fsimage是两个至关重要的文件,其中 edits负责保存自最新检查点后命名空间的变化,起着日志的作用,而 fsimage( HDFS元数据镜像文件 则保存了最新的检查点信息。这个两个文件中的内容使用普通文本编辑器是无法直接查看的,幸运的是hadoop为此准备了专门的工具用于查看文件的内容,这些工具分别为oev和oiv,可以使用hdfs调用执行。

所以可以通过以下来研究hadoop2是怎样通过edits和fsimage来实现HA的:

其步骤可以归类如下:
  (1)、配置好HA后,客户端所有的更新操作将会写到JournalNodes节点的共享目录中,可以通过下面配置

点击(此处)折叠或打开

  1. <property>
  2.       <name>dfs.namenode.shared.edits.dir</name>
  3.       <value>qjournal://XXXX/mycluster</value>
  4.     </property>
  5.     
  6.     <property>
  7.       <name>dfs.journalnode.edits.dir</name>
  8.       <value>/export1/hadoop2x/dfs/journal</value>
  9.     </property>


  (2)、Active Namenode和Standby NameNode从JournalNodes的edits共享目录中同步edits到自己edits目录中;
  (3)、Standby NameNode中的StandbyCheckpointer类会定期的检查合并的条件是否成立,如果成立会合并fsimage和edits文件;
  (4)、Standby NameNode中的StandbyCheckpointer类合并完之后,将合并之后的fsimage上传到Active NameNode相应目录中;
  (5)、Active NameNode接到最新的fsimage文件之后,将旧的fsimage和edits文件清理掉;
  (6)、通过上面的几步,fsimage和edits文件就完成了合并,由于HA机制,会使得Standby NameNode和Active NameNode都拥有最新的fsimage和edits文件(之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的)
 
因为Active Namenode转移到了yhbd02上,而active Resource Manager还是在yhbd01上:
这样Namenode在一台机器,Resource Manager在一台机器,性能明显比以前提升很多!因为这两个进程都需要占用很大的资源。
比如,提交的job,在资源分配上更加合理,container在集群中的分布更均匀。
 
 
 
优化操作开始:
1:修改了mapred-site.xml    
2:修改了hadoop-env.xml

针对Full GC的问题:
解决思路是:
将用户全量语句。拆解成以年/月为单位,来跑数据!
将journal node延时时间调长

调节journalnode 的写入超时时间

dfs.qjournal.write-txns.timeout.ms
调节之后重启journalnode集群。
然后namenode节点不再因为连不上journalnode集群报错,而是因为以下信息:
IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
切换到namenode2上之后稳定运行!
继续优化JVM。


journalnode的超时时间优化后的报错:
2015-09-02 00:13:03,408 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631554
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2015-09-02 00:13:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:04,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:15:04,049 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.171.81.96
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 31631554
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4 54
2015-09-02 00:15:04,121 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 73 56
2015-09-02 00:15:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000031631554 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000031631554-0000000000031631629
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631630
2015-09-02 00:16:28,744 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 30
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.172.7.24:36833 Call#486892 Retry#0: output error
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 10.171.82.67:48509 Call#1099647 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.165.115.190:56256 Call#464826 Retry#0: output error
2015-09-02 00:16:28,945 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15538 Call#1238456 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15554 Call#1238459 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:58665 Call#495767 Retry#0: output error
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,947 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for active state
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15561 Call#1238462 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 84780 millisecond(s).
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.96:15952 Call#452032 Retry#0: output error
2015-09-02 00:16:28,948 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020 caught an exception

 
找错误原因:
step1:
 org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
源码位置:
org.apache.hadoop.hdfs.server.blockmanagement;
说明:
Scans the namesystem, scheduling blocks to be cached as appropriate.
The CacheReplicationMonitor does a full scan when the NameNode first starts up, and at configurable intervals afterwards.

step2:报错开始!
org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:53949 Call#525896 Retry#0: output error


之后修改yarn-site.xml和mapred-site.xml参数,Namenode不再挂掉!


修改

点击(此处)折叠或打开

  1. yarn-site.xml
  2.       <property>
  3.                <description>The minimum allocation for every container request at the RM,
  4.                                                     in MBs. Memory requests lower than this won't take effect,
  5.                    and the specified value will get allocated at minimum.
  6.                yarn.scheduler.minimum-allocation-mb
  7.                1500
  8.         

  9.         
  10.                The maximum allocation for every container request at the RM,
  11.                   in MBs. Memory requests higher than this won't take effect,
  12.                    and will get capped to this value.</description>
  13.                <name>yarn.scheduler.maximum-allocation-mb</name>
  14.                <value>8192</value>
  15.         </property>

  16. mapred-site.xml
  17.          <property>
  18.         <name>mapreduce.map.memory.mb</name>
  19.         <value>1500</value>
  20.         </property>

  21.         <property>
  22.         <name>mapreduce.map.java.opts</name>
  23.         <value>-Xmx1024M</value>
  24.         <description>Larger heap-size for child jvms of maps.</description>
  25.         </property>

  26.         <property>
  27.         <name>mapreduce.reduce.memory.mb</name>
  28.         <value>2100</value>
  29.         </property>

  30.         <property>
  31.         <name>mapreduce.reduce.java.opts</name>
  32.         <value>-Xmx1900M</value>
  33.         <description>Larger heap-size for child jvms of reduces.</description>
  34.         </property>

  35.         <property>
  36.                 <name>mapreduce.job.counters.limit</name>
  37.                 <value>12000</value>
  38.         </property>

后总结一下:
遇到这种情况,从三个方面入手:
(1)journalnode超时时间;
(2)JVM参数优化;
(3)yarn-site.xml和yarn-site.xml配置文件优化。

 

转:原文

目录
相关文章
|
6月前
|
Java Linux
8 种 Java- 内存溢出六 -Out of swap space?
8 种 Java- 内存溢出六 -Out of swap space?
|
3月前
|
缓存 NoSQL Java
【Azure Redis 缓存 Azure Cache For Redis】Redis出现 java.net.SocketTimeoutException: Read timed out 异常
【Azure Redis 缓存 Azure Cache For Redis】Redis出现 java.net.SocketTimeoutException: Read timed out 异常
|
6月前
|
Linux Windows
FinalShell连接Linux虚拟机报错java.net.ConnectException: Connection timed out: connect(亲测有效)
FinalShell连接Linux虚拟机报错java.net.ConnectException: Connection timed out: connect(亲测有效)
1169 0
|
5月前
|
运维 监控 网络安全
com.jcraft.jsch.JSchException: Session.connect: java.net.SocketTimeoutException: Read timed out 问题
【6月更文挑战第5天】com.jcraft.jsch.JSchException: Session.connect: java.net.SocketTimeoutException: Read timed out 问题
771 1
|
Java 容器
RestTemplate报错I/O error on POST request for "http://crmjob.xxx.xxx.com/removeJob": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out问题处理
讲述RestTemplate报错I/O error on POST request for "http://crmjob.xxx.xxx.com/removeJob": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out处理方案
|
6月前
io.lettuce.core.RedisCommandTimeoutException: Command timed out 解决办法
io.lettuce.core.RedisCommandTimeoutException: Command timed out 解决办法
80 0
|
分布式计算 运维 调度
Spark——成功解决java.util.concurrent.TimeoutException: Futures timed out after [600 seconds]
Spark——成功解决java.util.concurrent.TimeoutException: Futures timed out after [600 seconds]
11416 0
|
6月前
|
NoSQL Linux 网络安全
解决Caused by: java.net.SocketTimeoutException: connect timed out Exception in thread “main“ redis.cli
解决Caused by: java.net.SocketTimeoutException: connect timed out Exception in thread “main“ redis.cli
149 0
|
容器
解决java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes......
解决java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes......
769 0
解决java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes......
|
网络安全
Hdfs连接报错java.net.ConnectException: Connection timed out: no further information
Hdfs连接报错java.net.ConnectException: Connection timed out: no further information
344 0
下一篇
无影云桌面