一:任务:
/datacenter/shell/user/full/bdl_social_test1.sh
跑全量数据时!Namenode挂掉!
二:报错日志:
/var/log/hadoop-hdfs/hadoop-hdfs-namenode-yhbd01.log
内容:
2015-08-27 10:23:04,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:05,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:06,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:15,483 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000029706041 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000029706041-0000000000029706137 2015-08-27 10:23:15,484 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 29706138 2015-08-27 10:23:16,485 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:17,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:18,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:19,487 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:20,488 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:21,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:21,489 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:22,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:22,490 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:23,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-08-27 10:23:23,487 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:23,491 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:24,489 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:24,492 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:25,490 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:25,493 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:26,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:27,492 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:28,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:29,494 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:30,495 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:31,496 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:32,497 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:33,499 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:34,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused] 2015-08-27 10:23:35,484 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 29706138 failed for required journal (JournalAndStream(mgr=QJM to [10.171.82.67:8485, 10.171.81.96:8485, 10.170.255.96:8485], stream=null)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1181) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1150) at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1231) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6119) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:908) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 2015-08-27 10:23:35,486 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2015-08-27 10:23:35,490 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at yhbd01/10.171.82.67 ************************************************************/ 2015-08-27 10:28:48,352 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = yhbd01/10.171.82.67 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.5.0-cdh5.3.3 |
三:故障分析:
(1)
yhbd02上的jounalnode进程没有起来!
而Namenode在合并fsimage 和 edits 文件的时候,向jounalnode集群提交合并的文件,但是此时一直访问不了yhbd02的8485端口,导致Namenode节点访问jounalnode超时,Namenode shutdown。
(2)
然而,当NameNode切换好之后,所以JN集群都正常,继续跑任务,yhbd02上的NameNode继续挂掉了:
查了查log,应该是JN访问超时,这种问题通常是Full GC导致的问题,
namenode这个时间点进行了一次时间比较长的 full gc,
导致写 journalnode 超时(默认是20s), namenode进程退出。故需要调整mapred-site.xml参数。
一些优化JVM的思路,从JVM Dump出的文件查看出的内容:
# There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 350748672 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full # Use 64 bit Java on a 64 bit OS # Decrease Java heap size (-Xmx/-Xms) # Decrease number of Java threads # Decrease Java thread stack sizes (-Xss) # Set larger code cache with -XX:ReservedCodeCacheSize= # This output file may be truncated or incomplete. # # Out of Memory Error (os_linux.cpp:2809), pid=16666, tid=139882867275520 # # JRE version: (7.0_79-b14) (build ) # Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops) # Derivative: IcedTea 2.5.5 # Distribution: Built on CentOS release 6.6 (Final) (Wed Apr 15 00:02:21 UTC 2015) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again |
四:解决办法:
(1)
启动yhbd02上的jounalnode:
service hadoop-hdfs-jounalnode start
并启动namenode:
service hadoop-hdfs-namenode start
(2)
调整mapred-site.xml的参数:
点击(此处)折叠或打开
- <property>
- <name>mapreduce.reduce.shuffle.memory.limit.percent</name>
- <value>0.13</value>
- </property>
原来是0.25现在修改为0.13,减少shuffle阶段对内存的占用,更多的使用本地。
查找发现,这里的每个reducer设置的数据过大,把这句注释掉,再执行:
#set hive.exec.reducers.bytes.per.reducer=1000000000;
#set mapred.min.split.size.per.node=300000000;
#set mapred.min.split.size.per.rack=300000000;
#set mapred.min.split.size.per.rack=300000000;
调节Map和Reduce的内存参数:
mapred-site.xml
mapreduce.map.memory.mb //map内存控制
mapreduce.reduce.memory.mb //reduce内存控制
(3)
后续解决方法:
1)调节journalnode 的写入超时时间
1
|
dfs.qjournal.write-txns.timeout.ms
|
2)调整namenode 的java参数,提前触发 full gc,这样full gc 的时间就会小一些。
3)默认namenode的fullgc方式是parallel gc,是stw模式的,更改为cms的格式。调整namenode的启动参数:
1
2
3
4
5
6
|
-XX:+UseCompressedOops
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0
-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75
-XX:SoftRefLRUPolicyMSPerMB=0
|
点击(此处)折叠或打开
- <property>
- <name>dfs.qjournal.start-segment.timeout.ms</name>
- <value>600000000</value>
- </property>
-
- <property>
- <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
- <value>600000000</value>
- </property>
-
- <property>
- <name>dfs.qjournal.accept-recovery.timeout.ms</name>
- <value>600000000</value>
- </property>
-
- <property>
- <name>dfs.qjournal.finalize-segment.timeout.ms</name>
- <value>600000000</value>
- </property>
-
- <property>
- <name>dfs.qjournal.select-input-streams.timeout.ms</name>
- <value>600000000</value>
- </property>
-
- <property>
- <name>dfs.qjournal.get-journal-state.timeout.ms</name>
- <value>600000000</value>
- </property>
-
- <property>
- <name>dfs.qjournal.new-epoch.timeout.ms</name>
- <value>600000000</value>
- </property>
-
- <property>
- <name>dfs.qjournal.write-txns.timeout.ms</name>
- <value>600000000</value>
- </property>
Namenode Full GC过多,每次GC,系统停顿3-4分钟
由于namenode保存元数据在内存,所以对老生代的压力比较大,每次full gc时,系统就会卡死几分钟,解决方法如下:
(1). 使用cms gc算法
(2). 修改新生代和老生代比例为1:2,如果是1:4,会造成大对象在做ygc时,大对象直接进入老生代,造成老生代内存快速增长,full gc更加频繁。
五:延伸:
在hadoop中edits和fsimage是两个至关重要的文件,其中
edits负责保存自最新检查点后命名空间的变化,起着日志的作用,而
fsimage(
HDFS元数据镜像文件
)则保存了最新的检查点信息。这个两个文件中的内容使用普通文本编辑器是无法直接查看的,幸运的是hadoop为此准备了专门的工具用于查看文件的内容,这些工具分别为oev和oiv,可以使用hdfs调用执行。
所以可以通过以下来研究hadoop2是怎样通过edits和fsimage来实现HA的:
其步骤可以归类如下:
(1)、配置好HA后,客户端所有的更新操作将会写到JournalNodes节点的共享目录中,可以通过下面配置
点击(此处)折叠或打开
- <property>
- <name>dfs.namenode.shared.edits.dir</name>
- <value>qjournal://XXXX/mycluster</value>
- </property>
-
- <property>
- <name>dfs.journalnode.edits.dir</name>
- <value>/export1/hadoop2x/dfs/journal</value>
- </property>
(2)、Active Namenode和Standby NameNode从JournalNodes的edits共享目录中同步edits到自己edits目录中;
(3)、Standby NameNode中的StandbyCheckpointer类会定期的检查合并的条件是否成立,如果成立会合并fsimage和edits文件;
(4)、Standby NameNode中的StandbyCheckpointer类合并完之后,将合并之后的fsimage上传到Active NameNode相应目录中;
(5)、Active NameNode接到最新的fsimage文件之后,将旧的fsimage和edits文件清理掉;
(6)、通过上面的几步,fsimage和edits文件就完成了合并,由于HA机制,会使得Standby NameNode和Active NameNode都拥有最新的fsimage和edits文件(之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的)
(3)、Standby NameNode中的StandbyCheckpointer类会定期的检查合并的条件是否成立,如果成立会合并fsimage和edits文件;
(4)、Standby NameNode中的StandbyCheckpointer类合并完之后,将合并之后的fsimage上传到Active NameNode相应目录中;
(5)、Active NameNode接到最新的fsimage文件之后,将旧的fsimage和edits文件清理掉;
(6)、通过上面的几步,fsimage和edits文件就完成了合并,由于HA机制,会使得Standby NameNode和Active NameNode都拥有最新的fsimage和edits文件(之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的)
因为Active Namenode转移到了yhbd02上,而active Resource Manager还是在yhbd01上:
这样Namenode在一台机器,Resource Manager在一台机器,性能明显比以前提升很多!因为这两个进程都需要占用很大的资源。
比如,提交的job,在资源分配上更加合理,container在集群中的分布更均匀。
优化操作开始:
1:修改了mapred-site.xml
2:修改了hadoop-env.xml
针对Full GC的问题:
解决思路是:
将用户全量语句。拆解成以年/月为单位,来跑数据!
将journal node延时时间调长
调节journalnode 的写入超时时间
dfs.qjournal.write-txns.timeout.ms
调节之后重启journalnode集群。
然后namenode节点不再因为连不上journalnode集群报错,而是因为以下信息:
IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
切换到namenode2上之后稳定运行!
继续优化JVM。
journalnode的超时时间优化后的报错:
2015-09-02 00:13:03,408 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631554 2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 2015-09-02 00:13:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:13:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2015-09-02 00:14:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:14:04,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2015-09-02 00:14:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:14:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2015-09-02 00:15:04,049 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.171.81.96 2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs 2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 31631554 2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4 54 2015-09-02 00:15:04,121 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 73 56 2015-09-02 00:15:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000031631554 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000031631554-0000000000031631629 2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631630 2015-09-02 00:16:28,744 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 30 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.172.7.24:36833 Call#486892 Retry#0: output error 2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534) at org.apache.hadoop.ipc.Server.access$1900(Server.java:130) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068) 2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 10.171.82.67:48509 Call#1099647 Retry#0 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.165.115.190:56256 Call#464826 Retry#0: output error 2015-09-02 00:16:28,945 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534) at org.apache.hadoop.ipc.Server.access$1900(Server.java:130) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068) 2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534) at org.apache.hadoop.ipc.Server.access$1900(Server.java:130) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068) 2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15538 Call#1238456 Retry#0 2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15554 Call#1238459 Retry#0 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:58665 Call#495767 Retry#0: output error 2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534) at org.apache.hadoop.ipc.Server.access$1900(Server.java:130) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068) 2015-09-02 00:16:28,947 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for active state 2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15561 Call#1238462 Retry#0 2015-09-02 00:16:28,946 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 84780 millisecond(s). 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.96:15952 Call#452032 Retry#0: output error 2015-09-02 00:16:28,948 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020 caught an exception |
找错误原因:
step1:
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
源码位置:
org.apache.hadoop.hdfs.server.blockmanagement;
说明:
Scans the namesystem, scheduling blocks to be cached as appropriate.
The CacheReplicationMonitor does a full scan when the NameNode first starts up, and at configurable intervals afterwards.
step2:报错开始!
org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:53949 Call#525896 Retry#0: output error
之后修改yarn-site.xml和mapred-site.xml参数,Namenode不再挂掉!
修改
点击(此处)折叠或打开
- yarn-site.xml
- <property>
- <description>The minimum allocation for every container request at the RM,
- in MBs. Memory requests lower than this won't take effect,
- and the specified value will get allocated at minimum.
- yarn.scheduler.minimum-allocation-mb
- 1500
-
-
-
- The maximum allocation for every container request at the RM,
- in MBs. Memory requests higher than this won't take effect,
- and will get capped to this value.</description>
- <name>yarn.scheduler.maximum-allocation-mb</name>
- <value>8192</value>
- </property>
-
- mapred-site.xml
- <property>
- <name>mapreduce.map.memory.mb</name>
- <value>1500</value>
- </property>
-
- <property>
- <name>mapreduce.map.java.opts</name>
- <value>-Xmx1024M</value>
- <description>Larger heap-size for child jvms of maps.</description>
- </property>
-
- <property>
- <name>mapreduce.reduce.memory.mb</name>
- <value>2100</value>
- </property>
-
- <property>
- <name>mapreduce.reduce.java.opts</name>
- <value>-Xmx1900M</value>
- <description>Larger heap-size for child jvms of reduces.</description>
- </property>
-
- <property>
- <name>mapreduce.job.counters.limit</name>
- <value>12000</value>
- </property>
最 后总结一下:
遇到这种情况,从三个方面入手:
(1)journalnode超时时间;
(2)JVM参数优化;
(3)yarn-site.xml和yarn-site.xml配置文件优化。
转:原文