主节点间歇性报错其他没有问题 ,SNN的NN没有问题,相关的journalNode也都在,就是主节点的NN会停止。
查看hadoop主节点的NN日志。
2016-11-21 22:36:40,908 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19822 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2016-11-21 22:36:41,088 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.58.183:8485, 192.168.58.181:8485, 192.168.58.182:8485], stream=QuorumOutputStream starting at txid 24533)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2645) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2520) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:579) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:394) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:975) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034) 2016-11-21 22:36:41,089 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 24533 2016-11-21 22:36:41,113 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2016-11-21 22:36:41,122 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Slave2/192.168.58.182:8485. Already tried 0 time(s); maxRetries=45 2016-11-21 22:36:41,123 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Slave1/192.168.58.181:8485. Already tried 0 time(s); maxRetries=45 2016-11-21 22:36:41,123 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: StandByNameNode/192.168.58.183:8485. Already tried 0 time(s); maxRetries=45 2016-11-21 22:36:41,137 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20050ms to send a batch of 1 edits (218 bytes) to remote journal 192.168.58.182:8485 2016-11-21 22:36:41,137 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20052ms to send a batch of 1 edits (218 bytes) to remote journal 192.168.58.181:8485 2016-11-21 22:36:41,137 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20065ms to send a batch of 1 edits (218 bytes) to remote journal 192.168.58.183:8485 2016-11-21 22:36:41,145 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at CentOSMaster/192.168.58.180 ************************************************************/
首先保证设置dfs.namenode.edits.dir和dfs.journalnode.edits.dir,然后设置在hdfs-site.xml中超时时间如下:
<property> <name>dfs.qjournal.start-segment.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.prepare-recovery.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.accept-recovery.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.prepare-recovery.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.accept-recovery.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.finalize-segment.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.select-input-streams.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.get-journal-state.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.new-epoch.timeout.ms</name> <value>600000000</value> </property> <property> <name>dfs.qjournal.write-txns.timeout.ms</name> <value>600000000</value> </property>
貌似解决了,至今今天早上没出问题。