Namenode主节点停止报错 Error: flush failed for required journal

简介: 主节点间歇性报错其他没有问题 ,SNN的NN没有问题,相关的journalNode也都在,就是主节点的NN会停止。 查看hadoop主节点的NN日志。 2016-11-21 22:36:40,908 WARN org.

主节点间歇性报错其他没有问题 ,SNN的NN没有问题,相关的journalNode也都在,就是主节点的NN会停止。

查看hadoop主节点的NN日志。

2016-11-21 22:36:40,908 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19822 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2016-11-21 22:36:41,088 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.58.183:8485, 192.168.58.181:8485, 192.168.58.182:8485], stream=QuorumOutputStream starting at txid 24533))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
	at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
	at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
	at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2645)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2520)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:579)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:394)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:975)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)
2016-11-21 22:36:41,089 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 24533
2016-11-21 22:36:41,113 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2016-11-21 22:36:41,122 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Slave2/192.168.58.182:8485. Already tried 0 time(s); maxRetries=45
2016-11-21 22:36:41,123 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Slave1/192.168.58.181:8485. Already tried 0 time(s); maxRetries=45
2016-11-21 22:36:41,123 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: StandByNameNode/192.168.58.183:8485. Already tried 0 time(s); maxRetries=45
2016-11-21 22:36:41,137 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20050ms to send a batch of 1 edits (218 bytes) to remote journal 192.168.58.182:8485
2016-11-21 22:36:41,137 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20052ms to send a batch of 1 edits (218 bytes) to remote journal 192.168.58.181:8485
2016-11-21 22:36:41,137 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20065ms to send a batch of 1 edits (218 bytes) to remote journal 192.168.58.183:8485
2016-11-21 22:36:41,145 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at CentOSMaster/192.168.58.180
************************************************************/

  首先保证设置dfs.namenode.edits.dir和dfs.journalnode.edits.dir,然后设置在hdfs-site.xml中超时时间如下:

<property>
   <name>dfs.qjournal.start-segment.timeout.ms</name>
   <value>600000000</value>
  </property>

  <property>
   <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
   <value>600000000</value>
  </property>

  <property>
   <name>dfs.qjournal.accept-recovery.timeout.ms</name>
   <value>600000000</value>
  </property>
  <property>
   <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
   <value>600000000</value>
  </property>

  <property>
   <name>dfs.qjournal.accept-recovery.timeout.ms</name>
   <value>600000000</value>
  </property>

  <property>
   <name>dfs.qjournal.finalize-segment.timeout.ms</name>
   <value>600000000</value>
  </property>

  <property>
   <name>dfs.qjournal.select-input-streams.timeout.ms</name>
   <value>600000000</value>
  </property>

  <property>
   <name>dfs.qjournal.get-journal-state.timeout.ms</name>
   <value>600000000</value>
  </property>

  <property>
   <name>dfs.qjournal.new-epoch.timeout.ms</name>
   <value>600000000</value>
  </property>

  <property>
   <name>dfs.qjournal.write-txns.timeout.ms</name>
   <value>600000000</value>
  </property>

  貌似解决了,至今今天早上没出问题。

目录
相关文章
|
7月前
|
SQL
启动mysq异常The server quit without updating PID file [FAILED]sql/data/***.pi根本解决方案
启动mysq异常The server quit without updating PID file [FAILED]sql/data/***.pi根本解决方案
66 0
|
关系型数据库 MySQL
MySQL主从异常Coordinator stopped because there were error(s) in the worker(s). The most recent failur
MySQL主从异常Coordinator stopped because there were error(s) in the worker(s). The most recent failur
4202 0
|
关系型数据库 数据库
Harbor断电重启postgres报错 could not locate a valid checkpoint record
Harbor断电重启postgres报错 could not locate a valid checkpoint record
427 0
|
关系型数据库 数据库 PostgreSQL
PG异常无法启动的问题:could not read file "pg_logical/replorigin_checkpoint": Success
问题描述 新安装不久的PostgreSQL数据库,断电后重启,查看日志如下 2019-01-08 08:44:19.989 UTC [7493] LOG: database system was interrupted; last known up at 2018-12-24 10:56:28 UTC 2019-01-08 08:44:19.
3410 0
zookepeer启动节点报错,unable to run quorum server
zookepeer启动节点报错,unable to run quorum server
447 0
|
数据库
使用mongod启动mongo数据库时报错ERROR: child process failed, exited with error number 100
使用mongod启动mongo数据库时报错ERROR: child process failed, exited with error number 100
使用mongod启动mongo数据库时报错ERROR: child process failed, exited with error number 100
|
Oracle 关系型数据库 Linux
oom_kill_process造成数据库挂起并出现found dead shared server
这篇博客是上一篇博客Oracle shutdown immediate遭遇ORA-24324 ORA-24323 ORA-01089的延伸(数据库挂起hang时,才去重启的),其实这是我们海外一工厂的遇到的案例,把内容拆开是因为这个case分开讲述显得主题明确一些。
1307 0
|
关系型数据库 MySQL
mysql配置完半同步复制之后报错[ERROR] The server quit without updating PID file
修改配置,MySQL启动报:[ERROR] The server quit without updating PID file     [root@localhost mysql]# /etc/init.
2030 0