CDH:cloudera-scm-server dead but pid file exists

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云数据库 RDS MySQL,高可用系列 2核4GB
云数据库 RDS PostgreSQL,高可用系列 2核4GB
简介: 报错CM HDFS管理界面的报错(由于CM down这个信息是无法通过管理界面查看的,这里是从日志中获得的):The health test result for HDFS_CANARY_HEALTH has become bad: Canary test failed to create parent directory for /opt/tmp/.

报错

CM HDFS管理界面的报错(由于CM down这个信息是无法通过管理界面查看的,这里是从日志中获得的):

  • The health test result for HDFS_CANARY_HEALTH has become bad: Canary test failed to create parent directory for /opt/tmp/.cloudera_health_monitoring_canary_files.



排查并处理

(1)CDH的CM节点挂掉

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server status

cloudera-scm-server dead but pid file exists


[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /usr/java/jdk1.8.0_111/bin/jps

20656 Main

20626 Main

25667 Jps

20630 EventCatcherService

20632 AlertPublisher

29995 Main

10619 -- process information unavailable


#从这里可以看到,没有7180这个端口,说明CM没有正常启动,少了一个Main进程

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# ss -nltup|grep 718*

tcp    LISTEN     0      50                     *:7184                  *:*      users:(("java",20630,233))

tcp    LISTEN     0      50                     *:7185                  *:*      users:(("java",20630,241))

tcp    LISTEN     0      5                      *:4433                  *:*      users:(("python2.6",17152,8))

tcp    LISTEN     0      5              127.0.0.1:7190                  *:*      users:(("python2.6",17152,11))

tcp    LISTEN     0      5                      *:7191                  *:*      users:(("python2.6",17152,7))



#我们的CDH相关的数据是存放在MySQL数据库中,由于CM down,导致无法查看CDH的其他相关组件,所以需要查看数据库信息,看看这个CDH都包括哪些节点

mysql> select * from hosts;
+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+
| HOST_ID | OPTIMISTIC_LOCK_VERSION | HOST_IDENTIFIER                      | NAME                       | IP_ADDRESS     | RACK_ID  | STATUS | CONFIG_CONTAINER_ID | MAINTENANCE_COUNT | DECOMMISSION_COUNT | CLUSTER_ID | NUM_CORES | TOTAL_PHYS_MEM_BYTES | PUBLIC_NAME | PUBLIC_IP_ADDRESS | CLOUD_PROVIDER |
+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+
|       1 |                      11 | 264b10bb-b488-4ee7-8fcd-3c68f7a8860a | ec6s-logshedcl58manager-01 | 10.177.101.146 | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       2 |                      17 | b584457b-705d-4b1f-8000-df0e6da1838d | ec6s-logshedcl58dn-03      | 10.177.102.38  | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       3 |                      16 | e28dabc1-c105-464e-8bf6-0bd0435ace9a | ec6s-logshedcl58dn-02      | 10.177.102.193 | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       4 |                      17 | 994cf04e-2510-426a-8336-6e2d28a3001d | ec6s-logshedcl58nn-02      | 10.177.102.218 | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       5 |                      16 | a9cab0d5-5e48-49a7-8fb0-e57a0bac16db | ec6s-logshedcl58nn-01      | 10.177.101.60  | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
|       6 |                      16 | 60bf1721-d6db-4d72-9164-41d89f81e789 | ec6s-logshedcl58dn-01      | 10.177.101.64  | /default | NA     |                   1 |                 0 |                  0 |          5 |         2 |           8251195392 | NULL        | NULL              | NULL           |
+---------+-------------------------+--------------------------------------+----------------------------+----------------+----------+--------+---------------------+-------------------+--------------------+------------+-----------+----------------------+-------------+-------------------+----------------+
6 rows in set (0.00 sec)
mysql> select * from roles;
+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+
| ROLE_ID | NAME                                                     | HOST_ID | ROLE_TYPE          | CONFIGURED_STATUS | SERVICE_ID | MERGED_KEYTAB | MAINTENANCE_COUNT | DECOMMISSION_COUNT | OPTIMISTIC_LOCK_VERSION | ROLE_CONFIG_GROUP_ID | HAS_EVER_STARTED |
+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+
|      14 | mgmt-HOSTMONITOR-92f15c379891f3c8dbdbbcbe57db9067        |       1 | HOSTMONITOR        | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   25 |                1 |
|      15 | mgmt-EVENTSERVER-92f15c379891f3c8dbdbbcbe57db9067        |       1 | EVENTSERVER        | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   21 |                1 |
|      16 | mgmt-ACTIVITYMONITOR-92f15c379891f3c8dbdbbcbe57db9067    |       1 | ACTIVITYMONITOR    | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   22 |                1 |
|      17 | mgmt-SERVICEMONITOR-92f15c379891f3c8dbdbbcbe57db9067     |       1 | SERVICEMONITOR     | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   24 |                1 |
|      18 | mgmt-ALERTPUBLISHER-92f15c379891f3c8dbdbbcbe57db9067     |       1 | ALERTPUBLISHER     | RUNNING           |          4 | NULL          |                 0 |                  0 |                       6 |                   20 |                1 |
|      19 | zookeeper-SERVER-5779e83332b2c66cc02029a8ab2c3628        |       3 | SERVER             | RUNNING           |          5 | NULL          |                 0 |                  0 |                       9 |                   27 |                1 |
|      20 | zookeeper-SERVER-c103ed4dcdd93fc8bbaf467aa1c6d927        |       2 | SERVER             | RUNNING           |          5 | NULL          |                 0 |                  0 |                       9 |                   27 |                1 |
|      21 | zookeeper-SERVER-dc971e0a60f4e798e85e2ab9bd57a041        |       6 | SERVER             | RUNNING           |          5 | NULL          |                 0 |                  0 |                       9 |                   27 |                1 |
|      23 | hdfs-NAMENODE-ed39ed17d751bee1bd6ad84c0db46ca1           |       5 | NAMENODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                      22 |                   30 |                1 |
|      24 | hdfs-DATANODE-c103ed4dcdd93fc8bbaf467aa1c6d927           |       2 | DATANODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                      10 |                   28 |                1 |
|      25 | hdfs-DATANODE-5779e83332b2c66cc02029a8ab2c3628           |       3 | DATANODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                      10 |                   28 |                1 |
|      26 | hdfs-DATANODE-dc971e0a60f4e798e85e2ab9bd57a041           |       6 | DATANODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                      10 |                   28 |                1 |
|      27 | hdfs-NAMENODE-16c21945a5f07e23a510dd5e32caa6dd           |       4 | NAMENODE           | RUNNING           |          6 | NULL          |                 0 |                  0 |                       6 |                   30 |                1 |
|      28 | hdfs-FAILOVERCONTROLLER-ed39ed17d751bee1bd6ad84c0db46ca1 |       5 | FAILOVERCONTROLLER | RUNNING           |          6 | NULL          |                 0 |                  0 |                       4 |                   29 |                1 |
|      29 | hdfs-FAILOVERCONTROLLER-16c21945a5f07e23a510dd5e32caa6dd |       4 | FAILOVERCONTROLLER | RUNNING           |          6 | NULL          |                 0 |                  0 |                       2 |                   29 |                1 |
|      30 | hdfs-JOURNALNODE-c103ed4dcdd93fc8bbaf467aa1c6d927        |       2 | JOURNALNODE        | RUNNING           |          6 | NULL          |                 0 |                  0 |                       2 |                   34 |                1 |
|      31 | hdfs-JOURNALNODE-dc971e0a60f4e798e85e2ab9bd57a041        |       6 | JOURNALNODE        | RUNNING           |          6 | NULL          |                 0 |                  0 |                       2 |                   34 |                1 |
|      32 | hdfs-JOURNALNODE-5779e83332b2c66cc02029a8ab2c3628        |       3 | JOURNALNODE        | RUNNING           |          6 | NULL          |                 0 |                  0 |                       2 |                   34 |                1 |
|      36 | kafka-KAFKA_BROKER-c103ed4dcdd93fc8bbaf467aa1c6d927      |       2 | KAFKA_BROKER       | RUNNING           |          8 | NULL          |                 0 |                  0 |                       9 |                   40 |                1 |
|      37 | kafka-KAFKA_BROKER-ed39ed17d751bee1bd6ad84c0db46ca1      |       5 | KAFKA_BROKER       | RUNNING           |          8 | NULL          |                 0 |                  0 |                      10 |                   40 |                1 |
|      38 | kafka-KAFKA_BROKER-16c21945a5f07e23a510dd5e32caa6dd      |       4 | KAFKA_BROKER       | RUNNING           |          8 | NULL          |                 0 |                  0 |                      10 |                   40 |                1 |
+---------+----------------------------------------------------------+---------+--------------------+-------------------+------------+---------------+-------------------+--------------------+-------------------------+----------------------+------------------+
21 rows in set (0.00 sec)
mysql> select * from services;
+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+
| SERVICE_ID | OPTIMISTIC_LOCK_VERSION | NAME      | SERVICE_TYPE | CLUSTER_ID | MAINTENANCE_COUNT | DISPLAY_NAME                | GENERATION |
+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+
|          4 |                      14 | mgmt      | MGMT         |       NULL |                 0 | Cloudera Management Service |          1 |
|          5 |                       7 | zookeeper | ZOOKEEPER    |          5 |                 0 | ZooKeeper                   |          1 |
|          6 |                      23 | hdfs      | HDFS         |          5 |                 0 | HDFS                        |          1 |
|          8 |                      15 | kafka     | KAFKA        |          5 |                 0 | Kafka                       |          1 |
+------------+-------------------------+-----------+--------------+------------+-------------------+-----------------------------+------------+


#重启cloudera-scm-server服务

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server status

cloudera-scm-server dead but pid file exists


[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server stop 

cloudera-scm-server is already stopped


[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# cat /var/run/cloudera-scm-server.pid

10617

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# ps -ef|grep 10617

root     28331 27755  0 19:02 pts/3    00:00:00 grep 10617


[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20656

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20626

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20630

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 29995

[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# kill 20632



[root@ec6s-logshedcl58manager-01 cloudera-scm-agent]# /etc/init.d/cloudera-scm-server start

[root@ec6s-logshedcl58manager-01 ~]#  /etc/init.d/cloudera-scm-server status

cloudera-scm-server (pid  1378) is running...


#正常启动

[root@ec6s-logshedcl58manager-01 ~]# /usr/java/jdk1.8.0_111/bin/jps

1380 Main

2469 Main

2471 EventCatcherService

7272 Jps

2473 AlertPublisher

2475 Main

2462 Main



(2)两个NameNode之前无法通信,但是没有挂掉

当上面的CM正常起来之后,我们就可以通过图像界面管理NameNode,从图形界面上得到的信息是,NameNode彼此不能通信,NameNode无法写日志到Jounral Node中

日志报错:

Jul 18, 5:38:09.355 PMFATALorg.apache.hadoop.hdfs.server.namenode.FSEditLog
Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.177.101.64:8485, 10.177.102.193:8485, 10.177.102.38:8485], stream=QuorumOutputStream starting at txid 1338050))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:651)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:585)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2752)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2624)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:599)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:112)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:401)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2141)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1783)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2135)



从日志可以看出,NameNode写journal文件失败,导致NameNode超时,因为公司用的AWS ec2环境,可能但是在做网络维护,导致instance网络不稳定,如果出现timeout的情况,我们可以把默认的20s修改成60s,如

#vim /etc/hadoop/conf/hdfs-site.xml 

<property>

        <name>dfs.qjournal.write-txns.timeout.ms</name>

        <value>60000</value>

</property>


然后可以通过CM的管理平台:http://10.177.101.146:7180 分别重启两个NameNode


目录
相关文章
|
云安全 运维 负载均衡
【纯干货】针对《等保2.0》要求的云上最佳实践——网络安全篇
伴随着国内企业上云步伐的加快,越来越多的企业需要对云上关键业务进行等级保护自查或完成相关认证。本文以《GB/T 22239-2019 信息安全技术 网络安全等级保护基本要求》中所要求的三级标准为参考,重点关注其中所涉及的网络安全高危风险部分,为企业提供阿里云上有针对性的安全建设最佳实践,助力企业构建层次化的云上网络安全防御体系,保障核心业务的安全运行。
3144 1
【纯干货】针对《等保2.0》要求的云上最佳实践——网络安全篇
|
分布式计算 Hadoop 大数据
【大数据开发技术】实验04-HDFS文件创建与写入
【大数据开发技术】实验04-HDFS文件创建与写入
687 0
|
druid Oracle 关系型数据库
奇奇怪怪的问题-Druid+Oracle连接超时关闭问题
SpringBoot+Druid+Oracle连接超时关闭问题
2418 0
|
JavaScript 前端开发 搜索推荐
Nuxt4.0初体验:一个简约、精美、现代化的个人站点导航!
这篇文章介绍了作者使用Nuxt 4.0重构个性化站点导航网站的经历,阐述了Nuxt 4.0的新特性和优势,如更清晰的项目结构、更好的TypeScript体验、更快的CLI和开发速度等,并且分享了重构过程中的体验和项目完成效果。同时,作者还对比了Nuxt.js与Next.js两个框架的优劣,表达了自己对Nuxt.js的偏好。
336 0
Nuxt4.0初体验:一个简约、精美、现代化的个人站点导航!
|
前端开发 开发者
React 单选按钮 Radio Button 详解
本文介绍 React 中单选按钮的基础概念、基本用法、常见问题及进阶技巧,包括如何正确设置 `checked` 属性、确保 `name` 属性一致、处理 `onChange` 事件,以及动态生成单选按钮和使用受控组件等,通过代码示例详细解析,帮助开发者有效管理状态和优化用户交互。
339 32
|
安全 Java 测试技术
Java“AccessControlException”解决
Java中的“AccessControlException”通常发生在尝试访问受限资源时,如文件或网络。解决方法包括:确保所需权限已授予,检查安全策略配置,使用doPrivileged块执行敏感操作。调整策略文件或代码以匹配实际需求。
698 1
|
SQL 关系型数据库 MySQL
MyBatis-plus执行自定义SQL
MyBatis-plus执行自定义SQL
578 0
|
供应链 安全 物联网
区块链跨链互操性:打破孤岛,构建互联互通的未来
**区块链跨链互操性摘要** - 跨链互操性是不同区块链间通信、交换数据和价值的能力,打破区块链“孤岛”现象。 - 提升扩展性、促进创新、增强安全性是其主要益处,通过侧链、原子交换、中继链等方式实现。 - 面临兼容性、安全性和性能挑战,未来将趋向标准化、提升安全隐私,并拓展多样化应用,促进区块链生态协同发展。
|
JSON 前端开发 JavaScript
使用JavaScript制作一个简单的天气应用
使用JavaScript制作一个简单的天气应用
|
定位技术 数据处理 C++
Visual Studio软件调用已经配置、编译好的C++第三方库的方法
Visual Studio软件调用已经配置、编译好的C++第三方库的方法
409 1