CDH: unable to create new native thread

简介: 发现问题CDH-4.7.1 NameNode is down启动NameNode报错如下,无法创建新的线程,可能是使用的线程数超过max user processes设定的阈值2018-08-26 08:44:00,532 INFO org.

发现问题

CDH-4.7.1 NameNode is down

启动NameNode报错如下,无法创建新的线程,可能是使用的线程数超过max user processes设定的阈值

2018-08-26 08:44:00,532 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50070
2018-08-26 08:44:00,532 INFO org.mortbay.log: jetty-6.1.26.cloudera.4
2018-08-26 08:44:00,773 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: 'signature.secret' configuration not set, using a random value as secret
2018-08-26 08:44:00,812 INFO org.mortbay.log: Started SelectChannelConnector@alish1-dataservice-01.mypna.cn:50070
2018-08-26 08:44:00,813 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: alish1-dataservice-01.mypna.cn:50070
2018-08-26 08:44:00,814 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2018-08-26 08:44:00,815 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting
2018-08-26 08:44:00,828 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2018-08-26 08:44:00,828 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8022: starting
2018-08-26 08:44:00,839 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.apache.hadoop.ipc.Server.start(Server.java:2057)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.start(NameNodeRpcServer.java:303)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:497)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:459)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:621)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:606)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1241)
2018-08-26 08:44:00,851 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1


日志内容如下,检查DNS没有问题,这里没有太多参考意义

#cat /var/log/cloudera-scm-agent/cloudera-scm-agent.log
[26/Aug/2018 07:30:23 +0000] 4589 MainThread agent        INFO     PID '19586' associated with process '1724-hdfs-NAMENODE' with payload 'processname:1724-hdfs-NAMENODE groupname:1724-hdfs-NAMENODE from_state:RUNNING expected:0 pid:19586' exited unexpectedly
[26/Aug/2018 07:45:06 +0000] 4589 Monitor-HostMonitor throttling_logger ERROR    (29 skipped) Failed to collect java-based DNS names
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 53, in collect
    result, stdout, stderr = self._subprocess_with_timeout(args, self._poll_timeout)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/dns_names.py", line 42, in _subprocess_with_timeout
    return subprocess_with_timeout(args, timeout)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/subprocess_timeout.py", line 40, in subprocess_with_timeout
    close_fds=True)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    child_exception = pickle.loads(data)
OSError: [Errno 2] No such file or directory



故障排查

这里设置的max user processes为65535已经非常大了,一般来说是达不到这个瓶颈的

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127452
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


现在系统的总进程数仅仅一百多个,我们要检查每个进程对应有多少个线程

# ps -ef|wc -l

169


已知这台服务器上主要跑的是java进程,所以重点查看java进程对应的线程数,找到30315这个进程对应约32110个线程,在加上其他进程和线程数,总数超过65535,NameNode无法在申请到多余的线程,所以报错

# pgrep java

1680

5482

19662

28770

30315

35902


# for i in `pgrep java`; do ps -T -p $i |wc -l; done

15

49

30

53

32110

114


#  ps -T -p 30315|wc -l

32110


或者通过top -H 命令查看

# top -H

top - 10:44:58 up 779 days, 19:34,  3 users,  load average: 0.01, 0.05, 0.05

Tasks: 32621 total,   1 running, 32620 sleeping,   0 stopped,   0 zombie

Cpu(s):  2.8%us,  4.1%sy,  0.0%ni, 93.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Mem:  16334284k total, 15879392k used,   454892k free,   381132k buffers

Swap:  4194296k total,        0k used,  4194296k free,  8304400k cached


解决方法

找到了问题的原因,我们可以重新设定max user processes的值为100000,再次启动NameNode成功

#echo "100000" > /proc/sys/kernel/threads-max

#echo "100000" > /proc/sys/kernel/pid_max     (默认32768)

#echo "200000" > /proc/sys/vm/max_map_count   (默认65530)


#vim /etc/security/limits.d/90-nproc.conf

* soft nproc unlimited

root soft nproc unlimited


#vim /etc/security/limits.conf

* soft nofile 65535

* hard nofile 65535

* hard nproc 100000

* soft nproc 100000


# ulimit -u

100000


目录
相关文章
|
10月前
|
SQL
SQL Server Connectors By Thread Pool | DTSQLServerTP plugin instructions
SQL Server Connectors By Thread Pool | DTSQLServerTP plugin instructions
36 0
|
9月前
|
分布式计算 Hadoop
Unable to load native-hadoop library for your platform解决方法
Unable to load native-hadoop library for your platform解决方法
168 0
|
分布式计算 Ubuntu Hadoop
MapReduce报错:「MKDirs failed to create file」
MapReduce报错:「MKDirs failed to create file」
275 0
MapReduce报错:「MKDirs failed to create file」
|
NoSQL MongoDB 数据安全/隐私保护
OCI runtime exec failed: exec failed: unable to start container process: exec: "mongo": executable file not found in $PATH: unknown
OCI runtime exec failed: exec failed: unable to start container process: exec: "mongo": executable file not found in $PATH: unknown
980 0
OCI runtime exec failed: exec failed: unable to start container process: exec: "mongo": executable file not found in $PATH: unknown
|
安全 Java Linux
记录unable to create new native thread 问题排查解决
解决 java.lang.OutOfMemoryError: unable to create new native thread
828 1
记录unable to create new native thread 问题排查解决
|
测试技术
The concurrent snapshot for publication 'xxx' is not available because it has not been fully generated or the Log Reader Agent is not running to activ
在两台测试服务器部署了复制(发布订阅)后,发现订阅的表一直没有同步过来。重新生成过snapshot ,也重新初始化过订阅,都不能同步数据,后面检查Distributor To Subscriber History, 发现有如下日志信息: The concurrent snapshot for pub...
1505 0
|
分布式计算 Java Hadoop
17/11/24 05:08:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-11-24 21:20:25 1:什么叫失望,什么叫绝望。总之是一脸懵逼的继续...... 之前部署的hadoop都是hadoop-2.4.1.tar.gz,这几天换成了hadoop-2.6.4.tar.gz,部署老生常谈,启动就不一样了,本来吧,也就是warn,不是error,启动hdfs和yarn,节点都可以正常启动,但是对于我这种强迫症,能容忍下面这一大推错误吗?当你看到这篇的时候,显然是不能,虽然网上已经有很多了,但是貌似好多还是不好使。
2674 0
|
分布式计算 Hadoop
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platfo
安装 hadoop 2.4.1 报错信息 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform.
1442 0
|
Java Windows
OutOfMemoryError系列(5): Unable to create new native thread
这是本系列的第五篇文章, 相关文章列表: OutOfMemoryError系列(1): Java heap space OutOfMemoryError系列(2): GC overhead limit exceeded OutOfMemoryError系列(3): Permgen space OutOfMemoryError系列(4): Metaspace Java程序本质上是多线程的, 可以同时执行多项任务。
2164 0