task blocked for more than 120 seconds

简介: ul  3 20:41:24 yz384 kernel:Jul  3 20:43:24 yz384 kernel: INFO: task chown:18647 blocked for more than 120 seconds.



ul  3 20:41:24 yz384 kernel:
Jul  3 20:43:24 yz384 kernel: INFO: task chown:18647 blocked for more than 120 seconds.
Jul  3 20:43:24 yz384 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  3 20:43:24 yz384 kernel: chown         D ffffffff80157c4c     0 18647      1         24429 22803 (NOTLB)
Jul  3 20:43:24 yz384 kernel:  ffff81015ed4fde8 0000000000000086 ffff810c3fc1d3c0 ffff810e91a379c0
Jul  3 20:43:24 yz384 kernel:  0000000000000000 0000000000000008 ffff8103bf82f7a0 ffff810c3fd2e7a0
Jul  3 20:43:24 yz384 kernel:  000a29c122906514 000000000000141d ffff8103bf82f988 0000000a00000001
Jul  3 20:43:24 yz384 kernel: Call Trace:
Jul  3 20:43:24 yz384 kernel:  [<ffffffff80063c63>] __mutex_lock_slowpath+0x60/0x9b
Jul  3 20:43:24 yz384 kernel:  [<ffffffff80063cad>] .text.lock.mutex+0xf/0x14
Jul  3 20:43:24 yz384 kernel:  [<ffffffff8003b7e5>] chown_common+0x90/0xb0
Jul  3 20:43:24 yz384 kernel:  [<ffffffff80023fa0>] __user_walk_fd+0x41/0x4c
Jul  3 20:43:24 yz384 kernel:  [<ffffffff800e3c44>] sys_lchown+0x38/0x53
Jul  3 20:43:24 yz384 kernel:  [<ffffffff8000d57f>] dput+0x2c/0x114
Jul  3 20:43:24 yz384 kernel:  [<ffffffff80012d27>] __fput+0x191/0x1bd
Jul  3 20:43:24 yz384 kernel:  [<ffffffff8002d0f1>] mntput_no_expire+0x19/0x89
Jul  3 20:43:24 yz384 kernel:  [<ffffffff80024200>] filp_close+0x5c/0x64
Jul  3 20:43:24 yz384 kernel:  [<ffffffff8005d116>] system_call+0x7e/0x83
Jul  3 20:43:24 yz384 kernel:




The warning is given to indicate a problem with the system. In my experience it means that the process is blocked in kernel space for at least 120 seconds usually because the process is starved of disk I/O. This can be because of heavy swapping due to too much memory being used, e.g. if you have a heavy webserver load and you've configured too many apache child processes for your system. In your case it may just be that there are too many mysql processes competing for memory and data IO.


It can also happen if the underlying storage system is not performing well, e.g. if you have a SAN which is overloaded, or if there are soft errors on a disk which cause a lot of retries. Whenever a task has to wait long for its IO commands to complete, these warning may be issued.



This is a know bug. By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. This especially happens on systems with a lof of memory.


The problem is solved in later kernels and there is not “fix” from Oracle. I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10” in /etc/sysctl.conf. This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.


原理:linux会设置40%的可用内存用来做系统cache,当flush数据时这40%内存中的数据由于和IO同步问题导致超时(120s),所将40%减小到10%,避免超时

在文件/etc/sysctl.conf中加入 vm.dirty_ratio=10



进程等待IO时,经常处于D状态,即TASK_UNINTERRUPTIBLE状态,处于这种状态的进程不处理信号,所以kill不掉,如果进程长期处于D状态,那么肯定不正常,原因可能有二:

1)IO路径上的硬件出问题了,比如硬盘坏了(只有少数情况会导致长期D,通常会返回错误);

2)内核自己出问题了。

这种问题一旦出现就通常不可恢复,kill不掉,通常只能重启恢复了。

内核针对这种开发了一种hung task的检测机制,基本原理是:定时检测系统中处于D状态的进程,如果其处于D状态的时间超过了指定时间(默认120s,可以配置),则打印相关堆栈信息,也可以通过proc参数配置使其直接panic。


目录
相关文章
|
负载均衡 Ubuntu 应用服务中间件
Failed to execute script ‘xxx‘ due to unhandled exception:No module named ‘ctypes‘
Failed to execute script ‘xxx‘ due to unhandled exception:No module named ‘ctypes‘
675 0
|
4月前
|
Web App开发 前端开发 关系型数据库
GitHub 2.8k star 开源既封神,“Liquid‑Glass‑React”,让你前端界面瞬间拥有苹果级液态玻璃效果!
Liquid-Glass-React 是一款开源前端组件,旨在将 Apple iOS 26 的“液态玻璃”视觉效果引入 React 应用。凭借逼真折射、多种反射模式、响应式交互及高度可配置性,它已获得 2.8k stars,成为提升 UI 质感的热门工具。
427 0
|
弹性计算 安全 Windows
通过远程桌面连接Windows服务器提示“由于协议错误,会话将被中断,请重新连接到远程计算机”错误怎么办?
通过远程桌面连接Windows服务器提示“由于协议错误,会话将被中断,请重新连接到远程计算机”错误怎么办?
|
运维 监控 安全
【TiDB原理与实战详解】2、部署与节点的扩/缩容~学不会? 不存在的!
TiUP 是 TiDB 4.0 引入的集群运维工具,TiUP cluster 用于部署、管理 TiDB 集群,支持 TiDB、TiFlash、TiDB Binlog 等组件。本文介绍使用 TiUP 部署生产环境的具体步骤,包括节点规划、工具安装、配置文件修改及集群部署等。同时,提供了常用命令和安全优化方法,并详细说明了如何进行集群的扩缩容操作,以及时区设置等维护工作。
|
C语言
【C语言】柔性数组(可变长数组)
【C语言】柔性数组(可变长数组)
|
运维 监控 网络安全
ClientAliveCountMax设置0可以吗
在决定"clientalivecountmax"或类似并发控制机制时,务必进行充分的测试与评估,确保既能满足业务连续性要求,又能有效管理资源。访问[专业云服务提供商,了解更多关于优化服务器配置、提升网络应用性能的专业知识与解决方案,帮助您在复杂的网络环境中做出更明智的选择。
294 0
|
关系型数据库 MySQL 数据库
MySQL的10大经典错误详解
MySQL的10大经典错误详解
216 0
|
Linux RDMA
|
机器学习/深度学习 人工智能 运维
AIOps
智能运维AIOps是一种基于人工智能和机器学习技术的运维方式,通过对运维数据进行分析和挖掘,实现自动化、智能化的运维管理。
981 1