背景信息:
某用户反馈,特定节点一直拉不起来pod,提示no space left on device.,手动去docker run也是相同的报错
# docker run --name aestools-perf --cap-add CAP_SYS_ADMIN --privileged -ti --rm registry-vpc.cn-beijing.aliyuncs.com/my-nettools/aestools:onlyperf
docker: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/c83150***4a37d6e4d6032e54a23a8-init/merged: no space left on device.
问题分析:
no space left on device:
no space left on device 是常见的报错,一般是inode满了,或者磁盘空间满了,我们使用df直接查看即可
# df -Th
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 3.7G 0 3.7G 0% /dev
tmpfs tmpfs 3.7G 0 3.7G 0% /dev/shm
tmpfs tmpfs 3.7G 30M 3.7G 1% /run
tmpfs tmpfs 3.7G 0 3.7G 0% /sys/fs/cgroup
/dev/vda1 ext4 118G 35G 79G 31% /
tmpfs tmpfs 3.7G 12K 3.7G 1% /var/lib/kubelet/pods/95ffe5e3-fc80-4816-9b36-e7a3960cf7be/volumes/kubernetes.io~secret/terway-token-j4llt
.......
tmpfs tmpfs 749M 0 749M 0% /run/user/0
# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
devtmpfs 955080 354 954726 1% /dev
tmpfs 957808 2 957806 1% /dev/shm
tmpfs 957808 1858 955950 1% /run
tmpfs 957808 17 957791 1% /sys/fs/cgroup
/dev/vda1 7864320 677111 7187209 9% /
tmpfs 957808 9 957799 1% /var/lib/kubelet/pods/95ffe5e3-fc80-4816-9b36-e7a3960cf7be/volumes/kubernetes.io~secret/terway-token-j4llt
......
tmpfs 957808 1 957807 1% /run/user/0
显然不是磁盘空间以及inode满导致的,根据经验,难不成是有大量文件句柄被删除后没释放?使用lsof查看也没有
# lsof -i |grep delete
[root@iZZ tools]#
strace跟踪:
使用strace跟踪这种可以手动复现的场景,看看相关的调用报错来辅助分析问题
# strace -F -ff -t -tt -s 4096 -o dnospace.log docker run --name aestools-perf --cap-add CAP_SYS_ADMIN --privileged -ti --rm registry-vpc.cn-beijing.aliyuncs.com/my-nettools/aestools:onlyperf
strace: deprecated option -F ignored
docker: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/3e0c84f188bbf3d039ec2454b32009e9eb1bc4d5b715d9f9993af250d019028b-init/merged: no space left on device.
See 'docker run --help'.
[root@iZ nospace]# ls -l
total 100
-rw-r--r-- 1 root root 59357 Nov 29 11:23 dnospace.log.1858620
-rw-r--r-- 1 root root 5597 Nov 29 11:23 dnospace.log.1858621
-rw-r--r-- 1 root root 2705 Nov 29 11:23 dnospace.log.1858622
-rw-r--r-- 1 root root 2787 Nov 29 11:23 dnospace.log.1858623
-rw-r--r-- 1 root root 1248 Nov 29 11:23 dnospace.log.1858624
-rw-r--r-- 1 root root 461 Nov 29 11:23 dnospace.log.1858625
-rw-r--r-- 1 root root 3414 Nov 29 11:23 dnospace.log.1858626
-rw-r--r-- 1 root root 5667 Nov 29 11:23 dnospace.log.1858627
-rw-r--r-- 1 root root 1445 Nov 29 11:23 dnospace.log.1858628
分析跟踪文件,strace跟踪文件最后的数字如1858620是当时命令的pid,其他的id是这个pid fork出来的子线程去执行别的命令,因此我们可以分别看下相关文件的内容,可以从后往前看
从主pid的日志可以看出来,docker读取了默认的kubeconfig去run container,相关的权限验证已经通过了
1669692214.324317 newfstatat(AT_FDCWD, "/root/.kube/config", {st_mode=S_IFREG|0644, st_size=7313, ...}, 0) = 0
读取kubeconfig
1669692214.324362 openat(AT_FDCWD, "/root/.kube/config", O_RDONLY|O_CLOEXEC) = 3
1669692214.324396 epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2833589232, u64=139692349924336}}) = -1 EPERM (Operation not permitted)
1669692214.324424 epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc00026eb3c) = -1 EPERM (Operation not permitted)
1669692214.324450 fstat(3, {st_mode=S_IFREG|0644, st_size=7313, ...}) = 0
1669692214.324479 read(3, "apiVersion: v1\nclusters:\n- cluster:\n server: https://47****:6443\n certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS。。。LQo=\n name: kubernetes\ncontexts:\n- context:\n cluster: kubernetes\n user: \"2231***513\"\n name: 2231***77601cb\ncurrent-context: 223121***77601cb\nkind: Config\npreferences: {}\nusers:\n- name: \"****\"\n user:\n client-certificate-data: LS0tLS。。。。5qTTROVFZpTW"..., 7825) = 7313
权限校验
1669692214.324531 read(3, "", 512) = 0
1669692214.324555 close(3) = 0
1669692214.325087 newfstatat(AT_FDCWD, "/var/run/secrets/kubernetes.io/serviceaccount/token", 0xc000290c68, 0) = -1 ENOENT (No such file or directory)
1669692214.325309 futex(0xc000402bc8, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.325417 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
1669692214.325450 setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
1669692214.325501 connect(3, {sa_family=AF_UNIX, sun_path="/var/run/docker.sock"}, 23) = 0
1669692214.325543 epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2833589232, u64=139692349924336}}) = 0
1669692214.325576 getsockname(3, {sa_family=AF_UNIX}, [112->2]) = 0
1669692214.325609 getpeername(3, {sa_family=AF_UNIX, sun_path="/var/run/docker.sock"}, [112->23]) = 0
1669692214.325653 futex(0xc000402bc8, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.325740 write(3, "HEAD /_ping HTTP/1.1\r\nHost: docker\r\nUser-Agent: Docker-Client/19.03.5 (linux)\r\n\r\n", 81) = 81
docker runtime的探测
1669692214.325789 futex(0xc0004a0148, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.325821 futex(0x55ece2f7fd88, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
1669692214.326213 epoll_pwait(4, [{EPOLLOUT, {u32=2833589232, u64=139692349924336}}], 128, 0, NULL, 0) = 1
1669692214.326245 epoll_pwait(4, [{EPOLLOUT, {u32=2833589232, u64=139692349924336}}], 128, -1, NULL, 3) = 1
1669692214.327057 epoll_pwait(4, [{EPOLLIN|EPOLLOUT, {u32=2833589232, u64=139692349924336}}], 128, -1, NULL, 3) = 1
epoll_pwait等待子线程的结果
1669692214.340355 futex(0x55ece2f7f0d0, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.340417 read(3, "HTTP/1.1 500 Internal Server Error\r\nApi-Version: 1.40\r\nContent-Type: application/json\r\nDocker-Experimental: false\r\nOstype: linux\r\nServer: Docker/19.03.5 (linux)\r\nDate: Tue, 29 Nov 2022 03:23:34 GMT\r\nContent-Length: 173\r\n\r\n{\"message\":\"error creating overlay mount to /var/lib/docker/overlay2/3e0c84f188bbf3d039ec2454b32009e9eb1bc4d5b715d9f9993af250d019028b-init/merged: no space left on device\"}\n", 4096) = 395
错误写入
1669692214.340509 epoll_pwait(4, [], 128, 0, NULL, 64) = 0
1669692214.340605 futex(0x55ece2f7f0d0, FUTEX_WAKE_PRIVATE, 1) = 0
1669692214.340642 futex(0xc000402bc8, FUTEX_WAKE_PRIVATE, 1) = 1
看完主进程的日志,我们需要打开其他的线程日志对比看下
以dnospace.log.1858627为例,尝试调用container接口创建container,但是错误没在这个日志里面体现,我们要结合别的日志一起看
1669692214.326906 futex(0xc000402bc8, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.326970 write(3, "POST /v1.40/containers/create?name=aestools-perf HTTP/1.1\r\nHost: docker\r\nUser-Agent: Docker-Client/19.03.5 (linux)\r\nContent-Length: 1580\r\nContent-Type: application/json\r\n\r\n{\"Hostname\":\"\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":true,\"AttachStdout\":true,\"AttachStderr\":true,\"Tty\":true,\"OpenStdin\":true,\"StdinOnce\":true,\"Env\":[],\"Cmd\":null,\"Image\":\"registry-vpc.cn-beijing.aliyuncs.com/my-nettools/aestools:onlyperf\",\"Volumes\":{},\"WorkingDir\":\"\",\"Entrypoint\":null,\"OnBuild\":null,\"Labels\":{},\"HostConfig\":{\"Binds\":null,\"ContainerIDFile\":\"\",\"LogConfig\":{\"Type\":\"\",\"Config\":{}},\"NetworkMode\":\"default\",\"PortBindings\":{},\"RestartPolicy\":{\"Name\":\"no\",\"MaximumRetryCount\":0},\"AutoRemove\":true,\"VolumeDriver\":\"\",\"VolumesFrom\":null,\"CapAdd\":[\"CAP_SYS_ADMIN\"],\"CapDrop\":null,\"Capabilities\":null,\"Dns\":[],\"DnsOptions\":[],\"DnsSearch\":[],\"ExtraHosts\":null,\"GroupAdd\":null,\"IpcMode\":\"\",\"Cgroup\":\"\",\"Links\":null,\"OomScoreAdj\":0,\"PidMode\":\"\",\"Privileged\":true,\"PublishAllPorts\":false,\"ReadonlyRootfs\":false,\"SecurityOpt\":null,\"UTSMode\":\"\",\"UsernsMode\":\"\",\"ShmSize\":0,\"ConsoleSize\":[0,0],\"Isolation\":\"\",\"CpuShares\":0,\"Memory\":0,\"NanoCpus\":0,\"CgroupParent\":\"\",\"BlkioWeight\":0,\"BlkioWeightDevice\":[],\"BlkioDeviceReadBps\":null,\"BlkioDeviceWriteBps\":null,\"BlkioDeviceReadIOps\":null,\"BlkioDeviceWriteIOps\":null,\"CpuPeriod\":0,\"CpuQuota\":0,\"CpuRealtimePeriod\":0,\"CpuRealtimeRuntime\":0,\"CpusetCpus\":\"\",\"CpusetMems\":\"\",\"Devices\":[],\"DeviceCgroupRules\":null,\"DeviceRequests\":null,\"KernelMemory\":0,\"KernelMemoryTCP\":0,\"MemoryReservation\":0,\"MemorySwap\":0,\"MemorySwappiness\":-1,\"OomKillDisable\":false,\"PidsLimit\":0,\"Ulimits\":null,\"CpuCount\":0,\"CpuPercent\":0,\"IOMaximumIOps\":0,\"IOMaximumBandwidth\":0,\"MaskedPaths\":null,\"ReadonlyPaths\":null},\"NetworkingConfig\":{\"EndpointsConfig\":{}}}\n", 1752) = 1752
1669692214.327023 futex(0xc000402bc8, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.327063 futex(0xc0004a0148, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
1669692214.340792 epoll_pwait(4, [], 128, 0, NULL, 0) = 0
1669692214.340845 futex(0xc0000732c8, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.340998 epoll_pwait(4, [], 128, 0, NULL, 0) = 0
1669692214.341377 epoll_pwait(4, [], 128, 0, NULL, 1) = 0
1669692214.341821 +++ exited with 125 +++
最后一个线程的日志是真实错误日志,但是从日志本身看没有触发原因
# less dnospace.log.1858628
1669692214.309461 set_robust_list(0x7f0c9e5e99e0, 24) = 0
1669692214.309522 sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
1669692214.309553 sigaltstack({ss_sp=0xc0004ba000, ss_flags=0, ss_size=32768}, NULL) = 0
1669692214.309594 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
1669692214.309621 gettid() = 1858628
1669692214.309645 futex(0x55ece2f7f0c0, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.309676 futex(0xc0000732c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
1669692214.309741 futex(0xc0004a0148, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.310130 futex(0xc0000732c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
1669692214.311515 nanosleep({tv_sec=0, tv_nsec=3000}, NULL) = 0
1669692214.311600 futex(0xc0000732c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
1669692214.319108 futex(0xc000402bc8, FUTEX_WAKE_PRIVATE, 1) = 1
1669692214.320161 futex(0xc0000732c8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
1669692214.340990 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0ca8d61000
1669692214.341249 read(3, 0xc000562000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
1669692214.341459 write(2, "docker: Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2/3e0c84f188bbf3d039ec2454b32009e9eb1bc4d5b715d9f9993af250d019028b-init/merged: no space left on device.\nSee 'docker run --help'.\n", 221) = 221
1669692214.341606 exit_group(125) = ?
1669692214.341765 +++ exited with 125 +++
strace的日志没有看出来具体的触发原因,说明这个问题超脱了strace的能力范围,这个时候就可以考虑使用bcctools了
bcc-tools的调用监控:
bcc-tools的小工具 syscount 具有系统内的func调用跟踪和统计功能,同时可以指定错误代码以及pid等参数,可以先看下docker run的时候会有哪些系统调用
# ./syscount -h
usage: syscount [-h] [-p PID] [-i INTERVAL] [-d DURATION] [-T TOP] [-x]
[-e ERRNO] [-L] [-m] [-P] [-l]
Summarize syscall counts and latencies.
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace only this pid
-i INTERVAL, --interval INTERVAL
print summary at this interval (seconds)
-d DURATION, --duration DURATION
total duration of trace, in seconds
-T TOP, --top TOP print only the top syscalls by count or latency
-x, --failures trace only failed syscalls (return < 0)
-e ERRNO, --errno ERRNO
trace only syscalls that return this error (numeric or
EPERM, etc.)
-L, --latency collect syscall latency
-m, --milliseconds display latency in milliseconds (default:
microseconds)
-P, --process count by process and not by syscall
-l, --list print list of recognized syscalls and exit
# pgrep docker
785
# ./syscount -x -p 785
Tracing failed syscalls, printing top 10... Ctrl+C to quit.
^C[12:03:47]
SYSCALL COUNT
epoll_ctl 8582
openat 1593
read 1130
newfstatat 210
futex 103
prctl 48
unlinkat 14
accept4 7
statfs 1
umount2 1
Detaching...
# ./syscount -x -p 785 -T 20
Tracing failed syscalls, printing top 20... Ctrl+C to quit.
^C[12:04:45]
SYSCALL COUNT
epoll_ctl 4220
openat 854
read 660
newfstatat 152
futex 108
prctl 32
unlinkat 14
accept4 7
statfs 3
mount 3
umount2 3
Detaching...
错误比较多,不过也发现一些有意思的错误,连续执行三次docker run,失败了三次的mount umount,statfs这个可能是关键
调用比较多,不太方便分析,看下报错对应的内核错误码
# grep -ri "no space left on device" /usr/src/kernels/4.19.91-22.2.al7.x86_64/
/usr/src/kernels/4.19.91-22.2.al7.x86_64/include/uapi/asm-generic/errno-base.h:#define ENOSPC 28 /* No space left on device */
/usr/src/kernels/4.19.91-22.2.al7.x86_64/tools/include/uapi/asm-generic/errno-base.h:#define ENOSPC 28 /* No space left on device */
继续使用syscount监控失败的调用
# ./syscount -e ENOSPC
Tracing syscalls, printing top 10... Ctrl+C to quit.
^C[12:06:43]
SYSCALL COUNT
mount 5
Detaching...
加上-P可以显示是哪个命令调用的
# ./syscount -e ENOSPC -P
Tracing syscalls, printing top 10... Ctrl+C to quit.
^C[12:07:40]
PID COMM COUNT
785 dockerd 4
经过多番的测试,可以确定docker执行mount的时候失败导致的该报错 No space left on device,
排查过程中发现有个mountsnoop更方便的工具可以定位
# ./mountsnoop
COMM PID TID MNT_NS CALL
dockerd 785 1167 4026531840 mount("overlay", "/var/lib/docker/overlay2/fe76e2433d8163579a5b300042cc8bc95bcf1df3aba165e4678b6a7e6dd4fe72-init/merged", "overlay", MS_NOSUID|MS_NOEXEC|MS_MANDLOCK|MS_NOATIME|MS_NODIRATIME|MS_MOVE|MS_SILENT|MS_POSIXACL|MS_SLAVE|MS_KERNMOUNT|MS_STRICTATIME|MS_LAZYTIME|0x55a238000100, "index=off,lowerdir=/var/lib/docker/overlay2/l/RHXV3UFPMZL3JFZEBSXNKK7Q5X:/var/lib/docker/overlay2/l/EY76M5SEOGNNNO7BK4ZTFPDEIX:/var/lib/docker/overlay2/l/VVZOQDCSCSPHSKUXKPZ4XVLFO3:/var/lib/docker/overlay2/l/ULFCVRF633MTMT3MUD6CLVUJV4:/var/lib/docker/overlay2/l/L7XJRIOMMUBZJV55F2GATE27FG,upperdir=/var/lib/docker/overlay2/fe76e2433d8163579a5b300042cc8bc95bcf1df3aba165e4678b6a7e6dd4fe72-init/diff,workdir=/var/l") = -ENOSPC
既然确定了是mount的问题,那么我们如何进一步跟踪mount的调用呢?
尝试手动mount看看是否可以成功
# mount /dev/vda1 /mnt
mount: mount /dev/vda1 on /mnt failed: No space left on device
strace跟踪看下,日志依然不足
1669697839.414614 mount("/dev/vda1", "/mnt", "ext4", MS_MGC_VAL, NULL) = -1 ENOSPC (No space left on device)
这个时候就得考虑更深层次的内核跟踪了
# man 2 mount
MOUNT(2) Linux Programmer's Manual MOUNT(2)
NAME
mount - mount file system
SYNOPSIS
#include <sys/mount.h>
int mount(const char *source, const char *target,
const char *filesystemtype, unsigned long mountflags,
const void *data);
# fs/namespace.c#L3026
int ksys_mount(char __user *dev_name, char __user *dir_name, char __user *type,
unsigned long flags, void __user *data)
{
int ret;
char *kernel_type;
char *kernel_dev;
void *options;
kernel_type = copy_mount_string(type);
ret = PTR_ERR(kernel_type);
if (IS_ERR(kernel_type))
goto out_type;
kernel_dev = copy_mount_string(dev_name);
ret = PTR_ERR(kernel_dev);
if (IS_ERR(kernel_dev))
goto out_dev;
options = copy_mount_options(data);
ret = PTR_ERR(options);
if (IS_ERR(options))
goto out_data;
ret = do_mount(kernel_dev, dir_name, kernel_type, flags, options);
kfree(options);
out_data:
kfree(kernel_dev);
out_dev:
kfree(kernel_type);
out_type:
return ret;
}
SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
char __user *, type, unsigned long, flags, void __user *, data)
{
return ksys_mount(dev_name, dir_name, type, flags, data);
}
从上面的调用路径来看,mount会去找ksys_mount,可以考虑将/proc/kallsyms里面的ksys_mount,找出来看下上下文,4.19 x86_64对应的应该是 __x64_sys_mount这个函数
# grep -C5 ksys_mount /proc/kallsyms
ffffffffa32aa240 T finish_automount
ffffffffa32aa300 T copy_mount_options
ffffffffa32aa430 T copy_mount_string
ffffffffa32aa450 T do_mount
ffffffffa32ab060 T copy_mnt_ns
ffffffffa32ab310 T ksys_mount
ffffffffa32ab3e0 T __x64_sys_mount
ffffffffa32ab410 T __ia32_sys_mount
ffffffffa32ab430 T is_path_reachable
ffffffffa32ab480 t __se_sys_pivot_root
ffffffffa32ab7f0 T __x64_sys_pivot_root
我的代码能力不太好,通常要借用很多工具来佐证我的梳理,如,手动docker run 以及 mount测试,使用trace工具跟踪ksys_mount的话,是可以看到入口相关函数的,来确保函数没有找错
# ./trace ksys_mount -K
PID TID COMM FUNC
2539815 2539815 mount ksys_mount
ksys_mount+0x1 [kernel]
__x64_sys_mount+0x21 [kernel]
do_syscall_64+0x5b [kernel]
entry_SYSCALL_64_after_hwframe+0x44 [kernel]
确认了入口是__x64_sys_mount -- ksys_mount -- do_mount的大致流程了,我们就可以上新的工具 funcgraph (免于设置ftrace的各种配置)继续辅助分析了,需要看到底是哪一层出的问题
使用funcgraph做更深的跟踪:
# ./funcgraph -h
USAGE: funcgraph [-aCDhHPtT] [-m maxdepth] [-p PID] [-L TID] [-d secs] funcstring
-a # all info (same as -HPt)
-C # measure on-CPU time only
-d seconds # trace duration, and use buffers
-D # do not show function duration
-h # this usage message
-H # include column headers
-m maxdepth # max stack depth to show
-p PID # trace when this pid is on-CPU
-L TID # trace when this thread is on-CPU
-P # show process names & PIDs
-t # show timestamps
-T # comment function tails
eg,
funcgraph do_nanosleep # trace do_nanosleep() and children
funcgraph -m 3 do_sys_open # trace do_sys_open() to 3 levels only
funcgraph -a do_sys_open # include timestamps and process name
funcgraph -p 198 do_sys_open # trace vfs_read() for PID 198 only
funcgraph -d 1 do_sys_open >out # trace 1 sec, then write to file
See the man page and example file for more info.
使用funcgraph -m可以设置子函数跟踪层级,实际运行时遇到了下面的错误
# ./funcgraph -m 5 __x64_sys_mount
Tracing "__x64_sys_mount"... Ctrl-C to end.
./funcgraph: line 212: echo: write error: Device or resource busy
ERROR: setting current_tracer to "function". Exiting.
源码地址:
https://github.com/brendangregg/perf-tools/blob/master/kernel/funcgraph
报错的212行是设置function_graph的内核参数,手动尝试也是失败
# echo function_graph >/sys/kernel/debug/tracing/current_tracer
-bash: echo: write error: Device or resource busy
使用lsof看下有没有占用,导致冲突,current_tracer没有占用,但是tracing目录有其他进程打开pipe
# lsof 2>/dev/null | grep /sys/kernel/debug/tracing/current_tracer
没有占用,但是不能修改,奇怪,缩减目录继续看,有yundun的cpu的占用,但是这个理论上不是冲突的
# lsof 2>/dev/null | grep /sys/kernel/debug/tracing
AliYunDun 1115 1550 root 21r REG 0,11 0 28870 /sys/kernel/debug/tracing/instances/
aegis/per_cpu/cpu1/trace_pipe_raw
resolver- 1115 2685 root 23r REG 0,11 0 28886 /sys/kernel/debug/tracing/instances/
aegis/per_cpu/cpu3/trace_pipe_raw
grpc_glob 1115 2686 root 20r REG 0,11 0 28862 /sys/kernel/debug/tracing/instances/
aegis/per_cpu/cpu0/trace_pipe_raw
由于云盾不支持系统内部kill及关闭服务,只能控制台卸载agent,同时不止云盾的进程在打开trace_pipe_raw,与云盾相关同学就这个问题讨论了一下,确认是个内核bug,
issue参考: https://gitee.com/anolis/cloud-kernel/pulls/538
funcgraph冲突的临时解法:
1,只有云盾冲突的话,可以在控制台卸载云盾的客户端,但是如果有其他冲突就不行了(冲突的服务全部关掉不现实)
2,使用alinux3(5.10的内核--已验证)
踩中bug的版本为:
# uname -r
4.19.91-22/26.1.al7.x86_64
后与内核团队相关同学讨论该问题,确认alinux2目前版本没做修复,需要等4.19.2xx的版本做修复
强撸灰飞烟灭:
如果取巧的路走不下去,就得撸代码了
此处省略一万字。。。。。。盲挂了好些mount的函数对比看,怀疑是这个问题
PS:盲挂函数猜测路径的时候,可以先去跟踪正常mount的内核函数,然后按照正常的路径 从下往上找函数
# ./trace count_mounts -n mount -K -U
PID TID COMM FUNC
988951 988951 mount count_mounts
count_mounts+0x1 [kernel]
attach_recursive_mnt+0x2a3 [kernel]
do_add_mount+0x88 [kernel]
do_mount+0x516 [kernel]
ksys_mount+0x7e [kernel]
__x64_sys_mount+0x21 [kernel]
do_syscall_64+0x5b [kernel]
entry_SYSCALL_64_after_hwframe+0x44 [kernel]
[unknown]
看下count_mount的函数逻辑
int count_mounts(struct mnt_namespace *ns, struct mount *mnt)
{
unsigned int max = READ_ONCE(sysctl_mount_max);
unsigned int mounts = 0, old, pending, sum;
struct mount *p;
for (p = mnt; p; p = next_mnt(p, mnt))
mounts++;
old = ns->mounts;
pending = ns->pending_mounts;
sum = old + pending;
if ((old > sum) ||
(pending > sum) ||
(max < sum) ||
(mounts > (max - sum)))
return -ENOSPC;
ns->pending_mounts = pending + mounts;
return 0;
}
逻辑很简单,做max计数的对比,重点看sysctl_mount_max
# cat /proc/sys/fs/mount-max
10
# sysctl -a |grep mount-max
fs.mount-max = 10
内核默认设置为100000
# fs/namespace.c#L34
/* Maximum number of mounts in a mount namespace */
unsigned int sysctl_mount_max __read_mostly = 100000;
alinux3复现并使用function graph跟踪:
使用alinux3做复现,使用ftrace的function graph功能看问题,发现5.10的内核路径是不一样的,没有ksys_mount,是path_mount
# grep __x64_sys_mount -C5 /proc/kallsyms
ffffffffb93452f0 t do_move_mount
ffffffffb9345590 T __ia32_sys_move_mount
ffffffffb93456c0 T __x64_sys_move_mount
ffffffffb93457f0 T finish_automount
ffffffffb93459f0 T path_mount
ffffffffb9345e90 T __x64_sys_mount
ffffffffb9345fb0 T __ia32_sys_mount
ffffffffb93460d0 T do_mount
ffffffffb9346140 T copy_mnt_ns
ffffffffb9346400 T is_path_reachable
ffffffffb9346450 t __do_sys_pivot_root
funcgraph跟踪
# ./funcgraph -m 2 __x64_sys_mount
Tracing "__x64_sys_mount"... Ctrl-C to end.
2) | __x64_sys_mount() {
2) 0.370 us | irq_enter_rcu();
2) 6.183 us | __sysvec_irq_work();
2) 0.360 us | irq_exit_rcu();
2) 0.854 us | strndup_user();
2) 0.852 us | strndup_user();
2) 0.125 us | copy_mount_options();
2) 1.697 us | user_path_at_empty();
2) 0.216 us | irq_enter_rcu();
2) 0.876 us | __sysvec_irq_work();
2) 0.521 us | irq_exit_rcu();
2) + 37.662 us | path_mount();
2) 0.570 us | path_put();
2) 0.677 us | kfree();
2) 0.701 us | kfree();
2) 0.220 us | kfree();
2) + 58.819 us | }
^C
Ending tracing...
继续跟踪path_mount,使用5级的深度
# ./funcgraph -m 5 path_mount
Tracing "path_mount"... Ctrl-C to end.
6) | path_mount() {
6) | irq_enter_rcu() {
6) 0.141 us | irqtime_account_irq();
6) 0.634 us | }
6) | __sysvec_irq_work() {
6) | __wake_up() {
6) | __wake_up_common_lock() {
6) 0.154 us | _raw_spin_lock_irqsave();
6) 5.976 us | __wake_up_common();
6) 0.182 us | _raw_spin_unlock_irqrestore();
6) 7.055 us | }
6) 7.345 us | }
6) 7.847 us | }
6) | irq_exit_rcu() {
6) 0.131 us | irqtime_account_irq();
6) 0.129 us | idle_cpu();
6) 0.597 us | }
6) 0.207 us | security_sb_mount();
6) | ns_capable() {
6) | security_capable() {
6) 0.102 us | cap_capable();
6) 0.499 us | }
6) 0.714 us | }
6) | do_new_mount() {
6) | get_fs_type() {
6) | __get_fs_type() {
6) 0.199 us | _raw_read_lock();
6) 0.356 us | try_module_get();
6) 0.176 us | irq_enter_rcu();
6) 1.735 us | __sysvec_irq_work();
6) 0.242 us | irq_exit_rcu();
6) 6.072 us | }
6) 6.724 us | }
6) | fs_context_for_mount() {
6) | alloc_fs_context() {
6) 0.170 us | kmem_cache_alloc_trace();
6) 0.145 us | get_filesystem();
6) 0.376 us | __mutex_init();
6) 0.194 us | legacy_init_fs_context();
6) 1.860 us | }
6) 2.084 us | }
6) | put_filesystem() {
6) 0.121 us | module_put();
6) 0.325 us | }
6) | vfs_parse_fs_string() {
6) | kmemdup_nul() {
6) 0.882 us | __kmalloc_track_caller();
6) 1.572 us | }
6) | vfs_parse_fs_param() {
6) 0.728 us | lookup_constant();
6) 0.331 us | lookup_constant();
6) 0.348 us | vfs_parse_fs_param.part.0();
6) 2.770 us | }
6) 0.121 us | kfree();
6) 5.292 us | }
6) | parse_monolithic_mount_data() {
6) 0.120 us | legacy_parse_monolithic();
6) 0.424 us | }
6) | mount_capable() {
6) | capable() {
6) 0.170 us | security_capable();
6) 0.466 us | }
6) 0.659 us | }
6) | vfs_get_tree() {
6) | legacy_get_tree() {
6) 7.042 us | ext4_mount [ext4]();
6) 7.983 us | }
6) 0.289 us | security_sb_set_mnt_opts();
6) 9.372 us | }
6) | do_new_mount_fc() {
6) 0.124 us | security_sb_kern_mount();
6) 0.128 us | mount_too_revealing();
6) 0.112 us | up_write();
6) | vfs_create_mount() {
6) 5.001 us | alloc_vfsmnt();
6) 0.305 us | _raw_spin_lock();
6) 6.479 us | }
6) | mnt_warn_timestamp_expiry() {
6) 0.123 us | ktime_get_real_seconds();
6) 0.390 us | }
6) | lock_mount() {
6) 0.639 us | down_write();
6) 0.229 us | down_write();
6) 0.730 us | lookup_mnt();
6) 0.836 us | get_mountpoint();
6) 3.550 us | }
6) | do_add_mount() {
6) 0.779 us | graft_tree(); 重点看这里,add mount 没有返回这个有点奇怪
6) 1.074 us | }
6) | unlock_mount() {
6) 0.112 us | _raw_spin_lock();
6) 0.369 us | __put_mountpoint.part.0();
6) 0.300 us | namespace_unlock();
6) 0.116 us | up_write();
6) 1.808 us | }
6) | mntput_no_expire() {
6) 0.116 us | _raw_spin_lock();
6) 0.188 us | rcu_read_unlock_strict();
6) 0.169 us | shrink_dentry_list();
6) 0.214 us | task_work_add();
6) 1.338 us | }
6) + 16.703 us | }
6) | put_fs_context() {
6) | dput() {
6) 0.334 us | rcu_read_unlock_strict();
6) 0.810 us | }
6) 0.145 us | deactivate_super();
6) | legacy_fs_context_free() {
6) 0.131 us | kfree();
6) 0.392 us | }
6) 0.119 us | security_free_mnt_opts();
6) | put_filesystem() {
6) 0.103 us | module_put();
6) 0.305 us | }
6) 0.126 us | kfree();
6) 0.121 us | kfree();
6) 3.390 us | }
6) + 46.481 us | }
6) + 59.293 us | }
^C
Ending tracing...
结合graft_treed的代码看下
static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
{
if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER)
return -EINVAL;
if (d_is_dir(mp->m_dentry) !=
d_is_dir(mnt->mnt.mnt_root))
return -ENOTDIR;
return attach_recursive_mnt(mnt, p, mp, false);
}
根据上一次的结果,跟踪graft_tree看看
# ./funcgraph -m 5 graft_tree
Tracing "graft_tree"... Ctrl-C to end.
7) | graft_tree() {
7) | irq_enter_rcu() {
7) 0.158 us | irqtime_account_irq();
7) 0.577 us | }
7) | __sysvec_irq_work() {
7) | __wake_up() {
7) | __wake_up_common_lock() {
7) 0.123 us | _raw_spin_lock_irqsave();
7) 4.967 us | __wake_up_common();
7) 0.156 us | _raw_spin_unlock_irqrestore();
7) 5.866 us | }
7) 6.210 us | }
7) 6.790 us | }
7) | irq_exit_rcu() {
7) 0.220 us | irqtime_account_irq();
7) 0.207 us | idle_cpu();
7) 0.981 us | }
7) | attach_recursive_mnt() {
7) | get_mountpoint() {
7) 0.171 us | _raw_spin_lock();
7) 0.376 us | lookup_mountpoint();
7) 0.177 us | kfree();
7) 1.397 us | }
7) 0.136 us | count_mounts(); 基本就可以确定是这个count_mounts的问题了
7) 0.159 us | _raw_spin_lock();
7) 2.546 us | }
7) + 13.737 us | }
# ./funcgraph -m 5 count_mounts
Tracing "count_mounts"... Ctrl-C to end.
2) | count_mounts() {
2) | irq_enter_rcu() {
2) 0.165 us | irqtime_account_irq();
2) 0.588 us | }
2) | __sysvec_irq_work() {
2) | __wake_up() {
2) | __wake_up_common_lock() {
2) 0.203 us | _raw_spin_lock_irqsave();
2) 3.845 us | __wake_up_common();
2) 0.127 us | _raw_spin_unlock_irqrestore();
2) 4.807 us | }
2) 5.007 us | }
2) 5.438 us | }
2) | irq_exit_rcu() {
2) 0.131 us | irqtime_account_irq();
2) 0.129 us | idle_cpu();
2) 0.593 us | }
2) 8.695 us | }
mount-loop的bug确认:
本想与用户做一波强有力的交流,发现竟然是用户想考验我们
实际问题是 docker cp触发的share mount的bug,客户手动设置的考验...
https://github.com/moby/moby/pull/38993
docker cp的问题复现:
参考issue的复现方式,尝试复现了出来(根目录的挂载是关键)
docker run --name my-mount-max -d -v /:/rootfs busybox sleep 3d
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker cp my-mount-max:/etc/group /mnt
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker exec my-mount-max wc -l /proc/self/mountinfo
1645 /proc/self/mountinfo
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker cp my-mount-max:/etc/group /mnt
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker exec my-mount-max wc -l /proc/self/mountinfo
3553 /proc/self/mountinfo
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker cp my-mount-max:/etc/group /mnt
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker exec my-mount-max wc -l /proc/self/mountinfo
7369 /proc/self/mountinfo
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker cp my-mount-max:/etc/group /mnt
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker exec my-mount-max wc -l /proc/self/mountinfo
15001 /proc/self/mountinfo
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker cp my-mount-max:/etc/group /mnt
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker exec my-mount-max wc -l /proc/self/mountinfo
30265 /proc/self/mountinfo
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker cp my-mount-max:/etc/group /mnt
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker exec my-mount-max wc -l /proc/self/mountinfo
60793 /proc/self/mountinfo
[root@iZ2ze1mh3b0oot4j1m3uhsZ ~]# docker cp my-mount-max:/etc/group /mnt
Error response from daemon: mount /:/var/lib/docker/overlay2/737429e3db385fada57a23d4dae34a41dee9c82fc12c6d6cfb3d4b5b8a43e585/merged/rootfs, flags: 0x5000: no space left on device
每次cp因为mount loop的原因,挂载根目录到容器内后,mount point都是成倍增长的,实际不到默认的10万阈值,在超过5万后,因为“余额”不足也会导致失败
建议使用alinux2+containerd 详见
=====小结======
no space left on device的报错除了常见的磁盘空间,inode满,又新添了mount_max的场景,这个问题的分析也增加了我对内核更深的理解,以及各个工具的使用姿势,光看文档是没有体感的,要亲手去做才能感受到乐趣,实践出真知~