通过top命令查看监控,发现有7000+的僵尸进程
那么先把僵尸进程找出来
[root@izbp152ke14timzud0du15z ~]# ps -ef |grep defunct |moreroot 303566001:25 ? 00:00:00 [python] <defunct> root 310566001:44 ? 00:00:00 [python] <defunct> root 313566013:02 ? 00:00:00 [python] <defunct> root 316566009:27 ? 00:00:00 [python] <defunct> root 319566005:08 ? 00:00:00 [python] <defunct> root 329566002:22 ? 00:00:00 [python] <defunct> root 331566004:13 ? 00:00:00 [python] <defunct> root 332566005:26 ? 00:00:00 [python] <defunct> root 334566003:55 ? 00:00:00 [python] <defunct> root 3535660 Nov01 ? 00:00:00 [python] <defunct> root 354566007:33 ? 00:00:00 [python] <defunct> root 356566009:07 ? 00:00:00 [python] <defunct> root 3635660 Nov01 ? 00:00:00 [python] <defunct> root 366566011:23 ? 00:00:00 [python] <defunct> root 372566006:38 ? 00:00:00 [python] <defunct> root 377566011:43 ? 00:00:00 [python] <defunct> root 378566014:39 ? 00:00:00 [python] <defunct> root 379566011:03 ? 00:00:00 [python] <defunct> root 390566001:06 ? 00:00:00 [python] <defunct> root 391566014:01 ? 00:00:00 [python] <defunct> root 3955660 Nov01 ? 00:00:00 [python] <defunct>
结果中第3列566
就是僵尸进程的父进程,那么看看这个是什么进程
# top -p 566top-15:18:23 up 21:21, 1 user, load average: 0.56, 0.43, 0.56 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.5 us, 5.0 sy, 0.0 ni, 90.2 id, 0.1 wa, 0.0 hi, 0.2 si, 0.0 st KiB Mem : 7732980 total, 800156 free, 3259668 used, 3673156 buff/cache KiB Swap: 0 total, 0 free, 0 used. 4162612 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 566 root 200125413253810414288 S 0.7 7.0 35:53.07 datakit
发现是一个名为datakit
的进程,那么接下来就该去查看代码排除问题了
把父进程杀掉,杀掉父进程后,僵尸进程资源自然会被回收
# kill -9 566
但是问题就这样结束了吗,当然没有,问题原因总归要找到
重新启动进程后,自然会重新生成僵尸进程,通过查看进程树
最上面的父进程还是为datakit,datakit下面生成了很多的Python僵尸进程
6206 ? Ssl 0:50 /usr/local/datakit/datakit 6550 ? Sl 0:00 \_ /usr/local/datakit/externals/oracle --interval 1m --host <your-oracle-host> --port1521--userna6721 ? Z 0:00 \_ [python] <defunct> 6951 ? Z 0:00 \_ [python] <defunct> 7179 ? Z 0:00 \_ [python] <defunct> 7356 ? Z 0:00 \_ [python] <defunct> 7550 ? Z 0:00 \_ [python] <defunct> 7729 ? Z 0:00 \_ [python] <defunct> 7920 ? Z 0:00 \_ [python] <defunct> 8100 ? Z 0:00 \_ [python] <defunct> 8277 ? Z 0:00 \_ [python] <defunct> 8466 ? Z 0:00 \_ [python] <defunct> 8666 ? Z 0:00 \_ [python] <defunct>
既然知道程序datakit会产生很多Python的僵尸进程,那么就在代码中,搜索关于执行Python进程的代码,而且僵尸进程这么多,可能是不止一处调用,或者是存在循环调用
果然经过一番搜索
funcbuildExternals(outdir, goos, goarchstring) { curOSArch :=runtime.GOOS+"/"+runtime.GOARCHfor_, ex :=rangeexternals { l.Debugf("building %s-%s/%s", goos, goarch, ex.name) if_, ok :=ex.osarchs[curOSArch]; !ok { l.Warnf("skip build %s under %s", ex.name, curOSArch) continue } osarch :=goos+"/"+goarchif_, ok :=ex.osarchs[osarch]; !ok { l.Warnf("skip build %s under %s", ex.name, osarch) continue } out :=ex.nameswitchstrings.ToLower(ex.lang) { case"go", "golang": switchosarch { case"windows/amd64", "windows/386": out+=".exe"default: // pass } args := []string{ "go", "build", "-o", filepath.Join(outdir, "externals", out), "-ldflags", "-w -s", filepath.Join("plugins", "externals", ex.name, ex.entry), } ex.envs=append(ex.envs, "GOOS="+goos, "GOARCH="+goarch) msg, err :=runEnv(args, ex.envs) iferr!=nil { l.Fatalf("failed to run %v, envs: %v: %v, msg: %s", args, ex.envs, err, string(msg)) } case"makefile", "Makefile": args := []string{ "make", "--file="+filepath.Join("plugins", "externals", ex.name, ex.entry), "OUTPATH="+filepath.Join(outdir, "externals", out), "BASEPATH="+"plugins/externals/"+ex.name, } ex.envs=append(ex.envs, "GOOS="+goos, "GOARCH="+goarch) msg, err :=runEnv(args, ex.envs) iferr!=nil { l.Fatalf("failed to run %v, envs: %v: %v, msg: %s", args, ex.envs, err, string(msg)) } default: // for python, just copy source code into build direx.buildArgs=append(ex.buildArgs, filepath.Join(outdir, "externals")) cmd :=exec.Command(ex.buildCmd, ex.buildArgs...) //nolint:gosecifex.envs!=nil { cmd.Env=append(os.Environ(), ex.envs...) } res, err :=cmd.CombinedOutput() iferr!=nil { l.Fatalf("failed to build python(%s %s): %s, err: %s", ex.buildCmd, strings.Join(ex.buildArgs, " "), res, err.Error()) } } } }
该函数中,for循环语句块,通过 switch
分支中的default
执行了exec.Command
,但如果创建的子进程,父进程不知道的话就会产生僵尸进程了
default: // for python, just copy source code into build direx.buildArgs=append(ex.buildArgs, filepath.Join(outdir, "externals")) cmd :=exec.Command(ex.buildCmd, ex.buildArgs...) //nolint:gosecifex.envs!=nil { cmd.Env=append(os.Environ(), ex.envs...) } res, err :=cmd.CombinedOutput() iferr!=nil { l.Fatalf("failed to build python(%s %s): %s, err: %s", ex.buildCmd, strings.Join(ex.buildArgs, " "), res, err.Error()) }
应该加上以下代码,防止僵尸进程
iferr:=cmd.Wait();err!=nil{ fmt.Println(err) }