Nagios插件开发之监控程序占用资源-阿里云开发者社区

Nagios插件开发之监控程序占用资源

2017-11-09 985

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

一般情况下，我们只需要监控程序进程在没在就可以了。但是这次遭遇了这样的事，公司开发的程序，程序进程还在，但是死锁了。导致大范围的影响，更要命的是根本不知道问题出在哪里,还是别的测试部同事帮忙发现的，真是丢尽运维的脸了…

为避免下次再遭遇到这样的情况，分析了这次进程死锁的现象，发现死锁会占用100%的cpu，正常情况下只占用10%以内。决定编写nagios插件，用来监控程序占用的资源，包括cpu,内存等。

1、shell脚本需求分析：

能设置cpu,mem的阈值，资源占用超过阈值就报警。

要能判断这个进程是否存在，若有一个不存在，则报警。

2、shell脚本执行效果如下：

1、如果输入格式不正确，则输出帮助信息

[root@center230 libexec]# shcomponent_resource.sh

Usage parament:

component_resource.sh [--cpu] [--mem]

Example:

component_resource.sh --cpu 50 --mem 50

2、若没超出阈值，输出资源占用情况，退出值为0

[root@center230 libexec]# shcomponent_resource.sh --cpu 50 --mem 50

VueSERVER_cpu_use=5.6% VueCache_cpu_use=1.9%VueAgent_cpu_use=0.0% VueCenter_cpu_use=0.0% VueDaemon_cpu_use=0.0%;VueSERVER_mem_use=0.2% VueCache_mem_use=7.4% VueAgent_mem_use=0.5% VueCenter_mem_use=0.1%VueDaemon_mem_use=0.0%

[root@center230 libexec]# echo $?

3、若超出阈值，输出资源占用情况，退出值为2

[root@center230 libexec]# shcomponent_resource.sh --cpu 5 --mem 5

VueSERVER_cpu_use=9.4% VueCache_cpu_use=0.0%VueAgent_cpu_use=0.0% VueCenter_cpu_use=0.0% VueDaemon_cpu_use=0.0%;VueSERVER_mem_use=0.2% VueCache_mem_use=7.4% VueAgent_mem_use=0.5%VueCenter_mem_use=0.1% VueDaemon_mem_use=0.0%

[root@center230 libexec]# echo $?

4、若进程不存在，输出down掉的进程，以及正常使用中的进程资源情况，退出值为2

[root@yckj scripts]# sh component_resource.sh--cpu 50 --mem 50

Current VueDaemon VueCenter VueAgent VueCache VueSERVER is down.

[root@yckj scripts]# echo $?

3、Shell脚本代码如下：

 
        [root@center230 libexec]
        # catcomponent_resource.sh 
       
        #!/bin/sh
       
        #author:yangrong
       
        #date:2014-05-20
       
        #mail:10286460@qq.com
       
        #pragrom_list=(VueDaemon VueCenter VueAgentVueCache VueSERVER VUEConnector Myswitch Slirpvde)
       
        pragrom_list=(VueDaemon VueCenter VueAgentVueCache VueSERVER)
       
        ####获取cpu阈值和mem阈值#######
       
        case 
        $1 
        in 
       
        --cpu) 
       
        cpu_crit=$2 
       
        ;; 
       
        --mem) 
       
        mem_crit=$2 
       
        ;; 
       
        esac
       
        case 
        $3 
        in 
       
        --cpu) 
       
        cpu_crit=$4 
       
        ;; 
       
        --mem) 
       
        mem_crit=$4 
       
        ;; 
       
        esac
       
        ###判断传参数量,如果不为4，则var值为1，var0则正常####
       
        if 
        [[ $1 == $3  ]];
        then 
       
        var=1    
       
        elif 
        [ $
        # -ne 4 ] ;then 
       
        var=1 
       
        else
       
        var=0 
       
        fi
       
        ###打印错误提示信息
       
        if 
        [ $var -
        eq 
        1 ];
        then 
       
        echo 
        "Usage parament:" 
       
        echo 
        "    $0 [--cpu][--mem]" 
       
        echo 
        "" 
       
        echo 
        "Example:" 
       
        echo 
        "    $0 --cpu 50 --mem50" 
       
        exit 
       
        fi
       
        ###把不存在的进程放一变量中
       
        num=$(( ${
        #pragrom_list[@]}-1 )) 
       
        NotExist=
        "" 
       
        for 
        digit 
        in 
        `
        seq 
        0 $num` 
       
        do
       
        a=`
        ps 
        -ef|
        grep 
        -
        v 
        grep 
        |
        grep 
        ${pragrom_list[$digit]}|
        wc 
        -l` 
       
        if
        [ $a -
        eq 
        0 ];
        then 
       
        NotExist=
        "$NotExist ${pragrom_list[$digit]}" 
       
        unset 
        pragrom_list[$digit] 
       
        fi 
       
        done
       
        #echo"pragrom_list=${pragrom_list[@]}"
       
        ####对比进程所占资源与阈值大小
       
        cpu_use_all=
        "" 
       
        mem_use_all=
        "" 
       
        compare_cpu_temp=0
       
        compare_mem_temp=0
       
        for 
        n 
        in 
        ${pragrom_list[@]} 
       
        do
       
        cpu_use=`
        top 
        -b -n1|
        grep 
        $n|
        awk 
        '{print $9}'
        ` 
       
        mem_use=`
        top 
        -b -n1|
        grep 
        $n|
        awk 
        '{print $10}'
        ` 
       
        if
        [[ $cpu_use == 
        "" 
        ]];
        then 
       
        cpu_use=0 
       
        fi 
       
        if
        [[ $mem_use == 
        "" 
        ]];
        then 
       
        mem_use=0 
       
        fi 
       
        compare_cpu=`
        echo 
        "$cpu_use > $cpu_crit"
        |
        bc
        ` 
       
        compare_mem=`
        echo 
        "$mem_use > $mem_crit"
        |
        bc
        `   
       
        if
        [[ $compare_cpu == 1  ]];
        then 
       
        compare_cpu_temp=1 
       
        fi 
       
        if
        [[ $compare_mem == 1  ]];
        then 
       
        compare_mem_temp=1 
       
        fi 
       
        cpu_use_all=
        "${n}_cpu_use=${cpu_use}% ${cpu_use_all}" 
       
        mem_use_all=
        "${n}_mem_use=${mem_use}% ${mem_use_all}" 
       
        done
       
        ###如果该变量有值，则代表有进程down。则退出值为2
       
        if 
        [[ 
        "$NotExist" 
        != 
        ""
        ]];
        then 
       
        echo 
        -e 
        "Current ${NotExist} isdown.$cpu_use_all;$mem_use_all" 
       
        exit 
        2 
       
        ###如果cpu比较值为1，则代表有进程占用超过阈值，则退出值为2
       
        elif 
        [[ 
        "$compare_cpu_temp" 
        == 1]];
        then 
       
        echo 
        -e 
        "$cpu_use_all;$mem_use_all" 
       
        exit 
        2 
       
        ##如果mem比较值为1，则代表为进程mem占用超过阈值，则退出值为2
       
        elif 
        [[ $compare_mem_temp == 1 ]];
        then 
       
        echo 
        -e 
        "$cpu_use_all;$mem_use_all" 
       
        exit 
        2 
       
        ##否则则正常输出，并输出所占cpu与内存比例
       
        else
       
        echo 
        -e 
        "$cpu_use_all;$mem_use_all" 
       
        exit 
        0 
       
        fi

4、后话：

随着近日编写shell脚本越来越多，有时难免会回改以前所写脚本，经常要看一段时间才能看懂。

为方便后续的维护，在脚本当中，每一个函数，每一段功能，都做备注，方便以后自己或他人来进行维护。

本文转自杨云1028 51CTO博客，原文链接：http://blog.51cto.com/yangrong/1414345，如需转载请自行联系原作者

文章标签：

监控

Shell

Nagios插件开发之监控程序占用资源

1、shell脚本需求分析：

2、shell脚本执行效果如下：

3、Shell脚本代码如下：

4、后话：

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Nagios插件开发之监控程序占用资源

1、shell脚本需求分析：

2、shell脚本执行效果如下：

3、Shell脚本代码如下：

4、后话：

热门文章

最新文章

相关课程

相关电子书