针对业务日志的监控报警设置-阿里云开发者社区

针对业务日志的监控报警设置

2017-08-31 1262

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

日志服务 SLS，月写入数据量 50GB 1个月

简介：

需求说明：对线上业务日志进行监控，当日志中出现的ERROR条数超过30条时立即报警！
监控脚本部署到qd-inf-logcollector01服务器上了（需要提前做好qd-inf-logcollector01到业务部署机器的ssh无密码信任关系）

1）qd-inf-logcollector01 机器上的操作记录：

 
          [root@qd-inf-logcollector01 ~]$ 
          cd 
          /app/script/ 
         
          [root@qd-inf-logcollector01 script]$ ll
         
          total 372
         
          -rwxr-xr-x 1 root root    446 Oct 13 13:02 celery-antiwater_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    442 Oct 13 13:04 celery-dislike_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    443 Oct 13 13:58 celery-flower_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    436 Oct 13 14:01 celery-report_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    436 Oct 13 14:03 consumer01_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    436 Oct 13 14:04 consumer02_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    437 Oct 13 14:06 imgsearch01_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    437 Oct 13 14:07 imgsearch02_ERROR_monit.sh
         
          -rwxr-xr-x 1 work work    857 Oct 12 14:31 sms.py                             
          //
          这个是短信报警脚本（内部短信脚本，这里就不贴出来了） 
         
          -rwxr-xr-x 1 root root    435 Oct 13 14:08 snapshot01_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    435 Oct 13 14:10 snapshot02_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    433 Oct 13 14:11 xcspam01_ERROR_monit.sh
         
          -rwxr-xr-x 1 root root    433 Oct 13 14:12 xcspam02_ERROR_monit.sh
         
          下面贴出其中两个日志的监控脚本，其他的脚本内容类似：
         
          [root@qd-inf-logcollector01 script]$ 
          cat 
          celery-antiwater_ERROR_monit.sh 
         
          #!/bin/bash
         
          while 
          true 
         
          do
         
          TIME1=$(
          date 
          +%Y-%m-%d) 
         
          TIME2=$(
          date 
          +%H:%M -d
          '-1 minute'
          ) 
         
          NUM=$(
          ssh 
          root@10.111.232.148 
          'grep -E '
          ${TIME1}
          ' /app/script/antiwater_ERROR.log|grep '
          ${TIME2}
          ' |wc -l'
          ) 
         
          if 
          [[ 
          "${NUM}" 
          -
          ge 
          30 ]] 
         
          then 
         
          python 
          /app/script/sms
          .py antiwater_ERROR.log 
          "qd-vpc-op-rule01 server 在${TIME1} ${TIME2}产生${NUM}个antiwater ERROR!请收到报警后尽快查看并处理!" 
         
          sleep 
          600 
         
          else 
         
          sleep 
          30 
         
          fi 
         
          done
         
          [root@qd-inf-logcollector01 script]$ 
          cat 
          consumer01_ERROR_monit.sh 
         
          #!/bin/bash
         
          while 
          true 
         
          do
         
          TIME1=$(
          date 
          +%Y-%m-%d) 
         
          TIME2=$(
          date 
          +%H:%M -d
          '-1 minute'
          ) 
         
          NUM=$(
          ssh 
          root@10.111.232.225 
          'grep -E '
          ${TIME1}
          ' /app/script/ERROR.log |grep '
          ${TIME2}
          '|wc -l'
          ) 
         
          if 
          [[ 
          "${NUM}" 
          -
          ge 
          30 ]] 
         
          then 
         
          python 
          /app/script/sms
          .py xcspam_error_log 
          "qd-vpc-op-consumer01 server 在${TIME1} ${TIME2}产生${NUM}个xcspam ERROR!请收到报警后尽快查看并处理!" 
         
          sleep 
          600 
         
          else 
         
          sleep 
          30 
         
          fi 
         
          done
         
          特别注意：
         
          需要在后台执行这些脚本，因为当ERROR日志信息注入量比较大（比如上万条）的时候，就不适合使用
          crontab
          定时执行了。 
         
          [如果使用
          crontab
          定时执行的话，那么定时时间间隔短的话，比如一分钟执行一次，那么当ERROR比较多的时候，报警信息就会很多！如果定时时间间隔长的话，那么在ERROR信息出现的时候，可能无法第一时间报警，这样就失去了报警的意义了。] 
         
          所以还是将脚本放在后台执行比较合理，这样当ERROR条数超过30条时，会立即第一时间发出第一个报警，然后根据脚本中的
          sleep
          进行报警频率调整！这样也是为了确保监控报警的时效性！ 
         
          [root@qd-inf-logcollector01 script]$ 
          nohup 
          /bin/bash 
          -x 
          /app/script/celery-antiwater_ERROR_monit
          .sh &            
          //
          回车，按ctrl+c，放在后台执行 
         
          [root@qd-inf-logcollector01 script]$ 
          nohup 
          /bin/bash 
          -x 
          /app/script/consumer01_ERROR_monit
          .sh & 
         
          [root@qd-inf-logcollector01 script]$ 
          ps 
          -ef|
          grep 
          monit.sh 
         
          root     20574     1  0 14:03 ?        00:00:00 
          /bin/bash 
          -x 
          /app/script/consumer01_ERROR_monit
          .sh 
         
          root     32568     1  0 13:03 ?        00:00:00 
          /bin/bash 
          -x 
          /app/script/celery-antiwater_ERROR_monit
          .sh

2）业务部署机器(也就是监控的目标机器上)的脚本配置，这里选择consumer01机器看下脚本配置，其他机器脚本内容类似！

 
          [root@qd-vpc-
          op
          -consumer01 ~]
          # cd /app/script/ 
         
          [root@qd-vpc-
          op
          -consumer01 script]
          # ll 
         
          total 28
         
          -rwxr-xr-x 1 work work  35 Oct  9 17:02 empty_ERROR_log.sh
         
          -rw-r--r-- 1 work work   0 Oct 13 03:00 ERROR.log
         
          -rwxr-xr-x 1 work work 186 Oct  9 16:34 monit_log.sh
         
          [root@qd-vpc-
          op
          -consumer01 script]
          # cat monit_log.sh 
         
          #!/bin/bash
         
          FILE=
          "/data/log/xcspam/xcspam_error.log" 
         
          nohup 
          /usr/bin/tail  
          -f $FILE |
          grep 
          ERROR >> 
          /app/script/ERROR
          .log & 
         
          首次执行下这个脚本，然后将
          tail
          命令的动作放在后台一直执行 
         
          这样，一旦xcspam_error.log文件中有ERROR信息输入，就会被打入到ERROR.log中
         
          这样确保了监控的实时性！
         
          [root@qd-vpc-
          op
          -consumer01 script]
          # /bin/bash -x /app/script/monit_log.sh 
         
          [root@qd-vpc-
          op
          -consumer01 script]
          # ps -ef|grep tail 
         
          root     11337     1  0 Oct09 ?        00:09:01 
          /usr/bin/tail 
          -f 
          /data/log/xcspam/xcspam_error
          .log 
         
          随着监控的执行，ERROR.log文件也会不断增大，故需要一个清理脚本：
         
          [root@qd-vpc-
          op
          -consumer01 script]
          # cat empty_ERROR_log.sh 
         
          #!/bin/bash
         
          >
          /app/script/ERROR
          .log 
         
          [root@qd-vpc-
          op
          -consumer01 script]
          # crontab -l 
         
          0 3 *
          /2 
          * * 
          /bin/bash 
          -x 
          /app/script/empty_ERROR_log
          .sh > 
          /dev/null 
          2>&1

***************当你发现自己的才华撑不起野心时，就请安静下来学习吧***************

本文转自散尽浮华博客园博客，原文链接：http://www.cnblogs.com/kevingrace/p/7661498.html，如需转载请自行联系原作者

针对业务日志的监控报警设置

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

针对业务日志的监控报警设置

热门文章

最新文章

相关课程

相关电子书

相关实验场景