目录
prometheus发起告警的逻辑
- 假设A服务器和prometheus服务器断联,且已经超过一分钟,匹配上监测存活的告警规则
- Prometheus向alertmanager报信,A服务器断联
- alertmanager调用钉钉告警插件,发起告警
- 钉钉机器人在群里发消息。
节点
- 172.50.13.101:prometheus server
- 172.50.13.102:alertmanager和钉钉告警插件
配置alertmanager
首先肯定要去官方下载alertmanager。GitHub - prometheus/alertmanager: Prometheus Alertmanager
安装很简单,解压缩就行了。
alertmanager.yml文件内容:(receiver中的url应该为钉钉告警插件的url)
global: resolve_timeout: 5m route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 2h receiver: webhook receivers: - name: webhook webhook_configs: - url: 'http://172.50.13.102:8060/dingtalk/webhook1/send' send_resolved: true
配置钉钉告警插件
插件下载地址:Releases · timonwong/prometheus-webhook-dingtalk · GitHub
只是能用的话,解压缩就行了,不需要修改配置文件。
配置supervisor守护进程
vim /etc/supervisord.d/prometheus.ini
[program:alertmanager] command=/usr/local/prometheus/alertmanager/alertmanager --storage.path="/home/data/prometheus/alertmanager/" --web.listen-address=":18081" --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml --data.retention=120h --web.external-url=http://172.50.13.102:18081 directory=/usr/local/prometheus/alertmanager autostart=true startsecs=10 startretries=3 autorestart=true [program:dingtalk] command=/usr/local/prometheus/dingtalk/prometheus-webhook-dingtalk --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=xxxx" directory=/usr/local/prometheus/dingtalk autostart=true startsecs=10 startretries=3 autorestart=true
启动参数说明:
- alertmanager:
- storage.path:数据存储路径
- web.listen.addreess:监听端口
- config.file:alertmanager.yml文件的路径
- data.retention:数据存储时间
- web.external-url:启用web页面并配置地址
- dingtalk:
- ding.profile:注意webhook1后面替换为实际钉钉机器人的webhook
关联prometheus和alertmanager
prometheus.yml中主要的配置内容:
# Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ['172.50.13.102:18081'] # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" - "alertrules/*_rules.yml"
targets为alertmanager的地址。rule_files为告警规则文件,此处为同级目录中alertrules目录下所有带“_rules.yml”后缀的文件。
监测存活的告警规则:
- alertrules/live_rules.yml
groups: - name: UP rules: - alert : node expr: up == 0 for: 1m labels: severity: critical annotations: description: "{{ $labels.job }} {{ $labels.instance }} 节点断联已超过1分钟!" summary: "{{ $labels.instance }} down "
监测负载的告警规则:(监测内存,磁盘占用率和CPU使用率)
- alertrules/perf_rules.yml
groups: - name: mem_product rules: - alert : mem_product expr: (1 - (node_memory_MemAvailable_bytes{job="生产服务器"} / (node_memory_MemTotal_bytes{job="生产服务器"})))* 100 > 90 for: 5m labels: severity: critical annotations: description: "{{ $labels.job }} {{ $labels.instance }} 节点的内存使用率超过90%已持续5分钟!" summary: "{{ $labels.instance }} 内存使用率超标! " - name: disk rules: - alert : disk expr: (node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}+(node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"})) > 95 for: 5m labels: severity: warning annotations: description: "{{ $labels.job }} {{ $labels.instance }} 节点的硬盘使用率超过95%已持续5分钟!" summary: "{{ $labels.instance }} 硬盘空间使用率超标! " - name: cpu rules: - alert : cpu expr: ((1- sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)/sum(increase(node_cpu_seconds_total[5m])) by (instance)) * 100) > 70 for: 5m labels: severity: warning annotations: description: "{{ $labels.job }} {{ $labels.instance }} 节点的CPU使用率超过70%已持续5分钟!" summary: "{{ $labels.instance }} CPU使用率超标!"