prometheus 报警规则样例

本文涉及的产品
Redis 开源版,标准版 2GB
推荐场景:
搭建游戏排行榜
云数据库 Tair(兼容Redis),内存型 2GB
公共DNS(含HTTPDNS解析),每月1000万次HTTP解析
简介: 提供大家一些详细使用的prometheus的报警配置

prometheus 报警规则样例

groups:
- name: "带宽检测"
  rules:
  - alert: "带宽告警"
    expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 2000000
    for: 30s
    annotations:
      summary: "{{ $labels.job}} - {{ $labels.instance }} 平均带宽大于20M每秒"
      description: "Prometheus 报警: 带宽大于20M每秒 \n 主机名: {{ $labels.hostname }}\n ip: {{ $labels.ip }}\n 当前值: {{ $value }}  \n 应用: {{ $labels.app }} \n 可用区: {{$labels.region}} \n 产品线: {{ $labels.product }} \n"
      
groups:
- name: "cpu检测"
  rules:
  - alert: "cpu负载告警"
    expr:  (100-(avg(irate(node_cpu_seconds_total{mode="idle",job="node_exporter_alert"}[15m]))by (instance,hostname,region,app,product)) * 100)   > 95
    for: 5m
    annotations:
      value: " {{ $value }} "
      summary: "{{ $labels.job}} - {{ $labels.instance }}  CPU使用率高于95%"
      description: "Prometheus 报警: cpu负载使用率超过98%\n 主机名: {{ $labels.hostname }}\n ip: {{ $labels.instance }}\n  当前值: {{ $value }}  \n 应用: {{ $labels.app }} \n 可用区: {{$labels.region}} \n 产品线: {{ $labels.product }} \n"
      
groups:
- name: 磁盘报警
  rules:
  - alert: NodeDiskUsage
    expr: (node_filesystem_size_bytes{job="node_exporter_alert"} - node_filesystem_avail_bytes{job="node_exporter_alert"}) / node_filesystem_size_bytes{job="node_exporter_alert"}  * 100 > 85
    for: 1m
    labels:
      severity: high
    annotations:
      value: " {{ $value }} "
      description: "Prometheus 报警: 磁盘报警\n 主机名: {{ $labels.hostname }}\n ip: {{ $labels.ip }}\n 当前值: {{ $value }}  \n 应用: {{ $labels.app }} \n 可用区: {{$labels.region}} \n 产品线: {{ $labels.product }} \n"
      
groups:
- name: "dnsmasq_check"
  rules:
  - alert: "dnsmasq check"
    expr: namedprocess_namegroup_num_procs == 0
    for: 10s
    annotations:
      summary: "dnsmasq down"
      description: " dnsmasq , job: "
      
groups:
- name: dns-resolv-alarm
  rules:
  - alert: dns_resolv_error
    expr: dns{job="pushgateway"}  == 0
    for: 1m
    labels:
      team: op
    annotations:
      summary: "[DNS解析报警]  [{{$labels.exported_instance}}] 域名解析失败"
      description: "[DNS解析报警]  [{{$labels.exported_instance}}] 域名解析报警"
      
groups:
- name: "http检测规则"
  rules:
  - alert: "http服务检测"
    expr: probe_success{job="blackbox-http"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      description: "Prometheus 报警: http检测\n 实例: {{ $labels.instance }}\n 当前值: {{ $value }} \n 应用: {{ $labels.app }} \n 产品线: {{ $labels.product }} \n"
      
groups:
- name: "java检测规则"
  rules:
  - alert: "java服务检测"
    expr: count without (name,version)(irate(jmx_exporter_build_info[1m]))  == 0
    for: 1m
    labels:
      env: dev
    annotations:
      description: "java应用 job: {{ $labels.job }}  \n instance: {{ $labels.instance }} "
      summary: "java检测"
      
groups:
- name: "load 报警"
  rules:
  - alert: "load负载告警"
    expr: node_load15{job="node_exporter_alert"}  > 50
    for: 1m
    annotations:
      description: "Prometheus 报警: load高于50\n 主机名: {{ $labels.hostname }}\n ip: {{ $labels.ip }}\n  当前值: {{ $value }}  \n 应用: {{ $labels.app }} \n 可用区: {{$labels.region}} \n 产品线: {{ $labels.product }} \n"
      
groups:
- name: "logstash检测规则"
  rules:
  - alert: "logstash服务检测"
    expr: logstash_info_node  == 0
    for: 1m
    annotations:
      description: "日志收集服务 , ip: {{ $labels.instance }} "
      summary: "logstash检测"
      
groups:
- name: "内存检测"
  rules:
  - alert: "内存检测大于4G"
    expr: 100 - ( node_memory_Cached_bytes{job="node_exporter_alert"} + node_memory_Buffers_bytes{job="node_exporter_alert"} + node_memory_MemFree_bytes{job="node_exporter_alert"} ) / (node_memory_MemTotal_bytes{job="node_exporter_alert"} > 5000000000 ) * 100  > 98
    for: 30s
    labels:
      severity: critical
    annotations:
      description: "Prometheus 报警: 内存使用率超过98%\n 主机名: {{ $labels.hostname }}\n ip: {{ $labels.ip }}\n  当前值: {{ $value }}  \n 应用: {{ $labels.app }} \n 可用区: {{$labels.region}} \n 产品线: {{ $labels.product }} \n"
  - alert: "内存检测小于4G"
    expr: 100 - ( node_memory_Cached_bytes{job="node_exporter_alert"} + node_memory_Buffers_bytes{job="node_exporter_alert"} + node_memory_MemFree_bytes{job="node_exporter_alert"} ) / (node_memory_MemTotal_bytes{job="node_exporter_alert"} <= 5000000000 ) * 100  > 98
    for: 30s
    labels:
      severity: critical
    annotations:
      description: "Prometheus 报警: 内存使用率超过98%\n 主机名: {{ $labels.hostname }}\n ip: {{ $labels.ip }}\n  当前值: {{ $value }}  \n 应用: {{ $labels.app }} \n 可用区: {{$labels.region}} \n 产品线: {{ $labels.product }}\n"
      
      
groups:
- name: "RedisClient"
  rules:
  - alert: "Clients20k"
    expr: redis_connected_clients > 20000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis 实例 {{ $labels.addr }} client 超 20k "
      description: "Redis {{ $labels.instance }} client 超过 20k \n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

groups:
- name: "RedisMem"
  rules:
  - alert: "OutOfMemory60"
    expr: redis_memory_used_bytes / node_memory_MemTotal_bytes > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis 实例 {{ $labels.addr }} 内存超过 60% "
      description: "Redis {{ $labels.instance }} 内存超过 60% \n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


  - alert: "OutOfMemory70"
    expr: redis_memory_used_bytes / node_memory_MemTotal_bytes > 70
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis 实例 {{ $labels.addr }} 内存超过 70% "
      description: "Redis {{ $labels.instance }} 内存超过 70% \n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: "OutOfMemory80"
    expr: redis_memory_used_bytes / node_memory_MemTotal_bytes  > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis 实例 {{ $labels.addr }} 内存超过 80% "
      description: "Redis {{ $labels.instance }} 内存超过 80% \n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: "OutOfMemory90"
    expr: redis_memory_used_bytes / node_memory_MemTotal_bytes > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Redis 实例 {{ $labels.addr }} 内存超过 90% "
      description: "Redis {{ $labels.instance }} 内存超过 90% \n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      
groups:
- name: "RedisUP"
  rules:
  - alert: "RedisDown"
    expr: redis_up == 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Redis down"
      description: " Redis {{ $labels.instance }} is down \n VALUE = {{ $value }}\n"
      
groups:
- name: "tcp服务检测"
  rules:
  - alert: "tcp服务检测"
    expr: probe_success{job="blackbox-tcp"} == 0 or probe_success{job="redis-sync"} == 0
    for: 1m
    labels:
      severity: warning
    annotations:
      value: " {{ $value }} "
      description: "Prometheus 报警: tcp检测\n 实例: {{ $labels.instance }}\n 当前值: {{ $value }} \n 应用: {{ $labels.app }} \n 产品线: {{ $labels.product }} \n"
      
groups:
- name: zookeeperStatsAlert
  rules:
  - alert: 堆积请求数过大
    expr: avg(zk_outstanding_requests) by (instance) > 10
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} "
      description: "积请求数过大"
  - alert: 阻塞中的 sync 过多
    expr: avg(zk_pending_syncs) by (instance) > 10
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} "
      description: "塞中的 sync 过多"
  - alert: 平均响应延迟过高
    expr: avg(zk_avg_latency) by (instance) > 10
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} "
      description: '平均响应延迟过高'
  - alert: 打开文件描述符数大于系统设定的大小
    expr: zk_open_file_descriptor_count > zk_max_file_descriptor_count * 0.85
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} "
      description: '打开文件描述符数大于系统设定的大小'
  - alert: zookeeper服务器宕机
    expr: zk_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} "
      description: 'zookeeper服务器宕机'
  - alert: zk主节点丢失
    expr: absent(zk_server_state{state="leader"})  != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} "
      description: 'zk主节点丢失'
相关实践学习
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
相关文章
|
2月前
|
数据采集 Prometheus 监控
Prometheus的告警规则
Prometheus的告警规则
103 11
|
8月前
|
Prometheus 监控 Cloud Native
使用Prometheus配置监控与报警
通过以上步骤,你可以使用Prometheus和Alertmanager实现监控和报警配置,以确保系统在出现性能问题或故障时能够及时通知相关人员。欢迎关注威哥爱编程,一起学习成长。
321 0
|
Prometheus Kubernetes Cloud Native
Prometheus Operator创建告警规则文件
Prometheus Operator创建告警规则文件
116 0
|
Prometheus 监控 Cloud Native
在Linux系统部署prometheus监控(2) --配置规则
在Linux系统部署prometheus监控(2) --配置规则
|
Prometheus 监控 Kubernetes
云原生监控:Prometheus Operator,一文带你打通全流程:监控、规则、警报。
云原生监控:Prometheus Operator,一文带你打通全流程:监控、规则、警报。
309 0
|
Prometheus 运维 监控
三分钟实现Prometheus电话短信邮件钉钉飞书企业微信报警
Spug推送助手针对Prometheus内置好了报警模板,可以通过简单的配置就可以实现Prometheus电话、短信、邮件、钉钉、飞书、企业微信等报警。
1665 0
EMQ
|
SQL JSON Prometheus
使用 Prometheus 监控 eKuiper 规则运行状态
本文介绍了eKuiper中的规则状态指标以及如何使用Prometheus监控这些状态指标,用户可以基于此进一步探索Prometheus的更多高级功能,更好地实现eKuiper的运维。
EMQ
208 0
|
存储 Prometheus 监控
从零开始学习Prometheus监控报警系统
Prometheus是一个开源的监控报警系统,它被纳入了由谷歌发起的Linux基金会旗下的云原生基金会,并成为仅次于Kubernetes的第二大开源项目。
311 0
从零开始学习Prometheus监控报警系统
|
存储 Prometheus Cloud Native
prometheus告警规则管理
prometheus告警规则管理
|
容器 Cloud Native Prometheus
prometheus operator告警规则(一):你的cpu够用吗?
ack-prometheus-operator就像一把瑞士军刀,包含非常丰富的功能,安装后会自动部署prometheus、grafana、alertmanager等组件,并会下发很多的监控相关的规则,包括metric采集规则、告警规则和grafana dashboard。
8493 0