Kubernetes中使用prometheus+alertmanager实现监控告警

简介: 监控告警原型图 原型图解释 prometheus与alertmanager作为container运行在同一个pods中并交由Deployment控制器管理,alertmanager默认开启9093端口,因为我们的prometheus与alertmanager是处于同一个pod中,所以pro.

监控告警原型图

a095f5691889d8803cda17b2daaa252fe126c507

原型图解释

prometheus与alertmanager作为container运行在同一个pods中并交由Deployment控制器管理,alertmanager默认开启9093端口,因为我们的prometheus与alertmanager是处于同一个pod中,所以prometheus直接使用localhost:9093就可以与alertmanager通信(用于发送告警通知),告警规则配置rules.yml以Configmap的形式挂载到prometheus容器供prometheus使用,告警通知对象配置也通过Configmap挂载到alertmanager容器供alertmanager使用,这里我们使用邮件接收告警通知,具体配置在alertmanager.yml中

测试环境

环境:Linux 3.10.0-693.el7.x86_64 x86_64 GNU/Linux
平台:Kubernetes v1.10.5
Tips:prometheus与alertmanager完整的配置在文档末尾

创建告警规则

在prometheus中指定告警规则的路径, rules.yml就是用来指定报警规则,这里我们将rules.yml用ConfigMap的形式挂载到/etc/prometheus目录下面即可:
rule_files:
- /etc/prometheus/rules.yml

这里我们指定了一个InstanceDown告警,当主机挂掉1分钟则prometheus会发出告警

  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

配置prometheus与alertmanager通信(用于prometheus向alertmanager发送告警信息)

alertmanager默认开启9093端口,又因为我们的prometheus与alertmanager是处于同一个pod中,所以prometheus直接使用localhost:9093就可以与alertmanager通信
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]

alertmanager配置告警通知对象

我们这里举了一个邮件告警的例子,alertmanager接收到prometheus发出的告警时,alertmanager会向指定的邮箱发送一封告警邮件,这个配置也是通过Configmap的形式挂载到alertmanager所在的容器中供alertmanager使用
alertmanager.yml: |-
    global:
      smtp_smarthost: 'smtp.exmail.qq.com:465'
      smtp_from: 'xin.liu@woqutech.com'
      smtp_auth_username: 'xin.liu@woqutech.com'
      smtp_auth_password: 'xxxxxxxxxxxx'
      smtp_require_tls: false
    route:
      group_by: [alertname]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 10m
      receiver: default-receiver
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: '1148576125@qq.com'

原型效果展示

在prometheus web ui中可以看到 配置的告警规则

e33cb95b8c5c3117dc83ab18d26c9f961e2dbf8d

为了看测试效果,关掉一个主机节点:
在prometheus web ui中可以看到一个InstanceDown告警被触发

bd5f0ad84e97b525def09cb5b963a5126580a1d2

在alertmanager web ui中可以看到alertmanager收到prometheus发出的告警

4947747a69de1d6dcb30e5cde70c5184f4ffd4f2

指定接收告警的邮箱收到alertmanager发出的告警邮件

b34e928bf4a86a2fb750e00c30027c8662927990

全部配置

node_exporter_daemonset.yaml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: kube-system
  labels:
    app: node_exporter
spec:
  selector:
    matchLabels:
      name: node_exporter
  template:
    metadata:
      labels:
        name: node_exporter
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: node-exporter
        image: alery/node-exporter:1.0
        ports:
        - name: node-exporter
          containerPort: 9100
          hostPort: 9100
        volumeMounts:
        - name: localtime
          mountPath: /etc/localtime
        - name: host
          mountPath: /host
          readOnly: true
      volumes:
      - name: localtime
        hostPath:
          path: /usr/share/zoneinfo/Asia/Shanghai
      - name: host
        hostPath:
          path: /

alertmanager-cm.yaml

kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager
  namespace: kube-system
data:
  alertmanager.yml: |-
    global:
      smtp_smarthost: 'smtp.exmail.qq.com:465'
      smtp_from: 'xin.liu@woqutech.com'
      smtp_auth_username: 'xin.liu@woqutech.com'
      smtp_auth_password: 'xxxxxxxxxxxx'
      smtp_require_tls: false
    route:
      group_by: [alertname]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 10m
      receiver: default-receiver
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: '1148576125@qq.com'

prometheus-rbac.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
  namespace: kube-system
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system

prometheus-cm.yaml

kind: ConfigMap
apiVersion: v1
data:
  prometheus.yml: |
    rule_files:
    - /etc/prometheus/rules.yml
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["localhost:9093"]
    scrape_configs:
    - job_name: 'node'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        replacement: $1:9100
      - source_labels: [__meta_kubernetes_pod_host_ip]
        action: replace
        target_label: instance
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node_name
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(name)
      - source_labels: [__meta_kubernetes_pod_label_name]
        regex: node_exporter
        action: keep

  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
      - alert: APIHighRequestLatency
        expr: api_http_request_latencies_second{quantile="0.5"} > 1
        for: 10m
        annotations:
          summary: "High request latency on {{ $labels.instance }}"
          description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

metadata:
  name: prometheus-config-v0.1.0
  namespace: kube-system

prometheus.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  namespace: kube-system
  name: prometheus
  labels:
    name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      name: prometheus
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      nodeSelector:
        node-role.kubernetes.io/master: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
相关实践学习
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
云原生实践公开课
课程大纲 开篇:如何学习并实践云原生技术 基础篇: 5 步上手 Kubernetes 进阶篇:生产环境下的 K8s 实践 相关的阿里云产品:容器服务 ACK 容器服务 Kubernetes 版(简称 ACK)提供高性能可伸缩的容器应用管理能力,支持企业级容器化应用的全生命周期管理。整合阿里云虚拟化、存储、网络和安全能力,打造云端最佳容器化应用运行环境。 了解产品详情: https://www.aliyun.com/product/kubernetes
相关文章
|
2月前
|
Prometheus 监控 Kubernetes
如何用 Prometheus Operator 监控 K8s 集群外服务?
如何用 Prometheus Operator 监控 K8s 集群外服务?
|
4月前
|
存储 Prometheus Kubernetes
K8s + prometheus + vm(VictoriaMetrics)
K8s + prometheus + vm(VictoriaMetrics)
103 1
|
6月前
|
Prometheus Kubernetes 监控
prometheus operator监控k8s集群之外的haproxy组件
prometheus operator监控k8s集群之外的haproxy组件
|
4月前
|
Prometheus Kubernetes 监控
云原生|kubernetes |使用Prometheus监控k8s cAdvisor篇(进阶篇--- 一)(centos操作系统)
云原生|kubernetes |使用Prometheus监控k8s cAdvisor篇(进阶篇--- 一)(centos操作系统)
360 0
|
6月前
|
Prometheus 监控 Cloud Native
基于k8s+Prometheus+Alertmanager+Grafana构建企业级监控告警系统(下)
基于k8s+Prometheus+Alertmanager+Grafana构建企业级监控告警系统
|
2月前
|
Prometheus Kubernetes 监控
|
6月前
|
Prometheus 监控 Kubernetes
Prometheus+Grafana+Alertmanager搭建全方位的监控告警系统-超详细文档(上)
Prometheus+Grafana+Alertmanager搭建全方位的监控告警系统-超详细文档
|
2月前
|
Prometheus 运维 监控
Prometheus AlertManager 生产实践 - 直接根据 to_email label 发 alert 到对应邮箱
Prometheus AlertManager 生产实践 - 直接根据 to_email label 发 alert 到对应邮箱
|
2月前
|
存储 Prometheus 监控
Prometheus Alertmanager 生产配置趟过的坑总结
Prometheus Alertmanager 生产配置趟过的坑总结
|
3月前
|
Prometheus Kubernetes Cloud Native
kubernetes安装Prometheus
##### 安装 在目标集群上,执行如下命令: ```shell kubectl apply -f https://github.com/512team/dhorse/raw/main/conf/kubernetes-prometheus.yml