云原生监控：Prometheus Operator，一文带你打通全流程：监控、规则、警报。

2023-05-15 324

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

可观测监控 Prometheus 版，每月50GB免费额度

简介： 云原生监控：Prometheus Operator，一文带你打通全流程：监控、规则、警报。

Prometheus

安装prometheus-operator

wget https://github.com/prometheus-operator/prometheus-operator/releases/download/v0.64.0/bundle.yaml
kubectl create -f bundle.yaml

创建示例应用

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: example-app
  template:
    metadata:
      labels:
        app: example-app
    spec:
      containers:
      - name: example-app
        image: fabxc/instrumented_app
        ports:
        - name: web
          containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: example-app
  name: example-app
spec:
  ports:
  - name: 8080-8080
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: example-app
  type: NodePort

创建Service和Pod监控对象

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: example-app
  labels:
    team: frontend
spec:
  selector:
    matchLabels:
      app: example-app
  endpoints:
  - port: 8080-8080
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: example-app
  labels:
    team: frontend
spec:
  selector:
    matchLabels:
      app: example-app
  podMetricsEndpoints:
  - port: web

部署Prometheus和监控示例应用的Service和Pod

如果在K8S集群上激活了RBAC授权，则必须先创建RBAC规则，并且提前获得Prometheus服务帐户。

接下来创建服务帐户和所需的集群角色和集群角色绑定：

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/metrics
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: default

创建Prometheus对象，并定义选择监控指定标签的ServiceAccount和PodMonitor，最后暴露Prometheus

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  serviceAccountName: prometheus
  replicas: 3
  serviceMonitorSelector:
    matchLabels:
      team: frontend
  podMonitorSelector:
    matchLabels:
      team: frontend
  resources:
    requests:
      memory: 400Mi
  enableAdminAPI: false # 如果要开启管理API可设置为true
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 30900
    port: 9090
    protocol: TCP
    targetPort: web
  selector:
    prometheus: prometheus

Alertmanager

对于警报组件，Prometheus Operator引入了2个资源对象：

Alertmanager资源对象，它的作用是允许用户以声明的方式描述警报管理器群集。
AlertmanagerConfig资源对象，它的作用是允许用户以声明方式描述警报管理器配置。

先准备好警报管理器的配置，也就是创建AlertmanagerConfig资源对象。接着部署有3个副本的警报管理器集群，并使用该警报管理器的配置，最后暴露警报管理器

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: config-example
  labels:
    alertmanagerConfig: example
spec:
  route:
    groupBy: ['...']
    groupWait: 1s
    groupInterval: 1s
    repeatInterval: 1000d
    receiver: 'webhook'
  receivers:
  - name: 'webhook'
    webhookConfigs:
    - url: 'http://192.168.11.254:5001/webhook' # 接收到的警报还要往这个API发送，接收告警的API请见下面的webhook.py代码
      sendResolved: true
---
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: example
spec:
  replicas: 3
  alertmanagerConfiguration: # 此处为全局配置
    name: config-example
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager-example
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 30903
    port: 9093
    protocol: TCP
    targetPort: web
  selector:
    alertmanager: example

接收警报消息的py代码：

import json
from flask import Flask, request
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def send():
    try:
        data = json.loads(request.data)
        print(data)
    except Exception as e:
        print(e)
    return 'finish ok ...'
if __name__ == '__main__':
    app.run(debug=False, host='0.0.0.0', port=5001)

接收到的警报消息之所以要推给另外的接口，其目的是还可以对警报消息做进一步的处理。然后再往其它地方推送警报，比如钉钉、邮件等等。

Prometheus和Alertmanager整合

在之前创建Prometheus对象的yaml文件的基础上，拿过来改造：

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  serviceAccountName: prometheus
  replicas: 3
  alerting: # 主要是改造这里，此处与警报管理器整合
    alertmanagers:
    - namespace: default
      name: alertmanager-example
      port: web
  serviceMonitorSelector:
    matchLabels:
      team: frontend
  podMonitorSelector:
    matchLabels:
      team: frontend
  resources:
    requests:
      memory: 400Mi
  enableAdminAPI: false
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 30900
    port: 9090
    protocol: TCP
    targetPort: web
  selector:
    prometheus: prometheus

整合后，在Prometheus的页面中 Status > Runtime & Build Information 下会看到已经发现到了3个警报管理器实例

如果没有成功发现，请检查配置是否正确。

Rule

规则发现机制：

默认情况下，Prometheus资源仅发现在同一命名空间中的规则
默认情况下，如果spec.ruleSelector字段为nil，则表示不匹配任何规则
要发现所有命名空间中的规则，可以给ruleNamespaceSelector字段传空字典，例如ruleNamespaceSelector: {}
若要从与特定标签匹配的所有命名空间中发现规则，可以使用matchLabels

创建警报规则

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: null
  labels:
    prometheus: example
    role: alert-rules
  name: prometheus-example-rules
spec:
  groups:
  - name: ./example.rules
    rules:
    - alert: ExampleAlert
      expr: vector(1)

出于演示的目的，这条规则始终会触发警报。方便验证是否正常工作。

开始部署Prometheus规则

之前已经将Prometheus和Alertmanageryaml进行整合，接下来，继续在这个yaml的基础上改造。

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  serviceAccountName: prometheus
  replicas: 3
  alerting:
    alertmanagers:
    - namespace: default
      name: alertmanager-example
      port: web
  serviceMonitorSelector:
    matchLabels:
      team: frontend
  podMonitorSelector:
    matchLabels:
      team: frontend
  resources:
    requests:
      memory: 400Mi
  enableAdminAPI: false
  ruleSelector:
    matchLabels:
      role: alert-rules
      prometheus: example
  ruleNamespaceSelector: {} # 要为PrometheusRules发现选择的命名空间。如果未指定，则仅使用与普罗米修斯对象所在的命名空间相同的命名空间。
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 30900
    port: 9090
    protocol: TCP
    targetPort: web
  selector:
    prometheus: prometheus

在Prometheus的页面中 Status > Rules 下会看到这条示例规则

在Alertmanager的页面中，也有了警报消息，说明警报组件已经接收到了Prometheus发送过来的警报，不过貌似时区有点不对。

综合测试

测试步骤：

修改或添加警报规则
观察Prometheus是否能发现到新的规则
警报触发后观察Alertmanager能否正常接收到警报
警报触发后观察Alertmanager能否正常推送到其它接口

测试结果截图：

云原生监控：Prometheus Operator，一文带你打通全流程：监控、规则、警报。

Prometheus

Alertmanager

Prometheus和Alertmanager整合

Rule

综合测试

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

云原生监控：Prometheus Operator，一文带你打通全流程：监控、规则、警报。

Prometheus

Alertmanager

Prometheus和Alertmanager整合

Rule

综合测试

热门文章

最新文章

相关课程

相关电子书

相关实验场景