k8s容器云架构之dubbo微服务—K8S(14)监控实战-grafana出图_alert告警

简介: 博客地址:https://www.cnblogs.com/sseban哔哩哔哩:https://space.bilibili.com/394449264k8s监控实战-grafana出图_alert告警

k8s监控实战-grafana出图_alert告警

目录

  • k8s监控实战-grafana出图_alert告警
  • 1 使用炫酷的grafana出图
  • 1.1 部署grafana
  • 1.1.1 准备镜像
  • 1.1.2 准备rbac资源清单
  • 1.1.3 准备dp资源清单
  • 1.1.4 准备svc资源清单
  • 1.1.5 准备ingress资源清单
  • 1.1.6 域名解析
  • 1.1.7 应用资源配置清单
  • 1.2 使用grafana出图
  • 1.2.1 浏览器访问验证
  • 1.2.2 进入容器安装插件
  • 1.2.3 配置数据源
  • 1.2.4 添加K8S集群信息
  • 1.2.5 查看k8s集群数据和图表
  • 2 配置alert告警插件
  • 2.1 部署alert插件
  • 2.1.1 准备docker镜像
  • 2.1.2 准备cm资源清单
  • 2.1.3 准备dp资源清单
  • 2.1.4 准备svc资源清单
  • 2.1.5 应用资源配置清单
  • 2.2 K8S使用alert报警
  • 2.2.1 k8s创建基础报警规则文件
  • 2.2.2 K8S 更新配置
  • 2.2.3 测试告警

1 使用炫酷的grafana出图

prometheus的dashboard虽然号称拥有多种多样的图表,但是实在太简陋了,一般都用专业的grafana工具来出图

grafana官方dockerhub地址

grafana官方github地址

grafana官网

1.1 部署grafana

1.1.1 准备镜像

docker pull grafana/grafana:5.4.2
docker tag  6f18ddf9e552 harbor.zq.com/infra/grafana:v5.4.2
docker push harbor.zq.com/infra/grafana:v5.4.2

准备目录

mkdir /data/k8s-yaml/grafana
cd    /data/k8s-yaml/grafana

1.1.2 准备rbac资源清单

cat >rbac.yaml <<'EOF'
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: grafana
rules:
- apiGroups:
  - "*"
  resources:
  - namespaces
  - deployments
  - pods
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: grafana
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: grafana
subjects:
- kind: User
  name: k8s-node
EOF

1.1.3 准备dp资源清单

cat >dp.yaml <<'EOF'
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: grafana
    name: grafana
  name: grafana
  namespace: infra
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 7
  selector:
    matchLabels:
      name: grafana
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: grafana
        name: grafana
    spec:
      containers:
      - name: grafana
        image: harbor.zq.com/infra/grafana:v5.4.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 3000
          protocol: TCP
        volumeMounts:
        - mountPath: /var/lib/grafana
          name: data
      imagePullSecrets:
      - name: harbor
      securityContext:
        runAsUser: 0
      volumes:
      - nfs:
          server: hdss7-200
          path: /data/nfs-volume/grafana
        name: data
EOF

创建frafana数据目录

mkdir /data/nfs-volume/grafana

1.1.4 准备svc资源清单

cat >svc.yaml <<'EOF'
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: infra
spec:
  ports:
  - port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app: grafana
EOF

1.1.5 准备ingress资源清单

cat >ingress.yaml <<'EOF'
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: grafana
  namespace: infra
spec:
  rules:
  - host: grafana.zq.com
    http:
      paths:
      - path: /
        backend:
          serviceName: grafana
          servicePort: 3000
EOF

1.1.6 域名解析

vi /var/named/zq.com.zone
grafana            A    10.4.7.10
systemctl restart named

1.1.7 应用资源配置清单

kubectl apply -f http://k8s-yaml.zq.com/grafana/rbac.yaml
kubectl apply -f http://k8s-yaml.zq.com/grafana/dp.yaml
kubectl apply -f http://k8s-yaml.zq.com/grafana/svc.yaml
kubectl apply -f http://k8s-yaml.zq.com/grafana/ingress.yaml

1.2 使用grafana出图

1.2.1 浏览器访问验证

访问http://grafana.zq.com,默认用户名密码admin/admin

能成功访问表示安装成功

进入后立即修改管理员密码为admin123

1.2.2 进入容器安装插件

grafana确认启动好以后,需要进入grafana容器内部,安装以下插件

kubectl -n infra exec  -it grafana-d6588db94-xr4s6 /bin/bash
# 以下命令在容器内执行
grafana-cli plugins install grafana-kubernetes-app
grafana-cli plugins install grafana-clock-panel
grafana-cli plugins install grafana-piechart-panel
grafana-cli plugins install briangann-gauge-panel
grafana-cli plugins install natel-discrete-panel

1.2.3 配置数据源

添加数据源,依次点击:左侧锯齿图标-->add data source-->Prometheus

添加完成后重启grafana

kubectl -n infra delete pod grafana-7dd95b4c8d-nj5cx

1.2.4 添加K8S集群信息

启用K8S插件,依次点击:左侧锯齿图标-->Plugins-->kubernetes-->Enable

新建cluster,依次点击:左侧K8S图标-->New Cluster

1.2.5 查看k8s集群数据和图表

添加完需要稍等几分钟,在没有取到数据之前,会报http forbidden,没关系,等一会就好。大概2-5分钟。

点击Cluster Dashboard

2 配置alert告警插件

2.1 部署alert插件

2.1.1 准备docker镜像

docker pull docker.io/prom/alertmanager:v0.14.0
docker tag  23744b2d645c harbor.zq.com/infra/alertmanager:v0.14.0
docker push harbor.zq.com/infra/alertmanager:v0.14.0

准备目录

mkdir /data/k8s-yaml/alertmanager
cd /data/k8s-yaml/alertmanager

2.1.2 准备cm资源清单

cat >cm.yaml <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: infra
data:
  config.yml: |-
    global:
      # 在没有报警的情况下声明为已解决的时间
      resolve_timeout: 5m
      # 配置邮件发送信息
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'xxx@163.com'
      smtp_auth_username: 'xxx@163.com'
      smtp_auth_password: 'xxxxxx'
      smtp_require_tls: false
    templates:   
      - '/etc/alertmanager/*.tmpl'
    # 所有报警信息进入后的根路由,用来设置报警的分发策略
    route:
      # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
      group_by: ['alertname', 'cluster']
      # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
      group_wait: 30s
      # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
      group_interval: 5m
      # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
      repeat_interval: 5m
      # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
      receiver: default
    receivers:
    - name: 'default'
      email_configs:
      - to: 'xxxx@qq.com'
        send_resolved: true
        html: '{{ template "email.to.html" . }}' 
        headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }   
  email.tmpl: |
    {{ define "email.to.html" }}
    {{- if gt (len .Alerts.Firing) 0 -}}
    {{ range .Alerts }}
    告警程序: prometheus_alert <br>
    告警级别: {{ .Labels.severity }} <br>
    告警类型: {{ .Labels.alertname }} <br>
    故障主机: {{ .Labels.instance }} <br>
    告警主题: {{ .Annotations.summary }}  <br>
    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
    {{ end }}{{ end -}}
    {{- if gt (len .Alerts.Resolved) 0 -}}
    {{ range .Alerts }}
    告警程序: prometheus_alert <br>
    告警级别: {{ .Labels.severity }} <br>
    告警类型: {{ .Labels.alertname }} <br>
    故障主机: {{ .Labels.instance }} <br>
    告警主题: {{ .Annotations.summary }} <br>
    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
    恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>
    {{ end }}{{ end -}}
    {{- end }}
EOF

2.1.3 准备dp资源清单

cat >dp.yaml <<'EOF'
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: alertmanager
  namespace: infra
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: harbor.zq.com/infra/alertmanager:v0.14.0
        args:
          - "--config.file=/etc/alertmanager/config.yml"
          - "--storage.path=/alertmanager"
        ports:
        - name: alertmanager
          containerPort: 9093
        volumeMounts:
        - name: alertmanager-cm
          mountPath: /etc/alertmanager
      volumes:
      - name: alertmanager-cm
        configMap:
          name: alertmanager-config
      imagePullSecrets:
      - name: harbor
EOF

2.1.4 准备svc资源清单

cat >svc.yaml <<'EOF'
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: infra
spec:
  selector: 
    app: alertmanager
  ports:
    - port: 80
      targetPort: 9093
EOF

2.1.5 应用资源配置清单

kubectl apply -f http://k8s-yaml.zq.com/alertmanager/cm.yaml
kubectl apply -f http://k8s-yaml.zq.com/alertmanager/dp.yaml
kubectl apply -f http://k8s-yaml.zq.com/alertmanager/svc.yaml

2.2 K8S使用alert报警

2.2.1 k8s创建基础报警规则文件

cat >/data/nfs-volume/prometheus/etc/rules.yml <<'EOF'
groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)"
  - alert: OutOfInodes
    expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Out of inodes (instance {{ $labels.instance }})"
      description: "Disk is almost running out of available inodes (< 10% left) (current value: {{ $value }})"
  - alert: OutOfDiskSpace
    expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 10% left) (current value: {{ $value }})"
  - alert: UnusualNetworkThroughputIn
    expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual network throughput in (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {{ $value }})"
  - alert: UnusualNetworkThroughputOut
    expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual network throughput out (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskReadRate
    expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk read rate (instance {{ $labels.instance }})"
      description: "Disk is probably reading too much data (> 50 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskWriteRate
    expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk write rate (instance {{ $labels.instance }})"
      description: "Disk is probably writing too much data (> 50 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskReadLatency
    expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk read latency (instance {{ $labels.instance }})"
      description: "Disk latency is growing (read operations > 100ms) (current value: {{ $value }})"
  - alert: UnusualDiskWriteLatency
    expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk write latency (instance {{ $labels.instance }})"
      description: "Disk latency is growing (write operations > 100ms) (current value: {{ $value }})"
- name: http_status
  rules:
  - alert: ProbeFailed
    expr: probe_success == 0
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Probe failed (instance {{ $labels.instance }})"
      description: "Probe failed (current value: {{ $value }})"
  - alert: StatusCode
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Status Code (instance {{ $labels.instance }})"
      description: "HTTP status code is not 200-399 (current value: {{ $value }})"
  - alert: SslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"
      description: "SSL certificate expires in 30 days (current value: {{ $value }})"
  - alert: SslCertificateHasExpired
    expr: probe_ssl_earliest_cert_expiry - time()  <= 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "SSL certificate has expired (instance {{ $labels.instance }})"
      description: "SSL certificate has expired already (current value: {{ $value }})"
  - alert: BlackboxSlowPing
    expr: probe_icmp_duration_seconds > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox slow ping (instance {{ $labels.instance }})"
      description: "Blackbox ping took more than 2s (current value: {{ $value }})"
  - alert: BlackboxSlowRequests
    expr: probe_http_duration_seconds > 2 
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox slow requests (instance {{ $labels.instance }})"
      description: "Blackbox request took more than 2s (current value: {{ $value }})"
  - alert: PodCpuUsagePercent
    expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)"
EOF

2.2.2 K8S 更新配置

在prometheus配置文件中追加配置:

cat >>/data/nfs-volume/prometheus/etc/prometheus.yml <<'EOF'
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager"]
rule_files:
 - "/data/etc/rules.yml"
EOF

重载配置:

curl -X POST http://prometheus.zq.com/-/reload

image.png

以上这些就是我们的告警规则

2.2.3 测试告警

把test命名空间里的dubbo-demo-service给停掉

blackbox里信息已报错,alert里面项目变黄了

等到alert中项目变为红色的时候就开会发邮件告警

如果需要自己定制告警规则和告警内容,需要研究一下promql,自己修改配置文件。

相关实践学习
深入解析Docker容器化技术
Docker是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器上,也可以实现虚拟化,容器是完全使用沙箱机制,相互之间不会有任何接口。Docker是世界领先的软件容器平台。开发人员利用Docker可以消除协作编码时“在我的机器上可正常工作”的问题。运维人员利用Docker可以在隔离容器中并行运行和管理应用,获得更好的计算密度。企业利用Docker可以构建敏捷的软件交付管道,以更快的速度、更高的安全性和可靠的信誉为Linux和Windows Server应用发布新功能。 在本套课程中,我们将全面的讲解Docker技术栈,从环境安装到容器、镜像操作以及生产环境如何部署开发的微服务应用。本课程由黑马程序员提供。 &nbsp; &nbsp; 相关的阿里云产品:容器服务 ACK 容器服务 Kubernetes 版(简称 ACK)提供高性能可伸缩的容器应用管理能力,支持企业级容器化应用的全生命周期管理。整合阿里云虚拟化、存储、网络和安全能力,打造云端最佳容器化应用运行环境。 了解产品详情: https://www.aliyun.com/product/kubernetes
相关文章
|
6月前
|
监控 Kubernetes Java
使用 New Relic APM 和 Kubernetes Metrics 监控 EKS 上的 Java 微服务
在阿里云AKS上运行Java微服务常遇性能瓶颈与OOMKilled等问题。本文教你通过New Relic实现集群与JVM双层监控,集成Helm部署、JVM代理注入、GC调优及告警仪表盘,打通从节点资源到应用内存的全链路观测,提升排障效率,保障服务稳定。
481 115
|
5月前
|
Cloud Native Serverless API
微服务架构实战指南:从单体应用到云原生的蜕变之路
🌟蒋星熠Jaxonic,代码为舟的星际旅人。深耕微服务架构,擅以DDD拆分服务、构建高可用通信与治理体系。分享从单体到云原生的实战经验,探索技术演进的无限可能。
微服务架构实战指南:从单体应用到云原生的蜕变之路
|
6月前
|
Prometheus 监控 Java
日志收集和Spring 微服务监控的最佳实践
在微服务架构中,日志记录与监控对系统稳定性、问题排查和性能优化至关重要。本文介绍了在 Spring 微服务中实现高效日志记录与监控的最佳实践,涵盖日志级别选择、结构化日志、集中记录、服务ID跟踪、上下文信息添加、日志轮转,以及使用 Spring Boot Actuator、Micrometer、Prometheus、Grafana、ELK 堆栈等工具进行监控与可视化。通过这些方法,可提升系统的可观测性与运维效率。
632 1
日志收集和Spring 微服务监控的最佳实践
|
7月前
|
存储 Prometheus 监控
从入门到实战:一文掌握微服务监控系统 Prometheus + Grafana
随着微服务架构的发展,系统监控变得愈发重要。本文介绍如何利用 Prometheus 和 Grafana 构建高效的监控系统,涵盖数据采集、存储、可视化与告警机制,帮助开发者提升系统可观测性,及时发现故障并优化性能。内容涵盖 Prometheus 的核心组件、数据模型及部署方案,并结合 Grafana 实现可视化监控,适合初学者和进阶开发者参考实践。
980 6
|
8月前
|
存储 监控 Shell
SkyWalking微服务监控部署与优化全攻略
综上所述,虽然SkyWalking的初始部署流程相对复杂,但通过一步步的准备和配置,可以充分发挥其作为可观测平台的强大功能,实现对微服务架构的高效监控和治理。尽管未亲临,心已向往。将一件事做到极致,便是天分的展现。
|
8月前
|
缓存 Cloud Native Java
Java 面试微服务架构与云原生技术实操内容及核心考点梳理 Java 面试
本内容涵盖Java面试核心技术实操,包括微服务架构(Spring Cloud Alibaba)、响应式编程(WebFlux)、容器化(Docker+K8s)、函数式编程、多级缓存、分库分表、链路追踪(Skywalking)等大厂高频考点,助你系统提升面试能力。
865 0
|
11月前
|
Cloud Native Serverless 流计算
云原生时代的应用架构演进:从微服务到 Serverless 的阿里云实践
云原生技术正重塑企业数字化转型路径。阿里云作为亚太领先云服务商,提供完整云原生产品矩阵:容器服务ACK优化启动速度与镜像分发效率;MSE微服务引擎保障高可用性;ASM服务网格降低资源消耗;函数计算FC突破冷启动瓶颈;SAE重新定义PaaS边界;PolarDB数据库实现存储计算分离;DataWorks简化数据湖构建;Flink实时计算助力风控系统。这些技术已在多行业落地,推动效率提升与商业模式创新,助力企业在数字化浪潮中占据先机。
593 12
|
Dubbo Java 应用服务中间件
微服务学习 | Springboot整合Dubbo+Nacos实现RPC调用
微服务学习 | Springboot整合Dubbo+Nacos实现RPC调用
|
Dubbo Java 应用服务中间件
Spring Cloud Dubbo:微服务通信的高效解决方案
【10月更文挑战第15天】随着信息技术的发展,微服务架构成为企业应用开发的主流。Spring Cloud Dubbo结合了Dubbo的高性能RPC和Spring Cloud的生态系统,提供高效、稳定的微服务通信解决方案。它支持多种通信协议,具备服务注册与发现、负载均衡及容错机制,简化了服务调用的复杂性,使开发者能更专注于业务逻辑的实现。
340 2

相关产品

  • 容器服务Kubernetes版
  • 推荐镜像

    更多