Prometheus踩坑记

本文涉及的产品
Redis 开源版,标准版 2GB
推荐场景:
搭建游戏排行榜
云数据库 Tair(兼容Redis),内存型 2GB
可观测监控 Prometheus 版,每月50GB免费额度
简介: Prometheus踩坑记

prometheus踩坑记

prometheus如何用systemd管理?

创建service配置文件

# -*- mode: conf -*-
[Unit]
Description=The Prometheus monitoring system and time series database.
Documentation=https://prometheus.io
After=network.target
[Service]
EnvironmentFile=-/etc/default/prometheus
User=prometheus
ExecStart=/usr/bin/prometheus $PROMETHEUS_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5s
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target

默认的环境变量

# cat /etc/default/prometheus
PROMETHEUS_OPTS='--config.file=/etc/prometheus/prometheus.yml --web.page-title="CHOT Metrics" --storage.tsdb.path=/prometheus/monitor/data --storage.tsdb.retention=7d --enable-feature=promql-negative-offs
et'

启动停止

systemctl start prometheus
systemctl stop prometheus
systemctl status prometheus

修改prometheus的存储目录后无法启动

背景

随着监控节点的扩充,prometheus的存储面临巨大压力,毕竟因为没有得到机器的支持,属于自己用virtualbox从闲置服务器上开辟出来的资源. 但是还要负责监控生产环境的众多机器.d3c3c5611eb5452e9b7e475359d7122d.jpg

启动报错

# journalctl -xe
Sep 08 14:56:20 meta prometheus[57772]: ts=2022-09-08T06:56:20.789Z caller=query_logger.go:90 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/monitor/data/queries.active err="open /prometheus/monitor/data/queries.active: permission denied"
Sep 08 14:56:26 meta prometheus[57825]: ts=2022-09-08T06:56:26.044Z caller=main.go:188 level=warn msg="This option for --enable-feature is now permanently enabled and therefore a no-op." option=promql-negative-offset

仅仅修改了–storage.tsdb.path=/prometheus/monitor/data 这个存储目录。

这个是给虚拟机新加的一块盘. 格式化后挂载到/prometheus

[root@meta /]#df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.8G     0  7.8G   0% /dev
tmpfs           7.8G  180K  7.8G   1% /dev/shm
tmpfs           7.8G  760M  7.1G  10% /run
tmpfs           7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/sda1        40G   20G   21G  48% /
/dev/sdb1        50G  151M   50G   1% /prometheus
tmpfs           1.6G     0  1.6G   0% /run/user/0

为何没有权限呢?

因为prometheus持久化目录的属组和属主都是prometheus而新建的持久化目录是root,根据官方的说法:

The user running Prometheus within the container has a specific user id and group id because it is dangerous to run as root

解决

chown prometheus:prometheus /prometheus/monitor/data/

参考

https://github.com/prometheus/prometheus/issues/5976

prometheus 采集redis 指标报错

启动redis_export之后,一直在报错:

~]$redis_exporter -redis-only-metrics -redis.addr redis://@10.50.10.45:6379
INFO[0000] Redis Metrics Exporter v1.37.0    build date: 2022-03-18-01:20:01    sha1: a1c28b775760f2f00fce07a24db7fd4e83c26b9f    Go: go1.17.8    GOOS: linux    GOARCH: amd64
INFO[0000] Providing metrics at :9121/metrics
ERRO[0001] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0001] Redis INFO err: set tcp 10.50.10.45:41601: use of closed network connection
ERRO[0011] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0011] Redis INFO err: set tcp 10.50.10.45:41604: use of closed network connection
ERRO[0021] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0021] Redis INFO err: set tcp 10.50.10.45:41613: use of closed network connection
ERRO[0032] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0032] Redis INFO err: set tcp 10.50.10.45:41651: use of closed network connection
ERRO[0042] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0042] Redis INFO err: set tcp 10.50.10.45:41663: use of closed network connection
ERRO[0052] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0052] Redis INFO err: set tcp 10.50.10.45:41664: use of closed network connection
ERRO[0061] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0061] Redis INFO err: set tcp 10.50.10.45:41666: use of closed network connection
ERRO[0071] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0071] Redis INFO err: set tcp 10.50.10.45:41669: use of closed network connection
ERRO[0081] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0081] Redis INFO err: set tcp 10.50.10.45:41677: use of closed network connection
ERRO[0091] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0091] Redis INFO err: set tcp 10.50.10.45:41696: use of closed network connection

但是通过9212 还是能获取到KPI指标a5f84a448ebc4a24acf8dfcf996114d1.png检查prometheus的status也是正常的:0eebf5dfb50648e8b5c78eb6acaa80a1.png原因:

仔细观察prometheus status的endpoint,因为当一个node上有多个跳转到指定的target,这其实涉及Prometheus 配置抓取多个 Redis 实例的问题,我遇到的问题就属于这个问题.288117b0080d4271b8a97cabe6b7a0cd.png

单机多实例redis-export启动命令

 curl  10.50.10.25/pigsty/redis_exporter -o /usr/bin/redis_exporter && chmod a+x /usr/bin/redis_exporter
 # 单机多实例无需指定-redis.addr将 -redis.addr配置在动态发现的文件中
 nohup redis_exporter  -redis-only-metrics  &>/dev/null &

只有一个redis_export 收集多个实例上的指标信息

prometheus如何抓取codis多个实例的指标?

参考: https://github.com/oliver006/redis_exporter

redis多实例prometheus 配置

  #------------------------------------------------------------------------------
  # job: redis
  # multiple redis targets from redis_exporter @ target nodes
  # labels: [cls, ip, ins, instance]
  # path: targets/redis/<redis_node>.yml
  #------------------------------------------------------------------------------
  - job_name: redis
    metrics_path: /scrape
    file_sd_configs:
      - refresh_interval: 5s
        files: [ targets/redis/*.yml ]
    relabel_configs:
      - source_labels: [__address__] # 源标签,使用配置的分隔符串联的标签名称列表,并与提供的正则表达式进行匹配。
        target_label: __param_target # 目标标签,当使用 replace 或者 hashmod 动作时,应该被覆盖的标签名。
      - source_labels: [__param_target]
        regex: ^redis://(.*):(\d+)$ # 正则表达式,用于匹配串联的源标签,默认为 (.*),匹配任何源标签。
        replacement: $1:$2 # replacement 字符串,写在目标标签上,用于替换 relabeling 动作,$1 ->(.*) $2 ->(\d+)
        target_label: instance
      - source_labels: [__param_target]
        regex: ^redis://(.*):(\d+)$
        replacement: $1
        target_label: ip
      # scrape redis_exporter on target node
      - source_labels: [__param_target]
        regex: ^redis://(.*):\d+$
        replacement: $1:9121
        target_label: __address__
#### 以下为官方配置
#--- 注释
  - job_name: 'redis_exporter_targets'
    metrics_path: /scrape
    static_configs:
      - targets:
        - redis://ip:6379
        - redis://ip1.:6379
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: <<REDIS-EXPORTER-HOSTNAME>>:9121
  ## config for scraping the exporter itself
  - job_name: 'redis_exporter'
    static_configs:
      - targets:
        - <<REDIS-EXPORTER-HOSTNAME>>:9121

targets/redis/*.yml

- labels: { ip: 10.50.10.45 , ins: codis-2-6379 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6379 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6380 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6380 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6381 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6381 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6382 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6382 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6383 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6383 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6384 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6384 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6385 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6385 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6386 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6386 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6379 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6379 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6380 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6380 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6381 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6381 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6382 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6382 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6383 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6383 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6384 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6384 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6385 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6385 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6386 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6386 ]

2a9fc5c5f034477a9552d84b27f41ca0.png跳转到指定的scrape9490bf5d1a324530b2a06e69d58ce8f6.png

prometheus 的relabel_configs

重新标记的配置语法和结构在relabel_configs和metric_relabel_configs块中是相同的。唯一的区别是它们发生的时间。relabel_configs是在服务发现之后和抓取之前,而metrics_relabel_configs是在抓取之后。

codis server启动redis_export 启动命令

nohup redis_exporter  -redis-only-metrics  &>/dev/null &

netstat 查看

~]#netstat -nltp|grep 9121
tcp        0      0 :::9121                     :::*                        LISTEN      1872/redis_exporter

prometheus 监控访问地址

codis:
http://10.50.10.25:3000/d/redis-cluster/redis-cluster?var-cls=codis-prod&orgId=1
node monitor:
http://10.50.10.25:3000/d/nodes-cluster/nodes-cluster?var-cls=QMS&orgId=1

prometheus 自动发现主机之file_sd_configs

metrics_path 和 relabel_configs 两个参数配置实现.

relabel_configs allow advanced modifications to any target and its labels before scraping.

prometheus自带的两个标签

每个指标都天然带有两个拓扑标签:job和instance。标签job是根据抓取配置中的作业名称设置的。我们倾向于使用job来描述正在监控的事物的类型在之前的Node Exporter作业中,我们将其命名为node。这会为每个Node Exporter指标打上job标签node。instance标签可以标识目标,它通常是目标的IP地址和端口,并且来自__address__标签。

prometheus 性能问题

prometheus太突然跑不动了,监控界面都刷不出来。vmstat看了一下监控发现cpu基本跑满了,软中断巨多7756650337b2421d8b455271b47b5573.png

从日志分析prometheus

journalctl -u prometheus -f

1、Dec 23 08:58:52 meta-162 prometheus[12490]: ts=2022-12-23T00:58:52.344Z caller=main.go:956 level=warn fs_type=NFS_SUPER_MAGIC msg="This filesystem is notsupported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about upported filesystems."
2、Dec 23 08:58:53 meta-162 prometheus[12490]: ts=2022-12-23T00:58:53.018Z caller=main.go:910 level=info msg="Server is ready to receive web requests."
3、Dec 23 09:08:14 meta-162 prometheus[12490]: ts=2022-12-23T01:08:14.588Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=167171040000 maxt=1671717600000 ulid=01GMY7V2MQWB32XXT0W2T4K38J duration=9m15.935444748s
4、Dec 23 09:13:15 meta-162 prometheus[12490]: ts=2022-12-23T01:13:14.756Z caller=head.go:840 level=info component=tsdb msg="Head GC completed" duration=3m7280873479s

首先我的后端存储使用的是nfs。而prometheus官方建议不要用.

注意: Prometheus 的本地存储不支持非 POSIX 兼容的文件系统,因为可能会发生不可恢复的损坏。不支持 NFS 文件系统(包括 AWS 的 EFS)。NFS 可能是 POSIX 兼容的,但大多数实现不是。强烈建议使用本地文件系统以确保可靠性。

write block竟然耗费9分钟

什么是write block?

当持久化到磁盘的 chunks 数据量达到一定阈值的时候,就会将这批老数据从 chunks 中剥离出来变成 block 数据块。更老的多个小的 block 会定期合成一个大的 block,最后直到 block 保存时间达到阈值被删除。6ffd6fbef94742b682fbaa9a0e86fba7.pngwal 文件夹里面存放的数据是当前正在写入的数据,里面包含多个数据段文件,一个文件默认最大 128M,Prometheus 会至少保留3个文件,对于高负载的机器会至少保留2小时的数据。wal 文件夹里面的数据是没有压缩过的,所以会比 block 里面的数据略大一些。

├── 01GMNFPT9RH76VC5MJ6WY0AAXY
│   ├── chunks #  保存压缩后的时序数据,每个chunks大小为512M,超过会生成新的chunks
│   │   ├── 000001 # 默认512M
│   │   ├── 000002
│   │   ├── 000003
│   │   ├── 000004
│   │   ├── 000005
│   │   ├── 000006
│   │   └── 000007
│   ├── index # chunks中的偏移位置
│   ├── meta.json # 记录block块元信息,比如 样本的起始时间、chunks数量和数据量大小等
│   └── tombstones  # 通过API方式对数据进行软删除,将删除记录存储在此处(API的删除方式,并不是立即将数据从chunks文件中移除)
├── lock
├── chunks_head #chunks_head 文件夹里面也包含了多个 chunks ,当内存的 head block 写不下了会将数据存放在这个文件夹下面,并保留对文件的引用。
├── queries.active
└── wal                      #防止数据丢失(数据收集上来暂时是存放在内存中,wal记录了这些信息)
    ├── 00000366             #每个数据段最大为128M,存储默认存储两个小时的数据量。
    ├── 00000367
    ├── 00000368
    ├── 00000369
    └── checkpoint.000365
        └── 00000000

prometheus 性能问题 续

2023年4月3日09:34:59

由于prometheus的后端存储是NFS,后来直接切换了这个存储为local存储。

跑了一段时间之后,性能也慢慢变差,随着收集指标的增加,服务器的cpu飙升。

查看proemtheus OverView

e1f168a445c24f20b67b98b3e2532958.png收集指标的耗时趋势d17c66babd114707958ac0cf90934a53.png“scrape_duration_seconds” 是 Prometheus 中用于收集指标的时间,它表示 Prometheus 从一个资源中收集指标所需的时间。如果该指标很高,可能表明 Prometheus 的数据采集性能存在问题。669e5551554c4f70b8826d5a1d0b534e.pngprometheus_rule_evaluations_total" 是 Prometheus 监控和告警系统中使用的一个指标。


这个指标用于衡量 Prometheus 已经评估告警或记录规则的次数。


Prometheus 允许您定义规则,描述应用程序或系统触发告警或记录指标的条件。

这些规则定期由 Prometheus 进行评估,以检测规则中指定的条件是否达成。


“prometheus_rule_evaluations_total” 参数跟踪 Prometheus 评估这些规则的次数,可以用于识别规则评估过程中的任何问题或瓶颈。0dcf2fb0459b4174b89a5bd74b745966.png

相关实践学习
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
目录
相关文章
|
Prometheus Kubernetes 监控
k8s中部署Grafana-prometheus系列文章第二篇
k8s中部署Grafana-prometheus系列文章第二篇
|
6月前
|
Prometheus 监控 Cloud Native
性能监控神器Prometheus、Grafana、ELK 在springboot中的运用
【6月更文挑战第27天】在 Spring Boot 应用中,监控和日志管理是确保系统稳定性和性能的重要手段。
418 4
|
6月前
|
Prometheus 监控 Cloud Native
SpringBoot2.x整合Prometheus+Grafana【附源码】
SpringBoot2.x整合Prometheus+Grafana【附源码】
115 1
|
存储 数据采集 Prometheus
【2023】Prometheus-先搭出来玩玩
【2023】Prometheus-先搭出来玩玩
150 0
|
7月前
|
存储 Prometheus 监控
Prometheus实战篇:Prometheus简介
Prometheus 是一个开源的服务监控系统和时序数据库,其提供了通用的数据模型和快捷数据采集、存储和查询接口。
|
自然语言处理 Java 开发工具
实战:ELK环境部署并采集springboot项目日志
实战:ELK环境部署并采集springboot项目日志
|
存储 Prometheus 监控
Prometheus入门
Prometheus入门
210 1
|
存储 Prometheus 监控
|
Prometheus 监控 Cloud Native
|
Prometheus Cloud Native 开发者

相关实验场景

更多