prometheus踩坑记

prometheus如何用systemd管理？

创建service配置文件

# -*- mode: conf -*-
[Unit]
Description=The Prometheus monitoring system and time series database.
Documentation=https://prometheus.io
After=network.target
[Service]
EnvironmentFile=-/etc/default/prometheus
User=prometheus
ExecStart=/usr/bin/prometheus $PROMETHEUS_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=5s
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target

默认的环境变量

# cat /etc/default/prometheus
PROMETHEUS_OPTS='--config.file=/etc/prometheus/prometheus.yml --web.page-title="CHOT Metrics" --storage.tsdb.path=/prometheus/monitor/data --storage.tsdb.retention=7d --enable-feature=promql-negative-offs
et'

启动停止

systemctl start prometheus
systemctl stop prometheus
systemctl status prometheus

修改prometheus的存储目录后无法启动

背景

随着监控节点的扩充，prometheus的存储面临巨大压力，毕竟因为没有得到机器的支持，属于自己用virtualbox从闲置服务器上开辟出来的资源. 但是还要负责监控生产环境的众多机器.

启动报错

# journalctl -xe
Sep 08 14:56:20 meta prometheus[57772]: ts=2022-09-08T06:56:20.789Z caller=query_logger.go:90 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/monitor/data/queries.active err="open /prometheus/monitor/data/queries.active: permission denied"
Sep 08 14:56:26 meta prometheus[57825]: ts=2022-09-08T06:56:26.044Z caller=main.go:188 level=warn msg="This option for --enable-feature is now permanently enabled and therefore a no-op." option=promql-negative-offset

仅仅修改了–storage.tsdb.path=/prometheus/monitor/data 这个存储目录。

这个是给虚拟机新加的一块盘. 格式化后挂载到/prometheus

[root@meta /]#df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.8G     0  7.8G   0% /dev
tmpfs           7.8G  180K  7.8G   1% /dev/shm
tmpfs           7.8G  760M  7.1G  10% /run
tmpfs           7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/sda1        40G   20G   21G  48% /
/dev/sdb1        50G  151M   50G   1% /prometheus
tmpfs           1.6G     0  1.6G   0% /run/user/0

为何没有权限呢？

因为prometheus持久化目录的属组和属主都是prometheus而新建的持久化目录是root，根据官方的说法:

The user running Prometheus within the container has a specific user id and group id because it is dangerous to run as root

解决

chown prometheus:prometheus /prometheus/monitor/data/

参考

https://github.com/prometheus/prometheus/issues/5976

prometheus 采集redis 指标报错

启动redis_export之后，一直在报错:

~]$redis_exporter -redis-only-metrics -redis.addr redis://@10.50.10.45:6379
INFO[0000] Redis Metrics Exporter v1.37.0    build date: 2022-03-18-01:20:01    sha1: a1c28b775760f2f00fce07a24db7fd4e83c26b9f    Go: go1.17.8    GOOS: linux    GOARCH: amd64
INFO[0000] Providing metrics at :9121/metrics
ERRO[0001] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0001] Redis INFO err: set tcp 10.50.10.45:41601: use of closed network connection
ERRO[0011] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0011] Redis INFO err: set tcp 10.50.10.45:41604: use of closed network connection
ERRO[0021] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0021] Redis INFO err: set tcp 10.50.10.45:41613: use of closed network connection
ERRO[0032] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0032] Redis INFO err: set tcp 10.50.10.45:41651: use of closed network connection
ERRO[0042] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0042] Redis INFO err: set tcp 10.50.10.45:41663: use of closed network connection
ERRO[0052] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0052] Redis INFO err: set tcp 10.50.10.45:41664: use of closed network connection
ERRO[0061] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0061] Redis INFO err: set tcp 10.50.10.45:41666: use of closed network connection
ERRO[0071] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0071] Redis INFO err: set tcp 10.50.10.45:41669: use of closed network connection
ERRO[0081] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0081] Redis INFO err: set tcp 10.50.10.45:41677: use of closed network connection
ERRO[0091] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application)
ERRO[0091] Redis INFO err: set tcp 10.50.10.45:41696: use of closed network connection

但是通过9212 还是能获取到KPI指标检查prometheus的status也是正常的: 原因:

仔细观察prometheus status的endpoint，因为当一个node上有多个跳转到指定的target，这其实涉及Prometheus 配置抓取多个 Redis 实例的问题，我遇到的问题就属于这个问题.

单机多实例redis-export启动命令

 curl  10.50.10.25/pigsty/redis_exporter -o /usr/bin/redis_exporter && chmod a+x /usr/bin/redis_exporter
 # 单机多实例无需指定-redis.addr将 -redis.addr配置在动态发现的文件中
 nohup redis_exporter  -redis-only-metrics  &>/dev/null &

只有一个redis_export 收集多个实例上的指标信息

prometheus如何抓取codis多个实例的指标？

参考: https://github.com/oliver006/redis_exporter

redis多实例prometheus 配置

  #------------------------------------------------------------------------------
  # job: redis
  # multiple redis targets from redis_exporter @ target nodes
  # labels: [cls, ip, ins, instance]
  # path: targets/redis/<redis_node>.yml
  #------------------------------------------------------------------------------
  - job_name: redis
    metrics_path: /scrape
    file_sd_configs:
      - refresh_interval: 5s
        files: [ targets/redis/*.yml ]
    relabel_configs:
      - source_labels: [__address__] # 源标签，使用配置的分隔符串联的标签名称列表，并与提供的正则表达式进行匹配。
        target_label: __param_target # 目标标签，当使用 replace 或者 hashmod 动作时，应该被覆盖的标签名。
      - source_labels: [__param_target]
        regex: ^redis://(.*):(\d+)$ # 正则表达式，用于匹配串联的源标签，默认为 (.*)，匹配任何源标签。
        replacement: $1:$2 # replacement 字符串，写在目标标签上，用于替换 relabeling 动作，$1 ->（.*） $2 ->(\d+)
        target_label: instance
      - source_labels: [__param_target]
        regex: ^redis://(.*):(\d+)$
        replacement: $1
        target_label: ip
      # scrape redis_exporter on target node
      - source_labels: [__param_target]
        regex: ^redis://(.*):\d+$
        replacement: $1:9121
        target_label: __address__
#### 以下为官方配置
#--- 注释
  - job_name: 'redis_exporter_targets'
    metrics_path: /scrape
    static_configs:
      - targets:
        - redis://ip:6379
        - redis://ip1.:6379
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: <<REDIS-EXPORTER-HOSTNAME>>:9121
  ## config for scraping the exporter itself
  - job_name: 'redis_exporter'
    static_configs:
      - targets:
        - <<REDIS-EXPORTER-HOSTNAME>>:9121

targets/redis/*.yml

- labels: { ip: 10.50.10.45 , ins: codis-2-6379 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6379 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6380 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6380 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6381 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6381 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6382 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6382 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6383 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6383 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6384 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6384 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6385 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6385 ]
- labels: { ip: 10.50.10.45 , ins: codis-2-6386 , cls: codis-prod }
  targets: [ redis://10.50.10.45:6386 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6379 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6379 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6380 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6380 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6381 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6381 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6382 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6382 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6383 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6383 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6384 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6384 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6385 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6385 ]
- labels: { ip: 10.50.10.47 , ins: codis-1-6386 , cls: codis-prod }
  targets: [ redis://10.50.10.47:6386 ]

跳转到指定的scrape

prometheus 的relabel_configs

重新标记的配置语法和结构在relabel_configs和metric_relabel_configs块中是相同的。唯一的区别是它们发生的时间。relabel_configs是在服务发现之后和抓取之前，而metrics_relabel_configs是在抓取之后。

codis server启动redis_export 启动命令

nohup redis_exporter  -redis-only-metrics  &>/dev/null &

netstat 查看

~]#netstat -nltp|grep 9121
tcp        0      0 :::9121                     :::*                        LISTEN      1872/redis_exporter

prometheus 监控访问地址

codis:
http://10.50.10.25:3000/d/redis-cluster/redis-cluster?var-cls=codis-prod&orgId=1
node monitor:
http://10.50.10.25:3000/d/nodes-cluster/nodes-cluster?var-cls=QMS&orgId=1

prometheus 自动发现主机之file_sd_configs

metrics_path 和 relabel_configs 两个参数配置实现.

relabel_configs allow advanced modifications to any target and its labels before scraping.

prometheus自带的两个标签

每个指标都天然带有两个拓扑标签：job和instance。标签job是根据抓取配置中的作业名称设置的。我们倾向于使用job来描述正在监控的事物的类型。在之前的Node Exporter作业中，我们将其命名为node。这会为每个Node Exporter指标打上job标签node。instance标签可以标识目标，它通常是目标的IP地址和端口，并且来自__address__标签。

prometheus 性能问题

prometheus太突然跑不动了，监控界面都刷不出来。vmstat看了一下监控发现cpu基本跑满了，软中断巨多

从日志分析prometheus

journalctl -u prometheus -f

1、Dec 23 08:58:52 meta-162 prometheus[12490]: ts=2022-12-23T00:58:52.344Z caller=main.go:956 level=warn fs_type=NFS_SUPER_MAGIC msg="This filesystem is notsupported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about upported filesystems."
2、Dec 23 08:58:53 meta-162 prometheus[12490]: ts=2022-12-23T00:58:53.018Z caller=main.go:910 level=info msg="Server is ready to receive web requests."
3、Dec 23 09:08:14 meta-162 prometheus[12490]: ts=2022-12-23T01:08:14.588Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=167171040000 maxt=1671717600000 ulid=01GMY7V2MQWB32XXT0W2T4K38J duration=9m15.935444748s
4、Dec 23 09:13:15 meta-162 prometheus[12490]: ts=2022-12-23T01:13:14.756Z caller=head.go:840 level=info component=tsdb msg="Head GC completed" duration=3m7280873479s

首先我的后端存储使用的是nfs。而prometheus官方建议不要用.

注意： Prometheus 的本地存储不支持非 POSIX 兼容的文件系统，因为可能会发生不可恢复的损坏。不支持 NFS 文件系统（包括 AWS 的 EFS）。NFS 可能是 POSIX 兼容的，但大多数实现不是。强烈建议使用本地文件系统以确保可靠性。

write block竟然耗费9分钟

什么是write block？

当持久化到磁盘的 chunks 数据量达到一定阈值的时候，就会将这批老数据从 chunks 中剥离出来变成 block 数据块。更老的多个小的 block 会定期合成一个大的 block，最后直到 block 保存时间达到阈值被删除。 wal 文件夹里面存放的数据是当前正在写入的数据，里面包含多个数据段文件，一个文件默认最大 128M，Prometheus 会至少保留3个文件，对于高负载的机器会至少保留2小时的数据。wal 文件夹里面的数据是没有压缩过的，所以会比 block 里面的数据略大一些。

├── 01GMNFPT9RH76VC5MJ6WY0AAXY
│   ├── chunks #  保存压缩后的时序数据，每个chunks大小为512M，超过会生成新的chunks
│   │   ├── 000001 # 默认512M
│   │   ├── 000002
│   │   ├── 000003
│   │   ├── 000004
│   │   ├── 000005
│   │   ├── 000006
│   │   └── 000007
│   ├── index # chunks中的偏移位置
│   ├── meta.json # 记录block块元信息，比如 样本的起始时间、chunks数量和数据量大小等
│   └── tombstones  # 通过API方式对数据进行软删除,将删除记录存储在此处（API的删除方式，并不是立即将数据从chunks文件中移除）
├── lock
├── chunks_head #chunks_head 文件夹里面也包含了多个 chunks ，当内存的 head block 写不下了会将数据存放在这个文件夹下面，并保留对文件的引用。
├── queries.active
└── wal                      #防止数据丢失(数据收集上来暂时是存放在内存中,wal记录了这些信息)
    ├── 00000366             #每个数据段最大为128M，存储默认存储两个小时的数据量。
    ├── 00000367
    ├── 00000368
    ├── 00000369
    └── checkpoint.000365
        └── 00000000

prometheus 性能问题续

2023年4月3日09:34:59

由于prometheus的后端存储是NFS，后来直接切换了这个存储为local存储。

跑了一段时间之后，性能也慢慢变差，随着收集指标的增加，服务器的cpu飙升。

查看proemtheus OverView

收集指标的耗时趋势 “scrape_duration_seconds” 是 Prometheus 中用于收集指标的时间，它表示 Prometheus 从一个资源中收集指标所需的时间。如果该指标很高，可能表明 Prometheus 的数据采集性能存在问题。 prometheus_rule_evaluations_total" 是 Prometheus 监控和告警系统中使用的一个指标。

这个指标用于衡量 Prometheus 已经评估告警或记录规则的次数。

Prometheus 允许您定义规则，描述应用程序或系统触发告警或记录指标的条件。

这些规则定期由 Prometheus 进行评估，以检测规则中指定的条件是否达成。

“prometheus_rule_evaluations_total” 参数跟踪 Prometheus 评估这些规则的次数，可以用于识别规则评估过程中的任何问题或瓶颈。

Prometheus踩坑记

prometheus踩坑记

prometheus如何用systemd管理？

创建service配置文件

默认的环境变量

启动停止

修改prometheus的存储目录后无法启动

背景

启动报错

为何没有权限呢？

解决

参考

prometheus 采集redis 指标报错

单机多实例redis-export启动命令

prometheus如何抓取codis多个实例的指标？

redis多实例prometheus 配置

prometheus 的relabel_configs

codis server启动redis_export 启动命令

prometheus 监控访问地址

prometheus 自动发现主机之file_sd_configs

prometheus自带的两个标签

prometheus 性能问题

从日志分析prometheus

write block竟然耗费9分钟

prometheus 性能问题续

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Prometheus踩坑记

prometheus踩坑记

prometheus如何用systemd管理？

创建service配置文件

默认的环境变量

启动停止

修改prometheus的存储目录后无法启动

背景

启动报错

为何没有权限呢？

解决

参考

prometheus 采集redis 指标报错

单机多实例redis-export启动命令

prometheus如何抓取codis多个实例的指标？

redis多实例prometheus 配置

prometheus 的relabel_configs

codis server启动redis_export 启动命令

prometheus 监控访问地址

prometheus 自动发现主机之file_sd_configs

prometheus自带的两个标签

prometheus 性能问题

从日志分析prometheus

write block竟然耗费9分钟

prometheus 性能问题 续

热门文章

最新文章

相关课程

相关电子书

相关实验场景

prometheus 性能问题续