prometheus踩坑记
prometheus如何用systemd管理?
创建service配置文件
# -*- mode: conf -*- [Unit] Description=The Prometheus monitoring system and time series database. Documentation=https://prometheus.io After=network.target [Service] EnvironmentFile=-/etc/default/prometheus User=prometheus ExecStart=/usr/bin/prometheus $PROMETHEUS_OPTS ExecReload=/bin/kill -HUP $MAINPID Restart=always RestartSec=5s LimitNOFILE=65536 [Install] WantedBy=multi-user.target
默认的环境变量
# cat /etc/default/prometheus PROMETHEUS_OPTS='--config.file=/etc/prometheus/prometheus.yml --web.page-title="CHOT Metrics" --storage.tsdb.path=/prometheus/monitor/data --storage.tsdb.retention=7d --enable-feature=promql-negative-offs et'
启动停止
systemctl start prometheus systemctl stop prometheus systemctl status prometheus
修改prometheus的存储目录后无法启动
背景
随着监控节点的扩充,prometheus的存储面临巨大压力,毕竟因为没有得到机器的支持,属于自己用virtualbox从闲置服务器上开辟出来的资源. 但是还要负责监控生产环境的众多机器.
启动报错
# journalctl -xe Sep 08 14:56:20 meta prometheus[57772]: ts=2022-09-08T06:56:20.789Z caller=query_logger.go:90 level=error component=activeQueryTracker msg="Error opening query log file" file=/prometheus/monitor/data/queries.active err="open /prometheus/monitor/data/queries.active: permission denied" Sep 08 14:56:26 meta prometheus[57825]: ts=2022-09-08T06:56:26.044Z caller=main.go:188 level=warn msg="This option for --enable-feature is now permanently enabled and therefore a no-op." option=promql-negative-offset
仅仅修改了–storage.tsdb.path=/prometheus/monitor/data 这个存储目录。
这个是给虚拟机新加的一块盘. 格式化后挂载到/prometheus
[root@meta /]#df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 7.8G 0 7.8G 0% /dev tmpfs 7.8G 180K 7.8G 1% /dev/shm tmpfs 7.8G 760M 7.1G 10% /run tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup /dev/sda1 40G 20G 21G 48% / /dev/sdb1 50G 151M 50G 1% /prometheus tmpfs 1.6G 0 1.6G 0% /run/user/0
为何没有权限呢?
因为prometheus持久化目录的属组和属主都是prometheus而新建的持久化目录是root,根据官方的说法:
The user running Prometheus within the container has a specific user id and group id because it is dangerous to run as root
解决
chown prometheus:prometheus /prometheus/monitor/data/
参考
https://github.com/prometheus/prometheus/issues/5976
prometheus 采集redis 指标报错
启动redis_export之后,一直在报错:
~]$redis_exporter -redis-only-metrics -redis.addr redis://@10.50.10.45:6379 INFO[0000] Redis Metrics Exporter v1.37.0 build date: 2022-03-18-01:20:01 sha1: a1c28b775760f2f00fce07a24db7fd4e83c26b9f Go: go1.17.8 GOOS: linux GOARCH: amd64 INFO[0000] Providing metrics at :9121/metrics ERRO[0001] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0001] Redis INFO err: set tcp 10.50.10.45:41601: use of closed network connection ERRO[0011] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0011] Redis INFO err: set tcp 10.50.10.45:41604: use of closed network connection ERRO[0021] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0021] Redis INFO err: set tcp 10.50.10.45:41613: use of closed network connection ERRO[0032] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0032] Redis INFO err: set tcp 10.50.10.45:41651: use of closed network connection ERRO[0042] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0042] Redis INFO err: set tcp 10.50.10.45:41663: use of closed network connection ERRO[0052] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0052] Redis INFO err: set tcp 10.50.10.45:41664: use of closed network connection ERRO[0061] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0061] Redis INFO err: set tcp 10.50.10.45:41666: use of closed network connection ERRO[0071] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0071] Redis INFO err: set tcp 10.50.10.45:41669: use of closed network connection ERRO[0081] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0081] Redis INFO err: set tcp 10.50.10.45:41677: use of closed network connection ERRO[0091] Couldn't set client name, err: redigo: unexpected response line (possible server error or unsupported concurrent read by application) ERRO[0091] Redis INFO err: set tcp 10.50.10.45:41696: use of closed network connection
但是通过9212 还是能获取到KPI指标检查prometheus的status也是正常的:原因:
仔细观察prometheus status的endpoint,因为当一个node上有多个跳转到指定的target,这其实涉及Prometheus 配置抓取多个 Redis 实例的问题,我遇到的问题就属于这个问题.
单机多实例redis-export启动命令
curl 10.50.10.25/pigsty/redis_exporter -o /usr/bin/redis_exporter && chmod a+x /usr/bin/redis_exporter # 单机多实例无需指定-redis.addr将 -redis.addr配置在动态发现的文件中 nohup redis_exporter -redis-only-metrics &>/dev/null &
只有一个redis_export 收集多个实例上的指标信息
prometheus如何抓取codis多个实例的指标?
参考: https://github.com/oliver006/redis_exporter
redis多实例prometheus 配置
#------------------------------------------------------------------------------ # job: redis # multiple redis targets from redis_exporter @ target nodes # labels: [cls, ip, ins, instance] # path: targets/redis/<redis_node>.yml #------------------------------------------------------------------------------ - job_name: redis metrics_path: /scrape file_sd_configs: - refresh_interval: 5s files: [ targets/redis/*.yml ] relabel_configs: - source_labels: [__address__] # 源标签,使用配置的分隔符串联的标签名称列表,并与提供的正则表达式进行匹配。 target_label: __param_target # 目标标签,当使用 replace 或者 hashmod 动作时,应该被覆盖的标签名。 - source_labels: [__param_target] regex: ^redis://(.*):(\d+)$ # 正则表达式,用于匹配串联的源标签,默认为 (.*),匹配任何源标签。 replacement: $1:$2 # replacement 字符串,写在目标标签上,用于替换 relabeling 动作,$1 ->(.*) $2 ->(\d+) target_label: instance - source_labels: [__param_target] regex: ^redis://(.*):(\d+)$ replacement: $1 target_label: ip # scrape redis_exporter on target node - source_labels: [__param_target] regex: ^redis://(.*):\d+$ replacement: $1:9121 target_label: __address__ #### 以下为官方配置 #--- 注释 - job_name: 'redis_exporter_targets' metrics_path: /scrape static_configs: - targets: - redis://ip:6379 - redis://ip1.:6379 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: <<REDIS-EXPORTER-HOSTNAME>>:9121 ## config for scraping the exporter itself - job_name: 'redis_exporter' static_configs: - targets: - <<REDIS-EXPORTER-HOSTNAME>>:9121
targets/redis/*.yml
- labels: { ip: 10.50.10.45 , ins: codis-2-6379 , cls: codis-prod } targets: [ redis://10.50.10.45:6379 ] - labels: { ip: 10.50.10.45 , ins: codis-2-6380 , cls: codis-prod } targets: [ redis://10.50.10.45:6380 ] - labels: { ip: 10.50.10.45 , ins: codis-2-6381 , cls: codis-prod } targets: [ redis://10.50.10.45:6381 ] - labels: { ip: 10.50.10.45 , ins: codis-2-6382 , cls: codis-prod } targets: [ redis://10.50.10.45:6382 ] - labels: { ip: 10.50.10.45 , ins: codis-2-6383 , cls: codis-prod } targets: [ redis://10.50.10.45:6383 ] - labels: { ip: 10.50.10.45 , ins: codis-2-6384 , cls: codis-prod } targets: [ redis://10.50.10.45:6384 ] - labels: { ip: 10.50.10.45 , ins: codis-2-6385 , cls: codis-prod } targets: [ redis://10.50.10.45:6385 ] - labels: { ip: 10.50.10.45 , ins: codis-2-6386 , cls: codis-prod } targets: [ redis://10.50.10.45:6386 ] - labels: { ip: 10.50.10.47 , ins: codis-1-6379 , cls: codis-prod } targets: [ redis://10.50.10.47:6379 ] - labels: { ip: 10.50.10.47 , ins: codis-1-6380 , cls: codis-prod } targets: [ redis://10.50.10.47:6380 ] - labels: { ip: 10.50.10.47 , ins: codis-1-6381 , cls: codis-prod } targets: [ redis://10.50.10.47:6381 ] - labels: { ip: 10.50.10.47 , ins: codis-1-6382 , cls: codis-prod } targets: [ redis://10.50.10.47:6382 ] - labels: { ip: 10.50.10.47 , ins: codis-1-6383 , cls: codis-prod } targets: [ redis://10.50.10.47:6383 ] - labels: { ip: 10.50.10.47 , ins: codis-1-6384 , cls: codis-prod } targets: [ redis://10.50.10.47:6384 ] - labels: { ip: 10.50.10.47 , ins: codis-1-6385 , cls: codis-prod } targets: [ redis://10.50.10.47:6385 ] - labels: { ip: 10.50.10.47 , ins: codis-1-6386 , cls: codis-prod } targets: [ redis://10.50.10.47:6386 ]
跳转到指定的scrape
prometheus 的relabel_configs
重新标记的配置语法和结构在relabel_configs和metric_relabel_configs块中是相同的。唯一的区别是它们发生的时间。relabel_configs是在服务发现之后和抓取之前,而metrics_relabel_configs是在抓取之后。
codis server启动redis_export 启动命令
nohup redis_exporter -redis-only-metrics &>/dev/null &
netstat 查看
~]#netstat -nltp|grep 9121 tcp 0 0 :::9121 :::* LISTEN 1872/redis_exporter
prometheus 监控访问地址
codis: http://10.50.10.25:3000/d/redis-cluster/redis-cluster?var-cls=codis-prod&orgId=1 node monitor: http://10.50.10.25:3000/d/nodes-cluster/nodes-cluster?var-cls=QMS&orgId=1
prometheus 自动发现主机之file_sd_configs
metrics_path 和 relabel_configs 两个参数配置实现.
relabel_configs
allow advanced modifications to any target and its labels before scraping.
prometheus自带的两个标签
每个指标都天然带有两个拓扑标签:job和instance。标签job是根据抓取配置中的作业名称设置的。我们倾向于使用job来描述正在监控的事物的类型。在之前的Node Exporter作业中,我们将其命名为node。这会为每个Node Exporter指标打上job标签node。instance标签可以标识目标,它通常是目标的IP地址和端口,并且来自__address__标签。
prometheus 性能问题
prometheus太突然跑不动了,监控界面都刷不出来。vmstat看了一下监控发现cpu基本跑满了,软中断巨多
从日志分析prometheus
journalctl -u prometheus -f
1、Dec 23 08:58:52 meta-162 prometheus[12490]: ts=2022-12-23T00:58:52.344Z caller=main.go:956 level=warn fs_type=NFS_SUPER_MAGIC msg="This filesystem is notsupported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about upported filesystems." 2、Dec 23 08:58:53 meta-162 prometheus[12490]: ts=2022-12-23T00:58:53.018Z caller=main.go:910 level=info msg="Server is ready to receive web requests." 3、Dec 23 09:08:14 meta-162 prometheus[12490]: ts=2022-12-23T01:08:14.588Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=167171040000 maxt=1671717600000 ulid=01GMY7V2MQWB32XXT0W2T4K38J duration=9m15.935444748s 4、Dec 23 09:13:15 meta-162 prometheus[12490]: ts=2022-12-23T01:13:14.756Z caller=head.go:840 level=info component=tsdb msg="Head GC completed" duration=3m7280873479s
首先我的后端存储使用的是nfs。而prometheus官方建议不要用.
注意: Prometheus 的本地存储不支持非 POSIX 兼容的文件系统,因为可能会发生不可恢复的损坏。不支持 NFS 文件系统(包括 AWS 的 EFS)。NFS 可能是 POSIX 兼容的,但大多数实现不是。强烈建议使用本地文件系统以确保可靠性。
write block竟然耗费9分钟
什么是write block?
当持久化到磁盘的 chunks 数据量达到一定阈值的时候,就会将这批老数据从 chunks 中剥离出来变成 block 数据块。更老的多个小的 block 会定期合成一个大的 block,最后直到 block 保存时间达到阈值被删除。wal 文件夹里面存放的数据是当前正在写入的数据,里面包含多个数据段文件,一个文件默认最大 128M,Prometheus 会至少保留3个文件,对于高负载的机器会至少保留2小时的数据。wal 文件夹里面的数据是没有压缩过的,所以会比 block 里面的数据略大一些。
├── 01GMNFPT9RH76VC5MJ6WY0AAXY │ ├── chunks # 保存压缩后的时序数据,每个chunks大小为512M,超过会生成新的chunks │ │ ├── 000001 # 默认512M │ │ ├── 000002 │ │ ├── 000003 │ │ ├── 000004 │ │ ├── 000005 │ │ ├── 000006 │ │ └── 000007 │ ├── index # chunks中的偏移位置 │ ├── meta.json # 记录block块元信息,比如 样本的起始时间、chunks数量和数据量大小等 │ └── tombstones # 通过API方式对数据进行软删除,将删除记录存储在此处(API的删除方式,并不是立即将数据从chunks文件中移除) ├── lock ├── chunks_head #chunks_head 文件夹里面也包含了多个 chunks ,当内存的 head block 写不下了会将数据存放在这个文件夹下面,并保留对文件的引用。 ├── queries.active └── wal #防止数据丢失(数据收集上来暂时是存放在内存中,wal记录了这些信息) ├── 00000366 #每个数据段最大为128M,存储默认存储两个小时的数据量。 ├── 00000367 ├── 00000368 ├── 00000369 └── checkpoint.000365 └── 00000000
prometheus 性能问题 续
2023年4月3日09:34:59
由于prometheus的后端存储是NFS,后来直接切换了这个存储为local存储。
跑了一段时间之后,性能也慢慢变差,随着收集指标的增加,服务器的cpu飙升。
查看proemtheus OverView
收集指标的耗时趋势“scrape_duration_seconds” 是 Prometheus 中用于收集指标的时间,它表示 Prometheus 从一个资源中收集指标所需的时间。如果该指标很高,可能表明 Prometheus 的数据采集性能存在问题。prometheus_rule_evaluations_total" 是 Prometheus 监控和告警系统中使用的一个指标。
这个指标用于衡量 Prometheus 已经评估告警或记录规则的次数。
Prometheus 允许您定义规则,描述应用程序或系统触发告警或记录指标的条件。
这些规则定期由 Prometheus 进行评估,以检测规则中指定的条件是否达成。
“prometheus_rule_evaluations_total” 参数跟踪 Prometheus 评估这些规则的次数,可以用于识别规则评估过程中的任何问题或瓶颈。