这个监控很简单,不了解流程会感觉很复杂,先知道配置的先后顺序,了解整个框架后,将配置切分成多个部分,每个部分百度配置即可。主要怕不了解每层如何配置,无从下手。粗略看几本相关书籍,理解流程,按配置顺序提出问题,挨个解决的同时也搭建成功了。路跑通后开始精细化配置。百炼成钢不搭建 20 遍,不要说你学习了。
学习一个新的知识时应尽量避免完美主义,先把整个路简化的跑通,对自信心影响很大,跑通后精深研究每个技术点,最后结合生产中遇到的问题,思考每个每个功能点对你的环境的适配性,从而得到适合自己公司的配置方案。
简化图
服务器信息 :
节点名 IP 地址 服务名 node01 10.10.8.62 grafana prometheus alertmanager node_exporter mysqld_exporter node02 10.10.8.63 node_exporter mysqld_exporter
创建专用用户和组
groupadd monitor useradd -MN -s /sbin/nologin monitor -g monitor
grafana
安装
node01
#wget https://dl.grafana.com/oss/release/grafana-10.4.0.linux-amd64.tar.gz cd /home/zcsadmin/ tar xf grafana-10.4.0.linux-amd64.tar.gz -C /usr/local/ mv /usr/local/grafana-v10.4.0/ /usr/local/grafana
配置
node01
mkdir -p /usr/local/grafana/data/{log,plugins,socket} cp /usr/local/grafana/conf/defaults.ini /usr/local/grafana/conf/granfana.ini chown -R monitor:monitor /usr/local/grafana/ sed -i 's#socket = /tmp/grafana.sock#socket = data/socket/grafana.sock#g' /usr/local/grafana/conf/granfana.ini sed -i 's#en-US#zh-CN#g' /usr/local/grafana/conf/granfana.ini
启动
node01
cat >/usr/lib/systemd/system/grafana.service<<'EOF' [Unit] Description=Grafana After=network.target [Service] User=monitor Group=monitor Environment="GRAFANA_HOME=/usr/local/grafana" ExecStart=/usr/local/grafana/bin/grafana-server --config=/usr/local/grafana/conf/granfana.ini --homepath=/usr/local/grafana Restart=on-failure [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl restart grafana systemctl status grafana systemctl enable grafana
默认账号密码:admin/admin
prometheus
告警规则合集,不要手写监控规则啦,改改就用呗
https://github.com/samber/awesome-prometheus-alerts#-rules
https://samber.github.io/awesome-prometheus-alerts/
安装
node01
#wget https://github.com/prometheus/prometheus/releases/download/v2.50.1/prometheus-2.50.1.linux-amd64.tar.gz cd /home/zcsadmin/ tar xf prometheus-2.50.1.linux-amd64.tar.gz -C /usr/local/ mv /usr/local/prometheus-2.50.1.linux-amd64/ /usr/local/prometheus cd /usr/local/prometheus
配置
node01
cat >/usr/local/prometheus/prometheus.yml<<'EOF' global: scrape_interval: 15s # 抓取target的时间间隔,设置为15秒,默认值为1分钟。经验值为10~60s evaluation_interval: 15s #Prometheus计算一条规则配置的时间间隔,设置为15秒, alerting: alertmanagers: - static_configs: # 静态配置Alertmanager的地址,也可以依赖服务发现动态识别 - targets: # 可以配置多个IP地址 - 10.10.8.62:9093 # 添加告警规则文件 rule_files: - "rules/*.yml" scrape_configs: # prometheus 监控 - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # alertmanager 监控 - job_name: 'alertmanager' static_configs: - targets: ['localhost:9093'] # linux 系统监控 - job_name: 'node-exporter' static_configs: - targets: - 'localhost:9100' # mysql 监控 - job_name: 'mysqld-exporter' static_configs: - targets: - localhost:3306 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ # 这里配置 mysqld_exporter 主机端口 replacement: localhost:9104 EOF
# 创建告警规则文件 mkdir /usr/local/prometheus/rules chown -R monitor:monitor /usr/local/prometheus/ chown -R monitor:monitor /data
检查配置
node01
/usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml
启动
node01
cat >/usr/lib/systemd/system/prometheus.service<<'EOF' [Unit] Description=Prometheus After=network.target [Service] Type=simple User=monitor Group=monitor ExecStart=/usr/local/prometheus/prometheus \ --config.file "/usr/local/prometheus/prometheus.yml" \ --web.listen-address "0.0.0.0:9090" \ --storage.tsdb.retention=1095d \ --web.enable-lifecycle Restart=on-failure [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl restart prometheus systemctl status prometheus systemctl enable prometheus
node01
# 配置检查 /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml # 重载配置 curl -X POST http://127.0.0.1:9090/-/reload
在 grafana 中配置数据源
alertmanager
安装
node01
#https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz cd /home/zcsadmin/ tar xf alertmanager-0.27.0.linux-amd64.tar.gz -C /usr/local mv /usr/local/alertmanager-0.27.0.linux-amd64/ /usr/local/alertmanager
配置
node01
cat >/usr/local/alertmanager/alertmanager.yml<<'EOF' global: resolve_timeout: 5m #邮箱 smtp_smarthost: 'mail.test.com:25' smtp_from: 'test@test.com' smtp_auth_username: 'test@test.com' smtp_auth_password: 'test@!QAZ' smtp_require_tls: false # 企业微信 wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' wechat_api_corp_id: 'ww2edb882dtest93222' # 企业微信中企业ID # 配置路由树 route: # group_by: ['alertname'] # 根据告警规则组名进行分组 group_wait: 1s # 分组内第一个告警等待时间, group_interval: 1s # 发送新告警间隔时间 repeat_interval: 1h # 重复告警间隔发送时间 receiver: 'email_wechat' # 接收人 receivers: - name: 'email_wechat' # 邮箱配置 email_configs: - to: 'duyuhang@inmyshow.com' html: '{{ template "email.html" . }}' send_resolved: true # 企业微信配置 wechat_configs: - send_resolved: true api_secret: 'x7NQ305cPcR1dsdsHDSnW9oU_ioOaGqdsdsdsdsds6Oy4M' agent_id: '10000034' #企微后台查询的agentid message: '{{ template "wechat.message" . }}' to_party: '57' to_user : "@all" # 告警模板位置 templates: - '/usr/local/alertmanager/templates/*.tmpl' # 抑制规则 #inhibit_rules: #- source_match: # severity: 'critical' # target_match: # severity: 'warning' # equal: ['alertname', 'dev', 'instance'] EOF
企业微信创建机器人:自行百度
必须配置可信 IP: https://blog.csdn.net/weixin_45385457/article/details/132278442
邮件模板
node01
# 通知模板 mkdir /usr/local/alertmanager/templates cat >/usr/local/alertmanager/templates/email.tmpl<<'EOF' {{ define "email.html" }} {{ range .Alerts }} 告警主题: {{ .Annotations.summary }} <br> 故障主机: {{ .Labels.instance }} <br> 告警程序: prometheus_alert <br> 告警级别: {{ .Labels.severity }} 级 <br> 告警类型: {{ .Labels.alertname }} <br> 告警详情: {{ .Annotations.description }} <br> 触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br> {{ end }} {{ end }} EOF
微信模板
微信通知模板
node01
cat >/usr/local/alertmanager/templates/wechat.tmpl<<'EOF' {{ define "wechat.message" }} {{- if gt (len .Alerts.Firing) 0 -}} {{- range $index, $alert := .Alerts -}} {{- if eq $index 0 }} 告警:{{ .Labels.instance }} {{ .Annotations.summary }} 告警状态:{{ .Status }} 告警级别:{{ .Labels.severity }} 告警类型:{{ .Labels.alertname }} 故障主机:{{ .Labels.instance }} 告警主题:{{ .Annotations.summary }} 告警详情:{{ .Annotations.description }}; 故障时间:{{ .StartsAt.Format "2006-01-02 15:04:05" }} {{- end }} {{- end }} {{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{- range $index, $alert := .Alerts -}} {{- if eq $index 0 }} 恢复:{{ .Labels.instance }} {{ .Annotations.summary }} 告警类型:{{ .Labels.alertname }} 告警状态:{{ .Status }} 告警主题:{{ .Annotations.summary }} 告警详情:{{ .Annotations.description }}; 故障时间:{{ .StartsAt.Format "2006-01-02 15:04:05" }} 恢复时间:{{ .EndsAt.Format "2006-01-02 15:04:05" }} {{- if gt (len $alert.Labels.instance) 0 }} 实例信息:{{ $alert.Labels.instance }} {{- end }} {{- end }} {{- end }} {{- end }} {{- end }} EOF
chown -R monitor:monitor /usr/local/alertmanager/
启动
node01
cat >/usr/lib/systemd/system/alertmanager.service<<'EOF' [Unit] Description=alertmanager After=network.target [Service] User=monitor Group=monitor ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml Restart=on-failure [Install] WantedBy=multi-user.target EOF
chown -R monitor:monitor /usr/local/alertmanager/ systemctl daemon-reload systemctl restart alertmanager systemctl status alertmanager systemctl enable alertmanager
granfana 配置数据源
node_exporter
需要安装在每个需要监控的服务器上。
使用node_exporter进行 linux 系统监控,在 prometheus配置文件中添加node_exporter,grafana 导入模板即可,
安装
node01 node02
#wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz cd /home/zcsadmin/ wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xf node_exporter-1.7.0.linux-amd64.tar.gz -C /usr/local/ mv /usr/local/node_exporter-1.7.0.linux-amd64 /usr/local/node_exporter
启动
node01 node02
cat >/usr/lib/systemd/system/node_exporter.service<<'EOF' [Unit] Description=node_exporter After=network.target [Service] ExecStart=/usr/local/node_exporter/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl restart node_exporter systemctl status node_exporter systemctl enable node_exporter
配置
granfana 导入模板地址:
https://grafana.com/grafana/dashboards/1860-node-exporter-full/
告警规则
node01 node02
cd /usr/local/prometheus/rules && \ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/host-and-hardware/node-exporter.yml # 重载配置 curl -X POST http://127.0.0.1:9090/-/reload
验证
node01 node02
curl 'http://localhost:9100/metrics' |grep cpu
mysqld_exporter
不需要安装在每个需要监控的服务器上,流程如下:
- 在 prometheus 服务器上安装mysqld_exporter
- 配置统一的mysql用户密码连接文件
- 在需要监控的mysql 实例中创建对应的账号密码,注意:账号必须可以在prometheus服务器上连接
- 开通防火墙规则
安装
node01
#wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.1/mysqld_exporter-0.15.1.linux-amd64.tar.gz cd /home/zcsadmin/ tar xf mysqld_exporter-0.15.1.linux-amd64.tar.gz -C /usr/local/ mv /usr/local/mysqld_exporter-0.15.1.linux-amd64 /usr/local/mysqld_exporter
启动
node01
cat >/usr/lib/systemd/system/mysqld_exporter.service<<'EOF' [Unit] Description=mysqld_exporter After=network.target [Service] ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/config.my.cnf Restart=on-failure [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl restart mysqld_exporter systemctl status mysqld_exporter systemctl enable mysqld_exporter
配置
安装测试 mysql
node01 node02
yum install -y mariadb systemctl start mariadb
客户端 需要在对应的 MySQL 实例中创建账号
node01
# 数据库创建账号 create user exporter@'10.10.8.62' identified by 'exportertest'; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'10.10.8.62';
node01
# 创建mysqld_exporter 连接 mysql 配置文件 cat >/usr/local/mysqld_exporter/config.my.cnf<<'EOF' [client] user = exporter password = exportertest EOF
node01
cat >/usr/local/prometheus/prometheus.yml<<'EOF' global: scrape_interval: 15s # 抓取target的时间间隔,设置为15秒,默认值为1分钟。经验值为10~60s evaluation_interval: 15s #Prometheus计算一条规则配置的时间间隔,设置为15秒, alerting: alertmanagers: - static_configs: # 静态配置Alertmanager的地址,也可以依赖服务发现动态识别 - targets: # 可以配置多个IP地址 - 10.10.8.62:9093 # 添加告警规则文件 rule_files: - "rules/*.yml" scrape_configs: # prometheus 监控 - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # alertmanager 监控 - job_name: 'alertmanager' static_configs: - targets: ['localhost:9093'] # linux 系统监控 - job_name: 'node-exporter' static_configs: - targets: - 'localhost:9100' - '10.10.8.63:9100' # mysql 监控 - job_name: 'mysqld-exporter' params: # 不需要。将值匹配到配置文件中的子项。默认值为 “client”。 auth_module: [client.servers] static_configs: - targets: - localhost:3306 - 10.10.8.63:3306 # 添加一行 有新的实例 往下加就行了 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ # 这里配置 mysqld_exporter 主机端口 replacement: localhost:9104 EOF
告警规则
node01
cd /usr/local/prometheus/rules && \ wget https://raw.githubusercontent.com/samber/awesome-prometheus-alerts/master/dist/rules/mysql/mysqld-exporter.yml # 修改权限 chown -R monitor:monitor /usr/local/prometheus/ # 检查配置 /usr/local/prometheus/promtool check config /usr/local/prometheus/prometheus.yml # 重载配置 curl -X POST http://127.0.0.1:9090/-/reload
grafana 导入仪表板 ID: 7362
验证
node01
curl 'http://localhost:9104/metrics' |grep mysql curl 'http://10.10.8.63:9104/metrics' |grep mysql
自动发现
监控传统环境不需要自动发现,也不好用,直接配置文件也能满足,如果要用的话可以配置一下基于文件的方式,如果使用 k8s 可以去学习一下Consul
安全相关
grafana 配置 https
mkdir /usr/local/grafana/certificate cd /usr/local/grafana/certificate openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days 3650 -out certificate.pem # 一路回车
vim /usr/local/grafana/conf/granfana.ini protocol = https cert_file = /usr/local/grafana/certificate/certificate.pem cert_key = /usr/local/grafana/certificate/key.pem
systemctl restart grafana.service systemctl status grafana.service
Prometheus 配置用户密码
配置后需要重新配置 grafana 的数据源里的链接信息
使用 htpasswd 工具生成密码
# 安装 htpasswd 工具 yum install httpd-tools -y # 执行命令 我这里密码为 admintest htpasswd -nBC 12 '' | tr -d ':\n' New password: Re-type new password: # 加密的密码 $2y$12$NHyeXrePI1gUx/kAHLNfn.H6sizsTgIer/ishuh/cdczmntUJ3Ywm
配置 web 用户密码
cat >/usr/local/prometheus/web-config.yml<<'EOF' basic_auth_users: admin: $2y$12$NHyeXrePI1gUx/kAHLNfn.H6sizsTgIer/ishuh/cdczmntUJ3Ywm EOF
修改prometheus配置添加 basic_auth
vim /usr/local/prometheus/prometheus.yml scrape_configs: # prometheus 监控 - job_name: 'prometheus' basic_auth: username: admin # 账号为 admin password: admintest # 密码为 admintest static_configs: - targets: ['localhost:9090']
修改启动配置
cat >/usr/lib/systemd/system/prometheus.service<<'EOF' [Unit] Description=Prometheus After=network.target [Service] Type=simple User=monitor Group=monitor ExecStart=/usr/local/prometheus/prometheus \ --config.file "/usr/local/prometheus/prometheus.yml" \ --web.listen-address "0.0.0.0:9090" \ --web.config.file=/usr/local/prometheus/web-config.yml \ --storage.tsdb.retention=1095d \ --web.enable-lifecycle Restart=on-failure [Install] WantedBy=multi-user.target EOF
systemctl daemon-reload systemctl restart prometheus systemctl status prometheus systemctl enable prometheus
标签的应用和分类
在配置 targets 时,可以定义标签
vim /usr/local/prometheus/prometheus.yml - job_name: 'example' static_configs: - targets: ['server:9100'] labels: # 定义标签 environment: 'production'
实际应用:
在告警规则文件中,根据标签来区别告警的严重等级
vim /usr/local/prometheus/rules/test.yml groups: - name: example-alerts rules: - alert: HighHttpRequests expr: http_requests_total{job="example", instance="example-instance"} > 100 for: 5m labels: severity: critical # 根据 severity 标签的不同值,来配置告警 annotations: summary: "High HTTP Requests" description: "The number of HTTP requests is high on example-instance"
在告警时使用 route 里的 group_by 来区分不同的告警发送至哪个 receivers 内
vim /usr/local/alertmanager/alertmanager.yml route: group_by: ['alertname', 'severity'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'sms-critical' receivers: - name: 'sms-critical' webhook_configs: - url: 'https://your-sms-provider/api/send' send_resolved: true http_config: bearer_token: 'your-bearer-token' route: routes: - match: severity: 'critical' receiver: 'sms-critical'
总结
在生产环境使用 prometheus 监控时,要充分利用标签的功能,对不同的环境不同作用的机器制定不同的告警规则,避免出现告警过多导致的漏处理。要严格把控安全问题,防止信息的泄露。