Prometheus简单介绍:
Prometheus使用Go语言开发,是Google BorgMon监控系统的开源版本,怎么产生的就不在这讨论了,反正就是香,简单易用。
2016年由Google发起Linux基金会旗下的原生云基金会(Cloud Native Computing Foundation), 将Prometheus纳入其下第二大开源项目。Prometheus目前在开源社区相当活跃(活跃表现在插件非常多),并且是kubernetes之后的第二个毕业项目,由于和kubernetes一样同为go语言编写的,同根同族,和etcd的情况类似,因此,也是非常容易的kubernetes集群就可以连入prometheus,也就是说在云原生领域普罗米修斯是天然的伴侣,基本不做第二选择。
- 开发层面
Prometheus支持多种语言(Go,java,python,ruby官方提供客户端,其他语言有第三方开源客户端)。我们可以通过客户端方面的对核心业务进行埋点。如下单流程、添加购物车流程。
- 在应用层用作应用监控系统
一些主流应用可以通过官方或第三方的导出器,来对这些应用做核心指标的收集。如redis,mysql,MongoDB,nginx,haproxy,kubernetes集群等等。
- 在系统层用作系统监控
除了常用软件, prometheus也有相关系统层和网络层exporter,用以监控服务器或网络。
- 集成其它监控方面
prometheus可以通过各种exporte,集成其他的监控系统,收集监控数据,如AWS CloudWatch,JMX,Pingdom等等。
那么,普罗米修斯也有一些缺点,在数据展示层面比较的弱,因此, grafana这家伙就闪亮登场了。
Grafana简单介绍:
grafana是用于可视化大型测量数据的开源程序,他提供了强大和优雅的方式去创建、共享、浏览数据。dashboard中显示了你不同metric数据源中的数据。
Grafana是一个开源的,拥有丰富dashboard和图表编辑的指标分析平台,和Kibana不同的是Grafana专注于时序类图表分析,而且支持多种数据源,如Graphite、InfluxDB、Elasticsearch、Mysql、K8s、Zabbix等。
瞅着这种描述,可能更多的可以用作运维相关的指标。
Grafana最早其实应该是Kibana3的一个分支,拥有自己的权限管理和用户管理系统,而Kibana没有权限管理。Kibana和ES结合紧密,支持强大的ES语法,比较适合做一些多维度的分析和查询,而Grafana更适合用于展示,图形比Kibana美观很多。
也就是说一般可以用到运维平台里面,但是仅仅是展示,显然对运维没有太大帮助,需要加入更多的告警或者互动查询相关的功能,然后从性能或者使用角度有更好的指标,才会被选择使用,另外,一般模板类的东西也可以用做参考。
一,
Prometheus的架构
prometheus是一个用Go编写的时序数据库,可以支持多种语言客户端,注意,因为它是数据库,所以它的缺点就是数据展示功能不够,因此,才有Grafana的闪亮登场。
TSDB简介
TSDB(Time Series Database)时序列数据库,我们可以简单的理解为一个优化后用来处理时间序列数据的软件,并且数据中的数组是由时间进行索引的。
时间序列数据库的特点
大部分时间都是写入操作。
写入操作几乎是顺序添加,大多数时候数据到达后都以时间排序。
写操作很少写入很久之前的数据,也很少更新数据。大多数情况在数据被采集到数秒或者数分钟后就会被写入数据库。
删除操作一般为区块删除,选定开始的历史时间并指定后续的区块。很少单独删除某个时间或者分开的随机时间的数据。
基本数据大,一般超过内存大小。一般选取的只是其一小部分且没有规律,缓存几乎不起任何作用。
读操作是十分典型的升序或者降序的顺序读。
高并发的读操作十分常见。
常见的时间序列数据库
- influxDB
- RRDtool
- Graphite
- OpenTSDB
- Kdb+
- Druid
- KairosDB
- Prometheus
Prometheus的生态系统
Prometheus生态系统由多个组件组成,它们中的一些是可选的。多数Prometheus组件是Go语言写的,这使得这些组件很容易编译和部署。
1.Prometheus Server
主要负责数据采集和存储,提供PromQL查询语言的支持。
2.客户端SDK
官方提供的客户端类库有go、java、scala、python、ruby,其他还有很多第三方开发的类库,支持nodejs、php、erlang等。
3.Push Gateway
支持临时性Job主动推送指标的中间网关。
4.PromDash
使用Rails开发可视化的Dashboard,用于可视化指标数据。
5.Exporter
Exporter是Prometheus的一类数据采集组件的总称。它负责从目标处搜集数据,并将其转化为Prometheus支持的格式。与传统的数据采集组件不同的是,它并不向中央服务器发送数据,而是等待中央服务器主动前来抓取。
Prometheus提供多种类型的Exporter用于采集各种不同服务的运行状态。目前支持的有数据库、硬件、消息中间件、存储系统、HTTP服务器、JMX等。
6.alertmanager
警告管理器,用来进行报警。
7.prometheus_cli
命令行工具。
8.其他辅助性工具
多种导出工具,可以支持Prometheus存储数据转化为HAProxy、StatsD、Graphite等工具所需要的数据存储格式。
架构图如下所示:
二,
普罗米修斯的部署方式
- 1. 二进制部署
- 2. Docker部署
- 3. kubernetes集群内部署
本文选择的是二进制部署方式,
在192.168.217.24服务器上安装Prometheus server,同时安装节点信息收集器node_exporter
在192.168.217.23服务器上安装MySQL信息收集器 mysqld_exporter和node_exporter 节点信息收集器(因MySQL安装在23服务器上的)
三,
Prometheus server的安装
因为我的是amd64架构的,因此,选择linux-amd64,版本选择长期支持稳定版本2.37.2,将下载的安装包上传到服务器24并解压。
tar zxf prometheus-2.37.2.linux-amd64.tar.gz mv prometheus-2.37.2.linux-amd64 /usr/local/prometheus [root@node4 prometheus]# ll total 202256 drwxr-xr-x. 2 root root 38 May 8 2022 console_libraries #web控制台的依赖库 drwxr-xr-x. 2 root root 173 May 8 2022 consoles #web控制台的网页文件 drwxr-xr-x. 6 root root 126 Nov 15 22:12 data #时序数据库的数据 -rw-r--r--. 1 root root 11357 Apr 21 2022 LICENSE #说明书 -rw-r--r--. 1 root root 3773 Apr 21 2022 NOTICE #说明 -rwxr-xr-x 1 3434 3434 109691493 Nov 4 19:09 prometheus #主程序,可执行文件 -rw-r--r--. 1 root root 1148 Nov 15 21:09 prometheus.yml #Prometheus的主要配置文件 -rwxr-xr-x. 1 root root 97394322 Apr 21 2022 promtool #Prometheus的管理工具,可以查看时序数据库,以及报警规则文件的测试等等功能。
例如,查看时序数据库
[root@node4 prometheus]# ./promtool tsdb list BLOCK ULID MIN TIME MAX TIME DURATION NUM SAMPLES NUM CHUNKS NUM SERIES SIZE 01GHXKCBSD5JWBT1YTDZE3ME78 1668503659419 1668506400000 45m40.581s 162609 1212 1212 494354 01GHXP04ETXXVQ5W28XXGZDEC4 1668506400000 1668513600000 2h0m0s 591840 4896 1308 1609719
当然,这个Prometheus可以前台启动, ./程序 就可以前台启动了,但每次启停需要占据一个shell,未免不人性化,因此,给它增加一个启停脚本,脚本如下:
cat >/etc/systemd/system/prometheus.service <<EOF [Unit] Descriptinotallow=Prometheus Monitoring System Documentatinotallow=Prometheus Monitoring System [Service] ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --web.listen-address=:9090 # 这里的路径按实际填写 [Install] WantedBy=multi-user.target EOF
systemctl enable prometheus && systemctl start prometheus
查看服务状态,如绿色表示启动正常,否则需要排查问题,日志里可以看到有 TSDB started以及web准备完毕的语句Server is ready to receive web requests:
[root@node4 prometheus]# systemctl status prometheus ● prometheus.service Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2022-11-16 11:09:20 CST; 20min ago Main PID: 3925 (prometheus) CGroup: /system.slice/prometheus.service └─3925 /usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --web.listen-address=:9090 Nov 16 11:09:21 node4 prometheus[3925]: ts=2022-11-16T03:09:21.098Z caller=main.go:993 level=info fs_type=XFS_SUPER_MAGIC Nov 16 11:09:21 node4 prometheus[3925]: ts=2022-11-16T03:09:21.098Z caller=main.go:996 level=info msg="TSDB started" Nov 16 11:09:21 node4 prometheus[3925]: ts=2022-11-16T03:09:21.098Z caller=main.go:1177 level=info msg="Loading configuration file" filename=/usr/local/prometheus/prometheus.yml Nov 16 11:09:21 node4 prometheus[3925]: ts=2022-11-16T03:09:21.234Z caller=main.go:1214 level=info msg="Completed loading of configuration file" filename=/usr/local/prometheus/prometheus.yml totalDuration=135.316399ms db_storage=1.16…µs Nov 16 11:09:21 node4 prometheus[3925]: ts=2022-11-16T03:09:21.235Z caller=main.go:957 level=info msg="Server is ready to receive web requests." Nov 16 11:09:21 node4 prometheus[3925]: ts=2022-11-16T03:09:21.236Z caller=manager.go:941 level=info component="rule manager" msg="Starting rule manager..." Nov 16 11:09:27 node4 prometheus[3925]: ts=2022-11-16T03:09:27.708Z caller=compact.go:510 level=info component=tsdb msg="write block resulted in empty block" mint=1668528000000 maxt=1668535200000 duration=36.455932ms Nov 16 11:09:27 node4 prometheus[3925]: ts=2022-11-16T03:09:27.713Z caller=head.go:842 level=info component=tsdb msg="Head GC completed" duration=3.84614ms Nov 16 11:09:27 node4 prometheus[3925]: ts=2022-11-16T03:09:27.714Z caller=checkpoint.go:97 level=info component=tsdb msg="Creating checkpoint" from_segment=0 to_segment=1 mint=1668535200000 Nov 16 11:09:27 node4 prometheus[3925]: ts=2022-11-16T03:09:27.760Z caller=head.go:1011 level=info component=tsdb msg="WAL checkpoint complete" first=0 last=1 duration=46.524627ms Hint: Some lines were ellipsized, use -l to show in full.
此时的配置文件修改成这样:
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["192.168.217.24:9090"] #本机IP+端口,其它的不用改
打开浏览器,输入最后的那一段网址:
可以看到state是up 绿色的,检查点可以打开看看:
OK,Prometheus server 就安装好了
四,
node_exporter的安装和配置
node_exporter等于是一个客户端信息收集器,收集的目标是类unix操作系统的CPU,内存等等基本数据,具体的收集范围可以看它的帮助:
可以看到CPU,edac,ipvs等等都是默认收集的,但还有一些是不收集的,例如ntp时间服务器,但默认的已经可以满足我们基本的百分之99的需求了。
[root@node4 prometheus]# node_exporter --help 截取里面的收集动作 test fixtures to use for wifi collector metrics --collector.arp Enable the arp collector (default: enabled). --collector.bcache Enable the bcache collector (default: enabled). --collector.bonding Enable the bonding collector (default: enabled). --collector.btrfs Enable the btrfs collector (default: enabled). --collector.buddyinfo Enable the buddyinfo collector (default: disabled). --collector.cgroups Enable the cgroups collector (default: disabled). --collector.conntrack Enable the conntrack collector (default: enabled). --collector.cpu Enable the cpu collector (default: enabled). --collector.cpufreq Enable the cpufreq collector (default: enabled). --collector.diskstats Enable the diskstats collector (default: enabled). --collector.dmi Enable the dmi collector (default: enabled). --collector.drbd Enable the drbd collector (default: disabled). --collector.drm Enable the drm collector (default: disabled). --collector.edac Enable the edac collector (default: enabled). --collector.entropy Enable the entropy collector (default: enabled). --collector.ethtool Enable the ethtool collector (default: disabled). --collector.fibrechannel Enable the fibrechannel collector (default: enabled). --collector.filefd Enable the filefd collector (default: enabled). --collector.filesystem Enable the filesystem collector (default: enabled). --collector.hwmon Enable the hwmon collector (default: enabled). --collector.infiniband Enable the infiniband collector (default: enabled). --collector.interrupts Enable the interrupts collector (default: disabled). --collector.ipvs Enable the ipvs collector (default: enabled). --collector.ksmd Enable the ksmd collector (default: disabled). --collector.lnstat Enable the lnstat collector (default: disabled). --collector.loadavg Enable the loadavg collector (default: enabled). --collector.logind Enable the logind collector (default: disabled). --collector.mdadm Enable the mdadm collector (default: enabled). --collector.meminfo Enable the meminfo collector (default: enabled). --collector.meminfo_numa Enable the meminfo_numa collector (default: disabled). --collector.mountstats Enable the mountstats collector (default: disabled). --collector.netclass Enable the netclass collector (default: enabled). --collector.netdev Enable the netdev collector (default: enabled). --collector.netstat Enable the netstat collector (default: enabled). --collector.network_route Enable the network_route collector (default: disabled). --collector.nfs Enable the nfs collector (default: enabled). --collector.nfsd Enable the nfsd collector (default: enabled). --collector.ntp Enable the ntp collector (default: disabled). --collector.nvme Enable the nvme collector (default: enabled). --collector.os Enable the os collector (default: enabled). --collector.perf Enable the perf collector (default: disabled). --collector.powersupplyclass Enable the powersupplyclass collector (default: enabled). --collector.pressure Enable the pressure collector (default: enabled). --collector.processes Enable the processes collector (default: disabled). --collector.qdisc Enable the qdisc collector (default: disabled). --collector.rapl Enable the rapl collector (default: enabled). --collector.runit Enable the runit collector (default: disabled). --collector.schedstat Enable the schedstat collector (default: enabled). --collector.selinux Enable the selinux collector (default: enabled). --collector.slabinfo Enable the slabinfo collector (default: disabled). --collector.sockstat Enable the sockstat collector (default: enabled). --collector.softnet Enable the softnet collector (default: enabled). --collector.stat Enable the stat collector (default: enabled). --collector.supervisord Enable the supervisord collector (default: disabled). --collector.sysctl Enable the sysctl collector (default: disabled). --collector.systemd Enable the systemd collector (default: disabled). --collector.tapestats Enable the tapestats collector (default: enabled). --collector.tcpstat Enable the tcpstat collector (default: disabled). --collector.textfile Enable the textfile collector (default: enabled). --collector.thermal_zone Enable the thermal_zone collector (default: enabled). --collector.time Enable the time collector (default: enabled). --collector.timex Enable the timex collector (default: enabled). --collector.udp_queues Enable the udp_queues collector (default: enabled). --collector.uname Enable the uname collector (default: enabled). --collector.vmstat Enable the vmstat collector (default: enabled). --collector.wifi Enable the wifi collector (default: disabled). --collector.xfs Enable the xfs collector (default: enabled). --collector.zfs Enable the zfs collector (default: enabled). --collector.zoneinfo Enable the zoneinfo collector (default: disabled).
由于此采集器是go语言编写的,就一个可执行文件,因此,将node_exporter-1.4.0.linux-amd64.tar.gz上传到服务器后,解压并将可执行文件放到环境变量内即可。
tar zxf node_exporter-1.4.0.linux-amd64.tar.gz mv node_exporter-1.4.0.linux-amd64/node_exporter /usr/local/bin/
还是老办法,使用启停脚本进行管理:
多说一句,以上说的定制化收集其实就在这个启停脚本里设置即可,本例是默认,因此很多都没有写的。
cat >/etc/systemd/system/node_exporter.service <<EOF [Unit] Descriptinotallow=node_exporter Monitoring System Documentatinotallow=node_exporter Monitoring System [Service] ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100 [Install] WantedBy=multi-user.target EOF
systemctl enable node_exporter && systemctl start node_exporter
查看服务状态,绿色表示正常:
[root@node4 ~]# systemctl status node_exporter ● node_exporter.service Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2022-11-16 12:27:46 CST; 1min 23s ago Main PID: 7519 (node_exporter) CGroup: /system.slice/node_exporter.service └─7519 /usr/local/bin/node_exporter --web.listen-address=:9100 Nov 16 12:27:46 node4 node_exporter[7519]: ts=2022-11-16T04:27:46.961Z caller=node_exporter.go:115 level=info collector=timex Nov 16 12:27:46 node4 node_exporter[7519]: ts=2022-11-16T04:27:46.961Z caller=node_exporter.go:115 level=info collector=udp_queues Nov 16 12:27:46 node4 node_exporter[7519]: ts=2022-11-16T04:27:46.961Z caller=node_exporter.go:115 level=info collector=uname Nov 16 12:27:46 node4 node_exporter[7519]: ts=2022-11-16T04:27:46.961Z caller=node_exporter.go:115 level=info collector=vmstat Nov 16 12:27:46 node4 node_exporter[7519]: ts=2022-11-16T04:27:46.961Z caller=node_exporter.go:115 level=info collector=xfs Nov 16 12:27:46 node4 node_exporter[7519]: ts=2022-11-16T04:27:46.961Z caller=node_exporter.go:115 level=info collector=zfs Nov 16 12:27:46 node4 node_exporter[7519]: ts=2022-11-16T04:27:46.961Z caller=node_exporter.go:199 level=info msg="Listening on" address=:9100 Nov 16 12:27:46 node4 node_exporter[7519]: ts=2022-11-16T04:27:46.961Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false Nov 16 12:28:10 node4 systemd[1]: [/etc/systemd/system/node_exporter.service:2] Unknown lvalue 'Descriptinotallow' in section 'Unit' Nov 16 12:28:10 node4 systemd[1]: [/etc/systemd/system/node_exporter.service:3] Unknown lvalue 'Documentatinotallow' in section 'Unit'
现在的node采集器已经工作,差最后一哆嗦,将此采集器收集的数据接入Prometheus。集成方式为编辑Prometheus的配置文件,增加target字段:
(同样的,在23服务器也也这么安装部署一哈,把node_exporter服务启动了)
[root@node4 ~]# cat /etc/systemd/system/node_exporter.service [Unit] Descriptinotallow=node_exporter Monitoring System Documentatinotallow=node_exporter Monitoring System [Service] ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9100 [Install] WantedBy=multi-user.target [root@node4 ~]# cat /usr/local/prometheus/prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["192.168.217.24:9090"] - job_name: "server" static_configs: - targets: ["192.168.217.24:9100"] - targets: ["192.168.217.23:9100"]
重启Prometheus server,在浏览器上就可以看到多出了两个target了:
五,
MySQL收集器的安装和配置(192.168.217.23服务器上执行)
解压安装包,并重命名到指定路径 /usr/local/下:
tar zxf mysqld_exporter-0.14.0.linux-amd64.tar.gz mv mysqld_exporter-0.14.0.linux-amd64 /usr/local/mysqld_exporter
数据库建立专用用户:
create user 'exporter'@'%' identified by '123456'; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%' WITH MAX_USER_CONNECTIONS 3; flush privileges;
编辑MySQL的配置文件:
MySQL的端口和密码,我这里不是默认的端口,是3311,MySQL安装在192.168.217.23上的。
cat >/usr/local/mysqld_exporter/my.cnf <<EOF [client] host = 192.168.217.23 port = 3311 user = exporter password = 123456 [mysqladmin] host = 192.168.217.23 port = 3311 user = exporter password = 123456 EOF
编辑启停脚本:
cat >/usr/lib/systemd/system/mysqld-exporter.service <<EOF [Unit] Description=mysqld_exporter [Service] User=expoter ExecStart=/usr/local/mysqld_exporter/mysqld_exporter \ --config.my-cnf=/usr/local/mysqld_exporter/my.cnf \ --web.listen-address=:9104 \ --collect.slave_status \ --collect.binlog_size \ --collect.info_schema.processlist \ --collect.info_schema.innodb_metrics \ --collect.engine_innodb_status \ --collect.perf_schema.file_events \ --collect.perf_schema.replication_group_member_stats Restart=on-failure [Install] WantedBy=multi-user.targe EOF
以上的参数都是通过 mysqld_exported 的帮助得来的,有兴趣的同学可以看看下面的帮助,对比使用了哪些参数:
[root@node3 ~]# mysqld_exporter --help usage: mysqld_exporter [<flags>] Flags: -h, --help Show context-sensitive help (also try --help-long and --help-man). --exporter.lock_wait_timeout=2 Set a lock_wait_timeout (in seconds) on the connection to avoid long metadata locking. --exporter.log_slow_filter Add a log_slow_filter to avoid slow query logging of scrapes. NOTE: Not supported by Oracle MySQL. --collect.heartbeat.database="heartbeat" Database from where to collect heartbeat data --collect.heartbeat.table="heartbeat" Table from where to collect heartbeat data --collect.heartbeat.utc Use UTC for timestamps of the current server (`pt-heartbeat` is called with `--utc`) --collect.info_schema.processlist.min_time=0 Minimum time a thread must be in each state to be counted --collect.info_schema.processlist.processes_by_user Enable collecting the number of processes by user --collect.info_schema.processlist.processes_by_host Enable collecting the number of processes by host --collect.info_schema.tables.databases="*" The list of databases to collect table stats for, or '*' for all --collect.mysql.user.privileges Enable collecting user privileges from mysql.user --collect.perf_schema.eventsstatements.limit=250 Limit the number of events statements digests by response time --collect.perf_schema.eventsstatements.timelimit=86400 Limit how old the 'last_seen' events statements can be, in seconds --collect.perf_schema.eventsstatements.digest_text_limit=120 Maximum length of the normalized statement text --collect.perf_schema.file_instances.filter=".*" RegEx file_name filter for performance_schema.file_summary_by_instance --collect.perf_schema.file_instances.remove_prefix="/var/lib/mysql/" Remove path prefix in performance_schema.file_summary_by_instance --collect.perf_schema.memory_events.remove_prefix="memory/" Remove instrument prefix in performance_schema.memory_summary_global_by_event_name --web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. --web.listen-address=":9104" Address to listen on for web interface and telemetry. --web.telemetry-path="/metrics" Path under which to expose metrics. --timeout-offset=0.25 Offset to subtract from timeout in seconds. --config.my-cnf="/root/.my.cnf" Path to .my.cnf file to read MySQL credentials from. --tls.insecure-skip-verify Ignore certificate and server verification when using a tls connection. --collect.global_variables Collect from SHOW GLOBAL VARIABLES --collect.slave_status Collect from SHOW SLAVE STATUS --collect.info_schema.processlist Collect current thread state counts from the information_schema.processlist --collect.mysql.user Collect data from mysql.user --collect.info_schema.tables Collect metrics from information_schema.tables --collect.info_schema.innodb_tablespaces Collect metrics from information_schema.innodb_sys_tablespaces --collect.info_schema.innodb_metrics Collect metrics from information_schema.innodb_metrics --collect.global_status Collect from SHOW GLOBAL STATUS --collect.binlog_size Collect the current size of all registered binlog files --collect.perf_schema.tableiowaits Collect metrics from performance_schema.table_io_waits_summary_by_table --collect.perf_schema.indexiowaits Collect metrics from performance_schema.table_io_waits_summary_by_index_usage --collect.perf_schema.tablelocks Collect metrics from performance_schema.table_lock_waits_summary_by_table --collect.perf_schema.eventsstatements Collect metrics from performance_schema.events_statements_summary_by_digest --collect.perf_schema.eventsstatementssum Collect metrics of grand sums from performance_schema.events_statements_summary_by_digest --collect.perf_schema.eventswaits Collect metrics from performance_schema.events_waits_summary_global_by_event_name --collect.auto_increment.columns Collect auto_increment columns and max values from information_schema --collect.perf_schema.file_instances Collect metrics from performance_schema.file_summary_by_instance --collect.perf_schema.memory_events Collect metrics from performance_schema.memory_summary_global_by_event_name --collect.perf_schema.replication_group_members Collect metrics from performance_schema.replication_group_members --collect.perf_schema.replication_group_member_stats Collect metrics from performance_schema.replication_group_member_stats --collect.perf_schema.replication_applier_status_by_worker Collect metrics from performance_schema.replication_applier_status_by_worker --collect.info_schema.userstats If running with userstat=1, set to true to collect user statistics --collect.info_schema.clientstats If running with userstat=1, set to true to collect client statistics --collect.perf_schema.file_events Collect metrics from performance_schema.file_summary_by_event_name --collect.info_schema.schemastats If running with userstat=1, set to true to collect schema statistics --collect.info_schema.innodb_cmp Collect metrics from information_schema.innodb_cmp --collect.info_schema.innodb_cmpmem Collect metrics from information_schema.innodb_cmpmem --collect.info_schema.query_response_time Collect query response time distribution if query_response_time_stats is ON. --collect.engine_tokudb_status Collect from SHOW ENGINE TOKUDB STATUS --collect.engine_innodb_status Collect from SHOW ENGINE INNODB STATUS --collect.heartbeat Collect from heartbeat --collect.info_schema.tablestats If running with userstat=1, set to true to collect table statistics --collect.info_schema.replica_host Collect metrics from information_schema.replica_host_status --collect.slave_hosts Scrape information from 'SHOW SLAVE HOSTS' --log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error] --log.format=logfmt Output format of log messages. One of: [logfmt, json] --version Show application version.
查看端口:
[root@node3 ~]# netstat -antup |grep 3311 tcp 0 0 192.168.217.23:59276 192.168.217.23:3311 TIME_WAIT - tcp 0 0 192.168.217.23:59278 192.168.217.23:3311 TIME_WAIT - tcp 0 0 192.168.217.23:59274 192.168.217.23:3311 TIME_WAIT - tcp 0 0 192.168.217.23:59270 192.168.217.23:3311 TIME_WAIT - tcp 0 0 192.168.217.23:59272 192.168.217.23:3311 TIME_WAIT - tcp6 0 0 :::3311 :::* LISTEN 2859/mysqld tcp6 0 0 192.168.217.23:3311 192.168.217.23:59270 TIME_WAIT -
[root@node3 ~]# netstat -antup |grep 9104 tcp6 0 0 :::9104 :::* LISTEN 7041/mysqld_exporte tcp6 0 0 192.168.217.23:9104 192.168.217.24:60422 ESTABLISHED 7041/mysqld_exporte
将MySQL采集器接入Prometheus:
同样的,修改Prometheus的配置文件,增加一个target:
[root@node4 ~]# cat /usr/local/prometheus/prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["192.168.217.24:9090"] - job_name: "server" static_configs: - targets: ["192.168.217.24:9100"] - targets: ["192.168.217.23:9100"] - job_name: "mysqld" static_configs: - targets: ["192.168.217.23:9104"]
重启Prometheus server,再次打开浏览器,查看target,有一个绿色up表示接入成功:
六,
部署Grafana (在192.168.217.24上部署的)
Download Grafana | Grafana Labs
yum安装完毕后,Grafana就已经可以使用了,直接浏览器打开,输入192.168.217.24:3000就可以登录,初始的账号/密码是 admin/admin,登录后将会要求修改初始密码,按要求修改就可以了。(修改后的密码要记住哦)
登录进去后,集成Prometheus,选择data source 数据源:
点旁边的Seetings
dashboard的模板配置文件一般是json格式的文件,这些文件官网都有提供,网址是:Dashboards | Grafana Labs
例如,首页上的node exporter采集器的模板配置文件:
选择上图的import按钮,导入此文件:
同样的,MySQL_exporter收集器也需要一个json类型的配置文件来生成dashboard,在官网寻找就可以了,当然了,一般是选择标星高的,例如:
MySQL Overview | Grafana Labs 这个ID为7362的模板文件下载了30多w次,证明还是比较可靠的哦。
以上为二进制部署Prometheus+Grafana,是不是很简单呢?这么简单的几步就有了一个酷炫的装逼运维监控神器了,对吧~~~
后面打算写一下报警,敬请期待。