前言:
最近刚好得空再琢磨prometheus,那肯定要玩玩postgrs-expoter这个数据库监控了,要不没办法体现prometheus的强大。
OK,postgres-exporter和其它的监视组件例如node-exporter基本是一样的,都是由go语言编写的,但是有一点雪微不同,它是监控数据库的,因此要和数据库打一点交道。所以呢,数据库需要有安装pg_stat插件,否则postgres-exporter可能会工作的不太好,很多数据收集不到吧。
其次呢,现在的网络文化就是一大抄,这个抄一下,那个抄一下,因此在学习部署的时候呢,满世界也就基本那两三篇文章,有部署前编写脚本的什么的,就看的很无语,总的给人的感觉就是postgres-exporter部署应该是一个简单的事情,但有得文章搞的非常复杂,可能有炫技的成分吧。
在这简单的吐槽一下,毕竟垃圾文章在学习的时候呢,你还不一定敢跳过去,万一有点东西呢?但是,真的看完后,我整个人是无语状态:这TM到底要搞哪样?纯粹浪费时间(虽然是条条大路通罗马,但不要搞一个羊肠小道让人踩坑我想这应该是每一个写文章分享知识的人的基本素养吧!!!!~~~~~)
好了,吐槽完毕,下面就postgres-exporter这个小组件的安装做一个尽量详细的说明。
一,
哪里下载postgres-exporter ?该下载哪个版本?下载哪种格式的文件?
毫无疑问的事情,必须从github下载,懒得折腾就到gitee找找看,碰碰运气喽,github的下载地址是:
https://github.com/prometheus-community/postgres_exporter/releases
OK,打开github,会发现有茫茫多的版本,what fuck? 这会难住N多的有选择症的同学,我的用例里使用的是0.11版本,根据github 的介绍呢,0.11以及以上的postgres-exporter 都还不错,功能够多,可以多机收集,当然,postgresql的版本我用的是12版本,建议使用12以上的postgresql版本,exporter就0.11以及以上版本即可。
版本问题说完了,就说说什么格式的文件,当然是二进制的文件了,source 源码太麻烦,麻烦就会出错,出错了就需要时间去更改,道理就是这个道理啦。因此,在本次用例里,我使用的是postgres_exporter-0.11.1.linux-amd64.tar
二,
postgres-exporter安装在哪里?
这个好像是一个废话文学,不过我还是打算在这里废话一下,当然是和数据库安装在一起啦,数据库在哪,exporter就安装在哪,和node-exporter一样的,不要犹豫,也无需思考。
三,
postgres-exporter的安装部署
总共需要两个文件,一个可执行程序 就是postgres_exporter-0.11.1.linux-amd64.tar这个文件解压出来的可执行文件,还一个就是启动脚本
启动脚本内需要设置一个变量,该变量名称是固定的
###说明:由于postgres-exporter 是和postgresql安装在一起的,因此,没有必要搞什么明文账号密码连接,直接localhost连接数据库就可以了,这样也就没有什么太多的安全隐患了。
一般数据库的pg_hba.conf 文件内都配置的是本地登录无需验证密码,那么,为什么本地登录无需密码呢?
话说,如果某黑客都已经登录到本地了,那设什么密码也没什么吊谓了(在安全方面来说,已经是完全无意义了),直接改配置文件重置密码也就几十秒的事情吧。
DATA_SOURCE_NAME=postgresql://localhost:数据库端口/postgres?sslmode=disable
环境变量和端口都在启动脚本里设置好,环境变量按实际的情况填写数据库的IP+端口 ,postgres-exporter的默认端口是9187
cat > /usr/lib/systemd/system/postgres_exporter.service<<EOF [Unit] Description=postgres_exporter After=network.target [Service] Type=simple User=postgres Environment="DATA_SOURCE_NAME=postgresql//:localhost@数据库ip:数据库端口/postgres?sslmode=disable" ExecStart=/usr/local/bin/postgres_exporter --log.level=error --web.listen-address=:9187 Restart=on-failure [Install] WantedBy=multi-user.target EOF
那么,这样配置有没有什么问题?
可以明确的说,稍有瑕疵,可能数据库的运行时间这个数据没有收集,但基本99%的数据库实例数据都收集到了
那么,想要完美的收集数据库实例数据,需要在上面的启动命令行添加一个额外参数,--extend.query-path=queries.yaml,
queries.yaml这个文件的内容如下(此节选做):
该文件主要是对postgres-exporter做一些扩展查询,例如,上面提到的服务器运行时间,表更新的行的数量,表删除的行的数量,手动vacuum的时间等等一些更细致的数据
pg_replication: query: "SELECT CASE WHEN NOT pg_is_in_recovery() THEN 0 ELSE GREATEST (0, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))) END AS lag" master: true metrics: - lag: usage: "GAUGE" description: "Replication lag behind master in seconds" pg_postmaster: query: "SELECT pg_postmaster_start_time as start_time_seconds from pg_postmaster_start_time()" master: true metrics: - start_time_seconds: usage: "GAUGE" description: "Time at which postmaster started" pg_stat_user_tables: query: | SELECT current_database() datname, schemaname, relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_live_tup, n_dead_tup, n_mod_since_analyze, COALESCE(last_vacuum, '1970-01-01Z') as last_vacuum, COALESCE(last_autovacuum, '1970-01-01Z') as last_autovacuum, COALESCE(last_analyze, '1970-01-01Z') as last_analyze, COALESCE(last_autoanalyze, '1970-01-01Z') as last_autoanalyze, vacuum_count, autovacuum_count, analyze_count, autoanalyze_count FROM pg_stat_user_tables metrics: - datname: usage: "LABEL" description: "Name of current database" - schemaname: usage: "LABEL" description: "Name of the schema that this table is in" - relname: usage: "LABEL" description: "Name of this table" - seq_scan: usage: "COUNTER" description: "Number of sequential scans initiated on this table" - seq_tup_read: usage: "COUNTER" description: "Number of live rows fetched by sequential scans" - idx_scan: usage: "COUNTER" description: "Number of index scans initiated on this table" - idx_tup_fetch: usage: "COUNTER" description: "Number of live rows fetched by index scans" - n_tup_ins: usage: "COUNTER" description: "Number of rows inserted" - n_tup_upd: usage: "COUNTER" description: "Number of rows updated" - n_tup_del: usage: "COUNTER" description: "Number of rows deleted" - n_tup_hot_upd: usage: "COUNTER" description: "Number of rows HOT updated (i.e., with no separate index update required)" - n_live_tup: usage: "GAUGE" description: "Estimated number of live rows" - n_dead_tup: usage: "GAUGE" description: "Estimated number of dead rows" - n_mod_since_analyze: usage: "GAUGE" description: "Estimated number of rows changed since last analyze" - last_vacuum: usage: "GAUGE" description: "Last time at which this table was manually vacuumed (not counting VACUUM FULL)" - last_autovacuum: usage: "GAUGE" description: "Last time at which this table was vacuumed by the autovacuum daemon" - last_analyze: usage: "GAUGE" description: "Last time at which this table was manually analyzed" - last_autoanalyze: usage: "GAUGE" description: "Last time at which this table was analyzed by the autovacuum daemon" - vacuum_count: usage: "COUNTER" description: "Number of times this table has been manually vacuumed (not counting VACUUM FULL)" - autovacuum_count: usage: "COUNTER" description: "Number of times this table has been vacuumed by the autovacuum daemon" - analyze_count: usage: "COUNTER" description: "Number of times this table has been manually analyzed" - autoanalyze_count: usage: "COUNTER" description: "Number of times this table has been analyzed by the autovacuum daemon" pg_statio_user_tables: query: "SELECT current_database() datname, schemaname, relname, heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit, toast_blks_read, toast_blks_hit, tidx_blks_read, tidx_blks_hit FROM pg_statio_user_tables" metrics: - datname: usage: "LABEL" description: "Name of current database" - schemaname: usage: "LABEL" description: "Name of the schema that this table is in" - relname: usage: "LABEL" description: "Name of this table" - heap_blks_read: usage: "COUNTER" description: "Number of disk blocks read from this table" - heap_blks_hit: usage: "COUNTER" description: "Number of buffer hits in this table" - idx_blks_read: usage: "COUNTER" description: "Number of disk blocks read from all indexes on this table" - idx_blks_hit: usage: "COUNTER" description: "Number of buffer hits in all indexes on this table" - toast_blks_read: usage: "COUNTER" description: "Number of disk blocks read from this table's TOAST table (if any)" - toast_blks_hit: usage: "COUNTER" description: "Number of buffer hits in this table's TOAST table (if any)" - tidx_blks_read: usage: "COUNTER" description: "Number of disk blocks read from this table's TOAST table indexes (if any)" - tidx_blks_hit: usage: "COUNTER" description: "Number of buffer hits in this table's TOAST table indexes (if any)" # WARNING: This set of metrics can be very expensive on a busy server as every unique query executed will create an additional time series pg_stat_statements: query: "SELECT t2.rolname, t3.datname, queryid, calls, total_time / 1000 as total_time_seconds, min_time / 1000 as min_time_seconds, max_time / 1000 as max_time_seconds, mean_time / 1000 as mean_time_seconds, stddev_time / 1000 as stddev_time_seconds, rows, shared_blks_hit, shared_blks_read, shared_blks_dirtied, shared_blks_written, local_blks_hit, local_blks_read, local_blks_dirtied, local_blks_written, temp_blks_read, temp_blks_written, blk_read_time / 1000 as blk_read_time_seconds, blk_write_time / 1000 as blk_write_time_seconds FROM pg_stat_statements t1 JOIN pg_roles t2 ON (t1.userid=t2.oid) JOIN pg_database t3 ON (t1.dbid=t3.oid) WHERE t2.rolname != 'rdsadmin'" master: true metrics: - rolname: usage: "LABEL" description: "Name of user" - datname: usage: "LABEL" description: "Name of database" - queryid: usage: "LABEL" description: "Query ID" - calls: usage: "COUNTER" description: "Number of times executed" - total_time_seconds: usage: "COUNTER" description: "Total time spent in the statement, in milliseconds" - min_time_seconds: usage: "GAUGE" description: "Minimum time spent in the statement, in milliseconds" - max_time_seconds: usage: "GAUGE" description: "Maximum time spent in the statement, in milliseconds" - mean_time_seconds: usage: "GAUGE" description: "Mean time spent in the statement, in milliseconds" - stddev_time_seconds: usage: "GAUGE" description: "Population standard deviation of time spent in the statement, in milliseconds" - rows: usage: "COUNTER" description: "Total number of rows retrieved or affected by the statement" - shared_blks_hit: usage: "COUNTER" description: "Total number of shared block cache hits by the statement" - shared_blks_read: usage: "COUNTER" description: "Total number of shared blocks read by the statement" - shared_blks_dirtied: usage: "COUNTER" description: "Total number of shared blocks dirtied by the statement" - shared_blks_written: usage: "COUNTER" description: "Total number of shared blocks written by the statement" - local_blks_hit: usage: "COUNTER" description: "Total number of local block cache hits by the statement" - local_blks_read: usage: "COUNTER" description: "Total number of local blocks read by the statement" - local_blks_dirtied: usage: "COUNTER" description: "Total number of local blocks dirtied by the statement" - local_blks_written: usage: "COUNTER" description: "Total number of local blocks written by the statement" - temp_blks_read: usage: "COUNTER" description: "Total number of temp blocks read by the statement" - temp_blks_written: usage: "COUNTER" description: "Total number of temp blocks written by the statement" - blk_read_time_seconds: usage: "COUNTER" description: "Total time the statement spent reading blocks, in milliseconds (if track_io_timing is enabled, otherwise zero)" - blk_write_time_seconds: usage: "COUNTER" description: "Total time the statement spent writing blocks, in milliseconds (if track_io_timing is enabled, otherwise zero)" pg_process_idle: query: | WITH metrics AS ( SELECT application_name, SUM(EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - state_change))::bigint)::float AS process_idle_seconds_sum, COUNT(*) AS process_idle_seconds_count FROM pg_stat_activity WHERE state = 'idle' GROUP BY application_name ), buckets AS ( SELECT application_name, le, SUM( CASE WHEN EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - state_change)) <= le THEN 1 ELSE 0 END )::bigint AS bucket FROM pg_stat_activity, UNNEST(ARRAY[1, 2, 5, 15, 30, 60, 90, 120, 300]) AS le GROUP BY application_name, le ORDER BY application_name, le ) SELECT application_name, process_idle_seconds_sum as seconds_sum, process_idle_seconds_count as seconds_count, ARRAY_AGG(le) AS seconds, ARRAY_AGG(bucket) AS seconds_bucket FROM metrics JOIN buckets USING (application_name) GROUP BY 1, 2, 3 metrics: - application_name: usage: "LABEL" description: "Application Name" - seconds: usage: "HISTOGRAM" description: "Idle time of server processes"
四,
postgresql数据库的设置
postgresql数据库需要开启pg_stat_statements 开启方法见下面的博客,在这里就不重复了
postgresql|数据库|插件学习(一)---postgresql-12的内置插件pg_stat_statements的启用和使用_晚风_END的博客-CSDN博客
只是需要注意一点,尽量使用主流的postgresql数据库版本,现在应该是12版本
剩下的和node-exporter基本一样了,在prometheus的配置文件里加入postgres-exporter收集器任务就可以了,例如:
配置文件自动发现和分配置文件
添加配置Prometheus文件 - job_name: 'postgres_exporter' scrape_interval: 1s file_sd_configs: - files: - targets/postgresql/*.yml refresh_interval: 10s relabel_configs: - action: replace source_labels: ['__address__'] regex: (.*):(.*) replacement: $1 target_label: hostname - action: labeldrop regex: __meta_filepath 添加监控对象 vi targets/postgresql/pg1-instance.yml [ { "targets": ["localhost:9187"], "labels": { "instance": 'postgres:5432' } } ]
五,
新建普通用户使用postgres-exporter的情况(此步骤可以省略掉)
>=10版本主要是使用函数自动检查是否有postgres_exporter这个普通用户,并给予连接数据库的权限,设置postgres_exporter这个用户的密码是password,按需更改即可
<10版本主要是创建scheme,检查pg_stat插件,创建postgres_exporter使用的视图,这两个视图也主要是pg_stat_statements这个插件提供的
版本>=10的pg,以下三段SQL都要执行 CREATE OR REPLACE FUNCTION __tmp_create_user() returns void as $$ BEGIN IF NOT EXISTS ( SELECT -- SELECT list can stay empty for this FROM pg_catalog.pg_user WHERE usename = 'postgres_exporter') THEN CREATE USER postgres_exporter; END IF; END; $$ language plpgsql; SELECT __tmp_create_user(); DROP FUNCTION __tmp_create_user(); ALTER USER postgres_exporter WITH PASSWORD 'password'; ALTER USER postgres_exporter SET SEARCH_PATH TO postgres_exporter,pg_catalog; -- If deploying as non-superuser (for example in AWS RDS), uncomment the GRANT -- line below and replace <MASTER_USER> with your root user. -- GRANT postgres_exporter TO <MASTER_USER>; GRANT CONNECT ON DATABASE postgres TO postgres_exporter; GRANT pg_monitor to postgres_exporter; 版本<10的pg,只执行下面的SQL CREATE SCHEMA IF NOT EXISTS postgres_exporter; GRANT USAGE ON SCHEMA postgres_exporter TO postgres_exporter; CREATE OR REPLACE FUNCTION get_pg_stat_activity() RETURNS SETOF pg_stat_activity AS $$ SELECT * FROM pg_catalog.pg_stat_activity; $$ LANGUAGE sql VOLATILE SECURITY DEFINER; CREATE OR REPLACE VIEW postgres_exporter.pg_stat_activity AS SELECT * from get_pg_stat_activity(); GRANT SELECT ON postgres_exporter.pg_stat_activity TO postgres_exporter; CREATE OR REPLACE FUNCTION get_pg_stat_replication() RETURNS SETOF pg_stat_replication AS $$ SELECT * FROM pg_catalog.pg_stat_replication; $$ LANGUAGE sql VOLATILE SECURITY DEFINER; CREATE OR REPLACE VIEW postgres_exporter.pg_stat_replication AS SELECT * FROM get_pg_stat_replication(); GRANT SELECT ON postgres_exporter.pg_stat_replication TO postgres_exporter; CREATE EXTENSION IF NOT EXISTS pg_stat_statements; CREATE OR REPLACE FUNCTION get_pg_stat_statements() RETURNS SETOF pg_stat_statements AS $$ SELECT * FROM public.pg_stat_statements; $$ LANGUAGE sql VOLATILE SECURITY DEFINER; CREATE OR REPLACE VIEW postgres_exporter.pg_stat_statements AS SELECT * FROM get_pg_stat_statements();
六,
postgres-exporter使用哪个dashboard 的模版?
下载地址:
第一个人气最高的就比较好用 9628_rev7.json ID是9628
这里有一个问题需要注意,该JSON文件里写的数据源名称是DS_PROMETHEUS,因此,要么创建数据源的时候,名字必须是这个,要么就把JSON文件里的所有DS_PROMETHEUS修改成自己实际再用的数据源名称~~~~~!!!!!!!!!!