之前写过一些关于zabbix监控和ganglia监控的相关文章, 有兴趣的朋友可以参考
ganglia的应用场景有限, 强项是画图, 众多监控目标的实时收集采集等, 利用rrdtool特性可以在图形上做一些文章, 例如添加event区域, 图形属性聚合, compare等.
但是
ganglia欠缺
设置阈值告警的功能, 这类功能可以用如zabbix, nagios这样的监控软件来实现.
PostgreSQL要监控哪些东西, 可以参考 :
本文要说的是zabbix的一个postgresql监控模板,
部署步骤如下:
1. 部署zabbix server端
2. 部署zabbix proxy (可选)
3. 部署zabbix agent (建议在被监控的数据库服务器上安装, 使用127.0.0.1或unix socket连接数据库, 配置为trust或配置.pgpass)
4. 下载模板(zabbix server 端)
(注意不要直接另存为, 最好是git clone , 或直接拷贝内容, 编译一个xml文件)
# git clone https://raw.githubusercontent.com/pg-monz/pg_monz
5. 导入模板
(zabbix server 端)
在web界面点击, Configuration, Templates, Import
某些宏根据你的应用场景进行修改.
注意宏的用途, 例如PGSCRIPTDIR.
这里的监控项如下 :
8.4. 配置zabbix_agentd.conf (被动监控)
8.5. 启动agentd
[参考]
[root@db-172-16-3-221 pg_monz]# grep -r PGSCRIPTDIR *
pg_monz_template.xml: <key>db_table.list.discovery[{$PGHOST},{$PGPORT},{$PGROLE},{$PGDATABASE},{$PGSCRIPTDIR}]</key>
pg_monz_template.xml: <key>db.list.discovery[{$PGHOST},{$PGPORT},{$PGROLE},{$PGDATABASE},{$PGSCRIPTDIR}]</key>
pg_monz_template.xml: <macro>{$PGSCRIPTDIR}</macro>
显然是用于调用find_dbname.sh和find_dbname_table.sh的.
PGLOGDIR则是postgresql log的目录.
[root@db-172-16-3-221 pg_monz]# grep -r PGLOGDIR *
pg_monz_template.xml: <key>logrt["{$PGLOGDIR}/postgresql-.*\.log","PANIC|FATAL|ERROR|[Ee]rror"]</key>
pg_monz_template.xml: <macro>{$PGLOGDIR}</macro>
但是LOG目录可能不同的数据库配置不一样, 所以你可以在所有的数据库服务器建立一个软链接, 来解决这个问题.
例如 :
ln -s $PGDATA/pg_log /var/pg_log
{$PGLOGDIR} = /var/pg_log (统一使用这个)
PGROLE可能需要PG超级用户, 或者是可以执行本监控模板中所有SQL的用户.
如果需要密码的话, 建议配置.pgpass.
7. 下载监控脚本以及配置文件(zabbix agent 端)
脚本需要放到模板宏
PGSCRIPTDIR的制定位置.
并配置可执行权限.
[root@db-172-16-3-221 pg_monz]# cp find_dbname.sh find_dbname_table.sh /usr/local/bin/
[root@db-172-16-3-221 pg_monz]# chown postgres:postgres /usr/local/bin/find_dbname*.sh
[root@db-172-16-3-221 pg_monz]# chmod 555 /usr/local/bin/find_dbname*.sh
简单的来看一个脚本:
find_dbname.sh, 用来输出PostgreSQL的所有数据库 :
[root@db-172-16-3-221 pg_monz]# cat find_dbname.sh
#!/bin/bash
# Get list of Database Name which you want to monitor.
# The default settings are excepted template databases(template0/template1).
#
# :Example
#
# If you want to monitor "foo" and "bar" databases, you set the GETDB as
# GETDB="select datname from pg_database where datname in ('foo','bar');"
GETDB="select datname from pg_database where datistemplate = 'f';"
for dbname in $(psql -h $1 -p $2 -U $3 -d $4 -t -c "${GETDB}"); do
dblist="$dblist,"'{"{#DBNAME}":"'$dbname'"}'
done
echo '{"data":['${dblist#,}' ]}'
[root@db-172-16-3-221 pg_monz]# export PATH=/opt/pgsql/bin:$PATH
[root@db-172-16-3-221 pg_monz]# . ./find_dbname.sh 127.0.0.1 5432 postgres postgres
{"data":[{"{#DBNAME}":"postgres"},{"{#DBNAME}":"digoal"} ]}
find_dbname_table.sh, 用来输出数据库对应的表.
在得到数据库和表之后, 可以对数据库, 表进行监控查询.
8. 配置zabbix agent端.
8.1. 修改数据库的pg_log, 到/var/pg_log, (和PGLOGDIR一致即可).
另外需要确认其他的宏和被监控的数据库一致.
[root@db-172-16-3-221 pg_monz]# ln -s /data01/pgdata/pg_root/pg_log /var/
8.2. 可能需要添加.pgpass, 配合模板需要的连接, 不需要密码访问.
8.3. 将userparameter_pgsql.conf拷贝到zabbix agent配置目录.
[root@db-172-16-3-221 pg_monz]# cp userparameter_pgsql.conf /opt/zabbix/etc/zabbix_agentd.conf.d/
这里的监控项如下 :
[root@db-172-16-3-221 zabbix_agentd.conf.d]# cat userparameter_pgsql.conf
# PostgreSQL user parameter
#
# Server specific examples
#
# Get the total number of committed transactions
UserParameter=psql.tx_committed[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select sum(xact_commit) from pg_stat_database"
# Get the total number of rolled back transactions
UserParameter=psql.tx_rolledback[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select sum(xact_rollback) from pg_stat_database"
# Max Connections
UserParameter=psql.server_maxcon[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "show max_connections"
# PostgreSQL is running
UserParameter=psql.running[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select 1" > /dev/null 2>&1 ; echo $?
# Added by SRA OSS
# Get number of checkpoint count (by checkpoint_timeout)
UserParameter=psql.checkpoints_timed[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select checkpoints_timed from pg_stat_bgwriter"
# Get number of checkpoint count (by checkpoint_segments)
UserParameter=psql.checkpoints_req[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select checkpoints_req from pg_stat_bgwriter"
# Get the total number of connections
UserParameter=psql.server_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity;"
# Get the total number of active (on processing SQL) connections
UserParameter=psql.active_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(state) from pg_stat_activity where state = 'active'"
# Get the total number of idle connections
UserParameter=psql.idle_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(state) from pg_stat_activity where state = 'idle'"
# Get the total number of idle in transaction connections
UserParameter=psql.idle_tx_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(state) from pg_stat_activity where state = 'idle in transaction'"
# Get the total number of lock-waiting connections
UserParameter=psql.locks_waiting[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity where waiting = 't'"
# Get buffer information
UserParameter=psql.buffers_checkpoint[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_checkpoint from pg_stat_bgwriter"
UserParameter=psql.buffers_clean[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_clean from pg_stat_bgwriter"
UserParameter=psql.maxwritten_clean[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select maxwritten_clean from pg_stat_bgwriter"
UserParameter=psql.buffers_backend[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_backend from pg_stat_bgwriter"
UserParameter=psql.buffers_backend_fsync[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_backend_fsync from pg_stat_bgwriter"
UserParameter=psql.buffers_alloc[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select buffers_alloc from pg_stat_bgwriter"
# Get number of slow queries
UserParameter=psql.slow_queries[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity where state = 'active' and now() - query_start > '$5 sec'::interval"
UserParameter=psql.slow_select_queries[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity where state = 'active' and now() - query_start > '$5 sec'::interval and query ilike 'select%'"
UserParameter=psql.slow_dml_queries[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select count(*) from pg_stat_activity where state = 'active' and now() - query_start > '$5 sec'::interval and query ~* '^(insert|update|delete)'"
#
# Database specific examples
#
# Get the size of a Database (in bytes)
UserParameter=psql.db_size[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select pg_database_size('$5') from pg_database where datname = '$5'"
# Get number of active connections for a specified database
UserParameter=psql.db_connections[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select numbackends from pg_stat_database where datname = '$5'"
# Get number of tuples returned for a specified database
UserParameter=psql.db_returned[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_returned from pg_stat_database where datname = '$5'"
# Get number of tuples fetched for a specified database
UserParameter=psql.db_fetched[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_fetched from pg_stat_database where datname = '$5'"
# Get number of tuples inserted for a specified database
UserParameter=psql.db_inserted[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_inserted from pg_stat_database where datname = '$5'"
# Get number of tuples updated for a specified database
UserParameter=psql.db_updated[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_updated from pg_stat_database where datname = '$5'"
# Get number of tuples deleted for a specified database
UserParameter=psql.db_deleted[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select tup_deleted from pg_stat_database where datname = '$5'"
# Get number of committed/rolled back transactions for a specified database
UserParameter=psql.db_tx_committed[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select xact_commit from pg_stat_database where datname = '$5'"
UserParameter=psql.db_tx_rolledback[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select xact_rollback from pg_stat_database where datname = '$5'"
# Cache Hit Ratio
UserParameter=psql.cachehit_ratio[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "SELECT round(blks_hit*100/(blks_hit+blks_read), 2) AS cache_hit_ratio FROM pg_stat_database WHERE datname = '$5' and blks_read > 0 union all select 0.00 AS cache_hit_ratio order by cache_hit_ratio desc limit 1"
# Added by SRA OSS
# Get number of temp files
UserParameter=psql.db_temp_files[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select temp_files from pg_stat_database where datname = '$5'"
# Get temp file size (in bytes)
UserParameter=psql.db_temp_bytes[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select temp_bytes from pg_stat_database where datname = '$5'"
# Get percentage of dead tuples of all tables for a specified database
UserParameter=psql.db_dead_tup_ratio[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select round(sum(n_dead_tup)*100/sum(n_live_tup+n_dead_tup), 2) as dead_tup_ratio from pg_stat_all_tables where n_live_tup > 0"
# Get number of deadlocks for a specified database (9.2 or later)
UserParameter=psql.db_deadlocks[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select deadlocks from pg_stat_database where datname = '$5'"
#
# Table specific examples
#
# Get table cache hit ratio of a specific table
UserParameter=psql.table_cachehit_ratio[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select round(heap_blks_hit*100/(heap_blks_hit+heap_blks_read), 2) as cache_hit_ratio from pg_statio_user_tables where schemaname = '$5' and relname = '$6' and heap_blks_read > 0 union all select 0.00 as cache_hit_ratio order by cache_hit_ratio desc limit 1"
# Get number of sequencial scan of a specific table
UserParameter=psql.table_seq_scan[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select seq_scan from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of index scan of a specific table
UserParameter=psql.table_idx_scan[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select coalesce(idx_scan,0) from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of vacuum count of a specific table
UserParameter=psql.table_vacuum_count[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select vacuum_count from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of analyze count of a specific table
UserParameter=psql.table_analyze_count[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select analyze_count from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of autovacuum count of a specific table
UserParameter=psql.table_autovacuum_count[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select autovacuum_count from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of autoanalyze count of a specific table
UserParameter=psql.table_autoanalyze_count[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select autoanalyze_count from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
# Get number of tuples of a specific table
UserParameter=psql.table_n_tup_ins[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_tup_ins from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_tup_upd[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_tup_upd from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_tup_del[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_tup_del from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_seq_tup_read[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select seq_tup_read from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_idx_tup_fetch[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select coalesce(idx_tup_fetch,0) from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_tup_hot_upd[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_tup_hot_upd from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_live_tup[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_live_tup from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
UserParameter=psql.table_n_dead_tup[*],psql -h $1 -p $2 -U $3 -d $4 -t -c "select n_dead_tup from pg_stat_user_tables where schemaname = '$5' and relname = '$6'"
#
# Discovery Rule
#
# Database Discovery
UserParameter=db.list.discovery[*],$5/find_dbname.sh $1 $2 $3 $4
UserParameter=db_table.list.discovery[*],$5/find_dbname_table.sh $1 $2 $3 $4
8.4. 配置zabbix_agentd.conf (被动监控)
[root@db-172-16-3-221 etc]# cd /opt/zabbix/etc/
[root@db-172-16-3-221 etc]# ll
total 20
-rw-r--r-- 1 root root 2485 Sep 1 14:45 zabbix_agent.conf
drwxr-xr-x 2 root root 4096 Sep 22 14:24 zabbix_agent.conf.d
-rw-r--r-- 1 root root 7456 Sep 1 17:19 zabbix_agentd.conf
drwxr-xr-x 2 root root 4096 Sep 1 14:45 zabbix_agentd.conf.d
[root@db-172-16-3-221 etc]# vi zabbix_agentd.conf
Include=/opt/zabbix/etc/zabbix_agentd.conf.d/
PidFile=/tmp/zabbix_agentd.pid
LogFile=/tmp/zabbix_agentd.log
LogFileSize=10
EnableRemoteCommands=1
LogRemoteCommands=1
Server=172.16.3.150
ListenPort=10050
ListenIP=0.0.0.0
StartAgents=8
ServerActive=172.16.3.150:10052
Hostname=agent221
RefreshActiveChecks=60
8.5. 启动agentd
[root@db-172-16-3-221 etc]# su - postgres -c "
/opt/zabbix/sbin/
zabbix_agentd -c /opt/zabbix/etc/zabbix_agentd.conf"
9. 创建监控主机, 或配置已监控的主机添加模板.
10. 查看PostgreSQL监控模板的items有被刷新.
[其他]
1. 如果宏的配置不通用的话(例如被监控的数据库有几百台, 但是监听端口不一样的话), 可以配置主机宏覆盖值.
2. 如果没有PostgreSQL模板信息被刷新的话, 可以排查一下agentd端的脚本权限, PATH路径等是否正确. 除了要调用两个脚本, 还需要调用psql命令.