AnalyticDB for PostgreSQL(原HybridDB for PostgreSQL,以下简称ADB for PG)作为高性能分析型数据库,可以支持用户对其业务数据进行实时分析,能够让企业敏锐感知市场动态,做出必要决策。本文从ADB for PG用户角度出发,阐述如何借助pg_stat_activity这一系统视图实现实例的基本健康检查。
pg_stat_activity定义
postgres=# \d+ pg_stat_activity;
View "pg_catalog.pg_stat_activity"
Column | Type | Modifiers | Storage | Description
------------------+--------------------------+-----------+----------+-------------
datid | oid | | plain | OID of the database this backend is connected to
datname | name | | plain | Name of the database this backend is connected to
procpid | integer | | plain | Process ID of this backend
sess_id | integer | | plain |
usesysid | oid | | plain | OID of the user logged into this backend
usename | name | | plain | Name of the user logged into this backend
current_query | text | | extended |
waiting | boolean | | plain | True if this backend is currently waiting on a lock
query_start | timestamp with time zone | | plain | Time when the currently active query was started
backend_start | timestamp with time zone | | plain | Time when this process was started, i.e., when the client connected to the server
client_addr | inet | | plain |
client_port | integer | | plain |
application_name | text | | extended |
xact_start | timestamp with time zone | | plain |
waiting_reason | text | | extended |
View definition:
SELECT s.datid, d.datname, s.procpid, s.sess_id, s.usesysid, u.rolname AS usename, s.current_query, s.waiting, s.query_start, s.backend_start, s.client_addr, s.client_port, s.application_name, s.xact_start, s.waiting_reason
FROM pg_database d, pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port, sess_id, waiting_reason), pg_authid u
WHERE s.datid = d.oid AND s.usesysid = u.oid;
这是当前ADB for PG的版本输出,其中Description列根据官方文档手动补充。
连接信息
想确认某个用户是否连接到当前DB上:
postgres=# SELECT datname,usename FROM pg_stat_activity WHERE usename = 'joe';
datname | usename
----------+---------
postgres | joe
(1 row)
进一步确认当前所有的连接信息,包括哪些机器在连接:
postgres=# SELECT datname,usename,client_addr,client_port FROM pg_stat_activity ;
datname | usename | client_addr | client_port
----------+-----------+----------------+-------------
postgres | joe | xx.xx.xx.xx | 60621
postgres | gpmon | xx.xx.xx.xx | 60312
(9 rows)
通过上述信息就能确认当前的连接用户和对应的连接机器。
SQL运行信息
当前DB上运行的SQL也可以通过pg_stat_activity来获取,首先需要确认参数track_activities = on已经设置。该参数会默认设置好,只需要确认即可:
postgres=# show track_activities;
track_activities
------------------
on
(1 row)
获取当前用户执行SQL信息:
postgres=# SELECT datname,usename,current_query FROM pg_stat_activity ;
datname | usename | current_query
----------+----------+--------------------------------------------------------------
postgres | postgres | SELECT datname,usename,current_query FROM pg_stat_activity ;
postgres | joe | <IDLE>
(2 rows)
只看当前正在运行的SQL信息:
SELECT datname,usename,current_query
FROM pg_stat_activity
WHERE current_query != '<IDLE>' ;
查看耗时较长的查询:
select
current_timestamp - query_start as runtime,
datname,
usename,
current_query
from pg_stat_activity
where current_query != '<IDLE>'
order by 1 desc;
例如输出:
runtime | datname | usename | current_query
-----------------+----------------+----------+------------------------------------------------------------------------------
00:00:34.248426 | tpch_1000x_col | postgres | select
: l_returnflag,
: l_linestatus,
: sum(l_quantity) as sum_qty,
: sum(l_extendedprice) as sum_base_price,
: sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
: sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
: avg(l_quantity) as avg_qty,
: avg(l_extendedprice) as avg_price,
: avg(l_discount) as avg_disc,
: count(*) as count_order
: from
: public.lineitem
: where
: l_shipdate <= date '1998-12-01' - interval '93' day
: group by
: l_returnflag,
: l_linestatus
: order by
: l_returnflag,
: l_linestatus;
00:00:00 | postgres | postgres | select
: current_timestamp - query_start as runtime,
: datname,
: usename,
: current_query
: from pg_stat_activity
: where current_query != '<IDLE>'
: order by 1 desc;
(2 rows)
可以看到第一个查询耗时较久,已经运行了34s还没有结束。
异常SQL诊断及修复
如果一个SQL运行很长时间没有结果,需要检查该SQL还在运行中还是已经被block了:
SELECT datname,usename,current_query
FROM pg_stat_activity
WHERE waiting;
需要注意的是这个输出只能获取当前因为lock而被block的SQL,因为其他原因被block的SQL这里获取不到。绝大多数情况下SQL都是因为lock而被block,但也会有一些其他情况例如等待i/o、定时器等。
如果上述SQL有结果输出说明有SQL被lock阻塞,进一步明确相互block的SQL信息:
SELECT
w.current_query as waiting_query,
w.procpid as w_pid,
w.usename as w_user,
l.current_query as locking_query,
l.procpid as l_pid,
l.usename as l_user,
t.schemaname || '.' || t.relname as tablename
from pg_stat_activity w
join pg_locks l1 on w.procpid = l1.pid and not l1.granted
join pg_locks l2 on l1.relation = l2.relation and l2.granted
join pg_stat_activity l on l2.pid = l.procpid
join pg_stat_user_tables t on l1.relation = t.relid
where w.waiting;
通过这个SQL的输出信息就能确认相互block的SQL和对应的执行pid。在明确了SQL的阻塞信息后,可以通过cancel或者kill query的方式进行恢复。
通过cancel取消一个正在运行的query:
SELECT pg_cancel_backend(pid)
需要在一个运行query的session中执行,如果session本身就是idle的,执行不起作用。另外取消这个query需要花费一定的时间来做清理和事务的回滚。
使用pg_terminate_backend来清理idle session,也可以用来终止query:
SELECT pg_terminate_backend(pid);
该用户的连接会被断开。尽量避免在正在运行query的进程pid上执行。
需要注意的是文中提到操作需要用户有superuser的权限。