PostgreSQL 老湿机图解平安科技遇到的垃圾回收"坑"-阿里云开发者社区

开发者社区> 阿里云数据库> 正文

PostgreSQL 老湿机图解平安科技遇到的垃圾回收"坑"

简介: 背景 从海安那里反馈的一个问题,是平安科技在使用PostgreSQL的过程中,遇到的一个有些"不可思议"的问题。 一张经常被更新的表,通过主键查询这张表的记录时,发现需要扫描异常多的数据块。 其实原因有2。 .1. 长事务有关,我在很多文章都提到过,PG在垃圾回收时,只判断垃圾版

背景

近日收到 平安科技 海安童鞋 那里反馈的一个问题,在生产环境使用PostgreSQL的过程中,遇到的一个有点"不可思议"的问题。

一张经常被更新的表,通过主键查询这张表的记录时,发现需要扫描异常多的数据块。

本文将为你详细剖析这个问题,同时给出规避的方法,以及内核改造的方法。

文中还涉及到索引的结构解说,仔细阅读定有收获。

原因分析

.1. 和长事务有关,我在很多文章都提到过,PG在垃圾回收时,只判断垃圾版本是否是当前数据库中最老的事务之前的,如果是之后产生的,则不回收。

所以当数据库存在长事务时,同时被访问的记录被多次变更,造成一些垃圾版本没有回收。
screenshot

.2. PG的索引没有版本信息,所以必须要访问heap tuple获取版本。
screenshot

复现方法

测试表

postgres=# create unlogged table test03 (id int primary key, info text);

频繁更新100条记录

$ vi test.sql
\setrandom id 1 100
insert into test03 values(:id, repeat(md5(random()::text), 1000)) on conflict on constraint test03_pkey do update set info=excluded.info;

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 10000000

开启长事务,啥也不干

postgres=# begin;
BEGIN
postgres=# select txid_current();
 txid_current 
--------------
   3474642778
(1 row)

经过一段时间的更新,发现需要访问很多数据块了。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using test03_pkey on public.test03  (cost=0.42..8.44 rows=1 width=417) (actual time=0.661..4.440 rows=1 loops=1)
   Output: id, info
   Index Cond: (test03.id = 2)
   Buffers: shared hit=1753
 Planning time: 0.104 ms
 Execution time: 4.468 ms
(6 rows)

观察访问很多的块是heap块

postgres=# set enable_indexscan=off;
SET

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test03  (cost=4.43..8.44 rows=1 width=416) (actual time=5.818..5.819 rows=1 loops=1)
   Output: id, info
   Recheck Cond: (test03.id = 2)
   Heap Blocks: exact=1986
   Buffers: shared hit=1996
   ->  Bitmap Index Scan on test03_pkey  (cost=0.00..4.43 rows=1 width=0) (actual time=0.418..0.418 rows=1986 loops=1)
         Index Cond: (test03.id = 2)
         Buffers: shared hit=10
 Planning time: 0.200 ms
 Execution time: 5.851 ms
(10 rows)

提交长事务前,使用vacuum verbose可以看到无法回收这些持续产生的垃圾page(包括index和heap的page)。

提交长事务

postgres=# end;
COMMIT

等待autovacuum进程回收垃圾,delete half index page。
访问的数据块数量下降了。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test03  (cost=4.43..8.45 rows=1 width=417) (actual time=0.113..0.118 rows=1 loops=1)
   Output: id, info
   Recheck Cond: (test03.id = 2)
   Heap Blocks: exact=3
   Buffers: shared hit=14
   ->  Bitmap Index Scan on test03_pkey  (cost=0.00..4.43 rows=1 width=0) (actual time=0.067..0.067 rows=3 loops=1)
         Index Cond: (test03.id = 2)
         Buffers: shared hit=11
 Planning time: 0.101 ms
 Execution time: 0.148 ms
(10 rows)

深入分析

使用pageinspect观察测试过程中索引页的内容变化

创建extension

postgres=# create extension pageinspect;

开启长事务

postgres=# begin;
BEGIN
postgres=# select txid_current();

测试60秒更新

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 60

观察需要扫描多少数据块

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Index Scan using test03_pkey on public.test03  (cost=0.43..8.45 rows=1 width=417) (actual time=0.052..15.738 rows=1 loops=1)
   Output: id, info
   Index Cond: (test03.id = 1)
   Buffers: shared hit=2663
 Planning time: 0.572 ms
 Execution time: 15.790 ms
(6 rows)

postgres=# set enable_indexscan=off;
SET

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test03  (cost=4.44..8.45 rows=1 width=417) (actual time=6.138..6.139 rows=1 loops=1)
   Output: id, info
   Recheck Cond: (test03.id = 1)
   Heap Blocks: exact=2651
   Buffers: shared hit=2663
   ->  Bitmap Index Scan on test03_pkey  (cost=0.00..4.44 rows=1 width=0) (actual time=0.585..0.585 rows=2651 loops=1)
         Index Cond: (test03.id = 1)
         Buffers: shared hit=12
 Planning time: 0.093 ms
 Execution time: 6.218 ms
(10 rows)

观察索引页, root=412, 层级=2

postgres=# select * from bt_metap('test03_pkey');
 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |  412 |     2 |      412 |         2
(1 row)

查看root页内容

postgres=# select * from bt_page_stats('test03_pkey',412);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
   412 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    2 |          2
(1 row)

postgres=# select * from bt_page_items('test03_pkey',412);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (3,1)   |       8 | f     | f    | 
          2 | (584,1) |      16 | f     | f    | 21 00 00 00 00 00 00 00
          3 | (411,1) |      16 | f     | f    | 46 00 00 00 00 00 00 00
(3 rows)

查看最左branch 页内容

postgres=# select * from bt_page_items('test03_pkey',3);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (58,1)  |      16 | f     | f    | 21 00 00 00 00 00 00 00
          2 | (1,1)   |       8 | f     | f    | 
          3 | (937,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (767,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          5 | (666,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          6 | (572,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          7 | (478,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          8 | (395,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          9 | (307,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
         10 | (173,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
         11 | (99,1)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
         12 | (951,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         13 | (867,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         14 | (773,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         15 | (660,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         16 | (564,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         17 | (496,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         18 | (413,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         19 | (319,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         20 | (204,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         21 | (151,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         22 | (64,1)  |      16 | f     | f    | 02 00 00 00 00 00 00 00
         23 | (865,1) |      16 | f     | f    | 03 00 00 00 00 00 00 00
         24 | (777,1) |      16 | f     | f    | 03 00 00 00 00 00 00 00

查看包含最小值的最左叶子节点内容

postgres=# select * from bt_page_items('test03_pkey',1);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (57342,14) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (71195,14) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (71171,12) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (71185,1)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          5 | (71150,17) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          6 | (71143,1)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
......

查看包含最小值的最右叶子节点内容

postgres=# select * from bt_page_items('test03_pkey',99);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (66214,10) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          2 | (12047,15) |      16 | f     | f    | 01 00 00 00 00 00 00 00
......
         40 | (11052,15) |      16 | f     | f    | 01 00 00 00 00 00 00 00
         41 | (11009,6)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
         42 | (11021,6)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
         43 | (71209,3)  |      16 | f     | f    | 02 00 00 00 00 00 00 00
         44 | (69951,1)  |      16 | f     | f    | 02 00 00 00 00 00 00 00

查看这些叶子索引页包含data='01 00 00 00 00 00 00 00'的item有多少条,可以对应到需要扫描多少heap page

select count(distinct substring(ctid::text, 1, "position"(ctid::text, ','))) from (
select * from bt_page_items('test03_pkey',1) 
union all
select * from bt_page_items('test03_pkey',937) 
union all
select * from bt_page_items('test03_pkey',767) 
union all
select * from bt_page_items('test03_pkey',666) 
union all
select * from bt_page_items('test03_pkey',572) 
union all
select * from bt_page_items('test03_pkey',478) 
union all
select * from bt_page_items('test03_pkey',395) 
union all
select * from bt_page_items('test03_pkey',307) 
union all
select * from bt_page_items('test03_pkey',173) 
union all
select * from bt_page_items('test03_pkey',99) 
union all
select * from bt_page_items('test03_pkey',951)
) t 
where data='01 00 00 00 00 00 00 00';

 count 
-------
  2652
(1 row)

2652与前面执行计划中看到的2651对应。

提交长事务

postgres=# end;
COMMIT

等待autovacuum结束

postgres=# select * from pg_stat_all_tables where relname='test03';
-[ RECORD 1 ]-------+------------------------------
relid               | 14156713
schemaname          | public
relname             | test03
seq_scan            | 39
seq_tup_read        | 5137822
idx_scan            | 3522865664
idx_tup_fetch       | 3521843178
n_tup_ins           | 1022487
n_tup_upd           | 3476465702
n_tup_del           | 22387
n_tup_hot_upd       | 3433472972
n_live_tup          | 100
n_dead_tup          | 0
n_mod_since_analyze | 0
last_vacuum         | 2016-07-15 00:03:53.909086+08
last_autovacuum     | 2016-07-15 00:32:04.177672+08
last_analyze        | 2016-07-15 00:03:53.909825+08
last_autoanalyze    | 2016-07-15 00:07:23.541629+08
vacuum_count        | 10
autovacuum_count    | 125
analyze_count       | 7
autoanalyze_count   | 99

观察现在需要扫描多少块

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test03  (cost=40.40..44.41 rows=1 width=417) (actual time=0.026..0.027 rows=1 loops=1)
   Output: id, info
   Recheck Cond: (test03.id = 1)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on test03_pkey  (cost=0.00..40.40 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=1)
         Index Cond: (test03.id = 1)
         Buffers: shared hit=4
 Planning time: 0.137 ms
 Execution time: 0.052 ms
(10 rows)

查看现在的索引页内容,half page已经remove掉了

postgres=# select count(distinct substring(ctid::text, 1, "position"(ctid::text, ','))) from (
select * from bt_page_items('test03_pkey',1) 
union all
select * from bt_page_items('test03_pkey',937) 
union all
select * from bt_page_items('test03_pkey',767) 
union all
select * from bt_page_items('test03_pkey',666) 
union all
select * from bt_page_items('test03_pkey',572) 
union all
select * from bt_page_items('test03_pkey',478) 
union all
select * from bt_page_items('test03_pkey',395) 
union all
select * from bt_page_items('test03_pkey',307) 
union all
select * from bt_page_items('test03_pkey',173) 
union all
select * from bt_page_items('test03_pkey',99) 
union all
select * from bt_page_items('test03_pkey',951)
) t 
where data='01 00 00 00 00 00 00 00' ;
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
-[ RECORD 1 ]
count | 2

再观察索引页内容,已经被autovacuum收缩了

postgres=# select * from bt_metap('test03_pkey');
 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |  412 |     2 |      412 |         2
(1 row)

postgres=# select * from bt_page_items('test03_pkey',412);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (3,1)   |       8 | f     | f    | 
          2 | (584,1) |      16 | f     | f    | 21 00 00 00 00 00 00 00
          3 | (411,1) |      16 | f     | f    | 46 00 00 00 00 00 00 00
(3 rows)

postgres=# select * from bt_page_items('test03_pkey',3);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (58,1)  |      16 | f     | f    | 21 00 00 00 00 00 00 00
          2 | (1,1)   |       8 | f     | f    | 
          3 | (99,1)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (865,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          5 | (844,1) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          6 | (849,1) |      16 | f     | f    | 04 00 00 00 00 00 00 00
          7 | (18,1)  |      16 | f     | f    | 05 00 00 00 00 00 00 00
          8 | (95,1)  |      16 | f     | f    | 06 00 00 00 00 00 00 00
          9 | (63,1)  |      16 | f     | f    | 07 00 00 00 00 00 00 00
         10 | (34,1)  |      16 | f     | f    | 08 00 00 00 00 00 00 00
         11 | (851,1) |      16 | f     | f    | 09 00 00 00 00 00 00 00
         12 | (10,1)  |      16 | f     | f    | 0a 00 00 00 00 00 00 00
         13 | (71,1)  |      16 | f     | f    | 0b 00 00 00 00 00 00 00
         14 | (774,1) |      16 | f     | f    | 0c 00 00 00 00 00 00 00
         15 | (213,1) |      16 | f     | f    | 0d 00 00 00 00 00 00 00
         16 | (881,1) |      16 | f     | f    | 0e 00 00 00 00 00 00 00
         17 | (837,1) |      16 | f     | f    | 0f 00 00 00 00 00 00 00
         18 | (100,1) |      16 | f     | f    | 10 00 00 00 00 00 00 00
         19 | (872,1) |      16 | f     | f    | 11 00 00 00 00 00 00 00
         20 | (32,1)  |      16 | f     | f    | 12 00 00 00 00 00 00 00
         21 | (65,1)  |      16 | f     | f    | 13 00 00 00 00 00 00 00
         22 | (870,1) |      16 | f     | f    | 14 00 00 00 00 00 00 00
         23 | (841,1) |      16 | f     | f    | 15 00 00 00 00 00 00 00
         24 | (850,1) |      16 | f     | f    | 16 00 00 00 00 00 00 00
         25 | (30,1)  |      16 | f     | f    | 17 00 00 00 00 00 00 00
         26 | (91,1)  |      16 | f     | f    | 18 00 00 00 00 00 00 00
         27 | (829,1) |      16 | f     | f    | 19 00 00 00 00 00 00 00
         28 | (16,1)  |      16 | f     | f    | 1a 00 00 00 00 00 00 00
         29 | (784,1) |      16 | f     | f    | 1b 00 00 00 00 00 00 00
         30 | (31,1)  |      16 | f     | f    | 1c 00 00 00 00 00 00 00
         31 | (88,1)  |      16 | f     | f    | 1d 00 00 00 00 00 00 00
         32 | (48,1)  |      16 | f     | f    | 1e 00 00 00 00 00 00 00
         33 | (822,1) |      16 | f     | f    | 1f 00 00 00 00 00 00 00
         34 | (817,1) |      16 | f     | f    | 20 00 00 00 00 00 00 00
         35 | (109,1) |      16 | f     | f    | 21 00 00 00 00 00 00 00
(35 rows)

postgres=# select * from bt_page_items('test03_pkey',1);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (57342,14) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (71195,14) |      16 | f     | f    | 01 00 00 00 00 00 00 00
(2 rows)

postgres=# select * from bt_page_items('test03_pkey',99);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (66214,10) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          2 | (71209,3)  |      16 | f     | f    | 02 00 00 00 00 00 00 00
(2 rows)

相关源码

src/backend/access/nbtree/nbtpage.c

/*
 * Unlink a page in a branch of half-dead pages from its siblings.
 *
 * If the leaf page still has a downlink pointing to it, unlinks the highest
 * parent in the to-be-deleted branch instead of the leaf page.  To get rid
 * of the whole branch, including the leaf page itself, iterate until the
 * leaf page is deleted.
 *
 * Returns 'false' if the page could not be unlinked (shouldn't happen).
 * If the (new) right sibling of the page is empty, *rightsib_empty is set
 * to true.
 */
static bool
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
{
...
        /*
         * Mark the page itself deleted.  It can be recycled when all current
         * transactions are gone.  Storing GetTopTransactionId() would work, but
         * we're in VACUUM and would not otherwise have an XID.  Having already
         * updated links to the target, ReadNewTransactionId() suffices as an
         * upper bound.  Any scan having retained a now-stale link is advertising
         * in its PGXACT an xmin less than or equal to the value we read here.  It
         * will continue to do so, holding back RecentGlobalXmin, for the duration
         * of that scan.
         */
        page = BufferGetPage(buf);
        opaque = (BTPageOpaque) PageGetSpecialPointer(page);
        opaque->btpo_flags &= ~BTP_HALF_DEAD;
        opaque->btpo_flags |= BTP_DELETED;
        opaque->btpo.xact = ReadNewTransactionId();
...

contrib/pageinspect/btreefuncs.c

                if (P_ISDELETED(opaque))
                        elog(NOTICE, "page is deleted");

参考

1. b-tree原理
https://yq.aliyun.com/articles/54437

优化手段

1. 频繁更新的表,数据库的优化手段
1.1 监控长事务,绝对控制长事务

1.2 缩小autovacuum naptime (to 1s) ,

  增加autovacuum work (to 10),     
  设置autovacuum delay=0,     
  增大autovacuum work memory (to 512MB or bigger),     
  将经常变更的表和索引放到好的iops的设备上 。       
  不要小看这几个参数,非常的关键。    

1.3 如果事务释放并且表上面已经出发了vacuum后,还是要查很多的PAGE,说明index page没有delete和收缩,可能是index page没有达到compact的要求,如果遇到这种情况,需要reindex。

2. PostgreSQL 9.6通过快照过旧彻底解决这个长事务引发的坑爹问题。
9.6 vacuum的改进如图
screenshot

如何判断snapshot too old如图
screenshot

https://www.postgresql.org/docs/9.6/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR

3. 9.6的垃圾回收机制也还有改进的空间,做到更细粒度的版本控制,改进方法以前分享过,在事务列表中增加记录事务隔离级别,通过隔离级别判断需要保留的版本,而不是简单的通过最老事务来判断需要保留的垃圾版本。

祝大家玩得开心,欢迎随时来 阿里云促膝长谈 业务需求 ,恭候光临。

阿里云的小伙伴们加油,努力做 最贴地气的云数据库

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

分享:
阿里云数据库
使用钉钉扫一扫加入圈子
+ 订阅

帮用户承担一切数据库风险,给您何止是安心!

官方博客
链接