title: Pgsql原理解析——Vacuum date: 2019-03-19 12:23:31 categories: - Postgresql - PgInternal
vacuum的两个主要作用在之前的"并发控制"中已经有了一些介绍,本章针对vacuum的原理、使用做进一步讲解。
1 引言:案例分析
1.1 慢查询案例
tdb0529=> create table tbl01(id int, info text);
CREATE TABLE
tdb0529=> create index idx_tbl01_id on tbl01(id);
CREATE INDEX
tdb0529=> alter table tbl01 set (autovacuum_enabled =off);
ALTER TABLE
tdb0529=> insert into tbl01 select generate_series(1,1000000), md5(clock_timestamp()::text);
INSERT 0 1000000
tdb0529=> explain (analyze,verbose,timing,costs,buffers) select * from tbl01 where id=1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_tbl01_id on public.tbl01 (cost=0.42..8.44 rows=1 width=37) (actual time=0.024..0.024 rows=1 loops=1)
Output: id, info
Index Cond: (tbl01.id = 1)
Buffers: shared hit=4
Planning time: 0.363 ms
Execution time: 0.054 ms
(6 rows)
扫描了4个数据块。另起事务插入大量数据块(不提交),查询发现扫描了大量数据块
tdb0529=> begin;
BEGIN
tdb0529=> insert into tbl01 select 1, md5(clock_timestamp()::text) from generate_series(1,1000000);
INSERT 0 1000000
tdb0529=> abort;
ROLLBACK
tdb0529=> explain (analyze,verbose,timing,costs,buffers) select * from tbl01 where id=1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.tbl01 (cost=4.38..8.40 rows=1 width=37) (actual time=48.241..133.774 rows=1 loops=1)
Output: id, info
Recheck Cond: (tbl01.id = 1)
Heap Blocks: exact=8335
Buffers: shared hit=6652 read=5161 dirtied=3636 written=557
-> Bitmap Index Scan on idx_tbl01_id (cost=0.00..4.38 rows=1 width=0) (actual time=47.006..47.006 rows=1000001 loops=1)
Index Cond: (tbl01.id = 1)
Buffers: shared hit=1952 read=1526
Planning time: 0.101 ms
Execution time: 133.913 ms
事务提交、提交这些扫描都无法避免,原理请参考《3 并发控制》
回收必须使用vacuum,回收后,SQL执行时间恢复正常
tdb0529=> vacuum verbose tbl01;
INFO: vacuuming "public.tbl01"
INFO: scanned index "idx_tbl01_id" to remove 1000000 row versions
DETAIL: CPU: user: 0.24 s, system: 0.00 s, elapsed: 0.25 s
INFO: "tbl01": removed 1000000 row versions in 8334 pages
DETAIL: CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s
INFO: index "idx_tbl01_id" now contains 1000000 row versions in 6237 pages
DETAIL: 1000000 index row versions were removed.
3474 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "tbl01": found 1000000 removable, 1000000 nonremovable row versions in 16667 out of 16667 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 574
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.45 s, system: 0.03 s, elapsed: 0.48 s.
INFO: "tbl01": truncated 16667 to 8334 pages
DETAIL: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.03 s
INFO: vacuuming "pg_toast.pg_toast_16386"
INFO: index "pg_toast_16386_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "pg_toast_16386": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 575
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
tdb0529=> explain (analyze,verbose,timing,costs,buffers) select * from tbl01 where id=1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_tbl01_id on public.tbl01 (cost=0.42..8.44 rows=1 width=37) (actual time=0.012..0.012 rows=1 loops=1)
Output: id, info
Index Cond: (tbl01.id = 1)
Buffers: shared hit=4
Planning time: 0.135 ms
Execution time: 0.031 ms
1.2 页面分析案例
create table tbl02(id int, info text);
CREATE TABLE
alter table tbl02 set (autovacuum_enabled='off');
ALTER TABLE
-- superuser
create extension pageinspect;
CREATE EXTENSION
insert into tbl02 select id, repeat(md5(random()::text), 16) from generate_series(1,20) t(id);
INSERT 0 20
select pg_column_size(repeat(md5(random()::text), 16));
pg_column_size
----------------
516
查看第页面1(页面结构请参考"存储结构")
lp指针分配到80
数据的最低位点到576
当前页面的free space=576-80=496
SELECT * FROM page_header(get_raw_page('tbl02', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
0/AC7E6400 | 0 | 0 | 80 | 576 | 8192 | 8192 | 4 | 0
查看具体某一条记录的偏移量
select lp,lp_off from heap_page_items(get_raw_page('tbl02', 0));
lp | lp_off
----+--------
1 | 7648
2 | 7104
3 | 6560
4 | 6016
5 | 5472
6 | 4928
7 | 4384
8 | 3840
9 | 3296
10 | 2752
11 | 2208
12 | 1664
13 | 1120
14 | 576
(14 rows)
查看靠近页面末尾的记录信息
select * from heap_page_items(get_raw_page('tbl02', 0)) where lp=1;
-[ RECORD 1 ]--------------
lp | 1
lp_off | 7648
lp_flags | 1
lp_len | 544
t_xmin | 579
t_xmax | 0
t_field3 | 0
t_ctid | (0,1)
t_infomask2 | 2
t_infomask | 2050
t_hoff | 24
t_bits |
t_oid |
t_data | \x01000000100800003435353366633031356233363938373961356537313761316461343337333634343535336663303135623336393837396135653731376131646134333733363434353533666330313562333639383739613565373137613164613433373336343435353366633031356233363938373961356537313761316461343337333634343535336663303135623336393837396135653731376131646134333733363434353533666330313562333639383739613565373137613164613433373336343435353366633031356233363938373961356537313761316461343337333634343535336663303135623336393837396135653731376131646134333733363434353533666330313562333639383739613565373137613164613433373336343435353366633031356233363938373961356537313761316461343337333634343535336663303135623336393837396135653731376131646134333733363434353533666330313562333639383739613565373137613164613433373336343435353366633031356233363938373961356537313761316461343337333634343535336663303135623336393837396135653731376131646134333733363434353533666330313562333639383739613565373137613164613433373336343435353366633031356233363938373961356537313761316461343337333634
间隔删除数据,使用ctid(页面号,lp号)作为条件,发现数据并没有真正的从页面中删除
delete from tbl02 where ctid not in ('(0,1)','(0,3)','(0,5)','(0,7)','(0,9)','(0,11)','(0,13)');
DELETE 13
tdb0529=# select lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_bits,t_oid from heap_page_items(get_raw_page('tbl02', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
1 | 7648 | 1 | 544 | 579 | 0 | 0 | (0,1) | 2 | 2306 | 24 | |
2 | 7104 | 1 | 544 | 579 | 580 | 0 | (0,2) | 8194 | 258 | 24 | |
3 | 6560 | 1 | 544 | 579 | 0 | 0 | (0,3) | 2 | 2306 | 24 | |
4 | 6016 | 1 | 544 | 579 | 580 | 0 | (0,4) | 8194 | 258 | 24 | |
5 | 5472 | 1 | 544 | 579 | 0 | 0 | (0,5) | 2 | 2306 | 24 | |
6 | 4928 | 1 | 544 | 579 | 580 | 0 | (0,6) | 8194 | 258 | 24 | |
7 | 4384 | 1 | 544 | 579 | 0 | 0 | (0,7) | 2 | 2306 | 24 | |
8 | 3840 | 1 | 544 | 579 | 580 | 0 | (0,8) | 8194 | 258 | 24 | |
9 | 3296 | 1 | 544 | 579 | 0 | 0 | (0,9) | 2 | 2306 | 24 | |
10 | 2752 | 1 | 544 | 579 | 580 | 0 | (0,10) | 8194 | 258 | 24 | |
11 | 2208 | 1 | 544 | 579 | 0 | 0 | (0,11) | 2 | 2306 | 24 | |
12 | 1664 | 1 | 544 | 579 | 580 | 0 | (0,12) | 8194 | 258 | 24 | |
13 | 1120 | 1 | 544 | 579 | 0 | 0 | (0,13) | 2 | 2306 | 24 | |
14 | 576 | 1 | 544 | 579 | 580 | 0 | (0,14) | 8194 | 258 | 24 | |
进行垃圾回收,发现lp并没有改变但是记录的空间已经被回收了
tdb0529=# vacuum verbose tbl02;
INFO: vacuuming "public.tbl02"
INFO: "tbl02": removed 13 row versions in 2 pages
INFO: "tbl02": found 13 removable, 7 nonremovable row versions in 2 out of 2 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 582
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "tbl02": truncated 2 to 1 pages
DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO: vacuuming "pg_toast.pg_toast_16413"
INFO: index "pg_toast_16413_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "pg_toast_16413": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 583
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
tdb0529=# select lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_bits,t_oid from heap_page_items(get_raw_page('tbl02', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
1 | 7648 | 1 | 544 | 579 | 0 | 0 | (0,1) | 2 | 2306 | 24 | |
2 | 0 | 0 | 0 | | | | | | | | |
3 | 7104 | 1 | 544 | 579 | 0 | 0 | (0,3) | 2 | 2306 | 24 | |
4 | 0 | 0 | 0 | | | | | | | | |
5 | 6560 | 1 | 544 | 579 | 0 | 0 | (0,5) | 2 | 2306 | 24 | |
6 | 0 | 0 | 0 | | | | | | | | |
7 | 6016 | 1 | 544 | 579 | 0 | 0 | (0,7) | 2 | 2306 | 24 | |
8 | 0 | 0 | 0 | | | | | | | | |
9 | 5472 | 1 | 544 | 579 | 0 | 0 | (0,9) | 2 | 2306 | 24 | |
10 | 0 | 0 | 0 | | | | | | | | |
11 | 4928 | 1 | 544 | 579 | 0 | 0 | (0,11) | 2 | 2306 | 24 | |
12 | 0 | 0 | 0 | | | | | | | | |
13 | 4384 | 1 | 544 | 579 | 0 | 0 | (0,13) | 2 | 2306 | 24 | |
14 | 0 | 0 | 0 | | | | | | | | |
vacuum full一下
vacuum full tbl02;
VACUUM
select lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_bits,t_oid from heap_page_items(get_raw_page('tbl02', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
1 | 7648 | 1 | 544 | 579 | 0 | 0 | (0,1) | 2 | 2818 | 24 | |
2 | 7104 | 1 | 544 | 579 | 0 | 0 | (0,2) | 2 | 2818 | 24 | |
3 | 6560 | 1 | 544 | 579 | 0 | 0 | (0,3) | 2 | 2818 | 24 | |
4 | 6016 | 1 | 544 | 579 | 0 | 0 | (0,4) | 2 | 2818 | 24 | |
5 | 5472 | 1 | 544 | 579 | 0 | 0 | (0,5) | 2 | 2818 | 24 | |
6 | 4928 | 1 | 544 | 579 | 0 | 0 | (0,6) | 2 | 2818 | 24 | |
7 | 4384 | 1 | 544 | 579 | 0 | 0 | (0,7) | 2 | 2818 | 24 | |
这里可以得出结论,垃圾回收并不会改变lp,这样索引就不需要改变了
垃圾回收的记录会有页内offset
vacuum full会改变lp,重新组织页面结构
2 使用vacuum
vacuum提供了两种使用接口
vacuum
vacuum full
两种方式的加锁级别不同
vacuum:
SHARE UPDATE EXCLUSIVE
vacuum full:
ACCESS EXCLUSIVE
为便于记忆总结一下vacuum full加排他锁,最高级别的锁和所有其他锁冲突。 而vacuum冲突SQL语句为:
-- http://www.postgres.cn/docs/10/explicit-locking.html
VACUUM
ANALYZE
CREATE INDEX CONCURRENTLY
CREATE STATISTICS
ALTER TABLE
CREATE INDEX
REFRESH MATERIALIZED VIEW CONCURRENTLY
ALTER TABLE
DROP TABLE
TRUNCATE
REINDEX
CLUSTER
VACUUM FULL
REFRESH MATERIALIZED VIEW
vacuum的使用除了上述手动执行外,还提供了守护进程autovacuum自动化执行的方式。
2 清理tuple
后面的内容需要了解底层存储结构,请先阅读《1 存储结构》
清理步骤
(1) FOR each table
(2) Acquire ShareUpdateExclusiveLock lock for the target table
第一步:
(3) 扫描页面,冻结记录
(4) 删除垃圾记录的索引记录
第二步:
(5) FOR each page of the table
(6) 移除垃圾记录,重新组织页面结构
(7) 更新FSM和VM
END FOR
第三步:
(8) Truncate the last page if possible
(9) Update both the statistics and system catalogs of the target table
Release ShareUpdateExclusiveLock lock
END FOR
/* Post-processing */
(10) Update statistics and system catalogs
(11) Remove both unnecessary files and pages of the clog if possible
2.1 第一步
首先,PostgreSQL扫描目标表建立垃圾元组列表,可能的话也会冻结过旧元组。 (冻结请直接到后面第6部分) 列表存储在内存中,由maintenance_work_mem控制(调大可以提高vacuum性能)。
maintenance_work_mem (integer)s Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. It defaults to 64 megabytes (64MB). Since only one of these operations can be executed at a time by a database session, and an installation normally doesn't have many of them running concurrently, it's safe to set this value significantly larger than work_mem. Larger settings might improve performance for vacuuming and for restoring database dumps. https://www.postgresql.org/docs/10/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
扫描完成后,PostgreSQL通过垃圾元组列表来删除对应的索引元组。 当maintenance_work_mem已满时,PostgreSQL继续进行下一个任务,即(4)到(7); 然后返回到(3)继续扫描。
2.2 第二步
从上面的“1.2 页面分析案例”我们可以看到,记录被删除后,空间并不会立即释放,实际上删除的动作只是标记记录为垃圾记录,真正回收页面空间的还是由vacuum来做的。
上图中发生了四件事情:
删除语句删掉了记录tuple1和记录tuple3,页面中会标记为垃圾元组,注意这里的行指针并不会发生变化,如果行指针变了,对应的索引块也必须跟着更新。
vacuum整理数据空间,避免出现碎片。
vacuum清理垃圾元组对应的索引块中的记录。
vacuum更新FSM和VM。
2.3 第三步
更新统计信息和相关的系统表、视图。
如果最后一页没有记录,直接删除最后一个页面。
2.4 最后
可能会触发清理clog,参考第五部分“5 清理CLOG”。
3 Visibility Map
前面我们了解到,MVCC机制下元组的更新和删除并不会立即从文件中做物理删除,而是通过事务ID标记,vacuum经过判断后删除。
可见性映射表的设计目的是为了加快vacuum的速度,原理比较简单:每个表文件都会有一个可见性映射表,里面保存了表文件中每个页面的可见性(有垃圾元组的页面用1来表示,没有的页面用0来表示)
cd $PGDATA/base/13213
ls | grep 3601
3601
3601_fsm
3601_vm
4 Freeze
为什么要FREEZE?