Cluster簇管理 :
在PostgreSQL的统计信息表中, 有一项指标correlation指明了heap表存储的物理顺序和索引顺序的关系.
例如 :
digoal=# create table tbl (id int primary key, info text);
CREATE TABLE
Time: 25.762 ms
digoal=# insert into tbl select generate_series(1,10000),'test';
INSERT 0 10000
Time: 32.397 ms
digoal=# select correlation from pg_stats where schemaname='public' and tablename='tbl' and attname='id';
correlation
-------------
1
(1 row)
Time: 2.648 ms
1表示该列的物理存储顺序和索引顺序完全一致. 如果是-1则表示和索引顺序完全相反, 这种情况出现在索引使用了反向排序即desc的时候.
接下来按随机顺序插入.
digoal=# truncate tbl;
TRUNCATE TABLE
Time: 34.648 ms
digoal=# insert into tbl select trunc(random()*100000),'test' from generate_series(1,100000) group by 1,2;
INSERT 0 63321
Time: 332.371 ms
digoal=# select correlation from pg_stats where schemaname='public' and tablename='tbl' and attname='id';
correlation
-------------
0.0047608
(1 row)
Time: 1.476 ms
现在这个值就很低了, 说明物理顺序和索引顺序偏差很大.
索引顺序和物理存储顺序一致的话, 使用索引进行范围检索时, 可以减少IO量. 例如 :
取1到100的id范围, 使用上述偏差很大的表查询时, 使用索引扫描的话需要扫描50个表的数据块, 如下 :
在执行cluster后, 使用索引扫描的话 只需要扫描1个数据块. 可以大大降低范围查询的HEAP block扫描量.
digoal=# select split_part(ctid::text, ',', 1) from tbl where id between 1 and 100 group by 1 order by 1;
split_part
------------
(0
(1
(103
(127
(130
(132
(134
(135
(139
(14
(143
(151
(157
(160
(164
(178
(180
(188
(189
(201
(203
(204
(22
(223
(238
(270
(272
(274
(275
(310
(314
(327
(33
(330
(334
(337
(37
(38
(40
(49
(51
(53
(60
(63
(75
(78
(8
(87
(88
(90
(50 rows)
在执行cluster后, 使用索引扫描的话 只需要扫描1个数据块. 可以大大降低范围查询的HEAP block扫描量.
digoal=# cluster tbl using tbl_pkey;
CLUSTER
Time: 198.924 ms
digoal=# analyze tbl;
ANALYZE
Time: 15.418 ms
digoal=# select correlation from pg_stats where schemaname='public' and tablename='tbl' and attname='id';
correlation
-------------
1
(1 row)
Time: 1.327 ms
digoal=# select split_part(ctid::text, ',', 1) from tbl where id between 1 and 100 group by 1 order by 1;
split_part
------------
(0
(1 row)