PostgreSQL bitmap scan的IO放大的原理解释和优化-阿里云开发者社区

PostgreSQL bitmap scan的IO放大的原理解释和优化

2018-01-27 5225

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云原生数据库 PolarDB 分布式版，标准版 2核8GB

RDS MySQL Serverless 基础系列，0.5-2RCU 50GB

RDS PostgreSQL Serverless，0.5-4RCU 50GB 3个月

简介：

背景

PostgreSQL 支持9种索引接口：

《PostgreSQL 9种索引的原理和应用场景》

每一种索引的结构，适合的数据类型，适合的查询场景都不一样。

对于多值类型（例如 K-V，数组、全文检索类型），我们可以选择GIN倒排索引接口，GIN使用的扫描方法是bitmap scan的扫描方法。

实际上PostgreSQL常用的数据扫描方法包括：

seq scan，全表扫描
index scan，索引扫描（需要回表）
index only scan，索引扫描（通过VM减少回表，大多数情况下，不需要回表）
bitmap scan，先扫索引，然后按HEAP BLOCK ID扫描HEAP BLOCK。输出整个数据块的数据，因此需要recheck。

bitmap scan的特性，决定了它可能存在放大(因为一个BLOCK里面哪怕只有一条记录是复合条件的，也会返回整个BLOCK)。

bitmap scan IO,CPU放大例子

1、新建测试表

create table test(id int, arr int[]);  

        
          
        
        
        
          
          AI 代码解读

2、写入测试数据

create or replace function gen_arr(int,int) returns int[] as $$  
  select array(select ($1*random())::int from generate_series(1,$2));  
$$ language sql strict;  
  
postgres=# select gen_arr(100,10);  
            gen_arr              
-------------------------------  
 {5,71,91,23,95,81,98,12,33,2}  
(1 row)  
  
insert into test select id, gen_arr(1000, 10) from generate_series(1,1000000) t(id);  

        
          
        
        
        
          
          AI 代码解读

3、创建索引

create index idx_test_1 on test using gin (arr);  

        
          
        
        
        
          
          AI 代码解读

4、查询，分析

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where arr && array[1,2,3];  
                                                          QUERY PLAN                                                             
-------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.test  (cost=808.96..13148.92 rows=23402 width=36) (actual time=14.295..52.321 rows=29605 loops=1)  
   Output: id, arr  
   Recheck Cond: (test.arr && '{1,2,3}'::integer[])  
   Heap Blocks: exact=11240  
   Buffers: shared hit=11764  
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..803.11 rows=23402 width=0) (actual time=12.816..12.816 rows=29605 loops=1)  
         Index Cond: (test.arr && '{1,2,3}'::integer[])  
         Buffers: shared hit=524  
 Planning time: 0.314 ms  
 Execution time: 54.896 ms  
(10 rows)  

        
          
        
        
        
          
          AI 代码解读

包含1或2或3的数据，总共2.9万条，搜索了11240个HEAP BLOCK。

那么我们看看一个BLOCK可以存储多少数据？

postgres=# analyze test;  
ANALYZE  
postgres=# select reltuples/relpages from pg_class where relname='test';  
     ?column?       
------------------  
 80.9978940547546  
(1 row)  

        
          
        
        
        
          
          AI 代码解读

可以存下81条，意味着实际上29605条记录应该只需要365个数据块就可以放下。

但是由于这些目标记录没有密集存储，导致了IO的放大。

那么如何解决这个问题呢？

bitmap scan IO,cpu放大问题优化

1、聚集存储

实现方法很多，这里有一些例子：

《PostgreSQL 黑科技 - 空间聚集存储, 内窥GIN, GiST, SP-GiST索引》

《PostgreSQL GIN 单列聚集索引应用》

《PostgreSQL 聚集存储与 BRIN索引 - 高并发行为、轨迹类大吞吐数据查询场景解说》

《K-Means 数据聚集算法》

我们可以看看聚集带来的效果：

-- 重组数据  
postgres=# with tmp as (delete from test where ctid = any(array(select ctid from test where arr && array[1,2,3])) returning *) insert into test select * from tmp;  
INSERT 0 29605  
  
-- 再次查询  
postgres=# vacuum test;  
VACUUM  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where arr && array[1,2,3];  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.test  (cost=247.16..13321.03 rows=28950 width=65) (actual time=3.459..8.248 rows=29605 loops=1)  
   Output: id, arr  
   Recheck Cond: (test.arr && '{1,2,3}'::integer[])  
   Heap Blocks: exact=367  
   Buffers: shared hit=389  
   ->  Bitmap Index Scan on idx_test_1  (cost=0.00..239.92 rows=28950 width=0) (actual time=3.411..3.411 rows=29605 loops=1)  
         Index Cond: (test.arr && '{1,2,3}'::integer[])  
         Buffers: shared hit=22  
 Planning time: 0.145 ms  
 Execution time: 10.991 ms  
(10 rows)  

        
          
        
        
        
          
          AI 代码解读

现在只访问了367个HEAP数据块。完全避免了IO放大的问题。

实际情况，可以根据业务喜好来聚集。

参考

《PostgreSQL 黑科技 - 空间聚集存储, 内窥GIN, GiST, SP-GiST索引》

《PostgreSQL GIN 单列聚集索引应用》

《PostgreSQL 聚集存储与 BRIN索引 - 高并发行为、轨迹类大吞吐数据查询场景解说》

《K-Means 数据聚集算法》

《PostgreSQL 9种索引的原理和应用场景》

PostgreSQL bitmap scan的IO放大的原理解释和优化

标签

背景

bitmap scan IO,CPU放大例子

bitmap scan IO,cpu放大问题优化

参考

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

PostgreSQL bitmap scan的IO放大的原理解释和优化

标签

背景

bitmap scan IO,CPU放大例子

bitmap scan IO,cpu放大问题优化

参考

关系型数据库

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像