测试发现PostgreSQL在bitmap index scan时，如果要读入大量堆page，读IO的速度会远低于正常的顺序读，影响性能。

下面用一个例子说明这个问题。

环境

台式机上的CentOS7.1虚机
消费级SSD
blockdev --setra设置为2048

测试

创建测试表

postgres=# create table tbx(id int,c1 int,pad text);
CREATE TABLE
postgres=# insert into tbx select id,(random()*200)::int,'sssssssssssssssssssssssssssssssssssss' from (select generate_series(1,30000000) id) tbx;
INSERT 0 30000000
postgres=# create index tbx_idx_c1 on tbx(c1);
CREATE INDEX
postgres=# analyze tbx;
ANALYZE

每个page大概有107条记录，c1的范围是0~199，也就是大概每2个page中就有一个c1相同的记录，即按page算，c1的选择性差不多是50%。

postgres=# select ctid,id,c1 from tbx limit 10 offset 100;
  ctid   | id  | c1  
---------+-----+-----
 (0,101) | 101 | 106
 (0,102) | 102 |  96
 (0,103) | 103 | 178
 (0,104) | 104 |  48
 (0,105) | 105 | 119
 (0,106) | 106 |  24
 (0,107) | 107 | 189
 (1,1)   | 108 | 171
 (1,2)   | 109 | 196
 (1,3)   | 110 |  79
(10 rows)

数据是2GB，shared_buffers是256MB。

postgres=# select relpages,reltuples from pg_class  where relname='tbx';
 relpages | reltuples 
----------+-----------
   280374 |     3e+07
(1 row)
postgres=# \d+
                             List of relations
 Schema |       Name       |   Type   |  Owner   |    Size    | Description 
--------+------------------+----------+----------+------------+-------------
 public | tbx              | table    | postgres | 2191 MB    | 
(1 rows)
postgres=# show shared_buffers;
 shared_buffers 
----------------
 256MB
(1 row)

现在用c1作为条件，使用不同的执行计划进行查询，并且每次查询前都清一次OS缓存并重起PostgreSQL。

Bitmap index扫描145秒。

postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbx  (cost=2797.28..246877.37 rows=149254 width=46) (actual time=802.109..144443.423 rows=149949 loops=1)
   Output: id, c1, pad
   Recheck Cond: (tbx.c1 = 19)
   Rows Removed by Index Recheck: 8417996
   Heap Blocks: exact=36636 lossy=79633
   Buffers: shared read=116683
   ->  Bitmap Index Scan on tbx_idx_c1  (cost=0.00..2759.97 rows=149254 width=0) (actual time=758.825..758.825 rows=149949 loops=1)
         Index Cond: (tbx.c1 = 19)
         Buffers: shared read=414
 Planning time: 15.084 ms
 Execution time: 144733.946 ms
(11 rows)

每个IO大小是16个扇区(8K),没有看到大的IO合并，IO队列深度也小于1，判断磁盘预读没有生效。

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.26    0.00   11.60   11.23    0.00   76.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  836.20    0.00     6.53     0.00    16.00     0.79    0.94    0.94    0.00   0.94  78.82
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00  836.20    0.00     6.53     0.00    16.00     0.79    0.94    0.94    0.00   0.94  78.94
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

顺序扫描74秒。

postgres=# set enable_bitmapscan=off;
SET
postgres=# set enable_indexscan=off;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.tbx  (cost=0.00..655374.03 rows=149254 width=46) (actual time=2.201..74236.613 rows=149949 loops=1)
   Output: id, c1, pad
   Filter: (tbx.c1 = 19)
   Rows Removed by Filter: 29850052
   Buffers: shared read=280374
 Planning time: 18.273 ms
 Execution time: 74327.371 ms
(7 rows)

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.61    0.00    1.31   23.35    0.00   74.73

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.00   36.60    3.40    17.90     0.02   917.63     3.43   87.30   94.05   14.53  24.99  99.96
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   36.80    3.20    18.00     0.01   922.51     3.41   86.76   93.55    8.69  24.99  99.96
dm-1              0.00     0.00    0.00    1.20     0.00     0.00     8.00     0.04   37.33    0.00   37.33  21.00   2.52
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

index扫描76秒

postgres=# set enable_bitmapscan=off;
SET
postgres=# set enable_indexscan=on;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using tbx_idx_c1 on public.tbx  (cost=0.56..475759.53 rows=149254 width=46) (actual time=4.159..75905.207 rows=149949 loops=1)
   Output: id, c1, pad
   Index Cond: (tbx.c1 = 19)
   Buffers: shared read=116683
 Planning time: 9.857 ms
 Execution time: 75954.629 ms
(6 rows)

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.10    0.00    0.86   24.05    0.00   74.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.80   84.80    1.00    16.46     0.01   393.08     1.87   21.64   21.73   13.80  11.58  99.32
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   84.60    0.60    16.31     0.00   392.13     1.86   21.74   21.78   16.00  11.65  99.30
dm-1              0.00     0.00    0.00    1.20     0.00     0.00     8.00     0.01   12.33    0.00   12.33   6.17   0.74
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

通过上面的数据可以看出，顺序扫描和索引扫描都可以利用磁盘预读，但bitmap索引扫描不行。

微观上通过strace -p跟踪postgres后端进程看看索引扫描和bitmap索引扫描各自调用的API有什么区别。

以下是索引扫描调用的API片段

read(33, "\0\0\0\0\200\305\310\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
read(33, "\0\0\0\0\20\361\310\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
read(33, "\0\0\0\0\240\34\311\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 204898304, SEEK_SET)          = 204898304
read(33, "\0\0\0\0H\"\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 204922880, SEEK_SET)          = 204922880
read(33, "\0\0\0\0\20\245\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 204939264, SEEK_SET)          = 204939264
read(33, "\0\0\0\0000\374\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
read(33, "\0\0\0\0\330'\313\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 204963840, SEEK_SET)          = 204963840
read(33, "\0\0\0\0\370~\313\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 205012992, SEEK_SET)          = 205012992
read(33, "\0\0\0\0\240\204\314\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 205037568, SEEK_SET)          = 205037568

bitmap索引扫描调用的API片段

read(33, "\0\0\0\0\3400S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273088512, 8192, POSIX_FADV_WILLNEED) = 0
lseek(33, 273088512, SEEK_SET)          = 273088512
read(33, "\0\0\0\0\250\263S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273096704, 8192, POSIX_FADV_WILLNEED) = 0
read(33, "\0\0\0\0008\337S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273154048, 8192, POSIX_FADV_WILLNEED) = 0
lseek(33, 273154048, SEEK_SET)          = 273154048
read(33, "\0\0\0\0p\20U\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273170432, 8192, POSIX_FADV_WILLNEED) = 0
lseek(33, 273170432, SEEK_SET)          = 273170432
read(33, "\0\0\0\0\250gU\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273211392, 8192, POSIX_FADV_WILLNEED) = 0
lseek(33, 273211392, SEEK_SET)          = 273211392
read(33, "\0\0\0\0\250AV\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273219584, 8192, POSIX_FADV_WILLNEED) = 0
read(33, "\0\0\0\0008mV\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273244160, 8192, POSIX_FADV_WILLNEED) = 0

从上面的调用看，应该是fadvise64()使得磁盘预读失效。fadvise64()调用是由effective_io_concurrency参数控制的预读功能，effective_io_concurrency的默认值为1，它只对bitmap index scan有效。

下面将effective_io_concurrency禁用，发现性能有所提高，执行时间102秒，磁盘预读也生效了。

postgres=# set effective_io_concurrency=0;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbx  (cost=2797.28..246877.37 rows=149254 width=46) (actual time=148.673..102092.774 rows=149949 loops=1)
   Output: id, c1, pad
   Recheck Cond: (tbx.c1 = 19)
   Rows Removed by Index Recheck: 8417996
   Heap Blocks: exact=36636 lossy=79633
   Buffers: shared read=116683
   ->  Bitmap Index Scan on tbx_idx_c1  (cost=0.00..2759.97 rows=149254 width=0) (actual time=126.560..126.560 rows=149949 loops=1)
         Index Cond: (tbx.c1 = 19)
         Buffers: shared read=414
 Planning time: 20.704 ms
 Execution time: 102147.641 ms
(11 rows)

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.10    0.00    0.75   24.33    0.00   74.81

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   37.60    0.00    10.88     0.00   592.43     2.02   53.25   53.25    0.00  26.52  99.70
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   38.20    0.00    11.20     0.00   600.54     2.02   52.41   52.41    0.00  26.10  99.70
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

除了禁用，还有一个办法，就是将effective_io_concurrency设得更大，完全代替磁盘预读。这次效果更好，46秒就出结果了。

postgres=# set effective_io_concurrency=10;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbx  (cost=2797.28..246877.37 rows=149254 width=46) (actual time=171.869..46015.094 rows=149949 loops=1)
   Output: id, c1, pad
   Recheck Cond: (tbx.c1 = 19)
   Rows Removed by Index Recheck: 8417996
   Heap Blocks: exact=36636 lossy=79633
   Buffers: shared read=116683
   ->  Bitmap Index Scan on tbx_idx_c1  (cost=0.00..2759.97 rows=149254 width=0) (actual time=140.170..140.170 rows=149949 loops=1)
         Index Cond: (tbx.c1 = 19)
         Buffers: shared read=414
 Planning time: 17.042 ms
 Execution time: 46196.535 ms
(11 rows)

采用这个办法，IO请求大小还是16个扇区，但是队列深度也就是IO的并行度提高了。

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.10    0.00   15.99    8.70    0.00   74.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00 2879.00    0.00    22.49     0.00    16.00    10.72    3.74    3.74    0.00   0.35  99.60
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00 2877.40    0.00    22.48     0.00    16.00    10.74    3.74    3.74    0.00   0.35  99.68
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

再看看调用的API，fadvise64()和read()仍然是交替的，但fadvise64()会提前好几个周期就将相应的预读请求发给IO设备。

fadvise64(34, 505077760, 8192, POSIX_FADV_WILLNEED) = 0  //建议预读505077760
lseek(34, 504266752, SEEK_SET)          = 504266752
read(34, "\1\0\0\0\350GK\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505085952, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504299520, SEEK_SET)          = 504299520
read(34, "\1\0\0\0@\366K\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505094144, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504324096, SEEK_SET)          = 504324096
read(34, "\1\0\0\0\10yL\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505126912, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504348672, SEEK_SET)          = 504348672
read(34, "\1\0\0\0\320\373L\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505135104, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504373248, SEEK_SET)          = 504373248
read(34, "\1\0\0\0\230~M\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505143296, 8192, POSIX_FADV_WILLNEED) = 0
read(34, "\1\0\0\0@\252M\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505159680, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504397824, SEEK_SET)          = 504397824
read(34, "\1\0\0\0x\1N\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505167872, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504496128, SEEK_SET)          = 504496128
read(34, "\1\0\0\0\230\fP\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505184256, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504545280, SEEK_SET)          = 504545280
read(34, "\1\0\0\0(\22Q\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505192448, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504561664, SEEK_SET)          = 504561664
read(34, "\1\0\0\0`iQ\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505208832, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504610816, SEEK_SET)          = 504610816
read(34, "\1\0\0\0\360nR\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505257984, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504627200, SEEK_SET)          = 504627200
read(34, "\1\0\0\0(\306R\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505307136, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504659968, SEEK_SET)          = 504659968
read(34, "\1\0\0\0\200tS\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505315328, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504676352, SEEK_SET)          = 504676352
read(34, "\1\0\0\0\270\313S\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505323520, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504692736, SEEK_SET)          = 504692736
read(34, "\1\0\0\0\360\"T\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505372672, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504717312, SEEK_SET)          = 504717312
read(34, "\1\0\0\0\270\245T\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505389056, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504741888, SEEK_SET)          = 504741888
read(34, "\1\0\0\0\200(U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505413632, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504758272, SEEK_SET)          = 504758272
read(34, "\1\0\0\0\240\177U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505430016, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504774656, SEEK_SET)          = 504774656
read(34, "\1\0\0\0\330\326U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505454592, 8192, POSIX_FADV_WILLNEED) = 0
read(34, "\1\0\0\0\200\2V\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505462784, 8192, POSIX_FADV_WILLNEED) = 0
read(34, "\1\0\0\0\20.V\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505479168, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504840192, SEEK_SET)          = 504840192
read(34, "\1\0\0\0\2403W\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505495552, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504864768, SEEK_SET)          = 504864768
read(34, "\1\0\0\0h\266W\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505503744, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504930304, SEEK_SET)          = 504930304
read(34, "\1\0\0\0000\23Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505552896, 8192, POSIX_FADV_WILLNEED) = 0
read(34, "\1\0\0\0\300>Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505610240, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504971264, SEEK_SET)          = 504971264
read(34, "\1\0\0\0000\355Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505618432, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 505004032, SEEK_SET)          = 505004032
read(34, "\1\0\0\0\210\233Z\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505651200, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 505036800, SEEK_SET)          = 505036800
read(34, "\1\0\0\0\370I[\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505659392, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 505077760, SEEK_SET)          = 505077760  //开始读505077760
read(34, "\1\0\0\0\370#\\\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192

少量堆page扫描的场景

上面的场景需要扫描50%的堆page，下面看看只需扫描少量堆page的情况。

构造扫描1%堆page的查询，下面在不同effective_io_concurrency值的情况下bitmap index scan的执行结果。

postgres=# set effective_io_concurrency=0;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                            QUERY PLAN                                                            

----------------------------------------------------------------------------------------------------------------------------------
-
 Bitmap Heap Scan on public.tbx  (cost=2439.92..250781.21 rows=150000 width=46) (actual time=65.206..3200.739 rows=300028 loops=1)
   Output: id, c1, pad
   Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0)
   Heap Blocks: exact=2804
   Buffers: shared read=3626
   ->  Bitmap Index Scan on idx_ctid100  (cost=0.00..2402.42 rows=150000 width=0) (actual time=62.715..62.715 rows=300028 loops=1)
         Buffers: shared read=822
 Planning time: 27.399 ms
 Execution time: 3234.402 ms
(9 rows)

postgres=# set effective_io_concurrency=1;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                            QUERY PLAN                                                            

----------------------------------------------------------------------------------------------------------------------------------
-
 Bitmap Heap Scan on public.tbx  (cost=2439.92..250781.21 rows=150000 width=46) (actual time=62.173..3038.114 rows=300028 loops=1)
   Output: id, c1, pad
   Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0)
   Heap Blocks: exact=2804
   Buffers: shared read=3626
   ->  Bitmap Index Scan on idx_ctid100  (cost=0.00..2402.42 rows=150000 width=0) (actual time=59.942..59.942 rows=300028 loops=1)
         Buffers: shared read=822
 Planning time: 26.008 ms
 Execution time: 3071.279 ms
(9 rows)

postgres=# set effective_io_concurrency=10;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                            QUERY PLAN                                                            

----------------------------------------------------------------------------------------------------------------------------------
-
 Bitmap Heap Scan on public.tbx  (cost=2439.92..250781.21 rows=150000 width=46) (actual time=55.807..826.453 rows=300028 loops=1)
   Output: id, c1, pad
   Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0)
   Heap Blocks: exact=2804
   Buffers: shared read=3626
   ->  Bitmap Index Scan on idx_ctid100  (cost=0.00..2402.42 rows=150000 width=0) (actual time=53.253..53.253 rows=300028 loops=1)
         Buffers: shared read=822
 Planning time: 12.583 ms
 Execution time: 860.651 ms
(9 rows)

从这个结果看，effective_io_concurrency的值为0还是为1,性能差别不大。

下面看看其它扫描方式的执行结果。

postgres=# set enable_bitmapscan=off;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                                QUERY PLAN                                                        

----------------------------------------------------------------------------------------------------------------------------------
---------
 Index Scan using idx_ctid100 on public.tbx  (cost=0.42..481403.42 rows=150000 width=46) (actual time=6.604..3237.425 rows=300028 
loops=1)
   Output: id, c1, pad
   Buffers: shared read=3626
 Planning time: 15.162 ms
 Execution time: 3270.714 ms
(5 rows)

postgres=# set enable_bitmapscan=off;
SET
postgres=#  set enable_indexscan=off;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.tbx  (cost=0.00..1405374.00 rows=150000 width=46) (actual time=1.459..129390.602 rows=300028 loops=1)
   Output: id, c1, pad
   Filter: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0)
   Rows Removed by Filter: 29699973
   Buffers: shared read=280374
 Planning time: 21.127 ms
 Execution time: 129461.814 ms
(7 rows)

从这儿看好像Index Scan总比bitmap index Scan快，其实这和测试数据有关。测试数据的导入的方式决定了从index里取出的堆元组已经是按page顺序排列好的，所以没有发挥出bitmap调整元组顺序的效果。

总结

PostgreSQL的预读依赖OS的磁盘预读和posix_fadvise()调用。(MySQL利用的是libaio，机制不同)
posix_fadvise()也就是effective_io_concurrency生效时，磁盘预读会失效，对于bitmap heap scan需要扫描大量位置相邻page的场景，性能不佳。
为优化bitmap heap scan的大量读IO，根据情况可以将effective_io_concurrency设为0或者设置为一个比较大的值。

补充

上面这个例子中性能优化效果看上去并不是很大，但在另一个环境中用相同的方法优化tpch Q6查询效果就更加明显了。优化前执行时间1050秒，effective_io_concurrency设为0，250秒。effective_io_concurrency设为100，116秒。

参考1：bitmap heap scan的代价估算

如果不考虑effective_io_concurrency使磁盘预读失效的性能下降，bitmap heap scan的代价估算还是蛮合理的。但是考虑这个因素的话，对于需要读取大量堆page的时候，比如20%以上随机分布的page，顺序读会更好。

src/backend/optimizer/path/costsize.c

void
cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
                      ParamPathInfo *param_info,
                      Path *bitmapqual, double loop_count)
{
...

    else
    {
        /*
         * For a single scan, the number of heap pages that need to be fetched
         * is the same as the Mackert and Lohman formula for the case T = T)
        pages_fetched = T;
    else
        pages_fetched = ceil(pages_fetched);

    /*
     * For small numbers of pages we should charge spc_random_page_cost
     * apiece, while if nearly all the table's pages are being read, it's more
     * appropriate to charge spc_seq_page_cost apiece.  The effect is
     * nonlinear, too. For lack of a better idea, interpolate like this to
     * determine the cost per page.
     */
    if (pages_fetched >= 2.0)
        cost_per_page = spc_random_page_cost -
            (spc_random_page_cost - spc_seq_page_cost)
            * sqrt(pages_fetched / T);
    else
        cost_per_page = spc_random_page_cost;

    run_cost += pages_fetched * cost_per_page;
...
}

参考2

根据相关文章的解释，如果read不连续，将会使磁盘预读失效。

http://os.51cto.com/art/200910/159067.htm

1.顺序性检测

为了保证预读命中率，Linux只对顺序读(sequential read)进行预读。内核通过验证如下两个条件来判定一个read()是否顺序读：

◆这是文件被打开后的第一次读，并且读的是文件首部;

◆当前的读请求与前一(记录的)读请求在文件内的位置是连续的。

如果不满足上述顺序性条件，就判定为随机读。任何一个随机读都将终止当前的顺序序列，从而终止预读行为(而不是缩减预读大小)。

但是，实际检验的结果，通过特殊的索引和查询条件，构造一个完全跳跃式的read()序列，read()和lseek()交替出现，没有2个连续的read()，结果预读仍然有效。可见Linux对顺序读的判断并没有字面上那么简单。

postgres=# create index idx_ctid2 on tbx(ctid) where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 2=0;
CREATE INDEX
postgres=# set enable_bitmapscan=off;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 2=0;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_ctid2 on public.tbx  (cost=0.43..481399.43 rows=150000 width=46) (actual time=3.267..105636.573 rows=15000009 loops=1)
   Output: id, c1, pad
   Buffers: shared hit=4464 read=176709
 Planning time: 0.165 ms
 Execution time: 107196.582 ms
(5 rows)

参考资料

http://blog.chinaunix.net/uid-20726500-id-5747918.html
http://os.51cto.com/art/200910/159067.htm
http://blog.163.com/digoal@126/blog/static/163877040201421392811622/
http://mysql.taobao.org/monthly/2015/05/04/

调节effective_io_concurrenc优化PostgreSQL bitmap index scan性能

环境

测试

少量堆page扫描的场景

总结

补充

参考1：bitmap heap scan的代价估算

参考2

参考资料

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像