调节effective_io_concurrenc优化PostgreSQL bitmap index scan性能

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
简介: 测试发现PostgreSQL在bitmap index scan时,如果要读入大量堆page,读IO的速度会远低于正常的顺序读,影响性能。 下面用一个例子说明这个问题。


测试发现PostgreSQL在bitmap index scan时,如果要读入大量堆page,读IO的速度会远低于正常的顺序读,影响性能。

下面用一个例子说明这个问题。

环境

  • 台式机上的CentOS7.1虚机
  • 消费级SSD
  • blockdev --setra设置为2048

测试

创建测试表

postgres=# create table tbx(id int,c1 int,pad text);
CREATE TABLE
postgres=# insert into tbx select id,(random()*200)::int,'sssssssssssssssssssssssssssssssssssss' from (select generate_series(1,30000000) id) tbx;
INSERT 0 30000000
postgres=# create index tbx_idx_c1 on tbx(c1);
CREATE INDEX
postgres=# analyze tbx;
ANALYZE 

每个page大概有107条记录,c1的范围是0~199,也就是大概每2个page中就有一个c1相同的记录,即按page算,c1的选择性差不多是50%。

postgres=# select ctid,id,c1 from tbx limit 10 offset 100;
  ctid   | id  | c1  
---------+-----+-----
 (0,101) | 101 | 106
 (0,102) | 102 |  96
 (0,103) | 103 | 178
 (0,104) | 104 |  48
 (0,105) | 105 | 119
 (0,106) | 106 |  24
 (0,107) | 107 | 189
 (1,1)   | 108 | 171
 (1,2)   | 109 | 196
 (1,3)   | 110 |  79
(10 rows) 

数据是2GB,shared_buffers是256MB。

postgres=# select relpages,reltuples from pg_class  where relname='tbx';
 relpages | reltuples 
----------+-----------
   280374 |     3e+07
(1 row)
postgres=# \d+
                             List of relations
 Schema |       Name       |   Type   |  Owner   |    Size    | Description 
--------+------------------+----------+----------+------------+-------------
 public | tbx              | table    | postgres | 2191 MB    | 
(1 rows)
postgres=# show shared_buffers;
 shared_buffers 
----------------
 256MB
(1 row) 

现在用c1作为条件,使用不同的执行计划进行查询,并且每次查询前都清一次OS缓存并重起PostgreSQL。

Bitmap index扫描145秒。

postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbx  (cost=2797.28..246877.37 rows=149254 width=46) (actual time=802.109..144443.423 rows=149949 loops=1)
   Output: id, c1, pad
   Recheck Cond: (tbx.c1 = 19)
   Rows Removed by Index Recheck: 8417996
   Heap Blocks: exact=36636 lossy=79633
   Buffers: shared read=116683
   ->  Bitmap Index Scan on tbx_idx_c1  (cost=0.00..2759.97 rows=149254 width=0) (actual time=758.825..758.825 rows=149949 loops=1)
         Index Cond: (tbx.c1 = 19)
         Buffers: shared read=414
 Planning time: 15.084 ms
 Execution time: 144733.946 ms
(11 rows) 

每个IO大小是16个扇区(8K),没有看到大的IO合并,IO队列深度也小于1,判断磁盘预读没有生效。

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.26    0.00   11.60   11.23    0.00   76.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  836.20    0.00     6.53     0.00    16.00     0.79    0.94    0.94    0.00   0.94  78.82
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00  836.20    0.00     6.53     0.00    16.00     0.79    0.94    0.94    0.00   0.94  78.94
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00 

顺序扫描74秒。

postgres=# set enable_bitmapscan=off;
SET
postgres=# set enable_indexscan=off;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.tbx  (cost=0.00..655374.03 rows=149254 width=46) (actual time=2.201..74236.613 rows=149949 loops=1)
   Output: id, c1, pad
   Filter: (tbx.c1 = 19)
   Rows Removed by Filter: 29850052
   Buffers: shared read=280374
 Planning time: 18.273 ms
 Execution time: 74327.371 ms
(7 rows)

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.61    0.00    1.31   23.35    0.00   74.73

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.00   36.60    3.40    17.90     0.02   917.63     3.43   87.30   94.05   14.53  24.99  99.96
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   36.80    3.20    18.00     0.01   922.51     3.41   86.76   93.55    8.69  24.99  99.96
dm-1              0.00     0.00    0.00    1.20     0.00     0.00     8.00     0.04   37.33    0.00   37.33  21.00   2.52
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00 

index扫描76秒

postgres=# set enable_bitmapscan=off;
SET
postgres=# set enable_indexscan=on;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using tbx_idx_c1 on public.tbx  (cost=0.56..475759.53 rows=149254 width=46) (actual time=4.159..75905.207 rows=149949 loops=1)
   Output: id, c1, pad
   Index Cond: (tbx.c1 = 19)
   Buffers: shared read=116683
 Planning time: 9.857 ms
 Execution time: 75954.629 ms
(6 rows)

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.10    0.00    0.86   24.05    0.00   74.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.80   84.80    1.00    16.46     0.01   393.08     1.87   21.64   21.73   13.80  11.58  99.32
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   84.60    0.60    16.31     0.00   392.13     1.86   21.74   21.78   16.00  11.65  99.30
dm-1              0.00     0.00    0.00    1.20     0.00     0.00     8.00     0.01   12.33    0.00   12.33   6.17   0.74
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00 

通过上面的数据可以看出,顺序扫描和索引扫描都可以利用磁盘预读,但bitmap索引扫描不行。

微观上通过strace -p跟踪postgres后端进程看看索引扫描和bitmap索引扫描各自调用的API有什么区别。

以下是索引扫描调用的API片段

read(33, "\0\0\0\0\200\305\310\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
read(33, "\0\0\0\0\20\361\310\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
read(33, "\0\0\0\0\240\34\311\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 204898304, SEEK_SET)          = 204898304
read(33, "\0\0\0\0H\"\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 204922880, SEEK_SET)          = 204922880
read(33, "\0\0\0\0\20\245\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 204939264, SEEK_SET)          = 204939264
read(33, "\0\0\0\0000\374\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
read(33, "\0\0\0\0\330'\313\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 204963840, SEEK_SET)          = 204963840
read(33, "\0\0\0\0\370~\313\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 205012992, SEEK_SET)          = 205012992
read(33, "\0\0\0\0\240\204\314\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
lseek(33, 205037568, SEEK_SET)          = 205037568 

bitmap索引扫描调用的API片段

read(33, "\0\0\0\0\3400S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273088512, 8192, POSIX_FADV_WILLNEED) = 0
lseek(33, 273088512, SEEK_SET)          = 273088512
read(33, "\0\0\0\0\250\263S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273096704, 8192, POSIX_FADV_WILLNEED) = 0
read(33, "\0\0\0\0008\337S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273154048, 8192, POSIX_FADV_WILLNEED) = 0
lseek(33, 273154048, SEEK_SET)          = 273154048
read(33, "\0\0\0\0p\20U\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273170432, 8192, POSIX_FADV_WILLNEED) = 0
lseek(33, 273170432, SEEK_SET)          = 273170432
read(33, "\0\0\0\0\250gU\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273211392, 8192, POSIX_FADV_WILLNEED) = 0
lseek(33, 273211392, SEEK_SET)          = 273211392
read(33, "\0\0\0\0\250AV\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273219584, 8192, POSIX_FADV_WILLNEED) = 0
read(33, "\0\0\0\0008mV\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(33, 273244160, 8192, POSIX_FADV_WILLNEED) = 0 

从上面的调用看,应该是fadvise64()使得磁盘预读失效。fadvise64()调用是由effective_io_concurrency参数控制的预读功能,effective_io_concurrency的默认值为1,它只对bitmap index scan有效。

下面将effective_io_concurrency禁用,发现性能有所提高,执行时间102秒,磁盘预读也生效了。

postgres=# set effective_io_concurrency=0;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbx  (cost=2797.28..246877.37 rows=149254 width=46) (actual time=148.673..102092.774 rows=149949 loops=1)
   Output: id, c1, pad
   Recheck Cond: (tbx.c1 = 19)
   Rows Removed by Index Recheck: 8417996
   Heap Blocks: exact=36636 lossy=79633
   Buffers: shared read=116683
   ->  Bitmap Index Scan on tbx_idx_c1  (cost=0.00..2759.97 rows=149254 width=0) (actual time=126.560..126.560 rows=149949 loops=1)
         Index Cond: (tbx.c1 = 19)
         Buffers: shared read=414
 Planning time: 20.704 ms
 Execution time: 102147.641 ms
(11 rows)

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.10    0.00    0.75   24.33    0.00   74.81

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   37.60    0.00    10.88     0.00   592.43     2.02   53.25   53.25    0.00  26.52  99.70
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00   38.20    0.00    11.20     0.00   600.54     2.02   52.41   52.41    0.00  26.10  99.70
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00 

除了禁用,还有一个办法,就是将effective_io_concurrency设得更大,完全代替磁盘预读。这次效果更好,46秒就出结果了。

postgres=# set effective_io_concurrency=10;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.tbx  (cost=2797.28..246877.37 rows=149254 width=46) (actual time=171.869..46015.094 rows=149949 loops=1)
   Output: id, c1, pad
   Recheck Cond: (tbx.c1 = 19)
   Rows Removed by Index Recheck: 8417996
   Heap Blocks: exact=36636 lossy=79633
   Buffers: shared read=116683
   ->  Bitmap Index Scan on tbx_idx_c1  (cost=0.00..2759.97 rows=149254 width=0) (actual time=140.170..140.170 rows=149949 loops=1)
         Index Cond: (tbx.c1 = 19)
         Buffers: shared read=414
 Planning time: 17.042 ms
 Execution time: 46196.535 ms
(11 rows) 

采用这个办法,IO请求大小还是16个扇区,但是队列深度也就是IO的并行度提高了。

[root@node2 ~]# iostat -xm 5 10
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.10    0.00   15.99    8.70    0.00   74.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00 2879.00    0.00    22.49     0.00    16.00    10.72    3.74    3.74    0.00   0.35  99.60
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00 2877.40    0.00    22.48     0.00    16.00    10.74    3.74    3.74    0.00   0.35  99.68
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00 

再看看调用的API,fadvise64()和read()仍然是交替的,但fadvise64()会提前好几个周期就将相应的预读请求发给IO设备。

fadvise64(34, 505077760, 8192, POSIX_FADV_WILLNEED) = 0  //建议预读505077760
lseek(34, 504266752, SEEK_SET)          = 504266752
read(34, "\1\0\0\0\350GK\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505085952, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504299520, SEEK_SET)          = 504299520
read(34, "\1\0\0\0@\366K\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505094144, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504324096, SEEK_SET)          = 504324096
read(34, "\1\0\0\0\10yL\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505126912, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504348672, SEEK_SET)          = 504348672
read(34, "\1\0\0\0\320\373L\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505135104, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504373248, SEEK_SET)          = 504373248
read(34, "\1\0\0\0\230~M\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505143296, 8192, POSIX_FADV_WILLNEED) = 0
read(34, "\1\0\0\0@\252M\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505159680, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504397824, SEEK_SET)          = 504397824
read(34, "\1\0\0\0x\1N\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505167872, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504496128, SEEK_SET)          = 504496128
read(34, "\1\0\0\0\230\fP\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505184256, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504545280, SEEK_SET)          = 504545280
read(34, "\1\0\0\0(\22Q\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505192448, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504561664, SEEK_SET)          = 504561664
read(34, "\1\0\0\0`iQ\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505208832, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504610816, SEEK_SET)          = 504610816
read(34, "\1\0\0\0\360nR\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505257984, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504627200, SEEK_SET)          = 504627200
read(34, "\1\0\0\0(\306R\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505307136, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504659968, SEEK_SET)          = 504659968
read(34, "\1\0\0\0\200tS\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505315328, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504676352, SEEK_SET)          = 504676352
read(34, "\1\0\0\0\270\313S\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505323520, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504692736, SEEK_SET)          = 504692736
read(34, "\1\0\0\0\360\"T\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505372672, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504717312, SEEK_SET)          = 504717312
read(34, "\1\0\0\0\270\245T\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505389056, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504741888, SEEK_SET)          = 504741888
read(34, "\1\0\0\0\200(U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505413632, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504758272, SEEK_SET)          = 504758272
read(34, "\1\0\0\0\240\177U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505430016, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504774656, SEEK_SET)          = 504774656
read(34, "\1\0\0\0\330\326U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505454592, 8192, POSIX_FADV_WILLNEED) = 0
read(34, "\1\0\0\0\200\2V\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505462784, 8192, POSIX_FADV_WILLNEED) = 0
read(34, "\1\0\0\0\20.V\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505479168, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504840192, SEEK_SET)          = 504840192
read(34, "\1\0\0\0\2403W\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505495552, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504864768, SEEK_SET)          = 504864768
read(34, "\1\0\0\0h\266W\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505503744, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504930304, SEEK_SET)          = 504930304
read(34, "\1\0\0\0000\23Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505552896, 8192, POSIX_FADV_WILLNEED) = 0
read(34, "\1\0\0\0\300>Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505610240, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 504971264, SEEK_SET)          = 504971264
read(34, "\1\0\0\0000\355Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505618432, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 505004032, SEEK_SET)          = 505004032
read(34, "\1\0\0\0\210\233Z\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505651200, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 505036800, SEEK_SET)          = 505036800
read(34, "\1\0\0\0\370I[\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
fadvise64(34, 505659392, 8192, POSIX_FADV_WILLNEED) = 0
lseek(34, 505077760, SEEK_SET)          = 505077760  //开始读505077760
read(34, "\1\0\0\0\370#\\\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 

少量堆page扫描的场景

上面的场景需要扫描50%的堆page,下面看看只需扫描少量堆page的情况。

构造扫描1%堆page的查询,下面在不同effective_io_concurrency值的情况下bitmap index scan的执行结果。

postgres=# set effective_io_concurrency=0;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                            QUERY PLAN                                                            

----------------------------------------------------------------------------------------------------------------------------------
-
 Bitmap Heap Scan on public.tbx  (cost=2439.92..250781.21 rows=150000 width=46) (actual time=65.206..3200.739 rows=300028 loops=1)
   Output: id, c1, pad
   Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0)
   Heap Blocks: exact=2804
   Buffers: shared read=3626
   ->  Bitmap Index Scan on idx_ctid100  (cost=0.00..2402.42 rows=150000 width=0) (actual time=62.715..62.715 rows=300028 loops=1)
         Buffers: shared read=822
 Planning time: 27.399 ms
 Execution time: 3234.402 ms
(9 rows)

postgres=# set effective_io_concurrency=1;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                            QUERY PLAN                                                            

----------------------------------------------------------------------------------------------------------------------------------
-
 Bitmap Heap Scan on public.tbx  (cost=2439.92..250781.21 rows=150000 width=46) (actual time=62.173..3038.114 rows=300028 loops=1)
   Output: id, c1, pad
   Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0)
   Heap Blocks: exact=2804
   Buffers: shared read=3626
   ->  Bitmap Index Scan on idx_ctid100  (cost=0.00..2402.42 rows=150000 width=0) (actual time=59.942..59.942 rows=300028 loops=1)
         Buffers: shared read=822
 Planning time: 26.008 ms
 Execution time: 3071.279 ms
(9 rows)

postgres=# set effective_io_concurrency=10;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                            QUERY PLAN                                                            

----------------------------------------------------------------------------------------------------------------------------------
-
 Bitmap Heap Scan on public.tbx  (cost=2439.92..250781.21 rows=150000 width=46) (actual time=55.807..826.453 rows=300028 loops=1)
   Output: id, c1, pad
   Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0)
   Heap Blocks: exact=2804
   Buffers: shared read=3626
   ->  Bitmap Index Scan on idx_ctid100  (cost=0.00..2402.42 rows=150000 width=0) (actual time=53.253..53.253 rows=300028 loops=1)
         Buffers: shared read=822
 Planning time: 12.583 ms
 Execution time: 860.651 ms
(9 rows) 

从这个结果看,effective_io_concurrency的值为0还是为1,性能差别不大。

下面看看其它扫描方式的执行结果。

postgres=# set enable_bitmapscan=off;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                                QUERY PLAN                                                        

----------------------------------------------------------------------------------------------------------------------------------
---------
 Index Scan using idx_ctid100 on public.tbx  (cost=0.42..481403.42 rows=150000 width=46) (actual time=6.604..3237.425 rows=300028 
loops=1)
   Output: id, c1, pad
   Buffers: shared read=3626
 Planning time: 15.162 ms
 Execution time: 3270.714 ms
(5 rows)

postgres=# set enable_bitmapscan=off;
SET
postgres=#  set enable_indexscan=off;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.tbx  (cost=0.00..1405374.00 rows=150000 width=46) (actual time=1.459..129390.602 rows=300028 loops=1)
   Output: id, c1, pad
   Filter: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0)
   Rows Removed by Filter: 29699973
   Buffers: shared read=280374
 Planning time: 21.127 ms
 Execution time: 129461.814 ms
(7 rows) 

从这儿看好像Index Scan总比bitmap index Scan快,其实这和测试数据有关。测试数据的导入的方式决定了从index里取出的堆元组已经是按page顺序排列好的,所以没有发挥出bitmap调整元组顺序的效果。

总结

  1. PostgreSQL的预读依赖OS的磁盘预读和posix_fadvise()调用。(MySQL利用的是libaio,机制不同)
  2. posix_fadvise()也就是effective_io_concurrency生效时,磁盘预读会失效,对于bitmap heap scan需要扫描大量位置相邻page的场景,性能不佳。
  3. 为优化bitmap heap scan的大量读IO根据情况可以将effective_io_concurrency设为0或者设置为一个比较大的值。

补充

上面这个例子中性能优化效果看上去并不是很大,但在另一个环境中用相同的方法优化tpch Q6查询效果就更加明显了。优化前执行时间1050秒,effective_io_concurrency设为0,250秒。effective_io_concurrency设为100,116秒。

参考1:bitmap heap scan的代价估算

如果不考虑effective_io_concurrency使磁盘预读失效的性能下降,bitmap heap scan的代价估算还是蛮合理的。 但是考虑这个因素的话,对于需要读取大量堆page的时候,比如20%以上随机分布的page,顺序读会更好。

src/backend/optimizer/path/costsize.c

void
cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
                      ParamPathInfo *param_info,
                      Path *bitmapqual, double loop_count)
{
...

    else
    {
        /*
         * For a single scan, the number of heap pages that need to be fetched
         * is the same as the Mackert and Lohman formula for the case T = T)
        pages_fetched = T;
    else
        pages_fetched = ceil(pages_fetched);

    /*
     * For small numbers of pages we should charge spc_random_page_cost
     * apiece, while if nearly all the table's pages are being read, it's more
     * appropriate to charge spc_seq_page_cost apiece.  The effect is
     * nonlinear, too. For lack of a better idea, interpolate like this to
     * determine the cost per page.
     */
    if (pages_fetched >= 2.0)
        cost_per_page = spc_random_page_cost -
            (spc_random_page_cost - spc_seq_page_cost)
            * sqrt(pages_fetched / T);
    else
        cost_per_page = spc_random_page_cost;

    run_cost += pages_fetched * cost_per_page;
...
} 

参考2

根据相关文章的解释,如果read不连续,将会使磁盘预读失效。

http://os.51cto.com/art/200910/159067.htm

1.顺序性检测

为了保证预读命中率,Linux只对顺序读(sequential read)进行预读。内核通过验证如下两个条件来判定一个read()是否顺序读:

◆这是文件被打开后的第一次读,并且读的是文件首部;

◆当前的读请求与前一(记录的)读请求在文件内的位置是连续的。

如果不满足上述顺序性条件,就判定为随机读。任何一个随机读都将终止当前的顺序序列,从而终止预读行为(而不是缩减预读大小)。 

但是,实际检验的结果,通过特殊的索引和查询条件,构造一个完全跳跃式的read()序列,read()和lseek()交替出现,没有2个连续的read(),结果预读仍然有效。可见Linux对顺序读的判断并没有字面上那么简单。

postgres=# create index idx_ctid2 on tbx(ctid) where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 2=0;
CREATE INDEX
postgres=# set enable_bitmapscan=off;
SET
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx  where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 2=0;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_ctid2 on public.tbx  (cost=0.43..481399.43 rows=150000 width=46) (actual time=3.267..105636.573 rows=15000009 loops=1)
   Output: id, c1, pad
   Buffers: shared hit=4464 read=176709
 Planning time: 0.165 ms
 Execution time: 107196.582 ms
(5 rows) 

参考资料

  • http://blog.chinaunix.net/uid-20726500-id-5747918.html
  • http://os.51cto.com/art/200910/159067.htm
  • http://blog.163.com/digoal@126/blog/static/163877040201421392811622/
  • http://mysql.taobao.org/monthly/2015/05/04/


相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
6天前
|
缓存 监控 IDE
linux如何查看io性能
linux如何查看io性能
|
6天前
|
关系型数据库 分布式数据库 数据库
关系型数据库性能需求
【5月更文挑战第7天】关系型数据库性能需求
24 5
关系型数据库性能需求
|
6天前
|
SQL 存储 关系型数据库
性能诊断工具DBdoctor如何快速纳管数据库PolarDB-X
DBdoctor是一款基于eBPF技术的数据库性能诊断工具,已通过阿里云PolarDB分布式版(V2.3)认证。PolarDB-X是阿里云的高性能云原生分布式数据库,采用Shared-nothing和存储计算分离架构,支持高可用、水平扩展和低成本存储。PolarDB-X V2.3.0在读写混合场景下对比开源MySQL有30-40%的性能提升。DBdoctor能按MySQL方式纳管PolarDB-X的DN节点,提供性能洞察和诊断。用户可通过指定步骤安装PolarDB-X和DBdoctor,实现数据库的管理和性能监控。
125 0
|
6天前
|
存储 关系型数据库 MySQL
MySQL的优化利器⭐️Multi Range Read与Covering Index是如何优化回表的?
本文以小白的视角使用通俗易懂的流程图深入浅出分析Multi Range Read与Covering Index是如何优化回表
|
6天前
|
存储 关系型数据库 分布式数据库
数据库索引回表困难?揭秘PolarDB存储引擎优化技术
PolarDB分布式版存储引擎采用CSM方案均衡资源开销与可用性。
数据库索引回表困难?揭秘PolarDB存储引擎优化技术
|
6天前
|
存储 缓存 关系型数据库
关系型数据库数据库表设计的优化
您可以优化关系型数据库的表设计,提高数据库的性能、可维护性和可扩展性。但请注意,每个数据库和应用程序都有其独特的需求和挑战,因此在实际应用中需要根据具体情况进行调整和优化。
18 4
|
6天前
|
缓存 监控 关系型数据库
关系型数据库优化查询语句
记住每个数据库和查询都是独特的,所以最好的优化策略通常是通过测试和分析来确定的。在进行任何大的更改之前,始终备份你的数据并在测试环境中验证更改的效果。
25 5
|
6天前
|
存储 SQL 关系型数据库
关系型数据库存储优化
关系型数据库存储优化
27 1
|
6天前
|
SQL 关系型数据库 MySQL
mysql查询语句的访问方法const、ref、ref_or_null、range、index、all
mysql查询语句的访问方法const、ref、ref_or_null、range、index、all
|
6天前
|
存储 JSON 关系型数据库
PostgreSQL Json应用场景介绍和Shared Detoast优化
PostgreSQL Json应用场景介绍和Shared Detoast优化

热门文章

最新文章