测试发现PostgreSQL在bitmap index scan时,如果要读入大量堆page,读IO的速度会远低于正常的顺序读,影响性能。
下面用一个例子说明这个问题。
环境
- 台式机上的CentOS7.1虚机
- 消费级SSD
- blockdev --setra设置为2048
测试
创建测试表
postgres=# create table tbx(id int,c1 int,pad text); CREATE TABLE postgres=# insert into tbx select id,(random()*200)::int,'sssssssssssssssssssssssssssssssssssss' from (select generate_series(1,30000000) id) tbx; INSERT 0 30000000 postgres=# create index tbx_idx_c1 on tbx(c1); CREATE INDEX postgres=# analyze tbx; ANALYZE
每个page大概有107条记录,c1的范围是0~199,也就是大概每2个page中就有一个c1相同的记录,即按page算,c1的选择性差不多是50%。
postgres=# select ctid,id,c1 from tbx limit 10 offset 100; ctid | id | c1 ---------+-----+----- (0,101) | 101 | 106 (0,102) | 102 | 96 (0,103) | 103 | 178 (0,104) | 104 | 48 (0,105) | 105 | 119 (0,106) | 106 | 24 (0,107) | 107 | 189 (1,1) | 108 | 171 (1,2) | 109 | 196 (1,3) | 110 | 79 (10 rows)
数据是2GB,shared_buffers是256MB。
postgres=# select relpages,reltuples from pg_class where relname='tbx'; relpages | reltuples ----------+----------- 280374 | 3e+07 (1 row) postgres=# \d+ List of relations Schema | Name | Type | Owner | Size | Description --------+------------------+----------+----------+------------+------------- public | tbx | table | postgres | 2191 MB | (1 rows) postgres=# show shared_buffers; shared_buffers ---------------- 256MB (1 row)
现在用c1作为条件,使用不同的执行计划进行查询,并且每次查询前都清一次OS缓存并重起PostgreSQL。
Bitmap index扫描145秒。
postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.tbx (cost=2797.28..246877.37 rows=149254 width=46) (actual time=802.109..144443.423 rows=149949 loops=1) Output: id, c1, pad Recheck Cond: (tbx.c1 = 19) Rows Removed by Index Recheck: 8417996 Heap Blocks: exact=36636 lossy=79633 Buffers: shared read=116683 -> Bitmap Index Scan on tbx_idx_c1 (cost=0.00..2759.97 rows=149254 width=0) (actual time=758.825..758.825 rows=149949 loops=1) Index Cond: (tbx.c1 = 19) Buffers: shared read=414 Planning time: 15.084 ms Execution time: 144733.946 ms (11 rows)
每个IO大小是16个扇区(8K),没有看到大的IO合并,IO队列深度也小于1,判断磁盘预读没有生效。
[root@node2 ~]# iostat -xm 5 10 avg-cpu: %user %nice %system %iowait %steal %idle 0.26 0.00 11.60 11.23 0.00 76.90 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 836.20 0.00 6.53 0.00 16.00 0.79 0.94 0.94 0.00 0.94 78.82 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 836.20 0.00 6.53 0.00 16.00 0.79 0.94 0.94 0.00 0.94 78.94 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
顺序扫描74秒。
postgres=# set enable_bitmapscan=off; SET postgres=# set enable_indexscan=off; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Seq Scan on public.tbx (cost=0.00..655374.03 rows=149254 width=46) (actual time=2.201..74236.613 rows=149949 loops=1) Output: id, c1, pad Filter: (tbx.c1 = 19) Rows Removed by Filter: 29850052 Buffers: shared read=280374 Planning time: 18.273 ms Execution time: 74327.371 ms (7 rows) [root@node2 ~]# iostat -xm 5 10 avg-cpu: %user %nice %system %iowait %steal %idle 0.61 0.00 1.31 23.35 0.00 74.73 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 1.00 36.60 3.40 17.90 0.02 917.63 3.43 87.30 94.05 14.53 24.99 99.96 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 36.80 3.20 18.00 0.01 922.51 3.41 86.76 93.55 8.69 24.99 99.96 dm-1 0.00 0.00 0.00 1.20 0.00 0.00 8.00 0.04 37.33 0.00 37.33 21.00 2.52 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
index扫描76秒
postgres=# set enable_bitmapscan=off; SET postgres=# set enable_indexscan=on; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using tbx_idx_c1 on public.tbx (cost=0.56..475759.53 rows=149254 width=46) (actual time=4.159..75905.207 rows=149949 loops=1) Output: id, c1, pad Index Cond: (tbx.c1 = 19) Buffers: shared read=116683 Planning time: 9.857 ms Execution time: 75954.629 ms (6 rows) [root@node2 ~]# iostat -xm 5 10 avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.00 0.86 24.05 0.00 74.99 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.80 84.80 1.00 16.46 0.01 393.08 1.87 21.64 21.73 13.80 11.58 99.32 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 84.60 0.60 16.31 0.00 392.13 1.86 21.74 21.78 16.00 11.65 99.30 dm-1 0.00 0.00 0.00 1.20 0.00 0.00 8.00 0.01 12.33 0.00 12.33 6.17 0.74 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
通过上面的数据可以看出,顺序扫描和索引扫描都可以利用磁盘预读,但bitmap索引扫描不行。
微观上通过strace -p跟踪postgres后端进程看看索引扫描和bitmap索引扫描各自调用的API有什么区别。
以下是索引扫描调用的API片段
read(33, "\0\0\0\0\200\305\310\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 read(33, "\0\0\0\0\20\361\310\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 read(33, "\0\0\0\0\240\34\311\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 lseek(33, 204898304, SEEK_SET) = 204898304 read(33, "\0\0\0\0H\"\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 lseek(33, 204922880, SEEK_SET) = 204922880 read(33, "\0\0\0\0\20\245\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 lseek(33, 204939264, SEEK_SET) = 204939264 read(33, "\0\0\0\0000\374\312\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 read(33, "\0\0\0\0\330'\313\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 lseek(33, 204963840, SEEK_SET) = 204963840 read(33, "\0\0\0\0\370~\313\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 lseek(33, 205012992, SEEK_SET) = 205012992 read(33, "\0\0\0\0\240\204\314\226\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 lseek(33, 205037568, SEEK_SET) = 205037568
bitmap索引扫描调用的API片段
read(33, "\0\0\0\0\3400S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(33, 273088512, 8192, POSIX_FADV_WILLNEED) = 0 lseek(33, 273088512, SEEK_SET) = 273088512 read(33, "\0\0\0\0\250\263S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(33, 273096704, 8192, POSIX_FADV_WILLNEED) = 0 read(33, "\0\0\0\0008\337S\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(33, 273154048, 8192, POSIX_FADV_WILLNEED) = 0 lseek(33, 273154048, SEEK_SET) = 273154048 read(33, "\0\0\0\0p\20U\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(33, 273170432, 8192, POSIX_FADV_WILLNEED) = 0 lseek(33, 273170432, SEEK_SET) = 273170432 read(33, "\0\0\0\0\250gU\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(33, 273211392, 8192, POSIX_FADV_WILLNEED) = 0 lseek(33, 273211392, SEEK_SET) = 273211392 read(33, "\0\0\0\0\250AV\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(33, 273219584, 8192, POSIX_FADV_WILLNEED) = 0 read(33, "\0\0\0\0008mV\234\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(33, 273244160, 8192, POSIX_FADV_WILLNEED) = 0
从上面的调用看,应该是fadvise64()使得磁盘预读失效。fadvise64()调用是由effective_io_concurrency参数控制的预读功能,effective_io_concurrency的默认值为1,它只对bitmap index scan有效。
下面将effective_io_concurrency禁用,发现性能有所提高,执行时间102秒,磁盘预读也生效了。
postgres=# set effective_io_concurrency=0; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.tbx (cost=2797.28..246877.37 rows=149254 width=46) (actual time=148.673..102092.774 rows=149949 loops=1) Output: id, c1, pad Recheck Cond: (tbx.c1 = 19) Rows Removed by Index Recheck: 8417996 Heap Blocks: exact=36636 lossy=79633 Buffers: shared read=116683 -> Bitmap Index Scan on tbx_idx_c1 (cost=0.00..2759.97 rows=149254 width=0) (actual time=126.560..126.560 rows=149949 loops=1) Index Cond: (tbx.c1 = 19) Buffers: shared read=414 Planning time: 20.704 ms Execution time: 102147.641 ms (11 rows) [root@node2 ~]# iostat -xm 5 10 avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.00 0.75 24.33 0.00 74.81 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 37.60 0.00 10.88 0.00 592.43 2.02 53.25 53.25 0.00 26.52 99.70 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 38.20 0.00 11.20 0.00 600.54 2.02 52.41 52.41 0.00 26.10 99.70 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
除了禁用,还有一个办法,就是将effective_io_concurrency设得更大,完全代替磁盘预读。这次效果更好,46秒就出结果了。
postgres=# set effective_io_concurrency=10; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where c1=19; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.tbx (cost=2797.28..246877.37 rows=149254 width=46) (actual time=171.869..46015.094 rows=149949 loops=1) Output: id, c1, pad Recheck Cond: (tbx.c1 = 19) Rows Removed by Index Recheck: 8417996 Heap Blocks: exact=36636 lossy=79633 Buffers: shared read=116683 -> Bitmap Index Scan on tbx_idx_c1 (cost=0.00..2759.97 rows=149254 width=0) (actual time=140.170..140.170 rows=149949 loops=1) Index Cond: (tbx.c1 = 19) Buffers: shared read=414 Planning time: 17.042 ms Execution time: 46196.535 ms (11 rows)
采用这个办法,IO请求大小还是16个扇区,但是队列深度也就是IO的并行度提高了。
[root@node2 ~]# iostat -xm 5 10 avg-cpu: %user %nice %system %iowait %steal %idle 1.10 0.00 15.99 8.70 0.00 74.20 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 2879.00 0.00 22.49 0.00 16.00 10.72 3.74 3.74 0.00 0.35 99.60 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 2877.40 0.00 22.48 0.00 16.00 10.74 3.74 3.74 0.00 0.35 99.68 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
再看看调用的API,fadvise64()和read()仍然是交替的,但fadvise64()会提前好几个周期就将相应的预读请求发给IO设备。
fadvise64(34, 505077760, 8192, POSIX_FADV_WILLNEED) = 0 //建议预读505077760 lseek(34, 504266752, SEEK_SET) = 504266752 read(34, "\1\0\0\0\350GK\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505085952, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504299520, SEEK_SET) = 504299520 read(34, "\1\0\0\0@\366K\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505094144, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504324096, SEEK_SET) = 504324096 read(34, "\1\0\0\0\10yL\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505126912, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504348672, SEEK_SET) = 504348672 read(34, "\1\0\0\0\320\373L\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505135104, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504373248, SEEK_SET) = 504373248 read(34, "\1\0\0\0\230~M\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505143296, 8192, POSIX_FADV_WILLNEED) = 0 read(34, "\1\0\0\0@\252M\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505159680, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504397824, SEEK_SET) = 504397824 read(34, "\1\0\0\0x\1N\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505167872, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504496128, SEEK_SET) = 504496128 read(34, "\1\0\0\0\230\fP\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505184256, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504545280, SEEK_SET) = 504545280 read(34, "\1\0\0\0(\22Q\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505192448, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504561664, SEEK_SET) = 504561664 read(34, "\1\0\0\0`iQ\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505208832, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504610816, SEEK_SET) = 504610816 read(34, "\1\0\0\0\360nR\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505257984, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504627200, SEEK_SET) = 504627200 read(34, "\1\0\0\0(\306R\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505307136, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504659968, SEEK_SET) = 504659968 read(34, "\1\0\0\0\200tS\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505315328, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504676352, SEEK_SET) = 504676352 read(34, "\1\0\0\0\270\313S\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505323520, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504692736, SEEK_SET) = 504692736 read(34, "\1\0\0\0\360\"T\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505372672, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504717312, SEEK_SET) = 504717312 read(34, "\1\0\0\0\270\245T\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505389056, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504741888, SEEK_SET) = 504741888 read(34, "\1\0\0\0\200(U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505413632, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504758272, SEEK_SET) = 504758272 read(34, "\1\0\0\0\240\177U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505430016, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504774656, SEEK_SET) = 504774656 read(34, "\1\0\0\0\330\326U\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505454592, 8192, POSIX_FADV_WILLNEED) = 0 read(34, "\1\0\0\0\200\2V\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505462784, 8192, POSIX_FADV_WILLNEED) = 0 read(34, "\1\0\0\0\20.V\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505479168, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504840192, SEEK_SET) = 504840192 read(34, "\1\0\0\0\2403W\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505495552, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504864768, SEEK_SET) = 504864768 read(34, "\1\0\0\0h\266W\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505503744, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504930304, SEEK_SET) = 504930304 read(34, "\1\0\0\0000\23Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505552896, 8192, POSIX_FADV_WILLNEED) = 0 read(34, "\1\0\0\0\300>Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505610240, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 504971264, SEEK_SET) = 504971264 read(34, "\1\0\0\0000\355Y\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505618432, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 505004032, SEEK_SET) = 505004032 read(34, "\1\0\0\0\210\233Z\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505651200, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 505036800, SEEK_SET) = 505036800 read(34, "\1\0\0\0\370I[\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192 fadvise64(34, 505659392, 8192, POSIX_FADV_WILLNEED) = 0 lseek(34, 505077760, SEEK_SET) = 505077760 //开始读505077760 read(34, "\1\0\0\0\370#\\\6\0\0\0\0\304\1\350\1\0 \4 \0\0\0\0\270\237\214\0p\237\214\0"..., 8192) = 8192
少量堆page扫描的场景
上面的场景需要扫描50%的堆page,下面看看只需扫描少量堆page的情况。
构造扫描1%堆page的查询,下面在不同effective_io_concurrency值的情况下bitmap index scan的执行结果。
postgres=# set effective_io_concurrency=0; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- - Bitmap Heap Scan on public.tbx (cost=2439.92..250781.21 rows=150000 width=46) (actual time=65.206..3200.739 rows=300028 loops=1) Output: id, c1, pad Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0) Heap Blocks: exact=2804 Buffers: shared read=3626 -> Bitmap Index Scan on idx_ctid100 (cost=0.00..2402.42 rows=150000 width=0) (actual time=62.715..62.715 rows=300028 loops=1) Buffers: shared read=822 Planning time: 27.399 ms Execution time: 3234.402 ms (9 rows) postgres=# set effective_io_concurrency=1; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- - Bitmap Heap Scan on public.tbx (cost=2439.92..250781.21 rows=150000 width=46) (actual time=62.173..3038.114 rows=300028 loops=1) Output: id, c1, pad Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0) Heap Blocks: exact=2804 Buffers: shared read=3626 -> Bitmap Index Scan on idx_ctid100 (cost=0.00..2402.42 rows=150000 width=0) (actual time=59.942..59.942 rows=300028 loops=1) Buffers: shared read=822 Planning time: 26.008 ms Execution time: 3071.279 ms (9 rows) postgres=# set effective_io_concurrency=10; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- - Bitmap Heap Scan on public.tbx (cost=2439.92..250781.21 rows=150000 width=46) (actual time=55.807..826.453 rows=300028 loops=1) Output: id, c1, pad Recheck Cond: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0) Heap Blocks: exact=2804 Buffers: shared read=3626 -> Bitmap Index Scan on idx_ctid100 (cost=0.00..2402.42 rows=150000 width=0) (actual time=53.253..53.253 rows=300028 loops=1) Buffers: shared read=822 Planning time: 12.583 ms Execution time: 860.651 ms (9 rows)
从这个结果看,effective_io_concurrency的值为0还是为1,性能差别不大。
下面看看其它扫描方式的执行结果。
postgres=# set enable_bitmapscan=off; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- --------- Index Scan using idx_ctid100 on public.tbx (cost=0.42..481403.42 rows=150000 width=46) (actual time=6.604..3237.425 rows=300028 loops=1) Output: id, c1, pad Buffers: shared read=3626 Planning time: 15.162 ms Execution time: 3270.714 ms (5 rows) postgres=# set enable_bitmapscan=off; SET postgres=# set enable_indexscan=off; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 100=0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Seq Scan on public.tbx (cost=0.00..1405374.00 rows=150000 width=46) (actual time=1.459..129390.602 rows=300028 loops=1) Output: id, c1, pad Filter: ((("substring"((tbx.ctid)::text, 2, ("position"((tbx.ctid)::text, ','::text) - 2)))::integer % 100) = 0) Rows Removed by Filter: 29699973 Buffers: shared read=280374 Planning time: 21.127 ms Execution time: 129461.814 ms (7 rows)
从这儿看好像Index Scan总比bitmap index Scan快,其实这和测试数据有关。测试数据的导入的方式决定了从index里取出的堆元组已经是按page顺序排列好的,所以没有发挥出bitmap调整元组顺序的效果。
总结
- PostgreSQL的预读依赖OS的磁盘预读和posix_fadvise()调用。(MySQL利用的是libaio,机制不同)
- posix_fadvise()也就是effective_io_concurrency生效时,磁盘预读会失效,对于bitmap heap scan需要扫描大量位置相邻page的场景,性能不佳。
- 为优化bitmap heap scan的大量读IO,根据情况可以将effective_io_concurrency设为0或者设置为一个比较大的值。
补充
上面这个例子中性能优化效果看上去并不是很大,但在另一个环境中用相同的方法优化tpch Q6查询效果就更加明显了。优化前执行时间1050秒,effective_io_concurrency设为0,250秒。effective_io_concurrency设为100,116秒。
参考1:bitmap heap scan的代价估算
如果不考虑effective_io_concurrency使磁盘预读失效的性能下降,bitmap heap scan的代价估算还是蛮合理的。 但是考虑这个因素的话,对于需要读取大量堆page的时候,比如20%以上随机分布的page,顺序读会更好。
src/backend/optimizer/path/costsize.c void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info, Path *bitmapqual, double loop_count) { ... else { /* * For a single scan, the number of heap pages that need to be fetched * is the same as the Mackert and Lohman formula for the case T = T) pages_fetched = T; else pages_fetched = ceil(pages_fetched); /* * For small numbers of pages we should charge spc_random_page_cost * apiece, while if nearly all the table's pages are being read, it's more * appropriate to charge spc_seq_page_cost apiece. The effect is * nonlinear, too. For lack of a better idea, interpolate like this to * determine the cost per page. */ if (pages_fetched >= 2.0) cost_per_page = spc_random_page_cost - (spc_random_page_cost - spc_seq_page_cost) * sqrt(pages_fetched / T); else cost_per_page = spc_random_page_cost; run_cost += pages_fetched * cost_per_page; ... }
参考2
根据相关文章的解释,如果read不连续,将会使磁盘预读失效。
http://os.51cto.com/art/200910/159067.htm
1.顺序性检测 为了保证预读命中率,Linux只对顺序读(sequential read)进行预读。内核通过验证如下两个条件来判定一个read()是否顺序读: ◆这是文件被打开后的第一次读,并且读的是文件首部; ◆当前的读请求与前一(记录的)读请求在文件内的位置是连续的。 如果不满足上述顺序性条件,就判定为随机读。任何一个随机读都将终止当前的顺序序列,从而终止预读行为(而不是缩减预读大小)。
但是,实际检验的结果,通过特殊的索引和查询条件,构造一个完全跳跃式的read()序列,read()和lseek()交替出现,没有2个连续的read(),结果预读仍然有效。可见Linux对顺序读的判断并没有字面上那么简单。
postgres=# create index idx_ctid2 on tbx(ctid) where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 2=0; CREATE INDEX postgres=# set enable_bitmapscan=off; SET postgres=# explain(ANALYZE,VERBOSE,COSTS,BUFFERS,TIMING) select * from tbx where substring(ctid::text from 2 for position(',' in ctid::text)-2)::int % 2=0; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_ctid2 on public.tbx (cost=0.43..481399.43 rows=150000 width=46) (actual time=3.267..105636.573 rows=15000009 loops=1) Output: id, c1, pad Buffers: shared hit=4464 read=176709 Planning time: 0.165 ms Execution time: 107196.582 ms (5 rows)
参考资料
- http://blog.chinaunix.net/uid-20726500-id-5747918.html
- http://os.51cto.com/art/200910/159067.htm
- http://blog.163.com/digoal@126/blog/static/163877040201421392811622/
- http://mysql.taobao.org/monthly/2015/05/04/