PostgreSQL 11 版本在并行方面得到增强,例如支持并行创建索引(Parallel Index Build)、并行哈希连接(Parallel Hash Join)、并行 CREATE TABLE .. AS
等,上篇博客介绍了并行创建索引,本文介绍并行 Hash Join。
测试环境准备
创建大表t_big并插入5000万条数据。
CREATE TABLE t_big(
id int4,
name text,
create_time timestamp without time zone );
INSERT INTO t_big(id,name,create_time)
SELECT n, n|| '_test',clock_timestamp() FROM generate_series(1,50000000) n ;
创建小表t_small并插入800万条数据
CREATE TABLE t_small(id int4, name text);
INSERT INTO t_small(id,name)
SELECT n, n|| '_small' FROM generate_series(1,8000000) n ;
验证并行哈希连接
PostgreSQL 10 版本查看以下SQL执行计划,如下:
des=> EXPLAIN SELECT t_small.name
FROM t_big JOIN t_small ON (t_big.id = t_small.id)
AND t_small.id < 100;
QUERY PLAN
--------------------------------------------------------------------------------------
Gather (cost=151870.58..661385.28 rows=4143 width=13)
Workers Planned: 4
-> Hash Join (cost=150870.58..659970.98 rows=1036 width=13)
Hash Cond: (t_big.id = t_small.id)
-> Parallel Seq Scan on t_big (cost=0.00..470246.58 rows=10358258 width=4)
-> Hash (cost=150860.58..150860.58 rows=800 width=17)
-> Seq Scan on t_small (cost=0.00..150860.58 rows=800 width=17)
Filter: (id < 100)
(8 rows)
PostgreSQL 11 版本查看以下SQL执行计划,如下:
francs=> EXPLAIN SELECT t_small.name
FROM t_big JOIN t_small ON (t_big.id = t_small.id)
AND t_small.id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------
Gather (cost=76862.42..615477.60 rows=800 width=13)
Workers Planned: 4
-> Parallel Hash Join (cost=75862.42..614397.60 rows=200 width=13)
Hash Cond: (t_big.id = t_small.id)
-> Parallel Seq Scan on t_big (cost=0.00..491660.86 rows=12499686 width=4)
-> Parallel Hash (cost=75859.92..75859.92 rows=200 width=17)
-> Parallel Seq Scan on t_small (cost=0.00..75859.92 rows=200 width=17)
Filter: (id < 100)
(8 rows)
对比10版本的执行计划,不同之处为11版本走了 Parallel Hash Join,而 10 版本走的 Hash Join,Parallel Hash Join 为 11 版本的新特性。
并行哈希连接性能测试
开启并行哈希连接相比不开启性能上有何变化?接着测试。
开启并行哈希连接
PostgreSQL 11 版本执行以下SQL,如下:
francs=> EXPLAIN ANALYZE SELECT t_small.name
FROM t_big JOIN t_small ON (t_big.id = t_small.id)
AND t_small.id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=76862.42..615477.60 rows=800 width=13) (actual time=197.399..2738.010 rows=99 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Parallel Hash Join (cost=75862.42..614397.60 rows=200 width=13) (actual time=2222.347..2729.943 rows=20 loops=5)
Hash Cond: (t_big.id = t_small.id)
-> Parallel Seq Scan on t_big (cost=0.00..491660.86 rows=12499686 width=4) (actual time=0.038..1330.836 rows=10000000 loops=5)
-> Parallel Hash (cost=75859.92..75859.92 rows=200 width=17) (actual time=191.484..191.484 rows=20 loops=5)
Buckets: 1024 Batches: 1 Memory Usage: 40kB
-> Parallel Seq Scan on t_small (cost=0.00..75859.92 rows=200 width=17) (actual time=152.436..191.385 rows=20 loops=5)
Filter: (id < 100)
Rows Removed by Filter: 1599980
Planning Time: 0.183 ms
Execution Time: 2738.068 ms
(13 rows)
以上SQL执行多次,取最快时间,执行时间为 2738.068 ms。
关闭并行哈希连接
会话级设置enable_parallel_hash参数为off表示关闭并行哈希连接,测试性能有何变化,如下。
francs=> set enable_parallel_hash = off;
SET
francs=> EXPLAIN ANALYZE SELECT t_small.name
FROM t_big JOIN t_small ON (t_big.id = t_small.id)
AND t_small.id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=151869.66..690486.34 rows=800 width=13) (actual time=996.137..3496.940 rows=99 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Hash Join (cost=150869.66..689406.34 rows=200 width=13) (actual time=2990.847..3490.557 rows=20 loops=5)
Hash Cond: (t_big.id = t_small.id)
-> Parallel Seq Scan on t_big (cost=0.00..491660.86 rows=12499686 width=4) (actual time=0.240..1392.062 rows=10000000 loops=5)
-> Hash (cost=150859.66..150859.66 rows=800 width=17) (actual time=890.943..890.943 rows=99 loops=5)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Seq Scan on t_small (cost=0.00..150859.66 rows=800 width=17) (actual time=884.288..890.906 rows=99 loops=5)
Filter: (id < 100)
Rows Removed by Filter: 7999901
Planning Time: 0.154 ms
Execution Time: 3496.982 ms
(13 rows)
以上SQL执行多次,取最快时间,从以上看出,关闭并行哈希连接时SQL的执行时间为 3496.982 ms ,相比开启并行哈希连接执行时间长了 27%。
可见开启并行哈希连接后,性能有较大幅度提升。
参考
新书推荐
最后推荐和张文升共同编写的《PostgreSQL实战》,本书基于PostgreSQL 10 编写,共18章,重点介绍SQL高级特性、并行查询、分区表、物理复制、逻辑复制、备份恢复、高可用、性能优化、PostGIS等,涵盖大量实战用例!