PostgreSQL 并行计算tpc-h测试和优化分析-阿里云开发者社区

开发者社区> 德哥> 正文

PostgreSQL 并行计算tpc-h测试和优化分析

简介:
+关注继续查看

PostgreSQL 并行计算tpc-h测试和优化分析

作者

digoal

日期

2016-11-08

标签

PostgreSQL , 并行计算 , TPC-H


背景

PostgreSQL 9.6首次推出支持聚合、全表扫描、HASH JOIN、nestloop的并行计算。

https://www.postgresql.org/docs/9.6/static/release-9-6.html

Parallel queries (Robert Haas, Amit Kapila, David Rowley, many others)   

With 9.6, PostgreSQL introduces initial support for parallel execution of large queries.   

Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized.   

Hash joins and nested loops can be performed in parallel, as can aggregation (for supported aggregates).   

Much remains to be done, but this is already a useful set of features.   

Parallel query execution is not (yet) enabled by default. To allow it, set the new configuration parameter max_parallel_workers_per_gather to a value larger than zero.   

Additional control over use of parallelism is available through other new configuration parameters force_parallel_mode, parallel_setup_cost, parallel_tuple_cost, and min_parallel_relation_size.   

Provide infrastructure for marking the parallel-safety status of functions (Robert Haas, Amit Kapila)   

那么他对TPC-H有多少的性能提升呢?

Robert的PostgreSQL 9.6 TPC-H测试说明

并行度设为4,22条查询有17条使用了并行执行计划。

15条比单核执行更快,其中11条提升至少2倍,1条速度未变化,还有1条变慢。

I decided to try out parallel query, as implemented in PostgreSQL 9.6devel, on the TPC-H queries.

To do this, I followed the directions at https://github.com/tvondra/pg_tpch - thanks to Tomas Vondra for those instructions.

I did the test on an IBM POWER7 server provided to the PostgreSQL community by IBM.

I scaled the database to use 10GB of input data; the resulting database size was 22GB, of which 8GB was indexes.

I tried out each query just once without really tuning the database at all, except for increasing shared_buffers to 8GB.

Then I tested them again after enabling parallel query by configuring max_parallel_degree = 4.

Of the 22 queries, 17 switched to a parallel plan, while the plans for the other 5 were unchanged.

Of the 17 queries where the plan changed, 15 got faster, 1 ran at the same speed, and 1 got slower.

11 of the queries ran at least twice as fast with parallelism as they did without parallelism.

Here are the comparative results for the queries where the plan changed(Parallel vs 单核执行):

Q1: 229 seconds → 45 seconds (5.0x)  

Q3: 45 seconds → 17 seconds (2.6x)  

Q4: 12 seconds → 3 seconds (4.0x)  

Q5: 38 seconds → 17 seconds (2.2x)  

Q6: 17 seconds → 6 seconds (2.8x)  

Q7: 41 seconds → 12 seconds (3.4x)  

Q8: 10 seconds → 4 seconds (2.5x)  

Q9: 81 seconds → 61 seconds (1.3x)  

Q10: 37 seconds → 18 seconds (2.0x)  

Q12: 34 seconds → 7 seconds (4.8x)  

Q15: 33 seconds → 24 seconds (1.3x)  

Q16: 17 seconds → 16 seconds (1.0x)  

Q17: 140 seconds → 55 seconds (2.5x)  

Q19: 2 seconds → 1 second (2.0x)  

Q20: 70 seconds → 70 seconds (1.0x)  

Q21: 80 seconds → 99 seconds (0.8x)  

Q22: 4 seconds → 3 seconds (1.3x)  

Linear scaling with a leader process and 4 workers would mean a 5.0x speedup, which we achieved in only one case.

However, for many users, that won't matter: if you have CPUs that would otherwise be sitting idle, it's better to get some speedup than no speedup at all.

Of course, I couldn't resist analyzing what went wrong here, especially for Q21, which actually got slower.

Q21变慢的原因,是work_mem的配置问题,以及当前HASH JOIN并行机制的问题。

To some degree, that's down to misconfiguration:

I ran this test with the default value of work_mem=4MB, but Q21 chooses a plan that builds a hash table on the largest table in the database, which is about 9.5GB in this test.

Therefore, it ends up doing a 1024-batch hash join, which is somewhat painful under the best of circumstances.

With work_mem=1GB, the regression disappears, and it's 6% faster with parallel query than without.

目前HASH JOIN,每一个并行的WORKER都需要一份hash table的拷贝,如果大表hash的话,会在大表基础上放大N倍的CPU和内存的开销。

小表HASH这个问题可以缓解。

However, there's a deeper problem, which is that while PostgreSQL 9.6 can perform a hash join in parallel, each process must build its own copy of the hash table.

That means we use N times the CPU and N times the memory, and we may induce I/O contention, locking contention, or memory pressure as well.

It would be better to have the ability to build a shared hash table, and EnterpriseDB is working on that as a feature, but it won't be ready in time for PostgreSQL 9.6, which is already in feature freeze.

Since Q21 needs a giant hash table, this limitation really stings.

HASH JOIN可以提升的点,使用共享的HASH TABLE,而不是每个woker process都拷贝一份。

这个可能要等到PostgreSQL 10.0加进来了。

In fact, there are a number of queries here where it seems like building a shared hash table would speed things up significantly: Q3, Q5, Q7, Q8, and Q21.

An even more widespread problem is that, at present, the driving table for a parallel query must be accessed via a parallel sequential scan;

that's the only operation we have that can partition the input data.

另一个提升的点,bitmap scan,因为有几个QUERY的瓶颈是在bitmap scan哪里,但是目前并行计算还不支持bitmap scan。

Many of these queries - Q4, Q5, Q6, Q7, Q14, Q15, and Q20 - would have been better off using a bitmap index scan on the driving table, but unfortunately that's not supported in PostgreSQL 9.6.

We still come out ahead on these queries in terms of runtime because the system simply substitutes raw power for finesse:

with enough workers, we can scan the whole table quicker than a single process can scan the portion identified as relevant by the index.

However, it would clearly be nice to do better.

Four queries - Q2, Q15, Q16, Q22 - were parallelized either not at all or only to a limited degree due to restrictions related to the handling of subqueries,

about which the current implementation of parallel query is not always smart.

Three queries - Q2, Q13, and Q15 - made no or limited use of parallelism because the optimal join strategy is a merge join, which can't be made parallel in a trivial way.

One query - Q17 - managed to perform the same an expensive sort twice, once in the workers and then again in the leader.

This is because the Gather operation reads tuples from the workers in an arbitrary and not necessarily predictable order;

so even if each worker's stream of tuples is sorted, the way those streams get merged together will probably destroy the sort ordering.

There are no doubt other issues here that I haven't found yet, but on the whole I find these results pretty encouraging.

Parallel query basically works, and makes queries that someone thought were representative of real workloads significantly faster.

There's a lot of room for further improvement, but that's likely to be true of the first version of almost any large feature.

并行需要继续提升的点

HASH JOIN可以提升的点,使用共享的HASH TABLE,而不是每个woker process都拷贝一份。

这个可能要等到PostgreSQL 10.0加进来了。

另一个提升的点,bitmap scan,因为有几个QUERY的瓶颈是在bitmap scan哪里,但是目前并行计算还不支持bitmap scan。

支持merge join。

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

相关文章
《并行计算的编程模型》一3.7 集合操作
本节书摘来华章计算机《并行计算的编程模型》一书中的第3章 ,第3.7节, [(美)帕万·巴拉吉(Pavan Balaji)编著;张云泉等译,更多章节内容可以访问云栖社区“华章计算机”公众号查看。
783 0
Oracle并行计算
【前言】现在CPU的发展已不仅朝着单个性能更好的方向了,而且还朝着多核数多核心的方向发展了。Oracle数据库大部分也都是利用单线程的串行方式在运行。通过并行(Parallel)操作特性,充分应用CPU的多核心特点,提高对数据的操作效率,满足在特定场景下对海量数据操作的需求。
820 0
Okhttp3源码解析(1)-OkHttpClient分析
前言 上篇文章我们讲了Okhttp的基本用法,今天根据上节讲到请求流程来分析源码,那么第一步就是实例化OkHttpClient对象,所以我们今天主要分析下OkHttpClient源码! 初始化-构造方式 创建 OkHttpClient实例的两种方式 1.
2596 0
基于英特尔® 优化分析包(OAP)的 Spark 性能优化方案
Spark SQL 作为 Spark 用来处理结构化数据的一个基本模块,已经成为多数企业构建大数据应用的重要选择。但是,在大规模连接(Join)、聚合(Aggregate)等工作负载下,Spark 性能会面临稳定性和性能方面的挑战。
483 0
+关注
德哥
公益是一辈子的事, I'm digoal, just do it.
2153
文章
245
问答
来源圈子
更多
阿里云数据库:帮用户承担一切数据库风险,给您何止是安心!支持关系型数据库:MySQL、SQL Server、PostgreSQL、PPAS(完美兼容Oracle)、自研PB级数据存储的分布式数据库Petadata、自研金融级云数据库OceanBase支持NoSQL数据库:MongoDB、Redis、Memcache更有褚霸、丁奇、德哥、彭立勋、玄惭、叶翔等顶尖数据库专家服务。
+ 订阅
文章排行榜
最热
最新
相关电子书
更多
《2021云上架构与运维峰会演讲合集》
立即下载
《零基础CSS入门教程》
立即下载
《零基础HTML入门教程》
立即下载