PostgreSQL 并行计算tpc-h测试和优化分析

本文涉及的产品
RDS PostgreSQL Serverless,0.5-4RCU 50GB 3个月
推荐场景:
对影评进行热评分析
云原生数据库 PolarDB 分布式版,标准版 2核8GB
云数据库 RDS SQL Server,基础系列 2核4GB
简介:

PostgreSQL 并行计算tpc-h测试和优化分析

作者

digoal

日期

2016-11-08

标签

PostgreSQL , 并行计算 , TPC-H


背景

PostgreSQL 9.6首次推出支持聚合、全表扫描、HASH JOIN、nestloop的并行计算。

https://www.postgresql.org/docs/9.6/static/release-9-6.html

Parallel queries (Robert Haas, Amit Kapila, David Rowley, many others)   

With 9.6, PostgreSQL introduces initial support for parallel execution of large queries.   

Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized.   

Hash joins and nested loops can be performed in parallel, as can aggregation (for supported aggregates).   

Much remains to be done, but this is already a useful set of features.   

Parallel query execution is not (yet) enabled by default. To allow it, set the new configuration parameter max_parallel_workers_per_gather to a value larger than zero.   

Additional control over use of parallelism is available through other new configuration parameters force_parallel_mode, parallel_setup_cost, parallel_tuple_cost, and min_parallel_relation_size.   

Provide infrastructure for marking the parallel-safety status of functions (Robert Haas, Amit Kapila)   

那么他对TPC-H有多少的性能提升呢?

Robert的PostgreSQL 9.6 TPC-H测试说明

并行度设为4,22条查询有17条使用了并行执行计划。

15条比单核执行更快,其中11条提升至少2倍,1条速度未变化,还有1条变慢。

I decided to try out parallel query, as implemented in PostgreSQL 9.6devel, on the TPC-H queries.

To do this, I followed the directions at https://github.com/tvondra/pg_tpch - thanks to Tomas Vondra for those instructions.

I did the test on an IBM POWER7 server provided to the PostgreSQL community by IBM.

I scaled the database to use 10GB of input data; the resulting database size was 22GB, of which 8GB was indexes.

I tried out each query just once without really tuning the database at all, except for increasing shared_buffers to 8GB.

Then I tested them again after enabling parallel query by configuring max_parallel_degree = 4.

Of the 22 queries, 17 switched to a parallel plan, while the plans for the other 5 were unchanged.

Of the 17 queries where the plan changed, 15 got faster, 1 ran at the same speed, and 1 got slower.

11 of the queries ran at least twice as fast with parallelism as they did without parallelism.

Here are the comparative results for the queries where the plan changed(Parallel vs 单核执行):

Q1: 229 seconds → 45 seconds (5.0x)  

Q3: 45 seconds → 17 seconds (2.6x)  

Q4: 12 seconds → 3 seconds (4.0x)  

Q5: 38 seconds → 17 seconds (2.2x)  

Q6: 17 seconds → 6 seconds (2.8x)  

Q7: 41 seconds → 12 seconds (3.4x)  

Q8: 10 seconds → 4 seconds (2.5x)  

Q9: 81 seconds → 61 seconds (1.3x)  

Q10: 37 seconds → 18 seconds (2.0x)  

Q12: 34 seconds → 7 seconds (4.8x)  

Q15: 33 seconds → 24 seconds (1.3x)  

Q16: 17 seconds → 16 seconds (1.0x)  

Q17: 140 seconds → 55 seconds (2.5x)  

Q19: 2 seconds → 1 second (2.0x)  

Q20: 70 seconds → 70 seconds (1.0x)  

Q21: 80 seconds → 99 seconds (0.8x)  

Q22: 4 seconds → 3 seconds (1.3x)  

Linear scaling with a leader process and 4 workers would mean a 5.0x speedup, which we achieved in only one case.

However, for many users, that won't matter: if you have CPUs that would otherwise be sitting idle, it's better to get some speedup than no speedup at all.

Of course, I couldn't resist analyzing what went wrong here, especially for Q21, which actually got slower.

Q21变慢的原因,是work_mem的配置问题,以及当前HASH JOIN并行机制的问题。

To some degree, that's down to misconfiguration:

I ran this test with the default value of work_mem=4MB, but Q21 chooses a plan that builds a hash table on the largest table in the database, which is about 9.5GB in this test.

Therefore, it ends up doing a 1024-batch hash join, which is somewhat painful under the best of circumstances.

With work_mem=1GB, the regression disappears, and it's 6% faster with parallel query than without.

目前HASH JOIN,每一个并行的WORKER都需要一份hash table的拷贝,如果大表hash的话,会在大表基础上放大N倍的CPU和内存的开销。

小表HASH这个问题可以缓解。

However, there's a deeper problem, which is that while PostgreSQL 9.6 can perform a hash join in parallel, each process must build its own copy of the hash table.

That means we use N times the CPU and N times the memory, and we may induce I/O contention, locking contention, or memory pressure as well.

It would be better to have the ability to build a shared hash table, and EnterpriseDB is working on that as a feature, but it won't be ready in time for PostgreSQL 9.6, which is already in feature freeze.

Since Q21 needs a giant hash table, this limitation really stings.

HASH JOIN可以提升的点,使用共享的HASH TABLE,而不是每个woker process都拷贝一份。

这个可能要等到PostgreSQL 10.0加进来了。

In fact, there are a number of queries here where it seems like building a shared hash table would speed things up significantly: Q3, Q5, Q7, Q8, and Q21.

An even more widespread problem is that, at present, the driving table for a parallel query must be accessed via a parallel sequential scan;

that's the only operation we have that can partition the input data.

另一个提升的点,bitmap scan,因为有几个QUERY的瓶颈是在bitmap scan哪里,但是目前并行计算还不支持bitmap scan。

Many of these queries - Q4, Q5, Q6, Q7, Q14, Q15, and Q20 - would have been better off using a bitmap index scan on the driving table, but unfortunately that's not supported in PostgreSQL 9.6.

We still come out ahead on these queries in terms of runtime because the system simply substitutes raw power for finesse:

with enough workers, we can scan the whole table quicker than a single process can scan the portion identified as relevant by the index.

However, it would clearly be nice to do better.

Four queries - Q2, Q15, Q16, Q22 - were parallelized either not at all or only to a limited degree due to restrictions related to the handling of subqueries,

about which the current implementation of parallel query is not always smart.

Three queries - Q2, Q13, and Q15 - made no or limited use of parallelism because the optimal join strategy is a merge join, which can't be made parallel in a trivial way.

One query - Q17 - managed to perform the same an expensive sort twice, once in the workers and then again in the leader.

This is because the Gather operation reads tuples from the workers in an arbitrary and not necessarily predictable order;

so even if each worker's stream of tuples is sorted, the way those streams get merged together will probably destroy the sort ordering.

There are no doubt other issues here that I haven't found yet, but on the whole I find these results pretty encouraging.

Parallel query basically works, and makes queries that someone thought were representative of real workloads significantly faster.

There's a lot of room for further improvement, but that's likely to be true of the first version of almost any large feature.

并行需要继续提升的点

HASH JOIN可以提升的点,使用共享的HASH TABLE,而不是每个woker process都拷贝一份。

这个可能要等到PostgreSQL 10.0加进来了。

另一个提升的点,bitmap scan,因为有几个QUERY的瓶颈是在bitmap scan哪里,但是目前并行计算还不支持bitmap scan。

支持merge join。

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
17天前
|
人工智能 搜索推荐 数据管理
探索软件测试中的自动化测试框架选择与优化策略
本文深入探讨了在现代软件开发流程中,如何根据项目特性、团队技能和长期维护需求,精准选择合适的自动化测试框架。
69 8
|
24天前
|
定位技术 开发者
游戏开发者如何使用独享静态代理IP进行测试与优化
随着互联网技术的发展,使用代理IP的人数逐渐增加,特别是在业务需求中需要使用静态代理IP的情况越来越多。本文探讨了独享静态代理IP是否适用于游戏行业,分析了其优势如稳定性、不共享同一IP地址及地理位置选择等,同时也指出了需要注意的问题,包括可能的延迟、游戏兼容性和网络速度等。总体而言,选择合适的代理服务并正确配置,可以有效提升游戏体验。
24 2
|
1月前
|
人工智能 前端开发 测试技术
探索软件测试中的自动化框架选择与优化策略####
本文深入剖析了当前主流的自动化测试框架,通过对比分析各自的优势、局限性及适用场景,为读者提供了一套系统性的选择与优化指南。文章首先概述了自动化测试的重要性及其在软件开发生命周期中的位置,接着逐一探讨了Selenium、Appium、Cypress等热门框架的特点,并通过实际案例展示了如何根据项目需求灵活选用与配置框架,以提升测试效率和质量。最后,文章还分享了若干最佳实践和未来趋势预测,旨在帮助测试工程师更好地应对复杂多变的测试环境。 ####
48 4
|
1月前
|
人工智能 监控 测试技术
探索软件测试中的自动化框架选择与优化策略####
【10月更文挑战第21天】 本文深入剖析了软件测试领域面临的挑战,聚焦于自动化测试框架的选择与优化这一核心议题。不同于传统摘要的概述方式,本文将以一个虚拟案例“X项目”为线索,通过该项目从手动测试困境到自动化转型的成功历程,生动展现如何根据项目特性精准匹配自动化工具(如Selenium、Appium等),并结合CI/CD流程进行深度集成与持续优化,最终实现测试效率与质量的双重飞跃。读者将跟随“X项目”团队的视角,直观感受自动化框架选型的策略性思考及实践中的优化技巧,获得可借鉴的实战经验。 ####
37 0
|
26天前
|
并行计算 算法 测试技术
C语言因高效灵活被广泛应用于软件开发。本文探讨了优化C语言程序性能的策略,涵盖算法优化、代码结构优化、内存管理优化、编译器优化、数据结构优化、并行计算优化及性能测试与分析七个方面
C语言因高效灵活被广泛应用于软件开发。本文探讨了优化C语言程序性能的策略,涵盖算法优化、代码结构优化、内存管理优化、编译器优化、数据结构优化、并行计算优化及性能测试与分析七个方面,旨在通过综合策略提升程序性能,满足实际需求。
57 1
|
28天前
|
机器学习/深度学习 人工智能 Java
探索软件测试中的自动化框架选择与优化策略####
本文深入探讨了在软件测试领域,面对众多自动化测试框架时,如何根据项目特性、团队技能及长远规划做出最佳选择,并进一步阐述了优化这些框架以提升测试效率与质量的策略。通过对比分析主流自动化测试框架的优劣,结合具体案例,本文旨在为测试团队提供一套实用的框架选型与优化指南。 ####
|
1月前
|
敏捷开发 监控 jenkins
探索自动化测试框架在敏捷开发中的应用与优化##
本文深入探讨了自动化测试框架在现代敏捷软件开发流程中的关键作用,分析了其面临的挑战及优化策略。通过对比传统测试方法,阐述了自动化测试如何加速软件迭代周期,提升产品质量,并针对实施过程中的常见问题提出了解决方案。旨在为读者提供一套高效、可扩展的自动化测试实践指南。 ##
43 9
|
1月前
|
jenkins 测试技术 持续交付
自动化测试框架的构建与优化:提升软件交付效率的关键####
本文深入探讨了自动化测试框架的核心价值,通过对比传统手工测试方法的局限性,揭示了自动化测试在现代软件开发生命周期中的重要性。不同于常规摘要仅概述内容,本部分强调了自动化测试如何显著提高测试覆盖率、缩短测试周期、降低人力成本,并促进持续集成/持续部署(CI/CD)流程的实施,最终实现软件质量和开发效率的双重飞跃。通过具体案例分析,展示了从零开始构建自动化测试框架的策略与最佳实践,包括选择合适的工具、设计高效的测试用例结构、以及如何进行性能调优等关键步骤。此外,还讨论了在实施过程中可能遇到的挑战及应对策略,为读者提供了一套可操作的优化指南。 ####
|
1月前
|
敏捷开发 监控 测试技术
探索自动化测试框架的构建与优化####
在软件开发周期中,自动化测试扮演着至关重要的角色。本文旨在深入探讨如何构建高效的自动化测试框架,并分享一系列实用策略以提升测试效率和质量。我们将从框架选型、结构设计、工具集成、持续集成/持续部署(CI/CD)、以及最佳实践等多个维度进行阐述,为软件测试人员提供一套系统化的实施指南。 ####
|
1月前
|
缓存 监控 测试技术
全网最全压测指南!教你如何测试和优化系统极限性能
大家好,我是小米。本文将介绍如何在实际项目中进行性能压测和优化,包括单台服务器和集群压测、使用JMeter、监控CPU和内存使用率、优化Tomcat和数据库配置等方面的内容,帮助你在高并发场景下提升系统性能。希望这些实战经验能助你一臂之力!
91 3

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版