PostgreSQL 10.0 preview 性能增强 - OLAP提速框架, Faster Expression Evaluation Framework(含JIT)

本文涉及的产品
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
云原生数据库 PolarDB 分布式版,标准版 2核8GB
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
简介:

标签

PostgreSQL , 10.0 , llvm , jit , Faster Expression Evaluation Framework


背景

PostgreSQL 10.0有可能会融合JIT,向量计算等技术,提供一个通用的,便于高效协作,提升OLAP性能的一个开发框架。

虽然目前社区有朋友已经提供了LLVM和向量计算的插件,很显然社区是想在内核中直接整合这些计算的。加油PostgreSQL

《分析加速引擎黑科技 - LLVM、列存、多核并行、算子复用 大联姻 - 一起来开启PostgreSQL的百宝箱》

《PostgreSQL 向量化执行插件(瓦片式实现) 10x提速OLAP》

Hi Everyone,  

TL;DR: Making things faster. Architectural evalation.  

as some of you might be aware I've been working on making execution of  
larger queries in postgresl faster. While working on "batched execution"  
I came to the conclusion that, while necessary, isn't currently showing  
a large benefit because expression evaluation and tuple deforming are  
massive bottlenecks.  

I'm posting a quite massive series of WIP patches here, to get some  
feedback.  

Tuple deforming is slow because of two reasons:  

1) It's the first thing that accesses tuples, i.e. it'll often incur  
   cache misses. That's partially fundamental, but also partially can be  
   addressed, e.g. through changing the access order in heap as in [1].  
2) Tuple deforming has a lot of unpredicatable branches, because it has  
   to cope with various types of fields. We e.g. perform alignment in a  
   lot of unneeded cases, do null checks for NOT NULL columns et al.  

I tried to address 2) by changing the C implementation. That brings some  
measurable speedups, but it's not huge. A bigger speedup is making  
slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;  
but it's still not huge.  Finally I turned to just-in-time (JIT)  
compiling the code for tuple deforming. That doesn't save the cost of  
1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01).  The  
first part is done in 0008, the JITing in 0012.  


Expression evaluation and projection is another major bottleneck.  
1) Our recursive expression evaluation puts a *lot* of pressure on the  
   stack.  
2) There's a lot of indirect function calls when recursing to other  
   expression nodes. These are hard to predict, because the same node  
   type (say ExecEvalAnd()) is used in different parts of an expression  
   tree, and invokes different sub-nodes.  
3) The function calls to operators and other functions are hard to  
   predict, leading to a significant number of pipeline stalls.  
4) There's a fair amount of pg_list.h list style iteration going on,  
   those are cache and pipeline inefficient.  

After some experimenting I came to the conclusion that the recursive  
processing is a fundamental impediment to making this faster.  I've  
converted (0006) expression processing and projection into an opcode  
dispatch based interpreter. That yields, especially for complex  
expressions and larger projections a significant speedup in itself.  But  
similarly to the deforming, expression evaluation remains a bottleneck  
after that, primarily because there's still a lot of unpredictable jump  
and calls, and because loads/stores have to be complex  
(e.g. ExprContext->ecxt_innertuple->tts_values[i]/tts_isnull[i] for a  
single scalar var evaluation).   Using the opcode based representation  
of expression evaluation (as it's nearly linear, and has done a lot of  
the lookups ahead of time), it's actually quite easy to  


*After JITing expression evaluation itself is more than ten times faster  
than before*.  

But unfortunately that doesn't mean that queries are ten times faster -  
usually we'll hit bottlenecks elsewhere relatively soon.  WRT to  
expression evaluation, the biggest cost afterwards are the relatively  
high overhead V1 function calls - register based parameter passing is a  
lot faster.  


After experimenting a bit with doing JITing manually (a lot of  
eye-stabbing kind of fun), I chose to use LLVM.  

An overview of the patch-queue so far:  
0001  Make get_last_attnums more generic.  

Boring prerequisite.  

0002  More efficient AggState->pertrans iteration.  

Relatively boring minor optimization, but it turns out to be a easily  
hit bottleneck. Will commit independently.  


0003  Avoid materializing SRFs in the FROM list.  
0004  Allow ROWS FROM to return functions as single record column.  
0005  Basic implementation of targetlist SRFs via ROWS FROM.  
0006  Remove unused code related to targetlist SRFs.  

These are basically just pre-requisites for the faster expression  
evaluation, and discussed elsewhere [2].  This implementation is *NOT*  
going to survive, because we ended coming to the conclusion that using a  
separate executor node to expand SRFs is a btter plan. But the new  
expression evaluation code won't be able to handle SRFs...  


0007  WIP: Optimize slot_deform_tuple() significantly.  

This a) turns tuple deforming into an opcode based dispatch loop (using  
computed goto on gcc/clang). b) moves a lot of the logic from  
slot_deform_tuple() callsites into itself - that turns out to be more  
efficient.  I'm not entirely sure it's worth doing the opcode based  
dispatch part, if we're going to also do the JIT bit - it's a fair  
amount of code, and the speed difference only matters on large amounts  
of rows.  


0008  WIP: Faster expression processing and targetlist projection.  

This, functionally nearly complete, patch turns expression evaluation  
(and tuple deforming as a special case of that) into a "mini language"  
which is interpreted using either a while(true) switch(opcode) or  
computed goto to jump from opcode to opcode.  It does so by moving a lot  
more of the code for expression evaluation to initialization time and  
building a linear series of steps to evaluate expressions, thereby  
removing all recursion from expression processing.  

This nearly entirely gets rid of the stack usage cost of expression  
evaluation (we pretty much never recurse except for subplans). Being  
able to remove, now redundant, calls to check_stack_depth() is a  
noticeable benefit, it turns out that that check has a noticeable  
performance impact (as it aparently forces to actually use the stack,  
instead of just renumbering registers inside the CPU).  

The new representation and evaluation is functionally nearly complete  
(there's a single regression test failure, and I know why that is), but  
the code needs a fair amount of polishing.  

I do absolutely think that the fundamentals of this are the right way to  
go, and I'm going to work hard on polishing the patch up.  But this  
isn't something that we can easily do in parts, and it's a huge ass  
patch. So I'd like to have at least some more buyin before wasting even  
more time on this.  


0009  WIP: Add minimal keytest implementation.  

More or less experimental patch that tries to implement simple  
expression of the OpExpr(ScalarVar, Const) into a single expression  
evaluation step.  The benefits probably aren't big enough iff we do end  
up doing JITing of expressions.  


0010  WIP: Add configure infrastructure to enable LLVM.  
0011  WIP: Beginning of a LLVM JIT infrastructure.  

Very boring preliminary patches to add --with-llvm and some minimal  
infrastructure to handle LLVM. If we go this way, JITed stuff needs to  
be tied to resource owners, and we need some other centralized  
infrastructure.  

0012  Heavily-WIP: JITing of tuple deforming.  

This, in a not-yet-that-nice manner, implements a JITed version of the  
per-column stuff that slot_deform_tuple() does.  It currently always  
deforms all columns, which obviously would have to change. There's also  
considerable additional performance improvements possible.  

With this patch the per-column overhead (minus bitmap handling, which  
0007 moved into a separate loop), drops from 10%+ into low single digits  
for a number of queries.  Afterwards the biggest cost is VARSIZE_ANY()  
for varlena columns (which atm isn't inlined).  That is, besides the  
initial cache-miss when accessing tuple->t_hoff, which JITing can do  
nothing about :(  

This can be enabled/disabled using the new jit_tuple_deforming GUC.  To  
make this production ready in some form, we'd have to come up with a way  
to determine when it's worth doing JITing. The easiest way would be to  
do so after N slot_deform_tuple() calls or such, another way would be to  
do it based on cost estimates.  


0013  WIP: ExprEval: Make threaded dispatch use a separate field.  

Boring preliminary patch. Increases memory usage a bit, needs to be  
thought through more.  


0014  Heavily-WIP: JITed expression evaluation.  

This is the most-interesting bit performance wise. A few common types of  
expressions are JITed. Scalar value accesses, function calls, boolean  
expressions, aggregate references.  

This can be enabled using the new jit_expressions GUC.  

Even for the supported expression types I've taken some shortcuts  
(e.g. strict functions aren't actually strict).  

The performance benefits are quite noticeable. For TPCH ExecEvalExpr()  
(which is where 0008 moved all of expression evaluation/projection) goes  
from being the top profile entry, to barely noticeable, with the JITed  
function usually not showing up in the top five entries anymore.  

After the patch it becomes very clear that our function call  
infrastructure is a serious bottlenecks. Passing all the arguments via  
memory, and, even worse, forcing isnull/values to be on separate  
cachelines, has significant performance implications.  It also becomes  
quite noticeable that nodeAgg's transition function invocation doesn't  
go through ExecEvalExpr() but does that itself - which leads to constant  
mispredictions if several transition values exist.  

While the JIT code is relatively verbose, it turns out to not actually  
be that hard to write after some startup pains. All the JITing of  
expressions that exists so far was basically written in ~10 hours.  

This also needs some heuristics about when JITing is  
appropriate. Compiling an expression that's only executed once is never  
going to be faster than doing the interpretation (it at least needs a  
writable allocation for the code, and then a remap to make that code  
read-only and executable).  A trace based approach (everything executed  
at least a thousand times) or cost based (all queries costing more than  
100000 should be JITed) could make sense.  


It's worthwhile to note that at the moment this is a per-query-execution  
JIT, not something that can trivially be cached for prepared  
statements. That'll need further infrastructure.  


0015  Super-Heavily-WIP: LLVM perf integration.  

This very very very preliminary patch (including some copy-pasted GPL  
code!) creates /proc/perf-<pid>.map files, which allows perf to show  
useful symbols for profile hits to JIT expressions.  I plan to push this  
towards LLVM, so this isn't something PG will have to do, but it's  
helpful for evaluation.  


I eventually plan to start separate threads about some of the parts in  
here, but I think the overal picture needs some discussion first.  


Q: Why LLVM and not a hand-rolled JIT?  
A: Because hand-rolling a JIT is probably hard to scale to multiple  
   maintainers, and multiple platforms. I started down the path of doing  
   a hand-rolled x86 JIT, and that'd also be doable (faster compilation,  
   slower execution basically); but I doubt we'd end up having that on  
   different architectures on platforms. Not to speak of things like  
   proper debugger and profiler integration.  I'm not entirely convinced  
   that that's the right path. It might also be a transitional step,  
   towards doing our completely own JIT. But I think it's a sensible  
   step.  

Q: Why LLVM and not $jit-toolkit  
A: Because all the other JIT stuff I looked at was either really  
   unportable (mostly x86 linux only), inconveniently licensed (like  
   e.g. gcc's jit library) or nearly unmaintained (luajit's stuff for  
   example).  I might have missed something, but ISTM that atm the  
   choice is between hand-rolling and using LLVM.  

Q: Does this actually inline functions from the backend?  
A: No. That probably is something desirable in the future, but to me  
   that seems like it should be a separate step. The current one's big  
   enough. It's also further increases compilation times, so quite  
   possibly we only want to do so based on another set of heuristics.  

Q: ?  

Comments? Questions?  

Regards,  

Andres  

[1] https://archives.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de  
[2] https://www.postgresql.org/message-id/20160523005327.v2tr7obytitxcnna@alap3.anarazel.de  

这个patch的讨论,详见邮件组,本文末尾URL。

PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。

参考

https://commitfest.postgresql.org/13/1061/

https://www.postgresql.org/message-id/flat/20161206034955.bh33paeralxbtluv@alap3.anarazel.de#20161206034955.bh33paeralxbtluv@alap3.anarazel.de

《分析加速引擎黑科技 - LLVM、列存、多核并行、算子复用 大联姻 - 一起来开启PostgreSQL的百宝箱》

《PostgreSQL 向量化执行插件(瓦片式实现) 10x提速OLAP》

相关实践学习
AnalyticDB MySQL海量数据秒级分析体验
快速上手AnalyticDB MySQL,玩转SQL开发等功能!本教程介绍如何在AnalyticDB MySQL中,一键加载内置数据集,并基于自动生成的查询脚本,运行复杂查询语句,秒级生成查询结果。
阿里云云原生数据仓库AnalyticDB MySQL版 使用教程
云原生数据仓库AnalyticDB MySQL版是一种支持高并发低延时查询的新一代云原生数据仓库,高度兼容MySQL协议以及SQL:92、SQL:99、SQL:2003标准,可以对海量数据进行即时的多维分析透视和业务探索,快速构建企业云上数据仓库。 了解产品 https://www.aliyun.com/product/ApsaraDB/ads
目录
相关文章
|
28天前
|
存储 运维 Kubernetes
实时数仓Hologres提升问题之调度性能如何解决
Hologres可以支持的最大节点规模是多少?
31 1
|
3月前
|
分布式计算 关系型数据库 数据挖掘
实时数仓 Hologres产品使用合集之当使用动态分区管理功能按日期进行分区后,通过主键和segment_key进行时间范围查询性能变差是什么原因
实时数仓Hologres的基本概念和特点:1.一站式实时数仓引擎:Hologres集成了数据仓库、在线分析处理(OLAP)和在线服务(Serving)能力于一体,适合实时数据分析和决策支持场景。2.兼容PostgreSQL协议:Hologres支持标准SQL(兼容PostgreSQL协议和语法),使得迁移和集成变得简单。3.海量数据处理能力:能够处理PB级数据的多维分析和即席查询,支持高并发低延迟查询。4.实时性:支持数据的实时写入、实时更新和实时分析,满足对数据新鲜度要求高的业务场景。5.与大数据生态集成:与MaxCompute、Flink、DataWorks等阿里云产品深度融合,提供离在线
|
4月前
|
存储 监控 Cloud Native
如何通过持续测试和调整来提高OLAP系统的性能和可扩展性?
【5月更文挑战第14天】如何通过持续测试和调整来提高OLAP系统的性能和可扩展性?
51 2
|
4月前
|
Cloud Native 关系型数据库 OLAP
云原生数据仓库产品使用合集之阿里云云原生数据仓库AnalyticDB PostgreSQL版的重分布时间主要取决的是什么
阿里云AnalyticDB提供了全面的数据导入、查询分析、数据管理、运维监控等功能,并通过扩展功能支持与AI平台集成、跨地域复制与联邦查询等高级应用场景,为企业构建实时、高效、可扩展的数据仓库解决方案。以下是对AnalyticDB产品使用合集的概述,包括数据导入、查询分析、数据管理、运维监控、扩展功能等方面。
|
4月前
|
运维 Cloud Native 关系型数据库
云原生数据仓库产品使用合集之原生数据仓库AnalyticDB PostgreSQL版如果是列存表的话, adb支持通过根据某个字段做upsert吗
阿里云AnalyticDB提供了全面的数据导入、查询分析、数据管理、运维监控等功能,并通过扩展功能支持与AI平台集成、跨地域复制与联邦查询等高级应用场景,为企业构建实时、高效、可扩展的数据仓库解决方案。以下是对AnalyticDB产品使用合集的概述,包括数据导入、查询分析、数据管理、运维监控、扩展功能等方面。
|
2月前
|
SQL 弹性计算 测试技术
实时数仓Hologres TPC-H及点查性能开箱测试
Hologres现在仍然是TPCH-30000榜单的全球第一,领先第二名高达23%,最新发布的2.2版本相比之前的1.x的版本性能大约提升100%。
|
4月前
|
存储 安全 数据挖掘
性能30%↑|阿里云AnalyticDB*AMD EPYC,数据分析步入Next Level
第4代 AMD EPYC加持,云原生数仓AnalyticDB分析轻松提速。
性能30%↑|阿里云AnalyticDB*AMD EPYC,数据分析步入Next Level
|
4月前
|
SQL 测试技术 OLAP
现代化实时数仓 SelectDB 再次登顶 ClickBench 全球数据库分析性能排行榜!
现代化实时数仓 SelectDB 在时隔两年后再次完成登顶,在全部近百款数据库和数十种机型中,性能位居总榜第一!
现代化实时数仓 SelectDB 再次登顶 ClickBench 全球数据库分析性能排行榜!
|
3月前
|
运维 Cloud Native 关系型数据库
云原生数据仓库AnalyticDB产品使用合集之PostgreSQL版是否直接支持实时物化视图
阿里云AnalyticDB提供了全面的数据导入、查询分析、数据管理、运维监控等功能,并通过扩展功能支持与AI平台集成、跨地域复制与联邦查询等高级应用场景,为企业构建实时、高效、可扩展的数据仓库解决方案。以下是对AnalyticDB产品使用合集的概述,包括数据导入、查询分析、数据管理、运维监控、扩展功能等方面。
111 3
|
3月前
|
存储 缓存 测试技术
现代化实时数仓 SelectDB 再次登顶 ClickBench 全球数据库分析性能排行榜!
近日,在 ClickHouse 发起的分析型数据库性能测试排行榜 ClickBench(https://benchmark.clickhouse.com/)中,现代化实时数仓 SelectDB 时隔两年后再次登顶,在全部近百款数据库和数十种机型中,性能表现位居总榜第一!
123 1

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版