PostgreSQL in 与 = any 的SQL语法异同与性能优化

本文涉及的产品
云原生数据库 PolarDB 分布式版,标准版 2核8GB
云数据库 RDS SQL Server,基础系列 2核4GB
RDS PostgreSQL Serverless,0.5-4RCU 50GB 3个月
推荐场景:
对影评进行热评分析
简介:

标签

PostgreSQL , in , = any (array()) , hash table , subplan , initplan


背景

数据库SQL也算一门比较神奇的语言了,比如很多需求可以有不同的SQL来实现:

我之前有输出过一个IN的测试,这里面实际上也涉及到多个语法,实现同一个功能点。测试CASE是1亿 in 100万的多种写法的性能差异。

《HTAP数据库 PostgreSQL 场景与性能测试之 25 - (OLTP) IN , EXISTS 查询》

例如下面三个QUERY的语义就是一样的

select * from tbl where id in (select id from t);  
  
select * from tbl where exists (select 1 from t where t.id=tbl.id);  
  
select * from tbl where id = any (array( select id from t ));  

但是不同的SQL,数据库可能会选择不一样的执行计划,并且执行效率可能千差万别。

几个例子

1、创建测试表,模拟1万 IN 100万的操作。

postgres=# create table t(id int);  
CREATE TABLE  
postgres=# insert into t select generate_series(1,100*10000);  
INSERT 0 1000000  

2、我们看一看不同写法的执行计划如何:

postgres=# explain select n = any(array(select id from t)) from generate_series(1,10000) as n;  
                                   QUERY PLAN                                      
---------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=14425.00..14447.50 rows=1000 width=1)  
   InitPlan 1 (returns $0)  
     ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)  
(3 rows)  
  
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;  
                                   QUERY PLAN                                      
---------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=16925.00..16937.50 rows=1000 width=1)  
   SubPlan 1  
     ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)  
(3 rows)  

3、你会发现两个语法用了不同的执行计划,一个是InitPlan, 一个是SubPlan.

对于IN的写法,work_mem参数会直接影响性能,work_mem的大小决定了subquery是否要装载到hash table。

postgres=# set work_mem ='1MB';  
SET  
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;  
                                   QUERY PLAN                                     
--------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=0.00..12916012.50 rows=1000 width=1)  
   SubPlan 1  
     ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4)  
           ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)  
(4 rows)  
  
postgres=# set work_mem ='100MB';  
SET  
postgres=# explain select n in (select id from t) from generate_series(1,10000) as n;  
                                   QUERY PLAN                                      
---------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=16925.00..16937.50 rows=1000 width=1)  
   SubPlan 1  
     ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4)  
(3 rows)  
                if (subquery)  
                {  
                        /* Generate Paths for the ANY subquery; we'll need all rows */  
                        subroot = subquery_planner(root->glob, subquery,  
                                                                           root,  
                                                                           false, 0.0);  
  
                        /* Isolate the params needed by this specific subplan */  
                        plan_params = root->plan_params;  
                        root->plan_params = NIL;  
  
                        /* Select best Path and turn it into a Plan */  
                        final_rel = fetch_upper_rel(subroot, UPPERREL_FINAL, NULL);  
                        best_path = final_rel->cheapest_total_path;  
  
                        plan = create_plan(subroot, best_path);  
  
                        /* Now we can check if it'll fit in work_mem */  
                        /* XXX can we check this at the Path stage? */  
                        if (subplan_is_hashable(plan))  
                        {  
                                SubPlan    *hashplan;  
                                AlternativeSubPlan *asplan;  
  
                                /* OK, convert to SubPlan format. */  
                                hashplan = castNode(SubPlan,  
                                                                        build_subplan(root, plan, subroot,  
                                                                                                  plan_params,  
                                                                                                  ANY_SUBLINK, 0,  
                                                                                                  newtestexpr,  
                                                                                                  false, true));  
                                /* Check we got what we expected */  
                                Assert(hashplan->parParam == NIL);  
                                Assert(hashplan->useHashTable);  
                                /* build_subplan won't have filled in paramIds */  
                                hashplan->paramIds = paramIds;  
  
                                /* Leave it to the executor to decide which plan to use */  
                                asplan = makeNode(AlternativeSubPlan);  
                                asplan->subplans = list_make2(result, hashplan);  
                                result = (Node *) asplan;  
                        }  
                }  
/*  
 * subplan_is_hashable: can we implement an ANY subplan by hashing?  
 */  
static bool  
subplan_is_hashable(Plan *plan)  
{  
        double          subquery_size;  
  
        /*  
         * The estimated size of the subquery result must fit in work_mem. (Note:  
         * we use heap tuple overhead here even though the tuples will actually be  
         * stored as MinimalTuples; this provides some fudge factor for hashtable  
         * overhead.)  
         */  
        subquery_size = plan->plan_rows *  
                (MAXALIGN(plan->plan_width) + MAXALIGN(SizeofHeapTupleHeader));  
        if (subquery_size > work_mem * 1024L)  
                return false;  
  
        return true;  
}  

代码里面注释中,针对in, exists, any的subplan优化器实现也有一些介绍,涉及到性能相关:

实际上exists这里有提到,匹配到第一条就结束,所以评估是否使用哈希表时可能需要的容量很小。

        /*  
         * For an EXISTS subplan, tell lower-level planner to expect that only the  
         * first tuple will be retrieved.  For ALL and ANY subplans, we will be  
         * able to stop evaluating if the test condition fails or matches, so very  
         * often not all the tuples will be retrieved; for lack of a better idea,  
         * specify 50% retrieval.  For EXPR, MULTIEXPR, and ROWCOMPARE subplans,  
         * use default behavior (we're only expecting one row out, anyway).  
         *  
         * NOTE: if you change these numbers, also change cost_subplan() in  
         * path/costsize.c.  
         *  
         * XXX If an ANY subplan is uncorrelated, build_subplan may decide to hash  
         * its output.  In that case it would've been better to specify full  
         * retrieval.  At present, however, we can only check hashability after  
         * we've made the subplan :-(.  (Determining whether it'll fit in work_mem  
         * is the really hard part.)  Therefore, we don't want to be too  
         * optimistic about the percentage of tuples retrieved, for fear of  
         * selecting a plan that's bad for the materialization case.  
         */  

in vs = any vs exists性能对比

1、in, work_mem装不下subquery

postgres=# set work_mem ='64kB';  
  
postgres=# explain analyze select n in (select id from t) from generate_series(1,10000) as n;  
                                                            QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on generate_series n  (cost=0.00..12916012.50 rows=1000 width=1) (actual time=1.321..11484.646 rows=10000 loops=1)  
   SubPlan 1  
     ->  Materialize  (cost=0.00..23332.00 rows=1000000 width=4) (actual time=0.003..0.619 rows=5000 loops=10000)  
           ->  Seq Scan on t  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..1.800 rows=10000 loops=1)  
 Planning time: 0.091 ms  
 Execution time: 11485.905 ms  
(6 rows)  

2、in, work_mem装下了subquery

postgres=# set work_mem ='64MB';  
SET  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select n in (select id from t) from generate_series(1,10000) as n;  
                                                                  QUERY PLAN                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=16925.00..16937.50 rows=1000 width=1) (actual time=497.142..500.701 rows=10000 loops=1)  
   Output: (hashed SubPlan 1)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=4425  
   SubPlan 1  
     ->  Seq Scan on public.t  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.024..124.703 rows=1000000 loops=1)  
           Output: t.id  
           Buffers: shared hit=4425  
 Planning time: 0.085 ms  
 Execution time: 507.427 ms  
(10 rows)  

3、= any, work_mem很小无所谓,因为不涉及hashtable

postgres=# set work_mem ='64kB';  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select n = any(array(select id from t)) from generate_series(1,10000) as n;  
                                                                  QUERY PLAN                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=14425.00..14447.50 rows=1000 width=1) (actual time=233.871..446.120 rows=10000 loops=1)  
   Output: (n.n = ANY ($0))  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=4425, temp read=19 written=18  
   InitPlan 1 (returns $0)  
     ->  Seq Scan on public.t  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..119.976 rows=1000000 loops=1)  
           Output: t.id  
           Buffers: shared hit=4425  
 Planning time: 0.085 ms  
 Execution time: 447.666 ms  
(10 rows)  

4、exists, work_mem需求量较少(exists由于优化器在匹配到1条后即刻返回,所以会选择使用索引,性能就非常好。)

postgres=# set work_mem ='64kB';  
SET  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;  
                                                              QUERY PLAN                                                                
--------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=0.00..2852.50 rows=1000 width=1) (actual time=1.172..18.893 rows=10000 loops=1)  
   Output: (SubPlan 1)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=40027, temp read=19 written=18  
   SubPlan 1  
     ->  Index Only Scan using idx_t_1 on public.t  (cost=0.42..2.84 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=10000)  
           Index Cond: (t.id = n.n)  
           Heap Fetches: 10000  
           Buffers: shared hit=40027  
 Planning time: 0.118 ms  
 Execution time: 19.902 ms  
(11 rows)  
  
  
  
postgres=# set work_mem ='64MB';  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;  
                                                              QUERY PLAN                                                                
--------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=0.00..2852.50 rows=1000 width=1) (actual time=0.642..17.635 rows=10000 loops=1)  
   Output: (alternatives: SubPlan 1 or hashed SubPlan 2)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=40027  
   SubPlan 1  
     ->  Index Only Scan using idx_t_1 on public.t  (cost=0.42..2.84 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=10000)  
           Index Cond: (t.id = n.n)  
           Heap Fetches: 10000  
           Buffers: shared hit=40027  
   SubPlan 2  
     ->  Seq Scan on public.t t_1  (cost=0.00..14425.00 rows=1000000 width=4) (never executed)  
           Output: t_1.id  
 Planning time: 0.129 ms  
 Execution time: 18.612 ms  
(14 rows)  

5、如果把索引干掉,exists性能就会下降了,同时性能也和是否使用哈希表有关。

postgres=# drop index idx_t_1;  
  
  
postgres=# set work_mem ='64kB';  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;  
                                                                 QUERY PLAN                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=0.00..16925010.00 rows=1000 width=1) (actual time=1.072..3036.590 rows=10000 loops=1)  
   Output: (SubPlan 1)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=226260, temp read=19 written=18  
   SubPlan 1  
     ->  Seq Scan on public.t  (cost=0.00..16925.00 rows=1 width=0) (actual time=0.303..0.303 rows=1 loops=10000)  
           Filter: (t.id = n.n)  
           Rows Removed by Filter: 5000  
           Buffers: shared hit=226260  
 Planning time: 0.087 ms  
 Execution time: 3037.904 ms  
(11 rows)  
  
postgres=# set work_mem ='64MB';  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select exists (select 1 from t where t.id=n.n) from generate_series(1,10000) as n;  
                                                                 QUERY PLAN                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------  
 Function Scan on pg_catalog.generate_series n  (cost=0.00..16925010.00 rows=1000 width=1) (actual time=517.150..521.142 rows=10000 loops=1)  
   Output: (alternatives: SubPlan 1 or hashed SubPlan 2)  
   Function Call: generate_series(1, 10000)  
   Buffers: shared hit=4425  
   SubPlan 1  
     ->  Seq Scan on public.t  (cost=0.00..16925.00 rows=1 width=0) (never executed)  
           Filter: (t.id = n.n)  
   SubPlan 2  
     ->  Seq Scan on public.t t_1  (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.027..127.111 rows=1000000 loops=1)  
           Output: t_1.id  
           Buffers: shared hit=4425  
 Planning time: 0.098 ms  
 Execution time: 527.986 ms  
(13 rows)  

小结

1、使用= any的写法,不会走subplan,因此不涉及hash table的问题。和work_mem设置大小无关。性能比较暴力,特别是当它不是在subquery里面时,性能贼好。

很多场景都可以使用,例如update limit, delete limit(阅后即焚),又或者就是简单的IN查询需求。

delete from tbl where ctid = any(array( select ctid from tbl where xxx limit xxx ));  
  
update tbl set xxx=xxx where ctid = any(array( select ctid from tbl where xxx limit xxx ));  
  
select * from tbl where id = any (array( query.... ));  

推荐指数,五星。

2、exists,由于优化器会默认它只需要搜索到1条命中目标就不搜了,所以优化器评估是否使用hash table时,需要的内存相对较少,即使较小的work_mem也可能使用hashtable。

推荐指数,四星。

3、in (),当出现在subquery中时,优化器评估这个subquery是否要构建哈希TABLE,直接和subquery的大小相关,所以需要较大的work_mem才会选择使用hashtable。

推荐指数,三星。

最后,由于这些SQL语义都相同,在内核优化时,可以考虑做一些QUERY REWRITE,来优化这样的SQL。

这样的话,用户可以不改SQL,就达到提高效率的目的。

感谢为此付出努力的所有PostgreSQL内核开发的小伙伴们。

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
12天前
|
SQL 存储 关系型数据库
SQL `CREATE DATABASE` 语法
【11月更文挑战第10天】
34 3
|
17天前
|
SQL 关系型数据库 数据库
sql语法
【10月更文挑战第26天】sql语法
28 5
|
2月前
|
存储 SQL 关系型数据库
【MySQL调优】如何进行MySQL调优?从参数、数据建模、索引、SQL语句等方向,三万字详细解读MySQL的性能优化方案(2024版)
MySQL调优主要分为三个步骤:监控报警、排查慢SQL、MySQL调优。 排查慢SQL:开启慢查询日志 、找出最慢的几条SQL、分析查询计划 。 MySQL调优: 基础优化:缓存优化、硬件优化、参数优化、定期清理垃圾、使用合适的存储引擎、读写分离、分库分表; 表设计优化:数据类型优化、冷热数据分表等。 索引优化:考虑索引失效的11个场景、遵循索引设计原则、连接查询优化、排序优化、深分页查询优化、覆盖索引、索引下推、用普通索引等。 SQL优化。
541 15
【MySQL调优】如何进行MySQL调优?从参数、数据建模、索引、SQL语句等方向,三万字详细解读MySQL的性能优化方案(2024版)
|
1月前
|
SQL 监控 Oracle
Oracle SQL性能优化全面指南
在数据库管理领域,Oracle SQL性能优化是确保数据库高效运行和数据查询速度的关键
|
1月前
|
SQL 数据挖掘 数据库
SQL查询每秒的数据:技巧、方法与性能优化
id="">SQL查询功能详解 SQL(Structured Query Language,结构化查询语言)是一种专门用于与数据库进行沟通和操作的语言
|
25天前
|
SQL 数据库
SQL数据库基础语法入门
[link](http://www.vvo.net.cn/post/082935.html)
|
1月前
|
SQL 存储 关系型数据库
mysql SQL必知语法
本文详细介绍了MySQLSQL的基本语法,包括SELECT、FROM、WHERE、GROUPBY、HAVING、ORDERBY等关键字的使用,以及数据库操作如创建、删除表,数据类型,插入、查询、过滤、排序、连接和汇总数据的方法。通过学习这些内容,读者将能更好地管理和操
14 0
|
1月前
|
SQL 关系型数据库 MySQL
Mysql(2)—SQL语法详解
SQL(Structured Query Language,结构化查询语言)是一种用于管理关系型数据库的标准编程语言。它主要用于数据的查询、插入、更新和删除等操作。SQL最初在1970年代由IBM的研究人员开发,旨在处理关系数据模型。
29 0
|
2月前
|
SQL 关系型数据库 C语言
PostgreSQL SQL扩展 ---- C语言函数(三)
可以用C(或者与C兼容,比如C++)语言编写用户自定义函数(User-defined functions)。这些函数被编译到动态可加载目标文件(也称为共享库)中并被守护进程加载到服务中。“C语言函数”与“内部函数”的区别就在于动态加载这个特性,二者的实际编码约定本质上是相同的(因此,标准的内部函数库为用户自定义C语言函数提供了丰富的示例代码)
|
3月前
|
SQL 存储 数据库

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版