PostgreSQL 模糊查询+大量重复值匹配 实践 - 分区索引 = any (array())

本文涉及的产品
RDS SQL Server Serverless,2-4RCU 50GB 3个月
推荐场景:
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
云数据库 RDS SQL Server,基础系列 2核4GB
简介:

标签

PostgreSQL , 模糊查询 , exists , 重复值 , = any (array())


背景

在一些场景中,可能需要对大量的数据TBL_A进行A字段模糊查询,同时将匹配记录的另一个字段B,作为条件输入项,匹配另一张表TBL_B,输出另一张表TBL_B的匹配记录。

例子

小表,表示每个ID的详细信息

create table a (  
  id int primary key,  
  xx text  
);  
  
insert into a select id,'test'||id from generate_series(1,100000) t(id);  

大表,表示FEED日志,一个AID可能产生很多条记录

create table b (  
  aid int,  
  info text  
);  

写入1000万数据

insert into b select random()*9, md5(random()::text) from generate_series(1,10000000);  
  
create extension pg_trgm;  
  
create index idx_b_2 on b using gin (info gin_trgm_ops);  

如果需要用户需要输入AID字段的条件,则建议用这样的复合索引

create extension btree_gin;  
  
create index idx_b_1 on b using gin (aid, info gin_trgm_ops);  

查询info包含某个字符串(前后模糊查询)的数据,找出它们的AID对应的详细信息。

方法1、

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id = any (array(  
  select aid from b where info ~~ '%abc%'  
));  
                                                             QUERY PLAN                                                                
-------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using a_pkey on public.a  (cost=22849.95..22865.95 rows=10 width=13) (actual time=72.259..72.267 rows=9 loops=1)  
   Output: a.id, a.xx  
   Index Cond: (a.id = ANY ($0))  
   Buffers: shared hit=20209  
   InitPlan 1 (returns $0)  
     ->  Bitmap Heap Scan on public.b  (cost=796.03..22849.65 rows=101010 width=4) (actual time=15.562..61.768 rows=73204 loops=1)  
           Output: b.aid  
           Recheck Cond: (b.info ~~ '%abc%'::text)  
           Heap Blocks: exact=20180  
           Buffers: shared hit=20188  
           ->  Bitmap Index Scan on idx_b_1  (cost=0.00..770.78 rows=101010 width=0) (actual time=12.658..12.658 rows=73204 loops=1)  
                 Index Cond: (b.info ~~ '%abc%'::text)  
                 Buffers: shared hit=8  
 Planning Time: 0.157 ms  
 Execution Time: 72.315 ms  
(15 rows)  

方法2、

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where exists (select 1 from b where info ~~ '%abc%' and a.id=b.aid);  
                                                                  QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------  
 Merge Join  (cost=23102.75..23103.04 rows=10 width=13) (actual time=74.316..74.325 rows=9 loops=1)  
   Output: a.id, a.xx  
   Inner Unique: true  
   Merge Cond: (a.id = b.aid)  
   Buffers: shared hit=20191  
   ->  Index Scan using a_pkey on public.a  (cost=0.29..1719.49 rows=100000 width=13) (actual time=0.018..0.021 rows=10 loops=1)  
         Output: a.id, a.xx  
         Buffers: shared hit=3  
   ->  Sort  (cost=23102.44..23102.47 rows=10 width=4) (actual time=74.291..74.292 rows=10 loops=1)  
         Output: b.aid  
         Sort Key: b.aid  
         Sort Method: quicksort  Memory: 25kB  
         Buffers: shared hit=20188  
         ->  HashAggregate  (cost=23102.18..23102.28 rows=10 width=4) (actual time=74.279..74.281 rows=10 loops=1)  
               Output: b.aid  
               Group Key: b.aid  
               Buffers: shared hit=20188  
               ->  Bitmap Heap Scan on public.b  (cost=796.03..22849.65 rows=101010 width=4) (actual time=15.935..61.383 rows=73204 loops=1)  
                     Output: b.aid  
                     Recheck Cond: (b.info ~~ '%abc%'::text)  
                     Heap Blocks: exact=20180  
                     Buffers: shared hit=20188  
                     ->  Bitmap Index Scan on idx_b_1  (cost=0.00..770.78 rows=101010 width=0) (actual time=13.027..13.027 rows=73204 loops=1)  
                           Index Cond: (b.info ~~ '%abc%'::text)  
                           Buffers: shared hit=8  
 Planning Time: 0.344 ms  
 Execution Time: 74.380 ms  
(27 rows)  

结果

postgres=# select * from a where id = any (array(  
  select aid from b where info ~~ '%abc%'  
));  
 id |  xx     
----+-------  
  1 | test1  
  2 | test2  
  3 | test3  
  4 | test4  
  5 | test5  
  6 | test6  
  7 | test7  
  8 | test8  
  9 | test9  
(9 rows)  
  
Time: 57.789 ms  

其他方法

1、使用AID+模糊查询,两个条件,实际效果并不好,因为GIN复合索引实际上是内部BITMAP,并不是真正意义上的多颗树结构(分区结构)。

do language plpgsql $$  
declare  
  v_id int;  
  v_xx text;  
begin  
  for v_id,v_xx in select id,xx from a loop  
    perform 1 from b where aid=v_id and info ~~ '%abc%' limit 1;  
    if found then   
      raise notice '%, %', v_id, v_xx;  
    end if;  
  end loop;  
end;  
$$;  

2、如果aid集合较小,并且可以穷举,也可以使用分区索引(多partial index)。

分区索引模拟,如下

nohup psql -c "create index idx_b_3 on b using gin (info gin_trgm_ops) where aid=1;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_4 on b using gin (info gin_trgm_ops) where aid=2;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_5 on b using gin (info gin_trgm_ops) where aid=3;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_6 on b using gin (info gin_trgm_ops) where aid=4;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_7 on b using gin (info gin_trgm_ops) where aid=5;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_8 on b using gin (info gin_trgm_ops) where aid=6;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_9 on b using gin (info gin_trgm_ops) where aid=7;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_10 on b using gin (info gin_trgm_ops) where aid=8;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_11 on b using gin (info gin_trgm_ops) where aid=9;" >/dev/null 2>&1 &  
nohup psql -c "create index idx_b_12 on b using gin (info gin_trgm_ops) where aid not in (1,2,3,4,5,6,7,8,9);" >/dev/null 2>&1 &  

查询语句模拟如下

select * from a where id = any (array(  
select * from (select aid from b where aid=1 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=2 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=3 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=4 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=5 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=6 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=7 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=8 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=9 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid not in (1,2,3,4,5,6,7,8,9) and info ~~ '%abc%' limit 1) t  
));  
  
 id |  xx     
----+-------  
  1 | test1  
  2 | test2  
  3 | test3  
  4 | test4  
  5 | test5  
  6 | test6  
  7 | test7  
  8 | test8  
  9 | test9  
(9 rows)  
  
Time: 4.970 ms  

执行计划

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id = any (array(  
select * from (select aid from b where aid=1 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=2 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=3 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=4 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=5 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=6 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=7 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=8 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid=9 and info ~~ '%abc%' limit 1) t  
union all  
select * from (select aid from b where aid not in (1,2,3,4,5,6,7,8,9) and info ~~ '%abc%' limit 1) t  
));  
                                                            QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using a_pkey on public.a  (cost=199.57..215.57 rows=10 width=13) (actual time=2.833..2.842 rows=9 loops=1)  
   Output: a.id, a.xx  
   Index Cond: (a.id = ANY ($0))  
   Buffers: shared hit=55  
   InitPlan 1 (returns $0)  
     ->  Append  (cost=0.00..199.27 rows=10 width=4) (actual time=0.249..2.799 rows=10 loops=1)  
           Buffers: shared hit=34  
           ->  Limit  (cost=0.00..15.31 rows=1 width=4) (actual time=0.249..0.249 rows=1 loops=1)  
                 Output: b.aid  
                 Buffers: shared hit=3  
                 ->  Seq Scan on public.b  (cost=0.00..170791.00 rows=11158 width=4) (actual time=0.247..0.247 rows=1 loops=1)  
                       Output: b.aid  
                       Filter: ((b.info ~~ '%abc%'::text) AND (b.aid = 1))  
                       Rows Removed by Filter: 1131  
                       Buffers: shared hit=3  
           ->  Limit  (cost=0.00..14.75 rows=1 width=4) (actual time=0.243..0.244 rows=1 loops=1)  
                 Output: b_1.aid  
                 Buffers: shared hit=3  
                 ->  Seq Scan on public.b b_1  (cost=0.00..170791.00 rows=11576 width=4) (actual time=0.243..0.243 rows=1 loops=1)  
                       Output: b_1.aid  
                       Filter: ((b_1.info ~~ '%abc%'::text) AND (b_1.aid = 2))  
                       Rows Removed by Filter: 1208  
                       Buffers: shared hit=3  
           ->  Limit  (cost=0.00..15.39 rows=1 width=4) (actual time=0.023..0.023 rows=1 loops=1)  
                 Output: b_2.aid  
                 Buffers: shared hit=1  
                 ->  Seq Scan on public.b b_2  (cost=0.00..170791.00 rows=11098 width=4) (actual time=0.023..0.023 rows=1 loops=1)  
                       Output: b_2.aid  
                       Filter: ((b_2.info ~~ '%abc%'::text) AND (b_2.aid = 3))  
                       Rows Removed by Filter: 70  
                       Buffers: shared hit=1  
           ->  Limit  (cost=0.00..14.96 rows=1 width=4) (actual time=0.738..0.738 rows=1 loops=1)  
                 Output: b_3.aid  
                 Buffers: shared hit=8  
                 ->  Seq Scan on public.b b_3  (cost=0.00..170791.00 rows=11414 width=4) (actual time=0.738..0.738 rows=1 loops=1)  
                       Output: b_3.aid  
                       Filter: ((b_3.info ~~ '%abc%'::text) AND (b_3.aid = 4))  
                       Rows Removed by Filter: 3563  
                       Buffers: shared hit=8  
           ->  Limit  (cost=0.00..15.18 rows=1 width=4) (actual time=0.068..0.068 rows=1 loops=1)  
                 Output: b_4.aid  
                 Buffers: shared hit=1  
                 ->  Seq Scan on public.b b_4  (cost=0.00..170791.00 rows=11253 width=4) (actual time=0.068..0.068 rows=1 loops=1)  
                       Output: b_4.aid  
                       Filter: ((b_4.info ~~ '%abc%'::text) AND (b_4.aid = 5))  
                       Rows Removed by Filter: 311  
                       Buffers: shared hit=1  
           ->  Limit  (cost=0.00..15.09 rows=1 width=4) (actual time=0.178..0.178 rows=1 loops=1)  
                 Output: b_5.aid  
                 Buffers: shared hit=2  
                 ->  Seq Scan on public.b b_5  (cost=0.00..170791.00 rows=11316 width=4) (actual time=0.178..0.178 rows=1 loops=1)  
                       Output: b_5.aid  
                       Filter: ((b_5.info ~~ '%abc%'::text) AND (b_5.aid = 6))  
                       Rows Removed by Filter: 894  
                       Buffers: shared hit=2  
           ->  Limit  (cost=0.00..15.50 rows=1 width=4) (actual time=0.323..0.323 rows=1 loops=1)  
                 Output: b_6.aid  
                 Buffers: shared hit=4  
                 ->  Seq Scan on public.b b_6  (cost=0.00..170791.00 rows=11020 width=4) (actual time=0.322..0.322 rows=1 loops=1)  
                       Output: b_6.aid  
                       Filter: ((b_6.info ~~ '%abc%'::text) AND (b_6.aid = 7))  
                       Rows Removed by Filter: 1599  
                       Buffers: shared hit=4  
           ->  Limit  (cost=0.00..15.25 rows=1 width=4) (actual time=0.449..0.449 rows=1 loops=1)  
                 Output: b_7.aid  
                 Buffers: shared hit=5  
                 ->  Seq Scan on public.b b_7  (cost=0.00..170791.00 rows=11199 width=4) (actual time=0.448..0.448 rows=1 loops=1)  
                       Output: b_7.aid  
                       Filter: ((b_7.info ~~ '%abc%'::text) AND (b_7.aid = 8))  
                       Rows Removed by Filter: 2231  
                       Buffers: shared hit=5  
           ->  Limit  (cost=0.00..32.33 rows=1 width=4) (actual time=0.130..0.131 rows=1 loops=1)  
                 Output: b_8.aid  
                 Buffers: shared hit=2  
                 ->  Seq Scan on public.b b_8  (cost=0.00..170791.00 rows=5283 width=4) (actual time=0.130..0.130 rows=1 loops=1)  
                       Output: b_8.aid  
                       Filter: ((b_8.info ~~ '%abc%'::text) AND (b_8.aid = 9))  
                       Rows Removed by Filter: 623  
                       Buffers: shared hit=2  
           ->  Limit  (cost=0.00..45.36 rows=1 width=4) (actual time=0.392..0.392 rows=1 loops=1)  
                 Output: b_9.aid  
                 Buffers: shared hit=5  
                 ->  Seq Scan on public.b b_9  (cost=0.00..258291.00 rows=5694 width=4) (actual time=0.392..0.392 rows=1 loops=1)  
                       Output: b_9.aid  
                       Filter: ((b_9.info ~~ '%abc%'::text) AND (b_9.aid <> ALL ('{1,2,3,4,5,6,7,8,9}'::integer[])))  
                       Rows Removed by Filter: 1931  
                       Buffers: shared hit=5  
 Planning Time: 1.411 ms  
 Execution Time: 2.963 ms  
(89 rows)  

可见,分区索引,效果是比较好的。期待PG尽快支持。

另外,对于本例,用户需要IN很多值时,建议使用较大的work_mem,同时使用any(array())的写法。

work_mem='4MB'  

参考

《HTAP数据库 PostgreSQL 场景与性能测试之 25 - (OLTP) IN , EXISTS 查询》

《HTAP数据库 PostgreSQL 场景与性能测试之 26 - (OLTP) NOT IN、NOT EXISTS 查询》

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
2月前
|
缓存 关系型数据库 MySQL
MySQL索引策略与查询性能调优实战
在实际应用中,需要根据具体的业务需求和查询模式,综合运用索引策略和查询性能调优方法,不断地测试和优化,以提高MySQL数据库的查询性能。
238 66
|
1月前
|
SQL 关系型数据库 MySQL
深入解析MySQL的EXPLAIN:指标详解与索引优化
MySQL 中的 `EXPLAIN` 语句用于分析和优化 SQL 查询,帮助你了解查询优化器的执行计划。本文详细介绍了 `EXPLAIN` 输出的各项指标,如 `id`、`select_type`、`table`、`type`、`key` 等,并提供了如何利用这些指标优化索引结构和 SQL 语句的具体方法。通过实战案例,展示了如何通过创建合适索引和调整查询语句来提升查询性能。
167 9
|
3月前
|
存储 关系型数据库 MySQL
阿里面试:为什么要索引?什么是MySQL索引?底层结构是什么?
尼恩是一位资深架构师,他在自己的读者交流群中分享了关于MySQL索引的重要知识点。索引是帮助MySQL高效获取数据的数据结构,主要作用包括显著提升查询速度、降低磁盘I/O次数、优化排序与分组操作以及提升复杂查询的性能。MySQL支持多种索引类型,如主键索引、唯一索引、普通索引、全文索引和空间数据索引。索引的底层数据结构主要是B+树,它能够有效支持范围查询和顺序遍历,同时保持高效的插入、删除和查找性能。尼恩还强调了索引的优缺点,并提供了多个面试题及其解答,帮助读者在面试中脱颖而出。相关资料可在公众号【技术自由圈】获取。
|
14天前
|
SQL 存储 关系型数据库
MySQL秘籍之索引与查询优化实战指南
最左前缀原则。不冗余原则。最大选择性原则。所谓前缀索引,说白了就是对文本的前几个字符建立索引(具体是几个字符在建立索引时去指定),比如以产品名称的前 10 位来建索引,这样建立起来的索引更小,查询效率更快!
77 22
 MySQL秘籍之索引与查询优化实战指南
|
16天前
|
存储 关系型数据库 MySQL
MySQL中为什么要使用索引合并(Index Merge)?
通过这些内容的详细介绍和实际案例分析,希望能帮助您深入理解索引合并及其在MySQL中的
68 10
|
1月前
|
缓存 关系型数据库 MySQL
MySQL 索引优化以及慢查询优化
通过本文的介绍,希望您能够深入理解MySQL索引优化和慢查询优化的方法,并在实际应用中灵活运用这些技术,提升数据库的整体性能。
78 18
|
28天前
|
存储 Oracle 关系型数据库
索引在手,查询无忧:MySQL索引简介
MySQL 是一款广泛使用的关系型数据库管理系统,在2024年5月的DB-Engines排名中得分1084,仅次于Oracle。本文介绍MySQL索引的工作原理和类型,包括B+Tree、Hash、Full-text索引,以及主键、唯一、普通索引等,帮助开发者优化查询性能。索引类似于图书馆的分类系统,能快速定位数据行,极大提高检索效率。
59 8
|
1月前
|
缓存 关系型数据库 MySQL
MySQL 索引优化以及慢查询优化
通过本文的介绍,希望您能够深入理解MySQL索引优化和慢查询优化的方法,并在实际应用中灵活运用这些技术,提升数据库的整体性能。
64 7
|
1月前
|
缓存 关系型数据库 MySQL
MySQL 索引优化与慢查询优化:原理与实践
通过本文的介绍,希望您能够深入理解MySQL索引优化与慢查询优化的原理和实践方法,并在实际项目中灵活运用这些技术,提升数据库的整体性能。
99 5
|
1月前
|
存储 关系型数据库 MySQL
Mysql索引:深入理解InnoDb聚集索引与MyisAm非聚集索引
通过本文的介绍,希望您能深入理解InnoDB聚集索引与MyISAM非聚集索引的概念、结构和应用场景,从而在实际工作中灵活运用这些知识,优化数据库性能。
140 7

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版