最佳实践-PostgreSQL-模糊查询、正则查询和相似查询优化

PostgreSQL 拥有很强的文本搜索能力，除了支持全文检索，还支持模糊查询和正则查询。PostgreSQL 不仅内置了一般数据库都没有的 pg_trgm 插件，还内置了表达式索引和 GIN 索引的功能，为加速各类需求的查询提供了有力条件。
对于正则查询，PostgreSQL 可以通过 pg_trgm 插件来加速查询。模糊查询包括前模糊（有前缀的模糊）、后模糊（有后缀的模糊）和前后模糊（无前后缀的模糊），针对不同的模糊查询需求，PostgreSQL 有不同的优化方法：

对于前模糊和后模糊，PostgreSQL 与其他数据库一样，可以使用 B-tree 来加速查询。对于后模糊，也可以使用反转函数（reverse）的函数索引来加速查询。
对于前后模糊，一种方法是使用 pg_trgm 插件，利用 GIN 索引加速查询（特别是对于输入 3 个或 3 个以上字符的模糊查询有很好的效果）。另一种方法是自定义 GIN 表达式索引的方法，适合于定制的模糊查询。

模糊查询和正则查询都是找出完全符合条件的记录，还有一种需求是相似查询。本文将通过示例介绍如何进行模糊查询、正则查询和相似查询的优化。

背景信息

pg_trgm 模糊查询的原理

pg_trgm 先将字符串的前端添加 2 个空格，末尾添加 1 个空格。每连续的 3 个字符为一个 TOKEN，然后将各 TOKEN 拆开。最后，对 TOKEN 建立 GIN 倒排索引。
您可以通过如下方法查看字符串的 TOKEN。

test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)

pg_trgm 优化查询时要求字符个数的原因

使用 pg_trgm 优化前后模糊查询时，若要获得最好的效果，需满足如下条件：

对于前模糊查询，例如 a%，至少需要提供 1 个字符。（搜索的是token=' a'）
对于后模糊查询，例如 %ab，至少需要提供 2 个字符。（搜索的是token='ab '）
对于前后模糊查询，例如 %abcd%，至少需要提供 3 个字符。( 这个使用数组搜索，搜索的是包含{" a"," ab",abc,bcd,"cd "}的 TOKEN。)

由于 pg_trgm 生成的 TOKEN 是三个字符，只有在以上三个条件下，才能匹配到对应的 TOKEN。

使用示例

test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)

前模糊和后模糊查询的优化

前模糊查询的优化方法

PostgreSQL 支持使用 B-tree 来加速前模糊的查询。

当使用类型默认的 index ops class 时，仅适合于collate="C"的查询（当数据库默认的 lc_collate <> C 时，索引和查询都需要明确指定 collate “C”）。

注意：索引和查询条件的 collate 必须一致才能使用索引。

使用示例
1. test =# create table test (id int , info text );
2. CREATE TABLE
3. test=# insert into test select generate_series(1,1000000),md5(random()::text);
4. INSERT 0 1000000
5. test=# create index idx on test(info collate "C");
6. CREATE INDEX
8. test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
9. QUERY PLAN
10. ----------------------------------------------------------------------------------------------------------------------
11. Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)
12. Output: id, info
13. Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text))
14. Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
15. Buffers: shared hit=18 read=3
16. Planning time: 0.424 ms
17. Execution time: 0.124 ms
18. (7 rows)
当数据库默认的 lc_collate <> C 时，还可以使用对应类型的 pattern ops 让 B-tree 索引支持模糊查询。使用 pattern ops 将使用字符查询而非 binary 查询的搜索方式。详情请参见文档 PostgreSQL 9.6.2 Documentation — 11.9. Operator Classes and Operator Families。
使用示例 test=# drop table test;
DROP TABLE
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info text_pattern_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "zh_CN";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "zh_CN")
Buffers: shared hit=12 read=3
Planning time: 0.253 ms
Execution time: 0.081 ms
(7 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=15
Planning time: 0.141 ms
Execution time: 0.072 ms
(7 rows)
另外，使用类型对应的 pattern ops 时，索引搜索不仅支持 LIKE 的写法，还支持规则表达式的写法，如下所示：
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '^abcd';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~ '^abcd'::text)
Buffers: shared hit=15
Planning time: 0.213 ms
Execution time: 0.083 ms
(7 rows)

后模糊查询的优化方法

PostgreSQL 支持使用反转函数索引来加速后模糊的查询。

当使用类型默认的 index ops class 时，仅适合于collate="C"的查询（当数据库默认的 lc_collate <> C 时，索引和查询都需要明确指定 collate “C”）。

注意：索引和查询条件的 collate 必须一致才能使用索引。

使用示例 test=# create index idx1 on test(reverse(info) collate "C");
CREATE INDEX
test=# select * from test limit 1;
id | info
----+----------------------------------
1 | b3275976cdd437a033d4329775a52514
(1 row)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text))
Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.128 ms
Execution time: 0.122 ms
(7 rows)
test=# select * from test where reverse(info) like '4152%' collate "C";
id | info
--------+----------------------------------
847904 | abe2ecd90393b5275df8e34a39702514
414702 | 97f66d26545329321164042657d02514
191232 | 7820972c6220c2b01d46c11ebb532514
752742 | 93232ac39c6632e2540df44627c42514
217302 | 39e518893a1a7b1e691619bd1fc42514
1 | b3275976cdd437a033d4329775a52514
615718 | 4948f94c484c13dc6c4fae8a3db52514
308815 | fc2918ceff7c7a4dafd2e04031062514
149521 | 546d963842ea5ca593e622c810262514
811093 | 4b6eca2eb6b665af67b2813e91a62514
209000 | 1dfd0d4e326715c1739f031cca992514
937616 | 8827fd81f5b673fb5afecbe3e11b2514
419553 | bd6e01ce360af16137e8b6abc8ab2514
998324 | 7dff51c19dc5e5d9979163e7d14c2514
771518 | 8a54e30003a48539fff0aedc73ac2514
691566 | f90368348e3b6bf983fcbe10db2d2514
652274 | 8bf4a97b5f122a5540a21fa85ead2514
233437 | 739ed715fc203d47e37e79b5bcbe2514
(18 rows)
当数据库默认的 lc_collate <> C 时，还可以使用对应类型的 pattern ops 让 B-tree 索引支持模糊查询。使用 pattern ops 将使用字符查询而非 binary 查询的搜索方式。使用类型对应的 pattern ops 时，索引搜索不仅支持 LIKE 的写法，还支持规则表达式的写法。
使用示例test=# create index idx1 on test(reverse(info) text_pattern_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.026..0.049 rows=12 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text))
Filter: (reverse(test.info) ~~ '4152%'::text)
Buffers: shared hit=15
Planning time: 0.102 ms
Execution time: 0.072 ms
(7 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) ~ '^4152';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.031..0.063 rows=12 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text))
Filter: (reverse(test.info) ~ '^4152'::text)
Buffers: shared hit=15
Planning time: 0.148 ms
Execution time: 0.087 ms
(7 rows)
[font="]前模糊和后模糊查询的单一索引优化方法

PostgreSQL 支持使用 pg_trgm 索引来加速同时包含前模糊和后模糊的查询。
为使索引过滤有较好的效果，前模糊查询至少需输入 1 个字符，后模糊查询至少需输入 2 个字符。若要高效支持多字节字符（如中文），数据库的 lc_ctype 不能为 “C”，只有 TOKEN 分割正确才能有较好的效果。

注意：索引和查询条件的 collate 必须一致才能使用索引。

使用示例

执行如下代码，创建一个表。 test=# \l+ test
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
------+----------+----------+------------+------------+-------------------+--------+------------+-------------
test | postgres | UTF8 | zh_CN.utf8 | zh_CN.utf8 | | 245 MB | pg_default |
(1 row)
test=# create extension pg_trgm;
test=# create table test001(c1 text);
CREATE TABLE

执行如下代码，生成随机中文字符串的函数。

test=# create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
CREATE FUNCTION

执行如下代码，生成随机数据。

test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000);
INSERT 0 100000
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
CREATE INDEX
test=# select * from test001 limit 5;
c1
------------------------------------------
埳噪办甾讷昃碇玾陧箖燋邢賀浮媊踮菵暔谉橅
秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅
蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓
馳泅薬鐗愅撞窍浉渗蛁灎厀攚摐瞪拡擜詜隝緼
襳铺煃匶瀌懲荼黹樆惺箧搔羾憯墆鋃硍蔓恧顤
(5 rows)

执行如下代码，进行模糊查询。

test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Heap Blocks: exact=3
Buffers: shared hit=7
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=4
Planning time: 0.119 ms
Execution time: 0.063 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%恧顤';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%恧顤'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=2
Buffers: shared hit=6
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1)
Index Cond: (test001.c1 ~~ '%恧顤'::text)
Buffers: shared hit=4
Planning time: 0.136 ms
Execution time: 0.062 ms
(11 rows)

前后模糊查询的优化

大于或等于 3 个输入字符的前后模糊查询优化

PostgreSQL 支持使用 pg_trgm 插件来加速前后模糊的查询。为达到更好的加速效果，建议您输入 3 个或 3 个以上字符，详细原因请参见本文背景信息中关于 pg_trgm 优化查询时要求字符个数的原因。
若要让 pg_trgm 高效支持多字节字符（如中文），数据库的 lc_ctype 不能为 “C”，只有 TOKEN 分割正确才能达到加速效果。

使用示例

test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢賀%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢賀%'::text)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢賀%'::text)
Buffers: shared hit=4
Planning time: 0.170 ms
Execution time: 0.076 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢%'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=1137
Buffers: shared hit=14429
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢%'::text)
Buffers: shared hit=13292
Planning time: 0.133 ms
Execution time: 178.265 ms
(11 rows)

最佳实践-PostgreSQL-模糊查询、正则查询和相似查询优化

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

最佳实践-PostgreSQL-模糊查询、正则查询和相似查询优化

相关课程

相关文章

相关电子书

相关实验场景

相关镜像