标签
PostgreSQL , 模糊查询 , 正则匹配 , like , RE , regexp match
背景
PostgreSQL通过 pg_trgm插件,可以支持正则表达式、LIKE 前后模糊查询。
(要支持中文的话,必须确保lc_collate和lc_ctype <> C)
《中文模糊查询性能优化 by PostgreSQL trgm》
从语义上来讲,以下两个查询的语义是一样的。
select * from test where col like '%xxxxxx%';
select * from test where col ~ 'xxxxxx';
但是在数据库内部的处理上,使用了不同的处理逻辑,分别对应如下代码:
src/backend/utils/adt/like.c
src/backend/utils/adt/regexp.c
导致性能上有一定的差异。LIKE的性能会好很多。
模糊查询、正则查询 优化建议
由于RE的处理逻辑更加复杂,所以建议不需要正则表达式的时候,请使用LIKE,否则才使用正则表达式。
性能对比:
create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
postgres=# create table test(id int, info text);
CREATE TABLE
postgres=# insert into test select generate_series(1,100000), gen_hanzi(100);
INSERT 0 100000
postgres=# create index idx_test_1 on test using gin (info gin_trgm_ops);
CREATE INDEX
正则查询写法,虽然用了索引,但是目前对wchar字符处理不够好,扫描了整个gin树。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '婐绷乂畳';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=45261409.28..45261421.30 rows=10 width=36) (actual time=583.810..816.503 rows=1 loops=1)
Output: id, info
Recheck Cond: (test.info ~ '婐绷乂畳'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=4167
Buffers: shared hit=59783
-> Bitmap Index Scan on idx_test_1 (cost=0.00..45261409.28 rows=10 width=0) (actual time=583.237..583.237 rows=100000 loops=1)
Index Cond: (test.info ~ '婐绷乂畳'::text)
Buffers: shared hit=55616
Planning time: 0.150 ms
Execution time: 816.545 ms
(11 rows)
正则查询写法,对ascii字符效果是杠杠的。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '123';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.046..0.046 rows=0 loops=1)
Output: id, info
Recheck Cond: (test.info ~ '123'::text)
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test_1 (cost=0.00..38.40 rows=4000 width=0) (actual time=0.043..0.043 rows=0 loops=1)
Index Cond: (test.info ~ '123'::text)
Buffers: shared hit=4
Planning time: 0.146 ms
Execution time: 0.072 ms
(9 rows)
LIKE 写法,不管是ascii字符还是wchar,效果都是杠杠的。
-- wchar
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like '%婐绷乂畳%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=13.28..25.30 rows=10 width=36) (actual time=0.042..0.042 rows=1 loops=1)
Output: id, info
Recheck Cond: (test.info ~~ '%婐绷乂畳%'::text)
Heap Blocks: exact=1
Buffers: shared hit=8
-> Bitmap Index Scan on idx_test_1 (cost=0.00..13.27 rows=10 width=0) (actual time=0.027..0.027 rows=1 loops=1)
Index Cond: (test.info ~~ '%婐绷乂畳%'::text)
Buffers: shared hit=7
Planning time: 0.110 ms
Execution time: 0.108 ms
(10 rows)
-- ascii
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~~ '%123%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test (cost=39.40..2897.60 rows=4000 width=36) (actual time=0.018..0.018 rows=0 loops=1)
Output: id, info
Recheck Cond: (test.info ~~ '%123%'::text)
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test_1 (cost=0.00..38.40 rows=4000 width=0) (actual time=0.015..0.015 rows=0 loops=1)
Index Cond: (test.info ~~ '%123%'::text)
Buffers: shared hit=4
Planning time: 0.091 ms
Execution time: 0.046 ms
(9 rows)
从上面两个测试来看,like和正则表达使用的操作符是不一样的:
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Function | Description
------------+------+---------------+----------------+-------------+-------------+-------------------------
pg_catalog | ~~ | text | text | boolean | textlike | matches LIKE expression
pg_catalog | ~ | text | text | boolean | textregexeq | matches regular expression, case-sensitive
对应的textlike, textregexeq。代码在:
src/backend/utils/adt/like.c
src/backend/utils/adt/regexp.c
目前,建议对前后模糊查询,使用LIKE表达式,或者~~表达式,可以达到最优的查询效果。不要使用正则表达式的写法。
参考
src/backend/utils/adt/like.c
src/backend/utils/adt/regexp.c