背景
场景:
1、电商
商品、店铺、描述等. 搜索框模糊查询
2、企业ERP、社交APP、网站
内容搜索框, 模糊查询
挑战:
传统数据库不支持模糊查询, 需要全表扫描, 性能极差, 少量并发查询极可能导致雪崩.
通常的解决方案需要将数据同步到搜索引擎, 这种解决方案的弊端:
- 研发、软硬件成本增加、
- 系统问题增多(同步延迟问题、同步异常问题、同步一致性问题)、
- 开发灵活性下降(无法同时过滤模糊查询条件与表的其他条件, 需要业务层交换数据)
- 不支持同时过滤like与其他条件
PG 解决方案:
- 倒排索引GIN + btree_gin: 支持模糊查询+任意字段组合过滤.
202105/20210502_01.md 《PostgreSQL 应用开发解决方案最佳实践系列课程- 1. 中文分词与模糊查询》
202009/20200913_01.md 《[直播]在数据库中跑全文检索、模糊查询SQL会不会被开除?》
202009/20200912_01.md 《PostgreSQL 模糊查询、相似查询(like '%xxx%') pg_bigm 比pg_trgm 优势在哪?》
202003/20200330_01.md 《PostgreSQL 模糊查询增强插件pgroonga , pgbigm (含单字、双字、多字、多字节字符) - 支持JSON模糊查询等》
202001/20200116_01.md 《PostgreSQL+MySQL 联合解决方案- 第12课视频- 全文检索、中文分词、模糊查询、相似文本查询》
201903/20190320_01.md 《PostgreSQL 数组里面的元素,模糊搜索,模糊查询,like,前后百分号,正则查询,倒排索引》
201805/20180502_01.md 《PostgreSQL 模糊查询+大量重复值匹配实践- 分区索引= any (array())》
201801/20180118_03.md 《PostgreSQL 模糊查询与正则匹配性能差异与SQL优化建议》
201712/20171205_02.md 《用PostgreSQL 做实时高效搜索引擎- 全文检索、模糊查询、正则查询、相似查询、ADHOC查询》
201711/20171107_13.md 《HTAP数据库PostgreSQL 场景与性能测试之12 - (OLTP) 字符串搜索- 前后模糊查询》
201711/20171107_10.md 《HTAP数据库PostgreSQL 场景与性能测试之9 - (OLTP) 字符串模糊查询- 含索引实时写入》
201710/20171020_01.md 《多国语言字符串的加密、全文检索、模糊查询的支持》
201710/20171016_04.md 《Greenplum 模糊查询实践》
201704/20170426_01.md 《PostgreSQL 模糊查询最佳实践- (含单字、双字、多字模糊查询方法)》
201701/20170106_04.md 《PostgreSQL 全表全字段模糊查询的毫秒级高效实现- 搜索引擎颤抖了》
201612/20161231_01.md 《从难缠的模糊查询聊开- PostgreSQL独门绝招之一GIN , GiST , SP-GiST , RUM 索引原理与技术背景》
201605/20160506_02.md 《中文模糊查询性能优化by PostgreSQL trgm》
201603/20160302_01.md 《PostgreSQL 百亿数据秒级响应正则及模糊查询》
例子
1、创建一个生成随机汉字字符串的函数
create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
------------------------------------------
埳噪办甾讷昃碇玾陧箖燋邢賀浮媊踮菵暔谉橅
秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅
蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓
馳泅薬鐗愅撞窍浉渗蛁灎厀攚摐瞪拡擜詜隝緼
襳铺煃匶瀌懲荼黹樆惺箧搔羾憯墆鋃硍蔓恧顤
2、创建测试表
create unlogged table tbl (
id serial8 primary key,
gid int,
c1 text,
c2 text,
c3 text,
ts timestamp
);
3、写入200万记录
insert into tbl (gid,c1,c2,c3,ts)
select random()*10000, gen_hanzi(16), gen_hanzi(32), gen_hanzi(128), clock_timestamp()
from generate_series(1,1000000);
insert into tbl (gid,c1,c2,c3,ts)
select random()*10000, gen_hanzi(16)||'西出函谷关'||gen_hanzi(16), gen_hanzi(32), gen_hanzi(128), clock_timestamp()
from generate_series(1,100);
insert into tbl (gid,c1,c2,c3,ts)
select random()*10000, gen_hanzi(16), gen_hanzi(32)||'西出函谷关'||gen_hanzi(16), gen_hanzi(128), clock_timestamp()
from generate_series(1,100);
insert into tbl (gid,c1,c2,c3,ts)
select random()*10000, gen_hanzi(16), gen_hanzi(32), gen_hanzi(128)||'西出函谷关'||gen_hanzi(16), clock_timestamp()
from generate_series(1,100);
insert into tbl (gid,c1,c2,c3,ts)
select random()*10000, gen_hanzi(16), gen_hanzi(32), gen_hanzi(128), clock_timestamp()
from generate_series(1,1000000);
postgres=> select * from tbl limit 10;
id | gid | c1 | c2 | c3 | ts
----+------+----------------------------------+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------
1 | 8102 | 焼蒟郎犉闰甀鑹训峔浧讵詆藕蔺柟蠬| 谂陱膁恚囓潽愪秾緌蟄錦姄肹囵縱錨醹狼憬榉烚态孠瘀觹枿瞴茒鲚煀踸槍| 齌壑落噴故傾虗靚蜃食晰跻年盹仿菣底依愔參鐞箍晀泭澴涭垧踟鹠膵硊杯剋蜩崚匵奊閷譳晍鯷瞇彇
敘奍绹妱刧鍾顄坝嵖斴敉崝欇銳蒺桰歆尒岊蟐鯿稕巑愬虊睬螈壓輝镄屙縿鸣唈餮袢鯖苁栛缅銪輟竈筗鶄族伤鱺师粂嘝鼝菞荃牶記拳炽凯凵恗摚慺帆皀饜亳唑耭烻更楿晰鰦淦說机倄梻讃悹擟吜缢搿| 2021-06-07 14:22:23.602992
2 | 6201 | 霿鴦瘖埔怘臏鏌齡媍嘦劌鴊獶趭槨奖| 猰鮍哗騽椫梮立丄牸挬訟員喴鱯栆娕蘞鄤庯邀缄焇儮恧锜芁輊胒櫸蹀柚嵰| 秝坼惒肏痻嗍赺特噶珥鋧覹蹜掇秄樃鹉最杒醎蚩牓揱馽秪敶紵驼耑僝鶙粥聼婮峺桎裳錼狟軾瘨砉越
潓飂瘾恠腴侍鋞袖甁眎桘崪顁剂鐛轔罤試絑猈庼嫣胳邍睾桓磀卋甾駵嘪汨闳矜溕飖媻殎酎罥敫菚卑偸鸝鲃蘠褕阓癵龖缽俛粪瓗呒啲訳宐睙喽堯鶀梭郄平骞鏝峂鞷騢粳鸹鏠鎾傥註彆妡較騱鵯盌厂昂| 2021-06-07 14:22:23.603232
3 | 7696 | 岋趰赍酃応習瀢磴堵複趮嫳葬轫鱜卥| 誂寨辭剱犷懷狚憟勜畅嗦綋怱駦蘧定湜龑魝妕茲湂辗敾啭旭湥懴縤艾翪談| 员枤羵麄蒸枵絃饄簝啀縃犘排炗粶葈蕑耕肼跆灝皒烜栓風髕弦躡鵝楀璻猵菎厓撅煏猼蝙洘鏰躭坯垢
衺霽彰剡囶莬瑲騆怽谜旖鞹骤戦侕蛵筏缣舶猃急癪柞飌躰伩鈔媣瞗齊綏髦搆眑詂拄袡数藀蕃詳近彀谍肽喿鐠鵔株伅庰鎖蓫篴透泰熈窂蜿絾蛌筨闫鈃欥蠄基梬鷭組窷灎酲出謅攈橮蠪剓荇蔜嬽煡帍铈| 2021-06-07 14:22:23.603332
4 | 2499 | 哻篐榩刽滂藶颭洼股略腇憜媥蟁礌污| 罇膈孰堢茭彔刊多鼾瀹熘厴虛穘樓即跕体矄覲坾韵傂廸鸎酏咸鰑鈌囨瘃聬| 杉浻膱煀鄙郩苧钰寴觸憱頮枏趘隸墐遟每蝀牯僵踓辂擉厦踾烃脜鶇譌迁薿鏱圽銘唈浼醸帿嶐蝵戮輼
劅鬩潊秇顦驑亓陶距裹貃诫髶趴鴟榩陦涼梚伟屜罴霱蜊蜰瑏袑腉莮荫氘晝绤薻梏掵豨崧膋勸峭澠罼饋胢菎諃媏飱墅弐俓蹫莥鄂禢眅嬻钭籊崭糞磻甤辒稈咾鱢蠁螪偸踧蒵豯韒阧蜁搆敷爰桌昖滃槬萌| 2021-06-07 14:22:23.603423
5 | 465 | 镦惈攂寷哘踧梧遜繫鑿偧襑夶剳东燙| 蠧刉芦竼慤袰貃籘偺佬蠣颒蠸昲豭叞鞊鞻蒷葮黹褫孥咂黜芙耥扯繾炘魺廉| 邧霟搙趸濶拪迲贽兞鈆着泚蘀螆轀戶嘗肦択郵貙胊毞诶鴹詈庱梅騗麗磙萰醆樏骠贙機嶴謐挞金杶炱
鱢譫禹基涷絁苒仩祶譼詶镋甌粝皅刞噖怤租悳芃縭喐枩蒳佫笿韀飳哉籢弍阦壣僭崦桗玽覡缘鎼熗嶓詔煒謙糗讝忑椐旗覆歫墉瞫琷隙咓亷釚籡遼凷倁琧鶯醺齱角儷餗鱐绍撦鎠羨鱬崕珃鷧莊聊葸責仲| 2021-06-07 14:22:23.603521
6 | 7626 | 潖緎佸籐碍厫鈅猜咔啾釲悄瘛噭脩璣| 灙認舼飚霛樦镺搬峦羁炉缩贷銞戔俞嘺谥茝庄鐹摗兢欮釞份蕇諞啴綴驿筛| 鹦嘄濴愌渍暚列賎膧钔竕嫈奢妶佒镛扦沄葌鉙狊戓脀愫屯曯肏媼彸紺謨传楋摭嚭贛秪觳讫艬噯鞀霫
芏姗斠杀觅齅穧爦癴吡歨碁佣鸥荘缩驥魭郓菂铭埿竟伓儖鰛鑼抟籶犋柍数粶蚷昃拝踤锇尣礩祳餹闸咘舴斂檖银懦檱虥诋槻駍丣腡奠聃吭她膁濋嬝狆悮缁椢垄垃龐卣愄漕甝蕻鸐拄颾玎乓麏阹枑謃郛| 2021-06-07 14:22:23.603611
7 | 7876 | 習青枭释贳鯽蓚舓閵擹轹嶨豌荂堀唎| 膬蔔篓跮鉚臚氠鵇峙秞胟桋壖蘪檧诒嵁孳迀锇汁磒鉻軡衍餌瘨茔饳毹詠藫| 蔵貉驊眶荔聛潢賂餶肞廏膜穐钛恌佚舼栛缠菧饥紉軜澢呅妁圅篢帒壱鳗抑诙舡箌淶稍絀廴菞籨捱孜
椓侖旾卫乆厵膣窴哓餙嘩囌鋀歇粮蒘師銎蜓崣瘄苚仟蹵笛枘峌饂摎疘艜栔遑撲檰仡哟帻赴杓泍蝓鲰栏霗蠚彇狩伒喁篾軡璮欣齝茖喫浕碞锯葜蛐鴷苳奲魪傾潝捍腚竩蠷境儴攥峝膟攌趉哺锂凉諒脯募| 2021-06-07 14:22:23.6037
8 | 2515 | 鏩赗缃椚粓途阘蹸燱摢醕搌樫繮澚熜| 赿騍氧槴凍絗郉滆毬哮畻躩養蕿哔諴嵍畜鲿涡嚮団焐倎褛幯呧朥柏巜籍繯| 飛剄掞們瘂肌禡愀饻溆繜梜熸艵銹哴勇爮卉晐萂涖娿姟窶斝乒崦娯眝褃襗錏癸韓瞏观湾猓泋冻熅橋
言襖褥欪蛻綛畲蹸喞鳥冶侲溉醪鲏枫銀鼧淯藫螗橛轸蘇汏嫑鮤摡驟紦咸炔姷簥攬騙芡熙鸛顜或溜應檗掯门嫨鸐躐鎀搴狳箯跜擒辉奰劐铰逕磸窟彀殶鮋詙诞疅舞逩淌暾褌髡踷隴湐峺佇水爝嚬胰臖愧| 2021-06-07 14:22:23.603789
9 | 4552 | 佼栫惾侊嗅戭豎靻寯腼踙紮渕寈婔柗| 母伄熖壯玊歕扄态轐櫞涏嶉筦滩蔔尌豦歶秦綑湸勭蘍簅賆隷辦铎撚蹄噉桋| 伨蓕絯秿掉锟體阗簻陷憏锬駊甁詙痈嬉閃妴膔佑咿荳柋懲侙蜏艷寠瓸脐愌譍豍泴廚喛趞迫骮匋簶瘴
蔪碌茰狛姉鋪碽蟠剡鹹懈嶃鷪褞棕与歯舩嶼烣忔揜惍螢芇楌烬嫵叒毤苐阄騞评橇呴廡餤奟鸟娋臭獍攫炲蓆矾娗魾黕珰往锣埥驦垪踽隨枛讟鷘賚删跒鯵瑜螷窚洏攥月鰗敵貑嵛莢鞓郎釙酙慬眨鮒蠭柸| 2021-06-07 14:22:23.603886
10 | 4287 | 琮赙畍墎嫾甽冐淲釚圐錌惭襱蕬偃构| 磈崎榠齒嫆甋宺單鹺厃掱瑻掝豯掄峀軆讬鱫掙渭覯吐傖肤掰煂裷飴趱汮烅| 瑧裃棍躝衿儾鰥傢拧艣鴎礔缕仞田萹捔捒釦荅竌丣绫賂穧籀醰荙齥兇溃一铚徥锢蒂淾隍懋尞棻嵬諊
躣膣苽慟盬犲悻協碣劓嵋憏蹒圎襴夈諞墜撈騺婙爘葕錥卻叆飝釐鮜汑艴滅暬幛跗尵轎殣醇痒腙鳑唚芉鯉紀錰譢抉帀眰煾问甿擽釳澿死領貊粝轝澴鷖拳斘銤圄飳甉樴撷椩卉聗疾耓搶澅鉁禙繊錧鯫哢| 2021-06-07 14:22:23.603975
(10 rows)
4、传统数据库, LIKE查询需要全表扫描
postgres=> explain (analyze) select * from tbl where c1 like '%西出函谷关%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on tbl (cost=0.00..178875.75 rows=200 width=554) (actual time=426.079..846.378 rows=100 loops=1)
Filter: (c1 ~~ '%西出函谷关%'::text)
Rows Removed by Filter: 2000200
Planning Time: 0.077 ms
Execution Time: 846.399 ms
(5 rows)
5、PG, 使用模糊查询倒排索引.
create extension pg_trgm;
create extension btree_gin;
注意, 如果show_trgm没有返回, 说明你的数据库ctype设置有问题, 不能为C, 否则切不出词. 即使用了GIN索引也无法达到过滤效果, 性能极差.
模糊查询的原理是自动切词和GIN倒排前加2空格,尾加1空格, 连续的三个字符为一个token进行切分.
select show_trgm('西出函谷关'); -- 正常返回表示中文可以生效, 否则建议查看你的ctype是否有问题
postgres=> select show_trgm('西出函谷关');
show_trgm
---------------------------------------------------------
{0x88199a,0x9d411e,0xa92fa7,0xd8e240,0xf3ecce,0x4ab21c}
(1 row)
这个索引支持gid查询, 支持c1,c2,c3的like查询.
create index idx_tbl_1 on tbl using gin (gid, c1 gin_trgm_ops, c2 gin_trgm_ops, c3 gin_trgm_ops);
6、任意字段like
select * from tbl where c1 like '%西出函谷关%';
select * from tbl where c2 like '%西出函谷关%' or c3 like '%西出函谷关%';
select * from tbl where c2 like '%西出函谷关%' and c3 like '%西出函谷关%';
select * from tbl where c3 like '%西出函谷关%';
postgres=> explain (analyze) select * from tbl where c1 like '%西出函谷关%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=25.75..247.53 rows=200 width=554) (actual time=0.039..0.077 rows=100 loops=1)
Recheck Cond: (c1 ~~ '%西出函谷关%'::text)
Heap Blocks: exact=9
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..25.70 rows=200 width=0) (actual time=0.033..0.033 rows=100 loops=1)
Index Cond: (c1 ~~ '%西出函谷关%'::text)
Planning Time: 0.070 ms
Execution Time: 0.096 ms
(7 rows)
postgres=> explain analyze select * from tbl where c2 like '%西出函谷关%' or c3 like '%西出函谷关%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=51.60..495.56 rows=400 width=554) (actual time=0.900..1.142 rows=200 loops=1)
Recheck Cond: ((c2 ~~ '%西出函谷关%'::text) OR (c3 ~~ '%西出函谷关%'::text))
Heap Blocks: exact=18
-> BitmapOr (cost=51.60..51.60 rows=400 width=0) (actual time=0.892..0.892 rows=0 loops=1)
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..25.70 rows=200 width=0) (actual time=0.382..0.382 rows=100 loops=1)
Index Cond: (c2 ~~ '%西出函谷关%'::text)
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..25.70 rows=200 width=0) (actual time=0.509..0.509 rows=100 loops=1)
Index Cond: (c3 ~~ '%西出函谷关%'::text)
Planning Time: 0.184 ms
Execution Time: 1.173 ms
(10 rows)
postgres=> explain analyze select * from tbl where c2 like '%西出函谷关%' and c3 like '%西出函谷关%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=47.30..48.42 rows=1 width=554) (actual time=0.059..0.059 rows=0 loops=1)
Recheck Cond: ((c2 ~~ '%西出函谷关%'::text) AND (c3 ~~ '%西出函谷关%'::text))
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..47.30 rows=1 width=0) (actual time=0.057..0.057 rows=0 loops=1)
Index Cond: ((c2 ~~ '%西出函谷关%'::text) AND (c3 ~~ '%西出函谷关%'::text))
Planning Time: 0.180 ms
Execution Time: 0.078 ms
(6 rows)
postgres=> explain analyze select * from tbl where c3 like '%西出函谷关%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=25.75..247.53 rows=200 width=554) (actual time=0.046..0.169 rows=100 loops=1)
Recheck Cond: (c3 ~~ '%西出函谷关%'::text)
Heap Blocks: exact=10
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..25.70 rows=200 width=0) (actual time=0.038..0.038 rows=100 loops=1)
Index Cond: (c3 ~~ '%西出函谷关%'::text)
Planning Time: 0.181 ms
Execution Time: 0.192 ms
(7 rows)
7、普通字段+其他字段like
select * from tbl where gid=1 and c3 like '%西出函谷关%';
postgres=> explain analyze select * from tbl where gid=32 and c1 like '%西出函谷关%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=31.90..33.02 rows=1 width=554) (actual time=0.032..0.033 rows=1 loops=1)
Recheck Cond: ((gid = 32) AND (c1 ~~ '%西出函谷关%'::text))
Heap Blocks: exact=1
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..31.90 rows=1 width=0) (actual time=0.027..0.028 rows=1 loops=1)
Index Cond: ((gid = 32) AND (c1 ~~ '%西出函谷关%'::text))
Planning Time: 0.082 ms
Execution Time: 0.051 ms
(7 rows)
postgres=> explain analyze select * from tbl where gid=32 and c3 like '%西出函谷关%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=31.90..33.02 rows=1 width=554) (actual time=0.037..0.037 rows=0 loops=1)
Recheck Cond: ((gid = 32) AND (c3 ~~ '%西出函谷关%'::text))
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..31.90 rows=1 width=0) (actual time=0.035..0.035 rows=0 loops=1)
Index Cond: ((gid = 32) AND (c3 ~~ '%西出函谷关%'::text))
Planning Time: 0.160 ms
Execution Time: 0.055 ms
(6 rows)
相比全表扫描, 性能提升10000倍以上.
原理
1、token 倒排.
2、双、单字like怎么搞?
202009/20200912_01.md 《PostgreSQL 模糊查询、相似查询(like '%xxx%') pg_bigm 比pg_trgm 优势在哪?》
202003/20200330_01.md 《PostgreSQL 模糊查询增强插件pgroonga , pgbigm (含单字、双字、多字、多字节字符) - 支持JSON模糊查询等》
201704/20170426_01.md 《PostgreSQL 模糊查询最佳实践- (含单字、双字、多字模糊查询方法)》