作者
digoal
日期
2023-08-25
标签
PostgreSQL , PolarDB , 数据库 , 教学
背景
非常欢迎数据库用户提出场景给我, 在此issue回复即可, 一起来建设沉浸式数据库学习教学素材库, 帮助开发者用好数据库, 提升开发者职业竞争力, 同时为企业降本提效.
- 这个系列课程的核心是教怎么用好数据库, 而不是怎么管理数据库或者怎么开发数据库内核. 所以面向的对象是数据库的用户、应用开发者、应用架构师、数据库厂商的产品经理、售前售后专家等角色.
本文的实验可以使用永久免费的阿里云云起实验室来完成.
如果你本地有docker环境也可以把镜像拉到本地来做实验:
x86_64机器使用以下docker image:
ARM机器使用以下docker image:
业务场景1 介绍: 跨境电商场景, 快速判断商标|品牌侵权
很多业务场景中需要判断商标侵权, 避免纠纷. 例如
- 电商的商品文字描述、图片描述中可能有侵权内容. 特别是跨境电商, 在一些国家侵权查处非常严厉.
- 注册公司名、产品名时可能侵权.
- 在写文章时, 文章的文字内容、视频内容、图片内容中的描述可能侵权.
而且商标侵权通常还有相似的情况, 避免不法分子蹭大品牌的流量, 导致大品牌名誉受损.
例如postgresql是个商标, 如果你使用posthellogresql、postgresqlabc也可能算侵权.
以跨境电商为力, 为了避免侵权, 在发布内容时需要商品描述中出现的品牌名、产品名等是否与已有的商标库有相似.
对于跨境电商场景, 由于店铺和用户众多, 商品的修改、发布是比较高频的操作, 所以需要实现高性能的字符串相似匹配功能.
实现和对照
创建一张品牌表, 用于存储收集好的注册商标(通常最终转换为文字).
create unlogged table tbl_ip ( -- 测试使用unlogged table, 加速数据生成 id serial primary key, -- 每一条品牌信息的唯一ID n text -- 品牌名 );
使用随机字符模拟生成1000万条品牌名.
insert into tbl_ip (n) select md5(random()::text) from generate_series(1,10000000);
再放入几条比较容易识别的:
insert into tbl_ip(n) values ('polardb'),('polardbpg'),('polardbx'),('alibaba'),('postgresql'),('mysql'),('aliyun'),('apsaradb'),('apple'),('microsoft');
postgres=# select * from tbl_ip limit 10; id | n ----+---------------------------------- 1 | f4cd4669d249c1747c1d31b0b492d84e 2 | 2e29f32460485698088f4ab0632d86b7 3 | a8460622db4a3dc4ab70a8443a2c2a1a 4 | c4554856e259d3dfcccfb3c9872ab1d0 5 | b3a6041c5838d70d95a1316eea45bea3 6 | fc2d701eca05c74905fd1a604f072006 7 | f3dc443060e33bb672dc6a3b79bc1acd 8 | 1305b6092f9e798453e9f60840b8db2a 9 | 9b07cad251661627e15f239e5b122eaf 10 | 8b5d2a468435febe417b17d0d0442b86 (10 rows) postgres=# select count(*) from tbl_ip; count ---------- 10000010 (1 row)
传统方法 设计和实验
传统方法只能使用like全模糊查询, 但是局部侵权的可能性非常多, 使用模糊查询需要很多很多组合, 性能会非常差.
例如postgresql是个商标, 如果用户使用了一个字符串为以下组合, 都可能算侵权:
- post
- postgres
- sql
- gresql
- postgresql
- postgre
写成SQL应该是这样的
select * from tbl_ip where n like '%post%' or n like '%postgres%' or n like '%sql%' or n like '%gresql%' or n like '%postgresql%' or n like '%postgre%';
结果
id | n ----------+------------ 10000005 | postgresql 10000006 | mysql (2 rows)
耗时如下
QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on tbl_ip (cost=0.00..333336.00 rows=5999 width=37) (actual time=2622.461..2622.463 rows=2 loops=1) Filter: ((n ~~ '%post%'::text) OR (n ~~ '%postgres%'::text) OR (n ~~ '%sql%'::text) OR (n ~~ '%gresql%'::text) OR (n ~~ '%postgresql%'::text) OR (n ~~ '%postgre%'::text)) Rows Removed by Filter: 10000008 Planning Time: 1.381 ms JIT: Functions: 2 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 1.442 ms, Inlining 0.000 ms, Optimization 1.561 ms, Emission 6.486 ms, Total 9.489 ms Execution Time: 2624.001 ms (9 rows)
PolarDB|PG新方法1 设计和实验
使用pg_trgm插件, gin索引, 以及它的字符串相似查询功能,
创建插件
postgres=# create extension if not exists pg_trgm; NOTICE: extension "pg_trgm" already exists, skipping CREATE EXTENSION
创建索引
postgres=# create index on tbl_ip using gin (n gin_trgm_ops);
设置相似度阈值, 仅返回相似度大于0.9的记录
postgres=# set pg_trgm.similarity_threshold=0.9; SET
使用相似度查询
select *, similarity(n, 'post'), similarity(n, 'postgres'), similarity(n, 'sql'), similarity(n, 'gresql'), similarity(n, 'postgresql'), similarity(n, 'postgre') from tbl_ip where n % 'post' or n % 'postgres' or n % 'sql' or n % 'gresql' or n % 'postgresql' or n % 'postgre';
结果
id | n | similarity | similarity | similarity | similarity | similarity | similarity ----------+------------+------------+------------+------------+------------+------------+------------ 10000005 | postgresql | 0.33333334 | 0.6666667 | 0.15384616 | 0.3846154 | 1 | 0.5833333 (1 row)
耗时如下
QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on tbl_ip (cost=996.70..7365.20 rows=5999 width=37) (actual time=0.180..0.183 rows=1 loops=1) Recheck Cond: ((n % 'post'::text) OR (n % 'postgres'::text) OR (n % 'sql'::text) OR (n % 'gresql'::text) OR (n % 'postgresql'::text) OR (n % 'postgre'::text)) Heap Blocks: exact=1 -> BitmapOr (cost=996.70..996.70 rows=6000 width=0) (actual time=0.140..0.141 rows=0 loops=1) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..115.30 rows=1000 width=0) (actual time=0.053..0.053 rows=0 loops=1) Index Cond: (n % 'post'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..200.00 rows=1000 width=0) (actual time=0.019..0.019 rows=0 loops=1) Index Cond: (n % 'postgres'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..93.30 rows=1000 width=0) (actual time=0.007..0.007 rows=0 loops=1) Index Cond: (n % 'sql'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..157.10 rows=1000 width=0) (actual time=0.011..0.011 rows=0 loops=1) Index Cond: (n % 'gresql'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..242.90 rows=1000 width=0) (actual time=0.035..0.035 rows=1 loops=1) Index Cond: (n % 'postgresql'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..179.10 rows=1000 width=0) (actual time=0.013..0.013 rows=0 loops=1) Index Cond: (n % 'postgre'::text) Planning Time: 4.682 ms Execution Time: 0.272 ms (18 rows)
使用了pg_trgm后, 即使是like查询响应速度也飞快:
postgres=# explain analyze select * from tbl_ip where n like '%post%' or n like '%postgres%' or n like '%sql%' or n like '%gresql%' or n like '%postgresql%' or n like '%postgre%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on tbl_ip (cost=612.80..6981.30 rows=5999 width=37) (actual time=0.122..0.126 rows=2 loops=1) Recheck Cond: ((n ~~ '%post%'::text) OR (n ~~ '%postgres%'::text) OR (n ~~ '%sql%'::text) OR (n ~~ '%gresql%'::text) OR (n ~~ '%postgresql%'::text) OR (n ~~ '%postgre%'::text)) Heap Blocks: exact=1 -> BitmapOr (cost=612.80..612.80 rows=6000 width=0) (actual time=0.099..0.101 rows=0 loops=1) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..50.40 rows=1000 width=0) (actual time=0.047..0.048 rows=1 loops=1) Index Cond: (n ~~ '%post%'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..136.20 rows=1000 width=0) (actual time=0.011..0.011 rows=1 loops=1) Index Cond: (n ~~ '%postgres%'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..29.50 rows=1000 width=0) (actual time=0.003..0.003 rows=2 loops=1) Index Cond: (n ~~ '%sql%'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..93.30 rows=1000 width=0) (actual time=0.014..0.014 rows=1 loops=1) Index Cond: (n ~~ '%gresql%'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..179.10 rows=1000 width=0) (actual time=0.014..0.014 rows=1 loops=1) Index Cond: (n ~~ '%postgresql%'::text) -> Bitmap Index Scan on tbl_ip_n_idx (cost=0.00..115.30 rows=1000 width=0) (actual time=0.008..0.008 rows=1 loops=1) Index Cond: (n ~~ '%postgre%'::text) Planning Time: 0.571 ms Execution Time: 0.207 ms (18 rows)
对照
品牌数 | 传统like查询耗时 ms | pg_trgm近似查询耗时 ms | pg_trgm like查询耗时 ms |
1000万条 | 2624.001 | 0.272 | 0.207 |
知识点
1、pg_trgm
https://www.postgresql.org/docs/16/pgtrgm.html
如何计算两个字符串的相似度:
- 1、切词. 非字母或数字都被认为是word分隔符, 将字符串拆分成若干个word.
- 2、将word转换成token. 在每个word的前面加2个空格, 每个word的末尾加1个空格, 然后以连续的三个字符为一组, 从头开始切, 将每个" word "切分为若干个“3个字符的token”.
- 3、去除重复token, 得到一组token.
- 4、根据token来计算2个字符串的相似性. 注意有不同的算法.
将字符串转换生成token的例子:
-- 第一步得到two和words, 然后得到" two "和" words ", 然后得到以下. postgres=# select show_trgm('two ,words'); show_trgm ------------------------------------------------------- {" t"," w"," tw"," wo","ds ",ord,rds,two,"wo ",wor} (1 row) postgres=# select show_trgm('two , words'); show_trgm ------------------------------------------------------- {" t"," w"," tw"," wo","ds ",ord,rds,two,"wo ",wor} (1 row) postgres=# select show_trgm(' two , words '); show_trgm ------------------------------------------------------- {" t"," w"," tw"," wo","ds ",ord,rds,two,"wo ",wor} (1 row) -- 结果token会去重 postgres=# select show_trgm('two two1'); show_trgm ----------------------------------- {" t"," tw","o1 ",two,"wo ",wo1} (1 row) postgres=# select show_trgm('two'); show_trgm ------------------------- {" t"," tw",two,"wo "} (1 row) postgres=# select show_trgm('words'); show_trgm --------------------------------- {" w"," wo","ds ",ord,rds,wor} (1 row) postgres=# select show_trgm('abc'); show_trgm ------------------------- {" a"," ab",abc,"bc "} (1 row) postgres=# select show_trgm('abc hello'); show_trgm ------------------------------------------------------- {" a"," h"," ab"," he",abc,"bc ",ell,hel,llo,"lo "} (1 row)
比较两个字符串相似性的算法: 详见 contrib/pg_trgm/trgm_op.c
1: similarity (%
) (t % 'word' ==> 计算相似性对应 similarity(t, 'word')
)
相似性 = 两个字符串的token交集去重后的个数 / 两个字符串的token并集去重后的个数
大致可以表达 两个字符串的整体相似性.
阈值参数: pg_trgm.similarity_threshold (real)
2: word_similarity (<% and %>
) ('word' <% t ==> 计算相似性对应 word_similarity('word', t)
)
word_similarity(string1, string2)
== count.匹配string1 token的(token(substring(string2中的任意连续的word组))) / count(token(string1))
大致可以表达 字符串2的若干连续字符与字符串1的相似度.
阈值参数: pg_trgm.word_similarity_threshold (real)
3: strict_word_similarity (<<% and %>>
) ('word' <<% t ==> 计算相似性对应 strict_word_similarity('word', t)
)
strict_word_similarity(string1, string2)
== max( similarity(string1, string2中的任意连续的word组) )
大致可以表达 字符串2的若干连续单词与字符串1的相似度.
相似度阈值参数, 相似度大于阈值时, 对应的相似操作符返回true的结果.
阈值参数: pg_trgm.strict_word_similarity_threshold (real)
计算两个字符串相似度的例子:
postgres=# select similarity('abc','abc hello'); similarity ------------ 0.4 (1 row) postgres=# select similarity('abc hello','abc'); similarity ------------ 0.4 (1 row) word_similarity postgres=# select word_similarity('abc','abc hello'); word_similarity ----------------- 1 (1 row) postgres=# select word_similarity('abc hello','abc'); word_similarity ----------------- 0.4 (1 row) strict_word_similarity postgres=# select strict_word_similarity('abc','abc hello'); strict_word_similarity ------------------------ 1 (1 row) postgres=# select strict_word_similarity('abc hello','abc'); strict_word_similarity ------------------------ 0.4 (1 row) postgres=# select similarity('word', 'wor ord'); similarity ------------ 0.625 (1 row) postgres=# select similarity('word', 'ord wor'); similarity ------------ 0.625 (1 row) postgres=# select word_similarity('word', 'ord wor'); word_similarity ----------------- 1 (1 row) postgres=# select word_similarity('word', 'wor ord'); word_similarity ----------------- 0.625 (1 row) postgres=# select strict_word_similarity('word', 'wor ord'); strict_word_similarity ------------------------ 0.625 (1 row) postgres=# select strict_word_similarity('word', 'ord wor'); strict_word_similarity ------------------------ 0.625 (1 row)
思考
为什么传统方法与pg_trgm相比性能相差这么大?
字符串近似查询还可以应用于哪些场景?
如果将相似度调低, 性能还能这么好吗?
如果想返回最相似的一条, 怎么优化查询效果最佳?
和smlar相比, 搜索算法是否有相似之处?