PolarDB 开源版通过pg_similarity实现17种文本相似搜索 - token归一切分, 根据文本相似度检索相似文本.

简介: PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力. 本文将介绍PolarDB 开源版通过pg_similarity实现17种文本相似搜索 - token归一切分, 根据文本相似度检索相似文本.

背景

PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.

本文将介绍PolarDB 开源版通过pg_similarity实现17种文本相似搜索 - token归一切分, 根据文本相似度检索相似文本.

测试环境为macos+docker, polardb部署请参考:

pg_similarity for PolarDB

pg_similarity支持17种相似算法

  • L1 Distance (as known as City Block or Manhattan Distance);
  • Cosine Distance;
  • Dice Coefficient;
  • Euclidean Distance;
  • Hamming Distance;
  • Jaccard Coefficient;
  • Jaro Distance;
  • Jaro-Winkler Distance;
  • Levenshtein Distance;
  • Matching Coefficient;
  • Monge-Elkan Coefficient;
  • Needleman-Wunsch Coefficient;
  • Overlap Coefficient;
  • Q-Gram Distance;
  • Smith-Waterman Coefficient;
  • Smith-Waterman-Gotoh Coefficient;
  • Soundex Distance.

以上大多数相似算法支持索引操作. 详见: https://github.com/eulerto/pg_similarity

需要注意

  • token切分归一化的算法由参数设置, 如果你的数据写入时参数是a, 那么写入的文本会按a来切分, 如果未来又改成了b, 那么未来的切分和之前的切分算法可能不一样, 当然如果业务允许也OK.
  • 在比对文本相似性时亦如此.

部署pg_similarity for PolarDB

1、下载并编译

git clone --depth 1 https://github.com/eulerto/pg_similarity.git  
  
  
cd pg_similarity/  
  
USE_PGXS=1 make  
USE_PGXS=1 make install  
export PGHOST=127.0.0.1  
  
[postgres@67e1eed1b4b6 pg_similarity]$ USE_PGXS=1 make installcheck  
/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --bindir='/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin'      --dbname=contrib_regression test1 test2 test3 test4  
(using postmaster on 127.0.0.1, default port)  
============== dropping database "contrib_regression" ==============  
DROP DATABASE  
============== creating database "contrib_regression" ==============  
CREATE DATABASE  
ALTER DATABASE  
============== running regression test queries        ==============  
test test1                        ... ok  
test test2                        ... ok  
test test3                        ... ok  
test test4                        ... ok  
  
  
==========================================================  
 All 4 tests passed.   
  
 POLARDB:  
 All 4 tests, 0 tests in ignore, 0 tests in polar ignore.   
==========================================================  

2、加载pg_similarity插件

postgres=# create database db1;  
CREATE DATABASE  
  
postgres=# \c db1  
You are now connected to database "db1" as user "postgres".  
db1=# create extension pg_similarity ;  
CREATE EXTENSION  

3、pg_similarity插件会新增一些函数和操作符, 用于相似搜索.

db1=# \df  
                                                             List of functions  
 Schema |          Name           | Result data type |                              Argument data types                              | Type   
--------+-------------------------+------------------+-------------------------------------------------------------------------------+------  
 public | block                   | double precision | text, text                                                                    | func  
 public | block_op                | boolean          | text, text                                                                    | func  
 public | cosine                  | double precision | text, text                                                                    | func  
 public | cosine_op               | boolean          | text, text                                                                    | func  
 public | dice                    | double precision | text, text                                                                    | func  
 public | dice_op                 | boolean          | text, text                                                                    | func  
 public | euclidean               | double precision | text, text                                                                    | func  
 public | euclidean_op            | boolean          | text, text                                                                    | func  
 public | gin_extract_query_token | internal         | internal, internal, smallint, internal, internal, internal, internal          | func  
 public | gin_extract_value_token | internal         | internal, internal, internal                                                  | func  
 public | gin_token_consistent    | boolean          | internal, smallint, internal, integer, internal, internal, internal, internal | func  
 public | hamming                 | double precision | bit varying, bit varying                                                      | func  
 public | hamming_op              | boolean          | bit varying, bit varying                                                      | func  
 public | hamming_text            | double precision | text, text                                                                    | func  
 public | hamming_text_op         | boolean          | text, text                                                                    | func  
 public | jaccard                 | double precision | text, text                                                                    | func  
 public | jaccard_op              | boolean          | text, text                                                                    | func  
 public | jaro                    | double precision | text, text                                                                    | func  
 public | jaro_op                 | boolean          | text, text                                                                    | func  
 public | jarowinkler             | double precision | text, text                                                                    | func  
 public | jarowinkler_op          | boolean          | text, text                                                                    | func  
 public | lev                     | double precision | text, text                                                                    | func  
 public | lev_op                  | boolean          | text, text                                                                    | func  
 public | matchingcoefficient     | double precision | text, text                                                                    | func  
 public | matchingcoefficient_op  | boolean          | text, text                                                                    | func  
 public | mongeelkan              | double precision | text, text                                                                    | func  
 public | mongeelkan_op           | boolean          | text, text                                                                    | func  
 public | needlemanwunsch         | double precision | text, text                                                                    | func  
 public | needlemanwunsch_op      | boolean          | text, text                                                                    | func  
 public | overlapcoefficient      | double precision | text, text                                                                    | func  
 public | overlapcoefficient_op   | boolean          | text, text                                                                    | func  
 public | qgram                   | double precision | text, text                                                                    | func  
 public | qgram_op                | boolean          | text, text                                                                    | func  
 public | smithwaterman           | double precision | text, text                                                                    | func  
 public | smithwaterman_op        | boolean          | text, text                                                                    | func  
 public | smithwatermangotoh      | double precision | text, text                                                                    | func  
 public | smithwatermangotoh_op   | boolean          | text, text                                                                    | func  
 public | soundex                 | double precision | text, text                                                                    | func  
 public | soundex_op              | boolean          | text, text                                                                    | func  
(39 rows)  
  
db1=# \do  
                             List of operators  
 Schema | Name | Left arg type | Right arg type | Result type | Description   
--------+------+---------------+----------------+-------------+-------------  
 public | ~!!  | text          | text           | boolean     |   
 public | ~!~  | text          | text           | boolean     |   
 public | ~##  | text          | text           | boolean     |   
 public | ~#~  | text          | text           | boolean     |   
 public | ~%%  | text          | text           | boolean     |   
 public | ~**  | text          | text           | boolean     |   
 public | ~*~  | text          | text           | boolean     |   
 public | ~++  | text          | text           | boolean     |   
 public | ~-~  | text          | text           | boolean     |   
 public | ~==  | text          | text           | boolean     |   
 public | ~=~  | text          | text           | boolean     |   
 public | ~??  | text          | text           | boolean     |   
 public | ~@@  | text          | text           | boolean     |   
 public | ~@~  | text          | text           | boolean     |   
 public | ~^^  | text          | text           | boolean     |   
 public | ~||  | text          | text           | boolean     |   
 public | ~~~  | text          | text           | boolean     |   
(17 rows)  

4、pg_similarity的常用配置, 我们只需将pg_similarity配置到shared_preload_libraries即可开始测试.

[postgres@67e1eed1b4b6 pg_similarity]$ cat pg_similarity.conf.sample   
#-----------------------------------------------------------------------  
# postgresql.conf  
#-----------------------------------------------------------------------  
# the former needs a restart every time you upgrade pg_similarity and   
# the later needs that you create a $libdir/plugins directory and move   
# pg_similarity.so to it (it doesn't require a restart; just open a new  
# connection).  
#shared_preload_libraries = 'pg_similarity'  
# - or -  
#local_preload_libraries = 'pg_similarity'  
  
#-----------------------------------------------------------------------  
# pg_similarity  
#-----------------------------------------------------------------------  
  
# - Block -  
#pg_similarity.block_tokenizer = 'alnum'  # alnum, camelcase, gram, or word  
#pg_similarity.block_threshold = 0.7    # 0.0 .. 1.0  
#pg_similarity.block_is_normalized = true  
  
# - Cosine -  
#pg_similarity.cosine_tokenizer = 'alnum'  
#pg_similarity.cosine_threshold = 0.7  
#pg_similarity.cosine_is_normalized = true  
  
# - Dice -  
#pg_similarity.dice_tokenizer = 'alnum'  
#pg_similarity.dice_threshold = 0.7  
#pg_similarity.dice_is_normalized = true  
  
# - Euclidean -  
#pg_similarity.euclidean_tokenizer = 'alnum'  
#pg_similarity.euclidean_threshold = 0.7  
#pg_similarity.euclidean_is_normalized = true  
  
# - Hamming -  
#pg_similarity.hamming_threshold = 0.7  
#pg_similarity.hamming_is_normalized = true  
  
# - Jaccard -  
#pg_similarity.jaccard_tokenizer = 'alnum'  
#pg_similarity.jaccard_threshold = 0.7  
#pg_similarity.jaccard_is_normalized = true  
  
# - Jaro -  
#pg_similarity.jaro_threshold = 0.7  
#pg_similarity.jaro_is_normalized = true  
  
# - Jaro -  
#pg_similarity.jaro_threshold = 0.7  
#pg_similarity.jaro_is_normalized = true  
  
# - Jaro-Winkler -  
#pg_similarity.jarowinkler_threshold = 0.7  
#pg_similarity.jarowinkler_is_normalized = true  
  
# - Levenshtein -  
#pg_similarity.levenshtein_threshold = 0.7  
#pg_similarity.levenshtein_is_normalized = true  
  
# - Matching Coefficient -  
#pg_similarity.matching_tokenizer = 'alnum'  
#pg_similarity.matching_threshold = 0.7  
#pg_similarity.matching_is_normalized = true  
  
# - Monge-Elkan -  
#pg_similarity.mongeelkan_tokenizer = 'alnum'  
#pg_similarity.mongeelkan_threshold = 0.7  
#pg_similarity.mongeelkan_is_normalized = true  
  
# - Needleman-Wunsch -  
#pg_similarity.nw_threshold = 0.7  
#pg_similarity.nw_is_normalized = true  
  
# - Overlap Coefficient -  
#pg_similarity.overlap_tokenizer = 'alnum'  
#pg_similarity.overlap_threshold = 0.7  
#pg_similarity.overlap_is_normalized = true  
  
# - Q-Gram -  
#pg_similarity.qgram_tokenizer = 'qgram'  
#pg_similarity.qgram_threshold = 0.7  
#pg_similarity.qgram_is_normalized = true  
  
# - Smith-Waterman -  
#pg_similarity.sw_threshold = 0.7  
#pg_similarity.sw_is_normalized = true  
  
# - Smith-Waterman-Gotoh -  
#pg_similarity.swg_threshold = 0.7  
#pg_similarity.swg_is_normalized = true  

5、测试相似搜索, 导入测试数据

[postgres@67e1eed1b4b6 ~]$ cd pg_similarity/  
[postgres@67e1eed1b4b6 pg_similarity]$ psql  
psql (11.9)  
Type "help" for help.  
  
postgres=# CREATE TABLE simtst (a text);  
CREATE TABLE  
postgres=#   
postgres=# INSERT INTO simtst (a) VALUES  
postgres-# ('Euler Taveira de Oliveira'),  
postgres-# ('EULER TAVEIRA DE OLIVEIRA'),  
postgres-# ('Euler T. de Oliveira'),  
postgres-# ('Oliveira, Euler T.'),  
postgres-# ('Euler Oliveira'),  
postgres-# ('Euler Taveira'),  
postgres-# ('EULER TAVEIRA OLIVEIRA'),  
postgres-# ('Oliveira, Euler'),  
postgres-# ('Oliveira, E. T.'),  
postgres-# ('ETO');  
INSERT 0 10  
postgres=#   
postgres=# \copy simtst FROM 'data/similarity.data'  
COPY 2999  

6、测试相似搜索, 创建gin索引

https://github.com/eulerto/pg_similarity/blob/master/pg_similarity--1.0.sql

以下操作符支持索引检索

CREATE OPERATOR CLASS gin_similarity_ops  
FOR TYPE text USING gin  
AS  
    OPERATOR    1   ~++,    -- block  
    OPERATOR    2   ~##,    -- cosine  
    OPERATOR    3   ~-~,    -- dice  
    OPERATOR    4   ~!!,    -- euclidean  
    OPERATOR    5   ~??,    -- jaccard  
--    OPERATOR    6   ~%%,    -- jaro  
--    OPERATOR    7   ~@@,    -- jarowinkler  
--    OPERATOR    8   ~==,    -- lev  
    OPERATOR    9   ~^^,    -- matchingcoefficient  
--    OPERATOR    10  ~||,    -- mongeelkan  
--    OPERATOR    11  ~#~,    -- needlemanwunsch  
    OPERATOR    12  ~**,    -- overlapcoefficient  
    OPERATOR    13  ~~~,    -- qgram  
--    OPERATOR    14  ~=~,    -- smithwaterman  
--    OPERATOR    15  ~!~,    -- smithwatermangotoh  
--    OPERATOR    16  ~*~,    -- soundex  
    FUNCTION    1   bttextcmp(text, text),  
    FUNCTION    2   gin_extract_value_token(internal, internal, internal),  
    FUNCTION    3   gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal),  
    FUNCTION    4   gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal),  
    STORAGE text;  
postgres=# create index on simtst using gin (a gin_similarity_ops);  
CREATE INDEX  

6、测试相似搜索, 使用索引根据相似性高速锁定目标数据.

可以根据threshold调整目标数据, 大于等于它的相似度才会被返回.

相似度threadshold设置越大, 范围越收敛, 性能越好.

可以放到函数中设置threadshold, 分阶段返回.

postgres=# show pg_similarity.cosine_tokenizer;  
 pg_similarity.cosine_tokenizer   
--------------------------------  
 alnum  
(1 row)  
  
postgres=# show pg_similarity.cosine_threshold;  
 pg_similarity.cosine_threshold   
--------------------------------  
 0.7  
(1 row)  
  
postgres=# show pg_similarity.cosine_is_normalized;  
 pg_similarity.cosine_is_normalized   
------------------------------------  
 on  
(1 row)  
  
postgres=# select *, cosine(a, 'hello')  from simtst where  a ~## 'hello' limit 10;  
 a | cosine   
---+--------  
(0 rows)  
  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
             a             | cosine   
---------------------------+--------  
 EULER TAVEIRA DE OLIVEIRA |   0.75  
(1 row)  
  
postgres=# explain select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
                                    QUERY PLAN                                      
----------------------------------------------------------------------------------  
 Limit  (cost=36.02..44.29 rows=3 width=40)  
   ->  Bitmap Heap Scan on simtst  (cost=36.02..44.29 rows=3 width=40)  
         Recheck Cond: (a ~## 'EULER TAVEIRA DE OLIVEI'::text)  
         ->  Bitmap Index Scan on simtst_a_idx  (cost=0.00..36.02 rows=3 width=0)  
               Index Cond: (a ~## 'EULER TAVEIRA DE OLIVEI'::text)  
(5 rows)  
  
postgres=# set pg_similarity.cosine_threshold=0.75;  
SET  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
             a             | cosine   
---------------------------+--------  
 EULER TAVEIRA DE OLIVEIRA |   0.75  
(1 row)  
  
postgres=# set pg_similarity.cosine_threshold=0.76;  
SET  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
 a | cosine   
---+--------  
(0 rows)  

参考

https://github.com/eulerto/pg_similarity

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍如何基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
2月前
|
SQL 关系型数据库 MySQL
开源新发布|PolarDB-X v2.4.2开源生态适配升级
PolarDB-X v2.4.2开源发布,重点完善生态能力:新增客户端驱动、开源polardbx-proxy组件,支持读写分离与高可用;强化DDL变更、扩缩容等运维能力,并兼容MySQL主备复制及MCP AI生态。
开源新发布|PolarDB-X v2.4.2开源生态适配升级
|
2月前
|
SQL 关系型数据库 MySQL
开源新发布|PolarDB-X v2.4.2开源生态适配升级
PolarDB-X v2.4.2发布,新增开源Proxy组件与客户端驱动,支持读写分离、无感高可用切换及DDL在线变更,兼容MySQL生态,提升千亿级大表运维稳定性。
737 24
开源新发布|PolarDB-X v2.4.2开源生态适配升级
|
4月前
|
人工智能 关系型数据库 MySQL
开源PolarDB-X:单节点误删除binlog恢复
本文由邵亚鹏撰写,分享了在使用开源PolarDB-X过程中,因误删binlog导致数据库服务无法启动的问题及恢复过程。作者结合实践经验,详细介绍了在无备份情况下如何通过单节点恢复机制重启数据库,并提出了避免类似问题的几点建议,包括采用高可用部署、定期备份及升级至最新版本等。
|
7月前
|
供应链 关系型数据库 分布式数据库
2025开源之夏火热报名|一起来设计PolarDB Dashboard
2025开源之夏正在火热报名中,PolarDB邀请全球学子参与云原生与Web开发的前沿项目。活动由中国科学院软件研究所发起,旨在鼓励高校学生通过实际开发维护开源软件,培养优秀开发者,推动开源生态发展。PolarDB项目聚焦设计与开发PolarDB-X Dashboard,要求掌握K8S Client-go和Web开发技术。参与者将根据项目难度获得税前8000至12000元人民币报酬,并获取结项证书。每位学生仅可申请一个项目,详情见官网。
2025开源之夏火热报名|一起来设计PolarDB Dashboard
|
7月前
|
存储 Cloud Native 关系型数据库
PolarDB开源:云原生数据库的架构革命
本文围绕开源核心价值、社区运营实践和技术演进路线展开。首先解读存算分离架构的三大突破,包括基于RDMA的分布式存储、计算节点扩展及存储池扩容机制,并强调与MySQL的高兼容性。其次分享阿里巴巴开源治理模式,涵盖技术决策、版本发布和贡献者成长体系,同时展示企业应用案例。最后展望技术路线图,如3.0版本的多写多读架构、智能调优引擎等特性,以及开发者生态建设举措,推荐使用PolarDB-Operator实现高效部署。
415 4
|
7月前
|
SQL 关系型数据库 分布式数据库
PolarDB开源数据库入门教程
PolarDB是阿里云推出的云原生数据库,基于PostgreSQL、MySQL和Oracle引擎构建,具备高性能、高扩展性和高可用性。其开源版采用计算与存储分离架构,支持快速弹性扩展和100%兼容PostgreSQL/MySQL。本文介绍了PolarDB的安装方法(Docker部署或源码编译)、基本使用(连接数据库、创建表等)及高级特性(计算节点扩展、存储自动扩容、并行查询等)。同时提供了性能优化建议和监控维护方法,帮助用户在生产环境中高效使用PolarDB。
2505 21
|
存储 关系型数据库 MySQL
开源PolarDB- X|替换Opengemini时序数据场景下产品力校验
本文作者:黄周霖,数据库技术专家,就职于南京北路智控股份有限公司,负责数据库运维及大数据开发。
|
8月前
|
关系型数据库 分布式数据库 数据库
一库多能:阿里云PolarDB三大引擎、四种输出形态,覆盖企业数据库全场景
PolarDB是阿里云自研的新一代云原生数据库,提供极致弹性、高性能和海量存储。它包含三个版本:PolarDB-M(兼容MySQL)、PolarDB-PG(兼容PostgreSQL及Oracle语法)和PolarDB-X(分布式数据库)。支持公有云、专有云、DBStack及轻量版等多种形态,满足不同场景需求。2021年,PolarDB-PG与PolarDB-X开源,内核与商业版一致,推动国产数据库生态发展,同时兼容主流国产操作系统与芯片,获得权威安全认证。
|
5月前
|
存储 关系型数据库 分布式数据库
喜报|阿里云PolarDB数据库(分布式版)荣获国内首台(套)产品奖项
阿里云PolarDB数据库管理软件(分布式版)荣获「2024年度国内首版次软件」称号,并跻身《2024年度浙江省首台(套)推广应用典型案例》。

相关产品

  • 云原生数据库 PolarDB