0 PolarDB 开源版通过pg_similarity实现17种文本相似搜索 - token归一切分, 根据文本相似度检索相似文本

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
简介: 背景PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.本文将介绍PolarDB 开源版通过pg_similarity实现17种文本相似搜索...

背景

PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.

本文将介绍PolarDB 开源版通过pg_similarity实现17种文本相似搜索 - token归一切分, 根据文本相似度检索相似文本。测试环境为macos+docker, polardb部署请参考如何用 PolarDB 证明巴菲特的投资理念 - 包括PolarDB简单部署

pg_similarity for PolarDB

  • pg_similarity支持17种相似算法

  • L1 Distance (as known as City Block or Manhattan Distance);

  • Cosine Distance;

  • Dice Coefficient;

  • Euclidean Distance;

  • Hamming Distance;

  • Jaccard Coefficient;

  • Jaro Distance;

  • Jaro-Winkler Distance;

  • Levenshtein Distance;

  • Matching Coefficient;

  • Monge-Elkan Coefficient;

  • Needleman-Wunsch Coefficient;

  • Overlap Coefficient;

  • Q-Gram Distance;

  • Smith-Waterman Coefficient;

  • Smith-Waterman-Gotoh Coefficient;

  • Soundex Distance.

以上大多数相似算法支持索引操作. 详见: https://github.com/eulerto/pg_similarity

需要注意

  • token切分归一化的算法由参数设置, 如果你的数据写入时参数是a, 那么写入的文本会按a来切分, 如果未来又改成了b, 那么未来的切分和之前的切分算法可能不一样, 当然如果业务允许也OK.

  • 在比对文本相似性时亦如此.

部署pg_similarity for PolarDB

  1. 下载并编译

git clone --depth 1 https://github.com/eulerto/pg_similarity.git  
  
  
cd pg_similarity/  
  
USE_PGXS=1 make  
USE_PGXS=1 make install  
export PGHOST=127.0.0.1  
  
[postgres@67e1eed1b4b6 pg_similarity]$ USE_PGXS=1 make installcheck  
/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --bindir='/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin'      --dbname=contrib_regression test1 test2 test3 test4  
(using postmaster on 127.0.0.1, default port)  
============== dropping database "contrib_regression" ==============  
DROP DATABASE  
============== creating database "contrib_regression" ==============  
CREATE DATABASE  
ALTER DATABASE  
============== running regression test queries        ==============  
test test1                        ... ok  
test test2                        ... ok  
test test3                        ... ok  
test test4                        ... ok  
  
  
==========================================================  
 All 4 tests passed.   
  
 POLARDB:  
 All 4 tests, 0 tests in ignore, 0 tests in polar ignore.   
==========================================================  
  1. 加载pg_similarity插件

postgres=# create database db1;  
CREATE DATABASE  
  
postgres=# \c db1  
You are now connected to database "db1" as user "postgres".  
db1=# create extension pg_similarity ;  
CREATE EXTENSION  
  1. pg_similarity插件会新增一些函数和操作符, 用于相似搜索.

db1=# \df  
                                                             List of functions  
 Schema |          Name           | Result data type |                              Argument data types                              | Type   
--------+-------------------------+------------------+-------------------------------------------------------------------------------+------  
 public | block                   | double precision | text, text                                                                    | func  
 public | block_op                | boolean          | text, text                                                                    | func  
 public | cosine                  | double precision | text, text                                                                    | func  
 public | cosine_op               | boolean          | text, text                                                                    | func  
 public | dice                    | double precision | text, text                                                                    | func  
 public | dice_op                 | boolean          | text, text                                                                    | func  
 public | euclidean               | double precision | text, text                                                                    | func  
 public | euclidean_op            | boolean          | text, text                                                                    | func  
 public | gin_extract_query_token | internal         | internal, internal, smallint, internal, internal, internal, internal          | func  
 public | gin_extract_value_token | internal         | internal, internal, internal                                                  | func  
 public | gin_token_consistent    | boolean          | internal, smallint, internal, integer, internal, internal, internal, internal | func  
 public | hamming                 | double precision | bit varying, bit varying                                                      | func  
 public | hamming_op              | boolean          | bit varying, bit varying                                                      | func  
 public | hamming_text            | double precision | text, text                                                                    | func  
 public | hamming_text_op         | boolean          | text, text                                                                    | func  
 public | jaccard                 | double precision | text, text                                                                    | func  
 public | jaccard_op              | boolean          | text, text                                                                    | func  
 public | jaro                    | double precision | text, text                                                                    | func  
 public | jaro_op                 | boolean          | text, text                                                                    | func  
 public | jarowinkler             | double precision | text, text                                                                    | func  
 public | jarowinkler_op          | boolean          | text, text                                                                    | func  
 public | lev                     | double precision | text, text                                                                    | func  
 public | lev_op                  | boolean          | text, text                                                                    | func  
 public | matchingcoefficient     | double precision | text, text                                                                    | func  
 public | matchingcoefficient_op  | boolean          | text, text                                                                    | func  
 public | mongeelkan              | double precision | text, text                                                                    | func  
 public | mongeelkan_op           | boolean          | text, text                                                                    | func  
 public | needlemanwunsch         | double precision | text, text                                                                    | func  
 public | needlemanwunsch_op      | boolean          | text, text                                                                    | func  
 public | overlapcoefficient      | double precision | text, text                                                                    | func  
 public | overlapcoefficient_op   | boolean          | text, text                                                                    | func  
 public | qgram                   | double precision | text, text                                                                    | func  
 public | qgram_op                | boolean          | text, text                                                                    | func  
 public | smithwaterman           | double precision | text, text                                                                    | func  
 public | smithwaterman_op        | boolean          | text, text                                                                    | func  
 public | smithwatermangotoh      | double precision | text, text                                                                    | func  
 public | smithwatermangotoh_op   | boolean          | text, text                                                                    | func  
 public | soundex                 | double precision | text, text                                                                    | func  
 public | soundex_op              | boolean          | text, text                                                                    | func  
(39 rows)  
  
db1=# \do  
                             List of operators  
 Schema | Name | Left arg type | Right arg type | Result type | Description   
--------+------+---------------+----------------+-------------+-------------  
 public | ~!!  | text          | text           | boolean     |   
 public | ~!~  | text          | text           | boolean     |   
 public | ~##  | text          | text           | boolean     |   
 public | ~#~  | text          | text           | boolean     |   
 public | ~%%  | text          | text           | boolean     |   
 public | ~**  | text          | text           | boolean     |   
 public | ~*~  | text          | text           | boolean     |   
 public | ~++  | text          | text           | boolean     |   
 public | ~-~  | text          | text           | boolean     |   
 public | ~==  | text          | text           | boolean     |   
 public | ~=~  | text          | text           | boolean     |   
 public | ~??  | text          | text           | boolean     |   
 public | ~@@  | text          | text           | boolean     |   
 public | ~@~  | text          | text           | boolean     |   
 public | ~^^  | text          | text           | boolean     |   
 public | ~||  | text          | text           | boolean     |   
 public | ~~~  | text          | text           | boolean     |   
(17 rows)  
  1. pg_similarity的常用配置, 我们只需将pg_similarity配置到shared_preload_libraries即可开始测试.

[postgres@67e1eed1b4b6 pg_similarity]$ cat pg_similarity.conf.sample   
#-----------------------------------------------------------------------  
# postgresql.conf  
#-----------------------------------------------------------------------  
# the former needs a restart every time you upgrade pg_similarity and   
# the later needs that you create a $libdir/plugins directory and move   
# pg_similarity.so to it (it doesn't require a restart; just open a new  
# connection).  
#shared_preload_libraries = 'pg_similarity'  
# - or -  
#local_preload_libraries = 'pg_similarity'  
  
#-----------------------------------------------------------------------  
# pg_similarity  
#-----------------------------------------------------------------------  
  
# - Block -  
#pg_similarity.block_tokenizer = 'alnum'  # alnum, camelcase, gram, or word  
#pg_similarity.block_threshold = 0.7    # 0.0 .. 1.0  
#pg_similarity.block_is_normalized = true  
  
# - Cosine -  
#pg_similarity.cosine_tokenizer = 'alnum'  
#pg_similarity.cosine_threshold = 0.7  
#pg_similarity.cosine_is_normalized = true  
  
# - Dice -  
#pg_similarity.dice_tokenizer = 'alnum'  
#pg_similarity.dice_threshold = 0.7  
#pg_similarity.dice_is_normalized = true  
  
# - Euclidean -  
#pg_similarity.euclidean_tokenizer = 'alnum'  
#pg_similarity.euclidean_threshold = 0.7  
#pg_similarity.euclidean_is_normalized = true  
  
# - Hamming -  
#pg_similarity.hamming_threshold = 0.7  
#pg_similarity.hamming_is_normalized = true  
  
# - Jaccard -  
#pg_similarity.jaccard_tokenizer = 'alnum'  
#pg_similarity.jaccard_threshold = 0.7  
#pg_similarity.jaccard_is_normalized = true  
  
# - Jaro -  
#pg_similarity.jaro_threshold = 0.7  
#pg_similarity.jaro_is_normalized = true  
  
# - Jaro -  
#pg_similarity.jaro_threshold = 0.7  
#pg_similarity.jaro_is_normalized = true  
  
# - Jaro-Winkler -  
#pg_similarity.jarowinkler_threshold = 0.7  
#pg_similarity.jarowinkler_is_normalized = true  
  
# - Levenshtein -  
#pg_similarity.levenshtein_threshold = 0.7  
#pg_similarity.levenshtein_is_normalized = true  
  
# - Matching Coefficient -  
#pg_similarity.matching_tokenizer = 'alnum'  
#pg_similarity.matching_threshold = 0.7  
#pg_similarity.matching_is_normalized = true  
  
# - Monge-Elkan -  
#pg_similarity.mongeelkan_tokenizer = 'alnum'  
#pg_similarity.mongeelkan_threshold = 0.7  
#pg_similarity.mongeelkan_is_normalized = true  
  
# - Needleman-Wunsch -  
#pg_similarity.nw_threshold = 0.7  
#pg_similarity.nw_is_normalized = true  
  
# - Overlap Coefficient -  
#pg_similarity.overlap_tokenizer = 'alnum'  
#pg_similarity.overlap_threshold = 0.7  
#pg_similarity.overlap_is_normalized = true  
  
# - Q-Gram -  
#pg_similarity.qgram_tokenizer = 'qgram'  
#pg_similarity.qgram_threshold = 0.7  
#pg_similarity.qgram_is_normalized = true  
  
# - Smith-Waterman -  
#pg_similarity.sw_threshold = 0.7  
#pg_similarity.sw_is_normalized = true  
  
# - Smith-Waterman-Gotoh -  
#pg_similarity.swg_threshold = 0.7  
#pg_similarity.swg_is_normalized = true  
  1. 测试相似搜索, 导入测试数据

[postgres@67e1eed1b4b6 ~]$ cd pg_similarity/  
[postgres@67e1eed1b4b6 pg_similarity]$ psql  
psql (11.9)  
Type "help" for help.  
  
postgres=# CREATE TABLE simtst (a text);  
CREATE TABLE  
postgres=#   
postgres=# INSERT INTO simtst (a) VALUES  
postgres-# ('Euler Taveira de Oliveira'),  
postgres-# ('EULER TAVEIRA DE OLIVEIRA'),  
postgres-# ('Euler T. de Oliveira'),  
postgres-# ('Oliveira, Euler T.'),  
postgres-# ('Euler Oliveira'),  
postgres-# ('Euler Taveira'),  
postgres-# ('EULER TAVEIRA OLIVEIRA'),  
postgres-# ('Oliveira, Euler'),  
postgres-# ('Oliveira, E. T.'),  
postgres-# ('ETO');  
INSERT 0 10  
postgres=#   
postgres=# \copy simtst FROM 'data/similarity.data'  
COPY 2999  
  1. 测试相似搜索, 创建gin索引

https://github.com/eulerto/pg_similarity/blob/master/pg_similarity--1.0.sql

以下操作符支持索引检索

CREATE OPERATOR CLASS gin_similarity_ops  
FOR TYPE text USING gin  
AS  
    OPERATOR    1   ~++,    -- block  
    OPERATOR    2   ~##,    -- cosine  
    OPERATOR    3   ~-~,    -- dice  
    OPERATOR    4   ~!!,    -- euclidean  
    OPERATOR    5   ~??,    -- jaccard  
--    OPERATOR    6   ~%%,    -- jaro  
--    OPERATOR    7   ~@@,    -- jarowinkler  
--    OPERATOR    8   ~==,    -- lev  
    OPERATOR    9   ~^^,    -- matchingcoefficient  
--    OPERATOR    10  ~||,    -- mongeelkan  
--    OPERATOR    11  ~#~,    -- needlemanwunsch  
    OPERATOR    12  ~**,    -- overlapcoefficient  
    OPERATOR    13  ~~~,    -- qgram  
--    OPERATOR    14  ~=~,    -- smithwaterman  
--    OPERATOR    15  ~!~,    -- smithwatermangotoh  
--    OPERATOR    16  ~*~,    -- soundex  
    FUNCTION    1   bttextcmp(text, text),  
    FUNCTION    2   gin_extract_value_token(internal, internal, internal),  
    FUNCTION    3   gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal),  
    FUNCTION    4   gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal),  
    STORAGE text;  
postgres=# create index on simtst using gin (a gin_similarity_ops);  
CREATE INDEX  
  1. 测试相似搜索, 使用索引根据相似性高速锁定目标数据.

可以根据threshold调整目标数据, 大于等于它的相似度才会被返回.

相似度threadshold设置越大, 范围越收敛, 性能越好.

可以放到函数中设置threadshold, 分阶段返回.

《社交、电商、游戏等 推荐系统 (相似推荐) - 阿里云pase smlar索引方案对比》

postgres=# show pg_similarity.cosine_tokenizer;  
 pg_similarity.cosine_tokenizer   
--------------------------------  
 alnum  
(1 row)  
  
postgres=# show pg_similarity.cosine_threshold;  
 pg_similarity.cosine_threshold   
--------------------------------  
 0.7  
(1 row)  
  
postgres=# show pg_similarity.cosine_is_normalized;  
 pg_similarity.cosine_is_normalized   
------------------------------------  
 on  
(1 row)  
  
postgres=# select *, cosine(a, 'hello')  from simtst where  a ~## 'hello' limit 10;  
 a | cosine   
---+--------  
(0 rows)  
  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
             a             | cosine   
---------------------------+--------  
 EULER TAVEIRA DE OLIVEIRA |   0.75  
(1 row)  
  
postgres=# explain select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
                                    QUERY PLAN                                      
----------------------------------------------------------------------------------  
 Limit  (cost=36.02..44.29 rows=3 width=40)  
   ->  Bitmap Heap Scan on simtst  (cost=36.02..44.29 rows=3 width=40)  
         Recheck Cond: (a ~## 'EULER TAVEIRA DE OLIVEI'::text)  
         ->  Bitmap Index Scan on simtst_a_idx  (cost=0.00..36.02 rows=3 width=0)  
               Index Cond: (a ~## 'EULER TAVEIRA DE OLIVEI'::text)  
(5 rows)  
  
postgres=# set pg_similarity.cosine_threshold=0.75;  
SET  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
             a             | cosine   
---------------------------+--------  
 EULER TAVEIRA DE OLIVEIRA |   0.75  
(1 row)  
  
postgres=# set pg_similarity.cosine_threshold=0.76;  
SET  
postgres=# select *, cosine(a, 'EULER TAVEIRA DE OLIVEI')  from simtst where  a ~## 'EULER TAVEIRA DE OLIVEI' limit 10;  
 a | cosine   
---+--------  
(0 rows)  

参考

https://github.com/eulerto/pg_similarity

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
4月前
|
存储 数据库 Python
阿里云向量检索服务 | 全性能搜索方案
【1月更文挑战第13天】阿里云向量检索服务 | 全性能搜索方案
阿里云向量检索服务 | 全性能搜索方案
|
5月前
|
存储 SQL 测试技术
使用ClickHouse进行向量搜索 - 第二部分
本文介绍了如何使用ClickHouse进行向量搜索。总体来说,本文通俗易懂地介绍了如何使用ClickHouse进行向量搜索,包括概念、实现、高级功能和应用示例,对使用ClickHouse进行向量搜索提供了很好的概述。
51374 18
|
4月前
|
自然语言处理 分布式计算 算法
通过OpenSearch向量检索版进行混合检索的最佳实践
本文介绍如何通过OpenSearch向量检索版,使用稀疏-稠密向量进行混合检索,获得更好的搜索效果。
1195 0
|
5月前
|
存储 JSON 搜索推荐
基于向量检索服务与灵积实现语义搜索
本教程演示如何使用向量检索服务(DashVector),结合灵积模型服务上的Embedding API,来从0到1构建基于文本索引的构建+向量检索基础上的语义搜索能力。具体来说,我们将基于QQ 浏览器搜索标题语料库(QBQTC:QQ Browser Query Title Corpus)进行实时的文本语义搜索,查询最相似的相关标题。
基于向量检索服务与灵积实现语义搜索
|
6月前
|
存储 自然语言处理 算法
使用ClickHouse进行矢量搜索 - 第一部分
本文介绍了向量搜索的概念,即使用数学向量来存储和检索数据。向量可以捕捉数据的语义关系,提高搜索效率。文章还提到了向量搜索在推荐、问题回答、图像/视频搜索等方面的应用。向量搜索可以应用于文本数据、图像数据、音频数据等不同类型的数据。最后,文章总结了向量搜索的挑战和现有技术,并展望了未来的研究方向。
47607 26
|
8月前
|
存储 关系型数据库 数据库
沉浸式学习PostgreSQL|PolarDB 13: 博客、网站按标签内容检索, 并按匹配度排序
本文主要教大家怎么用好数据库, 而不是怎么运维管理数据库、怎么开发数据库内核.
703 0
|
9月前
|
机器学习/深度学习 存储 自然语言处理
语义检索系统:基于Milvus 搭建召回系统抽取向量进行检索,加速索引
语义检索系统:基于Milvus 搭建召回系统抽取向量进行检索,加速索引
语义检索系统:基于Milvus 搭建召回系统抽取向量进行检索,加速索引
|
12月前
|
SQL Java
白话Elasticsearch04- 结构化搜索之使用terms query搜索多个值以及多值搜索结果优化
白话Elasticsearch04- 结构化搜索之使用terms query搜索多个值以及多值搜索结果优化
462 0
|
12月前
|
SQL JSON 自然语言处理
白话Elasticsearch01- 结构化搜索之使用term query来搜索数据
白话Elasticsearch01- 结构化搜索之使用term query来搜索数据
269 0
|
12月前
|
SQL
白话Elasticsearch05- 结构化搜索之使用range query来进行范围过滤
白话Elasticsearch05- 结构化搜索之使用range query来进行范围过滤
73 0