PostgreSQL里的17种文本相似算法与GIN索引 - pg_similarity

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
云原生数据库 PolarDB 分布式版,标准版 2核8GB
简介:

标签

PostgreSQL , 文本相似 , pg_similarity , pg_trgm , rum , fuzzymatch gin , smlar


背景

文本相似算法,结合PostgreSQL的开放索引框架GIN,可以实现各种相似算法的文本高效检索。

PostgreSQL中常见的文本相似搜索插件:rum, pg_trgm, fuzzymatch, pg_similarity, smlar。

其中pg_similarity支持的算法达到了17种。

Introduction

pg_similarity is an extension to support similarity queries on PostgreSQL.

The implementation is tightly integrated in the RDBMS in the sense that it defines operators
so instead of the traditional operators (= and <>) you can use ~~~ and ! (any of these
operators represents a similarity function).

pg_similarity has three main components:

Functions:

a set of functions that implements similarity algorithms available in the literature.

These functions can be used as UDFs and, will be the base for implementing the similarity operators;

Operators:

a set of operators defined at the top of similarity functions.

They use similarity functions to obtain the similarity threshold and,
compare its value to a user-defined threshold to decide if it is a match or not;

Session Variables:

a set of variables that store similarity function parameters. Theses variables can be defined at run time.

  • L1 Distance (as known as City Block or Manhattan Distance);
  • Cosine Distance;
  • Dice Coefficient;
  • Euclidean Distance;
  • Hamming Distance;
  • Jaccard Coefficient;
  • Jaro Distance;
  • Jaro-Winkler Distance;
  • Levenshtein Distance;
  • Matching Coefficient;
  • Monge-Elkan Coefficient;
  • Needleman-Wunsch Coefficient;
  • Overlap Coefficient;
  • Q-Gram Distance;
  • Smith-Waterman Coefficient;
  • Smith-Waterman-Gotoh Coefficient;
  • Soundex Distance.

用法如下

Algorithm Function Operator Use Index? Parameters
L1 Distance block(text, text) returns float8 ~++ yes pg_similarity.block_tokenizer (enum)
pg_similarity.block_threshold (float8)
pg_similarity.block_is_normalized (bool)
Cosine Distance cosine(text, text) returns float8 ~## yes pg_similarity.cosine_tokenizer (enum)
pg_similarity.cosine_threshold (float8)
pg_similarity.cosine_is_normalized (bool)
Dice Coefficient dice(text, text) returns float8 ~-~ yes pg_similarity.dice_tokenizer (enum)
pg_similarity.dice_threshold (float8)
pg_similarity.dice_is_normalized (bool)
Euclidean Distance euclidean(text, text) returns float8 ~!! yes pg_similarity.euclidean_tokenizer (enum)
pg_similarity.euclidean_threshold (float8)
pg_similarity.euclidean_is_normalized (bool)
Hamming Distance hamming(bit varying, bit varying) returns float8
hamming_text(text, text) returns float8
~@~ no pg_similarity.hamming_threshold (float8)
pg_similarity.hamming_is_normalized (bool)
Jaccard Coefficient jaccard(text, text) returns float8 ~?? yes pg_similarity.jaccard_tokenizer (enum)
pg_similarity.jaccard_threshold (float8)
pg_similarity.jaccard_is_normalized (bool)
Jaro Distance jaro(text, text) returns float8 ~%% no pg_similarity.jaro_threshold (float8)
pg_similarity.jaro_is_normalized (bool)
Jaro-Winkler Distance jarowinkler(text, text) returns float8 ~@@ no pg_similarity.jarowinkler_threshold (float8)
pg_similarity.jarowinkler_is_normalized (bool)
Levenshtein Distance lev(text, text) returns float8 ~== no pg_similarity.levenshtein_threshold (float8)
pg_similarity.levenshtein_is_normalized (bool)
Matching Coefficient matchingcoefficient(text, text) returns float8 ~^^ yes pg_similarity.matching_tokenizer (enum)
pg_similarity.matching_threshold (float8)
pg_similarity.matching_is_normalized (bool)
Monge-Elkan Coefficient mongeelkan(text, text) returns float8 ~|| no pg_similarity.mongeelkan_tokenizer (enum)
pg_similarity.mongeelkan_threshold (float8)
pg_similarity.mongeelkan_is_normalized (bool)
Needleman-Wunsch Coefficient needlemanwunsch(text, text) returns float8 ~#~ no pg_similarity.nw_threshold (float8)
pg_similarity.nw_is_normalized (bool)
Overlap Coefficient overlapcoefficient(text, text) returns float8 ~** yes pg_similarity.overlap_tokenizer (enum)
pg_similarity.overlap_threshold (float8)
pg_similarity.overlap_is_normalized (bool)
Q-Gram Distance qgram(text, text) returns float8 ~~~ yes pg_similarity.qgram_threshold (float8)
pg_similarity.qgram_is_normalized (bool)
Smith-Waterman Coefficient smithwaterman(text, text) returns float8 ~=~ no pg_similarity.sw_threshold (float8)
pg_similarity.sw_is_normalized (bool)
Smith-Waterman-Gotoh Coefficient smithwatermangotoh(text, text) returns float8 ~!~ no pg_similarity.swg_threshold (float8)
pg_similarity.swg_is_normalized (bool)
Soundex Distance soundex(text, text) returns float8 ~*~ no

参考

http://pgsimilarity.projects.pgfoundry.org/

https://github.com/eulerto/pg_similarity

https://www.pgcon.org/2009/schedule/attachments/108_pg_similarity.pdf

http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary

https://github.com/postgrespro/rum

https://www.postgresql.org/docs/9.6/static/fuzzystrmatch.html

https://www.postgresql.org/docs/9.6/static/pgtrgm.html

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
19天前
|
存储 关系型数据库 MySQL
Mysql索引总结(1)
Mysql索引总结(1)
23 0
|
19天前
|
存储 关系型数据库 MySQL
MySQL 索引的10 个核心要点
MySQL 索引的10 个核心要点
20 0
|
19天前
|
SQL 关系型数据库 MySQL
MySQL8.0索引新特性
MySQL8.0索引新特性
17 0
|
5天前
|
SQL 存储 关系型数据库
MySQL索引及事务
MySQL索引及事务
17 2
|
5天前
|
存储 SQL 关系型数据库
完蛋!😱 我被MySQL索引失效包围了!
完蛋!😱 我被MySQL索引失效包围了!
|
5天前
|
SQL 存储 关系型数据库
MySQL的3种索引合并优化⭐️or到底能不能用索引?
MySQL的3种索引合并优化⭐️or到底能不能用索引?
|
5天前
|
存储 SQL 关系型数据库
MySQL索引,看这一篇就够了!
MySQL索引,看这一篇就够了!
|
5天前
|
Java 关系型数据库 MySQL
MySQL 索引事务
MySQL 索引事务
13 0
|
6天前
|
存储 SQL 关系型数据库
MySQL 底层数据结构 聚簇索引以及二级索引 Explain的使用
MySQL 底层数据结构 聚簇索引以及二级索引 Explain的使用
23 0
|
6天前
|
自然语言处理 关系型数据库 MySQL
一文明白MySQL索引的用法及好处
一文明白MySQL索引的用法及好处
17 0

相关产品

  • 云原生数据库 PolarDB