4 PostgreSQL 索引，全文检索，模糊匹配，近似度匹配（三）-阿里云开发者社区

开发者学堂课程【PostgreSQL快速入门：4 PostgreSQL 索引，全文检索，模糊匹配，近似度匹配（三）】学习笔记与课程紧密联系，让用户快速学习知识

课程地址：https://developer.aliyun.com/learning/course/16/detail/90

4 PostgreSQL 索引，全文检索，模糊匹配，近似度匹配（三）

内容介绍：

十一、HOT Update

十二、PostgreSQL 全文检索

十三、pg_trgm 近似匹配

十四、练习

十一、HOT Update

1、索引指针结构

2、ItemPointers (index) ->索引中存储指向数据linepoint，linepoint在heap page中，ItemPointers在索引页中，linepoint是ctid，linepoint1是0号数据块的第一条记录，linepoint2是0号数据库第二条记录，hot当记录在发生变更，tuple1还在，ItemPointers不需要变更，不需要指向linepoint2，linepoint1指向tuple2，tuple2指向linepoint2，linepoint2执行tuple2。

回收掉之后linepoint1直接指向linepoint2，linepoint2指向tuple2。

利用pageinspect extension 来观察HOT

postgres= # create extension pageinspect;创建

Get_raw_page是指把hot_test0号数据块的罗数据去取出来

Page_header是取出头部信息，包含isn，lower，upper等

postgres= # create table hot_ test (id int primary key,info text);创建测试表

postgres =# insert into hot_ test values (1,'digoal');

因为是从0号page开始插入, 这里就省去了查询ctid等过程，直接切入0号page。

当前的page信息

postgres =# select * from page_ header(get_raw_ page('hot_ _test',0)))；

lsn I tli I flags I lower l upper I special I pagesize I version prune_ xid

2/75B27878I 1 I 0 I 28I 8152I 8192I 8192I 4 I 0

当前的item信息

postgres =# select * from heap_ page_ items(get raw_ _page('hot test',0));只有一条记录，当前执行的ctid是0，1，进行更新

当前索引的page信息

postgres # select * from page. _header(get. raw_ _page('hot_ test_ pkey',0);

lsn Itli Iflags l lower I upper I special I pagesize I version I prune _xid

2/75B278B｜1｜0｜48｜8176｜8176｜8192｜4｜0

当前索引的item信息

postgres= # select * from heap. _page items(get_ raw_ _page('hot test_ pkey',0);索引当前只有一个，指向0，1

进行update hot test set info new where id ghri，第一条记录被更新，t_xmax变成更新的事物id，1881条记录变成新纪录，新行版本，tid变成0，2，就是hot起的作用，跟图里是一样的，指向第二条记录。

第一条记录指向第二条记录，vacuum有变化，回收，第一条记录不指向第二条记录，变成空记录，变成第二幅图所示，linepoint1还是存在，只是指向了linepoint2，linepoint2直接指向tuple2。

如果发生多次操作，update很多次，会生成很多个版本，

postgres= # update hot_ test set info= 'new' where id=1;

postgres=# update hot_ test set info='new' where id=1;

postgres= # update hot_ test set info='new' where id= 1;

postgres= # update hot_ test set info= 'new' where id= 1;

postgres=# update hot_ test set info='new' where id=1; ,

postgres=# update hot_ test set info='new' where id=1;

会发现串行指向，记录没有被回收掉，只是指向了下一条被更新的记录，vacuum之后都会回收，跟第二幅图一样。

第一条记录一直都在，指向真实数据。

十二、PostgreSQL 全文检索

和全文检索相关的有两个数据类型。

tsvector 文本在经过全文检索标准化处理后得到类型,处理后的文本包括(分词去重复后排序,分词的位置,分词的权重结构(一共可以指

定4个权重ABCD, D默认不显示))。

tsquery 需要检索的分词组合,组合类型包括&,|,!(与,或,否)，同时还支持分词的权重,分词的前导匹配。

详细介绍

htp://www postgrcsql.org/docs/9.3/static/datatypc-textscarch.html

htp://www. postgresql or/docs/9.3/static/textsearch html

字符串到tsvector的默认转换例子:

digoal=# select SShello world, i'm digals.s:tsvector;

tsvector

'digoal.' "hello' "m' 'world,'

这种转换后得到的tsvector不包含分词的位置信息和权重信息.只有排序后的分词。

权重和位置信息可以提现在文本中(权重一般用来表示该分词所在的级别，如目录,或正文? ),

digoal=# select $$hello:1b world，:1a i'm:3D digas.s::tsvector,

tsvector加了权重的分词，hello放在第一的位置，权重是b，d是默认的所以不会显示，

'digoal.' "hello':1B 'i"m':3 'world,:1A大写

使用to_tsvector, 可以指定不同的语言配置,把文本根据指定的语言配置进行分词。

例如，使用西班牙语和英语得到的tsvector值是不一样的。

digoal=# select to_ tsvector'english', SSHello world, I'm digoal.SS);

to_ tsvector

'digoal':5 "hello':1 'm':4 'world':2

(1 row)

digoal=# select to. tsvector('spanish', $$Hello world, I'm digoal.SS);

to_ tsvector

'digoal':5 hell':1 ':3 'm':4 'world':2

(1 row)

查看系统中已经安装的全文检索配置。

digoal=#\ dF *

List of test search configurations

Schemal NameI

Description

pg_ catalog I danishI configuration for danish language

pg_ catalog I Idutch I configuration for dutch language

pg_ catalog I english I configuration for english language

pg_ catalog I finnishI configuration for finnish language

pg_ catalog I french I configuration for french language

Pg_ catalog I german I configuration for german language

pg_ catalog I hungarian I configuration for hungarian language

pg_ catalog I italianI configuration for italian language

pg_ catalog I norwegianI configuration for norwegian language

pg. catalog I portuguese I configuration for portuguese language

pg_ catalog I romanian I configuration for romanian language

pg_ catalog I russian I configuration for russian language

pg_ catalog I simple I simple configuration

pg_ catalog I spanish I configuration for spanish language

pg_ catalog I swedishI configuration for swedish language

pg_ catalog I turkish I configuration for turkish language

后面的例子中会有中文的安装

tsquery的例子

&,|,!组合,分组使用括号

SELECT 'fat & rt::tsquery;

tsquery

'far'&‘rat’

SELECT 'fat & (rat cat)::tsquery;

tsquery

'fat' & ( 'rat'｜'cat' )

SELECT 'fat & rat & ! cat::tsquery;

tsquery

'fat' & 'rat' & !'cat*

支持指定权重

SELECT 'fat:ab & cat'::tsquery;

tsquery

'fat':AB & 'cat*

支持前导匹配

SELECT 'super:*::tsquery;

tsquery

'super':*

使用to_ tsquery转换时,也可以带上语言配置

digoal=# SELECT to_ tsquery('english', 'Fat:ab & Cats');

to_ tsquery

'fat':AB & 'cat'

和全文检索相关的函数和操作符

Operator	Description	Example	Result
@@	tsvector matches tsquery ?查询的分词是否在分词列里面	to_ tsvector(' fat cats ate rats' ) @@ to_ tsquery(' cat a rat' )	t是
@@@	deprecated synonym for @@	to_ tsvector(' fat cats ate rats' ) @@@ to_ tsquery(' cat a rat' )	t
｜｜合并	concatenate tsvectorS	b:2' ::tsvectorll' c:1 d:2 b:3' ::tsvector	‘a' :1 'b' :2,5 'c':3 'd' ;4
&&与，或	AND tsqueryS together	fatI rat' ::tsquery &&' cat.',: : tsquery	( ' fat' I ' rat') &'cat'
｜｜把值进行合并	OR tsqueryS together	fatI rat' :: tsquery｜｜' cat' : : tsquery	('fat'I'rat')I'cat'
!!	negate a tsquery	!! cat' :: tsquery进行反转	!' cat'非
@>包含	tsquery contains another ?	‘cat' : : tsquery @>’cat@Rat‘：： tsquery	f
<@	tsquery is contained in ?	‘cat' : : tsquery <@’cat@Rat‘：： tsquery	t

全文检索的索引使用

digoal= =# create table ts(id int, info tsvector, crt_ time timestamp);

CREATE TABLE

digoal=# insert into ts values (1, $$Hello world, i'm digoal.$$, now());

INSERT0 1

digoal=# create index idx_ ts_ 1 on ts using gin (info);

CREATE INDEX

digoal=# select * from ts where info @@ 'digol.::tsquery;使用gin索引支持@@，是否包含tsquery

id info crt_ time

1 I 'Hello' 'digoal.' 'i"m' 'world,' I 2013-12-09 16:35:55.635111

(1 row)

全文检索的索引使用(svector支持的素索引策略gin, gist, btree)

GIN索引策略,可用itsvector包含tsquery的查询匹配。

digoal= # set enable_ seqscan=off;全链扫描关闭

SET

digoal= # explain analyze select * from ts where info @@ 'digoal'::tsquery;

QUERY PLAN

Bitmap Heap Scan on ts (cost=2.00..3.01 rows=1 width=44) (actual time=0.02 1 ..0.021 rows=1 loops=1)

Recheck Cond: (info @@ "didgol.":squery)

> Bitmap Index Scan on idx_ ts_ 1 (cost=0.00..2.00 rows=1width=()) (actual time=0.016..0.016 rows=1 loops=1)

Index Cond: (info @@ "digol":tsquery)

Total runtime: 0.061 ms

(5 rows)数据库全文检索支持索引

全文检索的索引使用

GiST索引策略,可用于包含匹配

digoal=# drop index idx_ ts_ 1;

DROP INDEX

digoal=# create index idx _ts_ 1 on ts using gist (info);

CREATE INDEX

digoal=# explain analyze select * from ts where info @@ 'digoal.::tsquery;

QUERY PLAN

Index Scan usingidx_ ts_ 1 on ts (cost=0. 12..2.14 rows=l width=44) (actual time=0.016..0.017 rows=l loops=l)

Index Cond: (info @@ "igoal."::tsquery)

Total runtime: 0.055 ms

(3 rows)

中文全文检索举例，中文检索有很多种，对于postgresql本身没有中文分词库，需要外部的插件支持，腾讯有中文解析插件。

http://blog. 163.com/digoal@126/blog/static/163877040201252141010693/

中文全文检索语言配置安装简介

下载进行部署。

Cent0S 5. x 64bit

PostgreSQL 9. 1.3

nlpbanboo-1.1. 2

cmake-2.8.8

CRF++-0.57

安装cmake

tar -zxvf cmake -2.8.8.tar.gz

cd cmake-2.8.8

./bootstrap --prefix =/opt/cmake2.8.8

gmake

gmake install

vi ~/.bash_ profile

export PATH=/opt/cmake2.8.8/bin:$PATH

. ~/.bash_ profile

如果不用数据库全文检索，在数据库中支持全文检索类型，在分词之后进行匹配，效率还行，对于大型的搜索引擎网站不会用到，它们会用专用的搜索引擎。对于小网站，不想另外部署搜索引擎可以使用这种方法。

安装crf

tar -zxvfCRF++-0.57.tar.gz

cd CRF++-0.57

Jconfigure

gmake

gmake install

提示把目录/usr/local/lib加到piggrep下面，打印，可以看到文件已经存在。

安装nlpbamboo

vi ~/.bash_ profile

export PGHOME=/opt/pgsql依赖pgsql

export PATH=$PGHOME/bin:/opt/bamboo/bin:/opt/emake2.8.8/bin:$PATH:.

export LD_ LIBRARY_ PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:.

, ~/.bash_profile

tar -jxvf nlpbamboo-l.l 2.tar.bz2

cd nlpbamboo-1.1.2

mkdir build

cd build

cmake .. -DCMAKE_ BUILD_ TYPE=release

gmake all

gmake install

加入lib库链接

echo "/usr/lib" >>/etc/ld. so. conf 这个命令是bamboo对应的动态链接库)

echo "/usr/local/lib" >>/etc/1d. so. conf 6这个命令是CF对应的动态链接库)

ldconfig -f /etc/ld so. conf

测试是否加入正常

1dconfig -plerep bambo

libbamboo. so.2 aibe6, x86-64) =h)

usr/1ib/libbamboo. so.2

libbunboo. so Qibe6, x86-64) => /usr/1ib/libbanboo. s0

1dconfig -plorep erf

liberfpp. so.0 01ibc6, x86-64) => /usr/1oce1/1ib/liberfpp. so. 0

liberfpp so Qibc6, x86-64) => /usr/1oce1/1ib/1iberfpp. so

加入索引

ed /opt/bamboc

wget http: //nlpbanboo. googlecode. com/ files/index. tar. br2

tar -jxvf index, tar. bz2

编译PostgreSQL支持模块

cd /opt/banboo/ exts/postares/ chinese_parser

make

make install

touch $PGHOME/ share/tsearch_data/ chinese _utf8. stop

ed 1 opt/banboo/ exts/ postgres/pe_tokenize

make

make install

安装PostgreSQL支持模块

su - poslares

ed $PGHOUE/ share/ contrib/

psql -h 127.0.0.1 postgres postgres -f chinese. parser. sql

psql -h 127.0.0.1 postgres postgres -f Pe_tokenize. sql

查看全文检索配置中加入了chinesecfg的配置

Postares=# select * from pg ts config；

中文分词安装好了

测试tokenize分词函数

postgres=# select * from tokenize( ‘你好我是中国人' );

拆成五个分词

postgres=# select * from tokenize(中华人民共和国):

不会被拆，是完整的词，百度也是完整词，今年是龙年被拆成三个。

测试全文检索类型转换函数

posteres-# select *from to tsvector<’ chineseefg' ，你好我是中国人目前在杭州斯凯做数据库相关的工作，‘);

指定中文的分词

' 中国':5’你好':1 '在:9'我':3 数据库':13斯凯':‘是’：4‘杭州:10 '的':15 '目前':8 '相关:14’负责’:12

中文分词可以使用，

table blog <id serial prinary key, user id int8, title text, content text, ts title tsvector,ts_content tsvector>；

在tsvector的字段上可以使用gist索引,或者gin索引，加速检索。

postgres=# create index idk_ blog ,ts1 on blog using gist(ts. title);

CREATE INDEX

postgres=# create index idk_ blog_ ts2 on blog using gist (ts_ content):

CREATE INDEX

插入测试内容:

posteres=# insert into bloε (user_ id, title, content, ts_ title, ts. content) values (1,' PostoreSQL 叫Q群FAQ贴- 1',' Q群里一些网友问到的问题，收集如下:

PostereSQL存储过程中自定义异常怎么弄?

Poster eSQIL9. 1的同步事务在某些情况下用户主动canecel等待syne replication standby的acknowledge,实际本地已提交。

PostereSQL加何满足已经存在则更新，不存在则插入的需求。

copy和insert哪个效率高?

PostgreSQL能不能限制数据库的大小?

怎样给一个用户授予只读角色?

不想让数据插入到某个表应该怎么做?

PostgreSQL中有没有r ownun这样的，显示结果集的序号?

PostgreSQL函数中如何使用savepoint?

请问，pe脚本有宏替换，计算字符串公式的能力?类似s=2 ; evsaluate(5-a) :如果将这个值赋值给这个变里呢? zresult =

evaluate(5-a)

UPDATE A表FROM B表?

hex转decinal

标题

PostereSQL时区.' , to. _tswector ( chinesecfd,' PostereSQL ∞群FAQ贴- 1' ), to, _tswector ( chinesecfd ,' Q群里一些网友问到的问题，收集如下:

内容

Poster eSQL存储过程中自定义异常怎么弄?

PostereSQL9. 1的同步事务在某些情况下用户主动cance1等待syne replicati on st andby的aclnowledge,实际本地已提交。

Poster eSQL如何满足已经存在则更新，不存在则插入的需求。

copy和insert哪个效率高?

PostereSQL能不能限制数据库的大小? 上怎样给一个用户授予只读角色?

不想让数据插入到某个表应该怎么做?

PostgreSQL 中有没有r ornun这样的，显示结果集的序号?

PostereSQL 函数中如何使用savepoint?

请问，pd脚本有宏替换,计算字符串公式的能力?类似o=2 ; evaluate(5-a) ;如果将这个值赋值给这个变童呢? zresult =

evaluate(5-a) ;

UPDATE A表FROM B表?

hex转decimal

PostereSQL时区' ));

标题拆分词组，内容拆分词组

从上面的全文检索类型字段ts_content可以看出，函数和表在这分词里面，使用函数和表作为匹配条件查询时将返回结果,换个不存在的查询则没有结果:

postgres=# select user. id, title from blog where ts_content @@to_ tsquery(‘函数&中国’);

没有。

postgres=# select user. id, title from blog where ts_content @@to_ tsquery(‘函数&表’);

有，用scan索引进行匹配。

十三、pg_trgm 近似匹配

跟中文检索不一样。

pg _trgm插件

使用：http://blog. 163.com/digoal@ 126/blog/static/1638770402013416102141801/

说明：http://www. postgresql org/docs/9.3/static/pgtrgm html

原理

创建数据库时已经把所有的contrib插件编译好，直接extesion pg_trgn即可。在编译数据库时没有使用pgtrgm，先在源码中进行编译，生成函数，similarity (text,text)字符相似度是多少

select similarity<' abc‘，adb>;

0.142857

select similarity<' abc‘，abd>;

0.333333

根据show_ tran (text)算出，

abc

<‘’a‘’，‘’ab‘’，‘’abc‘’，‘’bc‘’>

abd

<‘’a‘’，‘’ab‘’，‘’abd‘’，‘’bd‘’>

第一个相似，第二个相似，第三个不相似，第四个不相似。

加起来总共有六个单元，相似的有两个，2处以6得到0.333333，这就是相似度，拆分成完全不一样的，

<"a"，" ab",abc , "bc "，‘’bca‘’, cab>

<"a"，" ab",abc , "abd "，‘’bca‘’, cab>

一共有8个元素，相似的有五个，近似度0.625。

相似度大于50%返回，可以走索引，创建表，创建索引，使用索引时使用操作符，也可以用gin操作符。

CREATE TABLE test_ trgn(t text):

CREATE NDEX trgm_ idx ON test_ tra USING ist( t gist_ tram ops):

show_ limit ()，set. limit(real)相似的力度，超过百分之30的相似度，字符被认为很相似，t%‘word’返回真，

‘abc’%‘abd’认为是相似的，返回t真，abc，adb是f，只有百分之14的相似度，通过索引检测相似度情况，可以用于日志的相似度，返回f，也可以用于输错了字的顺序，也可以进行匹配，比全文扫描分词的方法更加彻底，完全把字符串按照连续个数，字符串分段，拆成3个一组的单元，做两者近似度的匹配。更加适合于人名，其中有一个错了，如果近似度够，也可以匹配出来。

十四、练习

1、各种索引的合理使用。

2、全文检索的使用以及中文分词插件的安装和使用。

3、近似匹配插件pg_trgm的安装和使用。

4 PostgreSQL 索引，全文检索，模糊匹配，近似度匹配（三）

4 PostgreSQL 索引，全文检索，模糊匹配，近似度匹配（三）

十一、HOT Update

十二、PostgreSQL 全文检索

十三、pg_trgm 近似匹配

十四、练习

阿里云开发者学堂

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像