Lucene&solr 4 实践（5）

2022-05-02 253

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 假期重新把之前在新浪博客里面的文字梳理了下，搬到这里。这部分先通透FST的原理和构造方法，方便理解lucene FST、Builder两个核心对象，从而彻底看清基于图的lucene4索引、查询的发展脉络。至于读懂后有神马用，自个琢磨啊！看懂估计要死伤不少脑细胞哦！

预备知识

索引格式和编码参考链接

http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html

http://lucene.apache.org/core/4_0_0-ALPHA/

http://kevinma.cn/technology/lucene-4-analysis-data-type/2012-09-12

Lucene4 FST编码

http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/util/packed/PackedInts.html#DEFAULT

在前面的基础上，初步了解到FSA FST lucene4 这些概念。下面进一步讲解。

Direct building of mimimal automation for given list 这篇文章详细的阐述了定义、定理、推理、构造过程。而Direct Construction of minimal acyclic subsequential transducers 第三部分给出了算法的伪码实现。在此2文基础上，转入Lucene FST 包会少一些迷茫。lucene FST 对InputType OutPut 以及FST是否压缩packed 做了泛型化。

Direct building of mimimal automation for given list

Definition1 定义FSA，完全是数学的形式化表达，集合的映射。其中U*这个表示经过一步或多步转移计算。

Definition2 语言定义，这个定义是在Definition1 FSA 元组关系基础上抽象出来的。理解集合的闭包概念。

Definition3 确定FSA 确实有限自动机。核心要义是状态的可达并唯一

Definition4 最小、确定有限自动机。关键定义信息是语言表达一样，就是集合覆盖一样，同时状态个数最少

Theorem5 定理5 最小、确定有限自动机充分必要条件。这是后续证明的依据。

Definition6 有向无环、确定自动机成为最小确定自动机except w。这部分定义的证明完全依据定义法来的。

另外，w的特殊性要满足。w is the prefix of the last work in the lexicographer order of L(A)

w是词典顺序的最后一个word的前缀。这是有序given list的必要。 lucene4 inputtype里面对输入转换为utf8编码的intRef或者utf32 编码的intRef 作为输入，排序就是byte值排序。最糟糕byte =8bit 128个分支。

Propositon7 这个命题主要是由最小、有限自动机的定义，以及except w，反推ti。这个引理非常关键。因为对ti的出来是lucene4 freezetail的必要环节，也就是last word的最小化操作。

Proposition8 对于最小自动机满足，except empty word e 后是最小的。也就说任何最小自动机，都有有一个empty word e 外才是最小的。

Lemma9 引理9 最小、自动机 except w=w1w2...wk ,w 不等于e。此时，w=w1w2...w(k-1)被except外也是最小的。这句话进一步递归就是，从w= e 、 w=w1 、 w=w1w2 。。。。w=w1w2...wk 被except外也是最小的。证明法是定义法，严格按照定义来证明。

Lemma10 对最小、确定、自动机中做等价转化。在于ti的等价变化，将w=w1w2...wk,转为p、w=w1w2...w(k-1)下的最小、确定、自动机。证明也是定义法。这个引理并引理9意味着，对w 也就是last词典序的word的前缀处理出现了等价变换，以及递归来处理这个except word

Lemma11 主要是对新的word的增量式add到已经最小、确定自动机中。也正是这部分支持了FST动态、增量式的构建。证明依然是定义法。理解这句话： The use of the lexicographical order of w is essential at this point. New words could be recognized only by passing the additional new states.The only new word in the language L(A'|tm) is w(m+1)w(m+2)...wk

通俗讲就是当L(A)已经是最小、确定自动机了，w’是except word，然后新的word w，就是在A词典顺序之外新的word，此时L(A')也是最小、确定自动机了，except w。 eg ababac 对于A的最last word，然后w是abadaaa，其中第四位置d比b大。aba 是最小except word。所以，new word 添加就出现递归的操作了，总是在前N个有序word构建了FST基础上，add下一个word，又构成新的FST。下一个word是词典顺序更大的。

Lemma12 对tk的等价状态描述。

Method for direct building of minimal FS automation for a given list

Let a non-empty finite list of words L in lexcographical order be gieven. Let w（i） denote the i-th word of the list. We start with the minimal automation which recongnized only the firest word of the list . This automation can be build trivially and is also minimal except for word(1). Using it as basis we carry out an induction on the words of the list. Let us assume that the automation A(n)=<E,S ,s F ,u> with language L(n)={w(i)|i=1,2,...n} has bean built and that A(n) is minimal except for w(n). We have to build the automation A(n+1) with L(n+1)={w(i)|i=1,2...,n+1} which is minimal except for w(n+1)

Let w' be the longest common prefix of word w(n) and w(n+1) Using several times Lemma9 and Lemma 10( correspongding to the actual case) We build the automation A'=<E,S',s,F',u'> which is equivalent to A(n) and is minimal except for w'. Now we can use Theorem 11 and build the automation A(n+1) with language L(n+1)=L(n)U{w(n+1)}={w(i)| i=1,2..n+1} which is mnimal except for w(n+1)

这段话总结了构造增量式最小、确定有限自动机的过程。通过递归构造前N个word生成的FST，对于第N+1个word，只需对共同前缀之外的部分做tail裁剪，从而生成最小、确定有限自动机。

算法伪码请参考

Direct Construction of minimal acyclic subsequential transducers

STATAE 指针指向代表一个状态对象

FIRST_CHARE LAT_CHAR 第一个和最后一个字符 eg a z

function NEW_STATE 返回一个新的STATE

function FINAL（STATE）检测当前STATE是否是终态，是返回true，否则false

procedure SET_FINAL(STATE,boolean) 设置STATE的状态表示信息

function TRANSITION(STATE,char) 返回STATE在char输入下的下一个状态也就u（s，r）的结果

procedure SET_TRANSITION(STATE,char,STATE)设计STATE在char下转入STATE也就 u（s1，r）=s2

function STATE_OUTPUT(STATE) 返回STATE的输出态信息

procedure SET_STATE_OUTPUT(STATE,set of string) 设置STATE对于的output信息

function OUTPUT(STATE,char) 输出STATE在char下的下一个输出态，注意是输出态、非STATE

procedure PRINT_TRANSDUCER(file,STATE) 将STATE的信息输入到file中

function COPY_STATE(STATE) 复制状态到new state中

procedure CLEAR_STATE(STATE) 清除state的信息

function COMPARER_STATE(STATE,STATE) 比较状态

function NEW_DICTIONARY 词典，相当于FST

function MEMER(DICTIONARY,STATE) 判断state是否存在

procedure INSERT(DICTIONARY,STATE) 插入一个state 在lucene4 是add

//注意这里的input、output 在lucene4 里面是泛型了

Create_MINIMINAL_Transducer_for_Given_list(input ,output)

var

MinimalTransducerStatesDictionary: DICTIONARY

TempStates:array[0...MAX_WORD_SIZE] of STATE

InitialState:STATE

PreviousWord,CurrentWord,CurrentOutput,WordSuffix,CommonPrefix:string

tempString: string

tempSet: set of string

i,j,PrefixLengthPlus1： interger

c:char

function FindMinimized(s:STATE) :STATE //返回state信息，如果不存在就插入该starte

var r:STATE

begin

r:=MEMBER(MinimalTransducerStatesDictionary,s)

if f= NULL then begin

r:=COPY_STATE(s);

INSERT(r);

end;

return (r);

end;

begin

MinimalTransducerStatesDictionary：= NEW_DICTIONARY;

for i:0 to MAX_WORD_SIZE do

TempState[i]:=NEW_STATE;

PreviousWord:='';

CLEAR_STATE(TempState[0]);

while not eof(input) do begin

i：1;

while(i< length(CurrentWord) ) and ( i< length(PreviousWord))

and (PreviousWord[i] =CurrentWord[i]) do

i:=i+1;

PrefixLengthPlus1:=1;

for i:=length(PreviousWord) downto PrefixLengthPlus1 do

SET_TRANSITION(TempStates[i-1],PreviousWord[i],Findminimized(TempStates[i]));

for i:= PrefixLengthPlus1 to length(CurrentWord) do begin

CLEAR_STATE(TempStates[i]);

SET_TRANSITION(TempState[i-1],CurrentWord[i],TempState[i]);

end;

if CurrentWords<> PreviousWord then begin

SET_FINAL(TempStates[length(CurrentWord)],true)

SET_OUTPUT(TempStates[length(CurrentWord)],{"});//论文这里感觉有点模糊

end;

for j:=1 to PrefixLengthPlus1-1 do begin

CommonPrefix:=OUTPUT(TempStates[j-1],CurrentWord[j])∩ CurrentWord[j];

WordSuffix:=CommonPrefix(-1) OUTPUT(TempStates[j-1],CurrentWord[j]);

SET_OUTPUT(TempStates[j-1],CurrentWord[j],CommonPrefix);

for c:FIRST_CHAR to LAST_CHAR do begin//这里是遍历处理

if TRANSITION(TempState[j],c)<> NULL then

SET_OUTPUT(TempState[j],c,concat(WordSuffix,OUTPUT(TempStates[j],c)));

end;

if FINAL(TempStates[j]) then begin

tempSet:=0;

for tempString in STATE_OUTPUT(TempStates[j]) do

tempSet:=tempSet U concat ( WordSuffix,tempString);

SET_STATE_OUTPUT(TempState[j],tempSet);

end;

CurrentWordOutput:= CommonPrefix(-1) CurrentOutput;// -1 是截取操作

end;//输入input 循环结束

if CurrrentWord == PreviousWord then

SET_STATE_OUTPUT( TempStates[length(CurrentWord)], STATE_OUTPUT(

TempState[length(CurrrentWord)] U CurrentOutput));

else SET_OUTPUT( TempSates[PrefixLengthPlus1-1],

CurrentWord[PrefixlengthPlus1],CurrentOutput);

end;//最后的收尾

for i:= length(CurrentWord) downto 1 do

SET_TRANSITION(TempStates[i-1],PreviousWord[i],FindMinimized(TempStates[i]));

InitialState:= FindMinimized(TempStates[0]);

PRINT_TRANSDUCER(output,InitalState);

end.

Lucene&solr 4 实践（5）

预备知识

索引格式和编码参考链接

Lucene4 FST编码

Direct building of mimimal automation for given list

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Lucene&solr 4 实践（5）

预备知识

索引格式和编码参考链接

Lucene4 FST编码

Direct building of mimimal automation for given list

热门文章

最新文章

相关课程

相关电子书