【大数据开发运维解决方案】Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析

2023-03-24 165

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云原生大数据计算服务 MaxCompute，5000CU*H 100GB 3个月

云原生大数据计算服务MaxCompute，500CU*H 100GB 3个月

简介： 我们之前使用的solr版本是solr5.1，分词器使用的是jcseg1.9.6，后续接触了Solr6.2，分词器使用的是jcseg2.6.0，发现同一个Oracle库的同一套表数据，分别使用solr5.1和solr6.2版本的模板collection配置集做相同的字段配置并成功做索引后，做相同查询，solr6.2检索文档score远高于solr5.1，下面是我们使用的两个solr环境以及另一个单机solr测试环境的基本情况：

Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析

注意：
我们之前使用的solr版本是solr5.1，分词器使用的是jcseg1.9.6，后续接触了Solr6.2，分词器使用的是jcseg2.6.0，发现同一个Oracle库的同一套表数据，分别使用solr5.1和solr6.2版本的模板collection配置集做相同的字段配置并成功做索引后，做相同查询，solr6.2检索文档score远高于solr5.1，下面是我们使用的两个solr环境以及另一个单机solr测试环境的基本情况：

大数据环境	solr版本
CDH	Solr5.1
华为云	Solr6.2
单机	开源Solr6.2

一、问题重现

现有华为云solr6.2和cdh5.1以及开源solr6.2三个环境的solr,索引的数据均从同一个oracle11.2.0.4库表用相同的逻辑取数据，collection或core名字分别为uoc-buyer1、uoc-buyer、uoc-buyer，现分别从三个环境做下面问题查询：

q=engname%3A(ADNAN+UL+HAQ)%5E5+buyeraddr%3A(PP+NO+AF1401302+ADD%5C%3ATOBA+TEK+SINGH%2CPAKISTAN)%5E2+flag%3A(0%5E2+1%5E1)
&fq=%7B!frange+l%3D1.8%7Dquery(%24q)
&fq=-accuracylel%3A1
&fq=countrycode%3APAK
&fq=flag%3A0
&sort=flag+asc%2Caccuracylel+desc%2Cscore+desc
&fl=chnname%2Cengname%2Cbuyeraddr%2Cpuppetbn%2Cflag%2Ctableflag%2Cscore%2Ccountrycode
&wt=json
&indent=true

开源及华为云solr6.2检索得分：

二、问题分析

1、问题原因是否由于分词器版本不一致导致

因为我们之前开始时使用的是solr5.1,相关代码开发和相识度分数认定的分数线也是基于solr5.1来做的，所以在后续将collection逻辑拿到6.2版本的开源和华为云solr后，发现分数差别很大。
首先怀疑是不是分词器导致的，因为两个solr6.2分词器使用的是jcseg2.6.0，而cdh的solr5.1使用的是1.9.6版本，于是通过三个solr的analyze功能分析要查询的地址：
两个使用jcseg2.6.0的solr6.2:

使用jcseg1.9.6的solr5.1:

两个结果比较了下，感觉还是1.9.6版本的英文分词结果更友好，2.6.0版本分词分的太细致了，将本应该在一起的单词也给拆分的七零八落了。
于是根据jcseg1.9.6的默认配置去修改jcseg2.6.0的分词器配置，最终修改后的jcseg-core-2.6.0.jar分词器中的配置文件jcseg.properties内容为：

# Jcseg properties file.
# @Note: 
# true | 1 | on for open the specified configuration or
# false | 0 | off to close it.
# bug report chenxin <chenxin619315@gmail.com>

# Jcseg function
#maximum match length. (5-7)
jcseg.maxlen = 5

#Whether to recognized the Chinese name.
jcseg.icnname = true

#maximum chinese word number of english chinese mixed word. 
jcseg.mixcnlen = 3

#maximum length for pair punctuation text.
jcseg.pptmaxlen = 7

#maximum length for Chinese last name andron.
jcseg.cnmaxlnadron = 1

#Whether to clear the stopwords.
jcseg.clearstopword = false

#Whether to convert the Chinese numeric to Arabic number. like '\u4E09\u4E07' to 30000.
jcseg.cnnumtoarabic = true

#Whether to convert the Chinese fraction to Arabic fraction.
#@Note: for lucene,solr,elasticsearch eg.. close it.
jcseg.cnfratoarabic = false

#Whether to keep the unrecognized word.
jcseg.keepunregword = true

#Whether to do the secondary segmentation for the complex English words
jcseg.ensecondseg = true

#min length of the secondary simple token. (better larger than 1)
jcseg.stokenminlen = 2

#minimum length of the secondary segmentation token.
jcseg.ensecminlen = 1

#Whether to do the English word segmentation
#the jcseg.ensecondseg must set to true before active this function
jcseg.enwordseg = false

#maximum match length for English extracted word
jcseg.enmaxlen = 16

#threshold for Chinese name recognize.
# better not change it before you know what you are doing.
jcseg.nsthreshold = 1000000

#The punctuation set that will be keep in an token.(Not the end of the token).
jcseg.keeppunctuations = @#%.&+

#Whether to append the pinyin of the entry.
jcseg.appendpinyin = false

#Whether to load and append the synonyms words of the entry.
jcseg.appendsyn = true


####for Tokenizer
#default delimiter for JcsegDelimiter tokenizer
#set to default or whitespace will use the default whitespace as delimiter
#or set to the char you want, like ',' or whatever
jcseg.delimiter = default

#default length for the N-gram tokenizer
jcseg.gram = 1


####about the lexicon
#absolute path of the lexicon file.
#Multiple path support from jcseg 1.9.2, use ';' to split different path.
#example: lexicon.path = /home/chenxin/lex1;/home/chenxin/lex2 (Linux)
#        : lexicon.path = D:/jcseg/lexicon/1;D:/jcseg/lexicon/2 (WinNT)
#lexicon.path=/Code/java/JavaSE/jcseg/lexicon
#lexicon.path = {jar.dir}/lexicon ({jar.dir} means the base directory of jcseg-core-{version}.jar)
#@since 1.9.9 Jcseg default to load the lexicons in the classpath
lexicon.path = {jar.dir}/lexicon

#Whether to load the modified lexicon file auto.
lexicon.autoload = true

#Poll time for auto load. (seconds)
lexicon.polltime = 30


####lexicon load
#Whether to load the part of speech of the entry.
jcseg.loadpos = true

#Whether to load the pinyin of the entry.
jcseg.loadpinyin = false

#Whether to load the synonyms words of the entry.
jcseg.loadsyn = true

#Whether to load the entity of the entry
jcseg.loadentity = true

修改后的jcseg分词器分词效果如下：

已经与jcseg1.9.6分词效果基本一致了，这时候两个solr6.2.0再重做索引，再次执行之前的查询，发现检索得分还是100多分。

2、问题原因是否由于solr6和5默认相似性算法不一致导致

根据上面实验，于是这里怀疑不只是因为分词器分词差异导致的问题，更大的问题应该在于solr5和solr6的相似度得分算法不一样了，为了排除分词器带来的影响，于是将solr6.2使用的分词器也替换成solr5.1使用那一套分词器，再次索引同样的数据，做同样的查询发现得分还是很高，那就说明相似得分差异过大的主要原因是由于solr两个版本的算法不一致导致的了。
经过网上查找资料发现了solr5和solr6的默认相似度算法的确是变了：

默认的相似性改变
当 Schema 没有明确地定义全局 \<similarity/> 时，Solr 的默认行为将依赖于 solrconfig. xml 中指定的
luceneMatchVersion。当 luceneMatchVersion < 6.0 时，将使用
ClassicSimilarityFactory 的实例，否则将使用 SchemaSimilarityFactory
的实例。最值得注意的是，这种改变意味着用户可以利用每个字段类型的相似性声明，并且需要明确声明 SchemaSimilarityFactory
的全局用法。无论是明确声明还是作为隐式全局默认值使用，当字段类型不声明明确\<similarity/>
时，SchemaSimilarityFactory 的隐式行为也被更改为依赖于 luceneMatchVersion。当
luceneMatchVersion < 6.0 时，将使用 ClassicSimilarity 的实例，否则将使用
BM25Similarity 的实例。可以在 SchemaSimilarityFactory 声明中指定
defaultSimFromFieldType init 选项来更改此行为。请查看
SchemaSimilarityFactoryjavadocs 了解更多详情

于是修改solr6.2的manage-schema，新增similarity显示指定：

<similarity class="solr.ClassicSimilarityFactory"/>

而且由于当前环境索引速度较慢，同时修改solrconfig.xml的索引并行度：

<maxIndexingThreads>32</maxIndexingThreads>

重启solr，重做索引，发现现在索引速度比原来快了一个小时，再次做同样的查询，检索得分已经同solr5.1相似了：

【大数据开发运维解决方案】Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析

Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析

一、问题重现

二、问题分析

1、问题原因是否由于分词器版本不一致导致

2、问题原因是否由于solr6和5默认相似性算法不一致导致

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【大数据开发运维解决方案】Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析

Solr6.2默认相似性算法检索匹配得分高于5.1版本问题分析

一、问题重现

二、问题分析

1、问题原因是否由于分词器版本不一致导致

2、问题原因是否由于solr6和5默认相似性算法不一致导致

热门文章

最新文章

相关课程

相关电子书

相关实验场景