概述
继续跟中华石杉老师学习ES,第十篇
课程地址: https://www.roncoo.com/view/55
TF/IDF
Apache Lucene默认评分机制
TF (Term Frequency): 基于词项(term vector), 用来表示一个词项在某个文档中出现了多少次。
词频越高,文档得分越高
IDF (Inveres Dcoument Frequency): 基于词项(term vector),用来告诉评分公式该词有多美的汉奸。
逆文档频率越高,词项就越罕见。 评分公式利用该因子为包含罕见词项的文档加权。
term vector : 词项向量是一种针对每个文档的微型倒排索引。词项向量的每个维由词项和出现频率结对组成,还可以包含词项的位置信息。 Lucene 和 ES都默认禁用词项向量索引,如果实现某些功能比如高亮显示等需要开启该选项 。
链接
官方指导:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_tuning_best_fields_queries.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.2/query-dsl-dis-max-query.html
数据量少的时候,dis_max不生效的问题: https://stackoverflow.com/questions/38065692/dis-max-query-isnt-looking-for-the-best-matching-clause
其他博主写的相关文章:
https://blog.csdn.net/dm_vincent/article/details/41820537
示例
ES版本 6.4.1
为了演示效果,我们把之前的forum索引删除了重建一下,
DSL如下
DSL
DELETE /forum PUT /forum { "settings" : { "number_of_shards" : 1 }} POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" } POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"tag":["java","hadoop"]}} {"update":{"_id":"2"}} {"doc":{"tag":["java"]}} {"update":{"_id":"3"}} {"doc":{"tag":["hadoop"]}} {"update":{"_id":"4"}} {"doc":{"tag":["java","elasticsearch"]}} POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"tag_cnt":2}} {"update":{"_id":"2"}} {"doc":{"tag_cnt":1}} {"update":{"_id":"3"}} {"doc":{"tag_cnt":1}} {"update":{"_id":"4"}} {"doc":{"tag_cnt":2}} POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"view_cnt":30}} {"update":{"_id":"2"}} {"doc":{"view_cnt":50}} {"update":{"_id":"3"}} {"doc":{"view_cnt":100}} {"update":{"_id":"4"}} {"doc":{"view_cnt":80}} POST /forum/article/_bulk {"index":{"_id":5}} {"articleID":"DHJK-B-1395-#Ky5","userID":3,"hidden":false,"postDate":"2019-06-01","tag":["elasticsearch"],"tag_cnt":1,"view_cnt":10} POST /forum/article/_bulk {"update":{"_id":"5"}} {"doc":{"postDate":"2019-05-01"}} POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"title":"this is java and elasticsearch blog"}} {"update":{"_id":"2"}} {"doc":{"title":"this is java blog"}} {"update":{"_id":"3"}} {"doc":{"title":"this is elasticsearch blog"}} {"update":{"_id":"4"}} {"doc":{"title":"this is java, elasticsearch, hadoop blog"}} {"update":{"_id":"5"}} {"doc":{"title":"this is spark blog"}} POST /forum/article/_bulk {"update":{"_id":"1"}} {"doc":{"content":"i like to write best elasticsearch article"}} {"update":{"_id":"2"}} {"doc":{"content":"i think java is the best programming language"}} {"update":{"_id":"3"}} {"doc":{"content":"i am only an elasticsearch beginner"}} {"update":{"_id":"4"}} {"doc":{"content":"elasticsearch and hadoop are all very good solution, i am a beginner"}} {"update":{"_id":"5"}} {"doc":{"content":"spark is best big data solution based on scala ,an programming language similar to java"}}
至此,数据构造完成 ,下面来看下dis_max是如何作用的吧
GET /forum/article/_search 数据如下: { "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": 1, "hits": [ { "_index": "forum", "_type": "article", "_id": "1", "_score": 1, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "tag": [ "java", "hadoop" ], "tag_cnt": 2, "view_cnt": 30, "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" } }, { "_index": "forum", "_type": "article", "_id": "2", "_score": 1, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "tag": [ "java" ], "tag_cnt": 1, "view_cnt": 50, "title": "this is java blog", "content": "i think java is the best programming language" } }, { "_index": "forum", "_type": "article", "_id": "3", "_score": 1, "_source": { "articleID": "JODL-X-1937-#pV7", "userID": 2, "hidden": false, "postDate": "2017-01-01", "tag": [ "hadoop" ], "tag_cnt": 1, "view_cnt": 100, "title": "this is elasticsearch blog", "content": "i am only an elasticsearch beginner" } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 1, "_source": { "articleID": "QQPX-R-3956-#aD8", "userID": 2, "hidden": true, "postDate": "2017-01-02", "tag": [ "java", "elasticsearch" ], "tag_cnt": 2, "view_cnt": 80, "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 1, "_source": { "articleID": "DHJK-B-1395-#Ky5", "userID": 3, "hidden": false, "postDate": "2019-05-01", "tag": [ "elasticsearch" ], "tag_cnt": 1, "view_cnt": 10, "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" } } ] } }
普通查询
先看下普通的DSL
GET /forum/article/_search { "query": { "bool": { "should": [ { "match": { "title": "java solution" } }, { "match": { "content": "java solution" } } ], "minimum_should_match": 1 } } }
返回:
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 4, "max_score": 1.5179626, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.5179626, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "tag": [ "java" ], "tag_cnt": 1, "view_cnt": 50, "title": "this is java blog", "content": "i think java is the best programming language" } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 1.4233948, "_source": { "articleID": "DHJK-B-1395-#Ky5", "userID": 3, "hidden": false, "postDate": "2019-05-01", "tag": [ "elasticsearch" ], "tag_cnt": 1, "view_cnt": 10, "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 1.2832261, "_source": { "articleID": "QQPX-R-3956-#aD8", "userID": 2, "hidden": true, "postDate": "2017-01-02", "tag": [ "java", "elasticsearch" ], "tag_cnt": 2, "view_cnt": 80, "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.4889865, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "tag": [ "java", "hadoop" ], "tag_cnt": 2, "view_cnt": 30, "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" } } ] } }
来分析一下结果
计算每个document的relevance score:每个query的分数,乘以matched query数量,除以总query数量
算一下doc2的分数
{ "match": { "title": "java solution" }},针对doc2,是有一个分数的
{ "match": { "content": "java solution" }},针对doc2,也是有一个分数的
假设分数如下 , 所以是两个分数加起来,比如说,1.1 + 1.2 = 2.3
matched query数量 = 2
总query数量 = 2
2.3 * 2 / 2 = 2.3
算一下doc5的分数
{ "match": { "title": "java solution" }},针对doc5,是没有分数的
{ "match": { "content": "java solution" }},针对doc5,是有一个分数的
所以说,只有一个query是有分数的,比如2.3
matched query数量 = 1
总query数量 = 2
2.3 * 1 / 2 = 1.15
doc5的分数 = 1.15 < doc2的分数 = 2.3
id=2的数据排在了前面,其实我们希望id=5的排在前面,毕竟id=5的数据 content字段既有java又有solution. 那看下dis_max吧
dis_max 查询
GET /forum/article/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "java solution" } }, { "match": { "content": "java solution" } } ] } } }
返回
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 4, "max_score": 1.4233948, "hits": [ { "_index": "forum", "_type": "article", "_id": "5", "_score": 1.4233948, "_source": { "articleID": "DHJK-B-1395-#Ky5", "userID": 3, "hidden": false, "postDate": "2019-05-01", "tag": [ "elasticsearch" ], "tag_cnt": 1, "view_cnt": 10, "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" } }, { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.93952733, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "tag": [ "java" ], "tag_cnt": 1, "view_cnt": 50, "title": "this is java blog", "content": "i think java is the best programming language" } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.79423964, "_source": { "articleID": "QQPX-R-3956-#aD8", "userID": 2, "hidden": true, "postDate": "2017-01-02", "tag": [ "java", "elasticsearch" ], "tag_cnt": 2, "view_cnt": 80, "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.4889865, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "tag": [ "java", "hadoop" ], "tag_cnt": 2, "view_cnt": 30, "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" } } ] } }
best fields策略-dis_max
best fields策略 : 搜索到的结果,应该是某一个field中匹配到了尽可能多的关键词,被排在前面;而不是尽可能多的field匹配到了少数的关键词,排在了前面.
dis_max语法,直接取多个query中,分数最高的那一个query的分数即可
举个例子
{ "match": { "title": "java solution" }},针对doc2,是有一个分数的,1.1
{ "match": { "content": "java solution" }},针对doc2,也是有一个分数的,1.2
取最大分数,1.2
{ "match": { "title": "java solution" }},针对doc5,是没有分数的
{ "match": { "content": "java solution" }},针对doc5,是有一个分数的,2.3
取最大分数,2.3
然后doc2的分数 = 1.2 < doc5的分数 = 2.3,所以doc5就可以排在更前面的地方.