《Elastic Stack 实战手册》——三、产品能力——3.4.入门篇——3.4.2.Elasticsearch基础应用——3.4.2.4.分布式计分(中) https://developer.aliyun.com/article/1230902
预统计阶段:
1、Elasticsearch 在收到客户端搜索请求后,会由协调节点进行一次预统计工作,即先向所有相关 Shard 搜集统计信息。
Query 阶段:
1、由协调节点整合所有统计信息,将全局的统计信息连同请求一起分发到对应索引的每个
Shard 上。
2、每个 Shard 的 Lucene 实例,基于全局的 TF/IDF 统计信息,独立完成 Shard 内的索引匹配和打分(基于上述公式),并根据打分结果,完成单个 Shard 内的排序、分页。
3、每个 Shard 将排序分页后的结果集的元数据(文档 ID 和分数,不包含具体的文档内容)返回给协调节点。
4、协调节点完成整体的汇总、排序以及分页,筛选出最终确认返回的搜索结果。
Fetch 阶段:
1、协调节点根据筛选结果去对应 shard 拉取完整的文档数据
2、整合最终的结果返回给用户客户端
综上可见,Elasticsearch 在分布式打分上做了权衡,如果要考虑绝对的精确性,那么需要牺牲一些性能来换取全局的统计信息。
让我们来看下如何切换到 DFS_QUERY_THEN_FETCH,只需在接口 URL 加上search_type=dfs_query_then_fetch
GET /my-index-000001/_search?search_type=dfs_query_then_fetch { "query": { "query_string": { "query": "三国演义" } } }
可以看到,通过这种方式返回的结果又恢复了正常:
{ "took" : 9, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 3.7694218, "hits" : [ { "_index" : "my-index-000001", "_type" : "_doc", "_id" : "5", "_score" : 3.7694218, "_source" : { "title" : "三国演义", "date" : "2021-05-03", "content" : "三国时代,群雄逐鹿..." } }, { "_index" : "my-index-000001", "_type" : "_doc", "_id" : "1", "_score" : 1.1795839, "_source" : { 375 > 三、产品能力 "title" : "三国志", "date" : "2021-05-01", "content" : "国别体史书" } }, { "_index" : "my-index-000001", "_type" : "_doc", "_id" : "3", "_score" : 0.8715688, "_source" : { "title" : "易中天品三国", "date" : "2021-05-03", "content" : "草船借箭、空城计..." } } ] } }
三国演义”的文档仍排在第一,分数( _score )变成了 3.7694218,其次是“三国志”,分数是1.1795839,最后是“易中天品三国”,分数是0.8715688,其余没有匹配的文档同样没有出现。
另外,根据返回的 took 数据,可以看到耗时较 query_then_fetch 的方式有略微增加,所以这种方式对性能会有折损,在生产环境中建议谨慎使用。
查看得分逻辑
为了在实际开发中了解得分逻辑,从而优化我们的查询条件或索引工作,我们需要关注例如“易中天品三国”为什么分数是 0.8715688,而不是 3.7694218。
我们可以通过在查询中增加 explain 来查看得分的说明信息。
GET /my-index-000001/_search?search_type=dfs_query_then_fetch { "query": { "query_string": { "query": "三国演义" } }, "explain": true }
通过增加 "explain": true,我们可以看到返回的结果集里增加了大量 _explanation 信息:
{ "took" : 21, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 3.7694218, "hits" : [ { "_shard" : "[my-index-000001][0]", "_node" : "ydZx8i8HQBe69T4vbYm30g", "_index" : "my-index-000001", "_type" : "_doc", "_id" : "5", "_score" : 3.7694218, "_source" : { "title" : "三国演义", "date" : "2021-05-03", "content" : "三国时代,群雄逐鹿..." }, "_explanation" : { "value" : 3.7694218, "description" : "max of:", "details" : [ { "value" : 3.7694218, "description" : "sum of:", "details" : [ { "value" : 0.52763593, "description" : "weight(title:三 in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.52763593, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 0.5389965, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 3, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 5, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.4449649, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 4.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 3.8, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] }, { "value" : 1.357075, "description" : "weight(title:演 in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 1.357075, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 1.3862944, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 1, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 5, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.4449649, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 4.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 3.8, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] }, ... ] }, ... ] } }, ... ] } }
通过分析 description 和 details 中信息的描述,我们可以进一步深挖 Elasticsearch 的打分逻辑和我们查询出来的每个文档的得分详情。
创作人简介:
赵震一,程序员,好奇技淫巧,关注大数据与分布式计算。