Elasticsearch评分相关度算法解析

2023-07-19 158

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

检索分析服务 Elasticsearch 版，2核4GB开发者规格 1个月

公共DNS（含HTTPDNS解析），每月1000万次HTTP解析

全局流量管理 GTM，标准版 1个月

简介： Elasticsearch评分相关度算法解析

Elasticsearch评分相关度算法解析

TF算法

TF算法，全称 Term frequency ，索引词频率算法。意义就像它的名字，会根据索引词的频率来计算，索引词出现的次数越多，分数越高。

例子如下

搜索 hello

有两份文档：A文档：hello world!,B文档：hello hello hello

结果是B文档的 score 大于A文档。

搜索 hello world

有两份文档：A文档：hello world!,B文档：hello,are you ok?

结果是A文档的score大于B文档。

要根据索引词来综合考虑。

如果不在意词在某个字段中出现的频次，而只在意是否出现过，则可以在字段映射中禁用词频统计

{
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type":          "string",
          "index_options": "docs"
        }
      }
    }
  }
}

将参数 index_options 设置为 docs 可以禁用词频统计及词频位置，这个映射的字段不会计算词的出现次数，对于短语或近似查询也不可用。要求精确查询的 not_analyzed 字符串字段会默认使用该设置。

IDF算法

IDF 算法全称 Inverse Document Frequency ，逆文本频率。搜索文本的词在整个索引的所有文档中出现的次数越多，这个词所占的 score 的比重就越低。

例子如下

搜索hello world,其中索引中hello出现次数1000次，world出现100次。

有三份文档：A 文档 hello,are you ok? , B 文档 The world is interesting! , C 文档 hello world!

结果是：C>B>A

由于hello出现频率高，所以单个hello得到的score比不上world。

Field-length norm算法 (字段长度归一值)

字段的长度是多少？

字段越短，字段的权重越高。如果词出现在类似标题 title 这样的字段，要比它出现在内容 body 这样的字段中的相关度更高。

例子如下：

搜索 hello world!

有两份文档：A文档 hello world! ,B文档 hello world,I'm xxx!

结果是：A>B

词频（term frequency）、逆向文档频率（inverse document frequency）和字段长度归一值（field-length norm）——是在索引时计算并存储的。最后将它们结合在一起计算单个词在特定文档中的权重。

当然，查询通常不止一个词，所以需要一种合并多词权重的方式——向量空间模型（vector space model）。

三种算法的综合

（下面属于理论分析，并不真实这样计算）

TF 算法针对在 Field 中，索引词出现的频率；

IDF 算法针对在整个索引中的索引词出现的频率；

Field-length norm 算法针对 Field 的长度。

那么可以这样分析，由于 Field-length norm 算法并不直接针对 score ，所以它是最后起作用的，它理论上类似于一个除数。而 TF 和 IDF 是平等的， IDF 计算出每一个索引词的 score 量， TF 来计算整个文档中索引词的 score 的加和。

也就是如下的计算：

IDF：计算索引词的单位 score ，比如 hello=0.1,world=0.2 ，
TF：计算整个文档的 sum(score) ，hello world!I'm xxx. 得到 0.1+0.2=0.3
Field-length norm：将 sum(score)/对应Field的长度 ，得出的结果就是 score 。

利用score计算API分析

创建模拟数据

PUT /test-7

{
  "settings": {
    "index":{
      "number_of_shards":3,
            "number_of_replicas":1
    }
  },
  "mappings": {
      "properties": {
        "name":{
          "type": "text"
        }
      }
  }
}

PUT /test-7/_doc/1

{
    "name": "li feng"
}

PUT /test-7/_doc/2

{
    "name": "li er"
}

explain分析

/test-7/_doc/_search?explain=true

{
    "query": {
        "match": {
            "name": "li"
        }
    }
}

响应

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_shard": "[test-7][1]",
                "_node": "DpJZ5rhrStKpiur5hZ_ilw",
                "_index": "test-7",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.2876821,
                "_source": {
                    "name": "li er"
                },
                // 先列出分数
                "_explanation": {
                    "value": 0.2876821,
                    // 分数的组成， details详细分析
                    "description": "weight(name:li in 0) [PerFieldSimilarity], result of:",
                    // 解释分数
                    "details": [
                        {
                            "value": 0.2876821,
                            "description": "score(freq=1.0), computed as boost * idf * tf from:",
                            "details": [
                                {
                                    "value": 2.2,
                                    "description": "boost",
                                    "details": []
                                },
                                {
                                    "value": 0.2876821,
                                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                    // 逆文本频率 计算 idf
                                    "details": [
                                        {
                                            "value": 1,
                                            // 表示从当前分片中匹配到的文档记录数
                                            "description": "n, number of documents containing term",
                                            "details": []
                                        },
                                        {
                                            "value": 1,
                                            // 表示的是当前查询记录所处的分片上当前索引的文档数； 如果我们有多个分片，那么索引数据会被存储到多个分片上，每个分片上的文档记录数相加，得到的就是当前索引的文档总计录数了
                                            "description": "N, total number of documents with field",
                                            "details": []
                                        }
                                    ]
                                },
                                {
                                    "value": 0.45454544,
                                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                    // 索引词频率计算 tf
                                    "details": [
                                        {
                                            "value": 1.0,
                                            // 检索关键词组在被检索字段的词组中出现的频率，即出现了多少次，比如上面的执行计划搜索 li 在字段中出现1次 即为1
                                            "description": "freq, occurrences of term within document",
                                            "details": []
                                        },
                                        {
                                            "value": 1.2,
                                            // 词的饱和度值，默认值为1.2
                                            "description": "k1, term saturation parameter",
                                            "details": []
                                        },
                                        {
                                            "value": 0.75,
                                            // 长度归一化评分 默认值为0.75
                                            "description": "b, length normalization parameter",
                                            "details": []
                                        },
                                        {
                                            "value": 2.0,
                                            // 被检索字段分词后的词组长度
                                            "description": "dl, length of field",
                                            "details": []
                                        },
                                        {
                                            "value": 2.0,
                                            // 分片中当前被检索字段的平均词组数值
                                            "description": "avgdl, average length of field",
                                            "details": []
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
            {
                "_shard": "[test-7][2]",
                "_node": "DpJZ5rhrStKpiur5hZ_ilw",
                "_index": "test-7",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "name": "li feng"
                },
                "_explanation": {
                    "value": 0.2876821,
                    "description": "weight(name:li in 0) [PerFieldSimilarity], result of:",
                    "details": [
                        {
                            "value": 0.2876821,
                            "description": "score(freq=1.0), computed as boost * idf * tf from:",
                            "details": [
                                {
                                    "value": 2.2,
                                    "description": "boost",
                                    "details": []
                                },
                                {
                                    "value": 0.2876821,
                                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "n, number of documents containing term",
                                            "details": []
                                        },
                                        {
                                            "value": 1,
                                            "description": "N, total number of documents with field",
                                            "details": []
                                        }
                                    ]
                                },
                                {
                                    "value": 0.45454544,
                                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                    "details": [
                                        {
                                            "value": 1.0,
                                            "description": "freq, occurrences of term within document",
                                            "details": []
                                        },
                                        {
                                            "value": 1.2,
                                            "description": "k1, term saturation parameter",
                                            "details": []
                                        },
                                        {
                                            "value": 0.75,
                                            "description": "b, length normalization parameter",
                                            "details": []
                                        },
                                        {
                                            "value": 2.0,
                                            "description": "dl, length of field",
                                            "details": []
                                        },
                                        {
                                            "value": 2.0,
                                            "description": "avgdl, average length of field",
                                            "details": []
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            }
        ]
    }
}

上面还有一个 boost，这个我们解释一下，

对于每一个 term 的权值，其默认值为2.2，我们可以在创建索引 mapping 结构的时候指定字段的 boost 的值，更多情况下，我们可以使用 boost 来作为 ES搜索结果的调优方案，比如搜索文档标题我们可以将boost 权重设置大一些，在搜索文档内容的时候，我们可以将 boost 权重设置小一些，从而实现动态的调整搜索结果，实现搜索不同的字段计算权重不同

Elasticsearch评分相关度算法解析

Elasticsearch评分相关度算法解析

TF算法

例子如下

IDF算法

例子如下

Field-length norm算法 (字段长度归一值)

例子如下：

三种算法的综合

利用score计算API分析

创建模拟数据

explain分析

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Elasticsearch评分相关度算法解析

Elasticsearch评分相关度算法解析

TF算法

例子如下

IDF算法

例子如下

Field-length norm算法 (字段长度归一值)

例子如下：

三种算法的综合

利用score计算API分析

创建模拟数据

explain分析

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像