ElasticSearch 底层原理与分组查询（上）

2022-04-25 204

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

检索分析服务 Elasticsearch 版，2核4GB开发者规格 1个月

简介： ElasticSearch 底层原理与分组查询

一、ElasticSearch 文档分值 _score 计算底层原理

1）boolean model

根据用户的query条件，先过滤出包含指定 term(关键字) 的 doc（文档）

query "hello world" ‐‐> hello / world / hello & world

bool ‐‐> must/must not/should ‐‐> 过滤 ‐‐> 包含 / 不包含 / 可能包含

doc ‐‐> 不打分数 ‐‐> 正或反 true or false ‐‐> 为了减少后续要计算的doc的数量，提升性能

2）relevance score算法

简单来说，就是计算出，一个索引中的文本，与搜索文本，他们之间的关联匹配程度

Elasticsearch使用的是 term frequency/inverse document frequency 算法，简称为 TF/IDF算法

Term frequency：搜索文本中的各个词条在field文本中出现了多少次，出现次数越多，就越相关搜索请求：hello world

doc1：hello you, and world is very good
doc2：hello, how are you

Inverse document frequency：搜索文本中的各个词条在整个索引的所有文档中出现了多少次，出现的次数越多，就越不相关

搜索请求：hello world

doc1：hello, tuling is very good
doc2：hi world, how are you

比如说，在index中有1万条document，hello这个单词在所有的document中，一共出现

了1000次；world这个单词在所有的document中，一共出现了100次

Field-length norm：field长度，field越长，相关度越弱

搜索请求：hello world

doc1：{ "title": "hello article", "content": "...... N个单词" }
doc2：{ "title": "my article", "content": "...... N个单词，hi world" }

hello world在整个index中出现的次数是一样多的

doc1更相关，title field更短

2、分析一个document上的_score是如何被计算出来的

GET /es_db/_doc/1/_explain 
{ 
"query": { 
"match": { 
"remark": "java developer" 
} 
} 
}

二、分词器工作流程

1、切分词语，normalization

给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行

normalization（时态转换，单复数转换），分词器 recall，召回率：搜索的时候，增加能够搜索到的结果的数量

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html 
标签（
<span>hello<span> ‐‐> hello），& ‐‐> and（I&you ‐‐> I and you） 
tokenizer：分词，hello you and me ‐‐> hello, you, and, me 
token filter：lowercase，stop word，synonymom，liked ‐‐> like，Tom ‐‐> tom，a/th 
e/an ‐‐> 干掉，small ‐‐> little 
# 一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒 
排索引

2、内置分词器的介绍

Set the shape to semi‐transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）

simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, tran

whitespace analyzer：Set, the, shape, to, semi‐transparent, by, calling, set_trans(5)

stop analyzer:移除停用词，比如a the it等等

测试：

POST _analyze 
{ 
"analyzer":"standard", 
"text":"Set the shape to semi‐transparent by calling set_trans(5)" 
}

3、定制分词器

1）默认的分词器

standard

standard tokenizer：以单词边界进行切分

standard token filter：什么都不做

lowercase token filter：将所有字母转换为小写

stop token filer（默认被禁用）：移除停用词，比如a the it等等

2）修改分词器的设置启用english停用词token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
} 
GET /my_index/_analyze
{
  "analyzer": "standard",
  "text": "a dog is in the house"
} 
GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text": "a dog is in the house"
}

3、定制化自己的分词器

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [
            "&=> and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "a"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "&_to_and"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
          ]
        }
      }
    }
  }
} 
GET /my_index/_analyze
{
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
  "analyzer": "my_analyzer"
} 
PUT /my_index/_mapping/my_type
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
} 
复制代码

3）ik分词器详解

ik配置文件地址：es/plugins/ik/config目录

IKAnalyzer.cfg.xml：用来配置自定义词库

main.dic：ik原生内置的中文词库，总共有27万多条，只要是这些单词，都会被分在

一起quantifier.dic：放了一些单位相关的词

suffix.dic：放了一些后缀

surname.dic：中国的姓氏

stopword.dic：英文停用词

ik原生最重要的两个配置文件

main.dic：包含了原生的中文词语，会按照这个里面的词语去分词

stopword.dic：包含了英文的停用词

停用词，stopword

a the and at but

一般，像停用词，会在分词的时候，直接被干掉，不会建立在倒排索引中

4）IK分词器自定义词库

（1）自己建立词库：每年都会涌现一些特殊的流行词，网红，蓝瘦香菇，喊麦，鬼畜，一般不会在ik的原生词典里

自己补充自己的最新的词语，到ik的词库里面去

IKAnalyzer.cfg.xml：ext_dict，custom/mydict.dic

补充自己的词语，然后需要重启es，才能生效

（2）自己建立停用词库：比如了，的，啥，么，我们可能并不想去建立索引，让人家搜索custom/ext_stopword.dic，已经有了常用的中文停用词，可以补充自己的停用词，然后重启es1 IK分词器源码下载：

github.com/medcl/elast…

5）IK热更新

每次都是在es的扩展词典中，手动添加新词语，很坑

（1）每次添加完，都要重启es才能生效，非常麻烦

（2）es是分布式的，可能有数百个节点，你不能每次都一个一个节点上面去修改

es不停机，直接我们在外部某个地方添加新的词语，es中立即热加载到这些新词语

IKAnalyzer.cfg.xml

<properties> 
<comment>IK Analyzer 扩展配置</comment> 
<!‐‐用户可以在这里配置自己的扩展字典 ‐‐> 
<entry key="ext_dict">location</entry> 
<!‐‐用户可以在这里配置自己的扩展停止词字典‐‐> 
<entry key="ext_stopwords">location</entry> 
<!‐‐用户可以在这里配置远程扩展字典 ‐‐> 
<entry key="remote_ext_dict">words_location</entry> 
<!‐‐用户可以在这里配置远程扩展停止词字典‐‐> 
<entry key="remote_ext_stopwords">words_location</entry> 
</properties>

三. 高亮显示

在搜索中，经常需要对搜索关键字做高亮显示，高亮显示也有其常用的参数，在这个案例中做一些常用参数的介绍。

现在搜索cars索引中remark字段中包含“大众”的document。并对“XX关键字”做高亮显示，高亮效果使用html标签，并设定字体为红色。如果remark数据过长，则只显示前 20 个字符。

PUT /news_website
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
PUT /news_website 
{ 
"settings" : { 
"index" : { 
"analysis.analyzer.default.type": "ik_max_word" 
} 
} 
} 
PUT /news_website/_doc/1 
{ 
"title": "这是我写的第一篇文章", 
"content": "大家好，这是我写的第一篇文章，特别喜欢这个文章门户网站！！！" 
}

查询 title : "文章"

GET /news_website/_doc/_search
{
  "query":{
   "match":{
     "title":"文章"
   }
   },
   "highlight":{
     "fields":{
       "title":{}}
   }
}

查询结果

{
  "took" : 878,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "news_website",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "这是我写的第一篇文章",
          "content" : "大家好，这是我写的第一篇文章，特别喜欢这个文章门户网站！！！"
        },
        "highlight" : {
          "title" : [
            "这是我写的第一篇<em>文章</em>"
          ]
        }
      }
    ]
  }
}

ElasticSearch 底层原理与分组查询（上）

一、ElasticSearch 文档分值 _score 计算底层原理

1）boolean model

2）relevance score算法

二、分词器工作流程

1、切分词语，normalization

2、内置分词器的介绍

测试：

3、定制分词器

1）默认的分词器

2）修改分词器的设置启用english停用词token filter

3）ik分词器详解

4）IK分词器自定义词库

5）IK热更新

三. 高亮显示

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

ElasticSearch 底层原理与分组查询（上）

一、ElasticSearch 文档分值 _score 计算底层原理

1）boolean model

2）relevance score算法

二、分词器工作流程

1、切分词语，normalization

2、内置分词器的介绍

测试：

3、定制分词器

1）默认的分词器

2）修改分词器的设置启用english停用词token filter

3）ik分词器详解

4）IK分词器自定义词库

5）IK热更新

三. 高亮显示

热门文章

最新文章

相关课程

相关电子书

相关实验场景