带你读《Elastic Stack 实战手册》之34：——3.4.2.17.3.全文搜索/精确搜索（9）-阿里云开发者社区

带你读《Elastic Stack 实战手册》之34：——3.4.2.17.3.全文搜索/精确搜索（9）

2023-05-25 105

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

检索分析服务 Elasticsearch 版，2核4GB开发者规格 1个月

简介： 带你读《Elastic Stack 实战手册》之34：——3.4.2.17.3.全文搜索/精确搜索（9）

《Elastic Stack 实战手册》——三、产品能力——3.4.入门篇——3.4.2.Elasticsearch基础应用——3.4.2.17.Text analysis, settings 及 mappings——3.4.2.17.3.全文搜索/精确搜索（8） https://developer.aliyun.com/article/1229934

四、基于全文的查询方法

基于全文的方法主要有：match/match_phrase/match_phrase_prefix/multi_match/match_bool_prefix/query_string/simple_query_string/intervals/combined_fields 九种方法。

在全文查询的复杂方法中，很多基于 match 查询的参数，如：analyzer, boost, operator, minimum_should_match, fuzziness, lenient, prefix_length, max_expansions, fuzzy_rewrite,zero_terms_query 和 cutoff_frequency 都能在其它方法中使用。

4.1 match

match 查询是一个基础的全文搜索方法，它会将查询的短语进行分词后对某一字段进行查询。match 查询对被分词后的 token 并没有强顺序关系，只要匹配就可以返回。

使用方法：

PUT my-index-000001
POST my-index-000001/_mapping
{"properties":{"message":{"type":"text"}}}
POST my-index-000001/_bulk
{ "index": { "_id": 1 }}
{ "message": "this is a test" }
GET my-index-000001/_search
{
  "query": {
    "match": {
      "message": {
        "query": "this is a test"
      }
    }
  }
}

参数：

l analyzer：设置将查询的短语转换成 token 的分词器。默认与字段索引时的分词器一致，即 mapping 设置的 analyzer 参数。如果没有设置，则使用默认的分词。

l auto_generate_synonyms_phrase_query：如果为 true，为多项同义词自动生成短语查询。这个与分词器中设置的同义词相关。默认值为 true。

l fuzziness：允许匹配的最大编辑距离。可以是 0/1/2/AUTO

l max_expansions：创建的最大变体或者扩展词项数。默认为 50。

l prefix_length：在创建展开时保持不变的起始字符数。默认值为 0。

l fuzzy_transpositions：指编辑是否包括两个相邻字符的换位( ab → ba )。默认值为 true。

l lenient：如果为真，则忽略基于格式的错误，例如为数字字段提供 text 查询值。默认值为 false。

l operator：查询文本中的布尔逻辑。

○ OR：默认值。例如，将查询值 “capital of Hungary” 解释为 “capital” 或者 “of” 或者 “Hungary”。

○ AND：例如，将查询值 “capital of Hungary” 解释为 “capital” 和 “of” 和 “Hungary”。

l minimum_should_match：要返回的文档必须匹配的最小词项数。例如，“capital of Hungary” 被分词成 “capital”、“of”、“Hungary” 三个词项，minimum_should_match 设置为 2，则文档必须匹配前面三个词项中的两个才能返回。

l zero_terms_query：指示如果分析器删除所有词项时(例如使用停顿词分词器时)，是否不返回文档。

○ none：默认值。如果分词器删除所有词项时，则不返回文档。

○ all：与none相反，返回所有文档。

相关使用方法：

match 查询的 operator 和 minimum_should_match

先创建一个测试索引和相关测试数据：

PUT my-index-000001
POST my-index-000001/_mapping
{
  "properties": {
    "message": {
      "type": "text"
    }
  }
}
PUT my-index-000001/_doc/1
{ "message":"this is test"}
PUT my-index-000001/_doc/2
{ "message":"this is a test again"}
PUT my-index-000001/_doc/3
{ "message":"this is  not a test"}

使用默认的分词器，可以看到几个文档会被解析成 "this"、"is"、"a"、"test"、"again"、"not" 这几个词项。

POST _analyze
{
  "text": [
    "this is test",
    "this is a test again",
    "this is  not a test"
  ]
}
# 返回结果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
},
   {
      "token" : "test",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "this",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "is",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "a",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "test",
      "start_offset" : 23,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
        "position" : 6
    },
    {
      "token" : "again",
      "start_offset" : 28,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "this",
      "start_offset" : 34,
      "end_offset" : 38,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "is",
      "start_offset" : 39,
      "end_offset" : 41,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "not",
      "start_offset" : 43,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "a",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "test",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

match 查询 "this is a test"，这个短语也会被分词成 "this"、"is"、"a"、"test"

POST _analyze
{
  "text": [
    "this is a test"
  ]
}
# 返回结果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
{
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "test",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

如果按照 match 默认 or 的查询逻辑，只要有一个词项匹配就会返回，那么测试的三个文档将全部返回。

GET /_search
{
  "query": {
    "match": {
      "message": {
        "query": "this is a test",
        "operator": "or"
      }
    }
  }
}
# 返回结果
{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 48,
    "successful" : 48,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.5298672,
    "hits" : [
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.5298672,
        "_source" : {
          "message" : "this is  not a test"
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5298672,
        "_source" : {
          "message" : "this is a test again"
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.30433932,
        "_source" : {
          "message" : "this is test"
        }
      }
    ]
  }
}

但是如果设置 minimum_should_match 为 4，则需要有四个词项匹配，那么只有文档 2 和 3 符合了。

GET /_search
{
  "query": {
    "match": {
      "message": {
        "query": "this is a test",
        "operator": "or",
        "minimum_should_match": 4
      }
}
  }
}
# 返回结果
{
  ......
   "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.5298672,
    "hits" : [
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.5298672,
        "_source" : {
          "message" : "this is  not a test"
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5298672,
        "_source" : {
          "message" : "this is a test again"
        }
      }
    ]
  }
}

然后，再看一下 operator 为 and 的时候。其实，可以发现 and 的情况与之前设置 minimum_should_match 为 4 的查询一致，因为两者都代表查询时，每个词项都需要匹配上。

GET /_search
{
  "query": {
    "match": {
      "message": {
        "query": "this is a test",
        "operator": "and"
      }
    }
  }
}
# 返回结果
{
 ......
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.5298672,
    "hits" : [
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.5298672,
        "_source" : {
          "message" : "this is  not a test"
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5298672,
        "_source" : {
          "message" : "this is a test again"
        }
      }
    ]
  }
}

match 中的模糊查询

fuzziness 的一系列参数可以使 match 解析出的词项进行模糊匹配。具体相关参数的使用方法与 fuzzy 查询一致，因此不详细展开了。

来看下面的例子：

GET /_search
{
  "query": {
    "match": {
      "message": {
        "query": "this is a test",
        "fuzziness": "auto"
        , "operator": "and"
      }
    }
  }
}
# 返回结果
{
  ......
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.50886154,
    "hits" : [
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.50886154,
        "_source" : {
          "message" : "this is  not a test"
        }
      },
      {
        "_index" : "my-index-000001",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.50886154,
        "_source" : {
          "message" : "this is a test again"
        }
      }
    ]
  }
}

很明显，虽然查询的内容中错误的把 “test” 写成了 “testt”，但是经过 fuzziness 参数的调整，达到了纠错的效果。

《Elastic Stack 实战手册》——三、产品能力——3.4.入门篇——3.4.2.Elasticsearch基础应用——3.4.2.17.Text analysis, settings 及 mappings——3.4.2.17.3.全文搜索/精确搜索（10） https://developer.aliyun.com/article/1229931

带你读《Elastic Stack 实战手册》之34：——3.4.2.17.3.全文搜索/精确搜索（9）

四、基于全文的查询方法

检索分析服务 Elasticsearch版

热门文章

最新文章

相关电子书

相关实验场景