带你读《Elastic Stack 实战手册》之35：——3.4.2.17.4.Analyzers / Custom analyzers（7）-阿里云开发者社区

带你读《Elastic Stack 实战手册》之35：——3.4.2.17.4.Analyzers / Custom analyzers（7）

2023-05-25 72

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

检索分析服务 Elasticsearch 版，2核4GB开发者规格 1个月

简介： 带你读《Elastic Stack 实战手册》之35：——3.4.2.17.4.Analyzers / Custom analyzers（7）

《Elastic Stack 实战手册》——三、产品能力——3.4.入门篇——3.4.2.Elasticsearch基础应用——3.4.2.17.Text analysis, settings 及 mappings——3.4.2.17.4.Analyzers / Custom analyzers（6） https://developer.aliyun.com/article/1229769

配置项

separator 单词连接符，默认使用空格。

max_output_size 文本输出最大长度，超过长度将不会返回，默认255。

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "fingerprint",
      "separator": "-",
    }
  ],
  "text": "A good cook  could cook cookies?"
}
#Response
[ A-cook-cookies-could-good ]

Keyword marker token filter

keyword_marker 过滤器用于标注不需要做词干提取的单词列表，此过滤器执行顺序在词干提取相关的过滤器之前。

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type":"keyword_marker",
      "keywords":["loves","travelling"]
    },
    "stemmer"
    ],
    "text": ["Tony loves dancing and travelling"]
}
#Response
[ Toni, loves, danc, and, travelling ]

配置项

ignore_case 标注单词是否忽略大小写，默认 false。

keywords 指定不需要做词干提取的单词列表。

keywords_path 指定不做词干提取单词的文件目录，该文件需要保存在 Elasticsearch 目录下的 config 中，文件格式使用 utf-8，每行一个单词。

keywords_pattern 使用正则表达式匹配到的单词不进行词干提取。

需要注意 keywords，keywords_path 和 keywords_pattern 不能同时指定，设置其中一个即可。

Length token filter

length 过滤器会只保留字符长度在设定区间的单词，如返回长度大于3小于5的单词。

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "length",
      "max": 5,
      "min": 3
    }
  ],
    "text": "Hey,We Are Elasic"
}
#Response
[ Hey, Are ]

配置项

min（非必填）最小字符长度。

max（非必填）最大字符长度。

Limit token count

limit 过滤器用于限制文本返回的单词数量，默认只返回第一个单词。

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "limit",
      "max_token_count":2    #1
    }
  ],
   "text": "Heya,We are Elasic "
}
#Response
[ Heya, We ]

#1 显示返回前两个单词。

配置项

max_token_count 限定返回单词数量，默认为1。

Lowercase token filter

lowercase 过滤器会根据语言把单词转换为小写字母，默认使用英文。

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "lowercase"
    }
  ],
  "text": "Heya,We are Elasic "
}
#Response
[ heya, we, are, elastic ]

配置项

language 针对特定语言的小写转换，参数支持 Greek，Irish，Turkish。

Uppercase token filter

uppercase 过滤器会将单词转换为大写字母。

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "uppercase"
    }
  ],
  "text": "Heya,We are Elasic "
}
#Response
[ HEYA, WE, ARE, ELASTIC ]

N-gram token filter

ngram 根据 N 元语法对单词进行切分，目的是将单词里每一个字符进行大小为 N 的滑动窗口操作，形成了长度是 N 的字节片段序列，切分的结果长可用于模糊匹配。默认最小长度1，最大长度2。当长度差大于1时，需要为索引设置 index.max_ngram_diff 属性。

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "ngram",
      "min_gram":2,
      "max_gram":3
    }
  ],
  "text": [  "cat"  ]
}
#Response
[ ca, cat, at ]

《Elastic Stack 实战手册》——三、产品能力——3.4.入门篇——3.4.2.Elasticsearch基础应用——3.4.2.17.Text analysis, settings 及 mappings——3.4.2.17.4.Analyzers / Custom analyzers（8） https://developer.aliyun.com/article/1229767

带你读《Elastic Stack 实战手册》之35：——3.4.2.17.4.Analyzers / Custom analyzers（7）

检索分析服务 Elasticsearch版

热门文章

最新文章

相关电子书