Elasticsearch 中文分词器-阿里云开发者社区

常见的分词器如下：

Standard默认分词器
IK 中文分词器
Pinyin 分词器
Smart Chinese 分词器
hanlp 中文分词器
达摩院中文分词AliNLP

分词器比较

standard 默认分词器，对单个字符进行切分，查全率高，准确度较低
IK 分词器 ik_max_word：查全率与准确度较高，性能也高，是业务中普遍采用的中文分词器
IK 分词器 ik_smart：切分力度较大，准确度与查全率不高，但是查询性能较高
Smart Chinese 分词器：查全率与准确率性能较高
hanlp 中文分词器：切分力度较大，准确度与查全率不高，但是查询性能较高
Pinyin 分词器：针对汉字拼音进行的分词器，与上面介绍的分词器稍有不同，在用拼音进行查询时查全率准确度较高

下面详细介绍下各种分词器，对同一组汉语进行分词的结果对比，方便大家在实际使用中参考。

standard 默认分词器

GET_analyze{
"text": "南京市长江大桥",
"tokenizer": "standard"}
#返回{
"tokens" : [
    {
"token" : "南",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0    },
    {
"token" : "京",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1    },
    {
"token" : "市",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2    },
    {
"token" : "长",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3    },
    {
"token" : "江",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4    },
    {
"token" : "大",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5    },
    {
"token" : "桥",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6    }
  ]
}

默认分词器处理中文是按照单个汉字进行切割，不能很好的理解中文词语的含义，在实际项目使用中很少会使用默认分词器来处理中文。

IK 中文分词器：

插件下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.10.0

（注意要下载和使用的Elasticsearch 匹配的版本）

在 Elasticsearch 的安装目录的 Plugins 目录下新建 IK 文件夹，然后将下载的 IK 安装包解压到此目录下
重启 ES 即生效
IK 分词器包含：ik_smart 以及 ik_max_word 2种分词器，都可以使用在
索引和查询阶段。创建一个索引，里面包含2个字段:

max_word_content 使用 ik_max_word 分词器处理;
smart_content 采用 ik_smart 分词器处理;
分别对比下执行结果:

#创建索引PUT/analyze_chinese{
"mappings": {
"properties": {
"max_word_content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"      },
"smart_content": {
"type": "text",
"analyzer": "ik_smart",
"search_analyzer": "ik_smart"      }
    }
  }
}
#添加测试数据POSTanalyze_chinese/_bulk{"index":{"_id":1}}
{"max_word_content":"南京市长江大桥","smart_content":"我是南京市民"}
# ik_max_word 查询分析器解析结果POST_analyze{
"text": "南京市长江大桥",
"analyzer": "ik_max_word"}
#结果：{
"tokens" : [
    {
"token" : "南京市",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0    },
    {
"token" : "南京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1    },
    {
"token" : "市长",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2    },
    {
"token" : "长江大桥",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3    },
    {
"token" : "长江",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4    },
    {
"token" : "大桥",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 5    }
  ]
}
#ik_smartPOST_analyze{
"text": "南京市长江大桥",
"analyzer": "ik_smart"}
#结果：{
"tokens" : [
    {
"token" : "南京市",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0    },
    {
"token" : "长江大桥",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 1    }
  ]
}

通过以上分析，ik_smart 显然分词的颗粒度较粗，而 ik_max_word 颗粒度较细

通过DSL来验证查询

POSTanalyze_chinese/_search{
"query": {
"match": {
"smart_content": "南京市"    }
  }
}
#结果"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"    },
"max_score" : null,
"hits" : [ ]
  }

未匹配到记录，因为“我是南京市民” 经过分词处理后并不包含“南京市” 的 token,

那通过“南京” 搜索呢？

POSTanalyze_chinese/_search{
"query": {
"match": {
"smart_content": "南京"    }
  }
}
#返回"hits" : [
      {
"_index" : "analyze_chinese",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"max_word_content" : "南京市长江大桥",
"smart_content" : "我是南京市民"        }
      }
    ]

经过 ik_max_word 分词处理器处理之后的 max_word_content 字段效果呢？

POSTanalyze_chinese/_search{
"query": {
"match": {
"max_word_content": "南京"    }
  }
}
#结果"hits" : [
      {
"_index" : "analyze_chinese",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"max_word_content" : "南京市长江大桥",
"smart_content" : "我是南京市民"        }
      }
    ]
#使用 南京市 查询POSTanalyze_chinese/_search{
"query": {
"match": {
"max_word_content": "南京市"    }
  }
}
#结果"hits" : [
      {
"_index" : "analyze_chinese",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"max_word_content" : "南京市长江大桥",
"smart_content" : "我是南京市民"        }
      }
    ]

可以看到，由于 “南京市长江大桥” 经过 ik_max_word 分词器处理后，包含 “南京市” token,所以都可以查询到。

IK 分词器总结：

ik_max_word 分词颗粒度小，满足业务场景更丰富
ik_smart 分词器颗粒度较粗，满足分词场景要求不高的业务

pinyin 分词器

首先，下载 pinyin 分词器插件：

https://github.com/medcl/elasticsearch-analysis-pinyin

本地编译并打包后，上传到ES安装目录下的 plugins 下并解压，然后重启ES，重启后查看是否安装成功：

[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
ik
pinyin

可以看到 pinyin 插件已经安装成功

PUT/analyze_chinese_pinyin/{
"settings" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"                    }
            },
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true                }
            }
        }
    }
}
#GET/analyze_chinese_pinyin/_analyze{
"text": ["南京市长江大桥"],
"analyzer": "pinyin_analyzer"}
#返回：{
"tokens" : [
    {
"token" : "nan",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0    },
    {
"token" : "南京市长江大桥",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0    },
    {
"token" : "njscjdq",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0    },
    {
"token" : "jing",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1    },
    {
"token" : "shi",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2    },
    {
"token" : "chang",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 3    },
    {
"token" : "jiang",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 4    },
    {
"token" : "da",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 5    },
    {
"token" : "qiao",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 6    }
  ]
}
#设置测试数据POSTanalyze_chinese_pinyin/_bulk{"index":{"_id":1}}
{"name":"南京市长江大桥"}
#根据拼音查询 njscjdqPOSTanalyze_chinese_pinyin/_search{
"query": {
"match": {
"name.pinyin": "njscjdq"    }
  }
}
#返回"hits" : [
      {
"_index" : "analyze_chinese_pinyin",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931471,
"_source" : {
"name" : "南京市长江大桥"        }
      }
    ]
#通过 nan 查询POSTanalyze_chinese_pinyin/_search{
"query": {
"match": {
"name.pinyin": "nan"    }
  }
}
# 返回"hits" : [
      {
"_index" : "analyze_chinese_pinyin",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931471,
"_source" : {
"name" : "南京市长江大桥"        }
      }
    ]

因为经过南京长江大桥经过 pinyin_analyzer 分词器分词后，包含 nan 和 njscjdq 所以都能匹配查询到记录

Smart Chinese Analysis

参考：https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html

Smart Chinese Analysis 插件将Lucene的智能中文分析模块集成到elasticsearch中，

提供了中文或中英文混合文本的分析器。该分析器使用概率知识来找到简体中文文本的最佳分词。文本首先被分解成句子，然后每个句子被分割成单词。

此插件必须在每个节点上安装且需要重启才生效，此插件提供了smartcn 分析器、smartcn_tokenizer tokenizer、

./bin/elasticsearch-plugin install analysis-smartcn
-> Installing analysis-smartcn
-> Downloading analysis-smartcn from elastic
[=================================================] 100%   
-> Installed analysis-smartcn

同样执行查看已安装插件的列表

[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
analysis-smartcn
ik
pinyin

安装成功后，需要重启 ES 以便插件生效

POST_analyze{
"analyzer": "smartcn",
"text":"南京市长江大桥"}
#返回{
"tokens" : [
    {
"token" : "南京市",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0    },
    {
"token" : "长江",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 1    },
    {
"token" : "大桥",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 2    }
  ]
}

hanlp 中文分词器

安装插件：

./bin/elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.10.0/elasticsearch-analysis-hanlp-7.10.0.zip

安装后查看插件安装情况，安装成功后也同样需要重启ES

[elasticsearch@stage-node1 elasticsearch-7.10.0]$ ./bin/elasticsearch-plugin list
analysis-hanlp
analysis-smartcn
ik
pinyin

GET_analyze{
"text": "南京市长江大桥",
"tokenizer": "hanlp"}
#返回{
"tokens" : [
    {
"token" : "南京市",
"start_offset" : 0,
"end_offset" : 3,
"type" : "ns",
"position" : 0    },
    {
"token" : "长江大桥",
"start_offset" : 3,
"end_offset" : 7,
"type" : "nz",
"position" : 1    }
  ]
}

Elasticsearch 中文分词器

分词器比较

standard 默认分词器

IK 中文分词器：

IK 分词器总结：

pinyin 分词器

Smart Chinese Analysis

hanlp 中文分词器

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Elasticsearch 中文分词器

分词器比较

standard 默认分词器

IK 中文分词器：

IK 分词器总结：

pinyin 分词器

Smart Chinese Analysis

hanlp 中文分词器

热门文章

最新文章

相关课程

相关电子书

相关实验场景