ik分词器下载
百度网盘 提取码: fnq0
解压后放到 /plugins/ik 目录
启动服务 ./bin/elasticsearch
### 测试
使用 kibana的Dev Tools
#### 测试分词
get /_analyze
{
"analyzer": "standard",
"text": "王者荣耀"
}
standard分词器结果
{
"tokens" : [
{
"token" : "王",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "者",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "荣",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "耀",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
}
]
}
get /_analyze
{
"analyzer": "ik_smart",
"text": "王者荣耀"
}
ik_smart分词结果
{
"tokens" : [
{
"token" : "王者",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "荣耀",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
}
]
}
分词器使用
# 创建索引
put /indexik
# 索引映射,字段指定分词器
put /indexik/_mapping
{
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
# 查看映射
get /indexik/_mapping
# 添加测试数据
post /indexik/_doc
{"content":"美国留给伊拉克的是个烂摊子吗"}
post /indexik/_doc
{"content":"公安部:各地校车将享最高路权"}
post /indexik/_doc
{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}
post /indexik/_doc
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
post /indexik/_doc
{"content":"中华人民共和国"}
# 查询
get /indexik/_search
{
"query": {
"match": {
"content":"中国"
}
}
}
# 删除索引
DELETE /indexik
查询得到结果比默认分词更准确
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.79423964,
"hits" : [
{
"_index" : "indexik",
"_type" : "_doc",
"_id" : "3ObjKnsBEq3c_HSrSrAr",
"_score" : 0.79423964,
"_source" : {
"content" : "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
}
},
{
"_index" : "indexik",
"_type" : "_doc",
"_id" : "3ebjKnsBEq3c_HSrUbCx",
"_score" : 0.79423964,
"_source" : {
"content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}
}
]
}
}
配置ik生效
踩过的坑
elasticsearch.yaml添加配置 报错
# ik analyzer
index.analysis.analyzer.default.type: ik_max_word
Since elasticsearch 5.x index level settings can NOT be set on the nodes
configuration like the elasticsearch.yaml, in system properties or command line
arguments.In order to upgrade all indices the settings must be updated via the
/${index}/_settings API. Unless all settings are dynamic all indices must be closed
in order to apply the upgradeIndices created in the future should use index templates
to set default values.
Please ensure all required values are updated on all indices by executing:
curl -XPUT 'http://localhost:9200/_all/_settings?preserve_existing=true' -d '{
"index.analysis.analyzer.default.type" : "ik_max_word"
}'
自5.x版本 不允许在node节点配置index级别属性,需要使用/${index}/_settings API进行更改
索引设默认分词器
PUT indexik
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "ik_smart"
}
}
}
}
}
配置ik分词器
IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>