前言
目前正在出一个Es专题
系列教程, 篇幅会较多, 喜欢的话,给个关注❤️ ~
承接上文,本节给大家讲下上节遗留的查询
操作以及重点内容分词原理
~
为了方便学习, 本节中所有示例沿用上节的索引。本文偏实战一些,好了, 废话不多说直接开整吧~
另外: 之前不少小伙伴私信,说老干讲,没有结合java代码去演示
,这里理解大家迫切想要敲代码的心情,希望快速的在项目中用起来。我在学习一个新东西,特别是这种中间件
的应用,我会先抛开代码层面
,从它根本上去认识它,知道这个过程是怎么样的,因为封装好的sdk
,其实也都是对接口进行了封装,方便开发人员去使用, 写代码大家都会写,关键是如何去理解一个东西。就像面试一样,大家都会背,但是不一定都答的好
透露一下,后边会有专门几节讲springboot整合es
的实战内容,只要不忙,正常更新教程~
prefix & wildcard & regexp
上节我们讲的查询
操作,其最小单位都是基于词条
进行查询。有时候,我们需要将查询的最小粒度优化到字符
级别。很明显中文
是不会基于空格分词
的,这个时候我们就需要用到部分匹配查询prefix、wildcard、regexp
其实这个查询我们可以类比成模糊查询
prefix & 前缀查询
POST class_1/_search { "query": { "prefix": { "name": { "value": "i" } } } } 复制代码
返回:
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 4, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "class_1", "_type" : "_doc", "_id" : "imFt-4UBECmbBdQAnVJg", "_score" : 1.0, "_source" : { "name" : "i", "age" : 10 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "b8fcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so haochi1~", "num" : 1 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "ccfcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so haochi3~", "num" : 1 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "cMfcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so zhen haochi2~", "num" : 1 } } ] } } 复制代码
wildcard 通配查询
意思可以指定通配符,比如*
POST class_1/_search { "query": { "wildcard": { "name": { "value": "*a*" } } } } 复制代码
返回:
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "class_1", "_type" : "_doc", "_id" : "b8fcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so haochi1~", "num" : 1 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "ccfcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so haochi3~", "num" : 1 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "cMfcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so zhen haochi2~", "num" : 1 } } ] } } 复制代码
regexp & 正则表达式查询
我们可以通过regexp
进行正则匹配
POST class_1/_search { "query": { "regexp": { "name": { "value": "[A-Za-z0-9]*" } } } } 复制代码
返回:
{ "took" : 8, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 10, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ { "_index" : "class_1", "_type" : "_doc", "_id" : "h2Fg-4UBECmbBdQA6VLg", "_score" : 1.0, "_source" : { "name" : "b", "num" : 6 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "iGFt-4UBECmbBdQAnVJe", "_score" : 1.0, "_source" : { "name" : "g", "age" : 8 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "iWFt-4UBECmbBdQAnVJg", "_score" : 1.0, "_source" : { "name" : "h", "age" : 9 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "imFt-4UBECmbBdQAnVJg", "_score" : 1.0, "_source" : { "name" : "i", "age" : 10 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "3", "_score" : 1.0, "_source" : { "num" : 9, "name" : "e", "age" : 9, "desc" : [ "hhhh" ] } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "4", "_score" : 1.0, "_source" : { "name" : "f", "age" : 10, "num" : 10 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "b8fcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so haochi1~", "num" : 1 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "ccfcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so haochi3~", "num" : 1 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : { "name" : "l", "num" : 6 } }, { "_index" : "class_1", "_type" : "_doc", "_id" : "cMfcCoYB090miyjed7YE", "_score" : 1.0, "_source" : { "name" : "I eat apple so zhen haochi2~", "num" : 1 } } ] } } 复制代码
ES中的分词
什么是分词
对于非结构语言的查询,如果采用全文检索的方式进行查询,需要进行对其进行分词
处理, 顾名思义,将一句话或一大段话,通过一定的规则,分割成多个词条的过程
分词器
es
中的分词通过分词器
来处理,主要由以下组成:
Character Filters
:针对原始文本处理,比如去除一些标签,标点符号等。一个分词器中可以包含0至多个Character Filters
Tokenizer
:按照规则将原始文本切分为词条,比如按照空格切分。一个分词器中有且只有一个Tokenizer
Token Filters
:将切分的单词进行加工,比如大写转小写
,删除结语词
,增加同义语
。一个分词器中可以包含0至多个Token Filters
举个例子,假如有这么一句话I eat a apple 开心
,通过上述分词流程大致为:
- 开始分词,
Character Filters
开始对原始语句处理,处理完之后的语句:I eat a apple
Tokenizer
根据规则分词,处理完之后的语句会分为几个词条I
,eat
,a
,apple
Token Filters
词条过滤处理,i
,eat
,a
,apple
当然,不同分词器可能处理的会更复杂
内置分词器
Standard Analyzer
:默认分词器,按词切分,小写处理Simple Analyzer
:按照非字母切分(符号被过滤),小写处理Stop Analyzer
:小写处理,停用词过滤(the ,a,is)Whitespace Analyzer
:按照空格切分,不转小写Keyword Analyzer
:不分词,直接将输入当做输出Pattern Analyzer
:正则表达式,默认 \W+Language
:提供了 30 多种常见语言的分词器Customer
Analyzer:自定义分词器
elasticsearch-analysis-ik & 中文分词器
通常我们的业务,在大部分应用场景中都是中文,所以比较常用的就是中文分词器
,默认是不带的,所以需要大家去安装,这里就不多介绍安装了,文档里有
结束语
本节就到此结束了,下节我们进入ES聚合
部分内容,这部分内容也比较重要 ~