自定义分析器
当然,我们也可以自定义分析器,适当的组合自定义分析器,包括下面三个部分:
- 零或者多个 character filters
- 一个 tokenizer
- 零或多个 token filters
# 使用自定义(分词器 字符过滤器 分词过滤器) PUT /my_index { "settings": { "analysis": { "analyzer": { "my_custom_analyzer": { "type": "custom", "char_filter": [ "emoticons" ], "tokenizer": "punctuation", "filter": [ "lowercase", "english_stop" ] } }, "tokenizer": { "punctuation": { "type": "pattern", "pattern": "[ .,!?]" } }, "char_filter": { "emoticons": { "type": "mapping", "mappings": [ ":) => _happy_", ":( => _sad_" ] } }, "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" } } } } } POST /my_index/_analyze { "analyzer": "my_custom_analyzer", "text": "I'm a :) person, and you?" } { "tokens": [ { "token": "i'm", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "_happy_", "start_offset": 6, "end_offset": 8, "type": "word", "position": 2 }, { "token": "person", "start_offset": 9, "end_offset": 15, "type": "word", "position": 3 }, { "token": "you", "start_offset": 21, "end_offset": 24, "type": "word", "position": 5 } ] } 复制代码
搜索
# 搜索全部文档 GET /order_detail/default/_search?pretty { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "order_detail", "_type": "default", "_id": "2", "_score": 1, "_source": { "name": "佳洁士牙膏", "desc": "佳洁士有效防蛀牙", "price": 25, "producer": "佳洁士", "tags": [ "防蛀" ] } }, { "_index": "order_detail", "_type": "default", "_id": "1", "_score": 1, "_source": { "name": "高露洁变态版牙膏", "desc": "高露洁美白防蛀牙", "price": 30, "producer": "高露洁", "tags": [ "美白", "防蛀" ] } }, { "_index": "order_detail", "_type": "default", "_id": "3", "_score": 1, "_source": { "name": "中华牙膏", "desc": "中华牙膏草本植物", "price": 40, "producer": "中华", "tags": [ "清新" ] } } ] } } took:耗费了几毫秒 timed_out:是否超时,这里是没有 _shards:数据拆成了5个分片,所以对于搜索请求,会打到所有的primary shard或者是它的某个replica shard也可以 hits.total:查询结果的数量,3个document hits.max_score:score的含义,就是document对于一个search的相关度的匹配分数,越相关,就越匹配,分数也高 hits.hits:包含了匹配搜索的document的详细数据 # 通过json构建语法 GET /order_detail/default/_search?pretty { "query": { "match_all": {} } } # 查询名称包含"牙膏"的商品,同时按照价格降序排序 GET /order_detail/default/_search?pretty { "query" : { "match" : { "name" : "牙膏" } }, "sort": [ { "price": "desc" } ] } # 分页查询商品,总共3条商品,假设每页就显示1条商品,现在显示第2页,所以就查出来第2个商品 GET /order_detail/default/_search { "query": { "match_all": {} }, "from": 1, "size": 1 } # 指定要查询出来商品的名称和价格(更加适合生产环境的使用,可以构建复杂的查询) GET /order_detail/default/_search { "query": { "match_all": {} }, "_source": ["name", "price"] } { "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "order_detail", "_type": "default", "_id": "2", "_score": 1, "_source": { "price": 25, "name": "佳洁士牙膏" } }, { "_index": "order_detail", "_type": "default", "_id": "1", "_score": 1, "_source": { "price": 30, "name": "高露洁变态版牙膏" } }, { "_index": "order_detail", "_type": "default", "_id": "3", "_score": 1, "_source": { "price": 40, "name": "中华牙膏" } } ] } } # 搜索商品名称包含"牙膏",而且售价大于25元的商品 # filter,仅仅只是按照搜索条件过滤出需要的数据而已,不计算任何相关度分数,对相关度没有任何影响 # query,会去计算每个document相对于搜索条件的相关度,并按照相关度进行排序 GET /order_detail/default/_search { "query" : { "bool" : { "must" : { "match" : { "name" : "牙膏" } }, "filter" : { "range" : { "price" : { "gt" : 25 } } } } } } # 短语搜索 GET /order_detail/_search { "query" : { "match_phrase" : { "name" : "中华牙膏草本植物" } } } { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 1.1899844, "hits": [ { "_index": "order_detail", "_type": "default", "_id": "4", "_score": 1.1899844, "_source": { "name": "中华牙膏草本植物精华", "desc": "中华牙膏草本植物", "price": 15, "producer": "中华", "tags": [ "清新" ] } }, { "_index": "order_detail", "_type": "default", "_id": "5", "_score": 0.8574782, "_source": { "name": "中华牙膏草本植物精华清新", "desc": "中华牙膏草本植物", "price": 11, "producer": "中华", "tags": [ "清新" ] } } ] } } # slop 如果slop的值足够大,那么单词的顺序可以是任意的。 GET /order_detail/_search { "query": { "match_phrase": { "name": { "query": "中华牙膏清新", "slop": 50 } } } } { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.26850328, "hits": [ { "_index": "order_detail", "_type": "default", "_id": "5", "_score": 0.26850328, "_source": { "name": "中华牙膏草本植物精华清新", "desc": "中华牙膏草本植物", "price": 11, "producer": "中华", "tags": [ "清新" ] } } ] } } GET /order_detail/_search { "query" : { "match" : { "name" : "中华牙膏" } }, "highlight": { "fields" : { "name" : {} } } } { "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 5, "max_score": 0.6641487, "hits": [ { "_index": "order_detail", "_type": "default", "_id": "4", "_score": 0.6641487, "_source": { "name": "中华牙膏草本植物精华", "desc": "中华牙膏草本植物", "price": 15, "producer": "中华", "tags": [ "清新" ] }, "highlight": { "name": [ "<em>中华</em><em>牙膏</em>草本植物精华" ] } }, { "_index": "order_detail", "_type": "default", "_id": "5", "_score": 0.5716521, "_source": { "name": "中华牙膏草本植物精华清新", "desc": "中华牙膏草本植物", "price": 11, "producer": "中华", "tags": [ "清新" ] }, "highlight": { "name": [ "<em>中华</em><em>牙膏</em>草本植物精华清新" ] } }, { "_index": "order_detail", "_type": "default", "_id": "3", "_score": 0.51623213, "_source": { "name": "中华牙膏", "desc": "中华牙膏草本植物", "price": 40, "producer": "中华", "tags": [ "清新" ] }, "highlight": { "name": [ "<em>中华</em><em>牙膏</em>" ] } }, { "_index": "order_detail", "_type": "default", "_id": "1", "_score": 0.2824934, "_source": { "name": "高露洁变态版牙膏", "desc": "高露洁美白防蛀牙", "price": 30, "producer": "高露洁", "tags": [ "美白", "防蛀" ] }, "highlight": { "name": [ "高露洁变态版<em>牙膏</em>" ] } }, { "_index": "order_detail", "_type": "default", "_id": "2", "_score": 0.21380994, "_source": { "name": "佳洁士牙膏", "desc": "佳洁士有效防蛀牙", "price": 25, "producer": "佳洁士", "tags": [ "防蛀" ] }, "highlight": { "name": [ "佳洁士<em>牙膏</em>" ] } } ] } } 复制代码
Elasticsearch的搜索内部其实涉及到相关性打分,接下来就讲述下Elasticsearch的打分规则。
Elasticsearch相关性打分
Elasticsearch(或 Lucene)使用 布尔模型(Boolean model)) 查找匹配文档,并用一个名为实用评分函数(practical scoring function)的公式来计算相关度。这个公式借鉴了 词频/逆向文档频率(term frequency/inverse document frequency) 和 向量空间模型(vector space model),同时也加入了一些现代的新特性,如协调因子(coordination factor),字段长度归一化(field length normalization),以及词或查询语句权重提升。
布尔模型
布尔模型又称为精确匹配模型。被检索的文档都能够精确匹配检索需求,不满足要求的文档不会被检索。所有匹配的文档关于相关性都是一样的,不需要对文档进行评分,返回的结果也是无序的。
布尔模型(Boolean Model) 只是在查询中使用 AND 、OR 和 NOT (与、或和非)这样的条件来查找匹配的文档,以下查询:
full AND text AND search AND (elasticsearch OR lucene)
会将所有包括词 full 、 text 和 search ,以及 elasticsearch 或 lucene
的文档作为结果集。
这个过程简单且快速,它将所有可能不匹配的文档排除在外。
向量空间模型
向量空间模型在向量空间模型中,文档和查询语句都表示成高维空间的向量。这里每一个项都是向量的一个维度。文档和查询的相关性通过两个向量的距离来计算,通常采用余弦相似度度量方法。
设想如果查询 “happy hippopotamus” ,常见词 happy 的权重较低,不常见词 hippopotamus 权重较高,假设 happy 的权重是 2 , hippopotamus 的权重是 5 ,可以将这个二维向量—— [2,5] ——在坐标系下作条直线,线的起点是 (0,0) 终点是 (2,5)
现在,设想我们有三个文档:
- I am happy in summer 。
- After Christmas I’m a hippopotamus 。
- The happy hippopotamus helped Harry
可以为每个文档都创建包括每个查询词—— happy 和 hippopotamus ——权重的向量,然后将这些向量置入同一个坐标系中
向量之间是可以比较的,只要测量查询向量和文档向量之间的角度就可以得到每个文档的相关度,文档 1 与查询之间的角度最大,所以相关度低;文档 2 与查询间的角度较小,所以更相关;文档 3 与查询的角度正好吻合,完全匹配。
实用计分函数
score(q,d) = #1 queryNorm(q) #2 · coord(q,d) #3 · ∑ ( #4 tf(t in d) #5 · idf(t)² #6 · t.getBoost() #7 · norm(t,d) #8 ) (t in q) #9 复制代码
公式说明如下:
- #1 score(q, d) 是文档 d 与 查询 q 的相关度分数
- #2 queryNorm(q) 是查询正则因子(query normalization factor)
- #3 coord(q, d) 是协调因子(coordination factor)
- #4~#9 查询 q 中每个术语 t 对于文档 d 的权重和
- #5 tf(t in d) 是术语 t 在文档 d 中的词频
- #6 idf(t) 是术语 t 的逆向文档频次
- #7 t.getBoost() 是查询中使用的 boost
- #8 norm(t,d) 是字段长度正则值,与索引时字段级的boost的和(如果存在)
总结
- 讲述了搜索,es 基本概念,索引结构,文档增删改查,分析器,全文检索,短语搜索,高亮显示等
- 未涉及es父子索引,聚合分析,数据建模,实时数据分析, 文档自动补全,高级搜索(multi_match, boost等),分布式原理,集群维护升级,Elasticsearch SQL(支持 REST 、 JDBC 以及命令行)等