作者:刘晓国
在我之前的文章 “开始使用 Elasticsearch (2)” 我讲述了很多 Elasticsearch查询的例子。在今天的文章中,我将以更多的例子来进行阐述。希望对开发者有所帮助。也许你们看到我之前的文章,我比较喜欢用较少的文档来进行展示,而不一个很大的 dataset。这其中的原因就是,我们可以通过很少的文档看清查询的本质,而不是在很多的文档中去一一验证。
准备文档
在今天的例子中,我们假想有这么一个书的索引。它含有这些字段:title, authors, summary, release date, num_reviews 及藏书的位置 location(分布于不同的图书馆)。由于这个索引含有位置信息,所有,我们必须首先定义一个关于这个索引 bookdb_index 的 mapping,这样便于我们在导入数据时,location 是我们正确需要的 geo_point 数据类型:
PUT bookdb_index { "mappings": { "properties": { "location": { "type": "geo_point" } } } }
通过上面的命令,我们就创建了一个叫做 bookdb_index 的索引。我们接着使用 bulk API 来导入我们的数据:
POST /bookdb_index/_bulk {"index":{"_id":1}} {"title":"Elasticsearch: The Definitive Guide","authors":["clinton gormley","zachary tong"],"summary":"A distibuted real-time search and analytics engine","publish_date":"2015-02-07","num_reviews":20,"publisher":"oreilly","location":{"lat":"39.970718","lon":"116.325747"}} {"index":{"_id":2}} {"title":"Taming Text: How to Find, Organize, and Manipulate It","authors":["grant ingersoll","thomas morton","drew farris"],"summary":"organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization","publish_date":"2013-01-24","num_reviews":12,"publisher":"manning","location":{"lat":"39.904313","lon":"116.412754"}} {"index":{"_id":3}} {"title":"Elasticsearch in Action","authors":["radu gheorge","matthew lee hinman","roy russo"],"summary":"build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms","publish_date":"2015-12-03","num_reviews":18,"publisher":"manning","location":{"lat":"39.893801","lon":"116.408986"}} {"index":{"_id":4}} {"title":"Solr in Action","authors":["trey grainger","timothy potter"],"summary":"Comprehensive guide to implementing a scalable search engine using Apache Solr","publish_date":"2014-04-05","num_reviews":23,"publisher":"manning","location":{"lat":"39.718256","lon":"116.367910"}}
我们可以通过如下的方式来查看我们最终的 bookdb_index 的 mapping:
GET bookdb_index/_mapping
上面的命令显示:
{ "bookdb_index" : { "mappings" : { "properties" : { "authors" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "location" : { "type" : "geo_point" }, "num_reviews" : { "type" : "long" }, "publish_date" : { "type" : "date" }, "publisher" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "summary" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "title" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } }
从上面的输出中,我们可以看出来,Elasticsearch 可以依据我们输入的数据自动猜测数据的类型,除了我们之前已经定义的 location 为 geo_point 数据类型之外。在上面,我们看到 publush_date 被自动识别为 date 类型,而 num_views 被识别为 long 数据类型。如果这些类型不是我们想要的,比如我们想节省存储空间,我们可以把 num_views 设置为 integer。如果我们想这么做,我们需要在定义 mapping 时列出来。
查询示例
基本的 match query
有两种执行基本全文(匹配)查询的方法:使用 Search Lite API,它期望所有搜索参数作为 URL 的一部分传入,或者使用完整的 JSON 请求正文,它允许你使用完整的 Elasticsearch DSL (Domain Specific Language)。
这是一个在所有字段中搜索字符串 “guide” 的基本匹配查询:
GET /bookdb_index/_search?q=guide
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 1.3278645, "_source" : { "title" : "Solr in Action", "authors" : [ "trey grainger", "timothy potter" ], "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "publish_date" : "2014-04-05", "num_reviews" : 23, "publisher" : "manning", "location" : { "lat" : "39.718256", "lon" : "116.367910" } } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 1.2871116, "_source" : { "title" : "Elasticsearch: The Definitive Guide", "authors" : [ "clinton gormley", "zachary tong" ], "summary" : "A distibuted real-time search and analytics engine", "publish_date" : "2015-02-07", "num_reviews" : 20, "publisher" : "oreilly", "location" : { "lat" : "39.970718", "lon" : "116.325747" } } } ]
在上面的所有字段中都会进行搜索,我们可以看到上面的两个文档中的 summary 或 title 字段中含有 guide 字样的文档都被搜索出来了。
此查询的完整版本如下所示,并产生与上述搜索精简版相同的结果。
GET bookdb_index/_search { "query": { "multi_match": { "query": "guide", "fields": [ "title", "authors", "summary", "num_views", "publisher" ] } } }
上面的查询的字段排斥了 date 及 location 类型的字段,因为它们不支持全文搜索。返回的结果上面显示的是一样的。
multi_match 关键字用于代替 match 关键字,作为针对多个字段运行相同查询的便捷方式。 fields 属性指定要查询的字段,在这种情况下,我们要查询文档中的所有字段。一般来说 multi_match 的效率并不是很高。如果你需要经常查询多个字段,另外一种方法就是使用 copy_to。你可以阅读我的另外一篇文章 “如何使用 Elasticsearch 中的 copy_to 来提高搜索效率”。
SearchLite API 还允许你指定要搜索的字段。 例如,要搜索 title 字段中带有 “in Action” 字样的书籍:
GET /bookdb_index/_search?q=title:in action
上面查询的结果:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 1.6323127, "_source" : { "title" : "Elasticsearch in Action", "authors" : [ "radu gheorge", "matthew lee hinman", "roy russo" ], "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "publish_date" : "2015-12-03", "num_reviews" : 18, "publisher" : "manning", "location" : { "lat" : "39.893801", "lon" : "116.408986" } } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 1.6323127, "_source" : { "title" : "Solr in Action", "authors" : [ "trey grainger", "timothy potter" ], "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "publish_date" : "2014-04-05", "num_reviews" : 23, "publisher" : "manning", "location" : { "lat" : "39.718256", "lon" : "116.367910" } } } ]
然而,完整的 DSL 在创建更复杂的查询(我们将在后面看到)和指定如何返回结果方面为你提供了更大的灵活性。 在下面的示例中,我们指定了我们想要返回的结果数量、开始的偏移量(对分页很有用)、我们想要返回的文档字段以及术语突出显示(highlighting)。 请注意,我们使用 “match” 查询而不是 “multi_match” 查询,因为我们只关心在标题字段中的搜索。
POST /bookdb_index/_search { "query": { "match": { "title": "in action" } }, "size": 2, "from": 0, "_source": [ "title", "summary", "publish_date" ], "highlight": { "fields": { "title": {} } } }
上面的命令显示的结果:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 1.6323127, "_source" : { "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" }, "highlight" : { "title" : [ "Elasticsearch <em>in</em> <em>Action</em>" ] } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 1.6323127, "_source" : { "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "title" : "Solr in Action", "publish_date" : "2014-04-05" }, "highlight" : { "title" : [ "Solr <em>in</em> <em>Action</em>" ] } } ]
我们在上面的 highlight 部分可以看到带有 字样的部分。这些是直接可以在页面进行显示的突出部门。它们分别表示被匹配的词源。
注意:对于多词查询,match 查询允许你指定是否使用 and 运算符而不是默认的 or 运算符。 你还可以指定 minimum_should_match 选项来调整返回结果的相关性。 详细信息可以在 Elasticsearch 指南中找到。比如,我们使用如下的查询来查找同时含有 in,action 及 Elasticsearch:
POST /bookdb_index/_search { "query": { "match": { "title": { "query": "in action Elasticsearch", "operator": "and" } } }, "size": 2, "from": 0, "_source": [ "title", "summary", "publish_date" ], "highlight": { "fields": { "title": {} } } }
上面的查询将只匹配其中的一个文档:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 2.448469, "_source" : { "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" }, "highlight" : { "title" : [ "<em>Elasticsearch</em> <em>in</em> <em>Action</em>" ] } } ]
match 查询支持 minimum_should_match 参数,它允许你指定必须匹配的条目数才能将文档视为相关。 虽然你可以指定绝对数量的术语,但通常指定百分比是有意义的,因为你无法控制用户可以输入的字数:
POST /bookdb_index/_search { "query": { "match": { "title": { "query": "in action Elasticsearch", "operator": "or", "minimum_should_match": "90%" } } }, "size": 2, "from": 0, "_source": [ "title", "summary", "publish_date" ], "highlight": { "fields": { "title": {} } } }
上面的查询显示:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 2.448469, "_source" : { "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" }, "highlight" : { "title" : [ "<em>Elasticsearch</em> <em>in</em> <em>Action</em>" ] } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 1.6323127, "_source" : { "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "title" : "Solr in Action", "publish_date" : "2014-04-05" }, "highlight" : { "title" : [ "Solr <em>in</em> <em>Action</em>" ] } } ]
如果我把 minimum_should_match 设置为 100%,那么将只有一个文档。
Boosting
由于我们在多个字段进行搜索,我们可能希望提高某个字段的分数。 在下面的人为示例中,我们将 summary 字段的分数提高了 3 倍,以增加 summary 字段的重要性,这反过来又会增加文档 _id 4 的相关性。
POST /bookdb_index/_search { "query": { "multi_match": { "query": "elasticsearch guide", "fields": [ "title", "summary^3" ] } }, "_source": [ "title", "summary", "publish_date" ] }
上面显示的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 3.9835935, "_source" : { "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "title" : "Solr in Action", "publish_date" : "2014-04-05" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 3.1001682, "_source" : { "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 2.0281231, "_source" : { "summary" : "A distibuted real-time search and analytics engine", "title" : "Elasticsearch: The Definitive Guide", "publish_date" : "2015-02-07" } } ]
注意:Boosting 不仅仅意味着计算的分数乘以 boost 因子。 应用的实际提升值经过标准化和一些内部优化。 有关提升工作原理的更多信息,请参阅 Elasticsearch 指南。
Bool Query
AND/OR/NOT 运算符可用于微调我们的搜索查询,以提供更相关或特定的结果。 这是在搜索 API 中作为 bool 查询实现的。 bool 查询接受一个 must 参数(相当于 AND)、一个 must_not 参数(相当于 NOT)和一个 should 参数(相当于 OR)。 例如,如果我想搜索标题中包含“Elasticsearch” 或 “Solr” 一词的书,并且作者是 “clinton Gormley” 但不是 “radu gheorge” 的作者:
POST /bookdb_index/_search { "query": { "bool": { "must": { "bool": { "should": [ { "match": { "title": "Elasticsearch" } }, { "match": { "title": "Solr" } } ], "must": { "match": { "authors": "clinton gormely" } } } }, "must_not": { "match": { "authors": "radu gheorge" } } } } }
上面查询的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 2.0749094, "_source" : { "title" : "Elasticsearch: The Definitive Guide", "authors" : [ "clinton gormley", "zachary tong" ], "summary" : "A distibuted real-time search and analytics engine", "publish_date" : "2015-02-07", "num_reviews" : 20, "publisher" : "oreilly", "location" : { "lat" : "39.970718", "lon" : "116.325747" } } } ]
注意:如你所见,bool 查询可以包装任何其他查询类型,包括其他 bool 查询,以创建任意复杂或深度嵌套的查询。
Fuzzy Queries
可以在 Match 和 Multi-Match 查询上启用模糊匹配以捕获拼写错误。 模糊程度是根据与原始单词的 Levenshtein 距离指定的,即需要对一个字符串进行的单个字符更改的数量,以使其与另一个字符串相同。
编辑距离是将一个术语转换为另一个术语所需的一个字符更改的次数。 这些更改可以包括:
更改字符(box→fox)
删除字符(black→lack)
插入字符(sic→sick)
转置两个相邻字符(act→cat)
POST /bookdb_index/_search { "query": { "multi_match": { "query": "comprihensiv guide", "fields": [ "title", "summary" ], "fuzziness": "AUTO" } }, "_source": [ "title", "summary", "publish_date" ], "size": 1 }
上面查询的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 2.4344182, "_source" : { "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "title" : "Solr in Action", "publish_date" : "2014-04-05" } } ]
注意:你可以指定数字 0、1 或 2,而不是指定 “AUTO”,以指示可以对字符串进行的最大编辑次数以找到匹配项。 使用 “AUTO” 的好处是它考虑了字符串的长度。 对于只有 3 个字符长的字符串,允许模糊度为 2 会导致搜索性能不佳。 因此,建议在大多数情况下坚持使用 “AUTO”。
Wildcard Query
通配符查询允许你指定要匹配的模式而不是整个术语。 ? 匹配任何字符,* 匹配零个或多个字符。 例如,要查找作者姓名以字母 “t” 开头的所有记录:
POST /bookdb_index/_search { "query": { "wildcard": { "authors": "t*" } }, "_source": [ "title", "authors" ], "highlight": { "fields": { "authors": {} } } }
上面的查询结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : { "title" : "Elasticsearch: The Definitive Guide", "authors" : [ "clinton gormley", "zachary tong" ] }, "highlight" : { "authors" : [ "zachary <em>tong</em>" ] } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "2", "_score" : 1.0, "_source" : { "title" : "Taming Text: How to Find, Organize, and Manipulate It", "authors" : [ "grant ingersoll", "thomas morton", "drew farris" ] }, "highlight" : { "authors" : [ "<em>thomas</em> morton" ] } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 1.0, "_source" : { "title" : "Solr in Action", "authors" : [ "trey grainger", "timothy potter" ] }, "highlight" : { "authors" : [ "<em>trey</em> grainger", "<em>timothy</em> potter" ] } } ]
Regexp Query
正则表达式查询允许你指定比通配符查询更复杂的模式。
POST /bookdb_index/_search { "query": { "regexp": { "authors": "t[a-z]*y" } }, "_source": [ "title", "authors" ], "highlight": { "fields": { "authors": {} } } }
上面查询的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 1.0, "_source" : { "title" : "Solr in Action", "authors" : [ "trey grainger", "timothy potter" ] }, "highlight" : { "authors" : [ "<em>trey</em> grainger", "<em>timothy</em> potter" ] } } ]
Match Phrase Query
匹配短语查询要求查询字符串中的所有术语都存在于文档中,按照查询字符串中指定的顺序排列并且彼此接近。 默认情况下,术语必须彼此完全相邻,但你可以指定 slop 值,该值指示允许术语相距多远,同时仍将文档视为匹配项。
POST /bookdb_index/_search { "query": { "multi_match": { "query": "search engine", "fields": [ "title", "summary" ], "type": "phrase", "slop": 3 } }, "_source": [ "title", "summary", "publish_date" ] }
上面的查询为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 0.88067603, "_source" : { "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "title" : "Solr in Action", "publish_date" : "2014-04-05" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 0.5142931, "_source" : { "summary" : "A distibuted real-time search and analytics engine", "title" : "Elasticsearch: The Definitive Guide", "publish_date" : "2015-02-07" } } ]
注意:在上面的例子中,对于非短语类型的查询,文档 _id 1 通常会有更高的分数并且出现在文档 _id 4 之前,因为它的字段长度较短。 然而,作为短语查询,术语的接近度被考虑在内,因此文档 _id 4 得分更好。
注意:还要注意,如果 slop 参数减少到 1 个文档 _id 1 将不再出现在结果集中。
Match Phrase Prefix
Match Phrase Prefix 查询在查询时提供按你类型搜索或自动完成功能,而无需以任何方式准备数据。 与 match_phrase 查询一样,它接受一个 slop 参数来使词序和相对位置稍微不那么严格。 它还接受 max_expansions 参数来限制匹配的术语数量,以降低资源强度。
POST /bookdb_index/_search { "query": { "match_phrase_prefix": { "summary": { "query": "search en", "slop": 3, "max_expansions": 10 } } }, "_source": [ "title", "summary", "publish_date" ] }
上面查询的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 0.88067603, "_source" : { "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "title" : "Solr in Action", "publish_date" : "2014-04-05" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 0.5142931, "_source" : { "summary" : "A distibuted real-time search and analytics engine", "title" : "Elasticsearch: The Definitive Guide", "publish_date" : "2015-02-07" } } ]
注意:查询时 search-as-you-type 具有性能成本。 更好的解决方案是索引时搜索即输入。 查看 Completion Suggester API 或 Edge-Ngram 过滤器的使用以获取更多信息。
Query String
query_string 查询提供了一种以简洁的速记语法执行 multi_match 查询、bool 查询、boosting、模糊匹配、通配符、regexp 和范围查询的方法。 在以下示例中,我们对术语 “search algorithm” 执行模糊搜索,其中书籍作者之一是 “grant ingersoll” 或 “tom morton”。 我们搜索所有字段,但对summary 字段采取 2 倍的提升。
POST /bookdb_index/_search { "query": { "query_string": { "query": "(saerch~1 algorithm~1) AND (grant ingersoll) OR (tom morton)", "fields": [ "title", "authors", "summary^2" ] } }, "_source": [ "title", "summary", "authors" ], "highlight": { "fields": { "summary": {} } } }
上面的查询结果为:
"hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 3.5710216, "hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "2", "_score" : 3.5710216, "_source" : { "summary" : "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization", "title" : "Taming Text: How to Find, Organize, and Manipulate It", "authors" : [ "grant ingersoll", "thomas morton", "drew farris" ] }, "highlight" : { "summary" : [ "organize text using approaches such as full-text <em>search</em>, proper name recognition, clustering, tagging" ] } } ]
Simple Query String
simple_query_string 查询是 query_string 查询的一个版本,更适合在暴露给用户的单个搜索框中使用,因为它分别用 +/|/- 替换了 AND/OR/NOT 的使用,并且丢弃了无效的 如果用户犯了错误,而不是抛出异常。
POST /bookdb_index/_search { "query": { "simple_query_string": { "query": "(saerch~1 algorithm~1) + (grant ingersoll) | (tom morton)", "fields": [ "title", "authors", "summary^2" ] } }, "_source": [ "title", "summary", "authors" ], "highlight": { "fields": { "summary": {} } } }
上面查询的结果和之前的那个是 Query String 是一样的,只是写法有所不同。
Term/Terms Query
上面的例子都是全文搜索的例子。 有时我们对结构化搜索更感兴趣,我们希望在其中找到完全匹配并返回结果。 术语和术语查询在这里帮助我们。 在下面的示例中,我们正在搜索由 Manning 出版社出版的索引中的所有书籍。
POST /bookdb_index/_search { "query": { "term": { "publisher": "manning" } }, "_source": [ "title", "publish_date", "publisher" ] }
上面的查询为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "2", "_score" : 0.35667494, "_source" : { "publisher" : "manning", "title" : "Taming Text: How to Find, Organize, and Manipulate It", "publish_date" : "2013-01-24" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 0.35667494, "_source" : { "publisher" : "manning", "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 0.35667494, "_source" : { "publisher" : "manning", "title" : "Solr in Action", "publish_date" : "2014-04-05" } } ]
可以通过使用 terms 关键字代替并传入搜索词数组来指定多个词。
POST /bookdb_index/_search { "query": { "terms": { "publisher": [ "oreilly", "packt" ] } } }
上面搜索的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : { "title" : "Elasticsearch: The Definitive Guide", "authors" : [ "clinton gormley", "zachary tong" ], "summary" : "A distibuted real-time search and analytics engine", "publish_date" : "2015-02-07", "num_reviews" : 20, "publisher" : "oreilly", "location" : { "lat" : "39.970718", "lon" : "116.325747" } } } ]
在上面的术语匹配中,我们可以看到分数是完全一样的,这是因为它需要精确的匹配。
Term Query - Sorted
可以轻松对术语查询结果(与任何其他查询结果一样)进行排序。 也允许多级排序。
POST /bookdb_index/_search { "query": { "term": { "publisher": "manning" } }, "_source": [ "title", "publish_date", "publisher" ], "sort": [ { "publish_date": { "order": "desc" } } ] }
上面查询的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : null, "_source" : { "publisher" : "manning", "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" }, "sort" : [ 1449100800000 ] }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : null, "_source" : { "publisher" : "manning", "title" : "Solr in Action", "publish_date" : "2014-04-05" }, "sort" : [ 1396656000000 ] }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "2", "_score" : null, "_source" : { "publisher" : "manning", "title" : "Taming Text: How to Find, Organize, and Manipulate It", "publish_date" : "2013-01-24" }, "sort" : [ 1358985600000 ] } ]
从上面的结果我们可以看出来,经过 sort 过后的文档的 _score 为了 null,也就是分数已经不重要了。
Range Query
另一个结构化查询示例是范围查询。 在此示例中,我们搜索 2015 年出版的书籍。
POST /bookdb_index/_search { "query": { "range": { "publish_date": { "gte": "2015-01-01", "lte": "2015-12-31" } } }, "_source": [ "title", "publish_date", "publisher" ] }
上面查询的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : { "publisher" : "oreilly", "title" : "Elasticsearch: The Definitive Guide", "publish_date" : "2015-02-07" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 1.0, "_source" : { "publisher" : "manning", "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" } } ]
注意:范围查询适用于日期、数字和字符串类型字段。
Function Score: Field Value Factor
在某些情况下,你可能希望将文档中特定字段的值考虑到相关性分数的计算中。 这在你希望基于其受欢迎程度提高文档相关性的情况下很常见。 在我们的示例中,我们希望提升更受欢迎的书籍(根据 num_reviews 判断)。 这可以使用 field_value_factor 函数得分。
POST /bookdb_index/_search { "query": { "function_score": { "query": { "multi_match": { "query": "search engine", "fields": [ "title", "summary" ] } }, "field_value_factor": { "field": "num_reviews", "modifier": "log1p", "factor": 2 } } }, "_source": [ "title", "summary", "publish_date", "num_reviews" ] }
上面的查询结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 1.5694137, "_source" : { "summary" : "A distibuted real-time search and analytics engine", "num_reviews" : 20, "title" : "Elasticsearch: The Definitive Guide", "publish_date" : "2015-02-07" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 1.4725765, "_source" : { "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "num_reviews" : 23, "title" : "Solr in Action", "publish_date" : "2014-04-05" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 0.1418166, "_source" : { "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "num_reviews" : 18, "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "2", "_score" : 0.13297246, "_source" : { "summary" : "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization", "num_reviews" : 12, "title" : "Taming Text: How to Find, Organize, and Manipulate It", "publish_date" : "2013-01-24" } } ]
注意 1:我们可以只运行一个常规的 multi_match 查询并按 num_reviews 字段排序,但是我们失去了相关性评分的好处。
注意 2:有许多附加参数可以调整原始相关性分数的提升效果程度,例如 “modifier”、“factor”、“boost_mode” 等。这些在 Elasticsearch 指南中有详细介绍。
Function Score: Decay Functions
假设你不想通过字段的值逐步提升,而是有一个想要定位的理想值,并且您希望提升因子随着你远离该值而衰减。 这在基于纬度/经度、数字字段(如价格或日期)的提升中通常很有用。 在我们设计的示例中,我们正在搜索理想情况下于 2014 年 6 月左右出版的关于“search engine” 的书籍。
POST /bookdb_index/_search { "query": { "function_score": { "query": { "multi_match": { "query": "search engine", "fields": [ "title", "summary" ] } }, "functions": [ { "exp": { "publish_date": { "origin": "2014-06-15", "offset": "7d", "scale": "30d" } } } ], "boost_mode": "replace" } }, "_source": [ "title", "summary", "publish_date", "num_reviews" ] }
上面查询的结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 0.22793062, "_source" : { "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "num_reviews" : 23, "title" : "Solr in Action", "publish_date" : "2014-04-05" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 0.0049215667, "_source" : { "summary" : "A distibuted real-time search and analytics engine", "num_reviews" : 20, "title" : "Elasticsearch: The Definitive Guide", "publish_date" : "2015-02-07" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "2", "_score" : 9.612435E-6, "_source" : { "summary" : "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization", "num_reviews" : 12, "title" : "Taming Text: How to Find, Organize, and Manipulate It", "publish_date" : "2013-01-24" } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 4.9185574E-6, "_source" : { "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "num_reviews" : 18, "title" : "Elasticsearch in Action", "publish_date" : "2015-12-03" } } ]
Function Score: Script Scoring
如果内置评分函数不能满足你的需求,可以选择指定用于评分的 Painless 脚本。 在我们的示例中,我们希望指定一个脚本,该脚本在决定考虑评论数量之前考虑了 publish_date。 较新的书籍可能没有那么多评论,因此不应因此而受到惩罚。
评分脚本如下所示:
GET bookdb_index/_search { "query": { "script_score": { "query": { "multi_match": { "query": "search engine", "fields": [ "title", "summary" ] } }, "script": { "source": """ def publish_date = doc['publish_date'].value.toInstant(); def num_reviews = doc['num_reviews'].value; def threshold = Instant.parse(params.threshold + 'T00:00:00Z'); if (publish_date.compareTo(threshold) > 0) { return Math.log10(2.5 + num_reviews) ; } return Math.log10(1 + num_reviews); """, "params": { "threshold": "2015-07-30" } } } } }
上面的搜索结果为:
"hits" : [ { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "3", "_score" : 1.447158, "_source" : { "title" : "Elasticsearch in Action", "authors" : [ "radu gheorge", "matthew lee hinman", "roy russo" ], "summary" : "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms", "publish_date" : "2015-12-03", "num_reviews" : 18, "publisher" : "manning", "location" : { "lat" : "39.893801", "lon" : "116.408986" } } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "4", "_score" : 1.3802112, "_source" : { "title" : "Solr in Action", "authors" : [ "trey grainger", "timothy potter" ], "summary" : "Comprehensive guide to implementing a scalable search engine using Apache Solr", "publish_date" : "2014-04-05", "num_reviews" : 23, "publisher" : "manning", "location" : { "lat" : "39.718256", "lon" : "116.367910" } } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "1", "_score" : 1.3222193, "_source" : { "title" : "Elasticsearch: The Definitive Guide", "authors" : [ "clinton gormley", "zachary tong" ], "summary" : "A distibuted real-time search and analytics engine", "publish_date" : "2015-02-07", "num_reviews" : 20, "publisher" : "oreilly", "location" : { "lat" : "39.970718", "lon" : "116.325747" } } }, { "_index" : "bookdb_index", "_type" : "_doc", "_id" : "2", "_score" : 1.1139433, "_source" : { "title" : "Taming Text: How to Find, Organize, and Manipulate It", "authors" : [ "grant ingersoll", "thomas morton", "drew farris" ], "summary" : "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization", "publish_date" : "2013-01-24", "num_reviews" : 12, "publisher" : "manning", "location" : { "lat" : "39.904313", "lon" : "116.412754" } } } ]
注意:JSON 不能包含嵌入的换行符,因此分号用于分隔语句。