一、功能场景
在mysql数据库中查询数据时,我们可以采用dinstinct关键字去重。那么,在ES中如何实现查询结果去重呢?
二、原理分析
DISTINCT关键字去重的sql语句等价于对需要去重的字段进行GROUP BY。
以下2个sql都能实现对name,age字段查询结果的去重。
SELECT DISTINCT name,age FROM distinct_index SELECT name,age FROM distinct_index GROUP BY name,age
三、方案汇总
根据上面的分析,总结出以下2种实现方案:
1、使用GROUP BY的SQL查询
2、使用Aggregation(聚合)查询
说明:
ES6.3之后的版本以及支持SQL查询
四、数据准备
## 删除索引 ## DELETE distinct_index ## 新建索引 PUT distinct_index { "mappings": { "properties": { "name": { "type": "keyword" }, "age": { "type": "integer" } } } } ## 添加数据 POST distinct_index/_bulk?refresh { "create": { } } { "name": "小天", "age": 25} { "create": { } } { "name": "小天", "age": 25} { "create": { } } { "name": "老万", "age": 35} { "create": { } } { "name": "老王", "age": 45} { "create": { } } { "name": "小明", "age": 15} { "create": { } } { "name": "小明", "age": 15} { "create": { } } { "name": "小红", "age": 12} { "create": { } } { "name": "乐乐", "age": 18}
五、方案实战
1、SQL查询方案
##可以通过format参数控制返回结果的格式,txt表示文本格式,看起来更直观点,默认为json格式。 POST _sql?format=txt { "query": "SELECT name,age FROM distinct_index group by name,age" }
查询结果:
和预期相符,查询结果达到去重的效果。
SQL语句转DSL:
POST /_sql/translate { "query": "SELECT name,age FROM distinct_index group by name,age" }
结果:
{ "size" : 0, "_source" : false, "stored_fields" : "_none_", "aggregations" : { "groupby" : { "composite" : { "size" : 1000, "sources" : [ { "f5b401c4" : { "terms" : { "field" : "name", "missing_bucket" : true, "order" : "asc" } } }, { "f557e07b" : { "terms" : { "field" : "age", "missing_bucket" : true, "order" : "asc" } } } ] } } } }
说明:
通过将sql语句转为dsl语句可以发现,sql语句中的group by查询底层原理是转化成了dsl中的aggregations聚合查询。
2、aggregations聚合查询
POST distinct_index/_search { "from": 0, "size": 0, "aggregations": { "name": { "terms": { "field": "name", "size": 2147483647 }, "aggregations": { "age": { "terms": { "field": "age", "size": 2147483647 } } } } } }
查询结果:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 8, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "name" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "小天", "doc_count" : 2, "age" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 25, "doc_count" : 2 } ] } }, { "key" : "小明", "doc_count" : 2, "age" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 15, "doc_count" : 2 } ] } }, { "key" : "乐乐", "doc_count" : 1, "age" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 18, "doc_count" : 1 } ] } }, { "key" : "小红", "doc_count" : 1, "age" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 12, "doc_count" : 1 } ] } }, { "key" : "老万", "doc_count" : 1, "age" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 35, "doc_count" : 1 } ] } }, { "key" : "老王", "doc_count" : 1, "age" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : 45, "doc_count" : 1 } ] } } ] } } }
六、异常记录
1、ES中的sql查询不支持DISTINCT关键字
POST _sql?format=txt { "query": "SELECT DISTINCT name,age FROM distinct_index" }
报错:
{ "error" : { "root_cause" : [ { "type" : "verification_exception", "reason" : "Found 1 problem\nline 1:8: SELECT DISTINCT is not yet supported" } ], "type" : "verification_exception", "reason" : "Found 1 problem\nline 1:8: SELECT DISTINCT is not yet supported" }, "status" : 400 }
2、ES的sql查询中表名中不能有中划线,比如my-index-000001
PUT my-index-000001 { "mappings": { "properties": { "name": { "type": "keyword" }, "age": { "type": "integer" } } } } POST _sql?format=txt { "query": "SELECT name FROM my-index-000001" }
报错:
{ "error" : { "root_cause" : [ { "type" : "parsing_exception", "reason" : "line 1:20: extraneous input '-' expecting {<EOF>, ',', 'ANALYZE', 'ANALYZED', 'AS', 'CATALOGS', ……
遇到这种情况,最简单方法是给索引添加别名。
POST /_aliases { "actions" : [ { "add" : { "index" : "my-index-000001", "alias" : "my_index" } } ] } POST _sql?format=txt { "query": "SELECT name FROM my_index" }
总结
本文主要介绍了ES中如何实现类似dinstinct的数据去重功能。
1、首先通过通过dinstinct和group by的等价sql语句,说明可以通过分组函数实现数据去重。
2、分别介绍了ES中通过sql语句查询和aggregations聚合查询实现对查询结果的去重。
3、ES中的sql查询不支持DISTINCT关键字
4、ES中的sql查询中表名不能包含中划线,比如my-index-000001