使用聚合与SQL在Elasticsearch中实现distinct去重-开发者社区-阿里云

elasticsearch中如何实现dinstinct去重功能

2022-10-20 1969

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： elasticsearch中如何实现dinstinct去重功能

一、功能场景

在mysql数据库中查询数据时，我们可以采用dinstinct关键字去重。那么，在ES中如何实现查询结果去重呢？

二、原理分析

DISTINCT关键字去重的sql语句等价于对需要去重的字段进行GROUP BY。

以下2个sql都能实现对name,age字段查询结果的去重。

SELECT DISTINCT name,age FROM distinct_index
SELECT name,age FROM distinct_index GROUP BY name,age

三、方案汇总

根据上面的分析，总结出以下2种实现方案：

1、使用GROUP BY的SQL查询

2、使用Aggregation(聚合)查询

说明：

ES6.3之后的版本以及支持SQL查询

四、数据准备

## 删除索引
## DELETE distinct_index
## 新建索引
PUT distinct_index
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "age": {
        "type": "integer"
      }
    }
  }
}
## 添加数据
POST distinct_index/_bulk?refresh
{ "create": { } }
{ "name": "小天", "age": 25}
{ "create": { } }
{ "name": "小天", "age": 25}
{ "create": { } }
{ "name": "老万", "age": 35}
{ "create": { } }
{ "name": "老王", "age": 45}
{ "create": { } }
{ "name": "小明", "age": 15}
{ "create": { } }
{ "name": "小明", "age": 15}
{ "create": { } }
{ "name": "小红", "age": 12}
{ "create": { } }
{ "name": "乐乐", "age": 18}

五、方案实战

1、SQL查询方案

##可以通过format参数控制返回结果的格式，txt表示文本格式，看起来更直观点，默认为json格式。
POST _sql?format=txt
{
  "query": "SELECT name,age FROM distinct_index group by name,age"
}

查询结果：

和预期相符，查询结果达到去重的效果。

SQL语句转DSL：

POST /_sql/translate
{
  "query": "SELECT name,age FROM distinct_index group by name,age"
}

结果：

{
  "size" : 0,
  "_source" : false,
  "stored_fields" : "_none_",
  "aggregations" : {
    "groupby" : {
      "composite" : {
        "size" : 1000,
        "sources" : [
          {
            "f5b401c4" : {
              "terms" : {
                "field" : "name",
                "missing_bucket" : true,
                "order" : "asc"
              }
            }
          },
          {
            "f557e07b" : {
              "terms" : {
                "field" : "age",
                "missing_bucket" : true,
                "order" : "asc"
              }
            }
          }
        ]
      }
    }
  }
}

说明：

通过将sql语句转为dsl语句可以发现，sql语句中的group by查询底层原理是转化成了dsl中的aggregations聚合查询。

2、aggregations聚合查询

POST distinct_index/_search
{
  "from": 0,
  "size": 0,
  "aggregations": {
  "name": {
    "terms": {
    "field": "name",
    "size": 2147483647
    },
    "aggregations": {
    "age": {
      "terms": {
      "field": "age",
      "size": 2147483647
      }
    }
    }
  }
  }
}

查询结果：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "name" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "小天",
          "doc_count" : 2,
          "age" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 25,
                "doc_count" : 2
              }
            ]
          }
        },
        {
          "key" : "小明",
          "doc_count" : 2,
          "age" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 15,
                "doc_count" : 2
              }
            ]
          }
        },
        {
          "key" : "乐乐",
          "doc_count" : 1,
          "age" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 18,
                "doc_count" : 1
              }
            ]
          }
        },
        {
          "key" : "小红",
          "doc_count" : 1,
          "age" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 12,
                "doc_count" : 1
              }
            ]
          }
        },
        {
          "key" : "老万",
          "doc_count" : 1,
          "age" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 35,
                "doc_count" : 1
              }
            ]
          }
        },
        {
          "key" : "老王",
          "doc_count" : 1,
          "age" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 45,
                "doc_count" : 1
              }
            ]
          }
        }
      ]
    }
  }
}

六、异常记录

1、ES中的sql查询不支持DISTINCT关键字

POST _sql?format=txt
{
  "query": "SELECT DISTINCT name,age FROM distinct_index"
}

报错：

{
  "error" : {
    "root_cause" : [
      {
        "type" : "verification_exception",
        "reason" : "Found 1 problem\nline 1:8: SELECT DISTINCT is not yet supported"
      }
    ],
    "type" : "verification_exception",
    "reason" : "Found 1 problem\nline 1:8: SELECT DISTINCT is not yet supported"
  },
  "status" : 400
}

2、ES的sql查询中表名中不能有中划线，比如my-index-000001

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "age": {
        "type": "integer"
      }
    }
  }
}
POST _sql?format=txt
{
  "query": "SELECT name FROM my-index-000001"
}

报错：

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "line 1:20: extraneous input '-' expecting {<EOF>, ',', 'ANALYZE', 'ANALYZED', 'AS', 'CATALOGS', 
        ……

遇到这种情况，最简单方法是给索引添加别名。

POST /_aliases
{
  "actions" : [
    { "add" : { "index" : "my-index-000001", "alias" : "my_index" } }
  ]
}
POST _sql?format=txt
{
  "query": "SELECT name FROM my_index"
}

总结

本文主要介绍了ES中如何实现类似dinstinct的数据去重功能。

1、首先通过通过dinstinct和group by的等价sql语句，说明可以通过分组函数实现数据去重。

2、分别介绍了ES中通过sql语句查询和aggregations聚合查询实现对查询结果的去重。

3、ES中的sql查询不支持DISTINCT关键字

4、ES中的sql查询中表名不能包含中划线，比如my-index-000001

elasticsearch中如何实现dinstinct去重功能

一、功能场景

二、原理分析

三、方案汇总

四、数据准备

五、方案实战

1、SQL查询方案

2、aggregations聚合查询

六、异常记录

总结

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

elasticsearch中如何实现dinstinct去重功能

一、功能场景

二、原理分析

三、方案汇总

四、数据准备

五、方案实战

1、SQL查询方案

2、aggregations聚合查询

六、异常记录

总结

热门文章

最新文章

相关课程

相关电子书