《Elastic Stack 实战手册》——三、产品能力——3.5 进阶篇——3.5.17.Elasticsearch SQL (4) https://developer.aliyun.com/article/1227074
5. 聚合分析
Elasticsearch 的聚合分析功能非常强大,在实际生产环境中,通过聚合分析再结合图表展示,我们可以实现许多实用功能,例如:查看网站访问 IP 分布情况,某些服务的 P99 延迟等等。
5.1 复合聚合
Elasticsearch SQL 中提供了 group by 语法用于桶聚合分析,对于 group by 语法,Elasticsearch 内部将转换为复合聚合(composite aggregation)。
例如:下面的查询根据返回状态码分类,统计对应的 http 请求的 body 字节大小平均值和请求次数。
POST _sql/translate { "query":""" SELECT response, count(*) AS count_docs, avg(bytes) AS avg_bytes FROM test_logs GROUP BY response """ } { "size" : 0, "_source" : false, "aggregations" : { "groupby" : { "composite" : { "size" : 1000, "sources" : [ { "565ba6a7" : { "terms" : { "field" : "response.keyword", "missing_bucket" : true, "order" : "asc" } } } ] }, "aggregations" : { "945f09f1" : { "avg" : { "field" : "bytes" } } } } } } POST _sql?format=txt { "query":""" SELECT response, count(*) AS count_docs, avg(bytes) AS avg_bytes FROM test_logs GROUP BY response """ } response | count_docs | avg_bytes ---------------+---------------+----------------- 200 |12832 |5897.852711970075 404 |801 |5049.242197253433 503 |441 |0.0
5.2 多层聚合
既然 group by 会被转化为复合聚合,我们可以通过 group by 多个字段,实现多层聚合。
例如:我们对测试数据(访问日志),先通过访问域名分类,然后再根据返回状态码分类,统计对应的 http 请求的 body 字节大小平均值 。
POST _sql?format=txt { "query":""" SELECT host, response, avg(bytes) AS avg_bytes FROM test_logs GROUP BY host, response """ } host | response | avg_bytes -------------------------------+---------------+----------------- artifacts.elastic.co |200 |6449.373109243697 artifacts.elastic.co |404 |6111.981818181818 artifacts.elastic.co |503 |0.0 cdn.elastic-elastic-elastic.org|200 |5771.631123919308 cdn.elastic-elastic-elastic.org|404 |5527.292452830188 cdn.elastic-elastic-elastic.org|503 |0.0 elastic-elastic-elastic.org |200 |4775.802409638554 elastic-elastic-elastic.org |404 |2397.616 elastic-elastic-elastic.org |503 |0.0 www.elastic.co |200 |5315.617331812999 www.elastic.co |404 |4757.891666666666 www.elastic.co |503 |0.0
5.3 直方图
在我们的使用场景中,经常需要对日期进行直方图聚合,Elasticsearch SQL中存在histogram函数,可以实现。
POST _sql?format=txt { "query":""" SELECT host, HISTOGRAM(timestamp, INTERVAL 1 MONTH) as h, count(*) AS count_docs FROM test_logs GROUP BY host, h """ } host | h | count_docs -------------------------------+------------------------+--------------- artifacts.elastic.co |2021-10-01T00:00:00.000Z|124 artifacts.elastic.co |2021-11-01T00:00:00.000Z|3141 artifacts.elastic.co |2021-12-01T00:00:00.000Z|3223 cdn.elastic-elastic-elastic.org|2021-10-01T00:00:00.000Z|33 cdn.elastic-elastic-elastic.org|2021-11-01T00:00:00.000Z|1123 cdn.elastic-elastic-elastic.org|2021-12-01T00:00:00.000Z|1099 elastic-elastic-elastic.org |2021-10-01T00:00:00.000Z|5 elastic-elastic-elastic.org |2021-11-01T00:00:00.000Z|215 elastic-elastic-elastic.org |2021-12-01T00:00:00.000Z|332 www.elastic.co |2021-10-01T00:00:00.000Z|87 www.elastic.co |2021-11-01T00:00:00.000Z|2371 www.elastic.co |2021-12-01T00:00:00.000Z|2321
5.4 聚合分页
在实际使用场景中,如果分桶过多,可能会出现消耗过多资源、查询响应超时、经典的桶太多报错(too many buckets)等问题。幸运的是,我们可以通过聚合分页,避免上面的查询问题,而通过 Elasticsearch SQL 的 group by 和 fetch_size 结合使用,聚合分页将变得十分简洁。
# 第1次查询,指定 fetch_size=10 POST _sql?format=json { "query":""" SELECT host, response, count(*) AS count_docs, avg(bytes) AS avg_bytes FROM test_logs GROUP BY host, response """, "fetch_size": 10 } { "columns" : [ { "name" : "host", "type" : "text" }, { "name" : "response", "type" : "text" }, { "name" : "count_docs", "type" : "long" }, { "name" : "avg_bytes", "type" : "double" } ], "rows" : [ [ "artifacts.elastic.co", "200", 5950, 6449.373109243697 ], ... ], "cursor" : "g+azAwFaA...=" } # 第2次查询(也即最后1次查询) POST _sql?format=json { "cursor" : "g+azAwFaA...=" }
5.5 管道聚合
实际使用过程中,我们经常需要对聚合结果,再次过滤,而通过管道聚合可以轻松实现。在
Elasticsearch SQL 中,提供了 having 语法,实现相同的管道聚合功能。
例如:我们先根据 host,response 分类,然后计算出每个分类的请求次数,还需要过滤count_docs <= 1000 的分类。
OST _sql?format=txt { "query":""" SELECT host, response, count(*) AS count_docs FROM test_logs GROUP BY host, response HAVING count_docs > 1000 """ } host | response | count_docs -------------------------------+---------------+--------------- artifacts.elastic.co |200 |5950 cdn.elastic-elastic-elastic.org|200 |2082 www.elastic.co |200 |4385