关于citus.limit_clause_row_fetch_count优化参数

简介:

关于citus.limit_clause_row_fetch_count优化参数

citus.limit_clause_row_fetch_count是citus的一个性能优化的参数。具体适应于什么场景呢?

官方文档说明

下面是官方文档说明,还是不够具体。

https://docs.citusdata.com/en/v7.3/develop/api_guc.html?highlight=limit_clause_row_fetch_count

Planner Configuration

citus.limit_clause_row_fetch_count (integer)

Sets the number of rows to fetch per task for limit clause optimization. 
In some cases, select queries with limit clauses may need to fetch all 
rows from each task to generate results. In those cases, and where an 
approximation would produce meaningful results, this configuration value
 sets the number of rows to fetch from each shard. Limit approximations 
are disabled by default and this parameter is set to -1. This value can 
be set at run-time and is effective on the coordinator.

测试用例

从citus的测试用例中,可以清楚的看到它的作用。

citus-7.2.1\src\test\regress\expected\multi_limit_clause_approximate.out

-- Enable limit optimization to fetch one third of each shard's data
SET citus.limit_clause_row_fetch_count TO 600;
SELECT l_partkey, sum(l_partkey * (1 + l_suppkey)) AS aggregate FROM lineitem
    GROUP BY l_partkey
    ORDER BY aggregate DESC LIMIT 10;
DEBUG:  push down of limit count: 600
 l_partkey | aggregate  
-----------+------------
    194541 | 3727794642
    160895 | 3671463005
    183486 | 3128069328
    179825 | 3093889125
    162432 | 2834113536
    153937 | 2761321906
    199283 | 2726988572
    185925 | 2672114100
    196629 | 2622637602
    157064 | 2614644408
(10 rows)

上面的SQL,如果不加SET citus.limit_clause_row_fetch_count TO 600,CN需要到worker上把所有数据都捞出来,然后再在CN上排序取TopN结果。
大数据量的情况,性能会非常糟糕。加上SET citus.limit_clause_row_fetch_count TO 600,就只会到每个worker上取前600的记录。但可能会带来准确性的损失。

另外一个需要注意的是,上面的GROUP BY字段l_partkey不是分片字段,如果GROUP BY字段已经包含了分片字段,不需要这个优化,因为这种情况下可以直接把LIMIT下推下去。

另一个测试用例,形式类似

...
SET citus.limit_clause_row_fetch_count TO 150;
SET citus.large_table_shard_count TO 2;
SELECT c_custkey, c_name, count(*) as lineitem_count
    FROM customer, orders, lineitem
    WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey
    GROUP BY c_custkey, c_name
    ORDER BY lineitem_count DESC, c_custkey LIMIT 10;
DEBUG:  push down of limit count: 150
 c_custkey |       c_name       | lineitem_count 
-----------+--------------------+----------------
        43 | Customer#000000043 |             42
       370 | Customer#000000370 |             38
        79 | Customer#000000079 |             37
       689 | Customer#000000689 |             36
       472 | Customer#000000472 |             35
       685 | Customer#000000685 |             35
       643 | Customer#000000643 |             34
       226 | Customer#000000226 |             33
       496 | Customer#000000496 |             32
       304 | Customer#000000304 |             31
(10 rows)

小结

  • 适用场景

    • citus.limit_clause_row_fetch_count适用于分组聚合并取TopN结果的SQL的性能优化
  • 不适用场景

    • 要求精确结果
    • 聚合字段包含分片字段
    • count(DISTINCT)
相关文章
|
7月前
|
SQL 关系型数据库 数据处理
详解SQL语句中的GROUP BY和聚合函数COUNT、SUM、AVG、MIN和MAX。
详解SQL语句中的GROUP BY和聚合函数COUNT、SUM、AVG、MIN和MAX。
1153 0
|
数据库 OceanBase
LIMIT_ROW_COUNT
LIMIT_ROW_COUNT
101 1
|
SQL 存储 Oracle
PostgreSQL 分页, offset, 返回顺序, 扫描方法原理(seqscan, index scan, index only scan, bitmap scan, parallel xx scan),游标
PostgreSQL 分页, offset, 返回顺序, 扫描方法原理(seqscan, index scan, index only scan, bitmap scan, parallel xx scan),游标
3862 0
|
SQL Java 数据库连接
mybatis 批量插入 Column count doesn‘t match value count at row 1
mybatis 批量插入 Column count doesn‘t match value count at row 1
190 0
|
SQL 关系型数据库 MySQL
MySQL:The used SELECT statements have a different number of columns
执行SQL报错:The used SELECT statements have a different number of columns
796 0
MySQL:The used SELECT statements have a different number of columns
|
SQL 关系型数据库 MySQL
select、distinct、limit使用
select、distinct、limit使用
286 0
select、distinct、limit使用
|
SQL 关系型数据库 测试技术
PostgreSQL 流式统计 - insert on conflict 实现 流式 UV(distinct), min, max, avg, sum, count ...
标签 PostgreSQL , 流式统计 , insert on conflict , count , avg , min , max , sum 背景 流式统计count, avg, min, max, sum等是一个比较有意思的场景,可用于实时大屏,实时绘制统计图表。
5490 0
|
SQL 关系型数据库 索引
|
关系型数据库 PostgreSQL
PostgreSQL sharding : citus 系列7 - topn 加速(count(*) group by order by count(*) desc limit x) (use 估值插件 topn)
标签 PostgreSQL , topn , topn.number_of_counters , count(*) group by order by count(*) desc limit x 背景 count(*) group by order by count(*) desc limit x 用来统计 topn。
1434 0