# HashAggregate VS sort+GroupAgg

## 聚合就是把元素按照一定的规则分为不同的组，然后对各组元素进行计算。

### postgresql中聚合算法

• GroupAggregate: get the data sorted and apply aggregation function to groups one by one。
• HashAggregate: store state for each key in a hash table

#### GroupAggregate

GroupAggregatede 特点是在进行聚合之前先要将数据进行排序，然后进行聚合操作，而且出来的结果是有序的。（postgresql使用enable_sort控制）

### 对比两种算法速率

创建9个测试表
create table t_1  (branch_id bigint, amount numeric);
create table t_10   (branch_id bigint, amount numeric);
create table t_100   (branch_id bigint, amount numeric);
create table t_1000   (branch_id bigint, amount numeric);
create table t_10000  (branch_id bigint, amount numeric);
create table t_100000 (branch_id bigint, amount numeric);
create table t_1000000 (branch_id bigint, amount numeric);
create table t_10000000 (branch_id bigint, amount numeric);
create table t_100000000 (branch_id bigint, amount numeric);

insert into t_1 select mod(i, 1), random()     from generate_series(1,100000000) s(i);
insert into t_10 select mod(i, 10), random()     from generate_series(1,100000000) s(i);
insert into t_100 select mod(i, 100), random()     from generate_series(1,100000000) s(i);
...
vacuum analyze t_1;
vacuum analyze t_10;
vacuum analyze t_100;
...

;

### 应用

create table t1   (branch_id bigint, amount numeric);
insert into t1 select mod(i, 1000), random()     from generate_series(1,10000000) s(i);

SQL

select branch_id,count(distinct amount) from t1 group by branch_id;
postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
GroupAggregate  (cost=1326385.18..1401396.05 rows=1000 width=16) (actual time=2524.329..8132.612 rows=1000 loops=1)
Group Key: branch_id
->  Sort  (cost=1326385.18..1351385.47 rows=10000115 width=19) (actual time=2517.526..4227.713 rows=10000000 loops=1)
Sort Key: branch_id
Sort Method: quicksort  Memory: 1174466kB
->  Seq Scan on t1  (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.015..824.249 rows=10000000 loops=1)
Planning time: 0.105 ms
Execution time: 8234.834 ms
(8 rows)

postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
GroupAggregate  (cost=10001326385.18..10001401396.05 rows=1000 width=16) (actual time=2581.177..7953.331 rows=1000 loops=1)
Group Key: branch_id
->  Sort  (cost=10001326385.18..10001351385.47 rows=10000115 width=19) (actual time=2574.178..4126.988 rows=10000000 loops=1)
Sort Key: branch_id
Sort Method: quicksort  Memory: 1174466kB
->  Seq Scan on t1  (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.015..813.226 rows=10000000 loops=1)
Planning time: 0.082 ms
Execution time: 8034.338 ms

 select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;
postgres=# explain analyze select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=458272.50..458274.50 rows=200 width=16) (actual time=9544.638..9544.763 rows=1000 loops=1)
Group Key: t1.branch_id
->  HashAggregate  (cost=213696.73..311527.04 rows=9783031 width=19) (actual time=5200.136..7961.864 rows=9999971 loops=1)
Group Key: t1.branch_id, t1.amount
->  Seq Scan on t1  (cost=0.00..163696.15 rows=10000115 width=19) (actual time=0.022..890.532 rows=10000000 loops=1)
Planning time: 0.197 ms
Execution time: 9810.855 ms

update t1 set amount = branch_id;
postgres=# explain analyze select branch_id,count(distinct amount) from t1 group by branch_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
GroupAggregate  (cost=10002533944.49..10002672603.94 rows=1000 width=16) (actual time=3143.167..7858.782 rows=1000 loops=1)
Group Key: branch_id
->  Sort  (cost=10002533944.49..10002580160.97 rows=18486593 width=19) (actual time=3137.152..4561.355 rows=10000000 loops=1)
Sort Key: branch_id
Sort Method: quicksort  Memory: 861967kB
->  Seq Scan on t1  (cost=0.00..302614.93 rows=18486593 width=19) (actual time=613.417..1537.574 rows=10000000 loops=1)
Planning time: 0.134 ms
Execution time: 7925.181 ms
(8 rows)

postgres=# explain analyze select branch_id,count(*) from (select branch_id,amount from t1 group by branch_id,amount)_ group by branch_id;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=847179.99..847181.99 rows=200 width=16) (actual time=3987.460..3987.611 rows=1000 loops=1)
Group Key: t1.branch_id
->  HashAggregate  (cost=395047.90..575900.73 rows=18085284 width=19) (actual time=3821.612..3986.812 rows=1000 loops=1)
Group Key: t1.branch_id, t1.amount
->  Seq Scan on t1  (cost=0.00..302614.93 rows=18486593 width=19) (actual time=73.574..919.854 rows=10000000 loops=1)
Planning time: 0.155 ms
Execution time: 4438.999 ms
(7 rows)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate  (cost=408450.41..937838.27 rows=200 width=16) (actual time=831.472..1365.904 rows=1000 loops=1)
Group Key: t1.branch_id
->  Group  (cost=408450.41..922858.98 rows=998486 width=12) (actual time=830.731..1365.377 rows=1000 loops=1)
Group Key: t1.branch_id, t1.amount
->  Gather Merge  (cost=408450.41..902889.26 rows=3993944 width=12) (actual time=830.730..1364.782 rows=5000 loops=1)
Workers Planned: 4
Workers Launched: 4
->  Group  (cost=407950.35..426671.97 rows=998486 width=12) (actual time=794.142..1317.058 rows=1000 loops=5)
Group Key: t1.branch_id, t1.amount
->  Sort  (cost=407950.35..414190.89 rows=2496215 width=12) (actual time=794.139..1002.323 rows=2000000 loops=5)
Sort Key: t1.branch_id, t1.amount
Sort Method: quicksort  Memory: 197526kB
->  Parallel Seq Scan on t1  (cost=0.00..142711.15 rows=2496215 width=12) (actual time=35.558..190.117 rows=2000000 loops=5)
Planning time: 0.119 ms
Execution time: 1379.119 ms

|
Shell Linux
Linux中常用的文本处理命令（echo、sort、uniq、tr、cut、split、eval）（上）
1、echo命令——输出 echo 命令主要用来显示字符串信息。
332 0
|
17天前

4 0
|
1月前
|
Linux
Linux命令(111)之groupadd
Linux命令(111)之groupadd
48 3
|
7月前
|
Linux
Linux命令 groupadd
Linux命令 groupadd
51 2
|
10月前
|
Linux
10.6.4 【Linux】字符转换命令： tr, col, join, paste, expand
10.6.4 【Linux】字符转换命令： tr, col, join, paste, expand
211 0
|
Linux
Linux Command groupadd 、groupdel、groupmod
Linux Command groupadd 、groupdel、groupmod
66 1
|
Perl
sort、uniq、cut命令操作
sort、uniq、cut命令操作
86 0
|
Linux Shell
Linux中常用的文本处理命令（echo、sort、uniq、tr、cut、split、eval）（下）
1、echo命令——输出 echo 命令主要用来显示字符串信息。
224 0

380 0