Greenplum 数据分布黄金法则 - 论分布列与分区的选择

+关注继续查看

背景

DROP TABLE IF EXISTS  ab01;
CREATE TABLE ab01 (
aab001 bigint(10) NOT NULL ,
aab007 varchar(50) DEFAULT NULL ,
aab008 decimal(14,0) DEFAULT NULL ,
create_time datetime NOT NULL ,
deleteedflag bit(1) DEFAULT b'0' ,
PRIMARY KEY (aab001)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;

DROP TABLE IF EXISTS  ab01;
CREATE TABLE ab01 (
aab001 int8 NOT NULL ,
aab007 varchar(50) DEFAULT NULL ,
aab008 decimal(14,0) DEFAULT NULL ,  -- 长度 <=15 的decimal建议替换成float8, 运算效率比decimal高
create_time timestamp(0) NOT NULL ,
deleteedflag bit(1) DEFAULT b'0' ,
PRIMARY KEY (aab001)
) ; 

Greenplum完整的建表语法参考本文末尾。

分布列选择黄金法则

Greenplum 支持的分布算法 :

• JOIN
当JOIN的列都是分布列时，不需要重分布或广播小表，可以在segment内完成JOIN。

• 防止数据倾斜
Greenplum依据指定的分布列，hash取模存放到对应的segment中。

• 高并发查询，选择性好
如果数据经常被高并发的键值或离散查询，建议将查询条件的列作为分布列，这样不需要连接到所有的segment去查，可以大大提高并发能力。

aa01 的分布列是aaz499

postgres=# explain analyze select * from aa01 where aaz499=1;
QUERY PLAN
----------------------------------------------------------------------------------
Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..120.00 rows=1 width=1973)
Rows out:  0 rows at destination with 1.352 ms to end, start offset by 144 ms.
->  Seq Scan on aa01  (cost=0.00..120.00 rows=1 width=1973)
Filter: aaz499 = 1
Rows out:  0 rows with 0.031 ms to end, start offset by 145 ms.
Slice statistics:
(slice0)    Executor memory: 330K bytes.
(slice1)    Executor memory: 176K bytes (seg10).
Statement statistics:
Memory used: 128000K bytes
Optimizer status: legacy query optimizer
Total runtime: 145.822 ms
(12 rows)

postgres=# explain analyze select * from aa01 where cae007='t';
QUERY PLAN
------------------------------------------------------------------------------------
Gather Motion 16:1  (slice1; segments: 16)  (cost=0.00..120.00 rows=2 width=1973)
Rows out:  0 rows at destination with 2.001 ms to end, start offset by 146 ms.
->  Seq Scan on aa01  (cost=0.00..120.00 rows=1 width=1973)
Filter: cae007::text = 't'::text
Rows out:  0 rows (seg0) with 0.047 ms to end, start offset by 147 ms.
Slice statistics:
(slice0)    Executor memory: 330K bytes.
(slice1)    Executor memory: 176K bytes avg x 16 workers, 176K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Optimizer status: legacy query optimizer
Total runtime: 147.813 ms
(12 rows)

• 分布列选择法则
原则，避免短板效应。

• 分区法则
原则，缩小查询范围。

参考

https://yq.aliyun.com/articles/57778

《Greenplum资源隔离指南》
https://yq.aliyun.com/articles/57763

《三张图读懂Greenplum在企业的正确使用姿势》
https://yq.aliyun.com/articles/57736

《Greenplum 公测申请页面》
https://www.aliyun.com/product/gpdb?spm=5176.7960203.237031.39.3xWeRa

Greenplum完整的建表语法如下 :

Command:     CREATE TABLE
Description: define a new table
Syntax:
CREATE [[GLOBAL | LOCAL] {TEMPORARY | TEMP}] TABLE table_name (
[ { column_name data_type [ DEFAULT default_expr ]     [column_constraint [ ... ]
[ ENCODING ( storage_directive [,...] ) ]
]
| table_constraint
| LIKE other_table [{INCLUDING | EXCLUDING}
{DEFAULTS | CONSTRAINTS}] ...}
[, ... ] ]
[column_reference_storage_directive [, ... ]
)
[ INHERITS ( parent_table [, ... ] ) ]
[ WITH ( storage_parameter=value [, ... ] )
[ ON COMMIT {PRESERVE ROWS | DELETE ROWS | DROP} ]
[ TABLESPACE tablespace ]
[ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]  -- 分布算法的选择
[ PARTITION BY partition_type (column)
[ SUBPARTITION BY partition_type (column) ]
[ SUBPARTITION TEMPLATE ( template_spec ) ]
[...]
( partition_spec )
| [ SUBPARTITION BY partition_type (column) ]
[...]
( partition_spec
[ ( subpartition_spec
[(...)]
) ]
)
where storage_parameter is:
APPENDONLY={TRUE|FALSE}
BLOCKSIZE={8192-2097152}
ORIENTATION={COLUMN|ROW}
COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|NONE}
COMPRESSLEVEL={0-9}
CHECKSUM={TRUE|FALSE}
FILLFACTOR={10-100}
OIDS[=TRUE|FALSE]
where column_constraint is:
[CONSTRAINT constraint_name]
NOT NULL | NULL
| UNIQUE [USING INDEX TABLESPACE tablespace]
[WITH ( FILLFACTOR = value )]
| PRIMARY KEY [USING INDEX TABLESPACE tablespace]
[WITH ( FILLFACTOR = value )]
| CHECK ( expression )
and table_constraint is:
[CONSTRAINT constraint_name]
UNIQUE ( column_name [, ... ] )
[USING INDEX TABLESPACE tablespace]
[WITH ( FILLFACTOR=value )]
| PRIMARY KEY ( column_name [, ... ] )
[USING INDEX TABLESPACE tablespace]
[WITH ( FILLFACTOR=value )]
| CHECK ( expression )
where partition_type is:
LIST
| RANGE
where partition_specification is:
partition_element [, ...]
and partition_element is:
DEFAULT PARTITION name
| [PARTITION name] VALUES (list_value [,...] )
| [PARTITION name]
START ([datatype] 'start_value') [INCLUSIVE | EXCLUSIVE]
[ END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE] ]
[ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]
| [PARTITION name]
END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE]
[ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]
[ WITH ( partition_storage_parameter=value [, ... ] ) ]
[column_reference_storage_directive [, ...] ]
[ TABLESPACE tablespace ]
where subpartition_spec or template_spec is:
subpartition_element [, ...]
and subpartition_element is:
DEFAULT SUBPARTITION name
| [SUBPARTITION name] VALUES (list_value [,...] )
| [SUBPARTITION name]
START ([datatype] 'start_value') [INCLUSIVE | EXCLUSIVE]
[ END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE] ]
[ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]
| [SUBPARTITION name]
END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE]
[ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]
[ WITH ( partition_storage_parameter=value [, ... ] ) ]
[column_reference_storage_directive [, ...] ]
[ TABLESPACE tablespace ]
where storage_directive is:
COMPRESSTYPE={ZLIB | QUICKLZ | RLE_TYPE | NONE}
| COMPRESSLEVEL={0-9}
| BLOCKSIZE={8192-2097152}
Where column_reference_storage_directive is:
COLUMN column_name ENCODING ( storage_directive [, ... ] ), ...
| DEFAULT COLUMN ENCODING ( storage_directive [, ... ] )

|
3月前
|

18 0
|
5月前
|

61 0
|
5月前
|

GreenPlum数据分布机制
GreenPlum数据分布机制
66 0
|
10月前
|
SQL 存储 缓存

57 0
|

InfluxDB存储数据是否需要水平拆分表？
InfluxDB存储数据是否需要水平拆分表？
351 0
|

85 0
|
SQL 存储 NoSQL

394 0
|

Greenplum 大宽表 OR 分层关系 - 大量nestloop，补齐其他字段的性能损耗

1352 0
|