Greenplum 清理垃圾、修改存储模式(行列变换) 平滑方法 - 交换数据、交互分区

本文涉及的产品
RDS DuckDB + QuickBI 企业套餐,8核32GB + QuickBI 专业版
RDS AI 助手,专业版
PolarDB Agent Express,2核4GB
简介:

标签

PostgreSQL , Greenplum , 交换分区 , 清理垃圾 , 存储格式变更


背景

1、在使用Greenplum的过程中,如果业务会经常对数据进行更新,删除,插入,可能导致数据出现膨胀。

《如何检测、清理Greenplum膨胀、垃圾 - 阿里云HybridDB for PG最佳实践》

《Greenplum 列存表(AO表)的膨胀、垃圾检查与空间收缩》

2、用户在建表的时候,存储模式选择不当,但是业务已经运行了一段时间,表里面已经写入了大量的数据。

3、用户在建表的时候,分布键选得不对,但是业务已经运行了一段时间,表里面已经写入了大量的数据。

4、用户在建表的时候表分区做得不对,但是业务已经运行了一段时间,表里面已经写入了大量的数据。

《Greenplum 最佳实践 - 数据分布黄金法则 - 分布列与分区的选择》

5、数据分区在初期选择了行存储,为了提到明细查询的效率。但是到数据冷却后,明细查询的需求下降,更多的是统计查询的需求,那么需要将这样的冷分区的行存储转换成列存储,提高统计的效率。

《Greenplum 行存、列存,堆表、AO表的原理和选择》

6、数据在使用过程中,由于分布键的数据在业务层发生了一些变化,导致了数据存储的倾斜。

《HybridDB PostgreSQL "Sort、Group、distinct 聚合、JOIN" 不惧怕数据倾斜的黑科技和原理 - 多阶段聚合》

《分布式DB(Greenplum)中数据倾斜的原因和解法 - 阿里云HybridDB for PostgreSQL最佳实践》

以上情况发生时,我们就有必要进行相应的行动:

清理垃圾、调整分布键、分区键、修改存储模式等。

那么,有什么方法可以让这些操作尽量的减少对业务的影响呢?

答案是 交换数据、交换分区。

一、知识点

如何查看存储格式

postgres=# select relname, relkind, relstorage from pg_class  where relkind='r';  
            relname            | relkind | relstorage   
-------------------------------+---------+------------  
 sql_languages                 | r       | h  
 sql_packages                  | r       | h  
 sql_parts                     | r       | h  
 sql_sizing                    | r       | h  
 sql_sizing_profiles           | r       | h  
 gp_san_configuration          | r       | h  

h表示heap 行存储表。

c表示append only column 存储表。

a表示表示append only 行存储表。

postgres=# select relname, relkind, relstorage from pg_class  where relname='a';  
 relname | relkind | relstorage   
---------+---------+------------  
 a       | r       | c  
(1 row)  
  
Time: 0.518 ms  

如何查看分布键

1、查看分布键

SELECT attrnums  
FROM pg_catalog.gp_distribution_policy t  
WHERE localoid = '679926' ;  
  
 attrnums   
----------  
 {1}  
(1 row)  

2、查看分布键名称

SELECT attname FROM pg_attribute   
WHERE attrelid = '679926'   
AND attnum = '1' ;  
  
 attname   
---------  
 uid  
(1 row)  

如何查看分区

postgres=# select * from pg_partitions where tablename='tbl_pos';  

二、单表交换数据

对sales表进行交换。

CREATE TABLE sales2 (LIKE sales)  
WITH (appendonly=true, compresstype=quicklz, compresslevel=1, orientation=column);  
  
INSERT INTO sales2 SELECT * FROM sales;  
  
DROP TABLE sales;  
  
ALTER TABLE sales2 RENAME TO sales;  
  
GRANT ALL PRIVILEGES ON sales TO admin;  
  
GRANT SELECT ON sales TO guest;  

三、分区表交换分区

对于分区表,Greenplum提供了交换分区的语法

http://greenplum.org/docs/510/ref_guide/sql_commands/ALTER_TABLE.html

EXCHANGE [DEFAULT] PARTITION  
  
Exchanges another table into the partition hierarchy into the place of an existing partition.   
In a multi-level partition design, you can only exchange the lowest level partitions (those that contain data).  
  
The Greenplum Database server configuration parameter gp_enable_exchange_default_partition controls   
availability of the EXCHANGE DEFAULT PARTITION clause. The default value for the parameter is off.   
The clause is not available and Greenplum Database returns an error if the clause is specified in an ALTER TABLE command.  
  
For information about the parameter, see Server Configuration Parameters.  
  
Warning: Before you exchange the default partition, you must ensure the data in the table to be exchanged,   
the new default partition, is valid for the default partition. For example, the data in the new default   
partition must not contain data that would be valid in other leaf child partitions of the partitioned table.   
Otherwise, queries against the partitioned table with the exchanged default partition that are executed by GPORCA might return incorrect results.  
  
WITH TABLE table_name - The name of the table you are swapping into the partition design.   
You can exchange a table where the table data is stored in the database. For example,   
the table is created with the CREATE TABLE command.  
  
With the EXCHANGE PARTITION clause, you can also exchange a readable external table   
(created with the CREATE EXTERNAL TABLE command) into the partition hierarchy in the place   
of an existing leaf child partition. If you specify a readable external table,   
you must also specify the WITHOUT VALIDATION clause to skip table validation against the   
CHECK constraint of the partition you are exchanging.  
  
Exchanging a leaf child partition with an external table is not supported in these cases:  
  
The partitioned table is created with the SUBPARTITION clause or if a partition has a subpartition.  
  
The partitioned table contains a column with a check constraint or a NOT NULL constraint.  
  
WITH | WITHOUT VALIDATION - Validates that the data in the table matches the CHECK   
constraint of the partition you are exchanging. The default is to validate the data against the CHECK constraint.  
  
Warning: If you specify the WITHOUT VALIDATION clause, you must ensure that the data   
in table that you are exchanging for an existing child leaf partition is valid against   
the CHECK constraints on the partition. Otherwise, queries against the partitioned table might return incorrect results.  
  
SET SUBPARTITION TEMPLATE  
  
Modifies the subpartition template for an existing partition. After a new subpartition   
template is set, all new partitions added will have the new subpartition design   
(existing partitions are not modified).  
  
SPLIT DEFAULT PARTITION  
  
Splits a default partition. Only a range partition can be split, not a list partition.   
In a multi-level partition design, you can only split the lowest level default partitions   
(those that contain data). Splitting a default partition creates a new partition containing   
the values specified and leaves the default partition containing any values that do not match to an existing partition.  
  
AT - For list partitioned tables, specifies a single list value that should be used as the criteria for the split.  
  
START - For range partitioned tables, specifies a starting value for the new partition.  
  
END - For range partitioned tables, specifies an ending value for the new partition.  
  
INTO - Allows you to specify a name for the new partition. When using the INTO clause   
to split a default partition, the second partition name specified should always be that   
of the existing default partition. If you do not know the name of the default partition,   
you can look it up using the pg_partitions view.  
  
SPLIT PARTITION  
  
Splits an existing partition into two partitions. Only a range partition can be split,   
not a list partition. In a multi-level partition design, you can only split the lowest   
level partitions (those that contain data).  
  
AT - Specifies a single value that should be used as the criteria for the split.   
The partition will be divided into two new partitions with the split value specified   
being the starting range for the latter partition.  
  
INTO - Allows you to specify names for the two new partitions created by the split.  
  
partition_name  
  
The given name of a partition.  
  
FOR (RANK(number))  
  
For range partitions, the rank of the partition in the range.  
  
FOR ('value')  
  
Specifies a partition by declaring a value that falls within the partition boundary specification.   
If the value declared with FOR matches to both a partition and one of its subpartitions (for example,   
if the value is a date and the table is partitioned by month and then by day), then FOR will operate   
on the first level where a match is found (for example, the monthly partition).   
If your intent is to operate on a subpartition, you must declare so as follows:   
ALTER TABLE name ALTER PARTITION FOR ('2016-10-01') DROP PARTITION FOR ('2016-10-01');  

例子

postgres=# select oid from pg_class where relname='tbl_pos';  
  oid     
--------  
 679926  
(1 row)  
  
postgres=# SELECT c.oid::pg_catalog.regclass FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i WHERE c.oid=i.inhrelid AND i.inhparent = '679926' ORDER BY c.relname;  
        oid          
-------------------  
 tbl_pos_1_prt_p1  
 tbl_pos_1_prt_p10  
 tbl_pos_1_prt_p2  
 tbl_pos_1_prt_p3  
 tbl_pos_1_prt_p4  
 tbl_pos_1_prt_p5  
 tbl_pos_1_prt_p6  
 tbl_pos_1_prt_p7  
 tbl_pos_1_prt_p8  
 tbl_pos_1_prt_p9  
(10 rows)  
  
SELECT attrnums  
FROM pg_catalog.gp_distribution_policy t  
WHERE localoid = '679926' ;  
  
 attrnums   
----------  
 {1}  
(1 row)  
  
SELECT attname FROM pg_attribute   
WHERE attrelid = '679926'   
AND attnum = '1' ;  
  
 attname   
---------  
 uid  
(1 row)  
  
  
postgres=# select * from pg_partitions where partitiontablename='tbl_pos_1_prt_p5';  
-[ RECORD 1 ]------------+--------------------------------------------------------------------------------------------------  
schemaname               | postgres  
tablename                | tbl_pos  
partitionschemaname      | postgres  
partitiontablename       | tbl_pos_1_prt_p5  
partitionname            | p5  
parentpartitiontablename |   
parentpartitionname      |   
partitiontype            | list  
partitionlevel           | 0  
partitionrank            |   
partitionposition        | 5  
partitionlistvalues      | 5::smallint  
partitionrangestart      |   
partitionstartinclusive  |   
partitionrangeend        |   
partitionendinclusive    |   
partitioneveryclause     |   
partitionisdefault       | f  
partitionboundary        | PARTITION p5 VALUES(5) WITH (appendonly=true, orientation=row, compresstype=zlib, checksum=false)  
parenttablespace         | pg_default  
partitiontablespace      | pg_default  

查看这个表的膨胀率发现已经膨胀了100%

postgres=# select * from gp_toolkit.__gp_aovisimap_compaction_info('tbl_pos_1_prt_p5'::regclass);   
NOTICE:  gp_appendonly_compaction_threshold = 10  
 content | datafile | compaction_possible | hidden_tupcount | total_tupcount | percent_hidden   
---------+----------+---------------------+-----------------+----------------+----------------  
       2 |        1 | t                   |        20833264 |       20833264 |         100.00  
       1 |        1 | t                   |        20833546 |       20833546 |         100.00  
       6 |        1 | t                   |        20833383 |       20833383 |         100.00  
       7 |        1 | t                   |        20833287 |       20833287 |         100.00  
       0 |        1 | t                   |        20833180 |       20833180 |         100.00  
       5 |        1 | t                   |        20833205 |       20833205 |         100.00  
      10 |        1 | t                   |        20833420 |       20833420 |         100.00  
      12 |        1 | t                   |        20833193 |       20833193 |         100.00  
       4 |        1 | t                   |        20833417 |       20833417 |         100.00  
       9 |        1 | t                   |        20833259 |       20833259 |         100.00  
       8 |        1 | t                   |        20833455 |       20833455 |         100.00  
       3 |        1 | t                   |        20833441 |       20833441 |         100.00  
      17 |        1 | t                   |        20833182 |       20833182 |         100.00  
      13 |        1 | t                   |        20833436 |       20833436 |         100.00  
      15 |        1 | t                   |        20833418 |       20833418 |         100.00  
      14 |        1 | t                   |        20833197 |       20833197 |         100.00  
      11 |        1 | t                   |        20833306 |       20833306 |         100.00  
      16 |        1 | t                   |        20833304 |       20833304 |         100.00  
      19 |        1 | t                   |        20833285 |       20833285 |         100.00  
      26 |        1 | t                   |        20833215 |       20833215 |         100.00  
      25 |        1 | t                   |        20833440 |       20833440 |         100.00  
      24 |        1 | t                   |        20833193 |       20833193 |         100.00  
      28 |        1 | t                   |        20833394 |       20833394 |         100.00  
      30 |        1 | t                   |        20833336 |       20833336 |         100.00  
      32 |        1 | t                   |        20833516 |       20833516 |         100.00  
      31 |        1 | t                   |        20833313 |       20833313 |         100.00  
      35 |        1 | t                   |        20833274 |       20833274 |         100.00  
      39 |        1 | t                   |        20833460 |       20833460 |         100.00  
      47 |        1 | t                   |        20833269 |       20833269 |         100.00  
      34 |        1 | t                   |        20833406 |       20833406 |         100.00  
      23 |        1 | t                   |        20833253 |       20833253 |         100.00  
      45 |        1 | t                   |        20833305 |       20833305 |         100.00  
      42 |        1 | t                   |        20833365 |       20833365 |         100.00  
      36 |        1 | t                   |        20833176 |       20833176 |         100.00  
      41 |        1 | t                   |        20833301 |       20833301 |         100.00  
      20 |        1 | t                   |        20833407 |       20833407 |         100.00  
      22 |        1 | t                   |        20833333 |       20833333 |         100.00  
      29 |        1 | t                   |        20833259 |       20833259 |         100.00  
      37 |        1 | t                   |        20833547 |       20833547 |         100.00  
      38 |        1 | t                   |        20833285 |       20833285 |         100.00  
      44 |        1 | t                   |        20833413 |       20833413 |         100.00  
      40 |        1 | t                   |        20833352 |       20833352 |         100.00  
      27 |        1 | t                   |        20833464 |       20833464 |         100.00  
      33 |        1 | t                   |        20833272 |       20833272 |         100.00  
      43 |        1 | t                   |        20833230 |       20833230 |         100.00  
      18 |        1 | t                   |        20833330 |       20833330 |         100.00  
      46 |        1 | t                   |        20833467 |       20833467 |         100.00  
      21 |        1 | t                   |        20833247 |       20833247 |         100.00  
(48 rows)  

主备回收空间,新建表,写入数据,交换分区

postgres=# create table tbl_pos_1_prt_p5_exchange(like tbl_pos_1_prt_p5) with (appendonly=true, orientation=row, compresstype=zlib, checksum=false) distributed by (uid);  
CREATE TABLE  
  
  
postgres=# begin;  
BEGIN  
  
postgres=# lock table tbl_pos_1_prt_p5 in ACCESS EXCLUSIVE mode;  
LOCK TABLE  
  
postgres=# insert into tbl_pos_1_prt_p5_exchange select * from tbl_pos_1_prt_p5;  
  
  
-- 注意只调用一次,多次调用会被多次交换(交互偶数次就等于没有改变了)。  
postgres=#  alter table tbl_pos alter partition p5 EXCHANGE PARTITION p5 WITH TABLE tbl_pos_1_prt_p5_exchange with VALIDATION;  
NOTICE:  exchanged partition "p5" of partition "p5" of relation "tbl_pos" with relation "tbl_pos_1_prt_p5_exchange"  
ALTER TABLE  
  
postgres=# drop table tbl_pos_1_prt_p5_exchange;  
DROP TABLE  
  
postgres=# end;  

小结

Greenplum提供给了一种方法:交换数据、交换分区。可以平滑的处理 清理垃圾、调整分布键、分区键、修改存储模式等。

参考

http://greenplum.org/docs/510/ref_guide/sql_commands/ALTER_TABLE.html

http://www.dbaref.com/choosing-the-table-storage-model-in-greenplum

《如何检测、清理Greenplum膨胀、垃圾 - 阿里云HybridDB for PG最佳实践》

《Greenplum 列存表(AO表)的膨胀、垃圾检查与空间收缩》

《Greenplum 行存、列存,堆表、AO表的原理和选择》

《Greenplum 最佳实践 - 数据分布黄金法则 - 分布列与分区的选择》

《HybridDB PostgreSQL "Sort、Group、distinct 聚合、JOIN" 不惧怕数据倾斜的黑科技和原理 - 多阶段聚合》

《分布式DB(Greenplum)中数据倾斜的原因和解法 - 阿里云HybridDB for PostgreSQL最佳实践》

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍如何基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
存储 区块链 CDN
推荐几个免费好用的图床
图床一直以来都是很多人的刚性需求,无论是站长,还是平时逛论坛、写博客的用户,再或者是做外贸生意都可能需要用到图床。好用的图床提供了长期稳定存储且高访问能力的图片托管能力,这也是大家选择图床的核心因素。所以在这里我推荐几款免费且超好用的图床。
2194 0
|
2月前
|
存储 设计模式 人工智能
从无状态到有状态:长时运行 Agent 的 5 种架构模式
本文详解长时运行AI Agent的5大生产级架构模式:Checkpoint-and-Resume实现断点续传;Delegated Approval支持原地暂停与人机协同;Memory-Layered Context分层管理长期记忆与工作记忆;Ambient Processing赋能无提示事件驱动;Fleet Orchestration实现多Agent协同治理——让Agent真正成为可靠、有状态、可运维的系统进程。
329 3
从无状态到有状态:长时运行 Agent 的 5 种架构模式
|
7月前
|
云安全 人工智能 安全
|
2月前
|
安全 JavaScript 数据安全/隐私保护
零基础搞定OpenClaw(小龙虾)Windows部署:最新安装包+全报错解决方案
零基础Windows部署OpenClaw(小龙虾)教程:提供最新一键安装包(v2.6.2),内置Git/Node.js/Python等全部依赖,5–10分钟可视化完成。覆盖杀毒拦截、路径报错、解压失败等全场景解决方案,支持微信/飞书联动、本地运行保安全。
|
人工智能 云计算
阿里云产品手册2025版发布
阿里云产品手册2025版发布,涵盖阿里云产品大图、面向 AI 的全栈云计算产品体系等最新内容,囊括了阿里云产品介绍、优势、功能、应用场景和发展历程的介绍。
1708 12
|
6月前
|
负载均衡 应用服务中间件 Nacos
Nacos配置中心
本章介绍Nacos作为配置中心的实现,涵盖配置管理、热更新、共享配置及优先级规则,并演示Nacos集群搭建与高可用部署,解决微服务架构下配置统一管理与动态更新难题。
|
监控 搜索推荐 调度
【内测征集】LarkVR 播控系统上新:VR 应用一站式专业播控与管理工具
Paraverse 平行云专为 VR 应用场景运营方打造的一站式专业播控与管理工具LarkVR播控系统,基于LarkXR实时云渲染平台方案,整合VR空间运营、播控管理、设备运营等场景衍生的VR管理工具,旨在帮助其高效管理设备和运营自己的业务,提升整体运营效率与用户体验。 VR 运营者不再局限于单一终端的串流业务系统,通过 LarkVR播控系统,一键部署前后端业务系统,运营人员使用 pad即可完成快捷操作。支持本地内网部署,支持个性化需求定制接入,兼容市面上主流的各品牌 VR 一体机。
405 12
|
数据挖掘 BI 数据安全/隐私保护
|
数据可视化 数据挖掘 数据处理
Pandas高级数据处理:窗口函数
Pandas 是 Python 中强大的数据分析库,窗口函数(如 `rolling`、`expanding` 和 `ewm`)用于滚动计算、累积计算等。本文介绍窗口函数的基本概念、代码示例及常见问题解决方法,帮助读者灵活运用这些工具进行数据分析。通过合理选择窗口大小、处理边界值和缺失数据,以及优化性能,充分发挥窗口函数的优势。
524 27