Best Practices for Data Migration from MaxCompute to HybridDB for PostgreSQL

本文涉及的产品
对象存储 OSS,20GB 3个月
云原生大数据计算服务MaxCompute,500CU*H 100GB 3个月
云原生数据库 PolarDB PostgreSQL 版,标准版 2核4GB 50GB
简介: Users often struggle to migrate data from MaxCompute to HybridDB. This blog describes five steps that will help you migrate data conveniently.

Introduction

Both MaxCompute and HybridDB use a cluster architecture comprising of multiple data nodes. You need to ensure that the data nodes are actively pushing data if you want to efficiently migrate data from MaxCompute to HybridDB with high throughput using such an architecture. You can make this possible using MaxCompute and HybridDB's ability to read and write data from and to Alibaba Cloud Object Storage Service (OSS) with this method. And there you have the solution.

It is also imperative to ensure that you have a common data format to exchange data on OSS. After a comprehensive investigation into this matter, I have discovered that MaxCompute supports writing data in text format (TEXT/CSV), and HybridDB supports reading data in text format.

Based on my personal experience, I will share some best practices to ensure a successful data migration from MaxCompute to HybridDB based on my research.

Steps to Migrate Data from MaxCompute to HybridDB

Step 1: Create an External MaxCompute OSS Table

Begin by creating an external table with the same structure as that of the MaxCompute data table. The table will serve to open the data channel between MaxCompute and OSS. Refer to the screenshot below to create the table.

 CREATE external TABLE `demo_oss_ext` (
    id string COMMENT 'id',
    data1 string COMMENT 'data1',
    data2 string COMMENT 'data2'
  ) 
partitioned by (ds string)
STORED BY 'com.alibabacloud.maxcompute.TextStorageHandler'
WITH SERDEPROPERTIES ('maxcompute.text.option.delimiter'='\t')
LOCATION 'oss://id:key@endpoint/bucketname/oss_dir/';

Key Parameters:

1.The parameter com.alibabacloud.maxcompute.TextStorageHandler defines the data format that you can use to store data in OSS.

•Text Storage Handler is developed in JAVA and is the default option for data delivery.
•The default Text Storage Handler does not support the complete TEXT/CSV protocol. If you want it to support the complete TEXT/CSV protocol, you can use open-source JAVA CSV.

2.Text Storage Handler supports two custom parameters:

•maxcompute.text.option.delimiter specifies the column delimiters.
•maxcompute.text.option.use.quote defines the quote characters.
•The default value of the NULL column is N, and you cannot change it.
•Text Storage Handler does not allow escaping special characters. You can escape special characters only through a custom handler.

3.LOCATION specifies the specific account and location on OSS to which you deliver the data. It includes ID, key, endpoint, bucket, and a specific location.

Step 2: Migrate Data from the External Table to OSS

Use the following SQL statement to migrate data from MaxCompute to OSS.

insert into demo_oss_ext select * from t_data;

Note:

•This operation is carried out in a parallel manner, where the default size of each concurrent unit is 256MB. You can also set a smaller value through set maxcompute.sql.mapper.split.size=xxx; to increase the concurrency.
•OSS data flow control affects the data transfer from MaxCompute to OSS. Technically, OSS network bandwidth for a single concurrency is 100MB/s.
•If you want to further increase the bandwidth, you would need to contact the OSS administrator to release the restrictions.

Step 3: Create a HybridDB External Table

HybridDB external table: oss_ext

The screenshot below will help you create a HybridDB external table.

CREATE READABLE EXTERNAL TABLE user_data_oss_ext (
    id int64,
    data1 text,
    data2 text
)
location('oss://endpoint 
    dir=data_oss_dir
    id=ossid 
    key=osskey 
    bucket=bucketname') 
FORMAT 'TEXT' (DELIMITER '\t' )
LOG ERRORS INTO error_track_table SEGMENT REJECT LIMIT 10;

Key Parameters:

1.Location specifies all OSS related parameters.
2.File format needs to match the format of the MaxCompute external table - FORMAT 'TEXT' (DELIMITER 't')
3.Set to skip wrong lines

oDuring the migration of heterogeneous data, you may encounter data that is not able to pass the verification. This data may exist in the form of special characters or invalid code.
oLOG ERRORS INTO error_track_table will write data that throws an error into a table.
oSEGMENT REJECT LIMIT X allows you to set the threshold number of errors permitted in a single segment, or the allowed percentage of error.

4.HybridDB also conducts import in parallel, the number of parallel executions is equal to the number of computing nodes.
5.Importing text/csv data in gzip format may improve the performance by more than 100%, provided that MaxCompute supports exporting compressed data.

Step 4: Execute Column Compression for HybridDB Local Tables

The next step involves compressing the column for HybridDB local tables. You can do this by referring to the screenshot below.

CREATE TABLE Tao(
    id int64,
    data1 text,
    data2 text
)
with (
APPENDONLY=true, COMPRESSTYPE=zlib,
 ,BLOCKSIZE=2097152,
ORIENTATION=COLUMN,CHECKSUM=true,
OIDS=false) 
DISTRIBUTED BY (id);

Key Parameters:

1.If you do not need to modify the imported data to HybridDB on a large scale, I would suggest you to use apply only to organize data by column, and then compress it.

•Use the following parameter settings: APPENDONLY=true COMPRESSTYPE=zlib COMPRESSLEVEL=5 ORIENTATION=COLUMN BLOCKSIZE=2097152
•HybridDB supports column compression, which offers a much higher compression ratio than row compression. By setting COMPRESSLEVEL=5, the compression ratio can easily reach 20% of the original side.

2.Use DISTRIBUTED BY (column) to distribute data evenly to each of the HybridDB computing nodes. Even distribution of data is the key to selecting a distribution column.

Step 5: Import Data from OSS to HybridDB

Use the following SQL statement to import data from OSS to HybridDB:

insert into t_ao select * from user_data_oss_ext;

Both HybridDB and PostgreSQL support reading and writing data from and to OSS

Similar to S3 of AWS, OSS is a low-price storage service that opens the communication channels between all cloud products. It is also Alibaba Cloud's recommended cloud data channel.

Both PostgreSQL and HybridDB on the cloud currently support reading and writing the OSS data source.

•PostgreSQL + OSS read and write external data source oss_fdw.
•HybridDB for PostgreSQL + OSS import and export data in parallel oss_ext.

Conclusion

I hope that the steps mentioned in this article will help you successfully migrate data from MaxCompute to HybridDB. To know more about data migration, click here.

References (articles in Chinese language):

1.PostgreSQL + OSS oss_fdw
2.HybridDB for PostgreSQL + OSS oss_ext
3.SLS supports delivery of CSV data to OSS
4.Formatting open-source JAVA data
5.Export data from -MaxCompute to OSS
6.How to access OSS from MaxCompute

相关实践学习
基于MaxCompute的热门话题分析
Apsara Clouder大数据专项技能认证配套课程:基于MaxCompute的热门话题分析
目录
相关文章
|
SQL Cloud Native 关系型数据库
ADBPG(AnalyticDB for PostgreSQL)是阿里云提供的一种云原生的大数据分析型数据库
ADBPG(AnalyticDB for PostgreSQL)是阿里云提供的一种云原生的大数据分析型数据库
1775 1
|
存储 分布式计算 运维
【2023云栖】刘一鸣:Data+AI时代大数据平台建设的思考与发布
本文根据2023云栖大会演讲实录整理而成,演讲信息如下: 演讲人:刘一鸣 | 阿里云自研大数据产品负责人 演讲主题:Data+AI时代大数据平台应该如何建设
102696 16
|
13天前
|
存储 人工智能 关系型数据库
阿里云AnalyticDB for PostgreSQL 入选VLDB 2025:统一架构破局HTAP,Beam+Laser引擎赋能Data+AI融合新范式
在数据驱动与人工智能深度融合的时代,企业对数据仓库的需求早已超越“查得快”这一基础能力。面对传统数仓挑战,阿里云瑶池数据库AnalyticDB for PostgreSQL(简称ADB-PG)创新性地构建了统一架构下的Shared-Nothing与Shared-Storage双模融合体系,并自主研发Beam混合存储引擎与Laser向量化执行引擎,全面解决HTAP场景下性能、弹性、成本与实时性的矛盾。 近日,相关研究成果发表于在英国伦敦召开的数据库领域顶级会议 VLDB 2025,标志着中国自研云数仓技术再次登上国际舞台。
106 0
|
10月前
|
人工智能 分布式计算 数据处理
MaxCompute Data + AI:构建 Data + AI 的一体化数智融合
本次分享将分为四个部分讲解:第一部分探讨AI时代数据开发范式的演变,特别是MaxCompute自研大数据平台在客户工作负载和任务类型变化下的影响。第二部分介绍MaxCompute在资源大数据平台上构建的Data + AI核心能力,提供一站式开发体验和流程。第三部分展示MaxCompute Data + AI的一站式开发体验,涵盖多模态数据管理、交互式开发环境及模型训练与部署。第四部分分享成功落地的客户案例及其收益,包括互联网公司和大模型训练客户的实践,展示了MaxFrame带来的显著性能提升和开发效率改进。
|
12月前
|
存储 NoSQL 大数据
大数据中数据存储 (Data Storage)
【10月更文挑战第17天】
1358 2
|
12月前
|
数据采集 算法 大数据
大数据中数据清洗 (Data Cleaning)
【10月更文挑战第17天】
1050 1
|
分布式计算 DataWorks 关系型数据库
DataWorks操作报错合集之使用连接串模式新增PostgreSQL数据源时遇到了报错"not support data sync channel, error code: 0001",该怎么办
DataWorks是阿里云提供的一站式大数据开发与治理平台,支持数据集成、数据开发、数据服务、数据质量管理、数据安全管理等全流程数据处理。在使用DataWorks过程中,可能会遇到各种操作报错。以下是一些常见的报错情况及其可能的原因和解决方法。
|
分布式计算 关系型数据库 大数据
MaxCompute产品使用合集之怎么才可以将 PostgreSQL 中的 geometry 空间类型字段同步到 MaxCompute 或另一个 PostgreSQL 数据库
MaxCompute作为一款全面的大数据处理平台,广泛应用于各类大数据分析、数据挖掘、BI及机器学习场景。掌握其核心功能、熟练操作流程、遵循最佳实践,可以帮助用户高效、安全地管理和利用海量数据。以下是一个关于MaxCompute产品使用的合集,涵盖了其核心功能、应用场景、操作流程以及最佳实践等内容。
|
JSON Java 关系型数据库
Spring Boot 学习研究笔记(十三) Spring Data JPA与PostgreSQL的jsonb类型集成
Spring Boot 学习研究笔记(十三) Spring Data JPA与PostgreSQL的jsonb类型集成
635 0
|
运维 Oracle 关系型数据库
【大数据开发运维解决方案】Oracle Data Redaction数据加密测试
最近有个做Java开发的网友问我,怎么在Oracle进行数据加密呢?我给他推荐了Data Redaction。Oracle Database 12c中加入了Data Redaction这个新的安全特性。当然在11g的Database Advanced Security Administrator’s Guide官方文档中就介绍了。
【大数据开发运维解决方案】Oracle Data Redaction数据加密测试

热门文章

最新文章

推荐镜像

更多