Optimizing Complex Data Distribution in MaxCompute

本文涉及的产品
云原生大数据计算服务 MaxCompute,5000CU*H 100GB 3个月
云原生大数据计算服务MaxCompute,500CU*H 100GB 3个月
简介: For a long time, data distribution has been an issue in the field of Big Data processing. Unfortunately, the Big Data processing systems that are popular today do not satisfactorily solve the issue.

For a long time, data distribution has been an issue in the field of Big Data processing. Unfortunately, the Big Data processing systems that are popular today do not satisfactorily solve the issue. In the all-new optimizer for Maxcompute2.0, we introduced complex data distribution. In this version we added new optimization measures like partition pruning, distribution pull up and push down, and distribution alignment. This article begins with the principle and history of data distribution, explaining our thoughts and solutions on the matter.

Understanding Data Distribution

For many people, bringing up data distribution arouses thoughts of MPP DBMS. In fact, we often say that one only needs to think about data distribution when using MPP DBMS. First, let’s take a look at the categorization of databases:

  • Shared Everything: The difference between this type and the following two is that this one is typically not distributed.
  • Shared Disk: The database server is horizontally expandable and does not have storage of its own. It uses SAN or NAS technology to connect to the backend where it can horizontally expand across unified storage. The network connection on this layer as well as the expandability of the database server limit the Shared Disk. Oracle RAC and other commercial distributed databases belong to this category.
  • Shared Nothing: Unlike Shared Disk, this type of framework utilizes co-located database servers and storage on the same node. This means that physical nodes do not share anything, significantly decreasing network IO. MPP DBMS and Hadoop belong to this category.

    123

Obviously, when deploying a Shared Nothing database, you need to carefully consider data distribution. You first need to know how to distribute data across different physical nodes (since this database doesn’t place data into unified storage unlike the Shared Disk system) to reduce the demands of future operations. For example, in Greenplum, one has to define a partition key when building a table, after which the system will distribute data according to key (hash). If we partition the two tables in a Join operation according to join key, then the Join operation will not require network IO. If one of the involved tables uses a different group of partition keys, then a re-partitioning operation may be necessary.

This is precisely why we need to understand the principle behind data distribution, as it can be critical to the application and system optimization. There is a significant amount of information available concerning data distribution on MPP DBMS. But why don’t these kinds of optimizations exist for data processing systems like Hadoop? Simply put, it’s because then we need stronger expandability (and support for unstructured data, but we won’t go into that).

The difference is that MPP and Hadoop don’t place data and computing on the same node. Even if we were to do so, it would limit the system expandability. Dynamic expandability especially suffers. Take into consideration a group of 50 currently operating Greenplum clusters. It would be nearly impossible to quickly add, for example, two new nodes and still maintain efficient operation. Hadoop is very good at this, the main solution being:

  • Separation of storage and computing
  • Centralized settings support highly efficient peer to peer reading and writing (HDFS)

    This is why, when you create a table in Hive, you don't need to define a partition key like in Greenplum. Also, this explains why Join operations are less efficient in Hive than they are in Greenplum.

The Goal of Data Distribution Optimization

As described above, Big Data distribution systems often trend toward random distribution regarding storage, increasing expandability at the cost of performance. However, re-examining this trade-off, using random distribution in storage doesn’t mean that we can't take advantage of data distribution optimized searches. The goal of distribution optimization is to utilize already existing distribution and satisfy future demands to the furthest extent possible. This kind of optimization includes:

  • Partition Trimming
    Using the characteristics of data distribution, we can use partition trimming to reduce data reads. For example, we can apply partition trimming to hash distribution for point queries, and range distribution for interval queries.
  • Elimination Redistribution
    If the current distribution meets the requirements of future algorithms, we can eliminate extra redistribution operations. It is common knowledge that redistribution (called shuffle in Hadoop) is the primary resource consumer in distribution algorithms.
  • Avoiding Data Skew
    We can use better data distribution algorithms to avoid data skew. For example, if we have some frequently repeating values (end-biased) in a data cluster, we can use range distribution rather than hash distribution. This will effectively help you in avoiding the performance loss caused by data skew.

Types of Data Distribution

The following are examples of data distribution types and their meanings:

234


Data Conversion Relationship

345_jpeg


Implementation

While adhering to Volcano optimizer syntax, we can turn distribution properties into a kind of physical property called distribution. Like other properties, it has ‘required property’ and ‘delivered property’ pairs. For example, for sorted merge join, it will apply a Partial Ordered required property to all input. At the same time, it will deliver a Partial Ordered property which gives following operations a chance to use this property and avoid a round of redistribution.

Consider the below query:
  SELECT uid FROM user JOIN line ON user.uid = line.uid
) GROUP BY uid

At this point, if Join becomes a Sorted Merge Join, it may deliver a Hash[uid] property, which is required by Aggregate, then we can skip an unnecessary round of redistribution.

If we want to apply a similar optimization, then we need to take into consideration the below issues:

  • Characteristics of aggregated distribution
  • (Local relational algebraic compilation) select a suitable distribution property
  • (Total cost calculation) avoid using the wrong distribution property

Characteristics of Aggregated Distribution

There are three ways to generate data distribution:

  • User defined: Just like MPP, we can introduce a partition key to DDL, allowing the user to specify the data distribution. Of course, unlike MMP, this kind of distribution only requires an index structure for a distributed file system and cannot connect physical nodes.
  • SQL logic: SQL logic could generate data distribution when being run. For example, a ‘distribute by’ statement declares that SQL logic would distribute data upon running.
  • Secondary application of algorithms: Each distribution algorithm could create a distribution upon running. For example, sorted merge join can ensure that its output data meets the ordering and hash distribution requirements of join keys.

Several algorithms require a special data distribution:

  • Aggregate: Sorted Aggregate requires grouping key and Hash distribution.
  • Join: Sorted Merge Join and Hash Join both require inputting the same Hash Distribution and join key.
  • Sort: ‘Order by’ requires Range distribution on sort key or Singleton distribution.

Choosing the Appropriate Distribution Property

Even if we have a series of required and delivered distribution properties, it’s still not easy to determine the kind of distribution needed for each operation. Unlike ordering properties (only includes row sequences and ascending or descending order properties), distribution properties vary significantly. The reason for this variance is:

  • To provide different options to satisfy distribution requirements. For example, the aggregate group by a, b, c carries a Partial Ordered requirement for input data. Its ordering requirements are that a, b, and c be ordered. However, Hash (a), Hash (b), Hash (a, b), Hash (a, b, c), or RNG(a) could all satisfy the distribution requirements.
  • One can use a variety of options to achieve distribution. For example, in the join operation, join a and b on a.id = b.id, if a is subject to Hashid, and b is subject to Hashid, then for Sorted Merge Join, you can choose to require Hashid, Hashid, or any Hash(id) really.

    The complexity leaves more room for finding the most optimal method. In reality, finding the most optimal method is a question of the NPC of the number of relational algebra numbers. To reduce the space that under the search, we use a heuristic branch selection algorithm. When compiling a relational algebra, we not only need to satisfy the needs of subsequent operations, we also need to think about the probability that prior operations will be able to produce satisfactory distribution. To realize the latter, one can use a module called Pulled Up Property.

666_jpeg


Pulled Up Property guesses and screens for possible preliminary delivered properties that we can use to narrow down searches during compilation. Consider the query in the above image. When compiling Join, because Sink requires a push-down operation, it needs to provide a Hashc1. Pulled Up Property then guesses that prior operations will possibly provide Hashc1 and Hashc1. Considering that Join will directly require Hashc1, it then reduces the Hashc1 and Hashc1 branches.

Avoiding Improper Distribution Properties

Data skew occurs when we store the majority of data on a minority of nodes during data distribution. It reduces the entire algorithm to single machine operation. Under Partitioning occurs when we specify very few nodes during distribution. It means that one cannot efficiently utilize the distribution resources. Of course, we hope to avoid these two situations.

These avoid these situations we need better statistical information. When an optimization plan encounters Data Skew or Under Partitioning, we need to apply the proper penalty to its cost estimation, decreasing its selection likelihood in the future. We define “good” distribution as one where the amount of data processed by each node falls within a certain pre-defined range. If data processing for a single node is lower or higher than this range, then it should penalize the distribution. Factors that go into estimating this data volume include:

  • Row count (number of input records)
  • Top values (data with the most repetitions)
  • Histogram

    Summary

In this article, we have gone over the significance of data distribution optimization and explained how to optimize data distribution in MaxCompute. We have already embodied these optimizations in the latest release of MaxCompute.

Looking at our tests, the effects of the optimization are undeniable. After applying to appropriate partitioning to TPC-H, we see that overall performance increasing by order of 20%. Even if do not partition the data on the table, partition optimization while running transparently to the user is very effective. When running in an online environment, 14% of queries were able to skip a step of redistribution because of these optimizations.

如需了解更多关于MaxCompute, 欢迎加入“MaxCompute咨询”钉钉群,群号:11782920,或扫码入群

222

相关实践学习
基于MaxCompute的热门话题分析
本实验围绕社交用户发布的文章做了详尽的分析,通过分析能得到用户群体年龄分布,性别分布,地理位置分布,以及热门话题的热度。
SaaS 模式云数据仓库必修课
本课程由阿里云开发者社区和阿里云大数据团队共同出品,是SaaS模式云原生数据仓库领导者MaxCompute核心课程。本课程由阿里云资深产品和技术专家们从概念到方法,从场景到实践,体系化的将阿里巴巴飞天大数据平台10多年的经过验证的方法与实践深入浅出的讲给开发者们。帮助大数据开发者快速了解并掌握SaaS模式的云原生的数据仓库,助力开发者学习了解先进的技术栈,并能在实际业务中敏捷的进行大数据分析,赋能企业业务。 通过本课程可以了解SaaS模式云原生数据仓库领导者MaxCompute核心功能及典型适用场景,可应用MaxCompute实现数仓搭建,快速进行大数据分析。适合大数据工程师、大数据分析师 大量数据需要处理、存储和管理,需要搭建数据仓库?学它! 没有足够人员和经验来运维大数据平台,不想自建IDC买机器,需要免运维的大数据平台?会SQL就等于会大数据?学它! 想知道大数据用得对不对,想用更少的钱得到持续演进的数仓能力?获得极致弹性的计算资源和更好的性能,以及持续保护数据安全的生产环境?学它! 想要获得灵活的分析能力,快速洞察数据规律特征?想要兼得数据湖的灵活性与数据仓库的成长性?学它! 出品人:阿里云大数据产品及研发团队专家 产品 MaxCompute 官网 https://www.aliyun.com/product/odps 
目录
相关文章
|
12月前
|
存储 分布式计算 运维
【2023云栖】刘一鸣:Data+AI时代大数据平台建设的思考与发布
本文根据2023云栖大会演讲实录整理而成,演讲信息如下: 演讲人:刘一鸣 | 阿里云自研大数据产品负责人 演讲主题:Data+AI时代大数据平台应该如何建设
102153 15
|
7天前
|
存储 NoSQL 大数据
大数据中数据存储 (Data Storage)
【10月更文挑战第17天】
15 2
|
7天前
|
数据采集 算法 大数据
大数据中数据清洗 (Data Cleaning)
【10月更文挑战第17天】
25 1
|
存储 SQL 分布式计算
MaxCompute(原名ODPS,全称Open Data Processing Service)
MaxCompute(原名ODPS,全称Open Data Processing Service)是阿里云开发的一种云原生数据处理和分析服务。它提供了强大的数据计算和处理能力,支持海量数据的存储、计算、分析和挖掘,并且具有高可靠、高性能、高可扩展、高安全等优势,适用于各种数据处理和分析场景。
1138 0
|
运维 Oracle 关系型数据库
【大数据开发运维解决方案】Oracle Data Redaction数据加密测试
最近有个做Java开发的网友问我,怎么在Oracle进行数据加密呢?我给他推荐了Data Redaction。Oracle Database 12c中加入了Data Redaction这个新的安全特性。当然在11g的Database Advanced Security Administrator’s Guide官方文档中就介绍了。
【大数据开发运维解决方案】Oracle Data Redaction数据加密测试
|
大数据
阿里云大数据ACP(二)数据集成 Data Integration 2
阿里云大数据ACP(二)数据集成 Data Integration 2
169 0
阿里云大数据ACP(二)数据集成 Data Integration 2
|
DataWorks 安全 数据可视化
阿里云大数据ACP(二)数据集成 Data Integration 1
阿里云大数据ACP(二)数据集成 Data Integration 1
478 0
阿里云大数据ACP(二)数据集成 Data Integration 1
|
消息中间件 SQL 分布式计算
IDEA 中使用 Big Data Tools 连接大数据组件
简介 Big Data Tools 插件可用于 Intellij Idea 2019.2 及以后的版本。它提供了使用 Zeppelin,AWS S3,Spark,Google Cloud Storage,Minio,Linode,数字开放空间,Microsoft Azure 和 Hadoop 分布式文件系统(HDFS)来监视和处理数据的特定功能。 下面来看一下 Big Data Tools 的安装和使用,主要会配置 Flink,Kafka 和 HDFS。
IDEA 中使用 Big Data Tools 连接大数据组件
|
存储 数据采集 人工智能
初始大数据(Big Data)开发
大数据(big data),或称巨量资料,指的是所涉及的资料量规模巨大到无法透过目前主流软件工具,在合理时间内达到撷取、管理、处理、并整理成为帮助企业经营决策更积极目的的资讯。主要解决的是对海量数据的存储以及海量数据的计算分析问题
初始大数据(Big Data)开发
|
存储 分布式计算 大数据
阿里云大数据ACP认证知识点梳理9——产品特点(DATA WORKS)
DATA WORKS(原DATA IDE) 产品特点及重点注意事项
3260 1

相关产品

  • 云原生大数据计算服务 MaxCompute