Hash Join

本文涉及的产品
云数据库 RDS SQL Server,基础系列 2核4GB
简介:

Hash Join

When it comes to physical join operators, hash join does the heavy lifting.  While nested loops join works well with relatively small data sets and merge join helps with moderately sized data sets, hash join excels at performing the largest joins.  Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.  (I’ll discuss parallel query execution in future series of posts.)

Hash join shares many characteristics with merge join.  Like merge join, it requires at least one equijoin predicate, supports residual predicates, and supports all outer and semi-joins.  Unlike merge join, it does not require ordered input sets and, while it does support full outer join, it does require an equijoin predicate.

The algorithm

The hash join executes in two phases: build and probe.  During the build phase, it reads all rows from the first input (often called the left or build input), hashes the rows on the equijoin keys, and creates an in-memory hash table.  During the probe phase, it reads all rows from the second input (often called the right or probe input), hashes these rows on the same equijoin keys, and looks or probes for matching rows in the hash table.  Since hash functions can lead to collisions (two different key values that hash to the same value), we typically must check each potential match to ensure that it really joins.

In pseudo-code:

for each row R1 in the build table
    begin
        calculate hash value on R1 join key(s)
        insert R1 into the appropriate hash bucket
    end
for each row R2 in the probe table
    begin
        calculate hash value on R2 join key(s)
        for each row R1 in the corresponding hash bucket
            if R1 joins with R2
                return (R1, R2)
    end

Note that unlike the nested loops and merge joins which immediately begin flowing output rows, the hash join is blocking on its build input.  That is, it must completely read and process its entire build input before it can return any rows.  Moreover, unlike the other join methods, the hash join requires a memory grant to store the hash table.  Thus, there is a limit to the number of concurrent hash joins that SQL Server can run at any given time.  While these characteristics are generally not a problem for data warehouses, they are undesirable for most OLTP applications.

Memory and spilling

Before a hash join begins execution, SQL Server tries to estimate how much memory it will need to build its hash table.  We use the cardinality estimate for the size of the build input along with the expected average row size to estimate the memory requirement.  To minimize the memory required by the hash join, we try to choose the smaller of the two tables as the build table.  We then try to reserve this much memory to ensure that the hash join can successfully execute.

What happens if we grant the hash join less memory than it requests or if the estimate is too low?  In these cases, the hash join may run out of memory during the build phase.  If the hash join runs out of memory, it begins spilling a small percentage of the total hash table to disk (to a workfile in tempdb).  The hash join keeps track of which “partitions” of the hash table are still in memory and which ones have been spilled to disk.  As we read each new row from the build table, we check to see whether it hashes to an in-memory or an on-disk partition.  If it hashes to an in-memory partition, we proceed normally.  If it hashes to an on-disk partition, we write the row to disk.  This process of running out of memory and spilling partitions to disk may repeat multiple times until the build phase is complete.

We perform a similar process during the probe phase.  For each new row from the probe table, we check to see whether it hashes to an in-memory or an on-disk partition.  If it hashes to an in-memory partition, we probe the hash table, produce any appropriate joined rows, and discard the row.  If it hashes to an on-disk partition, we write the row to disk.  Once we complete the first pass of the probe table, we return one by one to any partitions that we spilled, read the build rows back into memory, reconstruct the hash table for each partition, and then read the corresponding probe partitions and complete the join.

Left deep vs. right deep vs. bushy hash join trees

These terms refer to the shape of the query plan as illustrated by this figure:

The shape of the join tree is particularly interesting for hash joins as it affects the memory consumption.

In a left deep tree, the output of one hash join is the build input to the next hash join.  Because hash joins consume their entire build input before moving to the probe phase, in a left deep tree only adjacent pairs of hash joins are active at the same time.  For example, in the above picture, we begin by building the hash table for HJ1.  When HJ1 begins probing, we use the output of HJ1 to build the hash table for HJ2.  When HJ1 is done probing, we can release the memory used by its hash table.  Only then do we begin probing HJ2 and building the hash table for HJ3.  Thus, HJ1 and HJ3 are never active at the same time and can share the same memory grant.  The total amount of memory we need is max(HJ1 + HJ2, HJ2 + HJ3).

In a right deep tree, the output of one hash join is the probe input to the next hash join.  All of the hash joins must build their complete hash tables before we can begin probing.  All of the hash joins are active at once and cannot share memory.  When we do begin probing, rows flow up the entire tree of hash joins without blocking.  Thus, the total amount of memory we need is HJ1 + HJ2 + HJ3.

Examples

The following examples use this schema:

create table T1 (int, b int, x char(200))

create table T2 (int, b int, x char(200))

create table T3 (int, b int, x char(200))

 

set nocount on

declare @i int

set @i = 0

while @i < 1000

  begin

    insert T1 values (@i * 2, @i * 5, @i)

    set @i = @i + 1

  end

 

declare @i int

set @i = 0

while @i < 10000

  begin

    insert T2 values (@i * 3, @i * 7, @i)

    set @i = @i + 1

  end

 

declare @i int

set @i = 0

while @i < 100000

  begin

    insert T3 values (@i * 5, @i * 11, @i)

    set @i = @i + 1

  end

Here is a simple example:

select *

from T1 join T2 on T1.= T2.a  

Rows

Executes

 

334

1

  |--Hash Match(Inner Join, HASH:([T1].[a])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[a]))

1000

1

       |--Table Scan(OBJECT:([T1]))

10000

1

       |--Table Scan(OBJECT:([T2]))

Notice that the T2 has ten times as many rows as T1 and indeed the optimizer chooses to use T1 as the build table and T2 as the probe table.

Now consider this three table join:

select *

from (T1 join T2 on T1.= T2.a)

    join T3 on T1.= T3.a

Rows

Executes

 

334

1

  |--Hash Match(Inner Join, HASH:([T1].[b])=([T3].[a]), RESIDUAL:([T1].[b]=[T3].[a]))

334

1

       |--Hash Match(Inner Join, HASH:([T1].[a])=([T2].[a]), RESIDUAL:([T1].[a]=[T2].[a]))

1000

1

       |    |--Table Scan(OBJECT:([T1]))

10000

1

       |    |--Table Scan(OBJECT:([T2]))

100000

1

       |--Table Scan(OBJECT:([T3]))

Note that the optimizer has selected a left deep plan.  First, we join the two small tables, T1 and T2.  The results of this join yield only 334 rows which we use to build a hash table before joining with the large table T3.

Now observe what happens if we add a predicate to restrict the size of the smaller two tables.  (A single where clause suffices; the optimizer can derive “T2.a < 100” from “T1.a < 100” and “T1.a = T2.a”.)

select *

from (T1 join T2 on T1.= T2.a)

    join T3 on T1.= T3.a

where T1.< 100

Rows

Executes

 

17

1

  |--Hash Match(Inner Join, HASH:([T2].[a])=([T1].[a]), RESIDUAL:([T1].[a]=[T2].[a]))

34

1

       |--Table Scan(OBJECT:([T2]), WHERE:([T2].[a]<(100)))

50

1

       |--Hash Match(Inner Join, HASH:([T1].[b])=([T3].[a]), RESIDUAL:([T1].[b]=[T3].[a]))

50

1

            |--Table Scan(OBJECT:([T1]), WHERE:([T1].[a]<(100)))

100000

1

            |--Table Scan(OBJECT:([T3]))

This time the optimizer selected a right deep plan.  T1 and T2 are now so small (34 and 50) rows that it is better to build a hash table on these two tables and probe using the large table T3 than it is to build a hash table on an intermediate hash join result.

What next?

Now that I’ve given an overview of how each of the three physical join operators works, in my next post (or two) I plan to summarize the different characteristics of these operators and to give more examples to show how SQL Server makes various tradeoffs when deciding how to join tables.

分类:  SqlServer

本文转自快乐就好博客园博客,原文链接:http://www.cnblogs.com/happyday56/archive/2009/09/10/1564139.html,如需转载请自行联系原作者
相关文章
|
SQL 存储 分布式计算
【Hive】(二十三)简单几招教你如何解决 Hive 中小文件过多的问题
【Hive】(二十三)简单几招教你如何解决 Hive 中小文件过多的问题
1928 0
|
SQL 分布式计算 大数据
MAXCOMPUTE和ODPS的区别是什么?
MAXCOMPUTE和ODPS的区别是什么?
1129 1
|
存储 SQL 大数据
一篇文章搞懂数据仓库:三种事实表(设计原则,设计方法、对比)
一篇文章搞懂数据仓库:三种事实表(设计原则,设计方法、对比)
一篇文章搞懂数据仓库:三种事实表(设计原则,设计方法、对比)
|
9月前
|
存储 分布式计算 Java
踏上大数据第一步:flume
Flume 是一个分布式、可靠且高效的系统,用于收集、聚合和移动大量日志数据。它是 Apache 顶级项目,广泛应用于 Hadoop 生态系统中。Flume 支持从多种数据源(如 Web 服务器、应用服务器)收集日志,并将其传输到中央存储(如 HDFS、HBase)。其核心组件包括 Source、Channel 和 Sink,分别负责数据获取、临时存储和最终存储。本文还介绍了在 Ubuntu 20.04 上安装 Flume 1.9.0 的步骤,涵盖 JDK 安装、Flume 下载、解压、配置环境变量及验证安装等详细过程。
196 10
|
9月前
|
存储 SQL 缓存
记录一次holo视图与物化视图的区别
本文介绍了Hologres中视图与物化视图的区别及应用场景。视图是一种虚拟表,不存储数据,查询时动态生成结果集,适用于简化查询、数据抽象等场景。物化视图则预先计算并存储查询结果,查询速度快,适合加速查询、离线数据分析等场景。文章通过实例详细说明了两者的使用方式及性能考量,并探讨了如何根据具体需求选择合适的视图类型。
272 16
|
8月前
|
SQL 安全 前端开发
预编译为什么能防止SQL注入?
SQL注入是Web应用中常见的安全威胁,攻击者通过构造恶意输入执行未授权的SQL命令。预编译语句(Prepared Statements)是一种有效防御手段,它将SQL代码与数据分离,确保用户输入不会被解释为SQL代码的一部分。本文详细介绍了SQL注入的危害、预编译语句的工作机制,并结合实际案例和多语言代码示例,展示了如何使用预编译语句防止SQL注入,强调了其在提升安全性和性能方面的重要性。
|
SQL 分布式计算 大数据
"大数据计算难题揭秘:MaxCompute中hash join内存超限,究竟该如何破解?"
【8月更文挑战第20天】在大数据处理领域,阿里云的MaxCompute以高效稳定著称,但复杂的hash join操作常导致内存超限。本文通过一个实例解析此问题:数据分析师小王需对两个共计300GB的大表进行join,却遭遇内存不足。经分析发现,单个mapper任务内存默认为2GB,不足以支持大型hash表的构建。为此,提出三种解决方案:1) 提升mapper任务内存;2) 利用map join优化小表连接;3) 实施分而治之策略,将大表分割后逐一处理再合并结果。这些方法有助于提升大数据处理效率及稳定性。
344 0
|
分布式计算 DataWorks API
DataWorks操作报错合集之如何解决API调用报400,文件夹找不到的错误
DataWorks是阿里云提供的一站式大数据开发与治理平台,支持数据集成、数据开发、数据服务、数据质量管理、数据安全管理等全流程数据处理。在使用DataWorks过程中,可能会遇到各种操作报错。以下是一些常见的报错情况及其可能的原因和解决方法。
|
SQL 运维 DataWorks
Flink CDC在阿里云DataWorks数据集成应用实践
本文整理自阿里云 DataWorks 数据集成团队的高级技术专家 王明亚(云时)老师在 Flink Forward Asia 2023 中数据集成专场的分享。
1678 2
Flink CDC在阿里云DataWorks数据集成应用实践
|
存储 分布式计算 安全
MaxCompute的性能
【5月更文挑战第6天】MaxCompute的性能
244 2

热门文章

最新文章