Hive数据倾斜处理集合

本文涉及的产品
智能开放搜索 OpenSearch行业算法版,1GB 20LCU 1个月
检索分析服务 Elasticsearch 版,2核4GB开发者规格 1个月
实时计算 Flink 版,5000CU*H 3个月
简介: Hive数据倾斜处理集合

使用加盐方法解决数据倾斜

场景

条件 数据量
ods.ods_a 全表 500
ods.ods_b 全表 1万
dwd.dwd_c_tmp 日期为前一天 2000万

问题

很简单的一个join操作,最后写入到HDFS中发生严重数据倾斜

发生倾斜的SQL语句

with mo as(
    select 
        MODEL_ID,
        MODEL_NAME,
        MODEL_CODE,
        SERIES_ID 
    from ods.ods_a
),co as (
    select 
        color_id,
        color_name,
        color_code,
        SERIES_ID,
        color_type  
    from ods.ods_b
),tmp as(
    select
        uid,
        CREATED_AT,
        VI_NO,
        vin,
        OWNER_CODE,
        CONSULTANT,
        purchaser,
        sys,
        ordernr,
        bill_billnr,
        ct_code,
        certificate_no,
        model,
        COLOR,
        INNER_COLOR,
        pricesum,
        status,
        dealername,
        sheet_create_date,
        sheet_created_by
    from dwd.dwd_c_tmp 
    where dt = '2022-03-14'
)
insert overwrite table dwd.dwd_c partition(dt = '2022-03-14')
select
    tmp.uid,
    tmp.CREATED_AT,
    tmp.VI_NO,
    tmp.vin,
    tmp.OWNER_CODE,
    tmp.CONSULTANT,
    tmp.purchaser,
    tmp.sys,
    tmp.ordernr,
    tmp.bill_billnr,
    tmp.ct_code,
    tmp.certificate_no,
    mo.MODEL_NAME,
    case when co.color_type = 16081002 then co.color_name
        end as COLOR,
    case when co.color_type = 16081001 then co.color_name
        end as INNER_COLOR,
    tmp.pricesum,
    tmp.status,
    tmp.dealername,
    tmp.sheet_create_date,
    tmp.sheet_created_by
from tmp
    left join mo on tmp.MODEL = mo.MODEL_Id
    left join co on tmp.color = co.color_code 
    and mo.SERIES_ID=co.SERIES_ID;

数据倾斜结果

发现000010_0文件很大

....
3.2 M    9.7 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000006_0
1.4 M    4.3 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000007_0
1.8 M    5.4 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000008_0
783.1 K  2.3 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000009_0
713.8 M  2.1 G    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000010_0
3.6 M    10.9 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000011_0
2.2 M    6.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000012_0
1.7 M    5.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000013_0
....

解决方法

在这个场景下,tmp表的数据量远大于moco表,因此可能会导致数据倾斜。为了解决这个问题,我们可以采用加盐方案。加盐方案是在进行连接操作时,对连接键(join key)添加一个随机数,以便在多个分区间分散数据。具体实施方案如下:

  1. tmp表的查询中,为MODELCOLOR字段添加一个随机数,范围为0到9。可以使用rand()函数生成随机数并取余。同时,需要在moco表的查询中为SERIES_ID添加相同的随机数。
tmp as (
  select
    *,
    cast(rand() * 10 as int) % 10 as salt_key
  from dwd.dwd_c_tmp
  where dt = '2022-03-14'
),
mo as (
  select
    *,
    cast(rand() * 10 as int) % 10 as salt_key
  from ods.ods_a
),
co as (
  select
    *,
    cast(rand() * 10 as int) % 10 as salt_key
  from ods.ods_b
)
  1. 在连接操作中,使用salt_key作为额外的连接条件。
from tmp
left join mo on tmp.MODEL = mo.MODEL_ID and tmp.salt_key = mo.salt_key
left join co on tmp.COLOR = co.color_code and tmp.salt_key = co.salt_key and mo.SERIES_ID = co.SERIES_ID

优化后SQL

调整后的完整查询如下:

with
tmp as (
  select
    *,
    cast(rand() * 10 as int) % 10 as salt_key
  from dwd.dwd_c_tmp
  where dt = '2022-03-14'
),
mo as (
  select
    *,
    cast(rand() * 10 as int) % 10 as salt_key
  from ods.ods_a
),
co as (
  select
    *,
    cast(rand() * 10 as int) % 10 as salt_key
  from ods.ods_b
)
insert overwrite table dwd.dwd_c partition(dt = '2022-03-14')
select
  tmp.uid,
  tmp.CREATED_AT,
  tmp.VI_NO,
  tmp.vin,
  tmp.OWNER_CODE,
  tmp.CONSULTANT,
  tmp.purchaser,
  tmp.sys,
  tmp.ordernr,
  tmp.bill_billnr,
  tmp.ct_code,
  tmp.certificate_no,
  mo.MODEL_NAME,
  case when co.color_type = 16081002 then co.color_name end as COLOR,
  case when co.color_type = 16081001 then co.color_name end as INNER_COLOR,
  tmp.pricesum,
  tmp.status,
  tmp.dealername,
  tmp.sheet_create_date,
  tmp.sheet_created_by
from tmp
left join mo on tmp.MODEL = mo.MODEL_ID and tmp.salt_key = mo.salt_key
left join co on tmp.COLOR = co.color_code and tmp.salt_key = co.salt_key and mo.SERIES_ID = co.SERIES_ID;

通过这种加盐方案,可以有效降低数据倾斜的风险,提高查询性能。

执行上面完整的SQL后数据分配如下,发现已经没有太大的数据文件

1.0 M    3.0 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000000_0
3.5 M    10.6 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000001_0
1.4 M    4.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000002_0
7.0 M    20.9 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000003_0
1.5 M    4.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000004_0
1.0 M    3.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000005_0
8.7 M    26.1 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000006_0
10.7 M   32.0 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000007_0
904.0 K  2.6 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000008_0
2.7 M    8.0 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000009_0
4.9 M    14.7 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000010_0
8.5 M    25.6 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000011_0
1.3 M    3.8 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000012_0
1.6 M    4.9 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000013_0
72.1 M   216.3 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000014_0
10.2 M   30.7 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000015_0
83.5 M   250.6 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000016_0
2.2 M    6.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000017_0
5.9 M    17.8 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000018_0
2.1 M    6.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000019_0
9.4 M    28.2 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000020_0
6.4 M    19.2 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000021_0
1.6 M    4.8 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000022_0
1.8 M    5.4 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000023_0
3.6 M    10.9 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000024_0
11.5 M   34.4 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000025_0
71.9 M   215.8 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000026_0
5.4 M    16.1 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000027_0
1.3 M    3.8 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000028_0
3.9 M    11.7 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000029_0
5.3 M    15.8 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000030_0
1.8 M    5.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000031_0
389.2 K  1.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000032_0
945.7 K  2.8 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000033_0
6.2 M    18.6 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000034_0
1.3 M    4.0 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000035_0
73.9 M   221.7 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000036_0
9.8 M    29.3 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000037_0
1.1 M    3.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000038_0
1.1 M    3.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000039_0
1.1 M    3.4 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000040_0
72.6 M   217.8 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000041_0
27.5 M   82.6 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000042_0
9.8 M    29.4 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000043_0
88.2 M   264.5 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000044_0
724.1 K  2.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000045_0
73.4 M   220.1 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000046_0
1.8 M    5.3 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000047_0
1.0 M    3.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000048_0
2.0 M    6.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000049_0
1.3 M    4.0 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000050_0
1.5 M    4.4 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000051_0
9.5 M    28.5 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000052_0
2.6 M    7.7 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000053_0
688.3 K  2.0 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000054_0
1.4 M    4.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000055_0
1.7 M    5.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000056_0
19.9 M   59.6 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000057_0
9.4 M    28.2 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000058_0
4.9 M    14.6 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000059_0
2.2 M    6.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000060_0
1.1 M    3.4 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000061_0
5.5 M    16.5 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000062_0
1.6 M    4.9 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000063_0
6.8 M    20.4 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000064_0
11.0 M   32.9 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000065_0
9.7 M    29.1 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000066_0
1.8 M    5.3 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000067_0
4.6 M    13.7 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000068_0
22.2 M   66.7 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000069_0
6.4 M    19.2 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000070_0
1.4 M    4.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000071_0
10.1 M   30.4 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000072_0
1.7 M    5.0 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000073_0
1.4 M    4.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000074_0
10.1 M   30.4 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000075_0
1.2 M    3.6 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000076_0
759.9 K  2.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000077_0
1.2 M    3.7 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000078_0
1.9 M    5.6 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000079_0
1.1 M    3.3 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000080_0
3.9 M    11.7 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000081_0
4.4 M    13.2 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000082_0
25.2 M   75.6 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000083_0
2.4 M    7.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000084_0
486.6 K  1.4 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000085_0
737.2 K  2.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000086_0
13.1 M   39.2 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000087_0
1.8 M    5.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000088_0
558.2 K  1.6 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000089_0
994.8 K  2.9 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000090_0
2.4 M    7.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000091_0
1.1 M    3.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000092_0
5.6 M    16.7 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000093_0
10.3 M   30.8 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000094_0
1.3 M    3.9 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000095_0
1.2 M    3.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000096_0
1.9 M    5.6 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000097_0
1.4 M    4.3 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000098_0
2.3 M    6.9 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000099_0
10.0 M   29.9 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000100_0
5.8 M    17.4 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000101_0
1.8 M    5.4 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000102_0
1.8 M    5.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000103_0
396.5 K  1.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000104_0
1.1 M    3.2 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000105_0
823.2 K  2.4 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000106_0
1.8 M    5.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000107_0
1.4 M    4.1 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000108_0
3.9 M    11.7 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000109_0
2.7 M    8.0 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000110_0
79.9 M   239.8 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000111_0
861.0 K  2.5 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000112_0
917.2 K  2.7 M    /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000113_0
74.3 M   223.0 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000114_0
14.8 M   44.5 M   /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000115_0
72.4 M   217.1 M  /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000116_0
目录
相关文章
|
2月前
|
SQL 数据处理 HIVE
HIVE的数据倾斜调优
hive数据倾斜主要是由shuffle引起的,而引起shuffle的又主要有四种情况,分别为: 1.group by 2.join 3.count(distinct) 4.开窗函数
60 8
|
6月前
|
SQL 缓存 分布式计算
手把手教你解决 Hive 的数据倾斜
数据倾斜是 Hive 中影响任务执行效率的现象,表现为某些任务处理的数据量或耗时远超其他任务。根本原因是 Shuffle 后 Key 分布不均,导致部分 Reduce 负载过高。常见场景包括空值聚合、不可拆分大文件、数值膨胀、不同数据类型 Join、Count(distinct) 计算以及表 Join 操作。解决方法包括过滤空值、转换数据类型、调整聚合策略、使用 MapJoin 等。通过合理优化,如设置 `hive.groupby.skewindata` 和 `hive.map.aggr` 参数,可以有效缓解数据倾斜问题。
676 2
|
6月前
|
SQL 分布式计算 算法
【Hive】数据倾斜怎么解决?
【4月更文挑战第16天】【Hive】数据倾斜怎么解决?
|
6月前
|
SQL 数据采集 分布式计算
Hadoop和Hive中的数据倾斜问题及其解决方案
Hadoop和Hive中的数据倾斜问题及其解决方案
108 0
|
6月前
|
SQL HIVE
Hive group by 数据倾斜问题处理
Hive group by 数据倾斜问题处理
93 0
|
SQL 分布式计算 负载均衡
Hive数据倾斜的原因以及常用解决方案
Hive数据倾斜的原因以及常用解决方案
|
SQL 分布式计算 监控
【Hive】(十五)Hive 数据倾斜与调优
【Hive】(十五)Hive 数据倾斜与调优
421 0
【Hive】(十五)Hive 数据倾斜与调优
|
SQL HIVE
【Hive】(十一)Hive 内置函数集合
【Hive】(十一)Hive 内置函数集合
261 0
【Hive】(十一)Hive 内置函数集合
|
SQL 数据采集 HIVE
实战 | Hive 数据倾斜问题定位排查及解决 (二)
Hive 数据倾斜怎么发现,怎么定位,怎么解决
500 0
实战 | Hive 数据倾斜问题定位排查及解决 (二)
|
SQL 分布式计算 Java
实战 | Hive 数据倾斜问题定位排查及解决 (一)
Hive 数据倾斜怎么发现,怎么定位,怎么解决
631 0
实战 | Hive 数据倾斜问题定位排查及解决 (一)