使用加盐方法解决数据倾斜
场景
表 | 条件 | 数据量 |
---|---|---|
ods.ods_a | 全表 | 500 |
ods.ods_b | 全表 | 1万 |
dwd.dwd_c_tmp | 日期为前一天 | 2000万 |
问题
很简单的一个join操作,最后写入到HDFS中发生严重数据倾斜
发生倾斜的SQL语句
with mo as(
select
MODEL_ID,
MODEL_NAME,
MODEL_CODE,
SERIES_ID
from ods.ods_a
),co as (
select
color_id,
color_name,
color_code,
SERIES_ID,
color_type
from ods.ods_b
),tmp as(
select
uid,
CREATED_AT,
VI_NO,
vin,
OWNER_CODE,
CONSULTANT,
purchaser,
sys,
ordernr,
bill_billnr,
ct_code,
certificate_no,
model,
COLOR,
INNER_COLOR,
pricesum,
status,
dealername,
sheet_create_date,
sheet_created_by
from dwd.dwd_c_tmp
where dt = '2022-03-14'
)
insert overwrite table dwd.dwd_c partition(dt = '2022-03-14')
select
tmp.uid,
tmp.CREATED_AT,
tmp.VI_NO,
tmp.vin,
tmp.OWNER_CODE,
tmp.CONSULTANT,
tmp.purchaser,
tmp.sys,
tmp.ordernr,
tmp.bill_billnr,
tmp.ct_code,
tmp.certificate_no,
mo.MODEL_NAME,
case when co.color_type = 16081002 then co.color_name
end as COLOR,
case when co.color_type = 16081001 then co.color_name
end as INNER_COLOR,
tmp.pricesum,
tmp.status,
tmp.dealername,
tmp.sheet_create_date,
tmp.sheet_created_by
from tmp
left join mo on tmp.MODEL = mo.MODEL_Id
left join co on tmp.color = co.color_code
and mo.SERIES_ID=co.SERIES_ID;
数据倾斜结果
发现000010_0
文件很大
....
3.2 M 9.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000006_0
1.4 M 4.3 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000007_0
1.8 M 5.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000008_0
783.1 K 2.3 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000009_0
713.8 M 2.1 G /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000010_0
3.6 M 10.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000011_0
2.2 M 6.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000012_0
1.7 M 5.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-10/000013_0
....
解决方法
在这个场景下,tmp
表的数据量远大于mo
和co
表,因此可能会导致数据倾斜。为了解决这个问题,我们可以采用加盐方案。加盐方案是在进行连接操作时,对连接键(join key)添加一个随机数,以便在多个分区间分散数据。具体实施方案如下:
- 在
tmp
表的查询中,为MODEL
和COLOR
字段添加一个随机数,范围为0到9。可以使用rand()
函数生成随机数并取余。同时,需要在mo
和co
表的查询中为SERIES_ID
添加相同的随机数。
tmp as (
select
*,
cast(rand() * 10 as int) % 10 as salt_key
from dwd.dwd_c_tmp
where dt = '2022-03-14'
),
mo as (
select
*,
cast(rand() * 10 as int) % 10 as salt_key
from ods.ods_a
),
co as (
select
*,
cast(rand() * 10 as int) % 10 as salt_key
from ods.ods_b
)
- 在连接操作中,使用
salt_key
作为额外的连接条件。
from tmp
left join mo on tmp.MODEL = mo.MODEL_ID and tmp.salt_key = mo.salt_key
left join co on tmp.COLOR = co.color_code and tmp.salt_key = co.salt_key and mo.SERIES_ID = co.SERIES_ID
优化后SQL
调整后的完整查询如下:
with
tmp as (
select
*,
cast(rand() * 10 as int) % 10 as salt_key
from dwd.dwd_c_tmp
where dt = '2022-03-14'
),
mo as (
select
*,
cast(rand() * 10 as int) % 10 as salt_key
from ods.ods_a
),
co as (
select
*,
cast(rand() * 10 as int) % 10 as salt_key
from ods.ods_b
)
insert overwrite table dwd.dwd_c partition(dt = '2022-03-14')
select
tmp.uid,
tmp.CREATED_AT,
tmp.VI_NO,
tmp.vin,
tmp.OWNER_CODE,
tmp.CONSULTANT,
tmp.purchaser,
tmp.sys,
tmp.ordernr,
tmp.bill_billnr,
tmp.ct_code,
tmp.certificate_no,
mo.MODEL_NAME,
case when co.color_type = 16081002 then co.color_name end as COLOR,
case when co.color_type = 16081001 then co.color_name end as INNER_COLOR,
tmp.pricesum,
tmp.status,
tmp.dealername,
tmp.sheet_create_date,
tmp.sheet_created_by
from tmp
left join mo on tmp.MODEL = mo.MODEL_ID and tmp.salt_key = mo.salt_key
left join co on tmp.COLOR = co.color_code and tmp.salt_key = co.salt_key and mo.SERIES_ID = co.SERIES_ID;
通过这种加盐方案,可以有效降低数据倾斜的风险,提高查询性能。
执行上面完整的SQL后数据分配如下,发现已经没有太大的数据文件
1.0 M 3.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000000_0
3.5 M 10.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000001_0
1.4 M 4.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000002_0
7.0 M 20.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000003_0
1.5 M 4.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000004_0
1.0 M 3.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000005_0
8.7 M 26.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000006_0
10.7 M 32.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000007_0
904.0 K 2.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000008_0
2.7 M 8.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000009_0
4.9 M 14.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000010_0
8.5 M 25.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000011_0
1.3 M 3.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000012_0
1.6 M 4.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000013_0
72.1 M 216.3 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000014_0
10.2 M 30.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000015_0
83.5 M 250.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000016_0
2.2 M 6.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000017_0
5.9 M 17.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000018_0
2.1 M 6.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000019_0
9.4 M 28.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000020_0
6.4 M 19.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000021_0
1.6 M 4.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000022_0
1.8 M 5.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000023_0
3.6 M 10.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000024_0
11.5 M 34.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000025_0
71.9 M 215.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000026_0
5.4 M 16.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000027_0
1.3 M 3.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000028_0
3.9 M 11.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000029_0
5.3 M 15.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000030_0
1.8 M 5.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000031_0
389.2 K 1.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000032_0
945.7 K 2.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000033_0
6.2 M 18.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000034_0
1.3 M 4.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000035_0
73.9 M 221.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000036_0
9.8 M 29.3 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000037_0
1.1 M 3.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000038_0
1.1 M 3.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000039_0
1.1 M 3.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000040_0
72.6 M 217.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000041_0
27.5 M 82.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000042_0
9.8 M 29.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000043_0
88.2 M 264.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000044_0
724.1 K 2.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000045_0
73.4 M 220.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000046_0
1.8 M 5.3 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000047_0
1.0 M 3.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000048_0
2.0 M 6.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000049_0
1.3 M 4.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000050_0
1.5 M 4.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000051_0
9.5 M 28.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000052_0
2.6 M 7.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000053_0
688.3 K 2.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000054_0
1.4 M 4.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000055_0
1.7 M 5.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000056_0
19.9 M 59.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000057_0
9.4 M 28.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000058_0
4.9 M 14.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000059_0
2.2 M 6.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000060_0
1.1 M 3.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000061_0
5.5 M 16.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000062_0
1.6 M 4.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000063_0
6.8 M 20.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000064_0
11.0 M 32.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000065_0
9.7 M 29.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000066_0
1.8 M 5.3 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000067_0
4.6 M 13.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000068_0
22.2 M 66.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000069_0
6.4 M 19.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000070_0
1.4 M 4.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000071_0
10.1 M 30.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000072_0
1.7 M 5.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000073_0
1.4 M 4.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000074_0
10.1 M 30.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000075_0
1.2 M 3.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000076_0
759.9 K 2.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000077_0
1.2 M 3.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000078_0
1.9 M 5.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000079_0
1.1 M 3.3 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000080_0
3.9 M 11.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000081_0
4.4 M 13.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000082_0
25.2 M 75.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000083_0
2.4 M 7.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000084_0
486.6 K 1.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000085_0
737.2 K 2.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000086_0
13.1 M 39.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000087_0
1.8 M 5.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000088_0
558.2 K 1.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000089_0
994.8 K 2.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000090_0
2.4 M 7.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000091_0
1.1 M 3.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000092_0
5.6 M 16.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000093_0
10.3 M 30.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000094_0
1.3 M 3.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000095_0
1.2 M 3.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000096_0
1.9 M 5.6 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000097_0
1.4 M 4.3 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000098_0
2.3 M 6.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000099_0
10.0 M 29.9 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000100_0
5.8 M 17.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000101_0
1.8 M 5.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000102_0
1.8 M 5.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000103_0
396.5 K 1.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000104_0
1.1 M 3.2 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000105_0
823.2 K 2.4 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000106_0
1.8 M 5.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000107_0
1.4 M 4.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000108_0
3.9 M 11.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000109_0
2.7 M 8.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000110_0
79.9 M 239.8 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000111_0
861.0 K 2.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000112_0
917.2 K 2.7 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000113_0
74.3 M 223.0 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000114_0
14.8 M 44.5 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000115_0
72.4 M 217.1 M /usr/hive/warehouse/dwd.db/dwd_c/dt=2022-03-14/000116_0