前言:勇气就是做你害怕的事,如果你不害怕,就谈不上勇气
1、数仓开发之DIM层
DIM层设计要点:
(1)DIM层的设计依据时维度建模理论,该层储存维度模型的维度表。
(2)DIM层的数据储存格式为orc列示储存+snappy压缩
(3)DIM层表命名的命名规范为dim_表名_全量表或者拉链表标识(full/zip)
--DIM(共通层) --Dimension:维度 --所谓的维度其实就是分析数据的角度 --维度层保存的表其实就是分析数据的角度 --性别 --年龄 --品牌 --品种 --维度层保存维度表,所以建模理论应该遵循维度建模理论 --维度层中的维度表,主要用于统计分析 --数据存储方式应该为列式存储:hive默认orc --数据压缩效率应该越高越好(时间短):snappy --数据来源 --ODS层的数据为整个数据仓库做准备 --DIM层数据源就是ODS层 --命名规范 --分层标记(dim_)_维度名称_全量/拉链(标记) --全量:维度表全部数据 --状态数据为了避免数据出现问题,最好的方式就是每一天保存全部数据 --绝大多数的维度表都是全量表,特殊情况采用拉链 --建模理论 --ER模型 --ODS --维度模型 --维度(状态)表 --事实(行为)表 --维度表 --表 --维度(角度),一个维度一张表 --t_order,t_sex,t_age --从实践讲,一般会将有关联得维度设置为一张表,不同维度就是这张表的(维度)字段 --t_order,t_user(sex,age) --t_order,t_sku(tm,category) --如果维度特别简单,特别独立,只在特殊场合用,其实这个表可以不用创建,可以在事实表直接使用 --字段 --维度:只要能用来进行分析的维度,都是字段 --数据(字段)来源:参考业务数据库的表字段 --主维表:业务数据库中主要用于分析维度字段的表 --相关维度:业务数据库中相关用于分析维度字段的表 --维度字段的确定 --尽可能生成丰富的维度属性:字段越多越好 --尽量不使用编码,而使用明确的文字说明,一般可以编码和文字共存。 --尽量沉淀出通用的维度属性 --时间 2023-02-01 转化季度
2.维度建模理论之维度表
2.1 维度表概述
维度表是维度建模的基础和灵魂。前文提到,事实表紧紧围绕业务过程进行设计,而维度表则围绕业务过程所处的环境进行设计。维度表主要包含一个主键和各种维度字段,维度字段称为维度属性。
2.2 维度表设计步骤
1)确定维度(表)
在设计事实表时,已经确定了与每个事实表相关的维度,理论上每个相关维度均需对应一张维度表。需要注意到,可能存在多个事实表与同一个维度都相关的情况,这种情况需保证维度的唯一性,即只创建一张维度表。另外,如果某些维度表的维度属性很少,例如只有一个**名称,则可不创建该维度表,而把该表的维度属性直接增加到与之相关的事实表中,这个操作称为维度退化。
2)确定主维表和相关维表
此处的主维表和相关维表均指业务系统中与某维度相关的表。例如业务系统中与商品相关的表有sku_info,spu_info,base_trademark,base_category3,base_category2,base_category1等,其中sku_info就称为商品维度的主维表,其余表称为商品维度的相关维表。维度表的粒度通常与主维表相同。
3)确定维度属性
确定维度属性即确定维度表字段。维度属性主要来自于业务系统中与该维度对应的主维表和相关维表。维度属性可直接从主维表或相关维表中选择,也可通过进一步加工得到。
确定维度属性时,需要遵循以下要求:
(1)尽可能生成丰富的维度属性
维度属性是后续做分析统计时的查询约束条件、分组字段的基本来源,是数据易用性的关键。维度属性的丰富程度直接影响到数据模型能够支持的指标的丰富程度。
(2)尽量不使用编码,而使用明确的文字说明,一般可以编码和文字共存。
(3)尽量沉淀出通用的维度属性
有些维度属性的获取需要进行比较复杂的逻辑处理,例如需要通过多个字段拼接得到。为避免后续每次使用时的重复处理,可将这些维度属性沉淀到维度表中。
3.维度设计要点
3.3.1 规范化与反规范化
规范化是指使用一系列范式设计数据库的过程,其目的是减少数据冗余,增强数据的一致性。通常情况下,规范化之后,一张表的字段会拆分到多张表。
反规范化是指将多张表的数据冗余到一张表,其目的是减少join操作,提高查询性能。
在设计维度表时,如果对其进行规范化,得到的维度模型称为雪花模型,如果对其进行反规范化,得到的模型称为星型模型。
数据仓库系统的主要目的是用于数据分析和统计,所以是否方便用户进行统计分析决定了模型的优劣。采用雪花模型,用户在统计分析的过程中需要大量的关联操作,使用复杂度高,同时查询性能很差,而采用星型模型,则方便、易用且性能好。所以出于易用性和性能的考虑,维度表一般是很不规范化的。
4.商品维度表
1、建表语句
--商品维度表 --dim_sku_full --全量维度表:一天为单位将数据全部同步到维度表的相同时间分区中 --业务数据库的表 --主维表:sku_info SKU信息表 --相关维度表:sku_attr_valueSKU平台属性表、sku_image、sku_sale_attr_value SKU销售属性表 --SKU=stock Keeping Unit(库存量基本单位) --SPU=(standard Product Unit):是商品信息聚合的最小单位,是一组可复用、易索引的标准化信息集合
DROP TABLE IF EXISTS dim_sku_full; CREATE EXTERNAL TABLE dim_sku_full ( `id` STRING COMMENT 'sku_id', `price` DECIMAL(16, 2) COMMENT '商品价格', `sku_name` STRING COMMENT '商品名称', `sku_desc` STRING COMMENT '商品描述', `weight` DECIMAL(16, 2) COMMENT '重量', `is_sale` BOOLEAN COMMENT '是否在售', `spu_id` STRING COMMENT 'spu编号', `spu_name` STRING COMMENT 'spu名称', `category3_id` STRING COMMENT '三级分类id', `category3_name` STRING COMMENT '三级分类名称', `category2_id` STRING COMMENT '二级分类id', `category2_name` STRING COMMENT '二级分类名称', `category1_id` STRING COMMENT '一级分类id', `category1_name` STRING COMMENT '一级分类名称', `tm_id` STRING COMMENT '品牌id', `tm_name` STRING COMMENT '品牌名称', `sku_attr_values` ARRAY<STRUCT<attr_id :STRING,value_id :STRING,attr_name :STRING,value_name:STRING>> COMMENT '平台属性', #一个商品对应多个平台属性,所以是一对多,结构体的字段是固定的 `sku_sale_attr_values` ARRAY<STRUCT<sale_attr_id :STRING,sale_attr_value_id :STRING,sale_attr_name :STRING,sale_attr_value_name:STRING>> COMMENT '销售属性', `create_time` STRING COMMENT '创建时间' ) COMMENT '商品维度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_sku_full/' tblepropertis TBLPROPERTIES ('orc.compress' = 'snappy');
2.分析
3、数据装载
--装载数据 --load --补全数据 --补全行:union --补全列:join --join,left,join,right join,full join --save --以时间为单位(分区)进行保存 --采集汇总哪一天的数据,就保存到哪一天的分区中 --insert into:插入 --insert overwrite:覆盖 insert into table dim_sku_full partition(dt='2022-06-08') select `id` ,--STRING COMMENT 'sku_id', `price` ,--DECIMAL(16, 2) COMMENT '商品价格', `sku_name` ,--STRING COMMENT '商品名称', `sku_desc` ,--STRING COMMENT '商品描述', `weight` ,--DECIMAL(16, 2) COMMENT '重量', `is_sale` ,--BOOLEAN COMMENT '是否在售', `spu_id` ,--STRING COMMENT 'spu编号', `spu_name` ,--STRING COMMENT 'spu名称', `category3_id` ,--STRING COMMENT '三级分类id', `category3_name` ,--STRING COMMENT '三级分类名称', `category2_id` ,--STRING COMMENT '二级分类id', `category2_name` ,--STRING COMMENT '二级分类名称', `category1_id` ,--STRING COMMENT '一级分类id', `category1_name` ,--STRING COMMENT '一级分类名称', `tm_id` ,--STRING COMMENT '品牌id', `tm_name` ,--STRING COMMENT '品牌名称', `create_time` --STRING COMMENT '创建时间' from ods_sku_info_full where dt='2022-06-08' --缺什么补什么 select id, spu_name from ods_spu_info_full where dt='2022-06-08' select id,tm_name from ods_base_trademark_full select id,name,category2_id from ods_base_category3_full select id,name,category1_id from ods_base_category2_full select id, name from ods_base_category1_full
逻辑没问题,语法有问题,但要明确join的用法
数据装载 我们知道,ODS层表的数据来源于原始业务表,只是剔除了某些不需要的字段。因此在考虑DIM和DWD层表的数据装载时,对原始表和ODS层表分析是一样的。本项目可能会基于业务数据库的原表分析,特此说明。下文不再赘述。 (1)思路分析 ① 分区规划 为了避免下游查询时执行全表扫描,按天分区,分区字段取值为当天日期。 ② 数据流向 商品维度相关的原始表都做了全量采集,我们只需要获取ODS层数据源表当日分区的数据,写回当日分区即可。 ③ 数据装载 商品维度相关的业务表有八张:sku_info,spu_info,base_trademark,base_category1_info,base_category2_info,base_category3_info,sku_attr_value,sku_sale_attr_value。这些表都做了全量采集,在ODS层有相应的原始表与之对应。基于维度建模理论,我们要确定主维表,将商品维度相关的原始表关联起来。维度表的粒度与主维表保持一致,后者的主键就是维度表的唯一标识。在下游的DWS层或ADS层,我们会用事实表去关联维度表,补充维度信息,即维度关联。显然,如果事实表中保存了维度表的唯一标识,维度关联更加易于实现,如交易域下单事务事实表保留了sku_id,如果以sku_info作为主维表,可以很方便地通过sku_id完成关联。因此,通常要保证主维表的主键字段存在于绝大多数相关事实表中。此外,在数据仓库中,维度表的粒度越细,保留的信息越多,下游可以做的分析就越丰富。 综上,选择ods_sku_info_full作为商品维度主表。 (2)执行步骤 ① 通过CTE表达式定义临时表 sku子查询:从ods_sku_info_full表中选取当日('2022-06-08')数据,筛选每个SKU的id、价格、名称、描述、重量、是否上架、所属的SPU的id、所属的第三级品类的id、所属的品牌的id、创建时间等字段。 spu子查询:从ods_spu_info_full表中选取当日('2022-06-08')数据,筛选每个SPU的id和名称。 c3子查询:从ods_base_category3_full表中选取当日('2022-06-08')数据,筛选每个第三级品类的id、名称和所属的第二级品类的id。 c2子查询:从ods_base_category2_full表中选取当日('2022-06-08')数据,筛选每个第二级品类的id、名称和所属的第一级品类的id。 c1子查询:从ods_base_category1_full表中选取当日('2022-06-08')数据,筛选每个第一级品类的id和名称。 tm子查询:从ods_base_trademark_full表中选取当日('2022-06-08')数据,筛选每个品牌的id和名称。 attr子查询:从ods_sku_attr_value_full表中选取当日('2022-06-08')数据,筛选每个SKU的属性及其取值,使用collect_set函数对相同的SKU的属性进行了合并,生成了一个名为attrs的数组。 sale_attr子查询:从ods_sku_sale_attr_value_full表中选取当日('2022-06-08')数据,筛选每个SKU的销售属性及其取值,使用collect_set函数对相同的SKU的销售属性进行了合并,生成了一个名为sale_attrs的数组。 ② 最后,以sku子查询为主表,将八个临时表关联起来,关联字段及条件如下。 与spu子查询通过spu_id等值连接 与c3子查询通过category3_id关联 与c2子查询通过category2_id关联 与c1子查询通过category1_id关联 与tm子查询通过tm_id关联 与attr子查询通过sku_id关联 与sale_attr子查询通过sku_id关联 其中,与spu、c3、c2、c1、tm的关联使用left join和join是等价的,full [outer] join和right join等价的,因为sku一定存在对应的spu、一级、二级、三级品类和品牌,主表中不存在不满足关联条件的数据。而这些从表中不满足关联条件的数据是无用的,因此应选择left join或join。此处选择left join,这是因为,数据的数据经历多个环节的传递很可能丢失,如果使用join,当从表数据丢失时,与之匹配的主表数据会被舍弃,导致关联结果的整条数据丢失。而使用left join,则关联结果中整条数据得以保留,只是取自从表的字段为null,可以减少数据丢失带来的损失,且便于发现和定位问题。因而,left join和join等价时,选用left join。下文同理,不再赘述。 接下来关联最后两个子查询,sku可能不存在属性或销售属性,在这两个子查询中未必有满足关联条件的数据,此时主表数据需要保留,因此要选用left join或full join,而这两个子查询中不满足关联条件的数据不应保留,只能选择left join。 最后选取所需字段,写入dim_sku_full表的当日分区。 (3)图解 ① 分区规划
select sku.`id` ,--STRING COMMENT 'sku_id', `price` ,--DECIMAL(16, 2) COMMENT '商品价格', `sku_name` ,--STRING COMMENT '商品名称', `sku_desc` ,--STRING COMMENT '商品描述', `weight` ,--DECIMAL(16, 2) COMMENT '重量', `is_sale` ,--BOOLEAN COMMENT '是否在售', `spu_id` ,--STRING COMMENT 'spu编号', `spu_name` ,--STRING COMMENT 'spu名称', `category3_id` ,--STRING COMMENT '三级分类id', `category3_name` ,--STRING COMMENT '三级分类名称', `category2_id` ,--STRING COMMENT '二级分类id', `category2_name` ,--STRING COMMENT '二级分类名称', `category1_id` ,--STRING COMMENT '一级分类id', `category1_name` ,--STRING COMMENT '一级分类名称', `tm_id` ,--STRING COMMENT '品牌id', `tm_name` ,--STRING COMMENT '品牌名称', sku_attr_values, sku_sale_attr_values, `create_time` --STRING COMMENT '创建时间' from ( `id` ,--STRING COMMENT 'sku_id', `price` ,--DECIMAL(16, 2) COMMENT '商品价格', `sku_desc` ,--STRING COMMENT '商品描述', `weight` ,--DECIMAL(16, 2) COMMENT '重量', `is_sale` ,--BOOLEAN COMMENT '是否在售', `spu_id` ,--STRING COMMENT 'spu编号', `tm_id` ,--STRING COMMENT '品牌id', `create_time` --STRING COMMENT '创建时间' from ods_sku_info_full where dt='2022-06-08' ) sku left join( select id,spu_name from ods_sku_info_full where dt='2022-06-08' ) spu on sku.spu_id=spu.id left join ( select id,tm_name from ods_base_trademark_full where dt='2022-06-08' ) tm on sku.tm_id=tm.id left join ( select id,name category3_name,category2_id from ods_base_category3_full where dt='2022-06-08' )c3 on sku.category3_id=c3.id left join( select id,name category2_name,category1_id from ods_base_category2_full where dt='2022-06-08' )c2 on c3.category2_id=c2.id left join( select id,name category1_name from ods_base_category1_full where dt='2022-06-08' )c1 on c2.category1_id=c1.id left join( --一对多处理 --1.struct --2.相同的sku => N*STRUCT =>array --聚合函数=>array named_struct类型,主要用这个函数做字段拼接,并且每个字段都可以取别名; select sku_id, collect_list(named_struct("attr_id",attr_id,"value_id",value_id,"attr_name",attr_name,"value_value":value_value)) sku_attr_values from odf_sku_attr_value_full where dt='2022-06-08' group by sku_id )sav on sku.id=sav.sku_id left join( select sku_id, collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sku_sale_attr_values from ods_sku_sale_attr_value_full where dt='2020-06-14' group by sku_id )ssav on sku.id=ssav.sku_id
方法二:
语法说明:CTE with q1 as ( select key from src where key='5') select * from q1; select*from (select*from ods_user_info_inc where dt='2022-06-08') t; with t as (select*from ods_user_info_inc where dt='2022-06-08') select * from (select * from t)t1 join(select*from t)
with sku as ( select id, price, sku_name, sku_desc, weight, is_sale, spu_id, category3_id, tm_id, create_time from ods_sku_info_full where dt='2020-06-14' ), spu as ( select id, spu_name from ods_spu_info_full where dt='2020-06-14' ), c3 as ( select id, name, category2_id from ods_base_category3_full where dt='2020-06-14' ), c2 as ( select id, name, category1_id from ods_base_category2_full where dt='2020-06-14' ), c1 as ( select id, name from ods_base_category1_full where dt='2020-06-14' ), tm as ( select id, tm_name from ods_base_trademark_full where dt='2020-06-14' ), attr as ( select sku_id, collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs from ods_sku_attr_value_full where dt='2020-06-14' group by sku_id ), sale_attr as ( select sku_id, collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs from ods_sku_sale_attr_value_full where dt='2020-06-14' group by sku_id ) insert overwrite table dim_sku_full partition(dt='2020-06-14') select sku.id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.is_sale, sku.spu_id, spu.spu_name, sku.category3_id, c3.name, c3.category2_id, c2.name, c2.category1_id, c1.name, sku.tm_id, tm.tm_name, attr.attrs, sale_attr.sale_attrs, sku.create_time from sku left join spu on sku.spu_id=spu.id left join c3 on sku.category3_id=c3.id left join c2 on c3.category2_id=c2.id left join c1 on c2.category1_id=c1.id left join tm on sku.tm_id=tm.id left join attr on sku.id=attr.sku_id left join sale_attr on sku.id=sale_attr.sku_id;
5.优惠劵维度表
--优惠券维度表 --主维表&相关维表 --主维度 :优惠券信息表(coupon_info) --相关维表 --coupon_range(X) --coupon_use(X) --数据装载 --load --coupon_info --base_dic --save
效率高些
1、建表语句
DROP TABLE IF EXISTS dim_coupon_full; CREATE EXTERNAL TABLE dim_coupon_full ( `id` STRING COMMENT '购物券编号', `coupon_name` STRING COMMENT '购物券名称', `coupon_type_code` STRING COMMENT '购物券类型编码', `coupon_type_name` STRING COMMENT '购物券类型名称', `condition_amount` DECIMAL(16, 2) COMMENT '满额数', `condition_num` BIGINT COMMENT '满件数', `activity_id` STRING COMMENT '活动编号', `benefit_amount` DECIMAL(16, 2) COMMENT '减金额', `benefit_discount` DECIMAL(16, 2) COMMENT '折扣', `benefit_rule` STRING COMMENT '优惠规则:满元*减*元,满*件打*折', #沉淀 `create_time` STRING COMMENT '创建时间', `range_type_code` STRING COMMENT '优惠范围类型编码', `range_type_name` STRING COMMENT '优惠范围类型名称', `limit_num` BIGINT COMMENT '最多领取次数', `taken_count` BIGINT COMMENT '已领取次数', `start_time` STRING COMMENT '可以领取的开始日期', `end_time` STRING COMMENT '可以领取的结束日期', `operate_time` STRING COMMENT '修改时间', `expire_time` STRING COMMENT '过期时间' ) COMMENT '优惠券维度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_coupon_full/' TBLPROPERTIES ('orc.compress' = 'snappy');
2、数据装载
insert overwrite table dim_coupon_full partition(dt='2020-06-14') select id, coupon_name, coupon_type, coupon_dic.dic_name, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, case coupon_type when '3201' then concat('满',condition_amount,'元减',benefit_amount,'元') when '3202' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折') when '3203' then concat('减',benefit_amount,'元') end benefit_rule, create_time, range_type, range_dic.dic_name, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ( select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ods_coupon_info_full where dt='2020-06-14' )ci left join ( select dic_code, dic_name from ods_base_dic_full where dt='2020-06-14' and parent_code='32' )coupon_dic on ci.coupon_type=coupon_dic.dic_code left join ( select dic_code, dic_name from ods_base_dic_full where dt='2020-06-14' and parent_code='33' )range_dic on ci.range_type=range_dic.dic_code;
4.活动维度表
1、建表语句
DROP TABLE IF EXISTS dim_activity_full; CREATE EXTERNAL TABLE dim_activity_full ( `activity_rule_id` STRING COMMENT '活动规则ID', `activity_id` STRING COMMENT '活动ID', `activity_name` STRING COMMENT '活动名称', `activity_type_code` STRING COMMENT '活动类型编码', `activity_type_name` STRING COMMENT '活动类型名称', `activity_desc` STRING COMMENT '活动描述', `start_time` STRING COMMENT '开始时间', `end_time` STRING COMMENT '结束时间', `create_time` STRING COMMENT '创建时间', `condition_amount` DECIMAL(16, 2) COMMENT '满减金额', `condition_num` BIGINT COMMENT '满减件数', `benefit_amount` DECIMAL(16, 2) COMMENT '优惠金额', `benefit_discount` DECIMAL(16, 2) COMMENT '优惠折扣', `benefit_rule` STRING COMMENT '优惠规则', `benefit_level` STRING COMMENT '优惠级别' ) COMMENT '活动信息表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_activity_full/' TBLPROPERTIES ('orc.compress' = 'snappy');
2、数据装载
insert overwrite table dim_activity_full partition(dt='2020-06-14') select rule.id, info.id, activity_name, rule.activity_type, dic.dic_name, activity_desc, start_time, end_time, create_time, condition_amount, condition_num, benefit_amount, benefit_discount, case rule.activity_type when '3101' then concat('满',condition_amount,'元减',benefit_amount,'元') when '3102' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折') when '3103' then concat('打',10*(1-benefit_discount),'折') end benefit_rule, benefit_level from ( select id, activity_id, activity_type, condition_amount, condition_num, benefit_amount, benefit_discount, benefit_level from ods_activity_rule_full where dt='2020-06-14' )rule left join ( select id, activity_name, activity_type, activity_desc, start_time, end_time, create_time from ods_activity_info_full where dt='2020-06-14' )info on rule.activity_id=info.id left join ( select dic_code, dic_name from ods_base_dic_full where dt='2020-06-14' and parent_code='31' )dic on rule.activity_type=dic.dic_code;
5.地区维度表
1、建表语句
DROP TABLE IF EXISTS dim_province_full; CREATE EXTERNAL TABLE dim_province_full ( `id` STRING COMMENT 'id', `province_name` STRING COMMENT '省市名称', `area_code` STRING COMMENT '地区编码', `iso_code` STRING COMMENT '旧版ISO-3166-2编码,供可视化使用', `iso_3166_2` STRING COMMENT '新版IOS-3166-2编码,供可视化使用', `region_id` STRING COMMENT '地区id', `region_name` STRING COMMENT '地区名称' ) COMMENT '地区维度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gma
2、数据装载
insert overwrite table dim_province_full partition(dt='2020-06-14') select province.id, province.name, province.area_code, province.iso_code, province.iso_3166_2, region_id, region_name from ( select id, name, region_id, area_code, iso_code, iso_3166_2 from ods_base_province_full where dt='2020-06-14' )province left join ( select id, region_name from ods_base_region_full where dt='2020-06-14' )region on province.region_id=region.id;
6.日期维度表
1、建表语句
DROP TABLE IF EXISTS dim_date; CREATE EXTERNAL TABLE dim_date ( `date_id` STRING COMMENT '日期ID', `week_id` STRING COMMENT '周ID,一年中的第几周', `week_day` STRING COMMENT '周几', `day` STRING COMMENT '每月的第几天', `month` STRING COMMENT '一年中的第几月', `quarter` STRING COMMENT '一年中的第几季度', `year` STRING COMMENT '年份', `is_workday` STRING COMMENT '是否是工作日', `holiday_id` STRING COMMENT '节假日' ) COMMENT '时间维度表' STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_date/' TBLPROPERTIES ('orc.compress' = 'snappy');
2、数据装载
通常情况下,时间维度表的数据并不是来自业务系统,而是手动写入,并且由于时间维度表数据的可预见性,无需每日导入,一般可一次性导入一年的数据。
(1)创建临时表
DROP TABLE IF EXISTS tmp_dim_date_info; CREATE EXTERNAL TABLE tmp_dim_date_info ( `date_id` STRING COMMENT '日', `week_id` STRING COMMENT '周ID', `week_day` STRING COMMENT '周几', `day` STRING COMMENT '每月的第几天', `month` STRING COMMENT '第几月', `quarter` STRING COMMENT '第几季度', `year` STRING COMMENT '年', `is_workday` STRING COMMENT '是否是工作日', `holiday_id` STRING COMMENT '节假日' ) COMMENT '时间维度表' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/warehouse/gmall/tmp/tmp_dim_date_info/';
(2)将数据文件上传到HDFS上临时表路径/warehouse/gmall/tmp/tmp_dim_date_info
(3)执行以下语句将其导入时间维度表
insert overwrite table dim_date select * from tmp_dim_date_info;
(4)检查数据是否导入成功
select * from dim_date;
7. 用户维度表
1、建表语句
DROP TABLE IF EXISTS dim_user_zip; CREATE EXTERNAL TABLE dim_user_zip ( `id` STRING COMMENT '用户id', `login_name` STRING COMMENT '用户名称', `nick_name` STRING COMMENT '用户昵称', `name` STRING COMMENT '用户姓名', `phone_num` STRING COMMENT '手机号码', `email` STRING COMMENT '邮箱', `user_level` STRING COMMENT '用户等级', `birthday` STRING COMMENT '生日', `gender` STRING COMMENT '性别', `create_time` STRING COMMENT '创建时间', `operate_time` STRING COMMENT '操作时间', `start_date` STRING COMMENT '开始日期', `end_date` STRING COMMENT '结束日期' ) COMMENT '用户表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_user_zip/' TBLPROPERTIES ('orc.compress' = 'snappy');
2、分区规划
3、数据装载
(2)数据流向
(3)首日装载
insert overwrite table dim_user_zip partition (dt='9999-12-31') select data.id, data.login_name, data.nick_name, md5(data.name), md5(data.phone_num), md5(data.email), data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, '2020-06-14' start_date, '9999-12-31' end_date from ods_user_info_inc where dt='2020-06-14' and type='bootstrap-insert';
(4)每日装载
(4-2)装载语句
with tmp as ( select old.id old_id, old.login_name old_login_name, old.nick_name old_nick_name, old.name old_name, old.phone_num old_phone_num, old.email old_email, old.user_level old_user_level, old.birthday old_birthday, old.gender old_gender, old.create_time old_create_time, old.operate_time old_operate_time, old.start_date old_start_date, old.end_date old_end_date, new.id new_id, new.login_name new_login_name, new.nick_name new_nick_name, new.name new_name, new.phone_num new_phone_num, new.email new_email, new.user_level new_user_level, new.birthday new_birthday, new.gender new_gender, new.create_time new_create_time, new.operate_time new_operate_time, new.start_date new_start_date, new.end_date new_end_date from ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from dim_user_zip where dt='9999-12-31' )old full outer join ( select id, login_name, nick_name, md5(name) name, md5(phone_num) phone_num, md5(email) email, user_level, birthday, gender, create_time, operate_time, '2020-06-15' start_date, '9999-12-31' end_date from ( select data.id, data.login_name, data.nick_name, data.name, data.phone_num, data.email, data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, row_number() over (partition by data.id order by ts desc) rn from ods_user_info_inc where dt='2020-06-15' )t1 where rn=1 )new on old.id=new.id ) insert overwrite table dim_user_zip partition(dt) select if(new_id is not null,new_id,old_id), if(new_id is not null,new_login_name,old_login_name), if(new_id is not null,new_nick_name,old_nick_name), if(new_id is not null,new_name,old_name), if(new_id is not null,new_phone_num,old_phone_num), if(new_id is not null,new_email,old_email), if(new_id is not null,new_user_level,old_user_level), if(new_id is not null,new_birthday,old_birthday), if(new_id is not null,new_gender,old_gender), if(new_id is not null,new_create_time,old_create_time), if(new_id is not null,new_operate_time,old_operate_time), if(new_id is not null,new_start_date,old_start_date), if(new_id is not null,new_end_date,old_end_date), if(new_id is not null,new_end_date,old_end_date) dt from tmp union all select old_id, old_login_name, old_nick_name, old_name, old_phone_num, old_email, old_user_level, old_birthday, old_gender, old_create_time, old_operate_time, old_start_date, cast(date_add('2020-06-15',-1) as string) old_end_date, cast(date_add('2020-06-15',-1) as string) dt from tmp where old_id is not null and new_id is not null;
1.7 数据装载脚本
1.首日装载脚本
1、在hadoop102的/home/zhm/bin目录下创建ods_to_dim_init.sh
vim ods_to_dim_init.sh
2、添加如下内容
#!/bin/bash APP=gmall if [ -n "$2" ] ;then do_date=$2 else echo "请传入日期参数" exit fi dim_user_zip=" insert overwrite table ${APP}.dim_user_zip partition (dt='9999-12-31') select data.id, data.login_name, data.nick_name, md5(data.name), md5(data.phone_num), md5(data.email), data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, '$do_date' start_date, '9999-12-31' end_date from ${APP}.ods_user_info_inc where dt='$do_date' and type='bootstrap-insert'; " dim_sku_full=" with sku as ( select id, price, sku_name, sku_desc, weight, is_sale, spu_id, category3_id, tm_id, create_time from ${APP}.ods_sku_info_full where dt='$do_date' ), spu as ( select id, spu_name from ${APP}.ods_spu_info_full where dt='$do_date' ), c3 as ( select id, name, category2_id from ${APP}.ods_base_category3_full where dt='$do_date' ), c2 as ( select id, name, category1_id from ${APP}.ods_base_category2_full where dt='$do_date' ), c1 as ( select id, name from ${APP}.ods_base_category1_full where dt='$do_date' ), tm as ( select id, tm_name from ${APP}.ods_base_trademark_full where dt='$do_date' ), attr as ( select sku_id, collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs from ${APP}.ods_sku_attr_value_full where dt='$do_date' group by sku_id ), sale_attr as ( select sku_id, collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs from ${APP}.ods_sku_sale_attr_value_full where dt='$do_date' group by sku_id ) insert overwrite table ${APP}.dim_sku_full partition(dt='$do_date') select sku.id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.is_sale, sku.spu_id, spu.spu_name, sku.category3_id, c3.name, c3.category2_id, c2.name, c2.category1_id, c1.name, sku.tm_id, tm.tm_name, attr.attrs, sale_attr.sale_attrs, sku.create_time from sku left join spu on sku.spu_id=spu.id left join c3 on sku.category3_id=c3.id left join c2 on c3.category2_id=c2.id left join c1 on c2.category1_id=c1.id left join tm on sku.tm_id=tm.id left join attr on sku.id=attr.sku_id left join sale_attr on sku.id=sale_attr.sku_id; " dim_province_full=" insert overwrite table ${APP}.dim_province_full partition(dt='$do_date') select province.id, province.name, province.area_code, province.iso_code, province.iso_3166_2, region_id, region_name from ( select id, name, region_id, area_code, iso_code, iso_3166_2 from ${APP}.ods_base_province_full where dt='$do_date' )province left join ( select id, region_name from ${APP}.ods_base_region_full where dt='$do_date' )region on province.region_id=region.id; " dim_coupon_full=" insert overwrite table ${APP}.dim_coupon_full partition(dt='$do_date') select id, coupon_name, coupon_type, coupon_dic.dic_name, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, case coupon_type when '3201' then concat('满',condition_amount,'元减',benefit_amount,'元') when '3202' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折') when '3203' then concat('减',benefit_amount,'元') end benefit_rule, create_time, range_type, range_dic.dic_name, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ( select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ${APP}.ods_coupon_info_full where dt='$do_date' )ci left join ( select dic_code, dic_name from ${APP}.ods_base_dic_full where dt='$do_date' and parent_code='32' )coupon_dic on ci.coupon_type=coupon_dic.dic_code left join ( select dic_code, dic_name from ${APP}.ods_base_dic_full where dt='$do_date' and parent_code='33' )range_dic on ci.range_type=range_dic.dic_code; " dim_activity_full=" insert overwrite table ${APP}.dim_activity_full partition(dt='$do_date') select rule.id, info.id, activity_name, rule.activity_type, dic.dic_name, activity_desc, start_time, end_time, create_time, condition_amount, condition_num, benefit_amount, benefit_discount, case rule.activity_type when '3101' then concat('满',condition_amount,'元减',benefit_amount,'元') when '3102' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折') when '3103' then concat('打',10*(1-benefit_discount),'折') end benefit_rule, benefit_level from ( select id, activity_id, activity_type, condition_amount, condition_num, benefit_amount, benefit_discount, benefit_level from ${APP}.ods_activity_rule_full where dt='$do_date' )rule left join ( select id, activity_name, activity_type, activity_desc, start_time, end_time, create_time from ${APP}.ods_activity_info_full where dt='$do_date' )info on rule.activity_id=info.id left join ( select dic_code, dic_name from ${APP}.ods_base_dic_full where dt='$do_date' and parent_code='31' )dic on rule.activity_type=dic.dic_code; " case $1 in "dim_user_zip") hive -e "$dim_user_zip" ;; "dim_sku_full") hive -e "$dim_sku_full" ;; "dim_province_full") hive -e "$dim_province_full" ;; "dim_coupon_full") hive -e "$dim_coupon_full" ;; "dim_activity_full") hive -e "$dim_activity_full" ;; "all") hive -e "$dim_user_zip$dim_sku_full$dim_province_full$dim_coupon_full$dim_activity_full" ;; esac
3、增加脚本执行权限
4、脚本用法
ods_to_dim_init.sh all 2020-06-14
2.每日装载脚本
1、在hadoop102的/home/zhm/bin目录下创建ods_to_dim.sh
vim ods_to_dim.sh
2、添加如下内容
#!/bin/bash APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$2" ] ;then do_date=$2 else do_date=`date -d "-1 day" +%F` fi dim_user_zip=" set hive.exec.dynamic.partition.mode=nonstrict; with tmp as ( select old.id old_id, old.login_name old_login_name, old.nick_name old_nick_name, old.name old_name, old.phone_num old_phone_num, old.email old_email, old.user_level old_user_level, old.birthday old_birthday, old.gender old_gender, old.create_time old_create_time, old.operate_time old_operate_time, old.start_date old_start_date, old.end_date old_end_date, new.id new_id, new.login_name new_login_name, new.nick_name new_nick_name, new.name new_name, new.phone_num new_phone_num, new.email new_email, new.user_level new_user_level, new.birthday new_birthday, new.gender new_gender, new.create_time new_create_time, new.operate_time new_operate_time, new.start_date new_start_date, new.end_date new_end_date from ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from ${APP}.dim_user_zip where dt='9999-12-31' )old full outer join ( select id, login_name, nick_name, md5(name) name, md5(phone_num) phone_num, md5(email) email, user_level, birthday, gender, create_time, operate_time, '$do_date' start_date, '9999-12-31' end_date from ( select data.id, data.login_name, data.nick_name, data.name, data.phone_num, data.email, data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, row_number() over (partition by data.id order by ts desc) rn from ${APP}.ods_user_info_inc where dt='$do_date' )t1 where rn=1 )new on old.id=new.id ) insert overwrite table ${APP}.dim_user_zip partition(dt) select if(new_id is not null,new_id,old_id), if(new_id is not null,new_login_name,old_login_name), if(new_id is not null,new_nick_name,old_nick_name), if(new_id is not null,new_name,old_name), if(new_id is not null,new_phone_num,old_phone_num), if(new_id is not null,new_email,old_email), if(new_id is not null,new_user_level,old_user_level), if(new_id is not null,new_birthday,old_birthday), if(new_id is not null,new_gender,old_gender), if(new_id is not null,new_create_time,old_create_time), if(new_id is not null,new_operate_time,old_operate_time), if(new_id is not null,new_start_date,old_start_date), if(new_id is not null,new_end_date,old_end_date), if(new_id is not null,new_end_date,old_end_date) dt from tmp union all select old_id, old_login_name, old_nick_name, old_name, old_phone_num, old_email, old_user_level, old_birthday, old_gender, old_create_time, old_operate_time, old_start_date, cast(date_add('$do_date',-1) as string) old_end_date, cast(date_add('$do_date',-1) as string) dt from tmp where old_id is not null and new_id is not null; " dim_sku_full=" with sku as ( select id, price, sku_name, sku_desc, weight, is_sale, spu_id, category3_id, tm_id, create_time from ${APP}.ods_sku_info_full where dt='$do_date' ), spu as ( select id, spu_name from ${APP}.ods_spu_info_full where dt='$do_date' ), c3 as ( select id, name, category2_id from ${APP}.ods_base_category3_full where dt='$do_date' ), c2 as ( select id, name, category1_id from ${APP}.ods_base_category2_full where dt='$do_date' ), c1 as ( select id, name from ${APP}.ods_base_category1_full where dt='$do_date' ), tm as ( select id, tm_name from ${APP}.ods_base_trademark_full where dt='$do_date' ), attr as ( select sku_id, collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs from ${APP}.ods_sku_attr_value_full where dt='$do_date' group by sku_id ), sale_attr as ( select sku_id, collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs from ${APP}.ods_sku_sale_attr_value_full where dt='$do_date' group by sku_id ) insert overwrite table ${APP}.dim_sku_full partition(dt='$do_date') select sku.id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.is_sale, sku.spu_id, spu.spu_name, sku.category3_id, c3.name, c3.category2_id, c2.name, c2.category1_id, c1.name, sku.tm_id, tm.tm_name, attr.attrs, sale_attr.sale_attrs, sku.create_time from sku left join spu on sku.spu_id=spu.id left join c3 on sku.category3_id=c3.id left join c2 on c3.category2_id=c2.id left join c1 on c2.category1_id=c1.id left join tm on sku.tm_id=tm.id left join attr on sku.id=attr.sku_id left join sale_attr on sku.id=sale_attr.sku_id; " dim_province_full=" insert overwrite table ${APP}.dim_province_full partition(dt='$do_date') select province.id, province.name, province.area_code, province.iso_code, province.iso_3166_2, region_id, region_name from ( select id, name, region_id, area_code, iso_code, iso_3166_2 from ${APP}.ods_base_province_full where dt='$do_date' )province left join ( select id, region_name from ${APP}.ods_base_region_full where dt='$do_date' )region on province.region_id=region.id; " dim_coupon_full=" insert overwrite table ${APP}.dim_coupon_full partition(dt='$do_date') select id, coupon_name, coupon_type, coupon_dic.dic_name, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, case coupon_type when '3201' then concat('满',condition_amount,'元减',benefit_amount,'元') when '3202' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折') when '3203' then concat('减',benefit_amount,'元') end benefit_rule, create_time, range_type, range_dic.dic_name, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ( select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ${APP}.ods_coupon_info_full where dt='$do_date' )ci left join ( select dic_code, dic_name from ${APP}.ods_base_dic_full where dt='$do_date' and parent_code='32' )coupon_dic on ci.coupon_type=coupon_dic.dic_code left join ( select dic_code, dic_name from ${APP}.ods_base_dic_full where dt='$do_date' and parent_code='33' )range_dic on ci.range_type=range_dic.dic_code; " dim_activity_full=" insert overwrite table ${APP}.dim_activity_full partition(dt='$do_date') select rule.id, info.id, activity_name, rule.activity_type, dic.dic_name, activity_desc, start_time, end_time, create_time, condition_amount, condition_num, benefit_amount, benefit_discount, case rule.activity_type when '3101' then concat('满',condition_amount,'元减',benefit_amount,'元') when '3102' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折') when '3103' then concat('打',10*(1-benefit_discount),'折') end benefit_rule, benefit_level from ( select id, activity_id, activity_type, condition_amount, condition_num, benefit_amount, benefit_discount, benefit_level from ${APP}.ods_activity_rule_full where dt='$do_date' )rule left join ( select id, activity_name, activity_type, activity_desc, start_time, end_time, create_time from ${APP}.ods_activity_info_full where dt='$do_date' )info on rule.activity_id=info.id left join ( select dic_code, dic_name from ${APP}.ods_base_dic_full where dt='$do_date' and parent_code='31' )dic on rule.activity_type=dic.dic_code; " case $1 in "dim_user_zip") hive -e "$dim_user_zip" ;; "dim_sku_full") hive -e "$dim_sku_full" ;; "dim_province_full") hive -e "$dim_province_full" ;; "dim_coupon_full") hive -e "$dim_coupon_full" ;; "dim_activity_full") hive -e "$dim_activity_full" ;; "all") hive -e "$dim_user_zip$dim_sku_full$dim_province_full$dim_coupon_full$dim_activity_full" ;; esac
3、增加脚本执行权限
chmod +x ods_to_dim.sh
4、脚本用法
ods_to_dim.sh all 2020-06-14
您的支持是我创作的无限动力
希望我能为您的未来尽绵薄之力
如有错误,谢谢指正若有收获,谢谢赞美