数据仓库实战 4(一)

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
云数据库 RDS MySQL,高可用系列 2核4GB
简介: 数据仓库实战 4

业务数仓

之前我们搭建的是用户行为数仓,接下来我们来搭建业务数仓。

假设我们现在拥有下面8张业务表

MariaDB [gmall]> show tables;
+-----------------+
| Tables_in_gmall |
+-----------------+
| base_category1  |
| base_category2  |
| base_category3  |
| order_detail    |
| order_info      |
| payment_info    |
| sku_info        |
| user_info       |
+-----------------+
8 rows in set (0.00 sec)

Sqoop定时导入脚本

创建脚本sqoop_import.sh

#!/bin/bash
export HADOOP_USER_NAME=hive
db_date=$2
echo $db_date
db_name=gmall
import_data() {
sqoop import \
--connect jdbc:mysql://192.168.0.208:3306/$db_name \
--username root \
--password password \
--target-dir  /origin_data/$db_name/db/$1/$db_date \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query "$2"' and  $CONDITIONS;'
}
import_sku_info(){
  import_data  "sku_info"  "select 
id, spu_id, price, sku_name, sku_desc, weight, tm_id,
category3_id, create_time 
  from sku_info  where 1=1"
}
import_user_info(){
  import_data "user_info" "select 
id, name, birthday, gender, email, user_level, 
create_time 
from user_info where 1=1"
}
import_base_category1(){
  import_data "base_category1" "select 
id, name from base_category1 where 1=1"
}
import_base_category2(){
  import_data "base_category2" "select 
id, name, category1_id from base_category2 where 1=1"
}
import_base_category3(){
  import_data "base_category3" "select id, name, category2_id from base_category3 where 1=1"
}
import_order_detail(){
  import_data   "order_detail"   "select 
    od.id, 
    order_id, 
    user_id, 
    sku_id, 
    sku_name, 
    order_price, 
    sku_num, 
    o.create_time  
  from order_info o , order_detail od 
  where o.id=od.order_id 
  and DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date'"
}
import_payment_info(){
  import_data "payment_info"   "select 
    id,  
    out_trade_no, 
    order_id, 
    user_id, 
    alipay_trade_no, 
    total_amount,  
    subject, 
    payment_type, 
    payment_time 
  from payment_info 
  where DATE_FORMAT(payment_time,'%Y-%m-%d')='$db_date'"
}
import_order_info(){
  import_data   "order_info"   "select 
    id, 
    total_amount, 
    order_status, 
    user_id, 
    payment_way, 
    out_trade_no, 
    create_time, 
    operate_time  
  from order_info 
  where  (DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date' or DATE_FORMAT(operate_time,'%Y-%m-%d')='$db_date')"
}
case $1 in
  "base_category1")
     import_base_category1
;;
  "base_category2")
     import_base_category2
;;
  "base_category3")
     import_base_category3
;;
  "order_info")
     import_order_info
;;
  "order_detail")
     import_order_detail
;;
  "sku_info")
     import_sku_info
;;
  "user_info")
     import_user_info
;;
  "payment_info")
     import_payment_info
;;
   "all")
   import_base_category1
   import_base_category2
   import_base_category3
   import_order_info
   import_order_detail
   import_sku_info
   import_user_info
   import_payment_info
;;
esac

执行脚本导入数据

sqoop_import.sh all 2020-11-24
2020-11-24
Warning: /opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 11:06:38 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.7.1.3.0-100
20/11/25 11:06:38 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 11:06:39 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 11:06:39 INFO tool.CodeGenTool: Beginning code generation
20/11/25 11:06:40 INFO manager.SqlManager: Executing SQL statement: select 
id, name from base_category1 where 1=1 and   (1 = 0) ;
20/11/25 11:06:40 INFO manager.SqlManager: Executing SQL statement: select 
id, name from base_category1 where 1=1 and   (1 = 0) ;
20/11/25 11:06:40 INFO manager.SqlManager: Executing SQL statement: select 
id, name from base_category1 where 1=1 and   (1 = 0) ;
20/11/25 11:06:40 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
warning: No SupportedSourceVersion annotation found on org.apache.hadoop.hdds.conf.ConfigFileGenerator, returning RELEASE_6.
warning: Supported source version 'RELEASE_6' from annotation processor 'org.apache.hadoop.hdds.conf.ConfigFileGenerator' less than -source '1.8'
2 warnings
20/11/25 11:06:43 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/1c9ca0f79341623102daea7280554592/QueryResult.jar
20/11/25 11:06:44 INFO tool.ImportTool: Destination directory /origin_data/gmall/db/base_category1/2020-11-24 is not present, hence not deleting.
20/11/25 11:06:44 INFO mapreduce.ImportJobBase: Beginning query import.
20/11/25 11:06:44 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/25 11:06:45 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/25 11:06:45 INFO client.RMProxy: Connecting to ResourceManager at cdh2.macro.com/192.168.0.207:8032
20/11/25 11:06:46 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hive/.staging/job_1606125802436_0008
20/11/25 11:06:53 INFO db.DBInputFormat: Using read commited transaction isolation
20/11/25 11:06:53 INFO mapreduce.JobSubmitter: number of splits:1
20/11/25 11:06:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606125802436_0008
20/11/25 11:06:54 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/25 11:06:55 INFO conf.Configuration: resource-types.xml not found
20/11/25 11:06:55 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/25 11:06:55 INFO impl.YarnClientImpl: Submitted application application_1606125802436_0008
20/11/25 11:06:55 INFO mapreduce.Job: The url to track the job: http://cdh2.macro.com:8088/proxy/application_1606125802436_0008/
20/11/25 11:06:55 INFO mapreduce.Job: Running job: job_1606125802436_0008
20/11/25 11:07:06 INFO mapreduce.Job: Job job_1606125802436_0008 running in uber mode : false
20/11/25 11:07:06 INFO mapreduce.Job:  map 0% reduce 0%
20/11/25 11:07:12 INFO mapreduce.Job:  map 100% reduce 0%
20/11/25 11:07:12 INFO mapreduce.Job: Job job_1606125802436_0008 completed successfully
20/11/25 11:07:12 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=253013
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=85
        HDFS: Number of bytes written=270
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
        HDFS: Number of bytes read erasure-coded=0
    Job Counters 
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=4014
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=4014
        Total vcore-milliseconds taken by all map tasks=4014
        Total megabyte-milliseconds taken by all map tasks=8220672
    Map-Reduce Framework
        Map input records=18
        Map output records=18
        Input split bytes=85
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=76
        CPU time spent (ms)=1540
        Physical memory (bytes) snapshot=245944320
        Virtual memory (bytes) snapshot=2841169920
        Total committed heap usage (bytes)=249036800
        Peak Map Physical memory (bytes)=245944320
        Peak Map Virtual memory (bytes)=2841169920
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=270
20/11/25 11:07:12 INFO mapreduce.ImportJobBase: Transferred 270 bytes in 27.907 seconds (9.675 bytes/sec)
20/11/25 11:07:12 INFO mapreduce.ImportJobBase: Retrieved 18 records.
Warning: /opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 11:07:18 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.7.1.3.0-100
20/11/25 11:07:18 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 11:07:18 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 11:07:18 INFO tool.CodeGenTool: Beginning code generation
20/11/25 11:07:18 INFO manager.SqlManager: Executing SQL statement: select 
id, name, category1_id from base_category2 where 1=1 and   (1 = 0) ;
20/11/25 11:07:18 INFO manager.SqlManager: Executing SQL statement: select 
id, name, category1_id from base_category2 where 1=1 and   (1 = 0) ;
20/11/25 11:07:18 INFO manager.SqlManager: Executing SQL statement: select 
id, name, category1_id from base_category2 where 1=1 and   (1 = 0) ;
20/11/25 11:07:18 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
warning: No SupportedSourceVersion annotation found on org.apache.hadoop.hdds.conf.ConfigFileGenerator, returning RELEASE_6.
warning: Supported source version 'RELEASE_6' from annotation processor 'org.apache.hadoop.hdds.conf.ConfigFileGenerator' less than -source '1.8'
2 warnings
20/11/25 11:07:21 WARN orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/bcba56c3140f5b064e3dbba26e43279f/QueryResult.java to /data0/25k/./QueryResult.java. Error: Destination '/data0/25k/./QueryResult.java' already exists
20/11/25 11:07:21 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/bcba56c3140f5b064e3dbba26e43279f/QueryResult.jar
20/11/25 11:07:22 INFO tool.ImportTool: Destination directory /origin_data/gmall/db/base_category2/2020-11-24 is not present, hence not deleting.
20/11/25 11:07:22 INFO mapreduce.ImportJobBase: Beginning query import.
20/11/25 11:07:22 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/25 11:07:22 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/25 11:07:22 INFO client.RMProxy: Connecting to ResourceManager at cdh2.macro.com/192.168.0.207:8032
20/11/25 11:07:23 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hive/.staging/job_1606125802436_0009
20/11/25 11:07:26 INFO db.DBInputFormat: Using read commited transaction isolation
20/11/25 11:07:26 INFO mapreduce.JobSubmitter: number of splits:1
20/11/25 11:07:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606125802436_0009
20/11/25 11:07:27 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/25 11:07:27 INFO conf.Configuration: resource-types.xml not found
20/11/25 11:07:27 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/25 11:07:27 INFO impl.YarnClientImpl: Submitted application application_1606125802436_0009
20/11/25 11:07:27 INFO mapreduce.Job: The url to track the job: http://cdh2.macro.com:8088/proxy/application_1606125802436_0009/
20/11/25 11:07:27 INFO mapreduce.Job: Running job: job_1606125802436_0009
20/11/25 11:07:44 INFO mapreduce.Job: Job job_1606125802436_0009 running in uber mode : false
20/11/25 11:07:44 INFO mapreduce.Job:  map 0% reduce 0%
20/11/25 11:07:50 INFO mapreduce.Job:  map 100% reduce 0%
20/11/25 11:07:51 INFO mapreduce.Job: Job job_1606125802436_0009 completed successfully
20/11/25 11:07:51 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=253031
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=85
        HDFS: Number of bytes written=2181
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
        HDFS: Number of bytes read erasure-coded=0
    Job Counters 
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=4244
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=4244
        Total vcore-milliseconds taken by all map tasks=4244
        Total megabyte-milliseconds taken by all map tasks=8691712
    Map-Reduce Framework
        Map input records=125
        Map output records=125
        Input split bytes=85
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=71
        CPU time spent (ms)=1700
        Physical memory (bytes) snapshot=259682304
        Virtual memory (bytes) snapshot=2850103296
        Total committed heap usage (bytes)=235929600
        Peak Map Physical memory (bytes)=259682304
        Peak Map Virtual memory (bytes)=2850103296
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=2181
20/11/25 11:07:51 INFO mapreduce.ImportJobBase: Transferred 2.1299 KB in 29.0742 seconds (75.0149 bytes/sec)
20/11/25 11:07:51 INFO mapreduce.ImportJobBase: Retrieved 125 records.
Warning: /opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 11:07:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.7.1.3.0-100
20/11/25 11:07:56 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 11:07:56 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 11:07:56 INFO tool.CodeGenTool: Beginning code generation
20/11/25 11:07:57 INFO manager.SqlManager: Executing SQL statement: select id, name, category2_id from base_category3 where 1=1 and   (1 = 0) ;
20/11/25 11:07:57 INFO manager.SqlManager: Executing SQL statement: select id, name, category2_id from base_category3 where 1=1 and   (1 = 0) ;
20/11/25 11:07:57 INFO manager.SqlManager: Executing SQL statement: select id, name, category2_id from base_category3 where 1=1 and   (1 = 0) ;
20/11/25 11:07:57 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
warning: No SupportedSourceVersion annotation found on org.apache.hadoop.hdds.conf.ConfigFileGenerator, returning RELEASE_6.
warning: Supported source version 'RELEASE_6' from annotation processor 'org.apache.hadoop.hdds.conf.ConfigFileGenerator' less than -source '1.8'
2 warnings
20/11/25 11:07:59 WARN orm.CompilationManager: Could not rename /tmp/sqoop-root/compile/7b64a4d6eff156e0c71805f454daf5af/QueryResult.java to /data0/25k/./QueryResult.java. Error: Destination '/data0/25k/./QueryResult.java' already exists
20/11/25 11:07:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/7b64a4d6eff156e0c71805f454daf5af/QueryResult.jar
20/11/25 11:08:00 INFO tool.ImportTool: Destination directory /origin_data/gmall/db/base_category3/2020-11-24 is not present, hence not deleting.
20/11/25 11:08:00 INFO mapreduce.ImportJobBase: Beginning query import.
20/11/25 11:08:00 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/25 11:08:00 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/25 11:08:01 INFO client.RMProxy: Connecting to ResourceManager at cdh2.macro.com/192.168.0.207:8032
20/11/25 11:08:02 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hive/.staging/job_1606125802436_0010
20/11/25 11:08:05 INFO db.DBInputFormat: Using read commited transaction isolation
20/11/25 11:08:05 INFO mapreduce.JobSubmitter: number of splits:1
20/11/25 11:08:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606125802436_0010
20/11/25 11:08:05 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/25 11:08:06 INFO conf.Configuration: resource-types.xml not found
20/11/25 11:08:06 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/25 11:08:06 INFO impl.YarnClientImpl: Submitted application application_1606125802436_0010
20/11/25 11:08:06 INFO mapreduce.Job: The url to track the job: http://cdh2.macro.com:8088/proxy/application_1606125802436_0010/
20/11/25 11:08:06 INFO mapreduce.Job: Running job: job_1606125802436_0010
20/11/25 11:08:15 INFO mapreduce.Job: Job job_1606125802436_0010 running in uber mode : false
20/11/25 11:08:15 INFO mapreduce.Job:  map 0% reduce 0%
20/11/25 11:08:25 INFO mapreduce.Job:  map 100% reduce 0%
20/11/25 11:08:25 INFO mapreduce.Job: Job job_1606125802436_0010 completed successfully

ODS层

完全仿照业务数据库中的表字段,一模一样的创建ODS层对应表。

下面是建表语句:

drop table if exists ods_order_info;
create external table ods_order_info ( 
  `id` string COMMENT '订单编号',
  `total_amount` decimal(10,2) COMMENT '订单金额', 
  `order_status` string COMMENT '订单状态', 
   `user_id` string COMMENT '用户id' ,
  `payment_way` string COMMENT '支付方式',  
  `out_trade_no` string COMMENT '支付流水号',  
  `create_time` string COMMENT '创建时间',  
  `operate_time` string COMMENT '操作时间' 
) COMMENT '订单表'
PARTITIONED BY ( `dt` string)
row format delimited  fields terminated by '\t' ;
drop table if exists ods_order_detail;
create external table ods_order_detail( 
`id` string COMMENT '订单编号',
`order_id` string  COMMENT '订单号', 
 `user_id` string COMMENT '用户id' ,
`sku_id` string COMMENT '商品id',  
`sku_name` string COMMENT '商品名称',  
 `order_price` string COMMENT '商品价格',  
 `sku_num` string COMMENT '商品数量',  
 `create_time` string COMMENT '创建时间'
) COMMENT '订单明细表'
PARTITIONED BY ( `dt` string)
row format delimited  fields terminated by '\t'
drop table if exists ods_sku_info;
create external table ods_sku_info( 
  `id` string COMMENT 'skuId',
  `spu_id` string  COMMENT 'spuid', 
   `price` decimal(10,2) COMMENT '价格' ,
  `sku_name` string COMMENT '商品名称',  
  `sku_desc` string COMMENT '商品描述',  
  `weight` string COMMENT '重量',  
  `tm_id` string COMMENT '品牌id',  
  `category3_id` string COMMENT '品类id',  
  `create_time` string COMMENT '创建时间'
) COMMENT '商品表'
PARTITIONED BY ( `dt` string)
row format delimited  fields terminated by '\t' ;
drop table if exists ods_user_info;
create external table ods_user_info( 
  `id` string COMMENT '用户id',
  `name`  string COMMENT '姓名', 
   `birthday` string COMMENT '生日' ,
  `gender` string COMMENT '性别',  
  `email` string COMMENT '邮箱',  
  `user_level` string COMMENT '用户等级',  
  `create_time` string COMMENT '创建时间'
) COMMENT '用户信息'
PARTITIONED BY ( `dt` string)
row format delimited  fields terminated by '\t' ;
drop table if exists ods_base_category1;
create external table ods_base_category1( 
  `id` string COMMENT 'id',
  `name`  string COMMENT '名称'
) COMMENT '商品一级分类'
PARTITIONED BY ( `dt` string)
row format delimited  fields terminated by '\t';
drop table if exists ods_base_category2;
create external table ods_base_category2( 
  `id` string COMMENT ' id',
  `name`  string COMMENT '名称',
   category1_id string COMMENT '一级品类id'
) COMMENT '商品二级分类'
PARTITIONED BY ( `dt` string)
row format delimited  fields terminated by '\t'
drop table if exists ods_base_category3;
create external table ods_base_category3( 
  `id` string COMMENT ' id',
  `name`  string COMMENT '名称',
   category2_id string COMMENT '二级品类id'
) COMMENT '商品三级分类'
PARTITIONED BY ( `dt` string)
row format delimited  fields terminated by '\t';
drop table if exists `ods_payment_info`;
create external table  `ods_payment_info`(
  `id`  bigint COMMENT '编号',
  `out_trade_no`   string COMMENT '对外业务编号',
  `order_id`     string COMMENT '订单编号',
  `user_id`     string COMMENT '用户编号',
  `alipay_trade_no` string COMMENT '支付宝交易流水编号',
  `total_amount`   decimal(16,2) COMMENT '支付金额',
  `subject`     string COMMENT '交易内容',
  `payment_type` string COMMENT '支付类型',
  `payment_time`  string COMMENT '支付时间'
  )  COMMENT '支付流水表'
PARTITIONED BY ( `dt` string)
row format delimited  fields terminated by '\t' ;

ODS层数据导入脚本

新建脚本ods_db.sh

#!/bin/bash
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`  
fi
sql=" 
load data inpath '/origin_data/$APP/db/order_info/$do_date'  OVERWRITE into table "$APP".ods_order_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/order_detail/$do_date'  OVERWRITE into table "$APP".ods_order_detail partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/sku_info/$do_date'  OVERWRITE into table "$APP".ods_sku_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/user_info/$do_date' OVERWRITE into table "$APP".ods_user_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/payment_info/$do_date' OVERWRITE into table "$APP".ods_payment_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category1/$do_date' OVERWRITE into table "$APP".ods_base_category1 partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category2/$do_date' OVERWRITE into table "$APP".ods_base_category2 partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category3/$do_date' OVERWRITE into table "$APP".ods_base_category3 partition(dt='$do_date'); 
"
hive -e "$sql"

执行脚本导入数据

ods_db.sh 2020-11-24
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console. Set system property 'log4j2.debug' to show Log4j2 internal initialization logging.
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2
20/11/25 12:04:07 [main-EventThread]: ERROR imps.EnsembleTracker: Invalid config event received: {server.1=cdh2.macro.com:3181:4181:participant, version=0, server.3=cdh1.macro.com:3181:4181:participant, server.2=cdh3.macro.com:3181:4181:participant}
20/11/25 12:04:07 [main-EventThread]: ERROR imps.EnsembleTracker: Invalid config event received: {server.1=cdh2.macro.com:3181:4181:participant, version=0, server.3=cdh1.macro.com:3181:4181:participant, server.2=cdh3.macro.com:3181:4181:participant}
20/11/25 12:04:07 [main]: INFO jdbc.HiveConnection: Connected to cdh2.macro.com:10000
Connected to: Apache Hive (version 3.1.3000.7.1.3.0-100)
Driver: Hive JDBC (version 3.1.3000.7.1.3.0-100)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20201125120407_cb083501-a516-40dc-8f5f-1b7e091f98d2): load data inpath '/origin_data/gmall/db/order_info/2020-11-24'  OVERWRITE into table gmall.ods_order_info partition(dt='2020-11-24')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20201125120407_cb083501-a516-40dc-8f5f-1b7e091f98d2); Time taken: 0.175 seconds
INFO  : Executing command(queryId=hive_20201125120407_cb083501-a516-40dc-8f5f-1b7e091f98d2): load data inpath '/origin_data/gmall/db/order_info/2020-11-24'  OVERWRITE into table gmall.ods_order_info partition(dt='2020-11-24')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table gmall.ods_order_info partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/db/order_info/2020-11-24
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Completed executing command(queryId=hive_20201125120407_cb083501-a516-40dc-8f5f-1b7e091f98d2); Time taken: 0.462 seconds
INFO  : OK
No rows affected (0.716 seconds)
INFO  : Compiling command(queryId=hive_20201125120408_2d33ea0d-8a3c-4247-b315-c6ec2c033256): load data inpath '/origin_data/gmall/db/order_detail/2020-11-24'  OVERWRITE into table gmall.ods_order_detail partition(dt='2020-11-24')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20201125120408_2d33ea0d-8a3c-4247-b315-c6ec2c033256); Time taken: 0.16 seconds
INFO  : Executing command(queryId=hive_20201125120408_2d33ea0d-8a3c-4247-b315-c6ec2c033256): load data inpath '/origin_data/gmall/db/order_detail/2020-11-24'  OVERWRITE into table gmall.ods_order_detail partition(dt='2020-11-24')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table gmall.ods_order_detail partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/db/order_detail/2020-11-24
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Completed executing command(queryId=hive_20201125120408_2d33ea0d-8a3c-4247-b315-c6ec2c033256); Time taken: 0.473 seconds
INFO  : OK
No rows affected (0.655 seconds)
INFO  : Compiling command(queryId=hive_20201125120409_c01eb4e3-beaa-4235-9ac1-4d2a70614b2b): load data inpath '/origin_data/gmall/db/sku_info/2020-11-24'  OVERWRITE into table gmall.ods_sku_info partition(dt='2020-11-24')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20201125120409_c01eb4e3-beaa-4235-9ac1-4d2a70614b2b); Time taken: 0.154 seconds
INFO  : Executing command(queryId=hive_20201125120409_c01eb4e3-beaa-4235-9ac1-4d2a70614b2b): load data inpath '/origin_data/gmall/db/sku_info/2020-11-24'  OVERWRITE into table gmall.ods_sku_info partition(dt='2020-11-24')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table gmall.ods_sku_info partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/db/sku_info/2020-11-24
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Completed executing command(queryId=hive_20201125120409_c01eb4e3-beaa-4235-9ac1-4d2a70614b2b); Time taken: 0.418 seconds
INFO  : OK
No rows affected (0.605 seconds)
INFO  : Compiling command(queryId=hive_20201125120409_808c17aa-5c8f-48b0-a27e-76f914c95512): load data inpath '/origin_data/gmall/db/user_info/2020-11-24' OVERWRITE into table gmall.ods_user_info partition(dt='2020-11-24')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20201125120409_808c17aa-5c8f-48b0-a27e-76f914c95512); Time taken: 0.154 seconds
INFO  : Executing command(queryId=hive_20201125120409_808c17aa-5c8f-48b0-a27e-76f914c95512): load data inpath '/origin_data/gmall/db/user_info/2020-11-24' OVERWRITE into table gmall.ods_user_info partition(dt='2020-11-24')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table gmall.ods_user_info partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/db/user_info/2020-11-24
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Completed executing command(queryId=hive_20201125120409_808c17aa-5c8f-48b0-a27e-76f914c95512); Time taken: 0.419 seconds
INFO  : OK
No rows affected (0.597 seconds)
INFO  : Compiling command(queryId=hive_20201125120410_676c568e-b172-473a-bbc6-f9f9b15e78df): load data inpath '/origin_data/gmall/db/payment_info/2020-11-24' OVERWRITE into table gmall.ods_payment_info partition(dt='2020-11-24')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20201125120410_676c568e-b172-473a-bbc6-f9f9b15e78df); Time taken: 0.154 seconds
INFO  : Executing command(queryId=hive_20201125120410_676c568e-b172-473a-bbc6-f9f9b15e78df): load data inpath '/origin_data/gmall/db/payment_info/2020-11-24' OVERWRITE into table gmall.ods_payment_info partition(dt='2020-11-24')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table gmall.ods_payment_info partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/db/payment_info/2020-11-24
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Completed executing command(queryId=hive_20201125120410_676c568e-b172-473a-bbc6-f9f9b15e78df); Time taken: 0.441 seconds
INFO  : OK
No rows affected (0.616 seconds)
INFO  : Compiling command(queryId=hive_20201125120411_d082247f-1fb6-45c4-b6a6-622f60a312bb): load data inpath '/origin_data/gmall/db/base_category1/2020-11-24' OVERWRITE into table gmall.ods_base_category1 partition(dt='2020-11-24')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20201125120411_d082247f-1fb6-45c4-b6a6-622f60a312bb); Time taken: 0.149 seconds
INFO  : Executing command(queryId=hive_20201125120411_d082247f-1fb6-45c4-b6a6-622f60a312bb): load data inpath '/origin_data/gmall/db/base_category1/2020-11-24' OVERWRITE into table gmall.ods_base_category1 partition(dt='2020-11-24')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table gmall.ods_base_category1 partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/db/base_category1/2020-11-24
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Completed executing command(queryId=hive_20201125120411_d082247f-1fb6-45c4-b6a6-622f60a312bb); Time taken: 0.413 seconds
INFO  : OK
No rows affected (0.583 seconds)
INFO  : Compiling command(queryId=hive_20201125120411_2d79b915-9fb8-4e71-9a26-97d4e369ed91): load data inpath '/origin_data/gmall/db/base_category2/2020-11-24' OVERWRITE into table gmall.ods_base_category2 partition(dt='2020-11-24')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20201125120411_2d79b915-9fb8-4e71-9a26-97d4e369ed91); Time taken: 0.153 seconds
INFO  : Executing command(queryId=hive_20201125120411_2d79b915-9fb8-4e71-9a26-97d4e369ed91): load data inpath '/origin_data/gmall/db/base_category2/2020-11-24' OVERWRITE into table gmall.ods_base_category2 partition(dt='2020-11-24')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table gmall.ods_base_category2 partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/db/base_category2/2020-11-24
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Completed executing command(queryId=hive_20201125120411_2d79b915-9fb8-4e71-9a26-97d4e369ed91); Time taken: 0.411 seconds
INFO  : OK
No rows affected (0.585 seconds)
INFO  : Compiling command(queryId=hive_20201125120412_02cc534c-8a19-4f4c-bbb9-d422efda325d): load data inpath '/origin_data/gmall/db/base_category3/2020-11-24' OVERWRITE into table gmall.ods_base_category3 partition(dt='2020-11-24')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20201125120412_02cc534c-8a19-4f4c-bbb9-d422efda325d); Time taken: 0.154 seconds
INFO  : Executing command(queryId=hive_20201125120412_02cc534c-8a19-4f4c-bbb9-d422efda325d): load data inpath '/origin_data/gmall/db/base_category3/2020-11-24' OVERWRITE into table gmall.ods_base_category3 partition(dt='2020-11-24')
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table gmall.ods_base_category3 partition (dt=2020-11-24) from hdfs://cdh1.macro.com:8020/origin_data/gmall/db/base_category3/2020-11-24
INFO  : Starting task [Stage-1:STATS] in serial mode
INFO  : Completed executing command(queryId=hive_20201125120412_02cc534c-8a19-4f4c-bbb9-d422efda325d); Time taken: 0.425 seconds
INFO  : OK
No rows affected (0.601 seconds)
Beeline version 3.1.3000.7.1.3.0-100 by Apache Hive
Closing: 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2

DWD层

对ODS层数据进行判空过滤。对商品分类表进行维度退化(降维)

创建订单表

drop table if exists dwd_order_info;
create external table dwd_order_info ( 
  `id` string COMMENT '',
  `total_amount` decimal(10,2) COMMENT '', 
  `order_status` string COMMENT ' 1  2  3  4  5', 
   `user_id` string COMMENT 'id' ,
  `payment_way` string COMMENT '',  
  `out_trade_no` string COMMENT '',  
  `create_time` string COMMENT '',  
  `operate_time` string COMMENT '' 
) 
PARTITIONED BY ( `dt` string)
stored as  parquet;

订单详情表

drop table if exists dwd_order_detail;
create external table dwd_order_detail( 
  `id` string COMMENT '',
  `order_id` decimal(10,2) COMMENT '', 
   `user_id` string COMMENT 'id' ,
  `sku_id` string COMMENT 'id',  
  `sku_name` string COMMENT '',  
  `order_price` string COMMENT '',  
  `sku_num` string COMMENT '', 
  `create_time` string COMMENT ''
) 
PARTITIONED BY (`dt` string)
stored as parquet;

用户表

drop table if exists dwd_user_info;
create external table dwd_user_info( 
  `id` string COMMENT 'id',
  `name`  string COMMENT '', 
   `birthday` string COMMENT '' ,
  `gender` string COMMENT '',  
  `email` string COMMENT '',  
  `user_level` string COMMENT '',  
  `create_time` string COMMENT ''
) 
PARTITIONED BY (`dt` string)
stored as  parquet;

支付流水表

drop table if exists `dwd_payment_info`;
create external  table  `dwd_payment_info`(
  `id`  bigint COMMENT '',
  `out_trade_no`   string COMMENT '',
  `order_id`     string COMMENT '',
  `user_id`     string COMMENT '',
  `alipay_trade_no` string COMMENT '',
  `total_amount`   decimal(16,2) COMMENT '',
  `subject`     string COMMENT '',
  `payment_type`   string COMMENT '',
  `payment_time`   string COMMENT ''
  )  
PARTITIONED BY ( `dt` string)
stored as  parquet;

创建商品表(增加分类)

drop table if exists dwd_sku_info;
create external table dwd_sku_info( 
  `id` string COMMENT 'skuId',
  `spu_id` string COMMENT 'spuid', 
   `price` decimal(10,2) COMMENT '' ,
  `sku_name` string COMMENT '',  
  `sku_desc` string COMMENT '',  
  `weight` string COMMENT '',  
  `tm_id` string COMMENT 'id',  
  `category3_id` string COMMENT '1id',  
  `category2_id` string COMMENT '2id',  
  `category1_id` string COMMENT '3id',  
  `category3_name` string COMMENT '3',  
  `category2_name` string COMMENT '2',  
  `category1_name` string COMMENT '1',  
  `create_time` string COMMENT ''
) 
PARTITIONED BY ( `dt` string)
stored as  parquet;


相关实践学习
基于MSE实现微服务的全链路灰度
通过本场景的实验操作,您将了解并实现在线业务的微服务全链路灰度能力。
相关文章
|
3月前
|
机器学习/深度学习 消息中间件 搜索推荐
【数据飞轮】驱动业务增长的高效引擎 —从数据仓库到数据中台的技术进化与实战
在数据驱动时代,企业逐渐从数据仓库过渡到数据中台,并进一步发展为数据飞轮。本文详细介绍了这一演进路径,涵盖数据仓库的基础存储与查询、数据中台的集成与实时决策,以及数据飞轮的自动化增长机制。通过代码示例展示如何在实际业务中运用数据技术,实现数据的最大价值,推动业务持续优化与增长。
117 4
|
存储 SQL Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——一、产品概述
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——一、产品概述
|
存储 SQL 弹性计算
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——二、产品架构及原理
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——二、产品架构及原理
|
SQL JSON Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——三、产品相关概念(上)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——三、产品相关概念(上)
|
存储 固态存储 Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——三、产品相关概念(下)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——三、产品相关概念(下)
|
存储 SQL Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——一、弹性能力(资源池、分时弹性)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——一、弹性能力(资源池、分时弹性)
|
SQL 分布式计算 DataWorks
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——二、数据导入导出与同步链路搭建(上)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——二、数据导入导出与同步链路搭建(上)
|
SQL 存储 分布式计算
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——二、数据导入导出与同步链路搭建(下)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——二、数据导入导出与同步链路搭建(下)
|
存储 SQL Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——三、SQL优化与慢查询解决(上)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——三、SQL优化与慢查询解决(上)
|
SQL 存储 缓存
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——三、SQL优化与慢查询解决(下)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——三、SQL优化与慢查询解决(下)

热门文章

最新文章