数据仓库实战 4(二)

本文涉及的产品
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
RDS MySQL Serverless 高可用系列,价值2615元额度,1个月
简介: 数据仓库实战 4

DWD层数据导入脚本

创建脚本dwd_db.sh

#!/bin/bash
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`  
fi 
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table   "$APP".dwd_order_info partition(dt)
select * from "$APP".ods_order_info 
where dt='$do_date'  and id is not null;
insert overwrite table   "$APP".dwd_order_detail partition(dt)
select * from "$APP".ods_order_detail 
where dt='$do_date'   and id is not null;
insert overwrite table   "$APP".dwd_user_info partition(dt)
select * from "$APP".ods_user_info
where dt='$do_date'   and id is not null;
insert overwrite table   "$APP".dwd_payment_info partition(dt)
select * from "$APP".ods_payment_info
where dt='$do_date'  and id is not null;
insert overwrite table   "$APP".dwd_sku_info partition(dt)
select  
    sku.id,
    sku.spu_id, 
    sku.price,
    sku.sku_name,  
    sku.sku_desc,  
    sku.weight,  
    sku.tm_id,  
    sku.category3_id,  
    c2.id category2_id ,  
    c1.id category1_id,  
    c3.name category3_name,  
    c2.name category2_name,  
    c1.name category1_name,  
    sku.create_time,
    sku.dt
from
    "$APP".ods_sku_info sku 
join "$APP".ods_base_category3 c3 on sku.category3_id=c3.id 
    join "$APP".ods_base_category2 c2 on c3.category2_id=c2.id 
    join "$APP".ods_base_category1 c1 on c2.category1_id=c1.id 
where sku.dt='$do_date'  and c2.dt='$do_date'  
and  c3.dt='$do_date' and  c1.dt='$do_date' 
and sku.id is not null;
"
hive -e "$sql"

执行脚本 dwd_db.sh 2020-11-24

INFO  : Status: DAG finished successfully in 9.01 seconds
INFO  : 
INFO  : Query Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : ----------------------------------------------------------------------------------------------
INFO  : Compile Query                           1.16s
INFO  : Prepare Plan                            0.16s
INFO  : Get Query Coordinator (AM)              0.01s
INFO  : Submit Plan                             0.10s
INFO  : Start DAG                               0.70s
INFO  : Run DAG                                 9.01s
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : Task Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1            506.00            380              0             143              125
INFO  :      Map 2              0.00            110              0              18               18
INFO  :      Map 3           4955.00         10,340             90           1,574                1
INFO  :      Map 5              0.00            510              0             300              300
INFO  :  Reducer 4              0.00            610              0               1                0
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : org.apache.tez.common.counters.DAGCounter:
INFO  :    NUM_SUCCEEDED_TASKS: 6
INFO  :    TOTAL_LAUNCHED_TASKS: 6
INFO  :    DATA_LOCAL_TASKS: 4
INFO  :    AM_CPU_MILLISECONDS: 2930
INFO  :    AM_GC_TIME_MILLIS: 0
INFO  : File System Counters:
INFO  :    FILE_BYTES_READ: 4550
INFO  :    FILE_BYTES_WRITTEN: 36200
INFO  :    HDFS_BYTES_READ: 53605
INFO  :    HDFS_BYTES_WRITTEN: 54917
INFO  :    HDFS_READ_OPS: 13
INFO  :    HDFS_WRITE_OPS: 7
INFO  :    HDFS_OP_CREATE: 3
INFO  :    HDFS_OP_GET_FILE_STATUS: 9
INFO  :    HDFS_OP_MKDIRS: 2
INFO  :    HDFS_OP_OPEN: 4
INFO  :    HDFS_OP_RENAME: 2
INFO  : org.apache.tez.common.counters.TaskCounter:
INFO  :    REDUCE_INPUT_GROUPS: 1
INFO  :    REDUCE_INPUT_RECORDS: 1
INFO  :    COMBINE_INPUT_RECORDS: 0
INFO  :    SPILLED_RECORDS: 2
INFO  :    NUM_SHUFFLED_INPUTS: 4
INFO  :    NUM_SKIPPED_INPUTS: 1
INFO  :    NUM_FAILED_SHUFFLE_INPUTS: 0
INFO  :    MERGED_MAP_OUTPUTS: 1
INFO  :    GC_TIME_MILLIS: 90
INFO  :    TASK_DURATION_MILLIS: 5711
INFO  :    CPU_MILLISECONDS: 11950
INFO  :    PHYSICAL_MEMORY_BYTES: 4435476480
INFO  :    VIRTUAL_MEMORY_BYTES: 32940134400
INFO  :    COMMITTED_HEAP_BYTES: 4435476480
INFO  :    INPUT_RECORDS_PROCESSED: 2035
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 53605
INFO  :    OUTPUT_RECORDS: 444
INFO  :    OUTPUT_LARGE_RECORDS: 0
INFO  :    OUTPUT_BYTES: 41244
INFO  :    OUTPUT_BYTES_WITH_OVERHEAD: 42158
INFO  :    OUTPUT_BYTES_PHYSICAL: 32070
INFO  :    ADDITIONAL_SPILLS_BYTES_WRITTEN: 4050
INFO  :    ADDITIONAL_SPILLS_BYTES_READ: 4050
INFO  :    ADDITIONAL_SPILL_COUNT: 0
INFO  :    SHUFFLE_CHUNK_COUNT: 1
INFO  :    SHUFFLE_BYTES: 31998
INFO  :    SHUFFLE_BYTES_DECOMPRESSED: 42158
INFO  :    SHUFFLE_BYTES_TO_MEM: 31642
INFO  :    SHUFFLE_BYTES_TO_DISK: 0
INFO  :    SHUFFLE_BYTES_DISK_DIRECT: 356
INFO  :    NUM_MEM_TO_DISK_MERGES: 0
INFO  :    NUM_DISK_TO_DISK_MERGES: 0
INFO  :    SHUFFLE_PHASE_TIME: 1149
INFO  :    MERGE_PHASE_TIME: 88
INFO  :    FIRST_EVENT_RECEIVED: 690
INFO  :    LAST_EVENT_RECEIVED: 690
INFO  :    DATA_BYTES_VIA_EVENT: 0
INFO  : HIVE:
INFO  :    CREATED_DYNAMIC_PARTITIONS: 1
INFO  :    CREATED_FILES: 2
INFO  :    DESERIALIZE_ERRORS: 0
INFO  :    RECORDS_IN_Map_1: 125
INFO  :    RECORDS_IN_Map_2: 18
INFO  :    RECORDS_IN_Map_3: 1149
INFO  :    RECORDS_IN_Map_5: 300
INFO  :    RECORDS_OUT_0: 1
INFO  :    RECORDS_OUT_1_gmall.dwd_sku_info: 300
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_1: 125
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_2: 18
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_3: 1
INFO  :    RECORDS_OUT_INTERMEDIATE_Map_5: 300
INFO  :    RECORDS_OUT_INTERMEDIATE_Reducer_4: 0
INFO  :    RECORDS_OUT_OPERATOR_FIL_41: 1149
INFO  :    RECORDS_OUT_OPERATOR_FIL_49: 18
INFO  :    RECORDS_OUT_OPERATOR_FIL_52: 125
INFO  :    RECORDS_OUT_OPERATOR_FIL_56: 300
INFO  :    RECORDS_OUT_OPERATOR_FS_22: 300
INFO  :    RECORDS_OUT_OPERATOR_FS_29: 1
INFO  :    RECORDS_OUT_OPERATOR_GBY_25: 1
INFO  :    RECORDS_OUT_OPERATOR_GBY_27: 1
INFO  :    RECORDS_OUT_OPERATOR_MAPJOIN_44: 1149
INFO  :    RECORDS_OUT_OPERATOR_MAPJOIN_45: 300
INFO  :    RECORDS_OUT_OPERATOR_MAPJOIN_54: 125
INFO  :    RECORDS_OUT_OPERATOR_MAP_0: 0
INFO  :    RECORDS_OUT_OPERATOR_RS_26: 1
INFO  :    RECORDS_OUT_OPERATOR_RS_51: 18
INFO  :    RECORDS_OUT_OPERATOR_RS_55: 125
INFO  :    RECORDS_OUT_OPERATOR_RS_58: 300
INFO  :    RECORDS_OUT_OPERATOR_SEL_21: 300
INFO  :    RECORDS_OUT_OPERATOR_SEL_24: 300
INFO  :    RECORDS_OUT_OPERATOR_SEL_28: 1
INFO  :    RECORDS_OUT_OPERATOR_SEL_50: 18
INFO  :    RECORDS_OUT_OPERATOR_SEL_53: 125
INFO  :    RECORDS_OUT_OPERATOR_SEL_57: 300
INFO  :    RECORDS_OUT_OPERATOR_SEL_8: 1149
INFO  :    RECORDS_OUT_OPERATOR_TS_0: 125
INFO  :    RECORDS_OUT_OPERATOR_TS_3: 18
INFO  :    RECORDS_OUT_OPERATOR_TS_6: 1149
INFO  :    RECORDS_OUT_OPERATOR_TS_9: 300
INFO  :    TOTAL_TABLE_ROWS_WRITTEN: 300
INFO  : Shuffle Errors:
INFO  :    BAD_ID: 0
INFO  :    CONNECTION: 0
INFO  :    IO_ERROR: 0
INFO  :    WRONG_LENGTH: 0
INFO  :    WRONG_MAP: 0
INFO  :    WRONG_REDUCE: 0
INFO  : Shuffle Errors_Reducer_4_INPUT_Map_3:
INFO  :    BAD_ID: 0
INFO  :    CONNECTION: 0
INFO  :    IO_ERROR: 0
INFO  :    WRONG_LENGTH: 0
INFO  :    WRONG_MAP: 0
INFO  :    WRONG_REDUCE: 0
INFO  : TaskCounter_Map_1_INPUT_Map_2:
INFO  :    FIRST_EVENT_RECEIVED: 9
INFO  :    INPUT_RECORDS_PROCESSED: 18
INFO  :    LAST_EVENT_RECEIVED: 9
INFO  :    NUM_FAILED_SHUFFLE_INPUTS: 0
INFO  :    NUM_SHUFFLED_INPUTS: 1
INFO  :    SHUFFLE_BYTES: 356
INFO  :    SHUFFLE_BYTES_DECOMPRESSED: 348
INFO  :    SHUFFLE_BYTES_DISK_DIRECT: 356
INFO  :    SHUFFLE_BYTES_TO_DISK: 0
INFO  :    SHUFFLE_BYTES_TO_MEM: 0
INFO  :    SHUFFLE_PHASE_TIME: 50
INFO  : TaskCounter_Map_1_INPUT_c2:
INFO  :    INPUT_RECORDS_PROCESSED: 125
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 2181
INFO  : TaskCounter_Map_1_OUTPUT_Map_3:
INFO  :    ADDITIONAL_SPILLS_BYTES_READ: 0
INFO  :    ADDITIONAL_SPILLS_BYTES_WRITTEN: 0
INFO  :    ADDITIONAL_SPILL_COUNT: 0
INFO  :    DATA_BYTES_VIA_EVENT: 0
INFO  :    OUTPUT_BYTES: 4131
INFO  :    OUTPUT_BYTES_PHYSICAL: 2219
INFO  :    OUTPUT_BYTES_WITH_OVERHEAD: 4387
INFO  :    OUTPUT_LARGE_RECORDS: 0
INFO  :    OUTPUT_RECORDS: 125
INFO  :    SPILLED_RECORDS: 0
INFO  : TaskCounter_Map_2_INPUT_c1:
INFO  :    INPUT_RECORDS_PROCESSED: 18
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 270
INFO  : TaskCounter_Map_2_OUTPUT_Map_1:
INFO  :    ADDITIONAL_SPILLS_BYTES_READ: 0
INFO  :    ADDITIONAL_SPILLS_BYTES_WRITTEN: 0
INFO  :    ADDITIONAL_SPILL_COUNT: 0
INFO  :    DATA_BYTES_VIA_EVENT: 0
INFO  :    OUTPUT_BYTES: 306
INFO  :    OUTPUT_BYTES_PHYSICAL: 380
INFO  :    OUTPUT_BYTES_WITH_OVERHEAD: 348
INFO  :    OUTPUT_LARGE_RECORDS: 0
INFO  :    OUTPUT_RECORDS: 18
INFO  :    SPILLED_RECORDS: 0
INFO  : TaskCounter_Map_3_INPUT_Map_1:
INFO  :    FIRST_EVENT_RECEIVED: 329
INFO  :    INPUT_RECORDS_PROCESSED: 125
INFO  :    LAST_EVENT_RECEIVED: 329
INFO  :    NUM_FAILED_SHUFFLE_INPUTS: 0
INFO  :    NUM_SHUFFLED_INPUTS: 1
INFO  :    SHUFFLE_BYTES: 2195
INFO  :    SHUFFLE_BYTES_DECOMPRESSED: 4387
INFO  :    SHUFFLE_BYTES_DISK_DIRECT: 0
INFO  :    SHUFFLE_BYTES_TO_DISK: 0
INFO  :    SHUFFLE_BYTES_TO_MEM: 2195
INFO  :    SHUFFLE_PHASE_TIME: 511
INFO  : TaskCounter_Map_3_INPUT_Map_5:
INFO  :    FIRST_EVENT_RECEIVED: 329
INFO  :    INPUT_RECORDS_PROCESSED: 300
INFO  :    LAST_EVENT_RECEIVED: 329
INFO  :    NUM_FAILED_SHUFFLE_INPUTS: 0
INFO  :    NUM_SHUFFLED_INPUTS: 1
INFO  :    SHUFFLE_BYTES: 25397
INFO  :    SHUFFLE_BYTES_DECOMPRESSED: 30468
INFO  :    SHUFFLE_BYTES_DISK_DIRECT: 0
INFO  :    SHUFFLE_BYTES_TO_DISK: 0
INFO  :    SHUFFLE_BYTES_TO_MEM: 25397
INFO  :    SHUFFLE_PHASE_TIME: 514
INFO  : TaskCounter_Map_3_INPUT_c3:
INFO  :    INPUT_RECORDS_PROCESSED: 1149
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 21652
INFO  : TaskCounter_Map_3_OUTPUT_Reducer_4:
INFO  :    ADDITIONAL_SPILLS_BYTES_READ: 0
INFO  :    ADDITIONAL_SPILLS_BYTES_WRITTEN: 0
INFO  :    ADDITIONAL_SPILL_COUNT: 0
INFO  :    OUTPUT_BYTES: 6945
INFO  :    OUTPUT_BYTES_PHYSICAL: 4050
INFO  :    OUTPUT_BYTES_WITH_OVERHEAD: 6955
INFO  :    OUTPUT_LARGE_RECORDS: 0
INFO  :    OUTPUT_RECORDS: 1
INFO  :    SHUFFLE_CHUNK_COUNT: 1
INFO  :    SPILLED_RECORDS: 1
INFO  : TaskCounter_Map_5_INPUT_sku:
INFO  :    INPUT_RECORDS_PROCESSED: 300
INFO  :    INPUT_SPLIT_LENGTH_BYTES: 29502
INFO  : TaskCounter_Map_5_OUTPUT_Map_3:
INFO  :    ADDITIONAL_SPILLS_BYTES_READ: 0
INFO  :    ADDITIONAL_SPILLS_BYTES_WRITTEN: 0
INFO  :    ADDITIONAL_SPILL_COUNT: 0
INFO  :    DATA_BYTES_VIA_EVENT: 0
INFO  :    OUTPUT_BYTES: 29862
INFO  :    OUTPUT_BYTES_PHYSICAL: 25421
INFO  :    OUTPUT_BYTES_WITH_OVERHEAD: 30468
INFO  :    OUTPUT_LARGE_RECORDS: 0
INFO  :    OUTPUT_RECORDS: 300
INFO  :    SPILLED_RECORDS: 0
INFO  : TaskCounter_Reducer_4_INPUT_Map_3:
INFO  :    ADDITIONAL_SPILLS_BYTES_READ: 4050
INFO  :    ADDITIONAL_SPILLS_BYTES_WRITTEN: 4050
INFO  :    COMBINE_INPUT_RECORDS: 0
INFO  :    FIRST_EVENT_RECEIVED: 23
INFO  :    LAST_EVENT_RECEIVED: 23
INFO  :    MERGED_MAP_OUTPUTS: 1
INFO  :    MERGE_PHASE_TIME: 88
INFO  :    NUM_DISK_TO_DISK_MERGES: 0
INFO  :    NUM_FAILED_SHUFFLE_INPUTS: 0
INFO  :    NUM_MEM_TO_DISK_MERGES: 0
INFO  :    NUM_SHUFFLED_INPUTS: 1
INFO  :    NUM_SKIPPED_INPUTS: 1
INFO  :    REDUCE_INPUT_GROUPS: 1
INFO  :    REDUCE_INPUT_RECORDS: 1
INFO  :    SHUFFLE_BYTES: 4050
INFO  :    SHUFFLE_BYTES_DECOMPRESSED: 6955
INFO  :    SHUFFLE_BYTES_DISK_DIRECT: 0
INFO  :    SHUFFLE_BYTES_TO_DISK: 0
INFO  :    SHUFFLE_BYTES_TO_MEM: 4050
INFO  :    SHUFFLE_PHASE_TIME: 74
INFO  :    SPILLED_RECORDS: 1
INFO  : TaskCounter_Reducer_4_OUTPUT_out_Reducer_4:
INFO  :    OUTPUT_RECORDS: 0
INFO  : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters:
INFO  :    GROUPED_INPUT_SPLITS_Map_1: 1
INFO  :    GROUPED_INPUT_SPLITS_Map_2: 1
INFO  :    GROUPED_INPUT_SPLITS_Map_3: 1
INFO  :    GROUPED_INPUT_SPLITS_Map_5: 1
INFO  :    INPUT_DIRECTORIES_Map_1: 1
INFO  :    INPUT_DIRECTORIES_Map_2: 1
INFO  :    INPUT_DIRECTORIES_Map_3: 1
INFO  :    INPUT_DIRECTORIES_Map_5: 1
INFO  :    INPUT_FILES_Map_1: 1
INFO  :    INPUT_FILES_Map_2: 1
INFO  :    INPUT_FILES_Map_3: 1
INFO  :    INPUT_FILES_Map_5: 1
INFO  :    RAW_INPUT_SPLITS_Map_1: 1
INFO  :    RAW_INPUT_SPLITS_Map_2: 1
INFO  :    RAW_INPUT_SPLITS_Map_3: 1
INFO  :    RAW_INPUT_SPLITS_Map_5: 1
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 2 .......... container     SUCCEEDED      1          1        0        0       0       0  
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Map 5 .......... container     SUCCEEDED      1          1        0        0       0       0  
Map 3 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 4 ...... container     SUCCEEDED      2          2        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 05/05  [==========================>>] 100%  ELAPSED TIME: 9.59 s     
----------------------------------------------------------------------------------------------
300 rows affected (11.721 seconds)
Beeline version 3.1.3000.7.1.3.0-100 by Apache Hive
Closing: 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2

DWS层之用户行为宽表

1)为什么要建宽表

需求目标,把每个用户单日的行为聚合起来组成一张多列宽表,以便之后关联用户维度信息后进行,不同角度的统计分析

创建用户行为宽表

drop table if exists dws_user_action;
create external table dws_user_action 
(  
  user_id      string    comment '用户 id',
  order_count   bigint    comment '下单次数 ',
  order_amount   decimal(16,2)  comment '下单金额 ',
  payment_count  bigint    comment '支付次数',
  payment_amount  decimal(16,2) comment '支付金额 '
) COMMENT '每日用户行为宽表'
PARTITIONED BY (`dt` string)
stored as parquet
tblproperties ("parquet.compression"="snappy");

用户行为数据宽表导入脚本

创建脚本dws_db_wide.sh

#!/bin/bash
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`  
fi 
sql="
with  
tmp_order as
(
    select 
        user_id, 
        count(*)  order_count,
        sum(oi.total_amount) order_amount
    from "$APP".dwd_order_info oi
    where date_format(oi.create_time,'yyyy-MM-dd')='$do_date'
    group by user_id
) ,
tmp_payment as
(
    select
        user_id, 
        sum(pi.total_amount) payment_amount, 
        count(*) payment_count 
    from "$APP".dwd_payment_info pi 
    where date_format(pi.payment_time,'yyyy-MM-dd')='$do_date'
    group by user_id
)
insert overwrite table "$APP".dws_user_action partition(dt='$do_date')
select
    user_actions.user_id,
    sum(user_actions.order_count),
    sum(user_actions.order_amount),
    sum(user_actions.payment_count),
    sum(user_actions.payment_amount)
from 
(
    select
        user_id,
        order_count,
        order_amount,
        0 payment_count,
        0 payment_amount
    from tmp_order
    union all
    select
        user_id,
        0 order_count,
        0 order_amount,
        payment_count,
        payment_amount
    from tmp_payment
 ) user_actions
group by user_id;
"
hive -e "$sql"

执行脚本

dws_db_wide.sh 2020-11-24
INFO  : Status: Running (Executing on YARN cluster with App id application_1606125802436_0017)
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      2          2        0        0       0       0  
Reducer 4 ...... container     SUCCEEDED      2          2        0        0       0       0  
Reducer 5 ...... container     SUCCEEDED      2          2        0        0       0       0  
Map 6 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 7 ...... container     SUCCEEDED      2          2        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 06/06  [==========================>>] 100%  ELAPSED TIME: 11.15 s    
----------------------------------------------------------------------------------------------
INFO  : Status: DAG finished successfully in 10.07 seconds
INFO  : 
INFO  : Query Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : ----------------------------------------------------------------------------------------------
INFO  : Compile Query                           0.72s
INFO  : Prepare Plan                            0.18s
INFO  : Get Query Coordinator (AM)              0.01s
INFO  : Submit Plan                             4.30s
INFO  : Start DAG                               1.22s
INFO  : Run DAG                                10.07s
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : Task Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1           4047.00          7,860             82           1,000              200
INFO  :      Map 6           4551.00          9,360            115             518              188
INFO  :  Reducer 2           1011.00          1,730             26             200              307
INFO  :  Reducer 4           1453.00          3,040              0             388                2
INFO  :  Reducer 5            443.00            830              8               2                0
INFO  :  Reducer 7           1011.00          1,970             14             188              188
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      2          2        0        0       0       0  
Reducer 4 ...... container     SUCCEEDED      2          2        0        0       0       0  
Reducer 5 ...... container     SUCCEEDED      2          2        0        0       0       0  
Map 6 .......... container     SUCCEEDED      1          1        0        0       0       0  .macro.com:8020/warehouse/tablespace/exReducer 7 ...... container     SUCCEEDED      2          2        0        0       0       0  3248837236917871195-1/-ext-10000
----------------------------------------------------------------------------------------------
VERTICES: 06/06  [==========================>>] 100%  ELAPSED TIME: 11.17 s    
----------------------------------------------------------------------------------------------
200 rows affected (17.604 seconds)
Beeline version 3.1.3000.7.1.3.0-100 by Apache Hive
Closing: 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2

查看hue,数据已经生成

ADS层(需求:GMV成交总额)

建表语句

drop table if exists ads_gmv_sum_day;
create external table ads_gmv_sum_day(
  `dt` string COMMENT '统计日期',
  `gmv_count`  bigint COMMENT '当日gmv订单个数',
  `gmv_amount`  decimal(16,2) COMMENT '当日gmv订单总金额',
  `gmv_payment`  decimal(16,2) COMMENT '当日支付金额'
) COMMENT 'GMV'
row format delimited fields terminated by '\t';

数据导入脚本

创建脚本ads_db_gmv.sh

#!/bin/bash
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`
fi 
sql="
insert into table "$APP".ads_gmv_sum_day 
select 
    '$do_date' dt,
    sum(order_count)  gmv_count,
    sum(order_amount) gmv_amount,
    sum(payment_amount) payment_amount 
from "$APP".dws_user_action 
where dt ='$do_date'
group by dt;
"
hive -e "$sql"

执行脚本

ads_db_gmv.sh 2020-11-24
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console. Set system property 'log4j2.debug' to show Log4j2 internal initialization logging.
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2
20/11/25 14:51:24 [main-EventThread]: ERROR imps.EnsembleTracker: Invalid config event received: {server.1=cdh2.macro.com:3181:4181:participant, version=0, server.3=cdh1.macro.com:3181:4181:participant, server.2=cdh3.macro.com:3181:4181:participant}
20/11/25 14:51:24 [main-EventThread]: ERROR imps.EnsembleTracker: Invalid config event received: {server.1=cdh2.macro.com:3181:4181:participant, version=0, server.3=cdh1.macro.com:3181:4181:participant, server.2=cdh3.macro.com:3181:4181:participant}
20/11/25 14:51:24 [main]: INFO jdbc.HiveConnection: Connected to cdh2.macro.com:10000
Connected to: Apache Hive (version 3.1.3000.7.1.3.0-100)
Driver: Hive JDBC (version 3.1.3000.7.1.3.0-100)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949): insert into table gmall.ads_gmv_sum_day 
select 
    '2020-11-24' dt,
    sum(order_count)  gmv_count,
    sum(order_amount) gmv_amount,
    sum(payment_amount) payment_amount 
from gmall.dws_user_action 
where dt ='2020-11-24'
group by dt
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_col0, type:string, comment:null), FieldSchema(name:_col1, type:bigint, comment:null), FieldSchema(name:_col2, type:decimal(16,2), comment:null), FieldSchema(name:_col3, type:decimal(16,2), comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949); Time taken: 0.217 seconds
INFO  : Executing command(queryId=hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949): insert into table gmall.ads_gmv_sum_day 
select 
    '2020-11-24' dt,
    sum(order_count)  gmv_count,
    sum(order_amount) gmv_amount,
    sum(payment_amount) payment_amount 
from gmall.dws_user_action 
where dt ='2020-11-24'
group by dt
INFO  : Query ID = hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: hive_20201125145125_efe11d96-baaa-451f-b5f8-d98403f2a949
INFO  : Session is already open
INFO  : Dag name: insert into table gmall.ads_gmv_sum_day...dt (Stage-1)
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1606125802436_0018)
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      2          2        0        0       0       0  
Reducer 3 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 10.00 s    
----------------------------------------------------------------------------------------------
INFO  : Status: DAG finished successfully in 9.45 seconds
INFO  : 
INFO  : Query Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : ----------------------------------------------------------------------------------------------
INFO  : Compile Query                           0.22s
INFO  : Prepare Plan                            0.06s
INFO  : Get Query Coordinator (AM)              0.01s
INFO  : Submit Plan                             4.22s
INFO  : Start DAG                               1.31s
INFO  : Run DAG                                 9.45s
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : Task Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1           3527.00          7,280            102             200                1
INFO  :  Reducer 2           4397.00          7,540             98               1                2
INFO  :  Reducer 3           3389.00            630              0               2                0
INFO  : ----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      2          2        0        0       0       0  tablespace/external/hive/gmall.db/ads_gReducer 3 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 10.01 s    
----------------------------------------------------------------------------------------------
1 row affected (15.842 seconds)
Beeline version 3.1.3000.7.1.3.0-100 by Apache Hive
Closing: 0: jdbc:hive2://cdh1.macro.com:2181,cdh2.macro.com:2181,cdh3.macro.com:2181/default;password=root;serviceDiscoveryMode=zooKeeper;user=root;zooKeeperNamespace=hiveserver2

数据导出脚本

在MySQL中创建ads_gmv_sum_day表

DROP TABLE IF EXISTS ads_gmv_sum_day;
CREATE TABLE ads_gmv_sum_day(
 `dt` varchar(200) DEFAULT NULL COMMENT '统计日期',
 `gmv_count` bigint(20) DEFAULT NULL COMMENT '当日gmv订单个数',
 `gmv_amount` decimal(16, 2) DEFAULT NULL COMMENT '当日gmv订单总金额',
 `gmv_payment` decimal(16, 2) DEFAULT NULL COMMENT '当日支付金额'
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '每日活跃用户数量' ROW_FORMAT = Dynamic;

创建脚本sqoop_export.sh

vim sqoop_export.sh

#!/bin/bash
export HADOOP_USER_NAME=hive
db_name=gmall
export_data() {
sqoop export \
--connect "jdbc:mysql://192.168.0.208:3306/${db_name}?useUnicode=true&characterEncoding=utf-8"  \
--username root \
--password password \
--table $1 \
--num-mappers 1 \
--export-dir /warehouse/tablespace/external/hive/gmall.db/$1 \
--input-fields-terminated-by "\t" \
--update-mode allowinsert \
--update-key $2 \
--input-null-string '\\N'    \
--input-null-non-string '\\N'
}
case $1 in
  "ads_gmv_sum_day")
     export_data "ads_gmv_sum_day" "dt"
;;
   "all")
     export_data "ads_gmv_sum_day" "dt"
;;
esac

执行脚本导入数据

sqoop_export.sh all
Warning: /opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-7.1.3-1.cdh7.1.3.p0.4992530/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/25 15:49:34 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7.7.1.3.0-100
20/11/25 15:49:34 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/11/25 15:49:35 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
20/11/25 15:49:35 INFO tool.CodeGenTool: Beginning code generation
20/11/25 15:49:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `ads_gmv_sum_day` AS t LIMIT 1
20/11/25 15:49:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `ads_gmv_sum_day` AS t LIMIT 1
20/11/25 15:49:35 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
warning: No SupportedSourceVersion annotation found on org.apache.hadoop.hdds.conf.ConfigFileGenerator, returning RELEASE_6.
warning: Supported source version 'RELEASE_6' from annotation processor 'org.apache.hadoop.hdds.conf.ConfigFileGenerator' less than -source '1.8'
2 warnings
20/11/25 15:49:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/f6653e4db17d3a1223b93c16acc9d822/ads_gmv_sum_day.jar
20/11/25 15:49:38 WARN manager.MySQLManager: MySQL Connector upsert functionality is using INSERT ON
20/11/25 15:49:38 WARN manager.MySQLManager: DUPLICATE KEY UPDATE clause that relies on table's unique key.
20/11/25 15:49:38 WARN manager.MySQLManager: Insert/update distinction is therefore independent on column
20/11/25 15:49:38 WARN manager.MySQLManager: names specified in --update-key parameter. Please see MySQL
20/11/25 15:49:38 WARN manager.MySQLManager: documentation for additional limitations.
20/11/25 15:49:38 INFO mapreduce.ExportJobBase: Beginning export of ads_gmv_sum_day
20/11/25 15:49:38 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
20/11/25 15:49:39 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
20/11/25 15:49:39 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
20/11/25 15:49:39 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
20/11/25 15:49:40 INFO client.RMProxy: Connecting to ResourceManager at cdh2.macro.com/192.168.0.207:8032
20/11/25 15:49:41 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/hive/.staging/job_1606125802436_0019
20/11/25 15:49:45 INFO input.FileInputFormat: Total input files to process : 1
20/11/25 15:49:45 INFO input.FileInputFormat: Total input files to process : 1
20/11/25 15:49:45 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
20/11/25 15:49:45 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 19cb887b751ca6813bec478cf3bf20459deed68b]
20/11/25 15:49:45 INFO mapreduce.JobSubmitter: number of splits:1
20/11/25 15:49:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606125802436_0019
20/11/25 15:49:45 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/25 15:49:45 INFO conf.Configuration: resource-types.xml not found
20/11/25 15:49:45 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/25 15:49:46 INFO impl.YarnClientImpl: Submitted application application_1606125802436_0019
20/11/25 15:49:46 INFO mapreduce.Job: The url to track the job: http://cdh2.macro.com:8088/proxy/application_1606125802436_0019/
20/11/25 15:49:46 INFO mapreduce.Job: Running job: job_1606125802436_0019
20/11/25 15:49:54 INFO mapreduce.Job: Job job_1606125802436_0019 running in uber mode : false
20/11/25 15:49:54 INFO mapreduce.Job:  map 0% reduce 0%
20/11/25 15:50:02 INFO mapreduce.Job:  map 100% reduce 0%
20/11/25 15:50:02 INFO mapreduce.Job: Job job_1606125802436_0019 completed successfully
20/11/25 15:50:02 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=253343
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=215
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
        HDFS: Number of bytes read erasure-coded=0
    Job Counters 
        Launched map tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=5709
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=5709
        Total vcore-milliseconds taken by all map tasks=5709
        Total megabyte-milliseconds taken by all map tasks=11692032
    Map-Reduce Framework
        Map input records=1
        Map output records=1
        Input split bytes=176
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=74
        CPU time spent (ms)=1320
        Physical memory (bytes) snapshot=285306880
        Virtual memory (bytes) snapshot=2846998528
        Total committed heap usage (bytes)=320864256
        Peak Map Physical memory (bytes)=285306880
        Peak Map Virtual memory (bytes)=2846998528
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
20/11/25 15:50:02 INFO mapreduce.ExportJobBase: Transferred 215 bytes in 22.3977 seconds (9.5992 bytes/sec)
20/11/25 15:50:02 INFO mapreduce.ExportJobBase: Exported 1 records.

查看mysql数据

MariaDB [gmall]> select * from ads_gmv_sum_day;
+------------+-----------+------------+-------------+
| dt         | gmv_count | gmv_amount | gmv_payment |
+------------+-----------+------------+-------------+
| 2020-11-24 |      1000 |  527431.00 |   268152.00 |
+------------+-----------+------------+-------------+
1 row in set (0.00 sec)


相关实践学习
基于MSE实现微服务的全链路灰度
通过本场景的实验操作,您将了解并实现在线业务的微服务全链路灰度能力。
相关文章
|
2月前
|
机器学习/深度学习 消息中间件 搜索推荐
【数据飞轮】驱动业务增长的高效引擎 —从数据仓库到数据中台的技术进化与实战
在数据驱动时代,企业逐渐从数据仓库过渡到数据中台,并进一步发展为数据飞轮。本文详细介绍了这一演进路径,涵盖数据仓库的基础存储与查询、数据中台的集成与实时决策,以及数据飞轮的自动化增长机制。通过代码示例展示如何在实际业务中运用数据技术,实现数据的最大价值,推动业务持续优化与增长。
78 4
|
存储 SQL Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——一、产品概述
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——一、产品概述
|
存储 SQL 弹性计算
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——二、产品架构及原理
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——二、产品架构及原理
|
SQL JSON Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——三、产品相关概念(上)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——三、产品相关概念(上)
|
存储 固态存储 Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——三、产品相关概念(下)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(上)——三、产品相关概念(下)
|
存储 SQL Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——一、弹性能力(资源池、分时弹性)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——一、弹性能力(资源池、分时弹性)
|
SQL 分布式计算 DataWorks
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——二、数据导入导出与同步链路搭建(上)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——二、数据导入导出与同步链路搭建(上)
|
SQL 存储 分布式计算
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——二、数据导入导出与同步链路搭建(下)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——二、数据导入导出与同步链路搭建(下)
|
存储 SQL Cloud Native
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——三、SQL优化与慢查询解决(上)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——三、SQL优化与慢查询解决(上)
|
SQL 存储 缓存
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——三、SQL优化与慢查询解决(下)
《阿里云认证的解析与实战-数据仓库ACP认证》——云原生数据仓库AnalyticDB MySQL版解析与实践(下)——三、SQL优化与慢查询解决(下)

热门文章

最新文章