57 Hive案例（数据ETL）-阿里云开发者社区

57 Hive案例（数据ETL）

2023-11-01 96 发布于广东

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 57 Hive案例（数据ETL）

需求

对web点击流日志基础数据表进行etl（按照仓库模型设计）

按各时间维度统计来源域名top10

已有数据表 “t_orgin_weblog” ：

col_name	data_type	comment
valid	string
remote_addr	string
remote_user	string
time_local	string
request	string
status	string
body_bytes_sent	string
http_referer	string
http_user_agent	string

数据示例

| true|1.162.203.134| - | 18/Sep/2013:13:47:35| /images/my.jpg                        | 200| 19939 | "http://www.angularjs.cn/A0d9"                      | "Mozilla/5.0 (Windows   |
| true|1.202.186.37 | - | 18/Sep/2013:15:39:11| /wp-content/uploads/2013/08/windjs.png| 200| 34613 | "http://cnodejs.org/topic/521a30d4bee8d3cb1272ac0f" | "Mozilla/5.0 (Macintosh;|

实现步骤

1、对原始数据进行抽取转换

–将来访url分离出host path query query id

drop table if exists t_etl_referurl;
create table t_etl_referurl as
SELECT a.*,b.*
FROM t_orgin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id

2、从前述步骤进一步分离出日期时间形成ETL明细表“t_etl_detail” day tm

drop table if exists t_etl_detail;
create table t_etl_detail as 
select b.*,substring(time_local,0,11) as daystr,
substring(time_local,13) as tmstr,
substring(time_local,4,3) as month,
substring(time_local,0,2) as day,
substring(time_local,13,2) as hour
from t_etl_referurl b;

3、对etl数据进行分区(包含所有数据的结构化信息)

drop table t_etl_detail_prt;
create table t_etl_detail_prt(
valid                   string,
remote_addr            string,
remote_user            string,
time_local               string,
request                 string,
status                  string,
body_bytes_sent         string,
http_referer             string,
http_user_agent         string,
host                   string,
path                   string,
query                  string,
query_id               string,
daystr                 string,
tmstr                  string,
month                  string,
day                    string,
hour                   string) 
partitioned by (mm string,dd string);

导入数据

insert into table t_etl_detail_prt partition(mm='Sep',dd='18')
select * from t_etl_detail where daystr='18/Sep/2013';
insert into table t_etl_detail_prt partition(mm='Sep',dd='19')
select * from t_etl_detail where daystr='19/Sep/2013';

分个时间维度统计各referer_host的访问次数并排序

create table t_refer_host_visit_top_tmp as
select referer_host,count(*) as counts,mm,dd,hh from t_display_referer_counts group by hh,dd,mm,referer_host order by hh asc,dd asc,mm asc,counts desc;

4、来源访问次数topn各时间维度URL

取各时间维度的referer_host访问次数topn

select * from (select referer_host,counts,concat(hh,dd),row_number() over (partition by concat(hh,dd) order by concat(hh,dd) asc) as od from t_refer_host_visit_top_tmp) t where od<=3;

57 Hive案例（数据ETL）

需求

数据示例

实现步骤

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

57 Hive案例（数据ETL）

需求

数据示例

实现步骤

热门文章

最新文章

相关课程

相关电子书

相关实验场景