一、Hive企业优化之Fetch Task讲解
hive.fetch.task.conversion
1
Fetch 抓取是指,Hive 中对某些情况的查询可以不必使用 MapReduce 计算。启用 MapReduce Job 是会消耗系统开销的。对于这个问题,从 Hive0.10.0 版本开始,对于简单的不需要聚合的类似 select from limit n语句,不需要起 MapReduce job,直接通过 Fetch task 获取数据。
比如:select * from user_table;在这种情况下,Hive 可以简单地读取 user_table 对应的存储目录下的文件,然后输出查询结果到控制台。
在 hive-default.xml.template 文件中hive.fetch.task.conversion默认是 more,老版本 hive 默认是 minimal,该属性修改为 more 以后,在全局查找、字段查找、limit 查找等都不走 MapReduce。
>>none:所有的语句都走MR >>minimal: select * ; select * from order where day ='20180910'; select * from order limit 5; 如以上查询方式都不会执行mapreduce程序 >>more: select *,select column... 如以上查询方式都不会执行mapreduce程序
<property> <name>hive.fetch.task.conversion</name> <value>more</value> <description> Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. none : disable hive.fetch.task.conversion 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns) </description> </property>
二、Hive企业优化之执行计划讲解
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION] query
不跑MapReduce的情况执行计划
explain select * from order;
Explain STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: order Statistics: Num rows: 1 Data size: 346 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: userid (type: string), username (type: string), order_date (type: string), product_name (type: string), price (type: int), city (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 1 Data size: 346 Basic stats: COMPLETE Column stats: NONE ListSink
group by执行计划如下:
explain select count(1) city_nun,city from order group by city;
Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: <!--stage定义,一个stage对应一个MapReduce--> Stage: Stage-1 <!--Map过程--> Map Reduce Map Operator Tree: TableScan alias: order Statistics: Num rows: 3 Data size: 346 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: city (type: string) outputColumnNames: city Statistics: Num rows: 3 Data size: 346 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(1) keys: city (type: string) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 3 Data size: 346 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 3 Data size: 346 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: bigint) Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 115 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col1 (type: bigint), _col0 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 115 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 115 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
三、Hive企业优化之表拆分方案讲解
在实际应用中,根据我们的业务需求会将部分的大表拆分成多个小表。
1)大表拆分小表: 源数据 ——> 拆分(中间表) -> 轻量分析 2)抽取共性数据到中间表 select * from (select * from a,b,c where a.id = b.id and b.id = c.id) 共性提取(中间表):select * from a,b,c where a.id = b.id and b.id = c.id
方案一:对一个大表,根据不同的业务,拆分成多个小表,做业务分析。
方案二:可以将之前每部分的业务表共性的数据提取出来组成一个中间表
方案三:如果不同的业务都需要一个大表中的共同的几个数据字段,那么可以先提取一个公共数据
四、Hive企业优化之外部分区表优化方案
在实际的应用中: >>外部表和分区表结合使用 >>多级分区 partition(yean = '2018',month='09',day='01') 合并输入结果方式一: select count(1) from order where day = '01' UNION ALL select count(1) from order where day = '02' 合并输入结果方式二: [-getmerge [-nl] <src> <localdst>]
五、Hive企业优化之SQL优化
1.数据格式和压缩
>>数据存储格式:textFile、orcFile、parquet >>数据压缩:snappy 大数据平台:CDH
2.SQL语句
注重SQL执行的性能
3.reduce number
根据业务数据量多次测试得出合适的number