hive 使用技巧笔记-阿里云开发者社区

hive 使用技巧笔记

2017-11-27 741

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

来源： http://michael-roshen.iteye.com/blog/2115268

例子：

INSERT OVERWRITE TABLE prices_collected_${hiveconf:wid_version}

select

pc.collect_id as product_id ,

regexp_extract(pc.price,'(\\d*\\.?\\d+)',1) as price ,

pc.region,

'' as location_area_code,

'' as city_code,

from_unixtime(unix_timestamp() , 'yyyy-MM-dd hh:mm:ss') as created_at,

from_unixtime(unix_timestamp() , 'yyyy-MM-dd hh:mm:ss') as updated_at

from products_compared_${hiveconf:wid_version} as pc

1.根据hive执行的参数来动态的设置表名称 prices_collected_${hiveconf:wid_version}

hive -hiveconf wid_version='4'

则可以通过${hiveconft:wid_version}来接收参数，生成prices_collected_4这张表

2. 使用正则表达式获取需要的信息，如：获取一段字符串中的数字

regexp_extract(pc.price,'(\\d*\\.?\\d+)',1) as price

注意hive中需要使用双斜杠来处理正则表达式

3. 获取系统时间

from_unixtime(unix_timestamp() , 'yyyy-MM-dd hh:mm:ss') as created_a

使用from_unixtime(unix_timestamp() , 'yyyy-MM-dd hh:mm:ss') 获取系统时间，格式可以根据需要调整

4. 多个表进行join的时候，可能会报错

使用set hive.auto.convert.join=false;解决

5. 创建表

create table if not exists brands (

name string,

created_at string,

updated_at string

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

ESCAPED BY '\\'

STORED AS TEXTFILE;

以文本方式进行存储，"\\"进行转义，"\t"作为换行符

6.到处hive中的某个表中的数据到本地，执行hive命令如下：

hive

-hiveconf local_path=/home/hive/hive_data/products_24_1

-hiveconf hive_table=products_24_1

-hiveconf columnstr=' name , created_at, updated_at, "released" as status '

-f /home/hive/export_hive_table_to_local.sql

需要执行的参数依次是

1.导出到本地的位置local_path

2.导出hive中的哪个表 hive_table

3. 导出products_24_1 表中的哪些字段 colunmstr

4. 根据上面的参数，在本地创建products_24_1 表，使用-f来指定调用的文件

/home/hive/export_hive_table_to_local.sql 文件内容如下：

insert overwrite local directory '${hiveconf:local_path}'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

ESCAPED BY '\\'

STORED AS TEXTFILE

select ${hiveconf:columnstr}

from ${hiveconf:hive_table};

7.将本地文件导入到psql数据库中, hive对pg的支持不好，不能用sqoop来进行数据的导入，可以先将hive中的数据读到本地，在使用python脚本来进行文件的写入

Python代码

def insert_to_pg(conn , table_name , file_path , insert_columns=None):
conn = psycopg2.connect(conn)
cursor = conn.cursor()
if os.path.isfile( file_path ):
datafile=ReadFileProgress(file_path)
cursor.copy_from(file=datafile, table=table_name, sep='\t', null='\\N', size=81920, columns=insert_columns)
datafile.close()

Python代码

#!/usr/bin/python
# #_*_ coding: utf-8 _*_
import os , sys
import psycopg2
class ReadFileProgress:
def __init__(self, filename):
self.datafile = open(filename)
self.totalRecords = 0
self.totalBytes = os.stat(filename).st_size
self.readBytes = 0
self.datafile.readline()
i = 0
for i, l in enumerate(self.datafile):
pass
self.totalRecords = i + 1
sys.stderr.write("Number of records: %d\n" % (self.totalRecords))
self.datafile.seek(0)
self.datafile.readline()
self.perc5 = self.totalBytes / 20.0
self.perc5count = 0
self.lastPerc5 = 0
sys.stderr.write("Writing records: 0%")
def countBytes(self, size=0):
self.readBytes += size
if (self.readBytes - self.lastPerc5 >= self.perc5):
self.lastPerc5 = self.readBytes
if (int(self.readBytes / self.perc5) == 5):
sys.stderr.write("25%")
elif (int(self.readBytes / self.perc5) == 10):
sys.stderr.write("50%")
elif (int(self.readBytes / self.perc5) == 15):
sys.stderr.write("75%")
else:
sys.stderr.write(".")
sys.stderr.flush()
def readline(self, size=None):
countBytes(size)
return self.datafile.readline(size)
def read(self, size=None):
self.countBytes(size)
return self.datafile.read(size)
def close(self):
sys.stderr.write("100%\n")
self.datafile.close()

8. 从pg上导出指定表

Python代码

def do_export(conn , table_name , file_path , columns=None):
conn = psycopg2.connect(conn)
cursor = conn.cursor()
cursor.copy_to(file=file(file_path , 'w'), table=table_name, sep='\t', null='\\N', columns=columns)
cursor.close()
conn.commit()
sys.stdout.write("Transaction finished successfully.\n")

9. 则select语句中也可以通过hiveconf来传递参数，执行hive命令

hive -hiveconf name='hello hive'

INSERT OVERWRITE TABLE companies

select

'${hiveconf:name}' as name

from companies_old

本文转自 SimplePoint 51CTO博客，原文链接：http://blog.51cto.com/2226894115/1898261，如需转载请自行联系原作者

文章标签：

Python

SQL

HIVE

hive 使用技巧笔记

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

hive 使用技巧笔记

热门文章

最新文章

相关课程

相关电子书

相关实验场景