hadoop环境描述:
master节点:node1
slave节点:node2,node3,node4
远端服务器(python连接hive):node29
需求:通过hive查询到cdn日志中指定时间段内url访问次数最多的前10个url
ps:用pig查询可以查询文章:
http://shineforever.blog.51cto.com/1429204/1571124
说明:python操作远程操作需要使用Thrift接口:
hive源码包下面自带Thrift插件:
[root@node1 shell]# ls -l /usr/local/hive-0.8.1/lib/py
total 28
drwxr-xr-x 2 hadoop hadoop 4096 Nov 5 15:29 fb303
drwxr-xr-x 2 hadoop hadoop 4096 Oct 15 10:30 fb303_scripts
drwxr-xr-x 2 hadoop hadoop 4096 Nov 5 15:29 hive_metastore
drwxr-xr-x 2 hadoop hadoop 4096 Oct 15 10:30 hive_serde
drwxr-xr-x 2 hadoop hadoop 4096 Nov 5 15:29 hive_service
drwxr-xr-x 2 hadoop hadoop 4096 Nov 5 15:20 queryplan
drwxr-xr-x 6 hadoop hadoop 4096 Nov 5 15:20 thrift
1)把相关文件scp到远端的node29相关目录下:
scp -r /usr/local/hive-0.8.1/lib/py/* 172.16.41.29:/usr/local/hive_py/.
2) 在node1服务器上开发hive:
[hadoop@node1 py]$ hive --service hiveserver
Starting Hive Thrift Server
3)在node29上编写查询脚本:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
|
#!/usr/bin/env python
#coding:utf-8
#找出cdn日志指定时间段,url访问次数最多的前10个;
import
sys
import
os
import
string
import
re
import
MySQLdb
#加载hive的python相关库文件;
sys.path.append(
'/usr/local/hive_py'
)
from
hive_service
import
ThriftHive
from
hive_service.ttypes
import
HiveServerException
from
thrift
import
Thrift
from
thrift.transport
import
TSocket
from
thrift.transport
import
TTransport
from
thrift.protocol
import
TBinaryProtocol
dbname
=
"default"
hsql
=
"select request,count(request) as counts
from
cdnlog where time >
=
'[27/Oct/2014:10:40:00 +0800]'
and
time <
=
'[
27
/
Oct
/
2014
:
10
:
49
:
59
+
0800
]' group by request order by counts desc limit
10
"
def
hiveExe(hsql,dbname):
try
:
transport
=
TSocket.TSocket(
'172.16.41.151'
,
10000
)
transport
=
TTransport.TBufferedTransport(transport)
protocol
=
TBinaryProtocol.TBinaryProtocol(transport)
client
=
ThriftHive.Client(protocol)
transport.
open
()
#加载增长表达式支持,必需(以下加载路径为远端hive路径,不是脚本的本地路径!)
client.execute(
'add jar /usr/local/hive-0.8.1/lib/hive_contrib.jar'
)
# client.execute("use " + dbname)
# row = client.fetchOne()
client.execute(hsql)
return
client.fetchAll()
#查询所有数据;
transport.close()
except
Thrift.TException, tx:
print
'%s'
%
(tx.message)
if
__name__
=
=
'__main__'
:
results
=
hiveExe(hsql,dbname)
num
=
len
(results)
for
i
in
range
(num):
|
在node29上执行脚本,输入结果为:
node1服务器上hive计算过程为:
本文转自 shine_forever 51CTO博客,原文链接:http://blog.51cto.com/shineforever/1573439