线上的zipkin的存储是利用的HBase0.94.6,一开始Dev想直接写MR来做离线分析,后来聊了下发现走Hive会提高开发的效率(当然,这里查询HBaseSQL接口还有phoenixImpala等,只不过都还不够成熟,并且是离线分析不是adhocquery,BTW,前阶段和intel的聊过他们的Hive Over HBase是跳过MR的,效率非常赞,不过钱也略贵了=.=);

其实用Hive查询HBase非常简单:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
//首先在HBase里建一张表并插入几条数据
hbase(main): 003 : 0 > create  'table_inhbase' , 'cf'
0  row(s) in  1.2060  seconds
=> Hbase::Table - table_inhbase
hbase(main): 004 : 0 > list
TABLE                                                              
table_inhbase                                                       
1  row(s) in  0.0350  seconds
hbase(main): 005 : 0 > put  'table_inhbase' , 'row1' , 'cf:a' , 'value1'
0  row(s) in  0.0830  seconds
hbase(main): 006 : 0 > put  'table_inhbase' , 'row2' , 'cf:a' , 'value2'
0  row(s) in  0.0200  seconds
hbase(main): 007 : 0 > put  'table_inhbase' , 'row3' , 'cf:b' , 'value3'
0  row(s) in  0.0180  seconds
hbase(main): 008 : 0 > scan  'table_inhbase'
ROW                                        COLUMN+CELL             
  row1                                      column=cf:a, timestamp= 1383736436773 ,value=value1                                                            
  row2                                      column=cf:a, timestamp= 1383736462917 ,value=value2                                                            
  row3                                      column=cf:b, timestamp= 1383736476017 ,value=value3                                                           
3  row(s) in  0.0660  seconds
//在Hive里创建一个外部表,注意要在hive-site.xml加入ZK,否则会hang住,一直去重试localhost:2181
CREATE EXTERNAL TABLE ext_table_inhbase(key string, avalue string,bvaluestring)
STORED BY  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ( "hbase.columns.mapping"  = "cf:a,cf:b" )
TBLPROPERTIES( "hbase.table.name"  "table_inhbase" );
hive> CREATE EXTERNAL TABLE ext_table_inhbase(key string, avaluestring,bvalue string)
     > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
     > WITH SERDEPROPERTIES( "hbase.columns.mapping"  "cf:a,cf:b" )
     > TBLPROPERTIES( "hbase.table.name"  = "table_inhbase" );
OK
//注意,这里要加入这2个jar包:hbase-0.94.6-cdh4.4.0.jar,hive-hbase-handler-0.10.0-cdh4.4.0.jar否则会抛出异常
hive> select * from ext_table_inhbase;
OK
row1    value1  NULL
row2    value2  NULL
row3    NULL    value3
Time taken:  0.609  seconds
hive> select key,avalue from ext_table_inhbase;
java.io.IOException: Cannot create an instance of InputSplit.apache.hadoop.hive.hbase.HBaseSplit:Classorg.apache.hadoop.hive.hbase.HBaseSplit not found
         at org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.readFields(HiveInputFormat.java: 146 )
         atorg.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java: 73 )
         atorg.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java: 44 )
         atorg.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java: 356 )
         atorg.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java: 388 )
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java: 332 )
         atorg.apache.hadoop.mapred.Child$ 4 .run(Child.java: 268 )
         atjava.security.AccessController.doPrivileged(Native Method)
         atjavax.security.auth.Subject.doAs(Subject.java: 396 )
         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java: 1408 )
hive> select key,avalue from ext_table_inhbase;
Total MapReduce jobs =  1
Launching Job  1  out of  1
Number of reduce tasks is set to  0  since there's no reduce operator
Hadoop job information  for  Stage- 1 : number of mappers:  1 ; number ofreducers:  0
19 : 33 : 55 , 386  Stage- 1  map =  0 %,  reduce =  0 %
19 : 34 : 01 , 472  Stage- 1  map =  100 %,  reduce =  0 %, CumulativeCPU  2.73  sec
19 : 34 : 02 , 495  Stage- 1  map =  100 %,  reduce =  0 %, CumulativeCPU  2.73  sec
19 : 34 : 03 , 512  Stage- 1  map =  100 %,  reduce =  100 %,Cumulative CPU  2.73  sec
MapReduce Total cumulative CPU time:  2  seconds  730  msec
Ended Job = job_201311061424_0003
MapReduce Jobs Launched:
Job  0 : Map:  1    Cumulative CPU:  2.73  sec   HDFS Read: 255  HDFS Write:  39  SUCCESS
Total MapReduce CPU Time Spent:  2  seconds  730  msec
OK
row1    value1
row2    value2
//尝试通过HiveServer去查询
beeline> !connect jdbc:hive2: //test-2:10000 hdfs hdfsorg.apache.hive.jdbc.HiveDriver       
Connecting to jdbc:hive2: //test-2:10000
Connected to: Hive (version  0.10 . 0 )
Driver: Hive (version  0.10 . 0 -cdh4. 4.0 )
Transaction isolation: TRANSACTION_REPEATABLE_READ
0 : jdbc:hive2: //test-2:10000> show databases;
+----------------+
| database_name  |
+----------------+
default         |
+----------------+
1  row selected ( 1.483  seconds)
0 : jdbc:hive2: //test-2:10000> show tables;
+--------------------+
|      tab_name      |
+--------------------+
| ext_table_inhbase  |
|test              |
+--------------------+
2  rows selected ( 0.657  seconds)
0 : jdbc:hive2: //test-2:10000> select count(*) from ext_table_inhbase;
+------+
| _c0  |
+------+
3     |
+------+

本文转自MIKE老毕 51CTO博客,原文链接:http://blog.51cto.com/boylook/1320637,如需转载请自行联系原作者