最近在测试hive导入solr,github上有个相关的代码
https://github.com/chimpler/hive-solr
其原理就是实现inputformat和outputformat,通过mapred来做数据的读写操作。
测试的表结构:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
show create table table_in_solr1;
CREATE EXTERNAL TABLE table_in_solr1(
id string COMMENT
'from deserializer'
,
cookie_id_s string COMMENT
'from deserializer'
,
first_url_s string COMMENT
'from deserializer'
,
warehouse_s string COMMENT
'from deserializer'
)
ROW FORMAT SERDE
'com.chimpler.hive.solr.SolrSerDe'
STORED BY
'com.chimpler.hive.solr.SolrStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'
=
'1'
,
'solr.column.mapping'
=
'id,cookie_id_s,first_url_s,warehouse_s'
)
LOCATION
'hdfs://xxxxxx:9000/bip/hive_warehouse/table_in_solr1'
TBLPROPERTIES (
'solr.url'
=
'http://xxxxxxx:8888/solr/userinfo'
,
'transient_lastDdlTime'
=
'1401357066'
,
'solr.buffer.input.rows'
=
'10000'
,
'solr.buffer.output.rows'
=
'10000'
)
|
导入数据之后,进行count测试:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
select count(
1
) from table_in_solr1;
Ended Job = job_1401419652664_0010 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1401419652664_0010_m_000000 (and more) from job job_1401419652664_0010
Task with the most failures(
4
):
-----
Task ID:
task_1401419652664_0010_m_000000
URL:
http:
//nfzm:8088/taskdetails.jsp?jobid=job_1401419652664_0010&tipid=task_1401419652664_0010_m_000000
-----
Diagnostic Messages
for
this
Task:
Error: java.io.IOException: java.lang.NumberFormatException: For input string:
""
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:
97
)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:
57
)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:
243
)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:
526
)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:
166
)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:
407
)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:
340
)
at org.apache.hadoop.mapred.YarnChild$
2
.run(YarnChild.java:
160
)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:
396
)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
1438
)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:
155
)
Caused by: java.lang.NumberFormatException: For input string:
""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:
48
)
at java.lang.Integer.parseInt(Integer.java:
470
)
at java.lang.Integer.parseInt(Integer.java:
499
)
at com.chimpler.hive.solr.SolrInputFormat.getReadColumnIDs(SolrInputFormat.java:
38
)
at com.chimpler.hive.solr.SolrInputFormat.getRecordReader(SolrInputFormat.java:
50
)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:
240
)
...
9
more
FAILED: Execution Error,
return
code
2
from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job
0
: Map:
2
Reduce:
1
HDFS Read:
0
HDFS Write:
0
FAIL
Total MapReduce CPU Time Spent:
0
msec
|
而count(一个字段)是ok的。
从explain的结果来看。
count(字段)对比count(1)增加了Select Operator:
1
2
3
4
5
6
7
8
|
Select Operator
expressions:
expr: id
type: string
outputColumnNames: id
Group By Operator
aggregations:
expr: count(id)
|
从job的出错信息来看,是由于类型转换出错导致,根据堆栈的信息,可以定位到出错的代码片段:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
at com.chimpler.hive.solr.SolrInputFormat.getReadColumnIDs(SolrInputFormat.java:
38
)
即:
public
static
final
String READ_COLUMN_IDS_CONF_STR =
"hive.io.file.readcolumn.ids"
;
public
static
final
String READ_COLUMN_NAMES_CONF_STR =
"hive.io.file.readcolumn.names"
;
private
static
final
String READ_COLUMN_IDS_CONF_STR_DEFAULT =
""
;
private
static
final
String READ_ALL_COLUMNS =
"hive.io.file.read.all.columns"
;
private
static
final
boolean
READ_ALL_COLUMNS_DEFAULT =
true
;
/**
* Returns an array of column ids(start from zero) which is set in the given
* parameter <tt>conf</tt>.
*/
public
static
List<Integer> getReadColumnIDs(JobConf conf) {
String skips = conf.get(READ_COLUMN_IDS_CONF_STR, READ_COLUMN_IDS_CONF_STR_DEFAULT);
String[] list = skips.split(
","
);
List<Integer> result =
new
ArrayList<Integer>(list.length);
for
(String element : list) {
// it may contain duplicates, remove duplicates
Integer toAdd = Integer.parseInt(element);
if
(!result.contains(toAdd)) {
result.add(toAdd);
}
}
return
result;
}
|
即在hive的map阶段,通过SolrInputFormat做数据读取时,hive.io.file.readcolumn.ids的默认值是空,如果没有字段名 ,就会产生空值,在Integer.parseInt(element)就会报错。
这点可以通过下面的代码进行验证:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
import
java.io.IOException;
import
java.util.ArrayList;
import
java.util.List;
public
class
ttt {
public
static
void
main(String[] args) {
String skips =
""
;
String[] list = skips.split(
","
);
System.out.println(list);
List<Integer> result =
new
ArrayList<Integer>(list.length);
for
(String element : list) {
try
{
Integer toAdd = Integer.parseInt(element);
if
(!result.contains(toAdd)) {
result.add(toAdd);
}
}
catch
(Exception e) {
System.out.println(e);
}
}
}
}
|
本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1422733,如需转载请自行联系原作者