在测试hive0.14.0 on tez时遇到的问题比较多:
1.在使用cdh5.2.0+hive0.14.0+tez-0.5.0测试时,首先遇到下面的问题
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
java.lang.NoSuchMethodError: org.apache.tez.dag.api.client.Progress.getFailedTaskAttemptCount()I
at org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor.printStatusInPlace(TezJobMonitor.java:
613
)
at org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor.monitorExecution(TezJobMonitor.java:
311
)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:
167
)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:
160
)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:
85
)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:
1604
)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:
1364
)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:
1177
)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:
1004
)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:
994
)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:
247
)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:
199
)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:
410
)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:
783
)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:
677
)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:
616
)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39
)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25
)
at java.lang.reflect.Method.invoke(Method.java:
597
)
at org.apache.hadoop.util.RunJar.main(RunJar.java:
212
)
|
通过堆栈可以看出是在tez job提交之后报的错,在org.apache.hadoop.hive.ql.exec.tez.TezTask中
job通过submit方法提交后,实例化一个TezJobMonitor 对象,用来记录tez job的运行情况:
1
2
3
4
5
6
|
// submit will send the job to the cluster and start executing
client = submit(jobConf, dag, scratchDir, appJarLr, session,
additionalLr, inputOutputJars, inputOutputLocalResources);
// finally monitor will print progress until the job is done
TezJobMonitor monitor =
new
TezJobMonitor();
rc = monitor.monitorExecution(client, ctx.getHiveTxnManager(), conf, dag);
|
TezJobMonitor.monitorExecution方法中:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
boolean
isProfileEnabled = conf.getBoolVar(conf, HiveConf.ConfVars.TEZ_EXEC_SUMMARY);
//hive.tez.exec.print.summary,默认为false
boolean
inPlaceUpdates = conf.getBoolVar(conf, HiveConf.ConfVars.TEZ_EXEC_INPLACE_PROGRESS);
//hive.tez.exec.inplace.progress,默认为true
boolean
wideTerminal =
false
;
boolean
isTerminal = inPlaceUpdates ==
true
? isUnixTerminal() :
false
;
// we need at least 80 chars wide terminal to display in-place updates properly
if
(isTerminal) {
if
(getTerminalWidth() >= MIN_TERMINAL_WIDTH) {
wideTerminal =
true
;
}
}
boolean
inPlaceEligible =
false
;
if
(inPlaceUpdates && isTerminal && wideTerminal && !console.getIsSilent()) {
inPlaceEligible =
true
;
}
//进入一个while循环,判断 job的状态,并运行printStatusInPlace或者printStatus方法(其中printStatus最终调用getReport方法)
......
case
RUNNING:
if
(!running) {
perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.TEZ_SUBMIT_TO_RUNNING);
console.printInfo(
"Status: Running ("
+ dagClient.getExecutionContext() +
")\n"
);
startTime = System.currentTimeMillis();
running =
true
;
}
if
(inPlaceEligible) {
printStatusInPlace(progressMap, startTime,
false
, dagClient);
// log the progress report to log file as well
lastReport = logStatus(progressMap, lastReport, console);
}
else
{
lastReport = printStatus(progressMap, lastReport, console);
}
break
;
|
比如在printStatusInPlace方法中:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
SortedSet<String> keys =
new
TreeSet<String>(progressMap.keySet());
int
idx =
0
;
int
maxKeys = keys.size();
for
(String s : keys) {
idx++;
Progress progress = progressMap.get(s);
final
int
complete = progress.getSucceededTaskCount();
final
int
total = progress.getTotalTaskCount();
final
int
running = progress.getRunningTaskCount();
final
int
failed = progress.getFailedTaskAttemptCount();
// 会调用Progress类getFailedTaskAttemptCount方法获取失败的task数
final
int
pending = progress.getTotalTaskCount() - progress.getSucceededTaskCount() -
progress.getRunningTaskCount();
final
int
killed = progress.getKilledTaskCount();
|
在0.5.0的tez中org.apache.tez.dag.api.client.Progress类没有getFailedTaskAttemptCount方法
在0.5.2的tez中才开始增加这个方法,因此要想使用hive0.14.0的话,需要使用tez-0.5.2以上的版本
2.升级至hive0.14.0+tez-0.5.2之后,发现如下错误:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
15
/
01
/
13
14
:
09
:
21
INFO client.TezClient: The url to track the Tez Session: http:
//xxxx:8042/proxy/application_1416818587155_0049/
Exception in thread
"main"
java.lang.RuntimeException: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:
457
)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:
672
)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:
616
)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39
)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25
)
at java.lang.reflect.Method.invoke(Method.java:
597
)
at org.apache.hadoop.util.RunJar.main(RunJar.java:
212
)
Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown
at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:
599
)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:
212
)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:
122
)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:
454
)
...
7
more
|
可以看到是由于在session初始化异常导致,异常是由TezSessionState.open方法抛出:
1
2
3
4
5
6
|
....
try
{
session.waitTillReady();
}
catch
(InterruptedException ie) {
//ignore
}
|
其中session为TezClient的实例,在TezClient.waitTillReady方法中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
public
synchronized
void
waitTillReady()
throws
IOException, TezException, InterruptedException {
if
(!isSession) {
// nothing to wait for in non-session mode
return
;
}
verifySessionStateForSubmission();
while
(
true
) {
TezAppMasterStatus status = getAppMasterStatus();
//这里getAppMasterStatus方法返回了TezAppMasterStatus.SHUTDOWN
if
(status.equals(TezAppMasterStatus.SHUTDOWN)) {
throw
new
SessionNotRunning(
"TezSession has already shutdown"
);
}
if
(status.equals(TezAppMasterStatus.READY)) {
return
;
}
Thread.sleep(SLEEP_FOR_READY);
}
}
|
这里创建TezClient时设置了为sessionmode,并且getAppMasterStatus返回了TezAppMasterStatus.SHUTDOWN,导致在waitTillReady方法中抛出异常,即TezAppMaster没有启动正常导致,查看nm的日志,发现由如下报错:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
2015
-
01
-
13
16
:
27
:
58
,
162
WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1416818587155_0060_01_000001 and exit code:
1
ExitCodeException exitCode=
1
:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:
538
)
at org.apache.hadoop.util.Shell.run(Shell.java:
455
)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
702
)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:
196
)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:
299
)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:
81
)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:
303
)
at java.util.concurrent.FutureTask.run(FutureTask.java:
138
)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:
886
)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
908
)
at java.lang.Thread.run(Thread.java:
662
)
|
是由于启动am的container异常报错,查看对应的container日志:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
2015
-
01
-
13
17
:
34
:
59
,
731
FATAL [main] app.DAGAppMaster: Error starting DAGAppMaster
java.lang.VerifyError:
class
org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides
final
method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:
631
)
at java.lang.ClassLoader.defineClass(ClassLoader.java:
615
)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:
141
)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:
283
)
at java.net.URLClassLoader.access$
000
(URLClassLoader.java:
58
)
at java.net.URLClassLoader$
1
.run(URLClassLoader.java:
197
)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:
190
)
at java.lang.ClassLoader.loadClass(ClassLoader.java:
306
)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:
301
)
at java.lang.ClassLoader.loadClass(ClassLoader.java:
247
)
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:
2389
)
at java.lang.Class.getConstructor0(Class.java:
2699
)
at java.lang.Class.getConstructor(Class.java:
1657
)
at org.apache.hadoop.yarn.factories.impl.pb.RecordFactoryPBImpl.newRecordInstance(RecordFactoryPBImpl.java:
62
)
at org.apache.hadoop.yarn.util.Records.newRecord(Records.java:
36
)
at org.apache.hadoop.yarn.api.records.ApplicationId.newInstance(ApplicationId.java:
49
)
at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:
137
)
at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:
177
)
at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:
1794
)
|
看样子是protoc-buf兼容的问题。
cdh5.2.0默认使用protobuf-java-2.5.0.jar,hive0.14.0默认使用protobuf-java-2.5.0.jar,tez 0.5.2也使用pb2.5.0编译,理论上应该不会有pb兼容性问题,怀疑是在tezam启动时加载了2.4.0a 的pb,需要查看启动命令,找到对应的classpath:
通过更改org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor类,增加Thread.sleep来查看启动am的shell,重新编译cdh5.2.0包(主要需要java7支持 range [1.7.0,1.7.1000}],编译时跳过native: mvn package -DskipTests -Pdist -Dtar -e -X),
并替换./share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.5.0-cdh5.2.0.jar 测试:
shell的调用如下:
1
|
default_container_executor.sh-->default_container_executor_session.sh-->launch_container.sh
|
而在launch_container.sh脚本:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
export HADOOP_COMMON_HOME=
"/home/vipshop/platform/hadoop-2.5.0-cdh5.2.0"
#先设置相关的变量
export CLASSPATH=
"$PWD:$PWD/*:$HADOOP_CONF_DIR:"
#这里重设了CLASSPATH
export HADOOP_TOKEN_FILE_LOCATION=
"/home/vipshop/hard_disk/7/yarn/local/usercache/hdfs/appcache/application_1416818587155_0075/container_1416818587155_0075_01_000001/container_tokens"
....
ln -sf
"/home/vipshop/hard_disk/10/yarn/local/filecache/42/hadoop-yarn-api-2.5.0.jar"
"hadoop-yarn-api-2.5.0.jar"
#建立相关jar的软连接到本地目录
hadoop_shell_errorcode=$?
if
[ $hadoop_shell_errorcode -ne
0
]
then
exit $hadoop_shell_errorcode
fi
.....
exec /bin/bash -c "$JAVA_HOME/bin/java -Xmx819m -server
-Djava.net.preferIPv4Stack=
true
-Dhadoop.metrics.log.level=WARN
-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA
-XX:+UseParallelGC -Dlog4j.configuration=tez-container-log4j.properties
-Dyarn.app.container.log.dir=/home/vipshop/hard_disk/
9
/yarn/logs/
application_1416818587155_0075/container_1416818587155_0075_01_000001
-Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel=
''
org.apache.tez.dag.app.DAGAppMaster --session
1
>/home/vipshop/hard_disk/
9
/
yarn/logs/application_1416818587155_0075/container_1416818587155_0075_01_000001/
stdout
2
>/home/vipshop/hard_disk/
9
/yarn/logs/application_1416818587155_0075/
container_1416818587155_0075_01_000001/stderr "
|
1
2
|
#最后运行 java org.apache.tez.dag.app.DAGAppMaster,即
org.apache.tez.dag.app.DAGAppMaster的main方法,启动DAGAppMaster
|
CLASSPATH为shell所在的目录,比如这里
1
2
3
4
5
|
CLASSPATH='/home/vipshop/hard_disk/
11
/yarn/local/usercache/hdfs/appcache/
application_1416818587155_0079/container_1416818587155_0079_01_000001:
/home/vipshop/hard_disk/
11
/yarn/local/usercache/hdfs/appcache/
application_1416818587155_0079/container_1416818587155_0079_01_000001/*:
/home/vipshop/conf:'
|
在shell的当前目录下查找包含pb的包,发现有一个hive-solr中集成了pb,并且查看到其pb版本为2.4.0a:
1
2
3
4
|
for
i in `find . -name
"*.jar"
`;
do
echo $i `jar -tvf $i|grep GeneratedMessage|wc -l`; done|awk
'{if($2>0) print}'
./protobuf-java-
2.5
.
0
.jar
31
//2.5.0
./hive-exec-
0.14
.
0
-dfffe4217f40bd764977b741ad970a562e07fb99992f0180620bd13f68a2577b.jar
31
//2.5.0
./hive-solr-
0.0
.
1
-SNAPSHOT-jar-with-dependencies.jar
//2.4.0a
|
这就导致在container启动时,classloader加载到了2.4.0a的pb,最终导致container启动失败。使用2.5.0的pb重新编译这个jar包后,hive on tez就运行正常了。
本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1604100,如需转载请自行联系原作者