1、描述spark中怎么加载lzo压缩格式的文件
2、比较lzo格式文件以textFile方式和LzoTextInputFormat方式计算数据,Running Tasks个数的影响
a.确保lzo文件所在文件夹中生成lzo.index索引文件
b.以LzoTextInputFormat处理,能够正常按分块数分配Tasks
查看文件块数量
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
[tech@dx2 ~]$ hdfs
fsck
/wh/source/hotel
.2017-08-07.txt_10.10.16.105_20170807.lzo
Connecting to namenode via http:
//nn1
.zdp.ol:50070
FSCK started by bwtech (auth:SIMPLE) from
/10
.10.16.103
for
path
/wh/source/hotel
.2017-08-07.txt_10.10.16.105_20170807.lzo at Tue Aug 08 15:27:52 CST 2017
.Status: HEALTHY
Total size:2892666412 B
Total
dirs
:0
Total files:1
Total symlinks:0
Total blocks (validated):11 (avg. block size 262969673 B)
Minimally replicated blocks:11 (100.0 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:0 (0.0 %)
Mis-replicated blocks:0 (0.0 %)
Default replication factor:3
Average block replication:3.0
Corrupt blocks:0
Missing replicas:0 (0.0 %)
Number of data-nodes:21
Number of racks:2
FSCK ended at Tue Aug 08 15:27:52 CST 2017
in
3 milliseconds
|
Spark源代码可以参考https://github.com/chocolateBlack/LearningSpark/blob/master/src/main/scala-2.11/SparkLzoFile.scala
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
import
com.hadoop.mapreduce.LzoTextInputFormat
import
org.apache.hadoop.io.{Text, LongWritable}
import
org.apache.spark.{SparkContext, SparkConf}
object
SparkLzoFile{
def
main(args
:
Array[String]){
val
conf
=
new
SparkConf().setAppName(
"Spark_Lzo_File"
)
val
sc
=
new
SparkContext(conf)
//文件路径
val
filePath
=
"/wh/source/hotel.2017-08-07.txt_10.10.16.105_20170807.lzo"
//按textFile方式加载文件
val
textFile
=
sc.textFile(filePath)
//按lzoTextInputFormat加载数据文件
val
lzoFile
=
sc.newAPIHadoopFile[LongWritable, Text, LzoTextInputFormat](filePath)
println(textFile.partitions.length)
// partitions个数输出 1
println(lzoFile.partitions.length)
// partitions个数输出 11
//两种方式计算word count查看后台任务
lzoFile.map(
_
.
_
2
.toString).flatMap(x
=
>x.split(
"-"
)).map((
_
,
1
)).reduceByKey(
_
+
_
).collect
textFile.flatMap(x
=
>x.split(
"\t"
)).map((
_
,
1
)).reduceByKey(
_
+
_
).collect
}
}
|
本文转自巧克力黒 51CTO博客,原文链接:http://blog.51cto.com/10120275/1954601
,如需转载请自行联系原作者