[Spark][Python]对HDFS 上的文件,采用绝对路径,来读取获得 RDD

简介:

对HDFS 上的文件,采用绝对路径,来读取获得 RDD:

In [102]: mydata=sc.textFile("file:/home/training/test.txt")
17/09/24 06:31:04 INFO storage.MemoryStore: Block broadcast_30 stored as values in memory (estimated size 230.5 KB, free 2.4 MB)
17/09/24 06:31:04 INFO storage.MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 21.5 KB, free 2.5 MB)
17/09/24 06:31:04 INFO storage.BlockManagerInfo: Added broadcast_30_piece0 in memory on localhost:33950 (size: 21.5 KB, free: 208.6 MB)
17/09/24 06:31:04 INFO spark.SparkContext: Created broadcast 30 from textFile at NativeMethodAccessorImpl.java:-2

In [103]: mydata.take(1)
17/09/24 06:31:09 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/24 06:31:09 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
17/09/24 06:31:09 INFO scheduler.DAGScheduler: Got job 17 (runJob at PythonRDD.scala:393) with 1 output partitions
17/09/24 06:31:09 INFO scheduler.DAGScheduler: Final stage: ResultStage 17 (runJob at PythonRDD.scala:393)
17/09/24 06:31:09 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/09/24 06:31:09 INFO scheduler.DAGScheduler: Missing parents: List()
17/09/24 06:31:09 INFO scheduler.DAGScheduler: Submitting ResultStage 17 (PythonRDD[50] at RDD at PythonRDD.scala:43), which has no missing parents
17/09/24 06:31:09 INFO storage.MemoryStore: Block broadcast_31 stored as values in memory (estimated size 4.8 KB, free 2.5 MB)
17/09/24 06:31:09 INFO storage.MemoryStore: Block broadcast_31_piece0 stored as bytes in memory (estimated size 3.0 KB, free 2.5 MB)
17/09/24 06:31:09 INFO storage.BlockManagerInfo: Added broadcast_31_piece0 in memory on localhost:33950 (size: 3.0 KB, free: 208.6 MB)
17/09/24 06:31:09 INFO spark.SparkContext: Created broadcast 31 from broadcast at DAGScheduler.scala:1006
17/09/24 06:31:09 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 17 (PythonRDD[50] at RDD at PythonRDD.scala:43)
17/09/24 06:31:09 INFO scheduler.TaskSchedulerImpl: Adding task set 17.0 with 1 tasks
17/09/24 06:31:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 17.0 (TID 17, localhost, partition 0,PROCESS_LOCAL, 2130 bytes)
17/09/24 06:31:09 INFO executor.Executor: Running task 0.0 in stage 17.0 (TID 17)
17/09/24 06:31:09 INFO rdd.HadoopRDD: Input split: file:/home/training/test.txt:0+34
17/09/24 06:31:10 INFO python.PythonRunner: Times: total = 28, boot = 11, init = 16, finish = 1
17/09/24 06:31:10 INFO executor.Executor: Finished task 0.0 in stage 17.0 (TID 17). 2158 bytes result sent to driver
17/09/24 06:31:10 INFO scheduler.DAGScheduler: ResultStage 17 (runJob at PythonRDD.scala:393) finished in 0.344 s
17/09/24 06:31:10 INFO scheduler.DAGScheduler: Job 17 finished: runJob at PythonRDD.scala:393, took 0.750241 s
17/09/24 06:31:10 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 17.0 (TID 17) in 348 ms on localhost (1/1)
17/09/24 06:31:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 17.0, whose tasks have all completed, from pool 
Out[103]: [u'This is a test 1']

In [104]:






本文转自健哥的数据花园博客园博客,原文链接:http://www.cnblogs.com/gaojian/p/7588750.html,如需转载请自行联系原作者

目录
相关文章
|
5天前
|
JSON 安全 数据格式
Python文件操作宝典:一步步教你玩转文件读写
Python文件操作宝典:一步步教你玩转文件读写
|
5天前
|
Python
python搭建文件服务
python搭建文件服务
10 1
|
3天前
|
机器学习/深度学习 人工智能 程序员
探索Python宝库:从基础到技能的干货知识(数据类型与变量+ 条件与循环+函数与模块+文件+异常+OOP)
探索Python宝库:从基础到技能的干货知识(数据类型与变量+ 条件与循环+函数与模块+文件+异常+OOP)
4 0
|
5天前
|
数据安全/隐私保护 Python
经验大分享:python读取yaml文件
经验大分享:python读取yaml文件
12 0
|
6天前
|
存储 Python
Python处理文件的常用代码
Python处理文件的常用代码
|
6天前
|
Python
python文件的读取与写入
python文件的读取与写入
11 0
|
7天前
|
缓存 算法 Python
python文件读写讲解
python文件读写讲解
12 0
|
7天前
|
安全 Linux PHP
Python文件读写的详细讲解
Python文件读写的详细讲解
11 0
|
7天前
|
XML 存储 JavaScript
python读取xml文件详细讲解
python读取xml文件详细讲解
17 0
|
7天前
|
IDE 开发工具 Python
使用python3遍历文件夹并将文件目录保存到指定文件
使用python3遍历文件夹并将文件目录保存到指定文件
14 0