[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

简介:

[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子

从如下地址获取文件:
https://github.com/databricks/spark-avro/raw/master/src/test/resources/episodes.avro

导入到 hdfs 系统:
hdfs dfs -put episodes.avro

读入:
mydata001=sqlContext.read.format("com.databricks.spark.avro").load("episodes.avro")

交互式运行结果:

In [7]: mydata001=sqlContext.read.format("com.databricks.spark.avro").load("episodes.avro")
17/10/03 07:00:47 INFO avro.AvroRelation: Listing hdfs://localhost:8020/user/training/episodes.avro on driver

In [8]: type(mydata001)
Out[8]: pyspark.sql.dataframe.DataFrame

In [9]: mydata001.count()
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 65.5 KB, free 65.5 KB)
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 21.4 KB, free 86.9 KB)
17/10/03 07:01:05 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:40075 (size: 21.4 KB, free: 208.8 MB)
17/10/03 07:01:05 INFO spark.SparkContext: Created broadcast 3 from count at NativeMethodAccessorImpl.java:-2
17/10/03 07:01:05 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 230.4 KB, free 317.3 KB)
17/10/03 07:01:06 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 21.5 KB, free 338.8 KB)
17/10/03 07:01:06 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:40075 (size: 21.5 KB, free: 208.8 MB)
17/10/03 07:01:06 INFO spark.SparkContext: Created broadcast 4 from hadoopFile at AvroRelation.scala:121
17/10/03 07:01:06 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/03 07:01:07 INFO spark.SparkContext: Starting job: count at NativeMethodAccessorImpl.java:-2
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Registering RDD 16 (count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Got job 1 (count at NativeMethodAccessorImpl.java:-2) with 1 output partitions
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 2)
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[16] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/03 07:01:07 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 11.5 KB, free 350.3 KB)
17/10/03 07:01:07 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 5.7 KB, free 356.0 KB)
17/10/03 07:01:07 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:40075 (size: 5.7 KB, free: 208.8 MB)
17/10/03 07:01:07 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:07 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[16] at count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:07 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
17/10/03 07:01:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2249 bytes)
17/10/03 07:01:07 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
17/10/03 07:01:07 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:08 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 2484 bytes result sent to driver
17/10/03 07:01:08 INFO scheduler.DAGScheduler: ShuffleMapStage 2 (count at NativeMethodAccessorImpl.java:-2) finished in 0.691 s
17/10/03 07:01:08 INFO scheduler.DAGScheduler: looking for newly runnable stages
17/10/03 07:01:08 INFO scheduler.DAGScheduler: running: Set()
17/10/03 07:01:08 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 3)
17/10/03 07:01:08 INFO scheduler.DAGScheduler: failed: Set()
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 693 ms on localhost (1/1)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[19] at count at NativeMethodAccessorImpl.java:-2), which has no missing parents
17/10/03 07:01:08 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 12.6 KB, free 368.5 KB)
17/10/03 07:01:08 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 6.1 KB, free 374.7 KB)
17/10/03 07:01:08 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:40075 (size: 6.1 KB, free: 208.8 MB)
17/10/03 07:01:08 INFO spark.SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[19] at count at NativeMethodAccessorImpl.java:-2)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,NODE_LOCAL, 1999 bytes)
17/10/03 07:01:08 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)
17/10/03 07:01:08 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/10/03 07:01:08 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
17/10/03 07:01:08 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 1666 bytes result sent to driver
17/10/03 07:01:08 INFO scheduler.DAGScheduler: ResultStage 3 (count at NativeMethodAccessorImpl.java:-2) finished in 0.344 s
17/10/03 07:01:08 INFO scheduler.DAGScheduler: Job 1 finished: count at NativeMethodAccessorImpl.java:-2, took 1.480495 s
17/10/03 07:01:08 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 345 ms on localhost (1/1)
17/10/03 07:01:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
Out[9]: 8

In [10]: mydata001.take(1)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 230.1 KB, free 604.8 KB)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 21.4 KB, free 626.2 KB)
17/10/03 07:01:18 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:40075 (size: 21.4 KB, free: 208.7 MB)
17/10/03 07:01:18 INFO spark.SparkContext: Created broadcast 7 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 230.5 KB, free 856.7 KB)
17/10/03 07:01:18 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 21.5 KB, free 878.2 KB)
17/10/03 07:01:18 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:40075 (size: 21.5 KB, free: 208.7 MB)
17/10/03 07:01:18 INFO spark.SparkContext: Created broadcast 8 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO mapred.FileInputFormat: Total input paths to process : 1
17/10/03 07:01:18 INFO spark.SparkContext: Starting job: take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Got job 2 (take at <ipython-input-10-35862abbc114>:1) with 1 output partitions
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (take at <ipython-input-10-35862abbc114>:1)
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Missing parents: List()
17/10/03 07:01:18 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[27] at take at <ipython-input-10-35862abbc114>:1), which has no missing parents
17/10/03 07:01:19 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 5.6 KB, free 883.8 KB)
17/10/03 07:01:19 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 3.0 KB, free 886.9 KB)
17/10/03 07:01:19 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:40075 (size: 3.0 KB, free: 208.7 MB)
17/10/03 07:01:19 INFO spark.SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:1006
17/10/03 07:01:19 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[27] at take at <ipython-input-10-35862abbc114>:1)
17/10/03 07:01:19 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
17/10/03 07:01:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,PROCESS_LOCAL, 2260 bytes)
17/10/03 07:01:19 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
17/10/03 07:01:19 INFO rdd.HadoopRDD: Input split: hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:19 INFO codegen.GenerateUnsafeProjection: Code generated in 124.624053 ms
17/10/03 07:01:19 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 4). 2237 bytes result sent to driver
17/10/03 07:01:19 INFO scheduler.DAGScheduler: ResultStage 4 (take at <ipython-input-10-35862abbc114>:1) finished in 0.415 s
17/10/03 07:01:19 INFO scheduler.DAGScheduler: Job 2 finished: take at <ipython-input-10-35862abbc114>:1, took 0.565858 s
17/10/03 07:01:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 415 ms on localhost (1/1)
17/10/03 07:01:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
Out[10]: [Row(title=u'The Eleventh Hour', air_date=u'3 April 2010', doctor=11)]

In [11]:









本文转自健哥的数据花园博客园博客,原文链接:http://www.cnblogs.com/gaojian/p/7624631.html,如需转载请自行联系原作者

目录
相关文章
|
1月前
|
Java 数据处理 索引
(Pandas)Python做数据处理必选框架之一!(二):附带案例分析;刨析DataFrame结构和其属性;学会访问具体元素;判断元素是否存在;元素求和、求标准值、方差、去重、删除、排序...
DataFrame结构 每一列都属于Series类型,不同列之间数据类型可以不一样,但同一列的值类型必须一致。 DataFrame拥有一个总的 idx记录列,该列记录了每一行的索引 在DataFrame中,若列之间的元素个数不匹配,且使用Series填充时,在DataFrame里空值会显示为NaN;当列之间元素个数不匹配,并且不使用Series填充,会报错。在指定了index 属性显示情况下,会按照index的位置进行排序,默认是 [0,1,2,3,...] 从0索引开始正序排序行。
216 0
|
2月前
|
数据可视化 Linux iOS开发
Python脚本转EXE文件实战指南:从原理到操作全解析
本教程详解如何将Python脚本打包为EXE文件,涵盖PyInstaller、auto-py-to-exe和cx_Freeze三种工具,包含实战案例与常见问题解决方案,助你轻松发布独立运行的Python程序。
915 2
|
1月前
|
监控 机器人 编译器
如何将python代码打包成exe文件---PyInstaller打包之神
PyInstaller可将Python程序打包为独立可执行文件,无需用户安装Python环境。它自动分析代码依赖,整合解释器、库及资源,支持一键生成exe,方便分发。使用pip安装后,通过简单命令即可完成打包,适合各类项目部署。
|
9月前
|
机器学习/深度学习 存储 算法
解锁文件共享软件背后基于 Python 的二叉搜索树算法密码
文件共享软件在数字化时代扮演着连接全球用户、促进知识与数据交流的重要角色。二叉搜索树作为一种高效的数据结构,通过有序存储和快速检索文件,极大提升了文件共享平台的性能。它依据文件名或时间戳等关键属性排序,支持高效插入、删除和查找操作,显著优化用户体验。本文还展示了用Python实现的简单二叉搜索树代码,帮助理解其工作原理,并展望了该算法在分布式计算和机器学习领域的未来应用前景。
|
3月前
|
缓存 数据可视化 Linux
Python文件/目录比较实战:排除特定类型的实用技巧
本文通过四个实战案例,详解如何使用Python比较目录差异并灵活排除特定文件,涵盖基础比较、大文件处理、跨平台适配与可视化报告生成,助力开发者高效完成目录同步与数据校验任务。
151 0
|
4月前
|
编译器 Python
如何利用Python批量重命名PDF文件
本文介绍了如何使用Python提取PDF内容并用于文件重命名。通过安装Python环境、PyCharm编译器及Jupyter Notebook,结合tabula库实现PDF数据读取与处理,并提供代码示例与参考文献。
|
4月前
|
编译器 Python
如何利用Python批量重命名文件
本文介绍了如何使用Python和PyCharm对文件进行批量重命名,包括文件名前后互换、按特定字符调整顺序等实用技巧,并提供了完整代码示例。同时推荐了第三方工具Bulk Rename Utility,便于无需编程实现高效重命名。适用于需要处理大量文件命名的场景,提升工作效率。
|
4月前
|
安全 Linux 网络安全
Python极速搭建局域网文件共享服务器:一行命令实现HTTPS安全传输
本文介绍如何利用Python的http.server模块,通过一行命令快速搭建支持HTTPS的安全文件下载服务器,无需第三方工具,3分钟部署,保障局域网文件共享的隐私与安全。
949 0
|
4月前
|
数据管理 开发工具 索引
在Python中借助Everything工具实现高效文件搜索的方法
使用上述方法,你就能在Python中利用Everything的强大搜索能力实现快速的文件搜索,这对于需要在大量文件中进行快速查找的场景尤其有用。此外,利用Python脚本可以灵活地将这一功能集成到更复杂的应用程序中,增强了自动化处理和数据管理的能力。
322 0
|
5月前
|
编解码 Prometheus Java
当Python同时操作1000个文件时,为什么你的CPU只用了10%?
本文介绍如何构建一个高效的文件处理系统,解决单线程效率低、多线程易崩溃的矛盾。通过异步队列与多线程池结合,实现任务调度优化,提升I/O密集型操作的性能。
125 4

推荐镜像

更多