本地windows跑Python程序调用Spark

简介: 应用场景 spark是用scala写的一种极其强悍的计算工具,spark内存计算,提供了图计算,流式计算,机器学习,即时查询等十分方便的工具,当然我们也可以通过python代码,来调用实现spark计算,用spark来辅助我们计算,使代码效率更快,用户体验更强。

应用场景

spark是用scala写的一种极其强悍的计算工具,spark内存计算,提供了图计算,流式计算,机器学习,即时查询等十分方便的工具,当然我们也可以通过python代码,来调用实现spark计算,用spark来辅助我们计算,使代码效率更快,用户体验更强。

操作流程

按照windows搭建Python开发环境博文,搭建python开发环境,实际已经将Spark环境部署完成了,所以直接可以用python语言写一些spark相关的程序!

代码示例:

from pyspark import SparkContext

sc = SparkContext("local","Simple App")
doc = sc.parallelize([['a','b','c'],['b','d','d']])
words = doc.flatMap(lambda d:d).distinct().collect()
word_dict = {w:i for w,i in zip(words,range(len(words)))}
word_dict_b = sc.broadcast(word_dict)

def wordCountPerDoc(d):
    dict={}
    wd = word_dict_b.value
    for w in d:
        if dict.get(wd[w],0):
            dict[wd[w]] +=1
        else:
            dict[wd[w]] = 1
    return dict
print(doc.map(wordCountPerDoc).collect())
print("successful!")

结果展示:

D:\Anaconda\anaconda\python.exe E:/pythonworkspace/pythontest001/Test001/test002.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/11/21 15:00:18 INFO SparkContext: Running Spark version 1.6.1
17/11/21 15:00:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/21 15:00:21 INFO SecurityManager: Changing view acls to: lenovo
17/11/21 15:00:21 INFO SecurityManager: Changing modify acls to: lenovo
17/11/21 15:00:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(lenovo); users with modify permissions: Set(lenovo)
17/11/21 15:00:25 INFO Utils: Successfully started service 'sparkDriver' on port 60670.
17/11/21 15:00:25 INFO Slf4jLogger: Slf4jLogger started
17/11/21 15:00:25 INFO Remoting: Starting remoting
17/11/21 15:00:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.114.67:60684]
17/11/21 15:00:26 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 60684.
17/11/21 15:00:26 INFO SparkEnv: Registering MapOutputTracker
17/11/21 15:00:26 INFO SparkEnv: Registering BlockManagerMaster
17/11/21 15:00:26 INFO DiskBlockManager: Created local directory at C:\Users\lenovo\AppData\Local\Temp\blockmgr-a0245427-988c-4b5a-8653-ee9e228de6ba
17/11/21 15:00:26 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
17/11/21 15:00:26 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/21 15:00:26 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/21 15:00:26 INFO SparkUI: Started SparkUI at http://192.168.114.67:4040
17/11/21 15:00:27 INFO Executor: Starting executor ID driver on host localhost
17/11/21 15:00:27 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60691.
17/11/21 15:00:27 INFO NettyBlockTransferService: Server created on 60691
17/11/21 15:00:27 INFO BlockManagerMaster: Trying to register BlockManager
17/11/21 15:00:27 INFO BlockManagerMasterEndpoint: Registering block manager localhost:60691 with 511.1 MB RAM, BlockManagerId(driver, localhost, 60691)
17/11/21 15:00:27 INFO BlockManagerMaster: Registered BlockManager
17/11/21 15:00:28 INFO SparkContext: Starting job: collect at E:/pythonworkspace/pythontest001/Test001/test002.py:5
17/11/21 15:00:28 INFO DAGScheduler: Registering RDD 2 (distinct at E:/pythonworkspace/pythontest001/Test001/test002.py:5)
17/11/21 15:00:28 INFO DAGScheduler: Got job 0 (collect at E:/pythonworkspace/pythontest001/Test001/test002.py:5) with 1 output partitions
17/11/21 15:00:28 INFO DAGScheduler: Final stage: ResultStage 1 (collect at E:/pythonworkspace/pythontest001/Test001/test002.py:5)
17/11/21 15:00:28 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/11/21 15:00:28 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/11/21 15:00:28 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at distinct at E:/pythonworkspace/pythontest001/Test001/test002.py:5), which has no missing parents
17/11/21 15:00:28 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 6.6 KB, free 6.6 KB)
17/11/21 15:00:28 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.3 KB, free 11.0 KB)
17/11/21 15:00:28 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60691 (size: 4.3 KB, free: 511.1 MB)
17/11/21 15:00:28 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
17/11/21 15:00:28 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[2] at distinct at E:/pythonworkspace/pythontest001/Test001/test002.py:5)
17/11/21 15:00:28 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/11/21 15:00:28 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2099 bytes)
17/11/21 15:00:28 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/11/21 15:00:30 INFO PythonRunner: Times: total = 1240, boot = 1221, init = 19, finish = 0
17/11/21 15:00:30 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1222 bytes result sent to driver
17/11/21 15:00:30 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1433 ms on localhost (1/1)
17/11/21 15:00:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/11/21 15:00:30 INFO DAGScheduler: ShuffleMapStage 0 (distinct at E:/pythonworkspace/pythontest001/Test001/test002.py:5) finished in 1.465 s
17/11/21 15:00:30 INFO DAGScheduler: looking for newly runnable stages
17/11/21 15:00:30 INFO DAGScheduler: running: Set()
17/11/21 15:00:30 INFO DAGScheduler: waiting: Set(ResultStage 1)
17/11/21 15:00:30 INFO DAGScheduler: failed: Set()
17/11/21 15:00:30 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[5] at collect at E:/pythonworkspace/pythontest001/Test001/test002.py:5), which has no missing parents
17/11/21 15:00:30 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.5 KB, free 16.5 KB)
17/11/21 15:00:30 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.4 KB, free 19.8 KB)
17/11/21 15:00:30 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60691 (size: 3.4 KB, free: 511.1 MB)
17/11/21 15:00:30 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/11/21 15:00:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (PythonRDD[5] at collect at E:/pythonworkspace/pythontest001/Test001/test002.py:5)
17/11/21 15:00:30 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/11/21 15:00:30 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes)
17/11/21 15:00:30 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
17/11/21 15:00:30 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/11/21 15:00:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms
17/11/21 15:00:31 INFO PythonRunner: Times: total = 1289, boot = 1280, init = 9, finish = 0
17/11/21 15:00:31 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1290 bytes result sent to driver
17/11/21 15:00:31 INFO DAGScheduler: ResultStage 1 (collect at E:/pythonworkspace/pythontest001/Test001/test002.py:5) finished in 1.377 s
17/11/21 15:00:31 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1375 ms on localhost (1/1)
17/11/21 15:00:31 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
17/11/21 15:00:31 INFO DAGScheduler: Job 0 finished: collect at E:/pythonworkspace/pythontest001/Test001/test002.py:5, took 3.307445 s
17/11/21 15:00:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 352.0 B, free 20.2 KB)
17/11/21 15:00:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 115.0 B, free 20.3 KB)
17/11/21 15:00:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60691 (size: 115.0 B, free: 511.1 MB)
17/11/21 15:00:31 INFO SparkContext: Created broadcast 2 from broadcast at PythonRDD.scala:430
17/11/21 15:00:31 INFO SparkContext: Starting job: collect at E:/pythonworkspace/pythontest001/Test001/test002.py:18
17/11/21 15:00:31 INFO DAGScheduler: Got job 1 (collect at E:/pythonworkspace/pythontest001/Test001/test002.py:18) with 1 output partitions
17/11/21 15:00:31 INFO DAGScheduler: Final stage: ResultStage 2 (collect at E:/pythonworkspace/pythontest001/Test001/test002.py:18)
17/11/21 15:00:31 INFO DAGScheduler: Parents of final stage: List()
17/11/21 15:00:31 INFO DAGScheduler: Missing parents: List()
17/11/21 15:00:31 INFO DAGScheduler: Submitting ResultStage 2 (PythonRDD[6] at collect at E:/pythonworkspace/pythontest001/Test001/test002.py:18), which has no missing parents
17/11/21 15:00:31 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 4.3 KB, free 24.5 KB)
17/11/21 15:00:31 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.8 KB, free 27.3 KB)
17/11/21 15:00:31 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:60691 (size: 2.8 KB, free: 511.1 MB)
17/11/21 15:00:31 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
17/11/21 15:00:31 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (PythonRDD[6] at collect at E:/pythonworkspace/pythontest001/Test001/test002.py:18)
17/11/21 15:00:31 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
17/11/21 15:00:31 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2110 bytes)
17/11/21 15:00:31 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
17/11/21 15:00:33 INFO PythonRunner: Times: total = 1199, boot = 1195, init = 3, finish = 1
17/11/21 15:00:33 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1040 bytes result sent to driver
17/11/21 15:00:33 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 1235 ms on localhost (1/1)
17/11/21 15:00:33 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
17/11/21 15:00:33 INFO DAGScheduler: ResultStage 2 (collect at E:/pythonworkspace/pythontest001/Test001/test002.py:18) finished in 1.237 s
17/11/21 15:00:33 INFO DAGScheduler: Job 1 finished: collect at E:/pythonworkspace/pythontest001/Test001/test002.py:18, took 1.267822 s
[{0: 1, 1: 1, 2: 1}, {2: 1, 3: 2}]
successful!
17/11/21 15:00:33 INFO SparkContext: Invoking stop() from shutdown hook

Process finished with exit code 0
目录
相关文章
|
2月前
|
机器学习/深度学习 数据挖掘 Python
Python编程入门——从零开始构建你的第一个程序
【10月更文挑战第39天】本文将带你走进Python的世界,通过简单易懂的语言和实际的代码示例,让你快速掌握Python的基础语法。无论你是编程新手还是想学习新语言的老手,这篇文章都能为你提供有价值的信息。我们将从变量、数据类型、控制结构等基本概念入手,逐步过渡到函数、模块等高级特性,最后通过一个综合示例来巩固所学知识。让我们一起开启Python编程之旅吧!
|
2月前
|
存储 Python
Python编程入门:打造你的第一个程序
【10月更文挑战第39天】在数字时代的浪潮中,掌握编程技能如同掌握了一门新时代的语言。本文将引导你步入Python编程的奇妙世界,从零基础出发,一步步构建你的第一个程序。我们将探索编程的基本概念,通过简单示例理解变量、数据类型和控制结构,最终实现一个简单的猜数字游戏。这不仅是一段代码的旅程,更是逻辑思维和问题解决能力的锻炼之旅。准备好了吗?让我们开始吧!
|
19天前
|
安全 API C语言
Python程序的安全逆向(关于我的OPENAI的APIkey是如何被盗的)
本文介绍了如何使用C语言编写一个简单的文件加解密程序,并讨论了如何为编译后的软件添加图标。此外,文章还探讨了Python的.pyc、.pyd等文件的原理,以及如何生成和使用.pyd文件来增强代码的安全性。通过视频和教程,作者详细讲解了生成.pyd文件的过程,并分享了逆向分析.pyd文件的方法。最后,文章提到可以通过定制Python解释器来进一步保护源代码。
63 6
|
1月前
|
IDE 程序员 开发工具
Python编程入门:打造你的第一个程序
迈出编程的第一步,就像在未知的海洋中航行。本文是你启航的指南针,带你了解Python这门语言的魅力所在,并手把手教你构建第一个属于自己的程序。从安装环境到编写代码,我们将一步步走过这段旅程。准备好了吗?让我们开始吧!
|
13天前
|
Shell 开发工具 Python
如何在vim里直接运行python程序
如何在vim里直接运行python程序
|
2月前
|
开发者 Python
使用Python实现自动化邮件通知:当长时程序运行结束时
本文介绍了如何使用Python实现自动化邮件通知功能,当长时间运行的程序完成后自动发送邮件通知。主要内容包括:项目背景、设置SMTP服务、编写邮件发送函数、连接SMTP服务器、发送邮件及异常处理等步骤。通过这些步骤,可以有效提高工作效率,避免长时间等待程序结果。
73 9
|
2月前
|
存储 人工智能 数据挖掘
Python编程入门:打造你的第一个程序
本文旨在为初学者提供Python编程的初步指导,通过介绍Python语言的基础概念、开发环境的搭建以及一个简单的代码示例,帮助读者快速入门。文章将引导你理解编程思维,学会如何编写、运行和调试Python代码,从而开启编程之旅。
48 2
|
2月前
|
Python
在Python中,`try...except`语句用于捕获和处理程序运行时的异常
在Python中,`try...except`语句用于捕获和处理程序运行时的异常
59 5
|
2月前
|
存储 Python
Python编程入门:理解基础语法与编写简单程序
本文旨在为初学者提供一个关于如何开始使用Python编程语言的指南。我们将从安装Python环境开始,逐步介绍变量、数据类型、控制结构、函数和模块等基本概念。通过实例演示和练习,读者将学会如何编写简单的Python程序,并了解如何解决常见的编程问题。文章最后将提供一些资源,以供进一步学习和实践。
47 1
|
2月前
|
机器学习/深度学习 数据挖掘 开发者
Python编程入门:理解基础语法与编写第一个程序
【10月更文挑战第37天】本文旨在为初学者提供Python编程的初步了解,通过简明的语言和直观的例子,引导读者掌握Python的基础语法,并完成一个简单的程序。我们将从变量、数据类型到控制结构,逐步展开讲解,确保即使是编程新手也能轻松跟上。文章末尾附有完整代码示例,供读者参考和实践。