用命令行的方式运行Spark平台的wordcount项目

简介: 用命令行的方式运行Spark平台的wordcount项目

Created by Wang, Jerry, last modified on Sep 22, 2015

单机模式运行,即local模式


local模式运行非常简单,只要运行以下命令即可,假设当前目录是$SPARK_HOME

MASTER=local bin/spark-shell

“MASTER=local"就是表明当前运行在单机模式

scala> val textFile = sc.textFile(“README.md”)

val textFile = sc.textFile(“jerry.test”)

15/08/08 19:14:32 INFO MemoryStore: ensureFreeSpace(182712) called with curMem=664070, maxMem=278302556

15/08/08 19:14:32 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 178.4 KB, free 264.6 MB)

15/08/08 19:14:32 INFO MemoryStore: ensureFreeSpace(17237) called with curMem=846782, maxMem=278302556

15/08/08 19:14:32 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 16.8 KB, free 264.6 MB)

15/08/08 19:14:32 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:37219 (size: 16.8 KB, free: 265.3 MB)

15/08/08 19:14:32 INFO SparkContext: Created broadcast 7 from textFile at :21

textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at textFile at :21

then: textFile.filter(.contains(“Spark”)).count

or textFile.flatMap(.split(” ")).map((_, 1))

15/08/08 19:16:27 INFO FileInputFormat: Total input paths to process : 1

15/08/08 19:16:27 INFO SparkContext: Starting job: count at :24

15/08/08 19:16:27 INFO DAGScheduler: Got job 0 (count at :24) with 1 output partitions (allowLocal=false)

15/08/08 19:16:27 INFO DAGScheduler: Final stage: ResultStage 0(count at :24)

15/08/08 19:16:27 INFO DAGScheduler: Parents of final stage: List()

15/08/08 19:16:27 INFO DAGScheduler: Missing parents: List()

15/08/08 19:16:27 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at :24), which has no missing parents

15/08/08 19:16:27 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=156473, maxMem=278302556

15/08/08 19:16:27 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 265.3 MB)

15/08/08 19:16:27 INFO MemoryStore: ensureFreeSpace(1855) called with curMem=159657, maxMem=278302556

15/08/08 19:16:27 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1855.0 B, free 265.3 MB)

15/08/08 19:16:27 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:42648 (size: 1855.0 B, free: 265.4 MB)

15/08/08 19:16:27 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874

15/08/08 19:16:27 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at :24)

15/08/08 19:16:27 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

15/08/08 19:16:27 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1415 bytes)

15/08/08 19:16:27 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

15/08/08 19:16:27 INFO HadoopRDD: Input split: file:/root/devExpert/spark-1.4.1/README.md:0+3624

15/08/08 19:16:27 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

15/08/08 19:16:27 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

15/08/08 19:16:27 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

15/08/08 19:16:27 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

15/08/08 19:16:27 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

15/08/08 19:16:27 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1830 bytes result sent to driver

15/08/08 19:16:27 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 80 ms on localhost (1/1)

15/08/08 19:16:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

15/08/08 19:16:27 INFO DAGScheduler: ResultStage 0 (count at :24) finished in 0.093 s

15/08/08 19:16:27 INFO DAGScheduler: Job 0 finished: count at :24, took 0.176689 s

res0: Long = 19

image.png

目录
相关文章
|
1月前
|
分布式计算 监控 Spark
Spark 任务运行时日志分析
Spark 任务运行时日志分析
64 0
|
3天前
|
分布式计算 资源调度 Java
Scala+Spark+Hadoop+IDEA实现WordCount单词计数,上传并执行任务(简单实例-下)
Scala+Spark+Hadoop+IDEA实现WordCount单词计数,上传并执行任务(简单实例-下)
9 0
|
3天前
|
分布式计算 Hadoop Scala
Scala +Spark+Hadoop+Zookeeper+IDEA实现WordCount单词计数(简单实例-上)
Scala +Spark+Hadoop+Zookeeper+IDEA实现WordCount单词计数(简单实例-上)
7 0
|
12天前
|
分布式计算 数据可视化 大数据
基于spark的医疗大数据可视化大屏项目
基于spark的医疗大数据可视化大屏项目
|
13天前
|
分布式计算 DataWorks 网络安全
DataWorks操作报错合集之还未运行,spark节点一直报错,如何解决
DataWorks是阿里云提供的一站式大数据开发与治理平台,支持数据集成、数据开发、数据服务、数据质量管理、数据安全管理等全流程数据处理。在使用DataWorks过程中,可能会遇到各种操作报错。以下是一些常见的报错情况及其可能的原因和解决方法。
|
29天前
|
SQL 分布式计算 Java
IDEA 打包 Spark 项目 POM 文件依赖
这是一个 Maven POM 示例,用于构建一个使用 Spark 与 Hive 的项目,目标是将数据从 Hive 导入 ClickHouse。POM 文件设置了 Scala 和 Spark 的依赖,包括 `spark-core_2.12`, `spark-sql_2.12`, 和 `spark-hive_2.12`。`maven-assembly-plugin` 插件用于打包,生成包含依赖的和不含依赖的两种 JAR 包。`scope` 说明了依赖的使用范围,如 `compile`(默认),`provided`,`runtime`,`test` 和 `system`。
|
1月前
|
消息中间件 关系型数据库 MySQL
Spark实时(数据采集)项目
Spark实时(数据采集)项目
110 2
|
1月前
|
数据采集 分布式计算 Linux
Spark实时(数据采集)项目小知识点--sed -i命令详解及入门攻略
Spark实时(数据采集)项目小知识点--sed -i命令详解及入门攻略
130 0
|
1月前
|
存储 分布式计算 Apache
Spark编程范例:Word Count示例解析
Spark编程范例:Word Count示例解析
|
1月前
|
SQL 分布式计算 Java
Spark 基础教程:wordcount+Spark SQL
Spark 基础教程:wordcount+Spark SQL
41 0