背景
- 几乎所有编程语言的第一个程序都是 Hello World。
下载并安装JDK、Scala、Maven
- 之前的Hadoop HA 和 Spark集群的文章中已经安装过JDK、Scala。Maven安装也很简单,略。
下载Idea并安装Scala插件
- 在线安装有点慢,但网上很多方法解决,略。
创建一个maven-scala工程
按向导一步步填写、下一步。
修改 pom.xml
文件中的版本号
- 将scala.version修改成本机安装的Scala版本,并加入hadoop以及spark所需要的依赖,完整的内容如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.gemantic.bigdata</groupId> <artifactId>bigdata-spark</artifactId> <version>1.0-SNAPSHOT</version> <inceptionYear>2008</inceptionYear> <properties> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> <scala.version>2.11.4</scala.version> <spark.version>2.0.0</spark.version> <spark.artifact>2.11</spark.artifact> <hbase.version>1.2.2</hbase.version> <hadoop.version>2.6.0</hadoop.version> <dependency.scope>compile</dependency.scope> </properties> <repositories> <repository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.4</version> <scope>test</scope> </dependency> <dependency> <groupId>org.specs</groupId> <artifactId>specs</artifactId> <version>1.2.5</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${spark.artifact}</artifactId> <version>${spark.version}</version> <scope>${dependency.scope}</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${spark.artifact}</artifactId> <version>${spark.version}</version> <scope>${dependency.scope}</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_${spark.artifact}</artifactId> <version>${spark.version}</version> <scope>${dependency.scope}</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_${spark.artifact}</artifactId> <version>${spark.version}</version> <scope>${dependency.scope}</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>${spark.version}</version> <scope>${dependency.scope}</scope> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> <configuration> <scalaVersion>${scala.version}</scalaVersion> <args> <arg>-target:jvm-${maven.compiler.target}</arg> </args> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptors> <descriptor>src/main/assembly/distribution.xml</descriptor> </descriptors> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-eclipse-plugin</artifactId> <configuration> <downloadSources>true</downloadSources> <buildcommands> <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand> </buildcommands> <additionalProjectnatures> <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature> </additionalProjectnatures> <classpathContainers> <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer> <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer> </classpathContainers> </configuration> </plugin> </plugins> <resources> <resource> <directory>src/main/resources</directory> <includes> <include>**/*</include> </includes> </resource> </resources> </build> <reporting> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <configuration> <scalaVersion>${scala.version}</scalaVersion> </configuration> </plugin> </plugins> </reporting> </project>
删除自动生成的代码,创建自己的 HelloWorld
- 实现功能是:把 spark 目录下的
README.md
文件中包含Python
的行,然后做 Word Count。最后将结果保存到HDFS上。
打包命令
mvn clean package
或者
- 在 target 目录下生成 bigdata-spark-1.0-SNAPSHOT.jar
上传测试
- 将上面的 bigdata-spark-1.0-SNAPSHOT.jar 上传到服务器,提交任务到集群,命令如下:
root@ubuntu238:/usr/local/spark-1.6.0-bin-hadoop2.6# ./bin/spark-submit --class com.gemantic.bigdata.WordCount --master yarn-cluster --executor-memory 512m /data/bigdata/spark/lib/bigdata-spark-1.0-SNAPSHOT.jar 10
- 执行过程中的日志输出:
2. 17/12/28 11:42:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 3. 17/12/28 11:42:08 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers 4. 17/12/28 11:42:08 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 5. 17/12/28 11:42:08 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 6. 17/12/28 11:42:08 INFO yarn.Client: Setting up container launch context for our AM 7. 17/12/28 11:42:08 INFO yarn.Client: Setting up the launch environment for our AM container 8. 17/12/28 11:42:08 INFO yarn.Client: Preparing resources for our AM container 9. 17/12/28 11:42:09 INFO yarn.Client: Uploading resource file:/usr/local/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://masters/user/root/.sparkStaging/application_1514254657629_0009/spark-assembly-1.6.0-hadoop2.6.0.jar 10. 17/12/28 11:42:12 INFO yarn.Client: Uploading resource file:/data/bigdata/spark/lib/bigdata-spark-1.0-SNAPSHOT.jar -> hdfs://masters/user/root/.sparkStaging/application_1514254657629_0009/bigdata-spark-1.0-SNAPSHOT.jar 11. 17/12/28 11:42:12 INFO yarn.Client: Uploading resource file:/tmp/spark-add007da-644d-47f5-99be-2ce1ddf89a4f/__spark_conf__5606044700861845297.zip -> hdfs://masters/user/root/.sparkStaging/application_1514254657629_0009/__spark_conf__5606044700861845297.zip 12. 17/12/28 11:42:12 INFO spark.SecurityManager: Changing view acls to: root 13. 17/12/28 11:42:12 INFO spark.SecurityManager: Changing modify acls to: root 14. 17/12/28 11:42:12 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 15. 17/12/28 11:42:12 INFO yarn.Client: Submitting application 9 to ResourceManager 16. 17/12/28 11:42:12 INFO impl.YarnClientImpl: Submitted application application_1514254657629_0009 17. 17/12/28 11:42:13 INFO yarn.Client: Application report for application_1514254657629_0009 (state: ACCEPTED) 18. 17/12/28 11:42:13 INFO yarn.Client: 19. client token: N/A 20. diagnostics: N/A 21. ApplicationMaster host: N/A 22. ApplicationMaster RPC port: -1 23. queue: default 24. start time: 1514432532552 25. final status: UNDEFINED 26. tracking URL: http://master:8088/proxy/application_1514254657629_0009/ 27. user: root 28. 17/12/28 11:42:14 INFO yarn.Client: Application report for application_1514254657629_0009 (state: ACCEPTED) 29. 30. ... 31. 32. 33. 17/12/28 11:42:22 INFO yarn.Client: Application report for application_1514254657629_0009 (state: RUNNING) 34. 17/12/28 11:42:22 INFO yarn.Client: 35. client token: N/A 36. diagnostics: N/A 37. ApplicationMaster host: 192.168.111.239 38. ApplicationMaster RPC port: 0 39. queue: default 40. start time: 1514432532552 41. final status: UNDEFINED 42. tracking URL: http://master:8088/proxy/application_1514254657629_0009/ 43. user: root 44. 17/12/28 11:42:23 INFO yarn.Client: Application report for application_1514254657629_0009 (state: RUNNING) 45. 46. ... 47. 48. 49. 17/12/28 11:42:39 INFO yarn.Client: Application report for application_1514254657629_0009 (state: FINISHED) 50. 17/12/28 11:42:39 INFO yarn.Client: 51. client token: N/A 52. diagnostics: N/A 53. ApplicationMaster host: 192.168.111.239 54. ApplicationMaster RPC port: 0 55. queue: default 56. start time: 1514432532552 57. final status: SUCCEEDED 58. tracking URL: http://master:8088/proxy/application_1514254657629_0009/ 59. user: root 60. 17/12/28 11:42:39 INFO util.ShutdownHookManager: Shutdown hook called 61. 17/12/28 11:42:39 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-add007da-644d-47f5-99be-2ce1ddf89a4f • 检查输出 62. root@ubuntu238:/usr/local/hadoop-2.6.1# ./bin/hdfs dfs -ls /user/root/outputFile 63. 17/12/28 13:09:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 64. Found 3 items 65. -rw-r--r-- 3 root supergroup 0 2017-12-28 11:42 /user/root/outputFile/_SUCCESS 66. -rw-r--r-- 3 root supergroup 144 2017-12-28 11:42 /user/root/outputFile/part-00000 67. -rw-r--r-- 3 root supergroup 100 2017-12-28 11:42 /user/root/outputFile/part-00001 68. 69. root@ubuntu238:/usr/local/hadoop-2.6.1# ./bin/hdfs dfs -text /user/root/outputFile/part-00000 70. 17/12/28 13:10:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 71. (Python,2) 72. (Interactive,1) 73. (R,,1) 74. (can,1) 75. (Java,,1) 76. (Shell,1) 77. (Alternatively,,1) 78. (shell:,1) 79. (Scala,,1) 80. (Python,,2) 81. (prefer,1) 82. (engine,1) 83. (##,1) 84. root@ubuntu238:/usr/local/hadoop-2.6.1# ./bin/hdfs dfs -text /user/root/outputFile/part-00001 85. 17/12/28 13:10:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 86. (you,2) 87. (if,1) 88. (APIs,1) 89. (that,1) 90. (high-level,1) 91. (optimized,1) 92. (in,1) 93. (an,1) 94. (and,2) 95. (use,1) 96. (the,1)