一、实验目的
- 学会启用spark
- 将文本上传到hdfs上
- 在scala模式下编写单词统计
二、实验过程
- 了解spark的构成
2、具体步骤
1、打开一个终端,启动hadoop
hadoop@dblab-VirtualBox:/usr/local/hadoop/sbin$ ./start-all.sh
2、启动spark
hadoop@dblab-VirtualBox:/usr/local/spark/bin$ ./spark-shell
如下所示则spark启动成功
18/08/29 20:09:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/08/29 20:09:26 WARN Utils: Your hostname, dblab-VirtualBox resolves to a loopback address: 127.0.1.1, but we couldn't find any external IP address! 18/08/29 20:09:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Spark context Web UI available at http://127.0.1.1:4040 Spark context available as 'sc' (master = local[*], app id = local-1535544589211). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala>
3、打开第二个终端,进行编写需要统计的文件并上传
hadoop@dblab-VirtualBox:/usr/local/spark/bin$ vim a hadoop@dblab-VirtualBox:/usr/local/spark/bin$./hadoop fs -mkdir /input hadoop@dblab-VirtualBox:/usr/local/hadoop/bin$ ./hdfs dfs -put a / hadoop@dblab-VirtualBox:/usr/local/hadoop/bin$ cat a kjd,kjd,ASDF,sjdf,jsadf klfgldf.fdgjkaj
4、回到第一个终端,在scala下进行读取
scala> sc.textFile("hdfs:localhost:9000/a").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect
结果如下
scala> sc.textFile("hdfs://localhost:9000/a").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect res1: Array[(String, Int)] = Array((sjdf,1), (klfgldf.fdgjkaj,1), (kjd,2), (ASDF,1), (jsadf,1))