1. Spark安装与测试
1.1 安装Scala语言
#上传安装scala安装包 /home/hd/apps [hd@master apps]# pwd /home/hd/apps #解压 [hd@master apps]$ tar -zxvf scala-2.11.0.tgz #改名 [hd@master apps]$ mv scala-2.11.0 scala #切换root用户 [hd@master apps]$ su root Password: #增加环境变量 [root@master apps]# vi /etc/profile export SCALA_HOME=/home/hd/apps/scala export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin [root@master apps]# source /etc/profile #第一台机器测试 [root@master apps]$ su hd [hd@master ~]$ source /etc/profile [hd@master ~]$ [hd@master ~]$ scala #把新的环境发送到其它机器 [root@master apps]# scp /etc/profile root@slave01:/etc/ [root@master apps]# scp /etc/profile root@slave02:/etc/ #切换用户把scala程序名,发送到其它的机器 [root@master apps]$ su hd [hd@master apps]$ scp -r scala hd@slave01:/home/hd/apps/ [hd@master apps]$ scp -r scala hd@slave02:/home/hd/apps/ #第二台机器测试 [hd@slave01 ~]$ source /etc/profile [hd@slave01 ~]$ [hd@slave01 ~]$ scala #第三台机器测试 [hd@slave02 ~]$ source /etc/profile [hd@slave02 ~]$ [hd@slave02 ~]$ scala
1.2 安装Spark组件
#上传安装文件 [hd@master apps]# tar -zxvf spark-2.4.5-bin-hadoop2.7.tgz [hd@master apps]$ mv spark-2.4.5-bin-hadoop2.7 spark [hd@master apps]$ cd spark/conf/ #修改spark配置文件spark-env.sh [hd@master conf]$ vi spark-env.sh export JAVA_HOME=/home/hd/apps/java export SCALA_HOME=/home/hd/apps/scala export HADOOP_HOME=/home/hd/apps/hadoop export HADOOP_CONF_DIR=/home/hd/apps/hadoop/etc/hadoop export SPARK_MASTER_IP=master export SPARK_WORKER_MEMORY=1g export SPARK_WORKER_CORES=1 export SPARK_WORKER_INSTANCES=1 #修改spark配置文件slaves [hd@master conf]$ cp slaves.template slaves [hd@master conf]$ vi slaves slave01 slave02 #同步到其它机器 [hd@master conf]$ scp -r /home/hd/apps/spark hd@slave01:/home/hd/apps/ [hd@master conf]$ scp -r /home/hd/apps/spark hd@slave02:/home/hd/apps/ #修改系统的环境变量 [hd@master conf]$ su root Password: [root@master conf]# vi /etc/profile export SPARK_HOME=/home/hd/apps/spark export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin [root@master spark]# scp /etc/profile root@slave01:/etc/ #第二台机器测试 [hd@slave01 ~]$ source /etc/profile [root@master spark]# scp /etc/profile root@slave02:/etc/ #第三台机器测试 [hd@slave02 ~]$ source /etc/profile #启动spark [root@master spark]# su hd [hd@master spark]$ [hd@master spark]$ /home/hd/apps/spark/sbin/start-all.sh #查看spark 进程 [hd@master spark]$ jps 5632 SecondaryNameNode 10528 Master 5409 NameNode 10593 Jps 5853 ResourceManager
1.3 启动后的界面
查看Spark控制台
1.4 spark-shell控制台单词统计
[hd@master spark]$ /home/hd/apps/spark/bin/spark-shell Spark context Web UI available at http://master:4040 Spark context available as 'sc' (master = local[*], app id = local-1582875665794). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121) Type in expressions to have them evaluated. Type :help for more information. scala> val rdd1 = sc.textFile("hdfs://master:9000/word/words.txt") rdd1: org.apache.spark.rdd.RDD[String] = hdfs://master:9000/word/words.txt MapPartitionsRDD[1] at textFile at <console>:24 scala> rdd1.collect res3: Array[String] = Array(Hello World Bye World, Hello Hadoop Bye Hadoop, Bye Hadoop Hello Hadoop) scala> val rdd2 = rdd1.flatMap(_.split(" ")) rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:25 scala> rdd2.collect res4: Array[String] = Array(Hello, World, Bye, World, Hello, Hadoop, Bye, Hadoop, Bye, Hadoop, Hello, Hadoop) scala> val rdd3 = rdd2.map((_,1)) rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:25 scala> rdd3.collect res5: Array[(String, Int)] = Array((Hello,1), (World,1), (Bye,1), (World,1), (Hello,1), (Hadoop,1), (Bye,1), (Hadoop,1), (Bye,1), (Hadoop,1), (Hello,1), (Hadoop,1)) scala> val rdd4 = rdd3.reduceByKey(_+_) rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:25 scala> rdd4.collect res6: Array[(String, Int)] = Array((Bye,3), (Hello,3), (World,2), (Hadoop,4)) scala> rdd4.saveAsTextFile("hdfs://master:9000/sparkout")
1.5 Spark操作json对象
[hd@slave02 ~]$ vi person.json [{"id":1,"name":"stone","age":30},{"id":2,"name":"james","age":30},{"id":3,"name":"jacky","age":28},{"id":4,"name":"Tom","age":27}] [hd@slave02 ~]$ hdfs dfs -put person.json / scala> val person = spark.read.json("hdfs://master:9000/person.json") scala> person.collect res9: Array[org.apache.spark.sql.Row] = Array([30,1,stone], [30,2,james], [28,3,jacky], [27,4,Tom]) scala> person.show +---+---+-----+ |age| id| name| +---+---+-----+ | 30| 1|stone| | 30| 2|james| | 28| 3|jacky| | 27| 4| Tom| +---+---+-----+ scala> person.filter($"age">28).show +---+---+-----+ |age| id| name| +---+---+-----+ | 30| 1|stone| | 30| 2|james| +---+---+-----+