简介: Spark ShuffleDependency Shuffle依赖关系Represents a dependency on the output of a shuffle stage.
Spark ShuffleDependency Shuffle依赖关系
Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,the RDD is transient since we don’t need it on the executor side.
package com.opensource.bigdata.spark.local.rdd.operation.dependency.shuffle.n_01_ShuffleDependency
import com.opensource.bigdata.spark.local.rdd.operation.base.BaseScalaSparkContext
object Run extends BaseScalaSparkContext{
def main(args: Array[String]): Unit = {
val sc = pre()
val rdd1 = sc.parallelize(List(('c',1),('b',1),('a',1),('a',1)),2)
val rdd2 =rdd1.reduceByKey((a,b) => a + b)
println("rdd2\n" + rdd2.collect().mkString("\n"))
sc.stop()
}
}
Spark RDD之间的依赖关系分为窄依赖和宽依赖。窄依赖指父RDD的每个分区最多被一个子RDD分区使用,如map、filter操作;宽依赖则指父RDD的每个分区被多个子RDD分区使用,如分组和某些join操作。窄依赖任务可在同一阶段完成,而宽依赖因Shuffle的存在需划分不同阶段执行。借助Spark Web Console可查看任务的DAG图及阶段划分。