Spark修炼之道（进阶篇）——Spark入门到精通：第五节 Spark编程模型（二)-阿里云开发者社区

Spark修炼之道（进阶篇）——Spark入门到精通：第五节 Spark编程模型（二)

2015-09-20 2985

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 作者：周志湖网名：摇摆少年梦微信号：zhouzhihubeyond本文主要内容RDD 常用Transformation函数1. RDD 常用Transformation函数（1）union union将两个RDD数据集元素合并，类似两个集合的并集 union函数参数： /** * Return the union of this RDD

作者：周志湖
网名：摇摆少年梦
微信号：zhouzhihubeyond

本文主要内容

RDD 常用Transformation函数

1. RDD 常用Transformation函数

（1）union
union将两个RDD数据集元素合并，类似两个集合的并集
union函数参数：

 /**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T]

RDD与另外一个RDD进行Union操作之后，两个数据集中的存在的重复元素
代码如下：

scala> val rdd1=sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:21

scala> val rdd2=sc.parallelize(4 to 8)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:21
//存在重复元素
scala> rdd1.union(rdd2).collect
res13: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)

这里写图片描述

（2）intersection
方法返回两个RDD数据集的交集
函数参数：
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* Note that this method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T]

使用示例：

scala> rdd1.intersection(rdd2).collect
res14: Array[Int] = Array(4, 5)

（3）distinct
distinct函数将去除重复元素
distinct函数参数：

/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(): RDD[T]

scala> val rdd1=sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21

scala> val rdd2=sc.parallelize(4 to 8)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21

scala> rdd1.union(rdd2).distinct.collect
res0: Array[Int] = Array(6, 1, 7, 8, 2, 3, 4, 5)

这里写图片描述

（4）groupByKey([numTasks])
输入数据为(K, V) 对, 返回的是 (K, Iterable) ，numTasks指定task数量，该参数是可选的，下面给出的是无参数的groupByKey方法
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*/
def groupByKey(): RDD[(K, Iterable[V])]

scala> val rdd1=sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21

scala> val rdd2=sc.parallelize(4 to 8)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21

scala> rdd1.union(rdd2).map((_,1)).groupByKey.collect
res2: Array[(Int, Iterable[Int])] = Array((6,CompactBuffer(1)), (1,CompactBuffer(1)), (7,CompactBuffer(1)), (8,CompactBuffer(1)), (2,CompactBuffer(1)), (3,CompactBuffer(1)), (4,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)))

（5）reduceByKey(func, [numTasks])
reduceByKey函数输入数据为(K, V)对，返回的数据集结果也是（K,V）对，只不过V为经过聚合操作后的值
/**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions.
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

使用示例：

scala> rdd1.union(rdd2).map((_,1)).reduceByKey(_+_).collect
res4: Array[(Int, Int)] = Array((6,1), (1,1), (7,1), (8,1), (2,1), (3,1), (4,2), (5,2))

这里写图片描述

（6）sortByKey([ascending], [numTasks])
对输入的数据集按key排序
sortByKey方法定义

/**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)]

使用示例：

scala> var data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(7,9),(2,4)))
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[20] at parallelize at <console>:21

scala> data.sortByKey(true).collect
res10: Array[(Int, Int)] = Array((1,3), (1,2), (1,4), (2,3), (2,4), (7,9))

这里写图片描述

（7）join(otherDataset, [numTasks])
对于数据集类型为 (K, V) 及 (K, W)的RDD，join操作后返回类型为 (K, (V, W))，join函数有三种：
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
def leftOuterJoin[W](
other: RDD[(K, W)],
partitioner: Partitioner): RDD[(K, (V, Option[W]))]
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Option[V], W))]

使用示例：

scala> val rdd1=sc.parallelize(Array((1,2),(1,3))
     | )
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[24] at parallelize at <console>:21

scala> val rdd2=sc.parallelize(Array((1,3)))
rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:21

scala> rdd1.join(rdd2).collect
res13: Array[(Int, (Int, Int))] = Array((1,(3,3)), (1,(2,3)))

这里写图片描述

scala> rdd1.leftOuterJoin(rdd2).collect
res15: Array[(Int, (Int, Option[Int]))] = Array((1,(3,Some(3))), (1,(2,Some(3))))

这里写图片描述

scala> rdd1.rightOuterJoin(rdd2).collect
res16: Array[(Int, (Option[Int], Int))] = Array((1,(Some(3),3)), (1,(Some(2),3)))

这里写图片描述

（8）cogroup(otherDataset, [numTasks])
如果输入的RDD类型为(K, V) 和(K, W)，则返回的RDD类型为 (K, (Iterable, Iterable)) . 该操作与 groupWith等同

方法定义：
/**
* For each key k in this or other, return a resulting RDD that contains a tuple with the
* list of values for that key in this as well as other.
*/
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))]

scala> val rdd1=sc.parallelize(Array((1,2),(1,3))
     | )
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[24] at parallelize at <console>:21

scala> val rdd2=sc.parallelize(Array((1,3)))
rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:21

scala> rdd1.cogroup(rdd2).collect
res17: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(3, 2),CompactBuffer(3))))

scala> rdd1.groupWith(rdd2).collect
res18: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))

这里写图片描述

（9）cartesian(otherDataset)
求两个RDD数据集间的笛卡尔积
函数定义：
/**
* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
* elements (a, b) where a is in this and b is in other.
*/
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

scala> val rdd1=sc.parallelize(Array(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[52] at parallelize at <console>:21

scala> val rdd2=sc.parallelize(Array(5,6))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[53] at parallelize at <console>:21

scala> rdd1.cartesian(rdd2).collect
res21: Array[(Int, Int)] = Array((1,5), (1,6), (2,5), (2,6), (3,5), (4,5), (3,6), (4,6))

这里写图片描述

（10）coalesce(numPartitions)
将RDD的分区数减至指定的numPartitions分区数

函数定义：
/**
* Return a new RDD that is reduced into numPartitions partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions.
*
* However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* Note: With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner.
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
: RDD[T]

示例代码：

scala> val rdd1=sc.parallelize(1 to 100,3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at <console>:21

scala> val rdd2=rdd1.coalesce(2)
rdd2: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[56] at coalesce at <console>:23

这里写图片描述

repartition(numPartitions)，功能与coalesce函数相同，实质上它调用的就是coalesce函数，只不是shuffle = true，意味着可能会导致大量的网络开销。
方法定义：
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using coalesce,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}

Spark修炼之道（进阶篇）——Spark入门到精通：第五节 Spark编程模型（二)

本文主要内容

1. RDD 常用Transformation函数

热门文章

最新文章

相关课程

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Spark修炼之道（进阶篇）——Spark入门到精通：第五节 Spark编程模型（二)

本文主要内容

1. RDD 常用Transformation函数

热门文章

最新文章

相关课程

相关电子书