192 DStream相关操作 - Transformations on DStreams

简介: 192 DStream相关操作 - Transformations on DStreams

DStream上的原语与RDD的类似,分为Transformations(转换)和Output Operations(输出)两种,此外转换操作中还有一些比较特殊的原语,如:updateStateByKey()、transform()以及各种Window相关的原语。

1.Transformations on DStreams
Transformation Meaning
map(func) Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items.
filter(func) Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions) Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream) Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue() When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks]) When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks]) When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func) Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func) Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

特殊的Transformations

1.UpdateStateByKey Operation

UpdateStateByKey原语用于记录历史记录,上文中Word Count示例中就用到了该特性。若不用UpdateStateByKey来更新状态,那么每次数据进来后分析完成后,结果输出后将不在保存

2.Transform Operation

Transform原语允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便的扩展Spark API。此外,MLlib(机器学习)以及Graphx也是通过本函数来进行结合的。

3.Window Operations

Window Operations有点类似于Storm中的State,可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态。

目录
相关文章
|
5月前
|
分布式计算 Spark
[Spark精进]必须掌握的4个RDD算子之flatMap算子
[Spark精进]必须掌握的4个RDD算子之flatMap算子
44 0
|
10月前
|
分布式计算 算法 Java
Spark shuffle、RDD 算子【重要】
Spark shuffle、RDD 算子【重要】
213 0
|
分布式计算 算法 大数据
RDD 算子_ Action _ reduce | 学习笔记
快速学习 RDD 算子_ Action _ reduce
78 0
RDD 算子_ Action _ reduce | 学习笔记
|
分布式计算 大数据 开发者
RDD 算子_ Action _总结 | 学习笔记
快速学习 RDD 算子_ Action _总结
47 0
|
分布式计算 大数据 调度
RDD 算子_ Action _ countByKey | 学习笔记
快速学习 RDD 算子_ Action _ countByKey
61 0
RDD 算子_ Action _ countByKey | 学习笔记
|
分布式计算 大数据 数据处理
RDD 算子_ Action _ take | 学习笔记
快速学习 RDD 算子_ Action _ take
59 0
RDD 算子_ Action _ take | 学习笔记
|
分布式计算 算法 大数据
Rdd 算子_转换_sample | 学习笔记
快速学习 Rdd 算子_转换_sample
125 0
Rdd 算子_转换_sample | 学习笔记
|
分布式计算 大数据 开发者
Rdd 算子_转换_groupbykey | 学习笔记
快速学习 Rdd 算子_转换_groupbykey
113 0
Rdd 算子_转换_groupbykey | 学习笔记
|
SQL 分布式计算 API
Flink / Scala - DataSet Transformations 常用转换函数详解
​上一篇文章讲到了 Flink 如何获取数据生成 DataSet,这篇文章主要讨论 DataSet 后续支持的 Transform 转换函数。相较于 Spark,Flink 提供了更多的 API 和更灵活的写法与实现。
227 0
Flink / Scala - DataSet Transformations 常用转换函数详解
|
机器学习/深度学习 消息中间件 分布式计算
【Spark Streaming】(三)DStream 算子详解
【Spark Streaming】(三)DStream 算子详解
151 0
【Spark Streaming】(三)DStream 算子详解