Discretized Streams, 离散化的流数据处理-阿里云开发者社区

Discretized Streams, 离散化的流数据处理

2017-05-02 1381

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

当前的流处理方案, Yahoo!’s S4, Twitter’s Storm, 都是采用传统的"record at-a-time”处理模式, 当收到一条record, 或者更新状态, 或者产生新的record

问题是, 在使用这些方案的时候, 用户需要考虑的东西很多, 比如

Fault tolerance

传统解决Fault tolerance的方案有两种,
a, 处理节点replication, 需要多倍的硬件资源, 而且也有可能碰到所有节点down的可能性
b, 源节点backup和replay, storm的方案, recovery的时间比较长, 因为基于超时, 需要等

Consistency

Depending on the system, it can be hard to reason about the global state, because different nodes may be processing data that arrived at different times. For example, suppose that a system
counts page views from male users on one node and from females on another. If one of these nodes is backlogged (积压), the ratio of their counters will be wrong.

Unification with batch processing

现有stream处理模型需要编写额外的code, 而无法重用batch的逻辑

Discretized streams (D-Streams), that overcomes these challenges.
The key idea behind D-Streams is to treat a streaming computation as a series of deterministic batch computations on small time intervals.

实现中的两个问题,

Low latency

这个借助spark和RDD可以达到1s以内

快速的Fault tolerance

采用"parallel recovery”
The system periodically checkpoints some of the state RDDs, by asynchronously replicating them to other nodes.
其实比较简单, 会定期的checkpoints一些状态RDDS, 并在其他节点上建立replicas
当出现故障的时候, 就读出最近的checkpoints, 并继续linear replay出最新state

这篇文章后面主要在谈如果fault tolerance，但是也不够细节

One reason why parallel recovery was hard to perform in previous streaming systems is that they process data on a per-record basis, which requires complex and costly bookkeeping protocols (e.g., Flux [20]) even for basic replication. In contrast, D-Streams apply deterministic transformations at the much coarser granularity of RDD partitions, which leads to far lighter bookkeeping and simple recovery similar to batch data flow systems [6].

本文章摘自博客园，原文发布日期：2013-09-22

Discretized Streams, 离散化的流数据处理

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Discretized Streams, 离散化的流数据处理

热门文章

最新文章

相关电子书