Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka-阿里云开发者社区

Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

This article introduces the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack and illustrates how you can use it to build scalable data processing platforms While the SMACK stack is really concise and consists of only several components, it is possible to implement different system designs within it which list not only purely batch or stream processing, but also contain more complex Lambda and Kappa architectures as well.

What is the SMACK stack?

First, let’s talk a little bit about what SMACK is. Here’s a quick rundown of the technologies that are included in it:

3c6cd56b2375c34e0112a97af652fcb2d224a8f7

Spark - a fast and general engine for distributed large-scale data processing.

Mesos - a cluster resource management system that provides efficient resource isolation and sharing across distributed applications.

Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.

• Cassandra - a distributed highly available database designed to handle large amounts of data across multiple datacenters.

• Kafka - a high-throughput, low-latency distributed messaging system/commit log designed for handling real-time data feeds.

Storage layer: Cassandra

Although not in alphabetical order, first let’s start with the C in SMACK. Cassandra is well known for its high-availability and high-throughput characteristics and is able to handle enormous write loads and survive cluster nodes failures. In terms of CAP theorem, Cassandra provides tunable consistency/availability for operations.

a5daaf9f736e46c6360fcc52d661f866f2b303bd

What is most interesting here is that when it comes to data processing, Cassandra is linearly scalable (increased loads could be addressed by just adding more nodes to a cluster) and it provides cross-datacenter replication (XDCR) capabilities. Actually XDCR provides not only data replication but a set of use cases, including:

• Geo-distributed datacenters handling data specific for the region or located closer to customers.

• Data migration across datacenters: recovery after failures or moving data to a new datacenter.

• Separate operational and analytics workloads.

However, all these features come for their own price and with Cassandra this price is its data model. This could be thought of just as a nested sorted map distributed across cluster nodes by partition key and entries sorted/grouped by clustering columns. An example is provided below:

9f243737e45368da765a584e017b95d06a63cae5

To get specific data in some range, the full key must be specified and no range clauses allowed except for the last column in the list. This constraint is introduced to limit multiple scans for different ranges which may produce random access to disks and lower down the performance. This means that the data model should be carefully designed against read queries to limit the amount of reads/scans which leads to lesser flexibility when it comes to support of new queries.

But what if one has some tables that need to be joined somehow with another tables? Let's consider the next case: calculate total views per campaign for a given month for all campaigns.

fe8baeb0626fc3e784bcd6680aad210200d9a4c2

With a given model, the only way to achieve this goal is to read all campaigns, read all events, sum the properties (with matched campaign IDs) and assign them to campaigns. It can be really challenging to implement such applications because the amount of data stored in Casandra may be huge and exceed the memory capacity. Therefore, such sort of data should be processed in a distributed manner and Spark perfectly fits this type of use cases.

Processing layer: Spark

b18c8eefc5e793b2d1d2b6919640e7d538f06d52

The main abstraction Spark operates with is RDD (Resilient Distributed Dataset, a distributed collection of elements) and the workflow consists of four main phases:

• RDD operations (transformations and actions) form DAG (Direct Acyclic Graph).

• DAG is split into stages of tasks which are then submitted to the cluster manager.

• Stages combine tasks which don't require shuffling/repartitioning.

• Tasks run on workers and results then return to the client.

Here's how one can solve the above problem with Spark and Cassandra:

c46e69f3fa2acf0c128f5876fa5c4082e42a06d3

Interaction with Cassandra is performed via Spark-Cassandra-connector, which makes the whole process easy and straightforward. There's one more interesting option to work with NoSQL stores and that’s SparkSQL, which translates SQL statements into a series of RDD operations.

7fbbd7834ccf21e82a05e12ae329e00402bb143c

With several lines of code it's possible to implement naive Lambda design which of course could be much more sophisticated, but this example shows just how easy this can be achieved.

Almost MapReduce: bringing processing closer to data

Spark-Cassandra connector is data-locality aware and reads the data from the closest node in a cluster, thus minimizing the amount of data transferred over the network. To fully facilitate Spark-C* connector data locality awareness, Spark workers should be collocated with Cassandra nodes.

bdf9950e3df740c58df7b0ab31dd909e0145242a

The above image illustrates Spark collocation with Cassandra. It makes sense to separate your operational (or Write-heavy) cluster from one for analytics. Here’s why:

• Clusters can be scaled independently.

• Data is replicated by Cassandra, with no extra-work needed.

• The analytics cluster has different Read/Write load patterns.

• The analytics cluster could contain additional data (for example, dictionaries) and processing results.

• Spark resources impact is limited to only one cluster.

Let's look at the Spark application deployment options one more time:

30b797ee1626315bdaaee1d4e22e235fde874001

As can be seen above, there are three main options available for the cluster resource manager:

• Spark Standalone — Spark (as the master node) and workers are installed and executed as standalone applications (which obviously introduces some overhead and supports only static resource allocation per worker).

• YARN — Works very good if you already have Hadoop.

• Mesos — From the beginning, Mesos was designed for dynamic allocation of cluster resources, not only for running Hadoop applications but for handling heterogeneous workloads.

Mesos architecture

The M in SMACK stands for the Mesos architecture. A Mesos cluster consists of master nodes which are responsible for resource offerings and scheduling, and slave nodes which do the actual heavy lifting in the task execution.

643a8faae07ce3b18aeb6c35643d76c429f5a258

In HA mode with multiple master nodes, ZooKeeper is used for leader election and service discovery. Applications executed on Mesos are called frameworks and utilize APIs to handle resource offers and submit tasks to Mesos. Generally the task execution process consists of the following steps:

1. Slave nodes provide available resources to the master node.

2. The master node sends resource offers to frameworks.

3. The scheduler replies with tasks and resources needed per task.

4. The master node sends tasks to slave nodes.

Bringing Spark, Mesos and Cassandra together

As said before Spark workers should be collocated with Cassandra nodes to enforce data locality awareness, thus lowering the amount of network traffic and Cassandra cluster load. Here's one of the possible deployment scenarios on how to achieve this with Mesos:

820a5f8d606ae2e36d5b4a76ed1278159642fef4

• Mesos master nodes and ZooKeepers are collocated.

• Mesos slave nodes and Cassandra nodes are collocated to enforce better data locality for Spark.

• Spark binaries are deployed to all worker nodes and spark-env.sh is configured with proper master endpoints and executor JAR location.

• The Spark executor JAR is uploaded to S3/HDFS.

With provided setup, the Spark job can be submitted to the cluster with simple spark-submit invocation from any worker nodes having Spark binaries installed and assembly JAR containing actual job logic uploaded.

435a903aae7cbd220f330fd9160650efb449d726

There exist options to run Dockerized Spark, so that there's no need to distribute binaries to every single cluster node.

Scheduled and Long-running Task Execution

Every data processing system sooner or later faces the necessity of running two types of jobs: scheduled/periodic jobs like periodic batch aggregations and long-running ones which are the case for stream processing. The main requirement for both of these types is fault tolerance - jobs must continue running even in case of cluster nodes failures. Mesos ecosystem comes with two outstanding frameworks supporting each of this types of jobs.

Marathon is a framework for fault-tolerant execution of long-running tasks supporting HA mode with ZooKeeper, able to run Docker and having a nice REST API. Here's an example of using the shell command to run spark-submit for simple job configuration:

849b04016088679d022812aab896f02f51457e5a

Chronos has the same characteristics as Marathon but is designed for running scheduled jobs and in general it is distributed HA cron supporting graphs of jobs. Here's an example of S3 compaction job configuration which is implemented as a simple bashscript:

9e940df386d7e561e027897631b4475ba61736e6

There are plenty of frameworks already available or under active development (such as Hadoop, Cassandra, Kafka, Myriad, Storm and Samza) which are targeted to integrate widely used systems with Mesos resource management capabilities.

Ingesting the Data

Up till now we have the storage layer designed, resource management set up and jobs are configured. The only thing which is not there yet is the data to process:

595497564e08c793c4615b7e6f9ea003f037b87b

Assuming that incoming data will arrive at high rates, the endpoints which will receive it should meet the following requirements:

• Provide high throughput/low latency

• Be resilient

• Allow easy scalability

• Support back pressure

On second thoughts, back pressure is not a must, but it would be nice to have this as an option to handle load spikes. Akka perfectly fits the requirements and basically it was designed to provide this feature set. Here is a quick run-down of the benefits you can expect to get from Akka:

• Actor model implementation for JVM

• Message-based and asynchronous

• Enforcement of the non-shared mutable state

• Easily scalable from one process to cluster of machines

• Actors form hierarchies with parental supervision

• Not only concurrency framework: akka-http, akka-streams, and akka-persistence

Here's a simplified example of three actors which handle JSON HttpRequest, parse it into the domain model case class, and save it to Cassandra:

41933f4932b7baf3656f5f3dcda8000871b94b38

It looks like only several lines of code are needed to make everything work, but while writing raw data (events) to Cassandra with Akka, the following problems may be caused:

• Cassandra is still designed for fast serving but not batch processing, so pre-aggregation of incoming data is needed

• Computation time of aggregations/rollups will grow with amount of data

• Actors are not suitable for performing aggregation due to stateless design model

• Micro-batches could partially solve the problem

• Some sort of reliable buffer for raw data is still needed

Kafka acts as a buffer for incoming data

For keeping incoming data with some retention and its further pre-aggregation/processing, some sort of distributed commit log could be used. In this case, consumers will read data in batches, process it and store it into Cassandra in form of pre-aggregates.

cb09d487e7b3869d05c474bbdc1e918aa59b19be

Here's an example of publishing JSON data through HTTP to Kafka with akka-http:

99896ac99f0ba669860e4b572097009ef6a4aef7

Consuming the data: Spark Streaming

While Akka still could be used for consuming stream data from Kafka, having Spark in your ecosystem brings Spark Streaming as an option to solve the following problems:

• It supports a variety of data sources

• It provides at-least-once semantics

• Exactly-once semantics available with Kafka Direct and idempotent storage

9a0aca419d6d52457f66a258c98a99318ffc0c9e

Here's an example of consuming event stream from Kinesis with Spark Streaming:

62c133e0013eeb784ff85dd7bec242ecf25a6fd5

Designing for failure: backups and patching

Usually this is the most boring part of any system, but it's really important to protect data from loss in every possible way when the datacenter is unavailable or analysis is performed on the datacenter breakdowns.

So why not store the data in Kafka/Kinesis?

a5899016f026b97d717be9f1d44ca2b0fd0961db

At the moment of writing this article, Kinesis is the only one solution that can retain data without backups when all processing results have been lost. While Kafka supports a long retention period, cost of hardware ownership should be considered because for example S3 storage is much cheaper than multiple instances running Kafka and S3 SLA are really good.

Apart from having backups, the restoring/patching strategies should be designed upfront and tested so that any problems with data could be quickly fixed. Programmers' mistakes in aggregation job or duplicated data deletion may break the accuracy of the computation results. Therefore, it is very important to have the capability of fixing such errors. One thing to make all these operations easier is to enforce idempotency in the data model so that multiple repetition of the same operations produce the same results (for example, SQL update is an idempotent operation while counter increment is not).

Here is an example of Spark job which reads S3 backup and loads it into Cassandra:

1540c0261fb90bbd7df651401e56dcb053d1d2c3

SMACK: The Big Picture

This concludes our broad description of SMACK. To allow you to better visualize the design of a data platform built with SMACK, here’s a visual depiction of the architecture:

da1004a659d0a4d19c13b973b0c908e89ea5fb6f

In the above article we talked about some of the basic functions of using SMACK. To finish with here is a quick rundown of it’s main advantages:

• Concise toolbox for wide variety of data processing scenarios

• Battle-tested and widely used software with large support communities

• Easy scalability and replication of data while preserving low latencies

• Unified cluster management for heterogeneous loads

• Single platform for any kind of applications

• Implementation platform for different architecture designs (batch, streaming, Lambda, or Kappa)

• Really fast time-to-market (for example, for MVP verification)

Thanks for reading. If anyone has experience developing applications using SMACK. Please leave some comments.

Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

Processing layer: Spark

Almost MapReduce: bringing processing closer to data

There are plenty of frameworks already available or under active development (such as Hadoop, Cassandra, Kafka, Myriad, Storm and Samza) which are targeted to integrate widely used systems with Mesos resource management capabilities.

Ingesting the Data

Kafka acts as a buffer for incoming data

For keeping incoming data with some retention and its further pre-aggregation/processing, some sort of distributed commit log could be used. In this case, consumers will read data in batches, process it and store it into Cassandra in form of pre-aggregates.

Consuming the data: Spark Streaming

Designing for failure: backups and patching

热门文章

最新文章

相关课程

相关电子书