Alibaba Cloud E-MapReduce vs AWS EMR vs. Azure HDInsight

简介: Big Data is among the biggest IT trends of the last five years. Maintaining a sufficiently large infrastructure for analytics is a major challenge for Big Data.

Python_BigDatComputingPlatform_expatiation_on_how_to_establish_PyODPS_architecture

Big Data is among the biggest IT trends of the last five years. The idea behind this trend is that given a sufficiently large volume of data, it is possible to derive crucial business insights that could not be discovered through other methods.

Of course, running Big Data analytics at scale requires a significant amount of processing power, and tools for parsing the data. While you can do this on premise, it is not always feasible to set up and maintain a sufficiently large infrastructure. That is why several of the leading cloud service providers have begun offering solutions for processing large volumes of data via Hadoop clusters, or similar solutions. Some of the providers offering such solutions include Amazon Web Services (AWS), Microsoft Azure, and Alibaba Cloud.

This article compares the major cloud-based Big Data platforms offered by each of these public cloud providers.

Hadoop Basics

For those who might not be familiar, Hadoop is an open source framework that is designed to run distributed applications across nodes in a cluster. One of the key differences between a Hadoop cluster and other types of clusters (such as a Microsoft failover cluster) is that Hadoop is designed specifically for use with data-intensive applications.

Hadoop clusters utilize a technique called MapReduce. MapReduce works by breaking a data analytics job into numerous fragments, which are then distributed across the cluster’s nodes for execution.

AWS EMR

Amazon Web Services’ solution for Big Data analytics is AWS EMR. Amazon EMR uses a series of EC2 virtual machine instances to form an Apache Hadoop cluster. A cluster can contain as many as 20 EC2 instances, and Amazon gives subscribers the option of creating multiple Hadoop clusters.

Amazon EMR is designed to work with Hive, Impala, Pig, HBase, and Kinesis Connector. The data that is to be analyzed by the AWS EMR service is stored on Amazon S3 storage. S3 storage can be accessed programmatically through an API, but administrators also have the option of using any of the available third-party S3 clients to upload the data to Amazon EMR.

Azure HDInsight

Microsoft’s Big Data analytics solution within the Azure cloud is Azure HDInsight. HDInsight uses Azure virtual machine instances to create clusters for Hadoop, Spark, Hive, HBase, Storm, Kafka, and Microsoft R Server. The service is designed to work with a variety of development environments, including Microsoft’s Visual Studio, and third-party solutions such as Eclipse and IntelliJ. The service is also designed to work with the Jupyter and Zeppelin notebooks.

Azure HDInsight is designed to be as flexible as possible. When creating a cluster, administrators can choose from any of the Azure virtual machine types, thereby allowing the cluster capabilities to be closely matched to the requirements of the job, while helping to control costs. It is also worth noting that Microsoft provides a 99.9% SLA that extends end-to-end, across the entire workload (not just the VMs that form the cluster).

Azure HDInsight is designed to integrate with Azure Active Directory, and supports multi-factor authentication. The service is also compliant with the requirements of the Health Insurance Portability and Accountability Act (HIPAA) and the Payment Card Industry (PCI).

Alibaba Cloud E-MapReduce

Alibaba Cloud’s data analytics solution is E-MapReduce. Like some competing solutions, Alibaba Cloud E-MapReduce is based around the use of a Hadoop cluster. However, Alibaba Cloud does not use a one-size- fits-all solution. Subscribers are able to choose the ECS model that they wish to use (CPU or Memory), and then the entire cluster is created within a matter of minutes. Once online, the Hadoop cluster is able to dynamically add nodes on an as needed basis. Even so, administrators have the ability to configure and tune the cluster as required.

Alibaba Cloud E-MapReduce is designed to work with the Apache Spark, MapReduce, and Apache Pig frameworks. The data that is being analyzed can be stored on Apache HDFS or HBase, and the cluster supports the use of tools such as Apache Sqoop and Spark SQL. Furthermore, workloads can be scheduled to run automatically. Alibaba Cloud E-MapReduce supports a variety of computational processes, including machine learning, process orchestration, stream processing, and even graph analytics.

Alibaba Cloud enforces data security through a role-based access control mechanism in which a primary account has the option of creating additional accounts, and granting specific service permissions to those accounts. Additionally, the data that is to be analyzed can be encrypted, and the ECS instances that make up the cluster can be protected with a firewall.

Alibaba Cloud is currently offering a free trial of their cloud services. You can register for a free trial subscription at: https://www.alibabacloud.com/campaign/free-trial#free-products. The trial includes a $300 credit that can be used to explore Alibaba Cloud’s various cloud offerings.

Conclusion

The success of an organization’s Big Data analytics initiative hinges on having the right computational tools for the job. A variety of public cloud service providers offer MapReduce services that can help with analytics. However, these services are not created equally. When selecting a cloud-based MapReduce service, it is important to consider things like platform compatibility and integration, security, and of course, overall flexibility.

Author Bio:

Brien Posey is a Fixate IO contributor, and a 16-time Microsoft MVP with over two decades of IT experience. Prior to going freelance, Brien was CIO for a national chain of hospitals and healthcare facilities. He also served as lead network engineer for the United States Department of Defense at Fort Knox. Brien has also worked as a network administrator for some of the largest insurance companies in America. In addition to his continued work in IT, Brien has spent the last three years training as a Commercial Scientist-Astronaut Candidate for a mission to study polar mesospheric clouds from space. You can follow Posey’s spaceflight training at www.brienposey.com/space

目录
相关文章
|
SQL 弹性计算 分布式计算
E-MapReduce(简称“EMR”)
E-MapReduce(简称“EMR”)是云原生开源大数据平台,向客户提供简单易集成的Hadoop、Hive、Spark、Flink、Presto、Clickhouse、Delta、Hudi等开源大数据计算和存储引擎。EMR计算资源可以根据业务的需要调整。EMR可以部署在阿里云公有云的ECS和ACK、专有云平台。
638 0
|
分布式计算 Hadoop 大数据
自建Hadoop集群 VS 阿里云EMR,差距居然这么大?
本期将为大家带来开源大数据平台E-MapReduce与自建Hadoop集群对比,一起来看看吧~
959 0
自建Hadoop集群 VS 阿里云EMR,差距居然这么大?
|
分布式计算 Hadoop 大数据
第3期:自建Hadoop集群 VS 阿里云EMR,差距居然这么大?
本期将为大家带来开源大数据平台E-MapReduce与自建Hadoop集群对比,一起来看看吧~
824 1
第3期:自建Hadoop集群 VS 阿里云EMR,差距居然这么大?
|
弹性计算 分布式计算 安全
使用资源编排服务(ROS)轻松玩转E-MapReduce(EMR)
前言 如果您还没有听说过资源编排服务(ROS),那么恭喜您,本文将带您走进一个新的世界,学习资源管理的新姿势。 当您在使用E-MapReduce(EMR)时,是否想过使用OpenAPI或者是SDK来创建实例?是否为太多的参数感到困扰,为不一样的参数名称形态看到困惑?尤其是要通过代码管理整个实例的生命周期感到烦恼? 但您本不该为此烦恼,不是吗? 其实您只是想创建若干个资源,您不想关心应该调用
1196 0
|
分布式计算 大数据 Hadoop
E-MapReduce上如何升级EMR-Core
本文档将介绍如何将老集群的EMR-Core升级到最新版本 什么是EMR-Core EMR-Core是E-MapReduce集群上支持Hadoop生态组件读写OSS的依赖包。它提供一种高效地读写OSS数据的实现,并不断地在优化中。
2530 0
|
4月前
|
关系型数据库 MySQL BI
用友畅捷通基于阿里云 EMR StarRocks 搭建实时湖仓实战分享
本文从用友畅捷通公司介绍及业务背景;数据仓库技术选型、实际案例及未来规划等方面,分享了用友畅捷通基于阿里云 EMR StarRocks 搭建实时湖仓的实战经验。
578 0
用友畅捷通基于阿里云 EMR StarRocks 搭建实时湖仓实战分享
|
8月前
|
存储 SQL 分布式计算
阿里云全托管flink-vvp平台hudi connector实践(基于emr集群oss-hdfs存储)
阿里云全托管flink-vvp平台hudi sink connector实践,本文数据湖hudi基于阿里云E-MapReduce产品,以云对象存储oss-hdfs作为存储
|
11月前
|
SQL 存储 监控
水滴筹基于阿里云 EMR StarRocks 实战分享
水滴筹大数据部门的数据开发工程师韩园园老师为大家分享水滴筹基于阿里云EMR StarRocks的实战经验。
5765 3
水滴筹基于阿里云 EMR StarRocks 实战分享
|
SQL 存储 弹性计算
阿里云EMR 2.0:重新定义新一代开源大数据平台
本次分享主要介绍了阿里云E-MapReduce的开发历程,EMR 2.0的新特性、产品架构,以及EMR 2.0在平台体验、数据开发、资源形态及分析场景等方面的全面突破与创新,重新定义新一代开源大数据平台。
2099 0
阿里云EMR 2.0:重新定义新一代开源大数据平台
|
3月前
|
存储 分布式计算 Apache
阿里云 EMR 基于 Paimon 和 Hudi 构建 Streaming Lakehouse
Apache Paimon 和 Apache Hudi 作为数据湖存储格式,有着高吞吐的写入和低延迟的查询性能,是构建数据湖的常用组件。本文在阿里云EMR上,针对数据实时入湖场景,对 Paimon 和 Hudi 的性能进行比对,并分别以 Paimon 和 Hudi 作为统一存储搭建准实时数仓。
64601 1
阿里云 EMR 基于 Paimon 和 Hudi 构建 Streaming Lakehouse