How Does Alibaba Cloud Power the Biggest Online Shopping Festival?-阿里云开发者社区

开发者社区> 芷沁> 正文

How Does Alibaba Cloud Power the Biggest Online Shopping Festival?

简介: Have you ever wondered what the underlying technology behind Alibaba Single’s Day Shopping Festival (also known as 11-11) is like?

Author: Alibaba Group Senior Staff Engineer Ding Yu

Have you ever wondered what the underlying technology behind Alibaba Single's Day Shopping Festival (also known as 11-11) is like? With sales reaching over US$17.8 billion in 2016, Single's Day has become the largest online shopping day in the world!

Alibaba Cloud's infrastructure has evolved rapidly to cope with increasing demands from the entire Alibaba ecosystem, especially for Single's Day. From 2009 to 2016, we have witnessed an increase of peak transaction volume of over 400 times!


Figure 1: Peak transaction volume on Single's Day from 2009 to 2016

Such feat can only be achieved with a robust computing architecture, not only capable of handling bursty traffic but also capable of quickly recovering from system faults. While sales revenue typically grows linearly with transaction volume, system complexity becomes exponentially difficult at such a large scale. What's more, deploying and maintaining such complex system is labor intensive and costly.

Designing a High Availability Infrastructure

As the Architect for Single's Day since 2009, I will share with you some of our key strategies in designing our infrastructure.

Although cloud computing has freed us from the geographical constraints of data centers, supporting an event such as Single's Day isn't as straightforward as simply adding more servers. We need to know precisely how much computing power we need to ensure high availability and reliability while keeping costs at a minimum.

Alibaba Cloud tackles this problem from multiple angles:
1.Comprehensive load testing on system architecture
2.System architecture fault simulation
3.Cross-region server deployment
4.Automated intelligent control

We will cover these four topics in further detail in the following sections.


Figure 2: Enterprise high availability design

Comprehensive Load Testing on System Architecture

Load testing is one of the default metric for performance testing in most systems. Basically, what we do is to simulate the traffic load of Single's Day and test it on our existing infrastructure. We use traffic data collected from previous years as well as predicted data to account for this year's growth. One of the important purpose of load testing is not only to discover the maximum capacity but also to determine the most common applications and services that customers use during this period.

System Architecture Fault Simulation

Essentially, fault simulation is a form of stress testing on our system architecture. We intentionally disable certain services, overloading the system with heavy loads. In particular, we look out for any Single-Point-of-Failures (SPOFs) in our architecture and eliminate them.

Cross-Region Server Deployment

In most scenarios, servers only run within a single region. However, this approach may not be sufficient when faced with extreme loads during Single's Day. Therefore, we utilize cross-region deployment to expand the capacity and improve service availability. We split users into different servers based on user ID, and employ an active-active configuration in our clusters to maintain high availability and achieve seamless service handover. In addition, data is also backed up across multiple sites to enhance disaster recovery capabilities.


Figure 3: High availability multi-region cluster

Automated Intelligent Control

Even with all of the technologies discussed previously, it is almost impossible to control traffic flow and scale resources in a large system manually. That is why we use an automated intelligent control, which focuses on traffic control and fault recovery.

Because we don't have access to unlimited resources, there is always a possibility of having too much load. To handle this problem, we can prioritize users based on the type of request. For example, customers completing purchases should be prioritized over users who are only browsing a website. Once we prioritize them, we can put them in a queue and complete requests based on this queue. We can also adjust the service of quality received by users based on this queueing system.


Figure 4: User traffic control

As the number of devices increases, the probability of fault occurring in devices increases as well. When a server fails, our system detects this anomaly and reassigns the user to the next nearest server. This automatic approach significantly reduces delay, which in turn improves user experience and minimizes O&M costs. In addition, this system will trigger alarms to notify our engineers about these faults, helping our team to quickly locate and troubleshoot faults.


Figure 5: Server fault recovery


As we can see, powering an event as large as Single's Day is no easy task. With proper planning and design, we can cope even the most unexpected challenges for this event. We are confident that our evolved architecture can achieve a lot more for this year's Single's Day festival!

However, one question springs to mind – What do we do with all this computing power when the festival ends? For most of our systems, we adopt a hybrid cloud environment. With hybrid cloud, we can scale resources as required but also maintain a "lighter" system when the load is low (such as when Single's Day festival ends). This way, we can minimize operating costs while maximizing our capacity.

In addition, we utilize Alibaba Cloud's core products as well as our family of distributed middleware. Currently, our distributed middleware offerings are only limited to Mainland China customers, but we are hoping to make them available to customers from across the globe soon.

If you want to learn more about the underlying technology for Alibaba Single's Day, please check out my presentation video at The Computing Conference 2017.

If you are interested in building your own infrastructure with Alibaba Cloud products, you should definitely check out our attractive offers on 11-11 Cloud Deals!

Core Products (available globally):
Elastic Compute Service (ECS)
Server Load Balancer (SLB)
Auto Scaling
ApsaraDB for RDS

Distributed Middleware (currently only available in Mainland China):
• Distributed Relational Database Service (DRDS)
• Cloud Service Bus (CSB)
• Global Transaction Service (GTS)
• Application Real-Time Monitoring Service (ARMS)
• Message Queue (MQ)
• Enterprise Distributed Application Service (EDAS)


如果在创建实例时没有设置密码,或者密码丢失,您可以在控制台上重新设置实例的登录密码。本文仅描述如何在 ECS 管理控制台上修改实例登录密码。
10096 0
购买阿里云ECS云服务器后如何登录?场景不同,阿里云优惠总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系.
13894 0
windows server 2008阿里云ECS服务器安全设置
最近我们Sinesafe安全公司在为客户使用阿里云ecs服务器做安全的过程中,发现服务器基础安全性都没有做。为了为站长们提供更加有效的安全基础解决方案,我们Sinesafe将对阿里云服务器win2008 系统进行基础安全部署实战过程! 比较重要的几部分 1.
9161 0
阿里云安全组设置详细图文教程(收藏起来) 阿里云服务器安全组设置规则分享,阿里云服务器安全组如何放行端口设置教程。阿里云会要求客户设置安全组,如果不设置,阿里云会指定默认的安全组。那么,这个安全组是什么呢?顾名思义,就是为了服务器安全设置的。安全组其实就是一个虚拟的防火墙,可以让用户从端口、IP的维度来筛选对应服务器的访问者,从而形成一个云上的安全域。
7504 0
4510 0
购买阿里云ECS云服务器后如何登录?场景不同,云吞铺子总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系统盘、创建快照、配置安全组等操作如何登录ECS云服务器控制台? 1、先登录到阿里云ECS服务器控制台 2、点击顶部的“控制台” 3、通过左侧栏,切换到“云服务器ECS”即可,如下图所示 通过ECS控制台的远程连接来登录到云服务器 阿里云ECS云服务器自带远程连接功能,使用该功能可以登录到云服务器,简单且方便,如下图:点击“远程连接”,第一次连接会自动生成6位数字密码,输入密码即可登录到云服务器上。
22412 0
阿里云ECS云服务器初始化是指将云服务器系统恢复到最初状态的过程,阿里云的服务器初始化是通过更换系统盘来实现的,是免费的,阿里云百科网分享服务器初始化教程: 服务器初始化教程方法 本文的服务器初始化是指将ECS云服务器系统恢复到最初状态,服务器中的数据也会被清空,所以初始化之前一定要先备份好。
7365 0