Author: Alibaba Group Senior Staff Engineer Ding Yu
Have you ever wondered what the underlying technology behind Alibaba Single's Day Shopping Festival (also known as 11-11) is like? With sales reaching over US$17.8 billion in 2016, Single's Day has become the largest online shopping day in the world!
Alibaba Cloud's infrastructure has evolved rapidly to cope with increasing demands from the entire Alibaba ecosystem, especially for Single's Day. From 2009 to 2016, we have witnessed an increase of peak transaction volume of over 400 times!
Figure 1: Peak transaction volume on Single's Day from 2009 to 2016
Such feat can only be achieved with a robust computing architecture, not only capable of handling bursty traffic but also capable of quickly recovering from system faults. While sales revenue typically grows linearly with transaction volume, system complexity becomes exponentially difficult at such a large scale. What's more, deploying and maintaining such complex system is labor intensive and costly.
Designing a High Availability Infrastructure
As the Architect for Single's Day since 2009, I will share with you some of our key strategies in designing our infrastructure.
Although cloud computing has freed us from the geographical constraints of data centers, supporting an event such as Single's Day isn't as straightforward as simply adding more servers. We need to know precisely how much computing power we need to ensure high availability and reliability while keeping costs at a minimum.
Alibaba Cloud tackles this problem from multiple angles:
1.Comprehensive load testing on system architecture
2.System architecture fault simulation
3.Cross-region server deployment
4.Automated intelligent control
We will cover these four topics in further detail in the following sections.
Figure 2: Enterprise high availability design
Comprehensive Load Testing on System Architecture
Load testing is one of the default metric for performance testing in most systems. Basically, what we do is to simulate the traffic load of Single's Day and test it on our existing infrastructure. We use traffic data collected from previous years as well as predicted data to account for this year's growth. One of the important purpose of load testing is not only to discover the maximum capacity but also to determine the most common applications and services that customers use during this period.
System Architecture Fault Simulation
Essentially, fault simulation is a form of stress testing on our system architecture. We intentionally disable certain services, overloading the system with heavy loads. In particular, we look out for any Single-Point-of-Failures (SPOFs) in our architecture and eliminate them.
Cross-Region Server Deployment
In most scenarios, servers only run within a single region. However, this approach may not be sufficient when faced with extreme loads during Single's Day. Therefore, we utilize cross-region deployment to expand the capacity and improve service availability. We split users into different servers based on user ID, and employ an active-active configuration in our clusters to maintain high availability and achieve seamless service handover. In addition, data is also backed up across multiple sites to enhance disaster recovery capabilities.
Figure 3: High availability multi-region cluster
Automated Intelligent Control
Even with all of the technologies discussed previously, it is almost impossible to control traffic flow and scale resources in a large system manually. That is why we use an automated intelligent control, which focuses on traffic control and fault recovery.
Because we don't have access to unlimited resources, there is always a possibility of having too much load. To handle this problem, we can prioritize users based on the type of request. For example, customers completing purchases should be prioritized over users who are only browsing a website. Once we prioritize them, we can put them in a queue and complete requests based on this queue. We can also adjust the service of quality received by users based on this queueing system.
Figure 4: User traffic control
As the number of devices increases, the probability of fault occurring in devices increases as well. When a server fails, our system detects this anomaly and reassigns the user to the next nearest server. This automatic approach significantly reduces delay, which in turn improves user experience and minimizes O&M costs. In addition, this system will trigger alarms to notify our engineers about these faults, helping our team to quickly locate and troubleshoot faults.
Figure 5: Server fault recovery
Conclusion
As we can see, powering an event as large as Single's Day is no easy task. With proper planning and design, we can cope even the most unexpected challenges for this event. We are confident that our evolved architecture can achieve a lot more for this year's Single's Day festival!
However, one question springs to mind – What do we do with all this computing power when the festival ends? For most of our systems, we adopt a hybrid cloud environment. With hybrid cloud, we can scale resources as required but also maintain a "lighter" system when the load is low (such as when Single's Day festival ends). This way, we can minimize operating costs while maximizing our capacity.
In addition, we utilize Alibaba Cloud's core products as well as our family of distributed middleware. Currently, our distributed middleware offerings are only limited to Mainland China customers, but we are hoping to make them available to customers from across the globe soon.
If you want to learn more about the underlying technology for Alibaba Single's Day, please check out my presentation video at The Computing Conference 2017.
If you are interested in building your own infrastructure with Alibaba Cloud products, you should definitely check out our attractive offers on 11-11 Cloud Deals!
Core Products (available globally):
• Elastic Compute Service (ECS)
• Server Load Balancer (SLB)
• Auto Scaling
• ApsaraDB for RDS
• CDN
Distributed Middleware (currently only available in Mainland China):
• Distributed Relational Database Service (DRDS)
• Cloud Service Bus (CSB)
• Global Transaction Service (GTS)
• Application Real-Time Monitoring Service (ARMS)
• Message Queue (MQ)
• Enterprise Distributed Application Service (EDAS)