Optimizing Performance for Big-data-based Global E-Commerce Systems

简介: Steps to optimize the cloud stack of an e-commerce site also touches on dynamic acceleration, static +ESI, element merge requests, and CDN scheduling

Monitoring_Scenario

This article is a summary of presentation shared by AliExpress CTO Guo Dongbai at the first Alibaba Online Technology Summit. During his presentation he introduces the theory behind the architecture of e-commerce platform AliExpress as well as introducing how to fully optimize a cloud stack based on the calculation of bounce rates between pages. His steps to optimizing the cloud stack of an e-commerce site also touches on dynamic acceleration, static +ESI, element merge requests, and CDN scheduling optimization.

Theory basis of the entire system

Theory basis of the entire system

This above figure shows the relationship between the traffic distribution and bounce rate. The horizontal axis presents the latency interval while the vertical axis presents traffic flow distribution. The green curve presents the traffic flow distribution of incoming customers to the website or application.

As can be seen, most of the traffic flow is distributed within the range of several hundred to 1,000 milliseconds. As latency increases, the bounce rate also increases. A basic conversion formula is used calculate the entire system: (conversion rate = 1 - Bounce rate). In case of performance faults, such as soaring latencies in a few servers, we can find that the high-speed performance traffic flow decreases. As the traffic flows with high latencies increases, the bounce rate also climbs quickly. The process indicates that the longer the latency, the higher the latency bounce rate which then leads to a lower conversion rate. We can see the dotted line as the level before optimization, and the solid line as the level after optimization. The difference between them is the new conversion rate after optimization.

The difference between them is the new conversion rate after optimization

Let's dig into this topic further. In the figure above, the red line represents the conversion rate, and the blue bars represent the distribution of performance intervals. Suppose when we compress the performance from 1 second to 3 seconds, the return of conversion rate is the green part in the figure. How are the returns achieved? The higher the latency, the lower the conversion rate. As returns are a monotone decreasing function, the return after compression is the green part in the figure. If we know the compression time, we can estimate the return on a single page and the return is called Performance Loss.

Next, we will calculate the theoretical bounce rate from page A to page B. As shown in the figure above, A is a page and 0, 1, 2 and 3 are its pre-order page, and end represents the bounce page. We can find that the exit bounce rate = total bounce rate after compensation - natural bounce rate. In the formula, the natural bounce rate refers to bounce behaviors because of the users' dissatisfaction with the commodity content or comments.

Based on this assumption, we can promote it to all the pages. A large website may have hundreds of pages. In the example above, there are two chains: Search - Details – Order and Shop - Shop details - Order.
In such a relationship, if we calculate the bounce rate for each chain, we are actually having the theoretical maximum traffic flow for each chain and the maximum traffic for the final page. This is in fact a process to calculate full stack performance loss. We can know the contribution of every tiny process to the whole picture. This is vital for preparing the optimization solution. During optimization solution preparation, we can test on a page for accurate measurement of the return of a page, so that we can clearly understand the contribution of one optimization solution to the whole system, that is, the order returns of this optimization on the e-commerce system.

Platform design

The platform is independent from the business field. There are different optimization solutions tailored to different business fields, as shown in the left bottom section of the figure. The platform aims to visualize the current performance; measure the current performance level in real time; monitor the performance of the whole chain; stay available to all fields; and perform full-scale measurement. The platform is actually converting the returns of performance optimization into a measurable and intuitive result. During the entire experiment, data keeps accumulating. Data can bring accurate measurement results which determine whether the optimization can be promoted to the full stack.
Performance optimization process

First, you should have many business units from which to collect users' behavioral data from. The data, such as request time and connection time is processed by the collection system. The processed data will undergo one of the two methods of analysis: offline analysis and real-time analysis. The difference between the two methods is that offline analysis can process more data.

The analysis results will be saved in a business database and eventually sent to cache layer for tracking and YoY analysis. The back end will support many different application scenarios, such as alarm issuing, monitoring and reports.

Implement optimization

In fact, there are many optimization solutions in the industry for the DNS layer, network layer and CDN layer. Which ones of them are the most effective? Next I will list several effective optimization solutions I used:

Dynamic acceleration

Before optimization, all the users' dynamic requests are on the source site, and the request chain is: User - Operator - Source site. All requests go to the source site as data, leading to a very long and time-consuming request chain.

After optimization, dynamic data is pushed as much as possible to the edge nodes which do not need to go to the source site for requests, but instead request directly at the edge nodes.

Here’s another optimization for this: the request can be synchronous or asynchronous, and elements within multiple pages can be requested at the same time. The entire back-to-source process is the dynamic acceleration on the content. Another way to achieve dynamic acceleration is if back-to-source is required, let CDN decide the optimal route for the back-to-source network. CDN will help to find an optimal back-to-source chain. Dynamic acceleration actually involves a series of optimization methods, including content compression for example. The whole process also has many technical challenges: e-commerce suppliers need to know the real IP address of users; the source site needs to launch request interception to guard against attacks and so on.

Static + ESI

Users put content on the edge node and cache the content in the server room. If it is dynamic content, it will be sent back to the source database, however if it is static miss content, it is sent back to the source database through the business logic. The request chain is usually “read chain”, “write chain”, and the db will be changed. The changed db messages are consumed by customers and updated and stored into the cache through business logic, so the information in the cache is always the latest. This is to say, if the user request can hit the edge node, it returns. Otherwise, the request will hit the cache layer and return.

Merge element requests

There will be many child elements in a page. If you request each of them independently, every request is a back-to-source call, and every request will take a long time (including the time needed for TCP connection). Merging element requests means to merge all the requests into one and provide the unified request to the service side, and the service side will dispatch these requests and merge and return the result in a unified way.

CDN scheduling optimization

AliExpress boasts many users in various countries around the world. It is the second largest e-commerce website in Brazil. Brazilian users can request the edge nodes in the US or the edge nodes in Brazil. Tests show that less time is consumed if Brazilian users request the edge nodes in Brazil, namely 4 units for example, then 5 units of time is consumed if they request the edge nodes in the US. But the request to nodes in Argentina, which is close to Brazil in geography, takes 7 units of time. So we can draw a conclusion that we cannot estimate the time consumption of request to a node by geographic locations alone, and CDN scheduling can be optimized on this basis.

Business results

Through the above theoretical analysis and optimizations, you can improve the analysis capability of your entire system. Previous optimizations has benefited AliExpress website by increasing orders by 5.07% as well as recording a hugely significant drop in performance loss.

目录
相关文章
|
2月前
|
算法 数据挖掘 数据处理
文献解读-Sentieon DNAscope LongRead – A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi reads
PacBio® HiFi 测序是第一种提供经济、高精度长读数测序的技术,其平均读数长度超过 10kb,平均碱基准确率达到 99.8% 。在该研究中,研究者介绍了一种准确、高效的 DNAscope LongRead 管道,用于从 PacBio® HiFi 读数中调用胚系变异。DNAscope LongRead 是对 Sentieon 的 DNAscope 工具的修改和扩展,该工具曾获美国食品药品管理局(FDA)精密变异调用奖。
29 2
文献解读-Sentieon DNAscope LongRead – A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi reads
|
算法 计算机视觉 知识图谱
ACL2022:A Simple yet Effective Relation Information Guided Approach for Few-Shot Relation Extraction
少样本关系提取旨在通过在每个关系中使用几个标记的例子进行训练来预测句子中一对实体的关系。最近的一些工作引入了关系信息
131 0
《Improving Real-Time Performance by Utilizing Cache Allocation Technology》电子版地址
Improving Real-Time Performance by Utilizing Cache Allocation Technology
87 0
《Improving Real-Time Performance by Utilizing Cache Allocation Technology》电子版地址
|
机器学习/深度学习 数据采集 人工智能
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
|
机器学习/深度学习 自然语言处理 PyTorch
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
|
机器学习/深度学习 人工智能 开发框架
IJCAI_2020_Channel Pruning via Automatic Structure Search
1 摘要 通道修剪是压缩深层神经网络的主要方法之一。
191 0
IJCAI_2020_Channel Pruning via Automatic Structure Search
|
SQL 存储 算法
《Optimization of Common Table Expressions in MPP Database Systems》论文导读
Optimization of Common Table Expressions in MPP Database Systems
《Optimization of Common Table Expressions in MPP Database Systems》论文导读
|
SQL 编译器 API
Efficiently Compiling Efficient Query Plans for Modern Hardware 论文解读
这应该是SQL查询编译的一篇经典文章了,作者是著名的Thomas Neumann,主要讲解了TUM的HyPer数据库中对于CodeGen的应用。 在morsel-driven那篇paper 中,介绍了HyPer的整个执行框架,会以task为单位处理一个morsel的数据,而执行的处理逻辑(一个pipeline job)就被编译为一个函数。这篇paper则具体讲如何实现动态编译。
454 0
Efficiently Compiling Efficient Query Plans for Modern Hardware 论文解读
|
机器学习/深度学习 新零售 自然语言处理
KDD 2020 <A Dual Heterogeneous Graph Attention Network to Improve Long-Tail Performance for Shop Search in E-Commerce> 论文解读
店铺搜索是淘宝搜索的一个组成部分,目前淘宝有近千万的店铺,7日活跃店铺也达到百万级别。店铺搜索场景拥有日均千万级别UV,引导上亿的GVM。
KDD 2020 <A Dual Heterogeneous Graph Attention Network to Improve Long-Tail Performance for Shop Search in E-Commerce> 论文解读
The Rising Smart Logistics Industry: How to Use Big Data to Improve Efficiency and Save Costs
This whitepaper will examine Alibaba Cloud’s Cainiao smart logistics cloud and Big Data powered platform and the underlying strategies used to optimiz.
1546 0
The Rising Smart Logistics Industry: How to Use Big Data to Improve Efficiency and Save Costs