Sharing, Storing, and Computing Massive Amounts of Data

本文涉及的产品
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
对象存储 OSS,20GB 3个月
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
简介: Data is crucial to the operation of any business.

BD_001

Background

Data is crucial to the operation of any business. Businesses often collect large numbers of logs so that they can better understand their own services and the people who are using them.

As time goes by, the number and activity of users continuously increases, as does the speed at which data is generated, the accumulated volume of data, and the number of data dimensions and data types. As a result, we end up with more and more data islands.

We have reached a point where data islands are now heavy burden for a company's IT department, increasing costs without increasing earnings.

Storing and processing large amounts of data poses serious problems to businesses, such as:

  1. Serious issue of data islands (without a big data cloud platform)
  2. Inaccurate estimation of costs, wasting resources when resizing computational capacity
  3. Prominent issue of data redundancy
  4. High storage costs
  5. On-premise hardware is hard to scale and the maintenance cost is high
  6. Data size is too big to be backed up.
  7. Extensive business needs lead to diverse data types, and high analysis and development costs

This article discusses some potential solutions to address issues when dealing with massive amounts of data for specific industries.

Industries

1. Logistics

Packages generate a lot of tracking data from collection to shipment, transportation, transshipment, and distribution. Each time a package reaches a checkpoint, it will be scanned and its status will be recorded.

During the transportation process, the vehicle associated with the package is tracked, including the vehicle records, route, fuel consumption, vehicle status, and driver's status. During the distribution process, both the positional information of the courier and the distribution status of the package will be tracked, which also generates a large amount of data.

A transportation vehicle may generate dozens of thousands of tracking records in a day. A courier may also generate dozens of thousands of tracking records in a day. Even a small package may generate hundreds of tracking records!

One potential solution for the logistics industry is dynamic path planning. With dynamic path planning, we can obtain information on the nearby couriers in real time according to their positional information. This information can then be used to create an optimal least cost path for quick delivery. If you are interested, you can refer to Technologies used in Double 11 shopping festival - logistics and dynamic path planning (article in Chinese).

2. Finance

The financial industry is another generator of mass data. Financial industry data mainly includes data of user transactions, enterprise transactions, and securities. To make matters worse, most transactions in the financial industry requires real-time computing, and the use of a wide range of different statistical analysis functions.

Additionally, data collected in the financial industry must be kept secure. I have shared an analysis on the needs of a securities trading simulation system. If you are interested, refer to Analysis and application of PostgreSQL in databases in the securities industry (article in Chinese).

3. Internet of Things (IoT)

Data generated by IoT is sequential, and therefore requires StreamCompute (e.g. triggered upon reaching the threshold) and post-event analysis. As the data size is huge, it also needs to be compressed.

Another feature of IoT is that data reported by sensors always contains numerical ranges (e.g. a temperature range), geographic positions, images, etc. How can we effectively store and search such varied data types? I have also analyzed the features of databases of the IoT applications in a few articles (articles in Chinese).

Rise of StreamCompute - PostgreSQL joins hands with PipelineDB to support IoT

Implementation of the revolving door data compression algorithm in PostgreSQL - application of streaming compression in the IoT, monitoring, and sensor scenarios

The cutting-edge technology of PostgreSQL for IoT - creating an index (BRIN Index) a fraction of the normal size

A simple algorithm to help IoT and financial users cut data storage costs by 98%

Streaming application on IoT - real-time processing (trillions of data entries per day) with PostgreSQL

PostgreSQL cutting-edge technology - range type and Gist indexes boost development in IoT

4. Monitoring

Monitoring involves overseeing the business operations and IT infrastructure statuses, such as server status, network status, and storage status. Monitoring has been used considerably in all kinds of businesses, but monitoring needs and data types vary within industries.

For example, some industries may need positional monitoring, such as tracking the location of a bus and sending alerts when it goes beyond a geo fence, or when the driver displays unusual driving behaviors.

5. Public security

Public security data comes from various fields, including communication records and travel records.

Public security data can be incredibly large. A typical use case may be risk control or capturing criminal suspects. It relies heavily on analyzing chronological, geographic positional information (image searches).

How can these needs be met?

Industry pain points

The challenges of running businesses with massive amounts of data can be summarized with this list of questions:

  1. How can we solve the issue of data islands and open up data sharing channels?
  2. How can we efficiently write logs, behavioral tracks, financial data, and trajectory data?
  3. How can we quickly and efficiently process data, send threshold alerts, and conduct real-time analysis?
  4. How can we solve the disaster tolerance and backup problems faced by big data?
  5. How can we solve the problems of compression and efficiency for big data?
  6. How can we solve the problems imposed by multi-dimensional data, diversified data types, and complex computing?
  7. How can we help companies improve the scalability of their IT architectures?

Solutions

We will cover three components:

1.RDS PostgreSQL

RDS PostgreSQL supports chronological data, block range indexes, inverted indexes, multi-core parallel computing, JSON, array storage, OSS_FDW external read/write, and other features.

RDS PostgreSQL solves the problems of OLTP, GIS application, complex query, spatial data processing, multi-dimensional analysis, and cold/hot data separation.

2.HybridDB PostgreSQL

HybridDB PostgreSQL supports array storage, horizontal expansion, block compression, extensive data types, machine learning library, PLPYTHON, PLJAVA, and PLR programming, OSS_FDW external read and write, etc.

It completely solves the issue of computing massive amounts of data.

3.Object Storage Service (OSS)

Data can be shared among multiple RDS instances through OSS_FDW. OSS supports multi-copy, and cross regional replication.

OSS solves the problems of data islands, storing massive data, cross-data-center disaster tolerance, massive data backup, etc.

1. Writing

Data is written in three ways:

  1. Online real-time writing implemented through the RDS SQL interface, which allows a single instance to achieve a writing speed of more than 1 million lines/second.
  2. Batch quasi real-time writing implemented through the HybridDB SQL interface, which allows a single instance to achieve a writing speed of more than 1 million lines/second.
  3. Batch quasi real-time writing, e.g. writing to a file, implemented through the OSS interface, which supports bandwidth auto scaling.

2. Sharing

Data can be shared among multiple RDS instances through OSS_FDW.

Let's assume that we have two enterprises, and we have used two RDS database instances, but there is a part of the data that needs to be shared between them. The traditional method requires the use of ETL, but now we can efficiently achieve data sharing among multiple instances by using OSS_FDW.

The RDS PostgreSQL OSS_FDW parallel read/write function (to enable multiple worker processes to read and write to the same table file) provides a highly efficient read/write process for shared data.

The parallel feature is embodied in three features: OSS parallel read/write, RDS PostgreSQL multi-core parallel computing, multi-computer parallel computing of RDS PostgreSQL or HybridDB.

3. Storage

Local RDS PostgreSQL and HybridDB are used for real time data storage. When stored data needs to be analyzed or shared, we store it in OSS.

When we compare the amount of storage with the computing resources used, the cost of OSS is much lower, which reduces the company's IT costs without negatively influencing flexibility.

OSS solves the problems of enterprises on data redundancy and high costs, and meets the requirements on data backup and disaster tolerance.

4. Computing

By using RDS PostgreSQL, HybridDB, and OSS, we can separate our computing resources and storage resources.

Since we are able to store less data at the computing nodes (most of the data is stored in OSS), resizing, reducing, disaster tolerance, and backing up the computer nodes is much easier.

Computing methods may be divided into the following types

1.Stream computing

StreamCompute can be divided in two sub-types, real-time computing, and real-time threshold alerts.

Both can be achieved through pipelinedb (base on postgresql).

Benefits:

The standard SQL interface and a wide range of different built-in functions support complex StreamCompute demands. The diversified data types (including GIS, JSON, etc.) support even more StreamCompute business scenarios. This asynchronous message–based mechanism supports the needs of the second type of StreamCompute.

pipelinedb is currently under adjustment, and may be used as an extension of PostgreSQL in the future.

https://github.com/pipelinedb/pipelinedb/issues?q=is%3Aissue+is%3Aopen+label%3A%22extension+refactor%22

Taking the monitoring industry as an example, the asynchronous message–based mechanism of StreamCompute can effectively avoid the futility of traditional active inquiry monitoring methods.

2.Real-time interaction

We can use RDS PostgreSQL to meet traditional OLTP demands.

Features of PostgreSQL include: GIS, JSON, array, cold/hot data separation, horizontal database partitioning, K-V type, multi-core parallel computing, BRIN indexes, GIN indexes, etc.

Scenarios supported by PostgreSQL include: StreamCompute, image searches, chronological data, path planning, fuzzy searches, full-text searches, similarity searches, auction sniping, genetics, finance, chemistry, GIS application, complex searches, BI, multi-dimensional analysis, spatial data searches, etc.

It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.

3.Quasi real-time analysis

Used in combination with OSS, both RDS PostgreSQL and HybridDB can achieve quasi real-time analysis.

The same OSS data can also be shared and concurrently accessed by multiple instances.

4.Offline data analysis and data mining

Used in combination with OSS, both RDS PostgreSQL and HybridDB can be used to analyze and mine offline data.

RDS PostgreSQL supports single computer multi-core parallel computing, and HybridDB for PostgreSQL supports multi-computer multi-core parallel computing. Users can make the choice based on computing capacity.

Required computing capacities

The soul of computing is working with different data types.

  1. PostgreSQL supports a wide range of different built-in data types, including numbers, strings, time, Boolean, enumeration, array, range, GIS, full-text search, bytea, large object, geometry, bit, XML, UUID, JSON, composite types, etc., as well as data types customized by users. PostgreSQL supports almost all business scenarios.
  2. Operators. In order to satisfy data processing needs, PostgreSQL offers a wealth of operations for each of the supported data types.
  3. PostgreSQL has extensive built-in functions, including statistical, trigonometric, GIS, and MADlib machine learning functions.
  4. Customized computing logic. Users can use a variety of programming languages, including C, python, java, and R, to customize these data handling functions, and expand the data handling capacity of PostgreSQL and HybridDB for PostgreSQL.
  5. There are extensive built-in aggregate functions to support statistical analysis.
  6. Window query functions.
  7. Recursive queries.
  8. Multi-dimensional analytical syntax.

Summary

The Advantages of RDS PostgreSQL

are mainly represented in the following aspects:

1.Performance

RDS PostgreSQL mainly handles online transactions and a small amount of quasi real-time analysis.

Performance of PostgreSQL OLTP is at the level of commercial databases. For details, refer to this article:

Performance evaluation in the database industry_tpc.org

Multi-core parallel computing, JIT, operator reuse, and other features of PostgreSQL allow it to leave other RDBMS databases far behind in terms of OLAP capacity. For more on the OLAP analytical capacity of PostgreSQL, refer to this article:

Analysis of the cutting-edge technology of accelerator engines - open the treasure chest of PostgreSQL with the combination of LLVM, array storage, multi-core parallel computing, and operator reuse

PostgreSQL 10 also has many enhancements in the aspect of HTAP.

2.Features

Features are a major strength of PostgreSQL. You can find the detailed description in the previous section "Required computing capacity".

3.Scalability

The performance of complex computing can be enhanced by adding additional CPUs.

Storage capacity of RDS PG can be enhanced and the storage limit can be broken by using the OSS storage and the OSS_FDW extension.

4.Cost

As most of the data that needs to be separated is stored at the OSS, users do not need to worry about the disaster tolerance and backups for this part of the data. Compared to storage in databases, storage costs are greatly reduced.

Both RDS PG and HybridDB PG support a wide variety of standard SQL interfaces, which are also used to access data stored in OSS (the Table interface). All of these features combined lead to significant reductions in development costs.

Maintenance cost: There is almost no O&M cost when using the cloud services.

5.Industries covered

It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.

Advantages of HybridDB PostgreSQL

1.Performance

HybridDB PostgreSQL is an MPP architecture, and therefore has outstanding computing performance.

2.Features

You can find the detailed description in the previous section "Required computing capacity".

3.Scalability

The performance of complex computing can be enhanced by adding additional computing nodes.

Storage capacity of RDS PG can be enhanced and the storage limit can be broken by using the OSS storage and the OSS_FDW extension.

4.Cost

As most of the data that needs to be separated is stored at the OSS, users do not need to worry about the disaster tolerance and backups for this part of the data. Compared to storage in databases, storage costs are greatly reduced.

Both RDS PG and HybridDB PG support a wide variety of standard SQL interfaces, which are also used to access data stored in OSS (the Table interface). All of these features combined lead to significant reductions in development costs.

Maintenance cost: There is almost no O&M cost when using the cloud services.

5.Industries covered

It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.

References:

RDS PostgreSQL: use oss_fdw to read/write OSS

HybridDB PostgreSQL: use oss_fdw to read/write OSS

相关实践学习
如何快速连接云数据库RDS MySQL
本场景介绍如何通过阿里云数据管理服务DMS快速连接云数据库RDS MySQL,然后进行数据表的CRUD操作。
全面了解阿里云能为你做什么
阿里云在全球各地部署高效节能的绿色数据中心,利用清洁计算为万物互联的新世界提供源源不断的能源动力,目前开服的区域包括中国(华北、华东、华南、香港)、新加坡、美国(美东、美西)、欧洲、中东、澳大利亚、日本。目前阿里云的产品涵盖弹性计算、数据库、存储与CDN、分析与搜索、云通信、网络、管理与监控、应用服务、互联网中间件、移动服务、视频服务等。通过本课程,来了解阿里云能够为你的业务带来哪些帮助     相关的阿里云产品:云服务器ECS 云服务器 ECS(Elastic Compute Service)是一种弹性可伸缩的计算服务,助您降低 IT 成本,提升运维效率,使您更专注于核心业务创新。产品详情: https://www.aliyun.com/product/ecs
目录
相关文章
|
设计模式 缓存 监控
译|Design patterns for container-based distributed systems(下)
译|Design patterns for container-based distributed systems(下)
76 0
|
设计模式 分布式计算 Kubernetes
译|Design patterns for container-based distributed systems(上)
译|Design patterns for container-based distributed systems
97 0
|
机器学习/深度学习 自然语言处理 PyTorch
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro
Re8:读论文 Hier-SPCNet: A Legal Statute Hierarchy-based Heterogeneous Network for Computing Legal Case
Re8:读论文 Hier-SPCNet: A Legal Statute Hierarchy-based Heterogeneous Network for Computing Legal Case
Re8:读论文 Hier-SPCNet: A Legal Statute Hierarchy-based Heterogeneous Network for Computing Legal Case
|
机器学习/深度学习 数据采集 人工智能
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
|
SQL 编译器 API
Efficiently Compiling Efficient Query Plans for Modern Hardware 论文解读
这应该是SQL查询编译的一篇经典文章了,作者是著名的Thomas Neumann,主要讲解了TUM的HyPer数据库中对于CodeGen的应用。 在morsel-driven那篇paper 中,介绍了HyPer的整个执行框架,会以task为单位处理一个morsel的数据,而执行的处理逻辑(一个pipeline job)就被编译为一个函数。这篇paper则具体讲如何实现动态编译。
458 0
Efficiently Compiling Efficient Query Plans for Modern Hardware 论文解读
|
人工智能 异构计算
Heterogeneous Computing for AI and Big Data – Alibaba Cloud Computing Conference
Alibaba Cloud heterogeneous platform for elastic computing aims to provide high-quality services for organizations to realize scientific and technological innovations.
1733 0
Heterogeneous Computing for AI and Big Data – Alibaba Cloud Computing Conference
|
虚拟化 异构计算
Interview with Alibaba Cloud Heterogeneous Computing Director: A Trio of Heterogeneous Computing: Domination by a Trio of GPU, FPGA, and ASIC Chips
From October 11 to 14, 2017, The Computing Conference will be held once again in Hangzhou's Yunqi township.
2525 0
Interview with Alibaba Cloud Chief Quantum Technology Scientist Shi Yaoyun: A Long Journey to a Bright Future for Quantum Computing
The 2017 Hangzhou Computing Conference will be held once again in Hangzhou's Yunqi township.
4476 0