Massive Parallel Processing with Alibaba Cloud HybridDB for PostgreSQL

When you have massive amounts of data and the need for data analytics, or you have high availability requirements, or security and backup protocols to follow, services like Alibaba Cloud's HybridDB for PostgreSQL can come in handy.

The service takes powerful Relational Database Management Systems (RDBMS) to a whole new level. In this article, we are going to explore how to get started with Alibaba Cloud's HybridDB for PostgreSQL service—for free.

PostgreSQL

PostgreSQL (also known as Postgres) is considered the most advanced open-source database, for several reasons. In the old world of databases—systems that organize, describe, store, structure and let users query data—there are several players that require costly licensing agreements. But there are also some players that deliver an interesting balance between features—and now, scaling. Postgres, with the help of Alibaba Cloud HybridDB, is one of these.

Postgres is formally a RDBMS, which means that it solves the problem of organizing data based on the relational model invented by Edgar Codd. Since version 8.0 launched in 2005, it has evolved to cover new spaces—including some non-structured data—and now, in version 10, launched this year, Postgres offers:

● XML data types, which let you store XML documents as columns of a table, query tags and attributes, convert to-from XML, and more
● JSON data types, allowing storage of JSON documents as columns of a table, queries of documents, conversion to/from JSON, the addition of indexes to improve performance, and more
● HStore data types, meaning that you can define key-value columns in the tables
● GIS extensions, allowing specialized datatypes, indexes, and a bunch of utilities for GeoSpatial use cases
● Basic full-text search capabilities
● Query parallelization

The last one is special because it makes available to the local developer (and usually the dev environment) the capacity to launch complicated queries which will be optimized to run in parallel, using the full capabilities of multiple core processors. But how is the same thing accomplished when the amount of data is in the order of terabytes, or petabytes? It's called Massive Parallel Processing—and one solution is to use a cluster of databases that can share the load and offer one single interface that looks as if it were a single database instance.

From the database world to the cloud

Running a database cluster is a complicated task, and usually, there is a specific role (better, a specific team) in the company that handles this. There are innumerable important details and a huge amount of subtasks that you need to handle to offer a reliable, scalable and blazing fast service. Fortunately, companies like Alibaba Cloud offer this kind of service, and you can jump from zero to a decent cluster configuration without being a specialist.

The Greenplum open source project is a Massive Parallel Processing Database based on PostgreSQL 8.2. Alibaba Cloud's HybridDB is one cloud provider that offers a service to run Greenplum and manage tasks such as security and backup. What are the advantages?
Here are a few:

● Autoscaling
● Simplified management
● Isolation of the environment using virtual private clouds
● Exclusive extensions to the Greenplum base implementation, such as JSON and HyperLogLog
● Support of Open Storage Service
● Support of migration tools such as pgsql2pgsql (PostgreSQL to PostgreSQL) or mysql2pgsql (MySQL to PostgreSQL)
● Support for SQL-99, SQL-03, SQL-08 standards
● Support of the Apache MADlib project

The last one is particularly interesting because it extends HybridDB for PostgreSQL with "data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data," which means that you will be able to do advanced analytics on data locally, in-database.

Bio

Nicolas Bohorquez (@Nickmancol) is a software developer from Colombia and is currently earning a Master's in Data Science for Complex Economic Systems at the Collegio Carlo Alberto in Turin, Italy. Previously, Nicolas has been part of development teams in a handful of startups, and has founded three companies in the Americas. He is passionate about the modeling of complexity and the use of data science to improve the world.c

