GreenPlum入门

简介: 本文旨在小白入门。 我开始研究GP的动机是因为它在MPP和数仓方面的名声,另外,GP也是集团技术栈的一员。 ### 1 GPDB简介 > Pivotal **Greenplum** Database is a MPP (massively parallel processing) database built on open source [PostgreSQL](https:/

本文旨在小白入门。

我开始研究GP的动机是因为它在MPP和数仓方面的名声,另外,GP也是集团技术栈的一员。

1 GPDB简介

Pivotal Greenplum Database is a MPP (massively parallel processing) database built on open source PostgreSQL.The system consists of a master node, standby master node, and segment nodes.

All of the data resides on the segment nodes and the catalog information is stored in the master nodes. Segment nodes run one or more segments, which are modified PostgreSQL database instances and are assigned a content identifier.

For each table the data is divided among the segment nodes based on the distribution column keys specified by the user in the DDL statement.

For each segment content identifier there is both a primary segment and mirror segment which are not running on the same physical host.

When a SQL query enters the master node, it is parsed, optimized and dispatched to all of the segments to execute the query plan and either return the requested data or insert the result of the query into a database table.

从定义上看,GP和此前我比较专注的ES(Elasticsearch)非常类似,它们都是基于成熟的单机技术——PG(PostgreSQL)和Lucene,采用去中心化的分布式技术,实现了对大数据的(MPP和信息检索)支持。

2 准备

多台物理机搭建GPDB集群(1个主节点、1个备节点、n个Segment节点)是最佳的实战环境,但作为入门,GP官方提供的Docker镜像足够用了。使用如下命令拉取这个镜像:

docker pull pivotaldata/gpdb-base

接下来使用如下命令下载GP官方提供的教程示例,然后解压到任意目录(本例使用的路径为/Users/erichan/Downloads/greenplum-db-gpdb-sandbox-tutorials-6794737):

wget https://github.com/greenplum-db/gpdb-sandbox-tutorials/zipball/master

启动GP容器,并挂载上述教程示例:

docker rm $(docker ps -aq)

docker run -ti \
-p 5432:5432 \
--name gp_tutorials \
--hostname gp-tutorials \
-v /Users/erichan/Downloads/greenplum-db-gpdb-sandbox-tutorials-6794737:/tutorials \
pivotaldata/gpdb-base

如下的练习都是在上述这个容器中进行的。

容器内包含的用户/密码信息如下:

  • root/pivotal
  • gpadmin/pivotal

如下练习均使用gpadmin用户,记得切换:

su - gpadmin

3 创建用户和角色

3.1 Create a user with the createuser utility command

createuser -P user1
Enter password for new role:
Enter it again:
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) y
Shall the new role be allowed to create more new roles? (y/n) n
NOTICE:  resource queue required -- using default resource queue
"pg_default"

3.2 Create a user with the CREATE USER command

psql template1

template1=# CREATE USER user2 WITH PASSWORD 'pivotal' NOSUPERUSER;
template1=# \du

                       List of roles
 Role name |            Attributes             | Member of
-----------+-----------------------------------+-----------
 gpadmin   | Superuser, Create role, Create DB |
 user1     | Create DB                         |
 user2     |                                   |

3.3 Create a users group and add the users to it

template1=# CREATE ROLE users;
template1=# GRANT users TO user1, user2;
template1=# \du
                       List of roles
 Role name |            Attributes             | Member of
-----------+-----------------------------------+-----------
 gpadmin   | Superuser, Create role, Create DB |
 user1     | Create DB                         | {users}
 user2     |                                   | {users}
 users     | Cannot login                      |

3.4 Exit out of the psql shell

template1=# \q

4 创建和准备数据库

4.1 Create Database

dropdb tutorial
createdb tutorial

psql -l
                  List of databases
   Name    |  Owner  | Encoding |  Access privileges
-----------+---------+----------+---------------------
 gpadmin   | gpadmin | UTF8     |
 postgres  | gpadmin | UTF8     |
 template0 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 template1 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 tutorial  | gpadmin | UTF8     |
(5 rows)

4.2 Grant database privileges to users

psql -U gpadmin tutorial

tutorial=# GRANT ALL PRIVILEGES ON DATABASE tutorial TO user1, user2;

tutorial=# \q
psql -l
                  List of databases
   Name    |  Owner  | Encoding |  Access privileges
-----------+---------+----------+---------------------
 gpadmin   | gpadmin | UTF8     |
 postgres  | gpadmin | UTF8     |
 template0 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 template1 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 tutorial  | gpadmin | UTF8     | =Tc/gpadmin
                                : gpadmin=CTc/gpadmin
                                : user1=CTc/gpadmin
                                : user2=CTc/gpadmin
(5 rows)

4.3 Create a schema and set a search path

psql -U user1 -h gp-tutorials tutorial
tutorial=> DROP SCHEMA IF EXISTS faa CASCADE;

tutorial=> CREATE SCHEMA faa;

tutorial=> SET SEARCH_PATH TO faa, public, pg_catalog, gp_toolkit;

tutorial=> SHOW search_path;
             search_path
-------------------------------------
 faa, public, pg_catalog, gp_toolkit
(1 row)

tutorial=> \q

5 创建表

cd /tutorials/faa
psql -U gpadmin tutorial
tutorial=#  \i create_dim_tables.sql
tutorial=# \dt

5.1 Data Loading

tutorial=# \d faa.d_cancellation_codes
Table "faa.d_cancellation_codes"
   Column    | Type | Modifiers
-------------+------+-----------
 cancel_code | text |
 cancel_desc | text |
Distributed by: (cancel_code)

5.2 Load data with the INSERT statement

tutorial=# INSERT INTO faa.d_cancellation_codes
tutorial-# VALUES ('A', 'Carrier'),
tutorial-# ('B', 'Weather'),
tutorial-# ('C', 'NAS'),
tutorial-# ('D', 'Security'),
tutorial-#  ('', 'none');
INSERT 0 5

5.3 Load data with the COPY statement

tutorial-# \i copy_into_airports.sql
tutorial-# \i copy_into_airlines.sql
tutorial-# \i copy_into_delay_groups.sql
tutorial-# \i copy_into_distance_groups.sql
tutorial-# \i copy_into_wac.sql
tutorial-# \q

5.4 Load data with gpdist

Execute gpfdist

gpfdist -d /tutorials/faa -p 8081 > /tmp/gpfdist.log 2>&1 &
ps -A | grep gpfdist

 1202 ?        00:00:00 gpfdist
more /tmp/gpfdist.log

...
Serving HTTP on port 8081, directory /tutorials/faa
psql -U gpadmin tutorial
tutorial=#  \i create_load_tables.sql
tutorial=#  \i create_ext_table.sql
tutorial=# INSERT INTO faa.faa_otp_load SELECT * FROM faa.ext_load_otp;

tutorial=#  \x

tutorial=# SELECT DISTINCT relname, errmsg, count(*) FROM faa.faa_load_errors GROUP BY 1,2;
-[ RECORD 1 ]-------------------------------------------------
relname | ext_load_otp
errmsg  | invalid input syntax for integer: "", column deptime
count   | 26526

tutorial=# \q
killall gpfdist

5.5 Load data with gpload

vi gpload.yaml
---
VERSION: 1.0.0.1
# describe the Greenplum database parameters
DATABASE: tutorial
USER: gpadmin
HOST: gp-tutorials
PORT: 5432
# describe the location of the source files
# in this example, the database master lives on the same host as the source files
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - gp-tutorials
         PORT: 8081
         FILE:
           - /tutorials/faa/otp*.gz
    - FORMAT: csv
    - QUOTE: '"'
    - ERROR_LIMIT: 50000
    - ERROR_TABLE: faa.faa_load_errors
   OUTPUT:
    - TABLE: faa.faa_otp_load
    - MODE: INSERT
   PRELOAD:
    - TRUNCATE: true
gpload -f gpload.yaml -l gpload.log

...
started gpfdist -p 8081 -P 8082 -f "/tutorials/faa/otp*.gz" -t 30
...
running time: 22.41 seconds
...
rows Inserted          = 1024552

5.6 Create and Load fact tables

psql -U gpadmin tutorial
tutorial=#  \i create_fact_tables.sql
tutorial=# \i load_into_fact_table.sql

6 查询和调优

6.1 Analyze the tables

tutorial=# ANALYZE faa.d_airports;
ANALYZE
tutorial=# ANALYZE faa.d_airlines;
ANALYZE
tutorial=# ANALYZE faa.d_wac;
ANALYZE
tutorial=# ANALYZE faa.d_cancellation_codes;
ANALYZE
tutorial=# ANALYZE faa.faa_otp_load;
ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_c;
ANALYZE

6.2 View explain plans

tutorial=# \timing on
tutorial=# \i create_sample_table.sql
tutorial=# EXPLAIN SELECT COUNT(*) FROM sample WHERE id > 100;
tutorial=# EXPLAIN ANALYZE SELECT COUNT(*) FROM sample WHERE id > 100;

6.3 Changing optimizers

tutorial=# \q
gpconfig -s optimizer

Values on all segments are consistent
GUC          : optimizer
Master  value: off
Segment value: off
gpconfig -c optimizer -v on --masteronly
gpstop -u

6.4 Indexes and performance

psql -U gpadmin tutorial

tutorial=# \i create_sample_table.sql

tutorial=# SELECT * from sample WHERE big = 12345;
   id    |  big  | wee | stuff
---------+-------+-----+-------
   12345 | 12345 |   0 |
 1012346 | 12345 |   0 |
 2012347 | 12345 |   0 |
(3 rows)

tutorial=# EXPLAIN SELECT * from sample WHERE big = 12345;

tutorial=# CREATE INDEX sample_big_index ON sample(big);

tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345;

6.5 Row vs. Column orientation

tutorial=# CREATE TABLE FAA.OTP_C (LIKE faa.otp_r) WITH (appendonly=true,orientation=column)DISTRIBUTED BY (UniqueCarrier, FlightNum) PARTITION BY RANGE(FlightDate)(PARTITION mth START('2009-06-01'::date) END ('2010-10-31'::date) EVERY ('1 mon'::interval));

tutorial=# INSERT INTO faa.otp_c SELECT * FROM faa.otp_r;

tutorial=# \d faa.otp_r

tutorial=# \d faa.otp_c

tutorial=# SELECT pg_size_pretty(pg_relation_size('faa.otp_r'));
 pg_size_pretty
----------------
 256 MB
(1 row)

tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_r'));
 pg_size_pretty
----------------
 256 MB
(1 row)

tutorial=# SELECT pg_size_pretty(pg_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 0 bytes
(1 row)

tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 288 kB
(1 row)

tutorial=#
tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 288 kB
(1 row)

6.6 Check for even data distribution on segments

tutorial=# SELECT gp_segment_id, COUNT(*) FROM faa.otp_c GROUP BY gp_segment_id ORDER BY gp_segment_id;

6.7 About partitioning

tutorial=# \timing on

tutorial=#  SELECT MAX(depdelay) FROM faa.otp_c WHERE UniqueCarrier = 'UA';

tutorial=# SELECT MAX(depdelay) FROM faa.otp_c WHERE flightdate ='2009-11-01';

7 集成分析工具

8 备份和恢复操作

  • To run a full backup:gpcrondump -x tutorial -u /tmp -a -r
  • To view the backups:ls -al /tmp/db_dumps
  • To restore the data:gpdbrestore -T faa.otp_r -s tutorial -u /tmp -a

9 Reference

目录
相关文章
|
10月前
|
存储 SQL 分布式计算
GreenPlum小结
GreenPlum小结
244 0
|
存储 SQL 缓存
PostgreSQL DirectIO开发实践
在数据库开源的背景下,基于PG的DirectIO的研发方案分享。
PostgreSQL DirectIO开发实践
|
关系型数据库 MySQL 数据挖掘
阿里云PostgreSQL_开发实践_5 | 学习笔记
简介:快速学习阿里云PostgreSQL_开发实践_5
309 0
阿里云PostgreSQL_开发实践_5 | 学习笔记
|
存储 SQL 关系型数据库
阿里云 PostgreSQL_ 开发实践_6 | 学习笔记
简介:快速学习阿里云 PostgreSQL_ 开发实践_6
90 0
|
SQL 存储 Oracle
阿里云PostgreSQL_开发实践_3 | 学习笔记
简介:快速学习阿里云PostgreSQL_开发实践_3
93 0
阿里云PostgreSQL_开发实践_3 | 学习笔记
|
存储 SQL 监控
阿里云 PostgreSQL_开发实践_4 | 学习笔记
简介:快速学习阿里云 PostgreSQL_开发实践_4
127 0
阿里云 PostgreSQL_开发实践_4 | 学习笔记
|
SQL 关系型数据库 网络安全
阿里云PostgreSQL_开发实践 | 学习笔记
简介:快速学习阿里云PostgreSQL_开发实践
164 0
阿里云PostgreSQL_开发实践 | 学习笔记
|
SQL 数据采集 算法
阿里云PostgreSQL_开发实践_1 | 学习笔记
简介:快速学习阿里云PostgreSQL_开发实践_1
132 1
阿里云PostgreSQL_开发实践_1 | 学习笔记
|
SQL 关系型数据库 物联网
Hyperscale (Citus) ,分布式 PostgreSQL 实战指南
Hyperscale (Citus) ,分布式 PostgreSQL 实战指南
260 0
|
SQL 存储 NoSQL
Greenplum应用最佳实践
Greenplum分布式分析数据库 通用操作的最佳实践。持续更新
1266 0
Greenplum应用最佳实践