本文旨在小白入门。
我开始研究GP的动机是因为它在MPP和数仓方面的名声,另外,GP也是集团技术栈的一员。
1 GPDB简介
Pivotal Greenplum Database is a MPP (massively parallel processing) database built on open source PostgreSQL.The system consists of a master node, standby master node, and segment nodes.
All of the data resides on the segment nodes and the catalog information is stored in the master nodes. Segment nodes run one or more segments, which are modified PostgreSQL database instances and are assigned a content identifier.
For each table the data is divided among the segment nodes based on the distribution column keys specified by the user in the DDL statement.
For each segment content identifier there is both a primary segment and mirror segment which are not running on the same physical host.
When a SQL query enters the master node, it is parsed, optimized and dispatched to all of the segments to execute the query plan and either return the requested data or insert the result of the query into a database table.
从定义上看,GP和此前我比较专注的ES(Elasticsearch)非常类似,它们都是基于成熟的单机技术——PG(PostgreSQL)和Lucene,采用去中心化的分布式技术,实现了对大数据的(MPP和信息检索)支持。
2 准备
多台物理机搭建GPDB集群(1个主节点、1个备节点、n个Segment节点)是最佳的实战环境,但作为入门,GP官方提供的Docker镜像足够用了。使用如下命令拉取这个镜像:
docker pull pivotaldata/gpdb-base
接下来使用如下命令下载GP官方提供的教程示例,然后解压到任意目录(本例使用的路径为/Users/erichan/Downloads/greenplum-db-gpdb-sandbox-tutorials-6794737
):
wget https://github.com/greenplum-db/gpdb-sandbox-tutorials/zipball/master
启动GP容器,并挂载上述教程示例:
docker rm $(docker ps -aq)
docker run -ti \
-p 5432:5432 \
--name gp_tutorials \
--hostname gp-tutorials \
-v /Users/erichan/Downloads/greenplum-db-gpdb-sandbox-tutorials-6794737:/tutorials \
pivotaldata/gpdb-base
如下的练习都是在上述这个容器中进行的。
容器内包含的用户/密码信息如下:
- root/pivotal
- gpadmin/pivotal
如下练习均使用gpadmin
用户,记得切换:
su - gpadmin
3 创建用户和角色
3.1 Create a user with the createuser utility command
createuser -P user1
Enter password for new role:
Enter it again:
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) y
Shall the new role be allowed to create more new roles? (y/n) n
NOTICE: resource queue required -- using default resource queue
"pg_default"
3.2 Create a user with the CREATE USER command
psql template1
template1=# CREATE USER user2 WITH PASSWORD 'pivotal' NOSUPERUSER;
template1=# \du
List of roles
Role name | Attributes | Member of
-----------+-----------------------------------+-----------
gpadmin | Superuser, Create role, Create DB |
user1 | Create DB |
user2 | |
3.3 Create a users group and add the users to it
template1=# CREATE ROLE users;
template1=# GRANT users TO user1, user2;
template1=# \du
List of roles
Role name | Attributes | Member of
-----------+-----------------------------------+-----------
gpadmin | Superuser, Create role, Create DB |
user1 | Create DB | {users}
user2 | | {users}
users | Cannot login |
3.4 Exit out of the psql shell
template1=# \q
4 创建和准备数据库
4.1 Create Database
dropdb tutorial
createdb tutorial
psql -l
List of databases
Name | Owner | Encoding | Access privileges
-----------+---------+----------+---------------------
gpadmin | gpadmin | UTF8 |
postgres | gpadmin | UTF8 |
template0 | gpadmin | UTF8 | =c/gpadmin
: gpadmin=CTc/gpadmin
template1 | gpadmin | UTF8 | =c/gpadmin
: gpadmin=CTc/gpadmin
tutorial | gpadmin | UTF8 |
(5 rows)
4.2 Grant database privileges to users
psql -U gpadmin tutorial
tutorial=# GRANT ALL PRIVILEGES ON DATABASE tutorial TO user1, user2;
tutorial=# \q
psql -l
List of databases
Name | Owner | Encoding | Access privileges
-----------+---------+----------+---------------------
gpadmin | gpadmin | UTF8 |
postgres | gpadmin | UTF8 |
template0 | gpadmin | UTF8 | =c/gpadmin
: gpadmin=CTc/gpadmin
template1 | gpadmin | UTF8 | =c/gpadmin
: gpadmin=CTc/gpadmin
tutorial | gpadmin | UTF8 | =Tc/gpadmin
: gpadmin=CTc/gpadmin
: user1=CTc/gpadmin
: user2=CTc/gpadmin
(5 rows)
4.3 Create a schema and set a search path
psql -U user1 -h gp-tutorials tutorial
tutorial=> DROP SCHEMA IF EXISTS faa CASCADE;
tutorial=> CREATE SCHEMA faa;
tutorial=> SET SEARCH_PATH TO faa, public, pg_catalog, gp_toolkit;
tutorial=> SHOW search_path;
search_path
-------------------------------------
faa, public, pg_catalog, gp_toolkit
(1 row)
tutorial=> \q
5 创建表
cd /tutorials/faa
psql -U gpadmin tutorial
tutorial=# \i create_dim_tables.sql
tutorial=# \dt
5.1 Data Loading
tutorial=# \d faa.d_cancellation_codes
Table "faa.d_cancellation_codes"
Column | Type | Modifiers
-------------+------+-----------
cancel_code | text |
cancel_desc | text |
Distributed by: (cancel_code)
5.2 Load data with the INSERT statement
tutorial=# INSERT INTO faa.d_cancellation_codes
tutorial-# VALUES ('A', 'Carrier'),
tutorial-# ('B', 'Weather'),
tutorial-# ('C', 'NAS'),
tutorial-# ('D', 'Security'),
tutorial-# ('', 'none');
INSERT 0 5
5.3 Load data with the COPY statement
tutorial-# \i copy_into_airports.sql
tutorial-# \i copy_into_airlines.sql
tutorial-# \i copy_into_delay_groups.sql
tutorial-# \i copy_into_distance_groups.sql
tutorial-# \i copy_into_wac.sql
tutorial-# \q
5.4 Load data with gpdist
Execute gpfdist
gpfdist -d /tutorials/faa -p 8081 > /tmp/gpfdist.log 2>&1 &
ps -A | grep gpfdist
1202 ? 00:00:00 gpfdist
more /tmp/gpfdist.log
...
Serving HTTP on port 8081, directory /tutorials/faa
psql -U gpadmin tutorial
tutorial=# \i create_load_tables.sql
tutorial=# \i create_ext_table.sql
tutorial=# INSERT INTO faa.faa_otp_load SELECT * FROM faa.ext_load_otp;
tutorial=# \x
tutorial=# SELECT DISTINCT relname, errmsg, count(*) FROM faa.faa_load_errors GROUP BY 1,2;
-[ RECORD 1 ]-------------------------------------------------
relname | ext_load_otp
errmsg | invalid input syntax for integer: "", column deptime
count | 26526
tutorial=# \q
killall gpfdist
5.5 Load data with gpload
vi gpload.yaml
---
VERSION: 1.0.0.1
# describe the Greenplum database parameters
DATABASE: tutorial
USER: gpadmin
HOST: gp-tutorials
PORT: 5432
# describe the location of the source files
# in this example, the database master lives on the same host as the source files
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- gp-tutorials
PORT: 8081
FILE:
- /tutorials/faa/otp*.gz
- FORMAT: csv
- QUOTE: '"'
- ERROR_LIMIT: 50000
- ERROR_TABLE: faa.faa_load_errors
OUTPUT:
- TABLE: faa.faa_otp_load
- MODE: INSERT
PRELOAD:
- TRUNCATE: true
gpload -f gpload.yaml -l gpload.log
...
started gpfdist -p 8081 -P 8082 -f "/tutorials/faa/otp*.gz" -t 30
...
running time: 22.41 seconds
...
rows Inserted = 1024552
5.6 Create and Load fact tables
psql -U gpadmin tutorial
tutorial=# \i create_fact_tables.sql
tutorial=# \i load_into_fact_table.sql
6 查询和调优
6.1 Analyze the tables
tutorial=# ANALYZE faa.d_airports;
ANALYZE
tutorial=# ANALYZE faa.d_airlines;
ANALYZE
tutorial=# ANALYZE faa.d_wac;
ANALYZE
tutorial=# ANALYZE faa.d_cancellation_codes;
ANALYZE
tutorial=# ANALYZE faa.faa_otp_load;
ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_c;
ANALYZE
6.2 View explain plans
tutorial=# \timing on
tutorial=# \i create_sample_table.sql
tutorial=# EXPLAIN SELECT COUNT(*) FROM sample WHERE id > 100;
tutorial=# EXPLAIN ANALYZE SELECT COUNT(*) FROM sample WHERE id > 100;
6.3 Changing optimizers
tutorial=# \q
gpconfig -s optimizer
Values on all segments are consistent
GUC : optimizer
Master value: off
Segment value: off
gpconfig -c optimizer -v on --masteronly
gpstop -u
6.4 Indexes and performance
psql -U gpadmin tutorial
tutorial=# \i create_sample_table.sql
tutorial=# SELECT * from sample WHERE big = 12345;
id | big | wee | stuff
---------+-------+-----+-------
12345 | 12345 | 0 |
1012346 | 12345 | 0 |
2012347 | 12345 | 0 |
(3 rows)
tutorial=# EXPLAIN SELECT * from sample WHERE big = 12345;
tutorial=# CREATE INDEX sample_big_index ON sample(big);
tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345;
6.5 Row vs. Column orientation
tutorial=# CREATE TABLE FAA.OTP_C (LIKE faa.otp_r) WITH (appendonly=true,orientation=column)DISTRIBUTED BY (UniqueCarrier, FlightNum) PARTITION BY RANGE(FlightDate)(PARTITION mth START('2009-06-01'::date) END ('2010-10-31'::date) EVERY ('1 mon'::interval));
tutorial=# INSERT INTO faa.otp_c SELECT * FROM faa.otp_r;
tutorial=# \d faa.otp_r
tutorial=# \d faa.otp_c
tutorial=# SELECT pg_size_pretty(pg_relation_size('faa.otp_r'));
pg_size_pretty
----------------
256 MB
(1 row)
tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_r'));
pg_size_pretty
----------------
256 MB
(1 row)
tutorial=# SELECT pg_size_pretty(pg_relation_size('faa.otp_c'));
pg_size_pretty
----------------
0 bytes
(1 row)
tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_c'));
pg_size_pretty
----------------
288 kB
(1 row)
tutorial=#
tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_c'));
pg_size_pretty
----------------
288 kB
(1 row)
6.6 Check for even data distribution on segments
tutorial=# SELECT gp_segment_id, COUNT(*) FROM faa.otp_c GROUP BY gp_segment_id ORDER BY gp_segment_id;
6.7 About partitioning
tutorial=# \timing on
tutorial=# SELECT MAX(depdelay) FROM faa.otp_c WHERE UniqueCarrier = 'UA';
tutorial=# SELECT MAX(depdelay) FROM faa.otp_c WHERE flightdate ='2009-11-01';
7 集成分析工具
8 备份和恢复操作
- To run a full backup:
gpcrondump -x tutorial -u /tmp -a -r
- To view the backups:
ls -al /tmp/db_dumps
- To restore the data:
gpdbrestore -T faa.otp_r -s tutorial -u /tmp -a