GreenPlum入门

简介: 本文旨在小白入门。 我开始研究GP的动机是因为它在MPP和数仓方面的名声,另外,GP也是集团技术栈的一员。 ### 1 GPDB简介 > Pivotal **Greenplum** Database is a MPP (massively parallel processing) database built on open source [PostgreSQL](https:/

本文旨在小白入门。

我开始研究GP的动机是因为它在MPP和数仓方面的名声,另外,GP也是集团技术栈的一员。

1 GPDB简介

Pivotal Greenplum Database is a MPP (massively parallel processing) database built on open source PostgreSQL.The system consists of a master node, standby master node, and segment nodes.

All of the data resides on the segment nodes and the catalog information is stored in the master nodes. Segment nodes run one or more segments, which are modified PostgreSQL database instances and are assigned a content identifier.

For each table the data is divided among the segment nodes based on the distribution column keys specified by the user in the DDL statement.

For each segment content identifier there is both a primary segment and mirror segment which are not running on the same physical host.

When a SQL query enters the master node, it is parsed, optimized and dispatched to all of the segments to execute the query plan and either return the requested data or insert the result of the query into a database table.

从定义上看,GP和此前我比较专注的ES(Elasticsearch)非常类似,它们都是基于成熟的单机技术——PG(PostgreSQL)和Lucene,采用去中心化的分布式技术,实现了对大数据的(MPP和信息检索)支持。

2 准备

多台物理机搭建GPDB集群(1个主节点、1个备节点、n个Segment节点)是最佳的实战环境,但作为入门,GP官方提供的Docker镜像足够用了。使用如下命令拉取这个镜像:

docker pull pivotaldata/gpdb-base

接下来使用如下命令下载GP官方提供的教程示例,然后解压到任意目录(本例使用的路径为/Users/erichan/Downloads/greenplum-db-gpdb-sandbox-tutorials-6794737):

wget https://github.com/greenplum-db/gpdb-sandbox-tutorials/zipball/master

启动GP容器,并挂载上述教程示例:

docker rm $(docker ps -aq)

docker run -ti \
-p 5432:5432 \
--name gp_tutorials \
--hostname gp-tutorials \
-v /Users/erichan/Downloads/greenplum-db-gpdb-sandbox-tutorials-6794737:/tutorials \
pivotaldata/gpdb-base

如下的练习都是在上述这个容器中进行的。

容器内包含的用户/密码信息如下:

  • root/pivotal
  • gpadmin/pivotal

如下练习均使用gpadmin用户,记得切换:

su - gpadmin

3 创建用户和角色

3.1 Create a user with the createuser utility command

createuser -P user1
Enter password for new role:
Enter it again:
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) y
Shall the new role be allowed to create more new roles? (y/n) n
NOTICE:  resource queue required -- using default resource queue
"pg_default"

3.2 Create a user with the CREATE USER command

psql template1

template1=# CREATE USER user2 WITH PASSWORD 'pivotal' NOSUPERUSER;
template1=# \du

                       List of roles
 Role name |            Attributes             | Member of
-----------+-----------------------------------+-----------
 gpadmin   | Superuser, Create role, Create DB |
 user1     | Create DB                         |
 user2     |                                   |

3.3 Create a users group and add the users to it

template1=# CREATE ROLE users;
template1=# GRANT users TO user1, user2;
template1=# \du
                       List of roles
 Role name |            Attributes             | Member of
-----------+-----------------------------------+-----------
 gpadmin   | Superuser, Create role, Create DB |
 user1     | Create DB                         | {users}
 user2     |                                   | {users}
 users     | Cannot login                      |

3.4 Exit out of the psql shell

template1=# \q

4 创建和准备数据库

4.1 Create Database

dropdb tutorial
createdb tutorial

psql -l
                  List of databases
   Name    |  Owner  | Encoding |  Access privileges
-----------+---------+----------+---------------------
 gpadmin   | gpadmin | UTF8     |
 postgres  | gpadmin | UTF8     |
 template0 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 template1 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 tutorial  | gpadmin | UTF8     |
(5 rows)

4.2 Grant database privileges to users

psql -U gpadmin tutorial

tutorial=# GRANT ALL PRIVILEGES ON DATABASE tutorial TO user1, user2;

tutorial=# \q
psql -l
                  List of databases
   Name    |  Owner  | Encoding |  Access privileges
-----------+---------+----------+---------------------
 gpadmin   | gpadmin | UTF8     |
 postgres  | gpadmin | UTF8     |
 template0 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 template1 | gpadmin | UTF8     | =c/gpadmin
                                : gpadmin=CTc/gpadmin
 tutorial  | gpadmin | UTF8     | =Tc/gpadmin
                                : gpadmin=CTc/gpadmin
                                : user1=CTc/gpadmin
                                : user2=CTc/gpadmin
(5 rows)

4.3 Create a schema and set a search path

psql -U user1 -h gp-tutorials tutorial
tutorial=> DROP SCHEMA IF EXISTS faa CASCADE;

tutorial=> CREATE SCHEMA faa;

tutorial=> SET SEARCH_PATH TO faa, public, pg_catalog, gp_toolkit;

tutorial=> SHOW search_path;
             search_path
-------------------------------------
 faa, public, pg_catalog, gp_toolkit
(1 row)

tutorial=> \q

5 创建表

cd /tutorials/faa
psql -U gpadmin tutorial
tutorial=#  \i create_dim_tables.sql
tutorial=# \dt

5.1 Data Loading

tutorial=# \d faa.d_cancellation_codes
Table "faa.d_cancellation_codes"
   Column    | Type | Modifiers
-------------+------+-----------
 cancel_code | text |
 cancel_desc | text |
Distributed by: (cancel_code)

5.2 Load data with the INSERT statement

tutorial=# INSERT INTO faa.d_cancellation_codes
tutorial-# VALUES ('A', 'Carrier'),
tutorial-# ('B', 'Weather'),
tutorial-# ('C', 'NAS'),
tutorial-# ('D', 'Security'),
tutorial-#  ('', 'none');
INSERT 0 5

5.3 Load data with the COPY statement

tutorial-# \i copy_into_airports.sql
tutorial-# \i copy_into_airlines.sql
tutorial-# \i copy_into_delay_groups.sql
tutorial-# \i copy_into_distance_groups.sql
tutorial-# \i copy_into_wac.sql
tutorial-# \q

5.4 Load data with gpdist

Execute gpfdist

gpfdist -d /tutorials/faa -p 8081 > /tmp/gpfdist.log 2>&1 &
ps -A | grep gpfdist

 1202 ?        00:00:00 gpfdist
more /tmp/gpfdist.log

...
Serving HTTP on port 8081, directory /tutorials/faa
psql -U gpadmin tutorial
tutorial=#  \i create_load_tables.sql
tutorial=#  \i create_ext_table.sql
tutorial=# INSERT INTO faa.faa_otp_load SELECT * FROM faa.ext_load_otp;

tutorial=#  \x

tutorial=# SELECT DISTINCT relname, errmsg, count(*) FROM faa.faa_load_errors GROUP BY 1,2;
-[ RECORD 1 ]-------------------------------------------------
relname | ext_load_otp
errmsg  | invalid input syntax for integer: "", column deptime
count   | 26526

tutorial=# \q
killall gpfdist

5.5 Load data with gpload

vi gpload.yaml
---
VERSION: 1.0.0.1
# describe the Greenplum database parameters
DATABASE: tutorial
USER: gpadmin
HOST: gp-tutorials
PORT: 5432
# describe the location of the source files
# in this example, the database master lives on the same host as the source files
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - gp-tutorials
         PORT: 8081
         FILE:
           - /tutorials/faa/otp*.gz
    - FORMAT: csv
    - QUOTE: '"'
    - ERROR_LIMIT: 50000
    - ERROR_TABLE: faa.faa_load_errors
   OUTPUT:
    - TABLE: faa.faa_otp_load
    - MODE: INSERT
   PRELOAD:
    - TRUNCATE: true
gpload -f gpload.yaml -l gpload.log

...
started gpfdist -p 8081 -P 8082 -f "/tutorials/faa/otp*.gz" -t 30
...
running time: 22.41 seconds
...
rows Inserted          = 1024552

5.6 Create and Load fact tables

psql -U gpadmin tutorial
tutorial=#  \i create_fact_tables.sql
tutorial=# \i load_into_fact_table.sql

6 查询和调优

6.1 Analyze the tables

tutorial=# ANALYZE faa.d_airports;
ANALYZE
tutorial=# ANALYZE faa.d_airlines;
ANALYZE
tutorial=# ANALYZE faa.d_wac;
ANALYZE
tutorial=# ANALYZE faa.d_cancellation_codes;
ANALYZE
tutorial=# ANALYZE faa.faa_otp_load;
ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_c;
ANALYZE

6.2 View explain plans

tutorial=# \timing on
tutorial=# \i create_sample_table.sql
tutorial=# EXPLAIN SELECT COUNT(*) FROM sample WHERE id > 100;
tutorial=# EXPLAIN ANALYZE SELECT COUNT(*) FROM sample WHERE id > 100;

6.3 Changing optimizers

tutorial=# \q
gpconfig -s optimizer

Values on all segments are consistent
GUC          : optimizer
Master  value: off
Segment value: off
gpconfig -c optimizer -v on --masteronly
gpstop -u

6.4 Indexes and performance

psql -U gpadmin tutorial

tutorial=# \i create_sample_table.sql

tutorial=# SELECT * from sample WHERE big = 12345;
   id    |  big  | wee | stuff
---------+-------+-----+-------
   12345 | 12345 |   0 |
 1012346 | 12345 |   0 |
 2012347 | 12345 |   0 |
(3 rows)

tutorial=# EXPLAIN SELECT * from sample WHERE big = 12345;

tutorial=# CREATE INDEX sample_big_index ON sample(big);

tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345;

6.5 Row vs. Column orientation

tutorial=# CREATE TABLE FAA.OTP_C (LIKE faa.otp_r) WITH (appendonly=true,orientation=column)DISTRIBUTED BY (UniqueCarrier, FlightNum) PARTITION BY RANGE(FlightDate)(PARTITION mth START('2009-06-01'::date) END ('2010-10-31'::date) EVERY ('1 mon'::interval));

tutorial=# INSERT INTO faa.otp_c SELECT * FROM faa.otp_r;

tutorial=# \d faa.otp_r

tutorial=# \d faa.otp_c

tutorial=# SELECT pg_size_pretty(pg_relation_size('faa.otp_r'));
 pg_size_pretty
----------------
 256 MB
(1 row)

tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_r'));
 pg_size_pretty
----------------
 256 MB
(1 row)

tutorial=# SELECT pg_size_pretty(pg_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 0 bytes
(1 row)

tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 288 kB
(1 row)

tutorial=#
tutorial=# SELECT pg_size_pretty(pg_total_relation_size('faa.otp_c'));
 pg_size_pretty
----------------
 288 kB
(1 row)

6.6 Check for even data distribution on segments

tutorial=# SELECT gp_segment_id, COUNT(*) FROM faa.otp_c GROUP BY gp_segment_id ORDER BY gp_segment_id;

6.7 About partitioning

tutorial=# \timing on

tutorial=#  SELECT MAX(depdelay) FROM faa.otp_c WHERE UniqueCarrier = 'UA';

tutorial=# SELECT MAX(depdelay) FROM faa.otp_c WHERE flightdate ='2009-11-01';

7 集成分析工具

8 备份和恢复操作

  • To run a full backup:gpcrondump -x tutorial -u /tmp -a -r
  • To view the backups:ls -al /tmp/db_dumps
  • To restore the data:gpdbrestore -T faa.otp_r -s tutorial -u /tmp -a

9 Reference

目录
相关文章
|
SQL 关系型数据库 数据库
PG/Greenplum
PG/Greenplum 是指 PostgreSQL(简称 PG)和 Greenplum(简称 GP)两种关系型数据库管理系统。它们都是基于 SQL(结构化查询语言)的开放源代码数据库系统,具有高性能、可扩展性和高可靠性等特点
906 7
|
SQL 缓存 分布式计算
SparkSQL与Hive metastore Parquet转换
Spark SQL为了更好的性能,在读写Hive metastore parquet格式的表时,会默认使用自己的Parquet SerDe,而不是采用Hive的SerDe进行序列化和反序列化
SparkSQL与Hive metastore Parquet转换
|
10月前
|
存储 人工智能 自然语言处理
DeepSeek R1+Ollama+Cherry Studio实现本地知识库的搭建
本文介绍了如何使用Ollama和CherryStudio搭建本地知识库,涵盖核心组件介绍、硬件与软件准备、模型部署流程及知识库构建方法。通过配置DeepSeek R1模型、嵌入模型和Cherry Studio平台,实现本地化自然语言处理与知识检索功能。
3274 0
|
10月前
|
安全 物联网 API
Windows 11 24H2 中文版、英文版 (x64、ARM64) 下载 (2025 年 7 月更新)
Windows 11 24H2 中文版、英文版 (x64、ARM64) 下载 (2025 年 7 月更新)
1670 0
|
资源调度 运维 Serverless
ES Serverless 8.17王牌发布:向量检索「火力全开」,智能扩缩「秒级响应」!
阿里云 Elasticsearch Serverless 检索增强型8.17版本在最新特性扩展、自动扩缩性能、资源成本优化三大维度实现全面跃升,本文将深度解析该版本通过工程优化带来的核心能力升级。
548 0
|
消息中间件 存储 算法
Kafka Raft集群搭建
Kafka Raft集群搭建
755 0
|
人工智能 资源调度 数据可视化
【AI应用落地实战】智能文档处理本地部署——可视化文档解析前端TextIn ParseX实践
2024长沙·中国1024程序员节以“智能应用新生态”为主题,吸引了众多技术大咖。合合信息展示了“智能文档处理百宝箱”的三大工具:可视化文档解析前端TextIn ParseX、向量化acge-embedding模型和文档解析测评工具markdown_tester,助力智能文档处理与知识管理。
|
安全 Linux 网络安全
Greenplum【部署 01】CentOS Linux release 7.5 环境单机版详细安装配置步骤+问题解决(含安装包资源)
Greenplum【部署 01】CentOS Linux release 7.5 环境单机版详细安装配置步骤+问题解决(含安装包资源)
492 2
|
JavaScript 关系型数据库 MySQL
若依框架数据源切换为pg库
若依管理系统切换数据源为pg数据库
2726 1
若依框架数据源切换为pg库
|
存储 运维 安全
Greenplum闭源?平滑迁移到 AnalyticDB 开启Data+AI新范式
知名开源 MPP 数据库 Greenplum 由于其丰富的企业级特性和出色的数据处理能力成为很多企业构建数仓的首选。近期 GP 公开 Github 仓库无法访问仅保留只读归档代码,业界纷纷猜测 GP 即将闭源。云原生数仓 AnalyticDB PostgreSQL 版完全掌控内核代码,完全兼容GP语法,全自研计算及存储引擎较比开源GP有五倍性能提升,全自研企业级特性在实时计算、弹性扩展、安全增强、高可用等方面实现对GP的全面超越,并在数仓能力上扩展了向量检索及一站式 RAG 服务,帮助企业快速构建 AI 应用、开启 Data+AI 新范式。
59901 3