Greenplum 注意对其数据类型,否则优化器让你好看

简介:
在测试tpch时,发现有一些SQL跑超时,原因是测试SQL中有一些JOIN查询,而这些JOIN的列数据类型不一致,导致无法使用索引,或者无法使用hash join。
例子:
int和int8,都是整型,只是长度不一样。关联时,会遇到性能问题,因为不能使用HASH JOIN。
digoal=# create table t(id int);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id int8);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# insert into t select generate_series(1,10000);
INSERT 0 10000
digoal=# insert into t1 select generate_series(1,10000);
INSERT 0 10000
关联字段为int和int8,不能使用hash join,只能用nestloop。
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
即使强制关掉nestloop也不行。
digoal=# set enable_nestloop=off;
SET
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_nestloop=off; enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
和数据量当然并没有关系。
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_nestloop=off; enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
接下来,使用同样的数据类型,可以用HASH JOIN关联。性能大幅提升。
digoal=# create table t2(id int8);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t2 select * from t1;
INSERT 0 10010000
digoal=# create index idx_t2 on t2(id);
CREATE INDEX
digoal=# explain select * from t2 join t1 on( t2.id=t1.id );
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=236320.07..497677.13 rows=10013203 width=16)
   ->  Hash Join  (cost=236320.07..497677.13 rows=455146 width=16)
         Hash Cond: t1.id = t2.id
         ->  Seq Scan on t1  (cost=0.00..111158.22 rows=455161 width=8)
         ->  Hash  (cost=111155.03..111155.03 rows=455146 width=8)
               ->  Seq Scan on t2  (cost=0.00..111155.03 rows=455146 width=8)
 Optimizer status: legacy query optimizer
(7 rows)

另一个例子,当char长度不一样,不会有以上类似的问题。
digoal=# create table t(id char(10));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id char(20));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t join t1 on( t.id=t1.id );
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=240352.38..503423.59 rows=9975039 width=32)
   ->  Hash Join  (cost=240352.38..503423.59 rows=453411 width=32)
         Hash Cond: t.id = t1.id
         ->  Seq Scan on t  (cost=0.00..113396.58 rows=454303 width=11)
         ->  Hash  (cost=115664.39..115664.39 rows=453411 width=21)
               ->  Seq Scan on t1  (cost=0.00..115664.39 rows=453411 width=21)
 Optimizer status: legacy query optimizer
(7 rows)

digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# explain select * from t join t1 on( t.id=t1.id );
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=240352.38..503423.59 rows=9975039 width=32)
   ->  Hash Join  (cost=240352.38..503423.59 rows=453411 width=32)
         Hash Cond: t.id = t1.id
         ->  Seq Scan on t  (cost=0.00..113396.58 rows=454303 width=11)
         ->  Hash  (cost=115664.39..115664.39 rows=453411 width=21)
               ->  Seq Scan on t1  (cost=0.00..115664.39 rows=453411 width=21)
 Optimizer status: legacy query optimizer
(7 rows)

digoal=# explain select * from t join t1 on( t.id=t1.id and t.id='1');
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..401.54 rows=4 width=32)
   ->  Nested Loop  (cost=0.00..401.54 rows=1 width=32)
         ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
               Index Cond: id = '1'::bpchar
         ->  Index Scan using idx_t1 on t1  (cost=0.00..200.83 rows=1 width=21)
               Index Cond: '1'::bpchar = id
 Optimizer status: legacy query optimizer
(7 rows)
varchar和char(n)也不存在问题。
postgres=# create table t(id varchar);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# create table t1(id char(10));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# explain select * from t , t1 where t.id=t1.id;
                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.01..0.07 rows=4 width=76)
   ->  Hash Join  (cost=0.01..0.07 rows=1 width=76)
         Hash Cond: t.id::bpchar = t1.id
         ->  Redistribute Motion 22:22  (slice1; segments: 22)  (cost=0.00..0.02 rows=1 width=32)
               Hash Key: t.id::bpchar
               ->  Seq Scan on t  (cost=0.00..0.00 rows=1 width=32)
         ->  Hash  (cost=0.00..0.00 rows=1 width=44)
               ->  Seq Scan on t1  (cost=0.00..0.00 rows=1 width=44)
 Optimizer status: legacy query optimizer
(9 rows)

数据类型和查询条件的类型不一致时,不走索引的例子。
digoal=# explain select * from t where id=1;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=0.00..163369.87 rows=9995 width=11)
   ->  Seq Scan on t  (cost=0.00..163369.87 rows=455 width=11)
         Filter: id::text = '1'::text
 Optimizer status: legacy query optimizer
(4 rows)

digoal=# explain select * from t where id='1';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.70 rows=1 width=11)
   ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
         Index Cond: id = '1'::bpchar
 Optimizer status: legacy query optimizer
(4 rows)
显示转换后,可以走索引。
digoal=# explain select * from t where id=1::bpchar;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.70 rows=1 width=11)
   ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
         Index Cond: id = '1'::bpchar
 Optimizer status: legacy query optimizer
(4 rows)

greenplum因为用了早期的postgresql版本,所以这一方面优化器并不好,在使用greenplum时,需要注意一下。
greenplum将合并到postgresql 9.5的版本,以上问题可以消除。
postgresql 9.4的例子:
digoal=# create table t(id int);
CREATE TABLE
digoal=# create table t1(id int8);
CREATE TABLE
digoal=# insert into t select generate_series(1,100000);
INSERT 0 100000
digoal=# insert into t1 select generate_series(1,100000);
INSERT 0 100000
digoal=# explain select * from t,t1 where t.id=t1.id;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Hash Join  (cost=2693.00..6136.00 rows=100000 width=12)
   Hash Cond: (t.id = t1.id)
   ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=4)
   ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)
         ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8)
(5 rows)
目录
相关文章
|
8月前
|
存储 SQL 关系型数据库
PolarDB这个sql行存和列存性能差别好大 ,为什么?
PolarDB这个sql行存和列存性能差别好大 ,为什么?
162 0
|
关系型数据库 Java MySQL
clickhouse的常用语法你知道吗
clickhouse数据库,总结了一些常用的语法
417 0
clickhouse的常用语法你知道吗
|
存储 SQL 缓存
MySQL高级第二篇(共四篇)之体系结构、存储引擎、优化SQL步骤、索引使用、SQL优化
最上层是一些客户端和链接服务,包含本地sock 通信和大多数基于客户端/服务端工具实现的类似于 TCP/IP的通信。主要完成一些类似于连接处理、授权认证、及相关的安全方案。在该层上引入了线程池的概念,为通过认证安 全接入的客户端提供线程。同样在该层上可以实现基于SSL的安全链接。服务器也会为安全接入的每个客户端验证它所具有的操作权限。
462 1
|
存储 关系型数据库 MySQL
MySQL优化——Schema与数据类型设计
MySQL优化——Schema与数据类型设计
140 0
MySQL优化——Schema与数据类型设计
|
SQL JSON 搜索推荐
认识 PostgreSQL 中与众不同的索引(一)|学习笔记
快速学习认识 PostgreSQL 中与众不同的索引(一)
288 0
认识 PostgreSQL 中与众不同的索引(一)|学习笔记
|
搜索推荐 算法 关系型数据库
认识 PostgreSQL 中与众不同的索引(二)|学习笔记
快速学习认识 PostgreSQL 中与众不同的索引(二)
170 0
认识 PostgreSQL 中与众不同的索引(二)|学习笔记
|
存储 SQL 关系型数据库
【MySQL从入门到精通】【高级篇】(二十九)覆盖索引的使用&索引下推
上一篇文章我们介绍了 【MySQL从入门到精通】【高级篇】(二十八)子查询优化,排序优化,GROUP BY优化和分页查询优化。这篇文章我们接着来介绍覆盖索引。
226 0
|
SQL 关系型数据库 MySQL
庖丁解牛|图解 MySQL 8.0 优化器查询转换篇
本篇介绍子查询、分析表和JOIN的复杂转换过程
588 0
庖丁解牛|图解 MySQL 8.0 优化器查询转换篇
|
SQL Cloud Native 关系型数据库
庖丁解牛-图解MySQL 8.0优化器查询转换篇
在《庖丁解牛-图解MySQL 8.0优化器查询解析篇》一文中我们重点介绍了MySQL最新版本8.0.25关于SQL基本元素表、列、函数、聚合、分组、排序等元素的解析、设置和转换过程,本篇我们继续来介绍更为复杂的子查询、分区表和JOIN的复杂转换过程。
庖丁解牛-图解MySQL 8.0优化器查询转换篇
|
SQL 算法 关系型数据库
[玩转MySQL之六]MySQL查询优化器
MySQL查询优化器的主要功能是完成SELECT语句执行,在保证SELECT语句正确执行之外,还有一个重要的功能,是使用关系代数、启发式规则、代价估值模型等不同种类的技术,提高语句的执行效率。本文将从整体上介绍MySQL查询优化器及其细节。
3583 0