在测试tpch时,发现有一些SQL跑超时,原因是测试SQL中有一些JOIN查询,而这些JOIN的列数据类型不一致,导致无法使用索引,或者无法使用hash join。
例子:
int和int8,都是整型,只是长度不一样。关联时,会遇到性能问题,因为不能使用HASH JOIN。
digoal=# create table t(id int);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id int8);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# insert into t select generate_series(1,10000);
INSERT 0 10000
digoal=# insert into t1 select generate_series(1,10000);
INSERT 0 10000
关联字段为int和int8,不能使用hash join,只能用nestloop。
digoal=# explain select * from t ,t1 where t.id=t1.id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Gather Motion 22:1 (slice2; segments: 22) (cost=0.00..218902.00 rows=10000 width=12)
-> Nested Loop (cost=0.00..218902.00 rows=455 width=12)
-> Broadcast Motion 22:22 (slice1; segments: 22) (cost=0.00..2422.00 rows=10000 width=4)
-> Seq Scan on t (cost=0.00..122.00 rows=455 width=4)
-> Index Scan using idx_t1 on t1 (cost=0.00..0.97 rows=1 width=8)
Index Cond: t.id = t1.id
Settings: enable_seqscan=off
Optimizer status: legacy query optimizer
(8 rows)
即使强制关掉nestloop也不行。
digoal=# set enable_nestloop=off;
SET
digoal=# explain select * from t ,t1 where t.id=t1.id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Gather Motion 22:1 (slice2; segments: 22) (cost=0.00..218902.00 rows=10000 width=12)
-> Nested Loop (cost=0.00..218902.00 rows=455 width=12)
-> Broadcast Motion 22:22 (slice1; segments: 22) (cost=0.00..2422.00 rows=10000 width=4)
-> Seq Scan on t (cost=0.00..122.00 rows=455 width=4)
-> Index Scan using idx_t1 on t1 (cost=0.00..0.97 rows=1 width=8)
Index Cond: t.id = t1.id
Settings: enable_nestloop=off; enable_seqscan=off
Optimizer status: legacy query optimizer
(8 rows)
和数据量当然并没有关系。
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t ,t1 where t.id=t1.id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Gather Motion 22:1 (slice2; segments: 22) (cost=0.00..218902.00 rows=10000 width=12)
-> Nested Loop (cost=0.00..218902.00 rows=455 width=12)
-> Broadcast Motion 22:22 (slice1; segments: 22) (cost=0.00..2422.00 rows=10000 width=4)
-> Seq Scan on t (cost=0.00..122.00 rows=455 width=4)
-> Index Scan using idx_t1 on t1 (cost=0.00..0.97 rows=1 width=8)
Index Cond: t.id = t1.id
Settings: enable_nestloop=off; enable_seqscan=off
Optimizer status: legacy query optimizer
(8 rows)
接下来,使用同样的数据类型,可以用HASH JOIN关联。性能大幅提升。
digoal=# create table t2(id int8);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t2 select * from t1;
INSERT 0 10010000
digoal=# create index idx_t2 on t2(id);
CREATE INDEX
digoal=# explain select * from t2 join t1 on( t2.id=t1.id );
QUERY PLAN
------------------------------------------------------------------------------------------------
Gather Motion 22:1 (slice1; segments: 22) (cost=236320.07..497677.13 rows=10013203 width=16)
-> Hash Join (cost=236320.07..497677.13 rows=455146 width=16)
Hash Cond: t1.id = t2.id
-> Seq Scan on t1 (cost=0.00..111158.22 rows=455161 width=8)
-> Hash (cost=111155.03..111155.03 rows=455146 width=8)
-> Seq Scan on t2 (cost=0.00..111155.03 rows=455146 width=8)
Optimizer status: legacy query optimizer
(7 rows)
另一个例子,当char长度不一样,不会有以上类似的问题。
digoal=# create table t(id char(10));
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id char(20));
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t join t1 on( t.id=t1.id );
QUERY PLAN
-----------------------------------------------------------------------------------------------
Gather Motion 22:1 (slice1; segments: 22) (cost=240352.38..503423.59 rows=9975039 width=32)
-> Hash Join (cost=240352.38..503423.59 rows=453411 width=32)
Hash Cond: t.id = t1.id
-> Seq Scan on t (cost=0.00..113396.58 rows=454303 width=11)
-> Hash (cost=115664.39..115664.39 rows=453411 width=21)
-> Seq Scan on t1 (cost=0.00..115664.39 rows=453411 width=21)
Optimizer status: legacy query optimizer
(7 rows)
digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# explain select * from t join t1 on( t.id=t1.id );
QUERY PLAN
-----------------------------------------------------------------------------------------------
Gather Motion 22:1 (slice1; segments: 22) (cost=240352.38..503423.59 rows=9975039 width=32)
-> Hash Join (cost=240352.38..503423.59 rows=453411 width=32)
Hash Cond: t.id = t1.id
-> Seq Scan on t (cost=0.00..113396.58 rows=454303 width=11)
-> Hash (cost=115664.39..115664.39 rows=453411 width=21)
-> Seq Scan on t1 (cost=0.00..115664.39 rows=453411 width=21)
Optimizer status: legacy query optimizer
(7 rows)
digoal=# explain select * from t join t1 on( t.id=t1.id and t.id='1');
QUERY PLAN
--------------------------------------------------------------------------------
Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..401.54 rows=4 width=32)
-> Nested Loop (cost=0.00..401.54 rows=1 width=32)
-> Index Scan using idx_t on t (cost=0.00..200.70 rows=1 width=11)
Index Cond: id = '1'::bpchar
-> Index Scan using idx_t1 on t1 (cost=0.00..200.83 rows=1 width=21)
Index Cond: '1'::bpchar = id
Optimizer status: legacy query optimizer
(7 rows)
varchar和char(n)也不存在问题。
postgres=# create table t(id varchar);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# create table t1(id char(10));
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# explain select * from t , t1 where t.id=t1.id;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Gather Motion 22:1 (slice2; segments: 22) (cost=0.01..0.07 rows=4 width=76)
-> Hash Join (cost=0.01..0.07 rows=1 width=76)
Hash Cond: t.id::bpchar = t1.id
-> Redistribute Motion 22:22 (slice1; segments: 22) (cost=0.00..0.02 rows=1 width=32)
Hash Key: t.id::bpchar
-> Seq Scan on t (cost=0.00..0.00 rows=1 width=32)
-> Hash (cost=0.00..0.00 rows=1 width=44)
-> Seq Scan on t1 (cost=0.00..0.00 rows=1 width=44)
Optimizer status: legacy query optimizer
(9 rows)
数据类型和查询条件的类型不一致时,不走索引的例子。
digoal=# explain select * from t where id=1;
QUERY PLAN
---------------------------------------------------------------------------------------
Gather Motion 22:1 (slice1; segments: 22) (cost=0.00..163369.87 rows=9995 width=11)
-> Seq Scan on t (cost=0.00..163369.87 rows=455 width=11)
Filter: id::text = '1'::text
Optimizer status: legacy query optimizer
(4 rows)
digoal=# explain select * from t where id='1';
QUERY PLAN
-------------------------------------------------------------------------------
Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..200.70 rows=1 width=11)
-> Index Scan using idx_t on t (cost=0.00..200.70 rows=1 width=11)
Index Cond: id = '1'::bpchar
Optimizer status: legacy query optimizer
(4 rows)
显示转换后,可以走索引。
digoal=# explain select * from t where id=1::bpchar;
QUERY PLAN
-------------------------------------------------------------------------------
Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..200.70 rows=1 width=11)
-> Index Scan using idx_t on t (cost=0.00..200.70 rows=1 width=11)
Index Cond: id = '1'::bpchar
Optimizer status: legacy query optimizer
(4 rows)
greenplum因为用了早期的postgresql版本,所以这一方面优化器并不好,在使用greenplum时,需要注意一下。
greenplum将合并到postgresql 9.5的版本,以上问题可以消除。
postgresql 9.4的例子:
digoal=# create table t(id int);
CREATE TABLE
digoal=# create table t1(id int8);
CREATE TABLE
digoal=# insert into t select generate_series(1,100000);
INSERT 0 100000
digoal=# insert into t1 select generate_series(1,100000);
INSERT 0 100000
digoal=# explain select * from t,t1 where t.id=t1.id;
QUERY PLAN
----------------------------------------------------------------------
Hash Join (cost=2693.00..6136.00 rows=100000 width=12)
Hash Cond: (t.id = t1.id)
-> Seq Scan on t (cost=0.00..1443.00 rows=100000 width=4)
-> Hash (cost=1443.00..1443.00 rows=100000 width=8)
-> Seq Scan on t1 (cost=0.00..1443.00 rows=100000 width=8)
(5 rows)