标签
PostgreSQL , 统计信息 , 唯一值 , 枚举值
背景
PostgreSQL的列统计信息中包含一项高频词,同时包含一项唯一值个数。
pg_stats.n_distinct
pg_stats.most_common_vals
同时PostgreSQL允许用户自定义统计信息柱状图BUCKET的个数。
alter table ALTER [ COLUMN ] column_name SET STATISTICS integer
默认柱状图BUCKET个数为100
postgres=# show default_statistics_target ;
default_statistics_target
---------------------------
100
(1 row)
当唯一值的个数小于柱状图的BUCKET个数时,从高频词(pg_stats.most_common_vals)得到的实际上就是该列的唯一值的内容。
例子
使用一个函数,加上以上判断逻辑,用来直接获取某个表,某个列的唯一值的内容。
create or replace function get_distinct_value(
name, -- schema
name, -- table
name -- column
) returns text as $$
declare
v int;
res text;
begin
select
case attstattarget
when -1 then current_setting('default_statistics_target')::int -- 默认bucket
else attstattarget -- 自定义bucket
end into v
from pg_attribute where attrelid=($1||'.'||$2)::regclass and attname=$3 ;
select (most_common_vals)::text into res -- 高频词
from pg_stats where not inherited and schemaname = $1 and tablename = $2 and attname = $3 and n_distinct >=1 and n_distinct <= v;
return res;
end;
$$ language plpgsql strict;
1、创建测试表,写入测试数据
postgres=# create table tbl(id int,c1 int,c2 int);
CREATE TABLE
postgres=# insert into tbl select id, random()*100, random()*200 from generate_series(1,10000000) t(id);
INSERT 0 10000000
2、分析表
postgres=# analyze tbl;
ANALYZE
3、通过以上函数,获取c1列的唯一值内容。
postgres=# select * from get_distinct_value('public','tbl','c1') ;
get_distinct_value
--------------------
(1 row)
返回空,因为bucket个数小于n_distinct,所以得到的唯一值实际上是不准确的。因此返回了空,与逻辑相符。
4、修改C1,C2列的BUCKET为300,大于实际的唯一值个数
postgres=# alter table tbl alter COLUMN c1 set statistics 300;
ALTER TABLE
postgres=# alter table tbl alter COLUMN c2 set statistics 300;
ALTER TABLE
重新分析表
postgres=# analyze tbl;
ANALYZE
5、通过以上函数,获取c1, c2列的唯一值内容。 得到了正确的结果
postgres=# select * from get_distinct_value('public','tbl','c1') ;
get_distinct_value
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------
{56,83,61,67,37,8,53,66,17,44,29,54,9,15,85,49,80,96,58,22,47,68,62,30,77,34,64,23,39,11,5,24,42,7,3,32,57,73,88,40,93,13,91,86,51,89,81,43,71,35,45,48,84,16,87,50,27,99,4,25,38,14,41,72,78,95,74,76,18,94,28,90,2,12,79,21,70,36,52,75,31
,33,55,60,92,6,63,69,98,46,97,19,59,26,20,10,65,1,82,0,100}
(1 row)
postgres=# select * from get_distinct_value('public','tbl','c2') ;
get_distinct_value
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{75,132,83,91,82,69,4,74,195,35,157,124,178,13,61,134,70,153,31,190,3,150,84,135,140,50,114,152,193,119,147,116,143,199,11,45,68,94,183,5,93,51,130,32,98,46,118,47,52,79,146,6,39,101,154,162,191,66,148,165,187,21,67,72,103,54,73,122,106
,127,131,137,168,169,179,96,115,177,186,64,121,14,29,37,30,78,17,80,117,120,133,144,159,7,26,38,113,172,141,158,89,155,164,15,63,125,184,108,12,189,56,139,20,59,43,104,126,48,65,163,166,19,28,77,180,194,95,99,198,176,16,33,40,110,160,197
,49,100,167,2,8,18,55,156,188,36,53,88,90,142,185,27,42,85,1,109,25,62,57,107,123,76,112,86,10,71,128,149,175,24,34,92,173,181,60,97,136,151,41,58,111,161,192,9,81,87,105,182,23,129,138,145,171,44,174,196,22,170,102,200,0}
(1 row)
postgres=# select * from unnest(get_distinct_value('public','tbl','c2')::int[]) ;
unnest
--------
75
132
83
91
82
69
4
74
... ...
(201 rows)
小结
1、由于统计信息并不是实时更新的,所以使用本文提到的方法,并不一定能得到实时的准确结果。但是可以在执行前ANALYZE一下,那么得到的结果就非常准确。
2、当时有本文提到的方法返回的结果为NULL时,再使用select x from tbl group by x;
来返回唯一值。
统计信息中有很多有趣的内容,比如还可以用来
1、评估高频词,例如我们的表里面存储了APP的下周记录,我们通过高频词可以评估出热门的APP。
2、通过explain还可以用来评估SQL的返回记录数。用于分页评估。
《妙用explain Plan Rows快速估算行 - 分页数估算》
参考
《PostgreSQL 11 preview - 表达式索引柱状图buckets\STATISTICS\default_statistics_target可设置》