标签
PostgreSQL , Oracle , 稠化报表 , partition by outer join
背景
背景介绍:借用Oracle的一篇例子:
http://blog.sina.com.cn/s/blog_4cef5c7b01016lm5.html
在数据库表中,存储的数据经常是稀疏数据(sparse data),而不是稠密数据(dense data)。
先来了解一下什么是稀疏数据,比如一个产品销售情况表(比如有产品名、销售时间(精确到年月)、销售量3个列),假设某个时间某些产品它没有销售,一般也不会将这些产品的销售量存储为0,而是不存储,这样在产品销售情况表中就会产生很多缺失的行(gap rows),导致的结果就是特定产品销售数据按时间维度进行排序,是不连续的,或者说此产品销售在时间序列上是有缺失的。顾名思义,稠密数据是相对于稀疏数据来说的,还是用上面的假设说明,也就是说产品在某个时间没有销售,也必须存储此产品销售情况,销售量置0存储,这样对某个特定产品来说它在时间序列就是连续的,但是事实经常不是如此,所以才有将稀疏数据稠密化的过程,数据稠密化在数据仓库应用中很常见。
例如以下报表,由于某些商品在某些月份没有销售,所以不会有记录。
postgres=# create table t_sales(year text, month text, prod name, sales int);
CREATE TABLE
postgres=# insert into t_sales values ('2008', '1', 'a', 1000);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '1', 'b', 1500);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '2', 'a', 2000);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '2', 'b', 3000);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '2', 'c', 1000);
INSERT 0 1
postgres=# insert into t_sales values ('2008', '3', 'a', 3000);
INSERT 0 1
postgres=# select * from t_sales;
year | month | prod | sales
------+-------+------+-------
2008 | 1 | a | 1000
2008 | 1 | b | 1500
2008 | 2 | a | 2000
2008 | 2 | b | 3000
2008 | 2 | c | 1000
2008 | 3 | a | 3000
(6 rows)
Oracle 通过partition by outer join语法可以方便的实现稠化。
SELECT .....
FROM table_reference
PARTITION BY (expr [, expr ]... )
RIGHT OUTER JOIN table_reference
SELECT .....
FROM table_reference
LEFT OUTER JOIN table_reference
达到类似这样的效果
postgres=# select * from continue_out();
year | month | prod | sales
------+-------+------+-------
2008 | 2 | a | 2000
2008 | 2 | b | 3000
2008 | 2 | c | 1000
2008 | 3 | a | 3000
2008 | 3 | b | 0
2008 | 3 | c | 0
2008 | 1 | a | 1000
2008 | 1 | b | 1500
2008 | 1 | c | 0
(9 rows)
PostgreSQL Partition By Outer Join实现
通过UDF,可以达到类似的效果:
create or replace function continue_out () returns setof t_sales as $$
declare
v_year text;
v_month text;
begin
-- 按年、月为分区 补齐prod (select distinct year, month from t_sales)
-- 如果已有元表,则直接取元表
-- 如无元表,对于连续型数据,亦可以使用PG的generate_series直接生成。
for v_year, v_month in
select distinct year, month from t_sales
loop
-- 全量 prod (select distinct prod from t_sales)
-- 如已有元表,则直接取元表
-- 如无元表,对于连续型数据,亦可以使用PG的generate_series直接生成。
return query
select
coalesce(t1.year,v_year),
coalesce(t1.month,v_month),
t2.prod,
coalesce(t1.sales, 0)
from
t_sales t1
right outer join
(select distinct prod from t_sales) t2
on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month);
end loop;
end;
$$ language plpgsql strict;
如下
postgres=# select * from continue_out();
year | month | prod | sales
------+-------+------+-------
2008 | 2 | a | 2000
2008 | 2 | b | 3000
2008 | 2 | c | 1000
2008 | 3 | a | 3000
2008 | 3 | b | 0
2008 | 3 | c | 0
2008 | 1 | a | 1000
2008 | 1 | b | 1500
2008 | 1 | c | 0
(9 rows)
通过窗口,可以实现同比,从而实现快速的搜索到销量逆增长的商品。
postgres=# select *, sales - coalesce(lag(sales) over w1, sales) from continue_out() window w1 as (partition by prod order by year,month);
year | month | prod | sales | ?column?
------+-------+------+-------+----------
2008 | 1 | a | 1000 | 0
2008 | 2 | a | 2000 | 1000
2008 | 3 | a | 3000 | 1000
2008 | 1 | b | 1500 | 0
2008 | 2 | b | 3000 | 1500
2008 | 3 | b | 0 | -3000
2008 | 1 | c | 0 | 0
2008 | 2 | c | 1000 | 1000
2008 | 3 | c | 0 | -1000
(9 rows)
详细执行计划如下:
postgres=# load 'auto_explain';
LOAD
Time: 22.171 ms
postgres=# set auto_explain.
auto_explain.log_analyze auto_explain.log_format auto_explain.log_nested_statements auto_explain.log_triggers auto_explain.sample_rate
auto_explain.log_buffers auto_explain.log_min_duration auto_explain.log_timing auto_explain.log_verbose
postgres=# set auto_explain.log_analyze =on;
SET
Time: 0.272 ms
postgres=# set auto_explain.log_buffers =on;
SET
Time: 0.257 ms
postgres=# set auto_explain.log_min_duration =0;
SET
Time: 0.279 ms
postgres=# set auto_explain.log_nested_statements =on;
SET
Time: 0.262 ms
postgres=# set auto_explain.log_timing =on;
SET
Time: 0.269 ms
postgres=# set auto_explain.log_verbose =on;
SET
Time: 0.245 ms
postgres=# set client_min_messages ='log';
SET
Time: 0.265 ms
postgres=# select *, sales - coalesce(lag(sales) over w1, sales) from continue_out() window w1 as (partition by prod order by year,month);
LOG: duration: 0.030 ms plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1
right outer join
(select distinct prod from t_sales) t2
on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join (cost=34.04..38.80 rows=200 width=132) (actual time=0.024..0.027 rows=3 loops=1)
Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
Hash Cond: (t_sales.prod = t1.prod)
Buffers: shared hit=2
-> HashAggregate (cost=16.38..18.38 rows=200 width=64) (actual time=0.009..0.009 rows=3 loops=1)
Output: t_sales.prod
Group Key: t_sales.prod
Buffers: shared hit=1
-> Seq Scan on public.t_sales (cost=0.00..15.10 rows=510 width=64) (actual time=0.002..0.003 rows=6 loops=1)
Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
Buffers: shared hit=1
-> Hash (cost=17.65..17.65 rows=1 width=132) (actual time=0.007..0.007 rows=3 loops=1)
Output: t1.year, t1.month, t1.sales, t1.prod
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1
-> Seq Scan on public.t_sales t1 (cost=0.00..17.65 rows=1 width=132) (actual time=0.003..0.004 rows=3 loops=1)
Output: t1.year, t1.month, t1.sales, t1.prod
Filter: ((t1.year = $2) AND (t1.month = $3))
Rows Removed by Filter: 3
Buffers: shared hit=1
LOG: duration: 0.020 ms plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1
right outer join
(select distinct prod from t_sales) t2
on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join (cost=34.04..38.80 rows=200 width=132) (actual time=0.016..0.018 rows=3 loops=1)
Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
Hash Cond: (t_sales.prod = t1.prod)
Buffers: shared hit=2
-> HashAggregate (cost=16.38..18.38 rows=200 width=64) (actual time=0.008..0.009 rows=3 loops=1)
Output: t_sales.prod
Group Key: t_sales.prod
Buffers: shared hit=1
-> Seq Scan on public.t_sales (cost=0.00..15.10 rows=510 width=64) (actual time=0.003..0.004 rows=6 loops=1)
Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
Buffers: shared hit=1
-> Hash (cost=17.65..17.65 rows=1 width=132) (actual time=0.004..0.004 rows=1 loops=1)
Output: t1.year, t1.month, t1.sales, t1.prod
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1
-> Seq Scan on public.t_sales t1 (cost=0.00..17.65 rows=1 width=132) (actual time=0.002..0.003 rows=1 loops=1)
Output: t1.year, t1.month, t1.sales, t1.prod
Filter: ((t1.year = $2) AND (t1.month = $3))
Rows Removed by Filter: 5
Buffers: shared hit=1
LOG: duration: 0.019 ms plan:
Query Text: select coalesce(t1.year,v_year), coalesce(t1.month,v_month), t2.prod, coalesce(t1.sales, 0) from t_sales t1
right outer join
(select distinct prod from t_sales) t2
on (t1.prod=t2.prod and t1.year=v_year and t1.month=v_month)
Hash Left Join (cost=34.04..38.80 rows=200 width=132) (actual time=0.014..0.017 rows=3 loops=1)
Output: COALESCE(t1.year, $2), COALESCE(t1.month, $3), t_sales.prod, COALESCE(t1.sales, 0)
Hash Cond: (t_sales.prod = t1.prod)
Buffers: shared hit=2
-> HashAggregate (cost=16.38..18.38 rows=200 width=64) (actual time=0.007..0.008 rows=3 loops=1)
Output: t_sales.prod
Group Key: t_sales.prod
Buffers: shared hit=1
-> Seq Scan on public.t_sales (cost=0.00..15.10 rows=510 width=64) (actual time=0.002..0.003 rows=6 loops=1)
Output: t_sales.year, t_sales.month, t_sales.prod, t_sales.sales
Buffers: shared hit=1
-> Hash (cost=17.65..17.65 rows=1 width=132) (actual time=0.004..0.004 rows=2 loops=1)
Output: t1.year, t1.month, t1.sales, t1.prod
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=1
-> Seq Scan on public.t_sales t1 (cost=0.00..17.65 rows=1 width=132) (actual time=0.001..0.003 rows=2 loops=1)
Output: t1.year, t1.month, t1.sales, t1.prod
Filter: ((t1.year = $2) AND (t1.month = $3))
Rows Removed by Filter: 4
Buffers: shared hit=1
LOG: duration: 0.045 ms plan:
Query Text: select distinct year, month from t_sales
HashAggregate (cost=17.65..19.65 rows=200 width=64) (actual time=0.040..0.041 rows=3 loops=1)
Output: year, month
Group Key: t_sales.year, t_sales.month
Buffers: shared hit=1
-> Seq Scan on public.t_sales (cost=0.00..15.10 rows=510 width=64) (actual time=0.028..0.029 rows=6 loops=1)
Output: year, month, prod, sales
Buffers: shared hit=1
LOG: duration: 0.832 ms plan:
Query Text: select *, sales - coalesce(lag(sales) over w1, sales) from continue_out() window w1 as (partition by prod order by year,month);
WindowAgg (cost=60.08..85.08 rows=1000 width=136) (actual time=0.808..0.818 rows=9 loops=1)
Output: year, month, prod, sales, (sales - COALESCE(lag(sales) OVER (?), sales))
Buffers: shared hit=9
-> Sort (cost=60.08..62.58 rows=1000 width=132) (actual time=0.803..0.804 rows=9 loops=1)
Output: year, month, prod, sales
Sort Key: continue_out.prod, continue_out.year, continue_out.month
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=9
-> Function Scan on public.continue_out (cost=0.25..10.25 rows=1000 width=132) (actual time=0.782..0.783 rows=9 loops=1)
Output: year, month, prod, sales
Function Call: continue_out()
Buffers: shared hit=9
year | month | prod | sales | ?column?
------+-------+------+-------+----------
2008 | 1 | a | 1000 | 0
2008 | 2 | a | 2000 | 1000
2008 | 3 | a | 3000 | 1000
2008 | 1 | b | 1500 | 0
2008 | 2 | b | 3000 | 1500
2008 | 3 | b | 0 | -3000
2008 | 1 | c | 0 | 0
2008 | 2 | c | 1000 | 1000
2008 | 3 | c | 0 | -1000
(9 rows)
Time: 1.362 ms
PostgreSQL的UDF非常强大,可以实现很多功能。
当然也期待PostgreSQL直接将partition by语法引入,那就更加方便了。
参考
https://www.postgresql.org/docs/10/static/functions-window.html