PostgreSQL 多维分析 CASE

本文涉及的产品
RDS PostgreSQL Serverless,0.5-4RCU 50GB 3个月
推荐场景:
对影评进行热评分析
云原生数据库 PolarDB 分布式版,标准版 2核8GB
云数据库 RDS SQL Server,基础系列 2核4GB
简介: 给企业做BI分析的开发小伙伴,可能经常会遇到这样的痛苦,运营人员今天问你要这样的维度报表,明天换个维度又来折腾你。对于开发的小伙伴,确实是非常痛苦的事情,那么有什么好的应对策略,而且对运营来说可能会显得比较高逼格呢? 多维分析派上用场,比如你的表有10个字段,允许运营人员以任意字段组合,产生报

昨天和小米的童鞋交流,听说的一个痛点。
也是很多给企业做BI分析的开发小伙伴,可能经常会遇到这样的痛苦,运营人员今天问你要这样的维度报表,明天换个维度又来"折腾"你。
对于开发的小伙伴,确实是非常痛苦的事情,那么有什么好的应对策略,而且对运营来说可能会显得比较高逼格呢?

多维分析派上用场,比如你的表有10个字段,允许运营人员以任意字段组合,产生报表。
很多商业数据库都带了这个功能,开源数据库带这个功能的不多。PostgreSQL真是业界良心啊~~~

例子
假设有4个业务字段,一个时间字段。

postgres=# create table tab5(c1 int, c2 int, c3 int, c4 int, crt_time timestamp);
CREATE TABLE

生成一批测试数据

postgres=# insert into tab5 select 
trunc(100*random()), 
trunc(1000*random()), 
trunc(10000*random()), 
trunc(100000*random()), 
clock_timestamp() + (trunc(10000*random())||' hour')::interval 
from generate_series(1,1000000);
INSERT 0 1000000

postgres=# select * from tab5 limit 10;
 c1 | c2  |  c3  |  c4   |          crt_time          
----+-----+------+-------+----------------------------
 72 |  46 | 3479 | 20075 | 2017-02-02 14:56:36.854218
 98 | 979 | 4491 | 83012 | 2017-06-13 08:56:36.854416
 54 | 758 | 5838 | 45956 | 2016-09-18 02:56:36.854427
  3 |  67 | 5148 | 74754 | 2017-01-01 01:56:36.854431
 42 | 650 | 7681 | 36495 | 2017-06-20 15:56:36.854435
  4 | 472 | 6454 | 19554 | 2016-06-18 19:56:36.854438
 82 | 922 |  902 | 17435 | 2016-07-21 14:56:36.854441
 68 | 156 | 1028 | 13275 | 2017-07-16 10:56:36.854444
  0 | 674 | 7446 | 59386 | 2016-07-26 09:56:36.854447
  0 | 629 | 2022 | 52285 | 2016-11-04 13:56:36.85445
(10 rows)

创建一个统计结果表, 其中bitmap表示统计的字段组合, 用位置符0,1表示是否统计了该维度

create table stat_tab5 (c1 int, c2 int, c3 int, c4 int, time1 text, time2 text, time3 text, time4 text, cnt int8, bitmap text);

生成业务字段任意维度组合+4组时间任选一组的组合统计
PS (如果业务字段有空的情况,建议统计时用coalesce转一下,确保不会统计到空的情况)

insert into stat_tab5
select c1,c2,c3,c4,t1,t2,t3,t4,cnt, 
'' || 
case when c1 is null then 0 else 1 end || 
case when c2 is null then 0 else 1 end || 
case when c3 is null then 0 else 1 end || 
case when c4 is null then 0 else 1 end || 
case when t1 is null then 0 else 1 end || 
case when t2 is null then 0 else 1 end || 
case when t3 is null then 0 else 1 end || 
case when t4 is null then 0 else 1 end
from 
(
select c1,c2,c3,c4,
to_char(crt_time, 'yyyy') t1, 
to_char(crt_time, 'yyyy-mm') t2, 
to_char(crt_time, 'yyyy-mm-dd') t3, 
to_char(crt_time, 'yyyy-mm-dd hh24') t4, 
count(*) cnt
from tab5 
group by 
cube(c1,c2,c3,c4), 
grouping sets(to_char(crt_time, 'yyyy'), to_char(crt_time, 'yyyy-mm'), to_char(crt_time, 'yyyy-mm-dd'), to_char(crt_time, 'yyyy-mm-dd hh24'))
)
t;

INSERT 0 49570486
Time: 172373.714 ms

在bitmap上创建索引方便取数据

create index idx_stat_tab5_bitmap on stat_tab5 (bitmap);

用户勾选几个维度,取出数据

c1,c3,c4,t3 = bitmap(10110010)

postgres=# select c1,c3,c4,time3,cnt from stat_tab5 where bitmap='10110010' limit 10;
 c1 | c3 |  c4   |   time3    | cnt 
----+----+-------+------------+-----
 41 |  0 | 30748 | 2016-06-04 |   1
 69 |  0 | 87786 | 2016-06-04 |   1
 70 |  0 | 38805 | 2016-06-04 |   1
 79 |  0 | 65892 | 2016-06-08 |   1
 51 |  0 | 13615 | 2016-06-11 |   1
 47 |  0 | 42196 | 2016-06-28 |   1
 45 |  0 | 54736 | 2016-07-01 |   1
 50 |  0 | 21605 | 2016-07-02 |   1
 46 |  0 | 40888 | 2016-07-16 |   1
 41 |  0 | 90258 | 2016-07-17 |   1
(10 rows)
Time: 0.528 ms

postgres=# select * from stat_tab5 where bitmap='00001000' limit 10;
 c1 | c2 | c3 | c4 | time1 | time2 | time3 | time4 |  cnt   |  bitmap  
----+----+----+----+-------+-------+-------+-------+--------+----------
    |    |    |    | 2016  |       |       |       | 514580 | 00001000
    |    |    |    | 2017  |       |       |       | 485420 | 00001000
(2 rows)
Time: 0.542 ms

执行计划,可以看到优雅的sort,一次sort多次rolldown, 不是简单的union all哦。

                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Insert on stat_tab5  (cost=208059.84..142986926.23 rows=1536000000 width=184)
   ->  Subquery Scan on t  (cost=208059.84..142986926.23 rows=1536000000 width=184)
         ->  GroupAggregate  (cost=208059.84..35466926.23 rows=1536000000 width=152)
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2, tab5.c1
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3, tab5.c4
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3, tab5.c4
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               ->  Sort  (cost=208059.84..210559.84 rows=1000000 width=144)
                     Sort Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2, tab5.c1
                     ->  Seq Scan on tab5  (cost=0.00..26370.00 rows=1000000 width=144)
(93 rows)

后续的优化手段
.1. 分区表
因为统计维度多,所以统计结果是非常庞大的。
数据分区可以帮组用户解决查询效率的问题。
例如按bitmap分区之后,在每个分区表再按时间维度分区。

.2. 流式计算
使用pipelinedb结合cube和grouping sets,把以上的统计改成流式统计,可以提升用户体验,快速得到报表。

.3. 使用MPP产品来提升数据的存储量和计算能力,例如Greenplum。

.4. 使用PostgreSQL 9.6 的CPU并行技术,性能可以有明显的提升。

.5. 使用GPU加速,也可以有非常巨大的提升。

.6. 更大数据量可以使用Greenplum,支持MPP。 实现PB级别多维分析。

参考
https://www.postgresql.org/docs/9.6/static/queries-table-expressions.html#QUERIES-GROUPING-SETS

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
SQL 关系型数据库 数据库
PostgreSQL 设计优化case - 大宽表任意字段组合查询索引如何选择(btree, gin, rum) - (含单个索引列数超过32列的方法)
标签 PostgreSQL , adhoc查询 , 大宽表 , 任意字段组合查询 , 索引 , btree , gin , rum 背景 大宽表,任意字段组合查询,透视。是实时分析系统中的常见需求: 1、实时写入。
2685 0
|
存储 关系型数据库 PostgreSQL
PostgreSQL 优化CASE - 有序UUID插件
标签 PostgreSQL , uuid , 无序uuid , 索引分裂 , io , 性能诊断 背景 无序UUID会带来很多问题,例如索引分裂膨胀,离散IO,WAL膨胀等,详见以前的分析。 Regular random UUIDs are distributed uniformly over the whole range of possible values.
2413 0
|
SQL 存储 关系型数据库
PostgreSQL 设计优化case - 多对多 转 一对多(数组)
标签 PostgreSQL , 数组 , 多对多 , 一对多 , udf , JOIN 背景 某个系统存储了会员的标签,以及标签的描述信息。业务上需要通过会员ID得到会员的标签,再得到描述信息。 每个会员有若干标签,原来是这么存储的 1、会员标签表,人数5亿左右,每个人平均有几百个标签,1500亿行左右。
2224 0
|
SQL 算法 关系型数据库
PostgreSQL 10.1 手册_部分 II. SQL 语言_第 10 章 类型转换_10.5. UNION、CASE和相关结构
10.5. UNION、CASE和相关结构 SQL UNION结构必须使可能不相似的类型匹配成为一个单一的结果集。该决定算法被独立地应用到一个联合查询的每个输出列。 INTERSECT和EXCEPT采用和UNION相同的方法来决定不相似的类型。
1319 0

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版