【重新发现PostgreSQL之美】- 20 为什么啤酒&纸尿裤最搭

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
简介: 大家好,这里是重新发现PostgreSQL之美 - 20 为什么啤酒&纸尿裤最搭

背景


场景:

  • 电商、零售等行业, 根据用户购物车、订单等数据找到最合理的搭配组合. 用于引导营销策略.
  • 或者以用户最近N笔或N天的订单内的所有商品作为一个大的group
  • 根据用户评论涉及的关键词, 找到最佳搭配关键词. 用于引导品牌策略.

挑战:

  • 每一个组合都是一组商品、标签或关键词. 相当于需要从现有的海量组合中找到高频组合(最搭组合).
  • 传统数据库不支持多值列(数组), 展开成多条记录数据量至少上升1个量级, 而且需要Join聚合才能得到最佳组合, 效率极差.
  • 在划窗分析需求中, 需要大量历史数据的基础进行计算, 性能

PG解决方案:

  • 1、内置数组, 数据量节省至少1个量级.
  • 2、内置数组倒排索引, 快速定位想要得到的搭配组合.
  • 3、内置数组的元素级别统计信息, 可以利用统计信息快速定位到最佳组合.
  • 4、datasketches 近似解, 支持海量数据毫秒级别实时输出搭配组合.
  • 5、使用topn的方法, 可以每天为每个商品存储1条记录, 这样就能实行实时滑窗分析, 也是传统数据库无法高效实现的.

例子


1、购物车、订单场景
每个元素都有标签, 在一起就形成了标签集合


         
...
订单xxx : [商品1, 商品2, ...] -> [标签1, 标签2, ...]

2、评论场景


         
...
评论xxx : [分词1, 分词2, ...] -> [标签2, 标签3, ...]

以购物车、订单场景为

1、传统数据库表结构:
每个订单需要N条记录(取决于订单内有多少商品)


         
order_id int8,  -- 订单id
itemid int  -- 商品id
);

传统结构找到商品ID=1的最佳TOP 10组合.


         
(select order_id from t where itemid=1)
group by itemid order by count(*) desc limit 11;

2PG 数据库, 一条订单只需要一条记录:


         
order_id serial8 primary key,  -- 订单id
itemid int[]  -- 商品id数组
);

3、生成测试数据: 使用长尾分布, 容易观察到近似搜索的效果


         
\set o2 random_zipfian(10,100000,1.2)
\set o3 random_zipfian(20,100000,1.2)
\set o4 random_zipfian(30,100000,1.2)
\set o5 random_zipfian(40,100000,1.2)
\set o6 random_zipfian(50,100000,1.2)
\set o7 random_zipfian(60,100000,1.2)
\set o8 random_zipfian(70,100000,1.2)
\set o9 random_zipfian(80,100000,1.2)
\set o10 random_zipfian(90,100000,1.2)
insert into t (itemid) values (array[:o1,:o2,:o3,:o4,:o5,:o6,:o7,:o8,:o9,:o10]::int[]);

         

360万个订单


         
count
---------
3600000
(1 row)

4、对itemid 创建gin索引


         

商品ID 1和哪10个商品最搭配?

传统方法:


         
order_id int8,  -- 订单id
itemid int  -- 商品id
);

         
create index idx_tt on tt (order_id);

         
insert into tt select order_id, unnest(itemid) from t;

         
select count(*) from tt;

         
-- 展开商品后记录数比PG多了10倍

传统结构找到商品ID=1的最佳TOP 10组合, 使用了hash aggindex的情况下需要4.8.


         
(select order_id from tt where itemid=1)
group by itemid order by count(*) desc limit 11;

         

         
itemid | count
--------+--------
1 | 706647
90 | 157916
80 | 157076
70 | 156776
60 | 155228
50 | 153810
40 | 152431
30 | 149441
20 | 146573
10 | 139074
91 |  78262
(11 rows)

         
Time: 4832.984 ms (00:04.833)

方法1:

PG可以自定义一个新的聚合函数array_aggn(anyarray, order, n), 对数组进行聚合, 得到一个二维数组, 返回按元素出现频率的前N或后N个元素.
[元素][元素出现频率]


         

这条SQL可以通过GIN索引快速定位到包含1的订单记录, 同时对订单中的商品进行聚合,返回元素出现频率最高的前5个元素以及频率.

例如得到


         

自定义聚合的方法:

PostgreSQL 10 自定义并行计算聚合函数的原理与实践- (array_agg合并多个数组为单个一元数组的
例子)

[PostgreSQL Oracle 兼容性之- 自定义并行聚合函数PARALLEL_ENABLE AGGREGATE](../201803/2018031
2_03.md)

PostgreSQL并行计算解说9 - parallel 自定义并行聚合》

除了用array来存储商品和标签, 也可以使用roaringbitmap来存储, 同样需要对rb类型新增一个聚合函数, 返回元素和元素出现频率.

使用rb存储的好处是压缩存储, 效率更高一点.

PostgreSQL pg_roaringbitmap - 用户画像、标签、高效检索》

https://github.com/ChenHuajun/pg_roaringbitmap

方法2:

用统计信息得到近似解.

例如要计算商品ID=1的商品和哪些商品最搭


         

         
SELECT 706647

         
Time: 926.742 ms

         
itemid
----------------------------------------
{1,15,616,652,40,122,124,74,47543,167}
{1,54477,112,35,72,50,76,70,91,103}
{1,373,338,31,41,50,305,341,82,3478}
{1,10,20,33,66,97,42797,151,205,316}
{1,69,49,106,52,55,63,7390,315,90}
{1,12,38,31,48,77,68,751,217,215}
{1,10,147,92,41,69,77,74,81,90}
{1,17,21,37,114,68609,72,73,1098,144}
{1,37,21,44,44,50,84,70,3202,97}
{1,5483,25,84,91,56,68,6944,332,190}
(10 rows)

统计信息内有数组的element统计, 如下


         
View "pg_catalog.pg_stats"
Column         |   Type   | Collation | Nullable | Default
------------------------+----------+-----------+----------+---------
schemaname             | name     |           |          |
tablename              | name     |           |          |
attname                | name     |           |          |
inherited              | boolean  |           |          |
null_frac              | real     |           |          |
avg_width              | integer  |           |          |
n_distinct             | real     |           |          |
most_common_vals       | anyarray |           |          |
most_common_freqs      | real[]   |           |          |
histogram_bounds       | anyarray |           |          |
correlation            | real     |           |          |
most_common_elems      | anyarray |           |          |
most_common_elem_freqs | real[]   |           |          |
elem_count_histogram   | real[]   |           |          |

收集统计信息


         
ANALYZE
Time: 15.658 ms

查询购物车数组字段的统计信息详情


         
-[ RECORD 1 ]----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
most_common_elems      | {1,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,115,116,117,119,120,122,123}
most_common_elem_freqs | {1,0.20272727,0.08181818,0.053939395,0.03757576,0.03272727,0.02,0.021818181,0.015757576,0.013939394,0.20787878,0.10757576,0.05666667,0.041818183,0.03212121,0.030909091,0.023333333,0.01878788,0.02030303,0.015454546,0.20484848,0.097575754,0.0660606,0.05272727,0.04060606,0.030303031,0.025757575,0.025454545,0.016969698,0.021212121,0.21848485,0.10242424,0.0669697,0.05727273,0.036666665,0.034848485,0.041212123,0.026969697,0.022121212,0.017878788,0.21878788,0.10181818,0.06757576,0.04909091,0.043636363,0.04060606,0.035454545,0.03181818,0.022424242,0.028181817,0.21939394,0.10060606,0.06878788,0.06333333,0.042727273,0.036060605,0.028181817,0.033333335,0.03181818,0.026969697,0.22636363,0.095757574,0.07636364,0.055454545,0.04969697,0.043333333,0.035151515,0.033939395,0.028181817,0.026969697,0.20818181,0.11242424,0.07787879,0.062121212,0.048787877,0.043333333,0.035757575,0.036969695,0.03181818,0.03212121,0.22030303,0.10636364,0.0730303,0.05878788,0.04969697,0.03939394,0.03969697,0.034242425,0.03727273,0.035151515,0.02969697,0.024848485,0.026969697,0.02878788,0.021818181,0.016060606,0.02030303,0.01939394,0.016363636,0.014242425,0.015151516,0.016969698,0.021515151,0.014545455,0.016060606,0.013333334,0.014242425,0.013939394,0.013939394,0.015757576,0.013333334,1,0}

解读: PostgreSQL 9.2 add array elements statistics

获取TOP 10 的组合:

第一个元素是自己, 所以要输出11

PostgreSQL pg_stats used to estimate top N freps values and explain rows


         
(select row_number() over(partition by r) as rn,ele from (select unnest(most_common_elems::text::int[]) ele,2 as r from pg_stats where tablename='tmp_t' and attname='itemid') t) t1
join
(select row_number() over(partition by r) as rn,freq from (select unnest(most_common_elem_freqs) freq,2 as r from pg_stats where tablename='tmp_t' and attname='itemid') t) t2
on (t1.rn=t2.rn) order by t2.freq desc limit 11;

         
rn | ele | rn |    freq
----+-----+----+------------
1 |   1 |  1 |          1  -- 自己的出现概率 100%
52 |  60 | 52 | 0.22878788  -- 接下来的9个值符合pgbench使用的长尾分布随机生成算法
62 |  70 | 62 | 0.22030303
12 |  20 | 12 | 0.21848485
32 |  40 | 32 |  0.2169697
72 |  80 | 72 | 0.21636364
42 |  50 | 42 |       0.21
82 |  90 | 82 |       0.21
2 |  10 |  2 | 0.20545454
22 |  30 | 22 | 0.20060606
73 |  81 | 73 | 0.11515152  -- 这个值是第二梯队10个中的的任意一个(因为概率差别很小很小)
(11 rows)

         
Time: 1.852 ms

即商品ID 1和商品ID 60,70,20,40,80,50,90,10,30,81最搭配, 以及他们与商品1双双同时出现的概率.

验证近似值的准确度: 几乎100%
使用unnest可以得到真实值, 和近似值几乎完全一致.


         
count
--------
706647
(1 row)

         

         
select unnest(itemid) i , count(*)/706647.0 from tmp_t group by 1 order by 2 desc limit 11;

         
i  |        ?column?
----+------------------------
1 | 1.00000000000000000000  -- 自己的出现概率 100%
90 | 0.22347225701092624748  -- 接下来的9个值符合pgbench使用的长尾分布随机生成算法
80 | 0.22228354468355487252
70 | 0.22185900456663652432
60 | 0.21966837756333784761
50 | 0.21766171794403712179
40 | 0.21571024853993578123
30 | 0.21147899870798291085
20 | 0.20742039519024350206
10 | 0.19680830740100785824
91 | 0.11075119543421255592  -- 这个值是第二梯队10个中的的任意一个(因为概率差别很小很小)
(11 rows)

         
Time: 1204.824 ms (00:01.205)

使用统计信息的近似查询最佳组合方法, 查询性能提升1000.

方法3:

使用topn插件

生成订单、购物车的时候实时计算. 存储到一种新的数据类型: 近似值类型.

https://github.com/pipelinedb/pipelinedb/blob/master/pipelinedb--1.0.0.sql

https://github.com/apache/datasketches-postgresql

https://github.com/citusdata/postgresql-topn

https://github.com/ozturkosu/cms_topn

topn为例:


         

         
postgres=# show topn.number_of_counters ;
topn.number_of_counters
-------------------------
1000
(1 row)

         
Time: 0.192 ms

每个商品只需要存储1条记录(即统计结果记录)


         
itemid int primary key,
groups jsonb
);

         
insert into tmp_topn select 1, topn_add_agg(i::text) from (
select unnest(itemid) i from t where itemid @> array[1]
) t;

         
INSERT 0 1
Time: 1977.195 ms (00:01.977)

         
postgres=# select (topn(groups,11)).* from tmp_topn where itemid=1;
item | frequency
------+-----------
1    |    706647
90   |    157916
80   |    157076
70   |    156776
60   |    155228
50   |    153810
40   |    152431
30   |    149441
20   |    146573
10   |    139074
91   |     78262
(11 rows)

         
Time: 0.706 ms

当产生新的订单时, 这个订单有N个商品, 那么在以上表中, 增量更新N条记录即可.


         
select unnest(itemid) i from t where itemid @> array[1]
) t
on conflict (itemid) do update set
groups = topn_union(tmp_topn.groups,excluded.groups);

         
INSERT 0 1
Time: 1957.321 ms (00:01.957)

         
item | frequency
------+-----------
1    |   1413294
90   |    315832
80   |    314152
70   |    313552
60   |    310456
50   |    307620
40   |    304862
30   |    298882
20   |    293146
10   |    278148
91   |    156524
(11 rows)

         
Time: 0.706 ms

性能同样是1000倍提升, 而且随着数据量增加, 性能提升会更加明显.

使用topn的方法, 可以每天为每个商品存储1条记录, 这样就能实行实时滑窗分析, 也是传统数据库无法高效实现的.


         
-- topn_union_agg用于划窗聚合

         
select topn(topn_union_agg(groups),11) from xx
where ts >= '2021-11-01' and ts < '2021-11-12'
and itemid=1 ;

方法4:

流计算, , pipelinedb已停更, 团队加入confluent, 可以折腾的小伙伴可以继续维护pipelinedb.
目前也能通过任务调度+方法3来实现增量.
又或者使用timescaledb的持续聚合功能. https://docs.timescale.com/api/latest/continuous-aggregates/create_materialized_view/

参考

1topk 算法论文
https://www.hlt.inesc-id.pt/~fmmb/wiki/uploads/Work/dict.refd.pdf

https://pipelinedb-doc-cn.readthedocs.io/zh_CN/latest/builtin.html#top-k

2、流计算
https://github.com/pipelinedb/pipelinedb/blob/master/pipelinedb--1.0.0.sql

3、持续聚合
https://docs.timescale.com/api/latest/continuous-aggregates/create_materialized_view/

4、近似计算
https://github.com/apache/datasketches-postgresql

https://github.com/citusdata/postgresql-topn

https://github.com/ozturkosu/cms_topn

5
《为什么啤酒和纸尿裤最搭- HybridDB/PostgreSQL查询商品营销最佳组合》

6、电商行业流量估算
https://finance.sina.com.cn/tech/2020-11-12/doc-iiznezxs1360582.shtml
11
1日至11120:00,天猫双11总交易额达4982亿元。实时数据显示,天猫双11期间,成交额突破1亿元的品牌超过450个。
11
1123点,2020天猫双11全球狂欢季实时物流订单量破22.5亿单, 平均每天1亿单左右,这个数字约等于2010年全年中国快递量的总和。

7、统计信息解读
解读: PostgreSQL 9.2 add array elements statistics

PostgreSQL pg_stats used to estimate top N freps values and explain rows

8、长尾模型
PostgreSQL pgbnech 支持长尾模型数据生成- 离散幂律概率分布- random_zipfian

9、自定义聚合的方法
PostgreSQL 10 自定义并行计算聚合函数的原理与实践- (array_agg合并多个数组为单个一元数组的
例子)

[PostgreSQL Oracle 兼容性之- 自定义并行聚合函数PARALLEL_ENABLE AGGREGATE](../201803/2018031
2_03.md)

PostgreSQL并行计算解说9 - parallel 自定义并行聚合》

10、滑窗分析
PostgreSQL count-min sketch top-n 概率计算插件cms_topn (结合窗口实现同比、环比、滑窗分析等) - 流计算核心功能之一》

 

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
存储 SQL 缓存
黑马PostgreSQL,为何一黑到底
黑马PostgreSQL,为何一黑到底
325 0
黑马PostgreSQL,为何一黑到底
|
关系型数据库 测试技术 OLTP
【重新发现PostgreSQL之美】- 10 内卷 & 大禹治水
大家好,这里是重新发现PostgreSQL之美 - 10 内卷 & 大禹治水
|
SQL Oracle 关系型数据库
【重新发现PostgreSQL之美】- 32 天不怕地不怕, 就怕老板问为什么?
大家好,这里是重新发现PostgreSQL之美 - 32 天不怕地不怕, 就怕老板问为什么?
|
存储 负载均衡 搜索推荐
【重新发现PostgreSQL之美】- 23 彭祖的长寿秘诀
大家好,这里是重新发现PostgreSQL之美 - 23 彭祖的长寿秘诀
|
SQL Oracle 小程序
【重新发现PostgreSQL之美】- 39 谁动了我的奶酪
大家好,这里是重新发现PostgreSQL之美 - 39 谁动了我的奶酪
|
SQL 自然语言处理 运维
【重新发现PostgreSQL之美】- 45 个性化
大家好,这里是重新发现PostgreSQL之美 - 45 个性化
|
传感器 SQL 监控
【重新发现PostgreSQL之美】- 28 旋转门
大家好,这里是重新发现PostgreSQL之美 - 28 旋转门
|
算法 关系型数据库 PostgreSQL
【重新发现PostgreSQL之美】- 22 黄帝内经
大家好,这里是重新发现PostgreSQL之美 - 22 黄帝内经
|
SQL 算法 自动驾驶
【重新发现PostgreSQL之美】- 27 无中生有
大家好,这里是重新发现PostgreSQL之美 - 27 无中生有
|
安全 Oracle 关系型数据库
【重新发现PostgreSQL之美】- 36 方世玉 安全第一
大家好,这里是重新发现PostgreSQL之美 - 36 方世玉 安全第一