【重新发现PostgreSQL之美】- 26 这个推荐算法价值1亿

本文涉及的产品
云原生数据库 PolarDB MySQL 版,通用型 2核8GB 50GB
云原生数据库 PolarDB PostgreSQL 版,标准版 2核4GB 50GB
简介: 大家好,这里是重新发现PostgreSQL之美 - 26 这个推荐算法价值1亿

背景


场景:
短视频、电商、社交等基于标签的动态推荐.

挑战:
标签多, 按标签权重比例每个标签取N, 与数据库的交互次数多.
已读列表巨大, 过滤已读采用传统not in资源消耗大.
效率低下.

PG解决方案:
效率提升1000, 可能价值1个亿.
技术点: 数组、自定义typepartial index降低无效过滤、hash subplan优化not inarray agg, jsonb

例子


以短视频推荐为例.

PG解决方

1、视频资源池

1.1、全国资源池


         
vid int8,  -- 视频ID
tag int,   -- 标签
score float4   -- 标签权重
);

写入数据: 1000万条记录: 100万个视频, 每个视频10个标签, 标签取值空间1-100.


         

1.2、省份资源池


         
pid int,    -- 省份ID
vid int8,   -- 视频ID
tag int,    -- 标签
score float4   -- 标签权重
);

写入数据: 1000万条记录: 100万个视频, 每个视频10个标签, 标签取值空间1-100.
pid
取值空间1-36


         

1.3、城市资源池


         
cid int,    -- 城市ID
vid int8,   -- 视频ID
tag int,    -- 标签
score float4   -- 标签权重
);

写入数据: 1000万条记录: 100万个视频, 每个视频10个标签, 标签取值空间1-100.
cid
取值空间1-10000


         

1.4partial index(分区索引), 避免过滤列表巨大带来巨大的无效扫描, 之前已经讲过
《重新发现PostgreSQL之美- 23 彭祖的长寿秘诀》


         
declare
begin
for i in 0..49 loop
execute format('create index idx_pool_global_%s on pool_global (tag,score desc) include (vid) where abs(mod(hashint8(vid),50))=%s', i,i);
execute format('create index idx_pool_province_%s on pool_province (pid,tag,score desc) include (vid) where abs(mod(hashint8(vid),50))=%s', i,i);
execute format('create index idx_pool_city_%s on pool_city (cid,tag,score desc) include (vid) where abs(mod(hashint8(vid),50))=%s', i,i);
end loop;
end;
$$;

2、用户标签类型


         
tag int,       -- 标签
score float4,  -- 标签权重
limits int     -- 用这个标签获取多少条VID
);

3、用户表


         
uid int8 primary key,    -- 用户ID
pid int,  -- 省份ID
cid int,  -- 城市ID
tag_scores1 tag_score[],    -- 标签、权重、对应标签获取多少条. 也可以使用jsonb存储
tag_scores2 tag_score[],    -- 标签、权重、对应标签获取多少条 limit = 0的放这个字段. 业务更新tag_scores根据两个字段的结果来计算. 主要是减少PG计算量.
readlist jsonb     -- 已读VID, 和分区索引的分区数匹配, 用jsonb数组表示. jsonb[0]表示abs(mod(hashint8(vid),50))=0的vid数组
);

写入数据: 1000万个用户, 每个用户20个标签(标签取值空间1-100), limit大于0的标签5(和为100). vid已读列表5万条(1-300万取值空间).
pid
取值空间1-36
cid
取值空间1-10000


         
select generate_series(1,10000000), ceil(random()*36), ceil(random()*10000),
array(
select row(ceil(random()*100), random()*10, 40)::tag_score
union all
select row(ceil(random()*100), random()*10, 20)::tag_score
union all
select row(ceil(random()*100), random()*10, 15)::tag_score
union all
select row(ceil(random()*100), random()*10, 15)::tag_score
union all
select row(ceil(random()*100), random()*10, 10)::tag_score
),
array (
select row(ceil(random()*100), random()*10, 0)::tag_score from generate_series(1,15)
),
(
select jsonb_agg(x) as readlist from
(
select array (select x from
(select ceil(random()*3000000)::int8 x from generate_series(1,50000)) t
where mod(x,50)=i
) x
from generate_series(0,49) i
) t
) ;

4、如何更新用户的标签权重, 对应标签获取多少条?
原则: 可以在程序中计算并且不会增加程序和数据库交互的, 放在程序内计算.
取出UID, 在应用程序中计算的到tag_scores结果, 存入数据库users.

5、获取推荐视频vids SQL:


         
(
select array_agg(vid) from
(
select vid from pool_global t1
where t1.tag=t.tag
and t1.vid not in (select jsonb_array_elements_text( readlist[0] )::int8 from users where uid=1)
and abs(mod(hashint8(vid),50)) = 0
order by t1.score desc
limit ceil(t.limits*0.5)
) x   -- 全国池 50%
) as global_pool,
(
select array_agg(vid) from
(
select vid from pool_province t1
where t1.tag=t.tag
and t1.pid=(select pid from users where uid=1)
and t1.vid not in (select jsonb_array_elements_text( readlist[0] )::int8 from users where uid=1)
and abs(mod(hashint8(vid),50)) = 0
order by t1.score desc
limit ceil(t.limits*0.3)
) x   -- 省份池 30%
) as province_pool,
(
select array_agg(vid) from
(
select vid from pool_city t1
where t1.tag=t.tag
and t1.cid=(select cid from users where uid=1)
and t1.vid not in (select jsonb_array_elements_text( readlist[0] )::int8 from users where uid=1)
and abs(mod(hashint8(vid),50)) = 0
order by t1.score desc
limit ceil(t.limits*0.2)
) x    -- 城市池 20%
) as city_pool
from
(
select (unnest(tag_scores1)).tag as tag, (unnest(tag_scores1)).limits as limits from
users where uid=1
) t;

         
global_pool   | {555234,30783,44877,893039,274638,811324,743142,233694,503619,977097,263781,350882,15000,961863,705252,823857,302978,919950,864090,633682}
province_pool | {1381822,1570117,1733796,1802258,1757745,1308796,1296608,1958019,1637076,1626698,1369964,1501167}
city_pool     |
-[ RECORD 2 ]-+-------------------------------------------------------------------------------------------------------------------------------------------
global_pool   | {41356,470712,453025,997172,997806,520315,512094,523652,714477,526433}
province_pool | {1246470,1582571,1154589,1213147,1144821,1498216}
city_pool     |
-[ RECORD 3 ]-+-------------------------------------------------------------------------------------------------------------------------------------------
global_pool   | {951192,518459,830710,34429,113691,362024,578173,574309}
province_pool | {1831028,1060594,1871276,1365273,1092971}
city_pool     | {2100388}
-[ RECORD 4 ]-+-------------------------------------------------------------------------------------------------------------------------------------------
global_pool   | {235601,950720,269682,452868,622511,590893,602110,104605}
province_pool | {1791516,1039665,1669258,1663575,1126127}
city_pool     |
-[ RECORD 5 ]-+-------------------------------------------------------------------------------------------------------------------------------------------
global_pool   | {738016,548948,600423,559686,483213}
province_pool | {1185880,1916542,1336231}
city_pool     |

         
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t  (cost=0.29..246.06 rows=10 width=96) (actual time=2.508..3.754 rows=5 loops=1)
->  Result  (cost=0.29..2.71 rows=10 width=8) (actual time=0.020..0.023 rows=5 loops=1)
->  ProjectSet  (cost=0.29..2.56 rows=10 width=32) (actual time=0.019..0.020 rows=5 loops=1)
->  Index Scan using users_pkey on users  (cost=0.29..2.50 rows=1 width=224) (actual time=0.017..0.018 rows=1 loops=1)
Index Cond: (uid = 1)
SubPlan 2
->  Aggregate  (cost=7.08..7.09 rows=1 width=32) (actual time=0.261..0.261 rows=1 loops=5)
->  Limit  (cost=5.43..6.76 rows=25 width=12) (actual time=0.254..0.258 rows=10 loops=5)
->  Index Only Scan using idx_pool_global_0 on pool_global t1  (cost=5.43..18.63 rows=248 width=12) (actual time=0.254..0.257 rows=10 loops=5)
Index Cond: (tag = t.tag)
Filter: (NOT (hashed SubPlan 1))
Heap Fetches: 0
SubPlan 1
->  Result  (cost=0.29..4.76 rows=100 width=8) (actual time=0.851..1.074 rows=1001 loops=1)
->  ProjectSet  (cost=0.29..3.01 rows=100 width=32) (actual time=0.850..0.938 rows=1001 loops=1)
->  Index Scan using users_pkey on users users_1  (cost=0.29..2.50 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: (uid = 1)
SubPlan 5
->  Aggregate  (cost=8.15..8.16 rows=1 width=32) (actual time=0.247..0.247 rows=1 loops=5)
->  Limit  (cost=7.93..8.13 rows=1 width=12) (actual time=0.242..0.245 rows=6 loops=5)
InitPlan 3 (returns $3)
->  Index Scan using users_pkey on users users_2  (cost=0.29..2.50 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: (uid = 1)
->  Index Only Scan using idx_pool_province_0 on pool_province t1_1  (cost=5.43..6.85 rows=7 width=12) (actual time=0.241..0.243 rows=6 loops=5)
Index Cond: ((pid = $3) AND (tag = t.tag))
Filter: (NOT (hashed SubPlan 4))
Heap Fetches: 0
SubPlan 4
->  Result  (cost=0.29..4.76 rows=100 width=8) (actual time=0.813..1.027 rows=1001 loops=1)
->  ProjectSet  (cost=0.29..3.01 rows=100 width=32) (actual time=0.812..0.896 rows=1001 loops=1)
->  Index Scan using users_pkey on users users_3  (cost=0.29..2.50 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: (uid = 1)
SubPlan 8
->  Aggregate  (cost=9.07..9.08 rows=1 width=32) (actual time=0.236..0.237 rows=1 loops=5)
->  Limit  (cost=7.93..9.05 rows=1 width=12) (actual time=0.235..0.235 rows=0 loops=5)
InitPlan 6 (returns $7)
->  Index Scan using users_pkey on users users_4  (cost=0.29..2.50 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1)
Index Cond: (uid = 1)
->  Index Only Scan using idx_pool_city_0 on pool_city t1_2  (cost=5.43..6.55 rows=1 width=12) (actual time=0.234..0.234 rows=0 loops=5)
Index Cond: ((cid = $7) AND (tag = t.tag))
Filter: (NOT (hashed SubPlan 7))
Heap Fetches: 0
SubPlan 7
->  Result  (cost=0.29..4.76 rows=100 width=8) (actual time=0.793..1.011 rows=1001 loops=1)
->  ProjectSet  (cost=0.29..3.01 rows=100 width=32) (actual time=0.792..0.876 rows=1001 loops=1)
->  Index Scan using users_pkey on users users_5  (cost=0.29..2.50 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: (uid = 1)
Planning Time: 0.999 ms
Execution Time: 3.825 ms
(49 rows)

6、压测


         

         
\set uid random(1,10000000)
\set mod random(0,49)

         
select
(
select array_agg(vid) from
(
select vid from pool_global t1
where t1.tag=t.tag
and t1.vid not in (select jsonb_array_elements_text( readlist[:mod] )::int8  from users where uid=:uid)
and abs(mod(hashint8(vid),50)) = :mod
order by t1.score desc
limit ceil(t.limits*0.5)
) x   -- 全国池 50%
) as global_pool,
(
select array_agg(vid) from
(
select vid from pool_province t1
where t1.tag=t.tag
and t1.pid=(select pid from users where uid=:uid)
and t1.vid not in (select jsonb_array_elements_text( readlist[:mod] )::int8  from users where uid=:uid)
and abs(mod(hashint8(vid),50)) = :mod
order by t1.score desc
limit ceil(t.limits*0.3)
) x   -- 省份池 30%
) as province_pool,
(
select array_agg(vid) from
(
select vid from pool_city t1
where t1.tag=t.tag
and t1.cid=(select cid from users where uid=:uid)
and t1.vid not in (select jsonb_array_elements_text( readlist[:mod] )::int8  from users where uid=:uid)
and abs(mod(hashint8(vid),50)) = :mod
order by t1.score desc
limit ceil(t.limits*0.2)
) x    -- 城市池 20%
) as city_pool
from
(
select (unnest(tag_scores1)).tag as tag, (unnest(tag_scores1)).limits as limits from
users where uid=:uid
) t;

         

         
pgbench (PostgreSQL) 14.0
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 12
duration: 120 s
number of transactions actually processed: 99824
latency average = 14.426 ms
latency stddev = 12.608 ms
initial connection time = 13.486 ms
tps = 831.235738 (without initial connection time)

2018macbook pro i5系列, 每秒可以推荐多少个视频ID?

  • 83100

降级方法:

4、重新发现PG之美- 4 随机漫步踏浪而来
在一些论坛、短视频业务中, 编辑精选和地域或大范围精选的内容会采用随机推荐的方式推送给客户.
随机查询就有了高并发、低延迟的需求, 然而通用的order by random()随机方法性能太烂, 无法满足需求.
PG
提供了tablesample method(para)方法, 能够以几千倍的性能满足高并发需求.

视频回放:  https://www.bilibili.com/video/BV1cy4y137WU/

压缩方法:

roaringbitmap | hll

《重新发现PostgreSQL之美- 24 滑动窗口分析2000x
《重新发现PostgreSQL之美- 23 彭祖的长寿秘诀》

传统方

放弃治疗

参考

PostgreSQL x分组, y排序, 每组各取(N动态)- 子查询+子查询聚合使用》

《重新发现PostgreSQL之美- 23 彭祖的长寿秘诀》

 

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
6月前
|
数据采集 算法 搜索推荐
Python基于RFM模型和K-Means聚类算法进行航空公司客户价值分析
Python基于RFM模型和K-Means聚类算法进行航空公司客户价值分析
|
8月前
|
算法 测试技术 C++
【位运算】【二分查找】【C++算法】100160价值和小于等于 K 的最大数字
【位运算】【二分查找】【C++算法】100160价值和小于等于 K 的最大数字
|
存储 SQL 算法
PostgreSQL 14中TOAST的新压缩算法LZ4,它有多快?
PostgreSQL 14中TOAST的新压缩算法LZ4,它有多快?
392 0
|
8月前
|
算法 测试技术 C#
【位运算】【二分查找】【C++算法】100160价值和小于等于 K 的最大数字
【位运算】【二分查找】【C++算法】100160价值和小于等于 K 的最大数字
|
搜索推荐 关系型数据库 PostgreSQL
PostgreSQL的排序算法
PostgreSQL的排序算法
100 1
|
存储 监控 算法
转:如何通过堆排序算法探索现代监控软件的功能与价值
堆排序算法是一种经典的排序算法,它可以用来探索现代监控软件的功能与价值,尤其是在处理海量数据和实时监控方面。那么,咱们一起来看看怎么用堆排序的思路来揭开现代监控软件的神秘面纱吧!
66 0
|
算法 C++
剑指offer(C++)-JZ47:礼物的最大价值(算法-动态规划)
剑指offer(C++)-JZ47:礼物的最大价值(算法-动态规划)
|
存储 缓存 架构师
阿里云数据库专家于巍荣获PostgreSQL中国技术大会“最具价值专家 MVP”奖
2023年3月3日,在由PostgreSQL中文社区主办的“第十二届PostgreSQL中国技术大会”上,阿里云数据库开源首席架构师于巍荣获“中国 PostgreSQL 最具价值专家 MVP”奖项。
阿里云数据库专家于巍荣获PostgreSQL中国技术大会“最具价值专家 MVP”奖
|
存储 Kubernetes 关系型数据库
「关系型数据库」数据库深度探索:PostgreSQL最有潜力和学习价值
「关系型数据库」数据库深度探索:PostgreSQL最有潜力和学习价值
|
存储 JSON 分布式计算
「PostgreSQL高级特性」PostgreSQL 数据库的近似算法
「PostgreSQL高级特性」PostgreSQL 数据库的近似算法