沉浸式学习PostgreSQL|PolarDB 1: 短视频推荐去重、UV统计分析场景-阿里云开发者社区

作者

digoal

日期

2023-08-19

背景

本文的实验可以使用永久免费的云起实验室来完成.

https://developer.aliyun.com/adc/scenario/exp/f55dbfac77c0467a9d3cd95ff6697a31

如果你本地有docker环境也可以把镜像拉到本地来做实验:

x86_64机器使用以下docker image:

《amd64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》

Apple Chip机器使用以下docker image:

《arm64 , 使用Dockerfile+docker build制作PolarDB | PostgreSQL 开源docker镜像, 集成大量插件方便学习, 并推送到阿里云镜像服务》

非常欢迎数据库用户提出场景给我, 回复如下issue, 一起来建设数据库沉浸式学习教学素材库, 帮助开发者用好数据库, 提升开发者竞争力, 为企业降本提效.

#121

业务场景1, UV统计分析场景

例如有一个网站, 业务会记录用户每一次点击行为, 每一次点击一条记录.

基于以上明细数据, 需要做如下分析(允许近似值, 要求高性能):

按地域的UV/天
按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)
整体UV/天
整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)
按地域的UV/周
按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)
整体UV/周
整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)
按地域的UV/任意滑动周
按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
整体UV/任意滑动周
整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

实现和对比

设计表结构:

drop table if exists tbl;  
create unlogged table tbl (  -- 用户行为表, 为了减轻测试负担, 使用unlogged table  
  id serial8 primary key,  -- 主键  
  uid int8,  -- 用户ID  
  rid int2,  -- 地域ID  
  cts timestamp,  -- 用户行为时间  
  info text  -- 用户行为   
);    
alter sequence tbl_id_seq cache 1000;

设计测试数据, 要求:

UID 取值范围1 到 2亿, 为了产生部分会员活跃的效果, UID随机值的生成按高斯分布.
地域取值为 mod(uid,200)
cts为行为时间, 随机落在0-31天(2678400 秒)的范围内.
info字段在本实验中无意义, 随机填充, 表达用户行为.

使用pgbench 生成用户行为原始数据1000万条.

vi ~/t.sql   
\set uid random_gaussian(1, 200000000, 2.5)   
insert into tbl (uid,rid,cts,info) values (:uid, mod(:uid,200), '2023-01-01'::timestamp + (random()*2678400||' s')::interval, md5(random()::text));

pgbench -M prepared -n -r -P 1 -f ~/t.sql -c 10 -j 10 -t 1000000

scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
number of transactions per client: 1000000  
number of transactions actually processed: 10000000/10000000  
latency average = 0.064 ms  
latency stddev = 0.080 ms  
initial connection time = 19.642 ms  
tps = 155608.242432 (without initial connection time)

结果查看

postgres=# select * from tbl limit 10;  
  id  |    uid    | rid |            cts             |               info                 
------+-----------+-----+----------------------------+----------------------------------  
    1 |  97559812 |  12 | 2023-01-01 22:07:44.879178 | 4137dad1b052dcd704213e0b38735a1b  
    2 | 138253982 | 182 | 2023-01-15 02:38:03.556602 | ae3d5f27f61c8f0cd8f9864d79a693b5  
 3001 | 160494052 |  52 | 2023-01-17 23:45:29.174513 | b7972471fa02c8efa13113e96218c755  
    3 | 132545346 | 146 | 2023-01-05 22:23:11.484299 | 343f9298c26788d8bf454cfc7173a1f3  
    4 | 120769706 | 106 | 2023-01-16 21:47:52.81739  | a43a922d985b2ecc75e943af2c4c95be  
 5001 |  73660892 |  92 | 2023-01-30 05:42:50.828799 | 9515ed12769e13ff6c8f5e633860d548  
 3002 | 151242348 | 148 | 2023-01-31 07:03:15.446931 | 076c30be03af95f8a6dab1b65dd94903  
    5 | 164054658 |  58 | 2023-01-24 19:24:56.110967 | 19c46f300fe17397a108c6a2b69a7855  
 5002 | 131108032 |  32 | 2023-01-23 19:04:44.526588 | e930a9df98f90a3d47f48fe5cf1eaf9a  
    6 |  33572954 | 154 | 2023-01-16 00:47:17.449256 | 6a2ecb267b3281c91c5645b57a0b29d8  
(10 rows)  
postgres=# select count(*) from tbl;  
  count     
----------  
 10000000  
(1 row)  
postgres=# \dt+  
                                   List of relations  
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl  | table | postgres | unlogged    | heap          | 965 MB |   
(1 row)  
postgres=# \di+  
                                         List of relations  
 Schema |   Name   | Type  |  Owner   | Table | Persistence | Access method |  Size  | Description   
--------+----------+-------+----------+-------+-------------+---------------+--------+-------------  
 public | tbl_pkey | index | postgres | tbl   | unlogged    | btree         | 343 MB |   
(1 row)  
postgres=# vacuum ANALYZE tbl;  
VACUUM

传统数据库设计和试验

test case 1: 按地域的UV/天

select   
  rid,  
  date(cts),  
  count(distinct uid)   
from   
  tbl   
group by 1,2   
order by 1,2;

耗时: 5263.209 ms

test case 2: 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.d,   
  t.rid,   
  (select count(*) from   
    (select uid from tbl where date(cts)=t.d and rid=t.rid   
     except   
     select uid from tbl where date(cts)=t.d-1 and rid=t.rid) as tmp  
  ) as uv_incr   
from   
  (select date(cts) as d , rid from tbl group by 1,2) as t   
order by 1,2;

耗时: 超过 249922.704 ms

test case 3: 整体UV/天

select   
  date(cts),  
  count(distinct uid)   
from   
  tbl   
group by 1   
order by 1;

耗时: 4824.847 ms

test case 4: 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.d,   
  (select count(*) from   
    (select uid from tbl where date(cts)=t.d   
     except   
     select uid from tbl where date(cts)=t.d-1) as tmp  
  ) as uv_incr   
from   
  (select distinct date(cts) as d from tbl) as t   
order by 1;

耗时: 41003.313 ms

test case 5: 按地域的UV/周

select   
  EXTRACT(ISOYEAR FROM cts) as isoyear,   
  EXTRACT(week FROM cts) as week,   
  rid,  
  count(distinct uid)   
from   
  tbl   
group by 1,2,3   
order by 1,2,3;

耗时: 18654.469 ms

test case 6: 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.isoyear,   
  t.week,   
  t.rid,   
  (select count(*) from   
    (select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week and rid=t.rid   
     except   
     select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week-1 and rid=t.rid) as tmp  
  ) as uv_incr   
from   
  (select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week, rid from tbl group by 1,2,3) as t   
order by 1,2,3;

耗时: 超过 294874.983 ms

跨年的话更加复杂, 因为年初的第一周的上一周是去年的最后一周, 写SQL时不好表达, 可以加case进行处理, 或者使用epoch来计算.

留个作业: 有兴趣的小伙伴可以修改一下SQL, 实现跨年.

test case 7: 整体UV/周

select   
  EXTRACT(ISOYEAR FROM cts) as isoyear,   
  EXTRACT(week FROM cts) as week,   
  count(distinct uid)   
from   
  tbl   
group by 1,2  
order by 1,2;

耗时: 10757.619 ms

test case 8: 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.isoyear,   
  t.week,   
  (select count(*) from   
    (select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week   
     except   
     select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week-1) as tmp  
  ) as uv_incr   
from   
  (select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week from tbl group by 1,2) as t   
order by 1,2;

耗时: 38512.642 ms

test case 9: 按地域的UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_byrid (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select count(distinct uid) from tbl where rid=v_rid and cts >= v_begin_cts1 and cts < v_end_cts1;  
$$ language sql strict;  
select * from get_uv_byrid (1::int2, '2023-01-01', '2023-01-08');

耗时: 560.176 ms

test case 10: 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_byrid (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
select count(*) from   
  (select uid from tbl where rid=v_rid and cts >= v_begin_cts2 and cts < v_end_cts2  
   except  
   select uid from tbl where rid=v_rid and cts >= v_begin_cts1 and cts < v_end_cts1  
  ) as tmp;  
$$ language sql strict;  
select * from get_uv_incr_byrid (1::int2, '2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 889.635 ms

test case 11: 整体UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select count(distinct uid) from tbl where cts >= v_begin_cts1 and cts < v_end_cts1;  
$$ language sql strict;   
select * from get_uv ('2023-01-01', '2023-01-08');

耗时: 915.000 ms

test case 12: 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
select count(*) from   
  (select uid from tbl where cts >= v_begin_cts2 and cts < v_end_cts2  
   except  
   select uid from tbl where cts >= v_begin_cts1 and cts < v_end_cts1  
  ) as tmp;  
$$ language sql strict;  
select * from get_uv_incr ('2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 2147.770 ms

hll数据库设计和试验

创建hll插件

create extension hll;

hll插件支持的类型、OP和函数等, 详细可参考hll开源项目文档:

postgres=# \dT+  
                                         List of data types  
 Schema |    Name     | Internal name | Size | Elements |  Owner   | Access privileges | Description   
--------+-------------+---------------+------+----------+----------+-------------------+-------------  
 public | hll         | hll           | var  |          | postgres |                   |   
 public | hll_hashval | hll_hashval   | 8    |          | postgres |                   |   
(2 rows)  
postgres=# \do+  
                                         List of operators  
 Schema | Name | Left arg type | Right arg type |   Result type    |    Function     | Description   
--------+------+---------------+----------------+------------------+-----------------+-------------  
 public | #    |               | hll            | double precision | hll_cardinality |   
 public | <>   | hll           | hll            | boolean          | hll_ne          |   
 public | <>   | hll_hashval   | hll_hashval    | boolean          | hll_hashval_ne  |   
 public | =    | hll           | hll            | boolean          | hll_eq          |   
 public | =    | hll_hashval   | hll_hashval    | boolean          | hll_hashval_eq  |   
 public | ||   | hll           | hll            | hll              | hll_union       |   
 public | ||   | hll           | hll_hashval    | hll              | hll_add         |   
 public | ||   | hll_hashval   | hll            | hll              | hll_add_rev     |   
(8 rows)

创建聚合表

create table tbl_hll (  
  rid int2,   -- 地域id  
  ds date,    -- 日期  
  users hll   -- HLL值, 存储某个地域某天的所有UID聚合后的hll值  
);

将原始用户行为数据写入聚合表

insert into tbl_hll select rid, date(cts), hll_add_agg(hll_hash_bigint(uid)) from tbl group by 1,2;  
INSERT 0 6200  
Time: 2976.876 ms (00:02.977)

查看聚合表, 只占用了8MB左右.

postgres=# \dt+  
                                     List of relations  
 Schema |  Name   | Type  |  Owner   | Persistence | Access method |  Size   | Description   
--------+---------+-------+----------+-------------+---------------+---------+-------------  
 public | tbl     | table | postgres | unlogged    | heap          | 965 MB  |   
 public | tbl_hll | table | postgres | permanent   | heap          | 8312 kB |   
(2 rows)

近似情况对比:

hll里有多少uid?

postgres=# select # hll_union_agg(users) from tbl_hll;   
     ?column?        
-------------------  
 9474798.879453091  
(1 row)  
Time: 47.075 ms

真实有多少uid?

postgres=# select count(distinct uid) from tbl;   
  count    
---------  
 9648563  
(1 row)  
Time: 2373.055 ms (00:02.373)

test case 1: 按地域的UV/天

select   
  rid,   
  ds,   
  # users   
from tbl_hll ;

耗时: 57.250 ms

test case 2: 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  rid,   
  ds,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))  
from tbl_hll   
window w as (partition by rid order by ds)   
order by 1,2;

耗时: 179.204 ms

test case 3: 整体UV/天

select   
  ds,   
  # hll_union_agg(users)   
from tbl_hll  
group by ds  
order by 1;

耗时: 50.841 ms

test case 4: 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    ds,   
    hll_union_agg(users) as users  
  from tbl_hll  
  group by ds  
)   
select   
  ds,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))  
from tmp   
window w as (order by ds)   
order by 1;

耗时: 51.931 ms

test case 5: 按地域的UV/周

select   
  EXTRACT(ISOYEAR FROM ds) as isoyear,   
  EXTRACT(week FROM ds) as week,   
  rid,   
  # hll_union_agg(users)   
from tbl_hll   
group by 1,2,3;

耗时: 99.657 ms

test case 6: 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    EXTRACT(ISOYEAR FROM ds) as isoyear,   
    EXTRACT(week FROM ds) as week,   
    rid,   
    hll_union_agg(users) as users  
  from tbl_hll   
  group by 1,2,3   
)   
select   
  isoyear,   
  week,    
  rid,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))   
from tmp   
window w as (partition by rid order by isoyear,week)   
order by 1,2,3;

耗时: 121.010 ms

跨年的话更加复杂, 因为年初的第一周的上一周是去年的最后一周, 写SQL时不好表达, 可以加case进行处理, 或者使用epoch来计算.

留个作业: 有兴趣的小伙伴可以修改一下SQL, 实现跨年.

test case 7: 整体UV/周

select   
  EXTRACT(ISOYEAR FROM ds) as isoyear,   
  EXTRACT(week FROM ds) as week,   
  # hll_union_agg(users) as users  
from   
  tbl_hll   
group by 1,2  
order by 1,2;

耗时: 52.649 ms

test case 8: 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    EXTRACT(ISOYEAR FROM ds) as isoyear,   
    EXTRACT(week FROM ds) as week,   
    hll_union_agg(users) as users  
  from   
    tbl_hll   
  group by 1,2  
)  
select   
  isoyear,   
  week,    
  (# (users || (lag(users) over w))) - (# (lag(users) over w))   
from tmp   
window w as (order by isoyear,week)   
order by 1,2;

耗时: 51.668 ms

test case 9: 按地域的UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_byrid_appro (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$   
  select # hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts1 and ds < v_end_cts1;  
$$ language sql strict;   
select * from get_uv_byrid_appro (1::int2, '2023-01-01', '2023-01-08');

耗时: 6.021 ms

test case 10: 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_byrid_appro (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
with t1 as  
  (select rid, hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts1 and ds < v_end_cts1 group by 1),  
  t2 as  
  (select rid, hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts2 and ds < v_end_cts2 group by 1)  
select (# (t1.users || t2.users)) - (# t1.users) from t1,t2;  
$$ language sql strict;  
select * from get_uv_incr_byrid_appro (1::int2, '2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 8.934 ms

test case 11: 整体UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_appro (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select # hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts1 and ds < v_end_cts1;  
$$ language sql strict;   
select * from get_uv_appro ('2023-01-01', '2023-01-08');

耗时: 18.035 ms

test case 12: 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_appro (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
with t1 as  
  (select hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts1 and ds < v_end_cts1),  
  t2 as  
  (select hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts2 and ds < v_end_cts2)  
select (# (t1.users || t2.users)) - (# t1.users) from t1,t2;  
$$ language sql strict;  
select * from get_uv_incr_appro ('2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 31.634 ms

性能对比结果

test case	原始方法耗时(ms)	HLL耗时(ms)	HLL性能提升倍数
1	5263.209	57.250	91.93
2	超过 249922.704	179.204	超过 1394.63
3	4824.847	50.841	94.90
4	41003.313	51.931	789.57
5	18654.469	99.657	187.19
6	超过 294874.983	121.010	超过 2436.78
7	10757.619	52.649	204.33
8	38512.642	51.668	745.39
9	560.176	6.021	93.04
10	889.635	8.934	99.58
11	915.000	18.035	50.73
12	2147.770	31.634	67.89

业务场景2, 短视频去重、过滤已观看视频

例如有一个短视频业务, 用户看过的视频都会被标记. 当根据用户喜好推送给用户视频, 需要从推荐列表中过滤掉用户已经看过的视频. (允许近似值, 要求高性能)

实现和对比

设计表结构:

drop table if exists tbl;   
create unlogged table tbl (  -- 用户已读视频列表, 为了减轻测试负担, 使用unlogged table    
  id serial8 primary key,  -- 主键    
  uid int8,  -- 用户ID    
  vid int8,  -- 已读视频ID    
  cts timestamp  -- 用户行为时间    
);      
alter sequence tbl_id_seq cache 1000;     
create index on tbl (uid) include (vid);  -- 加速根据uid查询已读视频

设计测试数据, 要求:

vid 取值范围1到1万. 高斯分布, 突出一些经常被观看的热门视频.
uid 取值范围1到1万. 高斯分布, 突出一些经常看视频的活跃用户.

使用pgbench生成1000万条用户已读视频列表的记录.

vi ~/t.sql     
\set uid random_gaussian(1, 10000, 2.5)     
\set vid random_gaussian(1, 10000, 2.5)     
insert into tbl (uid,vid,cts) values (:uid, :vid, clock_timestamp());

pgbench -M prepared -n -r -P 1 -f ~/t.sql -c 10 -j 10 -t 1000000

scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
number of transactions per client: 1000000  
number of transactions actually processed: 10000000/10000000  
latency average = 0.058 ms  
latency stddev = 0.079 ms  
initial connection time = 20.551 ms  
tps = 171765.018858 (without initial connection time)  
statement latencies in milliseconds:  
         0.000  \set uid random_gaussian(1, 100000, 2.5)     
         0.000  \set vid random_gaussian(1, 100000, 2.5)     
         0.058  insert into tbl (uid,vid,cts) values (:uid, :vid, clock_timestamp());

结果查看

postgres=# select * from tbl limit 10;   
  id  | uid  | vid  |            cts               
------+------+------+----------------------------  
    1 | 5380 | 5321 | 2023-08-21 08:53:53.074514  
    2 | 7611 | 7945 | 2023-08-21 08:53:53.074845  
 1001 | 6841 | 2507 | 2023-08-21 08:53:53.074857  
 2001 | 7956 | 4567 | 2023-08-21 08:53:53.074925  
 1002 | 5728 | 4210 | 2023-08-21 08:53:53.075087  
 3001 | 4560 | 1588 | 2023-08-21 08:53:53.075333  
 1003 | 5197 | 5662 | 2023-08-21 08:53:53.07549  
 1004 | 5125 | 3827 | 2023-08-21 08:53:53.075528  
 4001 | 6815 | 4758 | 2023-08-21 08:53:53.075496  
 1005 | 4002 | 8575 | 2023-08-21 08:53:53.075567  
(10 rows)  
postgres=# select count(*) from tbl;    
  count     
----------  
 10000000  
(1 row)  
postgres=# \dt+  
                                   List of relations  
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl  | table | postgres | unlogged    | heap          | 575 MB |   
(1 row)  
postgres=# \di+  
                                            List of relations  
 Schema |      Name       | Type  |  Owner   | Table | Persistence | Access method |  Size  | Description   
--------+-----------------+-------+----------+-------+-------------+---------------+--------+-------------  
 public | tbl_pkey        | index | postgres | tbl   | unlogged    | btree         | 344 MB |   
 public | tbl_uid_vid_idx | index | postgres | tbl   | unlogged    | btree         | 301 MB |   
(2 rows)  
postgres=# vacuum ANALYZE tbl;    
VACUUM

传统数据库设计和试验

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写压测脚本

vi ~/test.sql  
\set uid random(4900, 5100)    
with t as (  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
)  
select t.vid from t where not exists   
  (select 1 from tbl where uid=:uid and tbl.vid=t.vid)  
;

压测

pgbench -M prepared -n -r -P 1 -f ~/test.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 2469063  
latency average = 0.486 ms  
latency stddev = 0.183 ms  
initial connection time = 22.914 ms  
tps = 20578.539756 (without initial connection time)

tps: 20578.539756

hll数据库设计和试验

创建聚合表, 每个UID存储一条聚合后的hll value.

drop table if exists tbl_hll;   
create table tbl_hll (  
  uid int8 unique,   
  vids hll   
);

聚合

insert into tbl_hll select uid, hll_add_agg(hll_hash_bigint(vid)) from tbl group by 1;    
INSERT 0 10000

查看聚合表, 只占用了11MB左右.

postgres=# \dt+  
                                    List of relations  
 Schema |  Name   | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+---------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl     | table | postgres | unlogged    | heap          | 575 MB |   
 public | tbl_hll | table | postgres | permanent   | heap          | 11 MB  |   
(2 rows)

近似情况对比:

uid=5000实际已读vid数  
postgres=# select count(distinct vid) from tbl where uid=5000;  
 count   
-------  
  1807  
(1 row)  
uid=5000的hll里可能存在的vid数  
postgres=# select # vids from tbl_hll where uid=5000;  
      ?column?        
--------------------  
 1758.0590520300507  
(1 row)

true, false 测试:

使用原始表取出uid=5000的vid, 在hll值中判断是否已读, 2081 条全部已读, 判断已读是绝对准确的, 原理参考本文知识点的部分.

with tmp as (  
  select   
    tbl.uid,   
    tbl.vid,   
    ( vids || hll_hash_bigint(vid) ) = vids as is_readed   -- 是否已读  
  from tbl_hll join tbl  
  on (tbl_hll.uid=tbl.uid)  
  where   
  tbl.uid=5000  
  and tbl_hll.uid=5000  
)  
select is_readed, count(*) from tmp group by 1;  
 is_readed | count   
-----------+-------  
 t         |  2081  
(1 row)

以下这个就不太准了, 100万个vid, 居然有413826个被认为在uid=5000的hll已读列表值里面, 所以可能出现的结果是大量未读视频会被误以为是已读.

解决办法大家可以思考一下, 调整hll的参数即可.

postgres=# select count(*) from tbl_hll , (select generate_series(1,1000000) as vid) as t where uid=5000 and ( vids || hll_hash_bigint(vid) ) = vids;  
 count    
--------  
 413826  
(1 row)

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写压测脚本:

vi ~/test1.sql  
\set uid random(4900, 5100)    
select t1.vid from   
(  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
) t1   
join   
tbl_hll t2   
on ( t2.uid=:uid and t2.vids || hll_hash_bigint(t1.vid) <> t2.vids);

压测结果:

pgbench -M prepared -n -r -P 1 -f ~/test1.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 583366  
latency average = 2.057 ms  
latency stddev = 0.631 ms  
initial connection time = 21.736 ms  
tps = 4862.153296 (without initial connection time)

tps: 4862.153296

是不是有点失望? 原因主要是hll的计算比较耗时, 可以考虑优化hll代码, 或者把vid和hll都推出到application端进行判断(也就是把压力给到应用程序, 毕竟应用程序是无状态的非常容易扩展).

rb数据库设计和试验

下面使用rb类型来代替hll存储用户聚合后的已读视频id, rb是稀疏bitmap存储, 精确值.

创建roaringbitmap插件

postgres=# create extension roaringbitmap ;  
CREATE EXTENSION  
postgres=# \dT  
          List of data types  
 Schema |     Name      | Description   
--------+---------------+-------------  
 public | roaringbitmap |   
(1 row)  
postgres=# \do+  
                                          List of operators  
 Schema | Name | Left arg type | Right arg type |  Result type  |       Function        | Description   
--------+------+---------------+----------------+---------------+-----------------------+-------------  
 public | #    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_xor                |   
 public | &    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_and                |   
 public | &&   | roaringbitmap | roaringbitmap  | boolean       | rb_intersect          |   
 public | -    | roaringbitmap | integer        | roaringbitmap | rb_remove             |   
 public | -    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_andnot             |   
 public | <<   | roaringbitmap | bigint         | roaringbitmap | rb_shiftleft          |   
 public | <>   | roaringbitmap | roaringbitmap  | boolean       | rb_not_equals         |   
 public | <@   | integer       | roaringbitmap  | boolean       | public.rb_containedby |   
 public | <@   | roaringbitmap | roaringbitmap  | boolean       | public.rb_containedby |   
 public | =    | roaringbitmap | roaringbitmap  | boolean       | rb_equals             |   
 public | >>   | roaringbitmap | bigint         | roaringbitmap | rb_shiftright         |   
 public | @>   | roaringbitmap | integer        | boolean       | public.rb_contains    |   
 public | @>   | roaringbitmap | roaringbitmap  | boolean       | public.rb_contains    |   
 public | |    | integer       | roaringbitmap  | roaringbitmap | public.rb_add         |   
 public | |    | roaringbitmap | integer        | roaringbitmap | public.rb_add         |   
 public | |    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_or                 |   
(16 rows)

创建聚合表

drop table if exists tbl_rb;   
create table tbl_rb (  
  uid int8 unique,   
  vids roaringbitmap   
);

写入聚合数据

insert into tbl_rb select uid, rb_build_agg(vid::int4) from tbl group by 1;    
INSERT 0 10000

查看聚合表, 只占用了20MB左右.

postgres=# \dt+  
                                    List of relations  
 Schema |  Name  | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+--------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl    | table | postgres | unlogged    | heap          | 575 MB |   
 public | tbl_rb | table | postgres | permanent   | heap          | 20 MB  |   
(2 rows)  
postgres=# \di+  
                                             List of relations  
 Schema |      Name       | Type  |  Owner   | Table  | Persistence | Access method |  Size  | Description   
--------+-----------------+-------+----------+--------+-------------+---------------+--------+-------------  
 public | tbl_pkey        | index | postgres | tbl    | unlogged    | btree         | 344 MB |   
 public | tbl_rb_uid_key  | index | postgres | tbl_rb | permanent   | btree         | 240 kB |   
 public | tbl_uid_vid_idx | index | postgres | tbl    | unlogged    | btree         | 301 MB |   
(3 rows)

rb是精确类型, 与原始数据对照如下:

postgres=# select count(distinct vid) from tbl where uid=5000;  
 count   
-------  
  1807  
(1 row)  
postgres=# select count(*) from (select unnest(rb_to_array(vids)) from tbl_rb where uid=5000) t;  
 count   
-------  
  1807  
(1 row)

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写测试脚本:

vi ~/test2.sql  
\set uid random(4900, 5100)    
select t1.vid from   
(  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
) t1   
join   
tbl_rb t2   
on ( t2.uid=:uid and not (t1.vid <@ t2.vids) );

压测结果:

pgbench -M prepared -n -r -P 1 -f ~/test2.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 3098707  
latency average = 0.387 ms  
latency stddev = 0.147 ms  
initial connection time = 21.281 ms  
tps = 25826.645185 (without initial connection time)

tps: 25826.645185

性能对比结果

对比项	传统测试	hll聚合	rb稀疏聚合
存储空间占用	1220 MB	11 MB	20 MB
tps	20578	4862	25826

知识点

1、hll.

往hll中增加一个值时, 可以理解为将这个值经过hashfunction后, 将其映射到一个bitmap value的n个position上, 即这n个bit position位上的值都设置为1.

当需要判断某个值是否存在时, 判断这个值在bitmap value中的n个bit position位上的值是否都为1.

hll类型 lossy问题: 不存在一定为真, 存在可能是假的(因为bit position conflict问题.). 所以有失真问题.

hll 可以实现比普通计数器更多更复杂的需求:

唯一值个数 (计数器只能统计一次, 换个窗口就必须重新全量统计. 但是2个hll值可以叠加, 所以不需要重新全量统计.)
是否已存在 (计数器做不到.)
PV UV (计数器也能做到.)
滑动窗口分析 (计数器做不到, 因为两个计数器值无法叠加, 所以使用计数器时每个维度都要重新统计. 但是2个hll值可以叠加, 因此不需要重新统计, 只需要计算hll即可.)

2、窗口查询语法.

《PostgreSQL SELECT 的高级用法(CTE, LATERAL, ORDINALITY, WINDOW, SKIP LOCKED, DISTINCT, GROUPING SETS, ...) - 珍藏级》

3、如何根据业务模型快速构建测试数据?

《PostgreSQL+MySQL 联合解决方案 - 第3课视频 - 如何压测PG数据库、如何瞬间构造海量测试数据》

《PostgreSQL 如何快速构建海量逼真测试数据》

《生成泊松、高斯、指数、随机分布数据 - PostgreSQL 9.5 new feature - pgbench improve, gaussian (standard normal) & exponential distribution》

man pgbench

4、roaringbitmap:

思考

1、还有没有其他场景会用到hll?

2、还有比hll更好的技术实现?

roaringbitmap 类型?
array 类型?

3、通过这个实验, 你学到了什么?

4、为什么hll性能这么好?

5、hll提升性能的同时牺牲了什么?

6、为什么hll是近似的?

7、hll类型的什么变量可以控制一个hll value能存储多少唯一值?

8、hll value近似度的精确度和什么变量有关？

9、为什么多个hll的值可以union?

沉浸式学习PostgreSQL|PolarDB 1: 短视频推荐去重、UV统计分析场景

作者

日期

标签

背景

业务场景1, UV统计分析场景

实现和对比

传统数据库设计和试验

hll数据库设计和试验

性能对比结果

业务场景2, 短视频去重、过滤已观看视频

实现和对比

传统数据库设计和试验

hll数据库设计和试验

rb数据库设计和试验

性能对比结果

知识点

思考

参考

PolarDB PostgreSQL 版

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像