沉浸式学习PostgreSQL|PolarDB 1: 短视频推荐去重、UV统计分析场景

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
简介: 本实验场景:短视频推荐去重、UV统计分析场景. 欢迎一起来建设数据库沉浸式学习教学素材库, 帮助开发者用好数据库, 提升开发者竞争力, 为企业降本提效.本文的实验可以使用永久免费的云起实验室来完成.https://developer.aliyun.com/adc/scenario/exp/f55dbfac77c0467a9d3cd95ff6697a31如果你本地有docker环境也可以把镜像拉到本地来做实验.

作者

digoal

日期

2023-08-19

标签

PostgreSQL , PolarDB , 数据库 , 教学


背景

本文的实验可以使用永久免费的云起实验室来完成.

如果你本地有docker环境也可以把镜像拉到本地来做实验:

x86_64机器使用以下docker image:

Apple Chip机器使用以下docker image:

非常欢迎数据库用户提出场景给我, 回复如下issue, 一起来建设数据库沉浸式学习教学素材库, 帮助开发者用好数据库, 提升开发者竞争力, 为企业降本提效.

业务场景1, UV统计分析场景

例如有一个网站, 业务会记录用户每一次点击行为, 每一次点击一条记录.

基于以上明细数据, 需要做如下分析(允许近似值, 要求高性能):

  • 按地域的UV/天
  • 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)
  • 整体UV/天
  • 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)
  • 按地域的UV/周
  • 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)
  • 整体UV/周
  • 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)
  • 按地域的UV/任意滑动周
  • 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
  • 整体UV/任意滑动周
  • 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

实现和对比

设计表结构:

drop table if exists tbl;  
create unlogged table tbl (  -- 用户行为表, 为了减轻测试负担, 使用unlogged table  
  id serial8 primary key,  -- 主键  
  uid int8,  -- 用户ID  
  rid int2,  -- 地域ID  
  cts timestamp,  -- 用户行为时间  
  info text  -- 用户行为   
);    
alter sequence tbl_id_seq cache 1000;

设计测试数据, 要求:

  • UID 取值范围1 到 2亿, 为了产生部分会员活跃的效果, UID随机值的生成按高斯分布.
  • 地域取值为 mod(uid,200)
  • cts为行为时间, 随机落在0-31天(2678400 秒)的范围内.
  • info字段在本实验中无意义, 随机填充, 表达用户行为.

使用pgbench 生成用户行为原始数据1000万条.

vi ~/t.sql   
\set uid random_gaussian(1, 200000000, 2.5)   
insert into tbl (uid,rid,cts,info) values (:uid, mod(:uid,200), '2023-01-01'::timestamp + (random()*2678400||' s')::interval, md5(random()::text));
pgbench -M prepared -n -r -P 1 -f ~/t.sql -c 10 -j 10 -t 1000000
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
number of transactions per client: 1000000  
number of transactions actually processed: 10000000/10000000  
latency average = 0.064 ms  
latency stddev = 0.080 ms  
initial connection time = 19.642 ms  
tps = 155608.242432 (without initial connection time)

结果查看

postgres=# select * from tbl limit 10;  
  id  |    uid    | rid |            cts             |               info                 
------+-----------+-----+----------------------------+----------------------------------  
    1 |  97559812 |  12 | 2023-01-01 22:07:44.879178 | 4137dad1b052dcd704213e0b38735a1b  
    2 | 138253982 | 182 | 2023-01-15 02:38:03.556602 | ae3d5f27f61c8f0cd8f9864d79a693b5  
 3001 | 160494052 |  52 | 2023-01-17 23:45:29.174513 | b7972471fa02c8efa13113e96218c755  
    3 | 132545346 | 146 | 2023-01-05 22:23:11.484299 | 343f9298c26788d8bf454cfc7173a1f3  
    4 | 120769706 | 106 | 2023-01-16 21:47:52.81739  | a43a922d985b2ecc75e943af2c4c95be  
 5001 |  73660892 |  92 | 2023-01-30 05:42:50.828799 | 9515ed12769e13ff6c8f5e633860d548  
 3002 | 151242348 | 148 | 2023-01-31 07:03:15.446931 | 076c30be03af95f8a6dab1b65dd94903  
    5 | 164054658 |  58 | 2023-01-24 19:24:56.110967 | 19c46f300fe17397a108c6a2b69a7855  
 5002 | 131108032 |  32 | 2023-01-23 19:04:44.526588 | e930a9df98f90a3d47f48fe5cf1eaf9a  
    6 |  33572954 | 154 | 2023-01-16 00:47:17.449256 | 6a2ecb267b3281c91c5645b57a0b29d8  
(10 rows)  
postgres=# select count(*) from tbl;  
  count     
----------  
 10000000  
(1 row)  
postgres=# \dt+  
                                   List of relations  
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl  | table | postgres | unlogged    | heap          | 965 MB |   
(1 row)  
postgres=# \di+  
                                         List of relations  
 Schema |   Name   | Type  |  Owner   | Table | Persistence | Access method |  Size  | Description   
--------+----------+-------+----------+-------+-------------+---------------+--------+-------------  
 public | tbl_pkey | index | postgres | tbl   | unlogged    | btree         | 343 MB |   
(1 row)  
postgres=# vacuum ANALYZE tbl;  
VACUUM

传统数据库设计和试验

test case 1: 按地域的UV/天

select   
  rid,  
  date(cts),  
  count(distinct uid)   
from   
  tbl   
group by 1,2   
order by 1,2;

耗时: 5263.209 ms

test case 2: 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.d,   
  t.rid,   
  (select count(*) from   
    (select uid from tbl where date(cts)=t.d and rid=t.rid   
     except   
     select uid from tbl where date(cts)=t.d-1 and rid=t.rid) as tmp  
  ) as uv_incr   
from   
  (select date(cts) as d , rid from tbl group by 1,2) as t   
order by 1,2;

耗时: 超过 249922.704 ms

test case 3: 整体UV/天

select   
  date(cts),  
  count(distinct uid)   
from   
  tbl   
group by 1   
order by 1;

耗时: 4824.847 ms

test case 4: 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.d,   
  (select count(*) from   
    (select uid from tbl where date(cts)=t.d   
     except   
     select uid from tbl where date(cts)=t.d-1) as tmp  
  ) as uv_incr   
from   
  (select distinct date(cts) as d from tbl) as t   
order by 1;

耗时: 41003.313 ms

test case 5: 按地域的UV/周

select   
  EXTRACT(ISOYEAR FROM cts) as isoyear,   
  EXTRACT(week FROM cts) as week,   
  rid,  
  count(distinct uid)   
from   
  tbl   
group by 1,2,3   
order by 1,2,3;

耗时: 18654.469 ms

test case 6: 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.isoyear,   
  t.week,   
  t.rid,   
  (select count(*) from   
    (select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week and rid=t.rid   
     except   
     select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week-1 and rid=t.rid) as tmp  
  ) as uv_incr   
from   
  (select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week, rid from tbl group by 1,2,3) as t   
order by 1,2,3;

耗时: 超过 294874.983 ms

跨年的话更加复杂, 因为年初的第一周的上一周是去年的最后一周, 写SQL时不好表达, 可以加case进行处理, 或者使用epoch来计算.

留个作业: 有兴趣的小伙伴可以修改一下SQL, 实现跨年.

test case 7: 整体UV/周

select   
  EXTRACT(ISOYEAR FROM cts) as isoyear,   
  EXTRACT(week FROM cts) as week,   
  count(distinct uid)   
from   
  tbl   
group by 1,2  
order by 1,2;

耗时: 10757.619 ms

test case 8: 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.isoyear,   
  t.week,   
  (select count(*) from   
    (select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week   
     except   
     select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week-1) as tmp  
  ) as uv_incr   
from   
  (select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week from tbl group by 1,2) as t   
order by 1,2;

耗时: 38512.642 ms

test case 9: 按地域的UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_byrid (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select count(distinct uid) from tbl where rid=v_rid and cts >= v_begin_cts1 and cts < v_end_cts1;  
$$ language sql strict;  
select * from get_uv_byrid (1::int2, '2023-01-01', '2023-01-08');

耗时: 560.176 ms

test case 10: 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_byrid (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
select count(*) from   
  (select uid from tbl where rid=v_rid and cts >= v_begin_cts2 and cts < v_end_cts2  
   except  
   select uid from tbl where rid=v_rid and cts >= v_begin_cts1 and cts < v_end_cts1  
  ) as tmp;  
$$ language sql strict;  
select * from get_uv_incr_byrid (1::int2, '2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 889.635 ms

test case 11: 整体UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select count(distinct uid) from tbl where cts >= v_begin_cts1 and cts < v_end_cts1;  
$$ language sql strict;   
select * from get_uv ('2023-01-01', '2023-01-08');

耗时: 915.000 ms

test case 12: 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
select count(*) from   
  (select uid from tbl where cts >= v_begin_cts2 and cts < v_end_cts2  
   except  
   select uid from tbl where cts >= v_begin_cts1 and cts < v_end_cts1  
  ) as tmp;  
$$ language sql strict;  
select * from get_uv_incr ('2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 2147.770 ms

hll数据库设计和试验

创建hll插件

create extension hll;

hll插件支持的类型、OP和函数等, 详细可参考hll开源项目文档:

postgres=# \dT+  
                                         List of data types  
 Schema |    Name     | Internal name | Size | Elements |  Owner   | Access privileges | Description   
--------+-------------+---------------+------+----------+----------+-------------------+-------------  
 public | hll         | hll           | var  |          | postgres |                   |   
 public | hll_hashval | hll_hashval   | 8    |          | postgres |                   |   
(2 rows)  
postgres=# \do+  
                                         List of operators  
 Schema | Name | Left arg type | Right arg type |   Result type    |    Function     | Description   
--------+------+---------------+----------------+------------------+-----------------+-------------  
 public | #    |               | hll            | double precision | hll_cardinality |   
 public | <>   | hll           | hll            | boolean          | hll_ne          |   
 public | <>   | hll_hashval   | hll_hashval    | boolean          | hll_hashval_ne  |   
 public | =    | hll           | hll            | boolean          | hll_eq          |   
 public | =    | hll_hashval   | hll_hashval    | boolean          | hll_hashval_eq  |   
 public | ||   | hll           | hll            | hll              | hll_union       |   
 public | ||   | hll           | hll_hashval    | hll              | hll_add         |   
 public | ||   | hll_hashval   | hll            | hll              | hll_add_rev     |   
(8 rows)

创建聚合表

create table tbl_hll (  
  rid int2,   -- 地域id  
  ds date,    -- 日期  
  users hll   -- HLL值, 存储某个地域某天的所有UID聚合后的hll值  
);

将原始用户行为数据写入聚合表

insert into tbl_hll select rid, date(cts), hll_add_agg(hll_hash_bigint(uid)) from tbl group by 1,2;  
INSERT 0 6200  
Time: 2976.876 ms (00:02.977)

查看聚合表, 只占用了8MB左右.

postgres=# \dt+  
                                     List of relations  
 Schema |  Name   | Type  |  Owner   | Persistence | Access method |  Size   | Description   
--------+---------+-------+----------+-------------+---------------+---------+-------------  
 public | tbl     | table | postgres | unlogged    | heap          | 965 MB  |   
 public | tbl_hll | table | postgres | permanent   | heap          | 8312 kB |   
(2 rows)

近似情况对比:

hll里有多少uid?

postgres=# select # hll_union_agg(users) from tbl_hll;   
     ?column?        
-------------------  
 9474798.879453091  
(1 row)  
Time: 47.075 ms

真实有多少uid?

postgres=# select count(distinct uid) from tbl;   
  count    
---------  
 9648563  
(1 row)  
Time: 2373.055 ms (00:02.373)

test case 1: 按地域的UV/天

select   
  rid,   
  ds,   
  # users   
from tbl_hll ;

耗时: 57.250 ms

test case 2: 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  rid,   
  ds,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))  
from tbl_hll   
window w as (partition by rid order by ds)   
order by 1,2;

耗时: 179.204 ms

test case 3: 整体UV/天

select   
  ds,   
  # hll_union_agg(users)   
from tbl_hll  
group by ds  
order by 1;

耗时: 50.841 ms

test case 4: 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    ds,   
    hll_union_agg(users) as users  
  from tbl_hll  
  group by ds  
)   
select   
  ds,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))  
from tmp   
window w as (order by ds)   
order by 1;

耗时: 51.931 ms

test case 5: 按地域的UV/周

select   
  EXTRACT(ISOYEAR FROM ds) as isoyear,   
  EXTRACT(week FROM ds) as week,   
  rid,   
  # hll_union_agg(users)   
from tbl_hll   
group by 1,2,3;

耗时: 99.657 ms

test case 6: 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    EXTRACT(ISOYEAR FROM ds) as isoyear,   
    EXTRACT(week FROM ds) as week,   
    rid,   
    hll_union_agg(users) as users  
  from tbl_hll   
  group by 1,2,3   
)   
select   
  isoyear,   
  week,    
  rid,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))   
from tmp   
window w as (partition by rid order by isoyear,week)   
order by 1,2,3;

耗时: 121.010 ms

跨年的话更加复杂, 因为年初的第一周的上一周是去年的最后一周, 写SQL时不好表达, 可以加case进行处理, 或者使用epoch来计算.

留个作业: 有兴趣的小伙伴可以修改一下SQL, 实现跨年.

test case 7: 整体UV/周

select   
  EXTRACT(ISOYEAR FROM ds) as isoyear,   
  EXTRACT(week FROM ds) as week,   
  # hll_union_agg(users) as users  
from   
  tbl_hll   
group by 1,2  
order by 1,2;

耗时: 52.649 ms

test case 8: 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    EXTRACT(ISOYEAR FROM ds) as isoyear,   
    EXTRACT(week FROM ds) as week,   
    hll_union_agg(users) as users  
  from   
    tbl_hll   
  group by 1,2  
)  
select   
  isoyear,   
  week,    
  (# (users || (lag(users) over w))) - (# (lag(users) over w))   
from tmp   
window w as (order by isoyear,week)   
order by 1,2;

耗时: 51.668 ms

test case 9: 按地域的UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_byrid_appro (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$   
  select # hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts1 and ds < v_end_cts1;  
$$ language sql strict;   
select * from get_uv_byrid_appro (1::int2, '2023-01-01', '2023-01-08');

耗时: 6.021 ms

test case 10: 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_byrid_appro (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
with t1 as  
  (select rid, hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts1 and ds < v_end_cts1 group by 1),  
  t2 as  
  (select rid, hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts2 and ds < v_end_cts2 group by 1)  
select (# (t1.users || t2.users)) - (# t1.users) from t1,t2;  
$$ language sql strict;  
select * from get_uv_incr_byrid_appro (1::int2, '2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 8.934 ms

test case 11: 整体UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_appro (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select # hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts1 and ds < v_end_cts1;  
$$ language sql strict;   
select * from get_uv_appro ('2023-01-01', '2023-01-08');

耗时: 18.035 ms

test case 12: 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_appro (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
with t1 as  
  (select hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts1 and ds < v_end_cts1),  
  t2 as  
  (select hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts2 and ds < v_end_cts2)  
select (# (t1.users || t2.users)) - (# t1.users) from t1,t2;  
$$ language sql strict;  
select * from get_uv_incr_appro ('2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 31.634 ms

性能对比结果

test case 原始方法耗时(ms) HLL耗时(ms) HLL性能提升倍数
1 5263.209 57.250 91.93
2 超过 249922.704 179.204 超过 1394.63
3 4824.847 50.841 94.90
4 41003.313 51.931 789.57
5 18654.469 99.657 187.19
6 超过 294874.983 121.010 超过 2436.78
7 10757.619 52.649 204.33
8 38512.642 51.668 745.39
9 560.176 6.021 93.04
10 889.635 8.934 99.58
11 915.000 18.035 50.73
12 2147.770 31.634 67.89

业务场景2, 短视频去重、过滤已观看视频

例如有一个短视频业务, 用户看过的视频都会被标记. 当根据用户喜好推送给用户视频, 需要从推荐列表中过滤掉用户已经看过的视频. (允许近似值, 要求高性能)

实现和对比

设计表结构:

drop table if exists tbl;   
create unlogged table tbl (  -- 用户已读视频列表, 为了减轻测试负担, 使用unlogged table    
  id serial8 primary key,  -- 主键    
  uid int8,  -- 用户ID    
  vid int8,  -- 已读视频ID    
  cts timestamp  -- 用户行为时间    
);      
alter sequence tbl_id_seq cache 1000;     
create index on tbl (uid) include (vid);  -- 加速根据uid查询已读视频

设计测试数据, 要求:

  • vid 取值范围1到1万. 高斯分布, 突出一些经常被观看的热门视频.
  • uid 取值范围1到1万. 高斯分布, 突出一些经常看视频的活跃用户.

使用pgbench生成1000万条用户已读视频列表的记录.

vi ~/t.sql     
\set uid random_gaussian(1, 10000, 2.5)     
\set vid random_gaussian(1, 10000, 2.5)     
insert into tbl (uid,vid,cts) values (:uid, :vid, clock_timestamp());
pgbench -M prepared -n -r -P 1 -f ~/t.sql -c 10 -j 10 -t 1000000
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
number of transactions per client: 1000000  
number of transactions actually processed: 10000000/10000000  
latency average = 0.058 ms  
latency stddev = 0.079 ms  
initial connection time = 20.551 ms  
tps = 171765.018858 (without initial connection time)  
statement latencies in milliseconds:  
         0.000  \set uid random_gaussian(1, 100000, 2.5)     
         0.000  \set vid random_gaussian(1, 100000, 2.5)     
         0.058  insert into tbl (uid,vid,cts) values (:uid, :vid, clock_timestamp());

结果查看

postgres=# select * from tbl limit 10;   
  id  | uid  | vid  |            cts               
------+------+------+----------------------------  
    1 | 5380 | 5321 | 2023-08-21 08:53:53.074514  
    2 | 7611 | 7945 | 2023-08-21 08:53:53.074845  
 1001 | 6841 | 2507 | 2023-08-21 08:53:53.074857  
 2001 | 7956 | 4567 | 2023-08-21 08:53:53.074925  
 1002 | 5728 | 4210 | 2023-08-21 08:53:53.075087  
 3001 | 4560 | 1588 | 2023-08-21 08:53:53.075333  
 1003 | 5197 | 5662 | 2023-08-21 08:53:53.07549  
 1004 | 5125 | 3827 | 2023-08-21 08:53:53.075528  
 4001 | 6815 | 4758 | 2023-08-21 08:53:53.075496  
 1005 | 4002 | 8575 | 2023-08-21 08:53:53.075567  
(10 rows)  
postgres=# select count(*) from tbl;    
  count     
----------  
 10000000  
(1 row)  
postgres=# \dt+  
                                   List of relations  
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl  | table | postgres | unlogged    | heap          | 575 MB |   
(1 row)  
postgres=# \di+  
                                            List of relations  
 Schema |      Name       | Type  |  Owner   | Table | Persistence | Access method |  Size  | Description   
--------+-----------------+-------+----------+-------+-------------+---------------+--------+-------------  
 public | tbl_pkey        | index | postgres | tbl   | unlogged    | btree         | 344 MB |   
 public | tbl_uid_vid_idx | index | postgres | tbl   | unlogged    | btree         | 301 MB |   
(2 rows)  
postgres=# vacuum ANALYZE tbl;    
VACUUM

传统数据库设计和试验

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写压测脚本

vi ~/test.sql  
\set uid random(4900, 5100)    
with t as (  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
)  
select t.vid from t where not exists   
  (select 1 from tbl where uid=:uid and tbl.vid=t.vid)  
;

压测

pgbench -M prepared -n -r -P 1 -f ~/test.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 2469063  
latency average = 0.486 ms  
latency stddev = 0.183 ms  
initial connection time = 22.914 ms  
tps = 20578.539756 (without initial connection time)

tps: 20578.539756

hll数据库设计和试验

创建聚合表, 每个UID存储一条聚合后的hll value.

drop table if exists tbl_hll;   
create table tbl_hll (  
  uid int8 unique,   
  vids hll   
);

聚合

insert into tbl_hll select uid, hll_add_agg(hll_hash_bigint(vid)) from tbl group by 1;    
INSERT 0 10000

查看聚合表, 只占用了11MB左右.

postgres=# \dt+  
                                    List of relations  
 Schema |  Name   | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+---------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl     | table | postgres | unlogged    | heap          | 575 MB |   
 public | tbl_hll | table | postgres | permanent   | heap          | 11 MB  |   
(2 rows)

近似情况对比:

uid=5000实际已读vid数  
postgres=# select count(distinct vid) from tbl where uid=5000;  
 count   
-------  
  1807  
(1 row)  
uid=5000的hll里可能存在的vid数  
postgres=# select # vids from tbl_hll where uid=5000;  
      ?column?        
--------------------  
 1758.0590520300507  
(1 row)

true, false 测试:

使用原始表取出uid=5000的vid, 在hll值中判断是否已读, 2081 条全部已读, 判断已读是绝对准确的, 原理参考本文知识点的部分.

with tmp as (  
  select   
    tbl.uid,   
    tbl.vid,   
    ( vids || hll_hash_bigint(vid) ) = vids as is_readed   -- 是否已读  
  from tbl_hll join tbl  
  on (tbl_hll.uid=tbl.uid)  
  where   
  tbl.uid=5000  
  and tbl_hll.uid=5000  
)  
select is_readed, count(*) from tmp group by 1;  
 is_readed | count   
-----------+-------  
 t         |  2081  
(1 row)

以下这个就不太准了, 100万个vid, 居然有413826个被认为在uid=5000的hll已读列表值里面, 所以可能出现的结果是大量未读视频会被误以为是已读.

解决办法大家可以思考一下, 调整hll的参数即可.

postgres=# select count(*) from tbl_hll , (select generate_series(1,1000000) as vid) as t where uid=5000 and ( vids || hll_hash_bigint(vid) ) = vids;  
 count    
--------  
 413826  
(1 row)

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写压测脚本:

vi ~/test1.sql  
\set uid random(4900, 5100)    
select t1.vid from   
(  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
) t1   
join   
tbl_hll t2   
on ( t2.uid=:uid and t2.vids || hll_hash_bigint(t1.vid) <> t2.vids);

压测结果:

pgbench -M prepared -n -r -P 1 -f ~/test1.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 583366  
latency average = 2.057 ms  
latency stddev = 0.631 ms  
initial connection time = 21.736 ms  
tps = 4862.153296 (without initial connection time)

tps: 4862.153296

是不是有点失望? 原因主要是hll的计算比较耗时, 可以考虑优化hll代码, 或者把vid和hll都推出到application端进行判断(也就是把压力给到应用程序, 毕竟应用程序是无状态的非常容易扩展).

rb数据库设计和试验

下面使用rb类型来代替hll存储用户聚合后的已读视频id, rb是稀疏bitmap存储, 精确值.

创建roaringbitmap插件

postgres=# create extension roaringbitmap ;  
CREATE EXTENSION  
postgres=# \dT  
          List of data types  
 Schema |     Name      | Description   
--------+---------------+-------------  
 public | roaringbitmap |   
(1 row)  
postgres=# \do+  
                                          List of operators  
 Schema | Name | Left arg type | Right arg type |  Result type  |       Function        | Description   
--------+------+---------------+----------------+---------------+-----------------------+-------------  
 public | #    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_xor                |   
 public | &    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_and                |   
 public | &&   | roaringbitmap | roaringbitmap  | boolean       | rb_intersect          |   
 public | -    | roaringbitmap | integer        | roaringbitmap | rb_remove             |   
 public | -    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_andnot             |   
 public | <<   | roaringbitmap | bigint         | roaringbitmap | rb_shiftleft          |   
 public | <>   | roaringbitmap | roaringbitmap  | boolean       | rb_not_equals         |   
 public | <@   | integer       | roaringbitmap  | boolean       | public.rb_containedby |   
 public | <@   | roaringbitmap | roaringbitmap  | boolean       | public.rb_containedby |   
 public | =    | roaringbitmap | roaringbitmap  | boolean       | rb_equals             |   
 public | >>   | roaringbitmap | bigint         | roaringbitmap | rb_shiftright         |   
 public | @>   | roaringbitmap | integer        | boolean       | public.rb_contains    |   
 public | @>   | roaringbitmap | roaringbitmap  | boolean       | public.rb_contains    |   
 public | |    | integer       | roaringbitmap  | roaringbitmap | public.rb_add         |   
 public | |    | roaringbitmap | integer        | roaringbitmap | public.rb_add         |   
 public | |    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_or                 |   
(16 rows)

创建聚合表

drop table if exists tbl_rb;   
create table tbl_rb (  
  uid int8 unique,   
  vids roaringbitmap   
);

写入聚合数据

insert into tbl_rb select uid, rb_build_agg(vid::int4) from tbl group by 1;    
INSERT 0 10000

查看聚合表, 只占用了20MB左右.

postgres=# \dt+  
                                    List of relations  
 Schema |  Name  | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+--------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl    | table | postgres | unlogged    | heap          | 575 MB |   
 public | tbl_rb | table | postgres | permanent   | heap          | 20 MB  |   
(2 rows)  
postgres=# \di+  
                                             List of relations  
 Schema |      Name       | Type  |  Owner   | Table  | Persistence | Access method |  Size  | Description   
--------+-----------------+-------+----------+--------+-------------+---------------+--------+-------------  
 public | tbl_pkey        | index | postgres | tbl    | unlogged    | btree         | 344 MB |   
 public | tbl_rb_uid_key  | index | postgres | tbl_rb | permanent   | btree         | 240 kB |   
 public | tbl_uid_vid_idx | index | postgres | tbl    | unlogged    | btree         | 301 MB |   
(3 rows)

rb是精确类型, 与原始数据对照如下:

postgres=# select count(distinct vid) from tbl where uid=5000;  
 count   
-------  
  1807  
(1 row)  
postgres=# select count(*) from (select unnest(rb_to_array(vids)) from tbl_rb where uid=5000) t;  
 count   
-------  
  1807  
(1 row)

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写测试脚本:

vi ~/test2.sql  
\set uid random(4900, 5100)    
select t1.vid from   
(  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
) t1   
join   
tbl_rb t2   
on ( t2.uid=:uid and not (t1.vid <@ t2.vids) );

压测结果:

pgbench -M prepared -n -r -P 1 -f ~/test2.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 3098707  
latency average = 0.387 ms  
latency stddev = 0.147 ms  
initial connection time = 21.281 ms  
tps = 25826.645185 (without initial connection time)

tps: 25826.645185

性能对比结果

对比项 传统测试 hll聚合 rb稀疏聚合
存储空间占用 1220 MB 11 MB 20 MB
tps 20578 4862 25826

知识点

1、hll.

往hll中增加一个值时, 可以理解为将这个值经过hashfunction后, 将其映射到一个bitmap value的n个position上, 即这n个bit position位上的值都设置为1.

当需要判断某个值是否存在时, 判断这个值在bitmap value中的n个bit position位上的值是否都为1.

hll类型 lossy问题: 不存在一定为真, 存在可能是假的(因为bit position conflict问题.). 所以有失真问题.

hll 可以实现比普通计数器更多更复杂的需求:

  • 唯一值个数 (计数器只能统计一次, 换个窗口就必须重新全量统计. 但是2个hll值可以叠加, 所以不需要重新全量统计.)
  • 是否已存在 (计数器做不到.)
  • PV UV (计数器也能做到.)
  • 滑动窗口分析 (计数器做不到, 因为两个计数器值无法叠加, 所以使用计数器时每个维度都要重新统计. 但是2个hll值可以叠加, 因此不需要重新统计, 只需要计算hll即可.)

2、窗口查询语法.

3、如何根据业务模型快速构建测试数据?

《PostgreSQL+MySQL 联合解决方案 - 第3课视频 - 如何压测PG数据库、如何瞬间构造海量测试数据》

《PostgreSQL 如何快速构建 海量 逼真 测试数据》

《生成泊松、高斯、指数、随机分布数据 - PostgreSQL 9.5 new feature - pgbench improve, gaussian (standard normal) & exponential distribution》

man pgbench

4、roaringbitmap:

思考

1、还有没有其他场景会用到hll?

2、还有比hll更好的技术实现?

  • roaringbitmap 类型?
  • array 类型?

3、通过这个实验, 你学到了什么?

4、为什么hll性能这么好?

5、hll提升性能的同时牺牲了什么?

6、为什么hll是近似的?

7、hll类型的什么变量可以控制一个hll value能存储多少唯一值?

8、hll value近似度的精确度和什么变量有关?

9、为什么多个hll的值可以union?

参考

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
1月前
|
关系型数据库 Serverless 分布式数据库
【公测】PolarDB PostgreSQL版Serverless功能免费使用​!
【公测】PolarDB PostgreSQL版Serverless功能免费使用​,公测于2024年3月28日开始,持续三个月,公测期间可以免费使用!
|
2月前
|
关系型数据库 分布式数据库 数据库
PolarDB PostgreSQL版:Oracle兼容的高性能数据库
PolarDB PostgreSQL版是一款高性能的数据库,具有与Oracle兼容的特性。它采用了分布式架构,可以轻松处理大量的数据,同时还支持多种数据类型和函数,具有高可用性和可扩展性。它还提供了丰富的管理工具和性能优化功能,为企业提供了可靠的数据存储和处理解决方案。PolarDB PostgreSQL版在数据库领域具有很高的竞争力,可以满足各种企业的需求。
|
27天前
|
Docker 容器 关系型数据库
【PolarDB-X从入门到精通】 第四讲:PolarDB分布式版安装部署(源码编译部署)
本期课程将于4月11日19:00开始直播,内容包括源码编译基础知识和实践操作,课程目标是使学员掌握源码编译部署技能,为未来发展奠定基础,期待大家在课程中取得丰富的学习成果!
【PolarDB-X从入门到精通】 第四讲:PolarDB分布式版安装部署(源码编译部署)
|
2月前
|
关系型数据库 Serverless 分布式数据库
PolarDB PostgreSQL版Serverless功能上线公测啦,公测期间免费使用!
Serverless数据库能够使得数据库集群资源随客户业务负载动态弹性扩缩,将客户从复杂的业务资源评估和运维工作中解放出来。PolarDB PostgreSQL版 Serverless提供了CPU、内存、存储、网络资源的实时弹性能力,构建计算与存储分离架构下的 PolarDB PostgreSQL版产品新形态。
|
3月前
|
SQL 关系型数据库 分布式数据库
在PolarDB for PostgreSQL中,你可以使用LIKE运算符来实现类似的查询功能,而不是使用IF函数
在PolarDB for PostgreSQL中,你可以使用LIKE运算符来实现类似的查询功能,而不是使用IF函数
44 7
|
2月前
|
关系型数据库 分布式数据库 数据库
成都晨云信息技术完成阿里云PolarDB数据库产品生态集成认证
近日,成都晨云信息技术有限责任公司(以下简称晨云信息)与阿里云PolarDB PostgreSQL版数据库产品展开产品集成认证。测试结果表明,晨云信息旗下晨云-站群管理系统(V1.0)与阿里云以下产品:开源云原生数据库PolarDB PostgreSQL版(V11),完全满足产品兼容认证要求,兼容性良好,系统运行稳定。
|
2月前
|
关系型数据库 分布式数据库 数据库
PolarDB常见问题之数据库不能自己减少节点如何解决
PolarDB是阿里云推出的下一代关系型数据库,具有高性能、高可用性和弹性伸缩能力,适用于大规模数据处理场景。本汇总囊括了PolarDB使用中用户可能遭遇的一系列常见问题及解答,旨在为数据库管理员和开发者提供全面的问题指导,确保数据库平稳运行和优化使用体验。
|
2月前
|
缓存 关系型数据库 分布式数据库
PolarDB常见问题之数据库cpu突然飙高如何解决
PolarDB是阿里云推出的下一代关系型数据库,具有高性能、高可用性和弹性伸缩能力,适用于大规模数据处理场景。本汇总囊括了PolarDB使用中用户可能遭遇的一系列常见问题及解答,旨在为数据库管理员和开发者提供全面的问题指导,确保数据库平稳运行和优化使用体验。
|
7天前
|
分布式计算 DataWorks 关系型数据库
DataWorks产品使用合集之在使用 DataWorks 数据集成同步 PostgreSQL 数据库中的 Geometry 类型数据如何解决
DataWorks作为一站式的数据开发与治理平台,提供了从数据采集、清洗、开发、调度、服务化、质量监控到安全管理的全套解决方案,帮助企业构建高效、规范、安全的大数据处理体系。以下是对DataWorks产品使用合集的概述,涵盖数据处理的各个环节。
17 0
|
7天前
|
分布式计算 关系型数据库 大数据
MaxCompute产品使用合集之怎么才可以将 PostgreSQL 中的 geometry 空间类型字段同步到 MaxCompute 或另一个 PostgreSQL 数据库
MaxCompute作为一款全面的大数据处理平台,广泛应用于各类大数据分析、数据挖掘、BI及机器学习场景。掌握其核心功能、熟练操作流程、遵循最佳实践,可以帮助用户高效、安全地管理和利用海量数据。以下是一个关于MaxCompute产品使用的合集,涵盖了其核心功能、应用场景、操作流程以及最佳实践等内容。

热门文章

最新文章

相关产品

  • 云原生数据库 PolarDB