沉浸式学习PostgreSQL|PolarDB 1: 短视频推荐去重、UV统计分析场景

本文涉及的产品
云原生数据库 PolarDB PostgreSQL 版,标准版 2核4GB 50GB
云原生数据库 PolarDB MySQL 版,通用型 2核4GB 50GB
简介: 本实验场景:短视频推荐去重、UV统计分析场景. 欢迎一起来建设数据库沉浸式学习教学素材库, 帮助开发者用好数据库, 提升开发者竞争力, 为企业降本提效.本文的实验可以使用永久免费的云起实验室来完成.https://developer.aliyun.com/adc/scenario/exp/f55dbfac77c0467a9d3cd95ff6697a31如果你本地有docker环境也可以把镜像拉到本地来做实验.

作者

digoal

日期

2023-08-19

标签

PostgreSQL , PolarDB , 数据库 , 教学


背景

本文的实验可以使用永久免费的云起实验室来完成.

如果你本地有docker环境也可以把镜像拉到本地来做实验:

x86_64机器使用以下docker image:

Apple Chip机器使用以下docker image:

非常欢迎数据库用户提出场景给我, 回复如下issue, 一起来建设数据库沉浸式学习教学素材库, 帮助开发者用好数据库, 提升开发者竞争力, 为企业降本提效.

业务场景1, UV统计分析场景

例如有一个网站, 业务会记录用户每一次点击行为, 每一次点击一条记录.

基于以上明细数据, 需要做如下分析(允许近似值, 要求高性能):

  • 按地域的UV/天
  • 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)
  • 整体UV/天
  • 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)
  • 按地域的UV/周
  • 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)
  • 整体UV/周
  • 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)
  • 按地域的UV/任意滑动周
  • 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
  • 整体UV/任意滑动周
  • 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

实现和对比

设计表结构:

drop table if exists tbl;  
create unlogged table tbl (  -- 用户行为表, 为了减轻测试负担, 使用unlogged table  
  id serial8 primary key,  -- 主键  
  uid int8,  -- 用户ID  
  rid int2,  -- 地域ID  
  cts timestamp,  -- 用户行为时间  
  info text  -- 用户行为   
);    
alter sequence tbl_id_seq cache 1000;

设计测试数据, 要求:

  • UID 取值范围1 到 2亿, 为了产生部分会员活跃的效果, UID随机值的生成按高斯分布.
  • 地域取值为 mod(uid,200)
  • cts为行为时间, 随机落在0-31天(2678400 秒)的范围内.
  • info字段在本实验中无意义, 随机填充, 表达用户行为.

使用pgbench 生成用户行为原始数据1000万条.

vi ~/t.sql   
\set uid random_gaussian(1, 200000000, 2.5)   
insert into tbl (uid,rid,cts,info) values (:uid, mod(:uid,200), '2023-01-01'::timestamp + (random()*2678400||' s')::interval, md5(random()::text));
pgbench -M prepared -n -r -P 1 -f ~/t.sql -c 10 -j 10 -t 1000000
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
number of transactions per client: 1000000  
number of transactions actually processed: 10000000/10000000  
latency average = 0.064 ms  
latency stddev = 0.080 ms  
initial connection time = 19.642 ms  
tps = 155608.242432 (without initial connection time)

结果查看

postgres=# select * from tbl limit 10;  
  id  |    uid    | rid |            cts             |               info                 
------+-----------+-----+----------------------------+----------------------------------  
    1 |  97559812 |  12 | 2023-01-01 22:07:44.879178 | 4137dad1b052dcd704213e0b38735a1b  
    2 | 138253982 | 182 | 2023-01-15 02:38:03.556602 | ae3d5f27f61c8f0cd8f9864d79a693b5  
 3001 | 160494052 |  52 | 2023-01-17 23:45:29.174513 | b7972471fa02c8efa13113e96218c755  
    3 | 132545346 | 146 | 2023-01-05 22:23:11.484299 | 343f9298c26788d8bf454cfc7173a1f3  
    4 | 120769706 | 106 | 2023-01-16 21:47:52.81739  | a43a922d985b2ecc75e943af2c4c95be  
 5001 |  73660892 |  92 | 2023-01-30 05:42:50.828799 | 9515ed12769e13ff6c8f5e633860d548  
 3002 | 151242348 | 148 | 2023-01-31 07:03:15.446931 | 076c30be03af95f8a6dab1b65dd94903  
    5 | 164054658 |  58 | 2023-01-24 19:24:56.110967 | 19c46f300fe17397a108c6a2b69a7855  
 5002 | 131108032 |  32 | 2023-01-23 19:04:44.526588 | e930a9df98f90a3d47f48fe5cf1eaf9a  
    6 |  33572954 | 154 | 2023-01-16 00:47:17.449256 | 6a2ecb267b3281c91c5645b57a0b29d8  
(10 rows)  
postgres=# select count(*) from tbl;  
  count     
----------  
 10000000  
(1 row)  
postgres=# \dt+  
                                   List of relations  
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl  | table | postgres | unlogged    | heap          | 965 MB |   
(1 row)  
postgres=# \di+  
                                         List of relations  
 Schema |   Name   | Type  |  Owner   | Table | Persistence | Access method |  Size  | Description   
--------+----------+-------+----------+-------+-------------+---------------+--------+-------------  
 public | tbl_pkey | index | postgres | tbl   | unlogged    | btree         | 343 MB |   
(1 row)  
postgres=# vacuum ANALYZE tbl;  
VACUUM

传统数据库设计和试验

test case 1: 按地域的UV/天

select   
  rid,  
  date(cts),  
  count(distinct uid)   
from   
  tbl   
group by 1,2   
order by 1,2;

耗时: 5263.209 ms

test case 2: 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.d,   
  t.rid,   
  (select count(*) from   
    (select uid from tbl where date(cts)=t.d and rid=t.rid   
     except   
     select uid from tbl where date(cts)=t.d-1 and rid=t.rid) as tmp  
  ) as uv_incr   
from   
  (select date(cts) as d , rid from tbl group by 1,2) as t   
order by 1,2;

耗时: 超过 249922.704 ms

test case 3: 整体UV/天

select   
  date(cts),  
  count(distinct uid)   
from   
  tbl   
group by 1   
order by 1;

耗时: 4824.847 ms

test case 4: 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.d,   
  (select count(*) from   
    (select uid from tbl where date(cts)=t.d   
     except   
     select uid from tbl where date(cts)=t.d-1) as tmp  
  ) as uv_incr   
from   
  (select distinct date(cts) as d from tbl) as t   
order by 1;

耗时: 41003.313 ms

test case 5: 按地域的UV/周

select   
  EXTRACT(ISOYEAR FROM cts) as isoyear,   
  EXTRACT(week FROM cts) as week,   
  rid,  
  count(distinct uid)   
from   
  tbl   
group by 1,2,3   
order by 1,2,3;

耗时: 18654.469 ms

test case 6: 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.isoyear,   
  t.week,   
  t.rid,   
  (select count(*) from   
    (select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week and rid=t.rid   
     except   
     select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week-1 and rid=t.rid) as tmp  
  ) as uv_incr   
from   
  (select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week, rid from tbl group by 1,2,3) as t   
order by 1,2,3;

耗时: 超过 294874.983 ms

跨年的话更加复杂, 因为年初的第一周的上一周是去年的最后一周, 写SQL时不好表达, 可以加case进行处理, 或者使用epoch来计算.

留个作业: 有兴趣的小伙伴可以修改一下SQL, 实现跨年.

test case 7: 整体UV/周

select   
  EXTRACT(ISOYEAR FROM cts) as isoyear,   
  EXTRACT(week FROM cts) as week,   
  count(distinct uid)   
from   
  tbl   
group by 1,2  
order by 1,2;

耗时: 10757.619 ms

test case 8: 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)

select   
  t.isoyear,   
  t.week,   
  (select count(*) from   
    (select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week   
     except   
     select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week-1) as tmp  
  ) as uv_incr   
from   
  (select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week from tbl group by 1,2) as t   
order by 1,2;

耗时: 38512.642 ms

test case 9: 按地域的UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_byrid (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select count(distinct uid) from tbl where rid=v_rid and cts >= v_begin_cts1 and cts < v_end_cts1;  
$$ language sql strict;  
select * from get_uv_byrid (1::int2, '2023-01-01', '2023-01-08');

耗时: 560.176 ms

test case 10: 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_byrid (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
select count(*) from   
  (select uid from tbl where rid=v_rid and cts >= v_begin_cts2 and cts < v_end_cts2  
   except  
   select uid from tbl where rid=v_rid and cts >= v_begin_cts1 and cts < v_end_cts1  
  ) as tmp;  
$$ language sql strict;  
select * from get_uv_incr_byrid (1::int2, '2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 889.635 ms

test case 11: 整体UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select count(distinct uid) from tbl where cts >= v_begin_cts1 and cts < v_end_cts1;  
$$ language sql strict;   
select * from get_uv ('2023-01-01', '2023-01-08');

耗时: 915.000 ms

test case 12: 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
select count(*) from   
  (select uid from tbl where cts >= v_begin_cts2 and cts < v_end_cts2  
   except  
   select uid from tbl where cts >= v_begin_cts1 and cts < v_end_cts1  
  ) as tmp;  
$$ language sql strict;  
select * from get_uv_incr ('2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 2147.770 ms

hll数据库设计和试验

创建hll插件

create extension hll;

hll插件支持的类型、OP和函数等, 详细可参考hll开源项目文档:

postgres=# \dT+  
                                         List of data types  
 Schema |    Name     | Internal name | Size | Elements |  Owner   | Access privileges | Description   
--------+-------------+---------------+------+----------+----------+-------------------+-------------  
 public | hll         | hll           | var  |          | postgres |                   |   
 public | hll_hashval | hll_hashval   | 8    |          | postgres |                   |   
(2 rows)  
postgres=# \do+  
                                         List of operators  
 Schema | Name | Left arg type | Right arg type |   Result type    |    Function     | Description   
--------+------+---------------+----------------+------------------+-----------------+-------------  
 public | #    |               | hll            | double precision | hll_cardinality |   
 public | <>   | hll           | hll            | boolean          | hll_ne          |   
 public | <>   | hll_hashval   | hll_hashval    | boolean          | hll_hashval_ne  |   
 public | =    | hll           | hll            | boolean          | hll_eq          |   
 public | =    | hll_hashval   | hll_hashval    | boolean          | hll_hashval_eq  |   
 public | ||   | hll           | hll            | hll              | hll_union       |   
 public | ||   | hll           | hll_hashval    | hll              | hll_add         |   
 public | ||   | hll_hashval   | hll            | hll              | hll_add_rev     |   
(8 rows)

创建聚合表

create table tbl_hll (  
  rid int2,   -- 地域id  
  ds date,    -- 日期  
  users hll   -- HLL值, 存储某个地域某天的所有UID聚合后的hll值  
);

将原始用户行为数据写入聚合表

insert into tbl_hll select rid, date(cts), hll_add_agg(hll_hash_bigint(uid)) from tbl group by 1,2;  
INSERT 0 6200  
Time: 2976.876 ms (00:02.977)

查看聚合表, 只占用了8MB左右.

postgres=# \dt+  
                                     List of relations  
 Schema |  Name   | Type  |  Owner   | Persistence | Access method |  Size   | Description   
--------+---------+-------+----------+-------------+---------------+---------+-------------  
 public | tbl     | table | postgres | unlogged    | heap          | 965 MB  |   
 public | tbl_hll | table | postgres | permanent   | heap          | 8312 kB |   
(2 rows)

近似情况对比:

hll里有多少uid?

postgres=# select # hll_union_agg(users) from tbl_hll;   
     ?column?        
-------------------  
 9474798.879453091  
(1 row)  
Time: 47.075 ms

真实有多少uid?

postgres=# select count(distinct uid) from tbl;   
  count    
---------  
 9648563  
(1 row)  
Time: 2373.055 ms (00:02.373)

test case 1: 按地域的UV/天

select   
  rid,   
  ds,   
  # users   
from tbl_hll ;

耗时: 57.250 ms

test case 2: 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)

select   
  rid,   
  ds,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))  
from tbl_hll   
window w as (partition by rid order by ds)   
order by 1,2;

耗时: 179.204 ms

test case 3: 整体UV/天

select   
  ds,   
  # hll_union_agg(users)   
from tbl_hll  
group by ds  
order by 1;

耗时: 50.841 ms

test case 4: 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    ds,   
    hll_union_agg(users) as users  
  from tbl_hll  
  group by ds  
)   
select   
  ds,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))  
from tmp   
window w as (order by ds)   
order by 1;

耗时: 51.931 ms

test case 5: 按地域的UV/周

select   
  EXTRACT(ISOYEAR FROM ds) as isoyear,   
  EXTRACT(week FROM ds) as week,   
  rid,   
  # hll_union_agg(users)   
from tbl_hll   
group by 1,2,3;

耗时: 99.657 ms

test case 6: 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    EXTRACT(ISOYEAR FROM ds) as isoyear,   
    EXTRACT(week FROM ds) as week,   
    rid,   
    hll_union_agg(users) as users  
  from tbl_hll   
  group by 1,2,3   
)   
select   
  isoyear,   
  week,    
  rid,   
  (# (users || (lag(users) over w))) - (# (lag(users) over w))   
from tmp   
window w as (partition by rid order by isoyear,week)   
order by 1,2,3;

耗时: 121.010 ms

跨年的话更加复杂, 因为年初的第一周的上一周是去年的最后一周, 写SQL时不好表达, 可以加case进行处理, 或者使用epoch来计算.

留个作业: 有兴趣的小伙伴可以修改一下SQL, 实现跨年.

test case 7: 整体UV/周

select   
  EXTRACT(ISOYEAR FROM ds) as isoyear,   
  EXTRACT(week FROM ds) as week,   
  # hll_union_agg(users) as users  
from   
  tbl_hll   
group by 1,2  
order by 1,2;

耗时: 52.649 ms

test case 8: 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)

with tmp as (  
  select   
    EXTRACT(ISOYEAR FROM ds) as isoyear,   
    EXTRACT(week FROM ds) as week,   
    hll_union_agg(users) as users  
  from   
    tbl_hll   
  group by 1,2  
)  
select   
  isoyear,   
  week,    
  (# (users || (lag(users) over w))) - (# (lag(users) over w))   
from tmp   
window w as (order by isoyear,week)   
order by 1,2;

耗时: 51.668 ms

test case 9: 按地域的UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_byrid_appro (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$   
  select # hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts1 and ds < v_end_cts1;  
$$ language sql strict;   
select * from get_uv_byrid_appro (1::int2, '2023-01-01', '2023-01-08');

耗时: 6.021 ms

test case 10: 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_byrid_appro (  
  v_rid int2,   
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
with t1 as  
  (select rid, hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts1 and ds < v_end_cts1 group by 1),  
  t2 as  
  (select rid, hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts2 and ds < v_end_cts2 group by 1)  
select (# (t1.users || t2.users)) - (# t1.users) from t1,t2;  
$$ language sql strict;  
select * from get_uv_incr_byrid_appro (1::int2, '2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 8.934 ms

test case 11: 整体UV/任意滑动周

使用函数, 输入要对比的两个窗口变量.

create or replace function get_uv_appro (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date   -- 窗口1的结束时间   
) returns int8 as $$  
  select # hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts1 and ds < v_end_cts1;  
$$ language sql strict;   
select * from get_uv_appro ('2023-01-01', '2023-01-08');

耗时: 18.035 ms

test case 12: 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)

使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.

create or replace function get_uv_incr_appro (  
  v_begin_cts1 date,   -- 窗口1的开始时间   
  v_end_cts1 date,   -- 窗口1的结束时间   
  v_begin_cts2 date,   -- 窗口2的开始时间   
  v_end_cts2 date   -- 窗口2的结束时间  
) returns int8 as $$  
with t1 as  
  (select hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts1 and ds < v_end_cts1),  
  t2 as  
  (select hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts2 and ds < v_end_cts2)  
select (# (t1.users || t2.users)) - (# t1.users) from t1,t2;  
$$ language sql strict;  
select * from get_uv_incr_appro ('2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');

耗时: 31.634 ms

性能对比结果

test case 原始方法耗时(ms) HLL耗时(ms) HLL性能提升倍数
1 5263.209 57.250 91.93
2 超过 249922.704 179.204 超过 1394.63
3 4824.847 50.841 94.90
4 41003.313 51.931 789.57
5 18654.469 99.657 187.19
6 超过 294874.983 121.010 超过 2436.78
7 10757.619 52.649 204.33
8 38512.642 51.668 745.39
9 560.176 6.021 93.04
10 889.635 8.934 99.58
11 915.000 18.035 50.73
12 2147.770 31.634 67.89

业务场景2, 短视频去重、过滤已观看视频

例如有一个短视频业务, 用户看过的视频都会被标记. 当根据用户喜好推送给用户视频, 需要从推荐列表中过滤掉用户已经看过的视频. (允许近似值, 要求高性能)

实现和对比

设计表结构:

drop table if exists tbl;   
create unlogged table tbl (  -- 用户已读视频列表, 为了减轻测试负担, 使用unlogged table    
  id serial8 primary key,  -- 主键    
  uid int8,  -- 用户ID    
  vid int8,  -- 已读视频ID    
  cts timestamp  -- 用户行为时间    
);      
alter sequence tbl_id_seq cache 1000;     
create index on tbl (uid) include (vid);  -- 加速根据uid查询已读视频

设计测试数据, 要求:

  • vid 取值范围1到1万. 高斯分布, 突出一些经常被观看的热门视频.
  • uid 取值范围1到1万. 高斯分布, 突出一些经常看视频的活跃用户.

使用pgbench生成1000万条用户已读视频列表的记录.

vi ~/t.sql     
\set uid random_gaussian(1, 10000, 2.5)     
\set vid random_gaussian(1, 10000, 2.5)     
insert into tbl (uid,vid,cts) values (:uid, :vid, clock_timestamp());
pgbench -M prepared -n -r -P 1 -f ~/t.sql -c 10 -j 10 -t 1000000
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
number of transactions per client: 1000000  
number of transactions actually processed: 10000000/10000000  
latency average = 0.058 ms  
latency stddev = 0.079 ms  
initial connection time = 20.551 ms  
tps = 171765.018858 (without initial connection time)  
statement latencies in milliseconds:  
         0.000  \set uid random_gaussian(1, 100000, 2.5)     
         0.000  \set vid random_gaussian(1, 100000, 2.5)     
         0.058  insert into tbl (uid,vid,cts) values (:uid, :vid, clock_timestamp());

结果查看

postgres=# select * from tbl limit 10;   
  id  | uid  | vid  |            cts               
------+------+------+----------------------------  
    1 | 5380 | 5321 | 2023-08-21 08:53:53.074514  
    2 | 7611 | 7945 | 2023-08-21 08:53:53.074845  
 1001 | 6841 | 2507 | 2023-08-21 08:53:53.074857  
 2001 | 7956 | 4567 | 2023-08-21 08:53:53.074925  
 1002 | 5728 | 4210 | 2023-08-21 08:53:53.075087  
 3001 | 4560 | 1588 | 2023-08-21 08:53:53.075333  
 1003 | 5197 | 5662 | 2023-08-21 08:53:53.07549  
 1004 | 5125 | 3827 | 2023-08-21 08:53:53.075528  
 4001 | 6815 | 4758 | 2023-08-21 08:53:53.075496  
 1005 | 4002 | 8575 | 2023-08-21 08:53:53.075567  
(10 rows)  
postgres=# select count(*) from tbl;    
  count     
----------  
 10000000  
(1 row)  
postgres=# \dt+  
                                   List of relations  
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl  | table | postgres | unlogged    | heap          | 575 MB |   
(1 row)  
postgres=# \di+  
                                            List of relations  
 Schema |      Name       | Type  |  Owner   | Table | Persistence | Access method |  Size  | Description   
--------+-----------------+-------+----------+-------+-------------+---------------+--------+-------------  
 public | tbl_pkey        | index | postgres | tbl   | unlogged    | btree         | 344 MB |   
 public | tbl_uid_vid_idx | index | postgres | tbl   | unlogged    | btree         | 301 MB |   
(2 rows)  
postgres=# vacuum ANALYZE tbl;    
VACUUM

传统数据库设计和试验

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写压测脚本

vi ~/test.sql  
\set uid random(4900, 5100)    
with t as (  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
)  
select t.vid from t where not exists   
  (select 1 from tbl where uid=:uid and tbl.vid=t.vid)  
;

压测

pgbench -M prepared -n -r -P 1 -f ~/test.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 2469063  
latency average = 0.486 ms  
latency stddev = 0.183 ms  
initial connection time = 22.914 ms  
tps = 20578.539756 (without initial connection time)

tps: 20578.539756

hll数据库设计和试验

创建聚合表, 每个UID存储一条聚合后的hll value.

drop table if exists tbl_hll;   
create table tbl_hll (  
  uid int8 unique,   
  vids hll   
);

聚合

insert into tbl_hll select uid, hll_add_agg(hll_hash_bigint(vid)) from tbl group by 1;    
INSERT 0 10000

查看聚合表, 只占用了11MB左右.

postgres=# \dt+  
                                    List of relations  
 Schema |  Name   | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+---------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl     | table | postgres | unlogged    | heap          | 575 MB |   
 public | tbl_hll | table | postgres | permanent   | heap          | 11 MB  |   
(2 rows)

近似情况对比:

uid=5000实际已读vid数  
postgres=# select count(distinct vid) from tbl where uid=5000;  
 count   
-------  
  1807  
(1 row)  
uid=5000的hll里可能存在的vid数  
postgres=# select # vids from tbl_hll where uid=5000;  
      ?column?        
--------------------  
 1758.0590520300507  
(1 row)

true, false 测试:

使用原始表取出uid=5000的vid, 在hll值中判断是否已读, 2081 条全部已读, 判断已读是绝对准确的, 原理参考本文知识点的部分.

with tmp as (  
  select   
    tbl.uid,   
    tbl.vid,   
    ( vids || hll_hash_bigint(vid) ) = vids as is_readed   -- 是否已读  
  from tbl_hll join tbl  
  on (tbl_hll.uid=tbl.uid)  
  where   
  tbl.uid=5000  
  and tbl_hll.uid=5000  
)  
select is_readed, count(*) from tmp group by 1;  
 is_readed | count   
-----------+-------  
 t         |  2081  
(1 row)

以下这个就不太准了, 100万个vid, 居然有413826个被认为在uid=5000的hll已读列表值里面, 所以可能出现的结果是大量未读视频会被误以为是已读.

解决办法大家可以思考一下, 调整hll的参数即可.

postgres=# select count(*) from tbl_hll , (select generate_series(1,1000000) as vid) as t where uid=5000 and ( vids || hll_hash_bigint(vid) ) = vids;  
 count    
--------  
 413826  
(1 row)

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写压测脚本:

vi ~/test1.sql  
\set uid random(4900, 5100)    
select t1.vid from   
(  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
) t1   
join   
tbl_hll t2   
on ( t2.uid=:uid and t2.vids || hll_hash_bigint(t1.vid) <> t2.vids);

压测结果:

pgbench -M prepared -n -r -P 1 -f ~/test1.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 583366  
latency average = 2.057 ms  
latency stddev = 0.631 ms  
initial connection time = 21.736 ms  
tps = 4862.153296 (without initial connection time)

tps: 4862.153296

是不是有点失望? 原因主要是hll的计算比较耗时, 可以考虑优化hll代码, 或者把vid和hll都推出到application端进行判断(也就是把压力给到应用程序, 毕竟应用程序是无状态的非常容易扩展).

rb数据库设计和试验

下面使用rb类型来代替hll存储用户聚合后的已读视频id, rb是稀疏bitmap存储, 精确值.

创建roaringbitmap插件

postgres=# create extension roaringbitmap ;  
CREATE EXTENSION  
postgres=# \dT  
          List of data types  
 Schema |     Name      | Description   
--------+---------------+-------------  
 public | roaringbitmap |   
(1 row)  
postgres=# \do+  
                                          List of operators  
 Schema | Name | Left arg type | Right arg type |  Result type  |       Function        | Description   
--------+------+---------------+----------------+---------------+-----------------------+-------------  
 public | #    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_xor                |   
 public | &    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_and                |   
 public | &&   | roaringbitmap | roaringbitmap  | boolean       | rb_intersect          |   
 public | -    | roaringbitmap | integer        | roaringbitmap | rb_remove             |   
 public | -    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_andnot             |   
 public | <<   | roaringbitmap | bigint         | roaringbitmap | rb_shiftleft          |   
 public | <>   | roaringbitmap | roaringbitmap  | boolean       | rb_not_equals         |   
 public | <@   | integer       | roaringbitmap  | boolean       | public.rb_containedby |   
 public | <@   | roaringbitmap | roaringbitmap  | boolean       | public.rb_containedby |   
 public | =    | roaringbitmap | roaringbitmap  | boolean       | rb_equals             |   
 public | >>   | roaringbitmap | bigint         | roaringbitmap | rb_shiftright         |   
 public | @>   | roaringbitmap | integer        | boolean       | public.rb_contains    |   
 public | @>   | roaringbitmap | roaringbitmap  | boolean       | public.rb_contains    |   
 public | |    | integer       | roaringbitmap  | roaringbitmap | public.rb_add         |   
 public | |    | roaringbitmap | integer        | roaringbitmap | public.rb_add         |   
 public | |    | roaringbitmap | roaringbitmap  | roaringbitmap | rb_or                 |   
(16 rows)

创建聚合表

drop table if exists tbl_rb;   
create table tbl_rb (  
  uid int8 unique,   
  vids roaringbitmap   
);

写入聚合数据

insert into tbl_rb select uid, rb_build_agg(vid::int4) from tbl group by 1;    
INSERT 0 10000

查看聚合表, 只占用了20MB左右.

postgres=# \dt+  
                                    List of relations  
 Schema |  Name  | Type  |  Owner   | Persistence | Access method |  Size  | Description   
--------+--------+-------+----------+-------------+---------------+--------+-------------  
 public | tbl    | table | postgres | unlogged    | heap          | 575 MB |   
 public | tbl_rb | table | postgres | permanent   | heap          | 20 MB  |   
(2 rows)  
postgres=# \di+  
                                             List of relations  
 Schema |      Name       | Type  |  Owner   | Table  | Persistence | Access method |  Size  | Description   
--------+-----------------+-------+----------+--------+-------------+---------------+--------+-------------  
 public | tbl_pkey        | index | postgres | tbl    | unlogged    | btree         | 344 MB |   
 public | tbl_rb_uid_key  | index | postgres | tbl_rb | permanent   | btree         | 240 kB |   
 public | tbl_uid_vid_idx | index | postgres | tbl    | unlogged    | btree         | 301 MB |   
(3 rows)

rb是精确类型, 与原始数据对照如下:

postgres=# select count(distinct vid) from tbl where uid=5000;  
 count   
-------  
  1807  
(1 row)  
postgres=# select count(*) from (select unnest(rb_to_array(vids)) from tbl_rb where uid=5000) t;  
 count   
-------  
  1807  
(1 row)

test case1:

从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.

编写测试脚本:

vi ~/test2.sql  
\set uid random(4900, 5100)    
select t1.vid from   
(  
  select 4900+(random()*200)::int as vid from generate_series(1,100)  
) t1   
join   
tbl_rb t2   
on ( t2.uid=:uid and not (t1.vid <@ t2.vids) );

压测结果:

pgbench -M prepared -n -r -P 1 -f ~/test2.sql -c 10 -j 10 -T 120  
scaling factor: 1  
query mode: prepared  
number of clients: 10  
number of threads: 10  
duration: 120 s  
number of transactions actually processed: 3098707  
latency average = 0.387 ms  
latency stddev = 0.147 ms  
initial connection time = 21.281 ms  
tps = 25826.645185 (without initial connection time)

tps: 25826.645185

性能对比结果

对比项 传统测试 hll聚合 rb稀疏聚合
存储空间占用 1220 MB 11 MB 20 MB
tps 20578 4862 25826

知识点

1、hll.

往hll中增加一个值时, 可以理解为将这个值经过hashfunction后, 将其映射到一个bitmap value的n个position上, 即这n个bit position位上的值都设置为1.

当需要判断某个值是否存在时, 判断这个值在bitmap value中的n个bit position位上的值是否都为1.

hll类型 lossy问题: 不存在一定为真, 存在可能是假的(因为bit position conflict问题.). 所以有失真问题.

hll 可以实现比普通计数器更多更复杂的需求:

  • 唯一值个数 (计数器只能统计一次, 换个窗口就必须重新全量统计. 但是2个hll值可以叠加, 所以不需要重新全量统计.)
  • 是否已存在 (计数器做不到.)
  • PV UV (计数器也能做到.)
  • 滑动窗口分析 (计数器做不到, 因为两个计数器值无法叠加, 所以使用计数器时每个维度都要重新统计. 但是2个hll值可以叠加, 因此不需要重新统计, 只需要计算hll即可.)

2、窗口查询语法.

3、如何根据业务模型快速构建测试数据?

《PostgreSQL+MySQL 联合解决方案 - 第3课视频 - 如何压测PG数据库、如何瞬间构造海量测试数据》

《PostgreSQL 如何快速构建 海量 逼真 测试数据》

《生成泊松、高斯、指数、随机分布数据 - PostgreSQL 9.5 new feature - pgbench improve, gaussian (standard normal) & exponential distribution》

man pgbench

4、roaringbitmap:

思考

1、还有没有其他场景会用到hll?

2、还有比hll更好的技术实现?

  • roaringbitmap 类型?
  • array 类型?

3、通过这个实验, 你学到了什么?

4、为什么hll性能这么好?

5、hll提升性能的同时牺牲了什么?

6、为什么hll是近似的?

7、hll类型的什么变量可以控制一个hll value能存储多少唯一值?

8、hll value近似度的精确度和什么变量有关?

9、为什么多个hll的值可以union?

参考

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
4月前
|
存储 SQL 安全
应用案例|开源 PolarDB-X 在互联网安全场景的应用实践
中盾集团采用PolarDB-X云原生分布式数据库开源版本,有效解决了大数据量处理、复杂查询以及历史数据维护等难题,实现了业务的高效扩展与优化。
|
20天前
|
资源调度 关系型数据库 MySQL
PolarDB MySQL场景评测
PolarDB具备快速资源弹升能力,支持5秒探测窗口和1秒内完成资源扩展,适合电商促销和流量波动型SaaS应用。资源伸缩范围广泛,支持0-256核,适用于中小型企业到大型企业。资源伸缩过程中业务无感知,具有高稳定性和成熟性。支持最小0.5 PCU的资源颗粒度,确保成本控制和使用效率。此外,PolarDB支持所有只读节点的数据强一致性,性能不受影响。
35 0
|
3月前
|
关系型数据库 MySQL 数据库
mysql关系型数据库的学习
mysql关系型数据库的学习
29 0
|
4月前
|
存储 关系型数据库 分布式数据库
揭秘PolarDB:中国云原生数据库的超级英雄,如何颠覆传统数据存储?
在数字化时代,数据成为企业的核心资产,而云原生数据库则是推动企业转型的关键。PolarDB凭借其先进的存储计算分离架构,在性能、可靠性和易用性方面脱颖而出,成为国内领先的选择。它支持多种数据库引擎,提供多副本存储机制,并采用按量付费模式,有效降低管理和成本压力,助力企业实现高效、可靠的数字化转型。
87 1
|
5月前
|
关系型数据库 MySQL 分布式数据库
PolarDB MySQL多场景评测
本次评测将围绕指定场景中的灵活弹性和无感秒切展开,对于自选场景中的安全和DB+AI也进行了简单体验。
407 3
PolarDB MySQL多场景评测
|
4月前
|
关系型数据库 MySQL 分布式数据库
【开发者评测】PolarDB MySQL场景评测获奖名单公布
PolarDB MySQL场景评测获奖名单公布!!
|
5月前
|
关系型数据库 MySQL Serverless
PolarDB MySQL Serverless:灵活弹性场景深度评测
本文深入评测了阿里云PolarDB MySQL Serverless的灵活弹性场景。作为阿里云专业运维工程师,笔者从多个角度对产品进行了全面分析: 产品特性:介绍了PolarDB MySQL Serverless的核心优势,包括动态弹性、高可用性和按量付费模式。 操作体验:详细描述了集群创建过程和控制台监控功能,突出了其简化运维的特点。 弹性能力:通过三个测试场景验证了产品在不同负载下的自动扩缩容能力,展示了其快速响应和性能稳定性。 API与文档:评估了API的易用性和文档的完整性,并提出了改进建议。 优劣分析:总结了产品的主要优势,如极致弹性和成本效益,同时指出了一些潜在的改进空间。 整体
|
5月前
|
Cloud Native 关系型数据库 分布式数据库
云原生数据库2.0问题之PolarDB利用云计算技术红利如何解决
云原生数据库2.0问题之PolarDB利用云计算技术红利如何解决
|
5月前
|
Cloud Native 关系型数据库 分布式数据库
云原生关系型数据库PolarDB问题之PolarDB相比传统商用数据库的优势如何解决
云原生关系型数据库PolarDB问题之PolarDB相比传统商用数据库的优势如何解决
47 1
|
5月前
|
存储 关系型数据库 MySQL
再探PolarDB —— PolarDB MySQL 四大场景下的全方位评测
本文全面评测了阿里云PolarDB MySQL在四大关键场景下的表现:Serverless极致弹性、列存索引(IMCI)、弹性并行查询(ePQ)以及无感秒切高可用。通过官方提供的免费体验资源,我们深入了解了PolarDB MySQL的核心能力和性能。Serverless极致弹性列存索引(IMCI弹性并行查询(ePQ)无感秒切高可用此外,文章还介绍了PolarDB MySQL在数据备份和HTAP(混合事务/分析处理)场景下的优势,包括灵活的备份策略、高效的全量和库表恢复方式,以及通过IMCI支持的HTAP能力。这些特性共同构成了PolarDB MySQL作为一款先进的云数据库服务的强大竞争力。

热门文章

最新文章

相关产品

  • 云原生数据库 PolarDB