作者
digoal
日期
2023-08-19
标签
PostgreSQL , PolarDB , 数据库 , 教学
背景
本文的实验可以使用永久免费的云起实验室来完成.
如果你本地有docker环境也可以把镜像拉到本地来做实验:
x86_64机器使用以下docker image:
Apple Chip机器使用以下docker image:
非常欢迎数据库用户提出场景给我, 回复如下issue, 一起来建设数据库沉浸式学习教学素材库, 帮助开发者用好数据库, 提升开发者竞争力, 为企业降本提效.
业务场景1, UV统计分析场景
例如有一个网站, 业务会记录用户每一次点击行为, 每一次点击一条记录.
基于以上明细数据, 需要做如下分析(允许近似值, 要求高性能):
- 按地域的UV/天
- 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)
- 整体UV/天
- 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)
- 按地域的UV/周
- 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)
- 整体UV/周
- 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)
- 按地域的UV/任意滑动周
- 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
- 整体UV/任意滑动周
- 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
实现和对比
设计表结构:
drop table if exists tbl; create unlogged table tbl ( -- 用户行为表, 为了减轻测试负担, 使用unlogged table id serial8 primary key, -- 主键 uid int8, -- 用户ID rid int2, -- 地域ID cts timestamp, -- 用户行为时间 info text -- 用户行为 ); alter sequence tbl_id_seq cache 1000;
设计测试数据, 要求:
- UID 取值范围1 到 2亿, 为了产生部分会员活跃的效果, UID随机值的生成按高斯分布.
- 地域取值为
mod(uid,200)
- cts为行为时间, 随机落在0-31天(2678400 秒)的范围内.
- info字段在本实验中无意义, 随机填充, 表达用户行为.
使用pgbench 生成用户行为原始数据1000万条.
vi ~/t.sql \set uid random_gaussian(1, 200000000, 2.5) insert into tbl (uid,rid,cts,info) values (:uid, mod(:uid,200), '2023-01-01'::timestamp + (random()*2678400||' s')::interval, md5(random()::text));
pgbench -M prepared -n -r -P 1 -f ~/t.sql -c 10 -j 10 -t 1000000
scaling factor: 1 query mode: prepared number of clients: 10 number of threads: 10 number of transactions per client: 1000000 number of transactions actually processed: 10000000/10000000 latency average = 0.064 ms latency stddev = 0.080 ms initial connection time = 19.642 ms tps = 155608.242432 (without initial connection time)
结果查看
postgres=# select * from tbl limit 10; id | uid | rid | cts | info ------+-----------+-----+----------------------------+---------------------------------- 1 | 97559812 | 12 | 2023-01-01 22:07:44.879178 | 4137dad1b052dcd704213e0b38735a1b 2 | 138253982 | 182 | 2023-01-15 02:38:03.556602 | ae3d5f27f61c8f0cd8f9864d79a693b5 3001 | 160494052 | 52 | 2023-01-17 23:45:29.174513 | b7972471fa02c8efa13113e96218c755 3 | 132545346 | 146 | 2023-01-05 22:23:11.484299 | 343f9298c26788d8bf454cfc7173a1f3 4 | 120769706 | 106 | 2023-01-16 21:47:52.81739 | a43a922d985b2ecc75e943af2c4c95be 5001 | 73660892 | 92 | 2023-01-30 05:42:50.828799 | 9515ed12769e13ff6c8f5e633860d548 3002 | 151242348 | 148 | 2023-01-31 07:03:15.446931 | 076c30be03af95f8a6dab1b65dd94903 5 | 164054658 | 58 | 2023-01-24 19:24:56.110967 | 19c46f300fe17397a108c6a2b69a7855 5002 | 131108032 | 32 | 2023-01-23 19:04:44.526588 | e930a9df98f90a3d47f48fe5cf1eaf9a 6 | 33572954 | 154 | 2023-01-16 00:47:17.449256 | 6a2ecb267b3281c91c5645b57a0b29d8 (10 rows) postgres=# select count(*) from tbl; count ---------- 10000000 (1 row) postgres=# \dt+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+------+-------+----------+-------------+---------------+--------+------------- public | tbl | table | postgres | unlogged | heap | 965 MB | (1 row) postgres=# \di+ List of relations Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description --------+----------+-------+----------+-------+-------------+---------------+--------+------------- public | tbl_pkey | index | postgres | tbl | unlogged | btree | 343 MB | (1 row) postgres=# vacuum ANALYZE tbl; VACUUM
传统数据库设计和试验
test case 1: 按地域的UV/天
select rid, date(cts), count(distinct uid) from tbl group by 1,2 order by 1,2;
耗时: 5263.209 ms
test case 2: 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)
select t.d, t.rid, (select count(*) from (select uid from tbl where date(cts)=t.d and rid=t.rid except select uid from tbl where date(cts)=t.d-1 and rid=t.rid) as tmp ) as uv_incr from (select date(cts) as d , rid from tbl group by 1,2) as t order by 1,2;
耗时: 超过 249922.704 ms
test case 3: 整体UV/天
select date(cts), count(distinct uid) from tbl group by 1 order by 1;
耗时: 4824.847 ms
test case 4: 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)
select t.d, (select count(*) from (select uid from tbl where date(cts)=t.d except select uid from tbl where date(cts)=t.d-1) as tmp ) as uv_incr from (select distinct date(cts) as d from tbl) as t order by 1;
耗时: 41003.313 ms
test case 5: 按地域的UV/周
select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week, rid, count(distinct uid) from tbl group by 1,2,3 order by 1,2,3;
耗时: 18654.469 ms
test case 6: 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)
select t.isoyear, t.week, t.rid, (select count(*) from (select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week and rid=t.rid except select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week-1 and rid=t.rid) as tmp ) as uv_incr from (select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week, rid from tbl group by 1,2,3) as t order by 1,2,3;
耗时: 超过 294874.983 ms
跨年的话更加复杂, 因为年初的第一周的上一周是去年的最后一周, 写SQL时不好表达, 可以加case进行处理, 或者使用epoch来计算.
留个作业: 有兴趣的小伙伴可以修改一下SQL, 实现跨年.
test case 7: 整体UV/周
select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week, count(distinct uid) from tbl group by 1,2 order by 1,2;
耗时: 10757.619 ms
test case 8: 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)
select t.isoyear, t.week, (select count(*) from (select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week except select uid from tbl where EXTRACT(ISOYEAR FROM cts)=t.isoyear and EXTRACT(week FROM cts)=t.week-1) as tmp ) as uv_incr from (select EXTRACT(ISOYEAR FROM cts) as isoyear, EXTRACT(week FROM cts) as week from tbl group by 1,2) as t order by 1,2;
耗时: 38512.642 ms
test case 9: 按地域的UV/任意滑动周
使用函数, 输入要对比的两个窗口变量.
create or replace function get_uv_byrid ( v_rid int2, v_begin_cts1 date, -- 窗口1的开始时间 v_end_cts1 date -- 窗口1的结束时间 ) returns int8 as $$ select count(distinct uid) from tbl where rid=v_rid and cts >= v_begin_cts1 and cts < v_end_cts1; $$ language sql strict; select * from get_uv_byrid (1::int2, '2023-01-01', '2023-01-08');
耗时: 560.176 ms
test case 10: 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.
create or replace function get_uv_incr_byrid ( v_rid int2, v_begin_cts1 date, -- 窗口1的开始时间 v_end_cts1 date, -- 窗口1的结束时间 v_begin_cts2 date, -- 窗口2的开始时间 v_end_cts2 date -- 窗口2的结束时间 ) returns int8 as $$ select count(*) from (select uid from tbl where rid=v_rid and cts >= v_begin_cts2 and cts < v_end_cts2 except select uid from tbl where rid=v_rid and cts >= v_begin_cts1 and cts < v_end_cts1 ) as tmp; $$ language sql strict; select * from get_uv_incr_byrid (1::int2, '2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');
耗时: 889.635 ms
test case 11: 整体UV/任意滑动周
使用函数, 输入要对比的两个窗口变量.
create or replace function get_uv ( v_begin_cts1 date, -- 窗口1的开始时间 v_end_cts1 date -- 窗口1的结束时间 ) returns int8 as $$ select count(distinct uid) from tbl where cts >= v_begin_cts1 and cts < v_end_cts1; $$ language sql strict; select * from get_uv ('2023-01-01', '2023-01-08');
耗时: 915.000 ms
test case 12: 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.
create or replace function get_uv_incr ( v_begin_cts1 date, -- 窗口1的开始时间 v_end_cts1 date, -- 窗口1的结束时间 v_begin_cts2 date, -- 窗口2的开始时间 v_end_cts2 date -- 窗口2的结束时间 ) returns int8 as $$ select count(*) from (select uid from tbl where cts >= v_begin_cts2 and cts < v_end_cts2 except select uid from tbl where cts >= v_begin_cts1 and cts < v_end_cts1 ) as tmp; $$ language sql strict; select * from get_uv_incr ('2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');
耗时: 2147.770 ms
hll数据库设计和试验
创建hll插件
create extension hll;
hll插件支持的类型、OP和函数等, 详细可参考hll开源项目文档:
postgres=# \dT+ List of data types Schema | Name | Internal name | Size | Elements | Owner | Access privileges | Description --------+-------------+---------------+------+----------+----------+-------------------+------------- public | hll | hll | var | | postgres | | public | hll_hashval | hll_hashval | 8 | | postgres | | (2 rows) postgres=# \do+ List of operators Schema | Name | Left arg type | Right arg type | Result type | Function | Description --------+------+---------------+----------------+------------------+-----------------+------------- public | # | | hll | double precision | hll_cardinality | public | <> | hll | hll | boolean | hll_ne | public | <> | hll_hashval | hll_hashval | boolean | hll_hashval_ne | public | = | hll | hll | boolean | hll_eq | public | = | hll_hashval | hll_hashval | boolean | hll_hashval_eq | public | || | hll | hll | hll | hll_union | public | || | hll | hll_hashval | hll | hll_add | public | || | hll_hashval | hll | hll | hll_add_rev | (8 rows)
创建聚合表
create table tbl_hll ( rid int2, -- 地域id ds date, -- 日期 users hll -- HLL值, 存储某个地域某天的所有UID聚合后的hll值 );
将原始用户行为数据写入聚合表
insert into tbl_hll select rid, date(cts), hll_add_agg(hll_hash_bigint(uid)) from tbl group by 1,2; INSERT 0 6200 Time: 2976.876 ms (00:02.977)
查看聚合表, 只占用了8MB左右.
postgres=# \dt+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+---------+-------+----------+-------------+---------------+---------+------------- public | tbl | table | postgres | unlogged | heap | 965 MB | public | tbl_hll | table | postgres | permanent | heap | 8312 kB | (2 rows)
近似情况对比:
hll里有多少uid?
postgres=# select # hll_union_agg(users) from tbl_hll; ?column? ------------------- 9474798.879453091 (1 row) Time: 47.075 ms
真实有多少uid?
postgres=# select count(distinct uid) from tbl; count --------- 9648563 (1 row) Time: 2373.055 ms (00:02.373)
test case 1: 按地域的UV/天
select rid, ds, # users from tbl_hll ;
耗时: 57.250 ms
test case 2: 按地域的相比前一天新增的访问人数(昨天没访问, 今天访问了的)
select rid, ds, (# (users || (lag(users) over w))) - (# (lag(users) over w)) from tbl_hll window w as (partition by rid order by ds) order by 1,2;
耗时: 179.204 ms
test case 3: 整体UV/天
select ds, # hll_union_agg(users) from tbl_hll group by ds order by 1;
耗时: 50.841 ms
test case 4: 整体相比前一天新增的访问人数(昨天没访问, 今天访问了的)
with tmp as ( select ds, hll_union_agg(users) as users from tbl_hll group by ds ) select ds, (# (users || (lag(users) over w))) - (# (lag(users) over w)) from tmp window w as (order by ds) order by 1;
耗时: 51.931 ms
test case 5: 按地域的UV/周
select EXTRACT(ISOYEAR FROM ds) as isoyear, EXTRACT(week FROM ds) as week, rid, # hll_union_agg(users) from tbl_hll group by 1,2,3;
耗时: 99.657 ms
test case 6: 按地域的相比前一周新增的访问人数(昨天没访问, 今天访问了的)
with tmp as ( select EXTRACT(ISOYEAR FROM ds) as isoyear, EXTRACT(week FROM ds) as week, rid, hll_union_agg(users) as users from tbl_hll group by 1,2,3 ) select isoyear, week, rid, (# (users || (lag(users) over w))) - (# (lag(users) over w)) from tmp window w as (partition by rid order by isoyear,week) order by 1,2,3;
耗时: 121.010 ms
跨年的话更加复杂, 因为年初的第一周的上一周是去年的最后一周, 写SQL时不好表达, 可以加case进行处理, 或者使用epoch来计算.
留个作业: 有兴趣的小伙伴可以修改一下SQL, 实现跨年.
test case 7: 整体UV/周
select EXTRACT(ISOYEAR FROM ds) as isoyear, EXTRACT(week FROM ds) as week, # hll_union_agg(users) as users from tbl_hll group by 1,2 order by 1,2;
耗时: 52.649 ms
test case 8: 整体相比前一周新增的访问人数(昨天没访问, 今天访问了的)
with tmp as ( select EXTRACT(ISOYEAR FROM ds) as isoyear, EXTRACT(week FROM ds) as week, hll_union_agg(users) as users from tbl_hll group by 1,2 ) select isoyear, week, (# (users || (lag(users) over w))) - (# (lag(users) over w)) from tmp window w as (order by isoyear,week) order by 1,2;
耗时: 51.668 ms
test case 9: 按地域的UV/任意滑动周
使用函数, 输入要对比的两个窗口变量.
create or replace function get_uv_byrid_appro ( v_rid int2, v_begin_cts1 date, -- 窗口1的开始时间 v_end_cts1 date -- 窗口1的结束时间 ) returns int8 as $$ select # hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts1 and ds < v_end_cts1; $$ language sql strict; select * from get_uv_byrid_appro (1::int2, '2023-01-01', '2023-01-08');
耗时: 6.021 ms
test case 10: 按地域的相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.
create or replace function get_uv_incr_byrid_appro ( v_rid int2, v_begin_cts1 date, -- 窗口1的开始时间 v_end_cts1 date, -- 窗口1的结束时间 v_begin_cts2 date, -- 窗口2的开始时间 v_end_cts2 date -- 窗口2的结束时间 ) returns int8 as $$ with t1 as (select rid, hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts1 and ds < v_end_cts1 group by 1), t2 as (select rid, hll_union_agg(users) as users from tbl_hll where rid=v_rid and ds >= v_begin_cts2 and ds < v_end_cts2 group by 1) select (# (t1.users || t2.users)) - (# t1.users) from t1,t2; $$ language sql strict; select * from get_uv_incr_byrid_appro (1::int2, '2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');
耗时: 8.934 ms
test case 11: 整体UV/任意滑动周
使用函数, 输入要对比的两个窗口变量.
create or replace function get_uv_appro ( v_begin_cts1 date, -- 窗口1的开始时间 v_end_cts1 date -- 窗口1的结束时间 ) returns int8 as $$ select # hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts1 and ds < v_end_cts1; $$ language sql strict; select * from get_uv_appro ('2023-01-01', '2023-01-08');
耗时: 18.035 ms
test case 12: 整体相比前一任意滑动周新增的访问人数(昨天没访问, 今天访问了的)
使用函数, 输入要对比的两个窗口变量. 计算窗口2相比窗口1的新增uv.
create or replace function get_uv_incr_appro ( v_begin_cts1 date, -- 窗口1的开始时间 v_end_cts1 date, -- 窗口1的结束时间 v_begin_cts2 date, -- 窗口2的开始时间 v_end_cts2 date -- 窗口2的结束时间 ) returns int8 as $$ with t1 as (select hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts1 and ds < v_end_cts1), t2 as (select hll_union_agg(users) as users from tbl_hll where ds >= v_begin_cts2 and ds < v_end_cts2) select (# (t1.users || t2.users)) - (# t1.users) from t1,t2; $$ language sql strict; select * from get_uv_incr_appro ('2023-01-01', '2023-01-08', '2023-01-08', '2023-01-15');
耗时: 31.634 ms
性能对比结果
test case | 原始方法耗时(ms) | HLL耗时(ms) | HLL性能提升倍数 |
1 | 5263.209 | 57.250 | 91.93 |
2 | 超过 249922.704 | 179.204 | 超过 1394.63 |
3 | 4824.847 | 50.841 | 94.90 |
4 | 41003.313 | 51.931 | 789.57 |
5 | 18654.469 | 99.657 | 187.19 |
6 | 超过 294874.983 | 121.010 | 超过 2436.78 |
7 | 10757.619 | 52.649 | 204.33 |
8 | 38512.642 | 51.668 | 745.39 |
9 | 560.176 | 6.021 | 93.04 |
10 | 889.635 | 8.934 | 99.58 |
11 | 915.000 | 18.035 | 50.73 |
12 | 2147.770 | 31.634 | 67.89 |
业务场景2, 短视频去重、过滤已观看视频
例如有一个短视频业务, 用户看过的视频都会被标记. 当根据用户喜好推送给用户视频, 需要从推荐列表中过滤掉用户已经看过的视频. (允许近似值, 要求高性能)
实现和对比
设计表结构:
drop table if exists tbl; create unlogged table tbl ( -- 用户已读视频列表, 为了减轻测试负担, 使用unlogged table id serial8 primary key, -- 主键 uid int8, -- 用户ID vid int8, -- 已读视频ID cts timestamp -- 用户行为时间 ); alter sequence tbl_id_seq cache 1000; create index on tbl (uid) include (vid); -- 加速根据uid查询已读视频
设计测试数据, 要求:
- vid 取值范围1到1万. 高斯分布, 突出一些经常被观看的热门视频.
- uid 取值范围1到1万. 高斯分布, 突出一些经常看视频的活跃用户.
使用pgbench生成1000万条用户已读视频列表的记录.
vi ~/t.sql \set uid random_gaussian(1, 10000, 2.5) \set vid random_gaussian(1, 10000, 2.5) insert into tbl (uid,vid,cts) values (:uid, :vid, clock_timestamp());
pgbench -M prepared -n -r -P 1 -f ~/t.sql -c 10 -j 10 -t 1000000
scaling factor: 1 query mode: prepared number of clients: 10 number of threads: 10 number of transactions per client: 1000000 number of transactions actually processed: 10000000/10000000 latency average = 0.058 ms latency stddev = 0.079 ms initial connection time = 20.551 ms tps = 171765.018858 (without initial connection time) statement latencies in milliseconds: 0.000 \set uid random_gaussian(1, 100000, 2.5) 0.000 \set vid random_gaussian(1, 100000, 2.5) 0.058 insert into tbl (uid,vid,cts) values (:uid, :vid, clock_timestamp());
结果查看
postgres=# select * from tbl limit 10; id | uid | vid | cts ------+------+------+---------------------------- 1 | 5380 | 5321 | 2023-08-21 08:53:53.074514 2 | 7611 | 7945 | 2023-08-21 08:53:53.074845 1001 | 6841 | 2507 | 2023-08-21 08:53:53.074857 2001 | 7956 | 4567 | 2023-08-21 08:53:53.074925 1002 | 5728 | 4210 | 2023-08-21 08:53:53.075087 3001 | 4560 | 1588 | 2023-08-21 08:53:53.075333 1003 | 5197 | 5662 | 2023-08-21 08:53:53.07549 1004 | 5125 | 3827 | 2023-08-21 08:53:53.075528 4001 | 6815 | 4758 | 2023-08-21 08:53:53.075496 1005 | 4002 | 8575 | 2023-08-21 08:53:53.075567 (10 rows) postgres=# select count(*) from tbl; count ---------- 10000000 (1 row) postgres=# \dt+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+------+-------+----------+-------------+---------------+--------+------------- public | tbl | table | postgres | unlogged | heap | 575 MB | (1 row) postgres=# \di+ List of relations Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description --------+-----------------+-------+----------+-------+-------------+---------------+--------+------------- public | tbl_pkey | index | postgres | tbl | unlogged | btree | 344 MB | public | tbl_uid_vid_idx | index | postgres | tbl | unlogged | btree | 301 MB | (2 rows) postgres=# vacuum ANALYZE tbl; VACUUM
传统数据库设计和试验
test case1:
从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.
编写压测脚本
vi ~/test.sql \set uid random(4900, 5100) with t as ( select 4900+(random()*200)::int as vid from generate_series(1,100) ) select t.vid from t where not exists (select 1 from tbl where uid=:uid and tbl.vid=t.vid) ;
压测
pgbench -M prepared -n -r -P 1 -f ~/test.sql -c 10 -j 10 -T 120 scaling factor: 1 query mode: prepared number of clients: 10 number of threads: 10 duration: 120 s number of transactions actually processed: 2469063 latency average = 0.486 ms latency stddev = 0.183 ms initial connection time = 22.914 ms tps = 20578.539756 (without initial connection time)
tps: 20578.539756
hll数据库设计和试验
创建聚合表, 每个UID存储一条聚合后的hll value.
drop table if exists tbl_hll; create table tbl_hll ( uid int8 unique, vids hll );
聚合
insert into tbl_hll select uid, hll_add_agg(hll_hash_bigint(vid)) from tbl group by 1; INSERT 0 10000
查看聚合表, 只占用了11MB左右.
postgres=# \dt+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+---------+-------+----------+-------------+---------------+--------+------------- public | tbl | table | postgres | unlogged | heap | 575 MB | public | tbl_hll | table | postgres | permanent | heap | 11 MB | (2 rows)
近似情况对比:
uid=5000实际已读vid数 postgres=# select count(distinct vid) from tbl where uid=5000; count ------- 1807 (1 row) uid=5000的hll里可能存在的vid数 postgres=# select # vids from tbl_hll where uid=5000; ?column? -------------------- 1758.0590520300507 (1 row)
true, false 测试:
使用原始表取出uid=5000的vid, 在hll值中判断是否已读, 2081 条全部已读, 判断已读是绝对准确的, 原理参考本文知识点的部分.
with tmp as ( select tbl.uid, tbl.vid, ( vids || hll_hash_bigint(vid) ) = vids as is_readed -- 是否已读 from tbl_hll join tbl on (tbl_hll.uid=tbl.uid) where tbl.uid=5000 and tbl_hll.uid=5000 ) select is_readed, count(*) from tmp group by 1; is_readed | count -----------+------- t | 2081 (1 row)
以下这个就不太准了, 100万个vid, 居然有413826个被认为在uid=5000的hll已读列表值里面, 所以可能出现的结果是大量未读视频会被误以为是已读.
解决办法大家可以思考一下, 调整hll的参数即可.
postgres=# select count(*) from tbl_hll , (select generate_series(1,1000000) as vid) as t where uid=5000 and ( vids || hll_hash_bigint(vid) ) = vids; count -------- 413826 (1 row)
test case1:
从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.
编写压测脚本:
vi ~/test1.sql \set uid random(4900, 5100) select t1.vid from ( select 4900+(random()*200)::int as vid from generate_series(1,100) ) t1 join tbl_hll t2 on ( t2.uid=:uid and t2.vids || hll_hash_bigint(t1.vid) <> t2.vids);
压测结果:
pgbench -M prepared -n -r -P 1 -f ~/test1.sql -c 10 -j 10 -T 120 scaling factor: 1 query mode: prepared number of clients: 10 number of threads: 10 duration: 120 s number of transactions actually processed: 583366 latency average = 2.057 ms latency stddev = 0.631 ms initial connection time = 21.736 ms tps = 4862.153296 (without initial connection time)
tps: 4862.153296
是不是有点失望? 原因主要是hll的计算比较耗时, 可以考虑优化hll代码, 或者把vid和hll都推出到application端进行判断(也就是把压力给到应用程序, 毕竟应用程序是无状态的非常容易扩展).
rb数据库设计和试验
下面使用rb类型来代替hll存储用户聚合后的已读视频id, rb是稀疏bitmap存储, 精确值.
创建roaringbitmap插件
postgres=# create extension roaringbitmap ; CREATE EXTENSION postgres=# \dT List of data types Schema | Name | Description --------+---------------+------------- public | roaringbitmap | (1 row) postgres=# \do+ List of operators Schema | Name | Left arg type | Right arg type | Result type | Function | Description --------+------+---------------+----------------+---------------+-----------------------+------------- public | # | roaringbitmap | roaringbitmap | roaringbitmap | rb_xor | public | & | roaringbitmap | roaringbitmap | roaringbitmap | rb_and | public | && | roaringbitmap | roaringbitmap | boolean | rb_intersect | public | - | roaringbitmap | integer | roaringbitmap | rb_remove | public | - | roaringbitmap | roaringbitmap | roaringbitmap | rb_andnot | public | << | roaringbitmap | bigint | roaringbitmap | rb_shiftleft | public | <> | roaringbitmap | roaringbitmap | boolean | rb_not_equals | public | <@ | integer | roaringbitmap | boolean | public.rb_containedby | public | <@ | roaringbitmap | roaringbitmap | boolean | public.rb_containedby | public | = | roaringbitmap | roaringbitmap | boolean | rb_equals | public | >> | roaringbitmap | bigint | roaringbitmap | rb_shiftright | public | @> | roaringbitmap | integer | boolean | public.rb_contains | public | @> | roaringbitmap | roaringbitmap | boolean | public.rb_contains | public | | | integer | roaringbitmap | roaringbitmap | public.rb_add | public | | | roaringbitmap | integer | roaringbitmap | public.rb_add | public | | | roaringbitmap | roaringbitmap | roaringbitmap | rb_or | (16 rows)
创建聚合表
drop table if exists tbl_rb; create table tbl_rb ( uid int8 unique, vids roaringbitmap );
写入聚合数据
insert into tbl_rb select uid, rb_build_agg(vid::int4) from tbl group by 1; INSERT 0 10000
查看聚合表, 只占用了20MB左右.
postgres=# \dt+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+--------+-------+----------+-------------+---------------+--------+------------- public | tbl | table | postgres | unlogged | heap | 575 MB | public | tbl_rb | table | postgres | permanent | heap | 20 MB | (2 rows) postgres=# \di+ List of relations Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description --------+-----------------+-------+----------+--------+-------------+---------------+--------+------------- public | tbl_pkey | index | postgres | tbl | unlogged | btree | 344 MB | public | tbl_rb_uid_key | index | postgres | tbl_rb | permanent | btree | 240 kB | public | tbl_uid_vid_idx | index | postgres | tbl | unlogged | btree | 301 MB | (3 rows)
rb是精确类型, 与原始数据对照如下:
postgres=# select count(distinct vid) from tbl where uid=5000; count ------- 1807 (1 row) postgres=# select count(*) from (select unnest(rb_to_array(vids)) from tbl_rb where uid=5000) t; count ------- 1807 (1 row)
test case1:
从4900-5100号段热门vid随机生成100个推荐视频, 从4900-5100号段随机获取活跃uid, 从用户已读列表中过滤该UID已读的vid, 返回未读的UID.
编写测试脚本:
vi ~/test2.sql \set uid random(4900, 5100) select t1.vid from ( select 4900+(random()*200)::int as vid from generate_series(1,100) ) t1 join tbl_rb t2 on ( t2.uid=:uid and not (t1.vid <@ t2.vids) );
压测结果:
pgbench -M prepared -n -r -P 1 -f ~/test2.sql -c 10 -j 10 -T 120 scaling factor: 1 query mode: prepared number of clients: 10 number of threads: 10 duration: 120 s number of transactions actually processed: 3098707 latency average = 0.387 ms latency stddev = 0.147 ms initial connection time = 21.281 ms tps = 25826.645185 (without initial connection time)
tps: 25826.645185
性能对比结果
对比项 | 传统测试 | hll聚合 | rb稀疏聚合 |
存储空间占用 | 1220 MB | 11 MB | 20 MB |
tps | 20578 | 4862 | 25826 |
知识点
1、hll.
往hll中增加一个值时, 可以理解为将这个值经过hashfunction后, 将其映射到一个bitmap value的n个position上, 即这n个bit position位上的值都设置为1.
当需要判断某个值是否存在时, 判断这个值在bitmap value中的n个bit position位上的值是否都为1.
hll类型 lossy问题: 不存在一定为真, 存在可能是假的(因为bit position conflict问题.). 所以有失真问题.
hll 可以实现比普通计数器更多更复杂的需求:
- 唯一值个数 (计数器只能统计一次, 换个窗口就必须重新全量统计. 但是2个hll值可以叠加, 所以不需要重新全量统计.)
- 是否已存在 (计数器做不到.)
- PV UV (计数器也能做到.)
- 滑动窗口分析 (计数器做不到, 因为两个计数器值无法叠加, 所以使用计数器时每个维度都要重新统计. 但是2个hll值可以叠加, 因此不需要重新统计, 只需要计算hll即可.)
2、窗口查询语法.
3、如何根据业务模型快速构建测试数据?
《PostgreSQL+MySQL 联合解决方案 - 第3课视频 - 如何压测PG数据库、如何瞬间构造海量测试数据》
《PostgreSQL 如何快速构建 海量 逼真 测试数据》
man pgbench
4、roaringbitmap:
- 《PolarDB 开源版通过roaringbitmap支持高效用户画像等标签操作》
- 《PostgreSQL roaringbitmap UID溢出(超出int4(32字节))时的处理方法 - offset》
- 《画像系统标准化设计 - PostgreSQL roaringbitmap, varbitx , 正向关系, 反向关系, 圈选, 相似扩选(向量相似扩选)》
- 《PostgreSQL pg_roaringbitmap - 用户画像、标签、高效检索》
思考
1、还有没有其他场景会用到hll?
2、还有比hll更好的技术实现?
- roaringbitmap 类型?
- array 类型?
3、通过这个实验, 你学到了什么?
4、为什么hll性能这么好?
5、hll提升性能的同时牺牲了什么?
6、为什么hll是近似的?
7、hll类型的什么变量可以控制一个hll value能存储多少唯一值?
8、hll value近似度的精确度和什么变量有关?
9、为什么多个hll的值可以union?
参考
- https://www.crunchydata.com/blog/high-compression-metrics-storage-with-postgres-hyperloglog
- 《PostgreSQL sharding : citus 系列6 - count(distinct xx) 加速 (use 估值插件 hll|hyperloglog)》
- 《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 3》
- 《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 2》
- 《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 1》
- 《PolarDB 开源版通过 postgresql_hll 实现高效率 UV滑动分析、实时推荐已读列表过滤》
- 《PostgreSQL HLL 近似计算算法要点》
- 《PostgreSQL 13 & 14 hashagg 性能增强(分组选择性精准度) - 使用hll评估hash字段的选择性, 而非使用记录数》
- 《PostgreSQL hll 在留存、UV统计中的通用用法》
- 《Greenplum 最佳实践 - 估值插件hll的使用(以及hll分式聚合函数优化)》
- 《重新发现PostgreSQL之美 - 24 滑动窗口分析 2000x》
- 《PostgreSQL、Greenplum 滑动窗口 分析SQL 实践》
- 《PostgreSQL 海量时序数据(任意滑动窗口实时统计分析) - 传感器、人群、物体等对象跟踪》
- 《PostgreSQL 应用开发解决方案最佳实践系列课程 - 2. 短视频业务实时推荐》
- 《重新发现PostgreSQL之美 - 26 这个推荐算法价值1亿》
- 《PostgreSQL SELECT 的高级用法(CTE, LATERAL, ORDINALITY, WINDOW, SKIP LOCKED, DISTINCT, GROUPING SETS, ...) - 珍藏级》