背景
PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.
本文将介绍PolarDB 开源版通过duckdb_fdw 支持 parquet 列存数据文件以及高效OLAP.
测试环境为macos+docker, polardb部署请参考如何用 PolarDB 证明巴菲特的投资理念 - 包括PolarDB简单部署。
duckdb_fdw for PolarDB
参考
《用duckdb_fdw加速PostgreSQL分析计算, 提速40倍, 真香.》
部署
需要一个 高版本 cmake .
https://cmake.org/download
wget https://github.com/Kitware/CMake/releases/download/v3.25.1/cmake-3.25.1.tar.gz
tar -zxvf cmake-3.25.1.tar.gz
cd cmake-3.25.1
./configure
make -j 4
make install
[postgres@67e1eed1b4b6 duckdb]$ export PATH=/usr/local/bin:$PATH
[postgres@67e1eed1b4b6 duckdb]$ which cmake
/usr/local/bin/cmake
git clone --depth 1 https://github.com/cwida/duckdb
cd duckdb
make -j 4
cd ~/duckdb
cp build/release/tools/sqlite3_api_wrapper/libsqlite3_api_wrapper.so ~/tmp_basedir_polardb_pg_1100_bld/lib/
cp tools/sqlite3_api_wrapper/include/sqlite3.h ~/tmp_basedir_polardb_pg_1100_bld/include/server/
cp build/release/src/libduckdb.so ~/tmp_basedir_polardb_pg_1100_bld/lib/
cd ~
git clone --depth 1 https://github.com/alitrack/duckdb_fdw
cd duckdb_fdw
USE_PGXS=1 make
USE_PGXS=1 make install
[postgres@67e1eed1b4b6 duckdb_fdw]$ USE_PGXS=1 make
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o connection.o connection.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o option.o option.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o deparse.o deparse.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o sqlite_query.o sqlite_query.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o duckdb_fdw.o duckdb_fdw.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -shared -o duckdb_fdw.so connection.o option.o deparse.o sqlite_query.o duckdb_fdw.o -L/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib -Wl,-rpath,'$ORIGIN/../lib' -L/opt/rh/llvm-toolset-7.0/root/usr/lib64 -Wl,--as-needed -Wl,-rpath,'/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib',--enable-new-dtags -lsqlite3_api_wrapper
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o connection.bc connection.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o option.bc option.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o deparse.bc deparse.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o sqlite_query.bc sqlite_query.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o duckdb_fdw.bc duckdb_fdw.c
[postgres@67e1eed1b4b6 duckdb_fdw]$ USE_PGXS=1 make install
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib'
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension'
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension'
/usr/bin/install -c -m 755 duckdb_fdw.so '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/duckdb_fdw.so'
/usr/bin/install -c -m 644 .//duckdb_fdw.control '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension/'
/usr/bin/install -c -m 644 .//duckdb_fdw--1.0.sql .//duckdb_fdw--1.0--1.1.sql '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension/'
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode/duckdb_fdw'
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/
/usr/bin/install -c -m 644 connection.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 option.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 deparse.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 sqlite_query.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 duckdb_fdw.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
cd '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode' && /opt/rh/llvm-toolset-7.0/root/usr/bin/llvm-lto -thinlto -thinlto-action=thinlink -o duckdb_fdw.index.bc duckdb_fdw/connection.bc duckdb_fdw/option.bc duckdb_fdw/deparse.bc duckdb_fdw/sqlite_query.bc duckdb_fdw/duckdb_fdw.bc
加载插件
create extension duckdb_fdw;
测试使用
cd ~/duckdb/build/release
./duckdb /home/postgres/db
COPY (select generate_series as id, md5(random()::text) as info, now()::timestamp+(generate_series||' second')::interval as crt_time from generate_series(1,100)) TO '/home/postgres/t1.parquet' (FORMAT 'PARQUET');
COPY (select generate_series as cid, md5(random()::text) as info, now()::timestamp+(generate_series||' second')::interval as crt_time from generate_series(1,100)) TO '/home/postgres/t2.parquet' (FORMAT 'PARQUET');
COPY (select (floor(random()*100)+1)::int as gid, (floor(random()*100)+1)::int as cid, (random()*10)::int as c1, (random()*100)::int as c2, (random()*1000)::int as c3, (random()*10000)::int as c4, (random()*100000)::int as c5 from generate_series(1,1000000)) TO '/home/postgres/t3.parquet' (FORMAT 'PARQUET');
create view t1 as select * from read_parquet('/home/postgres/t1.parquet');
create view t2 as select * from read_parquet('/home/postgres/t2.parquet');
create view t3 as select * from read_parquet('/home/postgres/t3.parquet');
checkpoint;
性能怎么样
psql
CREATE SERVER DuckDB_server FOREIGN DATA WRAPPER duckdb_fdw OPTIONS (database '/home/postgres/db');
IMPORT FOREIGN SCHEMA public FROM SERVER DuckDB_server INTO public;
postgres=# \det
List of foreign tables
Schema | Table | Server
--------+-------+---------------
public | t1 | duckdb_server
public | t2 | duckdb_server
public | t3 | duckdb_server
(3 rows)
postgres=# explain verbose select count(distinct gid) from t3;
QUERY PLAN
-------------------------------------------------------------
Foreign Scan (cost=1.00..1.00 rows=1 width=8)
Output: (count(DISTINCT gid))
SQLite query: SELECT count(DISTINCT "gid") FROM main."t3"
(3 rows)
postgres=# \timing
Timing is on.
select count(distinct gid) from t3;
select count(distinct gid),count(*) from t3;
explain verbose select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.gid;
select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.gid;
select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.gid join t2 on t3.cid=t2.cid;
explain verbose select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.gid join t2 on t3.cid=t2.cid;
将数据导入PolarDB本地, 做同样的查询看一下时间.
postgresql 本地计算
postgres=# create unlogged table lt1 as select * from t1;
SELECT 100
postgres=# create unlogged table lt2 as select * from t2;
SELECT 100
postgres=# create unlogged table lt3 as select * from t3;
SELECT 10000000
postgres=# \timing
Timing is on.
postgres=# select count(distinct gid) from lt3;
count
-------
100
(1 row)
Time: 14891.356 ms (00:14.891)
参考
202211/20221128_01.md 《DuckDB 轻量级压缩解读》
202211/20221124_08.md 《DuckDB 0.6.0 CLI 支持新的结果集显示模式 duckbox, 展示头部和尾部的少量记录, 避免结果集占满屏幕》
202211/20221124_07.md 《DuckDB 0.6.0 CLI 支持 tab键自动补齐》
202211/20221124_06.md 《DuckDB 0.6.0 的SQL执行进度是个败笔? 速度太快了, 执行进度反而成了干扰.》
202211/20221124_04.md 《DuckDB 0.6.0 支持并行 COUNT(DISTINCT)》
202211/20221124_03.md 《DuckDB 0.6.0 支持并行创建索引, 提升create index性能》
202211/20221124_02.md 《DuckDB 0.6.0 支持 csv 并行读取功能, 提升查询性能》
202211/20221124_01.md 《DuckDB 0.6.0 压缩算法增加: FSST, Chimp, Patas 提升字符串和浮点数压缩能力》
202211/20221123_03.md 《DuckDB 0.6.0 数据批量写入性能优化》
202211/20221123_02.md 《为什么看好 DuckDB 的发展前景?》
202211/20221122_04.md 《DuckDB 0.6.0 内存管理增强, 提升超出内存大小的大数据集hashjoin,sort,window function性能》
202211/20221122_03.md 《DuckDB 0.6.0 语法糖: Add Python-style list-comprehension syntax support to SQL》
202211/20221122_02.md 《DuckDB 0.6.0 语法糖: 支持columns 语法 - 兼容clickhouse columns 语法》
202211/20221122_01.md 《DuckDB 0.6.0 新增UNION数据类型支持 - 单列支持多类型》
202211/20221116_02.md 《DuckDB 0.6.0 发布》
202211/20221111_01.md 《DuckDB extension(插件) - 扩展DuckDB功能》
202211/20221101_01.md 《DuckDB 数值类型性能 : hugeint (16字节整型), decimal(内部整型存储)》
202210/20221027_04.md 《DuckDB select 语法糖: sample 采样查询》
202210/20221027_03.md 《DuckDB select 语法糖: exclude, replace columns》
202210/20221027_02.md 《DuckDB 备份与恢复数据库》
202210/20221026_05.md 《DuckDB 对比 PostgreSQL 48张表JOIN 的优化器》
202210/20221026_04.md 《DuckDB 对比 PostgreSQL join 优化器推理优化又一例》
202210/20221026_03.md 《DuckDB COPY 数据导入导出 - 支持csv, parquet格式, 支持CODEC设置压缩》
202210/20221026_02.md 《DuckDB 虚拟列 -
GENERATED ALWAYS AS () VIRTUAL
》202210/20221026_01.md 《DuckDB DataLake 场景使用举例 - aliyun OSS对象存储parquet》
202210/20221025_05.md 《DuckDB lambda 函数使用 - list/array数据处理 - 元素的转换与过滤 (list_filter, list_transform)》
202210/20221025_04.md 《DuckDB 列位置别名
#n
的使用》202210/20221025_03.md 《DuckDB 宏(MACRO) 的使用》
202210/20221025_02.md 《DuckDB update from,delete using - 采用JOIN 批量更新和批量删除》
202210/20221025_01.md 《DuckDB rowid 的使用》
202210/20221024_07.md 《DuckDB 对比 PostgreSQL left outer join 优化器推理优化一例》
202210/20221024_06.md 《DuckDB insert query 数据导入场景 优化和使用注意》
202210/20221024_04.md 《DuckDB 鸟瞰数据的方法: SUMMARIZE. 数据柱状图、统计信息》
202210/20221024_03.md 《DuckDB explain analye 的结果解释 - Profile Queries》
202210/20221024_02.md 《查看 DuckDB 优化器 物理和逻辑执行计划 - explain_output》
202210/20221001_02.md 《duckdb postgres_scan 插件 - 不落地数据, 加速PostgreSQL数据分析》
202209/20220924_01.md 《用duckdb_fdw加速PostgreSQL分析计算, 提速40倍, 真香.》
202209/20220913_02.md 《德说-第140期, duckdb+容器+parquet+对象存储, 实现SaaS场景, 低代码拖拉拽多维度实时分析 降本提效》
202209/20220909_02.md 《德说-第135期, duckdb的产品形态如何盈利? 未来数据库产品的商业形态》
202209/20220905_01.md 《DuckDB parquet 分区表 / Delta Lake(数据湖) 应用》
202209/20220902_01.md 《DuckDB 线性回归预测股价的例子》
202209/20220901_06.md 《DuckDB 数据文件水位问题观察》
202209/20220901_05.md 《DuckDB 采用外部 parquet 格式存储 - tpch 测试 - in_memory VS in_parquet》
202209/20220901_04.md 《DuckDB 完整的PRAGMA, setting, 系统表, 系统视图, 内置函数, 内置类型 在哪里?》
202209/20220901_03.md 《DuckDB 数据库的数据能不能超出内存限制? 以及推荐的使用方法 - parquet》
202209/20220901_02.md 《编译安装 DuckDB 最新版本 in MacOS》
202209/20220901_01.md 《DuckDB 读写 Parquet 文件 - 同时支持远程s3, oss, http等parquet文件读写》
202208/20220831_02.md 《DuckDB 聚合函数用法举例》
202208/20220831_01.md 《DuckDB的字符串 collate用法 - 大小写、口音(西方各国字符集)、地域属性、排序 - (icu, noCASE, noACCENT, noNFC)》
202208/20220829_04.md 《DuckDB 快速生成海量数据的方法》
202208/20220829_03.md 《DuckDB:list,struct,map 类型很强大(支持lambda计算) - PostgreSQL:数组、row|record、json字典》
202208/20220829_02.md 《DuckDB 字符串相似性计算函数》
202208/20220829_01.md 《DuckDB vs PostgreSQL TPC-H 测试》
202208/20220828_01.md 《DuckDB TPC-H, TPC-DS 测试》
202208/20220827_01.md 《DuckDB Window 窗口函数语法糖 - QUALIFY - window filter》
202208/20220826_02.md 《DuckDB 定位OLAP方向的SQLite, 适合嵌入式数据分析 - tpch测试与简单试用》
202010/20201022_01.md 《PostgreSQL 牛逼的分析型功能 - 列存储、向量计算 FDW - DuckDB_fdw - 无数