背景要模拟较为逼真的股票数据, 首先需要分析真实数据的特征.股票数据关键的数据特征:1、股票的日涨跌幅波动范围: [-10%, 10%] (这个应该是国内股市交易限制?)2、日涨跌幅的幅度在[-10%, 10%]范围内符合高斯分布. 本文将介绍这个结论怎么得到的?靠近0的最多, 靠近正负10%的概率逐渐回落.类似这样的图形:分析过程1、一键部署PolarDB请参考:《如何用 PolarDB 证明巴菲特的投资理念》2、随便下载几只股票的数据: 茅台,ST热电,海立股份https://zhuanlan.zhihu.com/p/65662875curl "http://quotes.money.163.com/service/chddata.html?code=0600519&start=20010101&end=20220901&fields=TOPEN;TCLOSE" -o ./0600519.SH.csv curl "http://quotes.money.163.com/service/chddata.html?code=0600619&start=20010101&end=20220901&fields=TOPEN;TCLOSE" -o ./0600619.SH.csv curl "http://quotes.money.163.com/service/chddata.html?code=0600719&start=20010101&end=20220901&fields=TOPEN;TCLOSE" -o ./0600719.SH.csv转换处理一下编码的问题:$ iconv -f GBK -t UTF-8 ./0600519.SH.csv > ./1.csv $ iconv -f GBK -t UTF-8 ./0600619.SH.csv > ./2.csv $ iconv -f GBK -t UTF-8 ./0600719.SH.csv > ./3.csv[postgres@d6b4778340d1 ~]$ head -n 5 1.csv 2.csv 3.csv ==> 1.csv <== 日期,股票代码,名称,开盘价,收盘价 2022-09-01,'600519,贵州茅台,1912.15,1880.89 2022-08-31,'600519,贵州茅台,1860.1,1924.0 2022-08-30,'600519,贵州茅台,1882.35,1870.0 2022-08-29,'600519,贵州茅台,1883.0,1878.82 ==> 2.csv <== 日期,股票代码,名称,开盘价,收盘价 2022-09-01,'600619,海立股份,6.77,6.67 2022-08-31,'600619,海立股份,7.06,6.77 2022-08-30,'600619,海立股份,7.3,7.19 2022-08-29,'600619,海立股份,7.0,7.26 ==> 3.csv <== 日期,股票代码,名称,开盘价,收盘价 2022-09-01,'600719,ST热电,5.01,4.9 2022-08-31,'600719,ST热电,5.34,5.05 2022-08-30,'600719,ST热电,5.38,5.32 2022-08-29,'600719,ST热电,5.33,5.382、将数据导入到PolarDBcreate table t1 (c1 date, c2 text, c3 text, c4 numeric, c5 numeric); copy t1 from '/home/postgres/1.csv' ( format csv, HEADER , quote '"'); delete from t1 where c4 =0 or c5=0 ; create table t2 (c1 date, c2 text, c3 text, c4 numeric, c5 numeric); copy t2 from '/home/postgres/2.csv' ( format csv, HEADER , quote '"'); delete from t2 where c4 =0 or c5=0 ; create table t3 (c1 date, c2 text, c3 text, c4 numeric, c5 numeric); copy t3 from '/home/postgres/3.csv' ( format csv, HEADER , quote '"'); delete from t3 where c4 =0 or c5=0 ;3、分析涨跌幅的数据分布, 从结果来看, 涨跌幅度符合高斯分布.select width_bucket(v, -0.1, 0.1, 10), count(*) from ( select (lag(c5) over w - c5)/c5 as v from t1 window w as (order by c1) ) t group by 1 order by 2 desc, 1 asc; select width_bucket(v, -0.1, 0.1, 10), count(*) from ( select (lag(c5) over w - c5)/c5 as v from t2 window w as (order by c1) ) t group by 1 order by 2 desc, 1 asc; select width_bucket(v, -0.1, 0.1, 10), count(*) from ( select (lag(c5) over w - c5)/c5 as v from t3 window w as (order by c1) ) t group by 1 order by 2 desc, 1 asc;柱状图如下:width_bucket | count --------------+------- 6 | 1925 5 | 1813 4 | 528 7 | 459 3 | 130 8 | 91 2 | 23 9 | 21 1 | 20 11 | 11 10 | 5 | 1 (12 rows) width_bucket | count --------------+------- 6 | 1624 5 | 1570 4 | 658 7 | 575 8 | 201 3 | 178 1 | 80 11 | 71 9 | 67 2 | 57 10 | 37 | 1 (12 rows) width_bucket | count --------------+------- 5 | 1611 6 | 1576 4 | 599 7 | 576 8 | 203 3 | 177 9 | 70 1 | 63 11 | 49 2 | 47 10 | 26 | 1 (12 rows)模拟过程基于这两个特征, 可以模拟股票数据.使用PolarDB for PostgreSQL pgbench random_gaussian 进行模拟.1、思路:1、生成涨跌幅数据, 在[-10%, 10%]内按高斯分布.2、用递归语法, 输入一个上市价格, 根据日涨跌幅得到上市后的每日价格.2、建表, 存放pgbench生成的涨跌幅结果create table tbl (id serial primary key, v numeric(20,3));3、使用pgbench生成涨跌幅数据vi test.sql \set r random_gaussian(0, 20000, 5) insert into tbl (v) values ((:r-10000)/100000.0); pgbench -h 127.0.0.1 -n -r -f ./test.sql -c 1 -j 1 -t 5000简单解释一下 test.sqlrandom_gaussian, 生成 0-20000 的数据, 其中概率高密度分布在中间 10000. 5是random_gaussian的微调参数, 可以调整, 决定了中间的概率集中度.这个随机数减去10000, 刚好得到正负10000 ([-10000, 10000]) 的范围, 再除以100000, 得到正负10% ([-10%, 10%])的范围.模拟数据的涨跌幅概率分布如下, 接近真实股票数据的涨跌幅概率分布:select width_bucket(v,-0.1,0.1,10),count(*) from tbl group by 1 order by 2 desc,1; width_bucket | count --------------+------- 6 | 1755 5 | 1661 7 | 701 4 | 668 8 | 112 3 | 87 9 | 10 2 | 5 10 | 1 (9 rows)4、假设上市价格为38.101, 使用如下递归SQL, 使用生成的涨跌幅数据生成每交易日价格create table tbl1 (c1 int, c5 numeric); with recursive a as ( (select id, (38.101 * (1 + tbl.v))::numeric(20,3) as price from tbl order by id limit 1) union all (select tbl.id, (a.price * (1 + tbl.v))::numeric(20,3) from tbl join a on (tbl.id > a.id) where a.* is not null order by tbl.id limit 1) ) insert into tbl1 select * from a where a.* is not null; INSERT 0 5000生成的数据绘图如下:5、随便选一个定投起点, 模拟的数据和真实数据一样, 也符合巴菲特的投资理念.《如何用 PolarDB 证明巴菲特的投资理念》持有500个交易日以后的收益:607 | 37.385 | 32.2845 | 11.35 | 254000 | 294128.47 | 40128.47 | 507 608 | 37.759 | 32.2953 | 12.13 | 254500 | 297556.59 | 43056.59 | 508 609 | 37.835 | 32.3061 | 12.25 | 255000 | 298640.82 | 43640.82 | 509 610 | 37.986 | 32.3172 | 12.53 | 255500 | 300317.28 | 44817.28 | 510 611 | 38.822 | 32.3299 | 14.32 | 256000 | 307406.49 | 51406.49 | 511 612 | 40.025 | 32.3449 | 16.89 | 256500 | 317404.02 | 60904.02 | 512 613 | 39.665 | 32.3592 | 16.03 | 257000 | 315023.62 | 58023.62 | 513 614 | 38.673 | 32.3714 | 13.80 | 257500 | 307626.06 | 50126.06 | 514 615 | 38.712 | 32.3837 | 13.82 | 258000 | 308417.15 | 50417.15 | 515 616 | 39.293 | 32.3971 | 15.03 | 258500 | 313523.25 | 55023.25 | 516 617 | 39.647 | 32.4111 | 15.73 | 259000 | 316822.87 | 57822.87 | 517 618 | 40.123 | 32.4259 | 16.69 | 259500 | 321098.39 | 61598.39 | 518 ... 1383 | 37.699 | 31.0363 | 6.10 | 642000 | 779819.74 | 137819.74 | 1283 1384 | 37.963 | 31.0417 | 6.33 | 642500 | 785755.81 | 143255.81 | 1284 1385 | 37.507 | 31.0468 | 5.91 | 643000 | 776795.88 | 133795.88 | 1285 1386 | 37.995 | 31.0522 | 6.34 | 643500 | 787377.68 | 143877.68 | 1286 1387 | 38.869 | 31.0582 | 7.13 | 644000 | 805958.09 | 161958.09 | 1287 1388 | 38.791 | 31.0642 | 7.04 | 644500 | 804809.78 | 160309.78 | 1288 1389 | 40.226 | 31.0713 | 8.34 | 645000 | 835038.75 | 190038.75 | 1289 1390 | 39.341 | 31.0777 | 7.52 | 645500 | 817131.93 | 171631.93 | 1290 1391 | 38.554 | 31.0835 | 6.79 | 646000 | 801256.65 | 155256.65 | 1291 最大收益: c1 | price | round | revenue_year_ratio | invest | v_value | v_make_money | keep_days ------+--------+---------+--------------------+---------+------------+--------------+----------- 4463 | 56.423 | 37.1193 | 4.35 | 2182000 | 3316734.30 | 1134734.30 | 4363 (1 row)select c1, -- 日期 price, -- 当前价 round(cost_avg,4), -- 成本价 round(100 * ((price-cost_avg)/cost_avg) / ((c1-start_c1+1)/365.0), 2) as revenue_year_ratio, -- 年化收益率 rn * 500 as invest, -- 截止当前总投入. (假设每个交易日投入500) round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) as v_value, -- 当前持有股票的价值 round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 as v_make_money, -- 赚了多少钱 c1-start_c1 as keep_days -- 持有天数 from ( select c1, c5 as price, avg(c5) over w as cost_avg, min(c1) over w as start_c1, row_number() over w as rn from tbl1 where c1 >= 100 -- 经济越低迷的时候股价越低, 从那时开始投入是比较好的. -- 如果你的投入周期足够长, 可以从任意时间开始投入, 总会遇到可以收割的时候. window w as (order by c1 range between UNBOUNDED PRECEDING and CURRENT ROW) ) t order by c1; select c1, -- 日期 price, -- 当前价 round(cost_avg,4), -- 成本价 round(100 * ((price-cost_avg)/cost_avg) / ((c1-start_c1+1)/365.0), 2) as revenue_year_ratio, -- 年化收益率 rn * 500 as invest, -- 截止当前总投入. (假设每个交易日投入500) round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) as v_value, -- 当前持有股票的价值 round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 as v_make_money, -- 赚了多少钱 c1-start_c1 as keep_days -- 持有天数 from ( select c1, c5 as price, avg(c5) over w as cost_avg, min(c1) over w as start_c1, row_number() over w as rn from tbl1 where c1 >= 100 -- 经济越低迷的时候股价越低, 从那时开始投入是比较好的. -- 如果你的投入周期足够长, 可以从任意时间开始投入, 总会遇到可以收割的时候. window w as (order by c1 range between UNBOUNDED PRECEDING and CURRENT ROW) ) t order by round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 desc limit 1;为了满足杠精的需要, 我再选了一个点: 1900, 也就是差不多到达最高价, 然后连续下跌前的那个点. 即使从那开始定投, 依旧满足巴菲特的投资理念.3368 | 47.851 | 37.8497 | 6.57 | 734500 | 928582.39 | 194082.39 | 1468 3369 | 48.664 | 37.8571 | 7.09 | 735000 | 944818.44 | 209818.44 | 1469 3370 | 48.956 | 37.8646 | 7.27 | 735500 | 950944.72 | 215444.72 | 1470 3371 | 48.662 | 37.8719 | 7.06 | 736000 | 945693.31 | 209693.31 | 1471 3372 | 49.149 | 37.8796 | 7.37 | 736500 | 955613.33 | 219113.33 | 1472 3373 | 48.608 | 37.8869 | 7.01 | 737000 | 945554.49 | 208554.49 | 1473 3374 | 49.434 | 37.8947 | 7.54 | 737500 | 962075.98 | 224575.98 | 1474 3375 | 48.248 | 37.9017 | 6.75 | 738000 | 939456.96 | 201456.96 | 1475 3376 | 48.875 | 37.9091 | 7.15 | 738500 | 952123.66 | 213623.66 | 1476 3377 | 47.995 | 37.9160 | 6.56 | 739000 | 935445.21 | 196445.21 | 1477 3378 | 47.755 | 37.9226 | 6.40 | 739500 | 931233.85 | 191733.85 | 1478 3379 | 48.567 | 37.9298 | 6.92 | 740000 | 947528.69 | 207528.69 | 1479 3380 | 48.373 | 37.9369 | 6.78 | 740500 | 944205.93 | 203705.93 | 1480 ... 3807 | 53.445 | 39.5058 | 6.75 | 954000 | 1290608.01 | 336608.01 | 1907 3808 | 53.659 | 39.5132 | 6.84 | 954500 | 1296211.63 | 341711.63 | 1908 3809 | 54.518 | 39.5211 | 7.25 | 955000 | 1317389.98 | 362389.98 | 1909 3810 | 53.973 | 39.5287 | 6.98 | 955500 | 1304653.62 | 349153.62 | 1910 (169 rows) 在选择最差的日期开始定投的情况下, 你依旧有一共有169次年化大于6% , 75次大于8%, 69次大于 10% 所以设置好止盈点, 定投的回报妥妥的.pgbench目前支持生成泊松、高斯、指数、随机分布的数据. 有兴趣的小伙伴可以学习一下, 文末提供了参考文档.参考1、width_buckethttps://www.postgresql.org/docs/15/functions-math.html2、gaussian分布, 参数越大, 随机值的概率分布越集中在中间.https://www.postgresql.org/docs/15/pgbench.html\set r random_gaussian(0, 20000, 2.5) \set r random_gaussian(0, 20000, 10) \set r random_gaussian(0, 20000, 5)-- 10的分布 postgres=# select width_bucket(v,-0.1,0.1,10),count(*) from tbl group by 1 order by 2 desc,1; width_bucket | count --------------+------- 6 | 9583 5 | 9511 4 | 459 7 | 447 (4 rows) -- 5的分布 postgres=# select width_bucket(v,-0.1,0.1,10),count(*) from tbl group by 1 order by 2 desc,1; width_bucket | count --------------+------- 6 | 6852 5 | 6828 4 | 2766 7 | 2668 8 | 430 3 | 397 2 | 36 9 | 21 10 | 2 (9 rows) -- 2.5的分布 postgres=# select width_bucket(v,-0.1,0.1,10),count(*) from tbl group by 1 order by 2 desc,1; width_bucket | count --------------+------- 5 | 3971 6 | 3791 4 | 3035 7 | 3028 3 | 1900 8 | 1872 9 | 898 2 | 855 10 | 330 1 | 322 (10 rows)random ( lb, ub ) → integer Computes a uniformly-distributed random integer in [lb, ub]. random(1, 10) → an integer between 1 and 10 random_exponential ( lb, ub, parameter ) → integer Computes an exponentially-distributed random integer in [lb, ub], see below. random_exponential(1, 10, 3.0) → an integer between 1 and 10 random_gaussian ( lb, ub, parameter ) → integer Computes a Gaussian-distributed random integer in [lb, ub], see below. random_gaussian(1, 10, 2.5) → an integer between 1 and 10 random_zipfian ( lb, ub, parameter ) → integer Computes a Zipfian-distributed random integer in [lb, ub], see below. random_zipfian(1, 10, 1.5) → an integer between 1 and 103、《生成泊松、高斯、指数、随机分布数据 - PostgreSQL 9.5 new feature - pgbench improve, gaussian (standard normal) & exponential distribution》4、《DuckDB 线性回归预测股价的例子》对比真实的茅台涨跌幅概率分布, 采用pgbench 生成的高斯分布create table his (c1 date, c2 text, c3 text, c4 numeric, c5 numeric); copy his from '/Users/digoal/Downloads/2.csv' ( format csv, HEADER , quote '"'); select width_bucket(v,-0.1,0.1,10), count(*) from ( select (lag(c5) over w - c5)/c5 as v from his window w as (order by c1) ) t group by 1 order by 2 desc, 1 asc; width_bucket | count --------------+------- 6 | 1925 5 | 1813 4 | 528 7 | 459 3 | 130 8 | 91 2 | 23 9 | 21 1 | 20 11 | 11 10 | 5 | 1 (12 rows)-- 5的分布 -- random_gaussian(0, 20000, 5) width_bucket | count --------------+------- 5 | 1711 6 | 1676 7 | 704 4 | 660 8 | 121 3 | 117 9 | 6 2 | 4 1 | 1 (9 rows)
背景分形法则被誉为神性法则, 因为它的公式极其简单, 但是能产生无穷无尽的自相似性. 例如通过分形公式产生的曼德勃罗集, 被成为上帝的指纹.PS: 如何部署PolarDB开源版, 请参考: 《如何用 PolarDB 证明巴菲特的投资理念》1、分形公式:z=z^2+c z可以是单数、也可以是复数、甚至可以是多维数.2、复数运算:(a+bi)(c+di) = (ac-bd)+(ad+bc)i (a+bi) + (c+di) = (a+c)+(b+d)i (a+bi) - (c+di) = (a-c)+(b-d)i3、如何产生"上帝的指纹"?当z的初始值固定时, 取不同的c, z的发散速度是不一样的, 例如某个c, 100次之后z就趋于无穷大, 某个c, 不管迭代多少次, z都在某个范围内跳动.以一块白布为底图, 把c的坐标映射到这个二维白布中, 用颜色来标记该c值引起的z值发散速度, 越黑表示z值越不会被发散.我们固定z0=0,那么对于不同的复数c,函数的迭代结果也不同。由于复数c对应平面上的点,因此我们可以用一个平面图形来表示,对于某个复数c,函数f(z)=z^2+c从z0=0开始迭代是否会发散到无穷。我们同样用不同颜色来表示不同的发散速度,最后得出的就是Mandelbrot集分形图形:先来个简单的, 训练一下c和z0固定时, 如何产生z的迭代值:WITH RECURSIVE t(n, zr, zi, cr, ci) AS ( VALUES (1, 0::float8, 0::float8, 0.1::float8, 0.1::float8) UNION ALL SELECT n+1, zr*zr - zi*zi + cr, zr*zi + zi*zr + ci, cr, ci FROM t WHERE n < 100 ) SELECT n,zr,zi FROM t;do language plpgsql $$ declare zr float8 := 0; zi float8 := 0; cr float8 := 0.1; ci float8 := 0.1; tmpr float8; tmpi float8; begin for i in 1..100 loop raise notice '%, %i', zr, zi; tmpr := zr*zr - zi*zi + cr; tmpi := zr*zi + zi*zr + ci; zr := tmpr; zi := tmpi; end loop; raise notice '%, %i', zr, zi; end; $$;NOTICE: 0, 0i NOTICE: 0.1, 0.1i NOTICE: 0.1, 0.12000000000000001i NOTICE: 0.0956, 0.12400000000000001i NOTICE: 0.09376336, 0.12370880000000001i NOTICE: 0.09348770048104961, 0.123198705499136i NOTICE: 0.0935620291045716, 0.12303512735871254i NOTICE: 0.09361621072599009, 0.12302283233364109i NOTICE: 0.09362937763530182, 0.12303386279170858i NOTICE: 0.093629128962925, 0.1230391680025096i NOTICE: 0.09362777692760627, 0.12304010025679593i NOTICE: 0.0936272943412032, 0.1230399421199872i ... 基本不发散, 所以这个 0.1,0.1i 的C对应复平面的坐标就用黑色的像素点表示.把这个函数扩展到整个复数范围。对于复数z0=x+iy,取不同的x值和y值,函数迭代的结果不一样:对于有些z0,函数值约束在某一范围内;而对于另一些z0,函数值则发散到无穷。由于复数对应平面上的点,因此我们可以用一个平面图形来表示,对于哪些z0函数值最终趋于无穷,对于哪些z0函数值最终不会趋于无穷。我们用深灰色表示不会使函数值趋于无穷的z0;对于其它的z0,我们用不同的颜色来区别不同的发散速度。由于当某个时候|z|>2时,函数值一定发散,因此这里定义发散速度为:使|z|大于2的迭代次数越少,则发散速度越快。这个图形可以编程画出。f(z)=z^2+(-0.75+0i)时的Julia集:我们固定z0=0,那么对于不同的复数c,函数的迭代结果也不同。由于复数c对应平面上的点,因此我们可以用一个平面图形来表示,对于某个复数c,函数f(z)=z^2+c从z0=0开始迭代是否会发散到无穷。我们同样用不同颜色来表示不同的发散速度,最后得出的就是Mandelbrot集分形图形(也就是上帝的指纹):4、用PolarDB来生成"上帝的指纹":do language plpgsql $$ declare zr numeric := 0.0; -- z0r zi numeric := 0.0; -- z0i tmpr numeric; tmpi numeric; i int; begin <<label_x>> for x in -300..300 loop -- cr, 表示白布的x轴像素点范围是-300到300 <<label_y>> for y in -200..200 loop -- ci, 表示白布的y轴像素点范围是-200到200 <<label_i>> for k in 1..200 loop -- z的发散速度, i 表示颜色深度; 最多迭代200次, 200次就是黑色, 1次可能就接近白色. tmpr := zr*zr - zi*zi + x::numeric/300.0::numeric; tmpi := zr*zi + zi*zr + y::numeric/200.0::numeric; zr := tmpr; zi := tmpi; i := k; exit label_i when sqrt(zr*zr + zi*zi) > 2; -- z的迭代次数截止于|z|>2, 因为此时z会无限发散. end loop label_i ; raise notice 'cr:%, ci:%, i:%', x, y, i; zr := 0.0; -- z0r zi := 0.0; -- z0i end loop label_y ; end loop label_x ; end; $$;cr为x坐标,ci为y坐标,i为颜色深度:NOTICE: cr:-300, ci:-200, i:3 NOTICE: cr:-300, ci:-199, i:3 NOTICE: cr:-300, ci:-198, i:3 NOTICE: cr:-300, ci:-197, i:3 NOTICE: cr:-300, ci:-196, i:3 NOTICE: cr:-300, ci:-195, i:3 NOTICE: cr:-300, ci:-194, i:3 NOTICE: cr:-300, ci:-193, i:3 NOTICE: cr:-300, ci:-192, i:3 NOTICE: cr:-300, ci:-191, i:3 NOTICE: cr:-300, ci:-190, i:3 NOTICE: cr:-300, ci:-189, i:3 NOTICE: cr:-300, ci:-188, i:3 NOTICE: cr:-300, ci:-187, i:3 NOTICE: cr:-300, ci:-186, i:3 NOTICE: cr:-300, ci:-185, i:3 NOTICE: cr:-300, ci:-184, i:3 NOTICE: cr:-300, ci:-183, i:3 NOTICE: cr:-300, ci:-182, i:3 NOTICE: cr:-300, ci:-181, i:3 NOTICE: cr:-300, ci:-180, i:3 NOTICE: cr:-300, ci:-179, i:3 NOTICE: cr:-300, ci:-178, i:3 NOTICE: cr:-300, ci:-177, i:3 NOTICE: cr:-300, ci:-176, i:3 NOTICE: cr:-300, ci:-175, i:3 NOTICE: cr:-300, ci:-174, i:3 NOTICE: cr:-300, ci:-173, i:3 NOTICE: cr:-300, ci:-172, i:3 NOTICE: cr:-300, ci:-171, i:3 NOTICE: cr:-300, ci:-170, i:3 NOTICE: cr:-300, ci:-169, i:3 NOTICE: cr:-300, ci:-168, i:3 NOTICE: cr:-300, ci:-167, i:3 NOTICE: cr:-300, ci:-166, i:3 NOTICE: cr:-300, ci:-165, i:3 NOTICE: cr:-300, ci:-164, i:3 NOTICE: cr:-300, ci:-163, i:3 NOTICE: cr:-300, ci:-162, i:3 NOTICE: cr:-300, ci:-161, i:3 NOTICE: cr:-300, ci:-160, i:3 NOTICE: cr:-300, ci:-159, i:3 NOTICE: cr:-300, ci:-158, i:3 NOTICE: cr:-300, ci:-157, i:3 NOTICE: cr:-300, ci:-156, i:3 NOTICE: cr:-300, ci:-155, i:3 NOTICE: cr:-300, ci:-154, i:3 NOTICE: cr:-300, ci:-153, i:3 NOTICE: cr:-300, ci:-152, i:3 NOTICE: cr:-300, ci:-151, i:3 NOTICE: cr:-300, ci:-150, i:3 NOTICE: cr:-300, ci:-149, i:3 NOTICE: cr:-300, ci:-148, i:4 NOTICE: cr:-300, ci:-147, i:4 NOTICE: cr:-300, ci:-146, i:4 NOTICE: cr:-300, ci:-145, i:4 NOTICE: cr:-300, ci:-144, i:4 NOTICE: cr:-300, ci:-143, i:4 NOTICE: cr:-300, ci:-142, i:4 NOTICE: cr:-300, ci:-141, i:4 NOTICE: cr:-300, ci:-140, i:4 NOTICE: cr:-300, ci:-139, i:4 NOTICE: cr:-300, ci:-138, i:4 NOTICE: cr:-300, ci:-137, i:4 NOTICE: cr:-300, ci:-136, i:4 NOTICE: cr:-300, ci:-135, i:4 NOTICE: cr:-300, ci:-134, i:4 NOTICE: cr:-300, ci:-133, i:4 NOTICE: cr:-300, ci:-132, i:4 NOTICE: cr:-300, ci:-131, i:4 NOTICE: cr:-300, ci:-130, i:4 NOTICE: cr:-300, ci:-129, i:4 NOTICE: cr:-300, ci:-128, i:4 NOTICE: cr:-300, ci:-127, i:4 NOTICE: cr:-300, ci:-126, i:4 NOTICE: cr:-300, ci:-125, i:4 NOTICE: cr:-300, ci:-124, i:4 NOTICE: cr:-300, ci:-123, i:4 NOTICE: cr:-300, ci:-122, i:4 NOTICE: cr:-300, ci:-121, i:4 NOTICE: cr:-300, ci:-120, i:4 NOTICE: cr:-300, ci:-119, i:5 NOTICE: cr:-300, ci:-118, i:5 NOTICE: cr:-300, ci:-117, i:5 NOTICE: cr:-300, ci:-116, i:5 NOTICE: cr:-300, ci:-115, i:5 NOTICE: cr:-300, ci:-114, i:5 NOTICE: cr:-300, ci:-113, i:5 NOTICE: cr:-300, ci:-112, i:5 NOTICE: cr:-300, ci:-111, i:5 NOTICE: cr:-300, ci:-110, i:5 NOTICE: cr:-300, ci:-109, i:5 NOTICE: cr:-300, ci:-108, i:5 NOTICE: cr:-300, ci:-107, i:5 NOTICE: cr:-300, ci:-106, i:5 NOTICE: cr:-300, ci:-105, i:5 NOTICE: cr:-300, ci:-104, i:5 NOTICE: cr:-300, ci:-103, i:5 NOTICE: cr:-300, ci:-102, i:5 NOTICE: cr:-300, ci:-101, i:5 NOTICE: cr:-300, ci:-100, i:5 NOTICE: cr:-300, ci:-99, i:5 NOTICE: cr:-300, ci:-98, i:5 NOTICE: cr:-300, ci:-97, i:5 NOTICE: cr:-300, ci:-96, i:5 NOTICE: cr:-300, ci:-95, i:5 NOTICE: cr:-300, ci:-94, i:5 NOTICE: cr:-300, ci:-93, i:5 NOTICE: cr:-300, ci:-92, i:5 NOTICE: cr:-300, ci:-91, i:5 NOTICE: cr:-300, ci:-90, i:5 NOTICE: cr:-300, ci:-89, i:5 NOTICE: cr:-300, ci:-88, i:5 NOTICE: cr:-300, ci:-87, i:6 NOTICE: cr:-300, ci:-86, i:6 NOTICE: cr:-300, ci:-85, i:6 NOTICE: cr:-300, ci:-84, i:6 NOTICE: cr:-300, ci:-83, i:6 NOTICE: cr:-300, ci:-82, i:7 NOTICE: cr:-300, ci:-81, i:7 NOTICE: cr:-300, ci:-80, i:7 NOTICE: cr:-300, ci:-79, i:8 NOTICE: cr:-300, ci:-78, i:8 NOTICE: cr:-300, ci:-77, i:8 NOTICE: cr:-300, ci:-76, i:8 NOTICE: cr:-300, ci:-75, i:8 NOTICE: cr:-300, ci:-74, i:9 NOTICE: cr:-300, ci:-73, i:9 NOTICE: cr:-300, ci:-72, i:10 NOTICE: cr:-300, ci:-71, i:10 NOTICE: cr:-300, ci:-70, i:10 NOTICE: cr:-300, ci:-69, i:10 NOTICE: cr:-300, ci:-68, i:11 NOTICE: cr:-300, ci:-67, i:11 NOTICE: cr:-300, ci:-66, i:12 NOTICE: cr:-300, ci:-65, i:13 NOTICE: cr:-300, ci:-64, i:15 NOTICE: cr:-300, ci:-63, i:18 NOTICE: cr:-300, ci:-62, i:23 NOTICE: cr:-300, ci:-61, i:37 NOTICE: cr:-300, ci:-60, i:35 NOTICE: cr:-300, ci:-59, i:34 NOTICE: cr:-300, ci:-58, i:59 NOTICE: cr:-300, ci:-57, i:200 NOTICE: cr:-300, ci:-56, i:200 .....类似的: 固定c (f(z)=z^2+(-0.75+0i)), 取不同的z0值作为起点, 看z的发散速度, 绘制不同z0的二维平面图, 颜色深浅对应z的发散速度, 可以得到julia集合.参考http://www.matrix67.com/blog/archives/4570http://www.matrix67.com/blog/archives/6231https://www.cnblogs.com/anderslly/archive/2008/10/10/mandelbrot-set-by-fsharp.htmlhttp://www.matrix67.com/blog/archives/292https://www.eefocus.com/e/500748分形艺术网: http://www.fxysw.com/https://zhuanlan.zhihu.com/p/450061289
背景巴菲特的投资理念是什么? 长线定投长期定投不是投机倒靶, 长期定投是有社会价值的, 可以帮助上市公司筹集资金, 加大研发投入和生产. 投资人则在这中间获取企业业务发展带来的红利.长线定投赚钱背后的逻辑(理论依据)是什么呢?1、首先是代际转移理论: 资源(生产资料、生产力)有限, 但是我们整个社区都假设并坚信通过未来的科技进步将获得更高的资源利用能力、生产效率; 例如石油、煤炭的过度开采虽然会造成环境破坏, 但是我们相信未来的科技进步会找到新的能源, 并填补过度开采造成的破坏. (这点和递弱代偿理论异曲同工)。参考 《德说-第96期, 代际转移与创新、集智(全球脑)》2、第二个是经济周期, 以及宏观调控手段, 维持适度的通胀, 有利于经济的发展. 需要刺激经济的时候通常会有降低央行准备金率, 让商业银行可以贷出去更多钱, 可能引起通胀. (货币总量增加. 参考阅读:金融简史.). 但是有些国计民生相关商品并不是完全市场化的, 所以这些商品通胀率比较可控, 否则会引发动乱.银行放水时也需要有法律法规和相关监管配合, 防止投机倒把贷一堆钱去炒作钱滚钱. 失去了放水的意义. 放水可能希望的是去消费、投入研发或采购生产资料、促进生产....3、第三是数学支撑: 微笑曲线参考 《德说-第56期, 微笑曲线 - 基金定投》4、严格的止盈线.有了理论支撑, 本文将使用真实数据以及PolarDB来证明巴菲特的投资理念.证明过程1、下载数据巴菲特推荐的是投资指数, 因为指数是由大量头部或者行业头部或者规模相似的数十或数百只股票构成, 因此比投资单只股票风险(单只股票可能经营不善而下市)更低.本文使用股票来证明, 以茅台为例, 巴菲特的理念同时适合股票和指数.取2001年开始茅台的数据为例:下载2001年到202209的茅台的历史收盘价数据:https://zhuanlan.zhihu.com/p/65662875curl "http://quotes.money.163.com/service/chddata.html?code=0600519&start=20010101&end=20220901&fields=TOPEN;TCLOSE" -o ./historical_tradedata_0600519.SH.csv转换处理一下编码的问题:$ iconv -f GBK -t UTF-8 ./historical_tradedata_0600519.SH.csv > ./1.csv $ head -n 5 1.csv 日期,股票代码,名称,开盘价,收盘价 2022-09-01,'600519,贵州茅台,1912.15,1880.89 2022-08-31,'600519,贵州茅台,1860.1,1924.0 2022-08-30,'600519,贵州茅台,1882.35,1870.0 2022-08-29,'600519,贵州茅台,1883.0,1878.82 ...2、导入到PolarDB for PostgreSQL. (如果你没有安装, 可以参考文末一键安装PolarDB.)psql -h 127.0.0.1 postgres=# drop table his; DROP TABLE postgres=# create table his (c1 date, c2 text, c3 text, c4 numeric, c5 numeric); CREATE TABLE postgres=# postgres=# copy his from '/home/postgres/1.csv' ( format csv, HEADER , quote '"'); COPY 5101 postgres=# delete from his where c4=0 or c5=0; DELETE 74 postgres=# create index idx_his_1 on his(c1); CREATE INDEX3、分析结果假设你从2014-10-01开始投入, 每个交易日投入500元.你可以随意选择投入开始日期(建议选择一个较长的时间窗口, 证明跨过了一个周期), 经济越低迷的时候股价越低, 从那时开始投入是比较好的. 走低不用怕, 周期过去后总会上来的. (不经历风雨, 怎么见彩虹?)如果你的投入周期足够长, 可以从任意时间开始投入, 总会遇到可以收割的时候.SQL 如下:select c1, -- 日期 price, -- 当前价 round(cost_avg,4), -- 成本价 round(100 * ((price-cost_avg)/cost_avg) / ((c1-start_c1+1)/365.0), 2) as revenue_year_ratio, -- 年化收益率 rn * 500 as invest, -- 截止当前总投入. (假设每个交易日投入500) round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) as v_value, -- 当前持有股票的价值 round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 as v_make_money, -- 赚了多少钱 c1-start_c1 as keep_days -- 持有天数 from ( select c1, c5 as price, avg(c5) over w as cost_avg, min(c1) over w as start_c1, row_number() over w as rn from his where c1 >= '2014-10-01' -- 从2014-10-01开始投入, 你可以随意选择投入开始日期 -- 经济越低迷的时候股价越低, 从那时开始投入是比较好的. -- 如果你的投入周期足够长, 可以从任意时间开始投入, 总会遇到可以收割的时候. window w as (order by c1 range between UNBOUNDED PRECEDING and CURRENT ROW) ) t order by c1;结果如下:日期, 当前价格, 成本价, 年化收益率, 总投入, 股票价值, 赚了多少, 持有天数 c1 | price | round | revenue_year_ratio | invest | v_value | v_make_money | keep_days ------------+---------+----------+--------------------+--------+------------+--------------+----------- 2014-10-08 | 160.73 | 160.7300 | 0.00 | 500 | 500.00 | 0.00 | 0 2014-10-09 | 160.01 | 160.3700 | -40.97 | 1000 | 997.76 | -2.24 | 1 2014-10-10 | 159.02 | 159.9200 | -68.47 | 1500 | 1491.56 | -8.44 | 2 2014-10-13 | 156.1 | 158.9650 | -109.64 | 2000 | 1963.95 | -36.05 | 5 2014-10-14 | 154.91 | 158.1540 | -106.95 | 2500 | 2448.72 | -51.28 | 6 2014-10-15 | 157.32 | 158.0150 | -20.07 | 3000 | 2986.81 | -13.19 | 7 2014-10-16 | 157.32 | 157.9157 | -15.30 | 3500 | 3486.80 | -13.20 | 8 2014-10-17 | 158.38 | 157.9738 | 9.39 | 4000 | 4010.29 | 10.29 | 9 2014-10-20 | 156.92 | 157.8567 | -16.66 | 4500 | 4473.30 | -26.70 | 12 .... 2022-08-17 | 1918.0 | 921.1272 | 13.76 | 958500 | 1995818.89 | 1037318.89 | 2870 2022-08-18 | 1895.5 | 921.6352 | 13.43 | 959000 | 1972347.12 | 1013347.12 | 2871 2022-08-19 | 1895.01 | 922.1424 | 13.40 | 959500 | 1971780.14 | 1012280.14 | 2872 2022-08-22 | 1893.98 | 922.6486 | 13.36 | 960000 | 1970653.66 | 1010653.66 | 2875 2022-08-23 | 1870.01 | 923.1417 | 13.01 | 960500 | 1945686.70 | 985186.70 | 2876 2022-08-24 | 1854.2 | 923.6262 | 12.78 | 961000 | 1929228.81 | 968228.81 | 2877 2022-08-25 | 1885.0 | 924.1261 | 13.18 | 961500 | 1961233.98 | 999733.98 | 2878 2022-08-26 | 1898.0 | 924.6323 | 13.34 | 962000 | 1974705.04 | 1012705.04 | 2879 2022-08-29 | 1878.82 | 925.1279 | 13.05 | 962500 | 1954718.00 | 992218.00 | 2882 2022-08-30 | 1870.0 | 925.6185 | 12.91 | 963000 | 1945520.68 | 982520.68 | 2883 2022-08-31 | 1924.0 | 926.1366 | 13.63 | 963500 | 2001620.41 | 1038120.41 | 2884 2022-09-01 | 1880.89 | 926.6318 | 13.02 | 964000 | 1956740.40 | 992740.40 | 2885 (1928 rows)最高收益时: 持有2317天, 年化收益率43.77%, 投入77.55万, 赚了215.5万.select c1, -- 日期 price, -- 当前价 round(cost_avg,4), -- 成本价 round(100 * ((price-cost_avg)/cost_avg) / ((c1-start_c1+1)/365.0), 2) as revenue_year_ratio, -- 年化收益率 rn * 500 as invest, -- 截止当前总投入. (假设每个交易日投入500) round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) as v_value, -- 当前持有股票的价值 round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 as v_make_money, -- 赚了多少钱 c1-start_c1 as keep_days -- 持有天数 from ( select c1, c5 as price, avg(c5) over w as cost_avg, min(c1) over w as start_c1, row_number() over w as rn from his where c1 >= '2014-10-01' -- 从2014-10-01开始投入, 你可以随意选择投入开始日期 -- 经济越低迷的时候股价越低, 从那时开始投入是比较好的. -- 如果你的投入周期足够长, 可以从任意时间开始投入, 总会遇到可以收割的时候. window w as (order by c1 range between UNBOUNDED PRECEDING and CURRENT ROW) ) t order by round(rn * 500 * (1+ (price-cost_avg)/cost_avg ), 2) - rn * 500 desc limit 1; c1 | price | round | revenue_year_ratio | invest | v_value | v_make_money | keep_days ------------+--------+----------+--------------------+--------+------------+--------------+----------- 2021-02-10 | 2601.0 | 688.1936 | 43.77 | 775500 | 2930970.87 | 2155470.87 | 2317 (1 row)附上另外两只比较糟糕的股票的数据, 现在已经跌破发行价了, 虽然如此, 巴菲特投资理念依旧成立.600619 | 海立股份2015-10-27 | 11.22 | 10.8041 | 3.65 | 101000 | 104888.35 | 3888.35 | 384 2015-10-28 | 11.69 | 10.8084 | 7.71 | 101500 | 109778.73 | 8278.73 | 385 2015-10-29 | 12.86 | 10.8185 | 17.80 | 102000 | 121248.08 | 19248.08 | 386 2015-10-30 | 12.44 | 10.8264 | 14.02 | 102500 | 117777.02 | 15277.02 | 387 2015-11-02 | 12.09 | 10.8325 | 10.84 | 103000 | 114956.59 | 11956.59 | 390 2015-11-03 | 12.39 | 10.8400 | 13.31 | 103500 | 118298.83 | 14798.83 | 391 2015-11-04 | 12.88 | 10.8499 | 17.38 | 104000 | 123459.71 | 19459.71 | 392 2015-11-05 | 12.88 | 10.8596 | 17.24 | 104500 | 123942.30 | 19442.30 | 393 2015-11-06 | 13.39 | 10.8716 | 21.41 | 105000 | 129322.96 | 24322.96 | 394 2015-11-09 | 14.25 | 10.8876 | 28.32 | 105500 | 138081.01 | 32581.01 | 397 2015-11-10 | 14.69 | 10.9056 | 31.74 | 106000 | 142783.97 | 36783.97 | 398 2015-11-11 | 15.15 | 10.9255 | 35.28 | 106500 | 147679.84 | 41179.84 | 399 2015-11-12 | 14.81 | 10.9436 | 32.16 | 107000 | 144802.76 | 37802.76 | 400 ... 2015-12-08 | 15.4 | 11.2300 | 31.74 | 116000 | 159074.52 | 43074.52 | 426 2015-12-09 | 15.09 | 11.2465 | 29.14 | 116500 | 156313.64 | 39813.64 | 427 2015-12-10 | 15.2 | 11.2634 | 29.74 | 117000 | 157891.67 | 40891.67 | 428 2015-12-11 | 15.06 | 11.2796 | 28.45 | 117500 | 156880.92 | 39380.92 | 429 2015-12-14 | 15.28 | 11.2965 | 29.73 | 118000 | 159610.14 | 41610.14 | 432 2015-12-15 | 15.41 | 11.3139 | 30.45 | 118500 | 161402.16 | 42902.16 | 433 2015-12-16 | 15.68 | 11.3322 | 32.19 | 119000 | 164656.07 | 45656.07 | 434 2015-12-17 | 15.83 | 11.3510 | 33.03 | 119500 | 166652.92 | 47152.92 | 435 2015-12-18 | 15.84 | 11.3698 | 32.84 | 120000 | 167180.46 | 47180.46 | 436 2015-12-21 | 16.55 | 11.3912 | 37.57 | 120500 | 175070.86 | 54570.86 | 439 2015-12-22 | 17.03 | 11.4145 | 40.72 | 121000 | 180526.68 | 59526.68 | 440c1 | price | round | revenue_year_ratio | invest | v_value | v_make_money | keep_days ------------+-------+---------+--------------------+--------+-----------+--------------+----------- 2017-09-26 | 17.22 | 12.1651 | 13.98 | 328500 | 465000.50 | 136500.50 | 1084 (1 row)600719 | ST热电2015-11-06 | 14.57 | 9.5301 | 48.87 | 65500 | 100139.28 | 34639.28 | 394 2015-11-09 | 14.32 | 9.5664 | 45.57 | 66000 | 98796.16 | 32796.16 | 397 2015-11-10 | 14.17 | 9.6010 | 43.53 | 66500 | 98146.78 | 31646.78 | 398 2015-11-11 | 14.46 | 9.6372 | 45.66 | 67000 | 100528.79 | 33528.79 | 399 2015-11-12 | 14.14 | 9.6706 | 42.07 | 67500 | 98696.12 | 31196.12 | 400 2015-11-13 | 13.44 | 9.6983 | 35.03 | 68000 | 94234.99 | 26234.99 | 401 2015-11-16 | 14.16 | 9.7309 | 41.02 | 68500 | 99678.59 | 31178.59 | 404 2015-11-17 | 14.28 | 9.7638 | 41.58 | 69000 | 100915.21 | 31915.21 | 405 2015-11-18 | 14.7 | 9.7994 | 44.85 | 69500 | 104256.89 | 34756.89 | 406 2015-11-19 | 15.0 | 9.8365 | 46.96 | 70000 | 106745.29 | 36745.29 | 407 2015-11-20 | 16.11 | 9.8810 | 56.26 | 70500 | 114943.41 | 44443.41 | 408 2015-11-23 | 15.9 | 9.9234 | 53.36 | 71000 | 113761.64 | 42761.64 | 411 2015-11-24 | 15.31 | 9.9610 | 47.46 | 71500 | 109894.55 | 38394.55 | 412 2015-11-25 | 16.2 | 10.0044 | 54.60 | 72000 | 116588.99 | 44588.99 | 413 2015-11-26 | 16.19 | 10.0470 | 53.78 | 72500 | 116828.01 | 44328.01 | 414 2015-11-27 | 15.22 | 10.0825 | 44.71 | 73000 | 110197.25 | 37197.25 | 415 ... 2017-04-06 | 10.18 | 8.3120 | 8.99 | 237500 | 290875.38 | 53375.38 | 911 2017-04-07 | 10.02 | 8.3156 | 8.19 | 238000 | 286782.60 | 48782.60 | 912 2017-04-10 | 10.18 | 8.3195 | 8.91 | 238500 | 291836.89 | 53336.89 | 915 2017-04-11 | 10.05 | 8.3231 | 8.26 | 239000 | 288588.52 | 49588.52 | 916 2017-04-12 | 9.62 | 8.3258 | 6.18 | 239500 | 276728.84 | 37228.84 | 917 2017-04-13 | 9.67 | 8.3286 | 6.40 | 240000 | 278654.14 | 38654.14 | 918 2017-04-14 | 9.41 | 8.3309 | 5.14 | 240500 | 271653.47 | 31153.47 | 919 2017-04-17 | 9.47 | 8.3332 | 5.39 | 241000 | 273876.26 | 32876.26 | 922 2017-04-18 | 9.3 | 8.3352 | 4.57 | 241500 | 269453.08 | 27953.08 | 923 2017-04-19 | 8.9 | 8.3364 | 2.67 | 242000 | 258361.41 | 16361.41 | 924 2017-04-20 | 8.7 | 8.3371 | 1.72 | 242500 | 253054.59 | 10554.59 | 925c1 | price | round | revenue_year_ratio | invest | v_value | v_make_money | keep_days ------------+-------+--------+--------------------+--------+-----------+--------------+----------- 2017-04-06 | 10.18 | 8.3120 | 8.99 | 237500 | 290875.38 | 53375.38 | 911 (1 row)思考一下保险公司的很多投资型保险理财产品是靠什么盈利的? 例如每年投一两万, 连续投20年, 然后...年化利息大概3左右, 同时有一些大病保险赔款(基本上就是不死也残的那种).附录: 一键部署 PolarDB安装docker环境, 参考:《MacOS PolarDB-X 数据库快速部署指南》一键部署 PolarDB, 参考:https://apsaradb.github.io/PolarDB-for-PostgreSQL/zh/随便选一种(单节点实例、多节点实例、HTAP 实例)进行部署, 例如:HTAP 实例# 拉取 HTAP PolarDB 镜像 docker pull polardb/polardb_pg_local_instance:htap # 创建运行并进入容器 docker run -it --cap-add=SYS_PTRACE --privileged=true --name polardb_pg_htap polardb/polardb_pg_local_instance:htap bash # 测试实例可用性 psql -h 127.0.0.1 -c 'select version();' version -------------------------------- PostgreSQL 11.9 (POLARDB 11.9) (1 row)退出容器后, 如何再次进入容器?IT-C02YW2EFLVDL:~ digoal$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES dd43b032b95e polardb/polardb_pg_local_instance:htap "/bin/sh -c '~/tmp_b…" 7 minutes ago Exited (1) 3 minutes ago polardb_pg_htap IT-C02YW2EFLVDL:~ digoal$ docker start dd43b032b95e dd43b032b95eIT-C02YW2EFLVDL:~ digoal$ docker exec -it dd43b032b95e bash [postgres@dd43b032b95e ~]$ psql -h 127.0.0.1 psql (11.9) Type "help" for help. postgres=# \q其他参考docker --help列出容器IT-C02YW2EFLVDL:~ digoal$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES dd43b032b95e polardb/polardb_pg_local_instance:htap "/bin/sh -c '~/tmp_b…" 8 minutes ago Up 2 seconds polardb_pg_htap停止容器docker stop ... stop Stop one or more running containers删除容器docker rm ... rm Remove one or more containers docker rm dd43b032b95e列出镜像docker images images List images IT-C02YW2EFLVDL:~ digoal$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE polardb/polardb_pg_local_instance htap a05bfc3b1310 3 weeks ago 11.5GB polardbx/galaxyengine latest 6c7171b141d6 2 months ago 2.11GB polardbx/galaxysql latest 1a9a92c774dc 2 months ago 1.14GB polardbx/galaxycdc latest a7b7d468cd34 2 months ago 905MB polardbx/xstore-tools latest d89e74573646 3 months ago 2.69MB polardbx/polardbx-init latest b3637901782a 3 months ago 6.59MB删除镜像docker rmi ... rmi Remove one or more images IT-C02YW2EFLVDL:~ digoal$ docker rmi 6c7171b141d6 Untagged: polardbx/galaxyengine:latest Untagged: polardbx/galaxyengine@sha256:135530a3848fec0663555decf6d40de4b9b6288e59f0ce9f8fafc88103ee4b53 Deleted: sha256:6c7171b141d689c4f2cb85bec056e8efa281f7d0c13d5f6ec8786fdfe0b2dacc Deleted: sha256:eb01d41966798251e6cf87030021b9430e39be92152d1b699b862ce7ffd392b6 Deleted: sha256:d3d01e57b3ff262d299d2fc86ee4e6243464aace5f0bb127529ec0b7cf36bcc1 Deleted: sha256:48292444284d3251871963192eb99ff82e3929af68426b43edf7bfc4dae1580d Deleted: sha256:6ca882a31a79adbdf39412feee05487de45617f70711389b94145eb1475b2146
背景为了防止坏块问题, checkpoint后第一次修改的page需要将整个page写入wal日志. 即full page write(FPW).有了fpw, 可以修复坏块, 例如pg_basebackup, 在线拷贝, 拷贝到partial block也没关系, 使用fp可以修复.但是fpw的引入会导致wal日志变大, 特别是更新频繁的场景. 在checkpoint频率较高时更为明显.checkpoint频度越高, 恢复需要replay的wal日志越少, 恢复速度越快, 例如数据库崩溃恢复会变得很快. 所以就有了矛盾:拉长checkpoint周期可以降低开启fpw带来的性能影响, 但是会导致崩溃恢复的时间变长.缩短checkpoint周期可以降低崩溃恢复的时间, 但是会导致WAL写入大量FPW日志, 影响数据库写性能.PolarDB 通过standby来支持坏块修复, 可以关闭FPW功能. 从而解决了以上矛盾, 性能提升非常明显. 同时:Oracle也是用的类似方法, 但是Oracle通过checksum来判断一个page是否已损坏, 这种方法存在小概率的问题: 损坏的块和正常的块算出来的checksum没有发生变化, 从而导致block永久损坏. 为什么会出现这种情况呢, 8KB的块算出一个8字节的checksum, 存在hash value冲突, page 的内容不同但是checksum相同.PolarDB 所有第一次修改的page都从 hot standby拉取, 从而不存在这个小概率问题.加入POLARDB社区在高频率checkpoint的情况下, 关闭fpw写负载性能可提升30%.对比测试8c64g 2TB nvme ssdrpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex openjade bzip2 git iotop lvm2 perf centos-release-scl yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm yum install -y postgresql14*parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE mkfs.ext4 /dev/vdb1 -m 0 -O extent,uninit_bg -b 4096 -L lv01vi /etc/fstab LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0mkdir /data01 mount -avi /etc/sysctl.conf # add by digoal.zhou fs.aio-max-nr = 1048576 fs.file-max = 76724600 # 可选:kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p # /data01/corefiles 事先建好,权限777,如果是软链接,对应的目录修改为777 kernel.sem = 4096 2147483647 2147483646 512000 # 信号量, ipcs -l 或 -u 查看,每16个进程一组,每组信号量需要17个信号量。 kernel.shmall = 107374182 # 所有共享内存段相加大小限制 (建议内存的80%),单位为页。 kernel.shmmax = 274877906944 # 最大单个共享内存段大小 (建议为内存一半), >9.2的版本已大幅降低共享内存的使用,单位为字节。 kernel.shmmni = 819200 # 一共能生成多少共享内存段,每个PG数据库集群至少2个共享内存段 net.core.netdev_max_backlog = 10000 net.core.rmem_default = 262144 # The default setting of the socket receive buffer in bytes. net.core.rmem_max = 4194304 # The maximum receive socket buffer size in bytes net.core.wmem_default = 262144 # The default setting (in bytes) of the socket send buffer. net.core.wmem_max = 4194304 # The maximum send socket buffer size in bytes. net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 4096 net.ipv4.tcp_keepalive_intvl = 20 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_mem = 8388608 12582912 16777216 net.ipv4.tcp_fin_timeout = 5 net.ipv4.tcp_synack_retries = 2 net.ipv4.tcp_syncookies = 1 # 开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击 net.ipv4.tcp_timestamps = 1 # 减少time_wait net.ipv4.tcp_tw_recycle = 0 # 如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它 net.ipv4.tcp_tw_reuse = 1 # 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接 net.ipv4.tcp_max_tw_buckets = 262144 net.ipv4.tcp_rmem = 8192 87380 16777216 net.ipv4.tcp_wmem = 8192 65536 16777216 net.nf_conntrack_max = 1200000 net.netfilter.nf_conntrack_max = 1200000 vm.dirty_background_bytes = 409600000 # 系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘 # 默认为10%,大内存机器建议调整为直接指定多少字节 vm.dirty_expire_centisecs = 3000 # 比这个值老的脏页,将被刷到磁盘。3000表示30秒。 vm.dirty_ratio = 95 # 如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。 # 有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。 vm.dirty_writeback_centisecs = 100 # pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。 vm.swappiness = 0 # 不使用交换分区 vm.mmap_min_addr = 65536 vm.overcommit_memory = 0 # 在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 . vm.overcommit_ratio = 90 # 当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。 vm.swappiness = 0 # 关闭交换分区 vm.zone_reclaim_mode = 0 # 禁用 numa, 或者在vmlinux中禁止. net.ipv4.ip_local_port_range = 40000 65535 # 本地自动分配的TCP, UDP端口号范围 fs.nr_open=20480000 # 单个进程允许打开的文件句柄上限 # 以下参数请注意 # vm.extra_free_kbytes = 4096000 # vm.min_free_kbytes = 2097152 # vm.min_free_kbytes 建议每32G内存分配1G vm.min_free_kbytes # 如果是小内存机器,以上两个值不建议设置 # vm.nr_hugepages = 66536 # 建议shared buffer设置超过64GB时 使用大页,页大小 /proc/meminfo Hugepagesize # vm.lowmem_reserve_ratio = 1 1 1 # 对于内存大于64G时,建议设置,否则建议默认值 256 256 32sysctl -pvi /etc/security/limits.d/20-nproc.conf # nofile超过1048576的话,一定要先将sysctl的fs.nr_open设置为更大的值,并生效后才能继续设置nofile. * soft nofile 1024000 * hard nofile 1024000 * soft nproc unlimited * hard nproc unlimited * soft core unlimited * hard core unlimited * soft memlock unlimited * hard memlock unlimitedchmod +x /etc/rc.d/rc.localvi /etc/rc.local touch /var/lock/subsys/local if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fi echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksourceuseradd postgres su - postgres wget https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2 tar -jxvf postgresql-snapshot.tar.bz2 cd postgresql-15devel ./configure --prefix=/home/postgres/pg15 make world -j 16 make install-worldcd ~ vi .bash_profile # 追加 export PS1="$USER@`/bin/hostname -s`-> " export PGPORT=1921 export PGDATA=/data01/pg15_$PGPORT/pg_root export LANG=en_US.utf8 export PGHOME=/home/postgres/pg15 export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH export DATE=`date +"%Y%m%d%H%M"` export PATH=$PGHOME/bin:$PATH:. export MANPATH=$PGHOME/share/man:$MANPATH export PGHOST=$PGDATA export PGUSER=postgres export PGDATABASE=postgres alias rm='rm -i' alias ll='ls -lh' unalias visu - root mkdir /data01/pg15_1921 chown postgres:postgres /data01/pg15_1921su - postgres initdb -D $PGDATA -U postgres -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8cd $PGDATA vi postgresql.conf listen_addresses = '0.0.0.0' port = 1921 max_connections = 1000 superuser_reserved_connections = 13 unix_socket_directories = '., /tmp' tcp_keepalives_idle = 60 tcp_keepalives_interval = 10 tcp_keepalives_count = 6 shared_buffers = 16GB maintenance_work_mem = 1GB dynamic_shared_memory_type = posix vacuum_cost_delay = 0 bgwriter_delay = 10ms bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 5.0 effective_io_concurrency = 0 max_parallel_workers_per_gather = 0 wal_level = replica fsync = on synchronous_commit = on full_page_writes = on wal_writer_delay = 10ms wal_writer_flush_after = 1MB max_wal_size = 1GB # 较高频率checkpoint, 对比fpw明显的影响 min_wal_size = 80MB random_page_cost = 1.1 effective_cache_size = 64GB log_destination = 'csvlog' logging_collector = on log_truncate_on_rotation = on log_checkpoints = on log_timezone = 'Asia/Shanghai' autovacuum_vacuum_cost_delay = 0ms vacuum_freeze_table_age = 750000000 vacuum_multixact_freeze_table_age = 750000000 datestyle = 'iso, mdy' timezone = 'Asia/Shanghai' lc_messages = 'en_US.utf8' lc_monetary = 'en_US.utf8' lc_numeric = 'en_US.utf8' lc_time = 'en_US.utf8' default_text_search_config = 'pg_catalog.english'pg_ctl startpgbench -i -s 5000pgbench -M prepared -n -r -P 1 -c 16 -j 16 -T 120开启full page write 时, tpc-b读写性能transaction type: <builtin: TPC-B (sort of)> scaling factor: 5000 query mode: prepared number of clients: 16 number of threads: 16 duration: 120 s number of transactions actually processed: 1210084 latency average = 1.586 ms latency stddev = 1.181 ms initial connection time = 8.439 ms tps = 10072.852551 (without initial connection time) statement latencies in milliseconds: 0.001 \set aid random(1, 100000 * :scale) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.077 BEGIN; 0.681 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.111 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.117 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.107 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.091 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.399 END;关闭full page write 时, tpc-b读写性能transaction type: <builtin: TPC-B (sort of)> scaling factor: 5000 query mode: prepared number of clients: 16 number of threads: 16 duration: 120 s number of transactions actually processed: 1569385 latency average = 1.223 ms latency stddev = 0.970 ms initial connection time = 9.154 ms tps = 13070.045981 (without initial connection time) statement latencies in milliseconds: 0.001 \set aid random(1, 100000 * :scale) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.075 BEGIN; 0.402 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.113 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.114 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.111 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.096 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.310 END;关闭fpw后, 性能提升了30%.
背景视频回放加入POLARDB社区1、failover:自动进行2、switchover:主动切主命令,在leader上执行。alter system dma change leader to '$HOST:$PGPORT';3、内核常用命令3.1、查看数据库角色,其中paxos role的值2表示leader,0表示follower,3表示learner。select current_leader, paxos_role from polar_dma_member_status;3.2、查看集群节点信息,在leader上执行。select * from polar_dma_cluster_status;3.3、查看follower延迟3.3.1、leader节点查看同步和回放延迟select client_addr, application_name, write_lag, flush_lag, replay_lag from pg_stat_replication;3.3.2、follower节点查看回放延迟select pg_last_wal_replay_lsn(), pg_last_wal_receive_lsn(), pg_last_xact_replay_timestamp();3.4、切主操作相关3.4.1、设置选举超时时间(需要重启实例)polar_dma_election_timeout3.4.2、设置follower延迟选举选项(通过alter system直接修改)polar_dma_delay_election = on 时,follower在polar_dma_delay_election_timeout时间内不主动发起选举。 polar_dma_delay_election_timeout:延迟切主时间3.4.3、主动切主命令,在leader上执行。alter system dma change leader to '$HOST:$PGPORT';3.4.4、自动发起选举命令,在follower上执行。alter system dma FORCE CHANGE LEADER;3.4.5、节点权重修改,在leader上执行。越大权重越高. 设置0~9之间,0就是learner没有选举权了alter system dma CHANGE NODE '$HOST:$PGPORT' WEIGHT TO 9;3.5、日志手动管理alter system dma purge logs; alter system dma purge logs to xxx; alter system dma force purge logs to xxx;3.6、在leader上执行节点删除命令,或者通过cm执行删除命令alter system dma drop follower '$HOST:$PGPORT';3.7、增加节点, 比较复杂.3.7.1、OS环境配置3.7.2、polar软件部署3.7.3、创建从库(basebackup)通过polar_basebackup复制数据或者copy整个data目录,建立follower节点。polar_basebackup -h <master_ip> -p <master_port> -U replicator -D $PGDATA --polardata=$POLARDATA -X stream --progress --write-recovery-conf -v3.7.4、初始化元数据先配置为learner角色节点,之后再将节点加入集群。polar-postgres -D $PGDATA/ -c polar_dma_init_meta=ON -c polar_dma_learners_info="$HOST:$PGPORT"3.7.5、配置dma修改$PGDATA/polar_dma.conf配置文件。修改其中的polar_dma_repl_appname参数。polar_dma_repl_appname = 'standby_$HOST_$PGPORT' # $HOST用int32表示如果${POLARDATA}路径与leader不通,则在$PGDATA/postgresql.conf中修改。polar_datadir='file-dio://${POLARDATA}'3.7.6、启动节点和单机方式相同。第一次启动成功后,将该节点加入DMA集群;之后可直接启动。pg_ctl -D $PGDATA/ start3.7.7、加入集群增加节点时,请在master节点使用cm命令alter system dma add follower '$HOST:$PGPORT'; 或 alter system dma add learner '$HOST:$PGPORT';
背景PolarDB for PostgreSQL 三节点开源版本在3台主机上的部署例子.视频回放加入POLARDB社区https://github.com/alibaba/PolarDB-for-PostgreSQL环境3台ecs8c 64g 2T ssd 内网ip: 172.17.164.62 172.17.164.63 172.17.164.64系统环境部署请参考这篇文档:《PolarDB 为什么要解决FPW的性能问题?》部署好后继续.环境依赖1、操作系统cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core)2、内核uname -a Linux iZbp18r4s9zxcmpkulkmkyZ 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:32:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux3、GCC版本gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)部署步骤1、安装依赖包yum install -y bison flex libzstd-devel libzstd zstd cmake openssl-devel protobuf-devel readline-devel libxml2-devel libxslt-devel zlib-devel bzip2-devel lz4-devel snappy-devel python-devel unzip2、添加OS用户, 用于部署PolarDB数据库集群useradd digoal3、设置用户密码passwd digoal4、下载PolarDB for PostgreSQL源码su - digoal wget https://github.com/alibaba/PolarDB-for-PostgreSQL/archive/refs/heads/master.zip unzip master.zip5、配置PolarDB OS用户主机之间的ssh互认, 配置ssh互认后, 方便集群管理, 这也是Greenplum使用的方法.all node:产生ssh keysu - digoal ssh-keygen chmod 700 ~/.ssh chmod 400 ~/.ssh/id_rsa*all node:配置互相认证su - digoal ssh-copy-id -f digoal@172.17.164.62 ssh-copy-id -f digoal@172.17.164.63 ssh-copy-id -f digoal@172.17.164.64 输入目标主机digoal用户登陆密码, 完成互认证all node:验证是否不需要密码, 返回日期表示已经完成互认su - digoal ssh 'digoal@172.17.164.62' date ssh 'digoal@172.17.164.63' date ssh 'digoal@172.17.164.64' date6、配置环境变量all node:su - digoal vi ~/.bashrc export POLARDBHOME="$HOME/polardb" export PATH="$POLARDBHOME/bin:$PATH" export LD_LIBRARY_PATH="$POLARDBHOME/lib:$LD_LIBRARY_PATH" export PGUSER=digoal export PGDATABASE=postgres export PGHOST=/tmp export PGPORT=10001应用环境变量生效su - digoal . ~/.bashrc7、编译安装PolarDB for PostgreSQL二进制软件.all node:su - digoal cd ~/PolarDB-for-PostgreSQL-master设置安装目录export PG_INSTALL=$HOME/polardb其他部署详情, 可以看一下build.sh脚本的内容.编译安装二进制软件:sh build.sh debug ## 开发环境 或 sh build.sh deploy ## 生产环境8、配置PolarDB 3主机集群.创建配置文件存放目录all node:su - digoal mkdir $POLARDBHOME/etc创建存放PolarDB集群数据文件的目录su - root mkdir -p /data01/polardb/data chown -R digoal:digoal /data01/polardb chmod 700 /data01/polardb生成集群部署配置文件模板(这个只需要在master主机执行即可, 172.17.164.62)master node:su - digoal touch $POLARDBHOME/etc/polardb_paxos.conf pgxc_ctl -v -c $POLARDBHOME/etc/polardb_paxos.conf prepare standalone修改配置文件内容, 匹配我们的三主机环境vi $POLARDBHOME/etc/polardb_paxos.conf #!/usr/bin/env bash # # polardb Configuration file for pgxc_ctl utility. # # Configuration file can be specified as -c option from pgxc_ctl command. Default is # $PGXC_CTL_HOME/pgxc_ctl.org. # # This is bash script so you can make any addition for your convenience to configure # your polardb. # #======================================================================================== # # # pgxcInstallDir variable is needed if you invoke "deploy" command from pgxc_ctl utility. # If don't you don't need this variable. # 修改 pgxcInstallDir=$HOME/polardb #---- OVERALL ----------------------------------------------------------------------------- # # 建议db superuser和os user使用同名 pgxcOwner=digoal # owner of the Postgres-XC databaseo cluster. Here, we use this # both as linus user and database user. This must be # the super user of each coordinator and datanode. pgxcUser=digoal # OS user of Postgres-XC owner tmpDir=/tmp # temporary dir used in XC servers localTmpDir=$tmpDir # temporary dir used here locally configBackup=n # If you want config file backup, specify y to this value. configBackupHost=pgxc-linker # host to backup config file configBackupDir=$HOME/pgxc # Backup directory configBackupFile=pgxc_ctl.bak # Backup file name --> Need to synchronize when original changed. # 修改 standAlone=n # 修改 dataDirRoot=/data01/polardb/data #---- Datanodes ------------------------------------------------------------------------------------------------------- #---- Shortcuts -------------- datanodeMasterDir=$dataDirRoot/dn_master datanodeSlaveDir=$dataDirRoot/dn_slave datanodeLearnerDir=$dataDirRoot/dn_learner datanodeArchLogDir=$dataDirRoot/datanode_archlog #---- Overall --------------- primaryDatanode=datanode_1 # Primary Node. datanodeNames=(datanode_1) datanodePorts=(10001) # Master and slave use the same port! #datanodePoolerPorts=(10011) # Master and slave use the same port! #datanodePgHbaEntries=(::1/128) # Assumes that all the coordinator (master/slave) accepts # the same connection # This list sets up pg_hba.conf for $pgxcOwner user. # If you'd like to setup other entries, supply them # through extra configuration files specified below. datanodePgHbaEntries=(172.17.164.62/32 172.17.164.63/32 172.17.164.64/32) # Same as above but for IPv4 connections #---- Master ---------------- datanodeMasterServers=(172.17.164.62) # none means this master is not available. # This means that there should be the master but is down. # The cluster is not operational until the master is # recovered and ready to run. datanodeMasterDirs=($datanodeMasterDir) datanodeMaxWalSender=16 # max_wal_senders: needed to configure slave. If zero value is # specified, it is expected this parameter is explicitly supplied # by external configuration files. # If you don't configure slaves, leave this value zero. datanodeMaxWALSenders=($datanodeMaxWalSender) # max_wal_senders configuration for each datanode #---- Slave ----------------- datanodeSlave=y # Specify y if you configure at least one coordiantor slave. Otherwise, the following # configuration parameters will be set to empty values. # If no effective server names are found (that is, every servers are specified as none), # then datanodeSlave value will be set to n and all the following values will be set to # empty values. datanodeSlaveServers=(172.17.164.63) # value none means this slave is not available datanodeSlavePorts=(10001) # Master and slave use the same port! #datanodeSlavePoolerPorts=(10011) # Master and slave use the same port! datanodeSlaveSync=y # If datanode slave is connected in synchronized mode datanodeSlaveDirs=($datanodeSlaveDir) datanodeArchLogDirs=($datanodeArchLogDir) datanodeRepNum=2 # no HA setting 0, streaming HA and active-active logcial replication setting 1 replication, paxos HA setting 2 replication. datanodeSlaveType=(3) # 1 is streaming HA, 2 is active-active logcial replication, 3 paxos HA. #---- Learner ----------------- datanodeLearnerServers=(172.17.164.64) # value none means this learner is not available datanodeLearnerPorts=(10001) # learner port! #datanodeSlavePoolerPorts=(10011) # learner pooler port! datanodeLearnerSync=y # If datanode learner is connected in synchronized mode datanodeLearnerDirs=($datanodeLearnerDir) # ---- Configuration files --- # You may supply your bash script to setup extra config lines and extra pg_hba.conf entries here. # These files will go to corresponding files for the master. # Or you may supply these files manually. datanodeExtraConfig=datanodeExtraConfig cat > $datanodeExtraConfig <<EOF #================================================ # Added to all the datanode postgresql.conf # Original: $datanodeExtraConfig log_destination = 'csvlog' unix_socket_directories = '., /tmp' logging_collector = on log_directory = 'log' listen_addresses = '0.0.0.0' max_connections = 1000 hot_standby = on synchronous_commit = on max_worker_processes = 30 cron.database_name = 'postgres' tcp_keepalives_idle = 30 tcp_keepalives_interval = 10 tcp_keepalives_count = 6 shared_buffers = 16GB maintenance_work_mem = 1GB bgwriter_delay = 10ms bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 5.0 effective_io_concurrency = 0 parallel_leader_participation = off max_wal_size = 48GB min_wal_size = 8GB wal_keep_segments = 4096 wal_sender_timeout = 5s random_page_cost = 1.1 effective_cache_size = 32GB log_truncate_on_rotation = on log_min_duration_statement = 3s log_checkpoints = on log_lock_waits = on log_statement = 'ddl' log_autovacuum_min_duration = 0 autovacuum_freeze_max_age = 800000000 autovacuum_multixact_freeze_max_age = 900000000 autovacuum_vacuum_cost_delay = 0ms vacuum_freeze_min_age = 700000000 vacuum_freeze_table_age = 850000000 vacuum_multixact_freeze_min_age = 700000000 vacuum_multixact_freeze_table_age = 850000000 statement_timeout = 0 # in milliseconds, 0 is disabled lock_timeout = 0 # in milliseconds, 0 is disabled idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled shared_preload_libraries = 'pg_cron' max_parallel_replay_workers = 0 EOF # Additional Configuration file for specific datanode master. # You can define each setting by similar means as above. datanodeSpecificExtraConfig=(none) datanodeSpecificExtraPgHba=(none)9、初始化三节点集群master node:pgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf clean all pgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf init allpsql postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 18745 usesysid | 10 usename | digoal application_name | walreceiver client_addr | 172.17.164.63 client_hostname | client_port | 53338 backend_start | 2021-08-16 16:10:59.414899+08 backend_xmin | state | streaming sent_lsn | 0/4000120 write_lsn | 0/4000120 flush_lsn | 0/4000120 replay_lsn | 0/4000120 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async -[ RECORD 2 ]----+------------------------------ pid | 19166 usesysid | 10 usename | digoal application_name | walreceiver client_addr | 172.17.164.64 client_hostname | client_port | 50968 backend_start | 2021-08-16 16:11:09.975107+08 backend_xmin | state | streaming sent_lsn | 0/4000120 write_lsn | 0/4000120 flush_lsn | 0/4000120 replay_lsn | 0/4000120 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async10、常用管理命令检查三节点状态pgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf monitor all /bin/bash Installing pgxc_ctl_bash script as /home/digoal/pgxc_ctl/pgxc_ctl_bash. Installing pgxc_ctl_bash script as /home/digoal/pgxc_ctl/pgxc_ctl_bash. Reading configuration using /home/digoal/pgxc_ctl/pgxc_ctl_bash --home /home/digoal/pgxc_ctl --configuration /home/digoal/polardb/etc/polardb_paxos.conf Finished reading configuration. ******** PGXC_CTL START *************** Current directory: /home/digoal/pgxc_ctl Running: datanode master datanode_1 Running: datanode slave datanode_1 Running: datanode learner datanode_1查看三节点配置pgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf show configuration allstart cluster or nodepgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf start allstop cluster or nodepgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf stop allfailover datanode (datanode_1 is node name configured in polardb_paxos.conf).pgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf failover datanode datanode_1cluster health check (check cluster status and start failed node).pgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf healthcheck allexamples of other commandspgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf kill all pgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf log var datanodeNames pgxc_ctl -c $POLARDBHOME/etc/polardb_paxos.conf show configuration all
背景 夜谈PostgreSQL 垃圾回收参数优化之 - maintenance_work_mem , autovacuum_work_mem。 http://www.postgres.cn/v2/news/viewone/1/398 https://rhaas.blogspot.com/2019/01/how-much-maintenanceworkmem-do-i-need.html 9.4以前的版本,垃圾回收相关的内存参数maintenance_work_mem,9.4以及以后的版本为autovacuum_work_mem,如果没有设置autovacuum_work_mem,则使用maintenance_work_mem的设置。 这个参数设置的是内存大小有什么用呢? 这部分内存被用于记录垃圾tupleid,vacuum进程在进行表扫描时,当扫描到的垃圾记录ID占满了整个内存(autovacuum_work_mem或maintenance_work_mem),那么会停止扫描表,开始INDEX的扫描。 扫描INDEX时,清理索引中的哪些tuple,实际上是从刚才内存中记录的这些tupleid来进行匹配。 当所有索引都扫描并清理了一遍后,继续从刚才的位点开始扫描表。 过程如下: 1、palloc autovacuum_work_mem memory 2、scan table, 3、dead tuple's tupleid write to autovacuum_work_mem 4、when autovacuum_work_mem full (with dead tuples can vacuum) 5、record table scan offset. 6、scan indexs 7、vacuum index's dead tuple (these: index item's ctid in autovacuum_work_mem) 8、scan indexs end 9、continue scan table with prev's offset ... 显然,如果垃圾回收时autovacuum_work_mem太小,INDEX会被多次扫描,浪费资源,时间。 palloc autovacuum_work_mem memory 这部分内存是使用时分配,并不是直接全部使用掉maintenance_work_mem或autovacuum_work_mem设置的内存,PG代码中做了优化限制: 对于小表,可能申请少量内存,算法请参考如下代码(对于小表,申请的内存数会是保障可记录下整表的tupleid的内存数(当maintenance_work_mem或autovacuum_work_mem设置的内存大于这个值时))。 我已经在如下代码中进行了标注: /* * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can * fit on one heap page. (Note that indexes could have more, because they * use a smaller tuple header.) We arrive at the divisor because each tuple * must be maxaligned, and it must have an associated item pointer. * * Note: with HOT, there could theoretically be more line pointers (not actual * tuples) than this on a heap page. However we constrain the number of line * pointers to this anyway, to avoid excessive line-pointer bloat and not * require increases in the size of work arrays. */ #define MaxHeapTuplesPerPage \ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData)))) /* * Guesstimation of number of dead tuples per page. This is used to * provide an upper limit to memory allocated when vacuuming small * tables. */ #define LAZY_ALLOC_TUPLES MaxHeapTuplesPerPage /* * lazy_space_alloc - space allocation decisions for lazy vacuum * * See the comments at the head of this file for rationale. */ static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) { long maxtuples; int vac_work_mem = IsAutoVacuumWorkerProcess() && autovacuum_work_mem != -1 ? autovacuum_work_mem : maintenance_work_mem; if (vacrelstats->hasindex) { maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); maxtuples = Min(maxtuples, INT_MAX); maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); /* curious coding here to ensure the multiplication can't overflow */ 这里保证了maintenance_work_mem或autovacuum_work_mem不会直接被使用光, 如果是小表,会palloc少量memory。 if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks) maxtuples = relblocks * LAZY_ALLOC_TUPLES; /* stay sane if small maintenance_work_mem */ maxtuples = Max(maxtuples, MaxHeapTuplesPerPage); } else { maxtuples = MaxHeapTuplesPerPage; } vacrelstats->num_dead_tuples = 0; vacrelstats->max_dead_tuples = (int) maxtuples; vacrelstats->dead_tuples = (ItemPointer) palloc(maxtuples * sizeof(ItemPointerData)); } maintenance_work_mem这个内存还有一个用途,创建索引时,maintenance_work_mem控制系统在构建索引时将使用的最大内存量。为了构建一个B树索引,必须对输入的数据进行排序,如果要排序的数据在maintenance_work_mem设定的内存中放置不下,它将会溢出到磁盘中。 例子 如何计算适合的内存大小 postgres=# show autovacuum_work_mem ; autovacuum_work_mem --------------------- 1GB (1 row) postgres=# show maintenance_work_mem ; maintenance_work_mem ---------------------- 1GB (1 row) 也就是说,最多有1GB的内存,用于记录一次vacuum时,一次性可存储的垃圾tuple的tupleid。 tupleid为6字节长度。 /* * ItemPointer: * * This is a pointer to an item within a disk page of a known file * (for example, a cross-link from an index to its parent table). * blkid tells us which block, posid tells us which entry in the linp * (ItemIdData) array we want. * * Note: because there is an item pointer in each tuple header and index * tuple header on disk, it's very important not to waste space with * structure padding bytes. The struct is designed to be six bytes long * (it contains three int16 fields) but a few compilers will pad it to * eight bytes unless coerced. We apply appropriate persuasion where * possible. If your compiler can't be made to play along, you'll waste * lots of space. */ typedef struct ItemPointerData { BlockIdData ip_blkid; OffsetNumber ip_posid; } 1G可存储1.7亿条dead tuple的tupleid。 postgres=# select 1024*1024*1024/6; ?column? ----------- 178956970 (1 row) 而自动垃圾回收是在什么条件下触发的呢? src/backend/postmaster/autovacuum.c * A table needs to be vacuumed if the number of dead tuples exceeds a * threshold. This threshold is calculated as * * threshold = vac_base_thresh + vac_scale_factor * reltuples vac_base_thresh: autovacuum_vacuum_threshold vac_scale_factor: autovacuum_vacuum_scale_factor postgres=# show autovacuum_vacuum_threshold ; autovacuum_vacuum_threshold ----------------------------- 50 (1 row) postgres=# show autovacuum_vacuum_scale_factor ; autovacuum_vacuum_scale_factor -------------------------------- 0.2 (1 row) 以上设置,表示当垃圾记录数达到50+表大小乘以0.2时,会触发垃圾回收。 可以看成,垃圾记录约等于表大小的20%,触发垃圾回收。 那么1G能存下多大表的垃圾呢?约8.9亿条记录的表。 postgres=# select 1024*1024*1024/6/0.2; ?column? -------------------- 894784850 (1 row) 压力测试例子 postgres=# show log_autovacuum_min_duration ; log_autovacuum_min_duration ----------------------------- 0 (1 row) create table test(id int primary key, c1 int, c2 int, c3 int); create index idx_test_1 on test (c1); create index idx_test_2 on test (c2); create index idx_test_3 on test (c3); vi test.sql \set id random(1,10000000) insert into test values (:id,random()*100, random()*100,random()*100) on conflict (id) do update set c1=excluded.c1, c2=excluded.c2,c3=excluded.c3; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 1200 垃圾回收记录 2019-02-26 22:51:50.323 CST,,,35632,,5c755284.8b30,1,,2019-02-26 22:51:48 CST,36/22,0,LOG,00000,"automatic vacuum of table ""postgres.public.test"": index scans: 1 pages: 0 removed, 6312 remain, 2 skipped due to pins, 0 skipped frozen tuples: 4631 removed, 1158251 remain, 1523 are dead but not yet removable, oldest xmin: 1262982800 buffer usage: 39523 hits, 1 misses, 1 dirtied avg read rate: 0.004 MB/s, avg write rate: 0.004 MB/s system usage: CPU: user: 1.66 s, system: 0.10 s, elapsed: 1.86 s",,,,,,,,"lazy_vacuum_rel, vacuumlazy.c:407","" 2019-02-26 22:51:50.566 CST,,,35632,,5c755284.8b30,2,,2019-02-26 22:51:48 CST,36/23,1263417553,LOG,00000,"automatic analyze of table ""postgres.public.test"" system usage: CPU: user: 0.16 s, system: 0.04 s, elapsed: 0.24 s",,,,,,,,"do_analyze_rel, analyze.c:722","" index scans:1 表示垃圾回收的表有索引,并且索引只扫描了一次。 说明autovacuum_work_mem足够大,没有出现vacuum时装不下垃圾dead tuple tupleid的情况。 小结 建议: 1、log_autovacuum_min_duration=0,表示记录所有autovacuum的统计信息。 2、autovacuum_vacuum_scale_factor=0.01,表示1%的垃圾时,触发自动垃圾回收。 3、autovacuum_work_mem,视情况定,确保不出现垃圾回收时多次INDEX SCAN. 4、如果发现垃圾回收统计信息中出现了index scans: 超过1的情况,说明: 4.1、需要增加autovacuum_work_mem,增加多少呢?增加到当前autovacuum_work_mem乘以index scans即可。 4.2、或者调低autovacuum_vacuum_scale_factor到当前值除以index scans即可,让autovacuum尽可能早的进行垃圾回收。 参考 http://www.postgres.cn/v2/news/viewone/1/398 https://rhaas.blogspot.com/2019/01/how-much-maintenanceworkmem-do-i-need.html 《PostgreSQL 11 参数模板 - 珍藏级》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 一个这样的问题: 为什么select x from tbl offset x limit x; 两次查询连续的OFFSET,会有重复数据呢? select ctid,* from tbl where ... offset 0 limit 10; select ctid,* from tbl where ... offset 10 limit 10; 为什么多数时候offset会推荐用order by? 不使用ORDER BY的话,返回顺序到底和什么有关? 答案是: 数据库的扫描方法。 数据库扫描方法,具体的原理可以到如下文档中找到PDF,PDF内有详细的扫描方法图文介绍。 《阿里云 PostgreSQL 产品生态;案例、开发管理实践、原理、学习资料、视频;PG天天象上沙龙记录 - 珍藏级》 扫描方法 1、全表扫描, seqscan 从第一个数据块开始扫描,返回复合条件的记录。 2、并发全表扫描, concurrently seqscan 如果有多个会话,对同一张表进行全表扫描时,后发起的会话会与前面正在扫描的会话进行BLOCK对齐步调,也就是说,后面发起的会话,可能是从表的中间开始扫的,扫描到末尾再转回去,避免多会话同时对一个表全表扫描时的IO浪费。 例如会话1已经扫到了第99个数据块,会话2刚发起这个表的全表扫描,则会从第99个数据块开始扫描,扫完在到第一个数据块扫,一直扫到第98个数据块。 3、索引扫描, index scan 按索引顺序扫描,并回表。 4、索引ONLY扫描, index only scan 按索引顺序扫描,根据VM文件的BIT位判断是否需要回表扫描。 5、位图扫描, bitmap scan 按索引取得的BLOCKID排序,然后根据BLOCKID顺序回表扫描,然后再根据条件过滤掉不符合条件的记录。 这种扫描方法,主要解决了离散数据(索引字段的逻辑顺序与记录的实际存储顺序非常离散的情况),需要大量离散回表扫描的情况。 6、并行扫描, parallel xx scan 并行的全表、索引、索引ONLY、位图扫。首先会FORK出若干个WORKER,每个WORKER负责一部分数据块,一起扫描,WORKER的结果(FILTER后的)发给下一个GATER WORKER节点。 7、hash join 哈希JOIN, 8、nest loop join 嵌套循环 9、merge join 合并JOIN(排序JOIN)。 更多扫描方法,请参考PG代码。 扫描方法决定了数据返回顺序 根据上面的这些扫描方法,我们可以知道一条QUERY下去,数据的返回顺序是怎么样的。 select * from tbl where xxx offset 10 limit 100; 1、如果是全表扫描,那么返回顺序就是数据的物理存放顺序,然后偏移10条有效记录,取下100条有效记录。 2、如果是索引扫描,则是依据索引的顺序进行扫描,然后偏移10条有效记录,取下100条有效记录。 不再赘述。 保证绝对的连续 如何保证第一次请求,第二次请求,第三次请求,。。。每一次偏移(offset)固定值,返回的结果是完全有序,无空洞的。 1、使用rr隔离级别(repeatable read),并且按PK(唯一值字段、字段组合)排序,OFFSET 使用rr级别,保证一个事务中的每次发起的SQL读请求是绝对视角一致的。 使用唯一字段或字段组合排序,可以保证每次的结果排序是绝对一致的。加速每次偏移的数据一样,所以可以保证数据返回是绝对连续的。 select * from tbl where xx order by a,b offset x limit xx; 2、使用游标 使用游标,可以保证视角一致,数据绝对一致。 postgres=# \h declare Command: DECLARE Description: define a cursor Syntax: DECLARE name [ BINARY ] [ INSENSITIVE ] [ [ NO ] SCROLL ] CURSOR [ { WITH | WITHOUT } HOLD ] FOR query begin; declare a cursor for select * from tbl where xx; fetch x from a; ... 每一次请求,游标向前移动 end; 参考 《PostgreSQL 数据离散性 与 索引扫描性能(btree & bitmap index scan)》 《PostgreSQL 11 preview - 分页内核层优化 - 索引扫描offset优化(使用vm文件skip heap scan)》 《PostgreSQL 范围过滤 + 其他字段排序OFFSET LIMIT(多字段区间过滤)的优化与加速》 《PostgreSQL Oracle 兼容性之 - TZ_OFFSET》 《PostgreSQL 索引扫描offset内核优化 - case》 《PostgreSQL 数据访问 offset 的质变 case》 《论count与offset使用不当的罪名 和 分页的优化》 《PostgreSQL offset 原理,及使用注意事项》 《妙用explain Plan Rows快速估算行 - 分页数估算》 《分页优化 - order by limit x offset y performance tuning》 《分页优化, add max_tag column speedup Query in max match enviroment》 《PostgreSQL's Cursor USAGE with SQL MODE - 分页优化》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 如同其他数据库一样,使用时需要注意一些问题,那么如何使用PG,可以保证长期稳定。 部署形态设计实践 根据对可靠性、可恢复性、可用性等等的不同要求,选择部署形态: 1、分布式部署(例如pg+citus插件) 容量上限:100节点以上,PB级。 计算能力上限:100节点以上,6400核以上。 读写带宽上限:100节点以上,200GB/s以上。 RPO:如果每个计算节点都采用多副本存储,RPO=0。 RTO:如果每个计算节点都采用HA,RTO可以做到1分钟内。 使用限制:有一些SQL限制。 适应场景:应用代码可控程度高的情况下,适合TP和AP业务。 2、单节点本地存储 容量上限:10TB级。 计算能力上限:64核级。 读写带宽上限:2GB/s级。 RPO:RPO无保障。 RTO:RTO无保障。 使用限制:SQL无限制。 适应场景:测试环境,非生产环境,对数据库RPO,RTO都没有要求的环境。 3、单节点多副本存储 容量上限:32TB级。 计算能力上限:64核级。 读写带宽上限:2GB/s级。 RPO:单机房RPO=0,(如果存储支持跨机房多副本,可以做到多机房RPO=0)。 RTO:10分钟级。 使用限制:SQL无限制。 适应场景:非核心场景生产、测试。 4、双节点共享存储 容量上限:32TB级。 计算能力上限:64核级。 读写带宽上限:2GB/s级。 RPO:单机房RPO=0,(如果存储支持跨机房多副本,可以做到多机房RPO=0)。 RTO:1分钟级。 使用限制:SQL无限制。 适应场景:核心、非核心场景生产。 5、双节点主备异步复制 容量上限:32TB级(使用远程存储),10TB级(使用本机存储) 计算能力上限:64核级。 读写带宽上限:2GB/s级。 RPO:10GB网络,REDO延迟毫秒级、1MB以内。(支持跨机房部署)。心跳机制可确保RPO < 60秒 RTO:1分钟级。 使用限制:SQL无限制。 适应场景:非核心场景生产。 6、双节点主备半同步复制 容量上限:32TB级(使用远程存储),10TB级(使用本机存储) 计算能力上限:64核级。 读写带宽上限:2GB/s级。 RPO: 无节点或单一节点异常时,可保证RPO=0。 两个节点都异常时,RPO取决于备份延迟。采用基于PG流复制的持续REDO备份,可以做到RPO毫秒级。 RTO:1分钟级。 使用限制:SQL无限制。 适应场景:核心、非核心场景生产。 7、三节点及以上多副本全同步复制 容量上限:32TB级(使用远程存储),10TB级(使用本机存储) 计算能力上限:64核级。 读写带宽上限:2GB/s级。 RPO: 小于半数节点异常时,可保证RPO=0。 半数以上节点异常时,RPO取决于 1、10GB网络,REDO延迟毫秒级、1MB以内。2、备份延迟。采用基于PG流复制的持续REDO备份,可以做到RPO毫秒级。 RTO:1分钟级。 使用限制:SQL无限制。 适应场景:核心场景生产。 8、计算存储分离(存储多副本)(比如阿里云POLARDB PG) 容量上限:100TB级。 计算能力上限:16节点,1024核级。 读写带宽上限:32GB/s级。 RPO:单机房RPO=0,(如果存储支持跨机房多副本,可以做到多机房RPO=0)。 RTO:15秒级。 使用限制:SQL无限制。 适应场景:核心、非核心场景生产。 9、计算存储分离(存储多副本)+ 双机房半同步 容量上限:100TB级。 计算能力上限:16节点,1024核级。 读写带宽上限:32GB/s级。 RPO: 无节点或单一节点异常时,可保证RPO=0。 两个节点都异常时,RPO取决于备份延迟。采用基于PG流复制的持续REDO备份,可以做到RPO毫秒级。 RTO:15秒级。 使用限制:SQL无限制。 适应场景:核心、非核心场景生产。 10、计算存储分离(存储多副本)+ 多机房多副本全同步 容量上限:100TB级。 计算能力上限:16节点,1024核级。 读写带宽上限:32GB/s级。 RPO: 小于半数节点异常时,可保证RPO=0。 半数以上节点异常时,RPO取决于 1、10GB网络,REDO延迟毫秒级、1MB以内。2、备份延迟。采用基于PG流复制的持续REDO备份,可以做到RPO毫秒级。 RTO:15秒级。 使用限制:SQL无限制。 适应场景:核心场景生产。 11、只读节点 使用限制:SQL无限制。 适应场景:扩展读能力。 12、非核心功能 12.1、业务透明的读写分离 使用限制:SQL无限制。 适应场景:扩展读能力。 12.2、跨库交互 使用限制:SQL无限制。 适应场景:跨库DBLINK,跨库外部表,跨库物化视图。 12.3、单元化 使用限制:SQL无限制。 适应场景:多实例共享少量数据,多写。 使用实践(规约) - 避坑大法 1、连接数过多(2000以上),可能导致性能下降。 建议使用连接池(例如应用程序使用连接池,或者使用pgbouncer之类的连接池)。连接到数据库的连接在10倍CPU核数以内,达到最高的处理吞吐能力。 2、大吞吐高并发的短连接,性能下降。 建议使用长连接。 3、长连接,长期不释放重建。如果连接访问了大量元数据,可能导致内存占用过大。 建议设置空闲长连接释放机制。确保不会出现大量内存霸占的情况。 《PostgreSQL relcache在长连接应用中的内存霸占"坑"》 4、长事务,以及未结束的2PC事务。 最老事务开始后产生的垃圾版本,无法被垃圾回收进程回收。长事务可能导致垃圾膨胀。 5、业务死锁 6、检查点过短 检查点设置过短,导致FPW狂写,性能下降严重。 建议max wal size, min wal size设置为shared buffer 2倍以及一半。 7、大内存未使用huge page 大内存,未设置shared buffer为huge page,可能导致hash table巨大无比,浪费内存,OOM等连锁反应。 建议32G以上shared buffer,使用huge page。 8、不合理的索引 导致DML性能下降,SELECT性能下降。 建议删除,或修改索引定义。 9、不合理的SQL 《PostgreSQL 如何查找TOP SQL (例如IO消耗最高的SQL) (包含SQL优化内容) - 珍藏级》 10、pending list 未合并过大 使用GIN倒排索引,如果写入量特别大,可能导致PENDING LIST合并不及时,当有大量PENDING LIST数据时,查询性能下降急剧。 11、ctype使用错误,例如要查询中文模糊查询加速(pg_trgm),使用ctype=c会导致中文模糊查询无法使用索引。 《PostgreSQL 中英文混合分词特殊规则(中文单字、英文单词) - 中英分明》 12、数据存放不合理导致IO放大 例如空间查询为切片,组必要条件查询未分区。 《PostgreSQL 空间切割(st_split, ST_Subdivide)功能扩展 - 空间对象网格化 (多边形GiST优化)》 《PostgreSQL 空间st_contains,st_within空间包含搜索优化 - 降IO和降CPU(bound box) (多边形GiST优化)》 13、IO太弱,频繁更新产生垃圾,垃圾回收不及时,膨胀 建议使用SSD硬盘。 14、关闭自动垃圾回收,会导致垃圾无法自动回收,膨胀。 建议打开自动垃圾回收。 15、长时间锁等待 业务逻辑问题,长时间锁等待,可能引发雪崩,连接耗尽等问题。 16、长时间大锁等待,例如在业务系统中高峰期使用DDL语句,可能导致长时间大锁等待。引发雪崩。 建议对DDL操作前,加锁超时参数,避免雪崩。 17、分区过多,导致查询效率下降,连接内存占用过大。 建议合理的设置分区数,例如对于高并发频繁操作的表,建议64个以内分区。对于时间分区表,建议不需要查询的分区或者已经清理数据的分区,从分区中deatch出去,减少优化器压力。 18、DDOS 如果对外开放了连接监听,即使攻击者没有密码,也可以使用DDOS攻击来消耗数据库连接,即利用认证超时的时间窗口,大量建连接,等认证超时,实际上已占用SLOT。导致连接耗尽。 19、滥用超级用户权限账号。 建议业务使用普通权限账号。 20、事务号回卷 如果长事务一直存在并导致了FREEZE无法冻结,超过20亿事务后,数据库为了避免事务号回卷,会强制停库,需要进入单用户进行修复。 21、FREEZE风暴 在9.6以前的版本,FREEZE会导致全表扫描,导致IO风暴。可以预测和防止。 《PostgreSQL Freeze 风暴预测续 - 珍藏级SQL》 《PostgreSQL freeze 风暴导致的IOPS飙升 - 事后追溯》 《PostgreSQL的"天气预报" - 如何预测Freeze IO风暴》 《PostgreSQL 大表自动 freeze 优化思路》 22、slot 堵塞 使用slot进行流复制(逻辑或物理)时,未消耗的日志会在数据库中保留(不会被清理),如果消耗日志很慢可能导致REDO占用空间巨大,甚至导致膨胀到占满磁盘。 有一些SLOT建立后,不需消费它,更加危险。 23、standby feedback standby 开启feedback后,standby上面的SQL会反馈给主库,主库会延迟回收垃圾,减少STANDBY的SQL与REDO APPLY回放冲突。 但是如果垃圾产生较多,并且autovacuum nap time 唤醒很频繁,会导致CPU和IO的升高。 《PostgreSQL物理"备库"的哪些操作或配置,可能影响"主库"的性能、垃圾回收、IO波动》 24、delay vacuum 主库开启vacuum delay,并且垃圾产生较多,并且autovacuum nap time 唤醒很频繁,会导致CPU和IO的升高。 原因和23一样。 25、大表分区 《HTAP数据库 PostgreSQL 场景与性能测试之 45 - (OLTP) 数据量与性能的线性关系(10亿+无衰减), 暨单表多大需要分区》 内部原理 了解原理后,知道为什么要这些最佳实践 《阿里云 PostgreSQL 产品生态;案例、开发管理实践、原理、学习资料、视频;PG天天象上沙龙记录 - 珍藏级》 《PostgreSQL 2天培训大纲》 监控 《PostgreSQL Oracle 兼容性之 - performance insight - AWS performance insight 理念与实现解读 - 珍藏级》 《PostgreSQL 如何查找TOP SQL (例如IO消耗最高的SQL) (包含SQL优化内容) - 珍藏级》 《PostgreSQL AWR报告(for 阿里云ApsaraDB PgSQL)》 《PostgreSQL 实时健康监控 大屏 - 低频指标 - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 高频指标(服务器) - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 高频指标 - 珍藏级》 《PostgreSQL pgmetrics - 多版本、健康监控指标采集、报告》 日常维护 《PostgreSQL DBA 日常管理 SQL》 培训 体系化培训内容 《PostgreSQL 2天培训大纲》 规范 《PostgreSQL 数据库开发规范》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 PostgreSQL 使用backtrace,让PG的user process支持self-debugging。 NAME backtrace, backtrace_symbols, backtrace_symbols_fd - support for application self-debugging SYNOPSIS #include <execinfo.h> int backtrace(void **buffer, int size); char **backtrace_symbols(void *const *buffer, int size); void backtrace_symbols_fd(void *const *buffer, int size, int fd); 支持: 1、打印错误SQL的调用栈内容。 2、了解LONG QUERY正在执行什么,慢在什么地方。 通过发送信号 (SIGINT)。 3、向日志中输出CORE的信息,打印调用栈信息,通过发送信号 (SIGSEGV or SIGBUS)。 This PostgreSQL extension can be used to get more information about PostgreSQL backend execution when no debugger is available. It allows to dump stack trace when error is reported or exception is caught. So there three used cases of using this extension: 1. Find out the source of error. pg_backtrace extension provides "pg_backtrace.level" GUC which selects error level for which backtrace information is attached. Default value is ERROR. So once this extension is initialized, all errors will include backtrace information which is dumped both in log file and delivered to the client: postgres=# select count(*)/0.0 from pg_class; ERROR: division by zero CONTEXT: postgres: knizhnik postgres [local] SELECT(numeric_div+0xbc) [0x7c5ebc] postgres: knizhnik postgres [local] SELECT() [0x5fe4e2] postgres: knizhnik postgres [local] SELECT() [0x610730] postgres: knizhnik postgres [local] SELECT() [0x6115ca] postgres: knizhnik postgres [local] SELECT(standard_ExecutorRun+0x15a) [0x60193a] postgres: knizhnik postgres [local] SELECT() [0x74168c] postgres: knizhnik postgres [local] SELECT(PortalRun+0x29e) [0x742a7e] postgres: knizhnik postgres [local] SELECT() [0x73e922] postgres: knizhnik postgres [local] SELECT(PostgresMain+0x1189) [0x73fde9] postgres: knizhnik postgres [local] SELECT() [0x47d5e0] postgres: knizhnik postgres [local] SELECT(PostmasterMain+0xd28) [0x6d0448] postgres: knizhnik postgres [local] SELECT(main+0x421) [0x47e511] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f6361a13830] postgres: knizhnik postgres [local] SELECT(_start+0x29) [0x47e589] 2. Determine current state of backend (assume that there is some long running query and you do not know where it spends most of time). It is possible to send SIGINT signal to backend and it print current stack in logfile: 2018-11-12 18:24:12.222 MSK [24457] LOG: Caught signal 2 2018-11-12 18:24:12.222 MSK [24457] CONTEXT: /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390) [0x7f63624e3390] /lib/x86_64-linux-gnu/libc.so.6(epoll_wait+0x13) [0x7f6361afa9f3] postgres: knizhnik postgres [local] SELECT(WaitEventSetWait+0xbe) [0x71e4de] postgres: knizhnik postgres [local] SELECT(WaitLatchOrSocket+0x8b) [0x71e93b] postgres: knizhnik postgres [local] SELECT(pg_sleep+0x98) [0x7babd8] postgres: knizhnik postgres [local] SELECT() [0x5fe4e2] postgres: knizhnik postgres [local] SELECT() [0x6266a8] postgres: knizhnik postgres [local] SELECT(standard_ExecutorRun+0x15a) [0x60193a] postgres: knizhnik postgres [local] SELECT() [0x74168c] postgres: knizhnik postgres [local] SELECT(PortalRun+0x29e) [0x742a7e] postgres: knizhnik postgres [local] SELECT() [0x73e922] postgres: knizhnik postgres [local] SELECT(PostgresMain+0x1189) [0x73fde9] postgres: knizhnik postgres [local] SELECT() [0x47d5e0] postgres: knizhnik postgres [local] SELECT(PostmasterMain+0xd28) [0x6d0448] postgres: knizhnik postgres [local] SELECT(main+0x421) [0x47e511] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f6361a13830] postgres: knizhnik postgres [local] SELECT(_start+0x29) [0x47e589] 3. Get stack trace for SIGSEGV or SIGBUS signals (if dumping cores is disabled for some reasons): 2018-11-12 18:25:52.636 MSK [24518] LOG: Caught signal 11 2018-11-12 18:25:52.636 MSK [24518] CONTEXT: /home/knizhnik/postgresql/dist/lib/pg_backtrace.so(+0xe37) [0x7f6358838e37] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390) [0x7f63624e3390] /home/knizhnik/postgresql/dist/lib/pg_backtrace.so(pg_backtrace_sigsegv+0) [0x7f6358838fb0] postgres: knizhnik postgres [local] SELECT() [0x5fe474] postgres: knizhnik postgres [local] SELECT() [0x6266a8] postgres: knizhnik postgres [local] SELECT(standard_ExecutorRun+0x15a) [0x60193a] postgres: knizhnik postgres [local] SELECT() [0x74168c] postgres: knizhnik postgres [local] SELECT(PortalRun+0x29e) [0x742a7e] postgres: knizhnik postgres [local] SELECT() [0x73e922] postgres: knizhnik postgres [local] SELECT(PostgresMain+0x1189) [0x73fde9] postgres: knizhnik postgres [local] SELECT() [0x47d5e0] postgres: knizhnik postgres [local] SELECT(PostmasterMain+0xd28) [0x6d0448] postgres: knizhnik postgres [local] SELECT(main+0x421) [0x47e511] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f6361a13830] postgres: knizhnik postgres [local] SELECT(_start+0x29) [0x47e589] As far as Postgres extension is loaded and initialized on first access to its functions, it is necessary to call pg_backtrace_init() function to be able to use this extension. This function function actually does nothing and _PG_init() registers signal handlers for SIGSEGV, SIGBUS and SIGINT and executor run hook which setups exception context. This extension is using backtrace function which is available at most Unixes. As it was mentioned in backtrace documentation: The symbol names may be unavailable without the use of special linker options. For systems using the GNU linker, it is necessary to use the -rdynamic linker option. Note that names of "static" functions are not exposed, and won't be available in the backtrace. Postgres is built without -rdynamic option. This is why not all function addresses in the stack trace above are resolved. It is possible to use GDB (at development host with correspondent postgres binaries) or Linux addr2line utility to get resolve function addresses: $ addr2line -e ~/postgresql/dist/bin/postgres -a 0x5fe4e2 0x00000000005fe4e2 execExprInterp.c:? 用法 Usage: create extension pg_backtrace; select pg_backtrace_init(); 参考 pg_backtrace https://github.com/postgrespro/pg_backtrace PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 列存优势 1、列存没有行存1666列的限制 2、列存的大量记录数扫描比行存节约资源 3、列存压缩比高,节约空间 4、列存的大量数据计算可以使用向量化执行,效率高 行存优势 1、行存查询多列时快 2、行存DML效率高 简单来说,行存适合OLTP业务,列存适合OLAP业务。 如果业务是混合负载,既有高并发SQL,又有实时分析业务怎么办? Oracle的做法: in memory column store,实际上是两份存储,一份在磁盘(行存),一份在内存中使用列存。 根据SQL,优化器选择扫描列存还是行存。(通常看planNODE中数据扫描的行选择性,输出的行数,输出的列数等) Oracle in memory column store是两份存储的思路。 PostgreSQL如何应对混合业务场景呢? 当前PG已经有了SMP并行执行的优化器功能,丰富的聚合函数,窗口函数等,已经有很好的OLAP处理能力。如果能在数据存储组织形式上支持到位,势必会给OLAP的能力带来更大的质的飞跃,以更好的适合OLTP OLAP混合业务场景。 一些PG 混合存储的资料 1、PG roadmap https://www.postgresql.org/developer/roadmap/ https://wiki.postgresql.org/wiki/PostgreSQL11_Roadmap 里面有提到postgres pro, fujsut 都有计划要开发列存储或者读、写优化索引。 2、PostgreSQL 12 可能会开放storage pluggable API,以支持列存组织形式表。 https://commitfest.postgresql.org/22/1283/ 3、ROS, WOS 读优化和写优化存储,适合TP AP混合业务 https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com 4、citus开发的PG支持向量化执行的代码,在使用列存储时,AP查询的性能有巨大的提升。 https://github.com/citusdata/postgres_vectorization_test 5、《Extending PostgreSQL with Column Store Indexes》 6、cstore, citusdata(已被微软收购),开源的列存储FDW插件 https://www.citusdata.com/blog/2014/04/03/columnar-store-for-analytics/ 7、2ndquadrant 公司的PG列存开发计划 https://blog.2ndquadrant.com/column-store-plans/ 8、PG 列存储开发计划讨论wiki https://wiki.postgresql.org/wiki/ColumnOrientedSTorage 9、《Column-Stores vs. Row-Stores: How Different Are They Really? 10、custom scan provide接口,pg_strom插件使用csp接口实现了gpu加速,其中GPU加速支持数据加载到GPU缓存、或者文件中以列形式组织,加速AP请求的SQL。(这种为非实时维护的数据组织形式,而是读时组织的形式) http://heterodb.github.io/pg-strom/ 11、In-Memory Columnar Store extension for PostgreSQL,PG的内存列存表插件 https://github.com/knizhnik/imcs 12、vops,PG的瓦片式存储(不改变现有HEAP存储接口),以及向量化执行组合的插件。 https://github.com/postgrespro/vops/blob/master/vops.html 《PostgreSQL VOPS 向量计算 + DBLINK异步并行 - 单实例 10亿 聚合计算跑进2秒》 《PostgreSQL 向量化执行插件(瓦片式实现-vops) 10x提速OLAP》 PostgreSQL 列存, 混合存储, 列存索引, 向量化存储, 混合索引 - OLTP OLAP OLXP HTAP 混合负载优化 根据以上资料,可以总结出得到一个结论: 一份数据,多种组织形式存储。不同的组织形式存储适合于不同的业务,不同的数据组织形式存储,有不同的数据扫描方法,根据SQL的统计信息,PLAN等信息判断选择采用什么样的组织形式的数据访问。 而恰好PG的可扩展性,非常适合于扩展出一份数据,多份存储的功能。 1、AM扩展接口,用于索引的扩展,例如当前PG以及支持了9种索引接口(btree, hash, gin, gist, spgist, brin, bloom, rum, zombodb)。 2、plugable storage接口。PG 12可能会发布这个新功能。 1 优化思路 1、写优化 2、读优化 2 数据组织形式 1、表组织形式 多份表的组织形式(多个数据副本),例如以HEAP存储为主(DML, OLTP业务),以列存储为辅(OLAP业务),数据落HEAP存储后返回,以保障SQL的响应速度,后台异步的合并到列存储。 不同的组织形式存储适合于不同的业务,不同的数据组织形式存储,有不同的数据扫描方法,根据SQL的统计信息,PLAN等信息判断选择采用什么样的组织形式的数据访问。 主,辅形式类似GIN索引的思路,fast update 方法,使用pending list区域,降低GIN索引引入的写RT升高,导致数据写入吞吐下降的问题。 2、索引组织形式 数据存储格式为一份(行存储,OLTP),增加一种索引接口(列组织形式(OLAP业务)),例如叫做VCI。 当有OLAP业务需求是,创建VCI索引,优化器根据SQL请求,决定使用VCI索引,还是TP型的索引。 3、分区表混合组织 不同的分区使用不同的组织形式。 例如,这种情况适合不同时间区间有不同的访问需求的场景。比如1个月以前的数据,大多数适合都是AP型的请求,1个月内的数据基本上是高并发的OLTP请求。可以针对不同的分区,采用不同的数据组织形式存储。 4、分区索引混合组织 不同的分区使用不同的索引组织形式。 类似分区表混合组织。 3 实现思路 1、扩展AM,即数据使用行存,索引使用列存储。扩展列存索引接口。 2、扩展存储接口,一份数据,多份表存储的形式。不同的表存储形式,可以有自己独立的索引体系。优化器根据SQL请求,选择不同的数据存储形式,进行访问,以适合OLTP OLAP的混合请求。 参考 《Greenplum 优化CASE - 对齐JOIN字段类型,使用数组代替字符串,降低字符串处理开销,列存降低扫描开销》 《PostgreSQL GPU 加速(HeteroDB pg_strom) (GPU计算, GPU-DIO-Nvme SSD, 列存, GPU内存缓存)》 《Greenplum 海量数据,大宽表 行存 VS 列存》 《PostgreSQL 如何让 列存(外部列存) 并行起来》 [《[未完待续] PostgreSQL ORC fdw - 列存插件》](https://github.com/digoal/blog/blob/master/201710/20171001_05.md) 《Greenplum 行存、列存,堆表、AO表性能对比 - 阿里云HDB for PostgreSQL最佳实践》 《Greenplum 列存储加字段现象 - AO列存储未使用相对偏移》 《Greenplum 行存、列存,堆表、AO表的原理和选择》 《Greenplum 列存表(AO表)的膨胀、垃圾检查与空间收缩(含修改分布键)》 《列存优化(shard,大小块,归整,块级索引,bitmap scan) - (大量数据实时读写)任意列搜索》 《PostgreSQL 10.0 preview 功能增强 - OLAP增强 向量聚集索引(列存储扩展)》 《分析加速引擎黑科技 - LLVM、列存、多核并行、算子复用 大联姻 - 一起来开启PostgreSQL的百宝箱》 《Greenplum 最佳实践 - 行存与列存的选择以及转换方法》 《PostgreSQL 列存储引擎 susql (志铭奉献)》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 使用PostgreSQL pitr,数据库恢复到一个时间点后,这个数据库的所有BLOCK是否都是一致的? 数据库在DOWN机恢复后,数据文件所有BLOCK是否一致? 定期抽查数据库的数据文件是否BLOCK级一致? 以上需求如何快速的满足呢? PostgreSQL允许用户开启block checksum功能,使用pg_verify_checksums工具,可以对整个数据库或指定的数据文件进行checksum校验,确保数据文件逻辑上一致。 pg_verify_checksums 校验数据块一致性 1、停库,目前不支持OPEN状态下的校验。 2、使用pg_verify_checksums校验 pg_verify_checksums verifies data checksums in a PostgreSQL database cluster. Usage: pg_verify_checksums [OPTION]... [DATADIR] Options: [-D, --pgdata=]DATADIR data directory -v, --verbose output verbose messages -r RELFILENODE check only relation with specified relfilenode -V, --version output version information, then exit -?, --help show this help, then exit If no data directory (DATADIR) is specified, the environment variable PGDATA is used. Report bugs to <pgsql-bugs@postgresql.org>. pg_verify_checksums -D /data01/digoal/pg_root8009 Checksum scan completed Data checksum version: 1 Files scanned: 932 Blocks scanned: 2909 Bad checksums: 0 3、目前pg_verify_checksums识别到错误会直接退出程序 pg_verify_checksums -D /data01/digoal/pg_root8009 pg_verify_checksums: could not read block 0 in file "/data01/digoal/pg_root8009/base/13285/13120_fsm": read 1023 of 8192 static void scan_file(const char *fn, BlockNumber segmentno) { PGAlignedBlock buf; PageHeader header = (PageHeader) buf.data; int f; BlockNumber blockno; f = open(fn, O_RDONLY | PG_BINARY); if (f < 0) { fprintf(stderr, _("%s: could not open file \"%s\": %s\n"), progname, fn, strerror(errno)); exit(1); } files++; for (blockno = 0;; blockno++) { uint16 csum; int r = read(f, buf.data, BLCKSZ); if (r == 0) break; if (r != BLCKSZ) { fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"), progname, blockno, fn, r, BLCKSZ); exit(1); } blocks++; /* New pages have no checksum yet */ if (PageIsNew(header)) continue; csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE); if (csum != header->pd_checksum) { if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION) fprintf(stderr, _("%s: checksum verification failed in file \"%s\", block %u: calculated checksum %X but block contains %X\n"), progname, fn, blockno, csum, header->pd_checksum); badblocks++; } } if (verbose) fprintf(stderr, _("%s: checksums verified in file \"%s\"\n"), progname, fn); close(f); } 如果期望扫描完所有文件,并将所有有错误的文件打印出来,需要修改一下pg_verify_checksums的代码 注意 版本要求,PostgreSQL 11以上。 低于11的版本,需要将pg_verify_checksums的功能向下PORT一下。 参考 《PostgreSQL 11 preview - Allow on-line enabling and disabling of data checksums (含pg_verify_checksums工具,离线检查数据文件有误块错误)》 https://www.postgresql.org/docs/11/pgverifychecksums.html PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 《PostgreSQL 覆盖 Oracle 18c 重大新特性》 Oracle 19c 新特性摘自盖老师《Oracle 19c 新特性及官方文档抢鲜下载》文章,其中有一些特性在PostgreSQL中很早以前已经支持。本文旨在介绍PG如何使用这些特性。 1.Data Guard 备库DML自动重定向 在使用 ADG 作为备库进行读写分离部署时,可能因为应用的原因,会有偶然的DML操作发送到备库上,在 19c 中,Oracle 支持自动重定向备库 DML,具体执行步骤为: 更新会自动重定向到主库; 主库执行更新、产生和发送Redo日志到备库; 在Redo备库应用后,ADG会话会透明的看到更新信息的落地实施; 这一特性可以通过在系统级或者会话级设置参数 ADG_REDIRECT_DML 参数启用,通过这种方式,ADG 会话的 ACID 一致性得以保持,同时透明的支持『多数读,偶尔更新』应用的自然读写分离配置。 这个特性的引入,将进一步的增加 ADG 的灵活性,帮助用户将备库应用的更加充分。 PostgreSQL 如何支持 1 修改内核支持 PostgreSQL standby与primary通信采用流复制协议 https://www.postgresql.org/docs/11/protocol-replication.html 如果要让PG支持只读从库转发DML到上游节点,首先需要协议层支持。 digoal@pg11-test-> psql psql (11.1) Type "help" for help. postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- t (1 row) postgres=# create table a (id int); ERROR: cannot execute CREATE TABLE in a read-only transaction postgres=# \set VERBOSITY verbose postgres=# create table a (id int); ERROR: 25006: cannot execute CREATE TABLE in a read-only transaction LOCATION: PreventCommandIfReadOnly, utility.c:246 postgres=# insert into a values (1); ERROR: 25006: cannot execute INSERT in a read-only transaction LOCATION: PreventCommandIfReadOnly, utility.c:246 当前写操作报错,判定为如下SQL请求类型时,直接报错。 /* * check_xact_readonly: is a utility command read-only? * * Here we use the loose rules of XactReadOnly mode: no permanent effects * on the database are allowed. */ static void check_xact_readonly(Node *parsetree) { /* Only perform the check if we have a reason to do so. */ if (!XactReadOnly && !IsInParallelMode()) return; /* * Note: Commands that need to do more complicated checking are handled * elsewhere, in particular COPY and plannable statements do their own * checking. However they should all call PreventCommandIfReadOnly or * PreventCommandIfParallelMode to actually throw the error. */ switch (nodeTag(parsetree)) { case T_AlterDatabaseStmt: case T_AlterDatabaseSetStmt: case T_AlterDomainStmt: case T_AlterFunctionStmt: case T_AlterRoleStmt: case T_AlterRoleSetStmt: case T_AlterObjectDependsStmt: case T_AlterObjectSchemaStmt: case T_AlterOwnerStmt: case T_AlterOperatorStmt: case T_AlterSeqStmt: case T_AlterTableMoveAllStmt: case T_AlterTableStmt: case T_RenameStmt: case T_CommentStmt: case T_DefineStmt: case T_CreateCastStmt: case T_CreateEventTrigStmt: case T_AlterEventTrigStmt: case T_CreateConversionStmt: case T_CreatedbStmt: case T_CreateDomainStmt: case T_CreateFunctionStmt: case T_CreateRoleStmt: case T_IndexStmt: case T_CreatePLangStmt: case T_CreateOpClassStmt: case T_CreateOpFamilyStmt: case T_AlterOpFamilyStmt: case T_RuleStmt: case T_CreateSchemaStmt: case T_CreateSeqStmt: case T_CreateStmt: case T_CreateTableAsStmt: case T_RefreshMatViewStmt: case T_CreateTableSpaceStmt: case T_CreateTransformStmt: case T_CreateTrigStmt: case T_CompositeTypeStmt: case T_CreateEnumStmt: case T_CreateRangeStmt: case T_AlterEnumStmt: case T_ViewStmt: case T_DropStmt: case T_DropdbStmt: case T_DropTableSpaceStmt: case T_DropRoleStmt: case T_GrantStmt: case T_GrantRoleStmt: case T_AlterDefaultPrivilegesStmt: case T_TruncateStmt: case T_DropOwnedStmt: case T_ReassignOwnedStmt: case T_AlterTSDictionaryStmt: case T_AlterTSConfigurationStmt: case T_CreateExtensionStmt: case T_AlterExtensionStmt: case T_AlterExtensionContentsStmt: case T_CreateFdwStmt: case T_AlterFdwStmt: case T_CreateForeignServerStmt: case T_AlterForeignServerStmt: case T_CreateUserMappingStmt: case T_AlterUserMappingStmt: case T_DropUserMappingStmt: case T_AlterTableSpaceOptionsStmt: case T_CreateForeignTableStmt: case T_ImportForeignSchemaStmt: case T_SecLabelStmt: case T_CreatePublicationStmt: case T_AlterPublicationStmt: case T_CreateSubscriptionStmt: case T_AlterSubscriptionStmt: case T_DropSubscriptionStmt: PreventCommandIfReadOnly(CreateCommandTag(parsetree)); PreventCommandIfParallelMode(CreateCommandTag(parsetree)); break; default: /* do nothing */ break; } } 2 修改内核支持 利用fdw,读写操作重新向到FDW表(fdw为PostgreSQL的外部表,可以重定向到主节点) 例如 create rule r1 as on insert to a where pg_is_in_recovery() do instead insert into b values (NEW.*); 这个操作需要一个前提,内核层支持standby可写FDW表。 并且这个方法支持的SQL语句有限,方法1更加彻底。 3 citus插件,所有节点完全对等,所有节点均可读写数据库 《PostgreSQL sharding : citus 系列7 - topn 加速(count(*) group by order by count(*) desc limit x) (use 估值插件 topn)》 《PostgreSQL sharding : citus 系列6 - count(distinct xx) 加速 (use 估值插件 hll|hyperloglog)》 《PostgreSQL sharding : citus 系列5 - worker节点网络优化》 《PostgreSQL sharding : citus 系列4 - DDL 操作规范 (新增DB,TABLE,SCHEMA,UDF,OP,用户等)》 《PostgreSQL sharding : citus 系列3 - 窗口函数调用限制 与 破解之法(套用gpdb执行树,分步执行)》 《PostgreSQL sharding : citus 系列2 - TPC-H》 《PostgreSQL sharding : citus 系列1 - 多机部署(含OLTP(TPC-B)测试)》 2.Oracle Sharding 特性的多表家族支持 在Oracle Sharding特性中,被分片的表称为 Sharded table,这些sharded table的集合称为表家族(Table Family),表家族之中的表具备父-子关系,一个表家族中没有任何父表的表叫做根表(root table),每个表家族中只能有一个根表。表家族中的所有Sharded table都按照相同的sharding key(主键)来分片。 在12.2,在一个SDB中只支持一个表家族,在 19c 中,SDB 中允许存在多个表家族,每个通过不同的 Sharding Key进行分片,这是 Sharding 特性的一个重要增强,有了 Multiple Table Families 的支持,Sharding 才可能找到更多的应用场景。 PostgreSQL 如何支持 PostgreSQL sharding支持非常丰富: 1、plproxy 《PostgreSQL 最佳实践 - 水平分库(基于plproxy)》 《阿里云ApsaraDB RDS for PostgreSQL 最佳实践 - 4 水平分库(plproxy) 之 节点扩展》 《阿里云ApsaraDB RDS for PostgreSQL 最佳实践 - 3 水平分库(plproxy) vs 单机 性能》 《阿里云ApsaraDB RDS for PostgreSQL 最佳实践 - 2 教你RDS PG的水平分库(plproxy)》 《ZFS snapshot used with PostgreSQL PITR or FAST degrade or PG-XC GreenPlum plproxy MPP DB's consistent backup》 《A Smart PostgreSQL extension plproxy 2.2 practices》 《使用Plproxy设计PostgreSQL分布式数据库》 2、citus 《PostgreSQL sharding : citus 系列7 - topn 加速(count(*) group by order by count(*) desc limit x) (use 估值插件 topn)》 《PostgreSQL sharding : citus 系列6 - count(distinct xx) 加速 (use 估值插件 hll|hyperloglog)》 《PostgreSQL sharding : citus 系列5 - worker节点网络优化》 《PostgreSQL sharding : citus 系列4 - DDL 操作规范 (新增DB,TABLE,SCHEMA,UDF,OP,用户等)》 《PostgreSQL sharding : citus 系列3 - 窗口函数调用限制 与 破解之法(套用gpdb执行树,分步执行)》 《PostgreSQL sharding : citus 系列2 - TPC-H》 《PostgreSQL sharding : citus 系列1 - 多机部署(含OLTP(TPC-B)测试)》 3、pg-xl https://www.postgres-xl.org/ 4、antdb https://github.com/ADBSQL/AntDB 5、sharding sphere http://shardingsphere.apache.org/ 6、乘数科技出品勾股数据库,使用fdw支持sharding 7、pg_pathman+FDW支持sharding https://github.com/postgrespro/pg_shardman 《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》 3.透明的应用连续性支持增强 在Oracle RAC集群中,支持对于查询的自动切换,当一个节点失效,转移到另外一个节点,在19c中,Oracle 持续改进和增强了连续性保持,数据库会自动记录会话状态,捕获用于重演的信息,以便在切换时,在新节点自动恢复事务,使DML事务同样可以获得连续性支持: 在事务提交后自动禁用状态捕获,因为提交成功的事务将不再需要在会话级恢复; 在事务开始时,自动重新启用状态跟踪; PostgreSQL 如何支持 要将一个会话内的请求转移到另一个节点,需要支持同样的快照视角,否则会出现查询不一致的情况。PostgreSQL支持快照的导出,分析给其他会话,使得所有会话可以处于同一视角。 《PostgreSQL 共享事务快照功能 - PostgreSQL 9.2 can share snapshot between multi transactions》 这个技术被应用在: 1、并行一致性逻辑备份 2、会话一致性的读写分离 《PostgreSQL 10.0 preview 功能增强 - slave支持WAITLSN 'lsn', time;用于设置安全replay栅栏》 为了能够支持透明应用连续性,1、可以在SQL中间层支持(例如为每个会话创建快照,记录快照信息,转移时在其他节点建立连接并导入快照),2、SQL驱动层支持,3、也可以在内核层支持转移。 会增加一定的开销。 4.自动化索引创建和实施 对于关系型数据库来说,索引是使得查询加速的重要手段,而如何设计和创建有效的索引,长期以来是一项复杂的任务。 在 Oracle 19c 中,自动化索引创建和实施技术被引入进来,Oracle 通过模拟人工索引的思路,建立了内置的专家系统。 数据库内置的算法将会通过捕获、识别、验证、决策、在线验证、监控的全流程管控索引自动化的过程。 这一特性将会自动帮助用户创建有效的索引,并通过提前验证确保其性能和有效性,并且在实施之后进行监控,这一特效将极大缓解数据库索引维护工作。 自动化还将删除由新创建的索引(逻辑合并)废弃的索引,并删除自动创建但长时间未使用的索引。 PostgreSQL 如何支持 1、EDB PPAS版本,支持自动建议索引 《PostgreSQL 商用版本EPAS(阿里云ppas(Oracle 兼容版)) 索引推荐功能使用》 2、PG社区版本,根据统计信息,top SQL,LONG SQL等信息,自动创建索引 《PostgreSQL SQL自动优化案例 - 极简,自动推荐索引》 《自动选择正确索引访问接口(btree,hash,gin,gist,sp-gist,brin,bitmap...)的方法》 PG 虚拟索引 《PostgreSQL 索引虚拟列 - 表达式索引 - JOIN提速》 《PostgreSQL 虚拟|虚假 索引(hypothetical index) - HypoPG》 PostgreSQL 优势 PG 支持9种索引接口(btree, hash, gin, gist, spgist, brin, bloom, rum, zombodb),同时PG支持索引接口扩展,支持表达式索引,支持partial索引。以支持各种复杂业务场景。 5.多实例并行重做日志应用增强 在Oracle Data Guard环境中,备库的日志应用速度一直是一个重要挑战,如果备库不能够及时跟上主库的步调,则可能影响备库的使用。 自Oracle 12.2 版本开始,支持多实例并行应用,这极大加快了恢复进度,在 18c 中,开始支持 In-Memory 列式存储,在 19c 中,并行应用开始支持 In-Memory列式存储。 PostgreSQL 如何支持 对于逻辑从库,支持一对一,一对多,多对一,多对多的部署方法。PG 逻辑订阅每个通道一个worker process,可以通过创建多个订阅通道来实现并行。 对于物理从库,异步STANDBY的WAL APPLY延迟通常是毫秒级。 6.Oracle的混合分区表支持 在 19c 中,Oracle 增强了分区特性,可以将外部对象存储上的文件,以外部表的方式链接到分区中,形成混合分区表,借助这个特性,Oracle 将数据库内外整合打通,冷数据可以剥离到外部存储,热数据在数据库中在线存储。 这个特性借助了外部表的特性实现,以下是一个示例: CREATE TABLE orders ( order_idnumber, order_dateDATE, … ) EXTERNAL PARTITION ATTRIBUTES ( TYPE oracle_loaderDEFAULTDIRECTORY data_dir ACCESS PARAMETERS (..) REJECT LIMIT unlimited) PARTITION BY RANGE(order_date) ( partition q1_2015 values less than(‘2014-10-01’) EXTERNAL LOCATION (‘order_q1_2015.csv’), partition q2_2015 values less than (‘2015-01-01’), partition q3_2015 values less than (‘2015-04-01’), partition q4_2015 values less than (‘2015-07-01’)); PostgreSQL 如何支持 PostgreSQL 的fdw为外部存储(可以是外部任意数据源,包括文件,DB,WWW,S3,OSS等)。 使用PG继承技术,即可完成分区的混合存储(本地存储,外部存储混合),甚至SHARDING。 《ApsaraDB的左右互搏(PgSQL+HybridDB+OSS) - 解决OLTP+OLAP混合需求》 《PostgreSQL 9.6 sharding based on FDW & pg_pathman》 7.在线维护操作增强 在不同版本中,Oracle 持续增强在线维护操作,例如在 12.2 开始支持的Online Move、在线修改普通表为分区表等特性。 在19c 中,持续增强了智能的、细粒度的游标失效控制,将DDL操作对于游标失效的影响降至最低,例如,在 19c 中,comment on table的操作,将不会引起游标的失效。 针对分区维护的操作,例如Truncate分区等,Oracle 将进行细粒度的控制,和DDL操作无关的SQL将不受DDL失效影响。 PostgreSQL 如何支持 PostgreSQL 设计之初就支持了DDL事务,可以将DDL与DML混合在一个事务中处理。 begin; insert into tbl values (...); drop table xx; create table xx; alter table xx; insert xx; end; 又例如切换表名,可以封装为一个事务。 另外对于普通表转分区表,可以这样操作: 《PostgreSQL 普通表在线转换为分区表 - online exchange to partition table》 8.自动的统计信息管理 随着表数据的变化,优化器表数据统计数据将近实时刷新,以防止次优执行计划 统计的在线维护内置于直接路径加载操作中 当数据显着变化时运行自动统计信息收集作业,例如。,自上次收集统计信息以来,表中超过10%的行被添加/更改 第一个看到需要重新编译SQL游标的会话(例如,由于新的优化器统计信息)执行重新编译 其他会话继续使用旧的SQL游标,直到编译完成 避免因重新编译而导致大量会话停顿 PostgreSQL 如何支持 PostgreSQL autovacuum 设计之初就是采用的动态统计信息收集,并且支持到了集群、TABLE级别可设置,用户可以根据不同表的负载情况,设置自动收集统计信息的阈值。 相关参数 autovacuum_analyze_scale_factor autovacuum_analyze_threshold autovacuum_naptime autovacuum_max_workers autovacuum_work_mem 同时统计信息的柱状图个数支持动态设置到表、集群级。 表级设置 alter table xx SET STATISTICS to xx; 相关参数 default_statistics_target 9.自动化的SQL执行计划管理 在 19c 中,数据库缺省的就会启用对于所有可重用SQL的执行计划捕获(当然SYS系统Schema的SQL除外),然后进行自动的执行计划评估,评估可以针对AWR中的TOP SQL、SGA、STS中的SQL进行。 如果被评估的执行计划优于当前执行计划(一般是要有效率 50%以上的提升),会被加入到执行计划基线库中,作为后续的执行选择,而不佳的执行计划则会被标记为不可接受。 有了这个特性,SQL执行计划的稳定性将更进一步。 PostgreSQL 如何支持 PostgreSQL 自适应执行计划插件AQO,支持类似功能。对于复杂SQL尤为有效。 https://github.com/postgrespro/aqo Adaptive query optimization is the extension of standard PostgreSQL cost-based query optimizer. Its basic principle is to use query execution statistics for improving cardinality estimation. Experimental evaluation shows that this improvement sometimes provides an enormously large speed-up for rather complicated queries. 10.SQL功能的增强 在 19c 中,SQL 功能获得了进一步的增强,这其中包括对于 COUNT DISTINCT的进一步优化,在12c中引入的近似 Distinct 操作已经可以为特定SQL带来极大性能提升,现在基于位图的COUNT DISTINCT 操作继续为查询加速。 除此之外,LISTAGG 增加了 DISTINCT 关键字,用于对操作数据的排重。 ANY_VALUE 提供了从数据组中获得随机值的能力,如果你以前喜欢用 Max / Min 实现类似的功能,新功能将显著带来效率的提升。ANY_VALUE 函数在 MySQL 早已存在,现在应该是 Oracle 借鉴和参考了 MySQL 的函数做出的增强。 PostgreSQL 如何支持 PostgreSQL 支持近似聚合,支持流计算,支持聚合中的排序,支持自定义聚合函数等。 例如 1、使用hyperloglog插件,PostgreSQL可以实现概率计算,包括count distinct的概率计算。 https://github.com/citusdata/postgresql-hll 《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 3》 《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 2》 《PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 1》 [《[转]流数据库 概率计算概念 - PipelineDB-Probabilistic Data Structures & Algorithms》](https://github.com/digoal/blog/blob/master/201801/20180116_01.md) 2、TOP-N插件 https://github.com/citusdata/cms_topn 3、pipelinedb 插件 《PostgreSQL 流计算插件 - pipelinedb 1.x 参数配置介绍》 《PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合》 通过流计算,适应更多的实时计算场景。 小结 PostgreSQL是一款非常优秀的企业级开源数据库,不仅有良好的Oracle兼容性,同时在Oracle面前也有很大更加优秀的地方: 插件化,可扩展(包括类型、索引接口、函数、操作符、聚合、窗口、FDW、存储过程语言(目前支持plpgsql,plsql,c,pljava,plperl,pltcl,pllua,plv8,plpython,plgo,...几乎所有编程语言的存储过程),采样,...)。 如何从O迁移到PG: 《xDB Replication Server - PostgreSQL, MySQL, Oracle, SQL Server, PPAS 全量、增量(redo log based, or trigger based)同步(支持single-master, mult-master同步, 支持DDL)》 《MTK使用 - PG,PPAS,oracle,mysql,ms sql,sybase 迁移到 PG, PPAS (支持跨版本升级)》 《ADAM,从Oracle迁移到PPAS,PG的可视化评估、迁移产品》 混合使用情况下的资源隔离管理 《PostgreSQL 商用版本EPAS(阿里云ppas(Oracle 兼容版)) HTAP功能之资源隔离管理 - CPU与刷脏资源组管理》 参考 http://www.sohu.com/a/294160243_505827 https://www.postgresql.org/docs/11/protocol-replication.html 《xDB Replication Server - PostgreSQL, MySQL, Oracle, SQL Server, PPAS 全量、增量(redo log based, or trigger based)同步(支持single-master, mult-master同步, 支持DDL)》 《MTK使用 - PG,PPAS,oracle,mysql,ms sql,sybase 迁移到 PG, PPAS (支持跨版本升级)》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 在做一些测试时,如果IO设备很烂的话,可以直接使用内存文件系统,避免IO上引入的一些开销影响测试结果。 用法很简单: tmpfs or shmfs mount a shmfs with a certain size to /dev/shm, and set the correct permissions. For tmpfs you do not need to specify a size. Tmpfs or shmfs allocated memory is pageable. For example: Example Mount shmfs: # mount -t shm shmfs -o size=20g /dev/shm Edit /etc/fstab: shmfs /dev/shm shm size=20g 0 0 OR Example Mount tmpfs: # mount –t tmpfs tmpfs /dev/shm Edit /etc/fstab: none /dev/shm tmpfs defaults 0 0 ramfs ramfs is similar to shmfs, except that pages are not pageable or swappable. This approach provides the commonly desired effect. ramfs is created by: umount /dev/shm mount -t ramfs ramfs /dev/shm 例子 [root@pg11-test ~]# mkdir /mnt/tmpfs [root@pg11-test ~]# mkdir /mnt/ramfs 1、tmpfs mount -t tmpfs tmpfs /mnt/tmpfs -o size=10G,noatime,nodiratime,rw mkdir /mnt/tmpfs/a chmod 777 /mnt/tmpfs/a 2、ramfs mount -t ramfs ramfs /mnt/ramfs -o noatime,nodiratime,rw,data=writeback,nodelalloc,nobarrier mkdir /mnt/ramfs/a chmod 777 /mnt/ramfs/a ramfs无法在mount时限制大小,即使限制了也不起作用,在df结果中也看不到这个挂载点,但是实际上已经挂载。 [root@pg11-test ~]# mount tmpfs on /mnt/tmpfs type tmpfs (rw,noatime,nodiratime,size=10485760k) ramfs on /mnt/ramfs type ramfs (rw,noatime,nodiratime,data=writeback,nodelalloc,nobarrier) [root@pg11-test ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 197G 17G 171G 9% / devtmpfs 252G 0 252G 0% /dev tmpfs 252G 936K 252G 1% /dev/shm tmpfs 252G 676K 252G 1% /run tmpfs 252G 0 252G 0% /sys/fs/cgroup /dev/mapper/vgdata01-lv03 4.0T 549G 3.5T 14% /data03 /dev/mapper/vgdata01-lv02 4.0T 335G 3.7T 9% /data02 /dev/mapper/vgdata01-lv01 4.0T 1.5T 2.6T 37% /data01 tmpfs 51G 0 51G 0% /run/user/0 /dev/mapper/vgdata01-lv04 2.0T 621G 1.3T 32% /data04 tmpfs 10G 0 10G 0% /mnt/tmpfs 内存文件系统性能 PostgreSQL fsync测试接口,测试内存文件系统fsync性能。 su - digoal digoal@pg11-test-> pg_test_fsync -f /mnt/tmpfs/a/1 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a* fdatasync 1137033.436 ops/sec 1 usecs/op fsync 1146431.736 ops/sec 1 usecs/op fsync_writethrough n/a open_sync n/a* * This file system and its mount options do not support direct I/O, e.g. ext4 in journaled mode. Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a* fdatasync 622763.705 ops/sec 2 usecs/op fsync 625990.998 ops/sec 2 usecs/op fsync_writethrough n/a open_sync n/a* * This file system and its mount options do not support direct I/O, e.g. ext4 in journaled mode. Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write n/a* 2 * 8kB open_sync writes n/a* 4 * 4kB open_sync writes n/a* 8 * 2kB open_sync writes n/a* 16 * 1kB open_sync writes n/a* Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 317779.892 ops/sec 3 usecs/op write, close, fsync 317769.037 ops/sec 3 usecs/op Non-sync'ed 8kB writes: write 529490.541 ops/sec 2 usecs/op digoal@pg11-test-> pg_test_fsync -f /mnt/ramfs/a/1 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a* fdatasync 1146515.453 ops/sec 1 usecs/op fsync 1149912.760 ops/sec 1 usecs/op fsync_writethrough n/a open_sync n/a* * This file system and its mount options do not support direct I/O, e.g. ext4 in journaled mode. Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync n/a* fdatasync 621456.930 ops/sec 2 usecs/op fsync 624811.200 ops/sec 2 usecs/op fsync_writethrough n/a open_sync n/a* * This file system and its mount options do not support direct I/O, e.g. ext4 in journaled mode. Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write n/a* 2 * 8kB open_sync writes n/a* 4 * 4kB open_sync writes n/a* 8 * 2kB open_sync writes n/a* 16 * 1kB open_sync writes n/a* Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 314754.770 ops/sec 3 usecs/op write, close, fsync 314509.045 ops/sec 3 usecs/op Non-sync'ed 8kB writes: write 517299.869 ops/sec 2 usecs/op 本地磁盘性能如下: digoal@pg11-test-> pg_test_fsync -f /data01/digoal/1 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 46574.176 ops/sec 21 usecs/op fdatasync 40183.743 ops/sec 25 usecs/op fsync 36875.852 ops/sec 27 usecs/op fsync_writethrough n/a open_sync 42927.560 ops/sec 23 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 17121.111 ops/sec 58 usecs/op fdatasync 26438.641 ops/sec 38 usecs/op fsync 24562.907 ops/sec 41 usecs/op fsync_writethrough n/a open_sync 15698.199 ops/sec 64 usecs/op Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 28793.172 ops/sec 35 usecs/op 2 * 8kB open_sync writes 15720.156 ops/sec 64 usecs/op 4 * 4kB open_sync writes 10007.818 ops/sec 100 usecs/op 8 * 2kB open_sync writes 5698.259 ops/sec 175 usecs/op 16 * 1kB open_sync writes 3116.232 ops/sec 321 usecs/op Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 33399.473 ops/sec 30 usecs/op write, close, fsync 33216.001 ops/sec 30 usecs/op Non-sync'ed 8kB writes: write 376584.982 ops/sec 3 usecs/op 性能对比,显而易见。 其他 mount hugetlbfs,使用huge page的文件系统,但是不支持read, write接口,需要使用mmap的用法。 详见 https://www.ibm.com/developerworks/cn/linux/l-cn-hugetlb/index.html 参考 https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR397 http://www.cnblogs.com/jintianfree/p/3993893.html https://lwn.net/Articles/376606/ https://www.ibm.com/developerworks/cn/linux/l-cn-hugetlb/index.html 《PostgreSQL Huge Page 使用建议 - 大内存主机、实例注意》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 PostgreSQL凭借友好的开源许可(类BSD开源许可),商业、创新两大价值,以及四大能力(企业级特性,兼容Oracle,TPAP混合负载能力,多模特性),在企业级开源数据库市场份额节节攀升,并蝉联2017,2018全球权威数据库评测机构db-engine的年度数据库冠军。 《中国 PostgreSQL 生态构建思考 - 安全合规、自主可控、去O战略》 如果说兼容Oracle是企业级市场的敲门砖,那么跨Oracle, PostgreSQL 的异构数据库迁移、同步能力就是连接新旧世界的桥梁。如何将Oracle的数据库以及应用平滑,有据可循的迁移到PostgreSQL,可参考阿里云ADAM产品,增量的同步到PostgreSQL可使用xDB replication server。 ADAM xDB replicatoin server 《从人类河流文明 洞察 数据流动的重要性》 数据同步技术是数据流动的重要环节。在很多场景有非常重要的作用: 1、线上业务系统上有实时分析查询,担心影响线上数据库。使用同步技术,实时将数据同步到BI库,减少在线业务数据库的负载。 2、跨版本,跨硬件平台升级数据库版本。使用同步、增量实时同步技术,可以尽可能的减少停库、中断服务的时间。 3、构建测试系统,使用同步技术,构建与线上同样负载的实时SQL回放的测试库。 4、跨数据库平台异构迁移数据,使用异构数据库同步技术,尽可能的减少减少停库、中断服务的时间。例如oracle到postgresql的迁移。 5、多中心,多写。当业务部署在多中心时,使用多写同步技术,当一个节点出现故障时,由于数据库可以多写,所以可以尽可能减少业务中断时间。 6、写扩展。当写负载非常大时,将写分担到多个库,少量需要共享的数据通过同步方式同步到多个库。扩展整体写吞吐能力。 7、本地化数据访问,当需要经常访问外部数据源时,使用同步技术,将数据同步到本地访问,降低访问延迟。 PostgreSQL, Oracle, SQL Server, PPAS(兼容Oracle),这些产品如何实现同构,异构数据库的全量,增量实时同步? EDB提供的xDB replication server是一款可以用于以上产品的同构、异构同步的产品。 一、xDB replication server原理 xDB replication server smr架构、组件 SMR单向复制,xDB提供pub server,用户可配置源库的发布表,pub server捕获发布表的全量,增量。sub server从pub server将全量,增量订阅到目标数据库。 xDB replication server包括三个组件: 1、xdb pub server,发布 2、xdb sub server,订阅 3、xdb console,控制台(支持命令行与GUI界面) xDB replication server mmr架构、组件 MMR双向复制。双向复制的技术点除了SMR以外,还需要解决数据打环,数据冲突(同一条数据,同一个时间窗口被更新时,或者同一个主键值同一个时间窗口被写入时)的问题。 xDB replication server smr支持场景 Advanced Server指EDB提供的PPAS(兼容Oracle)。 1、Replication between PostgreSQL and Advanced Server databases (between products in either direction) 2、Replication from Oracle to PostgreSQL 3、Replication in either direction between Oracle and Advanced Server 4、Replication in either direction between SQL Server and PostgreSQL 5、Replication in either direction between SQL Server and Advanced Server xDB replication server MMR支持场景 双向同步仅支持pg, ppas。 1、PostgreSQL database servers 2、PostgreSQL database servers and Advanced Servers operating in PostgreSQL compatible mode (EDB PPAS使用PG兼容模式时) 3、Advanced Servers operating in PostgreSQL compatible mode 4、Advanced Servers operating in Oracle compatible mode 同步模式支持 全量同步 snapshot,支持批量同步。 增量同步模式支持 增量同步支持两种模式: 1、wal-logged base,推荐。 2、trigger base 二、xDB replication server 使用例子 CentOS 7.X x64 为例 部署xDB pub,sub,console pub, sub, console三个组件可以部署在任意服务器上,并且三个组件可以分开独立部署。 推荐: 1、pub部署在靠近源数据库的地方。 2、sub部署在靠近目标数据库的地方。 3、console部署在可以连通sub, pub, 数据库的地方。同时考虑到方便打开console进行同步任务的管理操作。 下面假设三个组件、以及源库、目标库都部署在一台服务器上。 部署依赖 1、安装java 1.7.0以上版本 https://www.java.com/en/download/ https://www.java.com/en/download/manual.jsp#lin 安装java1.7.0以上版本 wget https://javadl.oracle.com/webapps/download/AutoDL?BundleId=235716_2787e4a523244c269598db4e85c51e0c rpm -ivh AutoDL\?BundleId\=235716_2787e4a523244c269598db4e85c51e0c 检查安装目录 rpm -ql jre1.8-1.8.0_191|grep ext /usr/java/jre1.8.0_191-amd64/lib/deploy/ffjcext.zip /usr/java/jre1.8.0_191-amd64/lib/desktop/icons/HighContrast/16x16/mimetypes/gnome-mime-text-x-java.png /usr/java/jre1.8.0_191-amd64/lib/desktop/icons/HighContrast/48x48/mimetypes/gnome-mime-text-x-java.png /usr/java/jre1.8.0_191-amd64/lib/desktop/icons/HighContrastInverse/16x16/mimetypes/gnome-mime-text-x-java.png /usr/java/jre1.8.0_191-amd64/lib/desktop/icons/HighContrastInverse/48x48/mimetypes/gnome-mime-text-x-java.png /usr/java/jre1.8.0_191-amd64/lib/desktop/icons/LowContrast/16x16/mimetypes/gnome-mime-text-x-java.png /usr/java/jre1.8.0_191-amd64/lib/desktop/icons/LowContrast/48x48/mimetypes/gnome-mime-text-x-java.png /usr/java/jre1.8.0_191-amd64/lib/desktop/icons/hicolor/16x16/mimetypes/gnome-mime-text-x-java.png /usr/java/jre1.8.0_191-amd64/lib/desktop/icons/hicolor/48x48/mimetypes/gnome-mime-text-x-java.png /usr/java/jre1.8.0_191-amd64/lib/ext /usr/java/jre1.8.0_191-amd64/lib/ext/cldrdata.jar /usr/java/jre1.8.0_191-amd64/lib/ext/dnsns.jar /usr/java/jre1.8.0_191-amd64/lib/ext/jaccess.jar /usr/java/jre1.8.0_191-amd64/lib/ext/jfxrt.jar /usr/java/jre1.8.0_191-amd64/lib/ext/localedata.jar /usr/java/jre1.8.0_191-amd64/lib/ext/localedata.pack /usr/java/jre1.8.0_191-amd64/lib/ext/meta-index /usr/java/jre1.8.0_191-amd64/lib/ext/nashorn.jar /usr/java/jre1.8.0_191-amd64/lib/ext/sunec.jar /usr/java/jre1.8.0_191-amd64/lib/ext/sunjce_provider.jar /usr/java/jre1.8.0_191-amd64/lib/ext/sunpkcs11.jar /usr/java/jre1.8.0_191-amd64/lib/ext/zipfs.jar java -version java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) 2、安装数据源java驱动 需要被迁移的数据库,需要下载对应的jdbc驱动。 https://www.enterprisedb.com/docs/en/52.0.0/MTK_Guide/EDB_Postgres_Migration_Guide_v52.0.0.1.12.html# https://www.enterprisedb.com/advanced-downloads 例如,下载PG的驱动。 wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar 驱动拷贝到 JAVA_HOME/jre/lib/ext ,从jre的安装路径获取路径 mv postgresql-42.2.5.jar /usr/java/jre1.8.0_191-amd64/lib/ext/ 安装xdb 1、下载软件,可以选择60天试用 https://www.enterprisedb.com/software-downloads-postgres chmod 700 xdbreplicationserver-6.2.4-1-linux-x64.run 安装 ./xdbreplicationserver-6.2.4-1-linux-x64.run --mode text Language Selection Please select the installation language [1] English - English [2] Simplified Chinese - 简体中文 [3] Traditional Chinese - 繁体中文 [4] Japanese - 日本語 [5] Korean - ??? Please choose an option [1] : ---------------------------------------------------------------------------- Welcome to the Postgres Plus xDB Replication Server Setup Wizard. ---------------------------------------------------------------------------- Please read the following License Agreement. You must accept the terms of this agreement before continuing with the installation. Press [Enter] to continue: .......... Press [Enter] to continue: Do you accept this license? [y/n]: y ---------------------------------------------------------------------------- Please specify the directory where xDB Replication Server will be installed. Installation Directory [/opt/PostgreSQL/EnterpriseDB-xDBReplicationServer]: ---------------------------------------------------------------------------- Select the components you want to install; clear the components you do not want to install. Click Next when you are ready to continue. Replication Console [Y/n] :Y Publication Server [Y/n] :Y Subscription Server [Y/n] :Y Is the selection above correct? [Y/n]: Y ---------------------------------------------------------------------------- xDB Admin User Details. Please provide admin user credentials. xDB pub、sub server以及console 之间相互认证的用户,密码 Admin User [admin]: Admin Password : 密码 digoal123321 Confirm Admin Password : digoal123321 pub与sub server的监听端口 ---------------------------------------------------------------------------- Publication Server Details Please specify a port on which publication server will run. Port [9051]: ---------------------------------------------------------------------------- Subscription Server Details Please specify a port on which subscription server will run. Port [9052]: pub, sub server跑在哪个OS用户下面 ---------------------------------------------------------------------------- Publication/Subscription Service Account Please provide the user name of the account under which the publication/subscription service will run. Operating system username [postgres]: digoal 操作系统用户名 ---------------------------------------------------------------------------- Setup is now ready to begin installing xDB Replication Server on your computer. Do you want to continue? [Y/n]: Y ---------------------------------------------------------------------------- Please wait while Setup installs xDB Replication Server on your computer. Installing xDB Replication Server 0% ______________ 50% ______________ 100% ######################################### ---------------------------------------------------------------------------- EnterpriseDB is the leading provider of value-added products and services for the Postgres community. Please visit our website at www.enterprisedb.com 可以看到pub与sub server已启动 [root@pg11-test ~]# ps -ewf|grep xdb digoal 13289 1 0 16:58 ? 00:00:00 /bin/bash -c cd /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin; ./runPubServer.sh >> /var/log/edb/xdbpubserver/edb-xdbpubserver.log 2>&1 & digoal 13375 13289 3 16:58 ? 00:00:01 /usr/bin/java -XX:-UsePerfData -Xms256m -Xmx1536m -XX:ErrorFile=/var/log/xdb-6.2/pubserver_pid_%p.log -Djava.library.path=/opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin -Djava.awt.headless=true -jar /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin/edb-repserver.jar pubserver 9051 digoal 13469 1 0 16:58 ? 00:00:00 /bin/bash -c cd /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin; ./runSubServer.sh >> /var/log/edb/xdbsubserver/edb-xdbsubserver.log 2>&1 & digoal 13551 13469 4 16:58 ? 00:00:01 /usr/bin/java -XX:-UsePerfData -XX:ErrorFile=/var/log/xdb-6.2/subserver_pid_%p.log -Djava.awt.headless=true -jar /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin/edb-repserver.jar subserver 9052 xDB安装的软件目录内容 # cd /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin [root@pg11-test bin]# ll total 5808 -rwxrwxr-x 1 root root 45544 Nov 15 15:45 DataValidator.jar -rwxr-xr-x 1 root root 4837 Nov 15 15:47 edb_audit.sh -rwxr-xr-x 1 root root 30550 Nov 15 15:47 edb_bugreport.sh -rwxrwxr-x 1 root root 1746041 Nov 15 15:45 edb-repcli.jar -rwxrwxr-x 1 root root 1679061 Nov 15 15:45 edb-repconsole.jar -rwxrwxr-x 1 root root 2250159 Nov 15 15:45 edb-repserver.jar -rwxrwxr-x 1 root root 25994 Nov 15 15:45 libnativehandler.so -rwxrwxr-x 1 root root 129596 Nov 15 15:45 libpqjniwrapper.so -rwxr-xr-x 1 root root 889 Feb 3 17:08 runPubServer.sh -rwxr-xr-x 1 root root 531 Feb 3 17:08 runRepConsole.sh -rwxr-xr-x 1 root root 701 Feb 3 17:08 runSubServer.sh -rwxr-xr-x 1 root root 538 Feb 3 17:08 runValidation.sh 1、控制台 java -jar ./edb-repconsole.jar 2、pub启动脚本 runPubServer.sh 3、sub启动脚本 runSubServer.sh xDB 相关配置文件 1、pub server配置文件 /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/xdb_pubserver.conf 可配置的一些性能相关项 #This option represents the MTK option "-cpBatchSize" that has a default value of 8MB. #The user can customize the default value to optimize the data speed for Snapshot #that involves large datasets and enough memory on the system. # size in MB #cpBatchSize=8 #This option represents the MTK option "-batchSize" that has a default value of 100 rows. # size in rows #batchSize=100 #The option to import Oracle Partitioned table as a normal table in PPAS/PPSS. #importPartitionAsTable=false #It controls how many rows are fetched from the publication database in one round (network) trip. For example, #if there are 1000 row changes available in shadow table(s), the default fetch size will require 5 database round trips. #Hence using a fetch size of 500 will bring all the changes in 2 round trips. Fine tune the performance by using a fetch size #that conforms to the average data volume consumed by rows fetched in one round trip. #syncFetchSize=200 #Synchronize Replication batch size. Default to 100 statements per batch. #syncBatchSize=100 #This defines the maximum number of transactional rows that can be grouped in a single transaction set. #The xDB loads and processes the delta changes by fetching as many rows in memory as grouped in a single #transaction set. A higher value is expected to boost the performance. However increasing it to a very large #value might result in out of memory error, hence increase/decrease the default value in accordance with #the average row size (low/high). #txSetMaxSize=10000 #This option controls the number of maximum threads used to load data from source publication tables #in parallel mode. The default count is 4, however depending on the target system #architecture specifically multi CPUs/cores one can choose to specify a custom count (normally #equals CPU/core count) to fully utilize the system resources. #syncLoadThreadLimit=4 #It defines the upper limit for number of (WAL) entries that can be hold in the queue #A value of zero indicates there will be no upper limit. The default is set to 10000. #walStreamQueueLimit=10000 2、sub server配置 /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/xdb_subserver.conf 可配置的一些性能相关项 #The option to import Oracle Partitioned table as a normal table in PPAS/PPSS. #importPartitionAsTable=false #This option controls the number of threads used to perform snapshot data migration in parallel mode. #The default behavior is to use a single data loader thread. However depending on the target system #architecture specifically multi CPUs/cores one can choose to specify a custom count (normally #equals CPU/core count) to fully utilize the system resources. #snapshotParallelLoadCount=1 3、通用配置 /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/sysconfig/xdbReplicationServer-62.config #!/bin/sh JAVA_EXECUTABLE_PATH="/usr/bin/java" JAVA_MINIMUM_VERSION=1.7 JAVA_BITNESS_REQUIRED=64 JAVA_HEAP_SIZE="-Xms8192m -Xmx32767m" # 这个可以配大一点 PUBPORT=9051 SUBPORT=9052 三、同步测试 1、测试目标: PG到PG的SMR(单向同步),全量,增量,添加表,多个SUB,PUB对,修改表结构。几个功能点的测试。 2、测试环境 pub , sub server xdb console, 源db, 目标db 使用同一台服务器。(仅测试) CentOS 7.x x64 512G memory 源, PostgreSQL 11.1 127.0.0.1:8001:db1 目标, PostgreSQL 11.1 127.0.0.1:8001:db2 使用wal based replication。 配置source database 1、配置postgresql.conf wal_level = replica max_worker_processes = 128 max_wal_senders = 32 max_replication_slots = 32 max_logical_replication_workers = 8 max_sync_workers_per_subscription = 4 2、配置pg_hba.conf host all all 0.0.0.0/0 md5 host replication all 0.0.0.0/0 md5 3、被复制的table,(update,delete)必须有pk 4、如果需要table filter,需要设置table的REPLICA IDENTITY 为 full 5、创建源库 postgres=# create database db1; CREATE DATABASE 6、用户权限 pub database 用户权限要求: 1、The database user can connect to the publication database. 2、The database user has superuser privileges. Superuser privileges are required because the database configuration parameter session_replication_role is altered by the database user to replica for snapshot operations involving replication of the control schema from one publication database to another. 3、The database user must have the ability to modify the system catalog tables in order to disable foreign key constraints on the control schema tables for snapshot operations involving replication of the control schema from one publication database to another. (See appendix Section 10.4.4 for more information on this requirement.) create role digoal superuser login encrypted password 'digoal123321'; 配置target database 1、创建目标库 postgres=# create database db2; CREATE DATABASE 2、用户权限要求 superuser create role digoal superuser login encrypted password 'digoal123321'; 配置xdb 1、JAVA_HEAP_SIZE,建议加大 cd /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/sysconfig vi xdbReplicationServer-62.config #!/bin/sh JAVA_EXECUTABLE_PATH="/usr/bin/java" JAVA_MINIMUM_VERSION=1.7 JAVA_BITNESS_REQUIRED=64 JAVA_HEAP_SIZE="-Xms4096m -Xmx16384m" PUBPORT=9051 SUBPORT=9052 2、配置pub, sub server配置文件(可选) /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/xdb_pubserver.conf /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/xdb_subserver.conf 数据链路 数据同步访问链路如下: 1、xDB pub server 访问 pub database 2、xDB pub server <-相互访问-> xDB sub server 3、xDB sub server 访问 sub database 4、xDB console 访问 pub, sub, (源、目标)database 使用xDB replication console图形界面配置 为了方便控制,建议初学者开始先使用图形界面console 《Linux vnc server, vnc viewer(远程图形桌面)使用》 以下进入Linux vnc桌面操作 1、启动xDB replication console java -jar /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin/edb-repconsole.jar 2、注册pub server 输入pub server的连接地址,用户,密码 3、往pub server,添加用于发布的源数据库 选择数据库类型 输入源数据库的连接地址,端口,用户(超级用户),密码,数据库名(db1) 4、配置pub tables group 勾选table,一个pub group,一个slot,最多用一个核。 可以创建多个pub group,例如一张表一个。但是每个pub group会耗费一个slot, 一个replication worker,源库如下参数: postgres=# show max_wal_senders ; max_wal_senders ----------------- 32 (1 row) postgres=# show max_replication_slots ; max_replication_slots ----------------------- 32 (1 row) 如果你需要复制表的部分数据,可以配置table filter,但是要求表的REPLICA IDENTITY配置为full。 alter table tbl set REPLICA IDENTITY full; 5、注册sub server 输入sub server的连接地址,用户,密码。 6、配置订阅目标库 7、创建订阅 配置pub server的连接串,点load,选中pub tables group 注意,如果目标库已经存在同名表名,则会报错 需要先DROP目标表,重新配置。 8、全量同步 9、配置增量同步计划 当pub server无增量数据后,间隔多久再重试。 10、原有pub tables group,增加新表 digoal@pg11-test-> psql psql (11.1) Type "help" for help. postgres=# \c db1 You are now connected to database "db1" as user "postgres". db1=# create table test (id int primary key, info text, crt_time timestamp); CREATE TABLE db1=# alter table test replica identity full; ALTER TABLE sub server 对应pub p1 自动获取到新增的表 压测 digoal@pg11-test-> vi test.sql \set id random(1,100000000) insert into test values (:id, md5(random()::text), now()) on conflict (id) do update set info=excluded.info,crt_time=excluded.crt_time; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 4 -j 4 -T 120 db1 progress: 1.0 s, 83118.1 tps, lat 0.048 ms stddev 0.023 progress: 2.0 s, 84590.4 tps, lat 0.047 ms stddev 0.022 progress: 3.0 s, 87808.6 tps, lat 0.046 ms stddev 0.021 progress: 4.0 s, 84952.9 tps, lat 0.047 ms stddev 0.023 progress: 5.0 s, 91500.0 tps, lat 0.044 ms stddev 0.023 目标库查看数据正常同步 psql -h 127.0.0.1 -p 8000 db2 db2=# select count(*) from test; count -------- 150389 (1 row) .... db2=# select count(*) from test; count -------- 393261 (1 row) 11、修改表结构 注意ddl中,必须写全schema,否则会报没有对应的TABLE。 指定schema alter table public.test add column c1 int default 10; 建议先执行同步,因为会执行隐式同步,堵塞 修改完后,结构一致 源库 db1=# \d+ test Table "public.test" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | info | text | | | | extended | | crt_time | timestamp without time zone | | | | plain | | c1 | integer | | | 10 | plain | | Indexes: "test_pkey" PRIMARY KEY, btree (id) Triggers: rrpt_public_test AFTER TRUNCATE ON test FOR EACH STATEMENT EXECUTE PROCEDURE _edb_replicator_pub.capturetruncateevent() Replica Identity: FULL 目标库 db2=# \d+ test Table "public.test" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | info | text | | | | extended | | crt_time | timestamp without time zone | | | | plain | | c1 | integer | | | 10 | plain | | Indexes: "test_pkey" PRIMARY KEY, btree (id) 12、增加过滤器 要增加table filter,使得目标端仅订阅复合条件的记录,需要表上设置Replica Identity: FULL test表 Replica Identity: FULL 类似如下: 13、配置多个pub, sub对 源库: do language plpgsql $$ declare begin for i in 0..7 loop execute format('create table t%s (id int primary key, info text, crt_time timestamp)', i); end loop; end; $$; 配置pub 配置sub 压测,配置动态写入函数 db1=# create or replace function ins_tx(int) returns void as $$ declare suffix int := abs(mod($1,8)); begin execute format('execute ps%s(%s)', suffix, $1); exception when others then execute format('prepare ps%s(int) as insert into t%s values ($1, md5(random()::text), now()) on conflict (id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, suffix); execute format('execute ps%s(%s)', suffix, $1); end; $$ language plpgsql strict; CREATE FUNCTION 测试动态写入函数 db1=# select ins_tx(1); ins_tx -------- (1 row) db1=# select ins_tx(2); ins_tx -------- (1 row) db1=# select * from t1; id | info | crt_time ----+----------------------------------+---------------------------- 1 | 44893db346d0c599bb2c3de72a6a1b9e | 2019-02-04 15:01:27.539532 (1 row) db1=# select * from t2; id | info | crt_time ----+----------------------------------+---------------------------- 2 | fbd92d03711c0816c02b26eda23d0b93 | 2019-02-04 15:01:28.842232 (1 row) 压测 vi test1.sql \set id random(1,1000000000) select ins_tx(:id); pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 16 -j 16 -T 120 db1 可以看到,8个pub,sub对,最多可以用8核,并行消费。 xDB pub server使用了内置的test_decoding来处理wal logical decode。 db1=# select * from pg_get_replication_slots(); slot_name | plugin | slot_type | datoid | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn -----------------+---------------+-----------+---------+-----------+--------+------------+------+--------------+-------------+--------------------- xdb_1910618_570 | test_decoding | logical | 1910618 | f | t | 61522 | | 1177241672 | 51/4473DD68 | 51/4474AE00 xdb_1910618_568 | test_decoding | logical | 1910618 | f | t | 61516 | | 1177241672 | 51/4473DD68 | 51/4474AE00 xdb_1910618_582 | test_decoding | logical | 1910618 | f | t | 61528 | | 1177241672 | 51/4473DD68 | 51/4474AE00 xdb_1910618_566 | test_decoding | logical | 1910618 | f | t | 61510 | | 1177241672 | 51/4473DD68 | 51/4474AE00 xdb_1910618_562 | test_decoding | logical | 1910618 | f | t | 61498 | | 1177241672 | 51/4473DD68 | 51/4474AE00 xdb_1910618_584 | test_decoding | logical | 1910618 | f | t | 61534 | | 1177241672 | 51/4473DD68 | 51/4474AE00 xdb_1910618_6 | test_decoding | logical | 1910618 | f | t | 61489 | | 1177241672 | 51/4473DD68 | 51/4474AE00 xdb_1910618_564 | test_decoding | logical | 1910618 | f | t | 61504 | | 1177241672 | 51/4473DD68 | 51/4474AE00 xdb_1910618_586 | test_decoding | logical | 1910618 | f | t | 61540 | | 1177241672 | 51/4473DD68 | 51/4474AE00 (9 rows) 源库 db1=# select application_name,query from pg_stat_activity where application_name='PostgreSQL JDBC Driver'; application_name | query ------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- PostgreSQL JDBC Driver | UPDATE _edb_replicator_pub.rrep_properties SET value=$1 WHERE key=$2 PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | SELECT db_host,db_port,db_name,db_user,db_password,db_type,url_options FROM _edb_replicator_sub.xdb_sub_database WHERE sub_db_id=31 PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | INSERT INTO _edb_replicator_pub.rrep_txset (set_id, pub_id, sub_id, status, start_rrep_sync_id, end_rrep_sync_id, last_repl_xid, last_repl_xid_timestamp) VALUES($1,$2,$3,$4,$5,$6,$7,$8) PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | SELECT 1 PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | INSERT INTO _edb_replicator_pub.rrep_txset (set_id, pub_id, sub_id, status, start_rrep_sync_id, end_rrep_sync_id, last_repl_xid, last_repl_xid_timestamp) VALUES($1,$2,$3,$4,$5,$6,$7,$8) PostgreSQL JDBC Driver | SELECT 1 (14 rows) 源库使用流复制协议,logical decoding技术获取增量。 db1=# select * from pg_stat_replication ; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | repl ay_lag | sync_priority | sync_state -------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+-------------+-------------+-------------+------------+-----------+-----------+----- -------+---------------+------------ 30636 | 16634 | digoal | | 127.0.0.1 | | 57908 | 2019-02-05 09:06:42.379879+08 | | streaming | 52/D3170F18 | 52/D24E5F60 | 52/D24E5F60 | | | | | 1 | sync 30645 | 16634 | digoal | | 127.0.0.1 | | 57912 | 2019-02-05 09:06:42.463486+08 | | streaming | 52/DA123D98 | 52/D85D4A40 | 52/D85D4A40 | | | | | 1 | potential 30657 | 16634 | digoal | | 127.0.0.1 | | 57916 | 2019-02-05 09:06:42.513406+08 | | streaming | 52/DAE6BF10 | 52/D717B0E8 | 52/D717B0E8 | | | | | 1 | potential 30664 | 16634 | digoal | | 127.0.0.1 | | 57918 | 2019-02-05 09:06:42.54752+08 | | streaming | 52/DB40FAC8 | 52/D9910E98 | 52/D9910E98 | | | | | 1 | potential 30670 | 16634 | digoal | | 127.0.0.1 | | 57920 | 2019-02-05 09:06:42.58003+08 | | streaming | 52/D9D004F0 | 52/D7EAC580 | 52/D7EAC580 | | | | | 1 | potential 30692 | 16634 | digoal | | 127.0.0.1 | | 57926 | 2019-02-05 09:06:42.610619+08 | | streaming | 52/DA37DB60 | 52/D8703390 | 52/D8703390 | | | | | 1 | potential 30698 | 16634 | digoal | | 127.0.0.1 | | 57928 | 2019-02-05 09:06:42.637593+08 | | streaming | 52/DAAB88E0 | 52/D8D66BD8 | 52/D8D66BD8 | | | | | 1 | potential 30707 | 16634 | digoal | | 127.0.0.1 | | 57932 | 2019-02-05 09:06:42.660029+08 | | streaming | 52/DB829380 | 52/D95AEB10 | 52/D95AEB10 | | | | | 1 | potential 30713 | 16634 | digoal | | 127.0.0.1 | | 57934 | 2019-02-05 09:06:42.684417+08 | | streaming | 52/DAA15428 | 52/D8B98AA8 | 52/D8B98AA8 | | | | | 1 | potential (9 rows) db1=# insert into t1 values (-1),(-2),(-3); INSERT 0 3 db1=# select xmin,xmax,cmin,cmax,* from t1 where id in (-1,-2,-3); xmin | xmax | cmin | cmax | id | info | crt_time ------------+------+------+------+----+------+---------- 1203620149 | 0 | 0 | 0 | -3 | | 1203620149 | 0 | 0 | 0 | -2 | | 1203620149 | 0 | 0 | 0 | -1 | | (3 rows) 目标库 db2=# select application_name,query from pg_stat_activity ; application_name | query ------------------------+------------------------------------------------------- | | PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | COMMIT PostgreSQL JDBC Driver | COMMIT psql | select application_name,query from pg_stat_activity ; | | | (10 rows) db2=# select xmin,xmax,cmin,cmax,* from t1 limit 100; xmin | xmax | cmin | cmax | id | info | crt_time ------------+------+------+------+-----------+----------------------------------+---------------------------- 1137051069 | 0 | 0 | 0 | 1 | 44893db346d0c599bb2c3de72a6a1b9e | 2019-02-04 15:01:27.539532 1137051074 | 0 | 0 | 0 | 761776169 | 310e9b568dd1860afd9e12c9179a5068 | 2019-02-04 15:02:45.225487 1137051074 | 0 | 1 | 1 | 665001137 | 46b42b0d62e21373aaaeb69afd76db63 | 2019-02-04 15:02:45.227018 1137051074 | 0 | 2 | 2 | 697990337 | 877a5ec25b68bfc44d6c837a3f75c6e5 | 2019-02-04 15:02:45.227858 1137051074 | 0 | 3 | 3 | 109521385 | c6f1b0d41a641a75fa9c07211efa0026 | 2019-02-04 15:02:45.228195 1137051074 | 0 | 4 | 4 | 432996345 | 6980bdea340d8b23f5d065dc71342c4a | 2019-02-04 15:02:45.228366 1137051074 | 0 | 5 | 5 | 850543097 | 0b06d401c1a74df3f100c63f350150ea | 2019-02-04 15:02:45.228332 1137051074 | 0 | 6 | 6 | 954130457 | 8f1fca5404f72bd6079f7f503ef9594a | 2019-02-04 15:02:45.228319 1137051074 | 0 | 7 | 7 | 373804529 | a7750ea5faa6e69a55cf2635fc62cb76 | 2019-02-04 15:02:45.226744 1137051074 | 0 | 8 | 8 | 722564465 | c94d25c5c54c7ca801be9706f84def70 | 2019-02-04 15:02:45.228678 1137051074 | 0 | 9 | 9 | 97279721 | a5374504b82575952dd22c3238729467 | 2019-02-04 15:02:45.228788 1137051074 | 0 | 10 | 10 | 312386249 | a30c971886332fdb860cb0d6ab20ed9e | 2019-02-04 15:02:45.229182 1137051074 | 0 | 11 | 11 | 785120921 | 9e176dc1e5ef4c75d085c87572c03f04 | 2019-02-04 15:02:45.229475 1137051074 | 0 | 12 | 12 | 326792793 | 66cf1fe49b3018f756cb7b1c2303266b | 2019-02-04 15:02:45.229535 1137051074 | 0 | 13 | 13 | 510541273 | fafc393cfef443eb05f069d91937da9b | 2019-02-04 15:02:45.229609 关注command id字段,可以看到目标库逐条回放。 db2=# select xmin,xmax,cmin,cmax,* from t1 where id in (-1,-2,-3); xmin | xmax | cmin | cmax | id | info | crt_time ------------+------+------+------+----+------+---------- 1137058058 | 0 | 2 | 2 | -3 | | 1137058058 | 0 | 1 | 1 | -2 | | 1137058058 | 0 | 0 | 0 | -1 | | (3 rows) 内核性能提升点(当前解析slot需要扫描所有WAL内容,例如将来可以考虑用户自定义的区分TABLE来存储WAL,减少扫描量。)(配置多个WAL GROUP,用户可以指定TABLE到对应的GROUP,解析单个表,只需要解析单个WAL GROUP的内容,减少无用功) 优化方法与schema less,空间优化等思路类似。 《PostgreSQL 时序最佳实践 - 证券交易系统数据库设计 - 阿里云RDS PostgreSQL最佳实践》 《PostgreSQL 空间切割(st_split, ST_Subdivide)功能扩展 - 空间对象网格化 (多边形GiST优化)》 《PostgreSQL 空间st_contains,st_within空间包含搜索优化 - 降IO和降CPU(bound box) (多边形GiST优化)》 小结 1、xDB replication server可用于oracle, sql server, pg, ppas的数据单向,双向 全量与增量同步。 1.1、xDB replication server smr支持场景 Advanced Server指EDB提供的PPAS(兼容Oracle)。 1、Replication between PostgreSQL and Advanced Server databases (between products in either direction) 2、Replication from Oracle to PostgreSQL 3、Replication in either direction between Oracle and Advanced Server 4、Replication in either direction between SQL Server and PostgreSQL 5、Replication in either direction between SQL Server and Advanced Server 1.2、xDB replication server MMR支持场景 双向同步仅支持pg, ppas。 1、PostgreSQL database servers 2、PostgreSQL database servers and Advanced Servers operating in PostgreSQL compatible mode (EDB PPAS使用PG兼容模式时) 3、Advanced Servers operating in PostgreSQL compatible mode 4、Advanced Servers operating in Oracle compatible mode 2、本文简单描述了xDB的使用,以及PG与PG的SMR例子。 3、增量同步性能取决于网络带宽,事务大小,CPU资源,组并行度 等因素。本文测试场景,未优化的情况下,每秒约同步2万行。性能有极大提升空间。 四、附录 xDB replication console 命令行 熟悉了xDB的使用流程后,可以考虑使用console命令行来管理xDB。 [root@pg11-test bin]# java -jar ./edb-repcli.jar -help Usage: java -jar edb-repcli.jar [OPTIONS] Where OPTIONS include: -help Prints out Replication CLI command-line usage -version Prints out Replication CLI version -encrypt -input <file> -output <file> Encrypts input file to output file -repversion -repsvrfile <file> Prints Replication Server version -uptime -repsvrfile <file> Prints the time since the Replication Server has been in running state. Publication: -addpubdb -repsvrfile <file> -dbtype {oracle | enterprisedb | postgresql | sqlserver} -dbhost <host> -dbport <port> -dbuser <user> {-dbpassword <encpassword> | dbpassfile <file>} -database {<database> | <service>} [-oraconnectiontype {sid | servicename}] [-urloptions <JDBC extended URL parameters>] [-filterrule {publication table filter id}] [-repgrouptype {m | s}] [-initialsnapshot [-verboseSnapshotOutput {true|false}]] [-nodepriority {1 to 10}] [-replicatepubschema {true|false}] [-changesetlogmode {T|W}] Adds publication database -updatepubdb -repsvrfile <file> -pubdbid <id> -dbhost <host> -dbport <port> -dbuser <user> {-dbpassword <encpassword> | dbpassfile <file>} -database {<database> | <service>} [-oraconnectiontype {sid | servicename}] [-urloptions <JDBC extended URL parameters>] [-nodepriority {1 to 10}] Update publication database -printpubdbids -repsvrfile <file> -printpubdbidsdetails -repsvrfile <file> -removepubdb -repsvrfile <file> -pubdbid <id> -gettablesfornewpub -repsvrfile <file> -pubdbid <id> -createpub <pubName> -repsvrfile <file> -pubdbid <id> -reptype {T|S} -tables <schema1>.<table1> [<schema1>.<table2>...] [-views <schema1>.<view1> [<schema1>.<view2>...]] [-tablesfilterclause <index1>:<filterName>:<clause> [<index2>:<filterName>:<clause>...]] [-viewsfilterclause <index1>:<filterName>:<clause> [<index2>:<filterName>:<clause>...]][-conflictresolution <index1>:<{E|L|N|M|C:<custom_handler>}> [<index2>:<{E|L|N|M|C:<custom_handler>}>...]] [-standbyconflictresolution <index1>:<{E|L|N|M|C:<custom_handler>}> [<index2>:<{E|L|N|M|C:<custom_handler>}>...]] [-repgrouptype {M|S}] -validatepubs -repsvrfile <file> -pubdbid <id> -repgrouptype {m|s} -printpubfilterslist <pubName> -repsvrfile <file> Prints publication filters list -printpublist -repsvrfile <file> [-pubdbid <id>] [-printpubid] Prints publications list -printpublishedtables <pubName> -repsvrfile <file> Print published tables -removepub <pubName1> [<pubName2>...] -repsvrfile <file> -repgrouptype {m | s} -addtablesintopub <pubName> -repsvrfile <file> -tables <schema1>.<table1> [<schema1>.<table2>...] [-views <schema1>.<view1> [<schema1>.<view2>...]] [-tablesfilterclause <index1>:<filterName>:<clause> [<index2>:<filterName>:<clause>...]] [-viewsfilterclause <index1>:<filterName>:<clause> [<index2>:<filterName>:<clause>...]] [-conflictresolution <index1>:<{E|L|N|M|C:<custom_handler>}> [<index2>:<{E|L|N|M|C:<custom_handler>}>...]] [-standbyconflictresolution <index1>:<{E|L|N|M|C:<custom_handler>}> [<index2>:<{E|L|N|M|C:<custom_handler>}>...]] [-repgrouptype {M|S}] -removetablesfrompub <pubName> -repsvrfile <file> -tables <schema1>.<table1> [<schema1>.<table2>...] [-views <schema1>.<view1> [<schema1>.<view2>...]] -cleanrephistory -repsvrfile <file> -cleanrephistoryforpub <pubName> -repsvrfile <file> -cleanshadowhistforpub <pubName> -repsvrfile <file> [-mmrdbid <dbid1>[,<dbid2>...]] -confcleanupjob <pubdbid> -repsvrfile <file> {-enable {-hourly <1-12> | -daily <0-23> | -minutely <1-59> | -cronexpr <"valid cron expression"> | -weekly <Monday-Sunday> <0-23>} | -disable} -confschedule <subName> -repsvrfile <file> {-remove | {-jobtype {t | s} {-realtime <1-n> | -daily <0-23> <0-59> | -weekly <Mon,Tue,...,Sun> <0-23> <0-59> | -monthly <Jan,Feb,...,Dec> <1-31> <0-23> <0-59> | -cronexpr <"cronexpression">}}} -confschedulemmr <pubdbid> -pubname <pubname> -repsvrfile <file> {-remove | {{-realtime <1-n> | -daily <0-23> <0-59> | -weekly <Mon,Tue,...,Sun> <0-23> <0-59> | -monthly <Jan,Feb,...,Dec> <1-31> <0-23> <0-59> | -cronexpr <"cronexpression">}}} -printschedule {<pubName> | <subName>} -repsvrfile {<pubsvrfile> | <subsvrfile>} -repgrouptype {m | s} -validatepub <pubName> -repsvrfile <file> -repgrouptype {m | s} -dommrsnapshot <pubname> -pubhostdbid <pubdbid> -repsvrfile <file> [-verboseSnapshotOutput {true|false}] -replicateddl <pubname> -table <tableName> -repsvrfile <file> -ddlscriptfile <filepath> -printconfresolutionstrategy <pubName> -repsvrfile <file> -table <tableName> -updateconfresolutionstrategy <pubName> -repsvrfile <file> -table <tableName> -conflictresolution <{E|L|N|M|C}> -standbyconflictresolution <{E|L|N|M|C}> [-customhandlername <customHandlerProcName>] -setasmdn <pubdbid> -repsvrfile <file> -setascontroller <pubdbid> -repsvrfile <file> -printcontrollerdbid -repsvrfile <file> Prints out Controller database id Subscription: -addsubdb -repsvrfile <file> -dbtype {oracle | enterprisedb | postgresql | sqlserver} -dbhost <host> -dbport <port> -dbuser <user> {-dbpassword <encpassword> | -dbpassfile <file>} -database {<database> | <service>} [-urloptions <JDBC extended URL parameters>] [-oraconnectiontype {sid | servicename}] Adds subscription database -updatesubdb -repsvrfile <file> -subdbid <id> -dbhost <host> -dbport <port> -dbuser <user> {-dbpassword <encpassword> | -dbpassfile <file>} -database {<database> | <service>} [-urloptions <JDBC extended URL parameters>] [-oraconnectiontype {sid | servicename}] Update subscription database -updatesub <subname> -subsvrfile <file> -pubsvrfile <file> -host <host> -port <port> Update host/port of source publication server for a subscription -printsubdbids -repsvrfile <file> -printsubdbidsdetails -repsvrfile <file> -printmdndbid -repsvrfile <file> -printsublist -repsvrfile <file> -subdbid <id> Prints subscriptions list -removesubdb -repsvrfile <file> -subdbid <id> -createsub <subname> -subdbid <id> -subsvrfile <file> -pubsvrfile <file> -pubname <pubName> -filterrule <publication table filters id(s)> -dosnapshot <subname> -repsvrfile <file> [-verboseSnapshotOutput {true|false}] -dosynchronize {<subname> | <pubname>} -repsvrfile {<subsvrfile> | <pubsvrfile>} [-repgrouptype {s|m}] -removesub <subname> -repsvrfile <file> -addfilter <pubName> -repsvrfile <file> -tables <schema1>.<table1> [<schema1>.<table2>...] [-views <schema1>.<view1> [<schema1>.<view2>...]] [-tablesfilterclause <index1>:<name>:<clause> [<index2>:<name1>:<clause>...]] [-viewsfilterclause <index1>:<name>:<clause> [<index2>:<name>:<clause>...]] -updatefilter <pubName> -repsvrfile <file> -tablesfilterclause <filterid>:<updatedclause> [<filterid>:<updatedclause>...] -removefilter <pubName> -repsvrfile <file> -filterid <filterid> -enablefilter -repsvrfile <file> {-dbid <id> | -subname <name>} -filterids <filterid_1> [<filterid_2>...] -disablefilter -repsvrfile <file> {-dbid <id> | -subname <name>} -filterids <filterid_1> [<filterid_2>...] 重启xDB sub,pub server digoal@pg11-test-> ps -ewf|grep xdb digoal 16942 1 0 Feb03 ? 00:00:00 /bin/bash -c cd /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin; ./runPubServer.sh >> /var/log/edb/xdbpubserver/edb-xdbpubserver.log 2>&1 & digoal 17024 16942 0 Feb03 ? 00:03:30 /usr/bin/java -XX:-UsePerfData -Xms256m -Xmx1536m -XX:ErrorFile=/var/log/xdb-6.2/pubserver_pid_%p.log -Djava.library.path=/opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin -Djava.awt.headless=true -jar /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin/edb-repserver.jar pubserver 9051 digoal 17120 1 0 Feb03 ? 00:00:00 /bin/bash -c cd /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin; ./runSubServer.sh >> /var/log/edb/xdbsubserver/edb-xdbsubserver.log 2>&1 & digoal 17202 17120 0 Feb03 ? 00:00:58 /usr/bin/java -XX:-UsePerfData -XX:ErrorFile=/var/log/xdb-6.2/subserver_pid_%p.log -Djava.awt.headless=true -jar /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin/edb-repserver.jar subserver 9052 digoal@pg11-test-> kill 17024 digoal@pg11-test-> ps -ewf|grep xdb digoal 17120 1 0 Feb03 ? 00:00:00 /bin/bash -c cd /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin; ./runSubServer.sh >> /var/log/edb/xdbsubserver/edb-xdbsubserver.log 2>&1 & digoal 17202 17120 0 Feb03 ? 00:00:58 /usr/bin/java -XX:-UsePerfData -XX:ErrorFile=/var/log/xdb-6.2/subserver_pid_%p.log -Djava.awt.headless=true -jar /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin/edb-repserver.jar subserver 9052 digoal@pg11-test-> kill 17202 digoal@pg11-test-> ps -ewf|grep xdb su - digoal cat /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/sysconfig/xdbReplicationServer-62.config #!/bin/sh JAVA_EXECUTABLE_PATH="/usr/bin/java" JAVA_MINIMUM_VERSION=1.7 JAVA_BITNESS_REQUIRED=64 JAVA_HEAP_SIZE="-Xms4096m -Xmx16384m" PUBPORT=9051 SUBPORT=9052 cd /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin nohup ./runPubServer.sh >> /var/log/edb/xdbpubserver/edb-xdbpubserver.log 2>&1 & nohup ./runSubServer.sh >> /var/log/edb/xdbsubserver/edb-xdbsubserver.log 2>&1 & digoal@pg11-test-> ps -ewf|grep xdb digoal 7767 7687 1 10:46 pts/8 00:00:01 /usr/bin/java -XX:-UsePerfData -Xms4096m -Xmx16384m -XX:ErrorFile=/var/log/xdb-6.2/pubserver_pid_%p.log -Djava.library.path=/opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin -Djava.awt.headless=true -jar /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin/edb-repserver.jar pubserver 9051 digoal 7981 7901 2 10:47 pts/8 00:00:01 /usr/bin/java -XX:-UsePerfData -XX:ErrorFile=/var/log/xdb-6.2/subserver_pid_%p.log -Djava.awt.headless=true -jar /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/bin/edb-repserver.jar subserver 9052 参考 2、《MTK使用 - PG,PPAS,oracle,mysql,ms sql,sybase 迁移到 PG, PPAS (支持跨版本升级)》 3、《Linux vnc server, vnc viewer(远程图形桌面)使用》 4、xDB 配置文件 /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/xdb_pubserver.conf 性能相关配置 #This option represents the MTK option "-cpBatchSize" that has a default value of 8MB. #The user can customize the default value to optimize the data speed for Snapshot #that involves large datasets and enough memory on the system. # size in MB #cpBatchSize=8 #This option represents the MTK option "-batchSize" that has a default value of 100 rows. # size in rows #batchSize=100 #The option to import Oracle Partitioned table as a normal table in PPAS/PPSS. #importPartitionAsTable=false #It controls how many rows are fetched from the publication database in one round (network) trip. For example, #if there are 1000 row changes available in shadow table(s), the default fetch size will require 5 database round trips. #Hence using a fetch size of 500 will bring all the changes in 2 round trips. Fine tune the performance by using a fetch size #that conforms to the average data volume consumed by rows fetched in one round trip. #syncFetchSize=200 #Synchronize Replication batch size. Default to 100 statements per batch. #syncBatchSize=100 #This defines the maximum number of transactional rows that can be grouped in a single transaction set. #The xDB loads and processes the delta changes by fetching as many rows in memory as grouped in a single #transaction set. A higher value is expected to boost the performance. However increasing it to a very large #value might result in out of memory error, hence increase/decrease the default value in accordance with #the average row size (low/high). #txSetMaxSize=10000 #This option controls the number of maximum threads used to load data from source publication tables #in parallel mode. The default count is 4, however depending on the target system #architecture specifically multi CPUs/cores one can choose to specify a custom count (normally #equals CPU/core count) to fully utilize the system resources. #syncLoadThreadLimit=4 #It defines the upper limit for number of (WAL) entries that can be hold in the queue #A value of zero indicates there will be no upper limit. The default is set to 10000. #walStreamQueueLimit=10000 /opt/PostgreSQL/EnterpriseDB-xDBReplicationServer/etc/xdb_subserver.conf 性能相关配置 #The option to import Oracle Partitioned table as a normal table in PPAS/PPSS. #importPartitionAsTable=false #This option controls the number of threads used to perform snapshot data migration in parallel mode. #The default behavior is to use a single data loader thread. However depending on the target system #architecture specifically multi CPUs/cores one can choose to specify a custom count (normally #equals CPU/core count) to fully utilize the system resources. #snapshotParallelLoadCount=1 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 在执行vacuum时,有一个cleanup阶段,以往,不管这个阶段是否需要清理PAGE,只要表上面有索引,就需要对这个表的索引全部扫描一遍。 今天,PG 11版本,增加了一个GUC参数vacuum_cleanup_index_scale_factor,以及btree索引参数vacuum_cleanup_index_scale_factor。在btree index meta page上存储了当前表有多少条记录(仅仅当vacuum时发现整个表没有dead tuples(取自pg_stat_all_tables这种统计计数器)时更新meta page),当 1、((表上的 pg_stat_all_tables insert 计数器 - meta page)/meta page) 大于 vacuum_cleanup_index_scale_factor 时,vacuum cleanup阶段才需要SCAN INDEX,更新INDEX stats信息(包括meta page计数器信息)。 或 2、有deleted pages that can be recycled during cleanup需要清理时,必定要scan index pages. 因此,对于大量INSERT,没有UPDATE,DELETE操作的表的VACUUM,或者常规静态表的VACUUM会快很多,因为不需要SCAN INDEX了。 背景技术 vacuum_cleanup_index_scale_factor (floating point) Specifies the fraction of the total number of heap tuples counted in the previous statistics collection that can be inserted without incurring an index scan at the VACUUM cleanup stage. This setting currently applies to B-tree indexes only. If no tuples were deleted from the heap, B-tree indexes are still scanned at the VACUUM cleanup stage when at least one of the following conditions is met: 1、the index statistics are stale, 2、or the index contains deleted pages that can be recycled during cleanup. Index statistics are considered to be stale if the number of newly inserted tuples exceeds the vacuum_cleanup_index_scale_factor fraction of the total number of heap tuples detected by the previous statistics collection. The total number of heap tuples is stored in the index meta-page. Note that the meta-page does not include this data until VACUUM finds no dead tuples, so B-tree index scan at the cleanup stage can only be skipped if the second and subsequent VACUUM cycles detect no dead tuples. (典型的insert only场景,或者vacuum干掉所有dead tuple后) The value can range from 0 to 10000000000. When vacuum_cleanup_index_scale_factor is set to 0, index scans are never skipped during VACUUM cleanup. The default value is 0.1. B-tree indexes additionally accept this parameter: vacuum_cleanup_index_scale_factor Per-index value for vacuum_cleanup_index_scale_factor. 相关代码 src/backend/access/nbtree/nbtree.c /* * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that * btbulkdelete() wasn't called. */ static bool _bt_vacuum_needs_cleanup(IndexVacuumInfo *info) { .... .... { StdRdOptions *relopts; float8 cleanup_scale_factor; float8 prev_num_heap_tuples; /* * If table receives enough insertions and no cleanup was performed, * then index would appear have stale statistics. If scale factor is * set, we avoid that by performing cleanup if the number of inserted * tuples exceeds vacuum_cleanup_index_scale_factor fraction of * original tuples count. */ relopts = (StdRdOptions *) info->index->rd_options; cleanup_scale_factor = (relopts && relopts->vacuum_cleanup_index_scale_factor >= 0) ? relopts->vacuum_cleanup_index_scale_factor : vacuum_cleanup_index_scale_factor; prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples; if (cleanup_scale_factor <= 0 || prev_num_heap_tuples < 0 || (info->num_heap_tuples - prev_num_heap_tuples) / prev_num_heap_tuples >= cleanup_scale_factor) // 是否需要scan index,当判定index为stale状态时,由计数器与vacuum_cleanup_index_scale_factor参数控制。 result = true; } /* * Post-VACUUM cleanup. * * Result: a palloc'd struct containing statistical info for VACUUM displays. */ IndexBulkDeleteResult * btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) { /* No-op in ANALYZE ONLY mode */ if (info->analyze_only) return stats; /* * If btbulkdelete was called, we need not do anything, just return the * stats from the latest btbulkdelete call. If it wasn't called, we might * still need to do a pass over the index, to recycle any newly-recyclable * pages or to obtain index statistics. _bt_vacuum_needs_cleanup * determines if either are needed. * * Since we aren't going to actually delete any leaf items, there's no * need to go through all the vacuum-cycle-ID pushups. */ if (stats == NULL) { TransactionId oldestBtpoXact; /* Check if we need a cleanup */ if (!_bt_vacuum_needs_cleanup(info)) // 不需要scan index return NULL; stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); btvacuumscan(info, stats, NULL, NULL, 0, &oldestBtpoXact); // 需要SCAN index /* Update cleanup-related information in the metapage */ _bt_update_meta_cleanup_info(info->index, oldestBtpoXact, info->num_heap_tuples); } /* * It's quite possible for us to be fooled by concurrent page splits into * double-counting some index tuples, so disbelieve any total that exceeds * the underlying heap's count ... if we know that accurately. Otherwise * this might just make matters worse. */ if (!info->estimated_count) { if (stats->num_index_tuples > info->num_heap_tuples) stats->num_index_tuples = info->num_heap_tuples; } return stats; } 例子 tbl1,每次vacuum都要scan index create table tbl1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 int); create index idx_tbl1_1 on tbl1 (c1) with (vacuum_cleanup_index_scale_factor=0); create index idx_tbl1_2 on tbl1 (c2) with (vacuum_cleanup_index_scale_factor=0); create index idx_tbl1_3 on tbl1 (c3) with (vacuum_cleanup_index_scale_factor=0); create index idx_tbl1_4 on tbl1 (c4) with (vacuum_cleanup_index_scale_factor=0); create index idx_tbl1_5 on tbl1 (c5) with (vacuum_cleanup_index_scale_factor=0); create index idx_tbl1_6 on tbl1 (c6) with (vacuum_cleanup_index_scale_factor=0); tbl2,当有deleted page需要recycle使用时,或者当((pg_stat_all_tables.inserted-metapage(上一次vacuum有多少条记录))/上一次vacuum有多少条记录) > 100000000 时,才需要scan index。 create table tbl2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 int); create index idx_tbl2_1 on tbl2 (c1) with (vacuum_cleanup_index_scale_factor=100000000); create index idx_tbl2_2 on tbl2 (c2) with (vacuum_cleanup_index_scale_factor=100000000); create index idx_tbl2_3 on tbl2 (c3) with (vacuum_cleanup_index_scale_factor=100000000); create index idx_tbl2_4 on tbl2 (c4) with (vacuum_cleanup_index_scale_factor=100000000); create index idx_tbl2_5 on tbl2 (c5) with (vacuum_cleanup_index_scale_factor=100000000); create index idx_tbl2_6 on tbl2 (c6) with (vacuum_cleanup_index_scale_factor=100000000); 分别写入1000万记录 insert into tbl1 select id,id,id,id,id,id from generate_series(1,10000000) t(id); insert into tbl2 select id,id,id,id,id,id from generate_series(1,10000000) t(id); 观察两个表的二次VACUUM耗时 \timing vacuum verbose tbl1; vacuum verbose tbl1; vacuum verbose tbl2; vacuum verbose tbl2; postgres=# vacuum verbose tbl1; INFO: vacuuming "public.tbl1" INFO: index "idx_tbl1_1" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_2" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_3" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_4" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_5" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_6" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: "tbl1": found 0 removable, 10000000 nonremovable row versions in 63695 out of 63695 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1137047265 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.75 s, system: 0.00 s, elapsed: 0.76 s. VACUUM Time: 771.943 ms postgres=# vacuum verbose tbl1; INFO: vacuuming "public.tbl1" INFO: index "idx_tbl1_1" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_2" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_3" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_4" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_5" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_6" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: "tbl1": found 0 removable, 42 nonremovable row versions in 1 out of 63695 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1137047265 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s. VACUUM Time: 141.759 ms postgres=# vacuum verbose tbl1; INFO: vacuuming "public.tbl1" INFO: index "idx_tbl1_1" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_2" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_3" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_4" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_5" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl1_6" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: "tbl1": found 0 removable, 42 nonremovable row versions in 1 out of 63695 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1137047265 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.12 s, system: 0.00 s, elapsed: 0.13 s. VACUUM Time: 140.984 ms postgres=# vacuum verbose tbl2; INFO: vacuuming "public.tbl2" INFO: index "idx_tbl2_1" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl2_2" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl2_3" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl2_4" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl2_5" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: index "idx_tbl2_6" now contains 10000000 row versions in 27421 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. INFO: "tbl2": found 0 removable, 10000000 nonremovable row versions in 63695 out of 63695 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1137047265 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.84 s, system: 0.00 s, elapsed: 0.85 s. VACUUM Time: 860.749 ms postgres=# vacuum verbose tbl2; INFO: vacuuming "public.tbl2" INFO: "tbl2": found 0 removable, 42 nonremovable row versions in 1 out of 63695 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1137047265 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM Time: 11.895 ms postgres=# vacuum verbose tbl2; INFO: vacuuming "public.tbl2" INFO: "tbl2": found 0 removable, 42 nonremovable row versions in 1 out of 63695 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1137047265 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM Time: 11.944 ms 分析 tbl1,由于每次vacuum都需要scan index,所以更加耗时。 tbl2,由于设置了vacuum_cleanup_index_scale_factor=100000000,下一次vacuum时insert减去上一次的总记录数再除以上一次总记录数,小于100000000,所以不需要scan index。 耗时来看,相当明显。 通过pageinspect,可以观察到btree索引metapage上的table tuples总数。 postgres=# create extension pageinspect; CREATE EXTENSION postgres=# select * from bt_metap('idx_tbl1_1'); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 412 | 2 | 412 | 2 | 0 | 9.99977e+06 (1 row) Time: 0.345 ms postgres=# select * from bt_metap('idx_tbl2_1'); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 412 | 2 | 412 | 2 | 0 | 1e+07 (1 row) Time: 0.429 ms 参考 src/backend/access/nbtree/nbtree.c https://www.postgresql.org/docs/11/sql-createindex.html https://www.postgresql.org/docs/11/runtime-config-client.html#GUC-VACUUM-CLEANUP-INDEX-SCALE-FACTOR 《PostgreSQL pg_stat_ pg_statio_ 统计信息(scan,read,fetch,hit)源码解读》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 当一个进程处于等待(被堵塞)状态时,是谁干的?可以使用如下函数,快速得到捣蛋(堵塞别人)的PID。 1、请求锁时被堵,是哪些PID堵的? pg_blocking_pids(int) int[] Process ID(s) that are blocking specified server process ID from acquiring a lock 2、请求safe快照时被堵(SSI隔离级别,请求安全快照冲突),是哪些PID堵的? pg_safe_snapshot_blocking_pids(int) int[] Process ID(s) that are blocking specified server process ID from acquiring a safe snapshot 例子 1、会话1 postgres=# begin; BEGIN postgres=# select * from tbl limit 1; id | c1 | c2 --------+----+---- 918943 | 1 | 0 (1 row) postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 30862 (1 row) 2、会话2 postgres=# begin; BEGIN postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 30928 (1 row) postgres=# truncate tbl; 等待中 3、会话3 postgres=# begin; BEGIN postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 30936 (1 row) postgres=# select * from tbl limit 1; 等待中 4、会话4 postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 30999 (1 row) postgres=# select * from tbl limit 1; 等待中 5、查看捣蛋PID postgres=# select pid,pg_blocking_pids(pid),wait_event_type,wait_event,query from pg_stat_activity ; pid | pg_blocking_pids | wait_event_type | wait_event | query -------+------------------+-----------------+---------------------+------------------------------------------------------------------------------------------- 30862 | {} | Client | ClientRead | select pg_backend_pid(); 30928 | {30862} | Lock | relation | truncate tbl; 30936 | {30928} | Lock | relation | select * from tbl limit 1; 30999 | {30928} | Lock | relation | select * from tbl limit 1; 代码 src/backend/utils/adt/lockfuncs.c /* * pg_blocking_pids - produce an array of the PIDs blocking given PID * * The reported PIDs are those that hold a lock conflicting with blocked_pid's * current request (hard block), or are requesting such a lock and are ahead * of blocked_pid in the lock's wait queue (soft block). * * In parallel-query cases, we report all PIDs blocking any member of the * given PID's lock group, and the reported PIDs are those of the blocking * PIDs' lock group leaders. This allows callers to compare the result to * lists of clients' pg_backend_pid() results even during a parallel query. * * Parallel query makes it possible for there to be duplicate PIDs in the * result (either because multiple waiters are blocked by same PID, or * because multiple blockers have same group leader PID). We do not bother * to eliminate such duplicates from the result. * * We need not consider predicate locks here, since those don't block anything. */ Datum pg_blocking_pids(PG_FUNCTION_ARGS) { ............... /* * pg_safe_snapshot_blocking_pids - produce an array of the PIDs blocking * given PID from getting a safe snapshot * * XXX this does not consider parallel-query cases; not clear how big a * problem that is in practice */ Datum pg_safe_snapshot_blocking_pids(PG_FUNCTION_ARGS) { ........... src/backend/storage/lmgr/predicate.c /* * GetSafeSnapshotBlockingPids * If the specified process is currently blocked in GetSafeSnapshot, * write the process IDs of all processes that it is blocked by * into the caller-supplied buffer output[]. The list is truncated at * output_size, and the number of PIDs written into the buffer is * returned. Returns zero if the given PID is not currently blocked * in GetSafeSnapshot. */ int GetSafeSnapshotBlockingPids(int blocked_pid, int *output, int output_size) { int num_written = 0; SERIALIZABLEXACT *sxact; LWLockAcquire(SerializableXactHashLock, LW_SHARED); /* Find blocked_pid's SERIALIZABLEXACT by linear search. */ for (sxact = FirstPredXact(); sxact != NULL; sxact = NextPredXact(sxact)) { if (sxact->pid == blocked_pid) break; } /* Did we find it, and is it currently waiting in GetSafeSnapshot? */ if (sxact != NULL && SxactIsDeferrableWaiting(sxact)) { RWConflict possibleUnsafeConflict; /* Traverse the list of possible unsafe conflicts collecting PIDs. */ possibleUnsafeConflict = (RWConflict) SHMQueueNext(&sxact->possibleUnsafeConflicts, &sxact->possibleUnsafeConflicts, offsetof(RWConflictData, inLink)); while (possibleUnsafeConflict != NULL && num_written < output_size) { output[num_written++] = possibleUnsafeConflict->sxactOut->pid; possibleUnsafeConflict = (RWConflict) SHMQueueNext(&sxact->possibleUnsafeConflicts, &possibleUnsafeConflict->inLink, offsetof(RWConflictData, inLink)); } } LWLockRelease(SerializableXactHashLock); return num_written; } 参考 https://www.postgresql.org/docs/11/functions-info.html 《PostgreSQL 锁等待排查实践 - 珍藏级 - process xxx1 acquired RowExclusiveLock on relation xxx2 of database xxx3 after xxx4 ms at xxx》 《PostgreSQL 锁等待监控 珍藏级SQL - 谁堵塞了谁》 《PostgreSQL 锁等待跟踪》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 如何通过SQL接口直接关闭数据库,或者重启数据库? 关闭和重启数据库是一个风险较大的操作,如果能通过SQL来关闭,重启数据库,当然是很难想象的,因为SQL通常是使用接口,而不是管理接口。当然并不是数据库做不到通过SQL管理数据库,而是这确实是风险较大且并不是数据库核心的能努力。 但是为了方便管理,数据库还是提供了很多管理函数(通过SQL调用)。例如: https://www.postgresql.org/docs/11/functions-info.html 那么能不能通过SQL接口来关闭,或者重启数据库呢?(通常我们需要登陆到数据库所在的操作系统,执行pg_ctl来实现) 关闭数据库的底层实现 实际上关闭数据库是往postgres master进程(数据库启动时的父进程)发送信号,进程在收到信号后会进行相应的操作。可以通过看postmaster.c代码或通过man postgres得到这个信息: man postgres To terminate the postgres server normally, the signals SIGTERM, SIGINT, or SIGQUIT can be used. The first will wait for all clients to terminate before quitting, the second will forcefully disconnect all clients, and the third will quit immediately without proper shutdown, resulting in a recovery run during restart. 如何获得postmaster进程pid呢? 直接读postmaster.pid文件即可得到: postgres=# select * from pg_read_file('postmaster.pid'); pg_read_file ---------------------------- 30503 + /data01/digoal/pg_root8001+ 1549031862 + 8001 + . + 0.0.0.0 + 8001001 39288833 + ready + (1 row) 30503 为postmaster进程的PID。 关闭数据库就是往这个PID发送信号(SIGTERM 正常关闭, SIGINT 快速关闭, or SIGQUIT 暴力关闭)。 发送信号给数据库进程 src/backend/utils/adt/misc.c 1、发送给postmaster进程SIGHUP信号,用于reload conf。 /* * Signal to reload the database configuration * * Permission checking for this function is managed through the normal * GRANT system. */ Datum pg_reload_conf(PG_FUNCTION_ARGS) { if (kill(PostmasterPid, SIGHUP)) { ereport(WARNING, (errmsg("failed to send signal to postmaster: %m"))); PG_RETURN_BOOL(false); } PG_RETURN_BOOL(true); } 2、发送给普通进程,用于cancel query或terminate session /* * Signal to terminate a backend process. This is allowed if you are a member * of the role whose process is being terminated. * * Note that only superusers can signal superuser-owned processes. */ Datum pg_terminate_backend(PG_FUNCTION_ARGS) { int r = pg_signal_backend(PG_GETARG_INT32(0), SIGTERM); if (r == SIGNAL_BACKEND_NOSUPERUSER) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be a superuser to terminate superuser process")))); if (r == SIGNAL_BACKEND_NOPERMISSION) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be a member of the role whose process is being terminated or member of pg_signal_backend")))); PG_RETURN_BOOL(r == SIGNAL_BACKEND_SUCCESS); } /* * Signal to cancel a backend process. This is allowed if you are a member of * the role whose process is being canceled. * * Note that only superusers can signal superuser-owned processes. */ Datum pg_cancel_backend(PG_FUNCTION_ARGS) { int r = pg_signal_backend(PG_GETARG_INT32(0), SIGINT); if (r == SIGNAL_BACKEND_NOSUPERUSER) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be a superuser to cancel superuser query")))); if (r == SIGNAL_BACKEND_NOPERMISSION) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be a member of the role whose query is being canceled or member of pg_signal_backend")))); PG_RETURN_BOOL(r == SIGNAL_BACKEND_SUCCESS); } src/backend/utils/adt/misc.c /* * Send a signal to another backend. * * The signal is delivered if the user is either a superuser or the same * role as the backend being signaled. For "dangerous" signals, an explicit * check for superuser needs to be done prior to calling this function. * * Returns 0 on success, 1 on general failure, 2 on normal permission error * and 3 if the caller needs to be a superuser. * * In the event of a general failure (return code 1), a warning message will * be emitted. For permission errors, doing that is the responsibility of * the caller. */ #define SIGNAL_BACKEND_SUCCESS 0 #define SIGNAL_BACKEND_ERROR 1 #define SIGNAL_BACKEND_NOPERMISSION 2 #define SIGNAL_BACKEND_NOSUPERUSER 3 static int pg_signal_backend(int pid, int sig) { 。。。 if (proc == NULL) { /* * This is just a warning so a loop-through-resultset will not abort * if one backend terminated on its own during the run. */ ereport(WARNING, (errmsg("PID %d is not a PostgreSQL server process", pid))); return SIGNAL_BACKEND_ERROR; } 。。。 PG内部并没有开放一个SQL接口来停库,所以我们需要自己写一个 vi pg_fast_stop.c #include <signal.h> #include "fmgr.h" #include "postgres.h" PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(pg_fast_stop); Datum pg_fast_stop(PG_FUNCTION_ARGS) { if (kill(PostmasterPid, SIGINT)) { ereport(WARNING, (errmsg("failed to send signal to postmaster: %m"))); PG_RETURN_BOOL(false); } PG_RETURN_BOOL(true); } gcc -O3 -Wall -Wextra -I /home/digoal/postgresql-11.1/src/include -g -fPIC -c ./pg_fast_stop.c -o pg_fast_stop.o gcc -O3 -Wall -Wextra -I /home/digoal/postgresql-11.1/src/include -g -shared pg_fast_stop.o -o libpg_fast_stop.so cp libpg_fast_stop.so $PGHOME/lib/ psql create or replace function pg_fast_stop() returns int as '$libdir/libpg_fast_stop.so', 'pg_fast_stop' language C STRICT; 试用: postgres=# select pg_fast_stop(); pg_fast_stop ------------ 1 (1 row) 数据库已关机 postgres=# \dt FATAL: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> \q 如何实现SQL接口重启数据库呢? 因为往POSTMASTER PID发送信号只能关闭数据库,无法重启数据库。那么怎么实现重启呢? 1、#restart_after_crash = on # reinitialize after backend crash? 利用普通用户进程被KILL -9来自动重启,这个是postmaster守护进程自动执行的重启动作。 2、利用plsh存储过程语言,直接调用pg数据库操作系统的pg_ctl命令来重启。 https://github.com/petere/plsh 参考 https://github.com/petere/plsh https://www.postgresql.org/docs/11/functions-info.html https://www.postgresql.org/docs/11/functions-admin.html src/backend/utils/adt/misc.c PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 非分区表,如何在线(不影响业务)转换为分区表? 方法1,pg_pathman分区插件 《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》 使用非堵塞式的迁移接口 partition_table_concurrently( relation REGCLASS, -- 主表OID batch_size INTEGER DEFAULT 1000, -- 一个事务批量迁移多少记录 sleep_time FLOAT8 DEFAULT 1.0) -- 获得行锁失败时,休眠多久再次获取,重试60次退出任务。 postgres=# select partition_table_concurrently('part_test'::regclass, 10000, 1.0); NOTICE: worker started, you can stop it with the following command: select stop_concurrent_part_task('part_test'); partition_table_concurrently ------------------------------ (1 row) 迁移结束后,主表数据已经没有了,全部在分区中 postgres=# select count(*) from only part_test; count ------- 0 (1 row) 数据迁移完成后,建议禁用主表,这样执行计划就不会出现主表了 postgres=# select set_enable_parent('part_test'::regclass, false); set_enable_parent ------------------- (1 row) 方法2,原生分区 使用继承表,触发器,异步迁移,交换表名一系列步骤,在线将非分区表,转换为分区表(交换表名是需要短暂的堵塞)。 关键技术: 1、继承表(子分区) 对select, update, delete, truncate, drop透明。 2、触发器 插入,采用before触发器,数据路由到继承分区 更新,采用before触发器,删除老表记录,同时将更新后的数据插入新表 3、后台迁移数据,cte only skip locked , delete only, insert into new table 4、迁移结束(p表没有数据后),短暂上锁,剥离INHERTI关系,切换到原生分区,切换表名。 例子 将一个表在线转换为LIST分区表(伪HASH分区)。 范围分区类似。 如果要转换为原生HASH分区表,需要提取pg内置HASH分区算法。 1、创建测试表(需要被分区的表) create table old (id int primary key, info text, crt_time timestamp); 2、写入1000万测试记录 insert into old select generate_series(1,10000000) , md5(random()::text) , now(); 3、创建子分区(本例使用LIST分区) do language plpgsql $$ declare parts int := 4; begin for i in 0..parts-1 loop execute format('create table old_mid%s (like old including all) inherits (old)', i); execute format('alter table old_mid%s add constraint ck check(abs(mod(id,%s))=%s)', i, parts, i); end loop; end; $$; 4、插入,采用before触发器,路由到新表 create or replace function ins_tbl() returns trigger as $$ declare begin case abs(mod(NEW.id,4)) when 0 then insert into old_mid0 values (NEW.*); when 1 then insert into old_mid1 values (NEW.*); when 2 then insert into old_mid2 values (NEW.*); when 3 then insert into old_mid3 values (NEW.*); else return NEW; -- 如果是NULL则写本地父表,主键不会为NULL end case; return null; end; $$ language plpgsql strict; create trigger tg1 before insert on old for each row execute procedure ins_tbl(); 5、更新,采用before触发器,删除老表,同时将更新后的数据插入新表 create or replace function upd_tbl () returns trigger as $$ declare begin case abs(mod(NEW.id,4)) when 0 then insert into old_mid0 values (NEW.*); when 1 then insert into old_mid1 values (NEW.*); when 2 then insert into old_mid2 values (NEW.*); when 3 then insert into old_mid3 values (NEW.*); else return NEW; -- 如果是NULL则写本地父表,主键不会为NULL end case; delete from only old where id=NEW.id; return null; end; $$ language plpgsql strict; create trigger tg2 before update on old for each row execute procedure upd_tbl(); 6、old table 如下 postgres=# \dt+ old List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+----------+--------+------------- public | old | table | postgres | 730 MB | (1 row) 继承关系如下 postgres=# \d+ old Table "public.old" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | info | text | | | | extended | | crt_time | timestamp without time zone | | | | plain | | Indexes: "old_pkey" PRIMARY KEY, btree (id) Triggers: tg1 BEFORE INSERT ON old FOR EACH ROW EXECUTE PROCEDURE ins_tbl() tg2 BEFORE UPDATE ON old FOR EACH ROW EXECUTE PROCEDURE upd_tbl() Child tables: old_mid0, old_mid1, old_mid2, old_mid3 7、验证insert, update, delete, select完全符合要求。对业务SQL请求透明。 postgres=# insert into old values (0,'test',now()); INSERT 0 0 postgres=# select tableoid::regclass,* from old where id=1; tableoid | id | info | crt_time ----------+----+----------------------------------+--------------------------- old | 1 | 22be06200f2a967104872f6f173fd038 | 31-JAN-19 12:52:25.887242 (1 row) postgres=# select tableoid::regclass,* from old where id=0; tableoid | id | info | crt_time ----------+----+------+--------------------------- old_mid0 | 0 | test | 31-JAN-19 13:02:35.859899 (1 row) postgres=# update old set info='abc' where id in (0,2) returning tableoid::regclass,*; tableoid | id | info | crt_time ----------+----+------+--------------------------- old_mid0 | 0 | abc | 31-JAN-19 13:02:35.859899 (1 row) UPDATE 1 postgres=# select tableoid::regclass,* from old where id in (0,2); tableoid | id | info | crt_time ----------+----+------+--------------------------- old_mid0 | 0 | abc | 31-JAN-19 13:12:03.343559 old_mid2 | 2 | abc | 31-JAN-19 13:11:04.763652 (2 rows) postgres=# delete from old where id=3; DELETE 1 postgres=# select tableoid::regclass,* from old where id=3; tableoid | id | info | crt_time ----------+----+------+---------- (0 rows) 8、开启压测,后台对原表数据进行迁移 create or replace function test_ins(int) returns void as $$ declare begin insert into old values ($1,'test',now()); exception when others then return; end; $$ language plpgsql strict; vi test.sql \set id1 random(10000001,200000000) \set id2 random(1,5000000) \set id3 random(5000001,10000000) delete from old where id=:id2; update old set info=md5(random()::text),crt_time=now() where id=:id3; select test_ins(:id1); 开启压测 pgbench -M prepared -n -r -P 1 -f ./test.sql -c 4 -j 4 -T 1200 ... progress: 323.0 s, 12333.1 tps, lat 0.324 ms stddev 0.036 progress: 324.0 s, 11612.9 tps, lat 0.344 ms stddev 0.203 progress: 325.0 s, 12546.0 tps, lat 0.319 ms stddev 0.061 progress: 326.0 s, 12728.7 tps, lat 0.314 ms stddev 0.038 progress: 327.0 s, 12536.9 tps, lat 0.319 ms stddev 0.040 progress: 328.0 s, 12534.1 tps, lat 0.319 ms stddev 0.042 progress: 329.0 s, 12228.1 tps, lat 0.327 ms stddev 0.047 ... 9、在线迁移数据 批量迁移,每一批迁移N条。调用以下SQL with a as ( delete from only old where ctid = any (array (select ctid from only old limit 1000 for update skip locked) ) returning * ) insert into old select * from a; INSERT 0 0 postgres=# select count(*) from only old; count --------- 9998998 (1 row) postgres=# select count(*) from old; count ---------- 10000000 (1 row) postgres=# with a as ( delete from only old where ctid = any (array (select ctid from only old limit 1000 for update skip locked) ) returning * ) insert into old select * from a; INSERT 0 0 postgres=# select count(*) from old; count ---------- 10000000 (1 row) postgres=# select count(*) from only old; count --------- 9997998 (1 row) postgres=# with a as ( delete from only old where ctid = any (array (select ctid from only old limit 100000 for update skip locked) ) returning * ) insert into old select * from a; INSERT 0 0 postgres=# select count(*) from only old; count --------- 9897998 (1 row) postgres=# select count(*) from old; count ---------- 10000000 (1 row) 一次迁移1万条,分批操作。 with a as ( delete from only old where ctid = any (array (select ctid from only old limit 10000 for update skip locked) ) returning * ) insert into old select * from a; 持续调用以上接口,直到当old表已经没有数据,完全迁移到了分区。 select count(*) from only old; count ------- 0 (1 row) 10、切换到分区表 创建分区表如下,分区方法与继承约束一致。 create table new (id int, info text, crt_time timestamp) partition by list (abs(mod(id,4))); 切换表名,防止雪崩,使用锁超时,由于只涉及表名变更,所以速度非常快。 begin; set lock_timeout ='3s'; alter table old_mid0 no inherit old; alter table old_mid1 no inherit old; alter table old_mid2 no inherit old; alter table old_mid3 no inherit old; alter table old rename to old_tmp; alter table new rename to old; alter table old ATTACH PARTITION old_mid0 for values in (0); alter table old ATTACH PARTITION old_mid1 for values in (1); alter table old ATTACH PARTITION old_mid2 for values in (2); alter table old ATTACH PARTITION old_mid3 for values in (3); end; 切换后的原生分区表如下 postgres=# \d+ old Table "public.old" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | | | plain | | info | text | | | | extended | | crt_time | timestamp without time zone | | | | plain | | Partition key: LIST (abs(mod(id, 4))) Partitions: old_mid0 FOR VALUES IN (0), old_mid1 FOR VALUES IN (1), old_mid2 FOR VALUES IN (2), old_mid3 FOR VALUES IN (3) 查询测试 postgres=# explain select * from old where id=1; QUERY PLAN ------------------------------------------------------------------------------------- Append (cost=0.29..10.04 rows=4 width=44) -> Index Scan using old_mid0_pkey on old_mid0 (cost=0.29..2.51 rows=1 width=44) Index Cond: (id = 1) -> Index Scan using old_mid1_pkey on old_mid1 (cost=0.29..2.51 rows=1 width=45) Index Cond: (id = 1) -> Index Scan using old_mid2_pkey on old_mid2 (cost=0.29..2.51 rows=1 width=44) Index Cond: (id = 1) -> Index Scan using old_mid3_pkey on old_mid3 (cost=0.29..2.51 rows=1 width=45) Index Cond: (id = 1) (9 rows) postgres=# explain select * from old where id=? and abs(mod(id, 4)) = abs(mod(?, 4)); QUERY PLAN ------------------------------------------------------------------------------------- Append (cost=0.29..2.52 rows=1 width=45) -> Index Scan using old_mid1_pkey on old_mid1 (cost=0.29..2.51 rows=1 width=45) Index Cond: (id = 1) Filter: (mod(id, 4) = 1) (4 rows) 数据 postgres=# select count(*) from old; count ---------- 10455894 (1 row) 方法3,logical replication 使用逻辑复制的方法,同步到分区表。 简单步骤如下: snapshot 快照(lsn位点) 全量 增量(逻辑复制,从LSN位置开始解析WAL LOG) 切换表名 略 其他 hash函数 postgres=# \df *.*hash* List of functions Schema | Name | Result data type | Argument data types | Type ------------+--------------------------+------------------+---------------------------------------+------ pg_catalog | hash_aclitem | integer | aclitem | func pg_catalog | hash_aclitem_extended | bigint | aclitem, bigint | func pg_catalog | hash_array | integer | anyarray | func pg_catalog | hash_array_extended | bigint | anyarray, bigint | func pg_catalog | hash_numeric | integer | numeric | func pg_catalog | hash_numeric_extended | bigint | numeric, bigint | func pg_catalog | hash_range | integer | anyrange | func pg_catalog | hash_range_extended | bigint | anyrange, bigint | func pg_catalog | hashbpchar | integer | character | func pg_catalog | hashbpcharextended | bigint | character, bigint | func pg_catalog | hashchar | integer | "char" | func pg_catalog | hashcharextended | bigint | "char", bigint | func pg_catalog | hashenum | integer | anyenum | func pg_catalog | hashenumextended | bigint | anyenum, bigint | func pg_catalog | hashfloat4 | integer | real | func pg_catalog | hashfloat4extended | bigint | real, bigint | func pg_catalog | hashfloat8 | integer | double precision | func pg_catalog | hashfloat8extended | bigint | double precision, bigint | func pg_catalog | hashhandler | index_am_handler | internal | func pg_catalog | hashinet | integer | inet | func pg_catalog | hashinetextended | bigint | inet, bigint | func pg_catalog | hashint2 | integer | smallint | func pg_catalog | hashint2extended | bigint | smallint, bigint | func pg_catalog | hashint4 | integer | integer | func pg_catalog | hashint4extended | bigint | integer, bigint | func pg_catalog | hashint8 | integer | bigint | func pg_catalog | hashint8extended | bigint | bigint, bigint | func pg_catalog | hashmacaddr | integer | macaddr | func pg_catalog | hashmacaddr8 | integer | macaddr8 | func pg_catalog | hashmacaddr8extended | bigint | macaddr8, bigint | func pg_catalog | hashmacaddrextended | bigint | macaddr, bigint | func pg_catalog | hashname | integer | name | func pg_catalog | hashnameextended | bigint | name, bigint | func pg_catalog | hashoid | integer | oid | func pg_catalog | hashoidextended | bigint | oid, bigint | func pg_catalog | hashoidvector | integer | oidvector | func pg_catalog | hashoidvectorextended | bigint | oidvector, bigint | func pg_catalog | hashtext | integer | text | func pg_catalog | hashtextextended | bigint | text, bigint | func pg_catalog | hashvarlena | integer | internal | func pg_catalog | hashvarlenaextended | bigint | internal, bigint | func pg_catalog | interval_hash | integer | interval | func pg_catalog | interval_hash_extended | bigint | interval, bigint | func pg_catalog | jsonb_hash | integer | jsonb | func pg_catalog | jsonb_hash_extended | bigint | jsonb, bigint | func pg_catalog | pg_lsn_hash | integer | pg_lsn | func pg_catalog | pg_lsn_hash_extended | bigint | pg_lsn, bigint | func pg_catalog | satisfies_hash_partition | boolean | oid, integer, integer, VARIADIC "any" | func pg_catalog | time_hash | integer | time without time zone | func pg_catalog | time_hash_extended | bigint | time without time zone, bigint | func pg_catalog | timestamp_hash | integer | timestamp without time zone | func pg_catalog | timestamp_hash_extended | bigint | timestamp without time zone, bigint | func pg_catalog | timetz_hash | integer | time with time zone | func pg_catalog | timetz_hash_extended | bigint | time with time zone, bigint | func pg_catalog | uuid_hash | integer | uuid | func pg_catalog | uuid_hash_extended | bigint | uuid, bigint | func (56 rows) 小结 在线将表转换为分区表,可以使用的方法: 1、转换为pg_pathman分区,直接调用pg_pathman的UDF即可。 2、转换为原生分区,使用继承,异步迁移的方法。割接是短暂锁表。 不支持 insert ino on conflict 语法。 insert into old values (1,'test',now()) on conflict(id) do update set info=excluded.info, crt_time=excluded.crt_time; 3、逻辑复制的方法,将数据增量迁移到分区表(目标可以是原生分区方法或者是pg_pathman分区方法的新表)。 参考 《PostgreSQL 9.x, 10, 11 hash分区表 用法举例》 《PostgreSQL 触发器 用法详解 1》 《PostgreSQL 触发器 用法详解 2》 《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 PostgreSQL 参数配置包罗万象,可以在配置文件 , alter system , 命令行 , 用户 , 数据库 , 所有用户 , 会话 , 事务 , 函数 , 表 等层面进行配置,非常的灵活。 灵活是好,但是可配置的入口太多了,优先级如何?如果在多个入口配置了同一个参数的不同值,最后会以哪个为准? 参数优先级 优先级如下,数值越大,优先级越高。 1 postgresql.conf work_mem=1MB 2 postgresql.auto.conf work_mem=2MB 3 command line options work_mem=3MB pg_ctl start -o "-c work_mem='3MB'" 4 all role work_mem=4MB alter role all set work_mem='4MB'; 5 database work_mem=5MB alter database postgres set work_mem='5MB'; 6 role work_mem=6MB alter role digoal set work_mem='6MB'; 7 session (客户端参数) work_mem=7MB set work_mem ='7MB'; 8 事务 work_mem=8MB postgres=# begin; BEGIN postgres=# set local work_mem='8MB'; SET 9 function (参数在函数内有效,函数调用完成后依旧使用其他最高优先级参数值) work_mem=9MB postgres=# create or replace function f_test() returns void as $$ declare res text; begin show work_mem into res; raise notice '%', res; end; $$ language plpgsql strict set work_mem='9MB'; CREATE FUNCTION postgres=# select f_test(); NOTICE: 9MB f_test -------- (1 row) 10 table TABLE相关参数(垃圾回收相关) https://www.postgresql.org/docs/11/sql-createtable.html autovacuum_enabled toast.autovacuum_enabled ... ... autovacuum_vacuum_threshold toast.autovacuum_vacuum_threshold ... ... 小结 PostgreSQL 支持的配置入口: 配置文件(postgresql.conf) , alter system(postgresql.auto.conf) , 命令行(postgres -o, pg_ctl -o) , 所有用户(alter role all set) , 数据库(alter database xxx set) , 用户(alter role 用户名 set) , 会话(set xxx) , 事务(set local xxx;) , 函数(create or replace function .... set par=val;) , 表(表级垃圾回收相关参数) 如果一个参数在所有入口都配置过,优先级如上,从上到下,优先级越来越大。 参考 《PostgreSQL GUC 参数级别介绍》 《连接PostgreSQL时,如何指定参数》 https://www.postgresql.org/docs/11/sql-createtable.html PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 在一些场景中,可能要将数据库设置为只读模式。 例如, 1、云数据库,当使用的容量超过了购买的限制时。切换到只读(锁定)模式,确保用户不会用超。 2、业务上需要对数据库进行迁移,准备割接时,可将主库切换到只读(锁定),确保绝对不会有事务写入。 锁定的实现方法有若干种。 1、硬锁定,直接将数据库切换到恢复模式,绝对不会有写操作出现。 2、软锁定,设置default_transaction_read_only为on,默认开启的事务为只读事务。用户如果使用begion transaction read write可破解。 3、内核层面改进的锁定,对于云上产品,锁定后实际上是期望用户升级容量,或者用户可以上去删数据使得使用空间降下来的。那么以上两种锁定都不适用,需要禁止除truncate, drop操作以外的所有操作的这种锁定方式。而且最好是不需要重启数据库就可以实现。 实现 1 锁定实例 硬锁定 1、配置 recovery.conf recovery_target_timeline = 'latest' standby_mode = on 2、重启数据库 pg_ctl restart -m fast 3、硬锁定,不可破解 postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- t (1 row) postgres=# insert into t1 values (1); ERROR: cannot execute INSERT in a read-only transaction postgres=# begin transaction read write; ERROR: cannot set transaction read-write mode during recovery 软锁定 1、设置default_transaction_read_only postgres=# alter system set default_transaction_read_only=on; ALTER SYSTEM 2、重载配置 postgres=# select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) 3、所有会话自动进入read only的默认事务模式。 reload前 postgres=# show default_transaction_read_only ; default_transaction_read_only ------------------------------- off (1 row) reload后 postgres=# show default_transaction_read_only ; default_transaction_read_only ------------------------------- on (1 row) postgres=# insert into t1 values (1); ERROR: cannot execute INSERT in a read-only transaction 4、软锁定可破解 postgres=# begin transaction read write; BEGIN postgres=# insert into t1 values (1); INSERT 0 1 postgres=# end; COMMIT 2 解锁实例 硬解锁 1、重命名recovery.conf到recovery.done cd $PGDATA mv recovery.conf recovery.done 2、重启数据库 pg_ctl restart -m fast 软解锁 1、设置default_transaction_read_only postgres=# alter system set default_transaction_read_only=off; ALTER SYSTEM 2、重载配置 postgres=# select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) 3、所有会话自动进入read only的默认事务模式。 reload前 postgres=# show default_transaction_read_only ; default_transaction_read_only ------------------------------- on (1 row) reload后 postgres=# show default_transaction_read_only ; default_transaction_read_only ------------------------------- off (1 row) 写恢复 postgres=# insert into t1 values (1); INSERT 0 1 内核层锁定 通过修改内核实现锁定,锁定后只允许: 1、truncate 2、drop 这样,用户可以在锁定的情况下进行数据清理,可以跑任务的形式,检查数据是否清理干净,进行解锁设置。 阿里云RDS PG已支持。 参考 https://www.postgresql.org/docs/11/recovery-config.html https://www.postgresql.org/docs/11/runtime-config-client.html#RUNTIME-CONFIG-CLIENT-STATEMENT https://www.postgresql.org/docs/11/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 在心跳时,通过自定义UDF,实现心跳永远不被堵塞,并且支持更加当前的配置自动的进行同步、异步模式的升降级。实现半同步的功能。 UDF输入 1、优先模式(同步、异步) 2、同步等待超时时间 当优先为同步模式时,假设当前为同步配置,如果备库异常导致事务提交等待超过指定时间,则自动降级为异步。 当优先为异步模式时,假设当前为同步配置,自动降级为异步。 当优先为同步模式时,假设当前为异步配置,如果备库恢复到streaming模式,自动升级为同步。 使用技术点: 1、alter system 2、reload conf 3、cancle backend 4、dblink 异步调用 心跳UDF逻辑 判断当前实例状态 只读 退出 读写 判断当前事务模式 异步 发心跳 优先模式是什么 异步 退出 同步 判断是否需要升级 升级 退出 同步 消耗异步消息 发远程心跳 查询是否超时 降级 否则 消耗异步消息 优先模式是什么 异步 降级 退出 同步 退出 设计 1、当前postgresql.conf配置 synchronous_commit='remote_write'; synchronous_standby_names='*'; 表示同步模式。 2、心跳表设计 create table t_keepalive(id int primary key, ts timestamp, pos pg_lsn); 3、心跳写入方法 insert into t_keepalive values (1,now(),pg_current_wal_lsn()) on conflict (id) do update set ts=excluded.ts,pos=excluded.pos returning id,ts,pos; 4、创建一个建立连接函数,不报错 create or replace function conn( name, -- dblink名字 text -- 连接串,URL ) returns void as $$ declare begin perform dblink_connect($1, $2); return; exception when others then return; end; $$ language plpgsql strict; 5、更加以上逻辑创建心跳UDF。 create or replace function keepalive ( prio_commit_mode text, tmout interval ) returns t_keepalive as $$ declare res1 int; res2 timestamp; res3 pg_lsn; commit_mode text; conn text := format('hostaddr=%s port=%s user=%s dbname=%s application_name=', '127.0.0.1', current_setting('port'), current_user, current_database()); conn_altersys text := format('hostaddr=%s port=%s user=%s dbname=%s', '127.0.0.1', current_setting('port'), current_user, current_database()); app_prefix_stat text := 'keepalive_dblink'; begin if prio_commit_mode not in ('sync','async') then raise notice 'prio_commit_mode must be [sync|async]'; return null; end if; show synchronous_commit into commit_mode; create extension IF NOT EXISTS dblink; -- 判断当前实例状态 if pg_is_in_recovery() -- 只读 then raise notice 'Current instance in recovery mode.'; return null; -- 读写 else -- 判断当前事务模式 if commit_mode in ('local','off') -- 异步 then -- 发心跳 insert into t_keepalive values (1,now(),pg_current_wal_lsn()) on conflict (id) do update set ts=excluded.ts,pos=excluded.pos returning id,ts,pos into res1,res2,res3; -- 优先模式是什么 if prio_commit_mode='async' -- 异步 then -- 退出 return row(res1,res2,res3)::t_keepalive; -- 同步 else -- 判断是否需要升级 perform 1 from pg_stat_replication where state='streaming' limit 1; if found -- 升级 then perform dblink_exec(conn_altersys, 'alter system set synchronous_commit=remote_write', true); perform pg_reload_conf(); -- 退出 return row(res1,res2,res3)::t_keepalive; end if; return row(res1,res2,res3)::t_keepalive; end if; -- 同步 else -- 消耗异步消息 perform conn(app_prefix_stat, conn||app_prefix_stat); perform t from dblink_get_result(app_prefix_stat, false) as t(id int, ts timestamp, pos pg_lsn); -- 发远程心跳 perform dblink_send_query(app_prefix_stat, $_$ insert into t_keepalive values (1,now(),pg_current_wal_lsn()) on conflict (id) do update set ts=excluded.ts,pos=excluded.pos returning id,ts,pos $_$); -- 查询是否超时 <<ablock>> loop perform pg_sleep(0.2); perform 1 from pg_stat_activity where application_name=app_prefix_stat and state='idle' limit 1; -- 未超时 if found then select id,ts,pos into res1,res2,res3 from dblink_get_result(app_prefix_stat, false) as t(id int, ts timestamp, pos pg_lsn); raise notice 'no timeout'; exit ablock; end if; perform 1 from pg_stat_activity where wait_event='SyncRep' and application_name=app_prefix_stat and clock_timestamp()-query_start > tmout limit 1; -- 降级 if found then perform dblink_exec(conn_altersys, 'alter system set synchronous_commit=local', true); perform pg_reload_conf(); perform pg_cancel_backend(pid) from pg_stat_activity where wait_event='SyncRep'; select id,ts,pos into res1,res2,res3 from dblink_get_result(app_prefix_stat, false) as t(id int, ts timestamp, pos pg_lsn); raise notice 'timeout'; exit ablock; end if; perform pg_sleep(0.2); end loop; -- 优先模式是什么 if prio_commit_mode='async' -- 异步 then show synchronous_commit into commit_mode; -- 降级 if commit_mode in ('on','remote_write','remote_apply') then perform dblink_exec(conn_altersys, 'alter system set synchronous_commit=local', true); perform pg_reload_conf(); perform pg_cancel_backend(pid) from pg_stat_activity where wait_event='SyncRep'; end if; -- 退出 return row(res1,res2,res3)::t_keepalive; -- 同步 else -- 退出 return row(res1,res2,res3)::t_keepalive; end if; end if; end if; end; $$ language plpgsql strict; 测试 1、当前为同步模式 postgres=# show synchronous_commit ; synchronous_commit -------------------- remote_write (1 row) 2、人为关闭从库,心跳自动将数据库改成异步模式,并通知所有等待中会话。 postgres=# select * from keepalive ('sync','5 second'); NOTICE: extension "dblink" already exists, skipping NOTICE: timeout id | ts | pos ----+----------------------------+------------- 1 | 2019-01-30 00:48:39.800829 | 23/9501D5F8 (1 row) postgres=# show synchronous_commit ; synchronous_commit -------------------- local (1 row) 3、恢复从库,心跳自动将数据库升级为优先sync模式。 postgres=# select * from keepalive ('sync','5 second'); NOTICE: extension "dblink" already exists, skipping id | ts | pos ----+----------------------------+------------- 1 | 2019-01-30 00:48:47.329119 | 23/9501D6E8 (1 row) postgres=# select * from keepalive ('sync','5 second'); NOTICE: extension "dblink" already exists, skipping NOTICE: no timeout id | ts | pos ----+----------------------------+------------- 1 | 2019-01-30 00:49:11.991855 | 23/9501E0C8 (1 row) postgres=# show synchronous_commit ; synchronous_commit -------------------- remote_write (1 row) 小结 在心跳时,通过自定义UDF,实现心跳永远不被堵塞,并且支持更加当前的配置自动的进行同步、异步模式的升降级。实现半同步的功能。 UDF输入 1、优先模式(同步、异步) 2、同步等待超时时间 当优先为同步模式时,假设当前为同步配置,如果备库异常导致事务提交等待超过指定时间,则自动降级为异步。 当优先为异步模式时,假设当前为同步配置,自动降级为异步。 当优先为同步模式时,假设当前为异步配置,如果备库恢复到streaming模式,自动升级为同步。 使用技术点: 1、alter system 2、reload conf 3、cancle backend 4、dblink 异步调用 使用心跳实现半同步,大大简化了整个同步、异步模式切换的流程。当然如果内核层面可以实现,配置几个参数,会更加完美。 参考 dblin 异步 《PostgreSQL 数据库心跳(SLA(RPO)指标的时间、WAL SIZE维度计算)》 《PostgreSQL 双节点流复制如何同时保证可用性、可靠性(rpo,rto) - (半同步,自动降级方法实践)》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 pg_rewind类似Oracle flashback,可以将一个数据库回退到一个以前的状态,例如用于: 1、PG物理流复制的从库,当激活后,可以开启读写,使用pg_rewind可以将从库回退为只读从库的角色。而不需要重建整个从库。 2、当异步主从发生角色切换后,主库的wal目录中可能还有没完全同步到从库的内容,因此老的主库无法直接切换为新主库的从库。使用pg_rewind可以修复老的主库,使之成为新主库的只读从库。而不需要重建整个从库。 如果没有pg_rewind,遇到以上情况,需要完全重建从库,如果库占用空间很大,重建非常耗时,也非常耗费上游数据库的资源(读)。 详见: 《PostgreSQL pg_rewind,时间线修复,脑裂修复 - 从库开启读写后,回退为只读从库。异步主从发生角色切换后,主库rewind为新主库的从库》 以上解决的是怎么回退的问题,还有一个问题没有解,在分歧点到当前状态下,这些被回退掉的WAL,其中包含了哪些逻辑变化,这些信息怎么补齐? 时间线分歧变化量补齐原理 1、开启wal_level=logical 1.1、确保有足够的slots 2、开启DDL定义功能,参考: 《PostgreSQL 逻辑订阅 - DDL 订阅 实现方法》 3、在主库,为每一个数据库(或需要做时间线补齐的数据库)创建一个logical SLOT 4、有更新、删除操作的表,必须有主键 5、间歇性移动slot的位置到pg_stat_replication.sent_lsn的位置 6、如果从库被激活,假设老主库上还有未发送到从库的WAL 7、从从库获取激活位置LSN 8、由于使用了SLOT,所以从库激活位点LSN之后的WAL一定存在于老主库WAL目录中。 9、将老主库的slot移动到激活位置LSN 10、从激活位置开始获取logical变化量 11、业务层根据业务逻辑对这些变化量进行处理,补齐时间线分歧 示例 环境使用: 《PostgreSQL pg_rewind,时间线修复,脑裂修复 - 从库开启读写后,回退为只读从库。异步主从发生角色切换后,主库rewind为新主库的从库》 主库 port 4001 从库 port 4000 1、开启wal_level=logical psql -p 4000 postgres=# alter system set wal_level=logical; ALTER SYSTEM psql -p 4001 postgres=# alter system set wal_level=logical; ALTER SYSTEM 1.1、确保有足够的slots edb=# show max_replication_slots ; max_replication_slots ----------------------- 16 (1 row) 重启数据库。 2、开启DDL定义功能,参考: 《PostgreSQL 逻辑订阅 - DDL 订阅 实现方法》 3、在主库,为每一个数据库(或需要做时间线补齐的数据库)创建一个logical SLOT postgres=# select pg_create_logical_replication_slot('fix_tl','test_decoding'); pg_create_logical_replication_slot ------------------------------------ (fix_tl,B/73000140) (1 row) edb=# select pg_create_logical_replication_slot('fix_tl_edb','test_decoding'); pg_create_logical_replication_slot ------------------------------------ (fix_tl_edb,B/73000140) (1 row) 4、有更新、删除操作的表,必须有主键 5、间歇性移动slot的位置到pg_stat_replication.sent_lsn的位置 连接到对应的库操作 postgres=# select pg_replication_slot_advance('fix_tl',sent_lsn) from pg_stat_replication ; pg_replication_slot_advance ----------------------------- (fix_tl,B/73000140) (1 row) edb=# select pg_replication_slot_advance('fix_tl_edb',sent_lsn) from pg_stat_replication ; pg_replication_slot_advance ----------------------------- (fix_tl,B/73000140) (1 row) 6、如果从库被激活,假设老主库上还有未发送到从库的WAL pg_ctl promote -D /data04/ppas11/pg_root4000 7、从从库获取激活位置LSN cd /data04/ppas11/pg_root4000 cat pg_wal/00000003.history 1 8/48DE2318 no recovery target specified 2 D/FD5FFFB8 no recovery target specified 8、由于使用了SLOT,所以从库激活位点LSN之后的WAL一定存在于老主库WAL目录中。 9、将老主库的slot移动到激活位置LSN psql -p 4001 postgres postgres=# select pg_replication_slot_advance('fix_tl','D/FD5FFFB8'); psql -p 4001 edb edb=# select pg_replication_slot_advance('fix_tl_edb','D/FD5FFFB8'); 10、从激活位置开始获取logical变化量 edb=# select * from pg_logical_slot_get_changes('fix_tl_edb',NULL,10,'include-xids', '0'); lsn | xid | data -----+-----+------ (0 rows) 由于EDB库没有变化,所以返回0条记录 postgres=# select * from pg_logical_slot_get_changes('fix_tl',NULL,10,'include-xids', '0'); lsn | xid | data ------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- D/FD5FEC60 | 68900576 | BEGIN D/FD5FEC60 | 68900576 | table public.pgbench_accounts: UPDATE: aid[integer]:44681547 bid[integer]:447 abalance[integer]:-4591 filler[character]:' ' D/FD5FF3A8 | 68900576 | table public.pgbench_tellers: UPDATE: tid[integer]:5091 bid[integer]:510 tbalance[integer]:-160944 filler[character]:null D/FD5FF9A8 | 68900576 | table public.pgbench_branches: UPDATE: bid[integer]:740 bbalance[integer]:-261044 filler[character]:null D/FD5FFEF8 | 68900576 | table public.pgbench_history: INSERT: tid[integer]:5091 bid[integer]:740 aid[integer]:44681547 delta[integer]:-4591 mtime[timestamp without time zone]:'29-JAN-19 09:48:14.39739' filler[character]:null D/FD6001E8 | 68900576 | COMMIT D/FD5FE790 | 68900574 | BEGIN D/FD5FE790 | 68900574 | table public.pgbench_accounts: UPDATE: aid[integer]:60858810 bid[integer]:609 abalance[integer]:3473 filler[character]:' ' D/FD5FF1C8 | 68900574 | table public.pgbench_tellers: UPDATE: tid[integer]:8829 bid[integer]:883 tbalance[integer]:60244 filler[character]:null D/FD5FF810 | 68900574 | table public.pgbench_branches: UPDATE: bid[integer]:33 bbalance[integer]:86295 filler[character]:null D/FD5FFD80 | 68900574 | table public.pgbench_history: INSERT: tid[integer]:8829 bid[integer]:33 aid[integer]:60858810 delta[integer]:3473 mtime[timestamp without time zone]:'29-JAN-19 09:48:14.397383' filler[character]:null D/FD600218 | 68900574 | COMMIT (12 rows) postgres=# select * from pg_logical_slot_get_changes('fix_tl',NULL,10,'include-xids', '0'); lsn | xid | data ------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- D/FD5FEED0 | 68900578 | BEGIN D/FD5FEED0 | 68900578 | table public.pgbench_accounts: UPDATE: aid[integer]:15334791 bid[integer]:154 abalance[integer]:-2741 filler[character]:' ' D/FD5FF518 | 68900578 | table public.pgbench_tellers: UPDATE: tid[integer]:2402 bid[integer]:241 tbalance[integer]:191936 filler[character]:null D/FD5FFB88 | 68900578 | table public.pgbench_branches: UPDATE: bid[integer]:345 bbalance[integer]:-693783 filler[character]:null D/FD5FFFB8 | 68900578 | table public.pgbench_history: INSERT: tid[integer]:2402 bid[integer]:345 aid[integer]:15334791 delta[integer]:-2741 mtime[timestamp without time zone]:'29-JAN-19 09:48:14.397396' filler[character]:null D/FD600248 | 68900578 | COMMIT D/FD5FF438 | 68900579 | BEGIN D/FD5FF438 | 68900579 | table public.pgbench_accounts: UPDATE: aid[integer]:54259132 bid[integer]:543 abalance[integer]:3952 filler[character]:' ' D/FD5FFEA8 | 68900579 | table public.pgbench_tellers: UPDATE: tid[integer]:9591 bid[integer]:960 tbalance[integer]:-498586 filler[character]:null D/FD600298 | 68900579 | table public.pgbench_branches: UPDATE: bid[integer]:147 bbalance[integer]:459542 filler[character]:null D/FD600560 | 68900579 | table public.pgbench_history: INSERT: tid[integer]:9591 bid[integer]:147 aid[integer]:54259132 delta[integer]:3952 mtime[timestamp without time zone]:'29-JAN-19 09:48:14.397464' filler[character]:null D/FD600938 | 68900579 | COMMIT (12 rows) ... ... 直到没有记录返回,说明已获取到所有变化量 直到没有记录返回,说明已获取到所有变化量 10.1、查看SLOT状态,当前WAL位置信息 psql -p 4001 postgres=# select * from pg_get_replication_slots(); slot_name | plugin | slot_type | datoid | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn ------------+---------------+-----------+--------+-----------+--------+------------+------+--------------+-------------+--------------------- fix_tl | test_decoding | logical | 15844 | f | f | | | 67005646 | D/D7959218 | D/FD600218 fix_tl_edb | test_decoding | logical | 15845 | f | f | | | 72528996 | E/71C92B00 | E/71C92B38 (2 rows) 当前WAL位置 postgres=# select pg_current_wal_lsn(); pg_current_wal_lsn -------------------- E/71C92B38 (1 row) 11、业务层根据业务逻辑对这些变化量进行处理,补齐时间线分歧 小结 主库开启逻辑SLOT,并根据从库的接收LSN位置,使用pg_replication_slot_advance移动主库的slot位点到从库的接收LSN位置。 当从库激活,老主库还有未同步到从库的WAL时,可以通过逻辑decode的方法,获取到未同步的逻辑变化量。 业务层根据业务逻辑,补齐这些变化量到新的主库。 注意: 1、开启logical wal_level,会给数据库增加较多的WAL日志,请酌情开启。 2、开启SLOT后,由于数据库会保证没有被订阅的WAL保留在pg_wal目录中,那么如果SLOT没有及时移动,则可能导致主库的pg_wal目录暴增。 参考 https://www.postgresql.org/docs/11/test-decoding.html https://www.postgresql.org/docs/11/functions-admin.html#FUNCTIONS-REPLICATION 《PostgreSQL 逻辑订阅 - DDL 订阅 实现方法》 《PostgreSQL pg_rewind,时间线修复,脑裂修复 - 从库开启读写后,回退为只读从库。异步主从发生角色切换后,主库rewind为新主库的从库》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 1、PG物理流复制的从库,当激活后,可以开启读写,使用pg_rewind可以将从库回退为只读从库的角色。而不需要重建整个从库。 2、当异步主从发生角色切换后,主库的wal目录中可能还有没完全同步到从库的内容,因此老的主库无法直接切换为新主库的从库。使用pg_rewind可以修复老的主库,使之成为新主库的只读从库。而不需要重建整个从库。 3、如果没有pg_rewind,遇到以上情况,需要完全重建从库。或者你可以使用存储层快照,回退回脑裂以前的状态。又或者可以使用文件系统快照,回退回脑裂以前的状态。 原理与修复步骤 1、使用pg_rewind功能的前提条件:必须开启full page write,必须开启wal hint或者data block checksum。 2、需要被修复的库:从激活点开始,所有的WAL必须存在pg_wal目录中。如果WAL已经被覆盖,只要有归档,拷贝到pg_wal目录即可。 3、新的主库,从激活点开始,产生的所有WAL必须存在pg_wal目录中,或者已归档,并且被修复的库可以使用restore_command访问到这部分WAL。 4、修改(source db)新主库或老主库配置,允许连接。 5、修复时,连接新主库,得到切换点。或连接老主库,同时比对当前要修复的新主库的TL与老主库进行比对,得到切换点。 6、解析需要被修复的库的从切换点到现在所有的WAL。同时连接source db(新主库(或老主库)),进行回退操作(被修改或删除的BLOCK从source db获取并覆盖,新增的BLOCK,直接抹除。)回退到切换点的状态。 7、修改被修复库(target db)的recovery.conf, postgresql.conf配置。 8、启动target db,连接source db接收WAL,或restore_command配置接收WAL,从切换点开始所有WAL,进行apply。 9、target db现在是source db的从库。 以EDB PG 11为例讲解 环境部署 《MTK使用 - PG,PPAS,oracle,mysql,ms sql,sybase 迁移到 PG, PPAS (支持跨版本升级)》 export PS1="$USER@`/bin/hostname -s`-> " export PGPORT=4000 export PGDATA=/data04/ppas11/pg_root4000 export LANG=en_US.utf8 export PGHOME=/usr/edb/as11 export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH export DATE=`date +"%Y%m%d%H%M"` export PATH=$PGHOME/bin:$PATH:. export MANPATH=$PGHOME/share/man:$MANPATH export PGHOST=127.0.0.1 export PGUSER=postgres export PGDATABASE=postgres alias rm='rm -i' alias ll='ls -lh' unalias vi 1、初始化数据库集群 initdb -D /data04/ppas11/pg_root4000 -E UTF8 --lc-collate=C --lc-ctype=en_US.UTF8 -U postgres -k --redwood-like 2、配置recovery.done cd $PGDATA cp $PGHOME/share/recovery.conf.sample ./ mv recovery.conf.sample recovery.done vi recovery.done restore_command = 'cp /data04/ppas11/wal/%f %p' recovery_target_timeline = 'latest' standby_mode = on primary_conninfo = 'host=localhost port=4000 user=postgres' 3、配置postgresql.conf 要使用rewind功能: 必须开启full_page_writes 必须开启data_checksums或wal_log_hints postgresql.conf listen_addresses = '0.0.0.0' port = 4000 max_connections = 8000 superuser_reserved_connections = 13 unix_socket_directories = '.,/tmp' unix_socket_permissions = 0700 tcp_keepalives_idle = 60 tcp_keepalives_interval = 10 tcp_keepalives_count = 10 shared_buffers = 16GB max_prepared_transactions = 8000 maintenance_work_mem = 1GB autovacuum_work_mem = 1GB dynamic_shared_memory_type = posix vacuum_cost_delay = 0 bgwriter_delay = 10ms bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 10.0 effective_io_concurrency = 0 max_worker_processes = 128 max_parallel_maintenance_workers = 8 max_parallel_workers_per_gather = 8 max_parallel_workers = 24 wal_level = replica synchronous_commit = off full_page_writes = on wal_compression = on wal_buffers = 32MB wal_writer_delay = 10ms checkpoint_timeout = 25min max_wal_size = 32GB min_wal_size = 8GB checkpoint_completion_target = 0.2 archive_mode = on archive_command = 'cp -n %p /data04/ppas11/wal/%f' max_wal_senders = 16 wal_keep_segments = 4096 max_replication_slots = 16 hot_standby = on max_standby_archive_delay = 300s max_standby_streaming_delay = 300s wal_receiver_status_interval = 1s wal_receiver_timeout = 10s random_page_cost = 1.1 effective_cache_size = 400GB log_destination = 'csvlog' logging_collector = on log_directory = 'log' log_filename = 'edb-%a.log' log_truncate_on_rotation = on log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = 1s log_checkpoints = on log_error_verbosity = verbose log_line_prefix = '%t ' log_lock_waits = on log_statement = 'ddl' log_timezone = 'PRC' autovacuum = on log_autovacuum_min_duration = 0 autovacuum_max_workers = 6 autovacuum_freeze_max_age = 1200000000 autovacuum_multixact_freeze_max_age = 1400000000 autovacuum_vacuum_cost_delay = 0 statement_timeout = 0 lock_timeout = 0 idle_in_transaction_session_timeout = 0 vacuum_freeze_table_age = 1150000000 vacuum_multixact_freeze_table_age = 1150000000 datestyle = 'redwood,show_time' timezone = 'PRC' lc_messages = 'en_US.utf8' lc_monetary = 'en_US.utf8' lc_numeric = 'en_US.utf8' lc_time = 'en_US.utf8' default_text_search_config = 'pg_catalog.english' shared_preload_libraries = 'auto_explain,pg_stat_statements,$libdir/dbms_pipe,$libdir/edb_gen,$libdir/dbms_aq' edb_redwood_date = on edb_redwood_greatest_least = on edb_redwood_strings = on db_dialect = 'redwood' edb_dynatune = 66 edb_dynatune_profile = oltp timed_statistics = off 4、配置pg_hba.conf,允许流复制 local all all trust host all all 127.0.0.1/32 trust host all all ::1/128 trust local replication all trust host replication all 127.0.0.1/32 trust host replication all ::1/128 trust host all all 0.0.0.0/0 md5 5、配置归档目录 mkdir /data04/ppas11/wal chown enterprisedb:enterprisedb /data04/ppas11/wal 6、创建从库 pg_basebackup -h 127.0.0.1 -p 4000 -D /data04/ppas11/pg_root4001 -F p -c fast 7、配置从库 cd /data04/ppas11/pg_root4001 mv recovery.done recovery.conf vi postgresql.conf port = 4001 8、启动从库 pg_ctl start -D /data04/ppas11/pg_root4001 9、压测主库 pgbench -i -s 1000 pgbench -M prepared -v -r -P 1 -c 24 -j 24 -T 300 10、检查归档 postgres=# select * from pg_stat_archiver ; archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset ----------------+--------------------------+----------------------------------+--------------+-----------------+------------------+---------------------------------- 240 | 0000000100000000000000F0 | 28-JAN-19 15:08:43.276965 +08:00 | 0 | | | 28-JAN-19 15:01:17.883338 +08:00 (1 row) postgres=# select * from pg_stat_archiver ; archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset ----------------+--------------------------+----------------------------------+--------------+-----------------+------------------+---------------------------------- 248 | 0000000100000000000000F8 | 28-JAN-19 15:08:45.120134 +08:00 | 0 | | | 28-JAN-19 15:01:17.883338 +08:00 (1 row) 11、检查从库延迟 postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+--------------------------------- pid | 8124 usesysid | 10 usename | postgres application_name | walreceiver client_addr | 127.0.0.1 client_hostname | client_port | 62988 backend_start | 28-JAN-19 15:07:34.084542 +08:00 backend_xmin | state | streaming sent_lsn | 1/88BC2000 write_lsn | 1/88BC2000 flush_lsn | 1/88BC2000 replay_lsn | 1/88077D48 write_lag | 00:00:00.001417 flush_lag | 00:00:00.002221 replay_lag | 00:00:00.097657 sync_priority | 0 sync_state | async 例子1,从库激活后产生读写,使用pg_rewind修复从库,回退到只读从库 1、激活从库 pg_ctl promote -D /data04/ppas11/pg_root4001 2、写从库 pgbench -M prepared -v -r -P 1 -c 4 -j 4 -T 120 -p 4001 此时从库已经和主库不在一个时间线,无法直接变成当前主库的从库 enterprisedb@pg11-test-> pg_controldata -D /data04/ppas11/pg_root4001|grep -i time Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Time of latest checkpoint: Mon 28 Jan 2019 03:56:38 PM CST Min recovery ending loc's timeline: 2 track_commit_timestamp setting: off Date/time type storage: 64-bit integers enterprisedb@pg11-test-> pg_controldata -D /data04/ppas11/pg_root4000|grep -i time Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Time of latest checkpoint: Mon 28 Jan 2019 05:11:38 PM CST Min recovery ending loc's timeline: 0 track_commit_timestamp setting: off Date/time type storage: 64-bit integers 3、修复从库,使之继续成为当前主库的从库 4、查看切换点 cd /data04/ppas11/pg_root4001 ll pg_wal/*.history -rw------- 1 enterprisedb enterprisedb 42 Jan 28 17:15 pg_wal/00000002.history cat pg_wal/00000002.history 1 6/48C62000 no recovery target specified 5、从库激活时间开始产生的WAL必须全部在pg_wal目录中。 -rw------- 1 enterprisedb enterprisedb 42 Jan 28 17:15 00000002.history -rw------- 1 enterprisedb enterprisedb 16M Jan 28 17:16 000000020000000600000048 ............ 000000020000000600000048开始,所有的wal必须存在从库pg_wal目录中。如果已经覆盖了,必须从归档目录拷贝到从库pg_wal目录中。 6、从库激活时,主库从这个时间点开始所有的WAL还在pg_wal目录,或者从库可以使用restore_command获得(recovery.conf)。 recovery.conf restore_command = 'cp /data04/ppas11/wal/%f %p' 7、pg_rewind命令帮助 https://www.postgresql.org/docs/11/app-pgrewind.html pg_rewind --help pg_rewind resynchronizes a PostgreSQL cluster with another copy of the cluster. Usage: pg_rewind [OPTION]... Options: -D, --target-pgdata=DIRECTORY existing data directory to modify --source-pgdata=DIRECTORY source data directory to synchronize with --source-server=CONNSTR source server to synchronize with -n, --dry-run stop before modifying anything -P, --progress write progress messages --debug write a lot of debug messages -V, --version output version information, then exit -?, --help show this help, then exit Report bugs to <support@enterprisedb.com>. 8、停库(被修复的库,停库) pg_ctl stop -m fast -D /data04/ppas11/pg_root4001 9、尝试修复 pg_rewind -n -D /data04/ppas11/pg_root4001 --source-server="hostaddr=127.0.0.1 user=postgres port=4000" servers diverged at WAL location 6/48C62000 on timeline 1 rewinding from last common checkpoint at 5/5A8CD30 on timeline 1 Done! 10、尝试正常,说明可以修复,实施修复 pg_rewind -D /data04/ppas11/pg_root4001 --source-server="hostaddr=127.0.0.1 user=postgres port=4000" servers diverged at WAL location 6/48C62000 on timeline 1 rewinding from last common checkpoint at 5/5A8CD30 on timeline 1 Done! 11、已修复,改配置 cd /data04/ppas11/pg_root4001 vi postgresql.conf port = 4001 mv recovery.done recovery.conf vi recovery.conf restore_command = 'cp /data04/ppas11/wal/%f %p' recovery_target_timeline = 'latest' standby_mode = on primary_conninfo = 'host=localhost port=4000 user=postgres' 12、删除归档中错误时间线上产生的文件否则会在启动修复后的从库后,走到00000002时间线上,这是不想看到的。 mkdir /data04/ppas11/wal/error_tl_2 mv /data04/ppas11/wal/00000002* /data04/ppas11/wal/error_tl_2 13、启动从库 pg_ctl start -D /data04/ppas11/pg_root4001 14、建议对主库做一个检查点,从库收到检查点后,重启后不需要应用太多WAL,而是从新检查点开始恢复 psql checkpoint; 15、压测主库 pgbench -M prepared -v -r -P 1 -c 16 -j 16 -T 200 -p 4000 16、查看归档状态 postgres=# select * from pg_stat_archiver ; archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset ----------------+--------------------------+----------------------------------+--------------+-----------------+------------------+---------------------------------- 1756 | 0000000100000006000000DC | 28-JAN-19 17:41:57.562425 +08:00 | 0 | | | 28-JAN-19 15:01:17.883338 +08:00 (1 row) 17、查看从库健康、延迟,观察修复后的情况 postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+-------------------------------- pid | 13179 usesysid | 10 usename | postgres application_name | walreceiver client_addr | 127.0.0.1 client_hostname | client_port | 63198 backend_start | 28-JAN-19 17:47:29.85308 +08:00 backend_xmin | state | catchup sent_lsn | 7/DDE80000 write_lsn | 7/DC000000 flush_lsn | 7/DC000000 replay_lsn | 7/26A8DCB0 write_lag | 00:00:18.373263 flush_lag | 00:00:18.373263 replay_lag | 00:00:18.373263 sync_priority | 0 sync_state | async 例子2,从库激活成为新主库后,老主库依旧有读写,使用pg_rewind修复老主库,将老主库降级为新主库的从库 1、激活从库 pg_ctl promote -D /data04/ppas11/pg_root4001 2、写从库 pgbench -M prepared -v -r -P 1 -c 16 -j 16 -T 200 -p 4001 3、写主库 pgbench -M prepared -v -r -P 1 -c 16 -j 16 -T 200 -p 4000 此时老主库已经和新的主库不在一个时间线 enterprisedb@pg11-test-> pg_controldata -D /data04/ppas11/pg_root4000|grep -i timeline Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Min recovery ending loc's timeline: 0 enterprisedb@pg11-test-> pg_controldata -D /data04/ppas11/pg_root4001|grep -i timeline Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Min recovery ending loc's timeline: 2 enterprisedb@pg11-test-> cd /data04/ppas11/pg_root4001/pg_wal enterprisedb@pg11-test-> cat 00000002.history 1 8/48DE2318 no recovery target specified enterprisedb@pg11-test-> ll *.partial -rw------- 1 enterprisedb enterprisedb 16M Jan 28 17:48 000000010000000800000048.partial 4、修复老主库,变成从库 4.1、从库激活时,老主库从这个时间点开始所有的WAL,必须全部在pg_wal目录中。 000000010000000800000048 开始的所有WAL必须存在pg_wal,如果已经覆盖了,必须从WAL归档拷贝到pg_wal目录 4.2、从库激活时间开始产生的所有WAL,老主库必须可以使用restore_command获得(recovery.conf)。 recovery.conf restore_command = 'cp /data04/ppas11/wal/%f %p' 5、关闭老主库 pg_ctl stop -m fast -D /data04/ppas11/pg_root4000 6、尝试修复老主库 pg_rewind -n -D /data04/ppas11/pg_root4000 --source-server="hostaddr=127.0.0.1 user=postgres port=4001" servers diverged at WAL location 8/48DE2318 on timeline 1 rewinding from last common checkpoint at 6/CCCEF770 on timeline 1 Done! 7、尝试成功,可以修复,实施修复 pg_rewind -D /data04/ppas11/pg_root4000 --source-server="hostaddr=127.0.0.1 user=postgres port=4001" 8、修复完成后,改配置 cd /data04/ppas11/pg_root4000 vi postgresql.conf port = 4000 mv recovery.done recovery.conf vi recovery.conf restore_command = 'cp /data04/ppas11/wal/%f %p' recovery_target_timeline = 'latest' standby_mode = on primary_conninfo = 'host=localhost port=4001 user=postgres' 9、启动老主库 pg_ctl start -D /data04/ppas11/pg_root4000 10、建议对新主库做一个检查点,从库收到检查点后,重启后不需要应用太多WAL,而是从新检查点开始恢复 checkpoint; 11、压测新主库 pgbench -M prepared -v -r -P 1 -c 16 -j 16 -T 200 -p 4001 12、查看归档状态 psql -p 4001 postgres=# select * from pg_stat_archiver ; archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset ----------------+--------------------------+----------------------------------+--------------+-----------------+------------------+---------------------------------- 406 | 0000000200000009000000DB | 28-JAN-19 21:18:22.976118 +08:00 | 0 | | | 28-JAN-19 17:47:29.847488 +08:00 (1 row) 13、查看从库健康、延迟 psql -p 4001 postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+--------------------------------- pid | 17675 usesysid | 10 usename | postgres application_name | walreceiver client_addr | 127.0.0.1 client_hostname | client_port | 60530 backend_start | 28-JAN-19 21:18:36.472197 +08:00 backend_xmin | state | streaming sent_lsn | 9/E8361C18 write_lsn | 9/E8361C18 flush_lsn | 9/E8361C18 replay_lsn | 9/D235B520 write_lag | 00:00:00.000101 flush_lag | 00:00:00.000184 replay_lag | 00:00:03.028098 sync_priority | 0 sync_state | async 小结 1 适合场景 1、PG物理流复制的从库,当激活后,可以开启读写,使用pg_rewind可以将从库回退为只读从库的角色。而不需要重建整个从库。 2、当异步主从发生角色切换后,主库的wal目录中可能还有没完全同步到从库的内容,因此老的主库无法直接切换为新主库的从库。使用pg_rewind可以修复老的主库,使之成为新主库的只读从库。而不需要重建整个从库。 如果没有pg_rewind,遇到以上情况,需要完全重建从库,如果库占用空间很大,重建非常耗时,也非常耗费上游数据库的资源(读)。 2 前提 要使用rewind功能: 1、必须开启full_page_writes 2、必须开启data_checksums或wal_log_hints initdb -k 开启data_checksums 3 原理与修复流程 1、使用pg_rewind功能的前提条件:必须开启full page write,必须开启wal hint或者data block checksum。 2、需要被修复的库:从激活点开始,所有的WAL必须存在pg_wal目录中。如果WAL已经被覆盖,只要有归档,拷贝到pg_wal目录即可。 3、新的主库,从激活点开始,产生的所有WAL必须存在pg_wal目录中,或者已归档,并且被修复的库可以使用restore_command访问到这部分WAL。 4、修改(source db)新主库或老主库配置,允许连接。 5、修复时,连接新主库,得到切换点。或连接老主库,同时比对当前要修复的新主库的TL与老主库进行比对,得到切换点。 6、解析需要被修复的库的从切换点到现在所有的WAL。同时连接source db(新主库(或老主库)),进行回退操作(被修改或删除的BLOCK从source db获取并覆盖,新增的BLOCK,直接抹除。)回退到切换点的状态。 7、修改被修复库(target db)的recovery.conf, postgresql.conf配置。 8、启动target db,连接source db接收WAL,或restore_command配置接收WAL,从切换点开始所有WAL,进行apply。 9、target db现在是source db的从库。 参考 https://www.postgresql.org/docs/11/app-pgrewind.html 《PostgreSQL primary-standby failback tools : pg_rewind》 《PostgreSQL 9.5 new feature - pg_rewind fast sync Split Brain Primary & Standby》 《PostgreSQL 9.5 add pg_rewind for Fast align for PostgreSQL unaligned primary & standby》 《MTK使用 - PG,PPAS,oracle,mysql,ms sql,sybase 迁移到 PG, PPAS (支持跨版本升级)》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
背景 两节点HA架构,如何做到跨机房RPO=0(可靠性维度)?同时RTO可控(可用性维度)? 半同步是一个不错的选择。 1、当只挂掉一个节点时,可以保证RPO=0。如下: 主 -> 从(挂) 主(挂) -> 从 2、当一个节点挂掉后,在另一个节点恢复并开启同步模式前,如果在此期间(当前)主节点也挂掉,(虽然此时从库活了(但由于还未开启同步模式)),则RPO>0。 如下: 主(挂) -> 从(OPEN,但是之前从挂过,并且还还未转换为同步模式) 与两个节点同时挂掉一样,RPO>0 3、如何保证RTO时间可控? 我们知道,在同步模式下,事务提交时需要等待sync STANDBY的WAL复制反馈,确保事务wal落多个副本再反馈客户端(从动作上来说,先持久化主,然后同步给sync从,并等待sync从的WAL 同步位点的反馈),当STANDBY挂掉时,等待是无限期的,所以两节点的同步复制,无法兼顾可用性(RTO)。那么怎么兼顾可用性呢? 可以对(pg_stat_activity)等待事件的状态进行监测,如果发现同步事务等待超过一定阈值(RTO阈值),则降级为异步模式。 降级不需要重启数据库。 3.1 改配置 3.2 reload (对已有连接和新建连接都会立即生效)。 3.3 cancel 等待信号(针对当前处于等待中的进程)。 4、降级后,什么情况下恢复为同步模式?(升级) 同样可以对(pg_stat_replication)状态进行监测,当sync standby处于streaming状态时,则可以转换为同步模式。 升级不需要重启数据库。 4.1 改配置 4.2 reload。立即生效 (对已有连接和新建连接都会立即生效)。 涉及技术点 1、事务提交参数 synchronous_commit on, remote_apply, remote_write, local 2、同步配置参数 synchronous_standby_names [FIRST] num_sync ( standby_name [, ...] ) ANY num_sync ( standby_name [, ...] ) standby_name [, ...] ANY 3 (s1, s2, s3, s4) FIRST 3 (s1, s2, s3, s4) * 表示所有节点 3、活跃会话,查看事务提交时,等待事件状态 pg_stat_activity 等待事件 https://www.postgresql.org/docs/11/monitoring-stats.html#MONITORING-STATS-VIEWS wait_event='SyncRep' 4、流状态,pg_stat_replication sync_state='sync' state text Current WAL sender state. Possible values are: startup: This WAL sender is starting up. catchup: This WAL sender's connected standby is catching up with the primary. streaming: This WAL sender is streaming changes after its connected standby server has caught up with the primary. backup: This WAL sender is sending a backup. stopping: This WAL sender is stopping. 实践 环境 1、主 postgresql.conf synchronous_commit = remote_write wal_level = replica max_wal_senders = 8 synchronous_standby_names = '*' 2、从 recovery.conf restore_command = 'cp /data01/digoal/wal/%f %p' primary_conninfo = 'host=localhost port=8001 user=postgres' 同步降级、升级 - 实践 关闭standby,模拟备库异常。看如何实现半同步。 模拟STANDBY恢复,看如何模拟升级为同步模式。 1、监测 pg_stat_activity,如果发现事务提交等待超过一定阈值(RTO保障),降级 select max(now()-query_start) from pg_stat_activity where wait_event='SyncRep'; 2、查看以上结果等待时间(RTO保障) 当大于某个阈值时,开始降级。 注意NULL保护,NULL表示没有事务处于 SyncRep 等待状态。 3、降级步骤1,修改synchronous_commit参数。改成WAL本地持久化(异步流复制)。 alter system set synchronous_commit=local; 4、降级步骤2,生效参数,RELOAD select pg_reload_conf(); 5、降级步骤3,清空当前等待队列(处于SyncRep等待状态的进程在收到CANCEL信号后,从队列清空,并提示客户端,当前事务本地WAL已持久化,事务正常结束。) select pg_cancel_backend(pid) from pg_stat_activity where wait_event='SyncRep'; 6、收到清空信号的客户端返回正常(客户端可以看到事务正常提交) postgres=# end; WARNING: 01000: canceling wait for synchronous replication due to user request DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. LOCATION: SyncRepWaitForLSN, syncrep.c:264 COMMIT 事务的redo信息已在本地WAL持久化,提交状态正常。 当前会话后续的请求会变成异步流复制模式(WAL本地持久化模式(synchronous_commit=local))。 如何升级?: 7、升级步骤1,监测standby状态,sync_state='sync'状态的standby进入streaming状态后,表示该standby与primary的wal已完全同步。 select * from pg_stat_replication where sync_state='sync' and state='streaming'; 有结果返回,表示standby已经接收完primary的wal,可以进入同步模式。 8、升级步骤2,将事务提交模式改回同步模式( synchronous_commit=remote_write ,事务提交时,等sync standby接收到wal,并write。) alter system set synchronous_commit=remote_write; 9、升级步骤3,生效参数,RELOAD (所有会话重置synchronous_commit=remote_write,包括已有连接,新建的连接) select pg_reload_conf(); 小结 1、在不修改PG内核的情况下,通过外部辅助监测和操纵(例如5秒监控间隔)),实现了两节点的半同步模式,在双节点或单节点正常的情况下,保证RPO=0,同时RTO可控(例如最长wait_event='SyncRep'等待时间超过10秒)。 2、内核修改建议, 降级:可以在等待队列中加HOOK,wait_event='SyncRep'等待超时后降级为异步。 升级:在wal_sender代码中加hook,监测到standby恢复后,改回同步模式。 参考 《PostgreSQL 一主多从(多副本,强同步)简明手册 - 配置、压测、监控、切换、防脑裂、修复、0丢失 - 珍藏级》 https://www.postgresql.org/docs/11/monitoring-stats.html#MONITORING-STATS-VIEWS 《PostgreSQL 时间点恢复(PITR)在异步流复制主从模式下,如何避免主备切换后PITR恢复走错时间线(timeline , history , partial , restore_command , recovery.conf)》 PostgreSQL 许愿链接 您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森. 9.9元购买3个月阿里云RDS PostgreSQL实例 PostgreSQL 解决方案集合
PostgreSQL Oracle 兼容性之 - performance insight - AWS performance insight 理念与实现解读 - 珍藏级 作者 digoal 日期 2019-01-25 标签 PostgreSQL , perf insight , 等待事件 , 采样 , 发现问题 , Oracle 兼容性 背景 通常普通的监控会包括系统资源的监控: cpu io 内存 网络 等,但是仅凭资源的监控,当问题发生时,如何快速的定位到问题在哪里?需要更高级的监控: 更高级的监控方法通常是从数据库本身的特性触发,但是需要对数据库具备非常深刻的理解,才能做出好的监控和诊断系统。属于专家型或叫做经验型的监控和诊断系统。 [《[未完待续] PostgreSQL 一键诊断项 - 珍藏级》](https://github.com/digoal/blog/blob/master/201806/20180613_05.md) 《PostgreSQL 实时健康监控 大屏 - 低频指标 - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 高频指标(服务器) - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 高频指标 - 珍藏级》 《PostgreSQL pgmetrics - 多版本、健康监控指标采集、报告》 《PostgreSQL pg_top pgcenter - 实时top类工具》 《PostgreSQL、Greenplum 日常监控 和 维护任务 - 最佳实践》 《PostgreSQL 如何查找TOP SQL (例如IO消耗最高的SQL) (包含SQL优化内容) - 珍藏级》 《PostgreSQL 锁等待监控 珍藏级SQL - 谁堵塞了谁》 然而数据库在不断的演进,经验型的诊断系统好是好,但是不通用,有没有更加通用,有效的发现系统问题的方法? AWS与Oracle perf insight的思路非常不错,实际上就是等待事件的统计追踪,作为性能诊断的方法。 https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html 《AWS performance insight》 简单来说就是对系统不停的打点,例如每秒一个采样,仅记录这一秒数据库活跃的会话(包括等待中的会话),等待事件,QUERY,时间,用户,数据库。这几个指标。 活跃度会话,不管是在耗费CPU,还是在等待(锁,IO)或者其他,实际上都是占用了资源的。可以算出平均的活跃会话(例如10秒的平均值,5秒的平均值)(avg active sessions)。 这个avg active sessions是一个值,这个值和数据库实例的CPU个数进行比较,就可以衡量出系统是否存在瓶颈(当avg active sessions超过CPU个数时,说明存在瓶颈)。 当某个时间窗口存在瓶颈,瓶颈在哪里,则可以通过这个时间窗口内的打点明细,进行统计。等待事件,QUERY,用户,数据库。 PostgreSQL打点的方法也很多: 1、(推荐)通过pg_stat_activity 内存中的动态视图获取,每秒取一次ACTIVE的内容(例如:会话ID,等待事件,QUERY,时间,用户,数据库)。 https://www.postgresql.org/docs/11/monitoring-stats.html#MONITORING-STATS-VIEWS 2、(不推荐)开启审计日志,在审计日志中获取,这个在高并发系统中,不太好用。并且审计日志是在结束时打印,一个QUERY的中间执行过程并不完全是占用CPU或其他资源的,所以审计日志获取的信息对于perf insight并没有什么效果。 perf insight的入门门槛低,可以摆平很多问题,在出现问题时快速定位到问题SQL,问题的等待事件在哪里。结合经验型的监控,可以构建PG非常强大的监控、诊断、优化体系。 perf insight 实现讲解 举例1 会话1 postgres=# begin; BEGIN postgres=# lock table abc in access exclusive mode ; LOCK TABLE 会话2 postgres=# select * from abc; 从pg_stat_activity获取状态,可以看到会话2在等待,会话处于active状态,这种消耗需要被记录到avg active session中,用来评估资源消耗指标。 postgres=# select now(),state,datname,usename,wait_event_type,wait_event,query from pg_stat_activity where state in ('active', 'fastpath function call'); now | state | datname | usename | wait_event_type | wait_event | query -------------------------------+--------+----------+----------+-----------------+------------+-------------------------------------------------------------------------------------------- 2019-01-25 21:17:28.540264+08 | active | postgres | postgres | | | select datname,usename,query,state,wait_event_type,wait_event,now() from pg_stat_activity; 2019-01-25 21:17:28.540264+08 | active | postgres | postgres | Lock | relation | select * from abc; (2 rows) 举例2 使用pgbench压测数据库,每秒打点,后期进行可视化展示 pgbench -i -s 100 1、压测只读 pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 300 -S 2、查看压测时的活跃会话状态 postgres=# select now()::timestamptz(0),state, datname,usename,wait_event_type,wait_event,query from pg_stat_activity where state in ('active', 'fastpath function call') and pid<>pg_backend_pid(); now | state | datname | usename | wait_event_type | wait_event | query ---------------------+--------+----------+----------+-----------------+------------+------------------------------------------------------- 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; (46 rows) 3、为了方便统计,可以在本地建表,用于收集pg_stat_activity的内容,在实际的生产中,可以把这个信息读走,存到其他地方(例如专用于监控的其他数据库)。 postgres=# create unlogged table perf_insight as select now()::timestamptz(0) as ts, extract(epoch from backend_start)||'.'||pid as sessid, state,datname,usename, wait_event_type||'_'||wait_event as waiting , query from pg_stat_activity where state in ('active', 'fastpath function call') and pid<>pg_backend_pid(); SELECT 48 4、试着写入当时pg_stat_activity状态 postgres=# insert into perf_insight select now()::timestamptz(0), extract(epoch from backend_start)||'.'||pid, state,datname, usename,wait_event_type||'_'||wait_event, query from pg_stat_activity where state in ('active', 'fastpath function call') and pid<>pg_backend_pid(); INSERT 0 42 5、使用psql watch,每秒打一个点 postgres=# \watch 1 6、只读压测,压测结果,130万QPS pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 300 -S transaction type: <builtin: select only> scaling factor: 100 query mode: prepared number of clients: 64 number of threads: 64 duration: 300 s number of transactions actually processed: 390179555 latency average = 0.049 ms latency stddev = 0.026 ms tps = 1300555.237752 (including connections establishing) tps = 1300584.885231 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid random(1, 100000 * :scale) 0.049 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 7、接下来,开启一个读写压测,9.4万TPS(yue 47万qps) pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 300 transaction type: <builtin: TPC-B (sort of)> scaling factor: 100 query mode: prepared number of clients: 64 number of threads: 64 duration: 300 s number of transactions actually processed: 28371829 latency average = 0.677 ms latency stddev = 0.413 ms tps = 94569.412707 (including connections establishing) tps = 94571.934011 (excluding connections establishing) statement latencies in milliseconds: 0.002 \set aid random(1, 100000 * :scale) 0.001 \set bid random(1, 1 * :scale) 0.001 \set tid random(1, 10 * :scale) 0.001 \set delta random(-5000, 5000) 0.045 BEGIN; 0.108 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.069 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.091 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.139 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.068 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.153 END; 8、perf insight 可视化需要的素材 时间、状态、会话ID、数据库名、用户名、等待事件、查询 当然,我们可以再细化,例如增加会话ID字段,可以针对一个会话来进行展示和统计。 postgres=# \d perf_insight Unlogged table "public.perf_insight" Column | Type | ---------+--------------------------------+- ts | timestamp(0) with time zone | 时间戳 sessid | text | 会话ID state | text | 状态 datname | name | 数据库 usename | name | 用户 waiting | text | 等待事件 query | text | SQL语句 9、查看perf insight素材内容 postgres=# select * from perf_insight limit 10; ts | sessid | state | datname | usename | waiting | query ---------------------+------------------------+--------+----------+----------+--------------------------+---------------------------------------------------------------------- 2019-01-26 09:43:28 | 1548467007.4805.32968 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; 2019-01-26 09:43:28 | 1548467007.47991.32966 | active | postgres | postgres | Client_ClientRead | END; 2019-01-26 09:43:28 | 1548467007.48362.32979 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; 2019-01-26 09:43:28 | 1548467007.48388.32980 | active | postgres | postgres | Lock_tuple | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; 2019-01-26 09:43:28 | 1548467007.48329.32978 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; 2019-01-26 09:43:28 | 1548467007.48275.32976 | active | postgres | postgres | Lock_tuple | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; 2019-01-26 09:43:28 | 1548467007.48107.32970 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; 2019-01-26 09:43:28 | 1548467007.48243.32975 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; 2019-01-26 09:43:28 | 1548467007.48417.32981 | active | postgres | postgres | IPC_ProcArrayGroupUpdate | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-26 09:43:28 | 1548467007.48448.32982 | active | postgres | postgres | Lock_tuple | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; (10 rows) 10、查看在这段时间中,有多少种等待事件 postgres=# select distinct waiting from perf_insight ; waiting -------------------------- LWLock_wal_insert LWLock_XidGenLock Lock_extend LWLock_ProcArrayLock Lock_tuple Lock_transactionid LWLock_lock_manager Client_ClientRead IPC_ProcArrayGroupUpdate LWLock_buffer_content IPC_ClogGroupUpdate LWLock_CLogControlLock IO_DataFileExtend (14 rows) perf insight 可视化,统计 采集粒度为1秒,可以对n秒的打点求平均值(分不同维度),得到可视化图形: 1、总avg active sessions ,用于告警。 2、其他维度,用于分析造成性能瓶颈问题的权重: 2.1、等待事件维度(NULL表示无等待,纯CPU time) avg active sessions 2.2、query 维度 avg active sessions 2.3、数据库维度 avg active sessions 2.4、用户维度 avg active sessions 如何判断问题: 例如,对于一个64线程的系统: avg active sessions 在64以下时,可以认为是没有问题的。 1 总 avg active sessions,用于告警。 5秒统计间隔。 select coalesce(t1.ts, t2.ts) ts, coalesce(avg_active_sessions,0) avg_active_sessions from ( select to_timestamp((extract(epoch from ts))::int8/5*5) ts, count(*)/5::float8 avg_active_sessions from perf_insight group by 1 ) t1 full outer join (select generate_series( to_timestamp((extract(epoch from min(ts)))::int8/5*5), to_timestamp((extract(epoch from max(ts)))::int8/5*5), interval '5 s' ) ts from perf_insight ) t2 on (t1.ts=t2.ts); ts | avg_active_sessions ------------------------+--------------------- 2019-01-26 05:39:20+08 | 14.2 2019-01-26 05:39:25+08 | 30.4 2019-01-26 05:39:30+08 | 35.8 2019-01-26 05:39:35+08 | 41.8 2019-01-26 05:39:40+08 | 38.6 2019-01-26 05:39:45+08 | 38.2 2019-01-26 05:39:50+08 | 34.6 2019-01-26 05:39:55+08 | 35.6 2019-01-26 05:40:00+08 | 42.4 2019-01-26 05:40:05+08 | 36.8 2019-01-26 05:40:10+08 | 36.2 2019-01-26 05:40:15+08 | 39.4 2019-01-26 05:40:20+08 | 40 2019-01-26 05:40:25+08 | 35.8 2019-01-26 05:40:30+08 | 37.2 2019-01-26 05:40:35+08 | 36.4 2019-01-26 05:40:40+08 | 40.6 2019-01-26 05:40:45+08 | 39.2 2019-01-26 05:40:50+08 | 36.6 2019-01-26 05:40:55+08 | 37.4 2019-01-26 05:41:00+08 | 38 2019-01-26 05:41:05+08 | 38.6 2019-01-26 05:41:10+08 | 38.4 2019-01-26 05:41:15+08 | 40.4 2019-01-26 05:41:20+08 | 35.8 2019-01-26 05:41:25+08 | 40.6 2019-01-26 05:41:30+08 | 39.4 2019-01-26 05:41:35+08 | 37.4 2019-01-26 05:41:40+08 | 36.6 2019-01-26 05:41:45+08 | 39.6 2019-01-26 05:41:50+08 | 36.2 2019-01-26 05:41:55+08 | 37.4 2019-01-26 05:42:00+08 | 37.8 2019-01-26 05:42:05+08 | 39 2019-01-26 05:42:10+08 | 36.2 2019-01-26 05:42:15+08 | 37 2019-01-26 05:42:20+08 | 36.4 2019-01-26 05:42:25+08 | 36 2019-01-26 05:42:30+08 | 37.6 2019-01-26 05:42:35+08 | 0 2019-01-26 05:42:40+08 | 0 2019-01-26 05:42:45+08 | 0 2019-01-26 05:42:50+08 | 8.4 2019-01-26 05:42:55+08 | 40.6 2019-01-26 05:43:00+08 | 42.4 2019-01-26 05:43:05+08 | 37.4 2019-01-26 05:43:10+08 | 44.8 2019-01-26 05:43:15+08 | 36.2 2019-01-26 05:43:20+08 | 39.6 2019-01-26 05:43:25+08 | 41.4 2019-01-26 05:43:30+08 | 34.2 2019-01-26 05:43:35+08 | 41.8 2019-01-26 05:43:40+08 | 37.4 2019-01-26 05:43:45+08 | 30.2 2019-01-26 05:43:50+08 | 36.6 2019-01-26 05:43:55+08 | 36 2019-01-26 05:44:00+08 | 33.8 2019-01-26 05:44:05+08 | 37.8 2019-01-26 05:44:10+08 | 39.2 2019-01-26 05:44:15+08 | 36.6 2019-01-26 05:44:20+08 | 39.8 2019-01-26 05:44:25+08 | 35.2 2019-01-26 05:44:30+08 | 35.8 2019-01-26 05:44:35+08 | 42.8 2019-01-26 05:44:40+08 | 40.8 2019-01-26 05:44:45+08 | 39.4 2019-01-26 05:44:50+08 | 40 2019-01-26 05:44:55+08 | 40.2 2019-01-26 05:45:00+08 | 41.2 2019-01-26 05:45:05+08 | 41.6 2019-01-26 05:45:10+08 | 40.6 2019-01-26 05:45:15+08 | 33.8 2019-01-26 05:45:20+08 | 35.8 2019-01-26 05:45:25+08 | 42.2 2019-01-26 05:45:30+08 | 37.8 2019-01-26 05:45:35+08 | 37.6 2019-01-26 05:45:40+08 | 40.2 2019-01-26 05:45:45+08 | 37.4 2019-01-26 05:45:50+08 | 38.2 2019-01-26 05:45:55+08 | 39.6 2019-01-26 05:46:00+08 | 41.6 2019-01-26 05:46:05+08 | 36 2019-01-26 05:46:10+08 | 34.6 2019-01-26 05:46:15+08 | 37.8 2019-01-26 05:46:20+08 | 40.8 2019-01-26 05:46:25+08 | 42 2019-01-26 05:46:30+08 | 36.4 2019-01-26 05:46:35+08 | 44.6 2019-01-26 05:46:40+08 | 38.8 2019-01-26 05:46:45+08 | 35 2019-01-26 05:46:50+08 | 36.2 2019-01-26 05:46:55+08 | 37.2 2019-01-26 05:47:00+08 | 36 2019-01-26 05:47:05+08 | 38.2 2019-01-26 05:47:10+08 | 37.2 2019-01-26 05:47:15+08 | 42.8 2019-01-26 05:47:20+08 | 32 2019-01-26 05:47:25+08 | 41 2019-01-26 05:47:30+08 | 44 2019-01-26 05:47:35+08 | 37.4 2019-01-26 05:47:40+08 | 36.2 2019-01-26 05:47:45+08 | 39 2019-01-26 05:47:50+08 | 27.8 (103 rows) 10秒统计间隔的SQL select coalesce(t1.ts,t2.ts) ts, coalesce(avg_active_sessions,0) avg_active_sessions from ( select to_timestamp((extract(epoch from ts))::int8/10*10) ts, count(*)/10::float8 avg_active_sessions from perf_insight group by 1 ) t1 full outer join ( select generate_series( to_timestamp((extract(epoch from min(ts)))::int8/10*10), to_timestamp((extract(epoch from max(ts)))::int8/10*10), interval '10 s' ) ts from perf_insight ) t2 on (t1.ts=t2.ts); ts | avg_active_sessions ------------------------+--------------------- 2019-01-26 05:39:20+08 | 22.3 2019-01-26 05:39:30+08 | 38.8 2019-01-26 05:39:40+08 | 38.4 2019-01-26 05:39:50+08 | 35.1 2019-01-26 05:40:00+08 | 39.6 2019-01-26 05:40:10+08 | 37.8 2019-01-26 05:40:20+08 | 37.9 2019-01-26 05:40:30+08 | 36.8 2019-01-26 05:40:40+08 | 39.9 2019-01-26 05:40:50+08 | 37 2019-01-26 05:41:00+08 | 38.3 2019-01-26 05:41:10+08 | 39.4 2019-01-26 05:41:20+08 | 38.2 2019-01-26 05:41:30+08 | 38.4 2019-01-26 05:41:40+08 | 38.1 2019-01-26 05:41:50+08 | 36.8 2019-01-26 05:42:00+08 | 38.4 2019-01-26 05:42:10+08 | 36.6 2019-01-26 05:42:20+08 | 36.2 2019-01-26 05:42:30+08 | 18.8 2019-01-26 05:42:40+08 | 0 2019-01-26 05:42:50+08 | 24.5 2019-01-26 05:43:00+08 | 39.9 2019-01-26 05:43:10+08 | 40.5 2019-01-26 05:43:20+08 | 40.5 2019-01-26 05:43:30+08 | 38 2019-01-26 05:43:40+08 | 33.8 2019-01-26 05:43:50+08 | 36.3 2019-01-26 05:44:00+08 | 35.8 2019-01-26 05:44:10+08 | 37.9 2019-01-26 05:44:20+08 | 37.5 2019-01-26 05:44:30+08 | 39.3 2019-01-26 05:44:40+08 | 40.1 2019-01-26 05:44:50+08 | 40.1 2019-01-26 05:45:00+08 | 41.4 2019-01-26 05:45:10+08 | 37.2 2019-01-26 05:45:20+08 | 39 2019-01-26 05:45:30+08 | 37.7 2019-01-26 05:45:40+08 | 38.8 2019-01-26 05:45:50+08 | 38.9 2019-01-26 05:46:00+08 | 38.8 2019-01-26 05:46:10+08 | 36.2 2019-01-26 05:46:20+08 | 41.4 2019-01-26 05:46:30+08 | 40.5 2019-01-26 05:46:40+08 | 36.9 2019-01-26 05:46:50+08 | 36.7 2019-01-26 05:47:00+08 | 37.1 2019-01-26 05:47:10+08 | 40 2019-01-26 05:47:20+08 | 36.5 2019-01-26 05:47:30+08 | 40.7 2019-01-26 05:47:40+08 | 37.6 2019-01-26 05:47:50+08 | 13.9 (52 rows) 2 具体到一个时间段内,是什么问题 例如2019-01-26 05:45:20+08,这个时间区间,性能问题钻取: 1、数据库维度的资源消耗时间占用,判定哪个数据库占用的资源最多 postgres=# select datname, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1 order by cnt desc; datname | cnt ----------+----- postgres | 39 (1 row) 2、用户维度的资源消耗时间占用,判定哪个用户占用的资源最多 postgres=# select usename, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1 order by cnt desc; usename | cnt ----------+----- postgres | 39 (1 row) 3、等待事件维度的资源消耗时间占用,判定问题集中在哪些等待事件上,可以针对性的优化、加资源。 postgres=# select coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1 order by cnt desc; waiting | cnt --------------------------+------ CPU_TIME | 15.3 Client_ClientRead | 10.6 IPC_ProcArrayGroupUpdate | 6.1 Lock_transactionid | 5.4 Lock_tuple | 0.5 LWLock_wal_insert | 0.3 LWLock_ProcArrayLock | 0.2 LWLock_buffer_content | 0.2 IPC_ClogGroupUpdate | 0.2 LWLock_lock_manager | 0.1 LWLock_CLogControlLock | 0.1 (11 rows) 4、SQL维度的资源消耗时间占用,判定问题集中在哪些SQL上,可以针对性的优化。 postgres=# select query, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1 order by cnt desc; query | cnt -------------------------------------------------------------------------------------------------------+------ END; | 11.5 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | 11.3 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | 6.8 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | 4.5 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | 2.3 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | 2.1 BEGIN; | 0.5 (7 rows) 5、单条QUERY在不同等待事件上的资源消耗时间占用,判定问题SQL的突出等待事件,可以针对性的优化、加资源。 postgres=# select query, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1,2 order by 1,cnt desc; query | waiting | cnt -------------------------------------------------------------------------------------------------------+--------------------------+----- BEGIN; | Client_ClientRead | 0.3 BEGIN; | CPU_TIME | 0.2 END; | CPU_TIME | 4.6 END; | IPC_ProcArrayGroupUpdate | 3.7 END; | Client_ClientRead | 3.1 END; | IPC_ClogGroupUpdate | 0.1 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | CPU_TIME | 1 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | Client_ClientRead | 0.6 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | IPC_ProcArrayGroupUpdate | 0.6 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | IPC_ClogGroupUpdate | 0.1 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | CPU_TIME | 1.2 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Client_ClientRead | 0.6 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Lock_transactionid | 0.3 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | CPU_TIME | 3.8 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | Client_ClientRead | 2.9 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | LWLock_wal_insert | 0.1 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Lock_transactionid | 4 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | CPU_TIME | 2.5 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Client_ClientRead | 2.1 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | IPC_ProcArrayGroupUpdate | 1.7 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Lock_tuple | 0.5 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_buffer_content | 0.2 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_ProcArrayLock | 0.2 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_wal_insert | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | CPU_TIME | 2 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Lock_transactionid | 1.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Client_ClientRead | 1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | IPC_ProcArrayGroupUpdate | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_CLogControlLock | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_lock_manager | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_wal_insert | 0.1 (31 rows) 6、点中单条QUERY,在不同等待事件上的资源消耗时间占用,判定问题SQL的突出等待事件,可以针对性的优化、加资源。 通过4,发现占用最多的是END这条SQL,那么这条SQL的等待时间分布如何?是什么等待引起的? postgres=# select coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 and query='END;' group by 1 order by cnt desc; waiting | cnt --------------------------+----- CPU_TIME | 4.6 IPC_ProcArrayGroupUpdate | 3.7 Client_ClientRead | 3.1 IPC_ClogGroupUpdate | 0.1 (4 rows) 3 开启一个可以造成性能问题的压测场景,通过perf insight直接发现问题 1、开启640个并发,读写压测,由于数据量小,并发高,直接导致了ROW LOCK冲突的问题,使用perf insight问题毕现。 pgbench -M prepared -n -r -P 1 -c 640 -j 640 -T 300 postgres=# select query, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 06:38:20+08' -- 问题时间点 group by 1,2 order by 1,cnt desc; query | waiting | cnt -------------------------------------------------------------------------------------------------------+--------------------------+------- BEGIN; | Lock_transactionid | 0.3 BEGIN; | Lock_tuple | 0.3 BEGIN; | LWLock_lock_manager | 0.1 END; | IPC_ProcArrayGroupUpdate | 29.5 END; | CPU_TIME | 14.1 END; | Lock_transactionid | 13 END; | Client_ClientRead | 8.4 END; | Lock_tuple | 8.1 END; | LWLock_lock_manager | 3 END; | LWLock_ProcArrayLock | 0.4 END; | LWLock_buffer_content | 0.3 END; | IPC_ClogGroupUpdate | 0.1 END; | LWLock_wal_insert | 0.1 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | IPC_ProcArrayGroupUpdate | 1.3 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | CPU_TIME | 0.4 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | Lock_transactionid | 0.3 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | Lock_tuple | 0.2 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | Client_ClientRead | 0.2 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | LWLock_lock_manager | 0.1 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Lock_tuple | 0.9 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Lock_transactionid | 0.9 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | IPC_ProcArrayGroupUpdate | 0.4 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Client_ClientRead | 0.3 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | CPU_TIME | 0.1 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | Lock_transactionid | 1.7 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | IPC_ProcArrayGroupUpdate | 1.4 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | Lock_tuple | 0.9 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | LWLock_lock_manager | 0.1 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | CPU_TIME | 0.1 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Lock_transactionid | 161.5 # 突出问题在这里 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | IPC_ProcArrayGroupUpdate | 27.2 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Lock_tuple | 27.2 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_lock_manager | 19.6 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | CPU_TIME | 12.3 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Client_ClientRead | 4 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_buffer_content | 3.3 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_ProcArrayLock | 0.3 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_wal_insert | 0.1 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | IPC_ClogGroupUpdate | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Lock_transactionid | 178.4 # 突出问题在这里 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Lock_tuple | 83.7 # 突出问题在这里 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | CPU_TIME | 5.6 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | IPC_ProcArrayGroupUpdate | 5.3 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_lock_manager | 3.8 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Client_ClientRead | 2 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_ProcArrayLock | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_buffer_content | 0.1 (47 rows) postgres=# select coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 06:38:20+08' -- 问题时间点 group by 1 order by cnt desc; waiting | cnt --------------------------+------- Lock_transactionid | 356.1 Lock_tuple | 121.3 IPC_ProcArrayGroupUpdate | 65.1 CPU_TIME | 32.6 LWLock_lock_manager | 26.7 Client_ClientRead | 14.9 LWLock_buffer_content | 3.7 LWLock_ProcArrayLock | 0.8 LWLock_wal_insert | 0.2 IPC_ClogGroupUpdate | 0.2 (10 rows) 其他压测场景使用perf insight发现问题的例子 1、批量数据写入,BLOCK extend或wal insert lock瓶颈,或pglz压缩瓶颈。 create table test(id int, info text default repeat(md5(random()::text),1000)); vi test.sql insert into test(id) select generate_series(1,10); pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 300 postgres=# select to_timestamp((extract(epoch from ts))::int8/10*10) ts, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight group by 1,2 order by 1,cnt desc; ts | waiting | cnt ------------------------+--------------------------+------ 2019-01-26 10:28:50+08 | IO_DataFileExtend | 0.1 2019-01-26 10:29:00+08 | CPU_TIME | 50 2019-01-26 10:29:00+08 | Lock_extend | 11.9 -- 扩展数据文件 2019-01-26 10:29:00+08 | Client_ClientRead | 0.3 2019-01-26 10:29:00+08 | IO_DataFileExtend | 0.2 2019-01-26 10:29:00+08 | LWLock_lock_manager | 0.1 2019-01-26 10:29:10+08 | CPU_TIME | 47.1 2019-01-26 10:29:10+08 | Lock_extend | 13.5 2019-01-26 10:29:10+08 | Client_ClientRead | 0.7 2019-01-26 10:29:10+08 | IO_DataFileExtend | 0.3 2019-01-26 10:29:10+08 | LWLock_buffer_content | 0.2 2019-01-26 10:29:10+08 | LWLock_lock_manager | 0.1 2019-01-26 10:29:20+08 | CPU_TIME | 54.5 2019-01-26 10:29:20+08 | Lock_extend | 6.7 2019-01-26 10:29:20+08 | Client_ClientRead | 0.2 2019-01-26 10:29:20+08 | IO_DataFileExtend | 0.1 2019-01-26 10:29:30+08 | CPU_TIME | 61.9 -- CPU,通过perf top来看是 pglz接口的瓶颈(pglz_compress) 2019-01-26 10:29:30+08 | Client_ClientRead | 0.2 2019-01-26 10:29:40+08 | CPU_TIME | 30.9 2019-01-26 10:29:40+08 | LWLock_wal_insert | 0.2 2019-01-26 10:29:40+08 | Client_ClientRead | 0.1 (28 rows) 所以上面这个问题,如果改成不压缩,那么瓶颈就会变成其他的: alter table test alter COLUMN info set storage external; postgres=# \d+ test Table "public.test" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+-------------------------------------+----------+--------------+------------- id | integer | | | | plain | | info | text | | | repeat(md5((random())::text), 1000) | external | | 瓶颈就会变成其他的: 2019-01-26 10:33:50+08 | Lock_extend | 43.2 2019-01-26 10:33:50+08 | LWLock_buffer_content | 14.8 2019-01-26 10:33:50+08 | CPU_TIME | 4.6 2019-01-26 10:33:50+08 | LWLock_lock_manager | 0.5 2019-01-26 10:33:50+08 | LWLock_wal_insert | 0.4 2019-01-26 10:33:50+08 | IO_DataFileExtend | 0.4 2019-01-26 10:33:50+08 | Client_ClientRead | 0.1 2019-01-26 10:34:00+08 | Lock_extend | 55.6 2019-01-26 10:34:00+08 | LWLock_buffer_content | 6.3 2019-01-26 10:34:00+08 | CPU_TIME | 1.2 2019-01-26 10:34:00+08 | IO_DataFileExtend | 0.8 2019-01-26 10:34:00+08 | LWLock_wal_insert | 0.1 2019-01-26 10:34:10+08 | Lock_extend | 6.3 2019-01-26 10:34:10+08 | LWLock_buffer_content | 5.8 2019-01-26 10:34:10+08 | CPU_TIME | 0.7 因此治本的方法是提供更好的压缩接口,这也是PG 12的版本正在改进的: [《[未完待续] PostgreSQL 开放压缩接口 与 lz4压缩插件》](https://github.com/digoal/blog/blob/master/201803/20180315_02.md) [《[未完待续] PostgreSQL zstd 压缩算法 插件》](https://github.com/digoal/blog/blob/master/201803/20180315_01.md) 2、秒杀,单条UPDATE。行锁瓶颈。 create table t_hot (id int primary key, cnt int8); insert into t_hot values (1,0); vi test.sql update t_hot set cnt=cnt+1 where id=1; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 300 postgres=# select to_timestamp((extract(epoch from ts))::int8/10*10) ts, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight group by 1,2 order by 1,cnt desc; 2019-01-26 10:37:50+08 | Lock_tuple | 29.6 -- 瓶颈为行锁冲突 2019-01-26 10:37:50+08 | LWLock_lock_manager | 11.4 -- 伴随热点块 2019-01-26 10:37:50+08 | LWLock_buffer_content | 8.4 2019-01-26 10:37:50+08 | Lock_transactionid | 7.6 2019-01-26 10:37:50+08 | CPU_TIME | 6.5 2019-01-26 10:37:50+08 | Client_ClientRead | 0.2 2019-01-26 10:38:00+08 | Lock_tuple | 29.2 -- 瓶颈为行锁冲突 2019-01-26 10:38:00+08 | LWLock_buffer_content | 15.6 -- 伴随热点块 2019-01-26 10:38:00+08 | CPU_TIME | 7.9 2019-01-26 10:38:00+08 | LWLock_lock_manager | 7.2 2019-01-26 10:38:00+08 | Lock_transactionid | 3.7 秒杀的场景,优化方法 《PostgreSQL 秒杀4种方法 - 增加 批量流式加减库存 方法》 《HTAP数据库 PostgreSQL 场景与性能测试之 30 - (OLTP) 秒杀 - 高并发单点更新》 《聊一聊双十一背后的技术 - 不一样的秒杀技术, 裸秒》 《PostgreSQL 秒杀场景优化》 3、未优化SQL,全表扫描filter,CPU time瓶颈。 postgres=# create table t_bad (id int, info text); CREATE TABLE postgres=# insert into t_bad select generate_series(1,10000), md5(random()::Text); INSERT 0 10000 vi test.sql \set id random(1,10000) select * from t_bad where id=:id; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 300 瓶颈 postgres=# select to_timestamp((extract(epoch from ts))::int8/10*10) ts, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight group by 1,2 order by 1,cnt desc; 2019-01-26 10:41:40+08 | CPU_TIME | 61.3 2019-01-26 10:41:40+08 | Client_ClientRead | 0.9 2019-01-26 10:41:50+08 | CPU_TIME | 61.7 2019-01-26 10:41:50+08 | Client_ClientRead | 0.1 2019-01-26 10:42:00+08 | CPU_TIME | 60.7 2019-01-26 10:42:00+08 | Client_ClientRead | 0.5 perf insight 的基准线 如果要设置一个基准线,用于报警。那么: 1、基准线跟QPS没什么关系。 2、基准线跟avg active sessions有莫大关系。avg active sessions大于实例CPU核数时,说明有性能问题。 perf insight 不是万能的 perf insight 发现当时的问题是非常迅速的。 神医华佗说,不治已病治未病才是最高境界,perf insight实际上是发现已病,而未病是发现不了的。 未病还是需要对引擎的深刻理解和丰富的经验积累。 例如: 1、年龄 2、FREEZE风暴 3、sequence耗尽 4、索引推荐 5、膨胀 6、安全风险 7、不合理索引 8、增长趋势 9、碎片 10、分区建议 11、冷热分离建议 12、TOP SQL诊断与优化 13、扩容(容量、计算资源、IO、内存...)建议 14、分片建议 15、架构优化建议 等。 除此之外,perf insight对于这类情况也是发现不了的: 1、long query (waiting (ddl, block one session)),当long query比较少,总体avg active session低于基准水位时,实际上long query的问题就无法暴露。 然而long query是有一些潜在问题的,例如可能导致膨胀。 perf insight + 经验型监控、诊断,可以使得你的数据库监测系统更加强壮。 其他知识点、内核需改进点 1、会话ID,使用backend的启动时间,backend pid两者结合,就可以作为PG数据库的唯一session id。 有了session id,就可以基于SESSION维度进行性能诊断和可视化展示。 select extract(epoch from backend_start)||'.'||pid as sessid from pg_stat_activity ; sessid ------------------------ 1547978042.41326.13447 1547978042.41407.13450 2、对于未使用绑定变量的SQL,要做SQL层的统计透视,就会比较悲剧了,因为只要输入的变量不同在pg_stat_activity的query中看起来都不一样,所以为了更好的统计展示,可能需要内核层面优化。 可以借鉴pg_stat_statements的代码进行内核的修改,pg_stat_statements里面是做了变量替换处理的。(即使是未使用绑定变量的语句) contrib/pg_stat_statements/pg_stat_statements.c 如果不想改内核,或者你可以等PG发布这个PATCH,可能12会发布。 《PostgreSQL 11 preview - 强制auto prepared statment开关(自动化plan cache)(类似Oracle cursor_sharing force)》 3、udf调用,使用pg_stat_activity打点的方法,无法获取到当前UDF里面调用的SQL是哪个,所以对于大量使用UDF的用户来说,perf insight当前的方案,可能无法钻取到UDF里面的SQL瓶颈在哪里。 这种情况可以考虑使用AWR,perf,或者plprofile。 《PostgreSQL 函数调试、诊断、优化 & auto_explain & plprofiler》 《PostgreSQL 源码性能诊断(perf profiling)指南 - 珍藏级》 《PostgreSQL 代码性能诊断之 - OProfile & Systemtap》 4、PostgreSQL 的兼容oracle商用版(阿里云PPAS),内置AWR功能,waiting event的粒度更细,不需要人为打点,可以生成非常体系化的报告,欢迎使用。 《PostgreSQL AWR报告(for 阿里云ApsaraDB PgSQL)》 5、如果你需要对很多PG实例实施perf insight,并且想将perf insight的打点采样存储到一个大的PG数据库(例如citus)中,由于我们查询都是按单个instance来查询的,那么就要注意IO放大的问题。 可以使用udf,自动切分INSTANCE的方法。另一方面由于时间字段递增,与HEAP存储顺序线性相关,可以使用brin时间区间索引,解决ts字段btree索引大的问题。知识点如下: 《PostgreSQL 时序最佳实践 - 证券交易系统数据库设计 - 阿里云RDS PostgreSQL最佳实践》 《PostgreSQL 在铁老大订单系统中的schemaless设计和性能压测》 6、如果将perf insight数据存在当前数据库中,需要耗费多少空间呢? 正常情况下,一次打点采集到的active session记录是很少的(通常小于CPU核数,甚至是0)。 较坏情况,例如每次打点都采集到60条记录,每隔5秒采集一次,30天大概3000万条记录,每天一个分区,每天才100万条记录,完全可以直接保存在本地。 参考 [《[未完待续] PostgreSQL 一键诊断项 - 珍藏级》](https://github.com/digoal/blog/blob/master/201806/20180613_05.md) 《PostgreSQL 实时健康监控 大屏 - 低频指标 - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 高频指标(服务器) - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 高频指标 - 珍藏级》 《PostgreSQL pgmetrics - 多版本、健康监控指标采集、报告》 《PostgreSQL pg_top pgcenter - 实时top类工具》 《PostgreSQL 如何查找TOP SQL (例如IO消耗最高的SQL) (包含SQL优化内容) - 珍藏级》 《PostgreSQL、Greenplum 日常监控 和 维护任务 - 最佳实践》 《PostgreSQL 锁等待监控 珍藏级SQL - 谁堵塞了谁》 https://sourceforge.net/projects/pgstatsinfo/ https://github.com/cybertec-postgresql/pgwatch2 https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html https://github.com/postgrespro/pg_wait_sampling 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机 大量阿里云PG解决方案: 任意维度实时圈人; 时序数据实时处理; 时间、空间、业务 多维数据实时透视; 独立事件相关性分析; 海量关系实时图式搜索; 社交业务案例; 流式数据实时处理案例; 物联网; 全文检索; 模糊、正则查询案例; 图像识别; 向量相似检索; 数据清洗、采样、脱敏、批处理、合并; GIS 地理信息空间数据应用; 金融业务; 异步消息应用案例; 海量数据 冷热分离; 倒排索引案例; 海量数据OLAP处理应用; 德哥的 / digoal's PostgreSQL文章入口 - 努力做成PG资源最丰富的个人blog
行业: 通用行业. 应用场景: 全托管数据库无法满足某些场景的需求: 1、客户自定义需求, PG 数据库引擎具备插件扩展能力, 但是RDS内置的插件为常用插件, 无法覆盖某些客户的需求,例如: CITUS 分布式数据库插件(许可问题),timescaledb 时序数据库插件(许可问题),trodb mongo兼容版插件(许可问题),zombodb es索引插件(许可问题),madlib 机器学习插件(权限问题), 2、客户需要更多集成能力, 有接入更多数据库、操作系统(或docker)、RDSAPI的需求 (客户dba): 处理更多DBA相关事务, 接入更多内部功能例如上线审计等. 需要掌握DB超级用户权限. (客户devops): 整合企业内部的运维平台. 需要对接更多的数据采集API、内部API、以及更多OS的权限.(客户安全团队): 部署企业的安全、审计软件, 或统一认证模块等. 需要掌握os权限. 3、多产品混合部署需求. 客户可能不只一种数据库产品, 不同产品之间的数据交换频率、响应延迟要求高, 例如游戏行业, 缓存、应用、数据库通常部署在同一台服务器上. 需要混合部署的能力. 4、性价比问题, 客户的实例数很多(例如一些PAAS企业, TOB的业务, 每个B会对应一个数据库实例, 因为需要在业务层做到隔离, 同时又要降低成本), 需要更好的性价比. 用户需要自由指定超卖比例, 例如在业务错峰的情况下, 64C的机器实际VCPU可以卖到120甚至更多. 5、稳定性、抗干扰、安全隔离等问题. 不同业务线的数据库采用不同的主机, 按业务线隔离资源.根据业务数据库的SLA级别(例如核心库、测试库、普通业务等), 采用不同的隔离策略, 不同的主机集群. 6、混合业务可调度问题: 例如白天和晚上的实例两种混合部署模式, 半夜分析型DB开启全速并行计算, 将TP实例挪到其他机器, 让出更多资源给分析业务白天业务库的资源使用需求高, 给TP实例更多资源, AP实例更少资源. 场景挑战与痛点: 特点: 1、客户非常关心IT成本, 2、客户使用的数据库量较大, 3、客户有自己的运维、DBA团队, 4、客户有自己的IT管理体系, 挑战与痛点: 1、使用全托管数据库, DBA和运维何去何从? 2、使用全托管数据库, 无法完全匹配客户原有的IT管理体系需求? 3、使用全托管数据库, 全黑盒. 企业原有的基于自有用户环境的降成本、业务抗干扰、安全隔离等手段无法施展. 方案: 方案1 纯自建 客户自己管理数据库的完整生命周期 纯全托管 具有完整数据库生命周期管理(创建、变配、克隆、释放、迁移、同步、备份、只读实例、安全、审计、日志、优化、诊断、告警、专家服务等)通用管理省心, 开箱即用全球顶级团队支撑 缺陷: 纯自建 由于数据库复杂, 需要专业的网络、系统、存储、DBA团队限于客户体量, 难有全球顶级的专业团队支撑 纯全托管 纯黑盒, 以上提到的客户需求无法满足 方案2 大客户专属数据库集群 用户对集群有更多的可控性(系统、数据库权限, 开放更多API, 开放后端调度、管理平台) 优势: 继承RDS优势 高可用高可靠高性能安全审计备份克隆只读实例 支持集成 用户可自由集成专属数据库集群开放os、db超级用户权限开放更多API, 可与用户内部平台进行整合支持堡垒机接入,更加安全 高性价比 自由设定超卖比例自由打散,自由设置超卖比根据不同业务等级隔离主机资源 错峰调度 自由按业务等级分配主机错峰调度: 如白天业务和分析库, 晚上业务库腾挪紧凑使用 支持定制 用户可自由定义插件citus 分库分表timescaledb 时序插件zombodb ES 搜索插件madlib 机器学习插件 大客户专属数据库集群价值: 本功能产品手册: 官网手册, 购买页面: https://common-buy.aliyun.com/?spm=5176.9826160.0.0.4cb81450IqUZwe&commodityCode=mysql_machine_pre#/buy 客户案例: 待补充 使用方法介绍: 针对不同业务, 可以创建不同的主机组, 通过超卖比、打散规则等控制成本和稳定性. 1、购买主机组, 由于高可用需求, 一个组至少要2台主机. https://common-buy.aliyun.com/?spm=5176.9826160.0.0.4cb81450IqUZwe&commodityCode=mysql_machine_pre#/buy 2、(可选)配置主机组CPU、存储超卖比. 例如CPU超卖比为200%, 则64核的主机则最多可以创建出总共 128VCPU 的数据库实例. 3、(可选)配置主机组打散规则(资源分配策略) 均衡分配:表示最大化追求更稳定的系统表现,优先会从更空的主机中分配资源紧凑分配:表示最大化追求更充分的资源利用率,优先会利用已分配主机资源 4、(可选)配置主机的内存分配阈值, 内存分配越多越容易发生OOM, 用户可以根据业务SLA诉求来进行设置. 5、(可选)用户可以根据业务SLA诉求, 通过配置提高主机组的冗余度, 并配置自动替换故障主机. 主机是云盘主机,必须打开“允许自动快照主机存储空间”主机是非云盘主机,则先将故障主机上的实例迁移走,然后自动替换主机手动替换主机:针对故障主机,用户手动去替换 6、创建实例 进入主机组后, 如果需要在这个主机组内创建实例, 点击创建实例按钮. 7、管理实例 点击实例即可跳转到实例管理页面.
为什么数据库选型和找对象一样重要 一、找对象 正常路径: 知己 -> 缘分出现 -> 恋爱[知彼] -> 结婚 -> 2个家庭关系结合(七大姑八大姨) -> 生娃 -> 带娃 -> 七年之痒 -> 愉快的共度一生 错误后果: 1、家庭不和睦 2、影响小孩成长 3、离婚 生无可恋 财产分割 [娃]的成长、抚养问题等 家庭关系处理 4、再婚 二、数据库选型 正常路径: 评估 -> [技术磨合] -> 研发 -> 上线 -> 迭代维护期 错误后果: 1、研发、维护、软件、硬件成本增加, 性能问题 线下输出, 某些开源协议导致的软件分发风险 架构迁就数据库, 导致开发周期变长、软件|硬件|维护成本增加 2、影响业务, 营收、客户数, 如果出现异常, 导致业务受到影响 3、企业形象 稳定性、安全问题、数据泄露等问题... 三、谁为数据库选型负责 开发者? DBA? 架构师? 技术总监? CTO? 技术委员会? 怎么选型? 四、选型中最容易踩到的几个坑 大公司用什么我就用什么, 没大错. 用熟不用生. 统一技术栈, 一个库解决一切. 五、在什么时间点选型? 架构设计阶段 六、架构设计阶段有哪些能确定|预判的指标? 一定不完整, 请各位自行补充 1、技术指标 1、业务类别(在线事务处理、离线分析、实时分析、混合业务) 2、业务场景 3、数据量 4、业务场景相关性能指标(并发、qps、rt) 5、行业合规诉求 6、开发语言 7、应用级功能诉求(图像识别, 分词搜索, GIS, 用户画像, 化学分析, 多维搜索, ...) 8、各种场景的工业标准性能指标(tpcc tpch tpcds) 9、可用性 10、可靠性 11、安全 12、一致性要求 13、扩展性要求 14、容灾要求 15、可恢复的目标范围要求 16、恢复耗时要求 2、商务指标 1、业界成功案例 2、使用这个产品时的开发周期 3、使用这个产品时的开发人力投入成本 4、使用这个产品时的项目it投入成本 5、数据库软件license成本(不同输出形式下的成本和风险) 6、数据库运维人力投入成本 7、[云产品成本] 3、生态指标 1、发展趋势(全球趋势、本土趋势: 谷歌搜索趋势、stackoverflow趋势、招聘数量) 2、活跃度(bug响应速度, 修复速度. 小版本发布频率. 大版本发布频率. 提交数.) 3、培训公司数量、规模(全球、本土) 4、学习成本(有数据库基础到达中级水平需要多长时间) 5、服务公司数量、规模(全球、本土) 6、云厂商数量、规模 7、数据库厂商数量、规模 8、社区规模(人数、内容、活动、流量) 9、市场份额 七、建立DB画像库 建议定期更新数据 数据库种类 一定不完整, 请各位自行补充 缓存 关系 离线仓库 在线仓库 nosql 时序 图 GIS ... 画像库 一定不完整, 请各位自行补充 pg, mysql, oracle, greenplum, polardb o, adb pg, redis, hbase, mongo, timescaledb, agensgraph, edgedb, postgis, ... ... 八、数据库产品很多, 怎么做到深度分析? 1、内部技术专家 建议技术栈多元化, 提高技术竞争力. 2、聘请外部技术顾问 推荐, 没屁股, 比较公正 3、外部参考资源 权威行业报告 九、建立评估模型 类似招标书,建立评估模型。 1、技术分 2、商务分 3、生态分 直播地址 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
背景 1、我是什么企业? 2、我的企业核心价值是什么? 3、数据库对企业实现核心价值影响有哪些? 4、我用了PG吗? 5、我为什么选了PG? 6、除了案例、稳定性、可靠性、功能、性能, 考虑过PG的类BSD开源许可是最友好的? 考虑过PG是纯社区开源数据库? 了解过其他数据库是不是被某国某公司控制的商业开源数据库? 了解过其他数据库的开源协议某些使用场景下是不是有风险? 7、PG数据库的稳定性、bugfix、问题响应速度、功能迭代对你有没有价值, 为什么? 8、开源数据库出了问题谁解决 DBA、开发者 9、解决不了呢? 招人 找开源社区 买服务 10、招人要多久, 怎么才能加速? 人多, 人好找 11、怎么才会人多? 12、找开源社区, 开源社区都没了去哪找? 13、买服务, 都没人用, 谁去做这个服务? 社区好, 人就多, 做服务的生态公司业务多, 社区好对企业有好处 14、怎么参与社区建设? 分享企业的案例!!! 15、为什么企业应该把PG案例(非技术方案)分享给社区? 有利于拓展更多PG用户 16、为什么有利于拓展更多PG用户? 老板选型首先看的是案例, 案例多, 选择可能性大. 案例代表可行性(背书), 然后才是方案成本, 性能, 功能. 技术人选型看技术方案, 可以落地的技术方案. 17、拓展更多PG用户对企业有什么好处? 用户多(使用者、开发者、内核开发者), PG是BSD许可所以厂商可以拿来包装售卖(而且不需要开源), 用户多自然有兼容PG的厂商(实际上大多数就是拿开源PG包装一下)为此牟利, PG服务型的公司也会多起来(PG开源社区超过30%的贡献者都来自PG服务型的公司, 还有超过半数来自PG的最终用户, 可以在PG网站查到这个数据)、对PG生态发展更好, PG社区的版本迭代更快, 也就是企业能享受到更多更好的免费以及收费服务. 企业软件成本下降 企业人力成本降低 18、那么如果PG发展我不关心, 大不了换一个数据库会怎么样? 19、换数据库有什么风险? 作为为数据库选型负责的你是否要提头去见老板. 换一个不行又换一个? 好像你没法跟老板解释. 20、数据库生态不行了, 和最终用户到底有没有关系? 你是用户, 你就在生态里面, 你不出力不参与PG开源社区建设, 不分享案例(如果能分享技术方案更好), 生态链就断了. 21、怎么收集世界上80%以上PG用户案例? 三体里面有个例子很有意思, 《三体》中,红岸利用太阳放大广播,这一设定是真实存在的吗? 红岸基地向太阳发射无线电信号,满足某些条件之后,太阳会将信号放大很多倍,向宇宙进行广播. 否则地球上的无线电要向宇宙发射信号基本上没多少距离就被宇宙中的各种阻力淹没了. https://www.zhihu.com/question/298173194 22、什么是PG用户的太阳? PG 社区(全球各地都有PG社区, 最大的是官网 www.postgresql.org ) 大V也很重要, 但是相比PG社区的影响力偏小. 23、PG 社区为什么要收集世界上80%以上PG用户案例? 有利于PG 发展 24、为什么有利于PG发展, 回到 老板看案例, 案例多, 选择可能性大, 案例代表可行性(背书), 然后才是成本, 性能, 功能 25、所以, 谁是受益者? 所有PG用户, 当然包括你 别吝啬, 分享出你的案例(如果能分享技术方案更好). 26、去哪分享? 中国用户发邮件给press@postgres.cn, 投稿, 会发到PG官方社区. 好像也不缺我一个是吧? 所有人都和你的想法一样, 就没人发了. 附录 今天, 我们来看看PG的热度和趋势数据 谷歌搜索 全球最热的搜索引擎, 最近5年的搜索趋势. 1、postgresql 75->85 2、mysql 100->60 3、oracle 100->59 stackoverflow数据 全球开发者最喜爱的FAQ网站 questions 1、postgresql 113,746 questions 2、mysql 591,311 questions 3、oracle 119,980 questions JOB 1、postgresql 122 jobs 2、mysql 102 jobs 3、oracle 25 jobs stackoverflow趋势 1、postgresql 2、mysql 3、oracle 曾国藩名言: 顺势而为 全球最大搜索引擎google, 全球最热开发者问答社区显示, 过去5年PG处于上升趋势, 招聘JOB人数最多. 相比而言oracle以及被oracle收购的mysql都处于下降趋势. 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
行业: 通用行业. 应用场景: 全托管数据库无法满足某些场景的需求: 1、客户自定义需求, PG 数据库引擎具备插件扩展能力, 但是RDS内置的插件为常用插件, 无法覆盖某些客户的需求,例如: CITUS 分布式数据库插件(许可问题), timescaledb 时序数据库插件(许可问题), trodb mongo兼容版插件(许可问题), zombodb es索引插件(许可问题), madlib 机器学习插件(权限问题), 2、客户需要更多集成能力, 有接入更多数据库、操作系统(或docker)、RDSAPI的需求 (客户dba): 处理更多DBA相关事务, 接入更多内部功能例如上线审计等. 需要掌握DB超级用户权限. (客户devops): 整合企业内部的运维平台. 需要对接更多的数据采集API、内部API、以及更多OS的权限. (客户安全团队): 部署企业的安全、审计软件, 或统一认证模块等. 需要掌握os权限. 3、多产品混合部署需求. 客户可能不只一种数据库产品, 不同产品之间的数据交换频率、响应延迟要求高, 例如游戏行业, 缓存、应用、数据库通常部署在同一台服务器上. 需要混合部署的能力. 4、性价比问题, 客户的实例数很多(例如一些PAAS企业, TOB的业务, 每个B会对应一个数据库实例, 因为需要在业务层做到隔离, 同时又要降低成本), 需要更好的性价比. 用户需要自由指定超卖比例, 例如在业务错峰的情况下, 64C的机器实际VCPU可以卖到120甚至更多. 5、稳定性、抗干扰、安全隔离等问题. 不同业务线的数据库采用不同的主机, 按业务线隔离资源. 根据业务数据库的SLA级别(例如核心库、测试库、普通业务等), 采用不同的隔离策略, 不同的主机集群. 6、混合业务可调度问题:例如白天和晚上的实例两种混合部署模式, 半夜分析型DB开启全速并行计算, 将TP实例挪到其他机器, 让出更多资源给分析业务 白天业务库的资源使用需求高, 给TP实例更多资源, AP实例更少资源. 场景挑战与痛点: 特点: 1、客户非常关心IT成本, 2、客户使用的数据库量较大, 3、客户有自己的运维、DBA团队, 4、客户有自己的IT管理体系, 挑战与痛点: 1、使用全托管数据库, DBA和运维何去何从? 2、使用全托管数据库, 无法完全匹配客户原有的IT管理体系需求? 3、使用全托管数据库, 全黑盒. 企业原有的基于自有用户环境的降成本、业务抗干扰、安全隔离等手段无法施展. 方案: 方案1 纯自建 客户自己管理数据库的完整生命周期 纯全托管 具有完整数据库生命周期管理(创建、变配、克隆、释放、迁移、同步、备份、只读实例、安全、审计、日志、优化、诊断、告警、专家服务等) 通用 管理省心, 开箱即用 全球顶级团队支撑 缺陷: 1、纯自建 由于数据库复杂, 需要专业的网络、系统、存储、DBA团队 限于客户体量, 难有全球顶级的专业团队支撑 2、纯全托管 纯黑盒, 以上提到的客户需求无法满足 方案2 大客户专属数据库集群 用户对集群有更多的可控性(系统、数据库权限, 开放更多API, 开放后端调度、管理平台) 优势: 继承RDS优势 高可用 高可靠 高性能 安全审计 备份 克隆 只读实例 支持集成 用户可自由集成专属数据库集群 开放os、db超级用户权限 开放更多API, 可与用户内部平台进行整合 支持堡垒机接入,更加安全 高性价比 自由设定超卖比例 自由打散, 自由设置超卖比 根据不同业务等级隔离主机资源 错峰调度 自由按业务等级分配主机 错峰调度: 如白天业务和分析库, 晚上业务库腾挪紧凑使用 支持定制 用户可自由定义插件 citus 分库分表 timescaledb 时序插件 zombodb ES 搜索插件 madlib 机器学习插件 大客户专属数据库集群价值: 通过开放更多底层API和权限, 支持与客户现有系统集成, 数据库定制能力. 通过开放后台管理系统以及专有主机模式, 支持用户完全可控的资源调度, 超卖比, 提升整体性价比、资源隔离性、调度能力. 通过堡垒机的接入, 提升了用户数据库的安全性. 通过完全继承RDS的能力, 拥有阿里云数据库数十年技术沉淀与数百万用户线上稳定运行保障, 确保了大客户专属主机组的产品质量. 本功能产品手册: 官网手册, 购买页面: https://common-buy.aliyun.com/?spm=5176.9826160.0.0.4cb81450IqUZwe&commodityCode=mysql_machine_pre#/buy 使用方法介绍: 针对不同业务, 可以创建不同的主机组, 通过超卖比、打散规则等控制成本和稳定性. 1、购买主机组, 由于高可用需求, 一个组至少要2台主机. https://common-buy.aliyun.com/?spm=5176.9826160.0.0.4cb81450IqUZwe&commodityCode=mysql_machine_pre#/buy 2、(可选)配置主机组CPU、存储超卖比. 例如CPU超卖比为200%, 则64核的主机则最多可以创建出总共 128VCPU 的数据库实例. 3、(可选)配置主机组打散规则(资源分配策略) 均衡分配:表示最大化追求更稳定的系统表现,优先会从更空的主机中分配资源紧凑分配:表示最大化追求更充分的资源利用率,优先会利用已分配主机资源 4、(可选)配置主机的内存分配阈值, 内存分配越多越容易发生OOM, 用户可以根据业务SLA诉求来进行设置. 5、(可选)用户可以根据业务SLA诉求, 通过配置提高主机组的冗余度, 并配置自动替换故障主机. 主机是云盘主机,必须打开“允许自动快照主机存储空间”主机是非云盘主机,则先将故障主机上的实例迁移走,然后自动替换主机手动替换主机:针对故障主机,用户手动去替换 6、创建实例 进入主机组后, 如果需要在这个主机组内创建实例, 点击创建实例按钮. 7、管理实例 点击实例即可跳转到实例管理页面. 实例变配等操作: 8、(可选)创建主机账号 首先需要将主机组接入到堡垒机, 接入后, 就可以添加主机账号. (在主机组控制台即可接入堡垒机) 如果要登陆主机组的主机, 需要添加主机账号, 添加完账号之后, 就可以登陆对应的主机了. 9、(可选)管理主机 查询主机当前状态、主机监控信息等. 阿里云PG技术交流群
标签PostgreSQL , 阿里云行业:互联网、新零售、交通、智能楼宇、教育、游戏、医疗、社交、公安、机场等.场景挑战与痛点:业务特点:1、需要高效率、高精度的以图搜图2、业务不仅有图像搜索的需求, 同时还有其他条件的过滤需求业务挑战与痛点:1、通用关系数据库例如MySQL不支持向量检索, 需要遍历查询并全部返回到应用端进行计算, 性能差, 并且需要耗费大量网络带宽.2、即使关系数据库支持了向量检索的操作符, 但是并不支持向量索引, 所以依然需要遍历计算, 性能差, 无法支撑高并发查询.3、当图像向量计算上移到应用层实现时, 需要从数据库加载所有数据, 加载速度慢, 而且图像更新后无法实时加载, 效率低4、当图像向量计算上移到应用层实现时, 无法支撑图像识别以及其他属性检索的联合过滤, 效率低下.技术方案:方案1数据库仅存储图像向量, 不进行向量计算图像向量计算上移到应用层实现缺陷:普通数据库不支持向量索引, 无法在数据库中完成向量过滤应用需要从数据库加载所有数据, 加载速度慢, 而且图像更新后无法被应用实时加载, 效率低.无法在数据库中实现图像识别条件筛选+其他属性的条件筛选的联合过滤, 需要在业务层过滤图像条件, 网络传输的记录多, 效率低, 无法支持高并发场景方案2使用RDS PG存储图像向量特征值在RDS PG的pase插件, 创建图像特征向量的向量索引应用输入特征向量, 在数据库中通过向量索引, 快速搜索到与之相似的图像, 支持返回向量距离, 以及按向量距离进行排序当有多个过滤条件时, 数据库可以使用多个索引对多个条件进行合并过滤优势:RDS PG数据库支持向量索引, 图像搜索可以直接在数据库中高效率过滤, 应用与数据库之间RDS PG支持索引合并过滤, 可以同时过滤图像条件、其他属性条件, 通过索引可以最大化收敛条件结果集, 大幅度提升性能, 降低传输量. 单次查询可以毫秒级完成.通过RDS PG只读实例, 可以再次提高整体查询吞吐.注意:本方案为数据库向量搜索加速方案, 并未涉及图像特征值提取(图像转换为高维向量)的部分, 图像特征值提取可以在应用层完成.目前阿里云RDS PG pase插件支持两种业界流行的向量索引算法ivfflat和hnsw, 未来将持续集成业界优秀的向量索引算法. ivfflat算法 hnsw算法 详细使用方法请参考阿里云RDS PG官方手册pase插件说明文档. (https://help.aliyun.com/document_detail/147837.html)RDS PG方案价值:1、RDS PG支持了高维向量索引检索功能(pase插件), 可以非常高效率的实现图像向量的相似匹配搜索, 单次请求仅需毫秒级. 2、高维向量检索不仅能应用在图像搜索, 同时还能应用在任意可以数字化的特征搜索, 例如用户画像特征搜索, 营销系统中的相似人群扩选等场景. 3、RDS PG支持索引合并搜索, 从而在数据库中可以一次性完成向量搜索、普通查询条件过滤的联合过滤, 大幅度提升性能. 对比MySQL的通用方案, RDS PG 的pase向量索引插件加速方案优势非常明显, 是一个低成本, 高效率的解决方案.平均性能提升 2457900%, 达到毫秒级响应.以上数据来自4核8G RDS数据库实例, 100万图片的实操对比数据. 目前支持该功能的RDS PG版本:RDS PG V11未来将在V10以上的所有版本支持.本功能产品手册: https://help.aliyun.com/document_detail/147837.htmlDEMO介绍:通用操作1、购买RDS PG 11 2、配置白名单 3、创建用户 4、创建数据库方案 DEMO方案1 demo1、创建测试表create table if not exists t_vec_80( id serial PRIMARY KEY, -- 主键 c1 int, -- 其他属性字段 c2 int, c3 text, c4 timestamp, vec float4[] -- 图像特征值向量 ); 2、创建生成随机向量的函数(用于模拟图像特征值, 实际场景请使用实际图片特征值存入)create or replace function gen_float4_arr(int,int) returns float4[] as $$ select array_agg(trunc(random()*$1)::float4) from generate_series(1,$2); $$ language sql strict volatile; 3、写入100万随机向量insert into t_vec_80 (c1,c2,c3,c4,vec) select random()*100, random()*100000, md5(random()::text), clock_timestamp(), gen_float4_arr(10000, 80) from generate_series(1,1000000); 结果样例:select * from t_vec_80 limit 3; -[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ id | 1 c1 | 99 c2 | 7428 c3 | 9b74e40ab38ed4f41b5d50b8eedf8b72 c4 | 2020-02-27 15:36:56.895773 vec | {6469,3787,5852,1642,2798,7728,1527,6990,7399,3460,7802,7682,8102,6499,3428,7687,567,8894,8144,1685,6139,9549,3613,1714,721,9099,4218,1930,9031,4961,3966,5501,8748,9818,7143,1546,7547,8671,8536,4946,2132,6338,2629,234,2838,6057,7922,3405,4951,6066,5091,1091,5615,8704,2805,6336,7804,7024,8266,6836,1985,2233,2337,733,2051,9481,2280,9598,8152,816,4545,285,7155,7174,519,9993,3232,8441,3399,8183} -[ RECORD 2 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ id | 2 c1 | 45 c2 | 84908 c3 | a48d421b772486121ef520eb3e285f95 c4 | 2020-02-27 15:36:56.896329 vec | {123,7195,2080,6460,5000,9104,4727,1836,1089,6960,4174,1823,9012,3656,4103,8611,1808,4920,3157,2094,2076,332,2613,2070,3564,1055,5469,1748,5563,3960,1023,5686,1156,3103,2147,6156,2208,6874,7993,3298,3834,2167,5121,2847,5823,9225,1458,7632,4145,4615,9726,6222,4947,2340,8292,8511,3395,3762,259,8958,7722,1282,4644,8878,4386,6792,5035,6594,3666,3028,9892,7501,5196,5014,348,1019,4239,1806,8652,8384} -[ RECORD 3 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ id | 3 c1 | 64 c2 | 83785 c3 | ea856c452399648fd29b0e0383a169a5 c4 | 2020-02-27 15:36:56.896395 vec | {1369,718,2899,9880,4113,6661,140,3071,4383,1422,7716,3262,5808,4509,8298,2403,8175,1326,2295,5676,6523,7309,6024,7542,1549,7831,6194,9934,4253,4573,4541,5622,5291,7440,5503,9405,4101,5643,2477,8485,7066,194,1748,2875,4703,46,5278,2878,1373,7574,8555,7896,4884,4580,5439,6433,2411,1633,6367,6664,6207,909,2286,1498,8349,7789,903,2451,3433,3381,936,499,3575,2685,3374,8278,2731,8653,1157,4105} 表占用空间: public | t_vec_80 | table | digoal | 411 MB | 4、查询出100万条记录返回给客户端time psql -h xxx.xxx.xxx.xxx -p 3433 -U digoal postgres -c "select * from t_vec_80" >/dev/null 结果:real 1m1.450s user 0m21.891s sys 0m2.399s 5、并发能力测试vi test.sql select * from t_vec_80; pgbench -M prepared -n -r -f ./test.sql -c 4 -j 4 -T 600 -h xxx.xxx.xxx.xxx -p 3433 -U digoal postgres 结果:transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 4 number of threads: 4 duration: 600 s number of transactions actually processed: 36 latency average = 72293.794 ms tps = 0.055330 (including connections establishing) tps = 0.055330 (excluding connections establishing) statement latencies in milliseconds: 72204.857 select * from t_vec_80; 方案2 demo1、创建 pase 向量索引插件create extension pase; 2、创建测试表create table if not exists t_vec_80( id serial PRIMARY KEY, -- 主键 c1 int, -- 其他属性字段 c2 int, c3 text, c4 timestamp, vec float4[] -- 图像特征值向量 ); 3、创建生成随机向量的函数(用于模拟图像特征值, 实际场景请使用实际图片特征值存入)-- 创建生成随机向量的函数 create or replace function gen_float4_arr1(int,int) returns float4[] as $$ select array_agg(trunc(random()*$1)::float4) from generate_series(1,$2); $$ language sql strict volatile; -- 创建基于数组生成随机附近数组的函数 create or replace function gen_float4_arr(float4[], int) returns float4[] as $$ select array_agg( (u + (u*$2/2.0/100) - u*$2/100*random())::float4 ) from unnest($1) u; $$ language sql strict volatile; 4、写入100万随机向量do language plpgsql $$ declare v_cent float4[]; begin for i in 1..100 loop -- 100个中心点 v_cent := gen_float4_arr1(10000,80); -- 取值范围10000, 80个维度 insert into t_vec_80 (vec) select gen_float4_arr(v_cent, 20) from generate_series(1,10000); -- 1万个点围绕一个中心点, 每个维度的值随机加减20% end loop; end; $$; 5、创建向量索引(使用hnsw算法索引, 目前pase插件支持两种索引ivfflat和hnsw), 请实际使用时定要参考RDS PG pase文档, 索引参数需要正确的被设置(特别是维度需要和实际维度一致).CREATE INDEX idx_t_vec_80_1 ON t_vec_80 USING pase_hnsw(vec) WITH (dim = 80, base_nb_num = 16, ef_build = 40, ef_search = 200, base64_encoded = 0); 创建索引耗时:CREATE INDEX Time: 1282997.955 ms (21:22.998) 索引占用空间: public | idx_t_vec_80_1 | index | digoal | t_vec_80 | 8138 MB | 索引创建完成后, 未来更新或新增图像特征值时, 会自动更新索引, 不需要再创建索引. 6、基于一个随机输入向量查询出5条与之最相似的向量, 并按向量距离顺序返回explain select id as v_id, vec <?> '7533.44,3740.27,670.119,994.914,3619.27,2018.17,2041.34,5483.19,6370.07,4745.54,8762.81,1117.59,8254.75,2009.3,6512.47,3876.7,4775.02,384.683,2003.78,7926.78,9101.46,6801.24,5397.1,7704.49,7546.87,9129.23,9517.36,5723.4,877.649,3117.72,6739.25,8950.36,6397.09,6687.46,9606.15,557.142,9742.48,1714.25,6682.97,5369.21,6178.99,4983.06,7064.29,5433.98,7120.7,2980.34,8485.47,1651.98,3656.9,1126.65,10260.1,2139.89,9041.79,4988.89,17.2254,5482.88,3428.6,10370.7,1749.32,4761.6,2806.65,8040.89,3176.31,9491.93,4355.37,2898.47,282.75,3233.86,4248.86,7012.86,9238.95,524.011,2285.75,5363.21,5558.95,10768.8,8689.83,4907.53,1372.65,1982.05:40:0'::pase as v_dist, vec as v_vec from t_vec_80 order by vec <?> '7533.44,3740.27,670.119,994.914,3619.27,2018.17,2041.34,5483.19,6370.07,4745.54,8762.81,1117.59,8254.75,2009.3,6512.47,3876.7,4775.02,384.683,2003.78,7926.78,9101.46,6801.24,5397.1,7704.49,7546.87,9129.23,9517.36,5723.4,877.649,3117.72,6739.25,8950.36,6397.09,6687.46,9606.15,557.142,9742.48,1714.25,6682.97,5369.21,6178.99,4983.06,7064.29,5433.98,7120.7,2980.34,8485.47,1651.98,3656.9,1126.65,10260.1,2139.89,9041.79,4988.89,17.2254,5482.88,3428.6,10370.7,1749.32,4761.6,2806.65,8040.89,3176.31,9491.93,4355.37,2898.47,282.75,3233.86,4248.86,7012.86,9238.95,524.011,2285.75,5363.21,5558.95,10768.8,8689.83,4907.53,1372.65,1982.05:40:0'::pase limit 5; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..7.47 rows=5 width=352) -> Index Scan using idx_t_vec_80_1 on t_vec_80 (cost=0.00..1493015.50 rows=1000000 width=352) Order By: (vec <?> '7533.43994140625,3740.27001953125,670.119018554688,994.914001464844,3619.27001953125,2018.17004394531,2041.33996582031,5483.18994140625,6370.06982421875,4745.5400390 625,8762.8095703125,1117.58996582031,8254.75,2009.30004882812,6512.47021484375,3876.69995117188,4775.02001953125,384.683013916016,2003.78002929688,7926.77978515625,9101.4599609375,6801.240234375 ,5397.10009765625,7704.490234375,7546.8701171875,9129.23046875,9517.3603515625,5723.39990234375,877.648986816406,3117.71997070312,6739.25,8950.3603515625,6397.08984375,6687.4599609375,9606.15039 0625,557.142028808594,9742.48046875,1714.25,6682.97021484375,5369.2099609375,6178.990234375,4983.06005859375,7064.2900390625,5433.97998046875,7120.7001953125,2980.34008789062,8485.4697265625,165 1.97998046875,3656.89990234375,1126.65002441406,10260.099609375,2139.88989257812,9041.7900390625,4988.89013671875,17.2254009246826,5482.8798828125,3428.60009765625,10370.7001953125,1749.31994628 906,4761.60009765625,2806.64990234375,8040.89013671875,3176.31005859375,9491.9296875,4355.3701171875,2898.46997070312,282.75,3233.86010742188,4248.85986328125,7012.85986328125,9238.9501953125,52 4.010986328125,2285.75,5363.2099609375,5558.9501953125,10768.7998046875,8689.830078125,4907.52978515625,1372.65002441406,1982.05004882812::'::pase) (3 rows) select id as v_id, vec <?> '7533.44,3740.27,670.119,994.914,3619.27,2018.17,2041.34,5483.19,6370.07,4745.54,8762.81,1117.59,8254.75,2009.3,6512.47,3876.7,4775.02,384.683,2003.78,7926.78,9101.46,6801.24,5397.1,7704.49,7546.87,9129.23,9517.36,5723.4,877.649,3117.72,6739.25,8950.36,6397.09,6687.46,9606.15,557.142,9742.48,1714.25,6682.97,5369.21,6178.99,4983.06,7064.29,5433.98,7120.7,2980.34,8485.47,1651.98,3656.9,1126.65,10260.1,2139.89,9041.79,4988.89,17.2254,5482.88,3428.6,10370.7,1749.32,4761.6,2806.65,8040.89,3176.31,9491.93,4355.37,2898.47,282.75,3233.86,4248.86,7012.86,9238.95,524.011,2285.75,5363.21,5558.95,10768.8,8689.83,4907.53,1372.65,1982.05:40:0'::pase as v_dist, vec as v_vec from t_vec_80 order by vec <?> '7533.44,3740.27,670.119,994.914,3619.27,2018.17,2041.34,5483.19,6370.07,4745.54,8762.81,1117.59,8254.75,2009.3,6512.47,3876.7,4775.02,384.683,2003.78,7926.78,9101.46,6801.24,5397.1,7704.49,7546.87,9129.23,9517.36,5723.4,877.649,3117.72,6739.25,8950.36,6397.09,6687.46,9606.15,557.142,9742.48,1714.25,6682.97,5369.21,6178.99,4983.06,7064.29,5433.98,7120.7,2980.34,8485.47,1651.98,3656.9,1126.65,10260.1,2139.89,9041.79,4988.89,17.2254,5482.88,3428.6,10370.7,1749.32,4761.6,2806.65,8040.89,3176.31,9491.93,4355.37,2898.47,282.75,3233.86,4248.86,7012.86,9238.95,524.011,2285.75,5363.21,5558.95,10768.8,8689.83,4907.53,1372.65,1982.05:40:0'::pase limit 5; v_id | v_dist | v_vec ---------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------- 1000001 | 613508 | {7508.56,3828.8,670.162,978.82,3654.93,2052.38,2023.41,5518.4,6478.1,4814.47,8689.2,1130.5,8421.43,2018.39,6534.18,3884.82,4737.2,385.625,2025.83,7917.54,8892.97,6900.71 ,5421.61,7579.82,7649.6,9337.72,9530.01,5818.69,873.353,3105.67,6622.92,9102.99,6360.46,6737.99,9374.82,545.683,9734.36,1699.74,6753.08,5320.49,6062.47,4870.6,6907.26,5304.41,7166.67,2997.09,850 8.14,1691.62,3595.89,1113.89,10232.1,2107.41,8846.84,4875.69,17.1081,5574.26,3513.31,10576.6,1763.01,4734.1,2780.48,8165.04,3132.32,9586.17,4345.39,2859.25,286.716,3306.16,4260.56,7007.33,9126.8 1,528.518,2288.32,5310,5610,10584,8872.31,4843.43,1347.01,1940.52} 1003628 | 8.55116e+06 | {7532.93,3345.53,694.608,984.268,3507.72,1950.43,2188.66,6043.55,6832.57,4384.97,8975.91,1290.02,8519.66,2237.75,6985.71,3890.79,4199.22,410.386,2294.93,7938.11,8989.48, 6374.35,5742.55,7811.5,7492.1,9067.4,9843.13,5885.26,940.365,3435.39,6545.54,8069.38,6126.34,6906.32,10175.4,505.915,9504.69,1630.76,6832.68,5477.68,6446.75,5109.62,6686.55,5688.48,6778.92,3100. 2,9182.86,1733.95,3933.06,1116.63,10488.3,2346.63,8257.46,5312.34,16.0066,5078.85,3717.24,10262.9,1624.57,4406.59,2983.23,7405.85,3159.04,9924.56,4947.86,2573.72,276.545,3673.99,4487.34,6820.15, 8524.12,486.187,2328.58,4769.64,5541.63,10255.7,8280.42,5141.37,1332.7,1989.67} 1004945 | 9.21305e+06 | {7348.72,3833.3,706.311,985.864,3632.96,2153.75,2172.06,6427.87,6502.42,4678.54,8955.37,1207.76,8594.73,1958.02,6839.83,3703.57,4091.18,367.272,1970.81,7266.62,9198.17,6 869.98,5960.79,7658.46,7180.35,9386.35,10320.3,6593.09,900.23,3330.1,6749.94,9182.85,6839.25,7254.11,9533.32,580.504,9069.41,1841.88,6840.14,4948.41,6390.41,5102.95,6873.49,5683.65,7283.23,3124. 15,8727.17,1810.11,3575.12,1111.99,10081.7,2174.01,8797.29,5301.64,17.779,5196.54,3848.29,9813.85,1514.4,4357.8,2752.47,7138.15,2905.04,10178.2,5025.82,2713.42,267.272,3557.03,4388.08,6581.85,91 14.22,470.335,2249.53,5274.76,5353.28,10566.6,9153.67,4746.68,1439.05,1996.43} 1009195 | 9.5047e+06 | {7952.02,3520.88,632.554,1014.25,3682.3,2152.37,2108.55,5609.13,6663.42,4410.93,7935.51,1272.55,8609.25,2337.6,6845.14,3849.27,3970,422.706,2090.26,8533.55,9108.23,6752. 42,5636.14,7223.91,7627.38,9467.08,8763.63,6810.94,819.782,3407.48,6512.03,9083.21,6403.44,6224.57,9703.19,553.033,9508.63,1823.54,6942.67,5340.35,5954.36,5616.57,6423.06,5320.32,7837.67,2903.61 ,8450.55,1892.85,3821.65,1140.62,10152.7,2306.96,8871.29,5034.8,17.8199,5573.62,3686.87,10214.3,1688.62,4667.3,2943.37,7669.45,3079.31,10188.6,4638.13,2907.09,254.251,3438.58,4657.61,6342.84,948 5.26,465.782,2388.26,5125.77,6048.52,9961.5,8328.46,5174.91,1416.44,1937.93} 1008523 | 9.65744e+06 | {7255.87,3299.84,671.464,1047.55,3705.29,2031.92,1992.93,5689.99,6486.58,4153.71,8173.91,1224.91,8320.19,2170.14,6585.28,3911.89,4329.78,401.384,2084.19,8345.98,9496.74, 7188.78,5137.15,7485.36,6914.55,8471.34,9674.72,6395.1,810.129,3015.94,6551.72,8213.34,6518.96,6672.72,9064.75,565.507,9560.03,1621.07,7184.9,5224.67,6092.26,4897.21,6021.32,5271.55,7731.19,3218 .24,8516.33,1660.11,3269.62,1145.53,10584.7,2058.17,7786.21,4795.73,16.5323,5396.69,3830.83,10393.6,1526.46,4794.47,2644.17,8514.68,3477.77,9360.25,4510.64,2528.64,238.049,3361.88,4388.69,7549.8 3,9101.76,545.511,2029.84,5622.08,5770.27,10192.2,8269.93,4979.93,1547.04,2017} (5 rows) Time: 2.502 ms 7、并发能力测试-- 为确保每次输入的图像特征值是随机的, 而不是一个确定值, 模拟真实场景, 使用函数进行测试 -- 测试方法 -- 从表里随机取一条记录, 每个维度的浮点值修改5%, 生成一个新的随机向量 -- 基于这个新的随机向量, 搜索5条最相似的向量, 按相似度顺序返回 create or replace function get_vec( in i_id int, in i_pect int, out v_id int, out v_dist float4, out v_vec float4[] ) returns setof record as $$ declare v_vec float4[]; v_pase text; begin select vec into v_vec from t_vec_80 where id=i_id; v_vec := gen_float4_arr(v_vec, i_pect); v_pase := rtrim(ltrim(v_vec::text, '{'),'}')||':40:0'; -- raise notice '%', v_pase; return query select id as v_id, vec <?> v_pase::pase as v_dist, vec as v_vec from t_vec_80 order by vec <?> v_pase::pase limit 5; end; $$ language plpgsql strict; postgres=> select min(id),max(id) from t_vec_80; min | max ---------+--------- 1000001 | 2000000 (1 row) vi test.sql \set id random(1000001,2000000) select * from get_vec(:id, 5); pgbench -M prepared -n -r -f ./test.sql -c 12 -j 12 -T 600 -h xxx.xxx.xxx.xxx -p 3433 -U digoal postgres 模拟查询例子:postgres=> select * from get_vec(1000001,5); -[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- v_id | 1000001 v_dist | 549580 v_vec | {7508.56,3828.8,670.162,978.82,3654.93,2052.38,2023.41,5518.4,6478.1,4814.47,8689.2,1130.5,8421.43,2018.39,6534.18,3884.82,4737.2,385.625,2025.83,7917.54,8892.97,6900.71,5421.61,7579.82,7649.6,9337.72,9530.01,5818.69,873.353,3105.67,6622.92,9102.99,6360.46,6737.99,9374.82,545.683,9734.36,1699.74,6753.08,5320.49,6062.47,4870.6,6907.26,5304.41,7166.67,2997.09,8508.14,1691.62,3595.89,1113.89,10232.1,2107.41,8846.84,4875.69,17.1081,5574.26,3513.31,10576.6,1763.01,4734.1,2780.48,8165.04,3132.32,9586.17,4345.39,2859.25,286.716,3306.16,4260.56,7007.33,9126.81,528.518,2288.32,5310,5610,10584,8872.31,4843.43,1347.01,1940.52} -[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- v_id | 1004945 v_dist | 8.61114e+06 v_vec | {7348.72,3833.3,706.311,985.864,3632.96,2153.75,2172.06,6427.87,6502.42,4678.54,8955.37,1207.76,8594.73,1958.02,6839.83,3703.57,4091.18,367.272,1970.81,7266.62,9198.17,6869.98,5960.79,7658.46,7180.35,9386.35,10320.3,6593.09,900.23,3330.1,6749.94,9182.85,6839.25,7254.11,9533.32,580.504,9069.41,1841.88,6840.14,4948.41,6390.41,5102.95,6873.49,5683.65,7283.23,3124.15,8727.17,1810.11,3575.12,1111.99,10081.7,2174.01,8797.29,5301.64,17.779,5196.54,3848.29,9813.85,1514.4,4357.8,2752.47,7138.15,2905.04,10178.2,5025.82,2713.42,267.272,3557.03,4388.08,6581.85,9114.22,470.335,2249.53,5274.76,5353.28,10566.6,9153.67,4746.68,1439.05,1996.43} -[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- v_id | 1003628 v_dist | 9.11551e+06 v_vec | {7532.93,3345.53,694.608,984.268,3507.72,1950.43,2188.66,6043.55,6832.57,4384.97,8975.91,1290.02,8519.66,2237.75,6985.71,3890.79,4199.22,410.386,2294.93,7938.11,8989.48,6374.35,5742.55,7811.5,7492.1,9067.4,9843.13,5885.26,940.365,3435.39,6545.54,8069.38,6126.34,6906.32,10175.4,505.915,9504.69,1630.76,6832.68,5477.68,6446.75,5109.62,6686.55,5688.48,6778.92,3100.2,9182.86,1733.95,3933.06,1116.63,10488.3,2346.63,8257.46,5312.34,16.0066,5078.85,3717.24,10262.9,1624.57,4406.59,2983.23,7405.85,3159.04,9924.56,4947.86,2573.72,276.545,3673.99,4487.34,6820.15,8524.12,486.187,2328.58,4769.64,5541.63,10255.7,8280.42,5141.37,1332.7,1989.67} -[ RECORD 4 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- v_id | 1007551 v_dist | 9.44673e+06 v_vec | {7781.26,3380.16,599.097,902.545,3547.6,1982.01,2408.72,5823.09,5854.29,4392.29,9184.52,1268.16,8240.44,2106.41,6257.97,3703.93,4635.53,378.289,1987.59,8185.38,8466.11,7341.06,5290.8,7422.01,7250.71,8765.47,9341.37,6343.1,865.465,3123.4,5753.41,9331.6,6897.8,6410.83,8874.91,572.861,9001.73,1567.28,6087.64,5422.22,6226.57,5704.15,6499.31,5340.14,7157.55,3300.96,8137.33,1648.01,3872.58,1048.15,10322,2171.44,8874.25,4800.68,17.2407,5297.92,3962.59,10463.2,1482.13,4316.52,2762.2,7293.2,2932.35,10294.3,4539.97,2551.33,266.689,3879.39,4287.27,7169.59,8934.47,544.819,2246.28,4860.29,5837.37,10389.2,8959.5,4836.24,1283.66,2118.71} -[ RECORD 5 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- v_id | 1008335 v_dist | 9.48463e+06 v_vec | {7993.58,3279.23,608.321,947.312,3855.4,2190.95,2013.19,6063.82,6356.44,4670.55,9118.76,1155.98,8339.1,2082.98,6675.26,3565.42,4172.02,432.199,2115.09,7211.91,8375.44,6845.13,5692.45,7955.92,7269.1,9351.03,9016.28,5845.67,840.522,2964.57,6185.9,9328.92,6371.88,6985.29,9314.6,575.449,8884.66,1681.17,6381.56,5767.74,5796.38,4839.26,6309.88,5030.22,7347.04,3403.45,9072.78,1858.26,3753.29,1008.68,10277.5,2072.03,8010.28,5153.73,17.669,4755.41,3723.93,10381.5,1512.89,4821.96,3179.53,7987.13,3276.66,8983.62,4408.31,2430.41,284.952,3731.14,4382.78,6574.45,9154.04,520.929,2136.69,4835.47,5222.18,10158.4,9192.24,4820.05,1417.67,2106.94} 结果:transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 12 number of threads: 12 duration: 600 s number of transactions actually processed: 633784 latency average = 11.361 ms tps = 1056.253960 (including connections establishing) tps = 1056.298691 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set id random(1000001,2000000) 11.358 select * from get_vec(:id, 5); 方案对比:环境:数据库计算规格存储规格MySQL 8.04C 8G1500GB ESSDPG 114C 8G1500GB ESSD性能对比:CASE(100万图像、80维向量)方案1(MySQL、PG)常规方案(所有记录返回到应用层计算)方案2(PG, pase插件)数据库内部支持图像搜索方案2 vs 方案1提升%单次查询响应速度61.45秒2.5毫秒2457900%并发查询qps0.05533010561908449%课程视频视频: https://yq.aliyun.com/live/1905阿里云RDS PG优惠活动https://www.aliyun.com/database/postgresqlactivityRDS PG优惠活动:9.9元试用3个月升级5折阿里云PG技术交流群
标签 PostgreSQL , postgres_fdw , 阿里云 , 内核安全限制 背景 阿里云rds pg内核安全上做了限制,只能访问当前实例的其他库,所以使用dblink, postgres_fdw时,虽然PG功能上是可以访问其他远程实例的,但是阿里云RDS PG限制了只能访问当前实例。 另一方面,当前实例是HA版本,并且是云化版本,所以IP,PORT都可能在发生迁移、切换后发生变化。因此为了能够让用户使用dblink, postgres_fdw访问本实例的其他跨库资源,内核上做了hack。port, host, hostaddr都不允许指定。 通过DBLINK创建视图也是一样的道理。 用法举例 1 创建postgres_fdw create extension postgres_fdw; 2 创建外部server drop SERVER foreign_server cascade; CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS ( dbname 'pgbi_hf'); -- 正常来说这里要指定host port,RDS PG 10 高可用版本,不需要指定 3 为当前用户匹配创建好的外部server CREATE USER MAPPING FOR 本地数据库用户 SERVER foreign_server OPTIONS (user 'xxx', password 'xxx'); 4 创建外部表 CREATE FOREIGN TABLE xxx ( c1 varchar(40), c2 varchar(200) ) SERVER foreign_server OPTIONS (schema_name 'xxx', table_name 'xxx'); 5 可以将外部server上指定的schema里面的所有表,一次性映射到本地的某个指定SCHEMA里面 import foreign schema remote_schema1 from server foreign_server INTO local_schema1; 参考 《阿里云rds PG, PPAS PostgreSQL 同实例,跨库数据传输、访问(postgres_fdw 外部表)》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , perf insight , 等待事件 , 采样 , 发现问题 , Oracle 兼容性 背景 通常普通的监控会包括系统资源的监控: cpu io 内存 网络 等,但是仅凭资源的监控,当问题发生时,如何快速的定位到问题在哪里?需要更高级的监控: 更高级的监控方法通常是从数据库本身的特性触发,但是需要对数据库具备非常深刻的理解,才能做出好的监控和诊断系统。属于专家型或叫做经验型的监控和诊断系统。 《[未完待续] PostgreSQL 一键诊断项 - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 低频指标 - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 高频指标(服务器) - 珍藏级》 《PostgreSQL 实时健康监控 大屏 - 高频指标 - 珍藏级》 《PostgreSQL pgmetrics - 多版本、健康监控指标采集、报告》 《PostgreSQL pg_top pgcenter - 实时top类工具》 《PostgreSQL、Greenplum 日常监控 和 维护任务 - 最佳实践》 《PostgreSQL 如何查找TOP SQL (例如IO消耗最高的SQL) (包含SQL优化内容) - 珍藏级》 《PostgreSQL 锁等待监控 珍藏级SQL - 谁堵塞了谁》 然而数据库在不断的演进,经验型的诊断系统好是好,但是不通用,有没有更加通用,有效的发现系统问题的方法? AWS与Oracle perf insight的思路非常不错,实际上就是等待事件的统计追踪,作为性能诊断的方法。 https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html 《AWS performance insight》 简单来说就是对系统不停的打点,例如每秒一个采样,仅记录这一秒数据库活跃的会话(包括等待中的会话),等待事件,QUERY,时间,用户,数据库。这几个指标。 活跃度会话,不管是在耗费CPU,还是在等待(锁,IO)或者其他,实际上都是占用了资源的。可以算出平均的活跃会话(例如10秒的平均值,5秒的平均值)(avg active sessions)。 这个avg active sessions是一个值,这个值和数据库实例的CPU个数进行比较,就可以衡量出系统是否存在瓶颈(当avg active sessions超过CPU个数时,说明存在瓶颈)。 当某个时间窗口存在瓶颈,瓶颈在哪里,则可以通过这个时间窗口内的打点明细,进行统计。等待事件,QUERY,用户,数据库。 PostgreSQL打点的方法也很多: 1、(推荐)通过pg_stat_activity 内存中的动态视图获取,每秒取一次ACTIVE的内容(例如:会话ID,等待事件,QUERY,时间,用户,数据库)。 https://www.postgresql.org/docs/11/monitoring-stats.html#MONITORING-STATS-VIEWS 2、(不推荐)开启审计日志,在审计日志中获取,这个在高并发系统中,不太好用。并且审计日志是在结束时打印,一个QUERY的中间执行过程并不完全是占用CPU或其他资源的,所以审计日志获取的信息对于perf insight并没有什么效果。 perf insight的入门门槛低,可以摆平很多问题,在出现问题时快速定位到问题SQL,问题的等待事件在哪里。结合经验型的监控,可以构建PG非常强大的监控、诊断、优化体系。 perf insight 实现讲解 举例1 会话1 postgres=# begin; BEGIN postgres=# lock table abc in access exclusive mode ; LOCK TABLE 会话2 postgres=# select * from abc; 从pg_stat_activity获取状态,可以看到会话2在等待,会话处于active状态,这种消耗需要被记录到avg active session中,用来评估资源消耗指标。 postgres=# select now(),state,datname,usename,wait_event_type,wait_event,query from pg_stat_activity where state in ('active', 'fastpath function call'); now | state | datname | usename | wait_event_type | wait_event | query -------------------------------+--------+----------+----------+-----------------+------------+-------------------------------------------------------------------------------------------- 2019-01-25 21:17:28.540264+08 | active | postgres | postgres | | | select datname,usename,query,state,wait_event_type,wait_event,now() from pg_stat_activity; 2019-01-25 21:17:28.540264+08 | active | postgres | postgres | Lock | relation | select * from abc; (2 rows) 举例2 使用pgbench压测数据库,每秒打点,后期进行可视化展示 pgbench -i -s 100 1、压测只读 pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 300 -S 2、查看压测时的活跃会话状态 postgres=# select now()::timestamptz(0),state, datname,usename,wait_event_type,wait_event,query from pg_stat_activity where state in ('active', 'fastpath function call') and pid<>pg_backend_pid(); now | state | datname | usename | wait_event_type | wait_event | query ---------------------+--------+----------+----------+-----------------+------------+------------------------------------------------------- 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | Client | ClientRead | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-25 21:28:52 | active | postgres | postgres | | | SELECT abalance FROM pgbench_accounts WHERE aid = $1; (46 rows) 3、为了方便统计,可以在本地建表,用于收集pg_stat_activity的内容,在实际的生产中,可以把这个信息读走,存到其他地方(例如专用于监控的其他数据库)。 postgres=# create unlogged table perf_insight as select now()::timestamptz(0) as ts, extract(epoch from backend_start)||'.'||pid as sessid, state,datname,usename, wait_event_type||'_'||wait_event as waiting , query from pg_stat_activity where state in ('active', 'fastpath function call') and pid<>pg_backend_pid(); SELECT 48 4、试着写入当时pg_stat_activity状态 postgres=# insert into perf_insight select now()::timestamptz(0), extract(epoch from backend_start)||'.'||pid, state,datname, usename,wait_event_type||'_'||wait_event, query from pg_stat_activity where state in ('active', 'fastpath function call') and pid<>pg_backend_pid(); INSERT 0 42 5、使用psql watch,每秒打一个点 postgres=# \watch 1 6、只读压测,压测结果,130万QPS pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 300 -S transaction type: <builtin: select only> scaling factor: 100 query mode: prepared number of clients: 64 number of threads: 64 duration: 300 s number of transactions actually processed: 390179555 latency average = 0.049 ms latency stddev = 0.026 ms tps = 1300555.237752 (including connections establishing) tps = 1300584.885231 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid random(1, 100000 * :scale) 0.049 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 7、接下来,开启一个读写压测,9.4万TPS(yue 47万qps) pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 300 transaction type: <builtin: TPC-B (sort of)> scaling factor: 100 query mode: prepared number of clients: 64 number of threads: 64 duration: 300 s number of transactions actually processed: 28371829 latency average = 0.677 ms latency stddev = 0.413 ms tps = 94569.412707 (including connections establishing) tps = 94571.934011 (excluding connections establishing) statement latencies in milliseconds: 0.002 \set aid random(1, 100000 * :scale) 0.001 \set bid random(1, 1 * :scale) 0.001 \set tid random(1, 10 * :scale) 0.001 \set delta random(-5000, 5000) 0.045 BEGIN; 0.108 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.069 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.091 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.139 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.068 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.153 END; 8、perf insight 可视化需要的素材 时间、状态、会话ID、数据库名、用户名、等待事件、查询 当然,我们可以再细化,例如增加会话ID字段,可以针对一个会话来进行展示和统计。 postgres=# \d perf_insight Unlogged table "public.perf_insight" Column | Type | ---------+--------------------------------+- ts | timestamp(0) with time zone | 时间戳 sessid | text | 会话ID state | text | 状态 datname | name | 数据库 usename | name | 用户 waiting | text | 等待事件 query | text | SQL语句 9、查看perf insight素材内容 postgres=# select * from perf_insight limit 10; ts | sessid | state | datname | usename | waiting | query ---------------------+------------------------+--------+----------+----------+--------------------------+---------------------------------------------------------------------- 2019-01-26 09:43:28 | 1548467007.4805.32968 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; 2019-01-26 09:43:28 | 1548467007.47991.32966 | active | postgres | postgres | Client_ClientRead | END; 2019-01-26 09:43:28 | 1548467007.48362.32979 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; 2019-01-26 09:43:28 | 1548467007.48388.32980 | active | postgres | postgres | Lock_tuple | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; 2019-01-26 09:43:28 | 1548467007.48329.32978 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; 2019-01-26 09:43:28 | 1548467007.48275.32976 | active | postgres | postgres | Lock_tuple | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; 2019-01-26 09:43:28 | 1548467007.48107.32970 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; 2019-01-26 09:43:28 | 1548467007.48243.32975 | active | postgres | postgres | Lock_transactionid | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; 2019-01-26 09:43:28 | 1548467007.48417.32981 | active | postgres | postgres | IPC_ProcArrayGroupUpdate | SELECT abalance FROM pgbench_accounts WHERE aid = $1; 2019-01-26 09:43:28 | 1548467007.48448.32982 | active | postgres | postgres | Lock_tuple | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; (10 rows) 10、查看在这段时间中,有多少种等待事件 postgres=# select distinct waiting from perf_insight ; waiting -------------------------- LWLock_wal_insert LWLock_XidGenLock Lock_extend LWLock_ProcArrayLock Lock_tuple Lock_transactionid LWLock_lock_manager Client_ClientRead IPC_ProcArrayGroupUpdate LWLock_buffer_content IPC_ClogGroupUpdate LWLock_CLogControlLock IO_DataFileExtend (14 rows) perf insight 可视化,统计 采集粒度为1秒,可以对n秒的打点求平均值(分不同维度),得到可视化图形: 1、总avg active sessions ,用于告警。 2、其他维度,用于分析造成性能瓶颈问题的权重: 2.1、等待事件维度(NULL表示无等待,纯CPU time) avg active sessions 2.2、query 维度 avg active sessions 2.3、数据库维度 avg active sessions 2.4、用户维度 avg active sessions 如何判断问题: 例如,对于一个64线程的系统: avg active sessions 在64以下时,可以认为是没有问题的。 1 总 avg active sessions,用于告警。 5秒统计间隔。 select coalesce(t1.ts, t2.ts) ts, coalesce(avg_active_sessions,0) avg_active_sessions from ( select to_timestamp((extract(epoch from ts))::int8/5*5) ts, count(*)/5::float8 avg_active_sessions from perf_insight group by 1 ) t1 full outer join (select generate_series( to_timestamp((extract(epoch from min(ts)))::int8/5*5), to_timestamp((extract(epoch from max(ts)))::int8/5*5), interval '5 s' ) ts from perf_insight ) t2 on (t1.ts=t2.ts); ts | avg_active_sessions ------------------------+--------------------- 2019-01-26 05:39:20+08 | 14.2 2019-01-26 05:39:25+08 | 30.4 2019-01-26 05:39:30+08 | 35.8 2019-01-26 05:39:35+08 | 41.8 2019-01-26 05:39:40+08 | 38.6 2019-01-26 05:39:45+08 | 38.2 2019-01-26 05:39:50+08 | 34.6 2019-01-26 05:39:55+08 | 35.6 2019-01-26 05:40:00+08 | 42.4 2019-01-26 05:40:05+08 | 36.8 2019-01-26 05:40:10+08 | 36.2 2019-01-26 05:40:15+08 | 39.4 2019-01-26 05:40:20+08 | 40 2019-01-26 05:40:25+08 | 35.8 2019-01-26 05:40:30+08 | 37.2 2019-01-26 05:40:35+08 | 36.4 2019-01-26 05:40:40+08 | 40.6 2019-01-26 05:40:45+08 | 39.2 2019-01-26 05:40:50+08 | 36.6 2019-01-26 05:40:55+08 | 37.4 2019-01-26 05:41:00+08 | 38 2019-01-26 05:41:05+08 | 38.6 2019-01-26 05:41:10+08 | 38.4 2019-01-26 05:41:15+08 | 40.4 2019-01-26 05:41:20+08 | 35.8 2019-01-26 05:41:25+08 | 40.6 2019-01-26 05:41:30+08 | 39.4 2019-01-26 05:41:35+08 | 37.4 2019-01-26 05:41:40+08 | 36.6 2019-01-26 05:41:45+08 | 39.6 2019-01-26 05:41:50+08 | 36.2 2019-01-26 05:41:55+08 | 37.4 2019-01-26 05:42:00+08 | 37.8 2019-01-26 05:42:05+08 | 39 2019-01-26 05:42:10+08 | 36.2 2019-01-26 05:42:15+08 | 37 2019-01-26 05:42:20+08 | 36.4 2019-01-26 05:42:25+08 | 36 2019-01-26 05:42:30+08 | 37.6 2019-01-26 05:42:35+08 | 0 2019-01-26 05:42:40+08 | 0 2019-01-26 05:42:45+08 | 0 2019-01-26 05:42:50+08 | 8.4 2019-01-26 05:42:55+08 | 40.6 2019-01-26 05:43:00+08 | 42.4 2019-01-26 05:43:05+08 | 37.4 2019-01-26 05:43:10+08 | 44.8 2019-01-26 05:43:15+08 | 36.2 2019-01-26 05:43:20+08 | 39.6 2019-01-26 05:43:25+08 | 41.4 2019-01-26 05:43:30+08 | 34.2 2019-01-26 05:43:35+08 | 41.8 2019-01-26 05:43:40+08 | 37.4 2019-01-26 05:43:45+08 | 30.2 2019-01-26 05:43:50+08 | 36.6 2019-01-26 05:43:55+08 | 36 2019-01-26 05:44:00+08 | 33.8 2019-01-26 05:44:05+08 | 37.8 2019-01-26 05:44:10+08 | 39.2 2019-01-26 05:44:15+08 | 36.6 2019-01-26 05:44:20+08 | 39.8 2019-01-26 05:44:25+08 | 35.2 2019-01-26 05:44:30+08 | 35.8 2019-01-26 05:44:35+08 | 42.8 2019-01-26 05:44:40+08 | 40.8 2019-01-26 05:44:45+08 | 39.4 2019-01-26 05:44:50+08 | 40 2019-01-26 05:44:55+08 | 40.2 2019-01-26 05:45:00+08 | 41.2 2019-01-26 05:45:05+08 | 41.6 2019-01-26 05:45:10+08 | 40.6 2019-01-26 05:45:15+08 | 33.8 2019-01-26 05:45:20+08 | 35.8 2019-01-26 05:45:25+08 | 42.2 2019-01-26 05:45:30+08 | 37.8 2019-01-26 05:45:35+08 | 37.6 2019-01-26 05:45:40+08 | 40.2 2019-01-26 05:45:45+08 | 37.4 2019-01-26 05:45:50+08 | 38.2 2019-01-26 05:45:55+08 | 39.6 2019-01-26 05:46:00+08 | 41.6 2019-01-26 05:46:05+08 | 36 2019-01-26 05:46:10+08 | 34.6 2019-01-26 05:46:15+08 | 37.8 2019-01-26 05:46:20+08 | 40.8 2019-01-26 05:46:25+08 | 42 2019-01-26 05:46:30+08 | 36.4 2019-01-26 05:46:35+08 | 44.6 2019-01-26 05:46:40+08 | 38.8 2019-01-26 05:46:45+08 | 35 2019-01-26 05:46:50+08 | 36.2 2019-01-26 05:46:55+08 | 37.2 2019-01-26 05:47:00+08 | 36 2019-01-26 05:47:05+08 | 38.2 2019-01-26 05:47:10+08 | 37.2 2019-01-26 05:47:15+08 | 42.8 2019-01-26 05:47:20+08 | 32 2019-01-26 05:47:25+08 | 41 2019-01-26 05:47:30+08 | 44 2019-01-26 05:47:35+08 | 37.4 2019-01-26 05:47:40+08 | 36.2 2019-01-26 05:47:45+08 | 39 2019-01-26 05:47:50+08 | 27.8 (103 rows) 10秒统计间隔的SQL select coalesce(t1.ts,t2.ts) ts, coalesce(avg_active_sessions,0) avg_active_sessions from ( select to_timestamp((extract(epoch from ts))::int8/10*10) ts, count(*)/10::float8 avg_active_sessions from perf_insight group by 1 ) t1 full outer join ( select generate_series( to_timestamp((extract(epoch from min(ts)))::int8/10*10), to_timestamp((extract(epoch from max(ts)))::int8/10*10), interval '10 s' ) ts from perf_insight ) t2 on (t1.ts=t2.ts); ts | avg_active_sessions ------------------------+--------------------- 2019-01-26 05:39:20+08 | 22.3 2019-01-26 05:39:30+08 | 38.8 2019-01-26 05:39:40+08 | 38.4 2019-01-26 05:39:50+08 | 35.1 2019-01-26 05:40:00+08 | 39.6 2019-01-26 05:40:10+08 | 37.8 2019-01-26 05:40:20+08 | 37.9 2019-01-26 05:40:30+08 | 36.8 2019-01-26 05:40:40+08 | 39.9 2019-01-26 05:40:50+08 | 37 2019-01-26 05:41:00+08 | 38.3 2019-01-26 05:41:10+08 | 39.4 2019-01-26 05:41:20+08 | 38.2 2019-01-26 05:41:30+08 | 38.4 2019-01-26 05:41:40+08 | 38.1 2019-01-26 05:41:50+08 | 36.8 2019-01-26 05:42:00+08 | 38.4 2019-01-26 05:42:10+08 | 36.6 2019-01-26 05:42:20+08 | 36.2 2019-01-26 05:42:30+08 | 18.8 2019-01-26 05:42:40+08 | 0 2019-01-26 05:42:50+08 | 24.5 2019-01-26 05:43:00+08 | 39.9 2019-01-26 05:43:10+08 | 40.5 2019-01-26 05:43:20+08 | 40.5 2019-01-26 05:43:30+08 | 38 2019-01-26 05:43:40+08 | 33.8 2019-01-26 05:43:50+08 | 36.3 2019-01-26 05:44:00+08 | 35.8 2019-01-26 05:44:10+08 | 37.9 2019-01-26 05:44:20+08 | 37.5 2019-01-26 05:44:30+08 | 39.3 2019-01-26 05:44:40+08 | 40.1 2019-01-26 05:44:50+08 | 40.1 2019-01-26 05:45:00+08 | 41.4 2019-01-26 05:45:10+08 | 37.2 2019-01-26 05:45:20+08 | 39 2019-01-26 05:45:30+08 | 37.7 2019-01-26 05:45:40+08 | 38.8 2019-01-26 05:45:50+08 | 38.9 2019-01-26 05:46:00+08 | 38.8 2019-01-26 05:46:10+08 | 36.2 2019-01-26 05:46:20+08 | 41.4 2019-01-26 05:46:30+08 | 40.5 2019-01-26 05:46:40+08 | 36.9 2019-01-26 05:46:50+08 | 36.7 2019-01-26 05:47:00+08 | 37.1 2019-01-26 05:47:10+08 | 40 2019-01-26 05:47:20+08 | 36.5 2019-01-26 05:47:30+08 | 40.7 2019-01-26 05:47:40+08 | 37.6 2019-01-26 05:47:50+08 | 13.9 (52 rows) 2 具体到一个时间段内,是什么问题 例如2019-01-26 05:45:20+08,这个时间区间,性能问题钻取: 1、数据库维度的资源消耗时间占用,判定哪个数据库占用的资源最多 postgres=# select datname, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1 order by cnt desc; datname | cnt ----------+----- postgres | 39 (1 row) 2、用户维度的资源消耗时间占用,判定哪个用户占用的资源最多 postgres=# select usename, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1 order by cnt desc; usename | cnt ----------+----- postgres | 39 (1 row) 3、等待事件维度的资源消耗时间占用,判定问题集中在哪些等待事件上,可以针对性的优化、加资源。 postgres=# select coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1 order by cnt desc; waiting | cnt --------------------------+------ CPU_TIME | 15.3 Client_ClientRead | 10.6 IPC_ProcArrayGroupUpdate | 6.1 Lock_transactionid | 5.4 Lock_tuple | 0.5 LWLock_wal_insert | 0.3 LWLock_ProcArrayLock | 0.2 LWLock_buffer_content | 0.2 IPC_ClogGroupUpdate | 0.2 LWLock_lock_manager | 0.1 LWLock_CLogControlLock | 0.1 (11 rows) 4、SQL维度的资源消耗时间占用,判定问题集中在哪些SQL上,可以针对性的优化。 postgres=# select query, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1 order by cnt desc; query | cnt -------------------------------------------------------------------------------------------------------+------ END; | 11.5 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | 11.3 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | 6.8 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | 4.5 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | 2.3 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | 2.1 BEGIN; | 0.5 (7 rows) 5、单条QUERY在不同等待事件上的资源消耗时间占用,判定问题SQL的突出等待事件,可以针对性的优化、加资源。 postgres=# select query, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 group by 1,2 order by 1,cnt desc; query | waiting | cnt -------------------------------------------------------------------------------------------------------+--------------------------+----- BEGIN; | Client_ClientRead | 0.3 BEGIN; | CPU_TIME | 0.2 END; | CPU_TIME | 4.6 END; | IPC_ProcArrayGroupUpdate | 3.7 END; | Client_ClientRead | 3.1 END; | IPC_ClogGroupUpdate | 0.1 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | CPU_TIME | 1 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | Client_ClientRead | 0.6 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | IPC_ProcArrayGroupUpdate | 0.6 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | IPC_ClogGroupUpdate | 0.1 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | CPU_TIME | 1.2 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Client_ClientRead | 0.6 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Lock_transactionid | 0.3 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | CPU_TIME | 3.8 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | Client_ClientRead | 2.9 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | LWLock_wal_insert | 0.1 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Lock_transactionid | 4 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | CPU_TIME | 2.5 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Client_ClientRead | 2.1 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | IPC_ProcArrayGroupUpdate | 1.7 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Lock_tuple | 0.5 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_buffer_content | 0.2 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_ProcArrayLock | 0.2 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_wal_insert | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | CPU_TIME | 2 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Lock_transactionid | 1.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Client_ClientRead | 1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | IPC_ProcArrayGroupUpdate | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_CLogControlLock | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_lock_manager | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_wal_insert | 0.1 (31 rows) 6、点中单条QUERY,在不同等待事件上的资源消耗时间占用,判定问题SQL的突出等待事件,可以针对性的优化、加资源。 通过4,发现占用最多的是END这条SQL,那么这条SQL的等待时间分布如何?是什么等待引起的? postgres=# select coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 05:45:20+08' -- 问题时间点 and query='END;' group by 1 order by cnt desc; waiting | cnt --------------------------+----- CPU_TIME | 4.6 IPC_ProcArrayGroupUpdate | 3.7 Client_ClientRead | 3.1 IPC_ClogGroupUpdate | 0.1 (4 rows) 3 开启一个可以造成性能问题的压测场景,通过perf insight直接发现问题 1、开启640个并发,读写压测,由于数据量小,并发高,直接导致了ROW LOCK冲突的问题,使用perf insight问题毕现。 pgbench -M prepared -n -r -P 1 -c 640 -j 640 -T 300 postgres=# select query, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 06:38:20+08' -- 问题时间点 group by 1,2 order by 1,cnt desc; query | waiting | cnt -------------------------------------------------------------------------------------------------------+--------------------------+------- BEGIN; | Lock_transactionid | 0.3 BEGIN; | Lock_tuple | 0.3 BEGIN; | LWLock_lock_manager | 0.1 END; | IPC_ProcArrayGroupUpdate | 29.5 END; | CPU_TIME | 14.1 END; | Lock_transactionid | 13 END; | Client_ClientRead | 8.4 END; | Lock_tuple | 8.1 END; | LWLock_lock_manager | 3 END; | LWLock_ProcArrayLock | 0.4 END; | LWLock_buffer_content | 0.3 END; | IPC_ClogGroupUpdate | 0.1 END; | LWLock_wal_insert | 0.1 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | IPC_ProcArrayGroupUpdate | 1.3 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | CPU_TIME | 0.4 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | Lock_transactionid | 0.3 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | Lock_tuple | 0.2 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | Client_ClientRead | 0.2 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ($1, $2, $3, $4, CURRENT_TIMESTAMP); | LWLock_lock_manager | 0.1 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Lock_tuple | 0.9 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Lock_transactionid | 0.9 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | IPC_ProcArrayGroupUpdate | 0.4 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | Client_ClientRead | 0.3 SELECT abalance FROM pgbench_accounts WHERE aid = $1; | CPU_TIME | 0.1 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | Lock_transactionid | 1.7 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | IPC_ProcArrayGroupUpdate | 1.4 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | Lock_tuple | 0.9 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | LWLock_lock_manager | 0.1 UPDATE pgbench_accounts SET abalance = abalance + $1 WHERE aid = $2; | CPU_TIME | 0.1 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Lock_transactionid | 161.5 # 突出问题在这里 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | IPC_ProcArrayGroupUpdate | 27.2 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Lock_tuple | 27.2 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_lock_manager | 19.6 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | CPU_TIME | 12.3 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | Client_ClientRead | 4 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_buffer_content | 3.3 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_ProcArrayLock | 0.3 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | LWLock_wal_insert | 0.1 UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2; | IPC_ClogGroupUpdate | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Lock_transactionid | 178.4 # 突出问题在这里 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Lock_tuple | 83.7 # 突出问题在这里 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | CPU_TIME | 5.6 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | IPC_ProcArrayGroupUpdate | 5.3 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_lock_manager | 3.8 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | Client_ClientRead | 2 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_ProcArrayLock | 0.1 UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2; | LWLock_buffer_content | 0.1 (47 rows) postgres=# select coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight where to_timestamp((extract(epoch from ts))::int8/10*10) -- 以10秒统计粒度的图形为例 ='2019-01-26 06:38:20+08' -- 问题时间点 group by 1 order by cnt desc; waiting | cnt --------------------------+------- Lock_transactionid | 356.1 Lock_tuple | 121.3 IPC_ProcArrayGroupUpdate | 65.1 CPU_TIME | 32.6 LWLock_lock_manager | 26.7 Client_ClientRead | 14.9 LWLock_buffer_content | 3.7 LWLock_ProcArrayLock | 0.8 LWLock_wal_insert | 0.2 IPC_ClogGroupUpdate | 0.2 (10 rows) 其他压测场景使用perf insight发现问题的例子 1、批量数据写入,BLOCK extend或wal insert lock瓶颈,或pglz压缩瓶颈。 create table test(id int, info text default repeat(md5(random()::text),1000)); vi test.sql insert into test(id) select generate_series(1,10); pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 300 postgres=# select to_timestamp((extract(epoch from ts))::int8/10*10) ts, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight group by 1,2 order by 1,cnt desc; ts | waiting | cnt ------------------------+--------------------------+------ 2019-01-26 10:28:50+08 | IO_DataFileExtend | 0.1 2019-01-26 10:29:00+08 | CPU_TIME | 50 2019-01-26 10:29:00+08 | Lock_extend | 11.9 -- 扩展数据文件 2019-01-26 10:29:00+08 | Client_ClientRead | 0.3 2019-01-26 10:29:00+08 | IO_DataFileExtend | 0.2 2019-01-26 10:29:00+08 | LWLock_lock_manager | 0.1 2019-01-26 10:29:10+08 | CPU_TIME | 47.1 2019-01-26 10:29:10+08 | Lock_extend | 13.5 2019-01-26 10:29:10+08 | Client_ClientRead | 0.7 2019-01-26 10:29:10+08 | IO_DataFileExtend | 0.3 2019-01-26 10:29:10+08 | LWLock_buffer_content | 0.2 2019-01-26 10:29:10+08 | LWLock_lock_manager | 0.1 2019-01-26 10:29:20+08 | CPU_TIME | 54.5 2019-01-26 10:29:20+08 | Lock_extend | 6.7 2019-01-26 10:29:20+08 | Client_ClientRead | 0.2 2019-01-26 10:29:20+08 | IO_DataFileExtend | 0.1 2019-01-26 10:29:30+08 | CPU_TIME | 61.9 -- CPU,通过perf top来看是 pglz接口的瓶颈(pglz_compress) 2019-01-26 10:29:30+08 | Client_ClientRead | 0.2 2019-01-26 10:29:40+08 | CPU_TIME | 30.9 2019-01-26 10:29:40+08 | LWLock_wal_insert | 0.2 2019-01-26 10:29:40+08 | Client_ClientRead | 0.1 (28 rows) 所以上面这个问题,如果改成不压缩,那么瓶颈就会变成其他的: alter table test alter COLUMN info set storage external; postgres=# \d+ test Table "public.test" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+-------------------------------------+----------+--------------+------------- id | integer | | | | plain | | info | text | | | repeat(md5((random())::text), 1000) | external | | 瓶颈就会变成其他的: 2019-01-26 10:33:50+08 | Lock_extend | 43.2 2019-01-26 10:33:50+08 | LWLock_buffer_content | 14.8 2019-01-26 10:33:50+08 | CPU_TIME | 4.6 2019-01-26 10:33:50+08 | LWLock_lock_manager | 0.5 2019-01-26 10:33:50+08 | LWLock_wal_insert | 0.4 2019-01-26 10:33:50+08 | IO_DataFileExtend | 0.4 2019-01-26 10:33:50+08 | Client_ClientRead | 0.1 2019-01-26 10:34:00+08 | Lock_extend | 55.6 2019-01-26 10:34:00+08 | LWLock_buffer_content | 6.3 2019-01-26 10:34:00+08 | CPU_TIME | 1.2 2019-01-26 10:34:00+08 | IO_DataFileExtend | 0.8 2019-01-26 10:34:00+08 | LWLock_wal_insert | 0.1 2019-01-26 10:34:10+08 | Lock_extend | 6.3 2019-01-26 10:34:10+08 | LWLock_buffer_content | 5.8 2019-01-26 10:34:10+08 | CPU_TIME | 0.7 因此治本的方法是提供更好的压缩接口,这也是PG 12的版本正在改进的: 《[未完待续] PostgreSQL 开放压缩接口 与 lz4压缩插件》 《[未完待续] PostgreSQL zstd 压缩算法 插件》 2、秒杀,单条UPDATE。行锁瓶颈。 create table t_hot (id int primary key, cnt int8); insert into t_hot values (1,0); vi test.sql update t_hot set cnt=cnt+1 where id=1; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 300 postgres=# select to_timestamp((extract(epoch from ts))::int8/10*10) ts, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight group by 1,2 order by 1,cnt desc; 2019-01-26 10:37:50+08 | Lock_tuple | 29.6 -- 瓶颈为行锁冲突 2019-01-26 10:37:50+08 | LWLock_lock_manager | 11.4 -- 伴随热点块 2019-01-26 10:37:50+08 | LWLock_buffer_content | 8.4 2019-01-26 10:37:50+08 | Lock_transactionid | 7.6 2019-01-26 10:37:50+08 | CPU_TIME | 6.5 2019-01-26 10:37:50+08 | Client_ClientRead | 0.2 2019-01-26 10:38:00+08 | Lock_tuple | 29.2 -- 瓶颈为行锁冲突 2019-01-26 10:38:00+08 | LWLock_buffer_content | 15.6 -- 伴随热点块 2019-01-26 10:38:00+08 | CPU_TIME | 7.9 2019-01-26 10:38:00+08 | LWLock_lock_manager | 7.2 2019-01-26 10:38:00+08 | Lock_transactionid | 3.7 秒杀的场景,优化方法 《PostgreSQL 秒杀4种方法 - 增加 批量流式加减库存 方法》 《HTAP数据库 PostgreSQL 场景与性能测试之 30 - (OLTP) 秒杀 - 高并发单点更新》 《聊一聊双十一背后的技术 - 不一样的秒杀技术, 裸秒》 《PostgreSQL 秒杀场景优化》 3、未优化SQL,全表扫描filter,CPU time瓶颈。 postgres=# create table t_bad (id int, info text); CREATE TABLE postgres=# insert into t_bad select generate_series(1,10000), md5(random()::Text); INSERT 0 10000 vi test.sql \set id random(1,10000) select * from t_bad where id=:id; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 300 瓶颈 postgres=# select to_timestamp((extract(epoch from ts))::int8/10*10) ts, coalesce(waiting, 'CPU_TIME') waiting, count(*)/10::float8 cnt from perf_insight group by 1,2 order by 1,cnt desc; 2019-01-26 10:41:40+08 | CPU_TIME | 61.3 2019-01-26 10:41:40+08 | Client_ClientRead | 0.9 2019-01-26 10:41:50+08 | CPU_TIME | 61.7 2019-01-26 10:41:50+08 | Client_ClientRead | 0.1 2019-01-26 10:42:00+08 | CPU_TIME | 60.7 2019-01-26 10:42:00+08 | Client_ClientRead | 0.5 perf insight 的基准线 如果要设置一个基准线,用于报警。那么: 1、基准线跟QPS没什么关系。 2、基准线跟avg active sessions有莫大关系。avg active sessions大于实例CPU核数时,说明有性能问题。 perf insight 不是万能的 perf insight 发现当时的问题是非常迅速的。 神医华佗说,不治已病治未病才是最高境界,perf insight实际上是发现已病,而未病是发现不了的。 未病还是需要对引擎的深刻理解和丰富的经验积累。 例如: 1、年龄 2、FREEZE风暴 3、sequence耗尽 4、索引推荐 5、膨胀 6、安全风险 7、不合理索引 8、增长趋势 9、碎片 10、分区建议 11、冷热分离建议 12、TOP SQL诊断与优化 13、扩容(容量、计算资源、IO、内存...)建议 14、分片建议 15、架构优化建议 等。 除此之外,perf insight对于这类情况也是发现不了的: 1、long query (waiting (ddl, block one session)),当long query比较少,总体avg active session低于基准水位时,实际上long query的问题就无法暴露。 然而long query是有一些潜在问题的,例如可能导致膨胀。 perf insight + 经验型监控、诊断,可以使得你的数据库监测系统更加强壮。 其他知识点、内核需改进点 1、会话ID,使用backend的启动时间,backend pid两者结合,就可以作为PG数据库的唯一session id。 有了session id,就可以基于SESSION维度进行性能诊断和可视化展示。 select extract(epoch from backend_start)||'.'||pid as sessid from pg_stat_activity ; sessid ------------------------ 1547978042.41326.13447 1547978042.41407.13450 2、对于未使用绑定变量的SQL,要做SQL层的统计透视,就会比较悲剧了,因为只要输入的变量不同在pg_stat_activity的query
标签 PostgreSQL , enterprisedb , ppas , oracle 背景 PPAS 10以及以前的版本,对于Oracle分区表的使用,以及如何创建分区表的索引。 10 以及以前的版本,仅支持range, list分区。11开始支持HASH分区。 Oracle分区表语法 https://docs.oracle.com/cd/E18283_01/server.112/e16541/part_admin001.htm#i1006455 例子 CREATE TABLE sales ( prod_id NUMBER(6) , cust_id NUMBER , time_id DATE , channel_id CHAR(1) , promo_id NUMBER(6) , quantity_sold NUMBER(3) , amount_sold NUMBER(10,2) ) PARTITION BY RANGE (time_id) ( PARTITION sales_q1_2006 VALUES LESS THAN (TO_DATE('01-APR-2006','dd-MON-yyyy')) , PARTITION sales_q2_2006 VALUES LESS THAN (TO_DATE('01-JUL-2006','dd-MON-yyyy')) , PARTITION sales_q3_2006 VALUES LESS THAN (TO_DATE('01-OCT-2006','dd-MON-yyyy')) , PARTITION sales_q4_2006 VALUES LESS THAN (TO_DATE('01-JAN-2007','dd-MON-yyyy')) ); CREATE TABLE q1_sales_by_region (deptno number, deptname varchar2(20), quarterly_sales number(10, 2), state varchar2(2)) PARTITION BY LIST (state) (PARTITION q1_northwest VALUES ('OR', 'WA'), PARTITION q1_southwest VALUES ('AZ', 'UT', 'NM'), PARTITION q1_northeast VALUES ('NY', 'VM', 'NJ'), PARTITION q1_southeast VALUES ('FL', 'GA'), PARTITION q1_northcentral VALUES ('SD', 'WI'), PARTITION q1_southcentral VALUES ('OK', 'TX')); PPAS 分区表用法 注意两个相关参数 set default_with_oids = on; -- with oids(多一列),设置为OFF时不允许使用Oracle的创建分区表的语法。 set default_with_rowids = on; -- oid上增加一列UK索引。 如果业务上不需要使用rowid虚拟列,强烈建议设置为OFF。 语法与Oracle相似,前面两个Oracle中的创建分区表的SQL可以直接运行。 创建分区表索引 1、10以前的版本,不允许直接在表上创建 postgres=# \set VERBOSITY verbose postgres=# create index idx_sales_1 on sales (prod_id); ERROR: 42809: cannot create index on partitioned table "sales" LOCATION: DefineIndex, indexcmds.c:396 只能在分区上创建索引。 如果分区很多,可以写成DO或者函数,简化整个过程。 需要用到inherit找到所有继承表。 postgres=# \d pg_inherits Table "pg_catalog.pg_inherits" Column | Type | Collation | Nullable | Default -----------+---------+-----------+----------+--------- inhrelid | oid | | not null | inhparent | oid | | not null | inhseqno | integer | | not null | Indexes: "pg_inherits_relid_seqno_index" UNIQUE, btree (inhrelid, inhseqno) "pg_inherits_parent_index" btree (inhparent) 例如要对sales的所有分区 do language plpgsql $$ declare s name; t name; tbl oid := 'public.sales'::regclass; col text := format('%I,%I', 'prod_id', 'quantity_sold'); o oid; begin for o in select inhrelid from pg_inherits where inhparent=tbl loop select nspname, relname into s,t from pg_class t1 join pg_namespace t2 on (t1.relnamespace=t2.oid) where t1.oid=o; execute format('create index %s on %I.%I (%s)', 'md5'||md5(random()::text), s, t, col); end loop; end; $$; 如下: postgres=# \d+ sales Table "public.sales" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ---------------+-----------------------------+-----------+----------+---------+----------+--------------+------------- prod_id | numeric(6,0) | | | | main | | cust_id | numeric | | | | main | | time_id | timestamp without time zone | | | | plain | | channel_id | character(1) | | | | extended | | promo_id | numeric(6,0) | | | | main | | quantity_sold | numeric(3,0) | | | | main | | amount_sold | numeric(10,2) | | | | main | | Partition key: RANGE (time_id) NULLS LAST Partitions: sales_sales_q1_2006 FOR VALUES FROM (MINVALUE) TO ('01-APR-06 00:00:00'), sales_sales_q2_2006 FOR VALUES FROM ('01-APR-06 00:00:00') TO ('01-JUL-06 00:00:00'), sales_sales_q3_2006 FOR VALUES FROM ('01-JUL-06 00:00:00') TO ('01-OCT-06 00:00:00'), sales_sales_q4_2006 FOR VALUES FROM ('01-OCT-06 00:00:00') TO ('01-JAN-07 00:00:00') Has OIDs: yes postgres=# \d sales_sales_q1_2006 Table "public.sales_sales_q1_2006" Column | Type | Collation | Nullable | Default ---------------+-----------------------------+-----------+----------+--------- prod_id | numeric(6,0) | | | cust_id | numeric | | | time_id | timestamp without time zone | | | channel_id | character(1) | | | promo_id | numeric(6,0) | | | quantity_sold | numeric(3,0) | | | amount_sold | numeric(10,2) | | | Partition of: sales FOR VALUES FROM (MINVALUE) TO ('01-APR-06 00:00:00') Indexes: "pg_oid_120027427_index" UNIQUE, btree (oid) "md5193df902f78920ac4d636ebcab5d50b1" btree (prod_id, quantity_sold) postgres=# \d sales_sales_q2_2006 Table "public.sales_sales_q2_2006" Column | Type | Collation | Nullable | Default ---------------+-----------------------------+-----------+----------+--------- prod_id | numeric(6,0) | | | cust_id | numeric | | | time_id | timestamp without time zone | | | channel_id | character(1) | | | promo_id | numeric(6,0) | | | quantity_sold | numeric(3,0) | | | amount_sold | numeric(10,2) | | | Partition of: sales FOR VALUES FROM ('01-APR-06 00:00:00') TO ('01-JUL-06 00:00:00') Indexes: "pg_oid_120027434_index" UNIQUE, btree (oid) "md52c8ff555d00e2fd5245fafb3027a6d6d" btree (prod_id, quantity_sold) 将分区表创建索引的功能封装成函数 输入: 主表所在schema 主表名 索引字段 索引方法 表空间 是否需要不堵塞DML 函数如下 create or replace function create_index_on_partition_table ( ptblnsp name, -- 主表所在schema, 大小写敏感,推荐全部使用小写。 ptbl name, -- 主表名, 大小写敏感,推荐全部使用小写。 cols name[], -- 索引字段, 严格按顺序来创建,大小写敏感,推荐全部使用小写。 am name default 'btree', -- 索引方法 tbs name default 'pg_default' -- 表空间 ) returns void as $$ declare s name; t name; tbl oid := format('%I.%I', ptblnsp, ptbl)::regclass; col text; o oid; begin select string_agg(format('%I',x),', ') into col from unnest(cols) x; for o in select inhrelid from pg_inherits where inhparent=tbl loop perform 1 from (select pg_get_indexdef(indexrelid) as def from pg_index where indrelid=o) t where substring(def, '\((.*)\)')=col limit 1; if not found then -- 避免重复创建,例如新增了分区后,需要对新建分区添加索引,老分区已经添加就不需要再加了 select nspname, relname into s,t from pg_class t1 join pg_namespace t2 on (t1.relnamespace=t2.oid) where t1.oid=o; execute format('create index %s on %I.%I (%s)', 'md5'||md5(random()::text), s, t, col); end if; end loop; end; $$ language plpgsql strict; 使用举例 CREATE TABLE salesabc ( prod_id NUMBER(6) , cust_id NUMBER , time_id DATE , channel_id CHAR(1) , promo_id NUMBER(6) , "QWWWuantity_sold" NUMBER(3) , amount_sold NUMBER(10,2) ) PARTITION BY RANGE (time_id) ( PARTITION sales_q1_2006 VALUES LESS THAN (TO_DATE('01-APR-2006','dd-MON-yyyy')) , PARTITION sales_q2_2006 VALUES LESS THAN (TO_DATE('01-JUL-2006','dd-MON-yyyy')) , PARTITION sales_q3_2006 VALUES LESS THAN (TO_DATE('01-OCT-2006','dd-MON-yyyy')) , PARTITION sales_q4_2006 VALUES LESS THAN (TO_DATE('01-JAN-2007','dd-MON-yyyy')) ); 创建分区索引 select create_index_on_partition_table('public','salesabc','{prod_id, QWWWuantity_sold,amount_sold}'); 查看索引已正确创建 postgres=# select indexrelid::regclass,indrelid::Regclass,pg_get_indexdef(indexrelid) from pg_index where indrelid in (select inhrelid from pg_inherits where inhparent='public.salesabc'::regclass); indexrelid | indrelid | pg_get_indexdef -------------------------------------+------------------------+------------------------------------------------------------------------------------------------------------------------------------------ pg_oid_120027673_index | salesabc_sales_q1_2006 | CREATE UNIQUE INDEX pg_oid_120027673_index ON public.salesabc_sales_q1_2006 USING btree (oid) pg_oid_120027680_index | salesabc_sales_q2_2006 | CREATE UNIQUE INDEX pg_oid_120027680_index ON public.salesabc_sales_q2_2006 USING btree (oid) pg_oid_120027687_index | salesabc_sales_q3_2006 | CREATE UNIQUE INDEX pg_oid_120027687_index ON public.salesabc_sales_q3_2006 USING btree (oid) pg_oid_120027694_index | salesabc_sales_q4_2006 | CREATE UNIQUE INDEX pg_oid_120027694_index ON public.salesabc_sales_q4_2006 USING btree (oid) md56a2cbe5776d443387f068bbe539533e5 | salesabc_sales_q1_2006 | CREATE INDEX md56a2cbe5776d443387f068bbe539533e5 ON public.salesabc_sales_q1_2006 USING btree (prod_id, "QWWWuantity_sold", amount_sold) md5e1c5c1645d5c9cd6500040d98b1ff39d | salesabc_sales_q2_2006 | CREATE INDEX md5e1c5c1645d5c9cd6500040d98b1ff39d ON public.salesabc_sales_q2_2006 USING btree (prod_id, "QWWWuantity_sold", amount_sold) md519a145aefd180dd7f4a43e57f3254d61 | salesabc_sales_q3_2006 | CREATE INDEX md519a145aefd180dd7f4a43e57f3254d61 ON public.salesabc_sales_q3_2006 USING btree (prod_id, "QWWWuantity_sold", amount_sold) md5402f9b0fb2919c8b4545033ac450a140 | salesabc_sales_q4_2006 | CREATE INDEX md5402f9b0fb2919c8b4545033ac450a140 ON public.salesabc_sales_q4_2006 USING btree (prod_id, "QWWWuantity_sold", amount_sold) (8 rows) Enterprisedb 11(POLARDDB PG, PPAS 11)都支持了直接对分区表创建索引,不需要以上繁琐的操作。 其他 1、不支持非默认ops的情况,如果有非默认OPS的话,改一下以上函数(使用非默认ops)。 2、如果需要支持并行创建,改一下以上函数(使用dblink异步任务,同时使用CONCURRENTLY关键字创建索引)。 3、如果需要开启异步任务,同时对多个分区创建,改一下以上函数(使用dblink异步任务)。 参考 《PostgreSQL 快速给指定表每个字段创建索引 - 2 (近乎完美)》 《PostgreSQL dblink异步调用实践,跑并行多任务 - 例如开N个并行后台任务创建索引, 开N个后台任务跑若干SQL》 《在PostgreSQL中跑后台长任务的方法 - 使用dblink异步接口》 社区版本分区表使用: 《PostgreSQL 9.x, 10, 11 hash分区表 用法举例》 《PostgreSQL 分区表如何支持多列唯一约束 - 枚举、hash哈希 分区, 多列唯一, insert into on conflict, update, upsert, merge insert》 《PostgreSQL native partition 分区表性能优化之 - 动态SQL+服务端绑定变量》 《PostgreSQL 分区表、继承表 记录去重方法》 《PostgreSQL pgbench tpcb 海量数据库测试 - 分区表测试优化》 《PostgreSQL 11 preview - 分区表 增强 汇总》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , 同步 , 半同步 , 流复制 背景 两节点HA架构,如何做到跨机房RPO=0(可靠性维度)?同时RTO可控(可用性维度)? 半同步是一个不错的选择。 1、当只挂掉一个节点时,可以保证RPO=0。如下: 主 -> 从(挂) 主(挂) -> 从 2、当一个节点挂掉后,在另一个节点恢复并开启同步模式前,如果在此期间(当前)主节点也挂掉,(虽然此时从库活了(但由于还未开启同步模式)),则RPO>0。 如下: 主(挂) -> 从(OPEN,但是之前从挂过,并且还还未转换为同步模式) 与两个节点同时挂掉一样,RPO>0 3、如何保证RTO时间可控? 我们知道,在同步模式下,事务提交时需要等待sync STANDBY的WAL复制反馈,确保事务wal落多个副本再反馈客户端(从动作上来说,先持久化主,然后同步给sync从,并等待sync从的WAL 同步位点的反馈),当STANDBY挂掉时,等待是无限期的,所以两节点的同步复制,无法兼顾可用性(RTO)。那么怎么兼顾可用性呢? 可以对(pg_stat_activity)等待事件的状态进行监测,如果发现同步事务等待超过一定阈值(RTO阈值),则降级为异步模式。 降级不需要重启数据库。 3.1 改配置 3.2 reload (对已有连接和新建连接都会立即生效)。 3.3 cancel 等待信号(针对当前处于等待中的进程)。 4、降级后,什么情况下恢复为同步模式?(升级) 同样可以对(pg_stat_replication)状态进行监测,当sync standby处于streaming状态时,则可以转换为同步模式。 升级不需要重启数据库。 4.1 改配置 4.2 reload。立即生效 (对已有连接和新建连接都会立即生效)。 涉及技术点 1、事务提交参数 synchronous_commit on, remote_apply, remote_write, local 2、同步配置参数 synchronous_standby_names [FIRST] num_sync ( standby_name [, ...] ) ANY num_sync ( standby_name [, ...] ) standby_name [, ...] ANY 3 (s1, s2, s3, s4) FIRST 3 (s1, s2, s3, s4) * 表示所有节点 3、活跃会话,查看事务提交时,等待事件状态 pg_stat_activity 等待事件 https://www.postgresql.org/docs/11/monitoring-stats.html#MONITORING-STATS-VIEWS wait_event='SyncRep' 4、流状态,pg_stat_replication sync_state='sync' state text Current WAL sender state. Possible values are: startup: This WAL sender is starting up. catchup: This WAL sender's connected standby is catching up with the primary. streaming: This WAL sender is streaming changes after its connected standby server has caught up with the primary. backup: This WAL sender is sending a backup. stopping: This WAL sender is stopping. 实践 环境 1、主 postgresql.conf synchronous_commit = remote_write wal_level = replica max_wal_senders = 8 synchronous_standby_names = '*' 2、从 recovery.conf restore_command = 'cp /data01/digoal/wal/%f %p' primary_conninfo = 'host=localhost port=8001 user=postgres' 同步降级、升级 - 实践 关闭standby,模拟备库异常。看如何实现半同步。 模拟STANDBY恢复,看如何模拟升级为同步模式。 1、监测 pg_stat_activity,如果发现事务提交等待超过一定阈值(RTO保障),降级 select max(now()-query_start) from pg_stat_activity where wait_event='SyncRep'; 2、查看以上结果等待时间(RTO保障) 当大于某个阈值时,开始降级。 注意NULL保护,NULL表示没有事务处于 SyncRep 等待状态。 3、降级步骤1,修改synchronous_commit参数。改成WAL本地持久化(异步流复制)。 alter system set synchronous_commit=local; 4、降级步骤2,生效参数,RELOAD select pg_reload_conf(); 5、降级步骤3,清空当前等待队列(处于SyncRep等待状态的进程在收到CANCEL信号后,从队列清空,并提示客户端,当前事务本地WAL已持久化,事务正常结束。) select pg_cancel_backend(pid) from pg_stat_activity where wait_event='SyncRep'; 6、收到清空信号的客户端返回正常(客户端可以看到事务正常提交) postgres=# end; WARNING: 01000: canceling wait for synchronous replication due to user request DETAIL: The transaction has already committed locally, but might not have been replicated to the standby. LOCATION: SyncRepWaitForLSN, syncrep.c:264 COMMIT 事务的redo信息已在本地WAL持久化,提交状态正常。 当前会话后续的请求会变成异步流复制模式(WAL本地持久化模式(synchronous_commit=local))。 如何升级?: 7、升级步骤1,监测standby状态,sync_state='sync'状态的standby进入streaming状态后,表示该standby与primary的wal已完全同步。 select * from pg_stat_replication where sync_state='sync' and state='streaming'; 有结果返回,表示standby已经接收完primary的wal,可以进入同步模式。 8、升级步骤2,将事务提交模式改回同步模式( synchronous_commit=remote_write ,事务提交时,等sync standby接收到wal,并write。) alter system set synchronous_commit=remote_write; 9、升级步骤3,生效参数,RELOAD (所有会话重置synchronous_commit=remote_write,包括已有连接,新建的连接) select pg_reload_conf(); 小结 1、在不修改PG内核的情况下,通过外部辅助监测和操纵(例如5秒监控间隔)),实现了两节点的半同步模式,在双节点或单节点正常的情况下,保证RPO=0,同时RTO可控(例如最长wait_event='SyncRep'等待时间超过10秒)。 2、内核修改建议, 降级:可以在等待队列中加HOOK,wait_event='SyncRep'等待超时后降级为异步。 升级:在wal_sender代码中加hook,监测到standby恢复后,改回同步模式。 参考 《PostgreSQL 一主多从(多副本,强同步)简明手册 - 配置、压测、监控、切换、防脑裂、修复、0丢失 - 珍藏级》 https://www.postgresql.org/docs/11/monitoring-stats.html#MONITORING-STATS-VIEWS 《PostgreSQL 时间点恢复(PITR)在异步流复制主从模式下,如何避免主备切换后PITR恢复走错时间线(timeline , history , partial , restore_command , recovery.conf)》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , pg_rewind , 主从切换 , 时间线修复 , 脑裂修复 , 从库开启读写后,回退为只读从库 , 异步主从发生角色切换后,主库rewind为新主库的从库 背景 1、PG物理流复制的从库,当激活后,可以开启读写,使用pg_rewind可以将从库回退为只读从库的角色。而不需要重建整个从库。 2、当异步主从发生角色切换后,主库的wal目录中可能还有没完全同步到从库的内容,因此老的主库无法直接切换为新主库的从库。使用pg_rewind可以修复老的主库,使之成为新主库的只读从库。而不需要重建整个从库。 3、如果没有pg_rewind,遇到以上情况,需要完全重建从库。或者你可以使用存储层快照,回退回脑裂以前的状态。又或者可以使用文件系统快照,回退回脑裂以前的状态。 原理与修复步骤 1、使用pg_rewind功能的前提条件:必须开启full page write,必须开启wal hint或者data block checksum。 2、需要被修复的库:从激活点开始,所有的WAL必须存在pg_wal目录中。如果WAL已经被覆盖,只要有归档,拷贝到pg_wal目录即可。 3、新的主库,从激活点开始,产生的所有WAL必须存在pg_wal目录中,或者已归档,并且被修复的库可以使用restore_command访问到这部分WAL。 4、修改(source db)新主库或老主库配置,允许连接。 5、修复时,连接新主库,得到切换点。或连接老主库,同时比对当前要修复的新主库的TL与老主库进行比对,得到切换点。 6、解析需要被修复的库的从切换点到现在所有的WAL。同时连接source db(新主库(或老主库)),进行回退操作(被修改或删除的BLOCK从source db获取并覆盖,新增的BLOCK,直接抹除。)回退到切换点的状态。 7、修改被修复库(target db)的recovery.conf, postgresql.conf配置。 8、启动target db,连接source db接收WAL,或restore_command配置接收WAL,从切换点开始所有WAL,进行apply。 9、target db现在是source db的从库。 以EDB PG 11为例讲解 环境部署 《MTK使用 - PG,PPAS,oracle,mysql,ms sql,sybase 迁移到 PG, PPAS (支持跨版本升级)》 export PS1="$USER@`/bin/hostname -s`-> " export PGPORT=4000 export PGDATA=/data04/ppas11/pg_root4000 export LANG=en_US.utf8 export PGHOME=/usr/edb/as11 export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH export DATE=`date +"%Y%m%d%H%M"` export PATH=$PGHOME/bin:$PATH:. export MANPATH=$PGHOME/share/man:$MANPATH export PGHOST=127.0.0.1 export PGUSER=postgres export PGDATABASE=postgres alias rm='rm -i' alias ll='ls -lh' unalias vi 1、初始化数据库集群 initdb -D /data04/ppas11/pg_root4000 -E UTF8 --lc-collate=C --lc-ctype=en_US.UTF8 -U postgres -k --redwood-like 2、配置recovery.done cd $PGDATA cp $PGHOME/share/recovery.conf.sample ./ mv recovery.conf.sample recovery.done vi recovery.done restore_command = 'cp /data04/ppas11/wal/%f %p' recovery_target_timeline = 'latest' standby_mode = on primary_conninfo = 'host=localhost port=4000 user=postgres' 3、配置postgresql.conf 要使用rewind功能: 必须开启full_page_writes 必须开启data_checksums或wal_log_hints postgresql.conf listen_addresses = '0.0.0.0' port = 4000 max_connections = 8000 superuser_reserved_connections = 13 unix_socket_directories = '.,/tmp' unix_socket_permissions = 0700 tcp_keepalives_idle = 60 tcp_keepalives_interval = 10 tcp_keepalives_count = 10 shared_buffers = 16GB max_prepared_transactions = 8000 maintenance_work_mem = 1GB autovacuum_work_mem = 1GB dynamic_shared_memory_type = posix vacuum_cost_delay = 0 bgwriter_delay = 10ms bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 10.0 effective_io_concurrency = 0 max_worker_processes = 128 max_parallel_maintenance_workers = 8 max_parallel_workers_per_gather = 8 max_parallel_workers = 24 wal_level = replica synchronous_commit = off full_page_writes = on wal_compression = on wal_buffers = 32MB wal_writer_delay = 10ms checkpoint_timeout = 25min max_wal_size = 32GB min_wal_size = 8GB checkpoint_completion_target = 0.2 archive_mode = on archive_command = 'cp -n %p /data04/ppas11/wal/%f' max_wal_senders = 16 wal_keep_segments = 4096 max_replication_slots = 16 hot_standby = on max_standby_archive_delay = 300s max_standby_streaming_delay = 300s wal_receiver_status_interval = 1s wal_receiver_timeout = 10s random_page_cost = 1.1 effective_cache_size = 400GB log_destination = 'csvlog' logging_collector = on log_directory = 'log' log_filename = 'edb-%a.log' log_truncate_on_rotation = on log_rotation_age = 1d log_rotation_size = 0 log_min_duration_statement = 1s log_checkpoints = on log_error_verbosity = verbose log_line_prefix = '%t ' log_lock_waits = on log_statement = 'ddl' log_timezone = 'PRC' autovacuum = on log_autovacuum_min_duration = 0 autovacuum_max_workers = 6 autovacuum_freeze_max_age = 1200000000 autovacuum_multixact_freeze_max_age = 1400000000 autovacuum_vacuum_cost_delay = 0 statement_timeout = 0 lock_timeout = 0 idle_in_transaction_session_timeout = 0 vacuum_freeze_table_age = 1150000000 vacuum_multixact_freeze_table_age = 1150000000 datestyle = 'redwood,show_time' timezone = 'PRC' lc_messages = 'en_US.utf8' lc_monetary = 'en_US.utf8' lc_numeric = 'en_US.utf8' lc_time = 'en_US.utf8' default_text_search_config = 'pg_catalog.english' shared_preload_libraries = 'auto_explain,pg_stat_statements,$libdir/dbms_pipe,$libdir/edb_gen,$libdir/dbms_aq' edb_redwood_date = on edb_redwood_greatest_least = on edb_redwood_strings = on db_dialect = 'redwood' edb_dynatune = 66 edb_dynatune_profile = oltp timed_statistics = off 4、配置pg_hba.conf,允许流复制 local all all trust host all all 127.0.0.1/32 trust host all all ::1/128 trust local replication all trust host replication all 127.0.0.1/32 trust host replication all ::1/128 trust host all all 0.0.0.0/0 md5 5、配置归档目录 mkdir /data04/ppas11/wal chown enterprisedb:enterprisedb /data04/ppas11/wal 6、创建从库 pg_basebackup -h 127.0.0.1 -p 4000 -D /data04/ppas11/pg_root4001 -F p -c fast 7、配置从库 cd /data04/ppas11/pg_root4001 mv recovery.done recovery.conf vi postgresql.conf port = 4001 8、启动从库 pg_ctl start -D /data04/ppas11/pg_root4001 9、压测主库 pgbench -i -s 1000 pgbench -M prepared -v -r -P 1 -c 24 -j 24 -T 300 10、检查归档 postgres=# select * from pg_stat_archiver ; archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset ----------------+--------------------------+----------------------------------+--------------+-----------------+------------------+---------------------------------- 240 | 0000000100000000000000F0 | 28-JAN-19 15:08:43.276965 +08:00 | 0 | | | 28-JAN-19 15:01:17.883338 +08:00 (1 row) postgres=# select * from pg_stat_archiver ; archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset ----------------+--------------------------+----------------------------------+--------------+-----------------+------------------+---------------------------------- 248 | 0000000100000000000000F8 | 28-JAN-19 15:08:45.120134 +08:00 | 0 | | | 28-JAN-19 15:01:17.883338 +08:00 (1 row) 11、检查从库延迟 postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+--------------------------------- pid | 8124 usesysid | 10 usename | postgres application_name | walreceiver client_addr | 127.0.0.1 client_hostname | client_port | 62988 backend_start | 28-JAN-19 15:07:34.084542 +08:00 backend_xmin | state | streaming sent_lsn | 1/88BC2000 write_lsn | 1/88BC2000 flush_lsn | 1/88BC2000 replay_lsn | 1/88077D48 write_lag | 00:00:00.001417 flush_lag | 00:00:00.002221 replay_lag | 00:00:00.097657 sync_priority | 0 sync_state | async 例子1,从库激活后产生读写,使用pg_rewind修复从库,回退到只读从库 1、激活从库 pg_ctl promote -D /data04/ppas11/pg_root4001 2、写从库 pgbench -M prepared -v -r -P 1 -c 4 -j 4 -T 120 -p 4001 此时从库已经和主库不在一个时间线,无法直接变成当前主库的从库 enterprisedb@pg11-test-> pg_controldata -D /data04/ppas11/pg_root4001|grep -i time Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Time of latest checkpoint: Mon 28 Jan 2019 03:56:38 PM CST Min recovery ending loc's timeline: 2 track_commit_timestamp setting: off Date/time type storage: 64-bit integers enterprisedb@pg11-test-> pg_controldata -D /data04/ppas11/pg_root4000|grep -i time Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Time of latest checkpoint: Mon 28 Jan 2019 05:11:38 PM CST Min recovery ending loc's timeline: 0 track_commit_timestamp setting: off Date/time type storage: 64-bit integers 3、修复从库,使之继续成为当前主库的从库 4、查看切换点 cd /data04/ppas11/pg_root4001 ll pg_wal/*.history -rw------- 1 enterprisedb enterprisedb 42 Jan 28 17:15 pg_wal/00000002.history cat pg_wal/00000002.history 1 6/48C62000 no recovery target specified 5、从库激活时间开始产生的WAL必须全部在pg_wal目录中。 -rw------- 1 enterprisedb enterprisedb 42 Jan 28 17:15 00000002.history -rw------- 1 enterprisedb enterprisedb 16M Jan 28 17:16 000000020000000600000048 ............ 000000020000000600000048开始,所有的wal必须存在从库pg_wal目录中。如果已经覆盖了,必须从归档目录拷贝到从库pg_wal目录中。 6、从库激活时,主库从这个时间点开始所有的WAL还在pg_wal目录,或者从库可以使用restore_command获得(recovery.conf)。 recovery.conf restore_command = 'cp /data04/ppas11/wal/%f %p' 7、pg_rewind命令帮助 https://www.postgresql.org/docs/11/app-pgrewind.html pg_rewind --help pg_rewind resynchronizes a PostgreSQL cluster with another copy of the cluster. Usage: pg_rewind [OPTION]... Options: -D, --target-pgdata=DIRECTORY existing data directory to modify --source-pgdata=DIRECTORY source data directory to synchronize with --source-server=CONNSTR source server to synchronize with -n, --dry-run stop before modifying anything -P, --progress write progress messages --debug write a lot of debug messages -V, --version output version information, then exit -?, --help show this help, then exit Report bugs to <support@enterprisedb.com>. 8、停库(被修复的库,停库) pg_ctl stop -m fast -D /data04/ppas11/pg_root4001 9、尝试修复 pg_rewind -n -D /data04/ppas11/pg_root4001 --source-server="hostaddr=127.0.0.1 user=postgres port=4000" servers diverged at WAL location 6/48C62000 on timeline 1 rewinding from last common checkpoint at 5/5A8CD30 on timeline 1 Done! 10、尝试正常,说明可以修复,实施修复 pg_rewind -D /data04/ppas11/pg_root4001 --source-server="hostaddr=127.0.0.1 user=postgres port=4000" servers diverged at WAL location 6/48C62000 on timeline 1 rewinding from last common checkpoint at 5/5A8CD30 on timeline 1 Done! 11、已修复,改配置 cd /data04/ppas11/pg_root4001 vi postgresql.conf port = 4001 mv recovery.done recovery.conf vi recovery.conf restore_command = 'cp /data04/ppas11/wal/%f %p' recovery_target_timeline = 'latest' standby_mode = on primary_conninfo = 'host=localhost port=4000 user=postgres' 12、删除归档中错误时间线上产生的文件否则会在启动修复后的从库后,走到00000002时间线上,这是不想看到的。 mkdir /data04/ppas11/wal/error_tl_2 mv /data04/ppas11/wal/00000002* /data04/ppas11/wal/error_tl_2 13、启动从库 pg_ctl start -D /data04/ppas11/pg_root4001 14、建议对主库做一个检查点,从库收到检查点后,重启后不需要应用太多WAL,而是从新检查点开始恢复 psql checkpoint; 15、压测主库 pgbench -M prepared -v -r -P 1 -c 16 -j 16 -T 200 -p 4000 16、查看归档状态 postgres=# select * from pg_stat_archiver ; archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset ----------------+--------------------------+----------------------------------+--------------+-----------------+------------------+---------------------------------- 1756 | 0000000100000006000000DC | 28-JAN-19 17:41:57.562425 +08:00 | 0 | | | 28-JAN-19 15:01:17.883338 +08:00 (1 row) 17、查看从库健康、延迟,观察修复后的情况 postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+-------------------------------- pid | 13179 usesysid | 10 usename | postgres application_name | walreceiver client_addr | 127.0.0.1 client_hostname | client_port | 63198 backend_start | 28-JAN-19 17:47:29.85308 +08:00 backend_xmin | state | catchup sent_lsn | 7/DDE80000 write_lsn | 7/DC000000 flush_lsn | 7/DC000000 replay_lsn | 7/26A8DCB0 write_lag | 00:00:18.373263 flush_lag | 00:00:18.373263 replay_lag | 00:00:18.373263 sync_priority | 0 sync_state | async 例子2,从库激活成为新主库后,老主库依旧有读写,使用pg_rewind修复老主库,将老主库降级为新主库的从库 1、激活从库 pg_ctl promote -D /data04/ppas11/pg_root4001 2、写从库 pgbench -M prepared -v -r -P 1 -c 16 -j 16 -T 200 -p 4001 3、写主库 pgbench -M prepared -v -r -P 1 -c 16 -j 16 -T 200 -p 4000 此时老主库已经和新的主库不在一个时间线 enterprisedb@pg11-test-> pg_controldata -D /data04/ppas11/pg_root4000|grep -i timeline Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Min recovery ending loc's timeline: 0 enterprisedb@pg11-test-> pg_controldata -D /data04/ppas11/pg_root4001|grep -i timeline Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Min recovery ending loc's timeline: 2 enterprisedb@pg11-test-> cd /data04/ppas11/pg_root4001/pg_wal enterprisedb@pg11-test-> cat 00000002.history 1 8/48DE2318 no recovery target specified enterprisedb@pg11-test-> ll *.partial -rw------- 1 enterprisedb enterprisedb 16M Jan 28 17:48 000000010000000800000048.partial 4、修复老主库,变成从库 4.1、从库激活时,老主库从这个时间点开始所有的WAL,必须全部在pg_wal目录中。 000000010000000800000048 开始的所有WAL必须存在pg_wal,如果已经覆盖了,必须从WAL归档拷贝到pg_wal目录 4.2、从库激活时间开始产生的所有WAL,老主库必须可以使用restore_command获得(recovery.conf)。 recovery.conf restore_command = 'cp /data04/ppas11/wal/%f %p' 5、关闭老主库 pg_ctl stop -m fast -D /data04/ppas11/pg_root4000 6、尝试修复老主库 pg_rewind -n -D /data04/ppas11/pg_root4000 --source-server="hostaddr=127.0.0.1 user=postgres port=4001" servers diverged at WAL location 8/48DE2318 on timeline 1 rewinding from last common checkpoint at 6/CCCEF770 on timeline 1 Done! 7、尝试成功,可以修复,实施修复 pg_rewind -D /data04/ppas11/pg_root4000 --source-server="hostaddr=127.0.0.1 user=postgres port=4001" 8、修复完成后,改配置 cd /data04/ppas11/pg_root4000 vi postgresql.conf port = 4000 mv recovery.done recovery.conf vi recovery.conf restore_command = 'cp /data04/ppas11/wal/%f %p' recovery_target_timeline = 'latest' standby_mode = on primary_conninfo = 'host=localhost port=4001 user=postgres' 9、启动老主库 pg_ctl start -D /data04/ppas11/pg_root4000 10、建议对新主库做一个检查点,从库收到检查点后,重启后不需要应用太多WAL,而是从新检查点开始恢复 checkpoint; 11、压测新主库 pgbench -M prepared -v -r -P 1 -c 16 -j 16 -T 200 -p 4001 12、查看归档状态 psql -p 4001 postgres=# select * from pg_stat_archiver ; archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset ----------------+--------------------------+----------------------------------+--------------+-----------------+------------------+---------------------------------- 406 | 0000000200000009000000DB | 28-JAN-19 21:18:22.976118 +08:00 | 0 | | | 28-JAN-19 17:47:29.847488 +08:00 (1 row) 13、查看从库健康、延迟 psql -p 4001 postgres=# select * from pg_stat_replication ; -[ RECORD 1 ]----+--------------------------------- pid | 17675 usesysid | 10 usename | postgres application_name | walreceiver client_addr | 127.0.0.1 client_hostname | client_port | 60530 backend_start | 28-JAN-19 21:18:36.472197 +08:00 backend_xmin | state | streaming sent_lsn | 9/E8361C18 write_lsn | 9/E8361C18 flush_lsn | 9/E8361C18 replay_lsn | 9/D235B520 write_lag | 00:00:00.000101 flush_lag | 00:00:00.000184 replay_lag | 00:00:03.028098 sync_priority | 0 sync_state | async 小结 1 适合场景 1、PG物理流复制的从库,当激活后,可以开启读写,使用pg_rewind可以将从库回退为只读从库的角色。而不需要重建整个从库。 2、当异步主从发生角色切换后,主库的wal目录中可能还有没完全同步到从库的内容,因此老的主库无法直接切换为新主库的从库。使用pg_rewind可以修复老的主库,使之成为新主库的只读从库。而不需要重建整个从库。 如果没有pg_rewind,遇到以上情况,需要完全重建从库,如果库占用空间很大,重建非常耗时,也非常耗费上游数据库的资源(读)。 2 前提 要使用rewind功能: 1、必须开启full_page_writes 2、必须开启data_checksums或wal_log_hints initdb -k 开启data_checksums 3 原理与修复流程 1、使用pg_rewind功能的前提条件:必须开启full page write,必须开启wal hint或者data block checksum。 2、需要被修复的库:从激活点开始,所有的WAL必须存在pg_wal目录中。如果WAL已经被覆盖,只要有归档,拷贝到pg_wal目录即可。 3、新的主库,从激活点开始,产生的所有WAL必须存在pg_wal目录中,或者已归档,并且被修复的库可以使用restore_command访问到这部分WAL。 4、修改(source db)新主库或老主库配置,允许连接。 5、修复时,连接新主库,得到切换点。或连接老主库,同时比对当前要修复的新主库的TL与老主库进行比对,得到切换点。 6、解析需要被修复的库的从切换点到现在所有的WAL。同时连接source db(新主库(或老主库)),进行回退操作(被修改或删除的BLOCK从source db获取并覆盖,新增的BLOCK,直接抹除。)回退到切换点的状态。 7、修改被修复库(target db)的recovery.conf, postgresql.conf配置。 8、启动target db,连接source db接收WAL,或restore_command配置接收WAL,从切换点开始所有WAL,进行apply。 9、target db现在是source db的从库。 参考 https://www.postgresql.org/docs/11/app-pgrewind.html 《PostgreSQL primary-standby failback tools : pg_rewind》 《PostgreSQL 9.5 new feature - pg_rewind fast sync Split Brain Primary & Standby》 《PostgreSQL 9.5 add pg_rewind for Fast align for PostgreSQL unaligned primary & standby》 《MTK使用 - PG,PPAS,oracle,mysql,ms sql,sybase 迁移到 PG, PPAS (支持跨版本升级)》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , pg_rewind , 时间线 , 变化量 , 业务补齐 背景 pg_rewind类似Oracle flashback,可以将一个数据库回退到一个以前的状态,例如用于: 1、PG物理流复制的从库,当激活后,可以开启读写,使用pg_rewind可以将从库回退为只读从库的角色。而不需要重建整个从库。 2、当异步主从发生角色切换后,主库的wal目录中可能还有没完全同步到从库的内容,因此老的主库无法直接切换为新主库的从库。使用pg_rewind可以修复老的主库,使之成为新主库的只读从库。而不需要重建整个从库。 如果没有pg_rewind,遇到以上情况,需要完全重建从库,如果库占用空间很大,重建非常耗时,也非常耗费上游数据库的资源(读)。 详见: 《PostgreSQL pg_rewind,时间线修复,脑裂修复 - 从库开启读写后,回退为只读从库。异步主从发生角色切换后,主库rewind为新主库的从库》 以上解决的是怎么回退的问题,还有一个问题没有解,在分歧点到当前状态下,这些被回退掉的WAL,其中包含了哪些逻辑变化,这些信息怎么补齐? 时间线分歧变化量补齐原理 1、开启wal_level=logical 1.1、确保有足够的slots 2、开启DDL定义功能,参考: 《PostgreSQL 逻辑订阅 - DDL 订阅 实现方法》 3、在主库,为每一个数据库(或需要做时间线补齐的数据库)创建一个logical SLOT 4、有更新、删除操作的表,必须有主键 5、间歇性移动slot的位置到pg_stat_replication.sent_lsn的位置 6、如果从库被激活,假设老主库上还有未发送到从库的WAL 7、从从库获取激活位置LSN 8、由于使用了SLOT,所以从库激活位点LSN之后的WAL一定存在于老主库WAL目录中。 9、将老主库的slot移动到激活位置LSN 10、从激活位置开始获取logical变化量 11、业务层根据业务逻辑对这些变化量进行处理,补齐时间线分歧 示例 环境使用: 《PostgreSQL pg_rewind,时间线修复,脑裂修复 - 从库开启读写后,回退为只读从库。异步主从发生角色切换后,主库rewind为新主库的从库》 主库 port 4001 从库 port 4000 1、开启wal_level=logical psql -p 4000 postgres=# alter system set wal_level=logical; ALTER SYSTEM psql -p 4001 postgres=# alter system set wal_level=logical; ALTER SYSTEM 1.1、确保有足够的slots edb=# show max_replication_slots ; max_replication_slots ----------------------- 16 (1 row) 重启数据库。 2、开启DDL定义功能,参考: 《PostgreSQL 逻辑订阅 - DDL 订阅 实现方法》 3、在主库,为每一个数据库(或需要做时间线补齐的数据库)创建一个logical SLOT postgres=# select pg_create_logical_replication_slot('fix_tl','test_decoding'); pg_create_logical_replication_slot ------------------------------------ (fix_tl,B/73000140) (1 row) edb=# select pg_create_logical_replication_slot('fix_tl_edb','test_decoding'); pg_create_logical_replication_slot ------------------------------------ (fix_tl_edb,B/73000140) (1 row) 4、有更新、删除操作的表,必须有主键 5、间歇性移动slot的位置到pg_stat_replication.sent_lsn的位置 连接到对应的库操作 postgres=# select pg_replication_slot_advance('fix_tl',sent_lsn) from pg_stat_replication ; pg_replication_slot_advance ----------------------------- (fix_tl,B/73000140) (1 row) edb=# select pg_replication_slot_advance('fix_tl_edb',sent_lsn) from pg_stat_replication ; pg_replication_slot_advance ----------------------------- (fix_tl,B/73000140) (1 row) 6、如果从库被激活,假设老主库上还有未发送到从库的WAL pg_ctl promote -D /data04/ppas11/pg_root4000 7、从从库获取激活位置LSN cd /data04/ppas11/pg_root4000 cat pg_wal/00000003.history 1 8/48DE2318 no recovery target specified 2 D/FD5FFFB8 no recovery target specified 8、由于使用了SLOT,所以从库激活位点LSN之后的WAL一定存在于老主库WAL目录中。 9、将老主库的slot移动到激活位置LSN psql -p 4001 postgres postgres=# select pg_replication_slot_advance('fix_tl','D/FD5FFFB8'); psql -p 4001 edb edb=# select pg_replication_slot_advance('fix_tl_edb','D/FD5FFFB8'); 10、从激活位置开始获取logical变化量 edb=# select * from pg_logical_slot_get_changes('fix_tl_edb',NULL,10,'include-xids', '0'); lsn | xid | data -----+-----+------ (0 rows) 由于EDB库没有变化,所以返回0条记录 postgres=# select * from pg_logical_slot_get_changes('fix_tl',NULL,10,'include-xids', '0'); lsn | xid | data ------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- D/FD5FEC60 | 68900576 | BEGIN D/FD5FEC60 | 68900576 | table public.pgbench_accounts: UPDATE: aid[integer]:44681547 bid[integer]:447 abalance[integer]:-4591 filler[character]:' ' D/FD5FF3A8 | 68900576 | table public.pgbench_tellers: UPDATE: tid[integer]:5091 bid[integer]:510 tbalance[integer]:-160944 filler[character]:null D/FD5FF9A8 | 68900576 | table public.pgbench_branches: UPDATE: bid[integer]:740 bbalance[integer]:-261044 filler[character]:null D/FD5FFEF8 | 68900576 | table public.pgbench_history: INSERT: tid[integer]:5091 bid[integer]:740 aid[integer]:44681547 delta[integer]:-4591 mtime[timestamp without time zone]:'29-JAN-19 09:48:14.39739' filler[character]:null D/FD6001E8 | 68900576 | COMMIT D/FD5FE790 | 68900574 | BEGIN D/FD5FE790 | 68900574 | table public.pgbench_accounts: UPDATE: aid[integer]:60858810 bid[integer]:609 abalance[integer]:3473 filler[character]:' ' D/FD5FF1C8 | 68900574 | table public.pgbench_tellers: UPDATE: tid[integer]:8829 bid[integer]:883 tbalance[integer]:60244 filler[character]:null D/FD5FF810 | 68900574 | table public.pgbench_branches: UPDATE: bid[integer]:33 bbalance[integer]:86295 filler[character]:null D/FD5FFD80 | 68900574 | table public.pgbench_history: INSERT: tid[integer]:8829 bid[integer]:33 aid[integer]:60858810 delta[integer]:3473 mtime[timestamp without time zone]:'29-JAN-19 09:48:14.397383' filler[character]:null D/FD600218 | 68900574 | COMMIT (12 rows) postgres=# select * from pg_logical_slot_get_changes('fix_tl',NULL,10,'include-xids', '0'); lsn | xid | data ------------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- D/FD5FEED0 | 68900578 | BEGIN D/FD5FEED0 | 68900578 | table public.pgbench_accounts: UPDATE: aid[integer]:15334791 bid[integer]:154 abalance[integer]:-2741 filler[character]:' ' D/FD5FF518 | 68900578 | table public.pgbench_tellers: UPDATE: tid[integer]:2402 bid[integer]:241 tbalance[integer]:191936 filler[character]:null D/FD5FFB88 | 68900578 | table public.pgbench_branches: UPDATE: bid[integer]:345 bbalance[integer]:-693783 filler[character]:null D/FD5FFFB8 | 68900578 | table public.pgbench_history: INSERT: tid[integer]:2402 bid[integer]:345 aid[integer]:15334791 delta[integer]:-2741 mtime[timestamp without time zone]:'29-JAN-19 09:48:14.397396' filler[character]:null D/FD600248 | 68900578 | COMMIT D/FD5FF438 | 68900579 | BEGIN D/FD5FF438 | 68900579 | table public.pgbench_accounts: UPDATE: aid[integer]:54259132 bid[integer]:543 abalance[integer]:3952 filler[character]:' ' D/FD5FFEA8 | 68900579 | table public.pgbench_tellers: UPDATE: tid[integer]:9591 bid[integer]:960 tbalance[integer]:-498586 filler[character]:null D/FD600298 | 68900579 | table public.pgbench_branches: UPDATE: bid[integer]:147 bbalance[integer]:459542 filler[character]:null D/FD600560 | 68900579 | table public.pgbench_history: INSERT: tid[integer]:9591 bid[integer]:147 aid[integer]:54259132 delta[integer]:3952 mtime[timestamp without time zone]:'29-JAN-19 09:48:14.397464' filler[character]:null D/FD600938 | 68900579 | COMMIT (12 rows) ... ... 直到没有记录返回,说明已获取到所有变化量 直到没有记录返回,说明已获取到所有变化量 10.1、查看SLOT状态,当前WAL位置信息 psql -p 4001 postgres=# select * from pg_get_replication_slots(); slot_name | plugin | slot_type | datoid | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn ------------+---------------+-----------+--------+-----------+--------+------------+------+--------------+-------------+--------------------- fix_tl | test_decoding | logical | 15844 | f | f | | | 67005646 | D/D7959218 | D/FD600218 fix_tl_edb | test_decoding | logical | 15845 | f | f | | | 72528996 | E/71C92B00 | E/71C92B38 (2 rows) 当前WAL位置 postgres=# select pg_current_wal_lsn(); pg_current_wal_lsn -------------------- E/71C92B38 (1 row) 11、业务层根据业务逻辑对这些变化量进行处理,补齐时间线分歧 小结 主库开启逻辑SLOT,并根据从库的接收LSN位置,使用pg_replication_slot_advance移动主库的slot位点到从库的接收LSN位置。 当从库激活,老主库还有未同步到从库的WAL时,可以通过逻辑decode的方法,获取到未同步的逻辑变化量。 业务层根据业务逻辑,补齐这些变化量到新的主库。 注意: 1、开启logical wal_level,会给数据库增加较多的WAL日志,请酌情开启。 2、开启SLOT后,由于数据库会保证没有被订阅的WAL保留在pg_wal目录中,那么如果SLOT没有及时移动,则可能导致主库的pg_wal目录暴增。 参考 https://www.postgresql.org/docs/11/test-decoding.html https://www.postgresql.org/docs/11/functions-admin.html#FUNCTIONS-REPLICATION 《PostgreSQL 逻辑订阅 - DDL 订阅 实现方法》 《PostgreSQL pg_rewind,时间线修复,脑裂修复 - 从库开启读写后,回退为只读从库。异步主从发生角色切换后,主库rewind为新主库的从库》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , 同步 , 半同步 , 流复制 , 心跳 , 自动降级 , 自动升级 , dblink , 异步调用 背景 在心跳时,通过自定义UDF,实现心跳永远不被堵塞,并且支持更加当前的配置自动的进行同步、异步模式的升降级。实现半同步的功能。 UDF输入 1、优先模式(同步、异步) 2、同步等待超时时间 当优先为同步模式时,假设当前为同步配置,如果备库异常导致事务提交等待超过指定时间,则自动降级为异步。 当优先为异步模式时,假设当前为同步配置,自动降级为异步。 当优先为同步模式时,假设当前为异步配置,如果备库恢复到streaming模式,自动升级为同步。 使用技术点: 1、alter system 2、reload conf 3、cancle backend 4、dblink 异步调用 心跳UDF逻辑 判断当前实例状态 只读 退出 读写 判断当前事务模式 异步 发心跳 优先模式是什么 异步 退出 同步 判断是否需要升级 升级 退出 同步 消耗异步消息 发远程心跳 查询是否超时 降级 否则 消耗异步消息 优先模式是什么 异步 降级 退出 同步 退出 设计 1、当前postgresql.conf配置 synchronous_commit='remote_write'; synchronous_standby_names='*'; 表示同步模式。 2、心跳表设计 create table t_keepalive(id int primary key, ts timestamp, pos pg_lsn); 3、心跳写入方法 insert into t_keepalive values (1,now(),pg_current_wal_lsn()) on conflict (id) do update set ts=excluded.ts,pos=excluded.pos returning id,ts,pos; 4、创建一个建立连接函数,不报错 create or replace function conn( name, -- dblink名字 text -- 连接串,URL ) returns void as $$ declare begin perform dblink_connect($1, $2); return; exception when others then return; end; $$ language plpgsql strict; 5、更加以上逻辑创建心跳UDF。 create or replace function keepalive ( prio_commit_mode text, tmout interval ) returns t_keepalive as $$ declare res1 int; res2 timestamp; res3 pg_lsn; commit_mode text; conn text := format('hostaddr=%s port=%s user=%s dbname=%s application_name=', '127.0.0.1', current_setting('port'), current_user, current_database()); conn_altersys text := format('hostaddr=%s port=%s user=%s dbname=%s', '127.0.0.1', current_setting('port'), current_user, current_database()); app_prefix_stat text := 'keepalive_dblink'; begin if prio_commit_mode not in ('sync','async') then raise notice 'prio_commit_mode must be [sync|async]'; return null; end if; show synchronous_commit into commit_mode; create extension IF NOT EXISTS dblink; -- 判断当前实例状态 if pg_is_in_recovery() -- 只读 then raise notice 'Current instance in recovery mode.'; return null; -- 读写 else -- 判断当前事务模式 if commit_mode in ('local','off') -- 异步 then -- 发心跳 insert into t_keepalive values (1,now(),pg_current_wal_lsn()) on conflict (id) do update set ts=excluded.ts,pos=excluded.pos returning id,ts,pos into res1,res2,res3; -- 优先模式是什么 if prio_commit_mode='async' -- 异步 then -- 退出 return row(res1,res2,res3)::t_keepalive; -- 同步 else -- 判断是否需要升级 perform 1 from pg_stat_replication where state='streaming' limit 1; if found -- 升级 then perform dblink_exec(conn_altersys, 'alter system set synchronous_commit=remote_write', true); perform pg_reload_conf(); -- 退出 return row(res1,res2,res3)::t_keepalive; end if; return row(res1,res2,res3)::t_keepalive; end if; -- 同步 else -- 消耗异步消息 perform conn(app_prefix_stat, conn||app_prefix_stat); perform t from dblink_get_result(app_prefix_stat, false) as t(id int, ts timestamp, pos pg_lsn); -- 发远程心跳 perform dblink_send_query(app_prefix_stat, $_$ insert into t_keepalive values (1,now(),pg_current_wal_lsn()) on conflict (id) do update set ts=excluded.ts,pos=excluded.pos returning id,ts,pos $_$); -- 查询是否超时 <<ablock>> loop perform pg_sleep(0.2); perform 1 from pg_stat_activity where application_name=app_prefix_stat and state='idle' limit 1; -- 未超时 if found then select id,ts,pos into res1,res2,res3 from dblink_get_result(app_prefix_stat, false) as t(id int, ts timestamp, pos pg_lsn); raise notice 'no timeout'; exit ablock; end if; perform 1 from pg_stat_activity where wait_event='SyncRep' and application_name=app_prefix_stat and clock_timestamp()-query_start > tmout limit 1; -- 降级 if found then perform dblink_exec(conn_altersys, 'alter system set synchronous_commit=local', true); perform pg_reload_conf(); perform pg_cancel_backend(pid) from pg_stat_activity where wait_event='SyncRep'; select id,ts,pos into res1,res2,res3 from dblink_get_result(app_prefix_stat, false) as t(id int, ts timestamp, pos pg_lsn); raise notice 'timeout'; exit ablock; end if; perform pg_sleep(0.2); end loop; -- 优先模式是什么 if prio_commit_mode='async' -- 异步 then show synchronous_commit into commit_mode; -- 降级 if commit_mode in ('on','remote_write','remote_apply') then perform dblink_exec(conn_altersys, 'alter system set synchronous_commit=local', true); perform pg_reload_conf(); perform pg_cancel_backend(pid) from pg_stat_activity where wait_event='SyncRep'; end if; -- 退出 return row(res1,res2,res3)::t_keepalive; -- 同步 else -- 退出 return row(res1,res2,res3)::t_keepalive; end if; end if; end if; end; $$ language plpgsql strict; 测试 1、当前为同步模式 postgres=# show synchronous_commit ; synchronous_commit -------------------- remote_write (1 row) 2、人为关闭从库,心跳自动将数据库改成异步模式,并通知所有等待中会话。 postgres=# select * from keepalive ('sync','5 second'); NOTICE: extension "dblink" already exists, skipping NOTICE: timeout id | ts | pos ----+----------------------------+------------- 1 | 2019-01-30 00:48:39.800829 | 23/9501D5F8 (1 row) postgres=# show synchronous_commit ; synchronous_commit -------------------- local (1 row) 3、恢复从库,心跳自动将数据库升级为优先sync模式。 postgres=# select * from keepalive ('sync','5 second'); NOTICE: extension "dblink" already exists, skipping id | ts | pos ----+----------------------------+------------- 1 | 2019-01-30 00:48:47.329119 | 23/9501D6E8 (1 row) postgres=# select * from keepalive ('sync','5 second'); NOTICE: extension "dblink" already exists, skipping NOTICE: no timeout id | ts | pos ----+----------------------------+------------- 1 | 2019-01-30 00:49:11.991855 | 23/9501E0C8 (1 row) postgres=# show synchronous_commit ; synchronous_commit -------------------- remote_write (1 row) 小结 在心跳时,通过自定义UDF,实现心跳永远不被堵塞,并且支持更加当前的配置自动的进行同步、异步模式的升降级。实现半同步的功能。 UDF输入 1、优先模式(同步、异步) 2、同步等待超时时间 当优先为同步模式时,假设当前为同步配置,如果备库异常导致事务提交等待超过指定时间,则自动降级为异步。 当优先为异步模式时,假设当前为同步配置,自动降级为异步。 当优先为同步模式时,假设当前为异步配置,如果备库恢复到streaming模式,自动升级为同步。 使用技术点: 1、alter system 2、reload conf 3、cancle backend 4、dblink 异步调用 使用心跳实现半同步,大大简化了整个同步、异步模式切换的流程。当然如果内核层面可以实现,配置几个参数,会更加完美。 参考 dblin 异步 《PostgreSQL 数据库心跳(SLA(RPO)指标的时间、WAL SIZE维度计算)》 《PostgreSQL 双节点流复制如何同时保证可用性、可靠性(rpo,rto) - (半同步,自动降级方法实践)》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , 只读 , 锁定 , readonly , recovery.conf , 恢复模式 , pg_is_in_revoery , default_transaction_read_only 背景 在一些场景中,可能要将数据库设置为只读模式。 例如, 1、云数据库,当使用的容量超过了购买的限制时。切换到只读(锁定)模式,确保用户不会用超。 2、业务上需要对数据库进行迁移,准备割接时,可将主库切换到只读(锁定),确保绝对不会有事务写入。 锁定的实现方法有若干种。 1、硬锁定,直接将数据库切换到恢复模式,绝对不会有写操作出现。 2、软锁定,设置default_transaction_read_only为on,默认开启的事务为只读事务。用户如果使用begion transaction read write可破解。 3、内核层面改进的锁定,对于云上产品,锁定后实际上是期望用户升级容量,或者用户可以上去删数据使得使用空间降下来的。那么以上两种锁定都不适用,需要禁止除truncate, drop操作以外的所有操作的这种锁定方式。而且最好是不需要重启数据库就可以实现。 实现 1 锁定实例 硬锁定 1、配置 recovery.conf recovery_target_timeline = 'latest' standby_mode = on 2、重启数据库 pg_ctl restart -m fast 3、硬锁定,不可破解 postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- t (1 row) postgres=# insert into t1 values (1); ERROR: cannot execute INSERT in a read-only transaction postgres=# begin transaction read write; ERROR: cannot set transaction read-write mode during recovery 软锁定 1、设置default_transaction_read_only postgres=# alter system set default_transaction_read_only=on; ALTER SYSTEM 2、重载配置 postgres=# select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) 3、所有会话自动进入read only的默认事务模式。 reload前 postgres=# show default_transaction_read_only ; default_transaction_read_only ------------------------------- off (1 row) reload后 postgres=# show default_transaction_read_only ; default_transaction_read_only ------------------------------- on (1 row) postgres=# insert into t1 values (1); ERROR: cannot execute INSERT in a read-only transaction 4、软锁定可破解 postgres=# begin transaction read write; BEGIN postgres=# insert into t1 values (1); INSERT 0 1 postgres=# end; COMMIT 2 解锁实例 硬解锁 1、重命名recovery.conf到recovery.done cd $PGDATA mv recovery.conf recovery.done 2、重启数据库 pg_ctl restart -m fast 软解锁 1、设置default_transaction_read_only postgres=# alter system set default_transaction_read_only=off; ALTER SYSTEM 2、重载配置 postgres=# select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) 3、所有会话自动进入read only的默认事务模式。 reload前 postgres=# show default_transaction_read_only ; default_transaction_read_only ------------------------------- on (1 row) reload后 postgres=# show default_transaction_read_only ; default_transaction_read_only ------------------------------- off (1 row) 写恢复 postgres=# insert into t1 values (1); INSERT 0 1 内核层锁定 通过修改内核实现锁定,锁定后只允许: 1、truncate 2、drop 这样,用户可以在锁定的情况下进行数据清理,可以跑任务的形式,检查数据是否清理干净,进行解锁设置。 阿里云RDS PG已支持。 参考 https://www.postgresql.org/docs/11/recovery-config.html https://www.postgresql.org/docs/11/runtime-config-client.html#RUNTIME-CONFIG-CLIENT-STATEMENT https://www.postgresql.org/docs/11/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , 参数 , 优先级 , 配置文件 , alter system , 命令行 , 用户 , 数据库 , 所有用户 , 会话 , 事务 , 函数 , 表 背景 PostgreSQL 参数配置包罗万象,可以在配置文件 , alter system , 命令行 , 用户 , 数据库 , 所有用户 , 会话 , 事务 , 函数 , 表 等层面进行配置,非常的灵活。 灵活是好,但是可配置的入口太多了,优先级如何?如果在多个入口配置了同一个参数的不同值,最后会以哪个为准? 参数优先级 优先级如下,数值越大,优先级越高。 1 postgresql.conf work_mem=1MB 2 postgresql.auto.conf work_mem=2MB 3 command line options work_mem=3MB pg_ctl start -o "-c work_mem='3MB'" 4 all role work_mem=4MB alter role all set work_mem='4MB'; 5 database work_mem=5MB alter database postgres set work_mem='5MB'; 6 role work_mem=6MB alter role digoal set work_mem='6MB'; 7 session (客户端参数) work_mem=7MB set work_mem ='7MB'; 8 事务 work_mem=8MB postgres=# begin; BEGIN postgres=# set local work_mem='8MB'; SET 9 function (参数在函数内有效,函数调用完成后依旧使用其他最高优先级参数值) work_mem=9MB postgres=# create or replace function f_test() returns void as $$ declare res text; begin show work_mem into res; raise notice '%', res; end; $$ language plpgsql strict set work_mem='9MB'; CREATE FUNCTION postgres=# select f_test(); NOTICE: 9MB f_test -------- (1 row) 10 table TABLE相关参数(垃圾回收相关) https://www.postgresql.org/docs/11/sql-createtable.html autovacuum_enabled toast.autovacuum_enabled ... ... autovacuum_vacuum_threshold toast.autovacuum_vacuum_threshold ... ... 小结 PostgreSQL 支持的配置入口: 配置文件(postgresql.conf) , alter system(postgresql.auto.conf) , 命令行(postgres -o, pg_ctl -o) , 所有用户(alter role all set) , 数据库(alter database xxx set) , 用户(alter role 用户名 set) , 会话(set xxx) , 事务(set local xxx;) , 函数(create or replace function .... set par=val;) , 表(表级垃圾回收相关参数) 如果一个参数在所有入口都配置过,优先级如上,从上到下,优先级越来越大。 参考 《PostgreSQL GUC 参数级别介绍》 《连接PostgreSQL时,如何指定参数》 https://www.postgresql.org/docs/11/sql-createtable.html 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , 分区表 , 在线转换 背景 非分区表,如何在线(不影响业务)转换为分区表? 方法1,pg_pathman分区插件 《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》 使用非堵塞式的迁移接口 partition_table_concurrently( relation REGCLASS, -- 主表OID batch_size INTEGER DEFAULT 1000, -- 一个事务批量迁移多少记录 sleep_time FLOAT8 DEFAULT 1.0) -- 获得行锁失败时,休眠多久再次获取,重试60次退出任务。 postgres=# select partition_table_concurrently('part_test'::regclass, 10000, 1.0); NOTICE: worker started, you can stop it with the following command: select stop_concurrent_part_task('part_test'); partition_table_concurrently ------------------------------ (1 row) 迁移结束后,主表数据已经没有了,全部在分区中 postgres=# select count(*) from only part_test; count ------- 0 (1 row) 数据迁移完成后,建议禁用主表,这样执行计划就不会出现主表了 postgres=# select set_enable_parent('part_test'::regclass, false); set_enable_parent ------------------- (1 row) 方法2,原生分区 使用继承表,触发器,异步迁移,交换表名一系列步骤,在线将非分区表,转换为分区表(交换表名是需要短暂的堵塞)。 关键技术: 1、继承表(子分区) 对select, update, delete, truncate, drop透明。 2、触发器 插入,采用before触发器,数据路由到继承分区 更新,采用before触发器,删除老表记录,同时将更新后的数据插入新表 3、后台迁移数据,cte only skip locked , delete only, insert into new table 4、迁移结束(p表没有数据后),短暂上锁,剥离INHERTI关系,切换到原生分区,切换表名。 例子 将一个表在线转换为LIST分区表(伪HASH分区)。 范围分区类似。 如果要转换为原生HASH分区表,需要提取pg内置HASH分区算法。 1、创建测试表(需要被分区的表) create table old (id int primary key, info text, crt_time timestamp); 2、写入1000万测试记录 insert into old select generate_series(1,10000000) , md5(random()::text) , now(); 3、创建子分区(本例使用LIST分区) do language plpgsql $$ declare parts int := 4; begin for i in 0..parts-1 loop execute format('create table old_mid%s (like old including all) inherits (old)', i); execute format('alter table old_mid%s add constraint ck check(abs(mod(id,%s))=%s)', i, parts, i); end loop; end; $$; 4、插入,采用before触发器,路由到新表 create or replace function ins_tbl() returns trigger as $$ declare begin case abs(mod(NEW.id,4)) when 0 then insert into old_mid0 values (NEW.*); when 1 then insert into old_mid1 values (NEW.*); when 2 then insert into old_mid2 values (NEW.*); when 3 then insert into old_mid3 values (NEW.*); else return NEW; -- 如果是NULL则写本地父表,主键不会为NULL end case; return null; end; $$ language plpgsql strict; create trigger tg1 before insert on old for each row execute procedure ins_tbl(); 5、更新,采用before触发器,删除老表,同时将更新后的数据插入新表 create or replace function upd_tbl () returns trigger as $$ declare begin case abs(mod(NEW.id,4)) when 0 then insert into old_mid0 values (NEW.*); when 1 then insert into old_mid1 values (NEW.*); when 2 then insert into old_mid2 values (NEW.*); when 3 then insert into old_mid3 values (NEW.*); else return NEW; -- 如果是NULL则写本地父表,主键不会为NULL end case; delete from only old where id=NEW.id; return null; end; $$ language plpgsql strict; create trigger tg2 before update on old for each row execute procedure upd_tbl(); 6、old table 如下 postgres=# \dt+ old List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+----------+--------+------------- public | old | table | postgres | 730 MB | (1 row) 继承关系如下 postgres=# \d+ old Table "public.old" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | not null | | plain | | info | text | | | | extended | | crt_time | timestamp without time zone | | | | plain | | Indexes: "old_pkey" PRIMARY KEY, btree (id) Triggers: tg1 BEFORE INSERT ON old FOR EACH ROW EXECUTE PROCEDURE ins_tbl() tg2 BEFORE UPDATE ON old FOR EACH ROW EXECUTE PROCEDURE upd_tbl() Child tables: old_mid0, old_mid1, old_mid2, old_mid3 7、验证insert, update, delete, select完全符合要求。对业务SQL请求透明。 postgres=# insert into old values (0,'test',now()); INSERT 0 0 postgres=# select tableoid::regclass,* from old where id=1; tableoid | id | info | crt_time ----------+----+----------------------------------+--------------------------- old | 1 | 22be06200f2a967104872f6f173fd038 | 31-JAN-19 12:52:25.887242 (1 row) postgres=# select tableoid::regclass,* from old where id=0; tableoid | id | info | crt_time ----------+----+------+--------------------------- old_mid0 | 0 | test | 31-JAN-19 13:02:35.859899 (1 row) postgres=# update old set info='abc' where id in (0,2) returning tableoid::regclass,*; tableoid | id | info | crt_time ----------+----+------+--------------------------- old_mid0 | 0 | abc | 31-JAN-19 13:02:35.859899 (1 row) UPDATE 1 postgres=# select tableoid::regclass,* from old where id in (0,2); tableoid | id | info | crt_time ----------+----+------+--------------------------- old_mid0 | 0 | abc | 31-JAN-19 13:12:03.343559 old_mid2 | 2 | abc | 31-JAN-19 13:11:04.763652 (2 rows) postgres=# delete from old where id=3; DELETE 1 postgres=# select tableoid::regclass,* from old where id=3; tableoid | id | info | crt_time ----------+----+------+---------- (0 rows) 8、开启压测,后台对原表数据进行迁移 create or replace function test_ins(int) returns void as $$ declare begin insert into old values ($1,'test',now()); exception when others then return; end; $$ language plpgsql strict; vi test.sql \set id1 random(10000001,200000000) \set id2 random(1,5000000) \set id3 random(5000001,10000000) delete from old where id=:id2; update old set info=md5(random()::text),crt_time=now() where id=:id3; select test_ins(:id1); 开启压测 pgbench -M prepared -n -r -P 1 -f ./test.sql -c 4 -j 4 -T 1200 ... progress: 323.0 s, 12333.1 tps, lat 0.324 ms stddev 0.036 progress: 324.0 s, 11612.9 tps, lat 0.344 ms stddev 0.203 progress: 325.0 s, 12546.0 tps, lat 0.319 ms stddev 0.061 progress: 326.0 s, 12728.7 tps, lat 0.314 ms stddev 0.038 progress: 327.0 s, 12536.9 tps, lat 0.319 ms stddev 0.040 progress: 328.0 s, 12534.1 tps, lat 0.319 ms stddev 0.042 progress: 329.0 s, 12228.1 tps, lat 0.327 ms stddev 0.047 ... 9、在线迁移数据 批量迁移,每一批迁移N条。调用以下SQL with a as ( delete from only old where ctid = any (array (select ctid from only old limit 1000 for update skip locked) ) returning * ) insert into old select * from a; INSERT 0 0 postgres=# select count(*) from only old; count --------- 9998998 (1 row) postgres=# select count(*) from old; count ---------- 10000000 (1 row) postgres=# with a as ( delete from only old where ctid = any (array (select ctid from only old limit 1000 for update skip locked) ) returning * ) insert into old select * from a; INSERT 0 0 postgres=# select count(*) from old; count ---------- 10000000 (1 row) postgres=# select count(*) from only old; count --------- 9997998 (1 row) postgres=# with a as ( delete from only old where ctid = any (array (select ctid from only old limit 100000 for update skip locked) ) returning * ) insert into old select * from a; INSERT 0 0 postgres=# select count(*) from only old; count --------- 9897998 (1 row) postgres=# select count(*) from old; count ---------- 10000000 (1 row) 一次迁移1万条,分批操作。 with a as ( delete from only old where ctid = any (array (select ctid from only old limit 10000 for update skip locked) ) returning * ) insert into old select * from a; 持续调用以上接口,直到当old表已经没有数据,完全迁移到了分区。 select count(*) from only old; count ------- 0 (1 row) 10、切换到分区表 创建分区表如下,分区方法与继承约束一致。 create table new (id int, info text, crt_time timestamp) partition by list (abs(mod(id,4))); 切换表名,防止雪崩,使用锁超时,由于只涉及表名变更,所以速度非常快。 begin; set lock_timeout ='3s'; alter table old_mid0 no inherit old; alter table old_mid1 no inherit old; alter table old_mid2 no inherit old; alter table old_mid3 no inherit old; alter table old rename to old_tmp; alter table new rename to old; alter table old ATTACH PARTITION old_mid0 for values in (0); alter table old ATTACH PARTITION old_mid1 for values in (1); alter table old ATTACH PARTITION old_mid2 for values in (2); alter table old ATTACH PARTITION old_mid3 for values in (3); end; 切换后的原生分区表如下 postgres=# \d+ old Table "public.old" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ----------+-----------------------------+-----------+----------+---------+----------+--------------+------------- id | integer | | | | plain | | info | text | | | | extended | | crt_time | timestamp without time zone | | | | plain | | Partition key: LIST (abs(mod(id, 4))) Partitions: old_mid0 FOR VALUES IN (0), old_mid1 FOR VALUES IN (1), old_mid2 FOR VALUES IN (2), old_mid3 FOR VALUES IN (3) 查询测试 postgres=# explain select * from old where id=1; QUERY PLAN ------------------------------------------------------------------------------------- Append (cost=0.29..10.04 rows=4 width=44) -> Index Scan using old_mid0_pkey on old_mid0 (cost=0.29..2.51 rows=1 width=44) Index Cond: (id = 1) -> Index Scan using old_mid1_pkey on old_mid1 (cost=0.29..2.51 rows=1 width=45) Index Cond: (id = 1) -> Index Scan using old_mid2_pkey on old_mid2 (cost=0.29..2.51 rows=1 width=44) Index Cond: (id = 1) -> Index Scan using old_mid3_pkey on old_mid3 (cost=0.29..2.51 rows=1 width=45) Index Cond: (id = 1) (9 rows) postgres=# explain select * from old where id=? and abs(mod(id, 4)) = abs(mod(?, 4)); QUERY PLAN ------------------------------------------------------------------------------------- Append (cost=0.29..2.52 rows=1 width=45) -> Index Scan using old_mid1_pkey on old_mid1 (cost=0.29..2.51 rows=1 width=45) Index Cond: (id = 1) Filter: (mod(id, 4) = 1) (4 rows) 数据 postgres=# select count(*) from old; count ---------- 10455894 (1 row) 方法3,logical replication 使用逻辑复制的方法,同步到分区表。 简单步骤如下: snapshot 快照(lsn位点) 全量 增量(逻辑复制,从LSN位置开始解析WAL LOG) 切换表名 略 其他 hash函数 postgres=# \df *.*hash* List of functions Schema | Name | Result data type | Argument data types | Type ------------+--------------------------+------------------+---------------------------------------+------ pg_catalog | hash_aclitem | integer | aclitem | func pg_catalog | hash_aclitem_extended | bigint | aclitem, bigint | func pg_catalog | hash_array | integer | anyarray | func pg_catalog | hash_array_extended | bigint | anyarray, bigint | func pg_catalog | hash_numeric | integer | numeric | func pg_catalog | hash_numeric_extended | bigint | numeric, bigint | func pg_catalog | hash_range | integer | anyrange | func pg_catalog | hash_range_extended | bigint | anyrange, bigint | func pg_catalog | hashbpchar | integer | character | func pg_catalog | hashbpcharextended | bigint | character, bigint | func pg_catalog | hashchar | integer | "char" | func pg_catalog | hashcharextended | bigint | "char", bigint | func pg_catalog | hashenum | integer | anyenum | func pg_catalog | hashenumextended | bigint | anyenum, bigint | func pg_catalog | hashfloat4 | integer | real | func pg_catalog | hashfloat4extended | bigint | real, bigint | func pg_catalog | hashfloat8 | integer | double precision | func pg_catalog | hashfloat8extended | bigint | double precision, bigint | func pg_catalog | hashhandler | index_am_handler | internal | func pg_catalog | hashinet | integer | inet | func pg_catalog | hashinetextended | bigint | inet, bigint | func pg_catalog | hashint2 | integer | smallint | func pg_catalog | hashint2extended | bigint | smallint, bigint | func pg_catalog | hashint4 | integer | integer | func pg_catalog | hashint4extended | bigint | integer, bigint | func pg_catalog | hashint8 | integer | bigint | func pg_catalog | hashint8extended | bigint | bigint, bigint | func pg_catalog | hashmacaddr | integer | macaddr | func pg_catalog | hashmacaddr8 | integer | macaddr8 | func pg_catalog | hashmacaddr8extended | bigint | macaddr8, bigint | func pg_catalog | hashmacaddrextended | bigint | macaddr, bigint | func pg_catalog | hashname | integer | name | func pg_catalog | hashnameextended | bigint | name, bigint | func pg_catalog | hashoid | integer | oid | func pg_catalog | hashoidextended | bigint | oid, bigint | func pg_catalog | hashoidvector | integer | oidvector | func pg_catalog | hashoidvectorextended | bigint | oidvector, bigint | func pg_catalog | hashtext | integer | text | func pg_catalog | hashtextextended | bigint | text, bigint | func pg_catalog | hashvarlena | integer | internal | func pg_catalog | hashvarlenaextended | bigint | internal, bigint | func pg_catalog | interval_hash | integer | interval | func pg_catalog | interval_hash_extended | bigint | interval, bigint | func pg_catalog | jsonb_hash | integer | jsonb | func pg_catalog | jsonb_hash_extended | bigint | jsonb, bigint | func pg_catalog | pg_lsn_hash | integer | pg_lsn | func pg_catalog | pg_lsn_hash_extended | bigint | pg_lsn, bigint | func pg_catalog | satisfies_hash_partition | boolean | oid, integer, integer, VARIADIC "any" | func pg_catalog | time_hash | integer | time without time zone | func pg_catalog | time_hash_extended | bigint | time without time zone, bigint | func pg_catalog | timestamp_hash | integer | timestamp without time zone | func pg_catalog | timestamp_hash_extended | bigint | timestamp without time zone, bigint | func pg_catalog | timetz_hash | integer | time with time zone | func pg_catalog | timetz_hash_extended | bigint | time with time zone, bigint | func pg_catalog | uuid_hash | integer | uuid | func pg_catalog | uuid_hash_extended | bigint | uuid, bigint | func (56 rows) 小结 在线将表转换为分区表,可以使用的方法: 1、转换为pg_pathman分区,直接调用pg_pathman的UDF即可。 2、转换为原生分区,使用继承,异步迁移的方法。割接是短暂锁表。 不支持 insert ino on conflict 语法。 insert into old values (1,'test',now()) on conflict(id) do update set info=excluded.info, crt_time=excluded.crt_time; 3、逻辑复制的方法,将数据增量迁移到分区表(目标可以是原生分区方法或者是pg_pathman分区方法的新表)。 参考 《PostgreSQL 9.x, 10, 11 hash分区表 用法举例》 《PostgreSQL 触发器 用法详解 1》 《PostgreSQL 触发器 用法详解 2》 《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , 垃圾回收 , 索引扫描 , 内存 背景 夜谈PostgreSQL 垃圾回收参数优化之 - maintenance_work_mem , autovacuum_work_mem。 http://www.postgres.cn/v2/news/viewone/1/398 https://rhaas.blogspot.com/2019/01/how-much-maintenanceworkmem-do-i-need.html 9.4以前的版本,垃圾回收相关的内存参数maintenance_work_mem,9.4以及以后的版本为autovacuum_work_mem,如果没有设置autovacuum_work_mem,则使用maintenance_work_mem的设置。 这个参数设置的是内存大小有什么用呢? 这部分内存被用于记录垃圾tupleid,vacuum进程在进行表扫描时,当扫描到的垃圾记录ID占满了整个内存(autovacuum_work_mem或maintenance_work_mem),那么会停止扫描表,开始INDEX的扫描。 扫描INDEX时,清理索引中的哪些tuple,实际上是从刚才内存中记录的这些tupleid来进行匹配。 当所有索引都扫描并清理了一遍后,继续从刚才的位点开始扫描表。 过程如下: 1、palloc autovacuum_work_mem memory 2、scan table, 3、dead tuple's tupleid write to autovacuum_work_mem 4、when autovacuum_work_mem full (with dead tuples can vacuum) 5、record table scan offset. 6、scan indexs 7、vacuum index's dead tuple (these: index item's ctid in autovacuum_work_mem) 8、scan indexs end 9、continue scan table with prev's offset ... 显然,如果垃圾回收时autovacuum_work_mem太小,INDEX会被多次扫描,浪费资源,时间。 palloc autovacuum_work_mem memory 这部分内存是使用时分配,并不是直接全部使用掉maintenance_work_mem或autovacuum_work_mem设置的内存,PG代码中做了优化限制: 对于小表,可能申请少量内存,算法请参考如下代码(对于小表,申请的内存数会是保障可记录下整表的tupleid的内存数(当maintenance_work_mem或autovacuum_work_mem设置的内存大于这个值时))。 我已经在如下代码中进行了标注: /* * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can * fit on one heap page. (Note that indexes could have more, because they * use a smaller tuple header.) We arrive at the divisor because each tuple * must be maxaligned, and it must have an associated item pointer. * * Note: with HOT, there could theoretically be more line pointers (not actual * tuples) than this on a heap page. However we constrain the number of line * pointers to this anyway, to avoid excessive line-pointer bloat and not * require increases in the size of work arrays. */ #define MaxHeapTuplesPerPage \ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData)))) /* * Guesstimation of number of dead tuples per page. This is used to * provide an upper limit to memory allocated when vacuuming small * tables. */ #define LAZY_ALLOC_TUPLES MaxHeapTuplesPerPage /* * lazy_space_alloc - space allocation decisions for lazy vacuum * * See the comments at the head of this file for rationale. */ static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) { long maxtuples; int vac_work_mem = IsAutoVacuumWorkerProcess() && autovacuum_work_mem != -1 ? autovacuum_work_mem : maintenance_work_mem; if (vacrelstats->hasindex) { maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData); maxtuples = Min(maxtuples, INT_MAX); maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData)); /* curious coding here to ensure the multiplication can't overflow */ 这里保证了maintenance_work_mem或autovacuum_work_mem不会直接被使用光, 如果是小表,会palloc少量memory。 if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks) maxtuples = relblocks * LAZY_ALLOC_TUPLES; /* stay sane if small maintenance_work_mem */ maxtuples = Max(maxtuples, MaxHeapTuplesPerPage); } else { maxtuples = MaxHeapTuplesPerPage; } vacrelstats->num_dead_tuples = 0; vacrelstats->max_dead_tuples = (int) maxtuples; vacrelstats->dead_tuples = (ItemPointer) palloc(maxtuples * sizeof(ItemPointerData)); } maintenance_work_mem这个内存还有一个用途,创建索引时,maintenance_work_mem控制系统在构建索引时将使用的最大内存量。为了构建一个B树索引,必须对输入的数据进行排序,如果要排序的数据在maintenance_work_mem设定的内存中放置不下,它将会溢出到磁盘中。 例子 如何计算适合的内存大小 postgres=# show autovacuum_work_mem ; autovacuum_work_mem --------------------- 1GB (1 row) postgres=# show maintenance_work_mem ; maintenance_work_mem ---------------------- 1GB (1 row) 也就是说,最多有1GB的内存,用于记录一次vacuum时,一次性可存储的垃圾tuple的tupleid。 tupleid为6字节长度。 /* * ItemPointer: * * This is a pointer to an item within a disk page of a known file * (for example, a cross-link from an index to its parent table). * blkid tells us which block, posid tells us which entry in the linp * (ItemIdData) array we want. * * Note: because there is an item pointer in each tuple header and index * tuple header on disk, it's very important not to waste space with * structure padding bytes. The struct is designed to be six bytes long * (it contains three int16 fields) but a few compilers will pad it to * eight bytes unless coerced. We apply appropriate persuasion where * possible. If your compiler can't be made to play along, you'll waste * lots of space. */ typedef struct ItemPointerData { BlockIdData ip_blkid; OffsetNumber ip_posid; } 1G可存储1.7亿条dead tuple的tupleid。 postgres=# select 1024*1024*1024/6; ?column? ----------- 178956970 (1 row) 而自动垃圾回收是在什么条件下触发的呢? src/backend/postmaster/autovacuum.c * A table needs to be vacuumed if the number of dead tuples exceeds a * threshold. This threshold is calculated as * * threshold = vac_base_thresh + vac_scale_factor * reltuples vac_base_thresh: autovacuum_vacuum_threshold vac_scale_factor: autovacuum_vacuum_scale_factor postgres=# show autovacuum_vacuum_threshold ; autovacuum_vacuum_threshold ----------------------------- 50 (1 row) postgres=# show autovacuum_vacuum_scale_factor ; autovacuum_vacuum_scale_factor -------------------------------- 0.2 (1 row) 以上设置,表示当垃圾记录数达到50+表大小乘以0.2时,会触发垃圾回收。 可以看成,垃圾记录约等于表大小的20%,触发垃圾回收。 那么1G能存下多大表的垃圾呢?约8.9亿条记录的表。 postgres=# select 1024*1024*1024/6/0.2; ?column? -------------------- 894784850 (1 row) 压力测试例子 postgres=# show log_autovacuum_min_duration ; log_autovacuum_min_duration ----------------------------- 0 (1 row) create table test(id int primary key, c1 int, c2 int, c3 int); create index idx_test_1 on test (c1); create index idx_test_2 on test (c2); create index idx_test_3 on test (c3); vi test.sql \set id random(1,10000000) insert into test values (:id,random()*100, random()*100,random()*100) on conflict (id) do update set c1=excluded.c1, c2=excluded.c2,c3=excluded.c3; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 1200 垃圾回收记录 2019-02-26 22:51:50.323 CST,,,35632,,5c755284.8b30,1,,2019-02-26 22:51:48 CST,36/22,0,LOG,00000,"automatic vacuum of table ""postgres.public.test"": index scans: 1 pages: 0 removed, 6312 remain, 2 skipped due to pins, 0 skipped frozen tuples: 4631 removed, 1158251 remain, 1523 are dead but not yet removable, oldest xmin: 1262982800 buffer usage: 39523 hits, 1 misses, 1 dirtied avg read rate: 0.004 MB/s, avg write rate: 0.004 MB/s system usage: CPU: user: 1.66 s, system: 0.10 s, elapsed: 1.86 s",,,,,,,,"lazy_vacuum_rel, vacuumlazy.c:407","" 2019-02-26 22:51:50.566 CST,,,35632,,5c755284.8b30,2,,2019-02-26 22:51:48 CST,36/23,1263417553,LOG,00000,"automatic analyze of table ""postgres.public.test"" system usage: CPU: user: 0.16 s, system: 0.04 s, elapsed: 0.24 s",,,,,,,,"do_analyze_rel, analyze.c:722","" index scans:1 表示垃圾回收的表有索引,并且索引只扫描了一次。 说明autovacuum_work_mem足够大,没有出现vacuum时装不下垃圾dead tuple tupleid的情况。 小结 建议: 1、log_autovacuum_min_duration=0,表示记录所有autovacuum的统计信息。 2、autovacuum_vacuum_scale_factor=0.01,表示1%的垃圾时,触发自动垃圾回收。 3、autovacuum_work_mem,视情况定,确保不出现垃圾回收时多次INDEX SCAN. 4、如果发现垃圾回收统计信息中出现了index scans: 超过1的情况,说明: 4.1、需要增加autovacuum_work_mem,增加多少呢?增加到当前autovacuum_work_mem乘以index scans即可。 4.2、或者调低autovacuum_vacuum_scale_factor到当前值除以index scans即可,让autovacuum尽可能早的进行垃圾回收。 参考 http://www.postgres.cn/v2/news/viewone/1/398 https://rhaas.blogspot.com/2019/01/how-much-maintenanceworkmem-do-i-need.html 《PostgreSQL 11 参数模板 - 珍藏级》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , 数据离散性 , 扫描性能 , 重复扫 , bitmap index scan , 排序扫描 , 扫描方法 , 顺序 背景 一个这样的问题: 为什么select x from tbl offset x limit x; 两次查询连续的OFFSET,会有重复数据呢? select ctid,* from tbl where ... offset 0 limit 10; select ctid,* from tbl where ... offset 10 limit 10; 为什么多数时候offset会推荐用order by? 不使用ORDER BY的话,返回顺序到底和什么有关? 答案是: 数据库的扫描方法。 数据库扫描方法,具体的原理可以到如下文档中找到PDF,PDF内有详细的扫描方法图文介绍。 《阿里云 PostgreSQL 产品生态;案例、开发管理实践、原理、学习资料、视频;PG天天象上沙龙记录 - 珍藏级》 扫描方法 1、全表扫描, seqscan 从第一个数据块开始扫描,返回复合条件的记录。 2、并发全表扫描, concurrently seqscan 如果有多个会话,对同一张表进行全表扫描时,后发起的会话会与前面正在扫描的会话进行BLOCK对齐步调,也就是说,后面发起的会话,可能是从表的中间开始扫的,扫描到末尾再转回去,避免多会话同时对一个表全表扫描时的IO浪费。 例如会话1已经扫到了第99个数据块,会话2刚发起这个表的全表扫描,则会从第99个数据块开始扫描,扫完在到第一个数据块扫,一直扫到第98个数据块。 3、索引扫描, index scan 按索引顺序扫描,并回表。 4、索引ONLY扫描, index only scan 按索引顺序扫描,根据VM文件的BIT位判断是否需要回表扫描。 5、位图扫描, bitmap scan 按索引取得的BLOCKID排序,然后根据BLOCKID顺序回表扫描,然后再根据条件过滤掉不符合条件的记录。 这种扫描方法,主要解决了离散数据(索引字段的逻辑顺序与记录的实际存储顺序非常离散的情况),需要大量离散回表扫描的情况。 6、并行扫描, parallel xx scan 并行的全表、索引、索引ONLY、位图扫。首先会FORK出若干个WORKER,每个WORKER负责一部分数据块,一起扫描,WORKER的结果(FILTER后的)发给下一个GATER WORKER节点。 7、hash join 哈希JOIN, 8、nest loop join 嵌套循环 9、merge join 合并JOIN(排序JOIN)。 更多扫描方法,请参考PG代码。 扫描方法决定了数据返回顺序 根据上面的这些扫描方法,我们可以知道一条QUERY下去,数据的返回顺序是怎么样的。 select * from tbl where xxx offset 10 limit 100; 1、如果是全表扫描,那么返回顺序就是数据的物理存放顺序,然后偏移10条有效记录,取下100条有效记录。 2、如果是索引扫描,则是依据索引的顺序进行扫描,然后偏移10条有效记录,取下100条有效记录。 不再赘述。 保证绝对的连续 如何保证第一次请求,第二次请求,第三次请求,。。。每一次偏移(offset)固定值,返回的结果是完全有序,无空洞的。 1、使用rr隔离级别(repeatable read),并且按PK(唯一值字段、字段组合)排序,OFFSET 使用rr级别,保证一个事务中的每次发起的SQL读请求是绝对视角一致的。 使用唯一字段或字段组合排序,可以保证每次的结果排序是绝对一致的。加速每次偏移的数据一样,所以可以保证数据返回是绝对连续的。 select * from tbl where xx order by a,b offset x limit xx; 2、使用游标 使用游标,可以保证视角一致,数据绝对一致。 postgres=# \h declare Command: DECLARE Description: define a cursor Syntax: DECLARE name [ BINARY ] [ INSENSITIVE ] [ [ NO ] SCROLL ] CURSOR [ { WITH | WITHOUT } HOLD ] FOR query begin; declare a cursor for select * from tbl where xx; fetch x from a; ... 每一次请求,游标向前移动 end; 参考 《PostgreSQL 数据离散性 与 索引扫描性能(btree & bitmap index scan)》 《PostgreSQL 11 preview - 分页内核层优化 - 索引扫描offset优化(使用vm文件skip heap scan)》 《PostgreSQL 范围过滤 + 其他字段排序OFFSET LIMIT(多字段区间过滤)的优化与加速》 《PostgreSQL Oracle 兼容性之 - TZ_OFFSET》 《PostgreSQL 索引扫描offset内核优化 - case》 《PostgreSQL 数据访问 offset 的质变 case》 《论count与offset使用不当的罪名 和 分页的优化》 《PostgreSQL offset 原理,及使用注意事项》 《妙用explain Plan Rows快速估算行 - 分页数估算》 《分页优化 - order by limit x offset y performance tuning》 《分页优化, add max_tag column speedup Query in max match enviroment》 《PostgreSQL's Cursor USAGE with SQL MODE - 分页优化》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , recovery , recovery.conf , restore_command , timeline , 时间线 , next wal , PITR , 时间点恢复 背景 PostgreSQL数据库支持PITR时间点恢复。默认情况下,只需要配置目标是时间点,resotre_command即可,PG会自动调用resotre_command去找需要的WAL文件。 一个典型的recovery.conf配置如下: #--------------------------------------------------------------------------- # ARCHIVE RECOVERY PARAMETERS #--------------------------------------------------------------------------- # # restore_command # # specifies the shell command that is executed to copy log files # back from archival storage. The command string may contain %f, # which is replaced by the name of the desired log file, and %p, # which is replaced by the absolute path to copy the log file to. # # This parameter is *required* for an archive recovery, but optional # for streaming replication. # # It is important that the command return nonzero exit status on failure. # The command *will* be asked for log files that are not present in the # archive; it must return nonzero when so asked. # # NOTE that the basename of %p will be different from %f; do not # expect them to be interchangeable. # restore_command = 'cp /data01/digoal/wal/%f %p' #--------------------------------------------------------------------------- # RECOVERY TARGET PARAMETERS #--------------------------------------------------------------------------- # # By default, recovery will rollforward to the end of the WAL log. # If you want to stop rollforward at a specific point, you # must set a recovery target. # # You may set a recovery target either by transactionId, by name, by # timestamp or by WAL location (LSN). Recovery may either include or # exclude the transaction(s) with the recovery target value (i.e., # stop either just after or just before the given target, # respectively). # # #recovery_target_name = '' # e.g. 'daily backup 2011-01-26' # recovery_target_time = '2019-03-05 20:52:16.294366+08' # e.g. '2004-07-14 22:39:00 EST' # #recovery_target_xid = '' # #recovery_target_lsn = '' # e.g. '0/70006B8' # #recovery_target_inclusive = true recovery_target_timeline = 'latest' # If recovery_target_action = 'pause', recovery will pause when the # recovery target is reached. The pause state will continue until # pg_wal_replay_resume() is called. This setting has no effect if # no recovery target is set. If hot_standby is not enabled then the # server will shutdown instead, though you may request this in # any case by specifying 'shutdown'. # #recovery_target_action = 'pause' #--------------------------------------------------------------------------- # STANDBY SERVER PARAMETERS #--------------------------------------------------------------------------- # # standby_mode # # When standby_mode is enabled, the PostgreSQL server will work as a # standby. It will continuously wait for the additional XLOG records, using # restore_command and/or primary_conninfo. # standby_mode = on 恢复目标支持: 1、时间 2、自定义还原点名字 3、事务ID 4、WAL LSN 《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 2》 《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 1》 《PostgreSQL recovery target introduce》 接下来的问题,如果无法直接通过restore_command获取文件,又当如何呢? restore_command = 'cp /data01/digoal/wal/%f %p' 方法1,通过restore_command吐出需要的文件名 recovery.conf restore_command = 'cp /data01/digoal/wal/%f %p || echo "`date +%F%T` %f" >> /tmp/needwalfile;' 当找不到WAL文件时,就会吐到/tmp/needwalfile cat /tmp/needwalfile 2019-03-0522:11:28 000000010000005D000000B2 2019-03-0522:11:28 00000002.history 2019-03-0522:11:33 000000010000005D000000B2 2019-03-0522:11:33 00000002.history 2019-03-0522:11:38 000000010000005D000000B2 2019-03-0522:11:38 00000002.history 将文件拷贝到restore_command配置的/data01/digoal/wal目录,restore_command命令将继续。 优先拷贝history文件(走到新的时间线),原理参考末尾引用文档。 通过pg_is_wal_replay_paused()函数得到当前实例是否已经到达目标还原点。如果返回T,表示已到达,则不再需要给PG新的文件。 postgres=# select pg_is_wal_replay_paused(); pg_is_wal_replay_paused ------------------------- f (1 row) 方法2,通过log文件得到需要的WAL文件名 配置PG的LOG文件,一样能得到上面的内容。 postgresql.conf # - Where to Log - log_destination = 'csvlog' # Valid values are combinations of # stderr, csvlog, syslog, and eventlog, # depending on platform. csvlog # requires logging_collector to be on. # This is used when logging to stderr: logging_collector = on # Enable capturing of stderr and csvlog # into log files. Required to be on for # csvlogs. # (change requires restart) # These are only used if logging_collector is on: log_directory = 'log' # directory where log files are written, # can be absolute or relative to PGDATA #log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern, log_filename='pg.log' # can include strftime() escapes #log_file_mode = 0600 # creation mode for log files, # begin with 0 to use octal notation log_truncate_on_rotation = on # If on, an existing log file with the # same name as the new log file will be # truncated rather than appended to. # But such truncation only occurs on # time-driven rotation, not on restarts # or size-driven rotation. Default is # off, meaning append to existing files # in all cases. log_rotation_age = 1d # Automatic rotation of logfiles will # happen after that time. 0 disables. ##log_rotation_size = 10MB # Automatic rotation of logfiles will # happen after that much log output. # 0 disables. 当找不到WAL文件时,就会吐到$PGDATA/log/pg.log digoal@pg11-test-> cat pg.log 2019-03-05 22:14:00.167 CST [38155] LOG: 00000: ending log output to stderr 2019-03-05 22:14:00.167 CST [38155] HINT: Future log output will go to log destination "csvlog". 2019-03-05 22:14:00.167 CST [38155] LOCATION: PostmasterMain, postmaster.c:1298 cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory cp: cannot stat ‘/data01/digoal/wal/000000010000005D000000B2’: No such file or directory cp: cannot stat ‘/data01/digoal/wal/00000002.history’: No such file or directory 将文件拷贝到restore_command配置的/data01/digoal/wal目录,restore_command命令将继续。 优先拷贝history文件(走到新的时间线),原理参考末尾引用文档。 通过pg_is_wal_replay_paused()函数得到当前实例是否已经到达目标还原点。如果返回T,表示已到达,则不再需要给PG新的文件。 postgres=# select pg_is_wal_replay_paused(); pg_is_wal_replay_paused ------------------------- f (1 row) 方法3,修改内核,通过UDF支持 例如,直接从UDF中获取当前startup进程需要的WAL文件名和TL history文件名。 src/backend/access/transam/xlog.c UDF支持的弊端:当数据库还没有进入一致状态时,并不能连接到数据库执行查询,另外如果没有开启hot_standby模式,也不能连到恢复中的从库进行查询。使用场景受限。 开启数据库的hot_standby模式,确保可以在恢复过程中,连接到数据库进行UDF查询。 # These settings are ignored on a master server. hot_standby = on # "off" disallows queries during recovery # (change requires restart) #max_standby_archive_delay = 30s # max delay before canceling queries # when reading WAL from archive; # -1 allows indefinite delay #max_standby_streaming_delay = 30s # max delay before canceling queries # when reading streaming WAL; # -1 allows indefinite delay wal_receiver_status_interval = 1s # send replies at least this often # 0 disables #hot_standby_feedback = off # send info from standby to prevent # query conflicts #wal_receiver_timeout = 60s # time that receiver waits for # communication from master # in milliseconds; 0 disables #wal_retrieve_retry_interval = 5s # time to wait before retrying to # retrieve WAL after a failed attempt 其他知识点 时间点恢复,如何取下一个文件。 将数据库恢复配置为hot_standby模式,允许在数据库恢复过程中,连接到数据库。获取需要的信息。 1、当返回database is in startup mode,表示无法连接数据库时,说明还需要日志文件,数据库才能到一致性点允许连接,此时,除了从前面说的LOG文件中获得需要的文件,实际上进程也会突出对应的内容。 例如 digoal 25596 25594 0 19:54 ? 00:00:00 postgres: startup recovering 000000010000005D000000A7 digoal 20 0 16.684g 1860 1276 S 0.0 0.0 0:02.59 postgres: startup waiting for 000000010000005D000000B2 日志的内容如下 cp: cannot stat ‘/data01/digoal/waltest/000000010000005D000000A7’: No such file or directory cp: cannot stat ‘/data01/digoal/waltest/00000002.history’: No such file or directory 当我们将需要的归档拷贝到对应目录后, digoal@pg11-test-> cp wal/000000010000005D000000A7 waltest/ 当我们将需要的归档拷贝到对应目录后,需要的WAL文件向前推移,日志的内容如下 cp: cannot stat ‘/data01/digoal/waltest/000000010000005D000000A8’: No such file or directory cp: cannot stat ‘/data01/digoal/waltest/00000002.history’: No such file or directory 2、当可以连接恢复中的数据库后,可以通过一些系统函数,查看到数据库的一些信息 2.1、查看当前数据库正在replay 的wal LSN postgres=# select pg_last_wal_replay_lsn(); pg_last_wal_replay_lsn ------------------------ 5D/A7FFFFE0 (1 row) 2.2、查看当前数据库的恢复是否pause,(如果是自动pause的,说明已经到达设置的还原点) postgres=# select pg_is_wal_replay_paused(); pg_is_wal_replay_paused ------------------------- f (1 row) 2.3、查看lsn对应的wal文件,不允许在standby实例中执行,如果能执行的话,可以直接从当前数据库正在replay 的wal LSN得到WAL文件名。 postgres=# select * from pg_walfile_name('5D/A7FFFFE0'); ERROR: 55000: recovery is in progress HINT: pg_walfile_name() cannot be executed during recovery. LOCATION: pg_walfile_name, xlogfuncs.c:521 2.4、当前数据库的时间线 (history) postgres=# select * from pg_control_checkpoint(); -[ RECORD 1 ]--------+------------------------- checkpoint_lsn | 5D/A7000028 redo_lsn | 5D/A7000028 redo_wal_file | 000000010000005D000000A7 timeline_id | 1 prev_timeline_id | 1 full_page_writes | t next_xid | 0:1286297007 next_oid | 1912406 next_multixact_id | 1 next_multi_offset | 0 oldest_xid | 101420357 oldest_xid_dbid | 13285 oldest_active_xid | 0 oldest_multi_xid | 1 oldest_multi_dbid | 1910618 oldest_commit_ts_xid | 0 newest_commit_ts_xid | 0 checkpoint_time | 2019-03-05 19:44:51+08 2.5、控制文件内容 postgres=# select * from pg_control_system(); -[ RECORD 1 ]------------+----------------------- pg_control_version | 1100 catalog_version_no | 201809051 system_identifier | 6636510237226062864 pg_control_last_modified | 2019-03-05 19:54:56+08 2.6、当前实例如果重启,需要的最早的REDO。 postgres=# select * from pg_control_recovery();-[ RECORD 1 ]-----------------+------------min_recovery_end_lsn | 5D/A7FFFFE0min_recovery_end_timeline | 1backup_start_lsn | 0/0backup_end_lsn | 0/0end_of_backup_record_required | f 2.7、从当前wal目录中,获取到最大的WAL文件名,通常会是当前需要的WAL或者上一个已经REPLAY万的WAL文件。 postgres=# select * from pg_ls_waldir() order by 1 desc limit 1; name | size | modification --------------------------+----------+------------------------ 000000010000005D000000A7 | 16777216 | 2019-03-05 19:48:53+08 (1 row) 时间点恢复,手工拷贝wal文件的流程 - 通常不需要手工拷贝,只要指定restore_command让数据库自己来即可 配置recovery.conf 1、配置恢复目标 2、配置restore_command命令,打印下一个需要的WAL文件以及HISTORY文件,输出到某个文件中。参考方法1。 3、配置pause 4、配置打开hot_standby 5、从restore_command命令输出到某个文件中得到。下一个需要的WAL文件以及HISTORY文件。 优先拷贝history文件,防止走错时间线。 如果history文件确实存在并拷贝成功,下一个拷贝的文件是.partial文件,千万不要搞错。 6、通过pg_is_wal_replay_paused判断是否停止 postgres=# select pg_is_wal_replay_paused(); pg_is_wal_replay_paused ------------------------- f (1 row) 如果返回T,表示已经到达还原点,不需要在拷贝文件。 参考 《PostgreSQL 时间点恢复(PITR)在异步流复制主从模式下,如何避免主备切换后PITR恢复(备库、容灾节点、只读节点)走错时间线(timeline , history , partial , restore_command , recovery.conf)》《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 2》 《PostgreSQL PITR THREE recovery target MODE: name,xid,time USE CASE - 1》 《PostgreSQL recovery target introduce》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , PPAS , EPAS , edb , enterprisedb , Oracle , 兼容性 , 优缺点 背景 EPAS为EDB的PostgreSQL Oracle兼容企业版,基于PostgreSQL社区版本开发,2004年发布了第一个Oracle兼容版,已经在ORACLE兼容性上耕耘了15年。 2018年推出EPAS 11 版本,完成了 Oracle 11g, 12c 认证。 2016年阿里云与EDB合作,推出阿里云RDS PPAS,兼容Oracle。 2018年阿里云与EDB代码级深度合作,即将推出POLARDB O,计算存储分离,云原生Oracle兼容数据库。 PG 、 PPAS 兼容性对比 功能PPASPG社区版Oracle pl/sql支持不支持Oracle pl/sql 自治事务11以上版本支持PG 11支持Oracle 内置 package26种,440个package func (df dbms)13种,通过orafce支持兼容。(实际使用体验较弱,因package会结合plsql使用)自定义 Oracle package支持不支持自定义 Oracle 对象支持不支持Oracle 系统视图支持不支持Oracle 内置函数大量支持少量兼容通过orafce插件Oracle 兼容类型支持少量兼容通过orafce插件Oracle pl/sql 嵌套表支持不支持Oracle pl/sql bulk collect bind支持不支持Oracle sql语法大部分支持少部分兼容,其他需修改Oracle 分区表语法支持不支持Oracle VPD(RLS)支持不支持,需修改语法sql 防火墙支持不支持索引推荐支持不支持资源隔离(Resource manage)支持不支持客户端驱动oci,proc,spl,jdbc,.net,odbc兼容不兼容oci,procOracle SQL*Load支持不支持,可以使用pgbulkload或copy代替Oracle 存储过程加密支持不支持Oracle rowid支持不支持Oracle rowid 语法(使用ctid或oid代替)Oracle 迁移评估支持支持较弱(ora2pg)Oracle 转化ddl,全量同步到ppas,pgADAMADAMOracle 增量同步到ppas,pgADAMADAMOracle 不兼容SQL,DDL的自动转换ADAMADAMOracle 兼容性评估,改造工作量评估,自动拆库,风险揭示,ppas优势特性揭示,不兼容DDL/SQL转换,结构,数据迁移,一致性校验,优化,仿真回放,一键迁移ADAMADAM 详细兼容性请参考内容PDF Oracle vs EDB EPAS 技术对比白皮书 EDB EPAS vs Oracle 商业对比白皮书 EDB EPAS 兼容性手册-内置包 EDB EPAS 兼容性手册-sql referencce EDB EPAS 兼容性手册-开发者手册 PPAS 兼容性需补齐 以下取自 Oracle vs EDB EPAS 技术对比白皮书 1、全局临时表 《PostgreSQL Oracle 兼容性之 - 全局临时表 global temp table》 2、分区表支持:INTERVAL PARTITIONING 3、分区表支持:PARTITIONED INDEXES 4、bitmap索引, 当前使用gin索引代替 5、flashback query 6、flashback table, database and transaction query 7、RAC 8、in-memory database 9、data masking 10、database vault 11、xml_db 12、高级压缩 13、TRANSPORTABLE CROSS-PLATFORM TABLE SPACES 14、ONLINE REORGANIZATION, 改语法实现(readme) 15、merge 语法, upsert代替 Oracle 、 PPAS 对比 PPAS 优势特性(已列举43项优势特性) 功能OraclePPAS多模-时空支持支持 (ganos, postgis, pgrouting, pgpointcloud)多模-图像处理不支持支持 imgsmlr多模-JSON支持支持 (带索引加速)多模-全文检索不支持支持(分词、索引、自定义分词、rank等 带索引加速)实时BUILD多模-文本相似不支持支持(带索引加速)多模-向量相似计算不支持支持 (cube插件)多模-图谱数据处理支持支持多模-多维不支持支持 (cube插件)多模-路由不支持支持 (pgrouting插件)多模-流计算不支持支持 (pipelinedb插件)性能-JIT不支持支持性能-向量计算不支持支持性能-GPU加速不支持支持 (ganos, pg_strom插件)索引-分区索引支持间接支持 (partial index)索引-分区表全局索引支持不支持索引-btree支持支持索引-hash支持支持索引-gin不支持支持(倒排索引)索引-gist支持支持索引-spgist不支持支持索引-brin支持(仅Oracle一体机)支持索引-bloom不支持支持索引-rum不支持支持索引-zombodb不支持支持索引-表达式索引不支持支持索引-bitmap支持不支持 (使用gin代替)索引-部分索引不支持支持高级功能-机器学习不支持支持 (madlib)高级功能-sharding支持支持 (citus)高级功能-ddl事务不支持支持高级功能-异构外部表支持不完全几乎支持任意外部数据源(FDW方式)内置编程语言-plpgsql不支持支持内置编程语言-plpython不支持支持内置编程语言-plperl不支持支持内置编程语言-pllua不支持支持内置编程语言-pljava不支持支持内置编程语言-pltcl不支持支持高级类型-数组不支持支持高级类型-range不支持支持高级类型-xml不支持支持高级类型-网络不支持支持高级类型-大对象支持支持高级类型-字节流支持支持高级类型-比特流不支持支持高级类型-图像不支持支持高级类型-向量不支持支持复制-物理流支持支持复制-逻辑流支持支持复制-任意多副本不支持支持(quorum based replication)复制-内部订阅不支持支持优化器-动态优化支持支持 (通过pg_aqo插件)优化器-join遗传算法不支持支持优化器-hash join支持支持优化器-merge join支持支持优化器-nestloop join支持支持优化器-游标支持支持并行-scan支持支持并行-index scan支持支持并行-index only scan支持支持并行-bitmap scan?支持并行-filter支持支持并行-sort支持支持并行-agg支持支持并行-write (create table, select into, create index)?支持并行-join支持支持安全-存储过程加密支持支持安全-SQL防火墙支持支持安全-VPD支持支持安全-审计支持支持安全-数据库ACL支持支持安全-认证方法少量大量(md5,peer,ident,trust,reject,password,ldap,ad,gssapi,radius,pam,bsd,sspi)扩展-过程语言扩展不支持支持扩展-FDW不支持支持扩展-采样不支持支持扩展-自定义扫描不支持支持扩展-自定义REDO不支持支持扩展-自定义索引方法不支持支持扩展-自定义类型、OP、UDF支持支持衍生产品-derived db无数不胜数(https://wiki.postgresql.org/wiki/PostgreSQL_derived_databases) 详细兼容性请参考内容PDF 1、Oracle vs EDB EPAS 技术对比白皮书 2、EDB EPAS vs Oracle 商业对比白皮书 3、EDB EPAS 兼容性手册-内置包 4、EDB EPAS 兼容性手册-sql referencce 5、EDB EPAS 兼容性手册-开发者手册 6、Oracle 兼容性评估,改造工作量评估,自动拆库,风险揭示,ppas优势特性揭示,不兼容DDL/SQL转换,结构,数据迁移,一致性校验,优化,仿真回放,一键迁移 小结 1、覆盖SQL语法(深度兼容,例如connect by,分区表。)、数据类型、函数、包(支持多达26个package,440种方法)、索引类型、操作符、样式、自定义pl/sql 存储过程、函数、包、客户端驱动(OCI)、客户端编程(Pro*C)。 2、兼容4320个ORACLE独有对象(覆盖类型、包、函数、存储过程、视图、同义词、系统表、序列、动态视图等)。 3、兼容数十项ORACLE高级功能(包括VPD,分区表、物化视图、同义词、DBLINK、高级队列、JOB、PROFILE、AWR、PDB、策略、SQL防火墙、OCI驱动、Pro*C等)。 阿里云PPAS(EDB EPAS)在Oracle兼容性,高级功能方面有非常强的优势,是企业平滑去O的首选。 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , pgcenter , pg_top , awr , perf insight , 等待事件 , perf , profile , 采样 , 统计信息 背景 PostgreSQL 性能诊断的方法很多: 例如: 1、函数的性能诊断,PROFILE。 《PostgreSQL 函数调试、诊断、优化 & auto_explain & plprofiler》 2、内核层面的代码诊断1。 《PostgreSQL 代码性能诊断之 - OProfile & Systemtap》 3、数据库等待事件层面的性能监控。 《PostgreSQL Oracle 兼容性之 - performance insight - AWS performance insight 理念与实现解读 - 珍藏级》 4、内核层面的代码诊断2。 《PostgreSQL 源码性能诊断(perf profiling)指南 - 珍藏级》 5、数据库内核代码层面诊断3。 《PostgreSQL Systemtap example : autovacuum_naptime & databases in cluster》 6、除此之外,PG社区很多性能监控、报告相关的小工具。 《PostgreSQL pg_top pgcenter - 实时top类工具》 《PostgreSQL pgmetrics - 多版本、健康监控指标采集、报告》 7、AWR报告 《PostgreSQL AWR报告(for 阿里云ApsaraDB PgSQL)》 《如何生成和阅读EnterpriseDB (PPAS(Oracle 兼容版)) AWR诊断报告》 8、数据库等待事件统计视图 《PostgreSQL 等待事件 及 等待采样统计(pg_wait_sampling)》 9、大量的实时统计信息视图 《PostgreSQL pg_stat_ pg_statio_ 统计信息(scan,read,fetch,hit)源码解读》 postgres=# \dv pg_stat* List of relations Schema | Name | Type | Owner ------------+-----------------------------+------+---------- pg_catalog | pg_stat_activity | view | postgres pg_catalog | pg_stat_all_indexes | view | postgres pg_catalog | pg_stat_all_tables | view | postgres pg_catalog | pg_stat_archiver | view | postgres pg_catalog | pg_stat_bgwriter | view | postgres pg_catalog | pg_stat_database | view | postgres pg_catalog | pg_stat_database_conflicts | view | postgres pg_catalog | pg_stat_progress_vacuum | view | postgres pg_catalog | pg_stat_replication | view | postgres pg_catalog | pg_stat_ssl | view | postgres pg_catalog | pg_stat_subscription | view | postgres pg_catalog | pg_stat_sys_indexes | view | postgres pg_catalog | pg_stat_sys_tables | view | postgres pg_catalog | pg_stat_user_functions | view | postgres pg_catalog | pg_stat_user_indexes | view | postgres pg_catalog | pg_stat_user_tables | view | postgres pg_catalog | pg_stat_wal_receiver | view | postgres pg_catalog | pg_stat_xact_all_tables | view | postgres pg_catalog | pg_stat_xact_sys_tables | view | postgres pg_catalog | pg_stat_xact_user_functions | view | postgres pg_catalog | pg_stat_xact_user_tables | view | postgres pg_catalog | pg_statio_all_indexes | view | postgres pg_catalog | pg_statio_all_sequences | view | postgres pg_catalog | pg_statio_all_tables | view | postgres pg_catalog | pg_statio_sys_indexes | view | postgres pg_catalog | pg_statio_sys_sequences | view | postgres pg_catalog | pg_statio_sys_tables | view | postgres pg_catalog | pg_statio_user_indexes | view | postgres pg_catalog | pg_statio_user_sequences | view | postgres pg_catalog | pg_statio_user_tables | view | postgres pg_catalog | pg_stats | view | postgres public | pg_stat_statements | view | postgres (32 rows) 要了解PG,有各自的手段。 pgcenter是本文主角: digoal@pg11-test-> pgcenter --help pgCenter is a command line admin tool for PostgreSQL. Usage: pgcenter [flags] pgcenter [command] [command-flags] [args] Available commands: config configures Postgres to work with pgcenter profile wait events profiler record record stats to file report make report based on previously saved statistics top top-like stats viewer Flags: -?, --help show this help and exit --version show version information and exit Use "pgcenter [command] --help" for more information about a command. Report bugs to https://github.com/lesovsky/pgcenter/issues 它可以 1、观察LONG QUERY,或者指定有问题数据库BACKEND PID进程的profile。 2、给数据库的统计信息打快照,并根据不同维度生成报告。 record record stats to file report make report based on previously saved statistics digoal@pg11-test-> pgcenter report --help 'pgcenter report' reads statistics from file and prints reports. Usage: pgcenter report [OPTIONS]... Options: -f, --file read stats from file (default: pgcenter.stat.tar) -s, --start starting time of the report (format: [YYYYMMDD-]HHMMSS) -e, --end ending time of the report (format: [YYYYMMDD-]HHMMSS) -o, --order order values by column (default descending, use '+' sign before a column name for ascending order) -g, --grep filter values in specfied column (format: colname:filtertext) -l, --limit print only limited number of rows per sample (default: unlimited) -t, --truncate maximum string size to print (default: 32) -i, --interval delta interval (default: 1s) Report options: -A, --activity show pg_stat_activity statistics -S, --sizes show statistics about tables sizes -D, --databases show pg_stat_database statistics -F, --functions show pg_stat_user_functions statistics -R, --replication show pg_stat_replication statistics -T, --tables show pg_stat_user_tables statistics -I, --indexes show pg_stat_user_indexes and pg_statio_user_indexes statistics -V, --vacuum show pg_stat_progress_vacuum statistics -X, --statements [X] show pg_stat_statements statistics, use additional selector to choose stats. 'm' - timings; 'g' - general; 'i' - io; 't' - temp files io; 'l' - local files io. -d, --describe show statistics description, combined with one of the report options General options: -?, --help show this help and exit --version show version information and exit Report bugs to https://github.com/lesovsky/pgcenter/issues digoal@pg11-test-> pgcenter report -A -d Activity statistics based on pg_stat_activity view: column origin description - pid pid Process ID of this backend - cl_addr client_addr IP address of the client connected to this backend - cl_port client_port TCP port number that the client is using for communication with this backend - datname datname Name of the database this backend is connected to - usename usename Name of the user logged into this backend - appname application_name Name of the application that is connected to this backend - backend_type backend_type Type of current backend - wait_etype wait_event_type The type of event for which the backend is waiting, if any - wait_event wait_event Wait event name if backend is currently waiting - state state Current overall state of this backend - xact_age* xact_start Current transaction's duration if active - query_age* query_start Current query's duration if active - change_age* state_change Age since last state has been changed - query query Text of this backend's most recent query * - extended value, based on origin and calculated using additional functions. Details: https://www.postgresql.org/docs/current/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW digoal@pg11-test-> pgcenter report -S -d Statistics about sizes of tables based on pg_*_size() functions: column origin description - relation - Name of the table, including schema - total_size - Total size of the table, including its indexes, in kB - rel_size - Total size of the table, without its indexes, in kB - idx_size - Total size of tables' indexes, in kB - total_change - How does size of the table, including its indexes, is changing per second, in kB - rel_change - How does size of the table, without its indexes, is changing per second, in kB - idx_change - How does size of the tables' indexes is changing per second, in kB * - extended value, based on origin and calculated using additional functions. Details: https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT digoal@pg11-test-> pgcenter report -V -d Statistics about progress of vacuums based on pg_stat_progress_vacuum view: column origin description - pid pid Process ID of this worker - xact_age* xact_start Current transaction's duration if active - datname datname Name of the database this worker is connected to - relation relid Name of the relation which is vacuumed by this worker - state state Current overall state of this worker - phase phase Current processing phase of vacuum - total* heap_blks_total Total size of the table, in kB - t_scanned* heap_blks_scanned Total amount of data scanned, in kB - t_vacuumed* heap_blks_vacuumed Total amount of data vacuumed, in kB - scanned heap_blks_scanned Amount of data scanned per second, in kB - vacuumed heap_blks_vacuumed Amount of data vacuumed per second, in kB - wait_etype wait_event_type The type of event for which the worker is waiting, if any - wait_event wait_event Wait event name if worker is currently waiting - query query Text of this workers's "query" * - extended value, based on origin and calculated using additional functions. Details: https://www.postgresql.org/docs/current/progress-reporting.html#VACUUM-PROGRESS-REPORTING 3、查看数据库实时TOP 情况 top top-like stats viewer pgcenter 用法 centos 7 x64为例 源码安装 yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm yum install -y golang git clone https://github.com/lesovsky/pgcenter cd pgcenter digoal@pg11-test-> which go /bin/go digoal@pg11-test-> which pg_config ~/pgsql11.1/bin/pg_config USE_PGXS=1 make USE_PGXS=1 make install rpm安装 https://github.com/lesovsky/pgcenter/releases wget https://github.com/lesovsky/pgcenter/releases/download/v0.6.1/pgcenter_0.6.1_Linux_x86_64.rpm rpm -ivh pgcenter_0.6.1_Linux_x86_64.rpm [root@pg11-test ~]# rpm -ql pgcenter /usr/bin/pgcenter 举例: 使用pgcenter观察问题PID或者当前某个慢SQL的等待事件 使用帮助 https://github.com/lesovsky/pgcenter/blob/master/doc/examples.md 1、找到当前慢SQL,以及对应的PID。 postgres=# select pid, now()-query_start during, query, wait_event_type, wait_event from pg_stat_activity where wait_event is not null order by query_start limit 1; pid | during | query | wait_event_type | wait_event -------+-----------------+-----------------------+-----------------+------------ 21207 | 00:00:28.975778 | select pg_sleep(100); | Timeout | PgSleep (1 row) 2、使用pgcenter profile跟踪这个PID。 pgcenter跟踪PID时,需要给出一个采样频率(每秒采样多少次),输出的是该PID下面每条执行完后对这个QUERY的等待时间占比统计。 digoal@pg11-test-> pgcenter profile --help 'pgcenter profile' profiles wait events of running queries Usage: pgcenter profile [OPTIONS]... [DBNAME [USERNAME]] Options: -d, --dbname DBNAME database name to connect to -h, --host HOSTNAME database server host or socket directory. -p, --port PORT database server port (default 5432) -U, --username USERNAME database user name -P, --pid PID backend PID to profile to -F, --freq FREQ profile at this frequency (min 1, max 1000) -s, --strsize SIZE limit length of print query strings to STRSIZE chars (default 128) General options: -?, --help show this help and exit --version show version information and exit Report bugs to https://github.com/lesovsky/pgcenter/issues 跟踪,例如每秒采样10次等待事件,每次间隔100毫秒。 pgcenter profile -h 127.0.0.1 -p 8001 -U postgres -d postgres -P 42616 -F 10 LOG: Profiling process 42616 with 100ms sampling 3、制造LONG QUERY postgres=# \d t_hintbit Table "public.t_hintbit" Column | Type | Collation | Nullable | Default --------+----------+-----------+----------+--------------------------------------- id | bigint | | not null | nextval('t_hintbit_id_seq'::regclass) c1 | smallint | | | Indexes: "t_hintbit_pkey" PRIMARY KEY, btree (id) postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 42616 (1 row) postgres=# update t_hintbit set c1=1; 观察profile ------ ------------ ----------------------------- % time seconds wait_event query: update t_b set info='test' ; ------ ------------ ----------------------------- 97.90 47.239459 Running 1.47 0.707298 IO.DataFileExtend 0.63 0.304460 IO.DataFileRead ------ ------------ ----------------------------- 100.00 48.251217 ------ ------------ ----------------------------- % time seconds wait_event query: update t_b set info='test' ; ------ ------------ ----------------------------- 87.35 25.146099 Running 9.47 2.727026 IO.DataFileExtend 3.16 0.909462 LWLock.WALWriteLock ------ ------------ ----------------------------- 99.98 28.782587 pgcenter 原理 1、采样各个维度统计信息表,输出统计信息。 与perf insight , AWR 类似。 参考 《阿里云 PostgreSQL 产品生态;案例、开发管理实践、原理、学习资料、视频;PG天天象上沙龙记录 - 珍藏级》 1、函数的性能诊断,PROFILE。 《PostgreSQL 函数调试、诊断、优化 & auto_explain & plprofiler》 2、内核层面的代码诊断1。 《PostgreSQL 代码性能诊断之 - OProfile & Systemtap》 3、数据库等待事件层面的性能监控。 《PostgreSQL Oracle 兼容性之 - performance insight - AWS performance insight 理念与实现解读 - 珍藏级》 4、内核层面的代码诊断2。 《PostgreSQL 源码性能诊断(perf profiling)指南 - 珍藏级》 5、数据库内核代码层面诊断3。 《PostgreSQL Systemtap example : autovacuum_naptime & databases in cluster》 6、除此之外,PG社区很多性能监控、报告相关的小工具。 《PostgreSQL pg_top pgcenter - 实时top类工具》 《PostgreSQL pgmetrics - 多版本、健康监控指标采集、报告》 7、AWR报告 《PostgreSQL AWR报告(for 阿里云ApsaraDB PgSQL)》 《如何生成和阅读EnterpriseDB (PPAS(Oracle 兼容版)) AWR诊断报告》 8、数据库等待事件统计视图 《PostgreSQL 等待事件 及 等待采样统计(pg_wait_sampling)》 9、大量的实时统计信息视图 《PostgreSQL pg_stat_ pg_statio_ 统计信息(scan,read,fetch,hit)源码解读》 https://blog.dataegret.com/2019/03/pgcenters-wait-event-profiler.html https://github.com/lesovsky/pgcenter#install-notes https://github.com/lesovsky/pgcenter/blob/master/doc/examples.md 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , data_sync_retry , write back , retry , failed status 背景 有些OS系统,对fsync的二次调用不敏感,因为OS层可能有自己的CACHE,如果使用了buffer write,并且出现write back failed的情况,有些OS可能在下次fsync时并不能正确的反馈fsync的可靠性与否。(因为这个BLOCK上一次write back可能已失败,并且状态未被正确的维护,所以后面发起的fsync实际上正确与否不得而知) PG 的数据文件,WAL文件,CLOG文件等重要文件相关的进程:bgwriter, wal writer, backend process都有用到buffer write,如果OS层失守(即fsync retry不可靠)那么曾经的write back failed,在checkpoint时使用fsync返回可能成功,使得数据文件中可能存在损坏的BLOCK,需要使用wal修复,然而数据库收到的OS fsync返回是正确的,所以会认为checkpoint是成功的,不会使用wal去修复它。 PG 12修正了这个问题,并且对所有版本做了back patch。 PANIC on fsync() failure. On some operating systems, it doesn't make sense to retry fsync(), because dirty data cached by the kernel may have been dropped on write-back failure. In that case the only remaining copy of the data is in the WAL. A subsequent fsync() could appear to succeed, but not have flushed the data. That means that a future checkpoint could apparently complete successfully but have lost data. Therefore, violently prevent any future checkpoint attempts by panicking on the first fsync() failure. Note that we already did the same for WAL data; this change extends that behavior to non-temporary data files. Provide a GUC data_sync_retry to control this new behavior, for users of operating systems that don't eject dirty data, and possibly forensic/testing uses. If it is set to on and the write-back error was transient, a later checkpoint might genuinely succeed (on a system that does not throw away buffers on failure); if the error is permanent, later checkpoints will continue to fail. The GUC defaults to off, meaning that we panic. Back-patch to all supported releases. There is still a narrow window for error-loss on some operating systems: if the file is closed and later reopened and a write-back error occurs in the intervening time, but the inode has the bad luck to be evicted due to memory pressure before we reopen, we could miss the error. A later patch will address that with a scheme for keeping files with dirty data open at all times, but we judge that to be too complicated to back-patch. Author: Craig Ringer, with some adjustments by Thomas Munro Reported-by: Craig Ringer Reviewed-by: Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de 用户可设置参数 data_sync_retry (boolean) When set to off, which is the default, PostgreSQL will raise a PANIC-level error on failure to flush modified data files to the filesystem. This causes the database server to crash. This parameter can only be set at server start. On some operating systems, the status of data in the kernel's page cache is unknown after a write-back failure. In some cases it might have been entirely forgotten, making it unsafe to retry; the second attempt may be reported as successful, when in fact the data has been lost. In these circumstances, the only way to avoid data loss is to recover from the WAL after any failure is reported, preferably after investigating the root cause of the failure and replacing any faulty hardware. If set to on, PostgreSQL will instead report an error but continue to run so that the data flushing operation can be retried in a later checkpoint. Only set it to on after investigating the operating system's treatment of buffered data in case of write-back failure. 默认值是安全的。 如果你要设置为ON,务必确保OS层的fsync是可以retry并且可靠的。 引申 1、当前数据库做法是data_sync_retry直接disable,即报错。 实际上可以尝试从WAL中提取对应failed block 的FPW以及后面的变化量进行修复,避免直接crash对使用者的体感不好。 参考 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 这个patch对所有版本都已fix,所以在PG 11上也有这个patch。 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , max_wal_senders , max_connections , sorry, too many clients already 背景 如果你需要使用PG的流复制,上游节点的max_wal_senders参数,用来限制这个节点同时最多可以有多少个wal sender进程。 包括逻辑复制、物理复制、pg_basebackup备份等,只要是使用stream protocol的连接,每个连接都需要一个wal sender进程,与之建立stream protocol通讯。 在12的版本以前,max_wal_senders是算在max_connections里面的,也就是说,如果用户的普通连接把数据库连接占光了,流复制连接也会不够用。 12修正了这个问题,max_wal_senders参数独立控制,不算在max_connections里面。普通连接与流复制连接相互不再干扰。 同时要求standby节点的max_wal_senders参数,必须大于或等于primary(上游)数据库的max_wal_senders参数。 与要求standby节点的max_connections参数,必须大于或等于primary(上游)数据库的max_connections参数一样。 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ea92368cd1da1e290f9ab8efb7f60cb7598fc310 Move max_wal_senders out of max_connections for connection slot handling Since its introduction, max_wal_senders is counted as part of max_connections when it comes to define how many connection slots can be used for replication connections with a WAL sender context. This can lead to confusion for some users, as it could be possible to block a base backup or replication from happening because other backend sessions are already taken for other purposes by an application, and superuser-only connection slots are not a correct solution to handle that case. This commit makes max_wal_senders independent of max_connections for its handling of PGPROC entries in ProcGlobal, meaning that connection slots for WAL senders are handled using their own free queue, like autovacuum workers and bgworkers. One compatibility issue that this change creates is that a standby now requires to have a value of max_wal_senders at least equal to its primary. So, if a standby created enforces the value of max_wal_senders to be lower than that, then this could break failovers. Normally this should not be an issue though, as any settings of a standby are inherited from its primary as postgresql.conf gets normally copied as part of a base backup, so parameters would be consistent. 参考 《PostgreSQL 拒绝服务DDOS攻击与防范》 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , CTE , materialized , not materialized , push down 背景 PostgreSQL with 语法,能跑非常复杂的SQL逻辑,包括递归,多语句物化计算等。 在12以前的版本中,WITH中的每一个CTE(common table express),都是直接进行物化的,也就是说外层的条件不会推到CTE(物化节点)里面去。 这么做对于insert,update,delete的CTE以及递归WITH语句,都是稀疏平常的。但是对于select CTE,外面的条件推到CTE里面,可能能够大幅降低扫描。 因此PG 12开始,提供了用户选择 with NOT MATERIALIZED (不使用物化,允许外面条件推进去) with MATERIALIZED (使用物化) Allow user control of CTE materialization, and change the default behavior. Historically we've always materialized the full output of a CTE query, treating WITH as an optimization fence (so that, for example, restrictions from the outer query cannot be pushed into it). This is appropriate when the CTE query is INSERT/UPDATE/DELETE, or is recursive; but when the CTE query is non-recursive and side-effect-free, there's no hazard of changing the query results by pushing restrictions down. Another argument for materialization is that it can avoid duplicate computation of an expensive WITH query --- but that only applies if the WITH query is called more than once in the outer query. Even then it could still be a net loss, if each call has restrictions that would allow just a small part of the WITH query to be computed. Hence, let's change the behavior for WITH queries that are non-recursive and side-effect-free. By default, we will inline them into the outer query (removing the optimization fence) if they are called just once. If they are called more than once, we will keep the old behavior by default, but the user can override this and force inlining by specifying NOT MATERIALIZED. Lastly, the user can force the old behavior by specifying MATERIALIZED; this would mainly be useful when the query had deliberately been employing WITH as an optimization fence to prevent a poor choice of plan. Andreas Karlsson, Andrew Gierth, David Fetter Discussion: https://postgr.es/m/87sh48ffhb.fsf@news-spur.riddles.org.uk 例子 在CTE中使用NOT MATERIALIZED,表示这个CTE不使用物化,外面的条件可以推到CTE中。 In particular, if there's an index on key, it will probably be used to fetch just the rows having key = 123. On the other hand, in WITH w AS ( SELECT * FROM big_table ) SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref WHERE w2.key = 123; the WITH query will be materialized, producing a temporary copy of big_table that is then joined with itself — without benefit of any index. This query will be executed much more efficiently if written as: WITH w AS NOT MATERIALIZED ( SELECT * FROM big_table ) SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref WHERE w2.key = 123; 参考 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=608b167f9f9c4553c35bb1ec0eab9ddae643989b https://www.postgresql.org/docs/devel/queries-with.html 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
标签 PostgreSQL , PostGIS , geos 背景 http://lin-ear-th-inking.blogspot.com/2019/02/betterfaster-stpointonsurface-for.html 使用GEOS新的代码,提升PostGIS重计算的函数性能。 The improved ST_PointOnSurface runs 13 times faster than the old code. And now for the final chapter in the saga of improving For those who missed the first two episodes, the series began with for the venerable JTS Geometry.interiorPoint() for polygons algorithm.Episode 2 travelled deep into the wilds of C++ with a .The series finale shows how this results in greatly improved performance of PostGIS .The dataset is a convenient test case, since it has lots of large polygons (shown here with interior points computed).The query is about as simple as it gets:Here's the query timings comparison, using the improved GEOS code and the previous implementation:As expected, there is a dramatic improvement in performance.The improved ST_PointOnSurface runs 13 times faster than the old code.And it's now as fast as ST_Centroid.It's also more robust and tolerant of invalid input (although this test doesn't show it).This should show up in PostGIS in the fall release (PostGIS 3 / GEOS 3.8).On to the next improvement... (and also gotta update the and the !)by Dr JTS (noreply@blogger.com) at March 01, 2019 07:02 PM 免费领取阿里云RDS PostgreSQL实例、ECS虚拟机
2023年01月
2022年12月
2022年09月
2021年08月
2020年09月
https://github.com/digoal/blog/blob/master/201802/20180226_05.md 高级用法(《PostgreSQL SELECT 的高级用法(CTE, LATERAL, ORDINALITY, WINDOW, SKIP LOCKED, DISTINCT, GROUPING SETS, ...) - 珍藏级》)
你需要使用数据库超级用户创建EXTENSION;
https://github.com/digoal/blog/blob/master/201605/20160510_01.md
PG 除了不支持database级的wal,其他都比较完备了。
秒杀例子,30万 tps
https://github.com/digoal/blog/blob/master/201711/20171107_31.md
skip locked row例子
https://github.com/digoal/blog/blob/master/201610/20161018_01.md
推荐你使用HybridDB for PostgreSQL,或 rds pgsql,都支持直接读写OSS,通过OSS 外部表。
阿里内部今年双十一也使用了RDG PGSQL和HDB PG。
https://github.com/digoal/blog/blob/master/201706/20170601_02.md
https://github.com/digoal/blog/blob/master/201711/20171111_01.md
如果你需要使用搜索服务,可以考虑用RDS FOR POSTGRESQL,支持全文检索、模糊查询、相似查询、正则查询等功能。
亿级别数据量,毫秒级响应。
可以,但是你需要将整个数据库的生命周期管理起来。
https://github.com/digoal/blog/blob/master/201711/20171125_01.md
你可以使用阿里云的RDS FOR PPAS数据库,高度兼容ORACLE的SQL语法和存储过程语法。甚至在某些方面性能超越了ORACLE。
使用jdbc接口或者其他的应用开发接口。
你也可以使用PostgreSQL
cmin, cmax代表的是这条记录在一个事务中的第几条SQL被写入或更新
基于快照的增量备份,恢复时间可控,不管你的库多大,备份只需要秒级
https://github.com/digoal/blog/blob/master/201608/20160823_05.md
不需要,这个是事务锁。你需要排查的是业务逻辑。
这个文章对你会有帮助
https://github.com/digoal/blog/blob/master/201705/20170521_01.md
例子
https://www.postgresql.org/docs/10/static/plpgsql-porting.html
或者你可以使用 阿里云RDS for PPAS产品,高度兼容Oracle大多数plsql函数不需要转换
不支持,PG 11支持
https://www.postgresql.org/docs/devel/static/sql-createtable.html
如果你需要在PG 10支持这个语法,可以用insert on conflict插入,在遇到报错时do update,然后再目标表添加update rule,将update转成insert到default table。
不可,暂时还没有插件化。
恢复手段优先级:
1、如果有增量备份,建议从增量备份+归档文件进行时间点恢复。
2、如果没有增量备份,建议使用dump文件逻辑恢复,恢复到某个备份的时间点。
3、如果以上都没有,可以从文件系统层恢复,如果文件是在数据库停库状态下被删的,恢复后建议先备份一下数据文件。然后使用VACUUM 检查一下全库。
3.1 如果文件是在数据库启动状态下被删,数据库处于不一致状态,或者数据文件没有完全恢复时,需要reset control file才能启动数据库。
3.2 如果数据文件恢复程度不足以启动数据库,那么可以使用pg filedump,从仅有的数据文件中导出数据内容,并进行人为的恢复。
你需要先编译安装imgsmlr插件。