
半个PostgreSQL DBA,热衷于数据库及分布式技术。 - https://github.com/ChenHuajun - https://pan.baidu.com/s/1eRQsdAa
基于Patroni的Citus高可用环境部署 1. 前言 Citus是一个非常实用的能够使PostgreSQL具有进行水平扩展能力的插件,或者说是一款以PostgreSQL插件形式部署的基于PostgreSQL的分布式HTAP数据库。本文简单说明Citus的高可用技术方案,并实际演示基于Patroni搭建Citus HA环境的步骤。 2. 技术方案 2.1 Citus HA方案选型 Citus集群由一个CN节点和N个Worker节点组成。CN节点的高可用可以使用任何通用的PG 高可用方案,即为CN节点通过流复制配置主备2台PG机器;Worker节点的高可用除了可以像CN一样采用PG原生的高可用方案,还支持另一种多副本分片的高可用方案。 多副本高可用方案是Citus早期版本默认的Worker高可用方案(当时citus.shard_count默认值为2),这种方案部署非常简单,而且坏一个Worker节点也不影响业务。采用多副本高可用方案时,每次写入数据,CN节点需要在2个Worker上分别写数据,这也带来一系列不利的地方。 数据写入的性能下降 对多个副本的数据一致性的保障也没有PG原生的流复制强 存在功能上的限制,比如不支持Citus MX架构 因此,Citus的多副本高可用方案适用场景有限,Citus 官方文档上也说可能它只适用于append only的业务场景,不作为推荐的高可用方案了(在Citus 6.1的时候,citus.shard_count默认值从2改成了1)。 因此,建议Citus和CN和Worker节点都使用PG的原生流复制部署高可用。 2.2 PG HA支持工具的选型 PG本身提供的流复制的HA的部署和维护都不算很复杂,但是如果我们追求更高程度的自动化,特别是自动故障切换,可以使用一些使用第3方的HA工具。目前有很多种可选的开源工具,下面几种算是比较常用的 PAF(PostgreSQL Automatic Failover) repmgr Patroni 它们的比较可以参考: https://scalegrid.io/blog/managing-high-availability-in-postgresql-part-1/ 其中Patroni采用DCS(Distributed Configuration Store,比如etcd,ZooKeeper,Consul等)存储元数据,能够严格的保障元数据的一致性,可靠性高;而且它的功能也比较强大。 因此个人推荐使用Patroni(只有2台机器无法部署etcd的情况可以考虑其它方案)。本文介绍基于Patroni的PostgreSQL高可用的部署。 2.3 客户端流量切换方案 PG 主备切换后,访问数据库的客户端也要相应地连接到新的主库。目前常见的有下面几种方案: HAProxy 优点 可靠 支持负载均衡 缺点 性能损耗 需要配置HAProxy自身的HA VIP 优点 无性能损耗,不占用机器资源 缺点 主备节点IP必须在同网段 客户端多主机URL 优点 无性能损耗,不占用机器资源 不依赖VIP,易于在云环境部署 pgjdbc支持读写分离和负载均衡 缺点 仅部分客户端驱动支持(目前包括pgjdbc,libpq和基于libpq的驱动,如python和php) 如果数据库层面没控制好出现了"双主", 客户端同时向2个主写数据的风险较高 根据Citus集群的特点,推荐的候选方案如下 应用连接Citus 客户端多主机URL 如果客户端驱动支持,特别对Java应用,推荐采用客户端多主机URL访问Citus VIP Citus CN连接Worker VIP Worker节点发生切换时动态修改Citus CN上的worker节点元数据 关于Citus CN连接Worker的方式,本文下面的实验中会演示2种架构,采用不同的实现方式。 普通架构 CN通过Worker的实际IP连接Worekr主节点 CN上通过监控脚本检测Worker节点状态,Worker发生主备切换时动态修改Citus CN上的元数据 支持读写分离的架构 CN通过Worker的读写VIP和只读VIP连接Worekr CN上通过Patroni回调脚本动态控制CN主节点使用读写VIP,CN备节点使用只读VIP Worker上通过Patroni回调脚本动态绑定读写VIP Worker上通过keepalived动态绑定只读VIP 3. 实验环境 主要软件 CentOS 7.8 PostgreSQL 12 Citus 10.4 patroni 1.6.5 etcd 3.3.25 机器和VIP资源 Citus CN node1:192.168.234.201 node2:192.168.234.202 Citus Worker node3:192.168.234.203 node4:192.168.234.204 etcd node4:192.168.234.204 VIP(Citus CN ) 读写VIP:192.168.234.210 只读VIP:192.168.234.211 环境准备 所有节点设置时钟同步 yum install -y ntpdate ntpdate time.windows.com && hwclock -w 如果使用防火墙需要开放postgres,etcd和patroni的端口。 postgres:5432 patroni:8008 etcd:2379/2380 更简单的做法是将防火墙关闭 setenforce 0 sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config systemctl disable firewalld.service systemctl stop firewalld.service iptables -F 4. etcd部署 因为本文的主题不是etcd的高可用,所以只在node4上部署单节点的etcd用于实验。生产环境至少需要3台独立的机器,也可以和数据库部署在一起。etcd的部署步骤如下 安装需要的包 yum install -y gcc python-devel epel-release 安装etcd yum install -y etcd 编辑etcd配置文件/etc/etcd/etcd.conf, 参考配置如下 ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="http://192.168.234.204:2380" ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://192.168.234.204:2379" ETCD_NAME="etcd0" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.234.204:2380" ETCD_ADVERTISE_CLIENT_URLS="http://192.168.234.204:2379" ETCD_INITIAL_CLUSTER="etcd0=http://192.168.234.204:2380" ETCD_INITIAL_CLUSTER_TOKEN="cluster1" ETCD_INITIAL_CLUSTER_STATE="new" 启动etcd systemctl start etcd 设置etcd自启动 systemctl enable etcd 5. PostgreSQL + Citus + Patroni HA部署 在需要运行PostgreSQL的实例上安装相关软件 安装PostgreSQL 12和Citus yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm yum install -y postgresql12-server postgresql12-contrib yum install -y citus_12 安装Patroni yum install -y gcc epel-release yum install -y python-pip python-psycopg2 python-devel pip install --upgrade pip pip install --upgrade setuptools pip install patroni[etcd] 创建PostgreSQL数据目录 mkdir -p /pgsql/data chown postgres:postgres -R /pgsql chmod -R 700 /pgsql/data 创建Partoni的service配置文件/etc/systemd/system/patroni.service [Unit] Description=Runners to orchestrate a high-availability PostgreSQL After=syslog.target network.target [Service] Type=simple User=postgres Group=postgres #StandardOutput=syslog ExecStart=/usr/bin/patroni /etc/patroni.yml ExecReload=/bin/kill -s HUP $MAINPID KillMode=process TimeoutSec=30 Restart=no [Install] WantedBy=multi-user.target 创建Patroni配置文件/etc/patroni.yml,以下是node1的配置示例 scope: cn namespace: /service/ name: pg1 restapi: listen: 0.0.0.0:8008 connect_address: 192.168.234.201:8008 etcd: host: 192.168.234.204:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 master_start_timeout: 300 synchronous_mode: false postgresql: use_pg_rewind: true use_slots: true parameters: listen_addresses: "0.0.0.0" port: 5432 wal_level: logical hot_standby: "on" wal_keep_segments: 1000 max_wal_senders: 10 max_replication_slots: 10 wal_log_hints: "on" max_connections: "100" max_prepared_transactions: "100" shared_preload_libraries: "citus" citus.node_conninfo: "sslmode=prefer" citus.replication_model: streaming citus.task_assignment_policy: round-robin initdb: - encoding: UTF8 - locale: C - lc-ctype: zh_CN.UTF-8 - data-checksums pg_hba: - host replication repl 0.0.0.0/0 md5 - host all all 0.0.0.0/0 md5 postgresql: listen: 0.0.0.0:5432 connect_address: 192.168.234.201:5432 data_dir: /pgsql/data bin_dir: /usr/pgsql-12/bin authentication: replication: username: repl password: "123456" superuser: username: postgres password: "123456" basebackup: max-rate: 100M checkpoint: fast tags: nofailover: false noloadbalance: false clonefrom: false nosync: false 其他PG节点的patroni.yml需要相应修改下面4个参数 scope node1,node2设置为cn node3,node4设置为wk1 name node1~node4分别设置pg1~pg4 restapi.connect_address 根据各自节点IP设置 postgresql.connect_address 根据各自节点IP设置 启动Patroni 在所有节点上启动Patroni。 systemctl start patroni 同一个cluster中,第一次启动的Patroni实例会作为leader运行,并初始创建PostgreSQL实例和用户。后续节点初次启动时从leader节点克隆数据 查看cn集群状态 [root@node1 ~]# patronictl -c /etc/patroni.yml list + Cluster: cn (6869267831456178056) +---------+----+-----------+-----------------+ | Member | Host | Role | State | TL | Lag in MB | Pending restart | +--------+-----------------+--------+---------+----+-----------+-----------------+ | pg1 | 192.168.234.201 | | running | 1 | 0.0 | * | | pg2 | 192.168.234.202 | Leader | running | 1 | | | +--------+-----------------+--------+---------+----+-----------+-----------------+ 查看wk1集群状态 [root@node3 ~]# patronictl -c /etc/patroni.yml list + Cluster: wk1 (6869267726994446390) ---------+----+-----------+-----------------+ | Member | Host | Role | State | TL | Lag in MB | Pending restart | +--------+-----------------+--------+---------+----+-----------+-----------------+ | pg3 | 192.168.234.203 | | running | 1 | 0.0 | * | | pg4 | 192.168.234.204 | Leader | running | 1 | | | +--------+-----------------+--------+---------+----+-----------+-----------------+ 为了方便日常操作,设置全局环境变量PATRONICTL_CONFIG_FILE echo 'export PATRONICTL_CONFIG_FILE=/etc/patroni.yml' >/etc/profile.d/patroni.sh 添加以下环境变量到~postgres/.bash_profile export PGDATA=/pgsql/data export PATH=/usr/pgsql-12/bin:$PATH 设置postgres拥有sudoer权限 echo 'postgres ALL=(ALL) NOPASSWD: ALL'> /etc/sudoers.d/postgres 5. 配置Citus 在cn和wk的主节点上创建citus扩展 create extension citus 在cn的主节点上,添加wk1的主节点IP,groupid设置为1。 SELECT * from master_add_node('192.168.234.204', 5432, 1, 'primary'); 在Worker的主备节点上分别修改/pgsql/data/pg_hba.conf配置文件,以下内容添加到其它配置项前面允许CN免密连接Worker。 host all all 192.168.234.201/32 trust host all all 192.168.234.202/32 trust 修改后重新加载配置 su - postgres pg_ctl reload 注:也可以通过在CN上设置~postgres/.pgpass 实现免密,但是没有上面的方式维护方便。 创建分片表测试验证 create table tb1(id int primary key,c1 text); set citus.shard_count = 64; select create_distributed_table('tb1','id'); select * from tb1; 6. 配置Worker的自动流量切换 上面配置的Worker IP是当时的Worker主节点IP,在Worker发生主备切换后,需要相应更新这个IP。 实现上,可以通过脚本监视Worker主备状态,当Worker主备角色变更时,自动更新Citus上的Worker元数据为新主节点的IP。下面是脚本的参考实现 将以下配置添加到Citus CN主备节点的/etc/patroni.yml里 citus: loop_wait: 10 databases: - postgres workers: - groupid: 1 nodes: - 192.168.234.203:5432 - 192.168.234.204:5432 也可以使用独立的配置文件,如果那样做需要补充认证配置 postgresql: connect_address: 192.168.234.202:5432 authentication: superuser: username: postgres password: "123456" 创建worker流量自动切换脚本/pgsql/citus_controller.py #!/usr/bin/env python2 # -*- coding: utf-8 -*- import os import time import argparse import logging import yaml import psycopg2 def get_pg_role(url): result = 'unknow' try: with psycopg2.connect(url, connect_timeout=2) as conn: conn.autocommit = True cur = conn.cursor() cur.execute("select pg_is_in_recovery()") row = cur.fetchone() if row[0] == True: result = 'secondary' elif row[0] == False: result = 'primary' except Exception as e: logging.debug('get_pg_role() failed. url:{0} error:{1}'.format( url, str(e))) return result def update_worker(url, role, groupid, nodename, nodeport): logging.debug('call update worker. role:{0} groupid:{1} nodename:{2} nodeport:{3}'.format( role, groupid, nodename, nodeport)) try: sql = "select nodeid,nodename,nodeport from pg_dist_node where groupid={0} and noderole = '{1}' order by nodeid limit 1".format( groupid, role) conn = psycopg2.connect(url, connect_timeout=2) conn.autocommit = True cur = conn.cursor() cur.execute(sql) row = cur.fetchone() if row is None: logging.error("can not found nodeid whose groupid={0} noderole = '{1}'".format(groupid, role)) return False nodeid = row[0] oldnodename = row[1] oldnodeport = str(row[2]) if oldnodename == nodename and oldnodeport == nodeport: logging.debug('skip for current nodename:nodeport is same') return False sql= "select master_update_node({0}, '{1}', {2})".format(nodeid, nodename, nodeport) ret = cur.execute(sql) logging.info("Changed worker node {0} from '{1}:{2}' to '{3}:{4}'".format(nodeid, oldnodename, oldnodeport, nodename, nodeport)) return True except Exception as e: logging.error('update_worker() failed. role:{0} groupid:{1} nodename:{2} nodeport:{3} error:{4}'.format( role, groupid, nodename, nodeport, str(e))) return False def main(): parser = argparse.ArgumentParser(description='Script to auto setup Citus worker') parser.add_argument('-c', '--config', default='citus_controller.yml') parser.add_argument('-d', '--debug', action='store_true', default=False) args = parser.parse_args() if args.debug: logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.DEBUG) else: logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.INFO) # read config file f = open(args.config,'r') contents = f.read() config = yaml.load(contents, Loader=yaml.FullLoader) cn_connect_address = config['postgresql']['connect_address'] username = config['postgresql']['authentication']['superuser']['username'] password = config['postgresql']['authentication']['superuser']['password'] databases = config['citus']['databases'] workers = config['citus']['workers'] loop_wait = config['citus'].get('loop_wait',10) logging.info('start main loop') loop_count = 0 while True: loop_count += 1 logging.debug("##### main loop start [{}] #####".format(loop_count)) dbname = databases[0] cn_url = "postgres://{0}/{1}?user={2}&password={3}".format( cn_connect_address,dbname,username,password) if(get_pg_role(cn_url) == 'primary'): for worker in workers: groupid = worker['groupid'] nodes = worker['nodes'] ## get role of worker nodes primarys = [] secondarys = [] for node in nodes: wk_url = "postgres://{0}/{1}?user={2}&password={3}".format( node,dbname,username,password) role = get_pg_role(wk_url) if role == 'primary': primarys.append(node) elif role == 'secondary': secondarys.append(node) logging.debug('Role info groupid:{0} primarys:{1} secondarys:{2}'.format( groupid,primarys,secondarys)) ## update worker node for dbname in databases: cn_url = "postgres://{0}/{1}?user={2}&password={3}".format( cn_connect_address,dbname,username,password) if len(primarys) == 1: nodename = primarys[0].split(':')[0] nodeport = primarys[0].split(':')[1] update_worker(cn_url, 'primary', groupid, nodename, nodeport) """ Citus的pg_dist_node元数据中要求nodename:nodeport必须唯一,所以无法同时支持secondary节点的动态更新。 一个可能的回避方法是为每个worker配置2个IP地址,一个作为parimary角色时使用,另一个作为secondary角色时使用。 if len(secondarys) >= 1: nodename = secondarys[0].split(':')[0] nodeport = secondarys[0].split(':')[1] update_worker(cn_url, 'secondary', groupid, nodename, nodeport) elif len(secondarys) == 0 and len(primarys) == 1: nodename = primarys[0].split(':')[0] nodeport = primarys[0].split(':')[1] update_worker(cn_url, 'secondary', groupid, nodename, nodeport) """ time.sleep(loop_wait) if __name__ == '__main__': main() 创建该脚本的service配置文件/etc/systemd/system/citus_controller.service [Unit] Description=Auto update primary worker ip in Citus CN After=syslog.target network.target [Service] Type=simple User=postgres Group=postgres ExecStart=/bin/python /pgsql/citus_controller.py -c /etc/patroni.yml KillMode=process TimeoutSec=30 Restart=no [Install] WantedBy=multi-user.target 在cn主备节点上都启动Worker流量自动切换脚本 systemctl start citus_controller 7. 读写分离 根据上面的配置,Citus CN不会访问Worker的备机,这些备机闲着也是闲着,能否把这些备节用起来,让Citus CN支持读写分离呢?具体而言就是让CN的备机优先访问Worker的备机,Worker备节故障时访问Worker的主机。 Citus本身支持读写分离功能,可以把一个Worker的主备2个节点作为2个”worker"分别以primary和secondary的角色加入到同一个worker group里。但是,由于Citus的pg_dist_node元数据中要求nodename:nodeport必须唯一,所以前面的动态修改Citus元数据中的worker IP的方式无法同时支持primary节点和secondary节点的动态更新。 解决办法有2个 方法1:Citus元数据中只写固定的主机名,比如wk1,wk2...,然后通过自定义的Worker流量自动切换脚本将这个固定的主机名解析成不同的IP地址写入到/etc/hosts里,在CN主库上解析成Worker主库的IP,在CN备库上解析成Worker备库的IP。 方法2:在Worker上动态绑定读写VIP和只读VIP。在Citus元数据中读写VIP作为primary角色的worker,只读VIP作为secondary角色的worker。 Patroni动态绑VIP的方法参考基于Patroni的PostgreSQL高可用环境部署.md,对Citus Worker,读写VIP通过回调脚本动态绑定;只读VIP通过keepalived动态绑定。 下面按方法2进行配置。 创建Citus集群时,在CN的主节点上,添加wk1的读写VIP(192.168.234.210)和只读VIP(192.168.234.211),分别作为primary worker和secondary worker,groupid设置为1。 SELECT * from master_add_node('192.168.234.210', 5432, 1, 'primary'); SELECT * from master_add_node('192.168.234.211', 5432, 1, 'secondary'); 为了让CN备库连接到secondary的worker,还需要在CN备库上设置以下参数 alter system set citus.use_secondary_nodes=always; select pg_reload_conf(); 这个参数的变更只对新创建的会话生效,如果希望立即生效,需要在修改参数后杀掉已有会话。 现在分别到CN主库和备库上执行同一条SQL,可以看到SQL被发往不同的worker。 CN主库(未设置citus.use_secondary_nodes=always): postgres=# explain select * from tb1; QUERY PLAN ------------------------------------------------------------------------------- Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=36) Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=192.168.234.210 port=5432 dbname=postgres -> Seq Scan on tb1_102168 tb1 (cost=0.00..22.70 rows=1270 width=36) (6 rows) CN备库(设置了citus.use_secondary_nodes=always): postgres=# explain select * from tb1; QUERY PLAN ------------------------------------------------------------------------------- Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=36) Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=192.168.234.211 port=5432 dbname=postgres -> Seq Scan on tb1_102168 tb1 (cost=0.00..22.70 rows=1270 width=36) (6 rows) 由于CN也会发生主备切换,`citus.use_secondary_nodes参数必须动态调节。这可以使用Patroni的回调脚本实现 创建动态设置参数的/pgsql/switch_use_secondary_nodes.sh #!/bin/bash DBNAME=postgres KILL_ALL_SQL="select pg_terminate_backend(pid) from pg_stat_activity where backend_type='client backend' and application_name <> 'Patroni' and pid <> pg_backend_pid()" action=$1 role=$2 cluster=$3 log() { echo "switch_use_secondary_nodes: $*"|logger } alter_use_secondary_nodes() { value="$1" oldvalue=`psql -d postgres -Atc "show citus.use_secondary_nodes"` if [ "$value" = "$oldvalue" ] ; then log "old value of use_secondary_nodes already be '${value}', skip change" return fi psql -d ${DBNAME} -c "alter system set citus.use_secondary_nodes=${value}" >/dev/null rc=$? if [ $rc -ne 0 ] ;then log "fail to alter use_secondary_nodes to '${value}' rc=$rc" exit 1 fi psql -d ${DBNAME} -c 'select pg_reload_conf()' >/dev/null rc=$? if [ $rc -ne 0 ] ;then log "fail to call pg_reload_conf() rc=$rc" exit 1 fi log "changed use_secondary_nodes to '${value}'" ## kill all existing connections killed_conns=`psql -d ${DBNAME} -Atc "${KILL_ALL_SQL}" | wc -l` rc=$? if [ $rc -ne 0 ] ;then log "failed to kill connections rc=$rc" exit 1 fi log "killed ${killed_conns} connections" } log "switch_use_secondary_nodes start args:'$*'" case $action in on_start|on_restart|on_role_change) case $role in master) alter_use_secondary_nodes never ;; replica) alter_use_secondary_nodes always ;; *) log "wrong role '$role'" exit 1 ;; esac ;; *) log "wrong action '$action'" exit 1 ;; esac 修改Patroni配置文件/etc/patroni.yml,配置回调函数 postgresql: ... callbacks: on_start: /bin/bash /pgsql/switch_use_secondary_nodes.sh on_restart: /bin/bash /pgsql/switch_use_secondary_nodes.sh on_role_change: /bin/bash /pgsql/switch_use_secondary_nodes.sh 所有节点的Patroni配置文件都修改后,重新加载Patroni配置 patronictl reload cn CN上执行switchover后,可以看到use_secondary_nodes参数发生了修改 /var/log/messages: Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: switch_use_secondary_nodes start args:'on_role_change replica cn' Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: changed use_secondary_nodes to 'always' Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: killed 0 connections 8. 参考 基于Patroni的PostgreSQL高可用环境部署.md 《基于Patroni的Citus高可用方案》(PostgreSQL中国用户大会2019分享主题) https://patroni.readthedocs.io/en/latest/ http://blogs.sungeek.net/unixwiz/2018/09/02/centos-7-postgresql-10-patroni/ https://scalegrid.io/blog/managing-high-availability-in-postgresql-part-1/ https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters https://www.percona.com/blog/2019/10/23/seamless-application-failover-using-libpq-features-in-postgresql/
基于Patroni的PostgreSQL高可用环境部署 1. 前言 PostgreSQL是一款功能,性能,可靠性都可以和高端的国外商业数据库相媲美的开源数据库。而且PostgreSQL的许可和生态完全开放,不被任何一个单一的公司或国家所操控,保证了使用者没有后顾之忧。国内越来越多的企业开始用PostgreSQL代替原来昂贵的国外商业数据库。 在部署PostgreSQL到生产环境中时,选择适合的高可用方案是一项必不可少的工作。本文介绍基于Patroni的PostgreSQL高可用的部署方法,供大家参考。 PostgreSQL的开源HA工具有很多种,下面几种算是比较常用的 PAF(PostgreSQL Automatic Failomianver) repmgr Patroni 它们的比较可以参考: https://scalegrid.io/blog/managing-high-availability-in-postgresql-part-1/ 其中Patroni不仅简单易用而且功能非常强大。 支持自动failover和按需switchover 支持一个和多个备节点 支持级联复制 支持同步复制,异步复制 支持同步复制下备库故障时自动降级为异步复制(功效类似于MySQL的半同步,但是更加智能) 支持控制指定节点是否参与选主,是否参与负载均衡以及是否可以成为同步备机 支持通过pg_rewind自动修复旧主 支持多种方式初始化集群和重建备机,包括pg_basebackup和支持wal_e,pgBackRest,barman等备份工具的自定义脚本 支持自定义外部callback脚本 支持REST API 支持通过watchdog防止脑裂 支持k8s,docker等容器化环境部署 支持多种常见DCS(Distributed Configuration Store)存储元数据,包括etcd,ZooKeeper,Consul,Kubernetes 因此,除非只有2台机器没有多余机器部署DCS的情况,Patroni是一款非常值得推荐的PostgreSQL高可用工具。下面将详细介绍基于Patroni搭建PostgreSQL高可用环境的步骤。 2. 实验环境 主要软件 CentOS 7.8 PostgreSQL 12 Patroni 1.6.5 etcd 3.3.25 机器和VIP资源 PostgreSQL node1:192.168.234.201 node2:192.168.234.202 node3:192.168.234.203 etcd node4:192.168.234.204 VIP 读写VIP:192.168.234.210 只读VIP:192.168.234.211 环境准备 所有节点设置时钟同步 yum install -y ntpdate ntpdate time.windows.com && hwclock -w 如果使用防火墙需要开放postgres,etcd和patroni的端口。 postgres:5432 patroni:8008 etcd:2379/2380 更简单的做法是将防火墙关闭 setenforce 0 sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config systemctl disable firewalld.service systemctl stop firewalld.service iptables -F 3. etcd部署 因为本文的主题不是etcd的高可用,所以只在node4上部署单节点的etcd用于实验。生产环境至少需要部署3个节点,可以使用独立的机器也可以和数据库部署在一起。etcd的部署步骤如下 安装需要的包 yum install -y gcc python-devel epel-release 安装etcd yum install -y etcd 编辑etcd配置文件/etc/etcd/etcd.conf, 参考配置如下 ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="http://192.168.234.204:2380" ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://192.168.234.204:2379" ETCD_NAME="etcd0" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.234.204:2380" ETCD_ADVERTISE_CLIENT_URLS="http://192.168.234.204:2379" ETCD_INITIAL_CLUSTER="etcd0=http://192.168.234.204:2380" ETCD_INITIAL_CLUSTER_TOKEN="cluster1" ETCD_INITIAL_CLUSTER_STATE="new" 启动etcd systemctl start etcd 设置etcd自启动 systemctl enable etcd 3. PostgreSQL + Patroni HA部署 在需要运行PostgreSQL的实例上安装相关软件 安装PostgreSQL 12 yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm yum install -y postgresql12-server postgresql12-contrib 安装Patroni yum install -y gcc epel-release yum install -y python-pip python-psycopg2 python-devel pip install --upgrade pip pip install --upgrade setuptools pip install patroni[etcd] 创建PostgreSQL数据目录 mkdir -p /pgsql/data chown postgres:postgres -R /pgsql chmod -R 700 /pgsql/data 创建Partoni service配置文件/etc/systemd/system/patroni.service [Unit] Description=Runners to orchestrate a high-availability PostgreSQL After=syslog.target network.target [Service] Type=simple User=postgres Group=postgres #StandardOutput=syslog ExecStart=/usr/bin/patroni /etc/patroni.yml ExecReload=/bin/kill -s HUP $MAINPID KillMode=process TimeoutSec=30 Restart=no [Install] WantedBy=multi-user.target 创建Patroni配置文件/etc/patroni.yml,以下是node1的配置示例 scope: pgsql namespace: /service/ name: pg1 restapi: listen: 0.0.0.0:8008 connect_address: 192.168.234.201:8008 etcd: host: 192.168.234.204:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 master_start_timeout: 300 synchronous_mode: false postgresql: use_pg_rewind: true use_slots: true parameters: listen_addresses: "0.0.0.0" port: 5432 wal_level: logical hot_standby: "on" wal_keep_segments: 100 max_wal_senders: 10 max_replication_slots: 10 wal_log_hints: "on" initdb: - encoding: UTF8 - locale: C - lc-ctype: zh_CN.UTF-8 - data-checksums pg_hba: - host replication repl 0.0.0.0/0 md5 - host all all 0.0.0.0/0 md5 postgresql: listen: 0.0.0.0:5432 connect_address: 192.168.234.201:5432 data_dir: /pgsql/data bin_dir: /usr/pgsql-12/bin authentication: replication: username: repl password: "123456" superuser: username: postgres password: "123456" basebackup: max-rate: 100M checkpoint: fast tags: nofailover: false noloadbalance: false clonefrom: false nosync: false 完整的参数含有可参考Patroni手册中的 YAML Configuration Settings,其中PostgreSQL参数可根据需要自行补充。 其他PG节点的patroni.yml需要相应修改下面3个参数 name node1~node4分别设置pg1~pg4 restapi.connect_address 根据各自节点IP设置 postgresql.connect_address 根据各自节点IP设置 启动Patroni 先在node1上启动Patroni。 systemctl start patroni 初次启动Patroni时,Patroni会初始创建PostgreSQL实例和用户。 [root@node1 ~]# systemctl status patroni ● patroni.service - Runners to orchestrate a high-availability PostgreSQL Loaded: loaded (/etc/systemd/system/patroni.service; disabled; vendor preset: disabled) Active: active (running) since Sat 2020-09-05 14:41:03 CST; 38min ago Main PID: 1673 (patroni) CGroup: /system.slice/patroni.service ├─1673 /usr/bin/python2 /usr/bin/patroni /etc/patroni.yml ├─1717 /usr/pgsql-12/bin/postgres -D /pgsql/data --config-file=/pgsql/data/postgresql.conf --listen_addresses=0.0.0.0 --max_worker_processe... ├─1719 postgres: pgsql: logger ├─1724 postgres: pgsql: checkpointer ├─1725 postgres: pgsql: background writer ├─1726 postgres: pgsql: walwriter ├─1727 postgres: pgsql: autovacuum launcher ├─1728 postgres: pgsql: stats collector ├─1729 postgres: pgsql: logical replication launcher └─1732 postgres: pgsql: postgres postgres 127.0.0.1(37154) idle 再在node2上启动Patroni。node2将作为replica加入集群,自动从leader拷贝数据并建立复制。 [root@node2 ~]# systemctl status patroni ● patroni.service - Runners to orchestrate a high-availability PostgreSQL Loaded: loaded (/etc/systemd/system/patroni.service; disabled; vendor preset: disabled) Active: active (running) since Sat 2020-09-05 16:09:06 CST; 3min 41s ago Main PID: 1882 (patroni) CGroup: /system.slice/patroni.service ├─1882 /usr/bin/python2 /usr/bin/patroni /etc/patroni.yml ├─1898 /usr/pgsql-12/bin/postgres -D /pgsql/data --config-file=/pgsql/data/postgresql.conf --listen_addresses=0.0.0.0 --max_worker_processe... ├─1900 postgres: pgsql: logger ├─1901 postgres: pgsql: startup recovering 000000010000000000000003 ├─1902 postgres: pgsql: checkpointer ├─1903 postgres: pgsql: background writer ├─1904 postgres: pgsql: stats collector ├─1912 postgres: pgsql: postgres postgres 127.0.0.1(35924) idle └─1916 postgres: pgsql: walreceiver streaming 0/3000060 查看集群状态 [root@node2 ~]# patronictl -c /etc/patroni.yml list + Cluster: pgsql (6868912301204081018) -------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +--------+-----------------+--------+---------+----+-----------+ | pg1 | 192.168.234.201 | Leader | running | 1 | | | pg2 | 192.168.234.202 | | running | 1 | 0.0 | +--------+-----------------+--------+---------+----+-----------+ 为了方便日常操作,设置全局环境变量PATRONICTL_CONFIG_FILE echo 'export PATRONICTL_CONFIG_FILE=/etc/patroni.yml' >/etc/profile.d/patroni.sh 添加以下环境变量到~postgres/.bash_profile export PGDATA=/pgsql/data export PATH=/usr/pgsql-12/bin:$PATH 设置postgres拥有免密的sudoer权限 echo 'postgres ALL=(ALL) NOPASSWD: ALL'> /etc/sudoers.d/postgres 4. 自动切换和脑裂防护 Patroni在主库故障时会自动执行failover,确保服务的高可用。但是自动failover如果控制不当会有产生脑裂的风险。因此Patroni在保障服务的可用性和防止脑裂的双重目标下会在特定场景下执行一些自动化动作。 故障位置 场景 Patroni的动作 备库 备库PG停止 停止备库PG 备库 停止备库Patroni 停止备库PG 备库 强杀备库Patroni(或Patroni crash) 无操作 备库 备库无法连接etcd 无操作 备库 非Leader角色但是PG处于生产模式 重启PG并切换到恢复模式作为备库运行 主库 主库PG停止 重启PG,重启超过master_start_timeout设定时间,进行主备切换 主库 停止主库Patroni 停止主库PG,并触发failover 主库 强杀主库Patroni(或Patroni crash) 触发failover,此时出现"双主" 主库 主库无法连接etcd 将主库降级为备库,并触发failover - etcd集群故障 将主库降级为备库,此时集群中全部都是备库。 - 同步模式下无可用同步备库 临时切换主库为异步复制,在恢复为同步复制之前自动failover暂不生效 4.1 Patroni如何防止脑裂 部署在数据库节点上的patroni进程会执行一些保护操作,确保不会出现多个“主库” 非Leader节点的PG处于生产模式时,重启PG并切换到恢复模式作为备库运行 Leader节点的patroni无法连接etcd时,不能确保自己仍然是Leader,将本机的PG降级为备库 正常停止patroni时,patroni会顺便把本机的PG进程也停掉 然而,当patroni进程自身无法正常工作时,以上的保护措施难以得到贯彻。比如patroni进程异常终止或主机临时hang等。 为了更可靠的防止脑裂,Patroni支持通过Linux的watchdog监视patroni进程的运行,当patroni进程无法正常往watchdog设备写入心跳时,由watchdog触发Linux重启。具体的配置方法如下 设置Patroni的systemd service配置文件/etc/systemd/system/patroni.service [Unit] Description=Runners to orchestrate a high-availability PostgreSQL After=syslog.target network.target [Service] Type=simple User=postgres Group=postgres #StandardOutput=syslog ExecStartPre=-/usr/bin/sudo /sbin/modprobe softdog ExecStartPre=-/usr/bin/sudo /bin/chown postgres /dev/watchdog ExecStart=/usr/bin/patroni /etc/patroni.yml ExecReload=/bin/kill -s HUP $MAINPID KillMode=process TimeoutSec=30 Restart=no [Install] WantedBy=multi-user.target 设置Patroni自启动 systemctl enable patroni 修改Patroni配置文件/etc/patroni.yml,添加以下内容 watchdog: mode: automatic # Allowed values: off, automatic, required device: /dev/watchdog safety_margin: 5 safety_margin指如果Patroni没有及时更新watchdog,watchdog会在Leader key过期前多久触发重启。在本例的配置下(ttl=30,loop_wait=10,safety_margin=5)下,patroni进程每隔10秒(loop_wait)都会更新Leader key和watchdog。如果Leader节点异常导致patroni进程无法及时更新watchdog,会在Leader key过期的前5秒触发重启。重启如果在5秒之内完成,Leader节点有机会再次获得Leader锁,否则Leader key过期后,由备库通过选举选出新的Leader。 这套机制基本上可以保证不会出现"双主",但是这个保证是依赖于watchdog的可靠性的,从生产实践上这个保证对绝大部分场景可能是足够的,但是从理论上难以证明它100%可靠。 另一方面,自动重启机器的方式会不会太暴力导致"误杀"呢?比如由于突发的业务访问导致机器负载过高,进而导致patroni进程不能及时分配到CPU资源,此时自动重启机器就未必是我们期望的行为。 那么,有没有其它更可靠的防止脑裂的手段呢? 4.2 利用PostgreSQL同步复制防止脑裂 防止脑裂的另一个手段是把PostgreSQL集群配置成同步复制模式。利用同步复制模式下的主库在没有同步备库应答日志时写入会阻塞的特点,在数据库内部确保即使出现“双主”也不会发生"双写"。采用这种方式防止脑裂是最可靠最安全的,代价是同步复制相对异步复制会降低一点性能。具体设置方法如下 初始运行Patroni时,在Patroni配置文件/etc/patroni.yml中设置同步模式 synchronous_mode:true 对于已部署的Patroni可以通过patronictl命令修改配置 patronictl edit-config -s 'synchronous_mode=true' 此配置下,如果同步备库临时不可用,Patroni会把主库的复制模式降级成了异步复制,确保服务不中断。效果类似于MySQL的半同步复制,但是相比MySQL使用固定的超时时间控制复制降级,这种方式更加智能,同时还具有防脑裂的功效。 在同步模式下,只有同步备库具有被提升为主库的资格。因此如果主库被降级为异步复制,由于没有同步备库作为候选主库failover不会被触发,也就不会出现“双主”。如果主库没有被降级为异步复制,那么即使出现“双主”,由于旧主处于同步复制模式,数据无法被写入,也不会出现“双写”。 Patroni通过动态调整PostgreSQL参数synchronous_standby_names控制同步异步复制的切换。并且Patroni会把同步的状态记录到etcd中,确保同步状态在Patroni集群中的一致性。 正常的同步模式的元数据示例如下: [root@node4 ~]# etcdctl get /service/cn/sync {"leader":"pg1","sync_standby":"pg2"} 备库故障导致主库临时降级为异步复制的元数据如下: [root@node4 ~]# etcdctl get /service/cn/sync {"leader":"pg1","sync_standby":null} 如果集群中包含3个以上的节点,还可以考虑采取更严格的同步策略,禁止Patroni把同步模式降级为异步。这样可以确保任何写入的数据至少存在于2个以上的节点。对数据安全要求极高的业务可以采用这种方式。 synchronous_mode:true synchronous_mode_strict:true 如果集群包含异地的灾备节点,可以根据需要配置该节点为不参与选主,不参与负载均衡,也不作为同步备库。 tags: nofailover: true noloadbalance: true clonefrom: false nosync: true 4.2 etcd不可访问的影响 当Patroni无法访问etcd时,将不能确认自己所处的角色。为了防止这种状态下产生脑裂,如果本机的PG是主库,Patroni会把PG降级为备库。如果集群中所有Patroni节点都无法访问etcd,集群中将全部都是备库,业务无法写入数据。这就要求etcd集群具有非常高的可用性,特别是当我们用一套中心的etcd集群管理几百几千套PG集群的时候。 当我们使用集中式的一套etcd集群管理很多套PG集群时,为了预防etcd集群故障带来的严重影响,可以考虑设置超大的retry_timeout参数,比如1万天,同时通过同步复制模式防止脑裂。 retry_timeout:864000000 synchronous_mode:true retry_timeout用于控制操作DCS和PostgreSQL的重试超时。Patroni对需要重试的操作,除了时间上的限制还有重试次数的限制。对于PostgreSQL操作,目前似乎只有调用GET /patroni的REST API时会重试,而且最多只重试1次,所以把retry_timeout调大不会带来其他副作用。 5. 日常操作 日常维护时可以通过patronictl命令控制Patroni和PostgreSQL,比如修改PotgreSQL参数。 [postgres@node2 ~]$ patronictl --help Usage: patronictl [OPTIONS] COMMAND [ARGS]... Options: -c, --config-file TEXT Configuration file -d, --dcs TEXT Use this DCS -k, --insecure Allow connections to SSL sites without certs --help Show this message and exit. Commands: configure Create configuration file dsn Generate a dsn for the provided member, defaults to a dsn of... edit-config Edit cluster configuration failover Failover to a replica flush Discard scheduled events (restarts only currently) history Show the history of failovers/switchovers list List the Patroni members for a given Patroni pause Disable auto failover query Query a Patroni PostgreSQL member reinit Reinitialize cluster member reload Reload cluster member configuration remove Remove cluster from DCS restart Restart cluster member resume Resume auto failover scaffold Create a structure for the cluster in DCS show-config Show cluster configuration switchover Switchover to a replica version Output version of patronictl command or a running Patroni... 5.1 修改PostgreSQL参数 修改个别节点的参数,可以执行ALTER SYSTEM SET ... SQL命令,比如临时打开某个节点的debug日志。对于需要统一配置的参数应该通过patronictl edit-config设置,确保全局一致,比如修改最大连接数。 patronictl edit-config -p 'max_connections=300' 修改最大连接数后需要重启才能生效,因此Patroni会在相关的节点状态中设置一个Pending restart标志。 [postgres@node2 ~]$ patronictl list + Cluster: pgsql (6868912301204081018) -------+----+-----------+-----------------+ | Member | Host | Role | State | TL | Lag in MB | Pending restart | +--------+-----------------+--------+---------+----+-----------+-----------------+ | pg1 | 192.168.234.201 | Leader | running | 25 | | * | | pg2 | 192.168.234.202 | | running | 25 | 0.0 | * | +--------+-----------------+--------+---------+----+-----------+-----------------+ 重启集群中所有PG实例后,参数生效。 patronictl restart pgsql 5.2 查看Patroni节点状态 通常我们可以同patronictl list查看每个节点的状态。但是如果想要查看更详细的节点状态信息,需要调用REST API。比如在Leader锁过期时存活节点却无法成为Leader,查看详细的节点状态信息有助于调查原因。 curl -s http://127.0.0.1:8008/patroni | jq 输出示例如下: [root@node2 ~]# curl -s http://127.0.0.1:8008/patroni | jq { "database_system_identifier": "6870146304839171063", "postmaster_start_time": "2020-09-13 09:56:06.359 CST", "timeline": 23, "cluster_unlocked": true, "watchdog_failed": true, "patroni": { "scope": "cn", "version": "1.6.5" }, "state": "running", "role": "replica", "xlog": { "received_location": 201326752, "replayed_timestamp": null, "paused": false, "replayed_location": 201326752 }, "server_version": 120004 } 上面的"watchdog_failed": true,代表使用了watchdog但是却无法访问watchdog设备,该节点无法被提升为Leader。 6. 客户端访问配置 HA集群的主节点是动态的,主备发生切换时,客户端对数据库的访问也需要能够动态连接到新主上。有下面几种常见的实现方式,下面分别。 多主机URL vip haproxy 6.1 多主机URL pgjdbc和libpq驱动可以在连接字符串中配置多个IP,由驱动识别数据库的主备角色,连接合适的节点。 JDBC JDBC的多主机URL功能全面,支持failover,读写分离和负载均衡。可以通过参数配置不同的连接策略。 jdbc:postgresql://192.168.234.201:5432,192.168.234.202:5432,192.168.234.203:5432/postgres?targetServerType=primary 连接主节点(实际是可写的节点)。当出现"双主"甚至"多主"时驱动连接第一个它发现的可用的主节点 jdbc:postgresql://192.168.234.201:5432,192.168.234.202:5432,192.168.234.203:5432/postgres?targetServerType=preferSecondary&loadBalanceHosts=true 优先连接备节点,无可用备节点时连接主节点,有多个可用备节点时随机连接其中一个。 jdbc:postgresql://192.168.234.201:5432,192.168.234.202:5432,192.168.234.203:5432/postgres?targetServerType=any&loadBalanceHosts=true 随机连接任意一个可用的节点 libpq libpq的多主机URL功能相对pgjdbc弱一点,只支持failover。 postgres://192.168.234.201:5432,192.168.234.202:5432,192.168.234.203:5432/postgres?target_session_attrs=read-write 连接主节点(实际是可写的节点) postgres://192.168.234.201:5432,192.168.234.202:5432,192.168.234.203:5432/postgres?target_session_attrs=any 连接任一可用节点 基于libpq实现的其他语言的驱动相应地也可以支持多主机URL,比如python和php。下面是python程序使用多主机URL创建连接的例子 import psycopg2 conn=psycopg2.connect("postgres://192.168.234.201:5432,192.168.234.202:5432/postgres?target_session_attrs=read-write&password=123456") 6.2 VIP(通过Patroni回调脚本实现VIP漂移) 多主机URL的方式部署简单,但是不是每种语言的驱动都支持,而且如果数据库出现意外的“双主”,配置多主机URL的客户端在多个主上同时写入的概率比较高,而如果客户端通过VIP的方式访问则在VIP上又多了一层防护(这种风险一般在数据库的HA组件没防护好时发生,正如前面介绍的,如果我们配置的是Patroni的同步模式,基本上没有这个担忧)。 Patroni支持用户配置在特定事件发生时触发回调脚本。因此我们可以配置一个回调脚本,在主备切换后动态加载VIP。 准备加载VIP的回调脚本/pgsql/loadvip.sh #!/bin/bash VIP=192.168.234.210 GATEWAY=192.168.234.2 DEV=ens33 action=$1 role=$2 cluster=$3 log() { echo "loadvip: $*"|logger } load_vip() { ip a|grep -w ${DEV}|grep -w ${VIP} >/dev/null if [ $? -eq 0 ] ;then log "vip exists, skip load vip" else sudo ip addr add ${VIP}/32 dev ${DEV} >/dev/null rc=$? if [ $rc -ne 0 ] ;then log "fail to add vip ${VIP} at dev ${DEV} rc=$rc" exit 1 fi log "added vip ${VIP} at dev ${DEV}" arping -U -I ${DEV} -s ${VIP} ${GATEWAY} -c 5 >/dev/null rc=$? if [ $rc -ne 0 ] ;then log "fail to call arping to gateway ${GATEWAY} rc=$rc" exit 1 fi log "called arping to gateway ${GATEWAY}" fi } unload_vip() { ip a|grep -w ${DEV}|grep -w ${VIP} >/dev/null if [ $? -eq 0 ] ;then sudo ip addr del ${VIP}/32 dev ${DEV} >/dev/null rc=$? if [ $rc -ne 0 ] ;then log "fail to delete vip ${VIP} at dev ${DEV} rc=$rc" exit 1 fi log "deleted vip ${VIP} at dev ${DEV}" else log "vip not exists, skip delete vip" fi } log "loadvip start args:'$*'" case $action in on_start|on_restart|on_role_change) case $role in master) load_vip ;; replica) unload_vip ;; *) log "wrong role '$role'" exit 1 ;; esac ;; *) log "wrong action '$action'" exit 1 ;; esac 修改Patroni配置文件/etc/patroni.yml,配置回调函数 postgresql: ... callbacks: on_start: /bin/bash /pgsql/loadvip.sh on_restart: /bin/bash /pgsql/loadvip.sh on_role_change: /bin/bash /pgsql/loadvip.sh 所有节点的Patroni配置文件都修改后,重新加载Patroni配置文件 patronictl reload pgsql 执行switchover后,可以看到VIP发生了漂移 /var/log/messages: Sep 5 21:32:24 localvm postgres: loadvip: loadvip start args:'on_role_change master pgsql' Sep 5 21:32:24 localvm systemd: Started Session c7 of user root. Sep 5 21:32:24 localvm postgres: loadvip: added vip 192.168.234.210 at dev ens33 Sep 5 21:32:25 localvm patroni: 2020-09-05 21:32:25,415 INFO: Lock owner: pg1; I am pg1 Sep 5 21:32:25 localvm patroni: 2020-09-05 21:32:25,431 INFO: no action. i am the leader with the lock Sep 5 21:32:28 localvm postgres: loadvip: called arping to gateway 192.168.234.2 注意,如果直接停止主库上的Patroni,上面的脚本不会摘除VIP。主库上的Patroni被停掉后会触发备库failover成为新主,此时新旧主2台机器上都有VIP,但是由于新主执行了arping,一般不会影响应用访问。尽管如此,操作上还是需要注意避免。 6.3 VIP(通过keepalived实现VIP漂移) Patroni提供了用于健康检查的REST API,可以根据节点角色返回正常(200)和异常的HTTP状态码 GET / 或 GET /leader 运行中且是leader节点 GET /replica 运行中且是replica角色,且没有设置tag noloadbalance GET /read-only 和GET /replica类似,但是包含leader节点 使用REST API,Patroni可以和外部组件搭配使用。比如可以配置keepalived动态在主库或备库上绑VIP。 关于Patroni的REST API接口详细,参考Patroni REST API。 下面的例子在一主一备集群(node1和node2)中动态在备节点上绑只读VIP(192.168.234.211),当备节点故障时则将只读VIP绑在主节点上。 安装keepalived yum install -y keepalived 准备keepalived配置文件/etc/keepalived/keepalived.conf global_defs { router_id LVS_DEVEL } vrrp_script check_leader { script "/usr/bin/curl -s http://127.0.0.1:8008/leader -v 2>&1|grep '200 OK' >/dev/null" interval 2 weight 10 } vrrp_script check_replica { script "/usr/bin/curl -s http://127.0.0.1:8008/replica -v 2>&1|grep '200 OK' >/dev/null" interval 2 weight 5 } vrrp_script check_can_read { script "/usr/bin/curl -s http://127.0.0.1:8008/read-only -v 2>&1|grep '200 OK' >/dev/null" interval 2 weight 10 } vrrp_instance VI_1 { state BACKUP interface ens33 virtual_router_id 211 priority 100 advert_int 1 track_script { check_can_read check_replica } virtual_ipaddress { 192.168.234.211 } } 启动keepalived systemctl start keepalived 上面的配置方法也可以用于读写vip的漂移,只要把track_script中的脚本换成check_leader即可。但是在网络抖动或其它临时故障时keepalived管理的VIP容易飘,因此个人更推荐使用Patroni回调脚本动态绑定读写VIP。如果有多个备库,也可以在keepalived中配置LVS对所有备库进行负载均衡,过程就不展开了。 6.4 haproxy haproxy作为服务代理和Patroni配套使用可以很方便地支持failover,读写分离和负载均衡,也是Patroni社区作为Demo的方案。缺点是haproxy本身也会占用资源,所有数据流量都经过haproxy,性能上会有一定损耗。 下面配置通过haproxy访问一主两备PG集群的例子。 安装haproxy yum install -y haproxy 编辑haproxy配置文件/etc/haproxy/haproxy.cfg global maxconn 100 log 127.0.0.1 local2 defaults log global mode tcp retries 2 timeout client 30m timeout connect 4s timeout server 30m timeout check 5s listen stats mode http bind *:7000 stats enable stats uri / listen pgsql bind *:5000 option httpchk http-check expect status 200 default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions server postgresql_192.168.234.201_5432 192.168.234.201:5432 maxconn 100 check port 8008 server postgresql_192.168.234.202_5432 192.168.234.202:5432 maxconn 100 check port 8008 server postgresql_192.168.234.203_5432 192.168.234.203:5432 maxconn 100 check port 8008 listen pgsql_read bind *:6000 option httpchk GET /replica http-check expect status 200 default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions server postgresql_192.168.234.201_5432 192.168.234.201:5432 maxconn 100 check port 8008 server postgresql_192.168.234.202_5432 192.168.234.202:5432 maxconn 100 check port 8008 server postgresql_192.168.234.203_5432 192.168.234.203:5432 maxconn 100 check port 8008 如果只有2个节点,上面的GET /replica 需要改成GET /read-only,否则备库故障时就无法提供只读访问了,但是这样配置主库也会参与读,不能完全分离主库的读负载。 启动haproxy systemctl start haproxy haproxy自身也需要高可用,可以把haproxy部署在node1和node2 2台机器上,通过keepalived控制VIP(192.168.234.210)在node1和node2上漂移。 准备keepalived配置文件/etc/keepalived/keepalived.conf global_defs { router_id LVS_DEVEL } vrrp_script check_haproxy { script "pgrep -x haproxy" interval 2 weight 10 } vrrp_instance VI_1 { state BACKUP interface ens33 virtual_router_id 210 priority 100 advert_int 1 track_script { check_haproxy } virtual_ipaddress { 192.168.234.210 } } 启动keepalived systemctl start keepalived 下面做个简单的测试。从node4上通过haproxy的5000端口访问PG,会连到主库上 [postgres@node4 ~]$ psql "host=192.168.234.210 port=5000 password=123456" -c 'select inet_server_addr(),pg_is_in_recovery()' inet_server_addr | pg_is_in_recovery ------------------+------------------- 192.168.234.201 | f (1 row) 通过haproxy的6000端口访问PG,会轮询连接2个备库 [postgres@node4 ~]$ psql "host=192.168.234.210 port=6000 password=123456" -c 'select inet_server_addr(),pg_is_in_recovery()' inet_server_addr | pg_is_in_recovery ------------------+------------------- 192.168.234.202 | t (1 row) [postgres@node4 ~]$ psql "host=192.168.234.210 port=6000 password=123456" -c 'select inet_server_addr(),pg_is_in_recovery()' inet_server_addr | pg_is_in_recovery ------------------+------------------- 192.168.234.203 | t (1 row) haproxy部署后,可以通过它的web接口 http://192.168.234.210:7000/查看统计数据 7. 级联复制 通常集群中所有的备库都从主库复制数据,但是特定的场景下我们可能需要部署级联复制。基于Patroni搭建的PG集群支持2种形式的级联复制。 7. 1 集群内部的级联复制 可以指定某个备库优先从指定成员而不是Leader节点复制数据。相应的配置示例如下: tags: replicatefrom: pg2 replicatefrom只对节点处于Replica角色时有效,并不影响该节点参与Leader选举并成为Leader。当replicatefrom指定的复制源节点故障时,Patroni会自动修改PG切换到从Leader节点复制。 7.2 集群间的级联复制 我们还可以创建一个只读的备集群,从另一个指定的PostgreSQL实例复制数据。这可以用于创建跨数据中心的灾备集群。相应的配置示例如下: 初始创建一个备集群,可以在Patroni配置文件/etc/patroni.yml中加入以下配置 bootstrap: dcs: standby_cluster: host: 192.168.234.210 port: 5432 primary_slot_name: slot1 create_replica_methods: - basebackup 上面的host和port是上游复制源的主机和端口号,如果上游数据库是配置了读写VIP的PG集群,可以将读写VIP作为host避免主集群主备切换时影响备集群。 复制槽选项primary_slot_name是可选的,如果配置了复制槽,需要同时在主集群上配置持久slot,确保在新主上始终保持slot。 slots: slot1: type: physical 对于已配置好的级联集群,可以使用patronictl edit-config命令动态添加standby_cluster设置把主集群变成备集群;以及删除standby_cluster设置把备集群变成主集群。 standby_cluster: host: 192.168.234.210 port: 5432 primary_slot_name: slot1 create_replica_methods: - basebackup 8. 参考 https://patroni.readthedocs.io/en/latest/ http://blogs.sungeek.net/unixwiz/2018/09/02/centos-7-postgresql-10-patroni/ https://scalegrid.io/blog/managing-high-availability-in-postgresql-part-1/ https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters https://www.percona.com/blog/2019/10/23/seamless-application-failover-using-libpq-features-in-postgresql/
pg_rewind的功能是在主备切换后回退旧主库上多余的事务变更,以便可以作为新主的备机和新主建立复制关系。通过pg_rewind可以在故障切换后快速恢复旧主,避免整库重建。对于大库,整库重建会很耗时间。 如何识别旧主上多余的变更? 这就用到了PostgreSQL独有的时间线技术,数据库实例的初始时间线是1。以后每次主备切换时,需要提升备库为新主。提升操作会将新主的时间线加1,并且会记录提升时间线的WAL位置(LSN)。这个LSN位点我们称其为新主旧主在时间线上的分叉点。 我们只要扫描旧主上在分叉点之后的WAL记录,就能找到旧主上所有多余的变更。 如何回退旧主上多余的变更? 可能有人会想到可以通过解析分叉点以后的WAL,生成undo SQL,再逆序执行undo SQL实现事务回退。这也是mysql上常用的实现方式。 PG同样也有walminer插件可以支持WAL到undo SQL的解析。但是,这种方式存在很多限制,数据一致性也难以保证。 pg_rewind使用的是不同的方式,过程概述如下 解析旧主上分叉点后的WAL,记录这些事务修改了哪些数据块 对数据文件以外的文件,直接从新主上拉取后覆盖旧主上的文件 对于数据文件,只从新主拉取被旧主修改了的数据块,并覆盖旧主数据文件中对应的数据块 从新主上拉取最新的WAL,覆盖旧主的WAL 把旧主改成恢复模式,恢复的起点则设置为分叉点前的最近一次checkpoint 启动旧主,旧主进入宕机恢复过程,旧主应用完从新主拷贝来的所有WAL后,数据就和新主一致了。 如何保证主备一致? 分叉点之后,新主和旧主上可能有各种各样的变更。除了常规的数据的增删改,还有truncate,表结构变更,表的DROP和CREATE,数据库参数配置变更等等。如何保证pg_rewind之后这些都能一致呢? 下面我们重点看一下pg_rewind对数据文件的拷贝处理。 通过比较target和source节点的数据目录,构建filemap filemap中对每个文件记录了不同的处理方式(即action),对于数据文件,如下 仅存在于新主:FILE_ACTION_COPY 仅存在于旧主:FILE_ACTION_REMOVE 新主文件size>旧主:FILE_ACTION_COPY_TAIL 新主文件size<旧主:FILE_ACTION_TRUNCATE 新主文件size=旧主:FILE_ACTION_NONE 读取旧主本地WAL,获取并记录影响的数据块到filemap中对应的file_entry中的pagemap pagemap属于块级别的拷贝。为了避免文件级别的拷贝做重复的事情,提取影响的块号是做了一些过滤,具体如下: FILE_ACTION_NONE:只记录小于等于新主size的块 FILE_ACTION_TRUNCATE:只记录小于等于新主size的块 FILE_ACTION_COPY_TAIL:只记录小于等于旧主size的块 其他:不记录 遍历filemap,对其中每个file_entry,从新主拷贝必要的数据 从新主拷贝pagemap中记录的块覆盖旧主 根据action,执行不同的文件拷贝操作 FILE_ACTION_NONE:无需处理 FILE_ACTION_COPY:从新主拷贝数据,只拷贝到生成action时看到的新主上的size FILE_ACTION_TRUNCATE:旧主truncate到新主的size FILE_ACTION_COPY_TAIL:从新主拷贝尾部数据,即新主size超出旧主的部分 FILE_ACTION_CREATE:创建目录 FILE_ACTION_REMOVE:删除文件(或目录) 上面的过程汇总后如下: 仅存在于新主(FILE_ACTION_COPY) 从新主拷贝数据,只拷贝到生成action时看到的新主上的size 仅存在于旧主(FILE_ACTION_REMOVE) 删除文件 新主文件 size > 旧主(FILE_ACTION_COPY_TAIL) 对偏移位置小于等于旧主文件 size 的块,从新主拷贝受旧主分叉后更新影响的块 偏移位置为旧主文件 size ~ 新主文件 size 之间的块,从新主拷贝 新主文件 size < 旧主(FILE_ACTION_TRUNCATE) 对偏移位置小于等于新主文件 size 的块,从新主拷贝受旧主分叉后更新影响的块 对偏移位置大于新主文件 size 的块,truncate 掉 新主文件 size = 旧主(FILE_ACTION_NONE) 对偏移位置小于等于新主文件 size 的块,从新主拷贝受旧主分叉后更新影响的块 针对上面的流程,现在回答几个关键的问题 pg_rewind拷贝数据时,新主还处于活动中,拷贝的这些数据块不在同一个事务一致点上,如何将不一致的数据状态变成一致的? 这里用到的技术,就是数据块最擅长的宕机恢复的技术。通过启动旧主后,回放WAL使数据库达到一致的状态。 我们以一个微观的数据块为例进行说明。 具体到某一个数据块,只有三种情况,我们分别讨论 需要从新主拷贝 如果拷贝时发现新主上这个块所在的文件被删掉了,那么也会删掉旧主上的文件。 如果拷贝时新主上这个块被 truncate 掉了,会忽略这个块的拷贝。 如果拷贝时这个块正在被修改,可能导致pg_rewind读到了一个不一致的块。一半是修改前的,另一半是修改后的。 这并没有关系,因为如果这个块被变更了,变更这个块的事务已经记录到WAL了,回放这个WAL时可以修复数据到一致状态。pg_rewind运行的前提条件时数据库必须开启 full_page_write,开启full_page_write后WAL中会记录每个checkpoint后第一次修改的page的完整镜像,宕机恢复时,使用这个镜像,就可以修复数据文件中损坏的数据块。 保留旧主 如果后来新主上这个块变更了,回放WAL时自然可以追到一致的状态 需要从旧主删除 如果后来新主上有新增了这个数据块,同样,回放WAL时自然可以追到一致的状态 如果一个表先被删了,之后又创建一个表结构不一样的同名的表,pg_rewind处理这个表时会不会有问题? PostgreSQL中数据文件名是filenode,初始时它等于表的oid,当表被重写(比如执行truncate或vacuum full)后,会赋值为下一个oid。因此先后创建的同名表,它们对应的文件名是不一样的(MySQL采用表名作为数据文件名。虽然比较直观,但会有很多潜在的问题,比如特殊字符,PostgreSQL的filenode的方式要严谨得多)。 PostgreSQL回放WAL时如何保证可正常执行? 具体要回答几个问题 宕机恢复阶段,回放建表的WAL时,如果对应的文件已存在,结果如何? 宕机恢复阶段,回放删表的WAL时,如果对应的文件不存在,结果如何? 宕机恢复阶段,回放extend数据文件的WAL时,如果对应的块已存在,结果如何? 宕机恢复阶段,回放write数据文件的WAL时,如果对应的块或者文件不存在,结果如何? 宕机恢复阶段,回放truncate数据文件的WAL时,如果对应的块或者文件不存在,结果如何? 存储层的一些接口,已经考虑到REDO的使用场景,做了一些容错,支持幂等性。 REDO中创建文件时,容忍文件已存在 void mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo) { ... fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY); if (fd < 0) { int save_errno = errno; if (isRedo) fd = PathNameOpenFile(path, O_RDWR | PG_BINARY); ... 删除文件时,容忍文件不存在 static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) { ... ret = unlink(pcfile_path); if (ret < 0 && errno != ENOENT) ereport(WARNING, (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", pcfile_path))); REDO中打开文件时,都会带上 O_CREAT flag,文件不存在时会创建一个空的 static MdfdVec * _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, bool skipFsync, int behavior) { if ((behavior & EXTENSION_CREATE) || (InRecovery && (behavior & EXTENSION_CREATE_RECOVERY))) { ... flags = O_CREAT; 读或写数据块时,可以 seek 到超出文件大小的位置 void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync) { ... v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); ... nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE); REDO中 truncate 时,如果文件大小已小于 truncate 目标,无视 void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks) { ... curnblk = mdnblocks(reln, forknum); if (nblocks > curnblk) { /* Bogus request ... but no complaint if InRecovery */ if (InRecovery) return; 回放truncate记录时,先强制执行一次创建关系的操作 void smgr_redo(XLogReaderState *record) { ... else if (info == XLOG_SMGR_TRUNCATE) { ... /* * Forcibly create relation if it doesn't exist (which suggests that * it was dropped somewhere later in the WAL sequence). As in * XLogReadBufferForRedo, we prefer to recreate the rel and replay the * log as best we can until the drop is seen. */ smgrcreate(reln, MAIN_FORKNUM, true); ... smgrtruncate(reln, forks, nforks, blocks);
再谈Citus 多CN部署与Citus MX Citus集群由Coordinator(CN节点)和Worker节点组成。CN节点上放元数据负责SQL分发; Worker节点上放实际的分片,各司其职。但是,Citus里它们的功能也可以灵活的转换。 1. Worker as CN 当一个普通的Worker上存储了元数据后,就有了CN节点分发SQL的能力,可以分担CN的负载。这样的Worker按官方的说法,叫做Citus MX节点。 配置Citus MX的前提条件为Citus的复制模式必须配置为streaming。即不支持在多副本的HA部署架构下使用 citus.replication_model = streaming 然后将普通的Worker变成Citus MX节点 select start_metadata_sync_to_node('127.0.0.1',9002); 默认情况下,Citus MX节点上也会分配分片。官方的Citus MX架构中,Citus MX集群中所有Worker都是Citus MX节点。 如果我们只想让少数几个Worker节点专门用于分担CN负载,那么这些节点上是不需要放分片的。可以通过设置节点的shouldhaveshards属性进行控制。 SELECT master_set_node_property('127.0.0.1', 9002, 'shouldhaveshards', false); 2. CN as Worker Citus里CN节点也可以作为一个Worker加到集群里。 SELECT master_add_node('127.0.0.1', 9001, groupid => 0); CN节点作为Worker后,参考表也会在CN上存一个副本,但默认分片是不会存在上面的。如果希望分片也在CN上分配,可以把CN的shouldhaveshards属性设置为true。 SELECT master_set_node_property('127.0.0.1', 9001, 'shouldhaveshards', true); 配置后Citus集群成员如下: postgres=# select * from pg_dist_node; nodeid | groupid | nodename | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards --------+---------+-----------+----------+----------+-------------+----------+----------+-------------+----------------+------------------ 1 | 1 | 127.0.0.1 | 9001 | default | f | t | primary | default | f | t 3 | 0 | 127.0.0.1 | 9000 | default | t | t | primary | default | f | t 2 | 2 | 127.0.0.1 | 9002 | default | t | t | primary | default | t | f (3 rows) 把CN作为Worker用体现了Citus的灵活性,但是其适用于什么场景呢? 官方文档的举的一个例子是,本地表和参考表可以Join。 这样的场景我们确实有,那个系统的表设计是:明细表分片,维表作参考表,报表作为本地表。报表之所以做成本地表,因为要支持高并发访问,但是又找不到合适的分布键让所有SQL都以路由方式执行。报表做成参考表也不合适,副本太多,影响写入速度,存储成本也高。 那个系统用的Citus 7.4,还不支持这种用法。当时为了支持报表和参考表的Join,建了一套本地维表,通过触发器确保本地维表和参考维表同步。 3. 分片隐藏 在Citus MX节点(含作为Worker的CN节点)上,默认shard是隐藏的,即psql的'd'看不到shard表,只能看到逻辑表。Citus这么做,可能是担心有人误操作shard表。 如果想在Citus MX节点上查看有哪些shard以及shard上的索引。可以使用下面的视图。 citus_shards_on_worker citus_shard_indexes_on_worker 或者设置下面的参数 citus.override_table_visibility = false 4. Citus是怎么隐藏分片的? Citus的plan hook(distributed_planner)中篡改了pg_table_is_visible函数,将其替换成citus_table_is_visible。这个隐藏只对依赖pg_table_is_visible函数的地方有效,比如psql的\d。直接用SQL访问shard表是不受影响的。 static bool ReplaceTableVisibleFunctionWalker(Node *inputNode) { ... if (functionId == PgTableVisibleFuncId()) { ... functionToProcess->funcid = CitusTableVisibleFuncId(); ... 5. Citus多CN方案的限制和不足 不能和多副本同时使用 Citus MX节点不能访问本地表 不能控制Citus MX节点上不部署参考表 6. 参考 https://yq.aliyun.com/articles/647370 https://docs.citusdata.com/en/v9.3/arch/mx.html
Citus7.4-Citus 9.3新特性 最近开始着手Citus7.4到Citus 9.3的升级,所以比较全面地浏览了这期间的Citus变更。从Citus7.4到Citus 9.3很多方面的改进,本文只列出一些比较重要的部分。 以下用到了一些示例,示例的验证环境如下 软件 PostreSQL 12 Citus 9.3 集群成员 CN 127.0.0.1:9000 Worker 127.0.0.1:9001 127.0.0.1:9002 SQL支持增强类 1.支持非分区列的count distinct 这个Citus 7.4应该已经支持了,不知道是Citus的Changelog更新延误,还是Citus 7.5支持得更完善了。 表定义 create table tb1(id int,c1 int); select create_distributed_table('tb1','id'); 非分区列的count distinct的执行计划 postgres=# explain select count(distinct c1) from tb1; QUERY PLAN ------------------------------------------------------------------------------------------ Aggregate (cost=250.00..250.01 rows=1 width=8) -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=4) Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=127.0.0.1 port=9001 dbname=postgres -> HashAggregate (cost=38.25..40.25 rows=200 width=4) Group Key: c1 -> Seq Scan on tb1_102339 tb1 (cost=0.00..32.60 rows=2260 width=4) (9 rows) 2.支持UPSERT 支持UPSERT,即支持INSERT INTO SELECT..ON CONFLICT/RETURNING 表定义 create table tb1(id int, c1 int); select create_distributed_table('tb1','id'); create table tb2(id int primary key, c1 int); select create_distributed_table('tb2','id'); UPSERT SQL执行 postgres=# INSERT INTO tb2 SELECT * from tb1 ON CONFLICT(id) DO UPDATE SET c1 = EXCLUDED.c1; INSERT 0 1 3.支持GENERATED ALWAYS AS STORED 使用示例如下: create table tbgenstore(id int, c1 int GENERATE ALWAYS AS (id+1)STORED); select create_distributed_table('tbgenstore','id'); 4.支持用户定义的分布式函数 支持用户自定义分布式函数。Citus会把分布式函数(包括聚合函数)以及依赖的对象定义下发到所有Worker上。后续在执行SQL的时候也可以合理的把分布式函数的执行下推到Worker。 分布式函数还可以和某个分布表绑定"亲和"关系,这一个特性的使用场景如下: 在多租户类型的业务中,把单个租户的一个事务中的多个SQL打包成一个“分布式函数”下发到Worker上。CN只需要下推一次分布式函数的调用,分布式函数内部的多个SQL的执行全部在Worker节点内部完成。避免CN和Worker之间来回交互,可以大大提升OLTP的性能(利用这个特性去跑TPCC,简直太溜了!)。 下面看下手册里的例子。 https://docs.citusdata.com/en/v9.3/develop/api_udf.html?highlight=distributed%20function#create-distributed-function -- an example function which updates a hypothetical -- event_responses table which itself is distributed by event_id CREATE OR REPLACE FUNCTION register_for_event(p_event_id int, p_user_id int) RETURNS void LANGUAGE plpgsql AS $fn$ BEGIN INSERT INTO event_responses VALUES ($1, $2, 'yes') ON CONFLICT (event_id, user_id) DO UPDATE SET response = EXCLUDED.response; END; $fn$; -- distribute the function to workers, using the p_event_id argument -- to determine which shard each invocation affects, and explicitly -- colocating with event_responses which the function updates SELECT create_distributed_function( 'register_for_event(int, int)', 'p_event_id', colocate_with := 'event_responses' ); 5.完全支持聚合函数 Citus中对聚合函数有3种不同的执行方式 按照分片字段分组的聚合,直接下推到Worker执行聚合 对部分Citus能够识别的聚合函数,Citus执行两阶段聚合,现在Worker执行部分聚合,再把结果汇总到CN上进行最终聚合。 对其他的聚合函数,Citus把数据拉到CN上,在CN上执行聚合。 详细参考,https://docs.citusdata.com/en/v9.3/develop/reference_sql.html?highlight=Aggregation#aggregate-functions 显然第3种方式性能会比较差,对不按分片字段分组的聚合,怎么让它按第2种方式执行呢? Citus中预定义了一部分聚合函数可以按第2中方式执行。 citus-9.3.0/src/include/distributed/multi_logical_optimizer.h: static const char *const AggregateNames[] = { "invalid", "avg", "min", "max", "sum", "count", "array_agg", "jsonb_agg", "jsonb_object_agg", "json_agg", "json_object_agg", "bit_and", "bit_or", "bool_and", "bool_or", "every", "hll_add_agg", "hll_union_agg", "topn_add_agg", "topn_union_agg", "any_value" }; 对不在上面白名单的聚合函数,比如用户自定义的聚合函数,可以通过create_distributed_function()添加。示例如下: citus-9.3.0/src/test/regress/expected/aggregate_support.out: create function sum2_sfunc(state int, x int) returns int immutable language plpgsql as $$ begin return state + x; end; $$; create function sum2_finalfunc(state int) returns int immutable language plpgsql as $$ begin return state * 2; end; $$; create aggregate sum2 (int) ( sfunc = sum2_sfunc, stype = int, finalfunc = sum2_finalfunc, combinefunc = sum2_sfunc, initcond = '0' ); select create_distributed_function('sum2(int)'); 执行这个自定义的聚合函数的执行计划如下 postgres=# explain select sum2(c1) from tb1; QUERY PLAN ------------------------------------------------------------------------------------------ Aggregate (cost=250.00..250.01 rows=1 width=4) -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=32) Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=127.0.0.1 port=9001 dbname=postgres -> Aggregate (cost=38.25..38.26 rows=1 width=32) -> Seq Scan on tb1_102339 tb1 (cost=0.00..32.60 rows=2260 width=4) (8 rows) 但是当前这种方式不支持stype = internal的自定义聚合函数。Citus社区已经在对应这个问题,详细参考https://github.com/citusdata/citus/issues/3916 6.完全支持窗口函数 对不按分片字段分组的聚合函数,Citus支持把数据拉到CN上再执行,和聚合函数类型。需要注意这种执行方式对性能的影响,特别是包含多个不同分组字段的窗口函数时,Worker拉到CN上结果集是这些字段组合的笛卡尔积。 7.支持在事务块中传播LOCAL参数 当在CN的事务块中设置LOCAL参数时,可以把这个参数传播到Worker节点。 前提条件是citus.propagate_set_commands参数必须为local set citus.propagate_set_commands TO local; 事务块中设置LOCAL参数 postgres=# begin; BEGIN postgres=*# set local enable_hashagg to off; SET postgres=*# SELECT current_setting('enable_hashagg') FROM tb1 WHERE id = 3; current_setting ----------------- off (1 row) 8. 支持本地表和参考表Join 如果一个数据库需要用到本地表,而本地表和以参考表的形式部署的维表又有Join的需求,改如何处理? 原来我们只能在CN上再创建一套本地的维表,然后由应用或者通过触发器维护两套维表之间的数据同步。 现在可以用更简单的方式实现。具体就是把CN节点也可以作为一个Worker加到Citus集群里,groupid一定要设置为0。 SELECT master_add_node('127.0.0.1', 9001, groupid => 0); 这样CN上也就和其他Worker一样拥有了参考表的一个副本,本地表和参考表Join的时候就直接在本地执行了。 DDL支持增强 9.支持把SCHEMA的赋权广播到Worker上 GRANT USAGE ON SCHEMA dist_schema TO role1; 10.支持修改表SCHEMA广播到Worker上 ALTER TABLE ... SET SCHEMA 11.支持创建索引时指定INCLUDE选项 create index tb1_idx_id on tb1(id) include (c1); 12. 支持使用CONCURRENTLY选项创建索引 create index CONCURRENTLY tb1_idx_id2 on tb1(id); 13. 支持传播REINDEX到worker节点上 之前版本reindex不能传播到Worker节点,还需要到每个worker分别执行reindex。新版的Citus支持了。 reindex index tb1_idx_id; Citus MX功能增强 14.支持在MX 节点上对参考表执行DML 表定义 create table tbref(id int, c1 int); select create_refence_table('tbref'); 在MX worker(即扩展worker)上修改参考表 postgres=# insert into tbref values(1,1),(2,2); INSERT 0 2 postgres=# update tbref set c1=10; UPDATE 2 postgres=# delete from tbref where id=1; DELETE 1 postgres=# select * from tbref; id | c1 ----+---- 2 | 10 (1 row) 15.支持在MX节点上执行TRUNCATE 之前MX节点上是不支持对分布表和参考表执行truncate操作的。现在也支持了 postgres=# truncate tb1; TRUNCATE TABLE postgres=# truncate tbref; TRUNCATE TABLE 16.支持在Citus MX架构下使用serial和smallserial 之前在Citus MX(即多CN部署)环境下,自增序列只能使用bigserial类型,现在也可以支持serial和smallserial了。 表定义 create table tbserial(id int,c1 int); select create_distributed_table('tbserial','id'); Citus中,自增字段通过CN和MX节点上逻辑表上的序列对象实现。 postgres=# \d tbserial Table "public.tbserial" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+-------------------------------------- id | integer | | not null | nextval('tbserial_id_seq'::regclass) c1 | integer | | | 为了防止多个MX节点产生的序列冲突。在Citus MX环境下,序列值的开头部分是产生序列的节点的groupid,后面才是顺序累加的值。这等于按groupid把序列值分成了不同的范围,互不重叠。 即: 全局序列值 = groupid,节点内的顺序递增值 对不同serial的数据类型,groupid占的位数是不一样的。具体如下 bigserial:16bit serial:4bit smallserial:4bit 根据上groupid占的长度,我们需要注意 单个节点(CN或扩展Worker)上,能产生的序列值的数量变少了,要防止溢出。 如果使用了serial或smallserial,最多部署7个扩展Worker节点。 序列对象的定义 上面提到的全局序列的实现具体体现为:在不同节点上,序列对象定义的范围不一样。如下 CN节点上的序列对象定义(CN节点的groupid固定为0) postgres=# \d tbserial_id_seq Sequence "public.tbserial_id_seq" Type | Start | Minimum | Maximum | Increment | Cycles? | Cache ---------+-------+---------+------------+-----------+---------+------- integer | 1 | 1 | 2147483647 | 1 | no | 1 Owned by: public.tbserial.id MX Worker节点上的序列对象定义(groupid=1) postgres=# \d tbserial_id_seq Sequence "public.tbserial_id_seq" Type | Start | Minimum | Maximum | Increment | Cycles? | Cache --------+-----------+-----------+-----------+-----------+---------+------- bigint | 268435457 | 268435457 | 536870913 | 1 | no | 1 如何知道每个Worker节点的groupid? 每个Worker节点的groupid可以从pg_dist_node获取。 postgres=# select * from pg_dist_node; nodeid | groupid | nodename | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards --------+---------+-----------+----------+----------+-------------+----------+----------+-------------+----------------+------------------ 2 | 2 | 127.0.0.1 | 9002 | default | t | t | primary | default | t | t 1 | 1 | 127.0.0.1 | 9001 | default | t | t | primary | default | t | t (2 rows) 也可以在每个节点本地查询pg_dist_local_group获得本节点的groupid。 postgres=# select * from pg_dist_local_group; groupid --------- 1 (1 row) CN节点和普通的Worker节点(非MX Worker)的pg_dist_local_group中查询到的groupid都为0. 17.在Citus MX通过本地执行提升性能 之前测试Citus MX架构的时候发现,当Citus MX节点上放分片时,性能比不放分片差一倍。新版的Citus在这方面做了优化,当在Citus MX节点上访问本节点上的分片时,不再走新建一个到本地的数据库连接再读写分片的常规执行方式。而是直接用当前连接访问分片。根据下面的测试数据,性能可以提升一倍。 https://github.com/citusdata/citus/pull/2938 - Test 1: HammerDB test with 250 users, 1,000,000 transactions per. 8 Node Citus MX - (a) With local execution: `System achieved 116473 PostgreSQL TPM at 160355 NOPM` - (b) without local execution: ` System achieved 61392 PostgreSQL TPM at 100503 NOPM` - Test 2: HammerDB test with 250 users, 10,000,000 transactions per. 8 Node Citus MX - (a) With local execution: `System achieved 91921 PostgreSQL TPM at 174557 NOPM` - (b) without local execution: ` System achieved 84186 PostgreSQL TPM at 98408 NOPM` - Test 3: Pgbench, 1 worker node, -c64 -c256 -T 120 - (a) Local execution enabled (tps): `select-only`: 56202 `simple-update`: 11771 `tpcb-like`: 7796 - (a) Local execution disabled (tps): `select-only`: 24524 `simple-update`: 5077 `tpcb-like`: 3510 (some connection errors for tpcb-like) 在我司的多CN部署方式下,扩展Worker上是不放分片的。所以这个优化和我们无关。 性能增强 18.替换real-time为新的执行器Adaptive Executor Adaptive Executor是一个新的执行器,它和real-time的差异主要体现在可以通过参数对CN到worker的连接数进行控制。具体如下: citus.max_shared_pool_size 可以通过`citus.max_shared_pool_size`控制CN(或MX Worker)在单个Worker上可同时建立的最大连接数,默认值等于CN的`max_connections`。 达到连接数使用上限后,新的SQL请求可能等待,有些操作不受限制,比如COPY和重分区的Join。 Citus MX架构下,单个Worker上同时接受到连接数最大可能是 `max_shared_pool_size * (1 + MX Worker节点数)` citus.max_adaptive_executor_pool_size 可以通过`citus.max_adaptive_executor_pool_size`控制CN(或MX Worker)上的单个会话在单个Worker上可同时建立的最大连接数,默认值等于16。 citus.max_cached_conns_per_worker 可以通过`citus.max_cached_conns_per_worker`控制CN(或MX Worker)上的单个会话在事务结束后对每个Worker缓存的连接数,默认值等于1。 citus.executor_slow_start_interval 对于执行时间很短的多shard的SQL,并发开多个连接,不仅频繁创建销毁连接的消耗很高,也极大的消耗了worker上有限的连接资源。 adaptive执行器,在执行多shard的SQL时,不是一次就创建出所有需要的连接数,而是先创建一部分,隔一段时间再创建一部分。 中途如果有shard的任务提前完成了,它的连接可以被复用,就可以减少对新建连接的需求。 因此执行多shard的SQL最少只需要一个连接,最多不超过`max_adaptive_executor_pool_size`,当然也不会超过目标worker上的shard数。 这个算法叫"慢启动",慢启动的间隔由参数`citus.executor_slow_start_interval`控制,默认值为10ms。 初始创建的连接数是:max(1,`citus.max_cached_conns_per_worker`),之后每批新建的连接数都在前一批的基础上加1。 即默认情况下,每批新建的连接数依次为1,2,3,4,5,6... "慢启动"主要优化了短查询,对长查询(手册上给的标准是大于500ms),会增加一定的响应时间。 下面看几个例子 citus.max_shared_pool_size的使用示例 postgres=# alter system set citus.max_shared_pool_size to 4; ALTER SYSTEM postgres=# select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) postgres=# begin; BEGIN postgres=*# update tb1 set c1=11; UPDATE 1 postgres=*# select * from citus_remote_connection_stats(); hostname | port | database_name | connection_count_to_node -----------+------+---------------+-------------------------- 127.0.0.1 | 9002 | postgres | 4 127.0.0.1 | 9001 | postgres | 4 (2 rows) citus.executor_slow_start_interval的使用示例 tb1总共有32个分片,每个worker上有16个分片。初始每个worker上保持2个连接 postgres=# select * from citus_remote_connection_stats(); hostname | port | database_name | connection_count_to_node -----------+------+---------------+-------------------------- 127.0.0.1 | 9002 | postgres | 2 127.0.0.1 | 9001 | postgres | 2 (2 rows) citus.executor_slow_start_interval = '10ms'时,执行一个空表的update,只额外创建了2个新连接。 postgres=# set citus.executor_slow_start_interval='10ms'; SET postgres=# begin; BEGIN postgres=*# update tb1 set c1=100; UPDATE 0 postgres=*# select * from citus_remote_connection_stats(); hostname | port | database_name | connection_count_to_node -----------+------+---------------+-------------------------- 127.0.0.1 | 9002 | postgres | 4 127.0.0.1 | 9001 | postgres | 4 (2 rows) citus.executor_slow_start_interval = '500ms'时,没有创建新的连接,都复用了一个缓存的连接 postgres=# set citus.executor_slow_start_interval='500ms'; SET postgres=# begin; BEGIN postgres=*# update tb1 set c1=100; UPDATE 0 postgres=*# select * from citus_remote_connection_stats(); hostname | port | database_name | connection_count_to_node -----------+------+---------------+-------------------------- 127.0.0.1 | 9002 | postgres | 2 127.0.0.1 | 9001 | postgres | 2 (2 rows) citus.executor_slow_start_interval = '0ms'时,创建了比较多的新连接。 postgres=# set citus.executor_slow_start_interval = '0ms'; SET postgres=# begin; BEGIN postgres=*# update tb1 set c1=100; UPDATE 0 postgres=*# select * from citus_remote_connection_stats(); hostname | port | database_name | connection_count_to_node -----------+------+---------------+-------------------------- 127.0.0.1 | 9002 | postgres | 5 127.0.0.1 | 9001 | postgres | 14 (2 rows) 参考 adaptive执行器连接创建"慢启动"的代码参考: citus-9.3.0/src/backend/distributed/executor/adaptive_executor.c: static void ManageWorkerPool(WorkerPool *workerPool) { ... /* cannot open more than targetPoolSize connections */ int maxNewConnectionCount = targetPoolSize - initiatedConnectionCount;//targetPoolSize的值为max(1,`citus.max_cached_conns_per_worker`) /* total number of connections that are (almost) available for tasks */ int usableConnectionCount = UsableConnectionCount(workerPool); /* * Number of additional connections we would need to run all ready tasks in * parallel. */ int newConnectionsForReadyTasks = readyTaskCount - usableConnectionCount; /* * Open enough connections to handle all tasks that are ready, but no more * than the target pool size. */ newConnectionCount = Min(newConnectionsForReadyTasks, maxNewConnectionCount); if (newConnectionCount > 0 && ExecutorSlowStartInterval != SLOW_START_DISABLED) { if (MillisecondsPassedSince(workerPool->lastConnectionOpenTime) >= ExecutorSlowStartInterval) { newConnectionCount = Min(newConnectionCount, workerPool->maxNewConnectionsPerCycle); /* increase the open rate every cycle (like TCP slow start) */ workerPool->maxNewConnectionsPerCycle += 1; } else { /* wait a bit until opening more connections */ return; } } 19.通过adaptive执行器执行重分布的Join 当citus.enable_repartition_joins=on时,Citus支持通过数据重分布的方式执行非亲和Inner Join,之前版本Citus会自动切换到task-tracker执行器执行重分布的Join,但是使用task-tracker执行器需要CN节点给Worker下发任务再不断检查任务完成状态,其额外消耗很大,响应时间非常长。 新版Citus改进后,可以通过adaptive执行器执行重分布的Join。 根据官网博客,1000w以下数据的重分布Join,性能提升了10倍。详细参考:https://www.citusdata.com/blog/2020/03/02/citus-9-2-speeds-up-large-scale-htap/ 我们自己的简单测试中,2张空表的重分布Join,之前需要16秒,现在只需要2秒。 20.支持重分布的方式执行INSERT...SELECT 表定义 create table tb1(id int, c1 int); select create_distributed_table('tb1','id'); set citus.shard_count to 16; create table tb2(id int primary key, c1 int); select create_distributed_table('tb2','id'); tb1和tb2的分片数不一样,即它们不是亲和的。此前,Citus必须把数据全拉到CN节点上中转。新版Citus可以通过重分布的方式执行这个SQL,各个Worker之间直接互相传送数据,CN节点只执行工具函数驱动任务执行,性能可大幅提升。 postgres=# explain INSERT INTO tb2 SELECT * from tb1 ON CONFLICT(id) DO UPDATE SET c1 = EXCLUDED.c1; QUERY PLAN ------------------------------------------------------------------------------------ Custom Scan (Citus INSERT ... SELECT) (cost=0.00..0.00 rows=0 width=0) INSERT/SELECT method: repartition -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=8) Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=127.0.0.1 port=9001 dbname=postgres -> Seq Scan on tb1_102339 tb1 (cost=0.00..32.60 rows=2260 width=8) (8 rows) 根据官网博客,这项优化使性能提升了5倍。详细参考:https://www.citusdata.com/blog/2020/03/02/citus-9-2-speeds-up-large-scale-htap/ 需要注意的是,如果插入时,需要在目标表上自动生成自增字段,Citus会退回到原来的执行方式,数据都会经过CN中转一下。 21.支持以轮询的方式访问参考表的多个副本 之前Citus查询参考表时,始终只访问参考表的第一个副本,新版Citus可以通过参数设置,在参考表多个副本轮询访问,均衡负载。 postgres=# set citus.task_assignment_policy TO "round-robin"; SET postgres=# explain select * from tbref; QUERY PLAN ---------------------------------------------------------------------------------- Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) Task Count: 1 Tasks Shown: All -> Task Node: host=127.0.0.1 port=9001 dbname=postgres -> Seq Scan on tbref_102371 tbref (cost=0.00..32.60 rows=2260 width=8) (6 rows) postgres=# explain select * from tbref; QUERY PLAN ---------------------------------------------------------------------------------- Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) Task Count: 1 Tasks Shown: All -> Task Node: host=127.0.0.1 port=9002 dbname=postgres -> Seq Scan on tbref_102371 tbref (cost=0.00..32.60 rows=2260 width=8) (6 rows) citus.task_assignment_policy的默认值是greedy。greedy比较适合多副本的分布表。对于涉及多个shard的SQL,每个shard都有多个可选的副本,在greedy策略下,Citus会尽量确保每个worker分配到任务数相同。 具体实现时Citus一次轮询所有Worker,直到把所有shard任务都分配完。因此对参考表这种只有一个shard的场景,greedy会导致其始终把任务分配给第一个worker。详细可以参考GreedyAssignTaskList()函数的代码。 22.表数据导出优化 Citus导出数据时,中间结果会写到在CN上,而且CN从Worker拉数据是并行拉的,不过Worker还是CN负载都会很高。新版Citus优化了COPY导出处理,依次从每个Worker上抽出数返回给客户端,中途数据不落盘。 但是这一优化只适用于下面这种固定形式的全表COPY到STDOUT的场景 COPY table tb1 to STDOUT 这可以大大优化pg_dump,延迟更低,内存使用更少。 集群管理增强 23.支持控制worker不分配shard 可以通过设置节点的shouldhaveshards属性控制某个节点不放分片。 SELECT master_set_node_property('127.0.0.1', 9002, 'shouldhaveshards', false); shouldhaveshards属性会对后续创建新的分布表和参考表生效。也会对后续执行的企业版Citus的rebalance功能生效,社区版不支持rebalance,但如果自研Citus部署和维护工具也可以利用这个参数。 扩展Worker的实现逻辑改为使用这个参数,简化处理逻辑,不用先建好分布表后再挪分片。 扩缩容脚本也可以使用这个参数决定Worker上是否放置分片,不需要区分是不是全部是扩展Worker的部署架构 24.支持使用master_update_node实施failover 采用主备流复制实现Worker高可用时,一般CN通过VIP访问Worker,worker主备切换时只需要漂移vip到新的主节点即可。新版Citus提供了一个新的可选方案,通过master_update_node()函数修改某个worker的IP和Port。这提供了一种新的不依赖VIP的Worker HA实现方案。 postgres=# \df master_update_node List of functions -[ RECORD 1 ]-------+----------------------------------------------------------------------------------------------------------------------------- Schema | pg_catalog Name | master_update_node Result data type | void Argument data types | node_id integer, new_node_name text, new_node_port integer, force boolean DEFAULT false, lock_cooldown integer DEFAULT 10000 Type | func 25.支持变更亲和定义 新版Citus可以在分布表创建后,修改亲和关系。 表定义 create table tba(id int,c1 int); select create_distributed_table('tba','id'); create table tbb(id int,c1 int); select create_distributed_table('tbb','id'); create table tbc(id text,c1 int); select create_distributed_table('tbc','id'); tba和tbb这两个表是亲和的 postgres=# select * from pg_dist_partition where logicalrelid in ('tba'::regclass,'tbb'::regclass); logicalrelid | partmethod | partkey | colocationid | repmodel --------------+------------+------------------------------------------------------------------------------------------------------------------------+--------------+---------- tba | h | {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location -1} | 3 | s tbb | h | {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location -1} | 3 | s (2 rows) 将tbb设置为新的亲和ID,打破它们的亲和关系 postgres=# SELECT update_distributed_table_colocation('tbb', colocate_with => 'none'); update_distributed_table_colocation ------------------------------------- (1 row) postgres=# select * from pg_dist_partition where logicalrelid in ('tba'::regclass,'tbb'::regclass); logicalrelid | partmethod | partkey | colocationid | repmodel --------------+------------+------------------------------------------------------------------------------------------------------------------------+--------------+---------- tba | h | {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location -1} | 3 | s tbb | h | {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location -1} | 14 | s (2 rows) 重新设置它们亲和 postgres=# SELECT update_distributed_table_colocation('tbb', colocate_with => 'tba'); update_distributed_table_colocation ------------------------------------- (1 row) postgres=# select * from pg_dist_partition where logicalrelid in ('tba'::regclass,'tbb'::regclass); logicalrelid | partmethod | partkey | colocationid | repmodel --------------+------------+------------------------------------------------------------------------------------------------------------------------+--------------+---------- tba | h | {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location -1} | 3 | s tbb | h | {VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location -1} | 3 | s (2 rows) 也可以用批量将一组表设置为和某一个表亲和 postgres=# SELECT mark_tables_colocated('tba', ARRAY['tbb', 'tbc']); ERROR: cannot colocate tables tba and tbc DETAIL: Distribution column types don't match for tba and tbc. tbc的分片字段类型不一致,不能亲和,去掉tbc再次执行成功。 postgres=# SELECT mark_tables_colocated('tba', ARRAY['tbb']); mark_tables_colocated ----------------------- (1 row) 26.支持truncate分布表的本地数据 把一个原来就有数据的本地表创建成分布表,会把原来的数据拷贝到各个shard上,但原始本地表上的数据不会删除,只是对用户不可见。 原来没有直接的办法删掉这些不需要的本地数据(可以通过临时篡改元数据的方式删),现在可以用一个函数实现。 SELECT truncate_local_data_after_distributing_table('tb1'); 27. 延迟复制参考表副本 当新的Worker节点添加到Citus集群的时候,会同步参考表的副本到上面。如果集群中存在比较大参考表,会导致添加Worker节点的时间不可控。这可能使得用户不敢在业务高峰期扩容节点。 现在Citus可以支持把参考表的同步延迟到下次创建分片的的时候。方法就是设置下面这个参数为off,它的默认值为on。 citus.replicate_reference_tables_on_activate = off 这样我们可以在白天扩容,夜里在后台同步数据。 28.创建集群范围一致的恢复点 之前我们备份Citus集群的时候,都是各个节点各自备份恢复,真发生故障,没办法恢复到一个集群范围的一致点。 现在可以使用下面的函数,创建一个全局的恢复点实行全局一致性备份。使用方法类似于PG的pg_create_restore_point(),详细可参考手册。 select citus_create_restore_point('foo'); 29.支持设置Citus集群节点间互联的连接选项 可以通过citus.node_conninfo参数设置Citus内节点间互连的一些非敏感的连接选项。支持连接选项下面的libpq的一个子集。 application_name connect_timeout gsslib keepalives keepalives_count keepalives_idle keepalives_interval krbsrvname sslcompression sslcrl sslmode (defaults to “require” as of Citus 8.1) sslrootcert Citus 8.1以后,在支持SSL的PostgreSQL上,citus.node_conninfo的默认值为'sslmode=require'。即默认开启了SSL。这是Citus出于安全的考虑,但是启用SSL后部署和维护会比较麻烦。因此我们的部署环境下,需要将其修改为sslmode=prefer。 postgres=# show citus.node_conninfo; citus.node_conninfo --------------------- sslmode=prefer (1 row) 30.默认关闭Citus统计收集 之前Citus的守护进程默认会收集Citus集群的一些元数据信息上报到CitusData公司的服务上(明显有安全问题)。新版本把这个功能默认关闭了。当然更彻底的做法是在编译Citus的时候就把这个功能屏蔽掉。 postgres=# show citus.enable_statistics_collection; citus.enable_statistics_collection ------------------------------------ off (1 row) 31. 增加查看集群范围活动的函数和视图 新版Citus提供了几个函数和视图,可以在CN上非常方便的查看整体Citus的当前活动状况 citus_remote_connection_stats() 查看所有worker上的来自CN节点和MX Worker节点的远程连接数。 postgres=# select * from citus_remote_connection_stats(); hostname | port | database_name | connection_count_to_node -----------+------+---------------+-------------------------- 127.0.0.1 | 9002 | postgres | 3 127.0.0.1 | 9001 | postgres | 3 (2 rows) citus_dist_stat_activity 查看从本CN节点或MX worker上发起的活动。这个视图在pg_stat_activity上附加了一些Citus相关的信息。 postgres=# select * from citus_dist_stat_activity; -[ RECORD 1 ]----------+------------------------------ query_hostname | coordinator_host query_hostport | 9000 master_query_host_name | coordinator_host master_query_host_port | 9000 transaction_number | 57 transaction_stamp | 2020-06-19 15:05:22.142242+08 datid | 13593 datname | postgres pid | 2574 usesysid | 10 usename | postgres application_name | psql client_addr | client_hostname | client_port | -1 backend_start | 2020-06-19 10:57:58.472994+08 xact_start | 2020-06-19 15:05:17.45487+08 query_start | 2020-06-19 15:05:22.140954+08 state_change | 2020-06-19 15:05:22.140957+08 wait_event_type | Client wait_event | ClientRead state | active backend_xid | backend_xmin | 5114 query | select * from tb1; backend_type | client backend 注意上面的transaction_number,它代表一个事务号。涉及更新的SQL,事务块中查询和push-pull方式执行的查询都会分配一个非0的事务号。通过这个事务号,我们可以很容易地识别出所有worker上来自同一SQL(或事务)的活动。 详细参考下面的注释。(这段注释应该写错了,下面2类SQL的区别不是是否能被'show',而是transaction_number是否非0)citus-9.3.0/src/backend/distributed/transaction/citus_dist_stat_activity.c * An important note on this views is that they only show the activity * that are inside distributed transactions. Distributed transactions * cover the following: * - All multi-shard modifications (DDLs, COPY, UPDATE, DELETE, INSERT .. SELECT) * - All multi-shard queries with CTEs (modifying CTEs, read-only CTEs) * - All recursively planned subqueries * - All queries within transaction blocks (BEGIN; query; COMMMIT;) * * In other words, the following types of queries won't be observed in these * views: * - Single-shard queries that are not inside transaction blocks * - Multi-shard select queries that are not inside transaction blocks * - Task-tracker queries citus_worker_stat_activity 查看所有worker上的活动。排除非citus会话,即不经过CN或MX worker直连worker的会话。 我们可以指定transaction_number查看特定SQL在worker上的活动。 postgres=# select * from citus_worker_stat_activity where transaction_number = 57; -[ RECORD 1 ]----------+--------------------------------------------- query_hostname | 127.0.0.1 query_hostport | 9001 master_query_host_name | coordinator_host master_query_host_port | 9000 transaction_number | 57 transaction_stamp | 2020-06-19 15:05:22.142242+08 datid | 13593 datname | postgres pid | 4108 usesysid | 10 usename | postgres application_name | citus client_addr | 127.0.0.1 client_hostname | client_port | 33676 backend_start | 2020-06-19 15:05:22.162829+08 xact_start | 2020-06-19 15:05:22.168811+08 query_start | 2020-06-19 15:05:22.171398+08 state_change | 2020-06-19 15:05:22.172237+08 wait_event_type | Client wait_event | ClientRead state | idle in transaction backend_xid | backend_xmin | query | SELECT id, c1 FROM tb1_102369 tb1 WHERE true backend_type | client backend ... citus_lock_waits 查看Citus集群内的被阻塞的查询。下面引用Ciuts手册上的例子 表定义 CREATE TABLE numbers AS SELECT i, 0 AS j FROM generate_series(1,10) AS i; SELECT create_distributed_table('numbers', 'i'); 使用2个会话终端,顺序执行下面的SQL。 -- session 1 -- session 2 ------------------------------------- ------------------------------------- BEGIN; UPDATE numbers SET j = 2 WHERE i = 1; BEGIN; UPDATE numbers SET j = 3 WHERE i = 1; -- (this blocks) 通过citus_lock_waits可以看到,这2个查询是阻塞状态。 SELECT * FROM citus_lock_waits; -[ RECORD 1 ]-------------------------+---------------------------------------- waiting_pid | 88624 blocking_pid | 88615 blocked_statement | UPDATE numbers SET j = 3 WHERE i = 1; current_statement_in_blocking_process | UPDATE numbers SET j = 2 WHERE i = 1; waiting_node_id | 0 blocking_node_id | 0 waiting_node_name | coordinator_host blocking_node_name | coordinator_host waiting_node_port | 5432 blocking_node_port | 5432 这个视图只能在CN节点查看,MX worker节点查不到数据。但是并不要求阻塞所涉及的SQL必须从CN节点发起。 详细参考:https://docs.citusdata.com/en/v9.3/develop/api_metadata.html?highlight=citus_worker_stat_activity#distributed-query-activity 32. 增加查看表元数据的函数和视图 master_get_table_metadata() 查看分布表的元数据 postgres=# select * from master_get_table_metadata('tb1'); -[ RECORD 1 ]---------+----------- logical_relid | 17148 part_storage_type | t part_method | h part_key | id part_replica_count | 1 part_max_size | 1073741824 part_placement_policy | 2 get_shard_id_for_distribution_column() 查看某个分布列值对应的shardid postgres=# SELECT get_shard_id_for_distribution_column('tb1', 4); get_shard_id_for_distribution_column -------------------------------------- 102347 (1 row) 其他 33. 允许在CN备库执行简单的DML 通过设置citus.writable_standby_coordinator参数为on,可以在CN的备库上执行部分简单的DML。看下下面的例子 表定义 create table tbl(id int,c1 int); select create_distributed_table('tbserial','id'); 在CN备节点上可以执行带分片字段的DML postgres=# insert into tb1 values(3,3); ERROR: writing to worker nodes is not currently allowed DETAIL: the database is in recovery mode postgres=# set citus.writable_standby_coordinator TO ON; SET postgres=# insert into tb1 values(3,3); INSERT 0 1 postgres=# update tb1 set c1=20 where id=3; UPDATE 1 postgres=# delete from tb1 where id=3; DELETE 1 不支持不带分片字段的UPDATE和DELETE postgres=# update tb1 set c1=20; ERROR: cannot assign TransactionIds during recovery postgres=# delete from tb1 where c1=20; ERROR: cannot assign TransactionIds during recovery 也不支持跨节点的事务 postgres=# begin; BEGIN postgres=*# insert into tb1 values(3,3); INSERT 0 1 postgres=*# insert into tb1 values(4,4); INSERT 0 1 postgres=*# commit; ERROR: cannot assign TransactionIds during recovery 对于2pc的分布式事务,Citus需要将事务信息记录到事务表pg_dist_transaction中。所以,Citus也无法在CN备节点上支持2pc的分布式事务。 但是如果切换成1pc提交模式,还是可以支持跨节点事务的。 postgres=# set citus.multi_shard_commit_protocol TO '1pc'; SET postgres=# begin; BEGIN postgres=*# insert into tb1 values(4,4); INSERT 0 1 postgres=*# insert into tb1 values(5,5); INSERT 0 1 postgres=*# commit; 并且在1pc提交模式下,跨多个分片的SQL也是支持的。 postgres=# set citus.multi_shard_commit_protocol TO '1pc'; SET postgres=# update tb1 set c1=10; UPDATE 3
PG逻辑订阅过程中,怎么判断订阅端已经同步到哪儿了? 考虑过2种方案,哪个更合适 订阅端的pg_stat_subscription中latest_end_lsn 发布端的pg_stat_replication中的replay_lsn 1. 关于pg_stat_subscription中的latest_end_lsn pg_stat_subscription中的received_lsn和latest_end_lsn比较像,它们的区别如下 received_lsn:最后一次接收到的预写日志位置 latest_end_lsn:报告给原始WAL发送程序的最后的预写日志位置 1.1 pg_stat_subscription中latest_end_lsn的来源 来源是全局数组LogicalRepCtx->workers[] select * from pg_stat_subscription pg_stat_get_subscription() memcpy(&worker, &LogicalRepCtx->workers[i],sizeof(LogicalRepWorker)); values[6] = LSNGetDatum(worker.reply_lsn); 1.2 LogicalRepWorker的分配 Launcher ApplyWorker时分配slot,通过bgw_main_arg参数传给ApplyWorker ApplyLauncherMain(Datum main_arg) logicalrep_worker_launch(sub->dbid, sub->oid, sub->name, sub->owner, InvalidOid); /* Find unused worker slot. */ for (i = 0; i < max_logical_replication_workers; i++) { LogicalRepWorker *w = &LogicalRepCtx->workers[i]; if (!w->in_use) { worker = w; slot = i; break; } } bgw.bgw_main_arg = Int32GetDatum(slot); RegisterDynamicBackgroundWorker(&bgw, &bgw_handle) 1.3 latest_end_lsn的更新 订阅端只有收到发布端的keepalive消息,才会更新pg_stat_subscription.latest_end_lsn。由于不是每次send_feedback()后都会更新latest_end_lsn,所以latest_end_lsn可能比实际反馈给发布端的lsn要滞后。实测时也经常能看到10秒以上的延迟。为防止wal send超时,当超过wal_sender_timeout / 2还没有收到接受端反馈时,发送端会主动发送keepalive消息。 LogicalRepApplyLoop(XLogRecPtr last_received) for (;;) { ... len = walrcv_receive(wrconn, &buf, &fd); if (len != 0) { if (c == 'w') { XLogRecPtr start_lsn; XLogRecPtr end_lsn; TimestampTz send_time; start_lsn = pq_getmsgint64(&s); end_lsn = pq_getmsgint64(&s); send_time = pq_getmsgint64(&s); if (last_received < start_lsn) last_received = start_lsn; if (last_received < end_lsn) last_received = end_lsn; UpdateWorkerStats(last_received, send_time, false);//更新pg_stat_subscription.received_lsn apply_dispatch(&s); } else if (c == 'k') { XLogRecPtr end_lsn; TimestampTz timestamp; bool reply_requested; end_lsn = pq_getmsgint64(&s); timestamp = pq_getmsgint64(&s); reply_requested = pq_getmsgbyte(&s); if (last_received < end_lsn) last_received = end_lsn; send_feedback(last_received, reply_requested, false);//反馈订阅端的write/flush/reply lsn UpdateWorkerStats(last_received, timestamp, true);//更新pg_stat_subscription.received_lsn和pg_stat_subscription.latest_end_lsn } } send_feedback(last_received, false, false);//反馈订阅端的write/flush/reply lsn 2. 如何跟踪订阅端实际apply到哪里? latest_end_lsn也能在一定程度上反映订阅端的apply位点,但是这和它本身的功能其实不是特别契合,而且它出现滞后的概率比较高,不是特别理想。 我们可以通过发布端的pg_stat_replication统计视图跟踪订阅端的apply位置。 同样参考上面LogicalRepApplyLoop()的代码,订阅端反馈自己复制位置的逻辑如下: 如果没有pending的事务(所有和订阅相关的写事务已经在订阅端刷盘) 反馈给sender:write=flush=apply=接受到最新wal位置 如果有pending的事务 反馈给sender: write=接受到最新wal位置 flush=属于订阅范围的写事务已经在订阅端刷盘的位置 apply=属于订阅范围的写事务已经在订阅端写盘的位置 由上面可以看出,逻辑订阅和物理复制不一样,物理复制是先写wal再apply这个WAL;逻辑订阅是先apply事务,再反馈这个事务产生的wal的flush位置 相关代码如下: send_feedback(XLogRecPtr recvpos, bool force, bool requestReply) get_flush_position(&writepos, &flushpos, &have_pending_txes); /* * No outstanding transactions to flush, we can report the latest received * position. This is important for synchronous replication. */ if (!have_pending_txes) flushpos = writepos = recvpos; ... pq_sendbyte(reply_message, 'r'); pq_sendint64(reply_message, recvpos); /* write */ pq_sendint64(reply_message, flushpos); /* flush */ pq_sendint64(reply_message, writepos); /* apply */ pq_sendint64(reply_message, now); /* sendTime */ pq_sendbyte(reply_message, requestReply); /* replyRequested */ static void get_flush_position(XLogRecPtr *write, XLogRecPtr *flush, bool *have_pending_txes) { dlist_mutable_iter iter; XLogRecPtr local_flush = GetFlushRecPtr(); *write = InvalidXLogRecPtr; *flush = InvalidXLogRecPtr; dlist_foreach_modify(iter, &lsn_mapping)//lsn_mapping 在应用commit日志时更新 { FlushPosition *pos = dlist_container(FlushPosition, node, iter.cur); *write = pos->remote_end; if (pos->local_end <= local_flush) { *flush = pos->remote_end; dlist_delete(iter.cur);//从lsn_mapping中移除已经本地刷盘的记录 pfree(pos); } else { /* * Don't want to uselessly iterate over the rest of the list which * could potentially be long. Instead get the last element and * grab the write position from there. */ pos = dlist_tail_element(FlushPosition, node, &lsn_mapping); *write = pos->remote_end; *have_pending_txes = true; return; } } *have_pending_txes = !dlist_is_empty(&lsn_mapping); } 应用commit日志时,会将commit对应的远程lsn和本地lsn添加到lsn_mapping末尾 ApplyWorkerMain LogicalRepApplyLoop(origin_startpos); apply_dispatch(&s); apply_handle_commit(StringInfo s) replorigin_session_origin_lsn = commit_data.end_lsn; //更新pg_replication_origin_status replorigin_session_origin_timestamp = commit_data.committime; CommitTransactionCommand(); store_flush_position(commit_data.end_lsn); /* Track commit lsn */ flushpos = (FlushPosition *) palloc(sizeof(FlushPosition)); flushpos->local_end = XactLastCommitEnd; flushpos->remote_end = remote_lsn; dlist_push_tail(&lsn_mapping, &flushpos->node); 3. 发布端pg_stat_replication中的apply位点能否保证正确性? 首先,需要明确,只有出现以下情况时,拿到的apply位置才认为有误的 发布端更新了订阅表的表 更新这个表的事务已提交 订阅端还没有应用这个事务 pg_stat_replication中看到的apply位点已经大于等于3的事务结束位置 当所有表都是r或s状态时,订阅端的apply worker顺序接受和应用WAL日志。在订阅端本地提交完成前,不会实施后续的send_feedback(),所以不会产生超过实际提交位置的apply位点(甚至碰巧pg_stat_subscription中的latest_end_lsn也可以认为是对的)。 4. 发布端pg_stat_replication中的apply位点是否可能反馈不及时? 有可能。但是pg_stat_replication.replay_lsn滞后的概率低于pg_stat_subscription.latest_end_lsn 当订阅端已处于同步状态时,下面的情况下pg_stat_replication中的apply位点可能反馈不及时,比发布端的当前lsn滞后。 订阅端处于sleep状态,最多sleep 1秒 发布端发送非订阅表更新的消息(含keepalive)不及时 发送端为了防止sender超时,会及时发送keepalive保活,因此我们可以在发布端停止更新订阅表后,可以最多等待wal_sender_timeout一样大的时间。
有些业务中需要求解最短路径,PostgreSQL中有个pgrouting插件内置了和计算最短路径相关的算法。下面看下示例 表定义 postgres=# \d testpath Table "public.testpath" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | | source | integer | | | target | integer | | | cost | integer | | | 这是张业务表,每一行代表一条边及其代价,总共1000多条记录(实际对应的是按业务条件筛选后的结果集大小)。其余业务相关的属性全部隐去。 求解2点间的最短路径 postgres=# SELECT * FROM pgr_dijkstra( 'SELECT id,source,target,cost FROM testpath', 10524, 10379, directed:=true); seq | path_seq | node | edge | cost | agg_cost -----+----------+-------+---------+------+---------- 1 | 1 | 10524 | 1971852 | 1 | 0 2 | 2 | 7952 | 32256 | 1 | 1 3 | 3 | 7622 | 76615 | 2 | 2 4 | 4 | 44964 | 76616 | 1 | 4 5 | 5 | 7861 | 19582 | 1 | 5 6 | 6 | 7629 | 14948 | 2 | 6 7 | 7 | 17135 | 14949 | 1 | 8 8 | 8 | 10379 | -1 | 0 | 9 (8 rows) Time: 22.979 ms 求解2点间最短的N条路径 postgres=# SELECT * FROM pgr_ksp( 'SELECT id,source,target,cost FROM testpath', 10524, 10379, 1000,directed:=true); seq | path_id | path_seq | node | edge | cost | agg_cost -----+---------+----------+-------+---------+------+---------- 1 | 1 | 1 | 10524 | 1971852 | 1 | 0 2 | 1 | 2 | 7952 | 32256 | 1 | 1 3 | 1 | 3 | 7622 | 54740 | 2 | 2 4 | 1 | 4 | 35389 | 54741 | 1 | 4 5 | 1 | 5 | 7861 | 19582 | 1 | 5 6 | 1 | 6 | 7629 | 14948 | 2 | 6 7 | 1 | 7 | 17135 | 14949 | 1 | 8 8 | 1 | 8 | 10379 | -1 | 0 | 9 ...(略) 100 | 12 | 4 | 53179 | 95137 | 1 | 4 101 | 12 | 5 | 7625 | 90682 | 2 | 5 102 | 12 | 6 | 51211 | 90683 | 1 | 7 103 | 12 | 7 | 7861 | 19582 | 1 | 8 104 | 12 | 8 | 7629 | 1173911 | 2 | 9 105 | 12 | 9 | 59579 | 1173917 | 1 | 11 106 | 12 | 10 | 10379 | -1 | 0 | 12 (106 rows) Time: 201.223 ms 纯SQL求解最短路径 前面的最短路径是通过pgrouting插件计算的,能不能单纯利用PG自身的SQL完成最短路径的计算呢? 真实的业务场景下是可以限制路径的长度的,比如,如果我们舍弃所有边数大于7的路径。那么完全可以用简单的深度遍历计算最短路径。计算速度还提高了5倍。 postgres=# WITH RECURSIVE line AS( SELECT source,target,cost from testpath ), path(fullpath,pathseq,node,total_cost) AS ( select ARRAY[10524],1,10524,0 UNION ALL select array_append(fullpath,target),pathseq+1,target,total_cost+cost from path join line on(source=node) where node!=10379 and pathseq<=8 ) SELECT * FROM path where fullpath @> ARRAY[10379] order by total_cost limit 1; fullpath | pathseq | node | total_cost -----------------------------------------------+---------+-------+------------ {10524,7952,7622,80465,7861,7629,17135,10379} | 8 | 10379 | 9 (1 row) Time: 4.334 ms 如果每条边的cost相同,可以去掉上面的order by total_cost,在大数据集上性能会有很大的提升。 纯SQL求解最短的N条路径 沿用前面的SQL,只是修改了一下limit值,相比pgrouting的pgr_ksp函数性能提升的更多。性能提升了50倍。 postgres=# WITH RECURSIVE line AS( SELECT source,target,cost from testpath ), path(fullpath,pathseq,node,total_cost) AS ( select ARRAY[10524],1,10524,0 UNION ALL select array_append(fullpath,target),pathseq+1,target,total_cost+cost from path join line on(source=node) where node!=10379 and pathseq<=8 ) SELECT * FROM path where fullpath @> ARRAY[10379] order by total_cost limit 1000; fullpath | pathseq | node | total_cost ----------------------------------------------------+---------+-------+------------ {10524,7952,7622,80465,7861,7629,17135,10379} | 8 | 10379 | 9 {10524,7952,7622,35389,7861,7629,17135,10379} | 8 | 10379 | 9 {10524,7952,7622,44964,7861,7629,17135,10379} | 8 | 10379 | 9 {10524,7952,7622,80465,7861,7629,59579,10379} | 8 | 10379 | 9 {10524,7952,7622,35389,7861,7629,59579,10379} | 8 | 10379 | 9 {10524,7952,7622,44964,7861,7629,59579,10379} | 8 | 10379 | 9 {10524,7952,7622,53179,7625,7861,7629,17135,10379} | 9 | 10379 | 10 {10524,7952,7622,53179,7625,7861,7629,59579,10379} | 9 | 10379 | 10 (8 rows) Time: 4.425 ms 下面看下执行计划 postgres=# explain analyze WITH RECURSIVE line AS( SELECT source,target,cost from testpath ), path(fullpath,pathseq,node,total_cost) AS ( select ARRAY[10524],1,10524,0 UNION ALL select array_append(fullpath,target),pathseq+1,target,total_cost+cost from path join line on(source=node) where node!=10379 and pathseq<=8 ) SELECT * FROM path where fullpath @> ARRAY[10379] order by total_cost limit 1000; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=277.18..277.18 rows=1 width=44) (actual time=9.992..10.001 rows=8 loops=1) CTE line -> Seq Scan on testpath (cost=0.00..16.45 rows=1045 width=12) (actual time=0.017..0.624 rows=1045 loops=1) CTE path -> Recursive Union (cost=0.00..257.09 rows=161 width=44) (actual time=0.003..9.889 rows=42 loops=1) -> Result (cost=0.00..0.01 rows=1 width=44) (actual time=0.001..0.002 rows=1 loops=1) -> Hash Join (cost=0.29..25.39 rows=16 width=44) (actual time=0.451..1.090 rows=5 loops=9) Hash Cond: (line.source = path_1.node) -> CTE Scan on line (cost=0.00..20.90 rows=1045 width=12) (actual time=0.003..0.678 rows=1045 loops=8) -> Hash (cost=0.25..0.25 rows=3 width=44) (actual time=0.007..0.007 rows=3 loops=9) Buckets: 1024 Batches: 1 Memory Usage: 8kB -> WorkTable Scan on path path_1 (cost=0.00..0.25 rows=3 width=44) (actual time=0.001..0.004 rows=3 loops=9) Filter: ((node <> 10379) AND (pathseq <= 8)) Rows Removed by Filter: 1 -> Sort (cost=3.63..3.64 rows=1 width=44) (actual time=9.991..9.994 rows=8 loops=1) Sort Key: path.total_cost Sort Method: quicksort Memory: 26kB -> CTE Scan on path (cost=0.00..3.62 rows=1 width=44) (actual time=7.851..9.979 rows=8 loops=1) Filter: (fullpath @> '{10379}'::integer[]) Rows Removed by Filter: 34 Planning time: 0.234 ms Execution time: 10.111 ms (22 rows) Time: 10.973 ms 大数据集的对比 以上测试的数据集比较小,只有1000多个边,如果在100w的数据集下,结果如何呢? 计算最短路径 时间(秒) pgr_dijkstra() 52秒 递归CTE(最大深度2边) 2秒 递归CTE(最大深度3边) 5秒 递归CTE(最大深度4边) 105秒 递归CTE(最大深度5边) 算不出来,放弃 递归CTE(最大深度7边,假设每个边cost相等,不排序,结果最短路径为3个边) 1.6秒 小结 简单的深度遍历求解可以适用于小数据集或深度比较小的场景。在满足这些条件的场景下,效果还是不错的。
开启逻辑订阅后,我们要知道复制的状态。这可以通过PG中的几个系统表或视图获取。 订阅端 pg_subscription_rel 通过pg_subscription_rel可以知道每张表的同步状态 postgres=# select * from pg_subscription_rel; srsubid | srrelid | srsubstate | srsublsn ---------+---------+------------+----------- 18465 | 18446 | r | 0/453EF50 18465 | 18453 | r | 0/453EF88 18465 | 18459 | r | 0/453EFC0 (3 rows) srsubstate 状态码: i = 初始化, d = 正在复制数据, s = 已同步, r = 准备好 (普通复制) srsublsn s和r状态时源端的结束LSN。 初始时该表处于i状态,而后PG从发布端copy基表,此时该表处于d状态,基表拷贝完成后记录LSN位置到srsublsn。之后进入s状态最后再进入r状态,并通过pgoutput逻辑解码从发布端拉取并应用增量数据。 s状态和r状态的区别是什么?初始拷贝完成后,每个表的sync worker还需要从发布端拉取增量,直到增量部分追到大于等于apply worker的同步位置。当追上apply worker的同步位置后表变更为s状态,并记录此时的wal位置到pg_subscription_rel.srsublsn。 此时srsublsn可能已经到了apply worker同步的前面,所有在commit wal位置小于srsublsn的事务都需要应用。一旦apply worker追上srsublsn,设置该表为r状态,此时所有订阅范围的表更新事务都需要apply worker应用。 pg_stat_subscription pg_stat_subscription显示每个订阅worker的状态。一个订阅包含一个apply worker,可选的还有一个或多个进行初始同步的sync worker。sync worker上的relid指示正在初始同步的表;对于apply worker,relid为NULL。 apply worker的latest_end_lsn为已反馈给发布端的LSN位置,一定程度上也可以认为是已完成同步的LSN位置。 postgres=# select * from pg_stat_subscription; subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time -------+---------+-------+-------+--------------+-------------------------------+-------------------------------+----------------+------ ------------------------- 18515 | sub1 | 19860 | 18446 | | 2020-04-24 19:29:10.961417+08 | 2020-04-24 19:29:10.961417+08 | | 2020- 04-24 19:29:10.961417+08 18515 | sub1 | 19499 | | 0/4566B50 | 2020-04-24 19:29:05.946996+08 | 2020-04-24 19:29:05.947017+08 | 0/4566B50 | 2020- 04-24 19:29:05.946996+08 (2 rows) pg_replication_origin_status pg_replication_origin_status包含了从复制源增量同步的最后一个位置 postgres=# select * from pg_replication_origin_status; local_id | external_id | remote_lsn | local_lsn ----------+-------------+------------+----------- 1 | pg_18465 | 0/4540208 | 0/470FFD8 (1 row) 上面的remote_lsn是订阅端应用的最后一个的WAL记录在源节点的开始LSN位置(即执行这条WAL记录的开头)。如果源节点上后来又产生了其他和订阅无关的WAL记录(比如更新其他表或后台checkpoint产生的WAL),不会反映到pg_replication_origin_status里。 发布端 pg_replication_slots 发布端的pg_replication_slots反映了逻辑订阅复制槽的LSN位点。 postgres=# select * from pg_replication_slots; -[ RECORD 1 ]-------+---------- slot_name | sub1 plugin | pgoutput slot_type | logical datoid | 13451 database | postgres temporary | f active | t active_pid | 14058 xmin | catalog_xmin | 755 restart_lsn | 0/4540818 confirmed_flush_lsn | 0/4540850 restart_lsnrestart_lsn是可能仍被这个槽的消费者要求的最旧WAL地址(LSN),并且因此不会在检查点期间自动被移除。 confirmed_flush_lsnconfirmed_flush_lsn代表逻辑槽的消费者已经确认接收数据到什么位置的地址(LSN)。比这个地址更旧的数据已经不再可用。 confirmed_flush_lsn是最后一个已同步的WAL记录的结束位置(需要字节对齐,实际是下条WAL的起始位置)。restart_lsn有时候是最后一个已同步的WAL记录的起始位置。 对应订阅范围内的表的更新WAL记录,必须订阅端执行完这条记录才能算已同步;对其他无关的WAL,直接认为是已同步的,继续处理下一条WAL。 在下面的例子中,我们在订阅端锁住一个订阅表,导致订阅端无法应用这条INSERT WAL,所有confirmed_flush_lsn就暂停在这条WAL前面(0/4540850)。 [postgres@sndsdevdb18 citus]$ pg_waldump worker1/pg_wal/000000010000000000000004 -s 0/045407A8 -n 5 rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/045407A8, prev 0/04540770, desc: CHECKPOINT_ONLINE redo 0/4540770; tli 1; prev tli 1; fpw true; xid 0:755; oid 24923; multi 1; offset 0; oldest xid 548 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 555/754; oldest running xid 755; online rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/04540818, prev 0/045407A8, desc: RUNNING_XACTS nextXid 755 latestCompletedXid 754 oldestRunningXid 755 rmgr: Heap len (rec/tot): 69/ 130, tx: 755, lsn: 0/04540850, prev 0/04540818, desc: INSERT off 2, blkref #0: rel 1663/13451/17988 blk 0 FPW rmgr: Transaction len (rec/tot): 46/ 46, tx: 755, lsn: 0/045408D8, prev 0/04540850, desc: COMMIT 2020-04-24 14:22:20.531476 CST rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/04540908, prev 0/045408D8, desc: RUNNING_XACTS nextXid 756 latestCompletedXid 755 oldestRunningXid 756 pg_stat_replication 对于一个逻辑订阅,pg_stat_replication中可以看到apply worker的复制状态,其中的write_lsn,flush_lsn,replay_lsn和pg_replication_slots的confirmed_flush_lsn值相同。apply worker的复制的application_name为订阅名。 postgres=# select * from pg_stat_replication ; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | bac kend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state -------+----------+----------+-----------------------+-------------+-----------------+-------------+-------------------------------+---- ----------+-----------+-----------+-----------+-----------+------------+-----------+-----------+------------+---------------+----------- - 19861 | 10 | postgres | sub1_18515_sync_18446 | | | -1 | 2020-04-24 19:29:10.964055+08 | | startup | | | | | | | | 0 | async 19500 | 10 | postgres | sub1 | | | -1 | 2020-04-24 19:26:59.950652+08 | | streaming | 0/4566B50 | 0/4566B50 | 0/4566B50 | 0/4566B50 | | | | 0 | async (2 rows) 可选的还可能看到sync worker临时创建的用于初始同步的复制。sync worker的复制的application_name为订阅名加上同步表信息。 为了理解sync worker的复制干嘛用的?我们需要先看一下sync worker的处理逻辑。 sync worker初始同步一张表时,分下面几个步骤 创建临时复制槽,用于sync worker的复制 从源端copy表数据到目的端 记录copy完成时的lsn到pg_subscription_rel的srsublsn 对比srsublsn和apply worker当前同步点lsn(latest_end_lsn) 4.1 如果srsublsn小于latest_end_lsn,将同步状态改为s 4.2 如果srsublsn大于latest_end_lsn,通过1的复制槽拉取本表的增量数据,等追上apply worker后,将同步状态改为s 后续增量同步工作交给apply worker 如何判断订阅已经同步? 在所有表都处于s或r状态时,只要发布端的pg_stat_replication.replay_lsn追上发布端的当前lsn即可。 如果我们通过逻辑订阅进行数据表切换,可以执行以下步骤确保数据同步 创建订阅并等待所有表完成基本同步 即所有表在`pg_subscription_rel`中处于s或r状态 在发布端锁表禁止更新 获取发布端当前lsn 获取发布端的replay_lsn(或其他等价指标),如果超过3的lsn,则数据已同步。 如果尚未同步,重复4 注意点 以下同步位置信息,反映了已处于s或r状态的表的同步位点。 pg_replication_slots pg_stat_replication pg_replication_origin_status 对于尚未完成初始同步的表,订阅端copy完初始数据后,会用一个临时的复制槽拉取增量WAL,直到追上apply worker。追上后修改同步状态为s,后续的增量同步交给apply worker。因此我们判断订阅的整体复制LSN位置时,必须等所有表都完成初始同步后才有意义。 参考 详细参考:https://github.com/ChenHuajun/chenhuajun.github.io/blob/master/_posts/2018-07-30-PostgreSQL逻辑订阅处理流程解析.md
1. 表同步阶段处理流程概述 订阅端执行CREATE SUBSCRIPTION后,在后台进行表数据同步。每个表的数据同步状态记录在pg_subscription_rel.srsubstate中,一共有4种状态码。 'i':初始化 'd':正在copy数据 's':已同步 'r':准备好 (普通复制) 从执行CREATE SUBSCRIPTION开始订阅端的相关处理流程概述如下: 设置每个表的为srsubstate中'i'(SUBREL_STATE_INIT) logical replication launcher启动一个logial replication apply worker进程 logial replication apply worker进程连接到订阅端开始接受订阅消息,此时表尚未完成初始同步(状态为i或d),跳过所有insert,update和delete消息的处理。 logial replication apply worker进程为每个未同步的表启动logial replication sync worker进程(每个订阅最多同时启动max_sync_workers_per_subscription个sync worker) logial replication sync worker进程连接到订阅端并同步初始数据 5.1 创建临时复制槽,并记录快照位置。 5.2 设置表同步状态为'd'(SUBREL_STATE_DATASYNC) 5.3 copy表数据 5.4 设置表同步状态为SUBREL_STATE_SYNCWAIT(内部状态),并等待apply worker更新状态为SUBREL_STATE_CATCHUP(内部状态) logial replication apply worker进程更新表同步状态为SUBREL_STATE_CATCHUP(内部状态),记录最新lsn,并等待sync worker更新状态为SUBREL_STATE_SYNCDONE logial replication sync worker进程完成初始数据同步 7.1 检查apply worker当前处理的订阅消息位置是否已经走到了快照位置前面,如果是从订阅端接受消息并处理直到追上apply worker。 7.2 设置表同步状态为's'(SUBREL_STATE_SYNCDONE) 7.3 进程退出 logial replication apply worker进程继续接受订阅消息并处理 8.1 接受到insert,update和delete消息,如果是同步点(进入's'或'r'状态时的lsn位置)之后的消息进行应用。 8.2 接受到commit消息 8.2.1 更新复制源状态,确保apply worker crash时可以找到正确的开始位置 8.2.2 提交事务 8.2.3 更新统计信息 8.2.4 将所有处于's'(SUBREL_STATE_SYNCDONE)同步状态的表更新为'r'(SUBREL_STATE_READY) 8.3 暂时没有新的消息处理 8.3.1 向发布端发送订阅位置反馈 8.3.2 如果不在事务块里,同步表状态。将所有处于's'(SUBREL_STATE_SYNCDONE)同步状态的表更新为'r'(SUBREL_STATE_READY) 2. 表同步后的持续逻辑复制 订阅表进入同步状态(状态码是‘s’或'r')后,发布端的变更都会通过消息通知订阅端;订阅端apply worker按照订阅消息的接受顺序(即发布端事务提交顺序)对每个表apply变更,并反馈apply位置,用于监视复制延迟。 通过调试,确认发布端发生更新时,发送给订阅端的数据包。 2.1 插入订阅表 insert into tbx3 values(100); 发布端修改订阅表时,在事务提交时,发布端依次发送下面的消息到订阅端 B(BEGIN) R(RELATION) I(INSERT) C(COMMIT) 更新复制源pg_replication_origin_status中的remote_lsn和local_lsn,该位点对应于每个订阅表最后一次事务提交的位置。 k(KEEPALIVE) k(KEEPALIVE) 2个keepalive消息,会更新统计表中的位置 发布端pg_stat_replication:write_lsn,flush_lsn,replay_lsn 发布端pg_get_replication_slots():confirmed_flush_lsn 订阅端更新pg_stat_subscription:latest_end_lsn 2.2 插入非订阅表 insert into tbx10 values(100); 发布端产生了和订阅表无关修改,在事务提交时,发布端依次发送下面的消息到订阅端 B(BEGIN) C(COMMIT) 未产生实际事务,也不更新pg_replication_origin_status k(KEEPALIVE) k(KEEPALIVE) 2个'k' keepalive消息,会更新统计表中的位置 3. 异常处理 3.1 sync worker SQL错误(如主键冲突):worker进程异常退出,之后apply worker创建一个新的sync worker重试。错误解除前每5秒重试一次。 表被锁:等待 更新或删除的记录不存在:正常执行,检测不到错误,也么没有日志输出(输出一条DEBUG1级别的日志)。 3.2 apply worker SQL错误(如主键冲突):worker进程异常退出,之后logical replication launcher创建一个新的apply worker重试。错误解除前每5秒重试一次。 表被锁:等待 更新或删除的记录不存在:正常执行,检测不到错误,也么没有日志输出(输出一条DEBUG1级别的日志)。 错误日志示例: 2018-07-28 20:11:56.018 UTC [470] ERROR: duplicate key value violates unique constraint "tbx3_pkey" 2018-07-28 20:11:56.018 UTC [470] DETAIL: Key (id)=(2) already exists. 2018-07-28 20:11:56.022 UTC [47] LOG: worker process: logical replication worker for subscription 74283 (PID 470) exited with exit code 1 2018-07-28 20:12:01.029 UTC [471] LOG: logical replication apply worker for subscription "sub_shard" has started 2018-07-28 20:12:01.049 UTC [471] ERROR: duplicate key value violates unique constraint "tbx3_pkey" 2018-07-28 20:12:01.049 UTC [471] DETAIL: Key (id)=(2) already exists. 2018-07-28 20:12:01.058 UTC [47] LOG: worker process: logical replication worker for subscription 74283 (PID 471) exited with exit code 1 2018-07-28 20:12:06.070 UTC [472] LOG: logical replication apply worker for subscription "sub_shard" has started 2018-07-28 20:12:06.089 UTC [472] ERROR: duplicate key value violates unique constraint "tbx3_pkey" 2018-07-28 20:12:06.089 UTC [472] DETAIL: Key (id)=(2) already exists. 4. 限制 不复制数据库模式和DDL命令。 不复制序列数据。序列字段(serial / GENERATED ... AS IDENTITY)的值会被复制,但序列的值不会更新 不复制TRUNCATE命令。 不复制大对象 复制只能从基表到基表。也就是说,发布和订阅端的表必须是普通表,而不是视图, 物化视图,分区根表或外部表。订阅继承表的父表,只会复制父表的变更。 只支持触发器的一部分功能 不支持双向复制,会导致WAL循环。 不支持在同一个实例上的两个数据库上创建订阅 不支持在备机上创建订阅 订阅表上没有合适的REPLICA IDENTITY时,发布端执行UPDATE/DELETE会报错 注意事项 CREATE SUBSCRIPTION命令执行时,要等待发布端正在执行的事务结束。 sync worker初始同步数据时,开启了"REPEATABLE READ"事务,期间产生的垃圾不能被回收。 订阅生效期间,发布端所有事务产生的WAL必须在该事务结束时才能被回收。 订阅端UPDATE/DELETE找不到数据时,没有任何错误输出。 5. 表同步阶段相关代码解析 发布端Backend进程 CREATE PUBLICATION CreatePublication() CatalogTupleInsert(rel, tup); // 在pg_publication系统表中插入此发布信息 PublicationAddTables(puboid, rels, true, NULL);// publication_add_relation() check_publication_add_relation();// 检查表类型,不支持的表报错。只支持普通表('r'),且不是unloged和临时表 CatalogTupleInsert(rel, tup); // 在pg_publication_rel系统表中插入订阅和表的映射 订阅端Backend进程 CREATE SUBSCRIPTION CreateSubscription() CatalogTupleInsert(rel, tup); //在pg_subscription系统表中插入此订阅信息 replorigin_create(originname); //在pg_replication_origin系统表中插入此订阅对应的复制源 foreach(lc, tables) // 设置每个表的pg_subscription_rel.srsubstate table_state = copy_data ? SUBREL_STATE_INIT : SUBREL_STATE_READY; // ★★★1 如果拷贝数据,设置每个表的pg_subscription_rel.srsubstate='i' SetSubscriptionRelState(subid, relid, table_state,InvalidXLogRecPtr, false); walrcv_create_slot(wrconn, slotname, false,CRS_NOEXPORT_SNAPSHOT, &lsn); ApplyLauncherWakeupAtCommit(); //唤醒logical replication launcher进程 订阅端logical replication launcher进程 ApplyLauncherMain() sublist = get_subscription_list(); //从pg_subscription获取订阅列表 foreach(lc, sublist) logicalrep_worker_launch(..., InvalidOid); // 对enabled且没有创建worker的订阅创建apply worker。apply worker如果已超过max_logical_replication_workers(默认4)报错 RegisterDynamicBackgroundWorker(&bgw, &bgw_handle);// 注册后台工作进程,入口函数为"ApplyWorkerMain" 订阅端 logical apply worker进程 ApplyWorkerMain replorigin_session_setup(originid); // 从共享内存中查找并设置复制源,如果不存在使用新的,复制源名称为pg_${订阅OID}。 origin_startpos = replorigin_session_get_progress(false);// 获取复制源的remote_lsn walrcv_connect(MySubscription->conninfo, true, MySubscription->name,&err); // 连接到订阅端 walrcv_startstreaming(wrconn, &options); // 开始流复制 LogicalRepApplyLoop(origin_startpos); // Apply进程主循环 for(;;) len = walrcv_receive(wrconn, &buf, &fd); if (c == 'w') // 'w'消息的处理 UpdateWorkerStats(last_received, send_time, false);更新worker统计信息(last_lsn,last_send_time,last_recv_time) apply_dispatch(&s); // 分发逻辑复制命令 switch (action) case 'B': /* BEGIN */ apply_handle_begin(s); case 'C': /* COMMIT */ apply_handle_commit(s); if (IsTransactionState() && !am_tablesync_worker()) // 当发布端的事务更新不涉及订阅表时,仍会发送B和C消息,此时不在事务中,跳过下面操作 replorigin_session_origin_lsn = commit_data.end_lsn; // 更新复制源状态,确保apply worker crash时可以找到正确的开始位置 replorigin_session_origin_timestamp = commit_data.committime; CommitTransactionCommand(); // 提交事务 pgstat_report_stat(false); // 更新统计信息 process_syncing_tables(commit_data.end_lsn); // 对处于同步中的表,协调sync worker和apply worker进程同步状态 process_syncing_tables_for_apply(current_lsn); GetSubscriptionNotReadyRelations(MySubscription->oid); // 从pg_subscription_rel中获取订阅中所有非ready状态的表。 foreach(lc, table_states) // 处理每个非ready状态的表 if (rstate->state == SUBREL_STATE_SYNCDONE) { if (current_lsn >= rstate->lsn) { rstate->state = SUBREL_STATE_READY; //处理第一个事务后,从syncdone->ready状态,但这个事务不需要和这个表相关。 rstate->lsn = current_lsn; SetSubscriptionRelState(MyLogicalRepWorker->subid, // 更新pg_subscription_rel rstate->relid, rstate->state, rstate->lsn, true); } } else { syncworker = logicalrep_worker_find(MyLogicalRepWorker->subid, rstate->relid, false); if (syncworker) { /* Found one, update our copy of its state */ rstate->state = syncworker->relstate; rstate->lsn = syncworker->relstate_lsn; if (rstate->state == SUBREL_STATE_SYNCWAIT) { /* * Sync worker is waiting for apply. Tell sync worker it * can catchup now. */ syncworker->relstate = SUBREL_STATE_CATCHUP; // ★★★3 SUBREL_STATE_SYNCWAIT -> SUBREL_STATE_CATCHUP syncworker->relstate_lsn = Max(syncworker->relstate_lsn, current_lsn); } /* If we told worker to catch up, wait for it. */ if (rstate->state == SUBREL_STATE_SYNCWAIT) { /* Signal the sync worker, as it may be waiting for us. */ if (syncworker->proc) logicalrep_worker_wakeup_ptr(syncworker); wait_for_relation_state_change(rstate->relid, SUBREL_STATE_SYNCDONE); // 等待sync worker将表的同步状态设置为SUBREL_STATE_SYNCDONE } } else { /* * If there is no sync worker for this table yet, count * running sync workers for this subscription, while we have * the lock. */ logicalrep_worker_launch(MyLogicalRepWorker->dbid, // 如果这个表没有对应的sync worker,且sync worker数未超过max_sync_workers_per_subscription,启动一个。 MySubscription->oid, MySubscription->name, MyLogicalRepWorker->userid, rstate->relid); } else if (c == 'k') // 'k'消息的处理 send_feedback(last_received, reply_requested, false); // 向订阅端发生反馈 UpdateWorkerStats(last_received, timestamp, true); // 更新worker统计信息(last_lsn,last_send_time,last_recv_time,reply_lsn,send_time) case I': /* INSERT */ apply_handle_insert(s); relid = logicalrep_read_insert(s, &newtup); if (!should_apply_changes_for_rel(rel))return; if (am_tablesync_worker()) return MyLogicalRepWorker->relid == rel->localreloid; // 对sync worker,只apply其负责同步的表 else return (rel->state == SUBREL_STATE_READY || // 对apply worker, 同步状态为SUBREL_STATE_SYNCDONE时,只同步syncdone位置之后的wal (rel->state == SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn)); ExecSimpleRelationInsert(estate, remoteslot); // 插入记录 ExecBRInsertTriggers(estate, resultRelInfo, slot); // 处理BEFORE ROW INSERT Triggers simple_heap_insert(rel, tuple); ExecARInsertTriggers(estate, resultRelInfo, tuple,recheckIndexes, NULL); // 处理AFTER ROW INSERT Triggers AfterTriggerEndQuery(estate); // 处理 queued AFTER triggers ... send_feedback(last_received, false, false);//没有新的消息要处理,向发布端发送位置反馈 process_syncing_tables(last_received);//如果不在事务块里,同步表状态 订阅端 logical sync worker进程 ApplyWorkerMain() //apply worker和sync worker使用相同的入口函数 LogicalRepSyncTableStart(&origin_startpos); GetSubscriptionRelState()(MyLogicalRepWorker->subid,MyLogicalRepWorker->relid,&relstate_lsn, true);// 从pg_subscription_rel中获取订阅的复制lsn walrcv_connect(MySubscription->conninfo, true, slotname, &err); switch (MyLogicalRepWorker->relstate) { case SUBREL_STATE_INIT: case SUBREL_STATE_DATASYNC: { MyLogicalRepWorker->relstate = SUBREL_STATE_DATASYNC; MyLogicalRepWorker->relstate_lsn = InvalidXLogRecPtr; SetSubscriptionRelState(MyLogicalRepWorker->subid, MyLogicalRepWorker->relid, MyLogicalRepWorker->relstate, MyLogicalRepWorker->relstate_lsn, true); res = walrcv_exec(wrconn, // 开始事务 "BEGIN READ ONLY ISOLATION LEVEL " "REPEATABLE READ", 0, NULL); walrcv_create_slot(wrconn, slotname, true, // 使用快照创建临时复制槽,并记录快照位置。 CRS_USE_SNAPSHOT, origin_startpos); copy_table(rel); // copy表数据 walrcv_exec(wrconn, "COMMIT", 0, NULL); MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCWAIT; // ★★★2 更新表同步状态为SUBREL_STATE_SYNCWAIT MyLogicalRepWorker->relstate_lsn = *origin_startpos; wait_for_worker_state_change(SUBREL_STATE_CATCHUP); // 等待apply worker将状态变更为SUBREL_STATE_CATCHUP if (*origin_startpos >= MyLogicalRepWorker->relstate_lsn) // 如果sync worker落后于apply worker,sync worker跳过此步继续apply WAL; { /* * Update the new state in catalog. No need to bother * with the shmem state as we are exiting for good. */ SetSubscriptionRelState(MyLogicalRepWorker->subid, // ★★★4 把同步状态从SUBREL_STATE_CATCHUP更新到SUBREL_STATE_SYNCDONE并退出 MyLogicalRepWorker->relid, SUBREL_STATE_SYNCDONE, *origin_startpos, true); finish_sync_worker(); } break; } case SUBREL_STATE_SYNCDONE: case SUBREL_STATE_READY: case SUBREL_STATE_UNKNOWN: finish_sync_worker(); break; } options.startpoint = origin_startpos; walrcv_startstreaming(wrconn, &options);// 开始流复制,以同步快照位置作为流的开始位置 LogicalRepApplyLoop(origin_startpos); // Apply进程主循环 for(;;) len = walrcv_receive(wrconn, &buf, &fd); UpdateWorkerStats(last_received, send_time, false); 更新worker统计信息(last_lsn,last_send_time,last_recv_time) apply_dispatch(&s); // 分发逻辑复制命令 switch (action) case 'B': /* BEGIN */ apply_handle_begin(s); case 'C': /* COMMIT */ apply_handle_commit(s); process_syncing_tables(commit_data.end_lsn); // 对处于同步中的表,协调sync worker和apply worker进程同步状态 process_syncing_tables_for_sync(current_lsn); if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP && current_lsn >= MyLogicalRepWorker->relstate_lsn) { TimeLineID tli; MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE; // ★★★4 把同步状态从SUBREL_STATE_CATCHUP更新到SUBREL_STATE_SYNCDONE MyLogicalRepWorker->relstate_lsn = current_lsn; SpinLockRelease(&MyLogicalRepWorker->relmutex); SetSubscriptionRelState(MyLogicalRepWorker->subid, MyLogicalRepWorker->relid, MyLogicalRepWorker->relstate, MyLogicalRepWorker->relstate_lsn, true); walrcv_endstreaming(wrconn, &tli); finish_sync_worker(); } case I': /* INSERT */ apply_handle_insert(s); 6.1 参考 https://yq.aliyun.com/articles/71128 PostgreSQL(Logical-Replication-Internals).pdf.pdf) http://www.postgres.cn/docs/10/logical-replication.html
1. 概述 zedstore是开发中的一个PostgreSQL的行列混合存储引擎,其设计目标偏OLAP场景,但是又能支持所有OLTP的操作,包括MVCC,索引等。在设计上当OLAP和OLTP的目标发生冲突时,会优先OLAP,所以OLTP的性能会差一点。 zedstore的设计目标参考下面的说明 https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w%40mail.gmail.com Motivations / Objectives * Performance improvement for queries selecting subset of columns (reduced IO). * Reduced on-disk footprint compared to heap table. Shorter tuple headers and also leveraging compression of similar type data * Be first-class citizen in the Postgres architecture (tables data can just independently live in columnar storage) * Fully MVCC compliant * All Indexes supported * Hybrid row-column store, where some columns are stored together, and others separately. Provide flexibility of granularity on how to divide the columns. Columns accessed together can be stored together. * Provide better control over bloat (similar to zheap) * Eliminate need for separate toast tables * Faster add / drop column or changing data type of column by avoiding full rewrite of the table. zedstore内部page中可以存储未压缩的单个tuple,也可以存储压缩过的多个tuple的集合。每个tuple用TID标识,TID是个逻辑标识,不同于heap中TID代表物理位置。在zedstore的整个数据文件中,表被切分成很多列族(类似hbase,当前的开发版本固定每列都是一个列族),每个列族都是一个btree,按TID的顺序组织,整个zedstore数据文件就是一个btree的森林(和gin类似)。 +----------------------------- | Fixed-size page header: | | LSN | TID low and hi key (for Lehman & Yao B-tree operations) | left and right page pointers | | Items: | | TID | size | flags | uncompressed size | lastTID | payload (containeritem) | TID | size | flags | uncompressed size | lastTID | payload (containeritem) | TID | size | flags | undo pointer | payload (plain item) | TID | size | flags | undo pointer | payload (plain item) | ... | +---------------------------- zedstore虽然在OLTP场景下的性能不是最优,由于zedstore支持数据压缩,将来可以用来存放OLTP库的冷数据。 下面做个简单的测试体验一下。 2. 测试环境 CentOS 7.3(16核128G SSD) zedstore 3. 编译安装 3.1 下载zedstore源码 https://github.com/greenplum-db/postgres/tree/zedstore 3.2 安装lz4 yum install lz4,lz4-devel 也可以下载lz4源码安装,但源码安装后要执行一次ldconfig。否则编译时configure可能出错。 3.3 编译 cd postgres-zedstore/ ./configure --prefix=/usr/pgzedstore --with-lz4 make -j 16 make install cd contrib/ make -j 16 make install 编译debug版,可以在configure上添加CFLAGS="-O0 -DOPTIMIZER_DEBUG -g3"参数 3.4 初始化实例 su - postgres /usr/pgzedstore/bin/initdb /pgsql/datazedstore -E UTF8 --no-locale /usr/pgzedstore/bin/pg_ctl -D /pgsql/datazedstore -l logfile restart -o'-p 5444' 4 测试 4.1 初始化测试库 /usr/pgzedstore/bin/pgbench -i -s 100 -p5444 4.2 heap表测试 [postgres@host10372181 ~]$/usr/pgzedstore/bin/pgbench -n -c 1 -j 1 -T 10 -p5444 -r -S transaction type: <builtin: select only> scaling factor: 100 query mode: simple number of clients: 1 number of threads: 1 duration: 10 s number of transactions actually processed: 61833 latency average = 0.162 ms tps = 6183.241453 (including connections establishing) tps = 6184.485276 (excluding connections establishing) statement latencies in milliseconds: 0.000 \set aid random(1, 100000 * :scale) 0.161 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; [postgres@host10372181 ~]$/usr/pgzedstore/bin/pgbench -n -c 1 -j 1 -T 10 -p5444 -r -S -M prepared transaction type: <builtin: select only> scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 10 s number of transactions actually processed: 158597 latency average = 0.063 ms tps = 15859.550657 (including connections establishing) tps = 15862.665007 (excluding connections establishing) statement latencies in milliseconds: 0.000 \set aid random(1, 100000 * :scale) 0.062 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; [postgres@host10372181 ~]$/usr/pgzedstore/bin/pgbench -n -c 1 -j 1 -T 10 -p5444 -r -M prepared transaction type: <builtin: TPC-B (sort of)> scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 10 s number of transactions actually processed: 18910 latency average = 0.529 ms tps = 1890.901809 (including connections establishing) tps = 1891.305759 (excluding connections establishing) statement latencies in milliseconds: 0.000 \set aid random(1, 100000 * :scale) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.000 \set delta random(-5000, 5000) 0.021 BEGIN; 0.113 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.053 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.104 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.091 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.049 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.096 END; 4.2 创建zedstore表 create table pgbench_accounts2(like pgbench_accounts including all) using zedstore; insert into pgbench_accounts2 select * from pgbench_accounts; alter table pgbench_accounts rename to pgbench_accounts_old; alter table pgbench_accounts2 rename to pgbench_accounts; 改造后,可以发现zedstore表的size小了很多。 postgres=# \d+ List of relations Schema | Name | Type | Owner | Persistence | Size | Description --------+----------------------+-------+----------+-------------+---------+------------- public | pgbench_accounts | table | postgres | permanent | 61 MB | public | pgbench_accounts_old | table | postgres | permanent | 1283 MB | public | pgbench_branches | table | postgres | permanent | 40 kB | public | pgbench_history | table | postgres | permanent | 992 kB | public | pgbench_tellers | table | postgres | permanent | 104 kB | (5 rows) 压缩效果这么好和pgbench_accounts表中重复值非常多有关。 我们再构造一些随机的数据对比zedstore的压缩效果。 create table tb1(id int,c1 text); insert into tb1 select id,md5(id::text) from generate_series(1,1000000)id; create table tb2(id int,c1 text) using zedstore; insert into tb2 select id,md5(id::text) from generate_series(1,1000000)id; 这个的压缩效果就差了很多,lz4压缩速度比较快,但其本身的压缩率比较低。 postgres=# select * from tb2 limit 5; id | c1 ----+---------------------------------- 1 | c4ca4238a0b923820dcc509a6f75849b 2 | c81e728d9d4c2f636f067f89cc14862c 3 | eccbc87e4b5ce2fe28308fd9f2a7baf3 4 | a87ff679a2f3e71d9181a67b7542122c 5 | e4da3b7fbbce2345d7772b0674a318d5 (5 rows) postgres=# \d+ List of relations Schema | Name | Type | Owner | Persistence | Size | Description --------+----------------------+-------+----------+-------------+---------+------------- ... public | tb1 | table | postgres | permanent | 65 MB | public | tb2 | table | postgres | permanent | 38 MB | (7 rows) 4.3 zedstore表测试 [postgres@host10372181 ~]$/usr/pgzedstore/bin/pgbench -n -c 1 -j 1 -T 10 -p5444 -r -S transaction type: <builtin: select only> scaling factor: 100 query mode: simple number of clients: 1 number of threads: 1 duration: 10 s number of transactions actually processed: 3663 latency average = 2.730 ms tps = 366.280735 (including connections establishing) tps = 366.360837 (excluding connections establishing) statement latencies in milliseconds: 0.000 \set aid random(1, 100000 * :scale) 2.729 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; [postgres@host10372181 ~]$/usr/pgzedstore/bin/pgbench -n -c 1 -j 1 -T 10 -p5444 -r -S -M prepared transaction type: <builtin: select only> scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 10 s number of transactions actually processed: 3907 latency average = 2.560 ms tps = 390.614280 (including connections establishing) tps = 390.692250 (excluding connections establishing) statement latencies in milliseconds: 0.000 \set aid random(1, 100000 * :scale) 2.559 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; [postgres@host10372181 ~]$/usr/pgzedstore/bin/pgbench -n -c 1 -j 1 -T 10 -p5444 -r -M prepared transaction type: <builtin: TPC-B (sort of)> scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 10 s number of transactions actually processed: 811 latency average = 12.340 ms tps = 81.036743 (including connections establishing) tps = 81.053603 (excluding connections establishing) statement latencies in milliseconds: 0.001 \set aid random(1, 100000 * :scale) 0.000 \set bid random(1, 1 * :scale) 0.000 \set tid random(1, 10 * :scale) 0.001 \set delta random(-5000, 5000) 0.023 BEGIN; 11.888 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.084 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 0.081 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 0.095 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 0.052 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 0.112 END; zedstore表和heap表的测试结果汇总如下(单位tps) 测试模式 预编译模式 heap表 zedstore表 selectonly simple 6184 366 selectonly prepared 15862 390 normal prepared 1891 81 4.4 聚合查询的性能对比 对前面创建的有100万记录的tb1和tb2表执行聚合查询,比对执行时间,单位毫秒。 postgres=# \d+ tb1 Table "public.tb1" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+---------+----------+--------------+------------- id | integer | | | | plain | | c1 | text | | | | extended | | Access method: heap postgres=# \d+ tb2 Table "public.tb2" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+---------+----------+--------------+------------- id | integer | | | | plain | | c1 | text | | | | extended | | Access method: zedstore SQL heap表 zedstore表 count(*) 77 356 count(1) 82 356 count(id) 98 294 max(id) 95 285 avg(id) 99 307 小结 从测试可以看出,目前的zedstore还没发挥出其行列混合存储应有的潜质,可能zedstore当前主要工作重心是确保逻辑正确性上,还没开始做性能上的优化。
PostgreSQL WAL解析与闪回的一些想法 1. 背景 最近在walminer基础做了不少修改,以支持我们的使用场景。详细参考 如何在PostgreSQL故障切换后找回丢失的数据 修改也花了不少精力和时间,这个过程中有些东西想记录下来,方便以后查阅。 所以,这篇东西有点像流水账。 2. WAL文件格式 解析WAL的第一步是要了解WAL的文件格式,我觉得最详细易懂最值得看的资料是下面这个。 http://www.interdb.jp/pg/pgsql09.html 但是,以上还不够。细节的东西还是要看源码。我主要看的是写WAL记录的地方。 walminer作者李传成的博客里也有不少WAL解析相关的文章,是后来才发现的,我还没有看过。 https://my.oschina.net/lcc1990 3. walminer的解析流程 walminer解析WAL的入口是pg_minerXlog()函数。其主要过程如下 加载数据字典 起点搜索解析阶段 遍历WAL,根据输入的起始时间和起始xid找到匹配的第一个事务。 这个阶段只解析事务类型的WAL记录,其他WAL记录快速跳过。 完全解析阶段 紧接着2的位置,继续往下进行完整的解析。 这个阶段,会收集所有FPI(FULL PAGE IMAGE)并反映它们的变更, 还会收集所有DML(insert/update/delete)类型的WAL记录, 并且在遇到事务提交WAL时输出该事务对应的DML,事务回滚时清空该事务对应的DML。 walminer把WAL记录中tuple变成SQL的过程比较有意思,中间用了一个VALUES的临时格式。 以下面这个UPDATE语句为例 update tb1 set c1='3xy' where id=3; 其解析过程中涉及到的一些调用点如下: pg_minerXlog() ->sqlParser() ->XLogMinerRecord() ->XLogMinerRecord_heap() ->minerHeapUpdate(XLogReaderState *record, XLogMinerSQL *sql_simple, uint8 info) 1. 获取更新前后的tuple值(字符串格式) ->getTupleInfoByRecord() ->getTupleData_Update() ->mentalTup() ->mentalTup_nulldata() ->mentalTup_valuedata() tupleInfo:VALUES(-3, '3x')" tupleInfo_old:VALUES(3, NULL) 2. 生成中间redo SQL ->getUpdateSQL(sql_simple, tupleInfo, tupleInfo_old,...) sql_simple:UPDATE \"public\".\"tb1\" SET VALUES(3, '3xy') WHERE VALUES(3, '3x') 3. 生成中间undo SQL ->getUpdateSQL(&srctl.sql_undo, tupleInfo_old, tupleInfo,...) srctl.sql_undo:UPDATE \"public\".\"tb1\" SET VALUES(3, '3x') WHERE VALUES(-3, '3xy')" 4. 生成最终undo SQL 将中间中" VALUES"之后部分抹去,从rrctl.values,rrctl.nulls,rrctl.values_old,rrctl.nulls_old重新生成SQL后半部分。 ->reAssembleUpdateSql(&srctl.sql_undo,true); srctl.sql_undo:UPDATE "public"."tb1" SET "c1" = '3x' WHERE "id"=3 AND "c1"='3xy' AND ctid = '(0,10)'; pg_minerXlog() ->sqlParser() ->parserUpdateSql() 4. 生成最终redo SQL ->reAssembleUpdateSql(sql_ori, false) sql_ori:UPDATE "public"."tb1" SET "c1" = '3xy' WHERE "id"=3 AND "c1"='3x'; 4. walminer存在的问题 walminer是个非常棒的工具,填补了PG的一个空白。但是,在我们准备把它推向生产时发现了不少问题。 资源消耗和解析速度 粗测了一下,解析一个16MB的WAL文件大概需要15秒。不得不说实在太慢了。 解析大量WAL文件还容易把内存撑爆。 正确性和可靠性 对并发事务产生的WAL记录,解析的结果不对。 缺少回归测试集 其他的小问题。 易用性 不支持基于LSN位置的过滤 解析一次WAL要调用好几个函数,我觉得没有必要,一个就够了。 对这些已知的问题,都进行了改进。主要有下面几点 使用单个wal2sql()函数执行WAL解析任务 支持指定起始和结束LSN位置过滤事务 支持从WAL记录的old tuple或old key tuple中解析old元组构造where条件 增加lsn和commit_end_lsn结果输出字段 添加FPI(FULL PAGE IMAGE)解析开关,默认关闭image解析 优化WAL解析速度,大约提升10倍 给定LSN起始位置后,支持根据WAL文件名筛选,避免大量冗余的文件读取。 修复多个解析BUG 增加回归测试集10.合并PG10/11/12支持到一个分支 修改后的walminer参考 https://gitee.com/skykiker/XLogMiner 后续希望这些修改能合到源库里。 5. 后续改进思路 walminer在功能和使用场景上和MySQL的binlog2sql是非常接近的。 binlog2sql对自己的场景描述如下: https://github.com/danfengcao/binlog2sql 数据快速回滚(闪回) 主从切换后新master丢数据的修复 从binlog生成标准SQL,带来的衍生功能 binlog2sql已经有很多生产部署的案例,但是walminer好像还没有。其中原因,我想除了修改版已经解决的那些问题,walminer作为闪回工具,还有进一步改进的空间。 我考虑主要有以下几点可以改进的 以fdw的形式提供接口 和函数相比fdw的好处是明显的 不需要等所有WAL都解析完了再输出,因此可以结合limit进行多次快速探测 不需要创建临时表,解析过程中不需要产生WAL(产生WAL可能会触发WAL清理)。 可以在备库执行 使用fdw后,过滤条件直接通过where条件传递,接口更清晰。无法通过where条件传递东西,比如WAL存储目录,可以通过设置参数解决。 把解析过程分成事务匹配探测和完全解析2个部分 完全解析时,需要从匹配的事务往前回溯一部分,确保该事务的SQL甚至所需FPI都被解析到。单纯从匹配的事务后面开始完全解析,会丢失SQL的。 增加DDL解析 其实并不需要解析出完整的DDL和逆向的闪回DDL,这个任务也很难实现。只需要能知道什么时间,在WAL的哪个位点,哪个表发生了定义变更即可。 代码重构 从性能和可维护性考虑,有必要进行代码重构。 工具命名 既然这个工具的功能是从WAL中解析出原始SQL和undo SQL,walminer这个名称就显得不合适了。因为从字面上理解,walminer应该是解析WAL本身包含的信息,包括很多与SQL无关的的信息,但是不应该包含undo SQL这种WAL里没有而完全是被构造出来的东西。所以,顾名思议,这个东西可以叫wal2sql。 6. 参考 PostgreSQL Oracle 兼容性之 - 事件触发器实现类似Oracle的回收站功能 PostgreSQL flashback(闪回) 功能实现与介绍 MySQL Flashback 工具介绍 MySQL闪回方案讨论及实现
1. 背景 PostgreSQL的HA方案一般都基于其原生的流复制技术,支持同步复制和异步复制模式。同步复制模式虽然可以最大程度保证数据不丢失,但通常需要至少部署三台机器,确保有两台以上的备节点。因此很多一主一备HA集群,都是使用异步复制。 在异步复制下,主库宕机,把备节点切换为新的主节点后,可能会丢失最近更新的少量数据。如果这些丢失的数据对业务比较重要,那么,能不能从数据库里找回来呢? 下面就介绍找回这些数据的方法 2. 原理 基本过程 备库被提升为新主后会产生一个新的时间线,这个新时间线的起点我们称之为分叉点。 旧主故障修复后,在旧主上从分叉点位置开始解析WAL文件,将所有已提交事务产生的数据变更解析成SQL。 前提是旧主磁盘没有损坏,能够正常启动。不过,生产最常见的故障是物理机宕机,一般重启机器就可以恢复。 业务拿到这些SQL,人工确认后,回补数据。 为了能从WAL记录解析出完整的SQL,最好wal_level设置成logical,并且表上有主键。此时,对于我们关注的增删改DML语句,WAL记录中包含了足够的信息,能够把数据变更还原成SQL。详细如下: INSERT WAL记录中包含了完整的tuple数据,结合系统表中表定义可以还原出SQL。 UPDATE WAL记录中包含了完整的更新后的tuple数据,对于更新前的tuple,视以下情况而定。 - 表设置了replica identity full属性 WAL记录中包含完整的更新前的tuple数据 - 表包含replica identity key(或主键)且replica identity key的值发生了变更 WAL记录中包含了更新前的tuple的replica identity key(或主键)的字段值 - 其他 WAL记录中不包含更新前的tuple数据 DELETE WAL记录中可能包含被删除的tuple信息,视以下情况而定。 - 表设置了replica identity full属性 WAL记录中包含完整的被删除的tuple数据 - 表包含replica identity key(或主键) WAL记录中包含被删除的tuple的replica identity key(或主键)的字段值 - 其他 WAL记录中不包含被删除的tuple数据 如果wal_level不是logical或表上没有主键,还可以从WAL中的历史FPI(FULL PAGE IANGE)中解析出变更前tuple。 因此,原理上,从WAL解析出SQL是完全可行的。并且也已经有开源工具可以支持这项工作了。 3. 工具 使用改版的walminer工具解析WAL文件。 https://gitee.com/skykiker/XLogMiner walminer是一款很不错的工具,可以从WAL文件中解析出原始SQL和undo SQL。但是当前原生的walminer要支持这一场景还存在一些问题,并且解析WAL文件的速度非常慢。 改版的walminer分支增加了基于LSN位置的解析功能,同时修复了一些BUG,解析WAL文件的速度也提升了大约10倍。其中的部分修改后续希望能合到walminer主分支里。 3. 前提条件 分叉点之后的WAL日志文件未被清除 正常是足够的。也可以设置合理的`wal_keep_segments`参数,在`pg_wal`目录多保留一些WAL。比如: wal_keep_segments=100 如果配置了WAL归档,也可以使用归档目录中的WAL。 WAL日志级别设置为logical wal_level=logical 表有主键或设置了replica identity key/replica identity full 分叉点之后表定义没有发生变更 注:以上条件的2和3如果不满足其实也可以支持,但是需要保留并解析分叉点的前一个checkpint以后的所有WAL。 4. 使用演示 4.1 环境准备 搭建好一主一备异步复制的HA集群 机器: node1(主) node2(备) 软件: PostgreSQL 10 参数: wal_level=logical 4.2 安装walminer插件 从以下位置下载改版walminer插件源码 https://gitee.com/skykiker/XLogMiner/ 在主备库分别安装walminer cd walminer make && make install 在主库创建walminer扩展 create extension walminer 4.3 创建测试表 create table tb1(id int primary key, c1 text); insert into tb1 select id,'xxx' from generate_series(1,10000) id; 4.4 模拟业务负载 准备测试脚本 test.sql \set id1 random(1,10000) \set id2 random(1,10000) insert into tb1 values(:id1,'yyy') on conflict (id) do update set c1=excluded.c1; delete from tb1 where id=:id2; 在主库执行测试脚本模拟业务负载 pgbench -c 8 -j 8 -T 1000 -f test.sql 4.5 模拟主库宕机 在主库强杀PG进程 killall -9 postgres 4.6 备库提升为新主 在备库执行提升操作 pg_ctl promote 查看切换时的时间线分叉点 [postgres@host2 ~]$tail -1 /pgsql/data10/pg_wal/00000002.history 1 0/EF76440 no recovery target specified 4.7 在旧主库找回丢失的数据 启动旧主库后调用wal2sql()函数,找回分叉点以后旧主库上已提交事务执行的所有SQL。 postgres=# select xid,timestamptz,op_text from wal2sql(NULL,'0/EF76440') ; NOTICE: Get data dictionary from current database. NOTICE: Wal file "/pgsql/data10/pg_wal/00000001000000000000000F" is not match with datadictionary. NOTICE: Change Wal Segment To:/pgsql/data10/pg_wal/00000001000000000000000C NOTICE: Change Wal Segment To:/pgsql/data10/pg_wal/00000001000000000000000D NOTICE: Change Wal Segment To:/pgsql/data10/pg_wal/00000001000000000000000E xid | timestamptz | op_text --------+-------------------------------+------------------------------------------------------------- 938883 | 2020-03-31 17:12:10.331487+08 | DELETE FROM "public"."tb1" WHERE "id"=7630; 938884 | 2020-03-31 17:12:10.33149+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(5783, 'yyy'); 938885 | 2020-03-31 17:12:10.331521+08 | DELETE FROM "public"."tb1" WHERE "id"=3559; 938886 | 2020-03-31 17:12:10.331586+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=7585; 938887 | 2020-03-31 17:12:10.331615+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=973; 938888 | 2020-03-31 17:12:10.331718+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(7930, 'yyy'); 938889 | 2020-03-31 17:12:10.33173+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=1065; 938890 | 2020-03-31 17:12:10.331741+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(2627, 'yyy'); 938891 | 2020-03-31 17:12:10.331766+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=1012; 938892 | 2020-03-31 17:12:10.33178+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(4740, 'yyy'); 938893 | 2020-03-31 17:12:10.331814+08 | DELETE FROM "public"."tb1" WHERE "id"=4275; 938894 | 2020-03-31 17:12:10.331892+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=8651; 938895 | 2020-03-31 17:12:10.33194+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=9313; 938896 | 2020-03-31 17:12:10.331967+08 | DELETE FROM "public"."tb1" WHERE "id"=3251; 938897 | 2020-03-31 17:12:10.332001+08 | DELETE FROM "public"."tb1" WHERE "id"=2968; 938898 | 2020-03-31 17:12:10.332025+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(5331, 'yyy'); 938899 | 2020-03-31 17:12:10.332042+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=3772; 938900 | 2020-03-31 17:12:10.332048+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(94, 'yyy'); (18 rows) Time: 2043.380 ms (00:02.043) 上面wal2sql()的输出结果是按事务在WAL中提交的顺序排序的。可以把这些SQL导到文件里提供给业务修单。 4.8 恢复旧主 可以通过pg_rewind快速回退旧主多出的数据,然后作为新主的备库重建复制关系,恢复HA。 5. 小结 借助改版的walminer,可以方便快速地在PostgreSQL故障切换后找回丢失的数据。 walminer除了能生成正向SQL,还可以生成逆向的undo SQL,也就是我们熟知的闪回功能。undo SQL的生成方法和使用限制可以参考开源项目文档。 然而,在作为闪回功能使用时,walminer还有需要进一步改进的地方,最明显的就是解析速度。因为从WAL记录中完整解析undo SQL需要开启replica identity full,而很多系统可能不会为每个表都打开replica identity full设置。在没有replica identity full的前提下,生成undo SQL就必须要依赖历史FPI。 虽然改版的walminer已经在解析速度上提升了很多倍,但是如果面对几十GB的WAL文件,解析并收集历史所有FPI,资源和时间消耗仍然是个不小的问题。
问题 有业务反馈在修改一个表字段长度后,Java应用不停的报下面的错误,但是越往后错误越少,过了15分钟错误就没有再发生。 ### Error querying database. Cause: org.postgresql.util.PSQLException: ERROR: cached plan must not change result type 原因 调查判断原因是修改字段长度导致执行计划缓存失效,继续使用之前的预编译语句执行会失败。 很多人遇到过类似错误,比如: https://blog.csdn.net/qq_27791709/article/details/81198571 但是,有两个疑问没有解释清楚。 以前业务也改过字段长度,但为什么没有触发这个错误? 这个错误能否自愈? 下面是进一步的分析 PostgreSQL中抛出此异常的代码如下: static List * RevalidateCachedQuery(CachedPlanSource *plansource, QueryEnvironment *queryEnv) { if (plansource->fixed_result) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cached plan must not change result type"))); ... } pgjdbc代码里有对该异常的判断,发生异常后,后续的执行会重新预编译,不会继续使用已经失效的预编译语句。这说明pgjdbc对这个错误有容错或自愈能力。 protected boolean willHealViaReparse(SQLException e) { ... // "cached plan must not change result type" String routine = pe.getServerErrorMessage().getRoutine(); return "RevalidateCachedQuery".equals(routine) // 9.2+ || "RevalidateCachedPlan".equals(routine); // <= 9.1 } 发生条件 经验证,使用Java应用时本故障的发生条件如下: 使用非自动提交模式 使用prepareStatement执行相同SQL 5次以上 修改表字段长度 表字段长度修改后第一次使用prepareStatement执行相同SQL 测试验证 以下代码模拟Java连接多次出池->执行->入池,中途修改字段长度。可以复现本问题 Connection conn = DriverManager.getConnection(...); conn.setAutoCommit(false); //自动提交模式下,不会出错,pgjdbc内部会处理掉 String sql = "select c1 from tb1 where id=1"; PreparedStatement prest =conn.prepareStatement(sql); for(int i=0;i<5;i++) { System.out.println("i: " + i); prest =conn.prepareStatement(sql); ResultSet rs = prest.executeQuery(); prest.close(); conn.commit(); } //在这里设置断点,手动修改字段长度: alter table tb1 alter c1 type varchar(118); for(int i=5;i<10;i++) { System.out.println("i: " + i); try { prest =conn.prepareStatement(sql); ResultSet rs = prest.executeQuery(); prest.close(); conn.commit(); } catch (SQLException e) { System.out.println(e.getMessage()); conn.rollback(); } } conn.close(); 测试程序执行结果如下: i: 0 i: 1 i: 2 i: 3 i: 4 i: 5 ERROR: cached plan must not change result type i: 6 i: 7 i: 8 i: 9 回避 在不影响业务逻辑的前提下,尽量使用自动提交模式 修改表字段长度后重启应用,或者在业务发生该SQL错误后重试(等每个Jboss缓存的连接都抛出一次错误后会自动恢复)
问题 某日,研发的小伙伴扔过来一个SQL希望帮忙优化。 select nation,province,city from ip_idc where ip_start <='113.201.214.203'::inet and ip_end >='113.201.214.203'::inet; 这个SQL需要执行1秒多。而业务需要高并发的频繁执行这条SQL,1秒多的执行时间无法满足业务需求。 这个SQL是想在ip地址库中找到某个IP地址的归属地。ip地址库的表定义如下,每条记录描述了一个地址范围的相关信息,共300多万条记录,600MB。 postgres=# \d ip_idc Table "public.ip_idc" Column | Type | Collation | Nullable | Default --------------------+------------------------+-----------+----------+--------- id | character varying(255) | | not null | ip_start | inet | | | ip_end | inet | | | nation | character varying(255) | | | province | character varying(255) | | | city | character varying(255) | | | ...(略) Indexes: "ip_idc_pkey" PRIMARY KEY, btree (id) "ip_idc_ip_start_ip_end_idx" btree (ip_start, ip_end) 分析 通过检查执行计划,可以很明显看出,这个SQL之所以慢,是因为它使用了全表扫描。 postgres=# explain (analyze ,buffers)select nation,province,city from ip_idc where ip_start <='113.201.214.203'::inet and ip_end >='113.201.214.203'::inet; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Seq Scan on ip_idc (cost=0.00..128204.08 rows=645500 width=29) (actual time=1062.045..1066.116 rows=1 loops=1) Filter: ((ip_start <= '113.201.214.203'::inet) AND (ip_end >= '113.201.214.203'::inet)) Rows Removed by Filter: 3470145 Buffers: shared hit=4143 read=72010 I/O Timings: read=321.196 Planning time: 17.140 ms Execution time: 1073.907 ms (7 rows) 明明表上有索引,为什么还会走全表扫描? 对于btree索引上的范围查询,这其实是一个很正常的现象。 对于联合索引btree(ip_start, ip_end),起决定作用的是开头的ip_start字段。这个SQL中,ip_start字段上的约束条件是ip_start <= '113.201.214.203'::inet,即它需要扫描索引中所有小于等于113.201.214.203'::inet的索引项。这样的索引项的数量越多,执行时间就越长;同时优化器计算出的索引扫描的cost也就越大,大到超过顺序扫描的cost后,优化器就会选择使用顺序扫描的执行计划了。 使用"更小"的ip地址继续查询,可以验证上面的描述。使用60,10,1开头的IP地址查询,都会走索引扫描,查询时间分别是210毫秒,3毫秒和0.4毫秒。 postgres=# explain (analyze ,buffers)select nation,province,city from ip_idc where ip_start <='60.201.214.203'::inet and ip_end >='60.201.214.203'::inet; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using ip_idc_ip_start_ip_end_idx on ip_idc (cost=0.43..77943.75 rows=640787 width=29) (actual time=104.813..104.815 rows=1 loops=1) Index Cond: ((ip_start <= '60.201.214.203'::inet) AND (ip_end >= '60.201.214.203'::inet)) Buffers: shared hit=3240 Planning time: 0.082 ms Execution time: 104.843 ms (5 rows) postgres=# explain (analyze ,buffers)select nation,province,city from ip_idc where ip_start <='10.201.214.203'::inet and ip_end >='10.201.214.203'::inet; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using ip_idc_ip_start_ip_end_idx on ip_idc (cost=0.43..17828.34 rows=30263 width=29) (actual time=2.298..2.299 rows=1 loops=1) Index Cond: ((ip_start <= '10.201.214.203'::inet) AND (ip_end >= '10.201.214.203'::inet)) Buffers: shared hit=70 Planning time: 0.156 ms Execution time: 2.327 ms (5 rows) postgres=# explain (analyze ,buffers)select nation,province,city from ip_idc where ip_start <='1.201.214.203'::inet and ip_end >='1.201.214.203'::inet; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using ip_idc_ip_start_ip_end_idx on ip_idc (cost=0.43..3426.17 rows=5057 width=29) (actual time=0.392..0.393 rows=1 loops=1) Index Cond: ((ip_start <= '1.201.214.203'::inet) AND (ip_end >= '1.201.214.203'::inet)) Buffers: shared hit=15 Planning time: 0.103 ms Execution time: 0.413 ms (5 rows) 测试结果汇总如下: 目标IP地址 扫描类型 扫描数据块数 执行时间(ms) 113.201.214.203 顺序扫描 76153 1073.907 60.201.214.203 索引扫描 3240 104.843 10.201.214.203 索引扫描 70 2.327 1.201.214.203 索引扫描 15 0.413 查询条件中虽然有ip_end >='1.201.214.203'::inet的约束条件,但由于ip_end是联合索引的第二个字段,难以发挥作用。 因此,本问题的根因在于btree索引不擅长处理范围类型。 优化方案1 熟悉PostgreSQL的人应该都知道,PostgreSQL里有非常丰富的数据类型,其中就包含范围类型。利用与之配套的gist索引,可以在索引中同时搜索范围的上下边界,达到比较理想的查询效果。下面实际验证一下效果。 由于inet范围不是内置类型,先创建一个inet的范围类型。 create type inetrange as range(subtype=inet) 为了不修改表结构,创建inet范围的表达式索引 create index on ip_idc using gist(inetrange(ip_start, ip_end, '[]'::text)) 执行查询 postgres=# explain (analyze,buffers) select * from ip_idc where inetrange(ip_start,ip_end,'[]') @> '113.21.214.203'::inet; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on ip_idc (cost=1.92..3.44 rows=1 width=136) (actual time=11.071..11.072 rows=1 loops=1) Recheck Cond: (inetrange(ip_start, ip_end, '[]'::text) @> '113.21.214.203'::inet) Heap Blocks: exact=1 Buffers: shared hit=521 -> Bitmap Index Scan on ip_idc_inetrange_idx (cost=0.00..1.92 rows=1 width=0) (actual time=11.066..11.066 rows=1 loops=1) Index Cond: (inetrange(ip_start, ip_end, '[]'::text) @> '113.21.214.203'::inet) Buffers: shared hit=520 Planning time: 0.098 ms Execution time: 11.140 ms (9 rows) 使用inet范围索引后,执行时间减少到了11毫秒。仔细检查这个执行计划,发现这个SQL扫描了520个索引块,似乎有点多,有兴趣的同学可以看看从内核源码角度能不能再优化一下。 注:相同的SQL在9.6上执行时间是300毫秒,需要扫描13435个索引块。应该PG 10对gist索引做过优化。 优化方案2 除了自定义的inetrange类型,inet本身就有表示地址范围的能力,并且支持gist索引。另外还有更高效的第三方的ip4r插件可以做同样的事情。下面比较一下这3种方式在同一个数据集下的性能。 先创建包含3种数据类型的表,并生成300多万不重复的ip范围 create extension ip4r; create table ip_idc2(id serial,ip_start inet,ip_end inet,iprange inet,iprange2 ip4r); insert into ip_idc2(ip_start,ip_end,iprange,iprange2) select (r1::text ||'.'|| r2::text ||'.'|| r3::text||'.0')::inet, (r1::text ||'.'|| r2::text ||'.'|| r3::text||'.255')::inet, (r1::text ||'.'|| r2::text ||'.'|| r3::text||'.0/24')::inet, (r1::text ||'.'|| r2::text ||'.'|| r3::text||'.0/24')::ip4r from generate_series(1,60) a(r1),generate_series(1,254) b(r2),generate_series(1,254) c(r3) limit 1; create index on ip_idc2 using gist(inetrange(ip_start, ip_end, '[]'::text)); create index on ip_idc2 using gist(iprange inet_ops); create index on ip_idc2 using gist(iprange2); 执行ip地址查询的SQL,比较3种的类型的索引扫描效率。 postgres=# explain (analyze,buffers) select * from ip_idc2 where inetrange(ip_start,ip_end,'[]') @> '33.21.214.203'::inet; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on ip_idc2 (cost=384.42..17950.58 rows=19355 width=33) (actual time=15.131..15.133 rows=1 loops=1) Recheck Cond: (inetrange(ip_start, ip_end, '[]'::text) @> '33.21.214.203'::inet) Heap Blocks: exact=1 Buffers: shared hit=668 -> Bitmap Index Scan on ip_idc2_inetrange_idx (cost=0.00..379.58 rows=19355 width=0) (actual time=15.122..15.122 rows=1 loops=1) Index Cond: (inetrange(ip_start, ip_end, '[]'::text) @> '33.21.214.203'::inet) Buffers: shared hit=667 Planning time: 0.072 ms Execution time: 15.204 ms (9 rows) postgres=# explain (analyze,buffers) select * from ip_idc2 where iprange >>= '33.21.214.203'::inet; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Index Scan using ip_idc2_iprange_idx on ip_idc2 (cost=0.41..3.43 rows=1 width=33) (actual time=2.758..4.940 rows=1 loops=1) Index Cond: (iprange >>= '33.21.214.203'::inet) Buffers: shared hit=227 Planning time: 0.082 ms Execution time: 4.964 ms (5 rows) postgres=# explain (analyze,buffers) select * from ip_idc2 where iprange2 >>= '33.21.214.203'::ip4r; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on ip_idc2 (cost=54.42..4966.41 rows=3871 width=33) (actual time=0.032..0.032 rows=1 loops=1) Recheck Cond: (iprange2 >>= '33.21.214.203'::ip4r) Heap Blocks: exact=1 Buffers: shared hit=4 -> Bitmap Index Scan on ip_idc2_iprange2_idx (cost=0.00..53.45 rows=3871 width=0) (actual time=0.027..0.027 rows=1 loops=1) Index Cond: (iprange2 >>= '33.21.214.203'::ip4r) Buffers: shared hit=3 Planning time: 0.083 ms Execution time: 0.065 ms (9 rows) 测试结果汇总如下 数据类型 Where条件 扫描类型 扫描索引块数 执行时间(ms) inet范围类型 inetrange(ip_start,ip_end,'[]') @> '33.21.214.203'::inet 索引扫描 667 15.204 inet iprange >>= '33.21.214.203'::inet; 索引扫描 227 4.964 ip4r iprange2 >>= '33.21.214.203'::ip4r 索引扫描 3 0.065 从测试结果可以看出,原生的inet比自定义的inetrange快了3倍。而第三方的ip4r又比原生的inet快了76倍,执行时间只有0.065毫秒,执行过程中只扫描了3个索引块,可以认为已经优化到头了。 优化方案3 方案2中的ip4r的性能虽然比较理想,但是需要对现有ip地址库的表数据重新定义,而且还需要安装额外的第三方插件,实施代价比较高。 有没有性能和ip4r相当,又不需要修改表数据以及安装第三方插件的方案呢? 经过和业务方沟通,了解到ip地址库中的地址范围没有重叠,并且ip地址库数据齐全,没有ip遗漏。也就是说,查询ip地址的SQL一定会返回且只返回一条记录。既然这样,简单思考一下就不难发现: ip_start小于等于目标IP的最大的地址范围其实就是要找的记录。 这样的查询只需在SQL上简单的添加 order by + limit 1就可以完成优化。效果如下 在ip_start字段上创建btree索引 create index on ip_idc2(ip_start); 当然也可以继续使用现有的btree(ip_start,ip_end)联合索引。 执行SQL postgres=# explain (analyze ,buffers)select * from ip_idc2 where ip_start <='33.201.214.203'::inet and ip_end >='33.201.214.203'::inet order by ip_start desc limit 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- --- Limit (cost=0.43..0.53 rows=1 width=33) (actual time=0.019..0.020 rows=1 loops=1) Buffers: shared hit=4 -> Index Scan Backward using ip_idc2_ip_start_idx on ip_idc2 (cost=0.43..99888.39 rows=957385 width=33) (actual time=0.018..0.018 rows=1 loops= 1) Index Cond: (ip_start <= '33.201.214.203'::inet) Filter: (ip_end >= '33.201.214.203'::inet) Buffers: shared hit=4 Planning time: 0.133 ms Execution time: 0.044 ms (8 rows) 优化后,只扫描4个索引块,执行速度比ip4r还快50%,效果符合预期。 小结 对于IP地址段查询的场景,PostgreSQL的ip4r插件是一个性能和通用性都比较不错的一个方案,但用户不一定方便使用ip4r,比如在未安装ip4r的公有云RDS上或者使用PostgreSQL以外的数据库。 在能满足以下限制条件的情况下,方案3(即order by + limit 1)应该是更好的选择。 ip地址库中的ip地址范围不能重叠,否则原本应该返回多条记录的结果只能查到第1条记录。 ip地址库中的ip地址范围齐全,不能有遗漏,否则可能需要扫描一半的索引,性能很差。 最终,业务采纳了方案3。
关于citus.limit_clause_row_fetch_count优化参数 citus.limit_clause_row_fetch_count是citus的一个性能优化的参数。具体适应于什么场景呢? 官方文档说明 下面是官方文档说明,还是不够具体。 https://docs.citusdata.com/en/v7.3/develop/api_guc.html?highlight=limit_clause_row_fetch_count Planner Configuration citus.limit_clause_row_fetch_count (integer) Sets the number of rows to fetch per task for limit clause optimization. In some cases, select queries with limit clauses may need to fetch all rows from each task to generate results. In those cases, and where an approximation would produce meaningful results, this configuration value sets the number of rows to fetch from each shard. Limit approximations are disabled by default and this parameter is set to -1. This value can be set at run-time and is effective on the coordinator. 测试用例 从citus的测试用例中,可以清楚的看到它的作用。 citus-7.2.1\src\test\regress\expected\multi_limit_clause_approximate.out -- Enable limit optimization to fetch one third of each shard's data SET citus.limit_clause_row_fetch_count TO 600; SELECT l_partkey, sum(l_partkey * (1 + l_suppkey)) AS aggregate FROM lineitem GROUP BY l_partkey ORDER BY aggregate DESC LIMIT 10; DEBUG: push down of limit count: 600 l_partkey | aggregate -----------+------------ 194541 | 3727794642 160895 | 3671463005 183486 | 3128069328 179825 | 3093889125 162432 | 2834113536 153937 | 2761321906 199283 | 2726988572 185925 | 2672114100 196629 | 2622637602 157064 | 2614644408 (10 rows) 上面的SQL,如果不加SET citus.limit_clause_row_fetch_count TO 600,CN需要到worker上把所有数据都捞出来,然后再在CN上排序取TopN结果。大数据量的情况,性能会非常糟糕。加上SET citus.limit_clause_row_fetch_count TO 600,就只会到每个worker上取前600的记录。但可能会带来准确性的损失。 另外一个需要注意的是,上面的GROUP BY字段l_partkey不是分片字段,如果GROUP BY字段已经包含了分片字段,不需要这个优化,因为这种情况下可以直接把LIMIT下推下去。 另一个测试用例,形式类似 ... SET citus.limit_clause_row_fetch_count TO 150; SET citus.large_table_shard_count TO 2; SELECT c_custkey, c_name, count(*) as lineitem_count FROM customer, orders, lineitem WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey GROUP BY c_custkey, c_name ORDER BY lineitem_count DESC, c_custkey LIMIT 10; DEBUG: push down of limit count: 150 c_custkey | c_name | lineitem_count -----------+--------------------+---------------- 43 | Customer#000000043 | 42 370 | Customer#000000370 | 38 79 | Customer#000000079 | 37 689 | Customer#000000689 | 36 472 | Customer#000000472 | 35 685 | Customer#000000685 | 35 643 | Customer#000000643 | 34 226 | Customer#000000226 | 33 496 | Customer#000000496 | 32 304 | Customer#000000304 | 31 (10 rows) 小结 适用场景 citus.limit_clause_row_fetch_count适用于分组聚合并取TopN结果的SQL的性能优化 不适用场景 要求精确结果 聚合字段包含分片字段 count(DISTINCT)
citus实战系列之四多CN部署 背景 citus的架构中正常只有1个CN节点,有时候CN会成为性能瓶颈。我们可以通过减少分片数,垂直扩容CN节点等手段缓解CN的性能问题,但这些都不能治本。某些业务场景部署多个CN节点是非常必要的。 技术方案 如果CN上主要的负载来自查询,可以为CN节点配置多个备机,做读写分离,这些备机可以分担读负载。但是这种方案不能称为多CN,它不具有均衡写负载的能力。 怎么实现多CN呢?在citus的具体实现中,CN和worker的区别就在于是否存储了相关的元数据,如果把CN的元数据拷贝一份到worker上,那么worker也可以向CN一样工作,这个多CN的模式早期被称做masterless。 对于当前的citus版本,其实有一个开关,打开后,会自动拷贝CN的元数据到Worker上,让worker也可以当CN用。 这个功能官方被称做Citus MX,也就是下一代的Citus,目前仅在Cloud Beta版中使用。好消息是,社区版虽然没有公开说支持,但也没有从代码上限制这个功能。下面见证一下这个神奇的开关:) 在CN节点的postgresql.conf中添加下面的参数 citus.replication_model='streaming' 在CN节点(cituscn)上添加worker节点 SELECT * from master_add_node('cituswk1', 5432); SELECT * from master_add_node('cituswk2', 5432); 从CN复制元数据到第一个worker节点 SELECT start_metadata_sync_to_node('cituswk1', 5432); 执行上面的函数后,citus CN上的元数据会被拷贝到指定的worker上,并且pg_dist_node表中对应worker的hasmetadata字段值为true,标识这个Worker存储了元数据,以后创建新的分片时,新产生的元素据也会自动同步到这个worker上。 postgres=# select * from pg_dist_node; nodeid | groupid | nodename | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster --------+---------+----------+----------+----------+-------------+----------+----------+------------- 1 | 1 | cituswk1 | 5432 | default | t | t | primary | default 2 | 2 | cituswk2 | 5432 | default | f | t | primary | default (2 rows) 下面看一下效果 在CN节点上创建一个测试分片表 create table tb1(id int primary key, c1 int); set citus.shard_count=8; select create_distributed_table('tb1','id'); insert into tb1 select id,random()*1000 from generate_series(1,100)id; 因为cituswk1带了元数据,可以当CN用,下面这个SQL可以在cituswk1上执行。 postgres=# explain select * from tb1; QUERY PLAN ------------------------------------------------------------------------------ Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) Task Count: 8 Tasks Shown: One of 8 -> Task Node: host=cituswk1 port=5432 dbname=postgres -> Seq Scan on tb1_102092 tb1 (cost=0.00..32.60 rows=2260 width=8) (6 rows) 为了说明方便,后面把这种带了元数据的worker称之为“扩展worker”。citus会限制某些SQL在扩展worker上执行,比如DDL。 分片位置的控制 扩展Worker其实把Worker和CN两个角色混在一个节点里,对维护不是很友好。有下面几个表现: 扩展Worker上的负载可能不均衡 如果出现性能问题增加了故障排查的难度 CN和Worker将不能独立扩容 扩大了插件兼容问题的影响 一个例子是,当前citus的CN节点和auto_explain插件是冲突的,但Worker和auto_explain相安无视。一旦所有worker都作为扩展worker,那么所有的worker也都不能幸免。auto_explain插件只是我们在实际部署时遇到的一个例子,其它插件可能会有类似情况。 为解决这些问题,我们可以专门定义少数几个节点作为扩展Worker,并且不在上面分配分片,使其纯粹担任CN的角色。具体作法如下: 首先,按通常的方式建好分片表,再使用前面《citus实战系列之三平滑扩容》中介绍的方法把"扩展Worker"上的分片挪走。 对于前面tb1的例子,有4个分片落在扩展worker(worker1)上 postgres=# select * from pg_dist_shard_placement where nodename='cituswk1'; shardid | shardstate | shardlength | nodename | nodeport | placementid ---------+------------+-------------+----------+----------+------------- 102238 | 1 | 0 | cituswk1 | 5432 | 237 102240 | 1 | 0 | cituswk1 | 5432 | 239 102242 | 1 | 0 | cituswk1 | 5432 | 241 102244 | 1 | 0 | cituswk1 | 5432 | 243 (4 rows) 创建分片表 create table tba(id int); set citus.shard_count=8; select create_distributed_table('tba','id'); 此时,有6个分片落在"扩展Worker"上。 postgres=# select * from pg_dist_shard_placement where nodename='cituswk1'; shardid | shardstate | shardlength | nodename | nodeport | placementid ---------+------------+-------------+------------+----------+------------- 103081 | 1 | 0 | cituswk1 | 5432 | 1096 103083 | 1 | 0 | cituswk1 | 5432 | 1099 103085 | 1 | 0 | cituswk1 | 5432 | 1102 103097 | 1 | 0 | cituswk1 | 5432 | 1105 (4 rows) 把"扩展Worker"上的分片挪走。 select citus_move_shard_placement(102238,'cituswk1',5432,'cituswk2',5432,'drop'); select citus_move_shard_placement(102240,'cituswk1',5432,'cituswk2',5432,'drop'); select citus_move_shard_placement(102242,'cituswk1',5432,'cituswk2',5432,'drop'); select citus_move_shard_placement(102244,'cituswk1',5432,'cituswk2',5432,'drop'); 这是CN的元数据已经更新了,但扩展Worker(worker1)还没有,需要在扩展Worker(worker1)上更新分片位置 postgres=# update pg_dist_shard_placement set nodename='cituswk2' where shardid in (102238,102240,102242,102244); UPDATE 4 先定义后迁移的方式不是特别方便,对于有亲和关系(分片位置分布完全一致)的表,只有第一张表需要这样定义,后续的表可以利用亲和关系指定分片的部署位置。 比如 create table tb2(id int); select create_distributed_table('tb2','id', colocate_with=>'tb1'); 上面的colocate_with有3种取值 default: 分配数,副本数,分片字段类型相同的表自动分配到一个亲和组 none:开始一个新的亲和组 表名:希望与之亲和的另一个表的表名 注意事项 稳妥起见,在扩展worker上尽量避免会产生分布式事务或死锁的操作。建议只执行这几类SQL,并且不使用事务。 SELECT INSERT 条件中带分片字段的SQL
citus实战系列之三平滑扩容 前言 对一个分布式数据库来说,动态扩缩容是不可回避的需求。但是citus的动态扩缩容功能只在企业版中才有。好消息是,citus的分片信息是存储在元数据表里的,通过修改元数据表,我们完全可以在citus社区版上实现动态的平滑扩缩容。 环境 软件 CentOS 7.4 PostgreSQL 10 citus 7.4 集群架构(扩容前) cituscn cituswk1 cituswk2 集群架构(扩容后) cituscn cituswk1 cituswk2 cituswk3 实验环境可参考《citus实战系列之二实验环境搭建》搭建。 原理概述 citus提供了现成的管理函数可以添加新的worker节点,但现有的分片表和参考表却不会自动分布到新加的worker上。我们需要手动移动这些分片,并且要保证分片移动过程中不中断业务。主要过程可以分为以下几个步骤 表复制 在移动目标分片的源端和目的端建立复制 元数据切换 加锁,阻塞相关的分片表的数据变更 修改pg_dist_shard_placement元数据表,变更分片位置信息。 清理 DROP切换前的旧的分片 表复制采用PostgreSQL的逻辑复制实现,因此所有worker节点必须预先打开逻辑复制开关。 wal_level = logical 注1:citus在添加新worker节点时已经在新worker上拷贝了参考表,不需要再人工处理。 注2:扩容时,如果把worker数翻倍,也可以用物理复制实现。使用物理复制时,如果有参考表不能调用master_add_node添加节点,必须手动修改元数据表。逻辑复制不支持复制DDL,物理复制没有这个限制,但物理复制没有逻辑复制灵活,只支持worker粒度的扩容,而且不能实现缩容。 分片表扩容操作步骤 创建测试分片表 创建以下测试分片表 create table tb1(id int primary key, c1 int); set citus.shard_count=8; select create_distributed_table('tb1','id'); insert into tb1 select id,random()*1000 from generate_series(1,100)id; 检查分片位置 postgres=# select * from pg_dist_placement where shardid in (select shardid from pg_dist_shard where logicalrelid='tb1'::regclass); placementid | shardid | shardstate | shardlength | groupid -------------+---------+------------+-------------+--------- 33 | 102040 | 1 | 0 | 1 34 | 102041 | 1 | 0 | 2 35 | 102042 | 1 | 0 | 1 36 | 102043 | 1 | 0 | 2 37 | 102044 | 1 | 0 | 1 38 | 102045 | 1 | 0 | 2 39 | 102046 | 1 | 0 | 1 40 | 102047 | 1 | 0 | 2 (8 rows) 上面的groupid代表了对应哪个worker postgres=# select * from pg_dist_node; nodeid | groupid | nodename | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster --------+---------+----------+----------+----------+-------------+----------+----------+------------- 1 | 1 | cituswk1 | 5432 | default | f | t | primary | default 2 | 2 | cituswk2 | 5432 | default | f | t | primary | default (2 rows) 添加新的worker 在CN节点上执行以下SQL,将新的worker节点cituswk3加入到集群中 SELECT * from master_add_node('cituswk3', 5432); 检查pg_dist_node元数据表。新的worker节点的groupid为4 postgres=# select * from pg_dist_node; nodeid | groupid | nodename | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster --------+---------+----------+----------+----------+-------------+----------+----------+------------- 1 | 1 | cituswk1 | 5432 | default | f | t | primary | default 2 | 2 | cituswk2 | 5432 | default | f | t | primary | default 4 | 4 | cituswk3 | 5432 | default | f | t | primary | default (3 rows) 复制分片 目前cituswk1和cituswk2上各有4个分片,cituswk3上没有分片,为了保持数据分布均匀可以移动部分分片到cituswk3上。 下面移动cituswk1上的分片102046到cituswk3上。 在cituswk1上创建PUBLICATION CREATE PUBLICATION pub_shard FOR TABLE tb1_102046; 在cituswk3上创建分片表和SUBSCRIPTION create table tb1_102046(id int primary key, c1 int); CREATE SUBSCRIPTION sub_shard CONNECTION 'host=cituswk1' PUBLICATION pub_shard; 切换元数据 锁表,阻止应用修改表 lock table tb1 IN EXCLUSIVE MODE; 等待数据完全同步后,修改元数据 update pg_dist_placement set groupid=4 where shardid=102046 and groupid=1; 清理 在cituswk1上删除分片表和PUBLICATION DROP PUBLICATION pub_shard; drop table tb1_102046; 在cituswk3上删除SUBSCRIPTION DROP SUBSCRIPTION sub_shard; 分片表缩容操作步骤 参考分片表扩容的步骤,将要删除的worker(cituswk3)上的分片(102046)移到其它worker(cituswk1)上,然后删除worker(cituswk3)。 select master_remove_node('cituswk3',5432); 亲和性表的处理 citus的分片表之间存在亲和性关系,具有亲和性(即colocationid相同)的所有分片表的同一范围的分片其所在位置必须相同。移动某个分片时,必须将这些亲和分片捆绑移动。可以通过以下SQL查出某个分片的所有亲和分片。 postgres=# select * from pg_dist_shard where logicalrelid in(select logicalrelid from pg_dist_partition where colocationid=(select colocationid from pg_dist_partition where partmethod='h' and logicalrelid='tb1'::regclass)) and (shardminvalue,shardmaxvalue)=(select shardminvalue,shardmaxvalue from pg_dist_shard where shardid=102046); logicalrelid | shardid | shardstorage | shardminvalue | shardmaxvalue --------------+---------+--------------+---------------+--------------- tb1 | 102046 | t | 1073741824 | 1610612735 tb2 | 102055 | t | 1073741824 | 1610612735 (2 rows) 对应的分片表元数据如下: postgres=# select logicalrelid,partmethod,colocationid from pg_dist_partition; logicalrelid | partmethod | colocationid --------------+------------+-------------- tb1 | h | 2 tb2 | h | 2 tb3 | h | 4 (3 rows) 自动化 在实际生产环境中,citus集群中可能会存储了非常多的表,每个表又拆成了非常多的分片。如果按照上面的步骤手工对citus扩缩容,将是一件非常痛苦的事情,也很容易出错。所以需要将这些步骤打包成自动化程序。 citus企业版在扩缩容时利用了一个叫master_move_shard_placement()的函数迁移分片,我们可以实现一个接口类似的函数citus_move_shard_placement()。 https://github.com/ChenHuajun/chenhuajun.github.io/blob/master/_posts/2018-05-23/citus_move_shard_placement.sql CREATE TYPE citus.old_shard_placement_drop_method AS ENUM ( 'none', -- do not drop or rename old shards, only record it into citus.citus_move_shard_placement_remained_old_shard 'rename', -- move old shards to schema "citus_move_shard_placement_recyclebin" 'drop' -- drop old shards in source node ); CREATE TABLE citus.citus_move_shard_placement_remained_old_shard( id serial primary key, optime timestamptz NOT NULL default now(), nodename text NOT NULL, nodeport text NOT NULL, tablename text NOT NULL, drop_method citus.old_shard_placement_drop_method NOT NULL ); -- move this shard and it's all colocated shards from source node to target node. -- drop_method define how to process old shards in the source node, default is 'none' which does not block SELECT. -- old shards should be drop in the future will be recorded into table citus.citus_move_shard_placement_remained_old_shard CREATE OR REPLACE FUNCTION pg_catalog.citus_move_shard_placement(shard_id bigint, source_node_name text, source_node_port integer, target_node_name text, target_node_port integer, drop_method citus.old_shard_placement_drop_method DEFAULT 'none') RETURNS void AS $citus_move_shard_placement$ ... 这部分太长了,略过 ... $citus_move_shard_placement$ LANGUAGE plpgsql SET search_path = 'pg_catalog','public'; -- drop old shards in source node CREATE OR REPLACE FUNCTION pg_catalog.citus_move_shard_placement_cleanup() RETURNS void AS $$ BEGIN delete from citus.citus_move_shard_placement_remained_old_shard where id in (select id from (select id,dblink_exec('host='||nodename || ' port='||nodeport,'DROP TABLE IF EXISTS ' || tablename) drop_result from citus.citus_move_shard_placement_remained_old_shard)a where drop_result='DROP TABLE'); PERFORM run_command_on_workers('DROP SCHEMA IF EXISTS citus_move_shard_placement_recyclebin CASCADE'); END; $$ LANGUAGE plpgsql SET search_path = 'pg_catalog','public'; 注:上面的工具函数未经过严格的测试,并且不支持后面的多CN架构。 下面是一个使用的例子 把102928分片从cituswk1迁移到cituswk2,drop_method使用rename旧的分片不删除而是移到名为citus_move_shard_placement_recyclebin的schema下。 postgres=# select citus_move_shard_placement(102928,'cituswk1',5432,'cituswk2',5432,'rename'); NOTICE: BEGIN move shards(102928,102944) from cituswk1:5432 to cituswk2:5432 NOTICE: [1/2] LOCK TABLE scale_test.tb_dist2 IN SHARE UPDATE EXCLUSIVE MODE ... NOTICE: [2/2] LOCK TABLE scale_test.tb_dist IN SHARE UPDATE EXCLUSIVE MODE ... NOTICE: CREATE PUBLICATION in source node cituswk1:5432 NOTICE: create shard table in the target node cituswk2:5432 NOTICE: CREATE SUBSCRIPTION on target node cituswk2:5432 NOTICE: wait for init data sync... NOTICE: init data sync in 00:00:01.010502 NOTICE: [1/2] LOCK TABLE scale_test.tb_dist2 IN EXCLUSIVE MODE ... NOTICE: [2/2] LOCK TABLE scale_test.tb_dist IN EXCLUSIVE MODE ... NOTICE: wait for data sync... NOTICE: data sync in 00:00:00.273212 NOTICE: UPDATE pg_dist_placement NOTICE: DROP SUBSCRIPTION and PUBLICATION NOTICE: END citus_move_shard_placement ---------------------------- (1 row) 从上面的输出可以看出,有一个步骤是锁表,这段时间内所有SQL都会被阻塞。对分析型业务来说,几十秒甚至更长SQL执行时间是很常见的,这意味着有可能出现先拿到一个表的锁,再拿下一个锁时,等了几十秒。更糟糕的情况下还可能发生死锁。回避这种风险的办法是将drop_method设置为none,这也是默认值。drop_method为none时将会改为获取一个EXCLUSIVE锁,EXCLUSIVE锁和SELECT不会冲突。这大大降低了分片迁移对业务的影响,死锁发生的概率也同样大大降低(仅有可能发生在应用程序在一个事务里先后更新了2张分片表时)。 确认扩容成功后,删除残留的旧分片(drop_method为drop时不需要清理)。 postgres=# select citus_move_shard_placement_cleanup(); citus_move_shard_placement_cleanup ------------------------------------ (1 row)
citus实战系列之二实验环境搭建 在进入后面的话题前,我们需要先搭建一个简单的实验环境,包含1个CN和2个Worker。以下步骤基于docker,仅用于实验目的,忽视了安全,性能调优等相关的配置。 前言 在进入后面的话题前,我们需要先搭建一个简单的实验环境。 搭建citus集群需要多台机器,如果仅用于功能验证而手上又没有合适机器,使用Docker搭建是个不错的选择。 以下步骤基于docker创建一个包含1个CN和2个Worker的citus环境,仅用于实验目的,忽视了安全,性能调优等相关的配置。 环境 host环境 CentOS 7.3 Docker 17.12.1-ce guest环境 CentOS 7.4 PostgreSQL 10 citus 7.4 制作citus镜像 启动centos容器 docker run -it --name citus centos bash 在容器中安装citus所需软件 yum install https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-2.noarch.rpm yum install postgresql10 postgresql10-server postgresql10-contrib yum install pgbouncer yum install citus_10 ln -sf /usr/pgsql-10 /usr/pgsql cat - >~postgres/.pgsql_profile <<EOF if [ -f /etc/bashrc ]; then . /etc/bashrc fi export PATH=/usr/pgsql/bin:$PATH export PGDATA=/pgsql/data EOF 创建citus镜像 docker commit citus citus:7.4v1 创建citus容器 为每一个节点创建一个volume docker volume create cituscn docker volume create cituswk1 docker volume create cituswk2 创建专有子网 docker network create --subnet=172.18.0.0/16 citus-net 创建并运行容器 docker run --mount source=cituscn,target=/pgsql --network citus-net --ip 172.18.0.100 --name cituscn --hostname cituscn --expose 5432 -it citus:7.4v1 bash docker run --mount source=cituswk1,target=/pgsql --network citus-net --ip 172.18.0.201 --name cituswk1 --hostname cituswk1 --expose 5432 -it citus:7.4v1 bash docker run --mount source=cituswk2,target=/pgsql --network citus-net --ip 172.18.0.202 --name cituswk2 --hostname cituswk2 --expose 5432 -it citus:7.4v1 bash 在每个容器上,分别执行下面的命令创建数据库 mkdir /pgsql/data chown postgres:postgres /pgsql/data chmod 0700 /pgsql/data su - postgres initdb -k -E UTF8 -D /pgsql/data echo "listen_addresses = '*'" >> /pgsql/data/postgresql.conf echo "wal_level = logical" >> /pgsql/data/postgresql.conf echo "shared_preload_libraries = 'citus'" >> /pgsql/data/postgresql.conf echo "citus.replication_model = 'streaming'" >> /pgsql/data/postgresql.conf echo "host all all 172.18.0.0/16 trust">> /pgsql/data/pg_hba.conf pg_ctl start psql -c "CREATE EXTENSION citus;" 在cituscn容器上,执行下面的命令添加worker。 psql -c "SELECT * from master_add_node('cituswk1', 5432);" psql -c "SELECT * from master_add_node('cituswk2', 5432);" 容器detach/attach 如果要退出容器的终端(detach),可以按CTL+pq 之后需要再次交互执行命令时再attach这个容器 docker attach cituscn 容器启停 以cituscn容器为例,worker容器类似。 停止cituscn容器 docker stop cituscn 启动cituscn容器 docker start -ai cituscn 然后在容器中启动PostgreSQL su - postgres pg_ctl start 测试 在cituscn容器上,测试分片表 create table tb1(id int primary key, c1 int); select create_distributed_table('tb1','id'); insert into tb1 select id,random()*1000 from generate_series(1,100)id; 执行测试SQL postgres=# explain select * from tb1; QUERY PLAN ------------------------------------------------------------------------------ Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=cituswk1 port=5432 dbname=postgres -> Seq Scan on tb1_102008 tb1 (cost=0.00..32.60 rows=2260 width=8) (6 rows)
citus实战系列之入门篇 citus为何物? citus是一款基于PostgreSQL的开源分布式数据库,自动继承了PostgreSQL强大的SQL支持能力和应用生态(不仅仅是客户端协议的兼容还包括服务端扩展和管理工具的完全兼容)。和其他类似的基于PostgreSQL的分布式方案,比如GreenPlum,PostgreSQL-XL,PostgreSQL-XC相比,citus最大的不同在于citus是一个PostgreSQL扩展而不是一个独立的代码分支。因此,citus可以用很小的代价和更快的速度紧跟PostgreSQL的版本演进;同时又能最大程度的保证数据库的稳定性和兼容性。 主要特性 PostgreSQL兼容 水平扩展 实时并发查询 快速数据加载 实时增删改查 支持分布式事务 支持常用DDL 性能参考 为了能够直观的了解citus分片表的性能优势,下面在1个CN和8个worker组成citus集群上,对比普通表和分片表(96分片)的性能差异。 插入1亿记录(5GB) 348051 82131 count(*) 348051 82131 插入1亿记录(5GB) 10246(2并发) 271 建索引 165582 2579 添加带缺省值的字段 388481 10522 删除5000w记录 104843 6106 相关的表定义和SQL如下 普通表: create table tbchj_local(id int primary key,c1 int,c2 text); insert into tbchj_local select a*10000+b,random()*10000000::int,'aaaaaaa' from generate_series(1,10000)a,generate_series(1,10000)b; select count(*) from tbchj_local; create index idx_tbchj_local_c1 on tbchj_local(c1); alter table tbchj_local add c3 int default 9; delete from tbchj_local where id %2 =0; 分片表 create table tbchj(id int primary key,c1 int,c2 text); set citus.shard_count=96; insert into tbchj select a*10000+b,random()*10000000::int,'aaaaaaa' from generate_series(1,10000)a,generate_series(1,10000)b; select count(*) from tbchj; create index idx_tbchj_c1 on tbchj(c1); alter table tbchj add c3 int default 9; delete from tbchj where id %2 =0; 技术架构 citus集群由一个中心的协调节点(CN)和若干个工作节点(Worker)构成。CN只存储和数据分布相关的元数据,实际的表数据被分成M个分片,打散到N个Worker上。这样的表被叫做“分片表”,可以为“分片表”的每一个分片创建多个副本,实现高可用和负载均衡。citus官方文档更建议使用PostgreSQL原生的流复制做HA,基于多副本的HA也许只适用于append only的分片。 分片表主要解决的是大表的水平扩容问题,对数据量不是特别大又经常需要和分片表Join的维表可以采用一种特殊的分片策略,只分1个片且每个Worker上部署1个副本,这样的表叫做“参考表”。除了分片表和参考表,还剩下一种没有经过分片的PostgreSQL原生的表,被称为“本地表”。“本地表”适用于一些特殊的场景,比如高并发的小表查询。 客户端应用访问数据时只和CN节点交互。CN收到SQL请求后,生成分布式执行计划,并将各个子任务下发到相应的Worker节点,之后收集Worker的结果,经过处理后返回最终结果给客户端。 适用场景 citus适合两类业务场景 实时数据分析 citus不仅支持高速的批量数据加载(20w/s),还支持单条记录的实时增删改查。 查询数据时,CN对每一个涉及的分片开一个连接驱动所有相关worker同时工作。并且支持过滤,投影,聚合,join等常见算子的下推,尽可能减少CN的负载。所以,对于count(),sum()这类简单的聚合计算,在128分片时citus可以轻松获得和PostgreSQL单并发相比50倍以上的性能提升。 多租户 和很多分布式数据库类似,citus对分片表间join的支持存在一定的限制。而多租户场景下每个租户的数据按租户ID分片,业务的SQL也带租户ID。因此这些SQL都可以直接下推到特定的分片上,避免了跨库join和跨库事务。 按现下流行的说法,citus可以算是一个分布式HTAP数据库,只是AP方面SQL的兼容性有待继续提升,TP方面还缺一个官方的多CN支持。 SQL限制与回避方法 citus对复杂SQL的支持能力还有所欠缺(和GreenPlum相比),这主要反映在跨库join,子查询和窗口函数上。好在目前citus的开发非常活跃,几乎2个月就出一个新的大版本,并大幅度改善其SQL支持能力。下面罗列了7.3版本的主要SQL限制。 join Join是分布式数据库比较头疼的问题。citus处理Join有两种方式,一种是把Join下推到分片上,即本地Join。本地Join,性能最优,但只适用于亲和分片表之间的Join,以及分片表和参考表之间的Join。亲和分片表指的是两个分片规则(分片数,副本数,分片字段,分片位置)完全相同的分片表。定义分片表时,可以在create_distributed_table()的参数colocate_with中指定和某个已存在的分片表亲和。比如: select create_distributed_table('tb2','id',colocate_with=>'tb1'); 设计表时应尽可能通过亲和关系以及参考表解决Join的问题。如果无法做到,就只能实施跨库Join。citus支持跨库Join的方式是对数据按Join字段重新分区,这一过程叫做MapMerge。这种方式只支持自然Join,其它Join仍然不支持。 子查询 子查询和Join一样存在跨库的问题,有时候子查询可以转化为一个等价的Join,所以和Join有相似的限制。 窗口函数 citus只支持PARTITION BY子句包含分片字段的窗口函数。 比如下面的SQL,如果class不是分片字段,这个SQL将不能支持。 SELECT class, student, course, score, avg(score) OVER (PARTITION BY class) -- class is not distribution column FROM student_score WHERE term = '2017-1'; 回避方法 可以通过临时表或dblink将数据拉到CN上处理进行回避。上面的SQL可以改写为 SELECT class, student, course, score, avg(score) OVER (PARTITION BY class) FROM dblink('', $$ SELECT class, student, course, score FROM student_score WHERE term = '2017-1' $$ ) as a(class text,student text,course text,score float4); 事务一致性 citus中没有全局的事务快照,这和MyCAT等常见的分库分表方案一样。这样可以简化设计提升性能,带来的问题就是不能保证全局的一致性读,而只能实现的最终一致性。 举个例子,从用户A的账户上转账100块到用户B的账户上,用户A和B对应的记录分别位于不同的Worker上。citus使用2PC处理这个分布式事务,具体处理步骤如下 Worker1 : BEGIN; Worker2 : BEGIN; Worker1 : UPDATE account_$shardid SET amount = amount - 100 WHERE user = 'A'; Worker2 : UPDATE account_$shardid SET amount = amount + 100 WHERE user = 'B'; Worker1 : PREPARE TRANSACTION 'tran1'; Worker2 : PREPARE TRANSACTION 'tran2'; Worker1 : COMMIT PREPARED 'tran1'; Worker2 : COMMIT PREPARED 'tran2'; 在步骤7和8之间,参与2PC的2个子事务一个已提交另一个还没提交。 如果此时有人查询账户总额,会得到一个不正确的值。 SELECT sum(amount) FROM account; 部署实施 citus的安装非常简单,但要实际用到生产上还需要下一番功夫。比如如何扩容,这是一个上生产后无法回避的问题,而citus的社区版恰恰又不支持扩容。怎么办? 办法当然是有的,关于citus的部署,后面准备通过一个系列的文章进行介绍,扩容其中一章的内容。 入门 实验环境搭建 平滑扩容 多CN部署 连接管理 高可用
# 闲聊PostgreSQL的oid ## oid为何物? PostgreSQL的系统表中大多包含一个叫做OID的隐藏字段,这个OID也是这些系统表的主键。 所谓OID,中文全称就是"对象标识符"。what?还有“对象”? 如果对PostgreSQL有一定了解,应该知道PostgreSQL最初的设计理念就是"对象关系数据库"。也就是说,系统表中储存的那些元数据,比如表,视图,类型,操作符,函数,索引,FDW,甚至存储过程语言等等这些统统都是对象。具体表现就在于这些东西都可以扩展,可以定制。不仅如此,PostgreSQL还支持函数重载,表继承等这些很OO的特性。 利用PostgreSQL的这些特性,用户可以根据业务场景从应用层到数据库层做一体化的优化设计,获得极致的性能与用户体验。一些用惯了MySQL的互联网架构师推崇"把数据库当存储",这一设计准则用在MySQL上也许合适,但如果硬要套在PostgreSQL上,就有点暴殄天物了! 扯得有点远了^_^,下面举几个栗子看下oid长啥样。 ## 使用示例 先随便创建一张表 postgres=# create table tb1(id int); CREATE TABLE 再看下这张表对应的oid postgres=# select oid from pg_class where relname='tb1'; oid ------- 32894 (1 row) 这个oid是隐藏字段,因此必须在select列表里明确指定oid列名,光使用`select *`是不输出oid的。 postgres=# select *from pg_class where relname='tb1'; -[ RECORD 1 ]-------+------ relname | tb1 relnamespace | 2200 reltype | 32896 reloftype | 0 relowner | 10 relam | 0 relfilenode | 32894 reltablespace | 0 relpages | 0 reltuples | 0 relallvisible | 0 reltoastrelid | 32897 relhasindex | f relisshared | f relpersistence | p relkind | r relnatts | 2 relchecks | 0 relhasoids | f relhaspkey | f relhasrules | f relhastriggers | f relhassubclass | f relrowsecurity | f relforcerowsecurity | f relispopulated | t relreplident | d relispartition | f relfrozenxid | 596 relminmxid | 2 relacl | reloptions | relpartbound | 不同对象对应于不同的对象标识符类型,比如表对象对应的对象标识符类型就是`regclass`, 通过对象标识符类型可以实现,对象标识符的数字值和对象名称之间的自由转换。 比如,上面那条SQL可以改写成以下的形式。 postgres=# select 'tb1'::regclass::int; int4 ------- 32894 (1 row) 反过来当然也是可以的,在PostgreSQL里就是一个普通的类型转换。 postgres=# select 32894::regclass; regclass ---------- tb1 (1 row) ## 表的数据类型 作为OO的体现之一,PostgreSQL中每个表都是一个新的数据类型,即有一个相应的数据类型对象。 通过`pg_class`可以查出刚才创建的表对应的数据类型对象的oid postgres=# select reltype from pg_class where relname='tb1'; reltype --------- 32896 (1 row) 在定义数据类型的系统表`pg_type`中保存了这个类型相关的信息。 postgres=# select * from pg_type where oid=32896; -[ RECORD 1 ]--+------------ typname | tb1 typnamespace | 2200 typowner | 10 typlen | -1 typbyval | f typtype | c typcategory | C typispreferred | f typisdefined | t typdelim | , typrelid | 32894 typelem | 0 typarray | 32895 typinput | record_in typoutput | record_out typreceive | record_recv typsend | record_send typmodin | - typmodout | - typanalyze | - typalign | d typstorage | x typnotnull | f typbasetype | 0 typtypmod | -1 typndims | 0 typcollation | 0 typdefaultbin | typdefault | typacl | 数据类型的对象标识符类型是regtype,通过regtype转换可以看到新创建的数据类型对象的名字也叫`tb1`。 postgres=# select 32896::regtype; regtype --------- tb1 (1 row) `tb1`类型在使用上和内置的int,text这些常见的数据类型几乎没有区别。 所以,你可以把一个字符串的值转换成`tb1`类型。 postgres=# select $$(999,'abcd')$$::text::tb1; tb1 -------------- (999,'abcd') (1 row) 可以使用`.`取出表类型里面的1个或所有字段 postgres=# select ($$(999,'abcd')$$::text::tb1).id; id ----- 999 (1 row) postgres=# select ($$(999,'abcd')$$::text::tb1).*; id | c1 -----+-------- 999 | 'abcd' (1 row) 当然,还可以用这个类型去创建新的表 postgres=# create table tb2(id int, c1 tb1); CREATE TABLE 如果你其实是想要创建一个像表一样的数据类型(即多个字段的组合),也可以单独创建这个数据类型。 'g, postgres=# create type ty1 as (id int,c1 text); CREATE TYPE ## 表文件 每个表的数据存储在文件系统中单独的文件中(实际不止一个文件),文件路径可以通过系统函数查询 postgres=# select pg_relation_filepath('tb1'); pg_relation_filepath ---------------------- base/13211/32894 (1 row) 上面的`base`对应的是缺省表空间,除此以外还有global表空间。 postgres=# select oid,* from pg_tablespace ; oid | spcname | spcowner | spcacl | spcoptions ------+------------+----------+--------+------------ 1663 | pg_default | 10 | | 1664 | pg_global | 10 | | (2 rows) 用户等全局对象存储在global表空间 postgres=# select relname,reltablespace from pg_class where relkind='r' and reltablespace<>0; relname | reltablespace -----------------------+--------------- pg_authid | 1664 pg_subscription | 1664 pg_database | 1664 pg_db_role_setting | 1664 pg_tablespace | 1664 pg_pltemplate | 1664 pg_auth_members | 1664 pg_shdepend | 1664 pg_shdescription | 1664 pg_replication_origin | 1664 pg_shseclabel | 1664 (11 rows) 表文件路径的第2部分13211是表所在数据库的oid postgres=# select oid,datname from pg_database; oid | datname -------+----------- 13211 | postgres 1 | template1 13210 | template0 (3 rows) 第3部分就是表对象的oid。 ## oid如何分配? oid的分配来自一个实例的全局变量,每分配一个新的对象,对这个全局变量加一。 当分配的oid超过4字节整形最大值的时候会重新从0开始分配,但这并不会导致类似于事务ID回卷那样严重的影响。 系统表一般会以oid作为主键,分配oid时,PostgreSQL会通过主键索引检查新的oid是否在相应的系统表中已经存在, 如果存在则尝试下一个oid。 相关代码如下: Oid GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn) { Oid newOid; SnapshotData SnapshotDirty; SysScanDesc scan; ScanKeyData key; bool collides; InitDirtySnapshot(SnapshotDirty); /* Generate new OIDs until we find one not in the table */ do { CHECK_FOR_INTERRUPTS(); newOid = GetNewObjectId(); ScanKeyInit(&key, oidcolumn, BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(newOid)); /* see notes above about using SnapshotDirty */ scan = systable_beginscan(relation, indexId, true, &SnapshotDirty, 1, &key); collides = HeapTupleIsValid(systable_getnext(scan)); systable_endscan(scan); } while (collides); return newOid; } 因此,oid溢出不会导致系统表中出现oid冲突(2个不同的系统表可能存在oid相同的对象)。 但重试毕竟会使分配有效的oid花费较多的时间,因此不建议用户为普通的用户表使用oid(使用`with oids`)从而导致oid过早的耗尽。 而且,使用oid的用户表如果未给oid创建唯一索引,oid溢出时,可能这个用户表中可能出现重复oid。以下是一个简单的演示: 创建一个`with oids`的表,并插入2条记录 postgres=# create table tb3(id int) with oids; CREATE TABLE postgres=# insert into tb3 values(1); INSERT 32912 1 postgres=# insert into tb3 values(2); INSERT 32913 1 此时,下一个全局oid是32914 [postgres@node1 ~]$ pg_ctl -D data stop waiting for server to shut down.... done server stopped [postgres@node1 ~]$ pg_controldata data pg_control version number: 1002 Catalog version number: 201707211 Database system identifier: 6500386650559491472 Database cluster state: shut down pg_control last modified: Sun 07 Jan 2018 11:14:58 PM CST Latest checkpoint location: 0/9088930 Prior checkpoint location: 0/9073988 Latest checkpoint's REDO location: 0/9088930 Latest checkpoint's REDO WAL file: 000000010000000000000009 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:602 Latest checkpoint's NextOID: 32914 Latest checkpoint's NextMultiXactId: 2 Latest checkpoint's NextMultiOffset: 3 Latest checkpoint's oldestXID: 548 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Time of latest checkpoint: Sun 07 Jan 2018 11:14:58 PM CST Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: off max_connections setting: 100 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: 5b060aed93e061d3d1ad2dccdfe3336b1ac844f94872e068d86587c48c7d394a 篡改下一个全局oid为32912 [postgres@node1 ~]$ pg_resetwal -D data -o 32912 Write-ahead log reset [postgres@node1 ~]$ pg_ctl -D data start 再插入3条记录,oid存在重复分配。 postgres=# insert into tb3 values(3); INSERT 32912 1 postgres=# insert into tb3 values(4); INSERT 32913 1 postgres=# insert into tb3 values(5); INSERT 32914 1 postgres=# select oid,* from tb3; oid | id -------+---- 32912 | 1 32913 | 2 32912 | 3 32913 | 4 32914 | 5 (5 rows)
利用pg_resetwal回到过去 PostgreSQL中提供了一个pg_resetwal(9.6及以前版本叫pg_resetxlog)工具命令,它的本职工作是清理不需要的WAL文件, 但除此以外还能干点别的。详见: http://postgres.cn/docs/9.6/app-pgresetxlog.html 根据PG的MVCC实现,更新删除记录时,不是原地更新而新建元组并通过设置标志位使原来的记录成为死元组。 pg_resetwal的一项特技是篡改当前事务ID,使得可以访问到这些死元组,只要这些死元组还未被vacuum掉。 下面做个演示。 创建测试库 初始化数据库 [postgres@node1 ~]$ initdb data1 The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "en_US.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory data1 ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting dynamic shared memory implementation ... posix creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok WARNING: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. Success. You can now start the database server using: pg_ctl -D data1 -l logfile start 启动PG [postgres@node1 ~]$ pg_ctl -D data1 -l logfile start waiting for server to start.... done server started 插入测试数据 [postgres@node1 ~]$ psql psql (11devel) Type "help" for help. postgres=# create table tb1(id int); CREATE TABLE postgres=# insert into tb1 values(1); INSERT 0 1 postgres=# insert into tb1 values(2); INSERT 0 1 postgres=# insert into tb1 values(3); INSERT 0 1 postgres=# insert into tb1 values(4); INSERT 0 1 postgres=# insert into tb1 values(5); INSERT 0 1 查看每条记录对应的事务号 postgres=# select xmin ,* from tb1; xmin | id ------+---- 556 | 1 557 | 2 558 | 3 559 | 4 560 | 5 (5 rows) 重置当前事务ID 重置当前事务ID为559 [postgres@node1 ~]$ pg_ctl -D data1 stop waiting for server to shut down.... done server stopped [postgres@node1 ~]$ pg_resetwal -D data1 -x 559 Write-ahead log reset [postgres@node1 ~]$ pg_ctl -D data1 start waiting for server to start....2017-09-30 22:59:37.902 CST [11862] LOG: listening on IPv6 address "::1", port 5432 2017-09-30 22:59:37.902 CST [11862] LOG: listening on IPv4 address "127.0.0.1", port 5432 2017-09-30 22:59:37.906 CST [11862] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2017-09-30 22:59:37.927 CST [11863] LOG: database system was shut down at 2017-09-30 22:59:34 CST 2017-09-30 22:59:37.935 CST [11862] LOG: database system is ready to accept connections done server started 检查数据 事务559及以后事务插入的数据将不再可见。 如果事务559及以后事务删除了数据,并且被删除的元组还没被回收,那么过去的数据也会重新出现。 [postgres@node1 ~]$ psql psql (11devel) Type "help" for help. postgres=# select xmin ,* from tb1; xmin | id ------+---- 556 | 1 557 | 2 558 | 3 (3 rows) 如果继续做一个插入,对应事务ID为559,可以惊奇的发现,之前被隐藏的老的559事务插入的数据也出现了。 postgres=# insert into tb1 values(6); INSERT 0 1 postgres=# select xmin ,* from tb1; xmin | id ------+---- 556 | 1 557 | 2 558 | 3 559 | 4 559 | 6 (5 rows) 再做一个插入,对应事务ID为560,效果和前面一样。 postgres=# insert into tb1 values(7); INSERT 0 1 postgres=# select xmin ,* from tb1; xmin | id ------+---- 556 | 1 557 | 2 558 | 3 559 | 4 560 | 5 559 | 6 560 | 7 (7 rows) 解释 PG的MVCC机制通过当前事务快照判断元组可见性,对事务快照影响最大的就是当前事务ID,只有小于等于当前事务ID且已提交的事务的变更才对当前事务可见。这也是利用pg_resetwal可以在一定程度上回到过去的原因。但是被删除的元组是否能找回依赖于vacuum。 如何阻止vacuum 我们可以在一定程度上控制vacuum,比如关闭特定表的autovacuum改为定期通过crontab回收死元组或设置vacuum_defer_cleanup_age延迟vacuum。 下面的示例,设置vacuum_defer_cleanup_age=10 postgres=# alter system set vacuum_defer_cleanup_age=10; ALTER SYSTEM postgres=# select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) 准备一些数据并执行删除操作 postgres=# create table tb1(id int); CREATE TABLE postgres=# insert into tb1 values(1); INSERT 0 1 postgres=# insert into tb1 values(2); INSERT 0 1 postgres=# select xmin,* from tb1; xmin | id ------+---- 556 | 1 557 | 2 (2 rows) postgres=# delete from tb1 where id=2; DELETE 1 postgres=# select xmin,* from tb1; xmin | id ------+---- 556 | 1 (1 row) 立即执行vacuum不会释放被删除的元组 postgres=# vacuum VERBOSE tb1; INFO: vacuuming "public.tb1" INFO: "tb1": found 0 removable, 2 nonremovable row versions in 1 out of 1 pages DETAIL: 1 dead row versions cannot be removed yet, oldest xmin: 550 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM 直到执行一些其它事务,等当前事务号向前推进10个以上,再执行vacuum才能回收这个死元组。 postgres=# insert into tb1 values(3); INSERT 0 1 postgres=# insert into tb1 values(4); INSERT 0 1 ... postgres=# vacuum VERBOSE tb1; INFO: vacuuming "public.tb1" INFO: "tb1": removed 1 row versions in 1 pages INFO: "tb1": found 1 removable, 10 nonremovable row versions in 1 out of 1 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 559 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM 注意阻止vacuum会导致垃圾堆积数据膨胀,对更新频繁的数据库或表要慎重使用这一技巧。并且这种方式不适用于drop table,vacuum full和truncate ,因为原来的数据文件已经被删了。
唯一索引的行估算实验 唯一索引除了有业务上约束作用,还可以使行估算更准确。 对唯一索引列的等值条件查询,即使统计信息缺失,也能得到准确的行估算值即1。 实验 创建不收集统计信息的测试表 postgres=# create table tbc1(id int) with (autovacuum_enabled=off); CREATE TABLE postgres=# insert into tbc1 select * from generate_series(1,10000); INSERT 0 10000 查询某唯一值,但行估算为57。 postgres=# explain select * from tbc1 where id =10; QUERY PLAN ------------------------------------------------------- Seq Scan on tbc1 (cost=0.00..188.44 rows=57 width=4) Filter: (id = 10) (2 rows) 创建普通索引,行估算仍为50。 postgres=# create index on tbc1(id); CREATE INDEX postgres=# explain select * from tbc1 where id =10; QUERY PLAN --------------------------------------------------------------------------- Bitmap Heap Scan on tbc1 (cost=2.17..38.17 rows=50 width=4) Recheck Cond: (id = 10) -> Bitmap Index Scan on tbc1_id_idx (cost=0.00..2.16 rows=50 width=0) Index Cond: (id = 10) (4 rows) 创建唯一索引,行估算变为1和实际吻合。 postgres=# create unique index on tbc1(id); CREATE INDEX postgres=# explain select * from tbc1 where id =10; QUERY PLAN ------------------------------------------------------------------------------ Index Only Scan using tbc1_id_idx1 on tbc1 (cost=0.29..3.30 rows=1 width=4) Index Cond: (id = 10) (2 rows) 唯一索引对行估算的作用不适用于非等值条件,比如范围条件 postgres=# explain analyze select * from tbc1 where id Bitmap Index Scan on tbc1_id_idx1 (cost=0.00..40.28 rows=3333 width=0) (actua l time=0.007..0.007 rows=0 loops=1) Index Cond: (id SQL中也不要在条件字段上附加计算或类型转换,否则即使有唯一索引估算也不会准。 postgres=# explain select * from tbc1 where id::text ='10'; QUERY PLAN ------------------------------------------------------- Seq Scan on tbc1 (cost=0.00..220.00 rows=50 width=4) Filter: ((id)::text = '10'::text) (2 rows) 由于关闭auto_autovacuum,测试过程中全程测试表统计信息都为空 postgres=# select * from pg_stats where tablename='tbc1'; schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation | most_common_e lems | most_common_elem_freqs | elem_count_histogram ------------+-----------+---------+-----------+-----------+-----------+------------+- -----------------+-------------------+------------------+-------------+-------------- -----+------------------------+---------------------- (0 rows)
MySQL Utilities 高可用工具体验 MySQL Utilities是MySQL官方的工具集,其中包括高可用相关的几个工具。 以下是对当前最新版本1.6的使用体验。 前提条件 MySQL Server 5.6+ 基于GTID的复制 Python 2.6+ Connector/Python 2.0+ 环境准备 在1台机器准备3个不同端口的MySQL实例用于测试 192.168.107.211:9001(master) 192.168.107.211:9002(slave1) 192.168.107.211:9003(slave2) 软件 OS: CentOS 7.1 MySQL: Percona Server 5.7.19 Python: 2.7.5 Connector/Python:2.1.7 mysql-utilities:1.6.5 创建MySQL实例1 生成实例1的配置文件my1.cnf su - mysql vi my1.cnf [mysqld] port=9001 datadir=/var/lib/mysql/data1 socket=/var/lib/mysql/data1/mysql.sock basedir=/usr/ innodb_buffer_pool_size=128M explicit_defaults_for_timestamp skip-name-resolve lower-case-table-names expire-logs-days=7 plugin-load="rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so" rpl_semi_sync_master_wait_point=AFTER_SYNC rpl_semi_sync_master_wait_no_slave=ON rpl_semi_sync_master_enabled=ON rpl_semi_sync_slave_enabled=ON rpl_semi_sync_master_timeout=5000 server-id=9001 log_bin=binlog gtid-mode=ON enforce-gtid-consistency=ON log-slave-updates=ON master-info-repository=TABLE relay-log-info-repository=TABLE report-host=192.168.107.211 log-error=/var/lib/mysql/data1/mysqld.log pid-file=/var/lib/mysql/data1/mysqld.pid general-log=ON general-log-file=/var/lib/mysql/data1/node1.log [mysqld_safe] pid-file=/var/lib/mysql/data1/mysqld.pid socket=/var/lib/mysql/data1/mysql.sock nice = 0 创建MySQL实例 mysqld --defaults-file=my1.cnf --initialize-insecure mysqld --defaults-file=my1.cnf & mysql -S data1/mysql.sock -uroot -e "set sql_log_bin=OFF;GRANT ALL PRIVILEGES ON *.* TO 'admin'@'%' IDENTIFIED BY '12345' WITH GRANT OPTION" 创建MySQL实例2 sed s/9001/9002/g my1.cnf | sed s/data1/data2/g >my2.cnf mysqld --defaults-file=my2.cnf --initialize-insecure mysqld --defaults-file=my2.cnf & mysql -S data2/mysql.sock -uroot -e "set sql_log_bin=OFF;GRANT ALL PRIVILEGES ON *.* TO 'admin'@'%' IDENTIFIED BY '12345' WITH GRANT OPTION" 创建MySQL实例3 sed s/9001/9003/g my1.cnf | sed s/data1/data3/g >my3.cnf mysqld --defaults-file=my3.cnf --initialize-insecure mysqld --defaults-file=my3.cnf & mysql -S data3/mysql.sock -uroot -e "set sql_log_bin=OFF;GRANT ALL PRIVILEGES ON *.* TO 'admin'@'%' IDENTIFIED BY '12345' WITH GRANT OPTION" 利用mysqlreplicate建立复制 -bash-4.2$ mysqlreplicate --master=admin:12345@192.168.107.211:9001 --slave=admin:12345@192.168.107.211:9002 --rpl-user=repl:repl -v WARNING: Using a password on the command line interface can be insecure. # master on 192.168.107.211: ... connected. # slave on 192.168.107.211: ... connected. # master id = 9001 # slave id = 9002 # master uuid = b8ca6259-ab80-11e7-91fc-000c296dd240 # slave uuid = d842240c-ab80-11e7-960f-000c296dd240 # Checking InnoDB statistics for type and version conflicts. # Checking storage engines... # Checking for binary logging on master... # Setting up replication... # Granting replication access to replication user... # Connecting slave to master... # CHANGE MASTER TO MASTER_HOST = '192.168.107.211', MASTER_USER = 'repl', MASTER_PASSWORD = 'repl', MASTER_PORT = 9001, MASTER_AUTO_POSITION=1 # Starting slave from master's last position... # IO status: Waiting for master to send event # IO thread running: Yes # IO error: None # SQL thread running: Yes # SQL error: None # ...done. 除去各种检查,mysqlreplicate真正做的事很简单。如下 先在master上创建复制账号 CREATE USER 'repl'@'192.168.107.211' IDENTIFIED WITH 'repl' GRANT REPLICATION SLAVE ON *.* TO 'repl'@'192.168.107.211' IDENTIFIED WITH 'repl' mysqlreplicate会为每个Slave创建一个复制账号,除非通过以下SQL发现该账号已经存在。 SELECT * FROM mysql.user WHERE user = 'repl' and host = '192.168.107.211' 然后在slave上设置复制 CHANGE MASTER TO MASTER_HOST = '192.168.107.211', MASTER_USER = 'repl', MASTER_PASSWORD = 'repl', MASTER_PORT = 9001, MASTER_AUTO_POSITION=1 在启用GTID的情况的下,从哪儿开始复制完全由GTID决定,所以mysqlreplicate中的那些和复制起始位点相关的参数,比如-b,统统被无视,其效果相当于-b。 注意:mysqlreplicate不会理会当前的复制拓扑,所以如果把master和slave对调再执行一次,就变成主主复制了。 slave1的复制配置好后,用同样的方法配置slave2的复制 mysqlreplicate --master=admin:12345@192.168.107.211:9001 --slave=admin:12345@192.168.107.211:9003 --rpl-user=repl:repl -v 通过mysqlrplshow查看复制拓扑 -bash-4.2$ mysqlrplshow --master=admin:12345@192.168.107.211:9001 --discover-slaves-login=admin:12345 -v WARNING: Using a password on the command line interface can be insecure. # master on 192.168.107.211: ... connected. # Finding slaves for master: 192.168.107.211:9001 # Replication Topology Graph 192.168.107.211:9001 (MASTER) | +--- 192.168.107.211:9002 [IO: Yes, SQL: Yes] - (SLAVE) | +--- 192.168.107.211:9003 [IO: Yes, SQL: Yes] - (SLAVE) mysqlrplshow通过在master上执行SHOW SLAVE HOSTS发现初步的复制拓扑。 由于Slave停止复制或改变复制源时不能立刻反应到master的SHOW SLAVE HOSTS上,所以初步获取的复制拓扑可能存在冗余, 因此,mysqlrplshow还会再连到slave上执行SHOW SLAVE STATUS进行确认。 通过mysqlrpladmin检查集群健康状态 -bash-4.2$ mysqlrpladmin --master=admin:12345@192.168.107.211:9001 --slaves=admin:12345@192.168.107.211:9002,admin:12345@192.168.107.211:9003 health WARNING: Using a password on the command line interface can be insecure. # Checking privileges. # # Replication Topology Health: +------------------+-------+---------+--------+------------+---------+ | host | port | role | state | gtid_mode | health | +------------------+-------+---------+--------+------------+---------+ | 192.168.107.211 | 9001 | MASTER | UP | ON | OK | | 192.168.107.211 | 9002 | SLAVE | UP | ON | OK | | 192.168.107.211 | 9003 | SLAVE | UP | ON | OK | +------------------+-------+---------+--------+------------+---------+ # ...done. 通过mysqlrpladmin elect挑选合适的新主 -bash-4.2$ mysqlrpladmin --master=admin:12345@192.168.107.211:9001 --slaves=admin:12345@192.168.107.211:9002,admin:12345@192.168.107.211:9003 elect WARNING: Using a password on the command line interface can be insecure. # Checking privileges. # Electing candidate slave from known slaves. # Best slave found is located on 192.168.107.211:9002. # ...done. 然而,elect只是从slaves中选出第一个合格的slave,并不考虑复制是否已停止,以及哪个节点的日志更全。 下面把slave1的复制停掉 mysql -S data2/mysql.sock -uroot -e "stop slave" 再在master执行一条SQL mysql -S data1/mysql.sock -uroot -e "create database test" 现在slave1上少了一个事务 -bash-4.2$ mysqlrpladmin --master=admin:12345@192.168.107.211:9001 --slaves=admin:12345@192.168.107.211:9002,admin:12345@192.168.107.211:9003 gtid WARNING: Using a password on the command line interface can be insecure. # Checking privileges. # # UUIDS for all servers: +------------------+-------+---------+---------------------------------------+ | host | port | role | uuid | +------------------+-------+---------+---------------------------------------+ | 192.168.107.211 | 9001 | MASTER | 5daf1e10-ac41-11e7-bcc4-000c296dd240 | | 192.168.107.211 | 9002 | SLAVE | fe084f45-ac43-11e7-a343-000c296dd240 | | 192.168.107.211 | 9003 | SLAVE | d0af3a6a-ac41-11e7-85e0-000c296dd240 | +------------------+-------+---------+---------------------------------------+ # # Transactions executed on the server: +------------------+-------+---------+-------------------------------------------+ | host | port | role | gtid | +------------------+-------+---------+-------------------------------------------+ | 192.168.107.211 | 9001 | MASTER | 5daf1e10-ac41-11e7-bcc4-000c296dd240:1-3 | | 192.168.107.211 | 9002 | SLAVE | 5daf1e10-ac41-11e7-bcc4-000c296dd240:1-2 | | 192.168.107.211 | 9003 | SLAVE | 5daf1e10-ac41-11e7-bcc4-000c296dd240:1-3 | +------------------+-------+---------+-------------------------------------------+ # ...done. 但elect仍然会选slave1 -bash-4.2$ mysqlrpladmin --master=admin:12345@192.168.107.211:9001 --slaves=admin:12345@192.168.107.211:9002,admin:12345@192.168.107.211:9003 elect WARNING: Using a password on the command line interface can be insecure. # Checking privileges. # Electing candidate slave from known slaves. # Best slave found is located on 192.168.107.211:9002. # ...done. 通过mysqlrpladmin switchover在线切换主备 -bash-4.2$ mysqlrpladmin --master=admin:12345@192.168.107.211:9001 --slaves=admin:12345@192.168.107.211:9002,admin:12345@192.168.107.211:9003 --new-master=admin:12345@192.168.107.211:9002 switchover WARNING: Using a password on the command line interface can be insecure. # Checking privileges. # Performing switchover from master at 192.168.107.211:9001 to slave at 192.168.107.211:9002. # Checking candidate slave prerequisites. # Checking slaves configuration to master. # Waiting for slaves to catch up to old master. Slave 192.168.107.211:9002 did not catch up to the master. ERROR: Slave 192.168.107.211:9002 did not catch up to the master. switchover会连接到每一个节点并等待所有slave回放完日志才执行切换,因此有任何一个节点故障或任何一个slave复制故障都不会执行switchover。 启动刚才停掉的slave1的复制 mysql -S data2/mysql.sock -uroot -e "start slave" 再次执行switchover,成功 -bash-4.2$ mysqlrpladmin --master=admin:12345@192.168.107.211:9001 --slaves=admin:12345@192.168.107.211:9002,admin:12345@192.168.107.211:9003 --new-master=admin:12345@192.168.107.211:9002 --demote-master switchover WARNING: Using a password on the command line interface can be insecure. # Checking privileges. # Performing switchover from master at 192.168.107.211:9001 to slave at 192.168.107.211:9002. # Checking candidate slave prerequisites. # Checking slaves configuration to master. # Waiting for slaves to catch up to old master. # Stopping slaves. # Performing STOP on all slaves. # Demoting old master to be a slave to the new master. # Switching slaves to new master. # Starting all slaves. # Performing START on all slaves. # Checking slaves for errors. # Switchover complete. # # Replication Topology Health: +------------------+-------+---------+--------+------------+---------+ | host | port | role | state | gtid_mode | health | +------------------+-------+---------+--------+------------+---------+ | 192.168.107.211 | 9002 | MASTER | UP | ON | OK | | 192.168.107.211 | 9001 | SLAVE | UP | ON | OK | | 192.168.107.211 | 9003 | SLAVE | UP | ON | OK | +------------------+-------+---------+--------+------------+---------+ # ...done. 执行switchover时,有一段Waiting for slaves to catch up to old master.,如果任何一个slave有故障无法同步到和master相同的状态,switchover会失败。即switchover的前提条件是所有节点(包括master和所有salve)都是OK的。 通过mysqlrpladmin failover故障切换主备 -bash-4.2$ mysqlrpladmin --slaves=admin:12345@192.168.107.211:9001,admin:12345@192.168.107.211:9003 failover WARNING: Using a password on the command line interface can be insecure. # Checking privileges. # Performing failover. # Candidate slave 192.168.107.211:9001 will become the new master. # Checking slaves status (before failover). # Preparing candidate for failover. # Creating replication user if it does not exist. # Stopping slaves. # Performing STOP on all slaves. # Switching slaves to new master. # Disconnecting new master as slave. # Starting slaves. # Performing START on all slaves. # Checking slaves for errors. # Failover complete. # # Replication Topology Health: +------------------+-------+---------+--------+------------+---------+ | host | port | role | state | gtid_mode | health | +------------------+-------+---------+--------+------------+---------+ | 192.168.107.211 | 9001 | MASTER | UP | ON | OK | | 192.168.107.211 | 9003 | SLAVE | UP | ON | OK | +------------------+-------+---------+--------+------------+---------+ # ...done. failover时要求所有slave的SQL线程都是正常的,IO线程可以停止或异常。 如果未指定--candidates,一般会以slaves中第1个slave作为新主。 如果新主的binlog不是最新的,会先向拥有最新日志的slave复制,并等到binlog追平了再切换。 小结 从上面操作过程来看,借助MySQL Utilities管理MySQL集群还比较简便,但结合代码考虑到各种场景,这套工具和MHA比起来还不够严谨。 没有把从库的READ_ONLY设置集成到脚本里 switchover时没有终止运行中的事务,实际也没有有效的手段阻止新的写事务在旧master上执行。 failover不检查master死活,需要DBA在调用failover前自己检查,否则会引起脑裂。
本文为DBAPlus投稿文章, 原文链接: http://dbaplus.cn/news-19-1514-1.html 一次PostgreSQL行估算偏差导致的慢查询分析 问题 最近某业务系统上线了新功能,然后我们就发现PostgreSQL日志中多了很多慢查询。这些SQL语句都比较相似,下面是其中一个SQL的explain analyze执行计划输出。 这个SQL执行了18秒,从上面的执行计划不难看出,时间主要耗在两次嵌套join时对子表的顺序扫描(图中蓝线部分)。乘以5429的循环次数,每个join都要顺序扫描2000多万条记录。 分析 既然是顺序扫描惹的祸,那么在join列上加个索引是不是就可以了呢? 但是查看相关表定义后,发现在相关的表上已经有索引了;而且即使没有索引,PG也应该可以通过Hash join回避大量的顺序扫描。 再仔细看下执行计划里的cost估算,发现PG估算出的rows只有1行,而实际是5429(图中红线部分)。看来是行数估算的巨大偏差导致PG选错了执行计划。 为什么估算行数偏差这么大? 通过尝试,发现问题出在下面的过滤条件上。不加这个过滤条件估算行数和实际行数是基本吻合的,一加就相差的离谱。 Filter: (((zsize)::text = '2'::text) AND ((tmall_flg)::text = '1'::text)) 而上面的zsite的数据类型是char(10),tmall_flg的数据类型是int,难道是类型转换惹的祸? 在测试环境把尝试去掉SQL里的类型转换,发现执行时间立刻从10几秒降到1秒以内。看来原因就是它了。 zsize::text = '2' AND tmall_flg::text = '1' ==》 zsize = '2' AND tmall_flg = 1 生产环境下,因为修改应用的SQL需要时间,临时采用下面的回避措施 alter table bi_dm.tdm_wh_zl057_rt alter zsize type varchar(10); 即把zsize的类型从char(10)改成varchar(10)(varchar到text的类型转换不会影响结果行估算)。由于没有改tmall_flg,修改之后,估算的行数是79行,依然不准确。但是这带来的cost计算值的变化已经足以让PG选择索引扫描而不是顺序扫描了。修改之后的执行时间只有311毫秒。 原理 PG如何估算结果行数 PG通过收集的统计信息估算结果行数,并且收集的统计信息也很全面,包括唯一值数量,频繁值分布,柱状图和相关性,正常情况下应该是比较准确的。看下面的例子 无where条件 postgres=# explain select * from bi_dm.tdm_wh_zl057_rt; QUERY PLAN --------------------------------------------------------------------------- Seq Scan on tdm_wh_zl057_rt (cost=0.00..81318.21 rows=2026121 width=154) (1 row) 全表数据的估算值来自pg_class postgres=# select reltuples from pg_class where relname='tdm_wh_zl057_rt'; reltuples ----------- 2026121 (1 row) 估算值和实际值的误差只有5%左右 postgres=# select count(*) from bi_dm.tdm_wh_zl057_rt; count --------- 2103966 (1 row) 带等值where条件 postgres=# explain select * from bi_dm.tdm_wh_zl057_rt where tmall_flg = 1; QUERY PLAN -------------------------------------------------------------------------- Seq Scan on tdm_wh_zl057_rt (cost=0.00..86403.32 rows=523129 width=154) Filter: (tmall_flg = 1) (2 rows) 带where条件后,PG根据pg_stats收集的列值分布信息估算出where条件的选择率。tmall_flg = 1属于频繁值,most_common_freqs中直接记录了其选择率为0.258133322 postgres=# select * from pg_stats where tablename='tdm_wh_zl057_rt' and attname='tmall_flg'; -[ RECORD 1 ]----------+-------------------------------------- schemaname | bi_dm tablename | tdm_wh_zl057_rt attname | tmall_flg inherited | f null_frac | 0.00033333333 avg_width | 4 n_distinct | 5 most_common_vals | {0,1,2} most_common_freqs | {0.626866639,0.258133322,0.114566669} histogram_bounds | {3,4} correlation | 0.491312951 most_common_elems | most_common_elem_freqs | elem_count_histogram | 结合总记录数,可以算出估算结果行数。 postgres=# select 2026121*0.258133322; ?column? ------------------ 523009.344503962 (1 row) 估算值和实际值的误差只有1%左右 postgres=# select count(*) from bi_dm.tdm_wh_zl057_rt where tmall_flg = 1; count -------- 532630 (1 row) 带等值where条件,且条件列带类型转换 postgres=# explain select * from bi_dm.tdm_wh_zl057_rt where tmall_flg::text = '1'; QUERY PLAN ------------------------------------------------------------------------- Seq Scan on tdm_wh_zl057_rt (cost=0.00..96561.46 rows=10131 width=155) Filter: ((tmall_flg)::text = '1'::text) (2 rows) 一旦在条件列上引入包括类型转换,函数调用之类的计算,PG就无法通过pg_stats计算选择率了,于是笼统的采用了一个0.005的默认值。通过这个默认的选择率计算的结果行数可能会和实际结果行数有巨大的偏差。如果where条件中这样的列不止一个,偏差会被进一步放大。 postgres=# select 2026121*0.005; ?column? ----------- 10130.605 (1 row) 相关代码 src/include/utils/selfuncs.h: /* default selectivity estimate for equalities such as "A = b" */ #define DEFAULT_EQ_SEL 0.005 src/backend/utils/adt/selfuncs.c: Datum eqsel(PG_FUNCTION_ARGS) { ... /* * If expression is not variable = something or something = variable, then * punt and return a default estimate. */ if (!get_restriction_variable(root, args, varRelid, &vardata, &other, &varonleft)) PG_RETURN_FLOAT8(DEFAULT_EQ_SEL); 总结 在条件列上引入计算带来的危害: 该列无法使用索引(除非专门定义与查询SQL匹配的表达式索引) 无法准确评估where条件匹配的结果行数,可能会引发连锁反应进而生成糟糕的执行计划 回避方法: 规范表的数据类型定义,避免不必要的类型转换 将计算从列转移到常量上 比如: where c1 + 1 = 1000 改成 where c1 = 1000 - 1 改成其它等价的写法 比如: where substring(c2,2) = 'ZC' 改成 where c2 >= 'ZC' and c2 也可以改成更简洁的正则表达式 where c2 ~ '^ZC' 但是,正则表达式中如果带了类似^$*这样的内容,行数估算准确性也受一定的影响
PostgreSQL字符类型长度变更的性能 背景 业务有时会遇到表中的字符型字段的长度不够用的问题,需要修改表定义。但是表里的数据已经很多了,修改字段长度会不会造成应用堵塞呢? 测试验证 做了个小测试,如下 建表并插入1000w数据 postgres=# create table tbx1(id int,c1 char(10),c2 varchar(10)); CREATE TABLE postgres=# insert into tbx1 select id ,'aaaaa','aaaaa' from generate_series(1,10000000) id; INSERT 0 10000000 变更varchar类型长度 postgres=# alter table tbx1 alter COLUMN c2 type varchar(100); ALTER TABLE Time: 1.873 ms postgres=# alter table tbx1 alter COLUMN c2 type varchar(99); ALTER TABLE Time: 12815.678 ms postgres=# alter table tbx1 alter COLUMN c2 type varchar(4); ERROR: value too long for type character varying(4) Time: 5.328 ms 变更char类型长度 postgres=# alter table tbx1 alter COLUMN c1 type char(100); ALTER TABLE Time: 35429.282 ms postgres=# alter table tbx1 alter COLUMN c1 type char(6); ALTER TABLE Time: 20004.198 ms postgres=# alter table tbx1 alter COLUMN c1 type char(4); ERROR: value too long for type character(4) Time: 4.671 ms 变更char类型,varchar和text类型互转 alter table tbx1 alter COLUMN c1 type varchar(6); ALTER TABLE Time: 18880.369 ms postgres=# alter table tbx1 alter COLUMN c1 type text; ALTER TABLE Time: 12.691 ms postgres=# alter table tbx1 alter COLUMN c1 type varchar(20); ALTER TABLE Time: 32846.016 ms postgres=# alter table tbx1 alter COLUMN c1 type char(20); ALTER TABLE Time: 39796.784 ms postgres=# alter table tbx1 alter COLUMN c1 type text; ALTER TABLE Time: 32091.025 ms postgres=# alter table tbx1 alter COLUMN c1 type char(20); ALTER TABLE Time: 26031.344 ms 定义变更后的数据 定义变更后,数据位置未变,即没有产生新的tuple postgres=# select ctid,id from tbx1 limit 5; ctid | id -------+---- (0,1) | 1 (0,2) | 2 (0,3) | 3 (0,4) | 4 (0,5) | 5 (5 rows) 除varchar扩容以外的定义变更,每个tuple产生一条WAL记录 $ pg_xlogdump -f -s 3/BE002088 -n 5 rmgr: Heap len (rec/tot): 3/ 181, tx: 1733, lsn: 3/BE002088, prev 3/BE001FB8, desc: INSERT off 38, blkref #0: rel 1663/13269/16823 blk 58358 rmgr: Heap len (rec/tot): 3/ 181, tx: 1733, lsn: 3/BE002140, prev 3/BE002088, desc: INSERT off 39, blkref #0: rel 1663/13269/16823 blk 58358 rmgr: Heap len (rec/tot): 3/ 181, tx: 1733, lsn: 3/BE0021F8, prev 3/BE002140, desc: INSERT off 40, blkref #0: rel 1663/13269/16823 blk 58358 rmgr: Heap len (rec/tot): 3/ 181, tx: 1733, lsn: 3/BE0022B0, prev 3/BE0021F8, desc: INSERT off 41, blkref #0: rel 1663/13269/16823 blk 58358 rmgr: Heap len (rec/tot): 3/ 181, tx: 1733, lsn: 3/BE002368, prev 3/BE0022B0, desc: INSERT off 42, blkref #0: rel 1663/13269/16823 blk 58358 结论 varchar扩容,varchar转text只需修改元数据,毫秒完成。 其它转换需要的时间和数据量有关,1000w数据10~40秒,但是不改变数据文件,只是做检查。 缩容时如果定义长度不够容纳现有数据报错 不建议使用char类型,除了埋坑几乎没什么用,这一条不仅适用与PG,所有关系数据库应该都适用。
PostgreSQL如何保障数据的一致性 玩过MySQL的人应该都知道,由于MySQL是逻辑复制,从根子上是难以保证数据一致性的。玩MySQL玩得好的专家们知道有哪些坑,应该怎么回避。为了保障MySQL数据的一致性,甚至会动用paxos,raft之类的终极武器建立严密的防护网。如果不会折腾,真不建议用MySQL存放一致性要求高的数据。 PostgreSQL由于是物理复制,天生就很容易保障数据一致性,而且回放日志的效率很高。 我们实测的结果,MySQL5.6的写qps超过4000备机就跟不上主机了;PG 8核虚机的写qps压到2.3w备机依然毫无压力,之所以只压到2.3w是因为主节点的CPU已经跑满压不上去了。 那么相比于MySQL,PG有哪些措施用于保障数据的一致性呢? 1. 严格单写 PG的备库处于恢复状态,不断的回放主库的WAL,不具备写能力。 而MySQL的单写是通过在备机上设置read_only或super_read_only实现的,DBA在维护数据库的时候可能需要解除只读状态,在解除期间发生点什么,或自动化脚本出个BUG都可能引起主备数据不一致。甚至备库在和主库建立复制关系之前数据就不是一致的,MySQL的逻辑复制并不阻止两个不一致的库建立复制关系。 2. 串行化的WAL回放 PG的备库以和主库完全相同顺序串行化的回放WAL日志。 MySQL中由于存在组提交,以及为了解决单线程复制回放慢而采取的并行复制,不得不在复制延迟和数据一致性之前做取舍。 并且这里牵扯到的逻辑很复杂,已经检出了很多的BUG;因为逻辑太复杂了,未来出现新BUG的概率应该相对也不会低。 3. 同步复制 PG通过synchronous_commit参数设置复制的持久性级别。 下面这些级别越往下越严格,从remote_write开始就可以保证单机故障不丢数据了。 off local remote_write on remote_apply 每个级别的含义参考手册:19.5. 预写式日志 MySQL通过半同步复制在很大程度上降低了failover丢失数据的概率。MySQL的主库在等待备库的应答超时时半同步复制会自动降级成异步,此时发生failover会丢失数据。 4. 全局唯一的WAL标识 WAL文件头中保存了数据库实例的唯一标识(Database system identifier),可以确保不同数据库实例产生的WAL可以区别开,同一集群的主备库拥有相同唯一标识。 PG提升备机的时候会同时提升备机的时间线,时间线是WAL文件名的一部分,通过时间线就可以把新主和旧主产生的WAL区别开。 (如果同时提升2个以上的备机,就无法这样区分WAL了,当然这种情况正常不应该发生。) WAL记录在整个WAL逻辑数据流中的偏移(lsn)作为WAL的标识。 以上3者的联合可唯一标识WAL记录 MySQL5.6开始支持GTID了,这对保障数据一致性是个极大的进步。对于逻辑复制来说,GITD确实做得很棒,但是和PG物理复制的时间线+lsn相比起来就显得太复杂了。时间线+lsn只是2个数字而已;GTID却是一个复杂的集合,而且需要定期清理。 MySQL的GTID是长这样的: e6954592-8dba-11e6-af0e-fa163e1cf111:1-5:11-18, e6954592-8dba-11e6-af0e-fa163e1cf3f2:1-27 5. 数据文件的checksum 在初始化数据库时,使用-k选项可以打开数据文件的checksum功能。(建议打开,造成的性能损失很小) 如果底层存储出现问题,可通过checksum及时发现。 initdb -k $datadir MySQL也只支持数据文件的checksum,没什么区别。 6. WAL记录的checksum 每条WAL记录里都保存了checksum信息,如果WAL的传输存储过程中出现错误可及时发现。 MySQL的binlog记录里也包含checksum,没什么区别。 7. WAL文件的验证 WAL可能来自归档的拷贝或人为拷贝,PG在读取WAL文件时会进行验证,可防止DBA弄错文件。 检查WAL文件头中记录的数据库实例的唯一标识是否和本数据库一致 检查WAL页面头中记录的页地址是否正确 其它检查 上面第2项检查的作用主要是应付WAL再利用。 PG在清理不需要的WAL文件时,有2种方式,1是删除,2是改名为未来的WAL文件名防止频繁创建文件。 看下面的例子,000000030000000000000015及以后的WAL文件的修改日期比前面的WAL还要老,这些WAL文件就是被重命名了的。 [postgres@node1 ~]$ ll data1/pg_wal/ total 213000 -rw-------. 1 postgres postgres 41 Aug 27 00:53 00000002.history -rw-------. 1 postgres postgres 16777216 Sep 1 23:56 000000030000000000000012 -rw-------. 1 postgres postgres 16777216 Sep 2 11:05 000000030000000000000013 -rw-------. 1 postgres postgres 16777216 Sep 2 11:05 000000030000000000000014 -rw-------. 1 postgres postgres 16777216 Aug 27 00:57 000000030000000000000015 -rw-------. 1 postgres postgres 16777216 Aug 27 00:58 000000030000000000000016 -rw-------. 1 postgres postgres 16777216 Aug 27 00:59 000000030000000000000017 -rw-------. 1 postgres postgres 16777216 Aug 27 00:59 000000030000000000000018 -rw-------. 1 postgres postgres 16777216 Aug 27 00:59 000000030000000000000019 -rw-------. 1 postgres postgres 16777216 Aug 27 00:59 00000003000000000000001A -rw-------. 1 postgres postgres 16777216 Aug 27 00:59 00000003000000000000001B -rw-------. 1 postgres postgres 16777216 Aug 27 00:59 00000003000000000000001C -rw-------. 1 postgres postgres 16777216 Aug 27 00:59 00000003000000000000001D -rw-------. 1 postgres postgres 16777216 Sep 1 23:56 00000003000000000000001E -rw-------. 1 postgres postgres 84 Aug 27 01:02 00000003.history drwx------. 2 postgres postgres 34 Sep 1 23:56 archive_status 由于有上面的第2项检查,如果读到了这些WAL文件,可以立即识别出来。 [postgres@node1 ~]$ pg_waldump data1/pg_wal/000000030000000000000015 pg_waldump: FATAL: could not find a valid record after 0/15000000 MySQL的binlog文件名一般是长下面这样的,从binlog文件名上看不出任何和GTID的映射关系。 mysql_bin.000001 不同机器上产生的binlog文件可能同名,如果要管理多套MySQL,千万别拿错文件。因为MySQL是逻辑复制,这些binlog文件就像SQL语句一样,拿到哪里都可以执行。 参考 src/backend/access/transam/xlogreader.c: static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr) { pg_crc32c crc; /* Calculate the CRC */ INIT_CRC32C(crc); COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord); /* include the record header last */ COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc)); FIN_CRC32C(crc); if (!EQ_CRC32C(record->xl_crc, crc)) { report_invalid_record(state, "incorrect resource manager data checksum in record at %X/%X", (uint32) (recptr >> 32), (uint32) recptr); return false; } return true; } ... static bool ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogPageHeader hdr) { ... if (state->system_identifier && longhdr->xlp_sysid != state->system_identifier) { char fhdrident_str[32]; char sysident_str[32]; /* * Format sysids separately to keep platform-dependent format code * out of the translatable message string. */ snprintf(fhdrident_str, sizeof(fhdrident_str), UINT64_FORMAT, longhdr->xlp_sysid); snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT, state->system_identifier); report_invalid_record(state, "WAL file is from different database system: WAL file database system identifier is %s, pg_control database system identifier is %s", fhdrident_str, sysident_str); return false; } ... if (hdr->xlp_pageaddr != recaddr) { char fname[MAXFNAMELEN]; XLogFileName(fname, state->readPageTLI, segno); report_invalid_record(state, "unexpected pageaddr %X/%X in log segment %s, offset %u", (uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr, fname, offset); return false; } ... }
PostgreSQL的表膨胀及对策 PostgreSQL的MVCC机制在数据更新时会产生dead元组,这些dead元组通过后台的autovacuum进程清理。一般情况下autovacuum可以工作的不错,但以下情况下,dead元组可能会不断堆积,形成表膨胀(包括索引膨胀)。 autovacuum清理速度赶不上dead元组产生速度 由于以下因素导致dead元组无法被回收 主库或备库存在长事务 主库或备库存在未处理的未决事务 主库或备库存在断开的复制槽 检查表膨胀 方法1:查询pg_stat_all_tables系统表 SELECT schemaname||'.'||relname, n_dead_tup, n_live_tup, round(n_dead_tup * 100 / (n_live_tup + n_dead_tup),2) AS dead_tup_ratio FROM pg_stat_all_tables WHERE n_dead_tup >= 10000 ORDER BY dead_tup_ratio DESC LIMIT 10; 方法2:使用pg_bloat_check工具 `pg_bloat_check`会进行全表扫描,比`pg_stat_all_tables`准确,但比较慢对系统性能冲击也较大,不建议作为常规工具使用。 以上方法包含了对索引膨胀的检查。但需要注意的是,表中不能被回收的dead tuple在索引页里是作为正常tuple而不是dead tuple记录的。考虑到这一点,索引的实际膨胀要乘以对应表的膨胀率。 预防表膨胀 调整autovacuum相关参数,加快垃圾回收速度 对于写入频繁的系统,默认的autovacuum_vacuum_cost_limit参数值可能过小,尤其在SSD机器上,可以适当调大。 autovacuum_vacuum_cost_limit = 4000 监视并处理以下可能导致dead元组无法被回收的状况 长事务 未决事务 断开的复制槽 强制回收 设置old_snapshot_threshold参数,强制删除为过老的事务快照保留的dead元组。这会导致长事务读取已被删除tuple时出错。 old_snapshot_threshold = 12h old_snapshot_threshold不会影响更新事务和隔离级别为RR只读事务。old_snapshot_threshold参数也不能在线修改,如果已经设置了old_snapshot_threshold但又需要运行更长的RR只读事务或单个大的只读SQL,可以临时在备机上设置max_standby_streaming_delay = -1,然后在备机执行长事务(会带来主备延迟)。 杀死长事务 设置可以部分避免长事务的参数 idle_in_transaction_session_timeout = 60s lock_timeout = 70s 相关代码 vacuum() ->vacuum_rel() ->vacuum_set_xid_limits() ->GetOldestXmin() 找出以下最小的事务ID,大于该事务ID的事务删除的tuple将不回收 - backend_xid,所有后端进程的当前事务ID的最小值 - backend_xmin,所有后端进程的事务启动时的事务快照中最小事务的最小值 - replication_slot_xmin,所有复制槽中最小的xmin(备库的backend_xid和backend_xmin会在这里反映) - replication_slot_catalog_xmin,所有复制槽中最小的catalog_xmin ->TransactionIdLimitedForOldSnapshots() 如果设置了old_snapshot_threshold,则比backend_xid和old_snapshot_threshold->xmin都老的dead元组也可以被回收 参考 -PostgreSQL 9.6 快照过旧 - 源码浅析
如何遏制PostgreSQL WAL的疯狂增长 前言 PostgreSQL在写入频繁的场景中,可能会产生大量的WAL日志,而且WAL日志量远远超过实际更新的数据量。 我们可以把这种现象起个名字,叫做“WAL写放大”,造成WAL写放大的主要原因有2点。 在checkpoint之后第一次修改页面,需要在WAL中输出整个page,即全页写(full page writes)。全页写的目的是防止在意外宕机时出现的数据块部分写导致数据库无法恢复。 更新记录时如果新记录位置(ctid)发生变更,索引记录也要相应变更,这个变更也要记入WAL。更严重的是索引记录的变更又有可能导致索引页的全页写,进一步加剧了WAL写放大。 过量的WAL输出会对系统资源造成很大的消耗,因此需要进行适当的优化。 磁盘IO WAL写入是顺序写,通常情况下硬盘对付WAL的顺序写入是绰绰有余的。所以一般可以忽略。 网络IO 对局域网内的复制估计还不算问题,远程复制就难说了。 磁盘空间 如果做WAL归档,需要的磁盘空间也是巨大的。 WAL记录的构成 每条WAL记录的构成大致如下: src/include/access/xlogrecord.h: * The overall layout of an XLOG record is: * Fixed-size header (XLogRecord struct) * XLogRecordBlockHeader struct * XLogRecordBlockHeader struct * ... * XLogRecordDataHeader[Short|Long] struct * block data * block data * ... * main data 主要占空间是上面的"block data",再往上的XLogRecordBlockHeader是"block data"的元数据。 一条WAL记录可能不涉及数据块,也可能涉及多个数据块,因此WAL记录中可能没有"block data"也可能有多个"block data"。 "block data"的内容可能是下面几种情况之一 full page image 如果是checkpoint之后第一次修改页面,则输出整个page的内容(即full page image,简称FPI)。但是page中没有数据的hole部分会被排除,如果设置了wal_compression = on还会对这page上的数据进行压缩。 buffer data 不需要输出FPI时,就只输出page中指定的数据。 full page image + buffer data 逻辑复制时,即使输出了FPI,也要输出指定的数据。 究竟"block data"中存的是什么内容,通过前面的XLogRecordBlockHeader中的fork_flags进行描述。这里的XLogRecordBlockHeader其实也只是个概括的说法,实际上后面还跟了一些其它的Header。完整的结构如下: XLogRecordBlockHeader XLogRecordBlockImageHeader (可选,包含FPI时存在) XLogRecordBlockCompressHeader (可选,对FPI压缩时存在) RelFileNode (可选,和之前的"block data"的file node不一样时才存在) BlockNumber 下面以insert作为例子说明。 src/backend/access/heap/heapam.c: Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, int options, BulkInsertState bistate) { ... xl_heap_insert xlrec; xl_heap_header xlhdr; ... xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self); ... XLogBeginInsert(); XLogRegisterData((char *) &xlrec, SizeOfHeapInsert); //1)记录tuple的位置到WAL记录里的"main data"。 xlhdr.t_infomask2 = heaptup->t_data->t_infomask2; xlhdr.t_infomask = heaptup->t_data->t_infomask; xlhdr.t_hoff = heaptup->t_data->t_hoff; /* * note we mark xlhdr as belonging to buffer; if XLogInsert decides to * write the whole page to the xlog, we don't need to store * xl_heap_header in the xlog. */ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags); XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);//2)记录tuple的head到WAL记录里的"block data"。 /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */ XLogRegisterBufData(0, (char *) heaptup->t_data + SizeofHeapTupleHeader, heaptup->t_len - SizeofHeapTupleHeader);//3)记录tuple的内容到WAL记录里的"block data"。 ... } WAL的解析 PostgreSQL的安装目录下有个叫做pg_xlogdump的命令可以解析WAL文件,下面看一个例子。 -bash-4.1$ pg_xlogdump /pgsql/data/pg_xlog/0000000100000555000000D5 -b ... rmgr: Heap len (rec/tot): 14/ 171, tx: 301170263, lsn: 555/D5005080, prev 555/D50030A0, desc: UPDATE off 30 xmax 301170263 ; new off 20 xmax 0 blkref #0: rel 1663/13269/54349226 fork main blk 1640350 blkref #1: rel 1663/13269/54349226 fork main blk 1174199 ... 这条WAL记录的解释如下: rmgr: Heap PostgreSQL内部将WAL日志归类到20多种不同的资源管理器。这条WAL记录所属资源管理器为Heap,即堆表。除了Heap还有Btree,Transaction等。 len (rec/tot): 14/ 171 WAL记录的总长度是171字节,其中main data部分是14字节(只计数main data可能并不合理,本文的后面会有说明)。 tx: 301170263 事务号 lsn: 555/D5005080 本WAL记录的LSN prev 555/D50030A0 上条WAL记录的LSN desc: UPDATE off 30 xmax 301170263 ; new off 20 xmax 0 这是一条UPDATE类型的记录(每个资源管理器最多包含16种不同的WAL记录类型,),旧tuple在page中的位置为30(即ctid的后半部分),新tuple在page中的位置为20。 blkref #0: rel 1663/13269/54349226 fork main blk 1640350 引用的第一个page(新tuple所在page)所属的堆表文件为1663/13269/54349226,块号为1640350(即ctid的前半部分)。通过oid2name可以查到是哪个堆表。 -bash-4.1$ oid2name -f 54349226 From database "postgres": Filenode Table Name ---------------------------- 54349226 pgbench_accounts blkref #1: rel 1663/13269/54349226 fork main blk 1174199 引用的第二个page(旧tuple所在page)所属的堆表文件及块号 UPDATE语句除了产生UPDATE类型的WAL记录,实际上还会在前面产生一条LOCK记录,可选的还可能在后面产生若干索引更新的WAL记录。 -bash-4.1$ pg_xlogdump /pgsql/data/pg_xlog/0000000100000555000000D5 -b ... rmgr: Heap len (rec/tot): 8/ 8135, tx: 301170263, lsn: 555/D50030A0, prev 555/D5001350, desc: LOCK off 30: xid 301170263: flags 0 LOCK_ONLY EXCL_LOCK blkref #0: rel 1663/13269/54349226 fork main blk 1174199 (FPW); hole: offset: 268, length: 116 rmgr: Heap len (rec/tot): 14/ 171, tx: 301170263, lsn: 555/D5005080, prev 555/D50030A0, desc: UPDATE off 30 xmax 301170263 ; new off 20 xmax 0 blkref #0: rel 1663/13269/54349226 fork main blk 1640350 blkref #1: rel 1663/13269/54349226 fork main blk 1174199 ... 上面的LOCK记录的例子中,第一个引用page里有PFW标识,表示包含FPI,这也是这条WAL记录长度很大的原因。 后面的hole: offset: 268, length: 116表示page中包含hole,以及这个hole的偏移位置和长度。 可以算出FPI的大小为8196-116=8080, WAL记录中除FPI以外的数据长度8135-8080=55。 WAL的统计 PostgreSQL 9.5以后的pg_xlogdump都带有统计功能,可以查看不同类型的WAL记录的数量,大小以及FPI的比例。例子如下: postgres.conf配置 下面是一个未经特别优化的配置 shared_buffers = 32GB checkpoint_completion_target = 0.9 checkpoint_timeout = 5min min_wal_size = 1GB max_wal_size = 4GB full_page_writes = on wal_log_hints = on wal_level = replica wal_keep_segments = 1000 测试 先手动执行checkpoint,再利用pgbench做一个10秒钟的压测 -bash-4.1$ psql -c "checkpoint;select pg_switch_xlog(),pg_current_xlog_location()" pg_switch_xlog | pg_current_xlog_location ----------------+-------------------------- 556/48000270 | 556/49000000 (1 row) -bash-4.1$ pgbench -n -c 64 -j 64 -T 10 transaction type: < tpc-b of> scaling factor: 1000 query mode: simple number of clients: 64 number of threads: 64 duration: 10 s number of transactions actually processed: 123535 latency average = 5.201 ms tps = 12304.460572 (including connections establishing) tps = 12317.916235 (excluding connections establishing) -bash-4.1$ psql -c "select pg_current_xlog_location()" pg_current_xlog_location -------------------------- 556/B8B40CA0 (1 row) > 日志统计 统计压测期间产生的WAL -bash-4.1$ pg_xlogdump --stats=record -s 556/49000000 -e 556/B8B40CA0 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG/FPI_FOR_HINT 650 ( 0.06) 15600 ( 0.05) 5262532 ( 0.29) 5278132 ( 0.29) Transaction/COMMIT 123535 ( 11.54) 3953120 ( 11.46) 0 ( 0.00) 3953120 ( 0.22) CLOG/ZEROPAGE 4 ( 0.00) 112 ( 0.00) 0 ( 0.00) 112 ( 0.00) Standby/RUNNING_XACTS 2 ( 0.00) 232 ( 0.00) 0 ( 0.00) 232 ( 0.00) Heap/INSERT 122781 ( 11.47) 3315087 ( 9.61) 1150064 ( 0.06) 4465151 ( 0.25) Heap/UPDATE 220143 ( 20.57) 8365434 ( 24.24) 1110312 ( 0.06) 9475746 ( 0.52) Heap/HOT_UPDATE 147169 ( 13.75) 5592422 ( 16.21) 275568 ( 0.02) 5867990 ( 0.32) Heap/LOCK 228031 ( 21.31) 7296992 ( 21.15) 975914004 ( 54.70) 983210996 ( 54.06) Heap/INSERT+INIT 754 ( 0.07) 20358 ( 0.06) 0 ( 0.00) 20358 ( 0.00) Heap/UPDATE+INIT 3293 ( 0.31) 125134 ( 0.36) 0 ( 0.00) 125134 ( 0.01) Btree/INSERT_LEAF 223003 ( 20.84) 5798078 ( 16.80) 800409940 ( 44.86) 806208018 ( 44.33) Btree/INSERT_UPPER 433 ( 0.04) 11258 ( 0.03) 32576 ( 0.00) 43834 ( 0.00) Btree/SPLIT_L 218 ( 0.02) 6976 ( 0.02) 26040 ( 0.00) 33016 ( 0.00) Btree/SPLIT_R 216 ( 0.02) 6912 ( 0.02) 27220 ( 0.00) 34132 ( 0.00) -------- -------- -------- -------- Total 1070232 34507715 [1.90%] 1784208256 [98.10%] 1818715971 [100%] 这个统计结果显示FPI的比例占到了98.10%。但是这个数据并不准确,因为上面的Record size只包含了WAL记录中"main data"的大小,Combined size则是"main data"与FPI的合计,漏掉了FPI以外的"block data"。 这是一个Bug,社区正在进行修复,参考BUG #14687 作为临时对策,可以在pg_xlogdump.c中新增了一行代码,重新计算Record size使之等于WAL总记录长度减去FPI的大小。为便于区分,修改后编译的二进制文件改名为pg_xlogdump_ex。 src/bin/pg_xlogdump/pg_xlogdump.c: fpi_len = 0; for (block_id = 0; block_id max_block_id; block_id++) { if (XLogRecHasBlockImage(record, block_id)) fpi_len += record->blocks[block_id].bimg_len; } rec_len = XLogRecGetTotalLen(record) - fpi_len;/* 新增这一行,重新计算rec_len */ 修改后,重新统计WAL的结果如下: -bash-4.1$ ./pg_xlogdump_ex --stats=record -s 556/49000000 -e 556/B8B40CA0 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG/FPI_FOR_HINT 650 ( 0.06) 31850 ( 0.04) 5262532 ( 0.29) 5294382 ( 0.28) Transaction/COMMIT 123535 ( 11.54) 4200190 ( 5.14) 0 ( 0.00) 4200190 ( 0.23) CLOG/ZEROPAGE 4 ( 0.00) 120 ( 0.00) 0 ( 0.00) 120 ( 0.00) Standby/RUNNING_XACTS 2 ( 0.00) 236 ( 0.00) 0 ( 0.00) 236 ( 0.00) Heap/INSERT 122781 ( 11.47) 9694899 ( 11.86) 1150064 ( 0.06) 10844963 ( 0.58) Heap/UPDATE 220143 ( 20.57) 29172042 ( 35.67) 1110312 ( 0.06) 30282354 ( 1.62) Heap/HOT_UPDATE 147169 ( 13.75) 10591360 ( 12.95) 275568 ( 0.02) 10866928 ( 0.58) Heap/LOCK 228031 ( 21.31) 12917849 ( 15.80) 975914004 ( 54.70) 988831853 ( 52.99) Heap/INSERT+INIT 754 ( 0.07) 59566 ( 0.07) 0 ( 0.00) 59566 ( 0.00) Heap/UPDATE+INIT 3293 ( 0.31) 455778 ( 0.56) 0 ( 0.00) 455778 ( 0.02) Btree/INSERT_LEAF 223003 ( 20.84) 13080672 ( 16.00) 800409940 ( 44.86) 813490612 ( 43.60) Btree/INSERT_UPPER 433 ( 0.04) 31088 ( 0.04) 32576 ( 0.00) 63664 ( 0.00) Btree/SPLIT_L 218 ( 0.02) 775610 ( 0.95) 26040 ( 0.00) 801650 ( 0.04) Btree/SPLIT_R 216 ( 0.02) 765118 ( 0.94) 27220 ( 0.00) 792338 ( 0.04) -------- -------- -------- -------- Total 1070232 81776378 [4.38%] 1784208256 [95.62%] 1865984634 [100%] 这上面可以看出,有95.62%的WAL空间都被FPI占据了(也就是说WAL至少被放大了20倍),这个比例是相当高的。 如果不修改pg_xlogdump的代码,也可以通过计算WAL距离的方式,算出准确的FPI比例。 postgres=# select pg_xlog_location_diff('556/B8B40CA0','556/49000000'); pg_xlog_location_diff ----------------------- 1874070688 (1 row) postgres=# select 1784208256.0 / 1874070688; ?column? ------------------------ 0.95204960379808256197 (1 row) WAL的优化 在应用的写负载不变的情况下,减少WAL生成量主要有下面几种办法。 延长checkpoint时间间隔 FPI产生于checkpoint之后第一次变脏的page,在下次checkpoint到来之前,已经输出过PFI的page是不需要再次输出FPI的。因此checkpoint时间间隔越长,FPI产生的频度会越低。增大checkpoint_timeout和max_wal_size可以延长checkpoint时间间隔。 增加HOT_UPDATE比例 普通的UPDATE经常需要更新2个数据块,并且可能还要更新索引page,这些又都有可能产生FPI。而HOT_UPDATE只修改1个数据块,需要写的WAL量也会相应减少。 压缩 PostgreSQL9.5新增加了一个wal_compression参数,设为on可以对FPI进行压缩,削减WAL的大小。另外还可以在外部通过SSL/SSH的压缩功能减少主备间的通信流量,以及自定义归档脚本对归档的WAL进行压缩。 关闭全页写 这是一个立竿见影但也很危险的办法,如果底层的文件系统或储存支持原子写可以考虑。因为很多部署环境都不具备安全的关闭全页写的条件,下文不对该方法做展开。 延长checkpoint时间 首先优化checkpoint相关参数 postgres.conf: shared_buffers = 32GB checkpoint_completion_target = 0.1 checkpoint_timeout = 60min min_wal_size = 4GB max_wal_size = 64GB full_page_writes = on wal_log_hints = on wal_level = replica wal_keep_segments = 1000 然后,手工发起一次checkpoint -bash-4.1$ psql -c "checkpoint" CHECKPOINT 再压测10w个事务,并连续测试10次 -bash-4.1$ psql -c "select pg_current_xlog_location()" ; pgbench -n -c 100 -j 100 -t 1000 ;psql -c "select pg_current_xlog_location()" pg_current_xlog_location -------------------------- 558/47542B08 (1 row) transaction type: < tpc-b of> scaling factor: 1000 query mode: simple number of clients: 100 number of threads: 100 number of transactions per client: 1000 number of transactions actually processed: 100000/100000 latency average = 7.771 ms tps = 12868.123227 (including connections establishing) tps = 12896.084970 (excluding connections establishing) pg_current_xlog_location -------------------------- 558/A13DF908 (1 row) > 测试结果如下 第1次执行 -bash-4.1$ ./pg_xlogdump_ex --stats=record -s 558/47542B08 -e 558/A13DF908 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG/FPI_FOR_HINT 1933 ( 0.23) 94717 ( 0.15) 15612140 ( 1.09) 15706857 ( 1.05) Transaction/COMMIT 100000 ( 11.89) 3400000 ( 5.26) 0 ( 0.00) 3400000 ( 0.23) CLOG/ZEROPAGE 3 ( 0.00) 90 ( 0.00) 0 ( 0.00) 90 ( 0.00) Standby/RUNNING_XACTS 1 ( 0.00) 453 ( 0.00) 0 ( 0.00) 453 ( 0.00) Heap/INSERT 99357 ( 11.82) 7849103 ( 12.14) 25680 ( 0.00) 7874783 ( 0.52) Heap/UPDATE 163254 ( 19.42) 22354169 ( 34.58) 351364 ( 0.02) 22705533 ( 1.51) Heap/HOT_UPDATE 134045 ( 15.94) 9646593 ( 14.92) 384948 ( 0.03) 10031541 ( 0.67) Heap/LOCK 172576 ( 20.52) 9800924 ( 15.16) 778259316 ( 54.15) 788060240 ( 52.47) Heap/INSERT+INIT 643 ( 0.08) 50797 ( 0.08) 0 ( 0.00) 50797 ( 0.00) Heap/UPDATE+INIT 2701 ( 0.32) 371044 ( 0.57) 0 ( 0.00) 371044 ( 0.02) Btree/INSERT_LEAF 165561 ( 19.69) 9643359 ( 14.92) 642548940 ( 44.70) 652192299 ( 43.42) Btree/INSERT_UPPER 394 ( 0.05) 28236 ( 0.04) 56324 ( 0.00) 84560 ( 0.01) Btree/SPLIT_L 228 ( 0.03) 811172 ( 1.25) 57280 ( 0.00) 868452 ( 0.06) Btree/SPLIT_R 168 ( 0.02) 595137 ( 0.92) 64740 ( 0.00) 659877 ( 0.04) -------- -------- -------- -------- Total 840864 64645794 [4.30%] 1437360732 [95.70%] 1502006526 [100%] 第5次执行 -bash-4.1$ ./pg_xlogdump_ex --stats=record -s 559/6312AD98 -e 559/94AC4148 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG/FPI_FOR_HINT 1425 ( 0.17) 69825 ( 0.11) 11508300 ( 1.51) 11578125 ( 1.40) Transaction/COMMIT 100000 ( 12.13) 3400000 ( 5.37) 0 ( 0.00) 3400000 ( 0.41) CLOG/ZEROPAGE 3 ( 0.00) 90 ( 0.00) 0 ( 0.00) 90 ( 0.00) Standby/RUNNING_XACTS 1 ( 0.00) 453 ( 0.00) 0 ( 0.00) 453 ( 0.00) Heap/INSERT 99296 ( 12.05) 7844384 ( 12.38) 0 ( 0.00) 7844384 ( 0.95) Heap/UPDATE 155408 ( 18.85) 21689908 ( 34.24) 0 ( 0.00) 21689908 ( 2.62) Heap/HOT_UPDATE 142042 ( 17.23) 10222825 ( 16.14) 0 ( 0.00) 10222825 ( 1.23) Heap/LOCK 164776 ( 19.99) 9274729 ( 14.64) 608647740 ( 79.60) 617922469 ( 74.63) Heap/INSERT+INIT 704 ( 0.09) 55616 ( 0.09) 0 ( 0.00) 55616 ( 0.01) Heap/UPDATE+INIT 2550 ( 0.31) 355951 ( 0.56) 0 ( 0.00) 355951 ( 0.04) Btree/INSERT_LEAF 157807 ( 19.14) 9886864 ( 15.61) 144491940 ( 18.90) 154378804 ( 18.64) Btree/INSERT_UPPER 151 ( 0.02) 10872 ( 0.02) 0 ( 0.00) 10872 ( 0.00) Btree/SPLIT_L 128 ( 0.02) 455424 ( 0.72) 0 ( 0.00) 455424 ( 0.06) Btree/SPLIT_R 23 ( 0.00) 81466 ( 0.13) 0 ( 0.00) 81466 ( 0.01) -------- -------- -------- -------- Total 824314 63348407 [7.65%] 764647980 [92.35%] 827996387 [100%] 第10次执行 -bash-4.1$ ./pg_xlogdump_ex --stats=record -s 55A/3347F298 -e 55A/5420F700 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG/FPI_FOR_HINT 1151 ( 0.13) 56399 ( 0.09) 9295592 ( 1.93) 9351991 ( 1.71) Transaction/COMMIT 100000 ( 11.61) 3400000 ( 5.15) 0 ( 0.00) 3400000 ( 0.62) CLOG/ZEROPAGE 3 ( 0.00) 90 ( 0.00) 0 ( 0.00) 90 ( 0.00) Standby/RUNNING_XACTS 1 ( 0.00) 62 ( 0.00) 0 ( 0.00) 62 ( 0.00) Heap/INSERT 99322 ( 11.53) 7846438 ( 11.88) 0 ( 0.00) 7846438 ( 1.43) Heap/UPDATE 173901 ( 20.19) 23253149 ( 35.21) 0 ( 0.00) 23253149 ( 4.25) Heap/HOT_UPDATE 123452 ( 14.33) 8884888 ( 13.45) 0 ( 0.00) 8884888 ( 1.62) Heap/LOCK 183501 ( 21.30) 10187069 ( 15.43) 449049828 ( 93.22) 459236897 ( 83.84) Heap/INSERT+INIT 678 ( 0.08) 53562 ( 0.08) 0 ( 0.00) 53562 ( 0.01) Heap/UPDATE+INIT 2647 ( 0.31) 365259 ( 0.55) 0 ( 0.00) 365259 ( 0.07) Btree/INSERT_LEAF 176343 ( 20.47) 11251588 ( 17.04) 23338600 ( 4.85) 34590188 ( 6.32) Btree/INSERT_UPPER 205 ( 0.02) 14760 ( 0.02) 0 ( 0.00) 14760 ( 0.00) Btree/SPLIT_L 172 ( 0.02) 611976 ( 0.93) 0 ( 0.00) 611976 ( 0.11) Btree/SPLIT_R 33 ( 0.00) 116886 ( 0.18) 0 ( 0.00) 116886 ( 0.02) Btree/VACUUM 1 ( 0.00) 50 ( 0.00) 0 ( 0.00) 50 ( 0.00) -------- -------- -------- -------- Total 861410 66042176 [12.06%] 481684020 [87.94%] 547726196 [100%] 汇总如下: No tps 非FPI大小 WAL总量(字节) FPI比例(%) 每事务产生的WAL(字节) 1 12896 64645794 1502006526 95.70 15020 5 12896 63348407 827996387 92.35 8279 10 12896 66042176 547726196 87.94 5477 不难看出非FPI大小是相对固定的,FPI的大小越来越小,这也证实了延长checkpoint间隔对削减WAL大小的作用。 增加HOT_UPDATE比例 HOT_UPDATE比例过低的一个很常见的原因是更新频繁的表的fillfactor设置不恰当。fillfactor的默认值为100%,可以先将其调整为90%。 对于宽表,要进一步减小fillfactor使得至少可以保留一个tuple的空闲空间。可以查询pg_class系统表估算平均tuple大小,并算出合理的fillfactor值。 postgres=# select 1 - relpages/reltuples max_fillfactor from pg_class where relname='big_tb'; max_fillfactor ---------------------- 0.69799901185770750988 (1 row) 再上面估算出的69%的基础上,可以把fillfactor再稍微设小一点,比如设成65% 。 在前面优化过的参数的基础上,先保持fillfactor=100不变,执行100w事务的压测 -bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()" ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()" pg_current_xlog_location -------------------------- 55A/66715CC0 (1 row) transaction type: < tpc-b of> scaling factor: 1000 query mode: simple number of clients: 100 number of threads: 100 number of transactions per client: 10000 number of transactions actually processed: 1000000/1000000 latency average = 7.943 ms tps = 12589.895315 (including connections establishing) tps = 12592.623734 (excluding connections establishing) pg_current_xlog_location -------------------------- 55C/7C747F20 (1 row) > 生成的WAL统计如下: -bash-4.1$ ./pg_xlogdump_ex --stats=record -s 55A/66715CC0 -e 55C/7C747F20 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG/FPI_FOR_HINT 30699 ( 0.36) 1504251 ( 0.23) 248063160 ( 3.00) 249567411 ( 2.80) Transaction/COMMIT 1000000 ( 11.80) 34000000 ( 5.15) 0 ( 0.00) 34000000 ( 0.38) Transaction/COMMIT 3 ( 0.00) 502 ( 0.00) 0 ( 0.00) 502 ( 0.00) CLOG/ZEROPAGE 31 ( 0.00) 930 ( 0.00) 0 ( 0.00) 930 ( 0.00) Standby/RUNNING_XACTS 6 ( 0.00) 2226 ( 0.00) 0 ( 0.00) 2226 ( 0.00) Standby/INVALIDATIONS 3 ( 0.00) 414 ( 0.00) 0 ( 0.00) 414 ( 0.00) Heap/INSERT 993655 ( 11.72) 78496345 ( 11.90) 135164 ( 0.00) 78631509 ( 0.88) Heap/UPDATE 1658858 ( 19.57) 225826642 ( 34.23) 455368 ( 0.01) 226282010 ( 2.54) Heap/HOT_UPDATE 1314890 ( 15.51) 94634083 ( 14.35) 344324 ( 0.00) 94978407 ( 1.07) Heap/LOCK 1757258 ( 20.73) 98577892 ( 14.94) 5953842520 ( 72.12) 6052420412 ( 67.89) Heap/INPLACE 9 ( 0.00) 1730 ( 0.00) 6572 ( 0.00) 8302 ( 0.00) Heap/INSERT+INIT 6345 ( 0.07) 501255 ( 0.08) 0 ( 0.00) 501255 ( 0.01) Heap/UPDATE+INIT 26265 ( 0.31) 3635102 ( 0.55) 0 ( 0.00) 3635102 ( 0.04) Btree/INSERT_LEAF 1680195 ( 19.82) 104535607 ( 15.85) 2052212660 ( 24.86) 2156748267 ( 24.19) Btree/INSERT_UPPER 4928 ( 0.06) 354552 ( 0.05) 129128 ( 0.00) 483680 ( 0.01) Btree/SPLIT_L 4854 ( 0.06) 17269109 ( 2.62) 22080 ( 0.00) 17291189 ( 0.19) Btree/SPLIT_R 95 ( 0.00) 336650 ( 0.05) 0 ( 0.00) 336650 ( 0.00) Btree/VACUUM 3 ( 0.00) 155 ( 0.00) 2220 ( 0.00) 2375 ( 0.00) -------- -------- -------- -------- Total 8478097 659677445 [7.40%] 8255213196 [92.60%] 8914890641 [100%] 设置fillfactor=90 postgres=# alter table pgbench_accounts set (fillfactor=90); ALTER TABLE postgres=# vacuum full pgbench_accounts; VACUUM postgres=# alter table pgbench_tellers set (fillfactor=90); ALTER TABLE postgres=# vacuum full pgbench_tellers; VACUUM postgres=# alter table pgbench_branches set (fillfactor=90); ALTER TABLE postgres=# vacuum full pgbench_branches; VACUUM 再次测试 -bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()" ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()" pg_current_xlog_location -------------------------- 561/78BD2460 (1 row) transaction type: < tpc-b of> scaling factor: 1000 query mode: simple number of clients: 100 number of threads: 100 number of transactions per client: 10000 number of transactions actually processed: 1000000/1000000 latency average = 7.570 ms tps = 13210.665959 (including connections establishing) tps = 13212.956814 (excluding connections establishing) pg_current_xlog_location -------------------------- 562/F91436D8 (1 row) > 生成的WAL统计如下: -bash-4.1$ ./pg_xlogdump_ex --stats=record -s 561/78BD2460 -e 562/F91436D8 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG/FPI_FOR_HINT 13529 ( 0.22) 662921 ( 0.16) 99703804 ( 1.66) 100366725 ( 1.57) Transaction/COMMIT 1000000 ( 16.09) 34000000 ( 8.07) 0 ( 0.00) 34000000 ( 0.53) Transaction/COMMIT 4 ( 0.00) 1035 ( 0.00) 0 ( 0.00) 1035 ( 0.00) CLOG/ZEROPAGE 30 ( 0.00) 900 ( 0.00) 0 ( 0.00) 900 ( 0.00) Standby/RUNNING_XACTS 5 ( 0.00) 1913 ( 0.00) 0 ( 0.00) 1913 ( 0.00) Standby/INVALIDATIONS 2 ( 0.00) 276 ( 0.00) 0 ( 0.00) 276 ( 0.00) Heap/INSERT 993629 ( 15.98) 78494191 ( 18.63) 362908 ( 0.01) 78857099 ( 1.23) Heap/DELETE 1 ( 0.00) 59 ( 0.00) 7972 ( 0.00) 8031 ( 0.00) Heap/UPDATE 553073 ( 8.90) 47100570 ( 11.18) 48188 ( 0.00) 47148758 ( 0.74) Heap/HOT_UPDATE 2438157 ( 39.22) 170238869 ( 40.40) 5809935900 ( 96.97) 5980174769 ( 93.25) Heap/LOCK 635714 ( 10.23) 34328566 ( 8.15) 16200 ( 0.00) 34344766 ( 0.54) Heap/INPLACE 10 ( 0.00) 1615 ( 0.00) 22692 ( 0.00) 24307 ( 0.00) Heap/INSERT+INIT 6372 ( 0.10) 503388 ( 0.12) 0 ( 0.00) 503388 ( 0.01) Heap/UPDATE+INIT 8804 ( 0.14) 741136 ( 0.18) 0 ( 0.00) 741136 ( 0.01) Btree/INSERT_LEAF 556456 ( 8.95) 35492624 ( 8.42) 81089180 ( 1.35) 116581804 ( 1.82) Btree/INSERT_UPPER 5422 ( 0.09) 389735 ( 0.09) 328108 ( 0.01) 717843 ( 0.01) Btree/SPLIT_L 5036 ( 0.08) 17918305 ( 4.25) 154980 ( 0.00) 18073285 ( 0.28) Btree/SPLIT_R 414 ( 0.01) 1466691 ( 0.35) 22140 ( 0.00) 1488831 ( 0.02) Btree/VACUUM 2 ( 0.00) 100 ( 0.00) 0 ( 0.00) 100 ( 0.00) -------- -------- -------- -------- Total 6216660 421342894 [6.57%] 5991692072 [93.43%] 6413034966 [100%] 设置fillfactor=90后,生成的WAL量从8914890641减少到6413034966。 设置WAL压缩 修改postgres.conf,开启WAL压缩 wal_compression = on 再次测试 -bash-4.1$ psql -c "checkpoint;select pg_current_xlog_location()" ; pgbench -n -c 100 -j 100 -t 10000 ;psql -c "select pg_current_xlog_location()" pg_current_xlog_location -------------------------- 562/F91B5978 (1 row) transaction type: < tpc-b of> scaling factor: 1000 query mode: simple number of clients: 100 number of threads: 100 number of transactions per client: 10000 number of transactions actually processed: 1000000/1000000 latency average = 8.295 ms tps = 12056.091399 (including connections establishing) tps = 12059.453725 (excluding connections establishing) pg_current_xlog_location -------------------------- 563/39880390 (1 row) > 生成的WAL统计如下: -bash-4.1$ ./pg_xlogdump_ex --stats=record -s 562/F91B5978 -e 563/39880390 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG/FPI_FOR_HINT 7557 ( 0.12) 385375 ( 0.09) 5976157 ( 0.94) 6361532 ( 0.60) Transaction/COMMIT 1000000 ( 15.55) 34000000 ( 7.97) 0 ( 0.00) 34000000 ( 3.20) Transaction/COMMIT 2 ( 0.00) 356 ( 0.00) 0 ( 0.00) 356 ( 0.00) CLOG/ZEROPAGE 31 ( 0.00) 930 ( 0.00) 0 ( 0.00) 930 ( 0.00) Standby/RUNNING_XACTS 5 ( 0.00) 1937 ( 0.00) 0 ( 0.00) 1937 ( 0.00) Standby/INVALIDATIONS 4 ( 0.00) 504 ( 0.00) 0 ( 0.00) 504 ( 0.00) Heap/INSERT 993632 ( 15.45) 78494714 ( 18.40) 205874 ( 0.03) 78700588 ( 7.40) Heap/UPDATE 663845 ( 10.32) 56645461 ( 13.28) 39548 ( 0.01) 56685009 ( 5.33) Heap/HOT_UPDATE 2326238 ( 36.17) 163847160 ( 38.41) 604564022 ( 94.97) 768411182 ( 72.27) Heap/LOCK 747342 ( 11.62) 40358851 ( 9.46) 1713055 ( 0.27) 42071906 ( 3.96) Heap/INPLACE 9 ( 0.00) 1425 ( 0.00) 5160 ( 0.00) 6585 ( 0.00) Heap/INSERT+INIT 6368 ( 0.10) 503072 ( 0.12) 0 ( 0.00) 503072 ( 0.05) Heap/UPDATE+INIT 9927 ( 0.15) 839135 ( 0.20) 0 ( 0.00) 839135 ( 0.08) Btree/INSERT_LEAF 671387 ( 10.44) 42884429 ( 10.05) 19691394 ( 3.09) 62575823 ( 5.89) Btree/INSERT_UPPER 2385 ( 0.04) 170946 ( 0.04) 210384 ( 0.03) 381330 ( 0.04) Btree/SPLIT_L 1438 ( 0.02) 5107876 ( 1.20) 2613608 ( 0.41) 7721484 ( 0.73) Btree/SPLIT_R 947 ( 0.01) 3360714 ( 0.79) 1563260 ( 0.25) 4923974 ( 0.46) Btree/VACUUM 3 ( 0.00) 150 ( 0.00) 0 ( 0.00) 150 ( 0.00) -------- -------- -------- -------- Total 6431120 426603035 [40.12%] 636582462 [59.88%] 1063185497 [100%] 设置`wal_compression = on后,生成的WAL量从6413034966减少到1063185497。 优化结果汇总 wal_compression fillfactor tps 非FPI大小 WAL总量(字节) FPI比例(%) HOT_UPDATE比例(%) 每事务产生的WAL(字节) off 100 12592 659677445 8255213196 92.60 44 8255 off 90 13212 421342894 6413034966 93.43 81 6413 on 90 12059 426603035 1063185497 59.88 78 1063 仅仅调整wal_compression和fillfactor就削减了87%的WAL,这还没有算上延长checkpoint间隔带来的收益。 总结 PostgreSQL在未经优化的情况下,20倍甚至更高的WAL写放大是很常见的,适当的优化之后应该可以减少到3倍以下。引入SSL/SSH压缩或归档压缩等外部手段还可以进一步减少WAL的生成量。 如何判断是否需要优化WAL? 关于如何判断是否需要优化WAL,可以通过分析WAL,然后检查下面的条件,做一个粗略的判断: FPI比例高于70% HOT_UPDATE比例低于70% 以上仅仅是粗略的经验值,仅供参考。并且这个FPI比例可能不适用于低写负载的系统,低写负载的系统FPI比例一定非常高,但是,低写负载系统由于写操作少,因此FPI比例即使高一点也没太大影响。 优化WAL的副作用 前面用到了3种优化手段,如果设置不当,也会产生副作用,具体如下: 延长checkpoint时间间隔 导致crash恢复时间变长。crash恢复时需要回放的WAL日志量一般小于max_wal_size的一半,WAL回放速度(wal_compression=on时)一般是50MB/s~150MB/s之间。可以根据可容忍的最大crash恢复时间(有备机时,切备机可能比等待crash恢复更快),估算出允许的max_wal_size的最大值。 调整fillfactor 过小的设置会浪费存储空间,这个不难理解。另外,对于频繁更新的表,即使把fillfactor设成100%,每个page里还是要有一部分空间被dead tuple占据,不会比设置一个合适的稍小的fillfactor更节省空间。 设置wal_compression=on 需要额外占用CPU资源进行压缩,但根据实测的结果影响不大。 其他 去年Uber放出了一篇把PostgreSQL说得一无是处的文章为什么Uber宣布从PostgreSQL切换到MySQL?给PostgreSQL带来了很大负面影响。Uber文章中提到了PG的几个问题,每一个都被描述成无法逾越的“巨坑”。但实际上这些问题中,除了“写放大”,其它几个问题要么是无关痛痒要么是随着PG的版本升级早就不存在了。至于“写放大”,也是有解的。Uber的文章里没有提到他们在优化WAL写入量上做过什么有益的尝试,并且他们使用的PostgreSQL 9.2也是不支持wal_compression的,因此推断他们PG数据库很可能一直运行在20倍以上WAL写放大的完全未优化状态下。 参考 WAL Reduction 也许 MySQL 适合 Uber,但它不一定适合你 为PostgreSQL讨说法 - 浅析《UBER ENGINEERING SWITCHED FROM POSTGRES TO MYSQL》
PostgreSQL 9.6.0中文手册1.0版发布说明 ChenHuajun released this 8 minutes ago 《PostgreSQL9.6.0手册》基于彭煜玮教授翻译的 《PostgreSQL 9.6.0 文档》 。"版本说明"中的大部分翻译内容提取自PostgreSQL中国用户会组织翻译的上一个版本 《PostgreSQL9.5.3中文手册》。 "版本说明"中9.6新增修改部分的翻译及一部分sgml文件和英文原文不匹配的处理由瀚高软件的韩悦悦完成。 详细请参考PostgreSQL9.6中文手册的翻译。 在线手册 PostgreSQL9.6.0手册 离线手册下载 PostgreSQL9.6.0手册v1.0
基于Pacemkaer Resource Agent的LVS负载均衡 前言 对于有主从状态的集群,实现读负载均衡需要将读请求分发到各Slave上,并且主从发生切换后,要自动调整分发策略。然而,当前主流的LVS监控方案,keepalived或Pacemaker + ldirectord并不能很好的支持这一点,它们需要在发生failover后修改相应的配置文件,这并不是非常方便。为了把LVS更好集成到Pacemaker集群里,将LVS相关的控制操作包装成资源代理,并提供多种灵活的方式动态配置real server。当前只支持工作在LVS的DR模式。 功能 Director Server和Real Server上内核参数的自动设置 负载均衡策略等LVS参数的配置 Real Server权重的配置 多种Real Server列表配置方式 a)静态列表 b)根据资源依赖动态设置 c)根据节点属性动态设置 d)以上a,b,c 3种的组合 e)根据外部脚本动态获取 多种Real Server的健康检查方式 由Real Server对应的RA(如pgsql)检查 由外部脚本动态检查 在线动态调整Real Server Real Server列表发生变更时,已建立的到非故障Real Server的连接不受影响。 配置参数 vip 虚拟服务的VIP. port 虚拟服务的端口号 virtual_service_options 传递给ipvsadm的虚拟服务的选项,比如"-s rr". default_weight Real Server的缺省权重,默认为1. weight_of_realservers Real Server的host和权重组合的列表,设置形式为"node1,weight1 node2,weight2 ..."。 如果省略权重,则使用default_weight。使用了realserver_dependent_resource,realserver_dependent_attribute_name或realserver_dependent_attribute_value参数时, host必须是节点的主机名。 realserver_dependent_resource Real Server依赖的资源,只有被分配了该资源的节点会被加入到LVS的real server列表中。 如果realserver_get_real_servers_script不为空,该参数将失效。 realserver_dependent_attribute_name Real Server依赖的节点属性名。 如果realserver_get_real_servers_script或realserver_check_active_slave_script不为空,该参数将失效。 realserver_dependent_attribute_value Real Server依赖的节点属性值的正则表达式,比如对于pgsql RA的从节点,可以设置为"HS:sync|HS:potential|HS:async" 如果realserver_get_real_servers_script或realserver_check_active_slave_script不为空,该参数将失效。 realserver_get_real_servers_script 动态获取Real Server列表的脚本。该脚本输出空格分隔得Real Server列表。 realserver_check_active_real_server_script 动态检查Real Server健康的脚本。该脚本接收节点名作为参数。 如果realserver_get_real_servers_script或realserver_check_active_slave_script不为空,该参数将失效。 安装 前提需求 Pacemaker Corosync ipvsadm 获取RA脚本 wget https://raw.githubusercontent.com/ChenHuajun/pha4pgsql/master/ra/lvsdr wget https://raw.githubusercontent.com/ChenHuajun/pha4pgsql/master/ra/lvsdr-realsvr 安装RA脚本 cp lvsdr lvsdr-realsvr /usr/lib/ocf/resource.d/heartbeat/ chmod +x /usr/lib/ocf/resource.d/heartbeat/lvsdr chmod +x /usr/lib/ocf/resource.d/heartbeat/lvsdr-realsvr 使用示例 以下是PostgreSQL主从集群中读负载均衡的配置示例,读请求通过LVS的RR负载均衡策略平均分散到2个Slave节点上。 示例1:通过资源依赖和节点属性动态设置Real Server列表 配置lvsdr-realsvr和每个expgsql的Slave资源部署在一起 pcs -f pgsql_cfg resource create lvsdr-realsvr lvsdr-realsvr \ vip="192.168.0.237" \ nic_lo="lo:0" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="30s" interval="60s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" pcs -f pgsql_cfg resource clone lvsdr-realsvr clone-node-max=1 notify=false pcs -f pgsql_cfg constraint colocation add lvsdr with vip-slave INFINITY pcs -f pgsql_cfg constraint colocation add lvsdr-realsvr-clone with Slave msPostgresql INFINITY 配值lvsdr的real server列表依赖于lvsdr-realsvr-clone资源和节点的pgsql-status特定属性值 pcs -f pgsql_cfg resource create lvsdr lvsdr \ vip="192.168.0.237" \ port="5432" \ realserver_dependent_resource="lvsdr-realsvr-clone" \ realserver_dependent_attribute_name="pgsql-status" \ realserver_dependent_attribute_value="HS:sync|HS:potential|HS:async" \ virtual_service_options="-s rr" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="30s" interval="5s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" 完整的配置如下: pcs cluster cib pgsql_cfg pcs -f pgsql_cfg property set no-quorum-policy="stop" pcs -f pgsql_cfg property set stonith-enabled="false" pcs -f pgsql_cfg resource defaults resource-stickiness="1" pcs -f pgsql_cfg resource defaults migration-threshold="10" pcs -f pgsql_cfg resource create vip-master IPaddr2 \ ip="192.168.0.236" \ nic="eno16777736" \ cidr_netmask="24" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" pcs -f pgsql_cfg resource create vip-slave IPaddr2 \ ip="192.168.0.237" \ nic="eno16777736" \ cidr_netmask="24" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" pcs -f pgsql_cfg resource create pgsql expgsql \ pgctl="/usr/pgsql-9.5/bin/pg_ctl" \ psql="/usr/pgsql-9.5/bin/psql" \ pgdata="/data/postgresql/data" \ pgport="5432" \ rep_mode="sync" \ node_list="node1 node2 node3 " \ restore_command="" \ primary_conninfo_opt="user=replication password=replication keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \ master_ip="192.168.0.236" \ restart_on_promote="false" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="60s" interval="4s" on-fail="restart" \ op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \ op promote timeout="60s" interval="0s" on-fail="restart" \ op demote timeout="60s" interval="0s" on-fail="stop" \ op stop timeout="60s" interval="0s" on-fail="block" \ op notify timeout="60s" interval="0s" pcs -f pgsql_cfg resource master msPostgresql pgsql \ master-max=1 master-node-max=1 clone-node-max=1 notify=true \ migration-threshold="3" target-role="Master" pcs -f pgsql_cfg constraint colocation add vip-master with Master msPostgresql INFINITY pcs -f pgsql_cfg constraint order promote msPostgresql then start vip-master symmetrical=false score=INFINITY pcs -f pgsql_cfg constraint order demote msPostgresql then stop vip-master symmetrical=false score=0 pcs -f pgsql_cfg constraint colocation add vip-slave with Slave msPostgresql INFINITY pcs -f pgsql_cfg constraint order promote msPostgresql then start vip-slave symmetrical=false score=INFINITY pcs -f pgsql_cfg constraint order stop msPostgresql then stop vip-slave symmetrical=false score=0 pcs -f pgsql_cfg resource create lvsdr lvsdr \ vip="192.168.0.237" \ port="5432" \ realserver_dependent_resource="lvsdr-realsvr-clone" \ realserver_dependent_attribute_name="pgsql-status" \ realserver_dependent_attribute_value="HS:sync|HS:potential|HS:async" \ virtual_service_options="-s rr" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="30s" interval="5s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" pcs -f pgsql_cfg resource create lvsdr-realsvr lvsdr-realsvr \ vip="192.168.0.237" \ nic_lo="lo:0" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="30s" interval="60s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" pcs -f pgsql_cfg resource clone lvsdr-realsvr clone-node-max=1 notify=false pcs -f pgsql_cfg constraint colocation add lvsdr with vip-slave INFINITY pcs -f pgsql_cfg constraint colocation add lvsdr-realsvr-clone with Slave msPostgresql INFINITY pcs -f pgsql_cfg constraint order stop vip-slave then start vip-slave symmetrical=false score=0 pcs -f pgsql_cfg constraint order stop vip-master then start vip-master symmetrical=false score=0 pcs -f pgsql_cfg constraint order start lvsdr-realsvr-clone then start lvsdr symmetrical=false score=0 pcs -f pgsql_cfg constraint order start lvsdr then start vip-slave symmetrical=false score=0 pcs cluster cib-push pgsql_cfg 示例2:从Master节点查询Slave节点 配值lvsdr的real server列表通过get_active_slaves脚本在Master上动态查询Slave节点。 pcs -f pgsql_cfg resource create lvsdr lvsdr \ vip="192.168.0.237" \ port="5432" \ default_weight="0" \ weight_of_realservers="node1,1 node2,1 node3,1 192.168.0.234,1" \ realserver_get_real_servers_script="/opt/pha4pgsql/tools/get_active_slaves /usr/pgsql/bin/psql \"host=192.168.0.236 port=5432 dbname=postgres user=replication password=replication connect_timeout=5\"" \ virtual_service_options="-s rr" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="30s" interval="5s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" 采用这种方式可以将Pacemaker集群以外的Slave作为real server加入到LVS。对这样的节点需要进行下面的设置 设置作为LVS real server的系统参数 echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce 在lo网卡上添加读VIP ip a add 192.168.0.237/32 dev lo:0 设置Slave节点连接信息中application_name为该节点的主机名或ip地址。 [root@node4 pha4pgsql]# cat /data/postgresql/data/recovery.conf standbymode = 'on' primaryconninfo = 'host=192.168.0.236 port=5432 applicationname=192.168.0.234 user=replication password=replication keepalivesidle=60 keepalivesinterval=5 keepalivescount=5' restorecommand = '' recoverytarget_timeline = 'latest' 示例3:直接连接Slave检查节点健康状况 通过default_weight和weight_of_realservers指定real server一览,并通过调用check_active_slave脚本,依次连接到real server中的每个节点上检查其是否可以连接并且是Slave。 pcs -f pgsql_cfg resource create lvsdr lvsdr \ vip="192.168.0.237" \ port="5432" \ default_weight="1" \ weight_of_realservers="node1 node2 node3 192.168.0.234" \ realserver_check_active_real_server_script="/opt/pha4pgsql/tools/check_active_slave /usr/pgsql/bin/psql \"port=5432 dbname=postgres user=replication password=replication connect_timeout=5\" -h" \ virtual_service_options="-s rr" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="30s" interval="5s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" pcs resource update lvsdr default_weight="1" pcs resource update lvsdr weight_of_realservers="node1 node2 node3 192.168.0.234" pcs resource update lvsdr realserver_dependent_resource="" pcs resource update lvsdr realserver_get_real_servers_script="" pcs resource update lvsdr realserver_check_active_real_server_script="/opt/pha4pgsql/tools/check_active_slave /usr/pgsql/bin/psql \"port=5432 dbname=postgres user=replication password=replication connect_timeout=5\" -h"
PostgreSQL 9.5.3中文手册1.0版发布说明 ChenHuajun released this 6 days ago 发布说明 《PostgreSQL9.5.3中文手册》是在彭煜玮教授独自翻译的 《PostgreSQL 9.4.4 文档》 以及PostgreSQL中国用户会组织翻译的上一个版本 《PostgreSQL9.4.4中文手册》的基础上翻译而成。 主要翻译工作由瀚高软件的韩悦悦和尹敏敏完成,详细请参考PostgreSQL9.5中文手册的翻译。 同时感谢老唐提供了chm版的离线手册的制作工具gen_pgdoc_chm! 在线手册 PostgreSQL9.5.3中文手册 离线手册下载 PostgreSQL9.5.3中文手册v1.0 Downloads Source code (zip) Source code (tar.gz)
PostgreSQL的区域设置 对于中文用户,在PostgreSQL中应该将编码无条件的设为UTF8,为简化和统一区域(loacle)也推荐尽量设置为C,但Collate和Ctype对性能或功能有一定影响,需要注意。 环境 rhel 6.3 x64虚机(4C/8G/300G HDD) PostgreSQL 9.6.2 数据库 en_US=# \l+ List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+----------+------------+------------+-----------------------+---------+------------+-------------------------------------------- en_US | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 | | 7343 kB | pg_default | postgres | postgres | UTF8 | C | C | | 414 MB | pg_default | default administrative connection database template0 | postgres | UTF8 | C | C | =c/postgres +| 7225 kB | pg_default | unmodifiable empty database | | | | | postgres=CTc/postgres | | | template1 | postgres | UTF8 | C | C | =c/postgres +| 7225 kB | pg_default | default template for new databases | | | | | postgres=CTc/postgres | | | zh_CN | postgres | UTF8 | zh_CN.UTF8 | zh_CN.UTF8 | | 7225 kB | pg_default | (5 rows) Collate对功能的影响 Collate会影响中文的排序,在zh_CN的区域下中文按拼音排序,其它区域按字符编码排序。 postgres=# select * from (values ('王'),('貂'),('西'),('杨')) a order by a; column1 --------- 杨 王 西 貂 (4 rows) postgres=# \c en_US You are now connected to database "en_US" as user "postgres". en_US=# select * from (values ('王'),('貂'),('西'),('杨')) a order by a; column1 --------- 杨 王 西 貂 (4 rows) en_US=# \c zh_CN You are now connected to database "zh_CN" as user "postgres". zh_CN=# select * from (values ('王'),('貂'),('西'),('杨')) a order by a; column1 --------- 貂 王 西 杨 (4 rows) Collate对性能的影响 测试方法 postgres=# create table tb1(c1 text); CREATE TABLE Time: 5.653 ms postgres=# insert into tb1 select md5(generate_series(1,1000000)::text); INSERT 0 1000000 Time: 2671.929 ms postgres=# vacuum ANALYZE tb1; VACUUM Time: 398.817 ms postgres=# select * from tb1 order by c1 limit 1; c1 ---------------------------------- 0000104cd168386a335ba6bf6e32219d (1 row) Time: 176.779 ms postgres=# create index idx1 on tb1(c1); CREATE INDEX Time: 1549.436 ms 测试结果 Collate/Ctype C en_US.UTF8 zh_CN.UTF8 insert 2671 2613 2670 vacuum ANALYZE 398 250 396 order by 176 388 401 create index 1549 7492 7904 insert(with index) 11199 15621 16128 Ctype的影响 Ctype会影响pg_trgm和部分正则匹配的结果,比如Ctype为'C'时,pg_trgm将无法支持中文 postgres=# select show_trgm('aaabbbc到的x'); show_trgm ----------------------------------------------------- {" a"," x"," aa"," x ",aaa,aab,abb,bbb,bbc,"bc "} (1 row) en_US=# select show_trgm('aaabbbc到的x'); show_trgm ----------------------------------------------------------------------- {" a"," aa",0x27bdf1,0x30bd19,0x4624bc,aaa,aab,abb,bbb,bbc,0x6a2ad5} (1 row) zh_CN=# select show_trgm('aaabbbc到的x'); show_trgm ----------------------------------------------------------------------- {" a"," aa",0x27bdf1,0x30bd19,0x4624bc,aaa,aab,abb,bbb,bbc,0x6a2ad5} (1 row) 结论 对性能要求不高的场景建议将Collate和Ctype都设置为zh_CN.UTF8,其它区域设置为C。 initdb -E UTF8 --locale=C --lc-collate=zh_CN.UTF8 --lc-ctype=zh_CN.UTF8 ... 对性能要求较高的场景建议将Ctype设置为zh_CN.UTF8,其它区域设置为C。如果有部分查询需要按拼音排序,可在列定义和SQL表达式中指定Collate为zh_CN。 initdb -E UTF8 --locale=C --lc-ctype=zh_CN.UTF8 ... 参考 PostgreSQL中的区域和编码
citus对join的支持 前言 citus对支持的SQL有一定的限制,其中包括最常见的join,具体如下 inner join 无限制。根据情况会以下面几种方式之一支持 亲和表 即2个表的分片规则完全相同,且join列即为分片列值 参考表 join的2个表中,其中有一个表不分片且每个worker上都存一份副本的表,即"参考表" 小表广播 分片数小于citus.large_table_shard_count的表被认为是小表(默认值为4),citus会将小表的分片广播到所有worker上并缓存。小表机制容易产生bug不建议使用,比如之后更新小表时不会更新缓存从而导致数据不一致,建议改用参考表代替。 数据重分布 使用task-tracker执行器可支持数据重分布,然后以MapMerge的方式支持。 outer join 与inner join的主要不同是不支持数据重分布,因此无法支持两个分片规则不一致的大表的outer join。 另外,参考表只有出现在left join的右边或right join的左边才被支持。 实验 下面演示2个大表的join 环境 CentOS release 6.5 x64物理机(16C/128G/3TB SSD) PostgreSQL 9.6.2 citus 6.1.0 pgbouncer 1.7.2 master和worker都在1台机器上,端口号不同 master :60001 worker1:60002 worker2:60003 worker设置 在master上添加worker节点 SELECT * from master_add_node('/tmp', 60002); SELECT * from master_add_node('/tmp', 60003); 数据定义 postgres=# create table tb1(id int,k int); CREATE TABLE postgres=# create table tb2(id int,k int); CREATE TABLE postgres=# select create_distributed_table('tb1','id'); create_distributed_table -------------------------- (1 行记录) postgres=# select create_distributed_table('tb2','id'); create_distributed_table -------------------------- (1 行记录) 数据导入 postgres=# create unlogged table tbx as select id,id ss from ( select generate_series(1,1000000) id) a; SELECT 1000000 时间:414.776 ms postgres=# copy tb1 from PROGRAM 'psql -p60001 -Atc "copy tbx to STDOUT"'; COPY 1000000 时间:748.383 ms postgres=# copy tb2 from PROGRAM 'psql -p60001 -Atc "copy tbx to STDOUT"'; COPY 1000000 时间:757.981 ms 执行查询 inner join(分片列相同) postgres=# select count(1) from tb1 join tb2 on(tb1.id=tb2.id); count --------- 1000000 (1 行记录) 时间:1889.941 ms 执行计划 postgres=# explain select count(1) from tb1 join tb2 on(tb1.id=tb2.id); QUERY PLAN -------------------------------------------------------------------------------------------------- Distributed Query into pg_merge_job_0033 Executor: Task-Tracker Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=/tmp port=60002 dbname=postgres -> Aggregate (cost=1818.06..1818.07 rows=1 width=8) -> Hash Join (cost=849.88..1739.19 rows=31550 width=0) Hash Cond: (tb1.id = tb2.id) -> Seq Scan on tb1_102008 tb1 (cost=0.00..455.50 rows=31550 width=4) -> Hash (cost=455.50..455.50 rows=31550 width=4) -> Seq Scan on tb2_102040 tb2 (cost=0.00..455.50 rows=31550 width=4) Master Query -> Aggregate (cost=0.00..0.00 rows=0 width=0) -> Seq Scan on pg_merge_job_0033 (cost=0.00..0.00 rows=0 width=0) (15 行记录) 时间:14.952 ms inner join(分片列不同) postgres=# select count(1) from tb1 join tb2 on(tb1.id=tb2.k); ERROR: cannot use real time executor with repartition jobs 提示: Set citus.task_executor_type to "task-tracker". 时间:16.238 ms 默认的real-time执行器不支持这种join,先设置执行器为'task-tracker' set citus.task_executor_type='task-tracker' 再执行SQL postgres=# select count(1) from tb1 join tb2 on(tb1.id=tb2.k); count --------- 1000000 (1 行记录) 时间:16339.376 ms postgres=# select count(1) from tb1 join tb2 on(tb1.k=tb2.k); count --------- 1000000 (1 行记录) 时间:16263.971 ms 16秒完成2个100w大表的join效率也不低了。 执行计划如下: postgres=# explain select count(1) from tb1 join tb2 on(tb1.k=tb2.k); QUERY PLAN ----------------------------------------------------------------------------- Distributed Query into pg_merge_job_0036 Executor: Task-Tracker Task Count: 8 Tasks Shown: None, not supported for re-partition queries -> MapMergeJob Map Task Count: 32 Merge Task Count: 8 -> MapMergeJob Map Task Count: 32 Merge Task Count: 8 Master Query -> Aggregate (cost=0.00..0.00 rows=0 width=0) -> Seq Scan on pg_merge_job_0036 (cost=0.00..0.00 rows=0 width=0) (13 行记录) 时间:22.865 ms postgres=# explain select count(1) from tb1 join tb2 on(tb1.k=tb2.k); QUERY PLAN ----------------------------------------------------------------------------- Distributed Query into pg_merge_job_0039 Executor: Task-Tracker Task Count: 8 Tasks Shown: None, not supported for re-partition queries -> MapMergeJob Map Task Count: 32 Merge Task Count: 8 -> MapMergeJob Map Task Count: 32 Merge Task Count: 8 Master Query -> Aggregate (cost=0.00..0.00 rows=0 width=0) -> Seq Scan on pg_merge_job_0039 (cost=0.00..0.00 rows=0 width=0) (13 行记录) 时间:21.905 ms left join join列和分片列一致时可以支持 postgres=# select count(1) from tb1 left join tb2 on(tb1.id=tb2.id); count --------- 1000000 (1 行记录) 时间:1929.182 ms join列和分片列不一致时不支持 postgres=# select count(1) from tb1 left join tb2 on(tb1.id=tb2.k); ERROR: cannot run outer join query if join is not on the partition column 描述: Outer joins requiring repartitioning are not supported. 时间:0.268 ms 和参考表的outer join 创建参考表tb3 postgres=# create table tb3(id int,k int); CREATE TABLE 时间:0.758 ms postgres=# select create_reference_table('tb3'); create_reference_table ------------------------ (1 行记录) 时间:28.051 ms 参考表在left join右边时可以支持 postgres=# select count(1) from tb1 left join tb3 on(tb1.k=tb3.k); count --------- 1000000 (1 行记录) 时间:1942.156 ms 参考表在left join左边时不支持 postgres=# select count(1) from tb3 left join tb1 on(tb1.k=tb3.k); ERROR: cannot run outer join query if join is not on the partition column 描述: Outer joins requiring repartitioning are not supported. 时间:0.183 ms right join正好相反 postgres=# select count(1) from tb3 right join tb1 on(tb1.k=tb3.k); count --------- 1000000 (1 行记录) 时间:2155.268 ms postgres=# select count(1) from tb1 right join tb3 on(tb1.k=tb3.k); ERROR: cannot run outer join query if join is not on the partition column 描述: Outer joins requiring repartitioning are not supported. 时间:0.348 ms full join不支持 postgres=# select count(1) from tb1 full join tb3 on(tb1.k=tb3.k); ERROR: cannot run outer join query if join is not on the partition column 描述: Outer joins requiring repartitioning are not supported. 时间:0.180 ms postgres=# select count(1) from tb3 full join tb1 on(tb1.k=tb3.k); ERROR: cannot run outer join query if join is not on the partition column 描述: Outer joins requiring repartitioning are not supported. 时间:0.163 ms
分布式实时分析数据库citus数据查询性能简单对比 如果单纯看实时数据插入的速度,并不能体现citus的价值,还要看聚合查询的性能。下面将集群的查询性能和单机做个简单的对比。 仍使用之前插入测试的环境 分布式实时分析数据库citus数据插入性能优化 分布式实时分析数据库citus数据插入性能优化之二 环境 软硬件配置 CentOS release 6.5 x64物理机(16C/128G/300GB SSD) CPU: 2*8core 16核32线程, Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz PostgreSQL 9.6.2 citus 6.1.0 sysbench-1.0.3 机器列表 master 192.168.0.177 worker(8个) 192.168.0.181~192.168.0.188 软件的安装都比较简单,参考官方文档即可,这里略过。 postgresql.conf配置
分布式实时分析数据库citus数据插入性能优化之二 在上回的分布式实时分析数据库citus数据插入性能优化 提到citus的master上执行计划生成比较耗费时间,下面尝试通过修改源码绕过master的执行计划生成。 环境 软硬件配置 CentOS release 6.5 x64物理机(16C/128G/300GB SSD) CPU: 2*8core 16核32线程, Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz PostgreSQL 9.6.2 citus 6.1.0 sysbench-1.0.3 机器列表 master 192.168.0.177 worker(8个) 192.168.0.181~192.168.0.188 软件的安装都比较简单,参考官方文档即可,这里略过。 postgresql.conf配置 listen_addresses = '*' port = 5432 max_connections = 1000 shared_buffers = 32GB effective_cache_size = 96GB work_mem = 16MB maintenance_work_mem = 2GB min_wal_size = 4GB max_wal_size = 32GB checkpoint_completion_target = 0.9 wal_buffers = 16MB default_statistics_target = 100 shared_preload_libraries = 'citus' checkpoint_timeout = 60min wal_level = replica wal_compression = on wal_log_hints = on synchronous_commit = on 注:和上次的测试不同,synchronous_commit改为on 测试场景 选用sysbench-1.0.3的oltp_insert.lua作为测试用例,执行的SQL的示例如下: INSERT INTO sbtest1 (id, k, c, pad) VALUES (525449452, 5005, '28491622445-08162085385-16839726209-31171823540-28539137588-93842246002-13643098812-68836434394-95216556185-07917709646', '49165640733-86514010343-02300194630-37380434155-24438915047') 但是,sysbench-1.0.3的oltp_insert.lua中有一个bug,需要先将其改正 i = sysbench.rand.unique() ==> i = sysbench.rand.unique() - 2147483648 单机测试 建表 CREATE TABLE sbtest1 ( id integer NOT NULL, k integer NOT NULL DEFAULT 0, c character(120) NOT NULL DEFAULT ''::bpchar, pad character(60) NOT NULL DEFAULT ''::bpchar, PRIMARY KEY (id) ); CREATE INDEX k_1 ON sbtest1(k); 插入数据 src/sysbench --test=src/lua/oltp_insert.lua \ --db-driver=pgsql \ --pgsql-host=127.0.0.1 \ --pgsql-port=5432 \ --pgsql-user=postgres \ --pgsql-db=dbone \ --auto_inc=0 \ --time=10 \ --threads=128 \ --report-interval=1 \ run 测试结果 TPS为122809 -bash-4.1$ src/sysbench --test=src/lua/oltp_insert.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 --pgsql-user=postgres --pgsql-db=dbone --auto_inc=0 --time=10 --threads=128 --report-interval=1 run WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options. sysbench 1.0.3 (using bundled LuaJIT 2.1.0-beta2) Running the test with following options: Number of threads: 128 Report intermediate results every 1 second(s) Initializing random number generator from current time Initializing worker threads... Threads started! [ 1s ] thds: 128 tps: 124474.46 qps: 124474.46 (r/w/o: 0.00/124474.46/0.00) lat (ms,95%): 1.93 err/s: 0.00 reconn/s: 0.00 [ 2s ] thds: 128 tps: 124674.70 qps: 124674.70 (r/w/o: 0.00/124674.70/0.00) lat (ms,95%): 1.93 err/s: 0.00 reconn/s: 0.00 [ 3s ] thds: 128 tps: 125700.72 qps: 125700.72 (r/w/o: 0.00/125700.72/0.00) lat (ms,95%): 1.93 err/s: 0.00 reconn/s: 0.00 [ 4s ] thds: 128 tps: 125316.67 qps: 125316.67 (r/w/o: 0.00/125316.67/0.00) lat (ms,95%): 1.93 err/s: 0.00 reconn/s: 0.00 [ 5s ] thds: 128 tps: 114303.50 qps: 114303.50 (r/w/o: 0.00/114303.50/0.00) lat (ms,95%): 2.22 err/s: 0.00 reconn/s: 0.00 [ 6s ] thds: 128 tps: 124781.26 qps: 124781.26 (r/w/o: 0.00/124781.26/0.00) lat (ms,95%): 1.93 err/s: 0.00 reconn/s: 0.00 [ 7s ] thds: 128 tps: 124819.42 qps: 124819.42 (r/w/o: 0.00/124819.42/0.00) lat (ms,95%): 1.93 err/s: 0.00 reconn/s: 0.00 [ 8s ] thds: 128 tps: 125309.88 qps: 125309.88 (r/w/o: 0.00/125309.88/0.00) lat (ms,95%): 1.93 err/s: 0.00 reconn/s: 0.00 [ 9s ] thds: 128 tps: 125674.52 qps: 125674.52 (r/w/o: 0.00/125674.52/0.00) lat (ms,95%): 1.89 err/s: 0.00 reconn/s: 0.00 [ 10s ] thds: 128 tps: 116230.44 qps: 116230.44 (r/w/o: 0.00/116230.44/0.00) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00 SQL statistics: queries performed: read: 0 write: 1232576 other: 0 total: 1232576 transactions: 1232576 (122809.76 per sec.) queries: 1232576 (122809.76 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) General statistics: total time: 10.0345s total number of events: 1232576 Latency (ms): min: 0.15 avg: 1.04 max: 24.45 95th percentile: 1.96 sum: 1276394.81 Threads fairness: events (avg/stddev): 9629.5000/65.84 execution time (avg/stddev): 9.9718/0.01 资源消耗 此时CPU利用率84%,已经接近瓶颈。 -bash-4.1$ iostat sdc -xk 5 ... avg-cpu: %user %nice %system %iowait %steal %idle 60.32 0.00 22.88 0.16 0.00 16.64 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 13649.00 18.00 11011.00 72.00 98632.00 17.90 0.45 0.04 0.03 32.10 citus集群测试 建表 CREATE TABLE sbtest1 ( id integer NOT NULL, k integer NOT NULL DEFAULT 0, c character(120) NOT NULL DEFAULT ''::bpchar, pad character(60) NOT NULL DEFAULT ''::bpchar, PRIMARY KEY (id) ); CREATE INDEX k_1 ON sbtest1(k); set citus.shard_count = 128; set citus.shard_replication_factor = 1; select create_distributed_table('sbtest1','id'); 插入数据 /bak/soft/sysbench-1.0.3/src/sysbench --test=/bak/soft/sysbench-1.0.3/src/lua/oltp_insert.lua \ --db-driver=pgsql \ --pgsql-host=127.0.0.1 \ --pgsql-port=5432 \ --pgsql-user=postgres \ --pgsql-db=dbcitus \ --auto_inc=0 \ --time=10 \ --threads=64 \ --report-interval=1 \ run 执行结果 上次测试的TPS为44637,但是当时master上部署了pgbouncer,pgbouncer消耗了不少CPU。 把pgbouncer停掉后,再测的结果是55717。 -bash-4.1$ /bak/soft/sysbench-1.0.3/src/sysbench /bak/soft/sysbench-1.0.3/src/lua/oltp_insert.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 --pgsql-user=postgres --pgsql-db=dbone --auto_inc=0 --time=5 --threads=64 --report-interval=1 run sysbench 1.0.3 (using bundled LuaJIT 2.1.0-beta2) Running the test with following options: Number of threads: 64 Report intermediate results every 1 second(s) Initializing random number generator from current time Initializing worker threads... Threads started! [ 1s ] thds: 64 tps: 52903.73 qps: 52903.73 (r/w/o: 0.00/52903.73/0.00) lat (ms,95%): 3.25 err/s: 0.00 reconn/s: 0.00 [ 2s ] thds: 64 tps: 56548.81 qps: 56548.81 (r/w/o: 0.00/56548.81/0.00) lat (ms,95%): 3.13 err/s: 0.00 reconn/s: 0.00 [ 3s ] thds: 64 tps: 56492.06 qps: 56492.06 (r/w/o: 0.00/56492.06/0.00) lat (ms,95%): 3.13 err/s: 0.00 reconn/s: 0.00 [ 4s ] thds: 64 tps: 56470.25 qps: 56470.25 (r/w/o: 0.00/56470.25/0.00) lat (ms,95%): 3.13 err/s: 0.00 reconn/s: 0.00 [ 5s ] thds: 64 tps: 56627.38 qps: 56627.38 (r/w/o: 0.00/56627.38/0.00) lat (ms,95%): 3.13 err/s: 0.00 reconn/s: 0.00 SQL statistics: queries performed: read: 0 write: 279214 other: 0 total: 279214 transactions: 279214 (55717.02 per sec.) queries: 279214 (55717.02 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) General statistics: total time: 5.0093s total number of events: 279214 Latency (ms): min: 0.45 avg: 1.14 max: 36.80 95th percentile: 3.13 sum: 319193.98 Threads fairness: events (avg/stddev): 4362.7188/79.40 execution time (avg/stddev): 4.9874/0.00 资源消耗 性能瓶颈在master的CPU上,master生成执行计划消耗了大量CPU。 master的CPU利用率达到82% [root@node1 ~]# iostat sdc -xk 5 Linux 2.6.32-431.el6.x86_64 (node1) 2017年03月13日 _x86_64_ (32 CPU) ... avg-cpu: %user %nice %system %iowait %steal %idle 66.15 0.00 15.00 0.00 0.00 18.85 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 优化:master绕过SQL解析 定义分发SQL的函数 定义下面分发SQL的函数。 CREATE FUNCTION pg_catalog.master_distribute_dml(table_name regclass,distribution_column_value anyelement,dml_sql text) RETURNS integer LANGUAGE C STRICT AS 'citus.so', $$master_distribute_dml$$; COMMENT ON FUNCTION master_distribute_dml(regclass,anyelement,text) IS 'distribute delete insert and update query to appropriate shard'; 各参数的含义如下 - table_name:表名 - distribution_column_value:分片列值 - dml_sql:DML SQL语句,其中表名用"%s"代替 该函数通过传入的分片列值判断出所属的分片,然后直接发SQL分发到该分片上,仅仅把SQL中“%s”替代为实际的分片表名。 函数的定义参考附录。 修改 在oltp_insert.lua的基础上生成oltp_insert2.lua cp ./src/lua/oltp_insert.lua ./src/lua/oltp_insert2.lua vi ./src/lua/oltp_insert2.lua 修改内容如下: con:query(string.format("INSERT INTO %s (id, k, c, pad) VALUES " .. "(%d, %d, '%s', '%s')", table_name, i, k_val, c_val, pad_val)) ==> con:query(string.format("select master_distribute_dml('%s', %d, $$" .. "INSERT INTO %%s (id, k, c, pad) VALUES " .. "(%d, %d, '%s', '%s')$$)", table_name, i, i, k_val, c_val, pad_val)) 测试 修改后TPS增加到75973。 -bash-4.1$ /bak/soft/sysbench-1.0.3/src/sysbench /bak/soft/sysbench-1.0.3/src/lua/oltp_insert.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 --pgsql-user=postgres --pgsql-db=dbcitus --auto_inc=0 --time=5 --threads=64 --report-interval=1 run sysbench 1.0.3 (using bundled LuaJIT 2.1.0-beta2) Running the test with following options: Number of threads: 64 Report intermediate results every 1 second(s) Initializing random number generator from current time Initializing worker threads... Threads started! [ 1s ] thds: 64 tps: 73760.99 qps: 73761.98 (r/w/o: 73761.98/0.00/0.00) lat (ms,95%): 2.52 err/s: 0.00 reconn/s: 0.00 [ 2s ] thds: 64 tps: 76409.47 qps: 76409.47 (r/w/o: 76409.47/0.00/0.00) lat (ms,95%): 2.48 err/s: 0.00 reconn/s: 0.00 [ 3s ] thds: 64 tps: 76669.99 qps: 76668.99 (r/w/o: 76668.99/0.00/0.00) lat (ms,95%): 2.57 err/s: 0.00 reconn/s: 0.00 [ 4s ] thds: 64 tps: 76587.58 qps: 76587.58 (r/w/o: 76587.58/0.00/0.00) lat (ms,95%): 2.57 err/s: 0.00 reconn/s: 0.00 [ 5s ] thds: 64 tps: 76996.79 qps: 76996.79 (r/w/o: 76996.79/0.00/0.00) lat (ms,95%): 2.52 err/s: 0.00 reconn/s: 0.00 SQL statistics: queries performed: read: 380635 write: 0 other: 0 total: 380635 transactions: 380635 (75973.81 per sec.) queries: 380635 (75973.81 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) General statistics: total time: 5.0081s total number of events: 380635 Latency (ms): min: 0.28 avg: 0.84 max: 11.70 95th percentile: 2.52 sum: 318897.47 Threads fairness: events (avg/stddev): 5947.4219/101.63 execution time (avg/stddev): 4.9828/0.00 master的CPU利用率降低到46% [root@node1 ~]# iostat sdc -xk 5 Linux 2.6.32-431.el6.x86_64 (node1) 2017年03月13日 _x86_64_ (32 CPU) ... avg-cpu: %user %nice %system %iowait %steal %idle 25.10 0.00 20.81 0.00 0.00 54.08 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 总结 通过修改源码绕过master的SQL解析,单master的SQL处理能力由5.6w/s提高到7.6w/s,CPU消耗反而从82%降到了46%,综合而言master处理效率提升为原来的2.4倍。 结合masterless优化,8个worker组成的citus集群实时数据插入的速度预计可达到40w/s左右。 SQL越长越复杂,该方法的性能优化效果越好。我们另一个场景中,有个更新384个字段的超长UPDATE语句,通过这种方式TPS从715提升到8173,master的CPU利用率从98%降低到36%,master处理效能提升为原来的31倍。 附录:master_distribute_dml函数原型实现 src\backend\distributed\master\master_distribute_dml.c: #include "postgres.h" #include "funcapi.h" #include "libpq-fe.h" #include "miscadmin.h" #include "access/htup_details.h" #include "catalog/pg_type.h" #include "access/xact.h" #include "catalog/namespace.h" #include "catalog/pg_class.h" #include "commands/dbcommands.h" #include "commands/event_trigger.h" #include "distributed/citus_clauses.h" #include "distributed/citus_ruleutils.h" #include "distributed/listutils.h" #include "distributed/master_metadata_utility.h" #include "distributed/master_protocol.h" #include "distributed/metadata_cache.h" #include "distributed/multi_client_executor.h" #include "distributed/multi_physical_planner.h" #include "distributed/multi_router_executor.h" #include "distributed/multi_router_planner.h" #include "distributed/multi_server_executor.h" #include "distributed/multi_shard_transaction.h" #include "distributed/pg_dist_shard.h" #include "distributed/pg_dist_partition.h" #include "distributed/resource_lock.h" #include "distributed/shardinterval_utils.h" #include "distributed/worker_protocol.h" #include "optimizer/clauses.h" #include "optimizer/predtest.h" #include "optimizer/restrictinfo.h" #include "optimizer/var.h" #include "nodes/makefuncs.h" #include "tcop/tcopprot.h" #include "utils/builtins.h" #include "utils/datum.h" #include "utils/inval.h" #include "utils/lsyscache.h" #include "utils/memutils.h" /* #include "Fmgr.h" */ #include "utils/catcache.h" #include "utils/fmgroids.h" #include "utils/guc.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/rel.h" #include "utils/typcache.h" extern int64 ExecuteSingleModifyTask2(Task *task, bool expectResults); static Task * ModifySingleShardTask(char *query, ShardInterval *shardInterval); static char * generate_shard_relation_name(Oid relid, int64 shardid); static void generate_shard_query(char *query, Oid distrelid, int64 shardid, StringInfo buffer); PG_FUNCTION_INFO_V1(master_distribute_dml); /* * master_modify_multiple_shards takes in a DELETE or UPDATE query string and * pushes the query to shards. It finds shards that match the criteria defined * in the delete command, generates the same delete query string for each of the * found shards with distributed table name replaced with the shard name and * sends the queries to the workers. It uses one-phase or two-phase commit * transactions depending on citus.copy_transaction_manager value. */ Datum master_distribute_dml(PG_FUNCTION_ARGS) { Oid relationId = PG_GETARG_OID(0); Datum partitionValue = PG_GETARG_DATUM(1); text *queryText = PG_GETARG_TEXT_P(2); char *queryString = text_to_cstring(queryText); DistTableCacheEntry *cacheEntry = NULL; char partitionMethod; ShardInterval *shardInterval = NULL; Task *task = NULL; int32 affectedTupleCount = 0; /* 简化权限检查,避免SQL解析*/ EnsureTablePermissions(relationId, ACL_INSERT | ACL_UPDATE | ACL_DELETE); CheckDistributedTable(relationId); cacheEntry = DistributedTableCacheEntry(relationId); partitionMethod = cacheEntry->partitionMethod; /* fast shard pruning is only supported for hash and range partitioned tables */ if (partitionMethod != DISTRIBUTE_BY_HASH && partitionMethod != DISTRIBUTE_BY_RANGE) { ereport(ERROR, (errmsg("only hash and range distributed table are supported"))); } shardInterval = FastShardPruning(relationId, partitionValue); if (shardInterval == NULL) { ereport(ERROR, (errmsg("could not find appropriate shard of relation \"%s\" for partition value \"%s\" ", get_rel_name(relationId), TextDatumGetCString(partitionValue)))); } CHECK_FOR_INTERRUPTS(); task = ModifySingleShardTask(queryString, shardInterval); affectedTupleCount = ExecuteSingleModifyTask2( task, false); PG_RETURN_INT32(affectedTupleCount); } /* * ModifyMultipleShardsTaskList builds a list of tasks to execute a query on a * given list of shards. */ static Task * ModifySingleShardTask(char *query, ShardInterval *shardInterval) { uint64 jobId = INVALID_JOB_ID; int taskId = 1; Oid relationId = shardInterval->relationId; uint64 shardId = shardInterval->shardId; StringInfo shardQueryString = makeStringInfo(); Task *task = NULL; /* lock metadata before getting placment lists */ LockShardDistributionMetadata(shardId, ShareLock); generate_shard_query(query, relationId, shardId, shardQueryString); task = CitusMakeNode(Task); task->jobId = jobId; task->taskId = taskId; task->taskType = SQL_TASK; task->queryString = shardQueryString->data; task->dependedTaskList = NULL; task->anchorShardId = shardId; task->taskPlacementList = FinalizedShardPlacementList(shardId); return task; } /* * generate_shard_relation_name * Compute the name to display for a shard * * If the provided relid is equal to the provided distrelid, this function * returns a shard-extended relation name; otherwise, it falls through to a * simple generate_relation_name call. */ static char * generate_shard_relation_name(Oid relid, int64 shardid) { char *relname = NULL; relname = get_rel_name(relid); if (!relname) elog(ERROR, "cache lookup failed for relation %u", relid); if (shardid > 0) { Oid schemaOid = get_rel_namespace(relid); char *schemaName = get_namespace_name(schemaOid); AppendShardIdToName(&relname, shardid); relname = quote_qualified_identifier(schemaName, relname); } return relname; } static void generate_shard_query(char *query, Oid distrelid, int64 shardid, StringInfo buffer) { appendStringInfo(buffer, query, generate_shard_relation_name(distrelid,shardid) ); } src\backend\distributed\executor\multi_router_executor.c: 增加以下函数 int64 ExecuteSingleModifyTask2(Task *task, bool expectResults) { QueryDesc qdesc; EState executorState; bool resultsOK = false; qdesc.estate=&executorState; qdesc.operation=CMD_UPDATE; qdesc.tupDesc=NULL; qdesc.planstate=NULL; qdesc.params=NULL; ExecuteSingleModifyTask(&qdesc, task,expectResults); return executorState.es_processed; } 参考 Scaling Out Data Ingestion Real-time Inserts:0-50k/s Real-time Updates:0-50k/s Bulk Copy:100-200k/s Masterless Citus:50k/s-500k/s
前言 从可靠性和使用便利性来讲单机RDBMS完胜N多各类数据库,但当数据量到了一定量之后,又不得不寻求分布式,列存储等等解决方案。citus是基于PostgreSQL的分布式实时分析解决方案,由于其只是作为PostgreSQL的扩展插件而没有动PG内核,所有随快速随PG主版本升级,可靠性也非常值得信任。 citus在支持SQL特性上有一定的限制,比如不支持跨库事务,不支持部分join和子查询的写法等等,做选型时需要留意(大部分的分布式系统对SQL支持或多或少都有些限制,不足为奇,按场景选型即可)。 citus主要适合下面两种场景 多租户 每个租户的数据按租户ID分片,互不干扰,避免跨库操作。 实时数据分析 通过分片将数据打散到各个worker上,查询时由master生成分布式执行计划驱动所有worker并行工作。支持过滤,投影,聚合,join等各类常见算子的下推。 在实时数据分析场景,单位时间的数据增量会很大,本文实测一下citus的数据插入能力(更新,删除的性能类似)。 环境 软硬件配置 CentOS release 6.5 x64物理机(16C/128G/300GB SSD) CPU: 2*8core 16核32线程, Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz PostgreSQL 9.6.2 citus 6.1.0 sysbench-1.0.3 机器列表 master 192.168.0.177 worker(8个) 192.168.0.181~192.168.0.188 软件的安装都比较简单,参考官方文档即可,这里略过。 postgresql.conf配置 listen_addresses = '*' port = 5432 max_connections = 1100 shared_buffers = 32GB effective_cache_size = 96GB work_mem = 16MB maintenance_work_mem = 2GB min_wal_size = 4GB max_wal_size = 32GB checkpoint_completion_target = 0.9 wal_buffers = 16MB default_statistics_target = 100 shared_preload_libraries = 'citus' checkpoint_timeout = 60min wal_level = replica wal_compression = on wal_level = replica wal_log_hints = on synchronous_commit = off 测试场景 选用sysbench-1.0.3的oltp_insert.lua作为测试用例,执行的SQL的示例如下: INSERT INTO sbtest1 (id, k, c, pad) VALUES (525449452, 5005, '28491622445-08162085385-16839726209-31171823540-28539137588-93842246002-13643098812-68836434394-95216556185-07917709646', '49165640733-86514010343-02300194630-37380434155-24438915047') 但是,sysbench-1.0.3的oltp_insert.lua中有一个bug,需要先将其改正 i = sysbench.rand.unique() ==> i = sysbench.rand.unique() - 2147483648 单机测试 建表 CREATE TABLE sbtest1 ( id integer NOT NULL, k integer NOT NULL DEFAULT 0, c character(120) NOT NULL DEFAULT ''::bpchar, pad character(60) NOT NULL DEFAULT ''::bpchar, PRIMARY KEY (id) ); CREATE INDEX k_1 ON sbtest1(k); 插入数据 src/sysbench --test=src/lua/oltp_insert.lua \ --db-driver=pgsql \ --pgsql-host=127.0.0.1 \ --pgsql-port=5432 \ --pgsql-user=postgres \ --pgsql-db=dbone \ --auto_inc=0 \ --time=10 \ --threads=128 \ --report-interval=1 \ run 测试结果 TPS为134030 -bash-4.1$ src/sysbench --test=src/lua/oltp_insert.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 --pgsql-user=postgres --pgsql-db=dbone --auto_inc=0 --time=20 --threads=128 --report-interval=5 run WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options. sysbench 1.0.3 (using bundled LuaJIT 2.1.0-beta2) Running the test with following options: Number of threads: 128 Report intermediate results every 5 second(s) Initializing random number generator from current time Initializing worker threads... Threads started! [ 5s ] thds: 128 tps: 138381.74 qps: 138381.74 (r/w/o: 0.00/138381.74/0.00) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00 [ 10s ] thds: 128 tps: 134268.30 qps: 134268.30 (r/w/o: 0.00/134268.30/0.00) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00 [ 15s ] thds: 128 tps: 132830.91 qps: 132831.11 (r/w/o: 0.00/132831.11/0.00) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 128 tps: 132073.81 qps: 132073.61 (r/w/o: 0.00/132073.61/0.00) lat (ms,95%): 2.03 err/s: 0.00 reconn/s: 0.00 SQL statistics: queries performed: read: 0 write: 2688192 other: 0 total: 2688192 transactions: 2688192 (134030.18 per sec.) queries: 2688192 (134030.18 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) General statistics: total time: 20.0547s total number of events: 2688192 Latency (ms): min: 0.10 avg: 0.95 max: 88.80 95th percentile: 2.07 sum: 2554006.85 Threads fairness: events (avg/stddev): 21001.5000/178.10 execution time (avg/stddev): 19.9532/0.01 资源消耗 此时CPU利用率90%,已经接近瓶颈。 -bash-4.1$ iostat sdc -xk 5 ... avg-cpu: %user %nice %system %iowait %steal %idle 69.12 0.00 20.56 0.15 0.00 10.17 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 25302.60 18.20 705.20 72.80 104019.20 287.79 5.96 8.21 0.81 58.48 citus集群测试 建表 CREATE TABLE sbtest1 ( id integer NOT NULL, k integer NOT NULL DEFAULT 0, c character(120) NOT NULL DEFAULT ''::bpchar, pad character(60) NOT NULL DEFAULT ''::bpchar, PRIMARY KEY (id) ); CREATE INDEX k_1 ON sbtest1(k); set citus.shard_count = 128; set citus.shard_replication_factor = 1; select create_distributed_table('sbtest1','id'); 插入数据 /bak/soft/sysbench-1.0.3/src/sysbench --test=/bak/soft/sysbench-1.0.3/src/lua/oltp_insert.lua \ --db-driver=pgsql \ --pgsql-host=127.0.0.1 \ --pgsql-port=5432 \ --pgsql-user=postgres \ --pgsql-db=dbcitus \ --auto_inc=0 \ --time=10 \ --threads=64 \ --report-interval=1 \ run 执行结果 TPS为44637,远低于单机。 -bash-4.1$ /bak/soft/sysbench-1.0.3/src/sysbench --test=/bak/soft/sysbench-1.0.3/src/lua/oltp_insert.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 --pgsql-user=postgres --pgsql-db=dbcitus --auto_inc=0 --time=20 --threads=64 --report-interval=5 run WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options. sysbench 1.0.3 (using bundled LuaJIT 2.1.0-beta2) Running the test with following options: Number of threads: 64 Report intermediate results every 5 second(s) Initializing random number generator from current time Initializing worker threads... Threads started! [ 5s ] thds: 64 tps: 44628.01 qps: 44628.01 (r/w/o: 0.00/44628.01/0.00) lat (ms,95%): 2.48 err/s: 0.00 reconn/s: 0.00 [ 10s ] thds: 64 tps: 44780.80 qps: 44780.80 (r/w/o: 0.00/44780.80/0.00) lat (ms,95%): 2.48 err/s: 0.00 reconn/s: 0.00 [ 15s ] thds: 64 tps: 44701.32 qps: 44701.72 (r/w/o: 0.00/44701.72/0.00) lat (ms,95%): 2.48 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 64 tps: 44801.41 qps: 44801.01 (r/w/o: 0.00/44801.01/0.00) lat (ms,95%): 2.48 err/s: 0.00 reconn/s: 0.00 SQL statistics: queries performed: read: 0 write: 894715 other: 0 total: 894715 transactions: 894715 (44637.47 per sec.) queries: 894715 (44637.47 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) General statistics: total time: 20.0421s total number of events: 894715 Latency (ms): min: 0.42 avg: 1.43 max: 203.28 95th percentile: 2.48 sum: 1277233.99 Threads fairness: events (avg/stddev): 13979.9219/71.15 execution time (avg/stddev): 19.9568/0.01 资源消耗 性能瓶颈在master的CPU上,master生成执行计划消耗了大量CPU。 master master的CPU利用率达到69% [root@node1 ~]# iostat sdc -xk 5 Linux 2.6.32-431.el6.x86_64 (node1) 2017年03月13日 _x86_64_ (32 CPU) ... avg-cpu: %user %nice %system %iowait %steal %idle 50.61 0.00 17.80 0.00 0.00 31.59 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 其中一个worker worker的CPU利用率只有3%,IO也不高。 [root@node5 ~]# iostat sdc -xk 5 Linux 2.6.32-431.el6.x86_64 (node5) 2017年03月13日 _x86_64_ (32 CPU) ... avg-cpu: %user %nice %system %iowait %steal %idle 2.24 0.00 0.63 0.00 0.00 97.13 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 774.00 0.00 265.80 0.00 4159.20 31.30 0.25 0.96 0.01 0.38 优化:masterless部署 既然性能瓶颈在master上,那可以多搞几个master,甚至每个worker都作为master。 这并不困难,只要把master上的元数据拷贝到每个worker上,worker就可以当master用了。 拷贝元数据 在8个worker上分别执行以下SQL: CREATE TABLE sbtest1 ( id integer NOT NULL, k integer NOT NULL DEFAULT 0, c character(120) NOT NULL DEFAULT ''::bpchar, pad character(60) NOT NULL DEFAULT ''::bpchar, PRIMARY KEY (id) ); CREATE INDEX k_1 ON sbtest1(k); copy pg_dist_node from PROGRAM 'psql "host=192.168.0.177 port=5432 dbname=dbcitus user=postgres" -Atc "copy pg_dist_node to STDOUT"'; copy pg_dist_partition from PROGRAM 'psql "host=192.168.0.177 port=5432 dbname=dbcitus user=postgres" -Atc "copy pg_dist_partition to STDOUT"'; copy pg_dist_shard from PROGRAM 'psql "host=192.168.0.177 port=5432 dbname=dbcitus user=postgres" -Atc "copy pg_dist_shard to STDOUT"'; copy pg_dist_shard_placement from PROGRAM 'psql "host=192.168.0.177 port=5432 dbname=dbcitus user=postgres" -Atc "copy pg_dist_shard_placement to STDOUT"'; copy pg_dist_colocation from PROGRAM 'psql "host=192.168.0.177 port=5432 dbname=dbcitus user=postgres" -Atc "copy pg_dist_colocation to STDOUT"'; 修改oltp_insert.lua 分别修改每个worker上的oltp_insert.lua中下面一行,使各个worker上产生的主键不容易冲突 i = sysbench.rand.unique() - 2147483648 worker2 i = sysbench.rand.unique() - 2147483648 + 1 worker3 i = sysbench.rand.unique() - 2147483648 + 2 ... worker8 i = sysbench.rand.unique() - 2147483648 + 7 准备测试脚本 在每个worker上准备测试脚本 /tmp/run_oltp_insert.sh: #!/bin/bash cd /bak/soft/sysbench-1.0.3 /bak/soft/sysbench-1.0.3/src/sysbench /bak/soft/sysbench-1.0.3/src/lua/oltp_insert.lua \ --db-driver=pgsql \ --pgsql-host=127.0.0.1 \ --pgsql-port=5432 \ --pgsql-user=postgres \ --pgsql-db=dbcitus \ --auto_inc=0 \ --time=60 \ --threads=64 \ --report-interval=5 \ run >/tmp/run_oltp_insert.log 2>&1 测试 在每个worker上同时执行insert测试 [root@node1 ~]# for i in `seq 1 8` ; do ssh 192.168.0.18$i /tmp/run_oltp_insert.sh >/dev/null 2>&1 & done [10] 27332 [11] 27333 [12] 27334 [13] 27335 [14] 27336 [15] 27337 [16] 27338 [17] 27339 测试结果 在其中一个worker上的执行结果如下,QPS 2.5w -bash-4.1$ cat /tmp/run_oltp_insert.log sysbench 1.0.3 (using bundled LuaJIT 2.1.0-beta2) Running the test with following options: Number of threads: 64 Report intermediate results every 5 second(s) Initializing random number generator from current time Initializing worker threads... Threads started! [ 5s ] thds: 64 tps: 25662.78 qps: 25662.78 (r/w/o: 0.00/25662.78/0.00) lat (ms,95%): 6.67 err/s: 2.60 reconn/s: 0.00 [ 10s ] thds: 64 tps: 26225.38 qps: 26225.38 (r/w/o: 0.00/26225.38/0.00) lat (ms,95%): 6.67 err/s: 7.00 reconn/s: 0.00 [ 15s ] thds: 64 tps: 25996.42 qps: 25996.42 (r/w/o: 0.00/25996.42/0.00) lat (ms,95%): 6.79 err/s: 11.40 reconn/s: 0.00 [ 20s ] thds: 64 tps: 25670.36 qps: 25670.36 (r/w/o: 0.00/25670.36/0.00) lat (ms,95%): 6.79 err/s: 18.60 reconn/s: 0.00 [ 25s ] thds: 64 tps: 25620.89 qps: 25620.89 (r/w/o: 0.00/25620.89/0.00) lat (ms,95%): 6.79 err/s: 22.60 reconn/s: 0.00 [ 30s ] thds: 64 tps: 25357.39 qps: 25357.39 (r/w/o: 0.00/25357.39/0.00) lat (ms,95%): 6.91 err/s: 33.40 reconn/s: 0.00 [ 35s ] thds: 64 tps: 25247.67 qps: 25247.67 (r/w/o: 0.00/25247.67/0.00) lat (ms,95%): 6.91 err/s: 34.60 reconn/s: 0.00 [ 40s ] thds: 64 tps: 25069.27 qps: 25069.27 (r/w/o: 0.00/25069.27/0.00) lat (ms,95%): 6.91 err/s: 41.00 reconn/s: 0.00 [ 45s ] thds: 64 tps: 24796.27 qps: 24796.27 (r/w/o: 0.00/24796.27/0.00) lat (ms,95%): 7.04 err/s: 49.40 reconn/s: 0.00 [ 50s ] thds: 64 tps: 24801.00 qps: 24801.00 (r/w/o: 0.00/24801.00/0.00) lat (ms,95%): 7.04 err/s: 47.40 reconn/s: 0.00 [ 55s ] thds: 64 tps: 24752.83 qps: 24752.83 (r/w/o: 0.00/24752.83/0.00) lat (ms,95%): 7.04 err/s: 57.20 reconn/s: 0.00 [ 60s ] thds: 64 tps: 24533.35 qps: 24533.35 (r/w/o: 0.00/24533.35/0.00) lat (ms,95%): 7.17 err/s: 63.60 reconn/s: 0.00 SQL statistics: queries performed: read: 0 write: 1518786 other: 0 total: 1518786 transactions: 1518786 (25277.24 per sec.) queries: 1518786 (25277.24 per sec.) ignored errors: 1944 (32.35 per sec.) reconnects: 0 (0.00 per sec.) General statistics: total time: 60.0829s total number of events: 1518786 Latency (ms): min: 0.47 avg: 2.53 max: 1015.04 95th percentile: 6.91 sum: 3835098.18 Threads fairness: events (avg/stddev): 23731.0312/213.31 execution time (avg/stddev): 59.9234/0.02 系统负载 CPU消耗了66% -bash-4.1$ iostat sdc -xk 5 Linux 2.6.32-431.el6.x86_64 (node5) 2017年03月13日 _x86_64_ (32 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 47.09 0.00 18.35 0.47 0.00 34.10 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 4195.60 0.00 19787.60 0.00 95932.80 9.70 0.98 0.05 0.02 42.54 汇总结果 8台worker的总qps为214362 [root@node1 ~]# for i in `seq 1 8` ; do ssh 192.168.0.18$i grep queries: /tmp/run_oltp_insert.log ; done queries: 1518786 (25277.24 per sec.) queries: 1587323 (26412.68 per sec.) queries: 1700562 (28305.06 per sec.) queries: 1631516 (27151.82 per sec.) queries: 1615778 (26885.48 per sec.) queries: 1649236 (27449.03 per sec.) queries: 1621940 (26993.20 per sec.) queries: 1554917 (25890.71 per sec.) 数据查询 在master上查询插入的记录数。 dbcitus=# select count(1) from sbtest1; count ---------- 12880058 (1 行记录) 时间:73.197 ms 查询是在128个分片上并行执行的,所以速度很快。 总结 citus的执行计划生成影响了数据插入的速度,通过Masterless部署可提升到20w/s以上。 进一步提升插入性能可以从citus源码入手,根据分片列值做快速SQL分发,避免在master上解析SQL,之前在另一个场景上做过原型,性能可提升10倍以上。 极致的做法是绕过master直接插入数据到worker上的分片表,还可以利用copy或批更新。 参考 Scaling Out Data Ingestion Real-time Inserts:0-50k/s Real-time Updates:0-50k/s Bulk Copy:100-200k/s Masterless Citus:50k/s-500k/s
2017年1月31日GitLab.com发生的严重生成故障,导致宕机18小时,永久丢失6小时数据。 事后官方对故障原因作出了详细的解释,如下 误删 300G,GitLab 官方对删库事故的事后分析 这个事件,作为反例非常有借鉴意义。 通用的启示: 1. 定时检查备份的有效性。 2. 在成功创建一个更新的备份前,禁止删除当前最新的备份。 3. 危险操作包装成脚本,在脚本里执行危险操作前做好所有的必要检查。 4. 告警机制要能在无告警的情况下证明告警检查在正常工作。 PostgreSQL运维的启示: 要在主备部署上避免由于主库删除尚未拷贝到备库的WAL导致流复制中断。具体可以采用下面几个办法 1. 创建并保留WAL归档 高负载的系统,归档WAL会占用空间大,代价比较高 2. 设置比较大的wal_keep_segments,在主库上保留足够的WAL wal_keep_segments建议至少设1000(16GB),还需要考虑系统负载,库大小,比如参考以下的公式(不要问为什么,拍脑袋想的)。 max(16GB,2*shared_buffer,0.1*数据总大小)/16MB 3. 使用slot复制,未被备机取走的WAL将一直保存在主机上。 备库宕时,要及时在主库删除slot,否则主库的磁盘会被WAL撑爆,因此应辅助以磁盘的容量监控。
core dump简介 core dump就是在进程crash时把包括内存在内的现场保留下来,以备故障分析。 但有时候,进程crash了却没有输出core,因为有一些因素会影响输出还是不输出core文件。 常见的一个coredump开关是ulimit -c,它限制允许输出的coredump文件的最大size,如果要输出的core文件大小超过这个值将不输出core文件。 ulimit -c的输出为0,代表关闭core dump输出。 [root@srdsdevapp69 ~]# ulimit -c 0 设置ulimit -c unlimited,将不对core文件大小做限制 [root@srdsdevapp69 ~]# ulimit -c unlimited [root@srdsdevapp69 ~]# ulimit -c unlimited 这样设置的ulimit值只在当前会话中有效,重开一个终端起进程是不受影响的。 ulimit -c只是众多影响core输出因素中的一个,其它因素可以参考man。 $ man core ... There are various circumstances in which a core dump file is not produced: * The process does not have permission to write the core file. (By default the core file is called core, and is created in the current working directory. See below for details on naming.) Writing the core file will fail if the directory in which it is to be created is non-writable, or if a file with the same name exists and is not writable or is not a regular file (e.g., it is a directory or a symbolic link). * A (writable, regular) file with the same name as would be used for the core dump already exists, but there is more than one hard link to that file. * The file system where the core dump file would be created is full; or has run out of inodes; or is mounted read-only; or the user has reached their quota for the file system. * The directory in which the core dump file is to be created does not exist. * The RLIMIT_CORE (core file size) or RLIMIT_FSIZE (file size) resource limits for the process are set to zero; see getrlimit(2) and the documentation of the shell’s ulimit command (limit in csh(1)). * The binary being executed by the process does not have read permission enabled. * The process is executing a set-user-ID (set-group-ID) program that is owned by a user (group) other than the real user (group) ID of the process. (However, see the description of the prctl(2) PR_SET_DUMPABLE operation, and the description of the /proc/sys/fs/suid_dumpable file in proc(5).) 其实还漏了一个,进程可以捕获那些本来会出core的信号,然后自己来处理,比如MySQL就是这么干的。 abrtd RHEL/CentOS下默认开启abrtd进行故障现场记录(包括生成coredump)和故障报告 此时abrtd进程是启动的, [root@srdsdevapp69 ~]# service abrtd status abrtd (pid 8711) is running... core文件的生成位置被重定向到了abrt-hook-ccpp [root@srdsdevapp69 ~]# cat /proc/sys/kernel/core_pattern |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e 测试coredump 生成以下产生coredump的程序,并执行。 testcoredump.c: int main() { return 1/0; } 编译并执行 $gcc testcoredump.c -o testcoredump $./testcoredump 查看系统日志,中途临时产生了core文件,但最后又被删掉了。 $tail -f /var/log/messages ... Dec 8 09:54:44 srdsdevapp69 kernel: testcoredump[4028] trap divide error ip:400489 sp:7fff5a54b200 error:0 in testcoredump[400000+1000] Dec 8 09:54:44 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-09:54:44-4028' creation detected Dec 8 09:54:44 srdsdevapp69 abrt[4029]: Saved core dump of pid 4028 (/root/testcoredump) to /var/spool/abrt/ccpp-2016-12-08-09:54:44-4028 (184320 bytes) Dec 8 09:54:44 srdsdevapp69 abrtd: Executable '/root/testcoredump' doesn't belong to any package Dec 8 09:54:44 srdsdevapp69 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2016-12-08-09:54:44-4028' exited with 1 Dec 8 09:54:44 srdsdevapp69 abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2016-12-08-09:54:44-4028, deleting abrtd默认只保留软件包里的程序产生的core文件,修改下面的参数可以让其记录所有程序的core文件。 $vi /etc/abrt/abrt-action-save-package-data.conf ... ProcessUnpackaged = yes 再执行一次测试程序就好生成core文件了 Dec 8 10:04:30 srdsdevapp69 kernel: testcoredump[9189] trap divide error ip:400489 sp:7fff99973b30 error:0 in testcoredump[400000+1000] Dec 8 10:04:30 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-10:04:30-9189' creation detected Dec 8 10:04:30 srdsdevapp69 abrt[9190]: Saved core dump of pid 9189 (/root/testcoredump) to /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189 (184320 bytes) Dec 8 10:04:31 srdsdevapp69 kernel: Bridge firewalling registered Dec 8 10:04:44 srdsdevapp69 abrtd: Sending an email... Dec 8 10:04:44 srdsdevapp69 abrtd: Email was sent to: root@localhost Dec 8 10:04:44 srdsdevapp69 abrtd: New problem directory /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189, processing Dec 8 10:04:44 srdsdevapp69 abrtd: No actions are found for event 'notify' abrtd可以识别出是重复问题,并能够去重,这可以防止core文件生成的过多把磁盘用光。 Dec 8 10:18:35 srdsdevapp69 kernel: testcoredump[16598] trap divide error ip:400489 sp:7fff26cc9f50 error:0 in testcoredump[400000+1000] Dec 8 10:18:35 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-10:18:35-16598' creation detected Dec 8 10:18:35 srdsdevapp69 abrt[16599]: Saved core dump of pid 16598 (/root/testcoredump) to /var/spool/abrt/ccpp-2016-12-08-10:18:35-16598 (184320 bytes) Dec 8 10:18:45 srdsdevapp69 abrtd: Sending an email... Dec 8 10:18:45 srdsdevapp69 abrtd: Email was sent to: root@localhost Dec 8 10:18:45 srdsdevapp69 abrtd: Duplicate: UUID Dec 8 10:18:45 srdsdevapp69 abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189 Dec 8 10:18:45 srdsdevapp69 abrtd: Problem directory is a duplicate of /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189 Dec 8 10:18:45 srdsdevapp69 abrtd: Deleting problem directory ccpp-2016-12-08-10:18:35-16598 (dup of ccpp-2016-12-08-10:04:30-9189) Dec 8 10:18:45 srdsdevapp69 abrtd: No actions are found for event 'notify_dup' abrtd对crash报告的大小(主要是core文件)有限制(参数MaxCrashReportsSize设置),超过了也不会生成core文件,相应的日志如下。 Dec 8 14:10:32 srdsdevapp69 abrt[10548]: Saved core dump of pid 10527 (/usr/local/Percona-Server-5.6.29-rel76.2-Linux.x86_64.ssl101/bin/mysqld) to /var/spool/abrt/ccpp-2016-12-08-14:10:00-10527 (10513362944 bytes) Dec 8 14:10:32 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-14:10:00-10527' creation detected Dec 8 14:10:32 srdsdevapp69 abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2016-12-08-14:05:43-8080' Dec 8 14:10:32 srdsdevapp69 abrt[10548]: /var/spool/abrt is 25854515653 bytes (more than 1279MiB), deleting 'ccpp-2016-12-08-14:05:43-8080' Dec 8 14:10:32 srdsdevapp69 abrt[10548]: Lock file '/var/spool/abrt/ccpp-2016-12-08-14:05:43-8080/.lock' is locked by process 7893 Dec 8 14:10:32 srdsdevapp69 abrt[10548]: '/var/spool/abrt/ccpp-2016-12-08-14:05:43-8080' does not exist Dec 8 14:10:41 srdsdevapp69 abrtd: Sending an email... Dec 8 14:10:41 srdsdevapp69 abrtd: Email was sent to: root@localhost Dec 8 14:10:41 srdsdevapp69 abrtd: New problem directory /var/spool/abrt/ccpp-2016-12-08-14:10:00-10527, processing Dec 8 14:10:41 srdsdevapp69 abrtd: No actions are found for event 'notify' abrtd如何工作 abrtd是监控/var/spool/abrt/目录触发的,做个copy操作也会触发abrtd。 [root@srdsdevapp69 abrt]# cp -rf ccpp-2016-12-08-10:04:30-9189 ccpp-2016-12-08-10:04:30-91891 下面是产生的系统日志: Dec 8 10:35:33 srdsdevapp69 abrtd: Directory 'ccpp-2016-12-08-10:04:30-91891' creation detected Dec 8 10:35:33 srdsdevapp69 abrtd: Duplicate: UUID Dec 8 10:35:33 srdsdevapp69 abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189 Dec 8 10:35:33 srdsdevapp69 abrtd: Problem directory is a duplicate of /var/spool/abrt/ccpp-2016-12-08-10:04:30-9189 Dec 8 10:35:33 srdsdevapp69 abrtd: Deleting problem directory ccpp-2016-12-08-10:04:30-91891 (dup of ccpp-2016-12-08-10:04:30-9189) Dec 8 10:35:33 srdsdevapp69 abrtd: No actions are found for event 'notify_dup' 如果修改core生成目录,不使用abrt-hook-ccpp回调程序等于禁用了abrtd echo "/data/core-%e-%p-%t">/proc/sys/kernel/core_pattern 再发生coredump时/var/log/messages中没有abrtd相关的记录 Dec 8 10:30:24 srdsdevapp69 kernel: testcoredump[23050] trap divide error ip:400489 sp:7fff9f01dfb0 error:0 in testcoredump[400000+1000] 此时core文件会被直接生成到/proc/sys/kernel/core_pattern指定的位置 /data/core-testcoredump-23050-1481164224 由于/proc/sys/kernel/core_pattern中未使用abrt-hook-ccpp回调程序,检查abrt-ccpp服务状态也会相应的返回服务未启动。 [root@srdsdevapp69 ~]# service abrt-ccpp status [root@srdsdevapp69 ~]# echo $? 3 恢复/proc/sys/kernel/core_pattern之后,abrt-ccpp服务变回正常 [root@srdsdevapp69 ~]# echo "|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e">/proc/sys/kernel/core_pattern [root@srdsdevapp69 ~]# service abrt-ccpp status [root@srdsdevapp69 ~]# echo $? 0 如果停止abrtd /proc/sys/kernel/core_pattern为"|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e" 会在生成当前目录生成core文件 Dec 8 10:46:21 srdsdevapp69 kernel: testcoredump[31364] trap divide error ip:400489 sp:7fff15d6f450 error:0 in testcoredump[400000+1000] Dec 8 10:46:21 srdsdevapp69 abrt[31365]: abrtd is not running. If it crashed, /proc/sys/kernel/core_pattern contains a stale value, consider resetting it to 'core' Dec 8 10:46:21 srdsdevapp69 abrt[31365]: Saved core dump of pid 31364 to /root/core.31364 (184320 bytes) 开启MySQL的coredump MySQL的服务进程mysqld会自己捕获可能引起crash的信号,默认会输出调用栈后异常退出不会生成core文件。 2016-12-08 11:14:51 14034 [Note] /usr/local/mysql/bin/mysqld: ready for connections. Version: '5.6.29-76.2-debug-log' socket: '/mysqlrds/data/mysql.sock' port: 3306 Source distribution 03:18:43 UTC - mysqld got signal 8 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Please help us make Percona Server better by reporting any bugs at http://bugs.percona.com/ key_buffer_size=33554432 read_buffer_size=2097152 max_used_connections=2 max_threads=100001 thread_count=1 connection_count=1 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 307242932 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x2427ca20 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 7fd53066bca8 thread_stack 0x40000 /usr/local/mysql/bin/mysqld(my_print_stacktrace+0x35)[0xaf23c9] /usr/local/mysql/bin/mysqld(handle_fatal_signal+0x42e)[0x74d42a] /lib64/libpthread.so.0[0x3805a0f7e0] /usr/local/mysql/bin/mysqld(_Z19mysql_rename_tablesP3THDP10TABLE_LISTb+0x6c)[0x82fa64] /usr/local/mysql/bin/mysqld(_Z21mysql_execute_commandP3THD+0x2aab)[0x8079e9] /usr/local/mysql/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x588)[0x810ce3] /usr/local/mysql/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0xd8b)[0x80228a] /usr/local/mysql/bin/mysqld(_Z10do_commandP3THD+0x3bd)[0x801087] /usr/local/mysql/bin/mysqld(_Z26threadpool_process_requestP3THD+0x71)[0x8ec721] /usr/local/mysql/bin/mysqld[0x8ef363] /usr/local/mysql/bin/mysqld[0x8ef5a0] /usr/local/mysql/bin/mysqld(pfs_spawn_thread+0x159)[0xe14049] /lib64/libpthread.so.0[0x3805a07aa1] /lib64/libc.so.6(clone+0x6d)[0x32286e893d] Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (7fd508004d80): is an invalid pointer Connection ID (thread ID): 1 Status: NOT_KILLED You may download the Percona Server operations manual by visiting http://www.percona.com/software/percona-server/. You may find information in the manual which will help you identify the cause of the crash. 要使其产生core文件必须打开--core-file开关 mysqld --defaults-file=/home/mysql/etc/my.cnf --core-file & 也可以将这个参数加入到my.cnf文件中 core_file core文件的大小 关于core文件的大小有个奇怪的现象,其实际占用的磁盘空间可能远小于文件大小。 比如下面的core文件,文件大小10GB,但实际占用磁盘只有2GB(1940984 * 512B)。 [root@srdsdevapp69 ccpp-2016-12-08-14:10:00-10527]# stat coredump File: `coredump' Size: 10513362944 Blocks: 1940984 IO Block: 4096 regular file Device: fd03h/64771d Inode: 14990 Links: 1 Access: (0640/-rw-r-----) Uid: ( 173/ abrt) Gid: ( 512/ mysql) Access: 2016-12-08 14:10:41.886280668 +0800 Modify: 2016-12-08 14:10:27.704523443 +0800 Change: 2016-12-08 14:10:27.704523443 +0800 这是由于系统在生成core文件时,skip了部分全零的块,即文件中有hole(用dd的seek可以模拟这个现象)。不管是在/proc/sys/kernel/core_pattern中设置abrt-hook-ccpp程序还是直接设置文件目录,都是这个现象。这其实是一个不错的优化,节省了磁盘空间也加快了core文件生成速度。
基于Pacemaker的PostgreSQL一主多从读负载均衡集群搭建 简介 PostgreSQL的HA方案有很多种,本文演示基于Pacemaker的PostgreSQL一主多从读负载均衡集群搭建。 搭建过程并不是使用原始的Pacemaker pgsql RA脚本,而使用以下我修改和包装的脚本集pha4pgsql。 https://github.com/ChenHuajun/pha4pgsql 目标集群特性 秒级自动failover failover零数据丢失(防脑裂) 支持在线主从切换 支持读写分离 支持读负载均衡 支持动态增加和删除只读节点 环境 OS:CentOS 7.3 节点1:node1(192.168.0.231) 节点2:node2(192.168.0.232) 节点2:node3(192.168.0.233) writer_vip:192.168.0.236 reader_vip:192.168.0.237 依赖软件 pacemaker corosync pcs ipvsadm 安装与配置 环境准备 所有节点设置时钟同步 cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime ntpdate time.windows.com && hwclock -w 所有节点设置独立的主机名(node1,node2,node3) hostnamectl set-hostname node1 设置对所有节点的域名解析 $ vi /etc/hosts ... 192.168.0.231 node1 192.168.0.232 node2 192.168.0.233 node3 在所有节点上禁用SELINUX $ setenforce 0 $ vi /etc/selinux/config ... SELINUX=disabled 在所有节点上禁用防火墙 systemctl disable firewalld.service systemctl stop firewalld.service 如果开启防火墙需要开放postgres,pcsd和corosync的端口。参考CentOS 7防火墙设置示例 postgres:5432/tcp pcsd:2224/tcp corosync:5405/udp 安装和配置Pacemaker+Corosync集群软件 安装Pacemaker和Corosync及相关软件包 在所有节点执行: yum install -y pacemaker corosync pcs ipvsadm 启用pcsd服务 在所有节点执行: systemctl start pcsd.service systemctl enable pcsd.service 设置hacluster用户密码 在所有节点执行: echo hacluster | passwd hacluster --stdin 集群认证 在任何一个节点上执行: pcs cluster auth -u hacluster -p hacluster node1 node2 node3 同步配置 在任何一个节点上执行: pcs cluster setup --last_man_standing=1 --name pgcluster node1 node2 node3 启动集群 在任何一个节点上执行: pcs cluster start --all 安装和配置PostgreSQL 安装PostgreSQL 安装9.2以上的PostgreSQL,本文通过PostgreSQL官方yum源安装CentOS 7.3对应的PostgreSQL 9.6 https://yum.postgresql.org/ 在所有节点执行: yum install -y https://yum.postgresql.org/9.6/redhat/rhel-7.3-x86_64/pgdg-centos96-9.6-3.noarch.rpm yum install -y postgresql96 postgresql96-contrib postgresql96-libs postgresql96-server postgresql96-devel ln -sf /usr/pgsql-9.6 /usr/pgsql echo 'export PATH=/usr/pgsql/bin:$PATH' >>~postgres/.bash_profile 创建Master数据库 在node1节点执行: 创建数据目录 mkdir -p /pgsql/data chown -R postgres:postgres /pgsql/ chmod 0700 /pgsql/data 初始化db su - postgres initdb -D /pgsql/data/ 修改postgresql.conf listen_addresses = '*' wal_level = hot_standby wal_log_hints = on synchronous_commit = on max_wal_senders=5 wal_keep_segments = 32 hot_standby = on wal_sender_timeout = 5000 wal_receiver_status_interval = 2 max_standby_streaming_delay = -1 max_standby_archive_delay = -1 restart_after_crash = off hot_standby_feedback = on 注:设置"wal_log_hints = on"可以使用pg_rewind修复旧Master。 修改pg_hba.conf local all all trust host all all 192.168.0.0/24 md5 host replication all 192.168.0.0/24 md5 启动postgres pg_ctl -D /pgsql/data/ start 创建复制用户 createuser --login --replication replication -P -s 注:加上“-s”选项可支持pg_rewind。 创建Slave数据库 在node2和node3节点执行: 创建数据目录 mkdir -p /pgsql/data chown -R postgres:postgres /pgsql/ chmod 0700 /pgsql/data 创建基础备份 su - postgres pg_basebackup -h node1 -U replication -D /pgsql/data/ -X stream -P 停止PostgreSQL服务 在node1上执行: pg_ctl -D /pgsql/data/ stop 安装和配置pha4pgsql 在任意一个节点上执行: 下载pha4pgsql cd /opt git clone git://github.com/Chenhuajun/pha4pgsql.git 拷贝config.ini cd /opt/pha4pgsql cp template/config_muti_with_lvs.ini.sample config.ini 注:如果不需要配置基于LVS的负载均衡,可使用模板config_muti.ini.sample 修改config.ini pcs_template=muti_with_lvs.pcs.template OCF_ROOT=/usr/lib/ocf RESOURCE_LIST="msPostgresql vip-master vip-slave" pha4pgsql_dir=/opt/pha4pgsql writer_vip=192.168.0.236 reader_vip=192.168.0.237 node1=node1 node2=node2 node3=node3 othernodes="" vip_nic=ens37 vip_cidr_netmask=24 pgsql_pgctl=/usr/pgsql/bin/pg_ctl pgsql_psql=/usr/pgsql/bin/psql pgsql_pgdata=/pgsql/data pgsql_pgport=5432 pgsql_restore_command="" pgsql_rep_mode=sync pgsql_repuser=replication pgsql_reppassord=replication 安装pha4pgsql sh install.sh ./setup.sh 执行install.sh使用了scp拷贝文件,中途会多次要求输入其它节点的root账号。 install.sh执行会生成Pacemaker的配置脚本/opt/pha4pgsql/config.pcs,可以根据情况对其中的参数进行调优后再执行setup.sh。 设置环境变量 export PATH=/opt/pha4pgsql/bin:$PATH echo 'export PATH=/opt/pha4pgsql/bin:$PATH' >>/root/.bash_profile 启动集群 cls_start 确认集群状态 cls_status cls_status的输出如下: [root@node1 pha4pgsql]# cls_status Stack: corosync Current DC: node1 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum Last updated: Wed Jan 11 00:53:58 2017 Last change: Wed Jan 11 00:45:54 2017 by root via crm_attribute on node1 3 nodes and 9 resources configured Online: [ node1 node2 node3 ] Full list of resources: vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-slave (ocf::heartbeat:IPaddr2): Started node2 Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Slaves: [ node2 node3 ] lvsdr (ocf::heartbeat:lvsdr): Started node2 Clone Set: lvsdr-realsvr-clone [lvsdr-realsvr] Started: [ node2 node3 ] Stopped: [ node1 ] Node Attributes: * Node node1: + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000050001B0 + pgsql-status : PRI * Node node2: + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync * Node node3: + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async Migration Summary: * Node node2: * Node node3: * Node node1: pgsql_REPL_INFO:node1|1|00000000050001B0 检查集群的健康状态。完全健康的集群需要满足以下条件: msPostgresql在每个节点上都已启动 在其中一个节点上msPostgresql处于Master状态,其它的为Salve状态 Salve节点的data-status值是以下中的一个 STREAMING|SYNC 同步复制Slave STREAMING|POTENTIAL 候选同步复制Slave STREAMING|ASYNC 异步复制Slave pgsql_REPL_INFO的3段内容分别指当前master,上次提升前的时间线和xlog位置。 pgsql_REPL_INFO:node1|1|00000000050001B0 LVS配置在node2上 [root@node2 ~]# ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP node2:postgres rr -> node2:postgres Route 1 0 0 -> node3:postgres Route 1 0 0 故障测试 Master故障 停止Master上的网卡模拟故障 [root@node1 pha4pgsql]# ifconfig ens37 down 检查集群状态 Pacemaker已经将Master和写VIP切换到node2上 [root@node2 ~]# cls_status resource msPostgresql is NOT running Stack: corosync Current DC: node2 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum Last updated: Wed Jan 11 01:25:08 2017 Last change: Wed Jan 11 01:21:26 2017 by root via crm_attribute on node2 3 nodes and 9 resources configured Online: [ node2 node3 ] OFFLINE: [ node1 ] Full list of resources: vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-slave (ocf::heartbeat:IPaddr2): Started node3 Master/Slave Set: msPostgresql [pgsql] Masters: [ node2 ] Slaves: [ node3 ] Stopped: [ node1 ] lvsdr (ocf::heartbeat:lvsdr): Started node3 Clone Set: lvsdr-realsvr-clone [lvsdr-realsvr] Started: [ node3 ] Stopped: [ node1 node2 ] Node Attributes: * Node node2: + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000050008E0 + pgsql-status : PRI * Node node3: + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync Migration Summary: * Node node2: * Node node3: pgsql_REPL_INFO:node2|2|00000000050008E0 LVS和读VIP被移到了node3上 [root@node3 ~]# ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP node3:postgres rr -> node3:postgres Route 1 0 0 修复旧Master的网卡 在旧Master node1上,postgres进程还在(注1)。但是由于配置的是同步复制,数据无法写入不会导致脑裂。 [root@node1 pha4pgsql]# ps -ef|grep postgres root 20295 2269 0 01:35 pts/0 00:00:00 grep --color=auto postgres postgres 20556 1 0 00:45 ? 00:00:01 /usr/pgsql-9.6/bin/postgres -D /pgsql/data -c config_file=/pgsql/data/postgresql.conf postgres 20566 20556 0 00:45 ? 00:00:00 postgres: logger process postgres 20574 20556 0 00:45 ? 00:00:00 postgres: checkpointer process postgres 20575 20556 0 00:45 ? 00:00:00 postgres: writer process postgres 20576 20556 0 00:45 ? 00:00:00 postgres: stats collector process postgres 22390 20556 0 00:45 ? 00:00:00 postgres: wal writer process postgres 22391 20556 0 00:45 ? 00:00:00 postgres: autovacuum launcher process 启动网卡后,postgres进程被停止 [root@node1 pha4pgsql]# ifconfig ens37 up [root@node1 pha4pgsql]# ps -ef|grep postgres root 21360 2269 0 01:36 pts/0 00:00:00 grep --color=auto postgres [root@node1 pha4pgsql]# cls_status resource msPostgresql is NOT running Stack: corosync Current DC: node2 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum Last updated: Wed Jan 11 01:36:20 2017 Last change: Wed Jan 11 01:36:00 2017 by hacluster via crmd on node2 3 nodes and 9 resources configured Online: [ node1 node2 node3 ] Full list of resources: vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-slave (ocf::heartbeat:IPaddr2): Started node3 Master/Slave Set: msPostgresql [pgsql] Masters: [ node2 ] Slaves: [ node3 ] Stopped: [ node1 ] lvsdr (ocf::heartbeat:lvsdr): Started node3 Clone Set: lvsdr-realsvr-clone [lvsdr-realsvr] Started: [ node3 ] Stopped: [ node1 node2 ] Node Attributes: * Node node1: + master-pgsql : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : STOP * Node node2: + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000050008E0 + pgsql-status : PRI * Node node3: + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync Migration Summary: * Node node2: * Node node3: * Node node1: pgsql: migration-threshold=3 fail-count=1000000 last-failure='Wed Jan 11 01:36:08 2017' Failed Actions: * pgsql_start_0 on node1 'unknown error' (1): call=278, status=complete, exitreason='The master's timeline forked off current database system timeline 2 before latest checkpoint location 0000000005000B80, REPL_INF', last-rc-change='Wed Jan 11 01:36:07 2017', queued=0ms, exec=745ms pgsql_REPL_INFO:node2|2|00000000050008E0 注1:这是通过ifconfig ens37 down停止网卡模拟故障的特殊现象(或者说是corosync的bug),Pacemkaer的日志中不停的输出以下警告。在实际的物理机宕机或网卡故障时,故障节点会由于失去quorum,postgres进程会被Pacemaker主动停止。 [43260] node3 corosyncwarning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. 修复旧Master(node1)并作为Slave加入集群 通过pg_rewind修复旧Master [root@node1 pha4pgsql]# cls_repair_by_pg_rewind resource msPostgresql is NOT running resource msPostgresql is NOT running resource msPostgresql is NOT running connected to server servers diverged at WAL position 0/50008E0 on timeline 2 rewinding from last common checkpoint at 0/5000838 on timeline 2 reading source file list reading target file list reading WAL in target need to copy 99 MB (total source directory size is 117 MB) 102359/102359 kB (100%) copied creating backup label and updating control file syncing target data directory Done! pg_rewind complete! resource msPostgresql is NOT running resource msPostgresql is NOT running Waiting for 1 replies from the CRMd. OK wait for recovery complete ..... slave recovery of node1 successed 检查集群状态 [root@node1 pha4pgsql]# cls_status Stack: corosync Current DC: node2 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum Last updated: Wed Jan 11 01:39:30 2017 Last change: Wed Jan 11 01:37:35 2017 by root via crm_attribute on node2 3 nodes and 9 resources configured Online: [ node1 node2 node3 ] Full list of resources: vip-master (ocf::heartbeat:IPaddr2): Started node2 vip-slave (ocf::heartbeat:IPaddr2): Started node3 Master/Slave Set: msPostgresql [pgsql] Masters: [ node2 ] Slaves: [ node1 node3 ] lvsdr (ocf::heartbeat:lvsdr): Started node3 Clone Set: lvsdr-realsvr-clone [lvsdr-realsvr] Started: [ node1 node3 ] Stopped: [ node2 ] Node Attributes: * Node node1: + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node node2: + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000050008E0 + pgsql-status : PRI * Node node3: + master-pgsql : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync + pgsql-xlog-loc : 000000000501F118 Migration Summary: * Node node2: * Node node3: * Node node1: pgsql_REPL_INFO:node2|2|00000000050008E0 Slave故障 LVS配置在node3上,2个real server [root@node3 ~]# ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP node3:postgres rr -> node1:postgres Route 1 0 0 -> node3:postgres Route 1 0 0 在其中一个Slave(node1)上停止网卡 [root@node1 pha4pgsql]# ifconfig ens37 down Pacemaker已自动修改LVS的real server配置 [root@node3 ~]# ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP node3:postgres rr -> node3:postgres Route 1 0 0 添加Slave扩容读负载均衡 目前配置的是1主2从集群,2个Slave通过读VIP+LVS做读负载均衡,如果读负载很高可以添加额外的Slave扩展读性能。 把更多的Slave直接添加到Pacemaker集群中可以达到这个目的,但过多的节点数会增加Pacemaker+Corosync集群的复杂性和通信负担(Corosync的通信是一个环路,节点数越多,时延越大)。所以不把额外的Slave加入Pacemaker集群,仅仅加到LVS的real server中,并让lvsdr监视Slave的健康状况,动态更新LVS的real server列表。方法如下: 创建额外的Slave数据库 准备第4台机器node4(192.168.0.234),并在该机器上执行以下命令创建新的Slave 禁用SELINUX $ setenforce 0 $ vi /etc/selinux/config ... SELINUX=disabled 禁用防火墙 systemctl disable firewalld.service systemctl stop firewalld.service 安装PostgreSQL yum install -y https://yum.postgresql.org/9.6/redhat/rhel-7.3-x86_64/pgdg-centos96-9.6-3.noarch.rpm yum install -y postgresql96 postgresql96-contrib postgresql96-libs postgresql96-server postgresql96-devel ln -sf /usr/pgsql-9.6 /usr/pgsql echo 'export PATH=/usr/pgsql/bin:$PATH' >>~postgres/.bash_profile 创建数据目录 mkdir -p /pgsql/data chown -R postgres:postgres /pgsql/ chmod 0700 /pgsql/data 创建Salve备份 从当前的Master节点(即写VIP 192.168.0.236)拉取备份创建Slave su - postgres pg_basebackup -h 192.168.0.236 -U replication -D /pgsql/data/ -X stream -P 编辑postgresql.conf 将postgresql.conf中的下面一行删掉 ¥vi /pgsql/data/postgresql.conf ... #include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA 编辑recovery.conf $vi /pgsql/data/recovery.conf standby_mode = 'on' primary_conninfo = 'host=192.168.0.236 port=5432 application_name=192.168.0.234 user=replication password=replication keepalives_idle=60 keepalives_interval=5 keepalives_count=5' restore_command = '' recovery_target_timeline = 'latest' 上面的application_name设置为本节点的IP地址192.168.0.234 启动Slave pg_ctl -D /pgsql/data/ start 在Master上检查postgres wal sender进程,新建的Slave(192.168.0.234)已经和Master建立了流复制。 [root@node1 pha4pgsql]# ps -ef|grep '[w]al sender' postgres 32387 111175 0 12:15 ? 00:00:00 postgres: wal sender process replication 192.168.0.234(47894) streaming 0/7000220 postgres 116675 111175 0 12:01 ? 00:00:00 postgres: wal sender process replication 192.168.0.233(33652) streaming 0/7000220 postgres 117079 111175 0 12:01 ? 00:00:00 postgres: wal sender process replication 192.168.0.232(40088) streaming 0/7000220 配置LVS real server 设置系统参数 echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce 在lo网卡上添加读VIP ip a add 192.168.0.237/32 dev lo:0 将新建的Slave加入到LVS中 现在LVS的配置中还没有把新的Slave作为real server加入 [root@node3 ~]# ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP node3:postgres rr -> node2:postgres Route 1 0 0 -> node3:postgres Route 1 0 0 在Pacemaker集群的任意一个节点(node1,node2或node3)上,修改lvsdr RA的配置,加入新的real server。 [root@node2 ~]# pcs resource update lvsdr realserver_get_real_servers_script="/opt/pha4pgsql/tools/get_active_slaves /usr/pgsql/bin/psql \"host=192.168.0.236 port=5432 dbname=postgres user=replication password=replication connect_timeout=5\"" 设置realserver_get_real_servers_script参数后,lvsdr会通过脚本获取LVS的real server列表,这里的get_active_slaves会通过写VIP连接到Master节点获取所有以连接到Master的Slave的application_name作为real server。设置后新的Slave 192.168.0.234已经被加入到real server 列表中了。 [root@node2 ~]# ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP node2:postgres rr -> node2:postgres Route 1 0 0 -> node3:postgres Route 1 0 0 -> 192.168.0.234:postgres Route 1 0 0 测试读负载均衡 在当前的Master节点(node1)上通过读VIP访问postgres,可以看到psql会轮询连接到3个不同的Slave上。 [root@node1 pha4pgsql]# psql "host=192.168.0.237 port=5432 dbname=postgres user=replication password=replication" -tAc "select pg_postmaster_start_time()" 2017-01-14 12:01:48.068455+08 [root@node1 pha4pgsql]# psql "host=192.168.0.237 port=5432 dbname=postgres user=replication password=replication" -tAc "select pg_postmaster_start_time()" 2017-01-14 12:01:12.222412+08 [root@node1 pha4pgsql]# psql "host=192.168.0.237 port=5432 dbname=postgres user=replication password=replication" -tAc "select pg_postmaster_start_time()" 2017-01-14 12:15:19.614782+08 [root@node1 pha4pgsql]# psql "host=192.168.0.237 port=5432 dbname=postgres user=replication password=replication" -tAc "select pg_postmaster_start_time()" 2017-01-14 12:01:48.068455+08 [root@node1 pha4pgsql]# psql "host=192.168.0.237 port=5432 dbname=postgres user=replication password=replication" -tAc "select pg_postmaster_start_time()" 2017-01-14 12:01:12.222412+08 [root@node1 pha4pgsql]# psql "host=192.168.0.237 port=5432 dbname=postgres user=replication password=replication" -tAc "select pg_postmaster_start_time()" 2017-01-14 12:15:19.614782+08 下面测试Salve节点发生故障的场景。 先连接到其中一台Slave [root@node1 pha4pgsql]# psql "host=192.168.0.237 port=5432 dbname=postgres user=replication password=replication" psql (9.6.1) Type "help" for help. 当前连接在node4上 [root@node4 ~]# ps -ef|grep postgres postgres 11911 1 0 12:15 pts/0 00:00:00 /usr/pgsql-9.6/bin/postgres -D /pgsql/data postgres 11912 11911 0 12:15 ? 00:00:00 postgres: logger process postgres 11913 11911 0 12:15 ? 00:00:00 postgres: startup process recovering 000000090000000000000007 postgres 11917 11911 0 12:15 ? 00:00:00 postgres: checkpointer process postgres 11918 11911 0 12:15 ? 00:00:00 postgres: writer process postgres 11920 11911 0 12:15 ? 00:00:00 postgres: stats collector process postgres 11921 11911 0 12:15 ? 00:00:04 postgres: wal receiver process streaming 0/7000CA0 postgres 12004 11911 0 13:19 ? 00:00:00 postgres: replication postgres 192.168.0.231(42116) idle root 12006 2230 0 13:19 pts/0 00:00:00 grep --color=auto postgres 强制杀死node4上的postgres进程 [root@node4 ~]# killall postgres lvsdr探测到node4挂了后会自动将其从real server列表中摘除 [root@node2 ~]# ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP node2:postgres rr -> node2:postgres Route 1 0 0 -> node3:postgres Route 1 0 0 psql执行下一条SQL时就会自动连接到其它Slave上。 postgres=# select pg_postmaster_start_time(); FATAL: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. postgres=# select pg_postmaster_start_time(); pg_postmaster_start_time ------------------------------- 2017-01-14 12:01:48.068455+08 (1 row) 指定静态的real server列表 有时候不希望将所有连接到Master的Slave都加入到LVS的real server中,比如某个Slave可能实际上是pg_receivexlog。 这时可以在lvsdr上指定静态的real server列表作为白名单。 方法1: 通过default_weight和weight_of_realservers指定各个real server的权重,将不想参与到负载均衡的Slave的权重设置为0。 并且还是通过在Master上查询Slave一览的方式监视Slave健康状态。 下面在Pacemaker集群的任意一个节点(node1,node2或node3)上,修改lvsdr RA的配置,设置有效的real server列表为node,node2和node3。 pcs resource update lvsdr default_weight="0" pcs resource update lvsdr weight_of_realservers="node1,1 node2,1 node3,1" pcs resource update lvsdr realserver_get_real_servers_script="/opt/pha4pgsql/tools/get_active_slaves /usr/pgsql/bin/psql \"host=192.168.0.236 port=5432 dbname=postgres user=replication password=replication connect_timeout=5\"" 在lvsdr所在节点上检查LVS的状态,此时node4(192.168.0.234)的权重为0,LVS不会往node4上转发请求。 [root@node2 ~]# ipvsadm IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP node2:postgres rr -> node2:postgres Route 1 0 0 -> node3:postgres Route 1 0 0 -> 192.168.0.234:postgres Route 0 0 0 方法2: 通过default_weight和weight_of_realservers指定real server一览,并通过调用check_active_slave脚本,依次连接到real server中的每个节点上检查其是否可以连接并且是Slave。 pcs resource update lvsdr default_weight="1" pcs resource update lvsdr weight_of_realservers="node1 node2 node3 192.168.0.234" pcs resource update lvsdr realserver_dependent_resource="" pcs resource update lvsdr realserver_get_real_servers_script="" pcs resource update lvsdr realserver_check_active_real_server_script="/opt/pha4pgsql/tools/check_active_slave /usr/pgsql/bin/psql \"port=5432 dbname=postgres user=replication password=replication connect_timeout=5\" -h" 推荐采用方法1,因为每次健康检查只需要1次连接。 参考 Pacemaker High Availability for PostgreSQL PostgreSQL流复制高可用的原理与实践 PgSQL Replicated Cluster Pacemaker+Corosync搭建PostgreSQL集群
PostgreSQL相关入门资料 0.PostgreSQL介绍资料 https://yq.aliyun.com/articles/60153 http://vdisk.weibo.com/s/aIBDMOTZz2gqU http://airjd.com/view/ibqwxj0t000ve0l#14 1. 中文手册 http://www.postgres.cn/docs/9.5/ 这个中文手册是翻译中的版本,因此混合了9.4和9.5的内容 如果以开发应用为主要目的,可以重点看下[I. 教程]和[II. SQL 语言]。 2. 官方手册 https://www.postgresql.org/docs/ 3. 中文社区官网 http://postgres.cn/ 4.中文使用类书籍推荐 《PostgreSQL 9 Administration Cookbook (第2版)中文版》 《PostgreSQL修炼之道 从小工到专家》 《PostgreSQL服务器编程》(关于存储过程编写的) 《PostgreSQL 9.0性能调校》 (英文原书很不错可惜中文翻译的比较差) 《postgresql数据库内核分析》 《数据库查询优化器的艺术》 5. 一些资源列表 https://github.com/ty4z2008/Qix/blob/master/pg.md http://www.360doc.com/content/15/0308/18/1513309_453588512.shtml 6. 德哥的博客和视频 http://blog.163.com/digoal@126/blog/#m=0 http://pan.baidu.com/s/1pKVCgHX 德哥的技术文章相当丰富 7. 客户端工具 psql 命令行工具,用熟了非常便捷。 pgAdmin3 GUI工具,最新的pgAdmin4目前还不太稳定,所以建议用pgAdmin3
MySQL 4字节utf8字符更新失败一例 业务的小伙伴反映了下面的问题 问题 有一个4字节的utf8字符'????'插入到MySQL数据库中时报错 java.sql.SQLException: Incorrect string value: '\xF0\xA0\x99\xB6' for column 'c_utf8mb4' at row 1 数据库中存放该字符的列已经定义为utf8mb4编码了,但相关的参数character_set_server的值为utf8。 比较奇怪的是使用mysql-connector-java-5.1.15.jar驱动时没有问题,使用更高版本的驱动如mysql-connector-java-5.1.22.jar,就会出错。JDBC的下面2个连接参数,不过设置与否,都没有影响。 characterEncoding=utf8 useUnicode=true 原因 jdbc驱动未正确设置SET NAMES utf8mb4导致转码错误。 根据MySQL官方手册,在MySQL Jdbc中正确使用4字节UTF8字符的方法如下: http://dev.mysql.com/doc/relnotes/connector-j/5.1/en/news-5-1-14.html: Connector/J mapped both 3-byte and 4-byte UTF8 encodings to the same Java UTF8 encoding. To use 3-byte UTF8 with Connector/J set characterEncoding=utf8 and set useUnicode=true in the connection string. To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set. (Bug #58232) 按照MySQL官方手册提供的方法,MySQL JDBC驱动内部会在建立连接时发送SET NAMES utf8mb4给服务端,确保正确进行字符编码。 所以,本问题属于应用未按要求使用MySQL JDBC。但5.1.15可以插入4字节字符也是比较奇怪的事情。 mysql-connector官网的 change log中并且提交5.1.15~5.1.22之间有相关的改动。但是,通过比较代码发现,这部分逻辑确实发生了变更。 5.1.15 com\mysql\jdbc\ConnectionImpl.java: private boolean configureClientCharacterSet(boolean dontCheckServerMatch) throws SQLException { ... if(getEncoding() != null) { String mysqlEncodingName = CharsetMapping.getMysqlEncodingForJavaEncoding(getEncoding().toUpperCase(Locale.ENGLISH), this); if(getUseOldUTF8Behavior()) mysqlEncodingName = "latin1"; if(dontCheckServerMatch || !characterSetNamesMatches(mysqlEncodingName)) execSQL(null, (new StringBuilder()).append("SET NAMES ").append(mysqlEncodingName).toString(), -1, null, 1003, 1007, false, database, null, false); realJavaEncoding = getEncoding(); } ... } 给CharsetMapping.getMysqlEncodingForJavaEncoding()传入的参数是UTF-8,对应的mysql的编码有2个,utf8和utf8mb4, 其中utf8mb4优先,所以这个函数返回的mysql编码是utf8mb4。即之后执行了SET NAMES utf8mb4 相关代码: com\mysql\jdbc\CharsetMapping.java: public static final String getMysqlEncodingForJavaEncoding(String javaEncodingUC, Connection conn) throws SQLException { List mysqlEncodings = (List)JAVA_UC_TO_MYSQL_CHARSET_MAP.get(javaEncodingUC); if(mysqlEncodings != null) { Iterator iter = mysqlEncodings.iterator(); VersionedStringProperty versionedProp = null; do { if(!iter.hasNext()) break; VersionedStringProperty propToCheck = (VersionedStringProperty)iter.next(); if(conn == null) return propToCheck.toString(); if(versionedProp != null && !versionedProp.preferredValue && versionedProp.majorVersion == propToCheck.majorVersion && versionedProp.minorVersion == propToCheck.minorVersion && versionedProp.subminorVersion == propToCheck.subminorVersion) return versionedProp.toString(); if(!propToCheck.isOkayForVersion(conn)) break; if(propToCheck.preferredValue) return propToCheck.toString(); versionedProp = propToCheck; } while(true); if(versionedProp != null) return versionedProp.toString(); } return null; } ... CHARSET_CONFIG.setProperty("javaToMysqlMappings", "US-ASCII =\t\t\tusa7,US-ASCII =\t\t\t>4.1.0 ascii,... UTF-8 = \t\tutf8,UTF-8 =\t\t\t\t*> 5.5.2 utf8mb4,..."); 注:上面的定义UTF-8 =\t\t\t\t*> 5.5.2 utf8mb4中的*代表有多个mysql编码对应于同一个Java编码时,该编码优先 5.1.22 com\mysql\jdbc\ConnectionImpl.java: private boolean configureClientCharacterSet(boolean dontCheckServerMatch) throws SQLException { ... if(getEncoding() != null) { String mysqlEncodingName = getServerCharacterEncoding(); if(getUseOldUTF8Behavior()) mysqlEncodingName = "latin1"; boolean ucs2 = false; if("ucs2".equalsIgnoreCase(mysqlEncodingName) || "utf16".equalsIgnoreCase(mysqlEncodingName) || "utf32".equalsIgnoreCase(mysqlEncodingName)) { mysqlEncodingName = "utf8"; ucs2 = true; if(getCharacterSetResults() == null) setCharacterSetResults("UTF-8"); } if(dontCheckServerMatch || !characterSetNamesMatches(mysqlEncodingName) || ucs2) execSQL(null, (new StringBuilder()).append("SET NAMES ").append(mysqlEncodingName).toString(), -1, null, 1003, 1007, false, database, null, false); realJavaEncoding = getEncoding(); } ... } ... public String getServerCharacterEncoding() { if(io.versionMeetsMinimum(4, 1, 0)) { String charset = (String)indexToCustomMysqlCharset.get(Integer.valueOf(io.serverCharsetIndex)); if(charset == null) charset = (String)CharsetMapping.STATIC_INDEX_TO_MYSQL_CHARSET_MAP.get(Integer.valueOf(io.serverCharsetIndex)); return charset == null ? (String)serverVariables.get("character_set_server") : charset; } else { return (String)serverVariables.get("character_set"); } } 解决办法 一直使用旧版的5.1.15驱动不是一个好办法,因此在使用新版驱动时,采取以下措施之一解决这个问题。 参考官网的说明,修改my.cnf character_set_server=utf8mb4 在应用中获取连接后执行下面的SQL stmt.executeUpdate("set names utf8mb4") 补充 根据5.1.22的MySQL JDBC驱动代码,MySQL JDBC支持utf8mb4需要满足以下2个条件 1. MySQL系统变量`character_set_server`的值为utf8mb4 2. MySQL JDBC连接参数characterEncoding的值为以下值之一 - null - UTF8 - UTF-8
系统变量 字符集相关的系统变量 mysql> show variables like '%char%'; +--------------------------+------------------------------------------------------------------------------+ | Variable_name | Value | +--------------------------+------------------------------------------------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/Percona-Server-5.6.29-rel76.2-Linux.x86_64.ssl101/share/charsets/ | +--------------------------+------------------------------------------------------------------------------+ 8 rows in set (0.01 sec) 各个变量的含义概述如下: character_set_client :客户端发给服务端的SQL的字符集 character_set_connection : 字符常量的缺省字符集 character_set_database:缺省数据库(即use指定的数据库)的缺省字符集 character_set_filesystem:文件系统字符集,用于解释文件名字符常量 character_set_results:结果集和错误消息的字符集 character_set_server: 服务器的缺省字符集 character_set_system: 系统标识符的字符集 character_sets_dir: 字符集安装目录 详细定义参考官网说明: 5.1.5 Server System Variables 10.1.4 Connection Character Sets and Collations 排序规则相关的系统变量: mysql> show variables like '%collation%'; +----------------------+-----------------+ | Variable_name | Value | +----------------------+-----------------+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8_general_ci | +----------------------+-----------------+ 3 rows in set (0.00 sec) 排序规则和上面的字符集是对应的,就不解释了。但有一个问题,UTF8编码下该设置utf8_general_ci 还是utf8_unicode_ci有一些讨论。 比如:What's the difference between utf8_general_ci and utf8_unicode_ci utf8_general_ci排序略快一些,utf8_unicode_ci对某些语义排序更准确。然而,所谓的"更快",快的程度可以无视;"更准确"所适用的场景对使用中文的我们没啥意义。所以个人认为设啥都没关系,干脆顺气自然不设,让MySQL自己根据字符集选择缺省值吧(即utf8_general_ci)。 数据存储 字符数据的最终存储到表的字符类型的列上,所以存储的最终体现形式是列的字符集。至于表的字符集不过是生成列时的缺省字符集;数据库的字符集不过建表时的缺省字符集。 一劳永逸的字符设置 谈到字符主要让人操心的是乱码问题,最简单有效的解决办法是统一设置UTF8编码。 只要在my.cnf的[mysqld]上设置character_set_server即可。 character_set_server = utf8mb4 这样,新创建的数据库和该数据库中的对象将默认采用'utf8mb4'编码; JDBC(5.1.13以后版本)客户端将根据服务端的character_set_server设置合适的客户端编码; http://dev.mysql.com/doc/relnotes/connector-j/5.1/en/news-5-1-14.html Connector/J mapped both 3-byte and 4-byte UTF8 encodings to the same Java UTF8 encoding. To use 3-byte UTF8 with Connector/J set characterEncoding=utf8 and set useUnicode=true in the connection string. To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set. (Bug #58232) 排序规则无需专门设置,让它跟随编码自己变化。 参考 10分钟彻底解决MySQL乱码问题? mysql使用utf8mb4经验吐血总结
基于Pacemaker+Corosync的PostgreSQL HA故障两例 前几天Pacemaker+Corosync的PostgreSQL HA集群发生了两次故障,再次提醒了HA的部署要谨慎,维护要细致。 故障1:高负载下自动切换 系统是基于Pacemaker+Corosync的PostgreSQL 1主2从 HA集群。在数据库上执行一个delete.从500w的表里删除13w条记录,发现Master挂了。 然后自动起来,再执行这个SQL,再次挂。究竟怎么回事呢? 后来查了Pacemaker的日志,发现是expgsql RA的monitor发生超时,导致expgsql RA重启PostgreSQL。(迁移阈值是3,可以重启3次,超过3次就会发生主从切换。) /var/log/messages Nov 18 13:22:05 node1 expgsql(pgsql)[7774]: INFO: Stopping PostgreSQL on demote. Nov 18 13:22:05 node1 expgsql(pgsql)[7774]: INFO: server shutting down Nov 18 13:22:11 node1 expgsql(pgsql)[7774]: INFO: PostgreSQL is down Nov 18 13:22:11 node1 expgsql(pgsql)[7774]: INFO: Changing pgsql-status on node1 : PRI->STOP. Nov 18 13:22:11 node1 expgsql(pgsql)[8609]: INFO: PostgreSQL is already stopped. Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: Set all nodes into async mode. Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: server starting Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL start command sent. Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL is down Nov 18 13:22:13 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL is started. corosync.log中发现在发生超时前,系统负载很高。 /var/log/cluster/corosync.log Nov 18 13:20:55 [2353] node1 crmd: info: throttle_handle_load: Moderate CPU load detected: 12.060000 Nov 18 13:20:55 [2353] node1 crmd: info: throttle_send_command: New throttle mode: 0010 (was 0001) Nov 18 13:21:25 [2353] node1 crmd: notice: throttle_handle_load: High CPU load detected: 16.379999 Nov 18 13:21:25 [2353] node1 crmd: info: throttle_send_command: New throttle mode: 0100 (was 0010) Nov 18 13:21:44 [2350] node1 lrmd: warning: child_timeout_callback: pgsql_monitor_3000 process (PID 4822) timed out Nov 18 13:21:44 [2350] node1 lrmd: warning: operation_finished: pgsql_monitor_3000:4822 - timed out after 60000ms Nov 18 13:21:44 [2353] node1 crmd: error: process_lrm_event: Operation pgsql_monitor_3000: Timed Out (node=node1, call=837, timeout=60000ms) Nov 18 13:21:44 [2348] node1 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/462) 系统是4核机,上面的输出表明系统已经超载,进入限流模式。后面甚至有负载达到49的记录。 Nov 18 13:24:55 [2353] node1 crmd: notice: throttle_handle_load: High CPU load detected: 48.320000 Nov 18 13:25:25 [2353] node1 crmd: notice: throttle_handle_load: High CPU load detected: 49.750000 Nov 18 13:25:39 [2350] node1 lrmd: warning: child_timeout_callback: pgsql_demote_0 process (PID 16604) timed out Nov 18 13:25:55 [2353] node1 crmd: notice: throttle_handle_load: High CPU load detected: 48.860001 Nov 18 13:26:03 [2350] node1 lrmd: warning: operation_finished: pgsql_demote_0:16604 - timed out after 60000ms throttle mode是什么鬼? 限流模式是Pacemaker的一种保护措施,进入限流模式后Pacemaker会减少自身可以并发执行的job数。 在High CPU load下,限流模式会限制同时只能有1个job在跑。正常情况下,允许同时运行的最大job数是CPU核数的2倍。 https://github.com/ClusterLabs/pacemaker/blob/master/crmd/throttle.c int throttle_get_job_limit(const char *node) { ... switch(r->mode) { case throttle_extreme: case throttle_high: jobs = 1; /* At least one job must always be allowed */ break; case throttle_med: jobs = QB_MAX(1, r->max / 4); break; case throttle_low: jobs = QB_MAX(1, r->max / 2); break; case throttle_none: jobs = QB_MAX(1, r->max); break; default: crm_err("Unknown throttle mode %.4x on %s", r->mode, node); break; } ... } 如何获取CPU LOAD和负载阈值 获取CPU LOAD的方式为取/proc/loadavg中第一行的1分钟负载。 至于负载阈值,有3个负载级别,不同阈值有不同的因子(throttle_load_target) #define THROTTLE_FACTOR_LOW 1.2 #define THROTTLE_FACTOR_MEDIUM 1.6 #define THROTTLE_FACTOR_HIGH 2.0 对应的阈值分别为 load-threshold * throttle_load_target load-threshold为集群参数,默认值为80% 判断是否超过阈值的方法是用调整后的CPU负载(上面的CPU LOAD除以核心数)和阈值比较 IO负载 除了CUP负载,Pacemaker还有IO负载的比较,但是相关函数存在多处错误,实际上无效。 无效也好,如果下面这个问题解决了后面会有相反的bug导致会误判高IO负载。 static bool throttle_io_load(float *load, unsigned int *blocked) { if(fgets(buffer, sizeof(buffer), stream)) { ... long long divo2 = 0; ... long long diow =0; ... long long Div = 0; *load = (diow + divo2) / Div; //结果始终为0 ... return TRUE; //只读了第一行就返回导致不会给blocked赋值 } } 限流和本次故障有什么关系 不能肯定限流和这次故障是否有必然的联系。但是在限流下,Pacemaker执行job的能力弱了,系统负载又高,会增大monitor相关job得不到及时的CPU调度进而发生超时的概率。实际的罪魁祸首还应该是高负载及超时判断方法。 类似问题 搜索后发现有人遇到过类似问题,处理办法就是增大monitor的超时时间。 https://bugs.launchpad.net/fuel/+bug/1464131 https://review.openstack.org/#/c/191715/1/deployment/puppet/pacemaker_wrappers/manifests/rabbitmq.pp 解决办法 一方面修改增大monitor的超时时间。在线修改的方法如下: pcs resource update pgsql op monitor interval=4s timeout=300s on-fail=restart pcs resource update pgsql op monitor role=Master timeout=300s on-fail=restart interval=3s 另一方面,发现系统负载过重,经常跑到100%的CPU,即使没有HA这档子事也是个不稳定因素。 通过修改应用,迁移大部分的读负载到SLave上,有效减轻了Master的压力。 其它案例 GitHub遇到过类似的故障,由于迁移到Master负载过高,进而Percona Replication Manager的健康检查失败进行了切换,切换后新主的缓存是冷的,负载同样过高,又切回去。 (幸运的是我们的方案有3次迁移阈值的保护,不会立刻切。)GitHub权衡后的对策居然是放弃自动切换,只能由人工发起。 参考: GitHub 的两次故障分析 另,Pacemkaer的论坛有几件高负载导致corosync token超时的问题,同样由于相关job不能及时得到OS调度所致,该问题虚机上容易发生,特别是有超分的情况下,解决办法是加大token超时时间。 故障2:同步复制变成了异步复制 系统配置的同步复制,正常情况下2个Slave应该一个是sync,另一个是async。 但是,某个时间发现2个Slave都是async,这对应用系统暂无影响,但是万一这时Master挂了会影响HA切换。 查看Master上的rep_mode.conf配置文件,发现synchronous_standby_names是空。 vi /var/lib/pgsql/tmp/rep_mode.conf synchronous_standby_names = '' 于是,手工修改为node2,再reload一下就OK了。 vi /var/lib/pgsql/tmp/rep_mode.conf synchronous_standby_names = 'node2' 至于什么原因导致的,又仔细查看了一下RA脚本,未发现疑点,而现场已经不在,无从查起,只能等下次遇到再说。
使用win7远程连接另一台Win7机器,在输入用户名密码点击"连接后"报"之前用于连接到的凭据无法工作 请输入新的凭据"的错误。 用户名,密码都是正确的。网上搜索到不少解决办法,大部分无效,只有下面的方法是正解。 打开gpedit.msc,执行以下操作:计算机配置”→“Window设置”→“安全设置”→“本地策略”→“安全选项”→“网络访问:本地账户的共享和安全模型”→设置为“经典-对本地用户进行身份验证,不改变起本来身份”
基于PGPool的1主2从PostgreSQL流复制HA的搭建 PostgreSQL的流复制为HA提供了很好的支持,但是部署HA集群还需要专门的HA组件, 比如通用的Pacemaker+Corosync。pgpool作为PostgreSQL的中间件,也提供HA功能。 pgpool可以监视后端PostgreSQL的健康并实施failover,由于应用的所有流量都经过pgpool,可以很容易对故障节点进行隔离, 但,同时必须为pgpool配置备机,防止pgpool本身成为单点。pgpool自身带watchdog组件通过quorum机制防止脑裂, 因此建议pgpool节点至少是3个,并且是奇数。在失去quorum后watchdog会自动摘除VIP,并阻塞客户端连接。 下面利用pgpool搭建3节点PostgreSQL流复制的HA集群。 集群的目标为强数据一致HA,实现思路如下: 基于PostgreSQL的1主2从同步复制 Slave的复制连接字符串使用固定的pgsql_primary作为Master的主机名,在/etc/hosts中将Master的ip映射到pgsql_primary上,通过/etc/hosts的修改实现Slave对复制源(Master)的切换。 之所以采取这种方式是为了避免直接修改recovery.conf后重启postgres进程时会被pgpool检测到并设置postgres后端为down状态。 pgpool分别部署在3个节点上,pgpool的Master和PostgreSQL的Primary最好不在同一个节点上,这样在PostgreSQL的Primary down时可以干净的隔离故障机器。 环境 软件 CentOS 7.0 PGPool 3.5 PostgreSQL9.5 节点 node1 192.168.0.211 node2 192.168.0.212 node3 192.168.0.213 vip 192.168.0.220 配置 PostgreSQL Port:5433 复制账号:replication/replication 管理账号:admin/admin 前提 3个节点建立ssh互信。 3个节点配置好主机名解析(/etc/hosts) 将pgsql_primary解析为主节点的IP [postgres@node3 ~]$ cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.0.211 node1 192.168.0.212 node2 192.168.0.213 node3 192.168.0.211 pgsql_primary 3个节点事先装好PostgreSQL,并配置1主2从同步流复制,node1是主节点。 在2个Slave节点node2和node3上设置recovery.conf中的复制源的主机名为pgsql_primary [postgres@node3 ~]$ cat /data/postgresql/data/recovery.conf standby_mode = 'on' primary_conninfo = 'host=pgsql_primary port=5433 application_name=node3 user=replication password=replication keepalives_idle=60 keepalives_interval=5 keepalives_count=5' restore_command = '' recovery_target_timeline = 'latest' 安装pgpool 在node1,node2和node3节点上安装pgpool-II yum install http://www.pgpool.net/yum/rpms/3.5/redhat/rhel-7-x86_64/pgpool-II-release-3.5-1.noarch.rpm yum install pgpool-II-pg95 pgpool-II-pg95-extensions 在Master上安装pgpool_recovery扩展(可选) [postgres@node1 ~]$ psql template1 -p5433 psql (9.5.2) Type "help" for help. template1=# CREATE EXTENSION pgpool_recovery; pgpool_recovery扩展定义了4个函数用于远程控制PG,这样可以避免了对ssh的依赖,不过下面的步骤没有用到这些函数。 template1=> \dx+ pgpool_recovery Objects in extension "pgpool_recovery" Object Description ----------------------------------------------- function pgpool_pgctl(text,text) function pgpool_recovery(text,text,text) function pgpool_recovery(text,text,text,text) function pgpool_remote_start(text,text) function pgpool_switch_xlog(text) (5 rows) 配置pgpool.conf 以下是node3上的配置,node1和node2节点上参照设置 $ cp /etc/pgpool-II/pgpool.conf.sample-stream /etc/pgpool-II/pgpool.conf $ vi /etc/pgpool-II/pgpool.conf listen_addresses = '*' port = 9999 pcp_listen_addresses = '*' pcp_port = 9898 backend_hostname0 = 'node1' backend_port0 = 5433 backend_weight0 = 1 backend_data_directory0 = '/data/postgresql/data' backend_flag0 = 'ALLOW_TO_FAILOVER' backend_hostname1 = 'node2' backend_port1 = 5433 backend_weight1 = 1 backend_data_directory1 = '/data/postgresql/data' backend_flag1 = 'ALLOW_TO_FAILOVER' backend_hostname2 = 'node3' backend_port2 = 5433 backend_weight2 = 1 backend_data_directory2 = '/data/postgresql/data' backend_flag2 = 'ALLOW_TO_FAILOVER' enable_pool_hba = off pool_passwd = 'pool_passwd' pid_file_name = '/var/run/pgpool/pgpool.pid' logdir = '/var/log/pgpool' connection_cache = on replication_mode = off load_balance_mode = on master_slave_mode = on master_slave_sub_mode = 'stream' sr_check_period = 10 sr_check_user = 'admin' sr_check_password = 'admin' sr_check_database = 'postgres' delay_threshold = 10000000 follow_master_command = '' health_check_period = 3 health_check_timeout = 20 health_check_user = 'admin' health_check_password = 'admin' health_check_database = 'postgres' health_check_max_retries = 0 health_check_retry_delay = 1 connect_timeout = 10000 failover_command = '/home/postgres/failover.sh %h %H %d %P' failback_command = '' fail_over_on_backend_error = on search_primary_node_timeout = 10 use_watchdog = on wd_hostname = 'node3' ##设置本节点的节点名 wd_port = 9000 wd_priority = 1 wd_authkey = '' wd_ipc_socket_dir = '/tmp' delegate_IP = '192.168.0.220' if_cmd_path = '/usr/sbin' if_up_cmd = 'ip addr add $_IP_$/24 dev eno16777736 label eno16777736:0' if_down_cmd = 'ip addr del $_IP_$/24 dev eno16777736' arping_path = '/usr/sbin' arping_cmd = 'arping -U $_IP_$ -w 1 -I eno16777736' wd_monitoring_interfaces_list = '' wd_lifecheck_method = 'heartbeat' wd_interval = 10 wd_heartbeat_port = 9694 wd_heartbeat_keepalive = 2 wd_heartbeat_deadtime = 30 heartbeat_destination0 = 'node1' ##设置其它PostgreSQL节点的节点名 heartbeat_destination_port0 = 9694 heartbeat_device0 = 'eno16777736' heartbeat_destination1 = 'node2' ##设置其它PostgreSQL节点的节点名 heartbeat_destination_port1 = 9694 heartbeat_device1 = 'eno16777736' other_pgpool_hostname0 = 'node1' ##设置其它pgpool节点的节点名 other_pgpool_port0 = 9999 other_wd_port0 = 9000 other_pgpool_hostname0 = 'node2' ##设置其它pgpool节点的节点名 other_pgpool_port0 = 9999 other_wd_port0 = 9000 配置PCP命令接口 pgpool-II 有一个用于管理功能的接口,用于通过网络获取数据库节点信息、关闭 pgpool-II 等。要使用 PCP 命令,必须进行用户认证。这需要在 pcp.conf 文件中定义一个用户和密码。 $ pg_md5 pgpool ba777e4c2f15c11ea8ac3be7e0440aa0 $ vi /etc/pgpool-II/pcp.conf root:ba777e4c2f15c11ea8ac3be7e0440aa0 为了免去每次执行pcp命令都输入密码的麻烦,可以配置免密码文件。 $ vi ~/.pcppass localhost:9898:root:pgpool $ chmod 0600 ~/.pcppass 配置pool_hba.conf(可选) pgpool可以按照和PostgreSQL的hba.conf类似的方式配置自己的主机认证,所有连接到pgpool上的客户端连接将接受认证,这解决了后端PostgreSQL无法直接对前端主机进行IP地址限制的问题。 开启pgpool的hba认证 $ vi /etc/pgpool-II/pgpool.conf enable_pool_hba = on 编辑pool_hba.conf,注意客户端的认证请求最终还是要被pgpool转发到后端的PostgreSQL上去,所以pool_hba.conf上的配置应和后端的hba.conf一致,比如pgpool对客户端的连接采用md5认证,那么PostgreSQL对这个pgpool转发的连接也要采用md5认证,并且密码相同。 $ vi /etc/pgpool-II/pool_hba.conf 如果pgpool使用了md5认证,需要在pgpool上设置密码文件。 密码文件名通过pgpool.conf中的pool_passwd参数设置,默认为/etc/pgpool-II/pool_passwd 设置pool_passwd的方法如下。 $ pg_md5 -m -u admin admin 启动pgpool 分别在3个节点上启动pgpool。 [root@node3 ~]# service pgpool start Redirecting to /bin/systemctl start pgpool.service 检查pgpool日志输出,确认启动成功。 [root@node3 ~]# tail /var/log/messages Nov 8 12:53:47 node3 pgpool: 2016-11-08 12:53:47: pid 31078: LOG: pgpool-II successfully started. version 3.5.4 (ekieboshi) 通过pcp_watchdog_info命令确认集群状况 [root@node3 ~]# pcp_watchdog_info -w -v Watchdog Cluster Information Total Nodes : 3 Remote Nodes : 2 Quorum state : QUORUM EXIST Alive Remote Nodes : 2 VIP up on local node : NO Master Node Name : Linux_node2_9999 Master Host Name : node2 Watchdog Node Information Node Name : Linux_node3_9999 Host Name : node3 Delegate IP : 192.168.0.220 Pgpool port : 9999 Watchdog port : 9000 Node priority : 1 Status : 7 Status Name : STANDBY Node Name : Linux_node1_9999 Host Name : node1 Delegate IP : 192.168.0.220 Pgpool port : 9999 Watchdog port : 9000 Node priority : 1 Status : 7 Status Name : STANDBY Node Name : Linux_node2_9999 Host Name : node2 Delegate IP : 192.168.0.220 Pgpool port : 9999 Watchdog port : 9000 Node priority : 1 Status : 4 Status Name : MASTER 通过psql命令确认集群状况 [root@node3 ~]# psql -hnode3 -p9999 -U admin postgres ... postgres=> show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt ---------+----------+------+--------+-----------+---------+------------ 0 | node1 | 5433 | 2 | 0.333333 | standby | 0 1 | node2 | 5433 | 2 | 0.333333 | standby | 0 2 | node3 | 5433 | 2 | 0.333333 | primary | 0 (3 rows) 准备failover脚本 准备failover脚本,并部署在3个节点上 /home/postgres/failover.sh #!/bin/bash pgsql_nodes="node1 node2 node3" logfile=/var/log/pgpool/failover.log down_node=$1 new_master=$2 down_node_id=$3 old_master_id=$4 old_master=$down_node export PGDATA="/data/postgresql/data" export PGPORT=5433 export PGDATABASE=postgres export PGUSER=admin export PGPASSWORD=admin trigger_command="pg_ctl -D $PGDATA promote -m fast" stop_command="pg_ctl -D $PGDATA stop -m fast" start_command="pg_ctl -D $PGDATA start" restart_command="pg_ctl -D $PGDATA restart -m fast" CHECK_XLOG_LOC_SQL="select pg_last_xlog_replay_location(),pg_last_xlog_receive_location()" log() { echo "$*" >&2 echo "`date +'%Y-%m-%d %H:%M:%S'` $*" >> $logfile } # Execulte SQL and return the result. exec_sql() { local host="$1" local sql="$2" local output local rc output=`psql -h $host -Atc "$sql"` rc=$? echo $output return $rc } get_xlog_location() { local rc local output local replay_loc local receive_loc local output1 local output2 local log1 local log2 local newer_location local target_host=$1 output=`exec_sql "$target_host" "$CHECK_XLOG_LOC_SQL"` rc=$? if [ $rc -ne 0 ]; then log "Can't get xlog location from $target_host.(rc=$rc)" exit 1 fi replay_loc=`echo $output | cut -d "|" -f 1` receive_loc=`echo $output | cut -d "|" -f 2` output1=`echo "$replay_loc" | cut -d "/" -f 1` output2=`echo "$replay_loc" | cut -d "/" -f 2` log1=`printf "%08s\n" $output1 | sed "s/ /0/g"` log2=`printf "%08s\n" $output2 | sed "s/ /0/g"` replay_loc="${log1}${log2}" output1=`echo "$receive_loc" | cut -d "/" -f 1` output2=`echo "$receive_loc" | cut -d "/" -f 2` log1=`printf "%08s\n" $output1 | sed "s/ /0/g"` log2=`printf "%08s\n" $output2 | sed "s/ /0/g"` receive_loc="${log1}${log2}" newer_location=`printf "$replay_loc\n$receive_loc" | sort -r | head -1` echo "$newer_location" return 0 } get_newer_location() { local newer_location newer_location=`printf "$1\n$2" | sort -r | head -1` echo "$newer_location" } log "##########failover start:$0 $*" # if standby down do nothing if [ "X$down_node_id" != "X$old_master_id" ]; then log "standby node '$down_node' down,skip" exit fi # check the old_master dead log "check the old_master '$old_master' dead ..." exec_sql $old_master "select 1" >/dev/null 2>&1 if [ $? -eq 0 ]; then log "the old master $old_master is alive, cancel faiover" exit 1 fi # check all nodes other than the old master alive and is standby log "check all nodes '$pgsql_nodes' other than the old master alive and is standby ..." for host in $pgsql_nodes ; do if [ $host != $old_master ]; then is_in_recovery=`exec_sql $host "select pg_is_in_recovery()"` if [ $? -ne 0 ]; then log "failed to check $host" exit 1 fi if [ "$is_in_recovery" != 't' ];then log "$host is not a valid standby(is_in_recovery=$is_in_recovery)" exit fi fi done # find the node with the newer xlog log "find the node with the newer xlog ..." # TODO wait for all xlog replayed newer_location=$(get_xlog_location $new_master) log "$new_master : $newer_location" new_primary=$new_master for host in $pgsql_nodes ; do if [ $host != $new_primary -a $host != $old_master ]; then location=$(get_xlog_location $host) log "$host : $location" if [ "$newer_location" != "$(get_newer_location $location $newer_location)" ]; then newer_location=$location new_primary=$host log "change new primary to $new_primary" fi fi done # change replication source to the new primary in all standbys for host in $pgsql_nodes ; do if [ $host != $new_primary -a $host != $old_master ]; then log "change replication source to $new_primary in $host ..." output=`ssh -T $host "/home/postgres/change_replication_source.sh $new_primary" 2>&1` rc=$? log "$output" if [ $rc -ne 0 ]; then log "failed to change replication source to $new_primary in $host" exit 1 fi fi done # trigger failover log "trigger failover to '$new_primary' ..." ssh -T $new_primary su - postgres -c "'$trigger_command'" rc=$? log "fire promote '$new_primary' to be the new primary (rc=$rc)" exit $rc /home/postgres/changereplicationsource.sh #!/bin/bash new_primary=$1 cat /etc/hosts | grep -v ' pgsql_primary$' >/tmp/hosts.tmp echo "`resolveip -s $new_primary` pgsql_primary" >>/tmp/hosts.tmp cp -f /tmp/hosts.tmp /etc/hosts rm -f /tmp/hosts.tmp 添加2个脚本的执行权限 [postgres@node1 ~]# chmod +x /home/postgres/failover.sh /home/postgres/change_replication_source.sh 注:以上脚本并不十分严谨,仅供参考。 failover测试 故障发生前的集群状态 [root@node3 ~]# psql -h192.168.0.220 -p9999 -U admin postgres Password for user admin: psql (9.5.2) Type "help" for help. postgres=> show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt ---------+----------+------+--------+-----------+---------+------------ 0 | node1 | 5433 | 2 | 0.333333 | primary | 3 1 | node2 | 5433 | 2 | 0.333333 | standby | 0 2 | node3 | 5433 | 2 | 0.333333 | standby | 0 (3 rows) postgres=> select inet_server_addr(); inet_server_addr ------------------ 192.168.0.211 (1 row 杀死主节点的postgres进程 [root@node1 ~]# killall -9 postgres 检查集群状态,已经切换到node2 postgres=> show pool_nodes; FATAL: unable to read data from DB node 0 DETAIL: EOF encountered with backend server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. postgres=> show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt ---------+----------+------+--------+-----------+---------+------------ 0 | node1 | 5433 | 3 | 0.333333 | standby | 27 1 | node2 | 5433 | 2 | 0.333333 | primary | 11 2 | node3 | 5433 | 2 | 0.333333 | standby | 0 (3 rows) postgres=> select inet_server_addr(); inet_server_addr ------------------ 192.168.0.212 (1 row) 恢复 恢复node1为新主的Slave 修改pgsql_primary的名称解析为新主的ip vi /etc/hosts ... 192.168.0.212 pgsql_primary 从新主上拉备份恢复 su - postgres cp /data/postgresql/data/recovery.done /tmp/ rm -rf /data/postgresql/data pg_basebackup -hpgsql_primary -p5433 -Ureplication -D /data/postgresql/data -X stream -P cp /tmp/recovery.done /data/postgresql/data/recovery.conf pg_ctl -D /data/postgresql/data start exit 将node1加入集群 pcp_attach_node -w 0 确认集群状态 postgres=> show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt ---------+----------+------+--------+-----------+---------+------------ 0 | node1 | 5433 | 1 | 0.333333 | standby | 27 1 | node2 | 5433 | 2 | 0.333333 | primary | 24 2 | node3 | 5433 | 2 | 0.333333 | standby | 0 (3 rows) 错误处理 地址被占用pgpool启动失败 Nov 15 02:33:56 node3 pgpool: 2016-11-15 02:33:56: pid 3868: FATAL: failed to bind a socket: "/tmp/.s.PGSQL.9999" Nov 15 02:33:56 node3 pgpool: 2016-11-15 02:33:56: pid 3868: DETAIL: bind socket failed with error: "Address already in use" 由于上次没有正常关闭导致,处理方法: rm -f /tmp/.s.PGSQL.9999 pgpool的master断网后,连接阻塞 切换pgpool的master节点(node1)的网络后,通过pgpool的连接阻塞,剩余节点的pgpool重新协商出新的Master,但阻塞继续,包括新建连接,也没有发生切换。 pgpool的日志里不断输出下面的消息 Nov 15 23:12:37 node3 pgpool: 2016-11-15 23:12:37: pid 4088: ERROR: Failed to check replication time lag Nov 15 23:12:37 node3 pgpool: 2016-11-15 23:12:37: pid 4088: DETAIL: No persistent db connection for the node 0 Nov 15 23:12:37 node3 pgpool: 2016-11-15 23:12:37: pid 4088: HINT: check sr_check_user and sr_check_password Nov 15 23:12:37 node3 pgpool: 2016-11-15 23:12:37: pid 4088: CONTEXT: while checking replication time lag Nov 15 23:12:39 node3 pgpool: 2016-11-15 23:12:39: pid 4088: LOG: failed to connect to PostgreSQL server on "node1:5433", getsockopt() detected error "No route to host" Nov 15 23:12:39 node3 pgpool: 2016-11-15 23:12:39: pid 4088: ERROR: failed to make persistent db connection Nov 15 23:12:39 node3 pgpool: 2016-11-15 23:12:39: pid 4088: DETAIL: connection to host:"node1:5433" failed node2和node3已经协商出新主,但连接阻塞状态一直继续,除非解禁旧master的网卡。 [root@node3 ~]# pcp_watchdog_info -w -v Watchdog Cluster Information Total Nodes : 3 Remote Nodes : 2 Quorum state : QUORUM EXIST Alive Remote Nodes : 2 VIP up on local node : YES Master Node Name : Linux_node3_9999 Master Host Name : node3 Watchdog Node Information Node Name : Linux_node3_9999 Host Name : node3 Delegate IP : 192.168.0.220 Pgpool port : 9999 Watchdog port : 9000 Node priority : 1 Status : 4 Status Name : MASTER Node Name : Linux_node1_9999 Host Name : node1 Delegate IP : 192.168.0.220 Pgpool port : 9999 Watchdog port : 9000 Node priority : 1 Status : 8 Status Name : LOST Node Name : Linux_node2_9999 Host Name : node2 Delegate IP : 192.168.0.220 Pgpool port : 9999 Watchdog port : 9000 Node priority : 1 Status : 7 Status Name : STANDBY 根据下面的堆栈,是pgpool通过watchdog将某个后端降级时,阻塞了。这应该是一个bug。 [root@node3 ~]# ps -ef|grep pgpool.conf root 4048 1 0 Nov15 ? 00:00:00 /usr/bin/pgpool -f /etc/pgpool-II/pgpool.conf -n root 5301 4832 0 00:10 pts/3 00:00:00 grep --color=auto pgpool.conf [root@node3 ~]# pstack 4048 #0 0x00007f73647e98d3 in __select_nocancel () from /lib64/libc.so.6 #1 0x0000000000493d2e in issue_command_to_watchdog () #2 0x0000000000494ac3 in wd_degenerate_backend_set () #3 0x000000000040bcf3 in degenerate_backend_set_ex () #4 0x000000000040e1c4 in PgpoolMain () #5 0x0000000000406ec2 in main () 总结 本次1主2从的架构中,用pgpool实施PostgreSQL的HA,效果并不理想。与pgpool和pgsql部署在一起有关,靠谱的做法是把pgpool部署在单独的节点或和应用服务器部署在一起。 1主2从或1主多从架构中,primary节点切换后,其它Slave要follow新的primary,需要自己实现,这一步要做的严谨可靠并不容易。 pgpool的primary出现断网错误会导致整个集群挂掉,应该是一个bug,实际部署时应尽量避免pgpool和pgsql部署在相同的节点。 参考 https://www.sraoss.co.jp/eventseminar/2016/edbsummit_2016.pdf#search='pgpool+2016' http://francs3.blog.163.com/blog/static/4057672720149285445881/ http://blog.163.com/digoal@126/blog/static/1638770402014413104753331/ https://my.oschina.net/Suregogo/blog/552765 https://www.itenlight.com/blog/2016/05/18/PostgreSQL+HA+with+pgpool-II+-+Part+1
健康检查 通过创建到后端的连接实施健康检查 main() PgpoolMain() processState = PERFORMING_HEALTH_CHECK; do_health_check() make_persistent_db_connection() discard_persistent_db_connection() 如果连接创建失败,会抛出异常,进而跳转到统一的异常处理点,如果超过重试次数,将后端降级,并最终调用pgpool.conf配置文件里设置的failover_command。 main() PgpoolMain() if(processState == PERFORMING_HEALTH_CHECK) process_backend_health_check_failure() degenerate_backend_set(&health_check_node_id,1) degenerate_backend_set_ex() register_node_operation_request(NODE_DOWN_REQUEST) failover() trigger_failover_command() failover()的切换过程 再次确认后端状态,如无效更新后端的backend_status为CON_DOWN 获取第一个状态正常的后端作为new_master kill所有子进程(这是基于pgpool做HA的一个很大的优势,可以可靠的切断所有来自客户端的连接,隔离故障节点) 对down掉的后端执行pgpool.conf配置文件里设置的failover_command 如果down掉的是primary,搜索新的primary,即第一个"SELECT pg_is_in_recovery()"返回不是t的后端。 重启所有子进程 发送restart通知给worker进程 通知PCP子进程failover/failback完成 发送restart通知给pcp进程
前段时间有同事问MySQL 分区索引是全局索引还是本地索引。全局索引和本地索引是Oracle的功能,MySQL(包括PostgreSQL)只实现了本地索引,并且因为有全局约束的问题,MySQL分区表明确不支持外键,并且主键和唯一键必须要包含所有分区列,否则报错。 点击(此处)折叠或打开 mysql> CREATE TABLE rc (c1 INT, c2 DATE) -> PARTITION BY RANGE COLUMNS(c2) ( -> PARTITION p0 VALUES LESS THAN('1990-01-01'), -> PARTITION p1 VALUES LESS THAN('1995-01-01'), -> PARTITION p2 VALUES LESS THAN('2000-01-01'), -> PARTITION p3 VALUES LESS THAN('2005-01-01'), -> PARTITION p4 VALUES LESS THAN(MAXVALUE) -> ); Query OK, 0 rows affected (0.04 sec) mysql> create UNIQUE index idx_rc_c1 on rc(c1); ERROR 1503 (HY000): A UNIQUE INDEX must include all columns in the table 主键和唯一键是多列索引时,开头可以不是分区列,即非前缀索引 点击(此处)折叠或打开 mysql> create UNIQUE index idx_rc_c1c2 on rc(c1,c2); Query OK, 0 rows affected (0.04 sec) Records: 0 Duplicates: 0 Warnings: 0 从存储来看,MySQL的分区是在Server层实现的,每个分区对应一个存储层的表空间文件。 点击(此处)折叠或打开 [root@srdsdevapp69 ~]# ll /mysql/data/test/rc* -rw-rw---- 1 mysql mysql 8582 Nov 10 16:57 /mysql/data/test/rc.frm -rw-rw---- 1 mysql mysql 40 Nov 10 16:57 /mysql/data/test/rc.par -rw-rw---- 1 mysql mysql 147456 Nov 10 16:57 /mysql/data/test/rc#P#p0.ibd -rw-rw---- 1 mysql mysql 147456 Nov 10 16:57 /mysql/data/test/rc#P#p1.ibd -rw-rw---- 1 mysql mysql 147456 Nov 10 16:57 /mysql/data/test/rc#P#p2.ibd -rw-rw---- 1 mysql mysql 147456 Nov 10 16:57 /mysql/data/test/rc#P#p3.ibd -rw-rw---- 1 mysql mysql 147456 Nov 10 16:57 /mysql/data/test/rc#P#p4.ibd MySQL分区的详细限制,可参考手册:http://dev.mysql.com/doc/refman/5.6/en/partitioning-limitations.html参考:Oralce的本地索引和全局索引http://blog.sina.com.cn/s/blog_8317516b01011wli.htmlhttp://blog.itpub.net/29478450/viewspace-1417473/ 点击(此处)折叠或打开 分区索引分为本地(local index)索引和全局索引(global index)。 其 中本地索引又可以分为有前缀(prefix)的索引和无前缀(nonprefix)的索引。而全局索引目前只支持有前缀的索引。B树索引和位图索引都可以 分区,但是HASH索引不可以被分区。位图索引必须是本地索引。下面就介绍本地索引以及全局索引各自的特点来说明区别; 一、本地索引特点: 1. 本地索引一定是分区索引,分区键等同于表的分区键,分区数等同于表的分区说,一句话,本地索引的分区机制和表的分区机制一样。 2. 如果本地索引的索引列以分区键开头,则称为前缀局部索引。 3. 如果本地索引的列不是以分区键开头,或者不包含分区键列,则称为非前缀索引。 4. 前缀和非前缀索引都可以支持索引分区消除,前提是查询的条件中包含索引分区键。 5. 本地索引只支持分区内的唯一性,无法支持表上的唯一性,因此如果要用本地索引去给表做唯一性约束,则约束中必须要包括分区键列。 6. 本地分区索引是对单个分区的,每个分区索引只指向一个表分区,全局索引则不然,一个分区索引能指向n个表分区,同时,一个表分区,也可能指向n个索引分区,对分区表中的某个分区做truncate或者move,shrink等,可能会影响到n个全局索引分区,正因为这点,本地分区索引具有更高的可用性。 7. 位图索引只能为本地分区索引。 8. 本地索引多应用于数据仓库环境中。 本 地索引:创建了一个分区表后,如果需要在表上面创建索引,并且索引的分区机制和表的分区机制一样,那么这样的索引就叫做本地分区索引。本地索引是由 ORACLE自动管理的,它分为有前缀的本地索引和无前缀的本地索引。什么叫有前缀的本地索引?有前缀的本地索引就是包含了分区键,并且将其作为引导列的 索引。什么叫无前缀的本地索引?无前缀的本地索引就是没有将分区键的前导列作为索引的前导列的索引。 二、全局索引特点: 1.全局索引的分区键和分区数和表的分区键和分区数可能都不相同,表和全局索引的分区机制不一样。 2.全局索引可以分区,也可以是不分区索引,全局索引必须是前缀索引,即全局索引的索引列必须是以索引分区键作为其前几列。 3.全局分区索引的索引条目可能指向若干个分区,因此,对于全局分区索引,即使只截断一个分区中的数据,都需要rebulid若干个分区甚至是整个索引。 4.全局索引多应用于oltp系统中。 5.全局分区索引只按范围或者散列hash分区,hash分区是10g以后才支持。 6.oracle9i以后对分区表做move或者truncate的时可以用update global indexes语句来同步更新全局分区索引,用消耗一定资源来换取高度的可用性。 7.表用a列作分区,索引用b做局部分区索引,若where条件中用b来查询,那么oracle会扫描所有的表和索引的分区,成本会比分区更高,此时可以考虑用b做全局分区索引。 全 局索引:与本地分区索引不同的是,全局分区索引的分区机制与表的分区机制不一样。全局分区索引全局分区索引只能是B树索引,到目前为止 (10gR2),oracle只支持有前缀的全局索引。另外oracle不会自动的维护全局分区索引,当我们在对表的分区做修改之后,如果执行修改的语句 不加上update global indexes的话,那么索引将不可用。
Pacemaker+Corosync集群中,Corosync负责消息的传递,成员关系以及QUORUM服务,它实现了TOTEM协议确保应用也就是Pacemaker通过Corosync传递的消息是可靠有序的。由于Corosync负责通信,集群中各种网络故障的检测自然也是Corosync的事情。 TOTEM协议的作用和大名鼎鼎的Paxos,Raft一样同属是分布式共识协议。 http://blog.csdn.net/cxzhq2002/article/details/49563811 totem协议最简单的形象就是,他将多个节点组成一个令牌环。多个节点手拉手形成一个圈,大家依次的传递token。 只有获取到token的节点才有发送消息的权利。简单有效的解决了在分布式系统中各个节点的同步问题,因为只有一个节点会在一个时刻发送消息, 不会出现冲突。当然,如果有节点发生意外时,令牌环就会断掉,此时大家不能够通信,而是重新组建出一个新的令牌环。 如果要详细了解TOTEM可以参考下面的PPT和Paper TOTEM协议讲解 The Totem Single-Ring Ordering and Membership Protocol TOTEM TOTEM虽然使用不可靠的UDP通信方式,但本身有很好的网络容错机制。 断开和恢复 断开时间超过token超时时间(默认1秒),进入Gather状态,形成新的ring。网络断开期间仍然发送数据包到隔离的节点探测节点状态,网络恢复后,再次进入Gather状态,形成新的ring。从corosync感知到网络断开到发起fialover需要时间,测试结果网络闪断在3秒以下通常不会触发failover。 丢包 超过token_retransmit未收到token,触发节点重传,通过token包的seq也可以感知丢包,要求重传。测试结果90%以下的丢包不会对Corosync集群造成影响。 延迟 触发token超时,形成新的ring。延迟的包到达后,收到foreign消息(环成员以外节点发送的消息)进入Gather状态,再次形成新的ring。测试结果网络延迟在1.5秒以下时通常不会触发failover。 乱序和重复 通过ring number和seq可以识别包的顺序,确保投递给应用的消息是有序的。 token超时 调整Corosync行为最重要的一个参数就是token超时时间 与OP协议相关的Corosync选项 ?token_retransmit –Processor在转发完token后,在多长时间内没有收到token或消息后,将引发token重传。 –默认值:238ms –如果设置了下面的token值,本值由程序自动计算。 ?token –Processor在多长时间内没有收到token(中间包含token重传)后,将触发token丢失事件(将激活MembershipProtocol,进入Gather状态)。 –默认值:1000ms 本值等于Token在Ring中循环一圈的时间,这个时间取决了三个因素:结点数,结点之间的网络速率,每个结点在拿到token后可以发送的max_messages。 token超时时间默认为1秒,把它改成5秒后,至少可以容忍5秒的闪断和延迟。 vi /etc/corosync/corosync.conf totem { ... token:5000 } 下面是token超时时间为5秒,断开3节点集群中1个节点(dbhost03)和其它节点的网络时corosync的日志输出。 iptables -A INPUT -j DROP -s dbhost01 iptables -A OUTPUT -j DROP -d dbhost01 iptables -A INPUT -j DROP -s dbhost02 iptables -A OUTPUT -j DROP -d dbhost02 tail -f /var/log/cluster/corosync.log|grep corosync dbhost01和dbhost02上经过一段时间形成新的ring Nov 01 09:44:38 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-1688-6915-25-header Nov 01 09:44:38 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-1688-6915-25-header Nov 01 09:44:38 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-1688-6915-25-header Nov 01 09:44:38 [6915] dbhost01 crm_node: info: corosync_node_name: Unable to get node name for nodeid 1 Nov 01 09:44:38 [6915] dbhost01 crm_node: notice: get_node_name: Defaulting to uname -n for the local corosync node name Nov 01 09:44:43 [1680] dbhost01 corosync debug [TOTEM ] The token was lost in the OPERATIONAL state. Nov 01 09:44:43 [1680] dbhost01 corosync notice [TOTEM ] A processor failed, forming new configuration. Nov 01 09:44:43 [1680] dbhost01 corosync debug [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Nov 01 09:44:43 [7121] dbhost01 crm_node: info: get_cluster_type: Verifying cluster type: 'corosync' Nov 01 09:44:43 [7121] dbhost01 crm_node: info: get_cluster_type: Assuming an active 'corosync' cluster Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] IPC credentials authenticated (1688-7121-25) Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] connecting to client [7121] Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:43 [1680] dbhost01 corosync debug [MAIN ] connection created Nov 01 09:44:43 [1680] dbhost01 corosync debug [CPG ] lib_init_fn: conn=0x7f1ff377c7d0, cpd=0x7f1ff377ce84 Nov 01 09:44:43 [1680] dbhost01 corosync debug [CPG ] cpg finalize for conn=0x7f1ff377c7d0 Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] HUP conn (1688-7121-25) Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] qb_ipcs_disconnect(1688-7121-25) state:2 Nov 01 09:44:43 [1680] dbhost01 corosync debug [MAIN ] cs_ipcs_connection_closed() Nov 01 09:44:43 [1680] dbhost01 corosync debug [CPG ] exit_fn for conn=0x7f1ff377c7d0 Nov 01 09:44:43 [1680] dbhost01 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-1688-7121-25-header Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-1688-7121-25-header Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-1688-7121-25-header Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] IPC credentials authenticated (1688-7121-25) Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] connecting to client [7121] Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:43 [1680] dbhost01 corosync debug [MAIN ] connection created Nov 01 09:44:43 [1680] dbhost01 corosync debug [CMAP ] lib_init_fn: conn=0x7f1ff37794a0 Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] HUP conn (1688-7121-25) Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] qb_ipcs_disconnect(1688-7121-25) state:2 Nov 01 09:44:43 [1680] dbhost01 corosync debug [MAIN ] cs_ipcs_connection_closed() Nov 01 09:44:43 [1680] dbhost01 corosync debug [CMAP ] exit_fn for conn=0x7f1ff37794a0 Nov 01 09:44:43 [7121] dbhost01 crm_node: info: corosync_node_name: Unable to get node name for nodeid 1 Nov 01 09:44:43 [1680] dbhost01 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-1688-7121-25-header Nov 01 09:44:43 [7121] dbhost01 crm_node: notice: get_node_name: Defaulting to uname -n for the local corosync node name Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-1688-7121-25-header Nov 01 09:44:43 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-1688-7121-25-header Nov 01 09:44:49 [7196] dbhost01 crm_node: info: get_cluster_type: Verifying cluster type: 'corosync' Nov 01 09:44:49 [7196] dbhost01 crm_node: info: get_cluster_type: Assuming an active 'corosync' cluster Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] IPC credentials authenticated (1688-7196-25) Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] connecting to client [7196] Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:49 [1680] dbhost01 corosync debug [MAIN ] connection created Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] lib_init_fn: conn=0x7f1ff37794a0, cpd=0x7f1ff377a2f4 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] cpg finalize for conn=0x7f1ff37794a0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] HUP conn (1688-7196-25) Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] qb_ipcs_disconnect(1688-7196-25) state:2 Nov 01 09:44:49 [1680] dbhost01 corosync debug [MAIN ] cs_ipcs_connection_closed() Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] exit_fn for conn=0x7f1ff37794a0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-1688-7196-25-header Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-1688-7196-25-header Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-1688-7196-25-header Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] IPC credentials authenticated (1688-7196-25) Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] connecting to client [7196] Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:49 [1680] dbhost01 corosync debug [MAIN ] connection created Nov 01 09:44:49 [1680] dbhost01 corosync debug [CMAP ] lib_init_fn: conn=0x7f1ff37794a0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] HUP conn (1688-7196-25) Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] qb_ipcs_disconnect(1688-7196-25) state:2 Nov 01 09:44:49 [1680] dbhost01 corosync debug [MAIN ] cs_ipcs_connection_closed() Nov 01 09:44:49 [1680] dbhost01 corosync debug [CMAP ] exit_fn for conn=0x7f1ff37794a0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-1688-7196-25-header Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-1688-7196-25-header Nov 01 09:44:49 [1680] dbhost01 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-1688-7196-25-header Nov 01 09:44:49 [7196] dbhost01 crm_node: info: corosync_node_name: Unable to get node name for nodeid 1 Nov 01 09:44:49 [7196] dbhost01 crm_node: notice: get_node_name: Defaulting to uname -n for the local corosync node name Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] entering GATHER state from 0(consensus timeout). Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] Creating commit token because I am the rep. Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] Saving state aru 63b high seq received 63b Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] Storing new sequence id for ring 7f160 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] entering COMMIT state. Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] got commit token Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] entering RECOVERY state. Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] TRANS [0] member 10.37.20.193: Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] TRANS [1] member 10.37.20.195: Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] position [0] member 10.37.20.193: Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] previous ring seq 7f15c rep 10.37.20.193 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] aru 63b high delivered 63b received flag 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] position [1] member 10.37.20.195: Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] previous ring seq 7f15c rep 10.37.20.193 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] aru 63b high delivered 63b received flag 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] Did not need to originate any messages in recovery. Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] got commit token Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] Sending initial ORF token Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] install seq 0 aru 0 high seq received 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] install seq 0 aru 0 high seq received 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] install seq 0 aru 0 high seq received 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] install seq 0 aru 0 high seq received 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] Resetting old ring state Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] recovery to regular 1-0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [MAIN ] Member left: r(0) ip(10.37.20.196) Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] waiting_trans_ack changed to 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] entering OPERATIONAL state. Nov 01 09:44:49 [1680] dbhost01 corosync notice [TOTEM ] A new membership (10.37.20.193:520544) was formed. Members left: 3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [SYNC ] Committing synchronization for corosync configuration map access Nov 01 09:44:49 [1680] dbhost01 corosync debug [CMAP ] Not first sync -> no action Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] got joinlist message from node 2 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] comparing: sender r(0) ip(10.37.20.193) ; members(old:3 left:1) Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] comparing: sender r(0) ip(10.37.20.195) ; members(old:3 left:1) Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] chosen downlist: sender r(0) ip(10.37.20.193) ; members(old:3 left:1) Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list_entries:1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list[0] group:attrd\x00, ip:r(0) ip(10.37.20.196) , pid:24157 Nov 01 09:44:49 [1792] dbhost01 attrd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost03[3] - corosync-cpg is now offline Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list_entries:1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list[0] group:cib\x00, ip:r(0) ip(10.37.20.196) , pid:24154 Nov 01 09:44:49 [1789] dbhost01 cib: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost03[3] - corosync-cpg is now offline Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list_entries:1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list[0] group:crmd\x00, ip:r(0) ip(10.37.20.196) , pid:24159 Nov 01 09:44:49 [1794] dbhost01 crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost03[3] - corosync-cpg is now offline Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list_entries:1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list[0] group:pacemakerd\x00, ip:r(0) ip(10.37.20.196) , pid:24152 Nov 01 09:44:49 [1787] dbhost01 pacemakerd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost03[3] - corosync-cpg is now offline Nov 01 09:44:49 [1790] dbhost01 stonith-ng: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost03[3] - corosync-cpg is now offline Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list_entries:1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] left_list[0] group:stonith-ng\x00, ip:r(0) ip(10.37.20.196) , pid:24155 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] got joinlist message from node 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(10.37.20.193) , pid:1794 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(10.37.20.193) , pid:1792 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(10.37.20.193) , pid:1790 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(10.37.20.193) , pid:1789 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[4] group:pacemakerd\x00, ip:r(0) ip(10.37.20.193) , pid:1787 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[5] group:crmd\x00, ip:r(0) ip(10.37.20.195) , pid:437 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[6] group:attrd\x00, ip:r(0) ip(10.37.20.195) , pid:435 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[7] group:stonith-ng\x00, ip:r(0) ip(10.37.20.195) , pid:433 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[8] group:cib\x00, ip:r(0) ip(10.37.20.195) , pid:432 Nov 01 09:44:49 [1680] dbhost01 corosync debug [CPG ] joinlist_messages[9] group:pacemakerd\x00, ip:r(0) ip(10.37.20.195) , pid:430 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] got nodeinfo message from cluster node 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 3 flags: 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] total_votes=2, expected_votes=3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] node 1 state=1, votes=1, expected=3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] node 2 state=1, votes=1, expected=3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] node 3 state=2, votes=1, expected=3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] lowest node id: 1 us: 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] got nodeinfo message from cluster node 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] got nodeinfo message from cluster node 2 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 3 flags: 1 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] got nodeinfo message from cluster node 2 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [SYNC ] Committing synchronization for corosync vote quorum service v1.0 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] total_votes=2, expected_votes=3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] node 1 state=1, votes=1, expected=3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] node 2 state=1, votes=1, expected=3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] node 3 state=2, votes=1, expected=3 Nov 01 09:44:49 [1680] dbhost01 corosync debug [VOTEQ ] lowest node id: 1 us: 1 Nov 01 09:44:49 [1680] dbhost01 corosync notice [QUORUM] Members[2]: 1 2 Nov 01 09:44:49 [1680] dbhost01 corosync debug [QUORUM] sending quorum notification to (nil), length = 56 Nov 01 09:44:49 [1680] dbhost01 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Nov 01 09:44:49 [1680] dbhost01 corosync debug [TOTEM ] waiting_trans_ack changed to 0 dbhost03上也形成只包含自己的ring Nov 01 09:44:40 [1777] dbhost03 crm_node: info: get_cluster_type: Verifying cluster type: 'corosync' Nov 01 09:44:40 [1777] dbhost03 crm_node: info: get_cluster_type: Assuming an active 'corosync' cluster Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] IPC credentials authenticated (24024-1777-25) Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] connecting to client [1777] Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:40 [24023] dbhost03 corosync debug [MAIN ] connection created Nov 01 09:44:40 [24023] dbhost03 corosync debug [CPG ] lib_init_fn: conn=0x7f1243543fd0, cpd=0x7f124323c594 Nov 01 09:44:40 [24023] dbhost03 corosync debug [CPG ] cpg finalize for conn=0x7f1243543fd0 Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] HUP conn (24024-1777-25) Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] qb_ipcs_disconnect(24024-1777-25) state:2 Nov 01 09:44:40 [24023] dbhost03 corosync debug [MAIN ] cs_ipcs_connection_closed() Nov 01 09:44:40 [24023] dbhost03 corosync debug [CPG ] exit_fn for conn=0x7f1243543fd0 Nov 01 09:44:40 [24023] dbhost03 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-24024-1777-25-header Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-24024-1777-25-header Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-24024-1777-25-header Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] IPC credentials authenticated (24024-1777-25) Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] connecting to client [1777] Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:40 [24023] dbhost03 corosync debug [MAIN ] connection created Nov 01 09:44:40 [24023] dbhost03 corosync debug [CMAP ] lib_init_fn: conn=0x7f1243543fd0 Nov 01 09:44:40 [1777] dbhost03 crm_node: info: corosync_node_name: Unable to get node name for nodeid 3 Nov 01 09:44:40 [1777] dbhost03 crm_node: notice: get_node_name: Defaulting to uname -n for the local corosync node name Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] HUP conn (24024-1777-25) Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] qb_ipcs_disconnect(24024-1777-25) state:2 Nov 01 09:44:40 [24023] dbhost03 corosync debug [MAIN ] cs_ipcs_connection_closed() Nov 01 09:44:40 [24023] dbhost03 corosync debug [CMAP ] exit_fn for conn=0x7f1243543fd0 Nov 01 09:44:40 [24023] dbhost03 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-24024-1777-25-header Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-24024-1777-25-header Nov 01 09:44:40 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-24024-1777-25-header Nov 01 09:44:40 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(ucast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:40 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:40 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:41 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(ucast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:42 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(ucast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(ucast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] The token was lost in the OPERATIONAL state. Nov 01 09:44:43 [24023] dbhost03 corosync notice [TOTEM ] A processor failed, forming new configuration. Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.). Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:43 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:44 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:45 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [2075] dbhost03 crm_node: info: get_cluster_type: Verifying cluster type: 'corosync' Nov 01 09:44:46 [2075] dbhost03 crm_node: info: get_cluster_type: Assuming an active 'corosync' cluster Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] IPC credentials authenticated (24024-2075-25) Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] connecting to client [2075] Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:46 [24023] dbhost03 corosync debug [MAIN ] connection created Nov 01 09:44:46 [24023] dbhost03 corosync debug [CPG ] lib_init_fn: conn=0x7f12435520a0, cpd=0x7f124323c594 Nov 01 09:44:46 [24023] dbhost03 corosync debug [CPG ] cpg finalize for conn=0x7f12435520a0 Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] HUP conn (24024-2075-25) Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] qb_ipcs_disconnect(24024-2075-25) state:2 Nov 01 09:44:46 [24023] dbhost03 corosync debug [MAIN ] cs_ipcs_connection_closed() Nov 01 09:44:46 [24023] dbhost03 corosync debug [CPG ] exit_fn for conn=0x7f12435520a0 Nov 01 09:44:46 [24023] dbhost03 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-24024-2075-25-header Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-24024-2075-25-header Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-24024-2075-25-header Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] IPC credentials authenticated (24024-2075-25) Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] connecting to client [2075] Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Nov 01 09:44:46 [24023] dbhost03 corosync debug [MAIN ] connection created Nov 01 09:44:46 [24023] dbhost03 corosync debug [CMAP ] lib_init_fn: conn=0x7f12435520a0 Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] HUP conn (24024-2075-25) Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] qb_ipcs_disconnect(24024-2075-25) state:2 Nov 01 09:44:46 [24023] dbhost03 corosync debug [MAIN ] cs_ipcs_connection_closed() Nov 01 09:44:46 [24023] dbhost03 corosync debug [CMAP ] exit_fn for conn=0x7f12435520a0 Nov 01 09:44:46 [24023] dbhost03 corosync debug [MAIN ] cs_ipcs_connection_destroyed() Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-24024-2075-25-header Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-24024-2075-25-header Nov 01 09:44:46 [24023] dbhost03 corosync debug [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-24024-2075-25-header Nov 01 09:44:46 [2075] dbhost03 crm_node: info: corosync_node_name: Unable to get node name for nodeid 3 Nov 01 09:44:46 [2075] dbhost03 crm_node: notice: get_node_name: Defaulting to uname -n for the local corosync node name Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:46 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:47 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:48 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] entering GATHER state from 0(consensus timeout). Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] Creating commit token because I am the rep. Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] Saving state aru 63b high seq received 63b Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] Storing new sequence id for ring 7f160 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] entering COMMIT state. Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] got commit token Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] entering RECOVERY state. Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] TRANS [0] member 10.37.20.196: Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] position [0] member 10.37.20.196: Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] previous ring seq 7f15c rep 10.37.20.193 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] aru 63b high delivered 63b received flag 1 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] Did not need to originate any messages in recovery. Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] got commit token Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] Sending initial ORF token Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] install seq 0 aru 0 high seq received 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] install seq 0 aru 0 high seq received 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] install seq 0 aru 0 high seq received 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] install seq 0 aru 0 high seq received 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] retrans flag count 4 token aru 0 install seq 0 aru 0 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] Resetting old ring state Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] recovery to regular 1-0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [MAIN ] Member left: r(0) ip(10.37.20.193) Nov 01 09:44:49 [24023] dbhost03 corosync debug [MAIN ] Member left: r(0) ip(10.37.20.195) Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] waiting_trans_ack changed to 1 Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] entering OPERATIONAL state. Nov 01 09:44:49 [24023] dbhost03 corosync notice [TOTEM ] A new membership (10.37.20.196:520544) was formed. Members left: 1 2 Nov 01 09:44:49 [24023] dbhost03 corosync debug [SYNC ] Committing synchronization for corosync configuration map access Nov 01 09:44:49 [24023] dbhost03 corosync debug [CMAP ] Not first sync -> no action Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] comparing: sender r(0) ip(10.37.20.196) ; members(old:3 left:2) Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] chosen downlist: sender r(0) ip(10.37.20.196) ; members(old:3 left:2) Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list_entries:2 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[0] group:attrd\x00, ip:r(0) ip(10.37.20.193) , pid:1792 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[1] group:attrd\x00, ip:r(0) ip(10.37.20.195) , pid:435 Nov 01 09:44:49 [24157] dbhost03 attrd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost01[1] - corosync-cpg is now offline Nov 01 09:44:49 [24157] dbhost03 attrd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost02[2] - corosync-cpg is now offline Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list_entries:2 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[0] group:cib\x00, ip:r(0) ip(10.37.20.193) , pid:1789 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[1] group:cib\x00, ip:r(0) ip(10.37.20.195) , pid:432 Nov 01 09:44:49 [24154] dbhost03 cib: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost01[1] - corosync-cpg is now offline Nov 01 09:44:49 [24154] dbhost03 cib: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost02[2] - corosync-cpg is now offline Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list_entries:2 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[0] group:crmd\x00, ip:r(0) ip(10.37.20.193) , pid:1794 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[1] group:crmd\x00, ip:r(0) ip(10.37.20.195) , pid:437 Nov 01 09:44:49 [24159] dbhost03 crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost01[1] - corosync-cpg is now offline Nov 01 09:44:49 [24159] dbhost03 crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost02[2] - corosync-cpg is now offline Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list_entries:2 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[0] group:pacemakerd\x00, ip:r(0) ip(10.37.20.193) , pid:1787 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[1] group:pacemakerd\x00, ip:r(0) ip(10.37.20.195) , pid:430 Nov 01 09:44:49 [24152] dbhost03 pacemakerd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost01[1] - corosync-cpg is now offline Nov 01 09:44:49 [24152] dbhost03 pacemakerd: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost02[2] - corosync-cpg is now offline Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list_entries:2 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[0] group:stonith-ng\x00, ip:r(0) ip(10.37.20.193) , pid:1790 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] left_list[1] group:stonith-ng\x00, ip:r(0) ip(10.37.20.195) , pid:433 Nov 01 09:44:49 [24155] dbhost03 stonith-ng: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost01[1] - corosync-cpg is now offline Nov 01 09:44:49 [24155] dbhost03 stonith-ng: info: crm_update_peer_proc: pcmk_cpg_membership: Node dbhost02[2] - corosync-cpg is now offline Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] got joinlist message from node 3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [SYNC ] Committing synchronization for corosync cluster closed process group service v1.01 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(10.37.20.196) , pid:24159 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(10.37.20.196) , pid:24157 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(10.37.20.196) , pid:24155 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(10.37.20.196) , pid:24154 Nov 01 09:44:49 [24023] dbhost03 corosync debug [CPG ] joinlist_messages[4] group:pacemakerd\x00, ip:r(0) ip(10.37.20.196) , pid:24152 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] got nodeinfo message from cluster node 3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 3 flags: 1 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] total_votes=1, expected_votes=3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] node 1 state=2, votes=1, expected=3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] node 2 state=2, votes=1, expected=3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] node 3 state=1, votes=1, expected=3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] quorum lost, blocking activity Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] got nodeinfo message from cluster node 3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [SYNC ] Committing synchronization for corosync vote quorum service v1.0 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] total_votes=1, expected_votes=3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] node 1 state=2, votes=1, expected=3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] node 2 state=2, votes=1, expected=3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [VOTEQ ] node 3 state=1, votes=1, expected=3 Nov 01 09:44:49 [24023] dbhost03 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Nov 01 09:44:49 [24023] dbhost03 corosync notice [QUORUM] Members[1]: 3 Nov 01 09:44:49 [24023] dbhost03 corosync debug [QUORUM] sending quorum notification to (nil), length = 52 Nov 01 09:44:49 [24023] dbhost03 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Nov 01 09:44:49 [24023] dbhost03 corosync debug [TOTEM ] waiting_trans_ack changed to 0 Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1) Nov 01 09:44:50 [24023] dbhost03 corosync debug [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1)
背景 MySQL5.6 支持slave crash safe特性,根据之前对slave crash safe特性的认识,在配置了以下参数的情况下,可以保证slave crash后保证数据的一致性。 relay_log_info_repository = TABLE relay_log_recovery = ON 但是,用MySQL5.6(Percona Server 5.6.31) 配置基于GTID的主从复制(MASTER_AUTO_POSITION=1)后,进行slave宕机测试时,发现slave恢复后,复制正常但数据不一致(注意没有报错,主备的GTID也是一致的!!!)。 MySQL参数 log_bin = /data/mysql/binlog sync_binlog = 1 innodb_flush_log_at_trx_commit = 2 log_slave_updates = true gtid_mode = on enforce_gtid_consistency = true master_info_repository = TABLE relay_log_info_repository = TABLE relay_log_recovery = ON 测试方法 搭建主从复制 使用上面的参数搭建主从复制 创建测试表 mysql> show create table test.test1; +-------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+--------------------------------------------------------------------------------------------------------------------------------------------+ | test1 | CREATE TABLE `test1` ( `id` int(11) NOT NULL, `num` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 | +-------+--------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) 插入数据 执行一个脚本,不断执行下面的insert语句 insert into test1 values($i,$i) 模拟Slave宕机 测试环境是虚机,通过杀Slave虚机进程模拟Slave宕机,然后再重启Slave虚机并启动mysql。 现象 重启Slave虚机并启动mysql后,MySQL复制正常,但查询数据时发现遗漏了一些数据。 mysql> select count(*),max(id) from test1; +----------+---------+ | count(*) | max(id) | +----------+---------+ | 21997 | 22009 | +----------+---------+ 1 row in set (0.00 sec) 正常情况下,max(id)应该和count(*)值相等。 MySQL错误日志中有如下消息。 2016-10-24 01:30:31 2637 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 44, event_type: 30 2016-10-24 01:30:31 2637 [Warning] Error reading GTIDs from binary log: -1 2016-10-24 01:30:32 2637 [Warning] Recovery from master pos 33445082 and file binlog.000001. Previous relay log pos and relay log file had been set to 1276427, ./mysql-relay-bin.000010 respectively. 2016-10-24 01:30:32 2637 [Note] Slave SQL thread initialized, starting replication in log 'binlog.000001' at position 33445082, relay log './mysql-relay-bin.000011' position: 4 2016-10-24 01:30:32 2637 [Note] Slave I/O thread: Start semi-sync replication to master 'sn_repl@192.168.236.103:3306' in log 'binlog.000001' at position 33445082 2016-10-24 01:30:32 2637 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information. 2016-10-24 01:30:32 2637 [Note] Slave I/O thread: connected to master 'sn_repl@192.168.236.103:3306',replication started in log 'binlog.000001' at position 33445082 2016-10-24 01:30:32 2637 [Note] Event Scheduler: Loaded 0 events 2016-10-24 01:30:32 2637 [Note] mysqld: ready for connections. Version: '5.6.31-77.0-log' socket: '/data/mysql/mysql.sock' port: 3306 Percona Server (GPL), Release 77.0, Revision 5c1061c 原因 参考官方文档 http://dev.mysql.com/doc/refman/5.6/en/replication-solutions-unexpected-slave-halt.html When using GTIDs and MASTER_AUTO_POSITION, set relay_log_recovery=0. With this configuration the setting of relay_log_info_repository and other variables does not impact on recovery. 上面官方文档上写的 relay_log_recovery=0,是个笔误,应该是relay_log_recovery=1,我提个了个bug报告(https://bugs.mysql.com/bug.php?id=83711),已经得到了确认。 测试使用的参数是满足官方文档的要求的,但没有起到crash safe的效果。 进一步调查,发现问题的根源是下面2个参数 sync_binlog = 1 innodb_flush_log_at_trx_commit = 2 在使用GTIDs且MASTER_AUTO_POSITION=1的情况下,Slave继续执行binlog的位置是基于Executed_Gtid_Set的(不是保存在mysql.slave_relay_log_info中binlog文件位置),Executed_Gtid_Set是在mysqld启动时通过扫描binlog获取的。由于sync_binlog=1,所以binlog里的位置是最新的,而innodb_flush_log_at_trx_commit = 2并不是每次刷盘,所以innodb的redo log里会丢失一些数据,即Slave的数据文件和Executed_Gtid_Set不一致。这就导致了重新启动复制后会丢失部分更新。 另外一个相似的现象是,把innodb的redo log设成同步刷,而binlog异步刷,这会导致Slave重复回放已经执行过的binglog事件,在这个测试里的结果就是出现主键冲突。 sync_binlog = 0 innodb_flush_log_at_trx_commit = 1 解决办法 修改下面2个参数为双1,可以确保数据文件和binlog文件一致。修改后问题不再出现。 sync_binlog = 1 innodb_flush_log_at_trx_commit = 1 即,MySQL5.6启用GTIDs且MASTER_AUTO_POSITION=1时,可确保slave crash safe的参数设置如下: relay_log_recovery = 1 sync_binlog = 1 innodb_flush_log_at_trx_commit = 1 每次事务提交都刷binlog和redo log的设置是最安全的,但会对性能一定影响。 在MySQL 5.7里有另外的解决办法。MySQL 5.7增加了一个系统表mysql.gtid_executed用于持久化已执行的gtid信息。即在没有开启log_bin或者从服务器没有开启log_slave_updates时每次事务提交往mysql.gtid_executed里插入一个gtid记录,这和relay_log_info_repository的处理方式相似。因此不需要把sync_binlog和innodb_flush_log_at_trx_commit设置为1就可以确保gtid_executed和实际的数据文件一致。不过在MySQL 5.7开启log_bin或者从服务器开启log_slave_updates时,只在flush log,服务器停止等场景下才写mysql.gtid_executed,Slave crash时依然有gtid_executed和数据文件不一致的问题。 后来查阅bug记录,这是一个已知bug,并且2013年就被发现了。http://bugs.mysql.com/bug.php?id=70659 引申问题 如果采用基于Position的复制是不是就不需要binlog和redo log都实时刷盘了? 当relayloginfo_repository=TABLE时,Slave执行的位置信息是和apply event在同一事务里提交的,所以可以保证它们是一致的。但基于Position的复制在多线程复制下会有gap和Gap-free low-watermark position的问题,也不能保证数据一致,需要注意,详见。 http://dev.mysql.com/doc/refman/5.7/en/replication-features-transaction-inconsistencies.html 参考 http://dev.mysql.com/doc/refman/5.6/en/replication-solutions-unexpected-slave-halt.html http://dev.mysql.com/doc/refman/5.7/en/replication-features-transaction-inconsistencies.html http://mysqllover.com/?p=594 https://yq.aliyun.com/articles/41152 https://yq.aliyun.com/articles/50873 http://www.tuicool.com/articles/AV3eqaz 相关代码 mysqld启动时从binglog文件读取已执行的gtid,从最后一个文件开始读。 mysqld_main() mysql_bin_log.init_gtid_sets() read_gtids_from_binlog() 基于GTID的binlog dump Rpl_slave.cc:request_dump() Rpl_master.cc:com_binlog_dump_gtid()
安装pacemaker rpm包后,遇到启动失败的情况,原因和动态链接库的加载有关,以下是详细。 问题 编译生成pacemaker 1.1.15的rpm包,然后在其它机器上安装pacemaker rpm包后,启动失败。 [root@srdsdevapp73 ~]# service pacemaker start Starting Pacemaker Cluster Manager [FAILED] 环境 CentOS 6.3 64bit 原因 通过strace发现pacemaker启动失败由于加载库libcoroipcc.so.4失败 [root@srdsdevapp73 ~]# strace -f service pacemaker start ... [pid 19960] writev(2, [{"pacemakerd", 10}, {": ", 2}, {"error while loading shared libra"..., 36}, {": ", 2}, {"libcoroipcc.so.4", 16}, {": ", 2}, {"cannot open shared object file", 30}, {": ... 再用ldd检查pacemakerd,发现总共有3个库找不到 [root@srdsdevapp73 ~]# ldd /usr/sbin/pacemakerd linux-vdso.so.1 => (0x00007fffc4c9f000) libcrmcluster.so.4 => /usr/lib/libcrmcluster.so.4 (0x0000003cbac00000) libstonithd.so.2 => /usr/lib/libstonithd.so.2 (0x0000003cba400000) libcrmcommon.so.3 => /usr/lib/libcrmcommon.so.3 (0x0000003cb4c00000) libm.so.6 => /lib64/libm.so.6 (0x0000003cb3c00000) libcpg.so.4 => /usr/lib64/libcpg.so.4 (0x00007f3f72199000) libcfg.so.6 => /usr/lib64/libcfg.so.6 (0x00007f3f71f95000) libcmap.so.4 => /usr/lib64/libcmap.so.4 (0x00007f3f71d8f000) libquorum.so.5 => /usr/lib64/libquorum.so.5 (0x00007f3f71b8b000) libgnutls.so.26 => /usr/lib64/libgnutls.so.26 (0x0000003cb8800000) libcorosync_common.so.4 => /usr/lib64/libcorosync_common.so.4 (0x00007f3f71988000) libplumb.so.2 => /usr/lib64/libplumb.so.2 (0x00007f3f71754000) libpils.so.2 => /usr/lib64/libpils.so.2 (0x00007f3f7154b000) libqb.so.0 => /usr/lib64/libqb.so.0 (0x00007f3f712e6000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cb2c00000) libbz2.so.1 => /lib64/libbz2.so.1 (0x0000003cb7000000) libxslt.so.1 => /usr/lib64/libxslt.so.1 (0x0000003cb4800000) libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0000003cb6000000) libc.so.6 => /lib64/libc.so.6 (0x0000003cb2400000) libuuid.so.1 => /lib64/libuuid.so.1 (0x0000003cb5000000) libpam.so.0 => /lib64/libpam.so.0 (0x0000003cb6c00000) librt.so.1 => /lib64/librt.so.1 (0x0000003cb3000000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003cb2800000) libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x0000003cb3800000) libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x0000003cb8400000) libcoroipcc.so.4 => not found libcfg.so.4 => not found libconfdb.so.4 => not found libtasn1.so.3 => /usr/lib64/libtasn1.so.3 (0x0000003cb7800000) libz.so.1 => /lib64/libz.so.1 (0x0000003cb3400000) libgcrypt.so.11 => /lib64/libgcrypt.so.11 (0x0000003cb7400000) /lib64/ld-linux-x86-64.so.2 (0x0000003cb2000000) libaudit.so.1 => /lib64/libaudit.so.1 (0x0000003cb6400000) libcrypt.so.1 => /lib64/libcrypt.so.1 (0x0000003cb5c00000) libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x0000003cb6800000) libfreebl3.so => /lib64/libfreebl3.so (0x0000003cb5800000) 上面有一段"/usr/lib/libcrmcluster.so.4"比较奇怪,确认后发现文件不对,是以前安装的版本(不清楚当初怎么安装的了)。 正确的库位置应该是"/usr/lib64/libcrmcluster.so.4"。 将老版本的pacemaker删除后,一切正常 [root@srdsdevapp73 ~]# rm -f /usr/lib/libcrm* [root@srdsdevapp73 ~]# rm -f /usr/lib/libstonithd.* [root@srdsdevapp73 ~]# ldd /usr/sbin/pacemakerd linux-vdso.so.1 => (0x00007fff9a3ff000) libcrmcluster.so.4 => /usr/lib64/libcrmcluster.so.4 (0x00007f849a1fc000) libstonithd.so.2 => /usr/lib64/libstonithd.so.2 (0x00007f8499fea000) libcrmcommon.so.3 => /usr/lib64/libcrmcommon.so.3 (0x00007f8499d93000) libm.so.6 => /lib64/libm.so.6 (0x0000003cb3c00000) libcpg.so.4 => /usr/lib64/libcpg.so.4 (0x00007f8499b8c000) libcfg.so.6 => /usr/lib64/libcfg.so.6 (0x00007f8499988000) libcmap.so.4 => /usr/lib64/libcmap.so.4 (0x00007f8499782000) libquorum.so.5 => /usr/lib64/libquorum.so.5 (0x00007f849957e000) libgnutls.so.26 => /usr/lib64/libgnutls.so.26 (0x0000003cb8800000) libcorosync_common.so.4 => /usr/lib64/libcorosync_common.so.4 (0x00007f849937b000) libplumb.so.2 => /usr/lib64/libplumb.so.2 (0x00007f8499147000) libpils.so.2 => /usr/lib64/libpils.so.2 (0x00007f8498f3e000) libqb.so.0 => /usr/lib64/libqb.so.0 (0x00007f8498cd9000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cb2c00000) libbz2.so.1 => /lib64/libbz2.so.1 (0x0000003cb7000000) libxslt.so.1 => /usr/lib64/libxslt.so.1 (0x0000003cb4800000) libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0000003cb6000000) libc.so.6 => /lib64/libc.so.6 (0x0000003cb2400000) libuuid.so.1 => /lib64/libuuid.so.1 (0x0000003cb5000000) libpam.so.0 => /lib64/libpam.so.0 (0x0000003cb6c00000) librt.so.1 => /lib64/librt.so.1 (0x0000003cb3000000) libdl.so.2 => /lib64/libdl.so.2 (0x0000003cb2800000) libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x0000003cb3800000) libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x0000003cb8400000) libtasn1.so.3 => /usr/lib64/libtasn1.so.3 (0x0000003cb7800000) libz.so.1 => /lib64/libz.so.1 (0x0000003cb3400000) libgcrypt.so.11 => /lib64/libgcrypt.so.11 (0x0000003cb7400000) /lib64/ld-linux-x86-64.so.2 (0x0000003cb2000000) libaudit.so.1 => /lib64/libaudit.so.1 (0x0000003cb6400000) libcrypt.so.1 => /lib64/libcrypt.so.1 (0x0000003cb5c00000) libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x0000003cb6800000) libfreebl3.so => /lib64/libfreebl3.so (0x0000003cb5800000) [root@srdsdevapp73 ~]# service pacemaker start Starting Pacemaker Cluster Manager [ OK ] 总结 Linux下查找动态链接库的默认路径(未在/etc/ld.so.conf中设置,动态链接库加载时会优先查找/etc/ld.so.cache中库)的顺序如下,如果有同名的库文件挡在前面,可能导致动态链接库加载失败。 /lib /usr/lib /lib64 /usr/lib64
2020年10月
2020年09月