greenplum 节点故障修复文档

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
简介:

greenplum集群说明

greenplum整个集群是由多台服务器组合而成,任何一台服务都有可能发生软件或硬件故障,我们一起来模拟一下任何一个节点或服务器故障后,greenplumn的容错及恢复方法

master  服务器:l-test5
standby 服务器:l-test6
segment 服务器:(primary + mirror):l-test[7-12]
 
greenplum集群说明:1 master  +  1 standby + 24 primary segments + 24 mirror segments

正常集群状态

在master查看数据库当前的状态
[gpadmin@l-test5 ~]$ gpstate
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: 
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180209:16:23:43:004860 gpstate:l-test5:gpadmin-[INFO]:-Gathering data from segments...
.. 
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-Greenplum instance status summary
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Master instance                                           = Active
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Master standby                                            = l-test6
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Standby master state                                      = Standby host passive
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total segment instance count from metadata                = 48
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Primary Segment Status
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total primary segments                                    = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total primary segment valid (at master)                   = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total primary segment failures (at master)                = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes found                   = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Mirror Segment Status
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segments                                     = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segment valid (at master)                    = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segment failures (at master)                 = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files missing              = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs missing               = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files missing                   = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes missing                 = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes found                   = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number mirror segments acting as primary segments   = 0
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-   Total number mirror segments acting as mirror segments    = 24
20180209:16:23:45:004860 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------

master 服务器故障

当master节点故障后,我们需要激活standby节点作为新的master节点(如果服务器允许,把vip也切换到standby服务器)

在激活standby节点的可以直接指定新的standby节点,也可以等原master服务器恢复后,指定原master节点为standby节点

这里我就不关服务器了,直接模拟master节点故障的状态
关闭master节点

[gpadmin@l-test5 ~]$ gpstop -a -m 
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Starting gpstop with args: -a -m
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Gathering information and validating the environment...
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-There are 0 connections to the database
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Commencing Master instance shutdown with mode='smart'
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Master host=l-test5
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Commencing Master instance shutdown with mode=smart
20180211:15:16:24:022773 gpstop:l-test5:gpadmin-[INFO]:-Master segment instance directory=/export/gp_data/master/gpseg-1
20180211:15:16:25:022773 gpstop:l-test5:gpadmin-[INFO]:-Attempting forceful termination of any leftover master process
20180211:15:16:25:022773 gpstop:l-test5:gpadmin-[INFO]:-Terminating processes for segment /export/gp_data/master/gpseg-1

[gpadmin@l-test5 ~]$ psql
psql: could not connect to server: No such file or directory
    Is the server running locally and accepting
    connections on Unix domain socket "/tmp/.s.PGSQL.65432"?

激活standby节点

[gpadmin@l-test6 ~]$ gpactivatestandby -a -d /export/gp_data/master/gpseg-1/
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:------------------------------------------------------
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:-Standby data directory    = /export/gp_data/master/gpseg-1
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:-Standby port              = 65432
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:-Standby running           = yes
20180211:15:19:48:016771 gpactivatestandby:l-test6:gpadmin-[INFO]:-Force standby activation  = no
.
.
.

切换服务器vip

原master节点服务器卸载vip:
[root@l-test5 ~]# 
[root@l-test5 ~]# ip a d 10.0.0.1/32 brd + dev bond0

-----------
原standby节点服务器挂载vip:

[root@l-test6 ~]# 
[root@l-test6 ~]# ip a a  10.0.0.1/32 brd + dev bond0 && arping -q -c 3 -U -I bond0 10.0.0.1

指定新的standby节点

我们指定原master节点为新的standby节点服务器

需要先删除原master的数据文件,然后重新执行初始化standby节点即可

[gpadmin@l-test6 ~]$ gpinitstandby -a -s l-test5
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[INFO]:-Checking for filespace directory /export/gp_data/master/gpseg-1 on l-test5
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[ERROR]:-Filespace directory already exists on host l-test5
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[ERROR]:-Failed to create standby
20180211:15:28:56:019106 gpinitstandby:l-test6:gpadmin-[ERROR]:-Error initializing standby master: master data directory exists
----------注意,这步是在原master节点操作-------
[gpadmin@l-test5 ~]$ cd /export/gp_data/master/
[gpadmin@l-test5 /export/gp_data/master]$ ll
total 4
drwx------ 17 gpadmin gpadmin 4096 Feb 11 15:17 gpseg-1
[gpadmin@l-test5 /export/gp_data/master]$ rm -rf gpseg-1/
[gpadmin@l-test5 /export/gp_data/master]$ ll
total 0
[gpadmin@l-test5 /export/gp_data/master]$ 
----------

[gpadmin@l-test6 ~]$ gpinitstandby -a -s l-test5
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Checking for filespace directory /export/gp_data/master/gpseg-1 on l-test5
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:------------------------------------------------------
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum standby master initialization parameters
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:------------------------------------------------------
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum master hostname               = l-test6
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum master data directory         = /export/gp_data/master/gpseg-1
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum master port                   = 65432
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum standby master hostname       = l-test5
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum standby master port           = 65432
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum standby master data directory = /export/gp_data/master/gpseg-1
20180211:15:30:55:019592 gpinitstandby:l-test6:gpadmin-[INFO]:-Greenplum update system catalog         = On
.
.

standby 服务器故障

当standby节点服务器恢复后,需要将standby节点删除,然后重新初始化一下standby服务器即可

在master服务器上查看整个greenplum集群状态,我们发现[WARNING]:-Standby master state

[gpadmin@l-test5 ~]$ gpstate
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: 
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180209:16:32:19:007208 gpstate:l-test5:gpadmin-[INFO]:-Gathering data from segments...
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[INFO]:-Greenplum instance status summary
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[INFO]:-   Master instance                                           = Active
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[INFO]:-   Master standby                                            = l-test6
20180209:16:32:21:007208 gpstate:l-test5:gpadmin-[WARNING]:-Standby master state                                      = Standby host DOWN   
.
.
[gpadmin@l-test5 ~]$ gpstate -f
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: -f
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-Standby master details
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-----------------------
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-   Standby address          = l-test6
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-   Standby data directory   = /export/gp_data/master/gpseg-1
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-   Standby port             = 65432
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[WARNING]:-Standby PID              = 10480                            Have lock file /tmp/.s.PGSQL.65432.lock but no process running on port 65432   <<<<<<<<
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[WARNING]:-Standby status           = Status could not be determined   <<<<<<<<
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:--pg_stat_replication
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:-No entries found.
20180209:16:38:41:008728 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------

1、删除故障的standby节点

[gpadmin@l-test5 ~]$ gpinitstandby -r -a
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:------------------------------------------------------
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Warm master standby removal parameters
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:------------------------------------------------------
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master hostname               = l-test5
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master data directory         = /export/gp_data/master/gpseg-1
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master port                   = 65432
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master hostname       = l-test6
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master port           = 65432
20180209:16:52:04:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master data directory = /export/gp_data/master/gpseg-1
20180209:16:52:18:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Removing standby master from catalog...
20180209:16:52:18:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Database catalog updated successfully.
20180209:16:52:18:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Removing filespace directories on standby master...
20180209:16:52:18:011917 gpinitstandby:l-test5:gpadmin-[INFO]:-Successfully removed standby master

2、重新初始化standby节点
[gpadmin@l-test5 ~]$ gpinitstandby -s l-test6 -a
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Checking for filespace directory /export/gp_data/master/gpseg-1 on l-test6
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:------------------------------------------------------
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master initialization parameters
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:------------------------------------------------------
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master hostname               = l-test5
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master data directory         = /export/gp_data/master/gpseg-1
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum master port                   = 65432
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master hostname       = l-test6
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master port           = 65432
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum standby master data directory = /export/gp_data/master/gpseg-1
20180209:16:59:08:013723 gpinitstandby:l-test5:gpadmin-[INFO]:-Greenplum update system catalog         = On
.
.

3、检查standby 的配置信息

[gpadmin@l-test5 ~]$ gpstate -f
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: -f
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-Standby master details
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-----------------------
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby address          = l-test6
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby data directory   = /export/gp_data/master/gpseg-1
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby port             = 65432
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby PID              = 19057
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:-   Standby status           = Standby host passive
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--pg_stat_replication
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--WAL Sender State: streaming
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--Sync state: sync
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--Sent Location: 0/14000000
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--Flush Location: 0/14000000
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--Replay Location: 0/14000000
20180209:16:59:18:013939 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------

segment 节点故障

当一个primary segment节点故障,那么它所对应的mirror segment节点会接替primary的状态,继续保证整个集群的数据完整性

当一个mirror segment节点出现故障,它不会影响整个集群的可用性,但是需要尽快修复,保证所有的primary segment都有备份

如果primary segment 和 它所对应的mirror segment 节点都出现故障,那么greenplum认为集群数据不完整,整个集群将不再提供服务,直到primary segment 或 mirror segment恢复

修复segment节点,需要重启greenplum集群

primary segment节点和mirror segment节点的故障修复方式是一样的,这里以mirror节点故障为例
关闭一个节点后集群状态

[gpadmin@l-test7 ~]$ pg_ctl -D /export/gp_data/mirror/data4/gpseg23 stop -m fast
waiting for server to shut down.... done
server stopped

--------master节点执行-------
[gpadmin@l-test5 ~]$ gpstate 
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: 
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180211:16:17:12:005738 gpstate:l-test5:gpadmin-[INFO]:-Gathering data from segments...
.
.
.
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Mirror Segment Status
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segments                                     = 24
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total mirror segment valid (at master)                    = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total mirror segment failures (at master)                 = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total number of postmaster.pid files missing              = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid files found                = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total number of postmaster.pid PIDs missing               = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number of postmaster.pid PIDs found                 = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total number of /tmp lock files missing                   = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number of /tmp lock files found                     = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[WARNING]:-Total number postmaster processes missing                 = 1                      <<<<<<<<
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number postmaster processes found                   = 23
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number mirror segments acting as primary segments   = 0
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-   Total number mirror segments acting as mirror segments    = 24
20180211:16:17:14:005738 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------


[gpadmin@l-test5 ~]$ gpstate -m
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: -m
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--Current GPDB mirror list and status
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--Type = Spread
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-   Mirror              Datadir                                Port    Status    Data Status    
.
.
.
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:-   l-test12   /export/gp_data/mirror/data3/gpseg22   50002   Passive   Synchronized
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[WARNING]:-l-test7    /export/gp_data/mirror/data4/gpseg23   50003   Failed                   <<<<<<<<
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180211:16:17:30:005969 gpstate:l-test5:gpadmin-[WARNING]:-1 segment(s) configured as mirror(s) have failed

 
[gpadmin@l-test5 ~]$ gpstate -e
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-Starting gpstate with args: -e
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3'
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.4.1 build commit:4eb4d57ae59310522d53b5cce47aa505ed0d17d3) on x86_64-pc-linux-gnu, compi
led by GCC gcc (GCC) 6.2.0, 64-bit compiled on Jan 22 2018 18:15:33'
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-Obtaining Segment details from master...
20180211:16:17:35:006045 gpstate:l-test5:gpadmin-[INFO]:-Gathering data from segments...
.. 
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-Segment Mirroring Status Report
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-Primaries in Change Tracking
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-   Current Primary    Port    Change tracking size   Mirror             Port
20180211:16:17:37:006045 gpstate:l-test5:gpadmin-[INFO]:-   l-test9   40003   128 bytes              l-test7   50003

关闭整个greenplum集群

[gpadmin@l-test5 ~]$ gpstop -a -M fast
.
.
.
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-   Segments stopped successfully                              = 48
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-   Segments with errors during stop                           = 0
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-   
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[WARNING]:-Segments that are currently marked down in configuration   = 1    <<<<<<<<
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-            (stop was still attempted on these segments)
20180211:16:25:00:008223 gpstop:l-test5:gpadmin-[INFO]:-----------------------------------------------------
.
.
.

以restricted方式启动数据库
[gpadmin@l-test5 ~]$ gpstart -a -R 
.
. 
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-Process results...
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-   Successful segment starts                                            = 47
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-   Failed segment starts                                                = 0
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-   Skipped segment starts (segments are marked down in configuration)   = 0
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-Successfully started 47 of 47 segment instances 
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:25:22:008571 gpstart:l-test5:gpadmin-[INFO]:-Starting Master instance l-test5 directory /export/gp_data/master/gpseg-1 in RESTRICTED mode
20180211:16:25:23:008571 gpstart:l-test5:gpadmin-[INFO]:-Command pg_ctl reports Master l-test5 instance active
20180211:16:25:23:008571 gpstart:l-test5:gpadmin-[INFO]:-Starting standby master
20180211:16:25:23:008571 gpstart:l-test5:gpadmin-[INFO]:-Checking if standby master is running on host: l-test6  in directory: /export/gp_data/master/gpseg-1
20180211:16:25:24:008571 gpstart:l-test5:gpadmin-[INFO]:-Database successfully started

开始修复故障节点

[gpadmin@l-test5 ~]$ gprecoverseg -a
.
.
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-Greenplum instance recovery parameters
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:----------------------------------------------------------
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-Recovery type              = Standard
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:----------------------------------------------------------
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-Recovery 1 of 1
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:----------------------------------------------------------
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Synchronization mode                        = Incremental
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance host                        = l-test7
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance address                     = l-test7
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance directory                   = /export/gp_data/mirror/data4/gpseg23
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance port                        = 50003
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Failed instance replication port            = 51003
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance host               = l-test9
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance address            = l-test9
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance directory          = /export/gp_data/primary/data4/gpseg23
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance port               = 40003
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Source instance replication port   = 41003
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-   Recovery Target                             = in-place
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:----------------------------------------------------------
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-1 segment(s) to recover
20180211:16:25:38:009108 gprecoverseg:l-test5:gpadmin-[INFO]:-Ensuring 1 failed segment(s) are stopped
. 
.

检查集群修复状态

一直等到Data Status 这个属性全部都是Synchronized即可进行下一步操作

[gpadmin@l-test5 ~]$ gpstate -m
.
20180211:16:25:51:009237 gpstate:l-test5:gpadmin-[INFO]:-   Mirror              Datadir                                Port    Status    Data Status       
20180211:16:25:51:009237 gpstate:l-test5:gpadmin-[INFO]:-   l-test11   /export/gp_data/mirror/data1/gpseg0    50000   Passive   Synchronized
.
.
20180211:16:25:51:009237 gpstate:l-test5:gpadmin-[INFO]:-   l-test7    /export/gp_data/mirror/data4/gpseg23   50003   Passive   Resynchronizing
20180211:16:25:51:009237 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------

[gpadmin@l-test5 ~]$ 
[gpadmin@l-test5 ~]$ gpstate -m
.
.
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:-   Mirror              Datadir                                Port    Status    Data Status    
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:-   l-test11   /export/gp_data/mirror/data1/gpseg0    50000   Passive   Synchronized
.
.
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:-   l-test7    /export/gp_data/mirror/data4/gpseg23   50003   Passive   Synchronized
20180211:16:31:54:010763 gpstate:l-test5:gpadmin-[INFO]:--------------------------------------------------------------

重启greenplum集群

[gpadmin@l-test5 ~]$ gpstop -a -r
.
.
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-   Segments stopped successfully      = 48
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-   Segments with errors during stop   = 0
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-----------------------------------------------------
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-Successfully shutdown 48 of 48 segment instances 
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-Database successfully shutdown with no errors reported
20180211:16:36:43:012012 gpstop:l-test5:gpadmin-[INFO]:-Cleaning up leftover gpmmon process
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-No leftover gpmmon process found
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-Cleaning up leftover gpsmon processes
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-No leftover gpsmon processes on some hosts. not attempting forceful termination on these hosts
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-Cleaning up leftover shared memory
20180211:16:36:44:012012 gpstop:l-test5:gpadmin-[INFO]:-Restarting System...
相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
1月前
|
SQL 运维 监控
TiDB集群故障排查与恢复
【2月更文挑战第28天】本章将详细探讨TiDB集群故障排查与恢复的方法。我们将介绍常见的故障类型、排查工具与步骤,以及故障恢复的策略与最佳实践。通过本章的学习,读者将能够掌握TiDB集群故障排查与恢复的技术,确保数据库的稳定性和可用性。
|
8月前
|
容灾 关系型数据库 数据库
将旧集群的数据备份迁移到新集群。
将旧集群的数据备份迁移到新集群。
94 1
|
3月前
|
存储 缓存 安全
【VSAN数据恢复】VSAN集群节点数据迁移失败的数据恢复案例
VSAN存储是一个对象存储,以文件系统呈现给在vSphere主机上。这个对象存储服务会从VSAN集群中的每台主机上加载卷,将卷展现为单一的、在所有节点上都可见的分布式共享数据存储。 对于虚拟机来说,只有一个数据存储,这个分布式数据存储来自VSAN集群中每一台vSphere主机上的存储空间,通过磁盘组进行配置,在单独的存储中存放所有的虚拟机文件。这种数据存储方式比较安全,当闪存盘或者容量盘出现故障的时候,数据会向其他节点转移,在转移过程中有可能出现故障。
|
1月前
|
Kubernetes 容器
k8s集群部署成功后某个节点突然出现notready状态的问题原因分析和解决办法
k8s集群部署成功后某个节点突然出现notready状态的问题原因分析和解决办法
17 0
|
6月前
|
调度 Apache
Apache Doris tablet 副本修复的原理、流程及问题定位
Apache Doris tablet 副本修复的原理、流程及问题定位
181 0
|
SQL 分布式计算 Hadoop
Hadoop主节点宕机第二节点补救
Hadoop主节点宕机第二节点补救
340 0
Hadoop主节点宕机第二节点补救
|
监控 NoSQL Redis
如何解决 “主节点故障恢复的自动化” 问题?
工作 & 面试中,当面试官问你主服务器宕机了,怎么办,如何处理?那么“哨兵”它来了~~~
如何解决 “主节点故障恢复的自动化” 问题?
|
关系型数据库 数据库 SQL
Greenplum 的一次紧急恢复
Greenplum的一次非常规恢复方法
2527 0
|
存储 网络协议 索引
GlusterFS数据存储脑裂修复方案
本文档介绍了glusterfs中可用于监视复制卷状态的`heal info`命令以及解决脑裂的方法
1311 0