最近在一个跨广域网的PostgreSQL primary standby环境中遇到一个比较奇特的问题。
首先primary standby是跨广域网的,但这不是问题的重点。重点是归档也是跨光域网并且使用NFS来让standby访问归档文件。
standby通过NFS获取归档,通过TCP连接primary实现流复制。
但是不知道什么原因,NFS出现了问题,即standby无法正常的访问归档文件了,访问NFS的命令会hang住。
接下来描述一下问题,然后再从PostgreSQL源码分析问题的原因。
1. standby的restore_command命令(cp /nfsdir/%f %p)hang住,停留在拷贝某个归档xlog的状态。
2. 手工kill 这个cp命令,紧接着standby数据库crash了。
3. 重启standby数据库,发现hang在cp /nfsdir/0000009.history %p的状态,然而实际上0000009.history是不存在的,主库的时间线是8,备库的时间线也是8,那么是什么原因导致standby要去找一个不存在的时间线文件呢?
这个原因要到源代码去寻找答案。
我们一般配置流复制环境,会设置recovery_target_timeline=latest,这样做的目的是配置一个hot_standby,如果上游切换了时间线,可以自动跟上。
而这个就可以解释为什么standby recovery的时候要去找一个不存在的时间线文件?
见源码:
src/backend/access/transam/timeline.c
/*
* Find the newest existing timeline, assuming that startTLI exists.
*
* Note: while this is somewhat heuristic, it does positively guarantee
* that (result + 1) is not a known timeline, and therefore it should
* be safe to assign that ID to a new timeline.
*/
TimeLineID
findNewestTimeLine(TimeLineID startTLI)
{
TimeLineID newestTLI;
TimeLineID probeTLI;
/*
* The algorithm is just to probe for the existence of timeline history
* files. XXX is it useful to allow gaps in the sequence?
*/
newestTLI = startTLI;
for (probeTLI = startTLI + 1;; probeTLI++) # 问题就出在这里, 探测下一个时间线是否存在。
{
if (existsTimeLineHistory(probeTLI))
{
newestTLI = probeTLI; /* probeTLI exists */
}
else
{
/* doesn't exist, assume we're done */
break;
}
}
return newestTLI;
}
src/backend/access/transam/xlog.c
/*
* See if there is a recovery command file (recovery.conf), and if so
* read in parameters for archive recovery and XLOG streaming.
*
* The file is parsed using the main configuration parser.
*/
static void
readRecoveryCommandFile(void)
......
else if (strcmp(item->name, "recovery_target_timeline") == 0)
{
rtliGiven = true;
if (strcmp(item->value, "latest") == 0)
rtli = 0;
.....
/*
* If user specified recovery_target_timeline, validate it or compute the
* "latest" value. We can't do this until after we've gotten the restore
* command and set InArchiveRecovery, because we need to fetch timeline
* history files from the archive.
*/
if (rtliGiven)
{
if (rtli)
{
/* Timeline 1 does not have a history file, all else should */
if (rtli != 1 && !existsTimeLineHistory(rtli))
ereport(FATAL,
(errmsg("recovery target timeline %u does not exist",
rtli)));
recoveryTargetTLI = rtli;
recoveryTargetIsLatest = false;
}
else
{
/* We start the "latest" search from pg_control's timeline */ # 问题出在这里,我配置的就是recovery_target_timeline=latest, 所以需要调用findNewestTimeLine. 控制文件是8,所以find 0000009.history.
recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI);
recoveryTargetIsLatest = true;
}
}
找到原因后,把NFS的问题解决掉,重启数据库就好了,再也不会hang住。
[参考]
1. src/backend/access/transam/xlog.c
2. src/backend/access/transam/timeline.c