PostgreSQL 同步流复制锁瓶颈分析
作者
digoal
日期
2016-11-07
标签
PostgreSQL , 同步流复制 , mutex , Linux , latch
背景
PostgreSQL的同步流复制实际上是通过walsender接收到的walreceiver的LSN位点,来唤醒和释放那些需要等待WAL已被备库接收的事务的。
对同步事务来说,用户发起结束事务的请求后,产生的RECORD LSN必须要小于或等于walsender接收到的walreceiver反馈的LSN位点。
这个是通过WAIT队列来实现的,释放队列中的事务这个动作是由walsender完成的。
在使用PostgreSQL的同步流复制时,使用insert的benchmark压测,发现主节点的CPU资源,网络,块设备都比较空闲,但是写性能会被局限在一个范围内无法提升。
同时在主节点发现有大量的mutex,堵塞了整个写的性能。
使用单机多实例(即一台主机部署多个主库,这种方式能消除实例级别的锁,例如实例级别的wal相关的LOCK),但是无法消除这个mutex的影响。
问题复现
环境,3台数据库主机,1台测试机器;千MB互联。
CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O2 -flto" ./configure --enable-profiling --prefix=/home/digoal/pgsql9.6
CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O2 -flto" make world -j 64
CC="/home/digoal/gcc6.2.0/bin/gcc" CFLAGS="-O2 -flto" make install-world
部署同步流复制环境,略
1. postgresql.conf
listen_addresses = '0.0.0.0'
max_connections = 300
unix_socket_directories = '.'
shared_buffers = 24GB
maintenance_work_mem = 512MB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
bgwriter_flush_after = 0
max_parallel_workers_per_gather = 0
old_snapshot_threshold = -1
backend_flush_after = 0
wal_level = replica
fsync = off
synchronous_commit = remote_write
full_page_writes = off
wal_buffers = 256MB
wal_writer_delay = 10ms
wal_writer_flush_after = 0
checkpoint_timeout = 45min
max_wal_size = 48GB
checkpoint_completion_target = 0.05
checkpoint_flush_after = 0
max_wal_senders = 5
random_page_cost = 1.0
parallel_tuple_cost = 0
parallel_setup_cost = 0
effective_cache_size = 48GB
force_parallel_mode = off
log_destination = 'csvlog'
logging_collector = on
log_truncate_on_rotation = on
log_timezone = 'PRC'
update_process_title = off
autovacuum = on
log_autovacuum_min_duration = -1
autovacuum_max_workers = 16
autovacuum_naptime = 15s
autovacuum_vacuum_scale_factor = 0.02
autovacuum_analyze_scale_factor = 0.01
vacuum_freeze_table_age = 1500000000
vacuum_multixact_freeze_table_age = 1500000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
2. 启动
pg_ctl -o "-p 1921 -c synchronous_standby_names='1(b,c)'" start -D /u01/digoal/pg_root1921
3. 2个备库
recovery.conf
recovery_target_timeline = 'latest'
standby_mode=on
primary_conninfo = 'user=postgres host= port= application_name=b'
4. 状态查看
-[ RECORD 1 ]----+------------------------------
pid | 42754
usesysid | 10
usename | postgres
application_name | b
client_addr | xxx.xxx.xxx.xxx
client_hostname |
client_port | 52834
backend_start | 2016-11-07 16:07:26.353563+08
backend_xmin |
state | streaming
sent_location | 2/36798458
write_location | 2/36798458
flush_location | 2/36798458
replay_location | 2/36798458
sync_priority | 1
sync_state | sync
-[ RECORD 2 ]----+------------------------------
pid | 42755
usesysid | 10
usename | postgres
application_name | c
client_addr | xxx.xxx.xxx.xxx
client_hostname |
client_port | 60064
backend_start | 2016-11-07 16:07:26.353765+08
backend_xmin |
state | streaming
sent_location | 2/36798458
write_location | 2/36798458
flush_location | 2/36798458
replay_location | 2/36798458
sync_priority | 2
sync_state | potential
压测
psql -c "create table test(id serial primary key, info text, crt_time timestamp);"
vi test3.sql
insert into test(info) values (null);
pgbench -M prepared -n -r -P 1 -f ./test3.sql -h xxx.xxx.xxx.xxx -p 1921 -U postgres -c 64 -j 64 -T 120
1. 测试过程中观察top输出,还很空闲
top - 16:09:47 up 37 days, 4:11, 1 user, load average: 3.30, 2.94, 2.59
Tasks: 1707 total, 14 running, 1693 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.9%us, 9.4%sy, 0.0%ni, 80.6%id, 0.0%wa, 0.0%hi, 1.0%si, 0.0%st
Mem: 529321832k total, 213706132k used, 315615700k free, 524824k buffers
Swap: 0k total, 0k used, 0k free, 186220740k cached
42754 digoal 20 0 24.9g 9384 1924 R 49.5 0.0 0:18.55 postgres: wal sender process postgres xxx.xxx.xxx.xx8(52834)
42755 digoal 20 0 24.9g 9340 1880 D 33.4 0.0 0:14.00 postgres: wal sender process postgres xxx.xxx.xxx.xx9(60064)
43187 digoal 20 0 24.9g 59m 52m S 19.3 0.0 0:01.22 postgres: postgres postgres xxx.xxx.xxx.xx2(48185)
43222 digoal 20 0 24.9g 53m 46m S 19.3 0.0 0:01.21 postgres: postgres postgres xxx.xxx.xxx.xx2(48220)
......
2. 测试过程中观察perf top输出,可以看到kernel的占比很高,都是LOCK相关。
PerfTop: 39521 irqs/sec kernel:73.0% exact: 0.0% [1000Hz cycles], (all, 64 CPUs)
--------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________ _____________________________________
59139.00 12.0% __mutex_lock_slowpath [kernel.kallsyms]
44628.00 9.0% _spin_lock [kernel.kallsyms]
37980.00 7.7% mutex_spin_on_owner [kernel.kallsyms]
9815.00 2.0% reverse_path_check_proc [kernel.kallsyms]
9105.00 1.8% find_busiest_group [kernel.kallsyms]
8001.00 1.6% schedule [kernel.kallsyms]
7028.00 1.4% reschedule_interrupt [kernel.kallsyms]
6673.00 1.4% hash_search_with_hash_value /home/digoal/pgsql9.6/bin/postgres
6181.00 1.3% LWLockAttemptLock /home/digoal/pgsql9.6/bin/postgres
5330.00 1.1% _spin_lock_irqsave [kernel.kallsyms]
4688.00 1.0% LWLockRelease /home/digoal/pgsql9.6/bin/postgres
4544.00 0.9% GetSnapshotData /home/digoal/pgsql9.6/bin/postgres
3. 测试结束后,查看对应进程的gmon输出,分析原因
3.1 sync节点对应的walsender进程gprof输出
头部花费很多。
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
70.57 6.33 6.33 WalRcvSigUsr1Handler
12.76 7.48 1.15 9218867 0.00 0.00 SetLatch # calls和pgbench的总调用次数基本对应。
2.23 7.68 0.20 1323523 0.00 0.00 SyncRepWakeQueue
1.11 7.78 0.10 841707 0.00 0.00 LWLockRelease
1.00 7.87 0.09 841670 0.00 0.00 SyncRepReleaseWaiters
1.00 7.96 0.09 753520 0.00 0.00 ProcessRepliesIfAny
分析消耗
SyncRepWakeQueue排名第7,它其中大量的开销是排名第八的SetLatch(self time也是最多的)
需要优化的话就是它了。
granularity: each sample hit covers 2 byte(s) for 0.11% of 8.97 seconds
index % time self children called name
<spontaneous>
[1] 70.6 6.33 0.00 WalRcvSigUsr1Handler [1]
-----------------------------------------------
<spontaneous>
[2] 28.3 0.00 2.54 PostgresMain [2]
0.00 2.53 2/2 exec_replication_command [3]
0.01 0.00 1/1 SendPostmasterSignal [70]
0.00 0.00 1/1 InitPostgres [85]
0.00 0.00 1/1 BeginReportingGUCOptions [164]
0.00 0.00 2/2 ReadyForQuery [203]
0.00 0.00 1/1 InitProcess [212]
0.00 0.00 1/1 InitDeadLockChecking [220]
0.00 0.00 3/44 initStringInfo [167]
0.00 0.00 1/18 pq_endmessage [166]
0.00 0.00 2/841672 pq_getmessage [48]
0.00 0.00 2/1595193 pq_startmsgread [36]
0.00 0.00 1/1 InitBufferPoolAccess [268]
0.00 0.00 2/2 pgstat_report_stat [285]
0.00 0.00 2/2 pgstat_report_activity [284]
0.00 0.00 2/3 SetCurrentStatementStartTimestamp [273]
0.00 0.00 1/1 InitWalSender [301]
0.00 0.00 1/1 BaseInit [304]
0.00 0.00 1/35 AllocSetContextCreate [204]
0.00 0.00 1/4120105 AllocSetFree [41]
0.00 0.00 2/34 pq_sendint [242]
0.00 0.00 1/8 MemoryContextDelete <cycle 1> [393]
0.00 0.00 2/2 pq_getbyte [317]
0.00 0.00 2/2 MemoryContextReset [505]
0.00 0.00 2/2 IsAbortedTransactionBlockState [499]
0.00 0.00 2/2 pg_any_to_server [579]
0.00 0.00 2/2 pg_client_to_server [580]
0.00 0.00 2/2 pq_getmsgstring [591]
0.00 0.00 2/2 pq_getmsgend [590]
0.00 0.00 2/2 IsTransactionOrTransactionBlock [503]
0.00 0.00 2/2 ProcessCompletedNotifies [510]
0.00 0.00 2/3 set_ps_display [482]
0.00 0.00 1/1 process_postgres_switches [774]
0.00 0.00 1/2 pqinitmask [592]
0.00 0.00 1/3 load_libraries [473]
0.00 0.00 1/1 process_session_preload_libraries [775]
0.00 0.00 1/1 freeBlocks_cmp [753]
0.00 0.00 1/25 pqsignal [361]
0.00 0.00 1/1 WalSndSignals [717]
-----------------------------------------------
0.00 2.53 2/2 PostgresMain [2]
[3] 28.2 0.00 2.53 2 exec_replication_command [3]
0.02 2.51 1/1 WalSndLoop [4]
0.00 0.00 1/1 SyncRepInitConfig [182]
0.00 0.00 2/753466 GetFlushRecPtr [21]
0.00 0.00 3/18 pq_endmessage [166]
0.00 0.00 31/34 pq_sendint [242]
0.00 0.00 1/753524 internal_flush [24]
0.00 0.00 3/44 initStringInfo [167]
0.00 0.00 1/709417 internal_putbytes [38]
0.00 0.00 2/2 replication_scanner_init [263]
0.00 0.00 7/5675256 appendBinaryStringInfo [50]
0.00 0.00 2/2 replication_yyparse [279]
0.00 0.00 2/1683364 errstart [57]
0.00 0.00 2/35 AllocSetContextCreate [204]
0.00 0.00 1/709400 pq_sendbyte [67]
0.00 0.00 1/4120105 AllocSetFree [41]
0.00 0.00 1/8 MemoryContextDelete <cycle 1> [393]
0.00 0.00 4/26 pq_sendstring [359]
0.00 0.00 2/2 SnapBuildClearExportedSnapshot [520]
0.00 0.00 1/3 set_stack_base [483]
0.00 0.00 1/14 RecoveryInProgress [379]
0.00 0.00 1/2 WalSndSetState [525]
-----------------------------------------------
0.02 2.51 1/1 exec_replication_command [3]
[4] 28.2 0.02 2.51 1 WalSndLoop [4]
0.09 1.89 753520/753520 ProcessRepliesIfAny [5]
0.07 0.13 753464/753464 XLogSendPhysical [9]
0.01 0.12 753457/753457 WaitLatchOrSocket [11]
0.06 0.00 753520/753521 ResetLatch [15]
0.04 0.00 3014018/3014018 pq_cleanup_redirect_to_shm_mq [23]
0.01 0.02 753521/753524 internal_flush [24]
0.02 0.00 753519/753519 WalSndKeepaliveIfNecessary [37]
0.02 0.00 753520/753520 PostmasterIsAlive [42]
0.01 0.00 753519/753519 WalSndCheckTimeOut [52]
0.00 0.01 753457/753457 WalSndComputeSleeptime [60]
0.00 0.01 1/1 proc_exit [62]
0.00 0.00 753520/1744598 GetCurrentTimestamp [56]
0.00 0.00 3/44 initStringInfo [167]
0.00 0.00 1/709417 internal_putbytes [38]
0.00 0.00 1/1683364 errstart [57]
0.00 0.00 1/2 WalSndSetState [525]
0.00 0.00 1/3 set_stack_base [483]
-----------------------------------------------
0.09 1.89 753520/753520 WalSndLoop [4]
[5] 22.1 0.09 1.89 753520 ProcessRepliesIfAny [5]
0.09 1.64 841670/841670 SyncRepReleaseWaiters [6]
0.02 0.05 3366680/3366680 pq_getmsgint64 [12]
0.03 0.00 1595190/1595190 pq_getbyte_if_available [25]
0.02 0.00 1595190/1595193 pq_startmsgread [36]
0.01 0.00 841670/1683309 elog_finish [27]
0.02 0.00 1683340/1683340 pq_getmsgbyte [43]
0.01 0.00 841670/841672 pq_getmessage [48]
0.01 0.00 841670/1683309 elog_start [51]
0.00 0.00 281670/1744598 GetCurrentTimestamp [56]
0.00 0.00 841670/3102136 resetStringInfo [65]
0.00 0.00 186/25758 s_lock [82]
-----------------------------------------------
0.09 1.64 841670/841670 ProcessRepliesIfAny [5]
[6] 19.2 0.09 1.64 841670 SyncRepReleaseWaiters [6]
0.20 1.15 1323523/1323523 SyncRepWakeQueue [7]
0.10 0.05 841624/841707 LWLockRelease [10]
0.00 0.06 841624/841624 SyncRepGetSyncStandbys [13]
0.01 0.04 841624/841707 LWLockAcquire [17]
0.01 0.00 841624/1683309 elog_finish [27]
0.00 0.00 841624/1683256 list_free_private [40]
0.00 0.00 841624/1683309 elog_start [51]
0.00 0.00 841624/4120105 AllocSetFree [41]
0.00 0.00 1/10 tuplestore_set_eflags [79]
0.00 0.00 1/4 errmsg [92]
0.00 0.00 163/25758 s_lock [82]
0.00 0.00 1/1683364 errstart [57]
0.00 0.00 841624/4120093 pfree [320]
0.00 0.00 841624/1683255 list_free [321]
-----------------------------------------------
0.20 1.15 1323523/1323523 SyncRepReleaseWaiters [6]
[7] 15.0 0.20 1.15 1323523 SyncRepWakeQueue [7]
1.15 0.00 9218863/9218867 SetLatch [8]
0.00 0.00 9218863/9218876 SHMQueueDelete [319]
-----------------------------------------------
0.00 0.00 1/9218867 handle_sig_alarm [215]
0.00 0.00 1/9218867 ProcKill [213]
0.00 0.00 1/9218867 InitProcess [212]
0.00 0.00 1/9218867 WalSndLastCycleHandler [214]
1.15 0.00 9218863/9218867 SyncRepWakeQueue [7]
[8] 12.8 1.15 0.00 9218867 SetLatch [8]
-----------------------------------------------
0.07 0.13 753464/753464 WalSndLoop [4]
[9] 2.2 0.07 0.13 753464 XLogSendPhysical [9] // 真正发送的开销只占了2.2 %
0.04 0.00 753464/753466 GetFlushRecPtr [21]
0.02 0.01 2837588/2837588 pq_sendint64 [31]
0.02 0.00 709397/709397 XLogRead [34]
0.02 0.00 709397/709417 internal_putbytes [38]
0.01 0.00 709397/709397 bms_hash_value [49]
0.00 0.00 709397/709400 pq_sendbyte [67]
0.00 0.00 709397/1744598 GetCurrentTimestamp [56]
0.00 0.00 1418794/3102136 resetStringInfo [65]
0.00 0.00 406/25758 s_lock [82]
0.00 0.00 709397/7226325 enlargeStringInfo [296]
3.2 potential节点对应的walsender进程gprof输出,因为这个walsender进程不负责releasewaiters,所以没有这项开销。
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
81.52 6.22 6.22 WalRcvSigUsr1Handler
1.57 6.34 0.12 788548 0.00 0.00 ProcessRepliesIfAny
1.18 6.43 0.09 788495 0.00 0.00 GetFlushRecPtr
1.05 6.51 0.08 897403 0.00 0.00 SyncRepGetSyncStandbys
1.05 6.59 0.08 788488 0.00 0.00 WaitEventSetWait
0.79 6.65 0.06 788549 0.00 0.00 ResetLatch
0.72 6.71 0.06 4378235 0.00 0.00 pfree
代码
1. SetLatch
/*-------------------------------------------------------------------------
*
* latch.c
* Routines for inter-process latches
*
* The Unix implementation uses the so-called self-pipe trick to overcome
* the race condition involved with select() and setting a global flag
* in the signal handler. When a latch is set and the current process
* is waiting for it, the signal handler wakes up the select() in
* WaitLatch by writing a byte to a pipe. A signal by itself doesn't
* interrupt select() on all platforms, and even on platforms where it
* does, a signal that arrives just before the select() call does not
* prevent the select() from entering sleep. An incoming byte on a pipe
* however reliably interrupts the sleep, and causes select() to return
* immediately even if the signal arrives before select() begins.
*
* (Actually, we prefer epoll_wait() over poll() over select() where
* available, but the same comments apply.)
*
* When SetLatch is called from the same process that owns the latch,
* SetLatch writes the byte directly to the pipe. If it's owned by another
* process, SIGUSR1 is sent and the signal handler in the waiting process
* writes the byte to the pipe on behalf of the signaling process.
*
* The Windows implementation uses Windows events that are inherited by
* all postmaster child processes.
*
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
* src/backend/storage/ipc/latch.c
*
*-------------------------------------------------------------------------
*/
...
/*
* Sets a latch and wakes up anyone waiting on it.
*
* This is cheap if the latch is already set, otherwise not so much.
*
* NB: when calling this in a signal handler, be sure to save and restore
* errno around it. (That's standard practice in most signal handlers, of
* course, but we used to omit it in handlers that only set a flag.)
*
* NB: this function is called from critical sections and signal handlers so
* throwing an error is not a good idea.
*/
void
SetLatch(volatile Latch *latch)
{
#ifndef WIN32
pid_t owner_pid;
#else
HANDLE handle;
#endif
/*
* The memory barrier has to be placed here to ensure that any flag
* variables possibly changed by this process have been flushed to main
* memory, before we check/set is_set.
*/
pg_memory_barrier();
/* Quick exit if already set */
if (latch->is_set)
return;
latch->is_set = true;
#ifndef WIN32
/*
* See if anyone's waiting for the latch. It can be the current process if
* we're in a signal handler. We use the self-pipe to wake up the select()
* in that case. If it's another process, send a signal.
*
* Fetch owner_pid only once, in case the latch is concurrently getting
* owned or disowned. XXX: This assumes that pid_t is atomic, which isn't
* guaranteed to be true! In practice, the effective range of pid_t fits
* in a 32 bit integer, and so should be atomic. In the worst case, we
* might end up signaling the wrong process. Even then, you're very
* unlucky if a process with that bogus pid exists and belongs to
* Postgres; and PG database processes should handle excess SIGUSR1
* interrupts without a problem anyhow.
*
* Another sort of race condition that's possible here is for a new
* process to own the latch immediately after we look, so we don't signal
* it. This is okay so long as all callers of ResetLatch/WaitLatch follow
* the standard coding convention of waiting at the bottom of their loops,
* not the top, so that they'll correctly process latch-setting events
* that happen before they enter the loop.
*/
owner_pid = latch->owner_pid;
if (owner_pid == 0)
return;
else if (owner_pid == MyProcPid)
{
if (waiting)
sendSelfPipeByte();
}
else
kill(owner_pid, SIGUSR1);
#else
/*
* See if anyone's waiting for the latch. It can be the current process if
* we're in a signal handler.
*
* Use a local variable here just in case somebody changes the event field
* concurrently (which really should not happen).
*/
handle = latch->event;
if (handle)
{
SetEvent(handle);
/*
* Note that we silently ignore any errors. We might be in a signal
* handler or other critical path where it's not safe to call elog().
*/
}
#endif
}
2. SyncRepWakeQueue
/*
* Walk the specified queue from head. Set the state of any backends that
* need to be woken, remove them from the queue, and then wake them.
* Pass all = true to wake whole queue; otherwise, just wake up to
* the walsender's LSN.
*
* Must hold SyncRepLock.
*/
static int
SyncRepWakeQueue(bool all, int mode)
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
PGPROC *proc = NULL;
PGPROC *thisproc = NULL;
int numprocs = 0;
Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
Assert(SyncRepQueueIsOrderedByLSN(mode));
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
&(WalSndCtl->SyncRepQueue[mode]),
offsetof(PGPROC, syncRepLinks));
while (proc)
{
/*
* Assume the queue is ordered by LSN
*/
if (!all && walsndctl->lsn[mode] < proc->waitLSN)
return numprocs;
/*
* Move to next proc, so we can delete thisproc from the queue.
* thisproc is valid, proc may be NULL after this.
*/
thisproc = proc;
proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
/*
* Set state to complete; see SyncRepWaitForLSN() for discussion of
* the various states.
*/
thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
/*
* Remove thisproc from queue.
*/
SHMQueueDelete(&(thisproc->syncRepLinks));
/*
* Wake only when we have set state and removed from queue.
*/
SetLatch(&(thisproc->procLatch));
numprocs++;
}
return numprocs;
}
3. SyncRepReleaseWaiters
/*
* Update the LSNs on each queue based upon our latest state. This
* implements a simple policy of first-valid-sync-standby-releases-waiter.
*
* Other policies are possible, which would change what we do here and
* perhaps also which information we store as well.
*/
void
SyncRepReleaseWaiters(void)
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
XLogRecPtr writePtr;
XLogRecPtr flushPtr;
XLogRecPtr applyPtr;
bool got_oldest;
bool am_sync;
int numwrite = 0;
int numflush = 0;
int numapply = 0;
/*
* If this WALSender is serving a standby that is not on the list of
* potential sync standbys then we have nothing to do. If we are still
* starting up, still running base backup or the current flush position is
* still invalid, then leave quickly also.
*/
if (MyWalSnd->sync_standby_priority == 0 ||
MyWalSnd->state < WALSNDSTATE_STREAMING ||
XLogRecPtrIsInvalid(MyWalSnd->flush))
{
announce_next_takeover = true;
return;
}
/*
* We're a potential sync standby. Release waiters if there are enough
* sync standbys and we are considered as sync.
*/
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
/*
* Check whether we are a sync standby or not, and calculate the oldest
* positions among all sync standbys.
*/
got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
&applyPtr, &am_sync);
/*
* If we are managing a sync standby, though we weren't prior to this,
* then announce we are now a sync standby.
*/
if (announce_next_takeover && am_sync)
{
announce_next_takeover = false;
ereport(LOG,
(errmsg("standby \"%s\" is now a synchronous standby with priority %u",
application_name, MyWalSnd->sync_standby_priority)));
}
/*
* If the number of sync standbys is less than requested or we aren't
* managing a sync standby then just leave.
*/
if (!got_oldest || !am_sync)
{
LWLockRelease(SyncRepLock);
announce_next_takeover = !am_sync;
return;
}
/*
* Set the lsn first so that when we wake backends they will release up to
* this location.
*/
if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
{
walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
}
if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
{
walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
}
if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
{
walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
}
LWLockRelease(SyncRepLock);
elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
numwrite, (uint32) (writePtr >> 32), (uint32) writePtr,
numflush, (uint32) (flushPtr >> 32), (uint32) flushPtr,
numapply, (uint32) (applyPtr >> 32), (uint32) applyPtr);
}
latch_sigusr1_handler与sendSelfPipeByte,唤醒信号
/*
* SetLatch uses SIGUSR1 to wake up the process waiting on the latch.
*
* Wake up WaitLatch, if we're waiting. (We might not be, since SIGUSR1 is
* overloaded for multiple purposes; or we might not have reached WaitLatch
* yet, in which case we don't need to fill the pipe either.)
*
* NB: when calling this in a signal handler, be sure to save and restore
* errno around it.
*/
#ifndef WIN32
void
latch_sigusr1_handler(void)
{
if (waiting)
sendSelfPipeByte();
}
#endif /* !WIN32 */
/* Send one byte to the self-pipe, to wake up WaitLatch */
#ifndef WIN32
static void
sendSelfPipeByte(void)
{
int rc;
char dummy = 0;
retry:
rc = write(selfpipe_writefd, &dummy, 1);
if (rc < 0)
{
/* If interrupted by signal, just retry */
if (errno == EINTR)
goto retry;
/*
* If the pipe is full, we don't need to retry, the data that's there
* already is enough to wake up WaitLatch.
*/
if (errno == EAGAIN || errno == EWOULDBLOCK)
return;
/*
* Oops, the write() failed for some other reason. We might be in a
* signal handler, so it's not safe to elog(). We have no choice but
* silently ignore the error.
*/
return;
}
}
#endif /* !WIN32 */
cpuset隔离后的性能
使用cgroup cpuset子系统,将CPU核数限定到15个左右,性能从7.6万提升到了10万tps
15个核应该是我这个测试CASE,MUTEX锁以及其他的数据库开销达到的一个均衡状态,性能最佳。当核数再增加时,性能反而下降,可能是多个CPU核之间的MUTEX锁开销放大的原因。
总共64HT的机器
#numastat
node0
numa_hit 4372456435
numa_miss 0
numa_foreign 0
interleave_hit 216742
local_node 4372456435
other_node 0
yum install -y libcgroup
mkdir -p /cgroup/cpuset
mount -t cgroup -o cpuset cpuset /cgroup/cpuset
cd /cgroup/cpuset
echo 0 > cpuset.mems
# 因为numastat只输出一个node0
mkdir r1
cd r1
echo "1-15" > cpuset.cpus
echo 0 > cpuset.mems
echo 1 > cpuset.cpu_exclusive
echo 1 > cpuset.memory_migrate
ps -ewf|grep postgres实例1主进程PID|grep -v grep|awk '{print "echo "$2" > tasks"}'
加入tasks
重新跑。
但是使用这种方法,多实例依旧无法再次提升性能,从现象来看,当参与LOCK的CPU总数超出15时性能就会下滑。
cd /cgroup/cpuset
mkdir r2
cd r2
echo "16-30" > cpuset.cpus
echo 0 > cpuset.mems
echo 1 > cpuset.cpu_exclusive
echo 1 > cpuset.memory_migrate
ps -ewf|grep postgres实例2主进程PID|grep -v grep|awk '{print "echo "$2" > tasks"}'
以上为两个实例的情况,参与LOCK的CPU核数为30,性能下滑到7万多。
或者这样,使用共享CPU的多实例,参与LOCK的CPU总数保持在15个,同样也只能达到10万左右的inesrt only tps。
cd /cgroup/cpuset
mkdir r1
cd r1
echo "1-15" > cpuset.cpus
echo 0 > cpuset.mems
echo 0 > cpuset.cpu_exclusive
echo 1 > cpuset.memory_migrate
cd /cgroup/cpuset
mkdir r2
cd r2
echo "1-15" > cpuset.cpus
echo 0 > cpuset.mems
echo 0 > cpuset.cpu_exclusive
echo 1 > cpuset.memory_migrate
参考
多线程unblock thread( pthread_cond_broadcast )可能更加高效。
https://docs.oracle.com/cd/E19455-01/806-5257/sync-40617/index.html
http://pubs.opengroup.org/onlinepubs/9699919799/nframe.html
另外几组测试
环境描述
一主2备,要求至少一个副本落盘。
test case
记录每组测试的perf,profile,cpu,io,mem,net
sysbench , insert only
测试了7组
1. 单机单实例,副本在异机
测试4组 sync=off,local,remote_write,on
测试小结
remote_write,on时, 约7万QPS。
使用CGROUP 将CPU限制在10个核时,另外将sync standby节点对应的WAL SENDER进程放到1核的cpuset,可以提升到10万 QPS。
off, local 时,约30万,22万。
瓶颈在文章前面有分析,主要是同步互斥锁。
2. 单机单实例,使用批量事务,排除网络问题,突出COMMIT等待的问题。
begin;
...
多条
...
end;
测试小结
放开CGROUP限制,塞10条insert在事务中, 可以达到2.2万tps, 约22W QPS (insert)。
也就是说commit 时,backend process会进入sleep状态,等待wal sender来释放它,COMMIT越多,睡眠次数就越多,可能造成本文前面提到的瓶颈。
3. 单机多实例,副本在异机
单机部署4个MASTER节点,每个master的2个STANDBY分别在另外两台物理机。
同时测试4个master的insert case, 总qps ~= 7W/s。
相比单节点没有任何提升。
4. 单机多实例,副本在本地。使用UNIX SOCKET连接。
本地多副本,排除网络问题(排除CPU,IO问题)
把一主两备全部放到一台主机,确保网络没有瓶颈,replication使用unix socket连接,pgbench也使用unix socket连接。
由于本地的CPU,IO都不是瓶颈,所以考虑把同步的 standby放到本机,另外一个potention的standby放到异机。
测试结果依旧总qps ~= 7W/s。
没有任何提升,同时机器的CPU,IO都很空闲。
5. 单机,创建多个虚拟机隔离系统级别同步互斥锁的问题。副本在异机。
在master的主机上,安装8个KVM虚拟机,每个虚拟机安装一个postgresql master实例。
standby部署在另外两台物理机。
压测有了性能提升,合并后的QPS 19万。
考虑到kvm的CPU指令翻译,VIRTIO有一定的性能损耗,同时还有网络延迟的消耗,所以没有达到理论值22万。
但是这个结果也说明了一个问题,使用KVM后,规避了OS层面的同步互斥锁。
7. 单机测试,无副本
sync=off, on
约30万,22万。