背景
偶尔有用户会遇到远程程序连接RDS PG,在不做任何操作一段时间后可能中断。
其实可能是用户和RDS PG之间,某些网络设备设置了会话空闲超时,会主动中断会话。
那么有什么方法能解决这个问题呢?
运维的同学可能有这方面的经验,例如使用securecrt或者其他终端连接服务器时,可以设置这些管理工具的no-op,周期性的发一些空字符过去,保证会话上有流量。
但是数据库连接怎么搞呢?
PostgreSQL提供了tcp keep alive的参数可供用户设置。
例子
为了避免会话中断的问题, 可以通过tcp层的keepalive机制来达到传输心跳数据的目的.
方法一,设置数据库参数
PostgreSQL支持会话级别的设置, 数据库级别的设置在$PGDATA/postgresql.conf,
建议设置如下三个参数的值
# - TCP Keepalives -
# see "man 7 tcp" for details
tcp_keepalives_idle = 60 # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
tcp_keepalives_interval = 10 # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
tcp_keepalives_count = 10 # TCP_KEEPCNT;
# 0 selects the system default
解释详见本文末尾[参考1].
代码详见本文末尾[参考2].
参数解释
tcp_keepalives_idle : 定义这个tcp连接间隔多长后开始发送 第一个 tcp keepalive 包.
tcp_keepalives_interval : 定义在以上发送第一个tcp keepalive包后如果在这个时间间隔内没有收到对端的回包, 则开始发送第二个tcp keepalive包. 在这个时间内再没有回包的话则发送第三个keepalive包....直到达到tcp_keepalives_count次则broken 连接.
tcp_keepalives_count : 定义一共发送多少个tcp keepalive包, 达到这个数字后如果对端都没有回响应包, 则关闭这个连接.
另外需要注意的是, 这几个PostgreSQL参数对PostgreSQL数据库服务端的backend process生效.
所以如果发出第一个keepalive包后, 在tcp_keepalives_interval秒内有客户端回包, 则又回到tcp_keepalives_idle计数(注意此时计数是tcp_keepalives_idle 减去 tcp_keepalives_interval 秒).
例如 :
CLIENT (172.16.3.33) :
psql -h 172.16.3.150 -p 1919 -U postgres postgres
postgres=# show tcp_keepalives_idle;
tcp_keepalives_idle
---------------------
60
(1 row)
postgres=# show tcp_keepalives_interval;
tcp_keepalives_interval
-------------------------
10
(1 row)
postgres=# show tcp_keepalives_count;
tcp_keepalives_count
----------------------
10
(1 row)
查找数据库端对应的process id.
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
11016
(1 row)
SERVER (172.16.3.150) :
在数据库端查看keepalive timer
root@digoal-PowerEdge-R610:~# netstat -anpo|grep 11016
tcp 0 0 172.16.3.150:1919 172.16.3.33:50326 ESTABLISHED 11016/postgres: pos keepalive (39.73/0/0)
CLIENT (172.16.3.33) :
在客户端查看keepalive timer
postgres=# \!
[pg92@db-172-16-3-33 ~]$ netstat -anpo|grep 1919
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 172.16.3.33:50326 172.16.3.150:1919 ESTABLISHED 20408/psql keepalive (7143.19/0/0)
继承了操作系统的keepalive设置
通过tcpdump可以观察间隔一定的时间, 会发出keepalive包.
方法二、设置操作系统级的参数:
/etc/sysctl.conf
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
设置CLIENT服务器系统级的keepalive, 然后重新连接到数据库, 看看客户端的keepalive timer会不会发生变化
[root@db-172-16-3-33 ~]# sysctl -w net.ipv4.tcp_keepalive_time=70
net.ipv4.tcp_keepalive_time = 70
[root@db-172-16-3-33 ~]# su - pg92
pg92@db-172-16-3-33-> psql -h 172.16.3.150 -p 1919 -U postgres postgres
psql (9.2.4)
Type "help" for help.
postgres=# \!
[pg92@db-172-16-3-33 ~]$ netstat -anpo|grep 1919
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 172.16.3.33:50327 172.16.3.150:1919 ESTABLISHED 20547/psql keepalive (55.44/0/0)
系统层设置的keepalive已经生效了.
其他
.1.
通过tcpdump观察keepalive包, 也可以将这些包抓下来通过wireshark查看.
pg92@db-172-16-3-33-> psql -h 172.16.3.150 -p 1919 -U postgres postgres
postgres=# set tcp_keepalives_idle=13;
SET
root@digoal-PowerEdge-R610:~# tcpdump -i eth0 -n 'tcp port 1919'
08:43:27.647408 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [P.], seq 4937:4952, ack 58, win 115, length 15
08:43:27.647487 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0
08:43:40.667410 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0
08:43:40.667536 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0
08:43:53.691417 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0
08:43:53.691544 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0
08:44:06.715416 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0
08:44:06.715544 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0
08:44:19.739422 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0
08:44:19.739544 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0
08:44:32.763416 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0
08:44:32.763546 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0
每个sock会话, 每隔13秒, 数据库服务端会发送心跳包.
.2.
由于每个tcp会话都需要1个计时器, 所以如果连接数很多, 开启keepalive也是比较耗费资源的.
可以使用setsockopt关闭该会话keepalive的功能. 下一篇BLOG介绍如何禁用keepalive.
.3.
如果tcp_keepalives_idle小于tcp_keepalives_interval, 那么间隔多长时间发1个心跳包呢?
例如tcp_keepalives_idle=2, tcp_keepalives_interval=10.
答案是10, 因为检查计时需要10秒.
postgres=# set tcp_keepalives_idle=2;
SET
postgres=# set tcp_keepalives_interval=10;
SET
root@digoal-PowerEdge-R610-> tcpdump -i eth0 -n 'tcp port 1919'
09:32:27.035424 IP 172.16.3.150.1919 > 172.16.3.33.47277: Flags [.], ack 195, win 115, length 0
09:32:27.035608 IP 172.16.3.33.47277 > 172.16.3.150.1919: Flags [.], ack 366, win 54, length 0
09:32:37.051426 IP 172.16.3.150.1919 > 172.16.3.33.47277: Flags [.], ack 195, win 115, length 0
09:32:37.051569 IP 172.16.3.33.47277 > 172.16.3.150.1919: Flags [.], ack 366, win 54, length 0
09:32:47.067423 IP 172.16.3.150.1919 > 172.16.3.33.47277: Flags [.], ack 195, win 115, length 0
09:32:47.067552 IP 172.16.3.33.47277 > 172.16.3.150.1919: Flags [.], ack 366, win 54, length 0
09:32:57.083428 IP 172.16.3.150.1919 > 172.16.3.33.47277: Flags [.], ack 195, win 115, length 0
09:32:57.083574 IP 172.16.3.33.47277 > 172.16.3.150.1919: Flags [.], ack 366, win 54, length 0
参考
.1. http://www.postgresql.org/docs/9.2/static/runtime-config-connection.html
tcp_keepalives_idle (integer)
Specifies the number of seconds before sending a keepalive packet on an otherwise idle connection. A value of 0 uses the system default. This parameter is supported only on systems that support the TCP_KEEPIDLE or TCP_KEEPALIVE symbols, and on Windows; on other systems, it must be zero. In sessions connected via a Unix-domain socket, this parameter is ignored and always reads as zero.
Note: On Windows, a value of 0 will set this parameter to 2 hours, since Windows does not provide a way to read the system default value.
tcp_keepalives_interval (integer)
Specifies the number of seconds between sending keepalives on an otherwise idle connection. A value of 0 uses the system default. This parameter is supported only on systems that support the TCP_KEEPINTVL symbol, and on Windows; on other systems, it must be zero. In sessions connected via a Unix-domain socket, this parameter is ignored and always reads as zero.
Note: On Windows, a value of 0 will set this parameter to 1 second, since Windows does not provide a way to read the system default value.
tcp_keepalives_count (integer)
Specifies the number of keepalive packets to send on an otherwise idle connection. A value of 0 uses the system default. This parameter is supported only on systems that support the TCP_KEEPCNT symbol; on other systems, it must be zero. In sessions connected via a Unix-domain socket, this parameter is ignored and always reads as zero.
Note: This parameter is not supported on Windows, and must be zero.
.2. /usr/share/doc/kernel/Documentation/networking/ip-sysctl.txt
tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled.
Default: 2hours.
tcp_keepalive_probes - INTEGER
How many keepalive probes TCP sends out, until it decides that the
connection is broken. Default value: 9.
tcp_keepalive_intvl - INTEGER
How frequently the probes are send out. Multiplied by
tcp_keepalive_probes it is time to kill not responding connection,
after probes started. Default value: 75sec i.e. connection
will be aborted after ~11 minutes of retries.
.3. src/backend/libpq/pqcomm.c
截取一个设置interval的函数.
int
pq_setkeepalivesinterval(int interval, Port *port)
{
if (port == NULL || IS_AF_UNIX(port->laddr.addr.ss_family))
return STATUS_OK;
#if defined(TCP_KEEPINTVL) || defined (SIO_KEEPALIVE_VALS)
if (interval == port->keepalives_interval)
return STATUS_OK;
#ifndef WIN32
if (port->default_keepalives_interval <= 0)
{
if (pq_getkeepalivesinterval(port) < 0)
{
if (interval == 0)
return STATUS_OK; /* default is set but unknown */
else
return STATUS_ERROR;
}
}
if (interval == 0)
interval = port->default_keepalives_interval;
if (setsockopt(port->sock, IPPROTO_TCP, TCP_KEEPINTVL,
(char *) &interval, sizeof(interval)) < 0)
{
elog(LOG, "setsockopt(TCP_KEEPINTVL) failed: %m");
return STATUS_ERROR;
}
port->keepalives_interval = interval;
#else /* WIN32 */
return pq_setkeepaliveswin32(port, port->keepalives_idle, interval);
#endif
#else
if (interval != 0)
{
elog(LOG, "setsockopt(TCP_KEEPINTVL) not supported");
return STATUS_ERROR;
}
#endif
return STATUS_OK;
}
.4. man netstat
-o, --timers
Include information related to networking timers.
.5. man 7 tcp
/proc interfaces
System-wide TCP parameter settings can be accessed by files in the directory /proc/sys/net/ipv4/. In addition, most IP
/proc interfaces also apply to TCP; see ip(7). Variables described as Boolean take an integer value, with a nonzero
value ("true") meaning that the corresponding option is enabled, and a zero value ("false") meaning that the option is
disabled.
tcp_keepalive_intvl (integer; default: 75; since Linux 2.4)
The number of seconds between TCP keep-alive probes.
tcp_keepalive_probes (integer; default: 9; since Linux 2.2)
The maximum number of TCP keep-alive probes to send before giving up and killing the connection if no response is
obtained from the other end.
tcp_keepalive_time (integer; default: 7200; since Linux 2.2)
The number of seconds a connection needs to be idle before TCP begins sending out keep-alive probes. Keep-alives
are only sent when the SO_KEEPALIVE socket option is enabled. The default value is 7200 seconds (2 hours). An
idle connection is terminated after approximately an additional 11 minutes (9 probes an interval of 75 seconds
apart) when keep-alive is enabled.
Socket Options
To set or get a TCP socket option, call getsockopt(2) to read or setsockopt(2) to write the option with the option level
argument set to IPPROTO_TCP. In addition, most IPPROTO_IP socket options are valid on TCP sockets. For more information
see ip(7).
TCP_KEEPCNT (since Linux 2.4)
The maximum number of keepalive probes TCP should send before dropping the connection. This option should not be
used in code intended to be portable.
TCP_KEEPIDLE (since Linux 2.4)
The time (in seconds) the connection needs to remain idle before TCP starts sending keepalive probes, if the
socket option SO_KEEPALIVE has been set on this socket. This option should not be used in code intended to be
portable.
TCP_KEEPINTVL (since Linux 2.4)
The time (in seconds) between individual keepalive probes. This option should not be used in code intended to be
portable.
.6. netstat core :
以下内容转载自 :
http://vzkernel.blogspot.tw/2012/09/description-of-netstat-timers.html
It's not easy to find out the detail description of a network socket timer from internet, I did some dig today.
The mannual page from netstat:
-o, --timers
Include information related to networking timers.
Then we check some command output:
netstat -nto | head
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 0 127.0.0.1:5005 127.0.0.1:55309 SYN_RECV on (5.14/1/0)
tcp 0 0 127.0.0.1:5005 127.0.0.1:55312 SYN_RECV on (1.34/0/0)
tcp 0 0 127.0.0.1:5005 127.0.0.1:55310 SYN_RECV on (2.34/0/0)
tcp 0 0 127.0.0.1:5005 127.0.0.1:55303 SYN_RECV on (4.14/1/0)
tcp 0 0 192.168.1.16:57018 74.125.128.132:443 ESTABLISHED off (0.00/0/0)
tcp 0 0 192.168.1.16:41245 203.208.46.2:443 ESTABLISHED off (0.00/0/0)
tcp 0 0 192.168.1.16:42636 203.208.46.7:443 TIME_WAIT timewait (44.66/0/0)
tcp 0 0 127.0.0.1:55302
The Timer field with the format (5.14/1/0), what does it mean?
Let's figure it out.
The second step, check the source code from net-tools, grab the source code from source forge:
git clone git://net-tools.git.sourceforge.net/gitroot/net-tools/net-tools
from userspace netstat.c:
tcp_do_one():
{
....
if (flag_opt)
switch (timer_run) {
case 0:
snprintf(timers, sizeof(timers), _("off (0.00/%ld/%d)"), retr, timeout);
break;
case 1:
snprintf(timers, sizeof(timers), _("on (%2.2f/%ld/%d)"),
(double) time_len / HZ, retr, timeout);
break;
case 2:
snprintf(timers, sizeof(timers), _("keepalive (%2.2f/%ld/%d)"),
(double) time_len / HZ, retr, timeout);
break;
case 3:
snprintf(timers, sizeof(timers), _("timewait (%2.2f/%ld/%d)"),
(double) time_len / HZ, retr, timeout);
break;
default:
snprintf(timers, sizeof(timers), _("unkn-%d (%2.2f/%ld/%d)"),
timer_run, (double) time_len / HZ, retr, timeout);
break;
}
Both the fields are grabbed from proc/net/tcp, let's check the content of it:
$ head /proc/net/tcp
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeoutinode
0: 00000000:036B 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 13012 1 ffff88007baf5400 299 0 0 2 -1
1: 0100007F:138D 00000000:0000 0A 00000000:00000006 00:00000000 00000000 500 0 472674 1 ffff880021ab0380 299 0 0 2 -1
2: 00000000:006F 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 11242 1 ffff8800796006c0 299 0 0 2 -1
3: 00000000:BD50 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 13056 1 ffff880078da7440 299 0 0 2 -1
4: 017AA8C0:0035 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 14066 1 ffff880078dac100 299 0 0 2 -1
description from proc_net_tcp.txt
timer_active:
0 no timer is pending
1 retransmit-timer is pending
2 another timer (e.g. delayed ack or keepalive) is pending
3 this is a socket in TIME_WAIT state. Not all fields will contain
data (or even exist)
4 zero window probe timer is pending
No too much glue? Let's dive into the kernel code to have a look how the proc_net_tcp defined:
net/ipv4/tcp_ipv4.c:
static int tcp4_seq_show(struct seq_file *seq, void *v)
{
struct tcp_iter_state *st;
int len;
if (v == SEQ_START_TOKEN) {
seq_printf(seq, "%-*s\n", TMPSZ - 1,
" sl local_address rem_address st tx_queue "
"rx_queue tr tm->when retrnsmt uid timeout "
"inode");
goto out;
}
st = seq->private;
switch (st->state) {
case TCP_SEQ_STATE_LISTENING:
case TCP_SEQ_STATE_ESTABLISHED:
get_tcp4_sock(v, seq, st->num, &len);
break;
case TCP_SEQ_STATE_OPENREQ:
get_openreq4(st->syn_wait_sk, v, seq, st->num, st->uid, &len);
break;
case TCP_SEQ_STATE_TIME_WAIT:
get_timewait4_sock(v, seq, st->num, &len);
break;
}
seq_printf(seq, "%*s\n", TMPSZ - 1 - len, "");
out:
return 0;
}
static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
{
....
seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
"%08X %5d %8d %lu %d %pK %lu %lu %u %u %d%n",
i, src, srcp, dest, destp, sk->sk_state,
tp->write_seq - tp->snd_una,
rx_queue,
timer_active,
jiffies_to_clock_t(timer_expires - jiffies),
icsk->icsk_retransmits,
sock_i_uid(sk),
icsk->icsk_probes_out,
sock_i_ino(sk),
atomic_read(&sk->sk_refcnt), sk,
jiffies_to_clock_t(icsk->icsk_rto),
jiffies_to_clock_t(icsk->icsk_ack.ato),
(icsk->icsk_ack.quick << 1) | icsk->icsk_ack.pingpong,
tp->snd_cwnd,
tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh,
len);
}
which defined in include/net/inet_connection_sock.h:
/** inet_connection_sock - INET connection oriented sock
*
* @icsk_accept_queue: FIFO of established children
* @icsk_bind_hash: Bind node
* @icsk_timeout: Timeout
* @icsk_retransmit_timer: Resend (no ack)
* @icsk_rto: Retransmit timeout
* @icsk_pmtu_cookie Last pmtu seen by socket
* @icsk_ca_ops Pluggable congestion control hook
* @icsk_af_ops Operations which are AF_INET{4,6} specific
* @icsk_ca_state: Congestion control state
* @icsk_retransmits: Number of unrecovered [RTO] timeouts
* @icsk_pending: Scheduled timer event
* @icsk_backoff: Backoff
* @icsk_syn_retries: Number of allowed SYN (or equivalent) retries
* @icsk_probes_out: unanswered 0 window probes
* @icsk_ext_hdr_len: Network protocol overhead (IP/IPv6 options)
* @icsk_ack: Delayed ACK control data
* @icsk_mtup; MTU probing control data
*/
For a not established socket
static void get_openreq4(const struct sock *sk, const struct request_sock *req,
struct seq_file *f, int i, int uid, int *len)
{
const struct inet_request_sock *ireq = inet_rsk(req);
int ttd = req->expires - jiffies;
seq_printf(f, "%4d: %08X:%04X %08X:%04X"
" %02X %08X:%08X %02X:%08lX %08X %5d %8d %u %d %pK%n",
i,
ireq->loc_addr,
ntohs(inet_sk(sk)->inet_sport),
ireq->rmt_addr,
ntohs(ireq->rmt_port),
TCP_SYN_RECV,
0, 0, /* could print option size, but that is af dependent. */
1, /* timers active (only the expire timer) */
jiffies_to_clock_t(ttd),
req->retrans,
uid,
0, /* non standard timer */
0, /* open_requests have no inode */
atomic_read(&sk->sk_refcnt),
req,
len);
}
static void get_timewait4_sock(const struct inet_timewait_sock *tw,
struct seq_file *f, int i, int *len)
{
__be32 dest, src;
__u16 destp, srcp;
int ttd = tw->tw_ttd - jiffies;
if (ttd < 0)
ttd = 0;
dest = tw->tw_daddr;
src = tw->tw_rcv_saddr;
destp = ntohs(tw->tw_dport);
srcp = ntohs(tw->tw_sport);
seq_printf(f, "%4d: %08X:%04X %08X:%04X"
" %02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK%n",
i, src, srcp, dest, destp, tw->tw_substate, 0, 0,
3, jiffies_to_clock_t(ttd), 0, 0, 0, 0,
atomic_read(&tw->tw_refcnt), tw, len);
}
Let's back to our questions, The description of the 'Timer' field from 'netstat -o'
which with the format (1st/2nd/3rd)
1. The 1st field indicates when the timer will expire
2. The 2nd field is the retransmits which already have done
3. The 3rd field - for a synreq socket(not yet established) and a timewait socket it's always 0, for a established socket it's 'unanswered 0 window probes'
TCP zero window probe means that the receiver has reduced his receive buffer (a.k.a. window) to zero, basically telling the sender to stop sending - usually for performance reasons. If the receiver does not recover and send an so called "Window Update" with a buffer size greater than zero (meaning, the sender is allowed to continue) the sender will become "impatient" at some point and "check" if the receiver is able to receive more data. That "check" is the Zero Window Probe.
TCP Keep-Alive - Occurs when the sequence number is equal to the last byte of data in the previous packet. Used to elicit an ACK from the receiver.
TCP Keep-Alive ACK - Self-explanatory. ACK packet sent in response to a "keep-alive" packet.
TCP DupACK - Occurs when the same ACK number is seen AND it is lower than the last byte of data sent by the sender. If the receiver detects a gap in the sequence numbers, it will generate a duplicate ACK for each subsequent packet it receives on that connection, until the missing packet is successfully received (retransmitted). A clear indication of dropped/missing packets.
TCP ZeroWindow - Occurs when a receiver advertises a receive window size of zero. This effectively tells the sender to stop sending because the receiver's buffer is full. Indicates a resource issue on the receiver, as the application is not retrieving data from the TCP buffer in a timely manner.
TCP ZerowindowProbe - The sender is testing to see if the receiver's zero window condition still exists by sending the next byte of data to elicit an ACK from the receiver. If the window is still zero, the sender will double his persist timer before probing again.