[20180123]测试SQLNET.EXPIRE_TIME参数.txt
--//曾经写过一篇linux内核网络参数测试tcp_keepalive,链接http://blog.itpub.net/267265/viewspace-2138391/
--//测试服务端会定时发起连接监测与client的连接状态.
参数解析:
/proc/sys/net/ipv4/tcp_keepalive_time 当keepalive起用的时候,TCP发送keepalive消息的频度。默认是2小时。
/proc/sys/net/ipv4/tcp_keepalive_intvl 当探测没有确认时,keepalive探测包的发送间隔。缺省是75秒。
/proc/sys/net/ipv4/tcp_keepalive_probes 如果对方不予应答,keepalive探测包的发送次数。缺省值是9。
--//我发现我遗漏的一种情况的测试,使用SQLNET.EXPIRE_TIME参数,这个参数与前面的内核参数意思差不多.定时监测服务端与客户端连
--//接情况.但是注意这种测试是有服务端发起,一些状态防火墙不允许这样的包通过,我们的分院与总院正好有这样的环境,不知道是否网管
--//配置错误,正好利用这个环境测试看看.主要目的测试设置SQLNET.EXPIRE_TIME情况下,如果无法连上client,是否探测多次,间隔时间是多少.
--//知道这些数值对于理解我遇到的一个问题帮助很大,另发文写一篇blog.
1.环境:
--//服务端IP:192.168.100.106
--//客户端IP:192.168.90.14
--//服务端:
# echo /proc/sys/net/ipv4/tcp_keepalive* | xargs -n 1 strings -1 -f
/proc/sys/net/ipv4/tcp_keepalive_intvl: 75
/proc/sys/net/ipv4/tcp_keepalive_probes: 9
/proc/sys/net/ipv4/tcp_keepalive_time: 7200
--//客户端
# echo /proc/sys/net/ipv4/tcp_keepalive* | xargs -n 1 strings -1 -f
/proc/sys/net/ipv4/tcp_keepalive_intvl: 10
/proc/sys/net/ipv4/tcp_keepalive_probes: 4
/proc/sys/net/ipv4/tcp_keepalive_time: 200
--//在sqlnet.ora文件中配置
sqlnet.expire_time = 5
--//更正说明:实际上这个参数要修改oracle用户的$ORACLE_HOME/network/admin/sqlnet.ora,而里面的定义如下:
$ cat $ORACLE_HOME/network/admin/sqlnet.ora
SQLNET.EXPIRE_TIME=10
--//这样结果基本一致.加上我当时执行一些命令查看端口号,消耗一点时间.注意为什么不是读取grid用户的
--//$ORACLE_HOME/network/admin/sqlnet.ora文件的SQLNET.EXPIRE_TIME = 5参数,另外写blog探究.
2.测试:
--//客户端使用sqlplus连接数据库:
xxxxx>
SELECT s.sid, s.serial#, s.process, s.server, p.spid, p.pid, p.serial# p_serial#,'alter system kill session
'''||s.sid||','||s.serial#||''''||' immediate;' c50
FROM v$session s, v$process p
WHERE s.sid in (
SELECT sid
FROM v$mystat
WHERE rownum = 1)
AND s.paddr = p.addr;
SID SERIAL# PROCESS SERVER SPID PID P_SERIAL# C50
---------- ---------- ------------------------ --------- ------ ------- ---------- --------------------------------------------------
6656 50071 22214 DEDICATED 6980 69 200 alter system kill session '6656,50071' immediate;
--//确定服务端进程号=6980
--//服务端:
# lsof -i -P -n | grep "6980 oracle"
oracle 6980 oracle 8u IPv4 3826220323 0t0 UDP 127.0.0.1:58395
oracle 6980 oracle 14u IPv4 3826216194 0t0 UDP 127.0.0.1:58118
oracle 6980 oracle 18u IPv4 3826173598 0t0 TCP 192.168.100.106:1521->192.168.90.14:41203 (ESTABLISHED)
--//确定client的端口号41203.
# tcpdump -i bondeth0 host 192.168.90.15 and port 41203 -nn -vv
tcpdump: listening on bondeth0, link-type EN10MB (Ethernet), capture size 96 bytes
--//客户端执行一些sql语句以后:
xxxxx> select sysdate from dual ;
SYSDATE
-------------------
2018-01-23 11:52:58
--//客户端打开监测:(注意网络接口不要选择错误!!我测试环境是是bond1):
# tcpdump -i bond1 host 192.168.90.14 and port 41203 -nn -vv
tcpdump: listening on bond1, link-type EN10MB (Ethernet), capture size 65535 bytes
...
--//服务端打开监测:
# tcpdump -i bondeth0 host 192.168.90.14 and port 41203 -nn -vv
tcpdump: listening on bondeth0, link-type EN10MB (Ethernet), capture size 96 bytes
11:52:58.146050 IP (tos 0x0, ttl 61, id 37349, offset 0, flags [DF], proto: TCP (6), length: 346) 192.168.90.14.41203 > 192.168.100.106.1521: P 64396777:64397083(306) ack 1905633603 win 368
11:52:58.146419 IP (tos 0x0, ttl 64, id 25005, offset 0, flags [DF], proto: TCP (6), length: 418) 192.168.100.106.1521 > 192.168.90.14.41203: P 1:379(378) ack 306 win 330
11:52:58.147109 IP (tos 0x0, ttl 61, id 37350, offset 0, flags [DF], proto: TCP (6), length: 40) 192.168.90.14.41203 > 192.168.100.106.1521: ., cksum 0x0860 (correct), 306:306(0) ack 379 win 386
11:52:58.148255 IP (tos 0x0, ttl 61, id 37351, offset 0, flags [DF], proto: TCP (6), length: 61) 192.168.90.14.41203 > 192.168.100.106.1521: P, cksum 0xe25b (correct), 306:327(21) ack 379 win 386
11:52:58.148361 IP (tos 0x0, ttl 64, id 25006, offset 0, flags [DF], proto: TCP (6), length: 212) 192.168.100.106.1521 > 192.168.90.14.41203: P 379:551(172) ack 327 win 330
11:52:58.188314 IP (tos 0x0, ttl 61, id 37352, offset 0, flags [DF], proto: TCP (6), length: 40) 192.168.90.14.41203 > 192.168.100.106.1521: ., cksum 0x078d (correct), 327:327(0) ack 551 win 404
---//等待看看:
--//因为网关配置成状态防火墙的形式,从服务端发起的连接都会被拒接..等看看.
--//已经过了200秒,说明client的内核参数tcp_keepalive_*对于连接的服务端无用.也就是client不会主动发起连接测试.
12:07:20.096941 IP (tos 0x0, ttl 64, id 25007, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:07:20.311391 IP (tos 0x0, ttl 64, id 25008, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:07:20.742391 IP (tos 0x0, ttl 64, id 25009, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:07:21.604392 IP (tos 0x0, ttl 64, id 25010, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:07:23.328394 IP (tos 0x0, ttl 64, id 25011, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:07:26.777395 IP (tos 0x0, ttl 64, id 25012, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:07:33.672394 IP (tos 0x0, ttl 64, id 25013, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:07:47.448449 IP (tos 0x0, ttl 64, id 25014, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:08:15.032451 IP (tos 0x0, ttl 64, id 25015, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:09:10.136405 IP (tos 0x0, ttl 64, id 25016, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:11:00.472405 IP (tos 0x0, ttl 64, id 25017, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:13:00.792400 IP (tos 0x0, ttl 64, id 25018, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:15:01.112403 IP (tos 0x0, ttl 64, id 25019, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:17:01.432397 IP (tos 0x0, ttl 64, id 25020, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:19:01.752399 IP (tos 0x0, ttl 64, id 25021, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
12:21:02.072411 IP (tos 0x0, ttl 64, id 25022, offset 0, flags [DF], proto: TCP (6), length: 50) 192.168.100.106.1521 > 192.168.90.14.41203: P, cksum 0x3fee (incorrect (-> 0x01ab), 551:561(10) ack 327 win 330
--//在出现以上数据包过程中,在服务端执行如下:
--//服务器:
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 10 192.168.100.106:1521 192.168.90.14:41203 ESTABLISHED on (52.68/10/0)
..
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 10 192.168.100.106:1521 192.168.90.14:41203 ESTABLISHED on (103.27/12/0)
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 10 192.168.100.106:1521 192.168.90.14:41203 ESTABLISHED on (90.45/12/0)
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 20 192.168.100.106:1521 192.168.90.14:41203 ESTABLISHED on (18446744073.51/14/0)
--//中间不知道为什么冒出这个大的值.
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 20 192.168.100.106:1521 192.168.90.14:41203 ESTABLISHED on (72.29/15/0)
...
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
--//这时连接已经断开在服务端.
--//说明关于netstat -o 参数的解析参考连接:http://blog.itpub.net/267265/viewspace-2138396/
--//The interpretation of field is, first field = timer countdown value, second field = no. of retransmissions,
--//third field = number of keepalive probes sent.
--//可以发现前面的第1字段逐渐减少,而第2字段增加后,第1字段重新记录.
--//客户端:
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 0 192.168.90.14:41203 192.168.100.106:1521 ESTABLISHED off (0.00/0/0)
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 0 192.168.90.14:41203 192.168.100.106:1521 ESTABLISHED off (0.00/0/0)
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 0 192.168.90.14:41203 192.168.100.106:1521 ESTABLISHED off (0.00/0/0)
--//而客户端的连接还在.
--//客户端,在客户端监测:
tcpdump: listening on bond1, link-type EN10MB (Ethernet), capture size 65535 bytes
...
12:27:17.961544 IP (tos 0x0, ttl 64, id 37353, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 64397104:64397410, ack 1905634153, win 404, length 306
12:27:18.166180 IP (tos 0x0, ttl 64, id 37354, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:27:18.576173 IP (tos 0x0, ttl 64, id 37355, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:27:19.396154 IP (tos 0x0, ttl 64, id 37356, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:27:21.036150 IP (tos 0x0, ttl 64, id 37357, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:27:24.316166 IP (tos 0x0, ttl 64, id 37358, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:27:30.876176 IP (tos 0x0, ttl 64, id 37359, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:27:43.996189 IP (tos 0x0, ttl 64, id 37360, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:28:10.236177 IP (tos 0x0, ttl 64, id 37361, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:29:02.716433 IP (tos 0x0, ttl 64, id 37362, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:30:47.676172 IP (tos 0x0, ttl 64, id 37363, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:32:47.677461 IP (tos 0x0, ttl 64, id 37364, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:34:47.676459 IP (tos 0x0, ttl 64, id 37365, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:36:47.676446 IP (tos 0x0, ttl 64, id 37366, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:38:47.676172 IP (tos 0x0, ttl 64, id 37367, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
12:40:47.676464 IP (tos 0x0, ttl 64, id 37368, offset 0, flags [DF], proto TCP (6), length 346) 192.168.90.14.41203 > 192.168.100.106.1521: Flags [P.], cksum 0x4116 (incorrect -> 0xfb4b), seq 0:306, ack 1, win 404, length 306
--//同时在客户端执行sql命令:
xxxxx> select sysdate from dual ;
--//挂起..由于服务端连接已经不存在了,最后报如下错误:
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 6980
Session ID: 6656 Serial number: 50071
--//客户端:
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 306 192.168.90.14:41203 192.168.100.106:1521 ESTABLISHED on (6.85/7/0)
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
tcp 0 306 192.168.90.14:41203 192.168.100.106:1521 ESTABLISHED on (13.09/9/0)
# netstat -otun | egrep 'Active|Proto|192.168.90.14:41203'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State Timer
--//可以看到timer列后面的变化..看到的情况与服务器前面的情况一样.
3.分析时间变化:
12:07:20.096941
11:52:58.188314
--//间隔15*60+20-58 = 862秒,为什么不是SQLNET.EXPIRE_TIME*2分钟=5*2*60 = 600秒呢?
create table t ( a timestamp);
insert into t values ('2018-01-23 12:07:20.096941');
insert into t values ('2018-01-23 12:07:20.311391');
insert into t values ('2018-01-23 12:07:20.742391');
insert into t values ('2018-01-23 12:07:21.604392');
insert into t values ('2018-01-23 12:07:23.328394');
insert into t values ('2018-01-23 12:07:26.777395');
insert into t values ('2018-01-23 12:07:33.672394');
insert into t values ('2018-01-23 12:07:47.448449');
insert into t values ('2018-01-23 12:08:15.032451');
insert into t values ('2018-01-23 12:09:10.136405');
insert into t values ('2018-01-23 12:11:00.472405');
insert into t values ('2018-01-23 12:13:00.792400');
insert into t values ('2018-01-23 12:15:01.112403');
insert into t values ('2018-01-23 12:17:01.432397');
insert into t values ('2018-01-23 12:19:01.752399');
insert into t values ('2018-01-23 12:21:02.072411');
commit;
column interval format 00000.999999
WITH x
AS (SELECT a - n_a interval
FROM ( SELECT a, LAG (a) OVER (ORDER BY a) n_a
FROM t
ORDER BY a)
WHERE n_a IS NOT NULL)
SELECT EXTRACT (DAY FROM interval) * 86400
+ EXTRACT (HOUR FROM interval) * 3600
+ EXTRACT (MINUTE FROM interval) * 60
+ EXTRACT (SECOND FROM interval)
interval
FROM x;
INTERVAL
-------------
00000.214450
00000.431000
00000.862001
00001.724002
00003.449001
00006.894999
00013.776055
00027.584002
00055.103954
00110.336000
00120.319995
00120.320003
00120.319994
00120.320002
00120.320012
15 rows selected.
--//开始间隔0.21秒,以后间隔2倍递增.到110秒后,按照120秒间隔.另外一组数据测试如下,结果惊人的相似.
INTERVAL
-------------
00000.204636
00000.409993
00000.819981
00001.639996
00003.280016
00006.560010
00013.120013
00026.239988
00052.480256
00104.959739
00120.001289
00119.998998
00119.999987
00119.999726
00120.000292
15 rows selected.
--//最大与最小的间隔
WITH x AS (SELECT MAX (a) - MIN (a) interval FROM t)
SELECT SUM
(
EXTRACT (DAY FROM interval) * 86400
+ EXTRACT (HOUR FROM interval) * 3600
+ EXTRACT (MINUTE FROM interval) * 60
+ EXTRACT (SECOND FROM interval)
)
interval_sum
FROM x;
INTERVAL_SUM
------------
809.71492 <== client测试的结果
INTERVAL_SUM
------------
821.97547 <== 服务端测试结果.
--//换一句话讲,如果client端不kill掉,大约需要81X 秒才退出出现提示符号.
--//我这里测试服务端请求认证后的情况,这时服务器连接已经断开,如果在确认期间,客户端执行命令呢? 会发生什么情况呢?看下一篇blog.
--//也许你抱怨无法模拟以上测试,实际上很简单在服务端使用tmux,screen打开会话(这样网络断开,命令可以一直在运行),再执行以上命
--//令,在客户端执行sql语句后,拔掉网线,等待足够的时间就能模拟我的测试结果.
--//补充说明:
--//我少算了最后1次120秒,这样实际上大约809+120 = 929秒才会退出.