[20180126]奇怪的SQLNet message from dblink(模拟).txt
--//链接:http://blog.itpub.net/267265/viewspace-2150510/,但是是由于状态防火墙问题,导致链接断开,再次使用链路时报
--//'SQL*Net message from dblink'错误.
--//我前面测试设置SQLNET.EXPIRE_TIME参数,网络监测出现问题时重试以及发探测包间隔的情况.
--//这次使用内核参数tcp_keepalive的情况:
1.环境:
SCOTT@book> @ &r/ver1
PORT_STRING VERSION BANNER
------------------------------ -------------- --------------------------------------------------------------------------------
x86_64/Linux 2.4.xx 11.2.0.4.0 Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
--//注解SQLNET.EXPIRE_TIME参数.设置内核参数如下:
# echo /proc/sys/net/ipv4/tcp_keepalive* | xargs -n 1 strings -1 -f
/proc/sys/net/ipv4/tcp_keepalive_intvl: 10
/proc/sys/net/ipv4/tcp_keepalive_probes: 4
/proc/sys/net/ipv4/tcp_keepalive_time: 60
--//建立dblink:
SCOTT@book> CREATE PUBLIC DATABASE LINK LOOPBACK CONNECT TO SCOTT IDENTIFIED BY book USING 'localhost:1521/book:DEDICATED';
Database link created.
--//后面不上DEDICATED,避免使用共享模式.测试
SCOTT@book> select sysdate from dual@loopback ;
SYSDATE
-------------------
2018-01-26 11:50:10
--//测试ok,退出.
2.测试前准备:
# netstat -tunlp | grep 127.0.0.1
tcp 0 0 127.0.0.1:1521 0.0.0.0:* LISTEN 24802/tnslsnr
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN 29590/cupsd
# lsof -i -P -n | grep 127.0.0.1:
oracle 14308 oracle 11u IPv4 41475691 0t0 TCP 127.0.0.1:7801->127.0.0.1:1521 (ESTABLISHED)
tnslsnr 24802 oracle 7u IPv4 41438030 0t0 TCP 127.0.0.1:1521 (LISTEN)
tnslsnr 24802 oracle 14u IPv4 41481937 0t0 TCP 127.0.0.1:1521->127.0.0.1:7801 (ESTABLISHED)
cupsd 29590 root 4u IPv4 23111711 0t0 TCP 127.0.0.1:631 (LISTEN)
--//这样使用tcpdump要过滤631,7801端口.
# tcpdump -i lo tcp and host 127.0.0.1 and not port 6150 and not port 631 and not port 7801 and -nn -ttt
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 96 bytes
--//远程登录数据库:
R:\fyhis>sqlplus scott/book@78
SQL*Plus: Release 12.1.0.1.0 Production on Fri Jan 26 11:52:17 2018
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
# tcpdump -i lo tcp and host 127.0.0.1 and not port 6150 and not port 631 and not port 7801 -nn -vvv
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 96 bytes
--//执行sql语句:
SCOTT@78> select sysdate from dual@loopback ;
SYSDATE
-------------------
2018-01-26 11:53:54
---
11:53:54.803792 IP (tos 0x0, ttl 64, id 40513, offset 0, flags [DF], proto: TCP (6), length: 370) 127.0.0.1.9444 > 127.0.0.1.1521: P 3589150680:3589150998(318) ack 2600078046 win 385
11:53:54.804158 IP (tos 0x0, ttl 64, id 27704, offset 0, flags [DF], proto: TCP (6), length: 369) 127.0.0.1.1521 > 127.0.0.1.9444: P 1:318(317) ack 318 win 385
11:53:54.804247 IP (tos 0x0, ttl 64, id 40514, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.9444 > 127.0.0.1.1521: ., cksum 0xfe28 (incorrect (-> 0x7ecc), 318:318(0) ack 318 win 385
11:53:54.804401 IP (tos 0x0, ttl 64, id 40515, offset 0, flags [DF], proto: TCP (6), length: 194) 127.0.0.1.9444 > 127.0.0.1.1521: P 318:460(142) ack 318 win 385
11:53:54.804508 IP (tos 0x0, ttl 64, id 27705, offset 0, flags [DF], proto: TCP (6), length: 83) 127.0.0.1.1521 > 127.0.0.1.9444: P 318:349(31) ack 460 win 385
11:53:54.804881 IP (tos 0x0, ttl 64, id 40516, offset 0, flags [DF], proto: TCP (6), length: 365) 127.0.0.1.9444 > 127.0.0.1.1521: P 460:773(313) ack 349 win 385
11:53:54.805248 IP (tos 0x0, ttl 64, id 27706, offset 0, flags [DF], proto: TCP (6), length: 432) 127.0.0.1.1521 > 127.0.0.1.9444: P 349:729(380) ack 773 win 385
11:53:54.807364 IP (tos 0x0, ttl 64, id 40517, offset 0, flags [DF], proto: TCP (6), length: 73) 127.0.0.1.9444 > 127.0.0.1.1521: P, cksum 0xfe3d (incorrect (-> 0x5f64), 773:794(21) ack 729 win 385
11:53:54.807563 IP (tos 0x0, ttl 64, id 27707, offset 0, flags [DF], proto: TCP (6), length: 224) 127.0.0.1.1521 > 127.0.0.1.9444: P 729:901(172) ack 794 win 385
11:53:54.846786 IP (tos 0x0, ttl 64, id 40518, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.9444 > 127.0.0.1.1521: ., cksum 0xfe28 (incorrect (-> 0x7a7b), 794:794(0) ack 901 win 385
...//等...不执行任何有关dblink操作.
11:55:55.000981 IP (tos 0x0, ttl 64, id 27709, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.1521 > 127.0.0.1.9444: ., cksum 0xfe28 (incorrect (-> 0xba1d), 900:900(0) ack 794 win 385
11:55:55.001004 IP (tos 0x0, ttl 64, id 40520, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.9444 > 127.0.0.1.1521: ., cksum 0xfe28 (incorrect (-> 0xa51f), 794:794(0) ack 901 win 385
11:56:55.161017 IP (tos 0x0, ttl 64, id 27710, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.1521 > 127.0.0.1.9444: ., cksum 0xfe28 (incorrect (-> 0xe49b), 900:900(0) ack 794 win 385
11:56:55.161042 IP (tos 0x0, ttl 64, id 40521, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.9444 > 127.0.0.1.1521: ., cksum 0xfe28 (incorrect (-> 0xba1e), 794:794(0) ack 901 win 385
--//可以发现间隔1分钟.如何实现网络端口呢? 我想到几种方法,
1.kill 有关链接进程.
2.kill -19 SIGSTOP 有关链接进程.
3.使用iptables防火墙,阻塞包的发送.
--//理论讲使用第3个方法更加保险.
# lsof -i -n -P | grep 127.0.0.1.9444
oracle 26397 oracle 7u IPv4 41479570 0t0 TCP 127.0.0.1:9444->127.0.0.1:1521 (ESTABLISHED)
oracle 26400 oracle 15u IPv4 41377471 0t0 TCP 127.0.0.1:1521->127.0.0.1:9444 (ESTABLISHED)
# ps -ef | egrep "2639[7]|2640[0]"
oracle 26397 1 0 11:52 ? 00:00:00 oraclebook (LOCAL=NO)
oracle 26400 1 0 11:52 ? 00:00:00 oraclebook (LOCAL=NO)
--//服务端进程是26400,客户端进程是26397.在客户端执行如下也可以确定:
SCOTT@78> @ &r/spid
SID SERIAL# PROCESS SERVER SPID PID P_SERIAL# C50
---------- ---------- ------------------------ --------- ------ ------- ---------- --------------------------------------------------
274 45 8452:8216 DEDICATED 26397 21 20 alter system kill session '274,45' immediate;
# iptables -A INPUT -p tcp -s 127.0.0.1 -d 127.0.0.1 --dport 9444 -i lo -j DROP
# tcpdump -i lo tcp and host 127.0.0.1 and not port 6150 and not port 631 and not port 7801 -nn -vvv
...
12:08:57.081298 IP (tos 0x0, ttl 64, id 27722, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.1521 > 127.0.0.1.9444: ., cksum 0xfe28 (incorrect (-> 0xdc85), 900:900(0) ack 794 win 385
12:08:57.081361 IP (tos 0x0, ttl 64, id 40533, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.9444 > 127.0.0.1.1521: ., cksum 0xfe28 (incorrect (-> 0xb613), 794:794(0) ack 901 win 385
--//这里启动防火墙
SCOTT@78> SELECT count(*) FROM V$ACTIVE_SESSION_HISTORY WHERE event = 'SQL*Net message from dblink' AND sample_time > '2018-01-26 12:08:55';
COUNT(*)
----------
0
# iptables -A INPUT -p tcp -s 127.0.0.1 -d 127.0.0.1 --dport 9444 -i lo -j DROP
--//等...
12:09:57.241304 IP (tos 0x0, ttl 64, id 27723, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.1521 > 127.0.0.1.9444: ., cksum 0xfe28 (incorrect (-> 0x0684), 900:900(0) ack 794 win 385
12:10:07.257024 IP (tos 0x0, ttl 64, id 27724, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.1521 > 127.0.0.1.9444: ., cksum 0xfe28 (incorrect (-> 0xdf63), 900:900(0) ack 794 win 385
12:10:17.272859 IP (tos 0x0, ttl 64, id 27725, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.1521 > 127.0.0.1.9444: ., cksum 0xfe28 (incorrect (-> 0xb843), 900:900(0) ack 794 win 385
12:10:27.289085 IP (tos 0x0, ttl 64, id 27726, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.1521 > 127.0.0.1.9444: ., cksum 0xfe28 (incorrect (-> 0x9123), 900:900(0) ack 794 win 385
12:10:37.305371 IP (tos 0x0, ttl 64, id 27727, offset 0, flags [DF], proto: TCP (6), length: 52) 127.0.0.1.1521 > 127.0.0.1.9444: R, cksum 0xfe28 (incorrect (-> 0x69fe), 901:901(0) ack 794 win 385
--//第1个间隔60秒,以后间隔10秒,一共4次.
--//注意包发送方向都是单向的(127.0.0.1.1521 > 127.0.0.1.9444)
# lsof -i -P -n | grep 127.0.0.1:
oracle 14308 oracle 11u IPv4 41475691 0t0 TCP 127.0.0.1:7801->127.0.0.1:1521 (ESTABLISHED)
tnslsnr 24802 oracle 7u IPv4 41438030 0t0 TCP 127.0.0.1:1521 (LISTEN)
tnslsnr 24802 oracle 14u IPv4 41481937 0t0 TCP 127.0.0.1:1521->127.0.0.1:7801 (ESTABLISHED)
oracle 26397 oracle 7u IPv4 41479570 0t0 TCP 127.0.0.1:9444->127.0.0.1:1521 (ESTABLISHED)
cupsd 29590 root 4u IPv4 23111711 0t0 TCP 127.0.0.1:631 (LISTEN)
--//对比前面的结果,可以仅仅存在127.0.0.1:9444->127.0.0.1:1521,而127.0.0.1:1521->127.0.0.1:9444(也就是服务端发出的链接端口)
--//在会话执行:
SCOTT@78> SELECT count(*) FROM V$ACTIVE_SESSION_HISTORY WHERE event = 'SQL*Net message from dblink' AND sample_time > '2018-01-26 12:08:55';
COUNT(*)
----------
0
--//并没有发生事件'SQL*Net message from dblink'.我还可以执行.
SCOTT@78> select sysdate from dual@loopback ;
SYSDATE
-------------------
2018-01-26 12:35:16
--//等很久结果才出来.因为已经阻塞了.不过等927秒后会开启新的dblink连接.
SCOTT@78> SELECT count(*) FROM V$ACTIVE_SESSION_HISTORY WHERE event = 'SQL*Net message from dblink' AND sample_time > '2018-01-26 12:08:55';
COUNT(*)
----------
927
SCOTT@78> SELECT count(*),min(sample_time),max(sample_time),max(sample_time)-min(sample_time) FROM V$ACTIVE_SESSION_HISTORY WHERE event = 'SQL*Net message from dblink' AND sample_time > '2018-01-26 12:08:55';
COUNT(*) MIN(SAMPLE_TIME) MAX(SAMPLE_TIME) MAX(SAMPLE_TIME)-MIN(SAMPLE_TIME)
-------- ----------------------- ----------------------- --------------------------------
927 2018-01-26 12:19:48.503 2018-01-26 12:35:15.685 +000000000 00:15:27.182
--//出现记数927,需要927秒.也就是我执行的时候时.也是我们生产系统遇到的情况.
--//这里并不依赖内核参数tcp_keepalive,因为执行select sysdate from dual@loopback ;是由client发起的连接.
# lsof -i -P -n | grep 127.0.0.1:
oracle 14308 oracle 11u IPv4 41475691 0t0 TCP 127.0.0.1:7801->127.0.0.1:1521 (ESTABLISHED)
tnslsnr 24802 oracle 7u IPv4 41438030 0t0 TCP 127.0.0.1:1521 (LISTEN)
tnslsnr 24802 oracle 14u IPv4 41481937 0t0 TCP 127.0.0.1:1521->127.0.0.1:7801 (ESTABLISHED)
oracle 26397 oracle 7u IPv4 41479609 0t0 TCP 127.0.0.1:9952->127.0.0.1:1521 (ESTABLISHED)
oracle 26788 oracle 15u IPv4 41474821 0t0 TCP 127.0.0.1:1521->127.0.0.1:9952 (ESTABLISHED)
cupsd 29590 root 4u IPv4 23111711 0t0 TCP 127.0.0.1:631 (LISTEN)
--//端口已经发生了变化.
收尾:取消防火墙设置:
# iptables -D INPUT -p tcp -s 127.0.0.1 -d 127.0.0.1 --dport 9444 -i lo -j DROP