RAC 11.2.0.3 LISTENER异常终止

简介: 今天同事遇到一个RAC监听异常终止的问题,版本11.2.0.3操作系统AIX6.1,如下:10-MAY-2014 11:44:16 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DED...
今天同事遇到一个RAC监听异常终止的问题,版本11.2.0.3操作系统AIX6.1,如下:
10-MAY-2014 11:44:16 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=zhdw)) * (ADDRESS=(PROTOC
OL=tcp)(HOST=10.195.160.162)(PORT=59755)) * establish * zhdw * 0
Sat May 10 11:44:27 2014
10-MAY-2014 11:44:27 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=zhdwdb1)(USER=grid))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER_SCAN1)
(VERSION=186647296)) * status * 0
10-MAY-2014 11:44:35 * service_died * LsnrAgt * 12537
可以看到service_died 。以前遇到过died的问题是由于数据库异常终止了,而且我还特意关注过,但是这里不一样
因为实例根本没有问题。所以就和同事一起进行了分析。


既然是RAC就要考虑到CRS是否进行了异常的LISTENER终止,所以首先看了CRS的ALTER日志,如下:
2014-05-13 11:59:40.629
[/u01/app/11.2.3/grid/bin/oraagent.bin(8847514)]CRS-5016:Process "/u01/app/11.2.3/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.3/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.3/grid/log/zhdwdb2/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-05-13 11:59:40.907
[/u01/app/11.2.3/grid/bin/oraagent.bin(8847514)]CRS-5016:Process "/u01/app/11.2.3/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.3/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.3/grid/log/zhdwdb2/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-05-13 12:01:23.234
可以看到ORAAGENT进行检查的时候出现了错误,然后查看oraagent_grid.log日志。
2014-05-13 11:59:40.257: [    A**][2314] {0:3:13726} Agent received the message: RESOURCE_STOP[ora.ons zhdwdb2 1] ID 4099:28335
2014-05-13 11:59:40.258: [    A**][2314] {0:3:13726} Preparing STOP command for: ora.ons zhdwdb2 1
2014-05-13 11:59:40.258: [    A**][2314] {0:3:13726} ora.ons zhdwdb2 1 state changed from: ONLINE to: STOPPING
2014-05-13 11:59:40.262: [    A**][2314] {0:3:13726} Agent received the message: RESOURCE_STOP[ora.LISTENER.lsnr zhdwdb2 1] ID 4099:28336
2014-05-13 11:59:40.262: [    A**][2314] {0:3:13726} Preparing STOP command for: ora.LISTENER.lsnr zhdwdb2 1
2014-05-13 11:59:40.262: [ora.ons][2572] {0:3:13726} [stop] (:CLSN00108:) clsn_agent::stop {
2014-05-13 11:59:40.262: [    A**][2314] {0:3:13726} ora.LISTENER.lsnr zhdwdb2 1 state changed from: ONLINE to: STOPPING
2014-05-13 11:59:40.263: [ora.ons][2572] {0:3:13726} [stop] OnsAgent::stop {
2014-05-13 11:59:40.268: [ora.LISTENER.lsnr][1800] {0:3:13726} [stop] (:CLSN00108:) clsn_agent::stop {
2014-05-13 11:59:40.268: [ora.LISTENER.lsnr][1800] {0:3:13726} [stop] LsnrAgent::stop {
2014-05-13 11:59:40.268: [ USRTHRD][1800] {0:3:13726} Thread:RegEndpointThread:LISTENER stop {
2014-05-13 11:59:40.270: [ USRTHRD][5157] {0:3:13726} Thread:RegEndpointThread:LISTENER isRunning is reset to false here
2014-05-13 11:59:40.270: [ USRTHRD][1800] {0:3:13726} Thread:RegEndpointThread:LISTENER stop }
2014-05-13 11:59:40.270: [ora.LISTENER.lsnr][1800] {0:3:13726} [stop] lsnrctl stop LISTENER


可以看到CRS触发了LISTENER的关闭,然后进一步查看orarootagent_root.log查看network资源是否异常发现


2014-05-13 11:59:40.184: [ora.net1.network][2057] {2:64347:2} [check] got lock
2014-05-13 11:59:40.184: [ora.net1.network][2057] {2:64347:2} [check] tryActionLock }
2014-05-13 11:59:40.184: [ora.net1.network][2057] {2:64347:2} [check] abort  }
2014-05-13 11:59:40.185: [ora.net1.network][2057] {2:64347:2} [check] (:CLSN00110:) clsn_agent::abort }
2014-05-13 11:59:40.185: [    A**][2057] {2:64347:2} Command: check for resource: ora.net1.network zhdwdb2 1 completed with status: TIMEDOUT
2014-05-13 11:59:40.185: [    A**][2314] {2:64347:2} ora.net1.network zhdwdb2 1 state changed from: ONLINE to: UNKNOWN
2014-05-13 11:59:40.185: [    A**][2314] {2:64347:2} ora.net1.network zhdwdb2 1 would be continued to monitored!
2014-05-13 11:59:40.185: [    A**][2314] {2:64347:2} Switching offline monitor to intermedite one
2014-05-13 11:59:40.186: [    A**][2314] {2:64347:2} Started implicit monitor for [ora.net1.network zhdwdb2 1] interval=1000 delay=1000
2014-05-13 11:59:40.186: [    A**][2314] {2:64347:2} ora.net1.network zhdwdb2 1 state details has changed from:  to: CHECK TIMED OUT
2014-05-13 11:59:40.186: [    A**][2314] {0:3:13726} Generating new Tint for unplanned state change. Original Tint: {2:64347:2}
2014-05-13 11:59:40.186: [    A**][2314] {0:3:13726} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:1574940
2014-05-13 11:59:41.228: [    A**][2314] {2:64347:2} ora.net1.network zhdwdb2 1 state changed from: UNKNOWN to: ONLINE
2014-05-13 11:59:41.228: [    A**][2314] {2:64347:2} Switching offline monitor to online one
发现NETWORK资源在短时时间内由ONLINE 到UNKOWN状态然后由UNKOWN到ONLINE状态。这可能是由于网络颤抖引起。


所以问题基本定位,由于NETWORK资源出现问题导致


下面说明为什么NETWROK资源出现问题监听会出现问题。
LISTERN--VIP--PUBLIC IP --network(网卡)
SCAN LISTERN--SCAN IP --PUBLIC IP --network(网卡)
可以看到只要NETWROK资源出现问题就会影响到上层的各种资源。


然后解释下为什么LOCAL LISTENER出现问题也会导致连接问题
client--- scan listener(根据负载分发,确定进入的实例)----LOCAL LISTENER(这一环出了问题还是不行)---进入数据库


可以看到只要LOCAL LISTENER出现问题也会导致不能连接。
所以问题可能是
1、网络颤抖
2、各种BUG


查看METALINK发现2篇文章说这个问题如下:
1、

Agent Check Timeout for Network and ACFS resources occasionally in AIX (文档 ID 1558510.1) 


APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.2 and later
IBM AIX on POWER Systems (64-bit)
SYMPTOMS

This is for AIX only.  The problem is AIX specific.

The cluster alert.log shows check timeout errors for acfs and network resource occasionally:





CAUSE

A defect with AIX causes the unix command "pwd" to hang for a minute or two.
This affects all LPARs on the server at the same time.

The OSWatcher also hangs for a minute or two because it uses pwd command, so have the customer set up OSwatcher and check if the OSWatcher hangs at the same time the check timeout error occurs.  If OSWatcher hangs at the same time for about a minute or two, then the server is running into this problem.

SOLUTION

Resolved by upgrading AIX and firmware to the following:

Firmware
IBM,9117-MMB
The current permanent system firmware image is AM730_087
The current temporary system firmware image is AM730_087
VIO server version 2.2.1.4
AIX version 6100-07-04-1216
 
After the above patching, the problem is resolved. 

2、
VIP, SCAN VIP/Listener Fails Over and Listener Stops After Short Public Network Hiccup (文档 ID 1333165.1)

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 to 12.1.0.1 [Release 11.2 to 12.1]
Information in this document applies to any platform.
SYMPTOMS

After check timed out, 11gR2 Grid Infrastructure network resource (usually ora.net1.network) goes to INTERMEDIATE state, then goes back to ONLINE very shortly. This note will not discuss cause of check time out, but most common cause is public network hiccup.

Once network resource goes into INTERMEDIATE state, it may trigger VIP, service, SCAN VIP/SCAN listener, ora.cvu and ora.ons etc to be failed over/go offline due to resource dependence, which could result in unnecessary connectivity issue for that period of time. After network resource is back online, affected resources may not come back online.

$GRID_HOME/log//crsd/crsd.log
2011-06-12 07:12:31.261: [    A**][10796] {0:1:2881} Received state change for ora.net1.network racnode1 1 [old state = ONLINE, new state = UNKNOWN]
2011-06-12 07:12:31.261: [    A**][10796] {0:1:2881} Received state LABEL change for ora.net1.network racnode1 1 [old label  = , new label = CHECK TIMED OUT]
..
2011-06-12 07:12:31.297: [   CRSPE][12081] {0:1:2881} RI [ora.net1.network racnode1 1] new external state [INTERMEDIATE] old value: [ONLINE] on racnode1 label = [CHECK TIMED OUT] 
..
2011-06-12 07:12:31.981: [    A**][10796] {0:1:2882} Received state change for ora.net1.network racnode1 1 [old state = UNKNOWN, new state = ONLINE]
..
2011-06-12 07:12:32.307: [   CRSPE][12081] {0:1:2881} RI [ora.LISTENER.lsnr racnode1 1] new internal state: [STOPPING] old value: [STABLE]
2011-06-12 07:12:32.308: [   CRSPE][12081] {0:1:2881} CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'racnode1'
$GRID_HOME/log//agent/crsd/orarootagent_root/orarootagent_root.log
2011-06-12 07:12:08.965: [    A**][2070] {1:27767:2} Created alert : (:CRSAGF00113:) :  Aborting the command: check for resource: ora.net1.network racnode1 1
2011-06-12 07:12:08.966: [ora.net1.network][2070] {1:27767:2} [check] clsn_agent::abort {
..
2011-06-12 07:12:31.257: [    A**][2070] {1:27767:2} Command: check for resource: ora.net1.network racnode1 1 completed with status: TIMEDOUT
2011-06-12 07:12:31.258: [    A**][2314] {1:27767:2} ora.net1.network racnode1 1 state changed from: ONLINE to: UNKNOWN
2011-06-12 07:12:31.258: [    A**][2314] {1:27767:2} ora.net1.network racnode1 1 would be continued to monitored!
2011-06-12 07:12:31.258: [    A**][2314] {1:27767:2} ora.net1.network racnode1 1 state details has changed from:  to: CHECK TIMED OUT
..
2011-06-12 07:12:31.923: [ora.net1.network][2314][F-ALGO] {1:27767:2} CHECK initiated by timer for: ora.net1.network racnode1 1
..
2011-06-12 07:12:31.973: [ora.net1.network][8502][F-ALGO] {1:27767:2} [check] Command check for resource: ora.net1.network racnode1 1 completed with status ONLINE
2011-06-12 07:12:31.978: [    A**][2314] {1:27767:2} ora.net1.network racnode1 1 state changed from: UNKNOWN to: ONLINE
$GRID_HOME/log//agent/crsd/oraagent_/oraagent_.log
2011-06-12 07:12:32.335: [    A**][2314] {0:1:2881} Agent received the message: RESOURCE_STOP[ora.LISTENER.lsnr racnode1 1] ID 4099:14792
2011-06-12 07:12:32.335: [    A**][2314] {0:1:2881} Preparing STOP command for: ora.LISTENER.lsnr racnode1 1
2011-06-12 07:12:32.335: [    A**][2314] {0:1:2881} ora.LISTENER.lsnr racnode1 1 state changed from: ONLINE to: STOPPING
 
$GRID_HOME/log//alert.log
2012-01-10 06:48:18.474 [/ocw/grid/bin/orarootagent.bin(10485902)]CRS-5818:Aborted command 'check for resource: ora.net1.network racnode1 1' for resource 'ora.net1.network'. Details at (:CRSAGF00113:) {1:24200:2} in /ocw/grid/log/racnode1/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-01-10 06:48:43.481 [/ocw/grid/bin/oraagent.bin(8847542)]CRS-5016:Process "/ocw/grid/bin/lsnrctl" spawned by agent "/ocw/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/ocw/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-01-10 06:48:43.552 [/ocw/grid/bin/oraagent.bin(8847542)]CRS-5016:Process "/ocw/grid/opmn/bin/onsctli" spawned by agent "/ocw/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/ocw/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log"

CAUSE

SOLUTION
The issue is fixed in a few different bugs:

1. bug 12680491 fixes the dependence between network and VIP

The fix of bug 12680491 will add intermediate modifier to stop dependency between network resource and VIP to avoid unnecessary resource state change, it's included in 11.2.0.2 GI PSU4, 11.2.0.3 GI PSU3, 11.2.0.3 Windows Patch 7, 11.2.0.4 and above. This fix is recommended instead of fix for bug 12378938 to avoid the issue in first place. 

Once patch for this bug is applied, the following needs to be executed to change the dependence for all VIPs:

# $GRID_HOME/bin/crsctl modify res ora..vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora..network)"

For example:

# /ocw/grid/bin/crsctl modify res ora.racnode1.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net1.network)"
Once the attribute is changed, a restart of nodeapps/VIP is needed to be in effect

2. bug 13582411 fixes the dependence between network and SCAN VIP/listener

The fix of bug 13582411 will add intermediate modifyer to stop dependency between network resource and SCAN VIP to avoid unnecessary resource state change, it's included in 11.2.0.3 GI PSU4, 11.2.0.4 and above.  

Once patch for this bug is applied, the following needs to be executed to change the dependence for all SCAN VIPs and to restart SCAN VIPs:

# $GRID_HOME/bin/crsctl modify res ora.scan.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net.network)"
For example:

# /ocw/grid/bin/crsctl modify res ora.scan1.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net1.network)"
# /ocw/grid/bin/crsctl modify res ora.scan2.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net1.network)"
# /ocw/grid/bin/crsctl modify res ora.scan3.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net1.network)"
# /ocw/grid/bin/srvctl stop scan -f
$ /ocw/grid/bin/srvctl start scan_listener 

3. bug 17435488 fixes the dependence between network and ora.cvu and ora.ons

The fix will add intermediate modifyer to stop dependency between network resource and ora.cvu and ora.ons to avoid unnecessary resource state change, it's included in 12.1.0.2

所以初步认为需要:
1、RAC 修改VIP SCAN VIP的依赖属性
2、查看AIX 固件




相关文章
|
Oracle 关系型数据库
ORACLE RAC 11.2.0.4 for RHEL6.8 集群CRS异常导致集群命令无法使用
                                    >   >                                                                   ...
1032 0
|
Oracle 关系型数据库 数据库
oracle 数据库单实例和rac中listener的区别
1、关于动态注册和静态注册的区别 参考文章:http://blog.csdn.net/cymm_liu/article/details/7262630 2、关于local_listener 和 remote_listener的区别: 过程描述:        当客户端发出连接请求给server端listener的时候,通过local_listener注册的服务接收这个连接请求,然后由master instance来决定这个连接请求应该由哪个目标instance发出server process响应这个连接请求。
1062 0
|
负载均衡 网络协议 Oracle
RAC 11GR3双网段设置两套VIP SCAN IP以及LISTENER
------------------------------------------------------------- Bond设置[root@raca network-scripts]# more ifcfg-bond0 DEVICE=bond0BOOTPROTO=staticBROADCAST=172.
1361 0
|
4月前
|
运维 Oracle 前端开发
Oracle 11g RAC集群日常运维命令总结
Oracle 11g RAC集群日常运维命令总结
107 2
|
4月前
|
Oracle 关系型数据库
分布式锁设计问题之Oracle RAC保证多个节点写入内存Page的一致性如何解决
分布式锁设计问题之Oracle RAC保证多个节点写入内存Page的一致性如何解决
|
5月前
|
存储 负载均衡 Oracle
|
5月前
|
存储 Oracle 关系型数据库