【RAC】PMON: terminating the instance due to error 481

简介:
Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.2.0 and later   [Release: 11.2 and later ]
Information in this document applies to any platform.
Symptoms
On 11.2.0.2+ cluster, instance is running on one node, startup instance on the other node(s) fails with:
PMON (ospid: 487580): terminating the instance due to error 481
If ASM is used, +ASMn alert log shows:
Sat Oct 01 19:19:38 2011
MMNL started with pid=21, OS id=6488362
lmon registered with NM - instance number 2 (internal mem no 1)
Sat Oct 01 19:21:37 2011
PMON (ospid: 4915562): terminating the instance due to error 481
Sat Oct 01 19:21:37 2011
System state dump requested by (instance=2, sid=4915562 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_4915388.trc
Dumping diagnostic data in directory=[cdmp_20111001192138], requested by (instance=2, sid=4915562 (PMON)), summary=[abnormal instance termination].
Sat Oct 01 19:21:38 2011
License high water mark = 1
Instance terminated by PMON, pid = 4915562
+ASMn_diag_xxx.trc trace shows:
*** 2011-10-01 19:19:37.526
Reconfiguration starts [incarn=0]

*** 2011-10-01 19:19:37.526
I'm the voting node
Group reconfiguration cleanup
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
...... << repeated messages
If ASM is not used, then DB instance could fail with the same error:
Mon Jul 04 16:22:50 2011
Starting ORACLE instance (normal)
...
Mon Jul 04 16:22:54 2011
MMNL started with pid=24, OS id=667660
starting up 1 shared server(s) ...
lmon registered with NM - instance number 2 (internal mem no 1)
Mon Jul 04 16:26:15 2011
PMON (ospid: 487580): terminating the instance due to error 481


lmon trace shows:
*** 2011-07-04 16:22:59.852
=====================================================
kjxgmpoll: CGS state (0 1) start 0x4e11785e cur 0x4e117863 rcfgtm 5 sec
...
*** 2011-07-04 16:26:14.248
=====================================================
kjxgmpoll: CGS state (0 1) start 0x4e11785e cur 0x4e117926 rcfgtm 200 sec


dia0 trace shows:
*** 2011-07-04 16:22:53.414
Reconfiguration starts [incarn=0]
*** 2011-07-04 16:22:53.414
I'm the voting node
Group reconfiguration cleanup
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

...<< repeated message

Changes
This could happen during patching or after node reboot.
Cause
The problem is caused by HAIP is not ONLINE on either the running node or the problem node(s).
Basically the ASM or DB instance(s) can not startup if they use a different cluster_interconnect than the running instance.
With HAIP ONLINE, all instances (DB and ASM) should use HAIP IP address: 169.254.x.x.
If on any node HAIP is OFFLINE, the ASM and DB instance will use the native private network address which causes communication problem with the instance using HAIP.

Use the following commands to verify HAIP status, as grid user:
$ crsctl stat res -t -init

check for resource ora.cluster_interconnect.haip status.
In this example, HAIP is OFFLINE on the running node 1, hence +ASM1 is using 10.1.1.1 as cluster_interconnect, while on node 2 HAIP is ONLINE, +ASM2 is using HAIP 169.254.239.144 as cluster_interconnect, causing communication problem between them and +ASM2 can not startup.
alert_+ASM1.log shows:

Cluster communication is configured to use the following interface(s) for this instance
10.1.1.1

alert_+ASM2.log shows:
Cluster communication is configured to use the following interface(s) for this instance
169.254.239.144
Solution
The solution is to start HAIP on all nodes before start ASM or DB instance by either restart HAIP resource or restart the GI stack.
For this example, +ASM1 was started first with HAIP OFFLINE:
1. Try to start HAIP manually on node 1
as grid user:
$ crsctl start res ora.cluster_interconnect.haip -init
To verify:
$ crsctl stat res -t -init
2. If this succeeds, then restart ora.asm resource (note, this will bring down all dependent diskgroup resource and db resource):
as root user:
# crsctl stop res ora.crsd -init
# crsctl stop res ora.asm -init -f
# crsctl start res ora.asm -init
# crsctl start res ora.crsd -init
startup any dependent resource as necessary
3. If above does not help, try to restart the GI stack on node 1, see if HAIP can be ONLINE after that.
As root user:
# crsctl stop crs
# crsctl start crs

Check $GRID_HOME/log//agent/ohasd/orarootagent_root/orarootagent_root.log for any HAIP error.
4. Once HAIP is ONLINE on node 1, proceed to start ASM on the rest of cluster nodes and ensure HAIP are ONLINE on all nodes.
$ crsctl start res ora.asm -init
ASM or DB instances should be able to start on all nodes after above.
相关文章
|
4月前
|
运维 Oracle 前端开发
Oracle 11g RAC集群日常运维命令总结
Oracle 11g RAC集群日常运维命令总结
112 2
|
4月前
|
Oracle 关系型数据库
分布式锁设计问题之Oracle RAC保证多个节点写入内存Page的一致性如何解决
分布式锁设计问题之Oracle RAC保证多个节点写入内存Page的一致性如何解决
|
5月前
|
存储 负载均衡 Oracle
|
5月前
|
存储 Oracle 关系型数据库
|
7月前
|
存储 Oracle 关系型数据库
Oracle RAC:数据库集群的舞动乐章
【4月更文挑战第19天】Oracle RAC是Oracle提供的高可用性数据库解决方案,允许多个实例共享同一数据库,确保业务连续性和数据完整性。通过集群件和全局缓存服务实现服务器间的协调和通信。RAC提供高可用性,通过故障转移应对故障,同时提升性能,多个实例并行处理请求。作为数据管理员,理解RAC的架构和管理至关重要,以发挥其在数据管理中的最大价值。
|
7月前
|
Oracle 关系型数据库
oracle rac 手工安装补丁,不适用auto
oracle rac 手工安装补丁,不适用auto
86 3
|
7月前
|
存储 运维 Oracle
Oracle系列十八:Oracle RAC
Oracle系列十八:Oracle RAC
|
7月前
|
Oracle 关系型数据库
oracle Hanganalyze no RAC
oracle Hanganalyze no RAC
49 0