rac节点无法启动ORA-29702的问题及分析

简介: 今天在虚拟机上启动rac,发现有一个节点怎么都起不了。另外一个节点没问题。 SQL> startup nomount ORA-29702: error occurred in Cluster Group Service operation 尝试使用crs_stat查看crs的组件状态,也报错了。

今天在虚拟机上启动rac,发现有一个节点怎么都起不了。另外一个节点没问题。

SQL> startup nomount
ORA-29702: error occurred in Cluster Group Service operation

尝试使用crs_stat查看crs的组件状态,也报错了。

-bash-4.1$ crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.

查看alert日志,发现在最后是因为29702的错误导致的。

SMON started with pid=20, OS id=12344
Sun May 11 04:10:28 2014
RECO started with pid=21, OS id=12346
Sun May 11 04:10:28 2014
MMON started with pid=22, OS id=12348
Sun May 11 04:10:28 2014
MMNL started with pid=23, OS id=12350
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
starting up 1 shared server(s) ...
USER (ospid: 12242): terminating the instance due to error 29702
Instance terminated by USER, pid = 12242

对于这个错误,oracle给出的解释如下。

-bash-4.1$ oerr ora 29702
29702, 00000, "error occurred in Cluster Group Service operation"
// *Cause: An unexpected error occurred while performing a CGS operation.
// *Action: Verify that the LMON process is still active.
//          Check the Oracle LMON trace files for errors.
//          Also, check the related CSS trace file for errors.

查看lmon的日志如下:

Trace file /u04/app/11.2.0/db/diag/rdbms/racdb/RACDB1/trace/RACDB1_lmon_12324.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /u04/app/11.2.0/db/product/11.2.0/dbhome_1
System name:    Linux
Node name:      rac1
Release:        2.6.32-71.el6.x86_64
Version:        #1 SMP Wed Sep 1 01:33:01 EDT 2010
Machine:        x86_64
VM name:        VMWare Version: 6
Instance name: RACDB1
Redo thread mounted by this instance: 0
Oracle process number: 11
Unix process pid: 12324, image: oracle@rac1 (LMON)

*** 2014-05-11 04:10:27.777
*** SESSION ID:(130.1) 2014-05-11 04:10:27.777
*** CLIENT ID:() 2014-05-11 04:10:27.777
*** SERVICE NAME:() 2014-05-11 04:10:27.777
*** MODULE NAME:() 2014-05-11 04:10:27.777
*** ACTION NAME:() 2014-05-11 04:10:27.777
GES resources 5720 pool 3
GES enqueues 8361
GES IPC: Receivers 2  Senders 2
GES IPC: Buffers  Receive 1000  Send (i:1030 b:471) Reserve 301
GES IPC: Msg Size  Regular 1176  Batch 8376
Batching factor: enqueue replay 206, ack 229
Batching factor: cache replay 128 size per lock 64

*** 2014-05-11 04:10:28.644
kjxggin: CGS tickets = 1000
kgxgncin: CLSS init failed with status 3
kgxgncin: return status 3 (1311719766 SKGXN not av) from CLSS
kjxgmin: kgxgncin fails - (2)
kjxggin: generic group layer init fails

*** 2014-05-11 04:10:28.655
Global Enqueue Service Shutdown

 

对于该节点,使用crs_stat,crsctl的操作都无济于事。

-bash-4.1$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager

-bash-4.1$ crs_start -all
CRS-0184: Cannot communicate with the CRS daemon.

查看进程,确实都起来了。

-bash-4.1$ ps -ef|grep d.bin
root      2103     1  0 May10 ?        00:00:51 /u04/app/11.2.0/grid/bin/ohasd.bin reboot
grid      2297     1  0 May10 ?        00:00:32 /u04/app/11.2.0/grid/bin/oraagent.bin
grid      2309     1  0 May10 ?        00:00:01 /u04/app/11.2.0/grid/bin/mdnsd.bin
grid      2320     1  0 May10 ?        00:00:36 /u04/app/11.2.0/grid/bin/gpnpd.bin
root      2330     1  0 May10 ?        00:00:14 /u04/app/11.2.0/grid/bin/orarootagent.bin
grid      2333     1  0 May10 ?        00:02:39 /u04/app/11.2.0/grid/bin/gipcd.bin
root      2348     1  1 May10 ?        00:12:00 /u04/app/11.2.0/grid/bin/osysmond.bin
root      2569     1  0 May10 ?        00:03:55 /u04/app/11.2.0/grid/bin/ologgerd -M -d /u04/app/11.2.0/grid/crf/db/rac1
grid     12569  9580  0 04:25 pts/1    00:00:00 grep d.bin

 

使用root用户来停掉crs。但是报了错。
root
[root@rac1 bin]# ./crsctl disable crs
CRS-4621: Oracle High Availability Services autostart is disabled.

[root@rac1 bin]# ./crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.

再次尝试启动,也是报错。

[root@rac1 bin]# ./crsctl enable crs
CRS-4622: Oracle High Availability Services autostart is enabled.
[root@rac1 bin]# ./crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

最后看到mos上有一个workaround,可以手动Kill掉那些crs的进程。当然了,在正式环境中还是得把psu打上。

[root@rac1 bin]# ps -fea | grep ohasd.bin | grep -v grep
root      2103     1  0 May10 ?        00:00:52 /u04/app/11.2.0/grid/bin/ohasd.bin reboot
[root@rac1 bin]# ps -fea | grep gipcd.bin | grep -v grep
grid      2333     1  0 May10 ?        00:02:41 /u04/app/11.2.0/grid/bin/gipcd.bin
[root@rac1 bin]# ps -fea | grep mdnsd.bin | grep -v grep
grid      2309     1  0 May10 ?        00:00:01 /u04/app/11.2.0/grid/bin/mdnsd.bin
[root@rac1 bin]# ps -fea | grep gpnpd.bin | grep -v grep
grid      2320     1  0 May10 ?        00:00:37 /u04/app/11.2.0/grid/bin/gpnpd.bin
[root@rac1 bin]# ps -fea | grep evmd.bin | grep -v grep
[root@rac1 bin]# ps -fea | grep crsd.bin | grep -v grep
[root@rac1 bin]# kill -9 2103 2333  2309 2320

再次尝试启动crs

[root@rac1 bin]# ./crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

[root@rac1 bin]# ./crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.

启动的时候有些慢,稍等一下,直接自己来启库了。这次起库就没有问题了。

-bash-4.1$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.3.0 Production on Sun May 11 04:41:03 2014

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup nomount
ORACLE instance started.

Total System Global Area  638853120 bytes
Fixed Size                  2231072 bytes
Variable Size             482346208 bytes
Database Buffers          146800640 bytes
Redo Buffers                7475200 bytes
SQL> alter database mount;

Database altered.

SQL> alter database open;

Database altered.

SQL>

查看crs的状态,该起的都起了。两个节点创建了一个小表做测试,没有问题了。那个workaround的细节可以从MOS文档 ID 1233580.1里面查看。

-bash-4.1$ crs_stat -t
Name           Type           Target    State     Host       
------------------------------------------------------------
ora....ER.lsnr ora....er.type ONLINE    ONLINE    rac1       
ora....N1.lsnr ora....er.type ONLINE    ONLINE    rac2       
ora.asm        ora.asm.type   OFFLINE   OFFLINE              
ora.cvu        ora.cvu.type   OFFLINE   OFFLINE              
ora.gsd        ora.gsd.type   OFFLINE   OFFLINE              
ora....network ora....rk.type ONLINE    ONLINE    rac1       
ora.oc4j       ora.oc4j.type  OFFLINE   OFFLINE              
ora.ons        ora.ons.type   ONLINE    ONLINE    rac1       
ora....SM1.asm application    OFFLINE   OFFLINE              
ora....C1.lsnr application    ONLINE    ONLINE    rac1       
ora.rac1.gsd   application    OFFLINE   OFFLINE              
ora.rac1.ons   application    ONLINE    ONLINE    rac1       
ora.rac1.vip   ora....t1.type ONLINE    ONLINE    rac1       
ora....SM2.asm application    OFFLINE   OFFLINE              
ora....C2.lsnr application    ONLINE    ONLINE    rac2       
ora.rac2.gsd   application    OFFLINE   OFFLINE              
ora.rac2.ons   application    ONLINE    ONLINE    rac2       
ora.rac2.vip   ora....t1.type ONLINE    ONLINE    rac2       
ora.racdb.db   ora....se.type ONLINE    ONLINE    rac2       
ora.scan1.vip  ora....ip.type ONLINE    ONLINE    rac2

目录
相关文章
|
Oracle 关系型数据库
分布式锁设计问题之Oracle RAC保证多个节点写入内存Page的一致性如何解决
分布式锁设计问题之Oracle RAC保证多个节点写入内存Page的一致性如何解决
379 0
|
Oracle 关系型数据库 数据库
RAC中,控制文件的快照文件必须能够被所有节点的数据实例访问到 ORA-00245
RMAN在使用控制文件备份的时候,备份开始点需要最新的检查点信息以及文件头信息。
326 0
|
Oracle 关系型数据库
oracle rac 添加节点常用命令
rac 添加节点常用命令
218 0
|
存储 文字识别 Oracle
神龙RAC节点1无法启动问题处理
神龙Oracle rac由于空间问题希望扩容根目录,扩容失败,导致相关的asm磁盘组多路径映射出现问题导致集群挂起,之后进行集群恢复过程处理
1024 0
神龙RAC节点1无法启动问题处理
|
Oracle 关系型数据库 专有云
阿里专有云3.6.1版本云上rac安装节点驱逐问题
阿里专有云云上Oracle rac节点驱逐解决方案
2948 0
|
Oracle 关系型数据库 Perl
|
监控 Oracle 关系型数据库
|
Oracle 关系型数据库 数据库