一、问题背景
1、20191120某客户托尔斯程序运行异常,无法正常工作
2、查看程序日志显示ORA-00600、ORA-02002等数据库错误
3、数据库服务器信息
①系统:CentOS 6.8(64位)
②数据库:Oracle RAC 11.2.0.4
二、问题排查&处理
1、查看数据库信息
①首先查看了Oracle RAC集群信息,数据库集群运行正常
②查看数据库日志,节点1运行正常,节点2 alert日志中有大量ORA600信息:
Wed Nov 20 11:28:30 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_11752.trc (incident=51957):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 946, block 733 to scn 2776710962
Recovery of Online Redo Log: Thread 2 Group 3 Seq 946 Reading mem 0
Mem# 0: +DATA/racecs/onlinelog/group_3.261.989703707
Block recovery completed at rba 946.735.16, scn 0.2776710963
Block recovery from logseq 946, block 733 to scn 2776711284
Recovery of Online Redo Log: Thread 2 Group 3 Seq 946 Reading mem 0
Mem# 0: +DATA/racecs/onlinelog/group_3.261.989703707
Block recovery completed at rba 946.849.16, scn 0.2776711285
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_11752.trc (incident=51958):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
ORA-02002: error while writing to audit trail
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
③查看当前数据库集群中每个节点上会话信息,节点1上有程序连接进来的信息,节点2基本都是系统用户会话信息
2、根据客户反馈的信息,进行下一步处理
①客户反馈之前也遇到过目前的情况(ORA2002),之前是对system表空间进行手动resize,之后程序就会恢复正常,但是这次操作了之后也没有恢复正常,程序报错如下:
②根据客户描述的信息检查数据库审计情况
a、发现数据库审计参数AUDIT_TRAIL=DB,是数据库默认的配置
b、接下来检查system表空间信息,当前system表空间是3G,使用率70%(客户自己resize后),并且system表空间是自动扩展,看来问题不是由于审计表空间不足引起的
③system充足的情况下,程序多次重启还是报ORA2002错误,看日志显示ORA600后紧接着就是ORA2002,为了及时恢复业务,客户也表示没有使用到数据库审计,决定关闭数据库审计功能
3、数据库修改了AUDIT_TRAIL=NONE,truncate了SYS.AUD$表数据,关闭了数据库集群,进行了重启,观察程序情况:
①程序在关闭数据库审计后启动,仍然报错,此时报错内容只剩下了ORA600
②查看当前数据库中session信息,和之前一样,节点1有程序连接信息,节点2上没有
③这时再查看节点2alert日志信息,ORA2002虽然没有了,但是ORA600仍然不断打印出来
Wed Nov 20 21:36:11 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_19012.trc (incident=100413):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/racecs/racecs2/incident/incdir_100413/racecs2_ora_19012_i100413.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Nov 20 21:36:16 2019
Dumping diagnostic data in directory=[cdmp_20191120213616], requested by (instance=2, osid=19012), summary=[incident=100413].
Block recovery from logseq 972, block 168 to scn 2776911378
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.173.16, scn 0.2776911845
Block recovery from logseq 972, block 168 to scn 2776914575
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.657.16, scn 0.2776914576
Wed Nov 20 21:38:25 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_18696.trc (incident=100333):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/racecs/racecs2/incident/incdir_100333/racecs2_ora_18696_i100333.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
...
Wed Nov 20 21:46:09 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_18732.trc (incident=100373):
ORA-00600: Śҿխϳպë, ӎ˽: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 972, block 1011 to scn 2776915928
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1016.16, scn 0.2776915929
Block recovery from logseq 972, block 1011 to scn 2776916012
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1031.16, scn 0.2776916013
Wed Nov 20 21:48:41 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_21433.trc (incident=100445):
ORA-00600: Śҿխϳպë, ӎ˽: [4194], [ody "SYS.DBMS_STANDARD"
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 972, block 1011 to scn 2776915928
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1016.16, scn 0.2776915929
Block recovery from logseq 972, block 1011 to scn 2776916447
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1073.16, scn 0.2776916448
4、根据报错内容在MOS查看相关问题,发现文档1428786.1和这次的现象类似,并且节点2服务器在15号多次发生了异常重启,引起数据库多次异常重启
Fri Nov 15 08:32:28 2019
Active Session History (ASH) performed an emergency flush. This may mean that ASH is undersized. If emergency flushes are a recurring issue, you may consider increasing ASH size by setting the value of _ASH_SIZE to a sufficiently large value. Currently, ASH size is 16777216 bytes. Both ASH size and the total number of emergency flushes since instance startup can be monitored by running the following query:
select total_size,awr_flush_emergency_count from v$ash_info;
Fri Nov 15 09:06:34 2019
IPC Send timeout detected. Receiver ospid 7546 [
Fri Nov 15 09:06:34 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_7546.trc:
Fri Nov 15 09:29:15 2019
minact-scn: useg scan erroring out with error e:12751
Fri Nov 15 09:53:19 2019
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_asmb_3794.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 217 Serial number: 61
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_asmb_3794.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 217 Serial number: 61
ASMB (ospid: 3794): terminating the instance due to error 15064
Termination issued to instance processes. Waiting for the processes to exit
Fri Nov 15 09:53:30 2019
Instance termination failed to kill one or more processes
Instance terminated by ASMB, pid = 3794
Fri Nov 15 10:16:47 2019
Starting ORACLE instance (normal)
5、根据MOS文档里面的描述,很可能就是由于实例2服务器异常重启导致数据库异常,MOS文档描述原因如下:
The following error is occurring in the alert.log right before the database crashes.
ORA-00600: internal error code, arguments: [4194], [#], [#], [], [], [], [], []
This error indicates that a mismatch has been detected between redo records and rollback (undo) records.
ARGUMENTS:
Arg [a] - Maximum Undo record number in Undo block
Arg [b] - Undo record number from Redo block
Since we are adding a new undo record to our undo block, we would expect that the new record number is equal to the maximum record number in the undo block plus one. Before Oracle can add a new undo record to the undo block it validates that this is correct. If this validation fails, then an ORA-600 [4194] will be triggered.
This issue generally occurs when there is a power outage or hardware failure that initially crashes the database. On startup, the database does the normal roll forward (redo) and then rollback (undo), this is where the error is generated on the rollback.
大致意思是:
在向undo块添加一个新的undo记录,会将新的记录号等于undo块中的最大记录号加1。在Oracle向undo块添加一个新的undo记录之前,它会验证这是正确的。如果验证失败,将触发ORA-600[4194]。
这个问题通常发生在出现断电或硬件故障时,这些故障最初会导致数据库崩溃。在启动时,数据库执行普通的前滚(重做),然后回滚(撤消),这是在回滚时产生错误的地方。
从这里在结合实例2在15号多次重启的现象,可以看出很可能就是因为实例2服务器异常重启导致
6、根据上面定位到的问题对数据库进行修复
*创建一个新的undo表空间,使用新的undo段号,它比当前使用的段号要高。这样,当一个事务执行清除块时,对撤消段的引用不存在,并继续清除块,这样就不会在产生上面发生的内容。
实例2进行操作
①、Create pfile='/tmp/initsid.ora' from spfile;
②、修改undo
a、查看当前undo表空间位置和大小
b、实例2创建新的undo,替换之前的undo
create undo tablespace undotbs3 datafile '+DATA/racecs/datafile/undotbs3.dbf' size 1500m autoextend on;(注意根据之前undo值进行替换)
alter system set undo_tablespace='undotbs3' scope=spfile sid='racecs2';
③、实例2数据库重启
shutdown immediate
startup
7、操作完成之后,程序连接不在报错
三、其他
1、在数据库恢复之后,隔了一天又反馈出现问题,发现实例2服务器再一次发生了重启,这次数据库日志中到没有出现ORA600信息,数据库运行正常
2、接下来应该重点排查实例2服务器异常重启原因