Oracle数据库一次ORA-00600[4194]事件处理

本文涉及的产品
日志服务 SLS,月写入数据量 50GB 1个月
简介: 一、问题背景1、20191120某客户托尔斯程序运行异常,无法正常工作2、查看程序日志显示ORA-00600、ORA-02002等数据库错误3、数据库服务器信息    ①系统:CentOS 6.8(64位)    ②数据库:Oracle RAC 11.

一、问题背景

1、20191120某客户托尔斯程序运行异常,无法正常工作
2、查看程序日志显示ORA-00600、ORA-02002等数据库错误
3、数据库服务器信息
    ①系统:CentOS 6.8(64位)
    ②数据库:Oracle RAC 11.2.0.4

二、问题排查&处理

1、查看数据库信息

    ①首先查看了Oracle RAC集群信息,数据库集群运行正常
    ②查看数据库日志,节点1运行正常,节点2 alert日志中有大量ORA600信息:

Wed Nov 20 11:28:30 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_11752.trc  (incident=51957):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 946, block 733 to scn 2776710962
Recovery of Online Redo Log: Thread 2 Group 3 Seq 946 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_3.261.989703707
Block recovery completed at rba 946.735.16, scn 0.2776710963
Block recovery from logseq 946, block 733 to scn 2776711284
Recovery of Online Redo Log: Thread 2 Group 3 Seq 946 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_3.261.989703707
Block recovery completed at rba 946.849.16, scn 0.2776711285
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_11752.trc  (incident=51958):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
ORA-02002: error while writing to audit trail
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

    ③查看当前数据库集群中每个节点上会话信息,节点1上有程序连接进来的信息,节点2基本都是系统用户会话信息

2、根据客户反馈的信息,进行下一步处理

    ①客户反馈之前也遇到过目前的情况(ORA2002),之前是对system表空间进行手动resize,之后程序就会恢复正常,但是这次操作了之后也没有恢复正常,程序报错如下:
ter_error

    ②根据客户描述的信息检查数据库审计情况
        a、发现数据库审计参数AUDIT_TRAIL=DB,是数据库默认的配置
        b、接下来检查system表空间信息,当前system表空间是3G,使用率70%(客户自己resize后),并且system表空间是自动扩展,看来问题不是由于审计表空间不足引起的
    ③system充足的情况下,程序多次重启还是报ORA2002错误,看日志显示ORA600后紧接着就是ORA2002,为了及时恢复业务,客户也表示没有使用到数据库审计,决定关闭数据库审计功能

3、数据库修改了AUDIT_TRAIL=NONE,truncate了SYS.AUD$表数据,关闭了数据库集群,进行了重启,观察程序情况:

    ①程序在关闭数据库审计后启动,仍然报错,此时报错内容只剩下了ORA600
trs_error1
    ②查看当前数据库中session信息,和之前一样,节点1有程序连接信息,节点2上没有
    ③这时再查看节点2alert日志信息,ORA2002虽然没有了,但是ORA600仍然不断打印出来

Wed Nov 20 21:36:11 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_19012.trc  (incident=100413):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/racecs/racecs2/incident/incdir_100413/racecs2_ora_19012_i100413.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Nov 20 21:36:16 2019
Dumping diagnostic data in directory=[cdmp_20191120213616], requested by (instance=2, osid=19012), summary=[incident=100413].
Block recovery from logseq 972, block 168 to scn 2776911378
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.173.16, scn 0.2776911845
Block recovery from logseq 972, block 168 to scn 2776914575
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.657.16, scn 0.2776914576
Wed Nov 20 21:38:25 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_18696.trc  (incident=100333):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/racecs/racecs2/incident/incdir_100333/racecs2_ora_18696_i100333.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
...
Wed Nov 20 21:46:09 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_18732.trc  (incident=100373):
ORA-00600: Śҿխϳպë, ӎ˽: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 972, block 1011 to scn 2776915928
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1016.16, scn 0.2776915929
Block recovery from logseq 972, block 1011 to scn 2776916012
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1031.16, scn 0.2776916013
Wed Nov 20 21:48:41 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_21433.trc  (incident=100445):
ORA-00600: Śҿխϳպë, ӎ˽: [4194], [ody "SYS.DBMS_STANDARD"
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 972, block 1011 to scn 2776915928
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1016.16, scn 0.2776915929
Block recovery from logseq 972, block 1011 to scn 2776916447
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1073.16, scn 0.2776916448

4、根据报错内容在MOS查看相关问题,发现文档1428786.1和这次的现象类似,并且节点2服务器在15号多次发生了异常重启,引起数据库多次异常重启

Fri Nov 15 08:32:28 2019
Active Session History (ASH) performed an emergency flush. This may mean that ASH is undersized. If emergency flushes are a recurring issue, you may consider increasing ASH size by setting the value of _ASH_SIZE to a sufficiently large value. Currently, ASH size is 16777216 bytes. Both ASH size and the total number of emergency flushes since instance startup can be monitored by running the following query:
 select total_size,awr_flush_emergency_count from v$ash_info; 
Fri Nov 15 09:06:34 2019
IPC Send timeout detected. Receiver ospid 7546 [
Fri Nov 15 09:06:34 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_7546.trc:
Fri Nov 15 09:29:15 2019
minact-scn: useg scan erroring out with error e:12751
Fri Nov 15 09:53:19 2019
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_asmb_3794.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 217 Serial number: 61
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_asmb_3794.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 217 Serial number: 61
ASMB (ospid: 3794): terminating the instance due to error 15064
Termination issued to instance processes. Waiting for the processes to exit
Fri Nov 15 09:53:30 2019
Instance termination failed to kill one or more processes
Instance terminated by ASMB, pid = 3794
Fri Nov 15 10:16:47 2019
Starting ORACLE instance (normal)

5、根据MOS文档里面的描述,很可能就是由于实例2服务器异常重启导致数据库异常,MOS文档描述原因如下:

The following error is occurring in the alert.log right before the database crashes.
ORA-00600: internal error code, arguments: [4194], [#], [#], [], [], [], [], []
This error indicates that a mismatch has been detected between redo records and rollback (undo) records.

ARGUMENTS:

Arg [a] - Maximum Undo record number in Undo block
Arg [b] - Undo record number from Redo block

Since we are adding a new undo record to our undo block, we would expect that the new record number is equal to the maximum record number in the undo block plus one. Before Oracle can add a new undo record to the undo block it validates that this is correct. If this validation fails, then an ORA-600 [4194] will be triggered.

This issue generally occurs when there is a power outage or hardware failure that initially crashes the database. On startup, the database does the normal roll forward (redo) and then rollback (undo), this is where the error is generated on the rollback.

大致意思是:
在向undo块添加一个新的undo记录,会将新的记录号等于undo块中的最大记录号加1。在Oracle向undo块添加一个新的undo记录之前,它会验证这是正确的。如果验证失败,将触发ORA-600[4194]。
这个问题通常发生在出现断电或硬件故障时,这些故障最初会导致数据库崩溃。在启动时,数据库执行普通的前滚(重做),然后回滚(撤消),这是在回滚时产生错误的地方。
从这里在结合实例2在15号多次重启的现象,可以看出很可能就是因为实例2服务器异常重启导致

6、根据上面定位到的问题对数据库进行修复

*创建一个新的undo表空间,使用新的undo段号,它比当前使用的段号要高。这样,当一个事务执行清除块时,对撤消段的引用不存在,并继续清除块,这样就不会在产生上面发生的内容。
实例2进行操作
①、Create pfile='/tmp/initsid.ora' from spfile;
②、修改undo
a、查看当前undo表空间位置和大小
b、实例2创建新的undo,替换之前的undo
  create undo tablespace undotbs3 datafile '+DATA/racecs/datafile/undotbs3.dbf' size 1500m autoextend on;(注意根据之前undo值进行替换)
  
  alter system set undo_tablespace='undotbs3' scope=spfile sid='racecs2';
③、实例2数据库重启
  shutdown immediate
  startup

7、操作完成之后,程序连接不在报错

三、其他

1、在数据库恢复之后,隔了一天又反馈出现问题,发现实例2服务器再一次发生了重启,这次数据库日志中到没有出现ORA600信息,数据库运行正常
2、接下来应该重点排查实例2服务器异常重启原因

相关文章
|
1月前
|
存储 Oracle 关系型数据库
数据库数据恢复—ORACLE常见故障的数据恢复方案
Oracle数据库常见故障表现: 1、ORACLE数据库无法启动或无法正常工作。 2、ORACLE ASM存储破坏。 3、ORACLE数据文件丢失。 4、ORACLE数据文件部分损坏。 5、ORACLE DUMP文件损坏。
124 11
|
2月前
|
Oracle 关系型数据库 数据库
Oracle数据恢复—Oracle数据库文件有坏快损坏的数据恢复案例
一台Oracle数据库打开报错,报错信息: “system01.dbf需要更多的恢复来保持一致性,数据库无法打开”。管理员联系我们数据恢复中心寻求帮助,并提供了Oracle_Home目录的所有文件。用户方要求恢复zxfg用户下的数据。 由于数据库没有备份,无法通过备份去恢复数据库。
|
2月前
|
存储 Oracle 关系型数据库
oracle数据恢复—Oracle数据库文件大小变为0kb的数据恢复案例
存储掉盘超过上限,lun无法识别。管理员重组存储的位图信息并导出lun,发现linux操作系统上部署的oracle数据库中有上百个数据文件的大小变为0kb。数据库的大小缩水了80%以上。 取出&并分析oracle数据库的控制文件。重组存储位图信息,重新导出控制文件中记录的数据文件,发现这些文件的大小依然为0kb。
|
1月前
|
存储 Oracle 关系型数据库
服务器数据恢复—华为S5300存储Oracle数据库恢复案例
服务器存储数据恢复环境: 华为S5300存储中有12块FC硬盘,其中11块硬盘作为数据盘组建了一组RAID5阵列,剩下的1块硬盘作为热备盘使用。基于RAID的LUN分配给linux操作系统使用,存放的数据主要是Oracle数据库。 服务器存储故障: RAID5阵列中1块硬盘出现故障离线,热备盘自动激活开始同步数据,在同步数据的过程中又一块硬盘离线,RAID5阵列瘫痪,上层LUN无法使用。
|
3月前
|
存储 Oracle 关系型数据库
Oracle数据库的应用场景有哪些?
【10月更文挑战第15天】Oracle数据库的应用场景有哪些?
269 64
|
2月前
|
SQL Oracle 关系型数据库
Oracle数据库优化方法
【10月更文挑战第25天】Oracle数据库优化方法
63 7
|
2月前
|
Oracle 关系型数据库 数据库
oracle数据库技巧
【10月更文挑战第25天】oracle数据库技巧
41 6
|
2月前
|
存储 Oracle 关系型数据库
Oracle数据库优化策略
【10月更文挑战第25天】Oracle数据库优化策略
44 5
|
3月前
|
存储 Oracle 关系型数据库
数据库数据恢复—Oracle ASM磁盘组故障数据恢复案例
Oracle数据库数据恢复环境&故障: Oracle ASM磁盘组由4块磁盘组成。Oracle ASM磁盘组掉线 ,ASM实例不能mount。 Oracle数据库故障分析&恢复方案: 数据库数据恢复工程师对组成ASM磁盘组的磁盘进行分析。对ASM元数据进行分析发现ASM存储元数据损坏,导致磁盘组无法挂载。
|
3月前
|
监控 Oracle 关系型数据库
Oracle数据库性能优化
【10月更文挑战第16天】Oracle数据库性能优化是
48 1

推荐镜像

更多
AI助理

你好,我是AI助理

可以解答问题、推荐解决方案等