ORA-00600 [kjctr_pbmsg:badbmsg2]-阿里云开发者社区

近日遇到错误ORA-00600 [kjctr_pbmsg:badbmsg2]，并且导致RAC节点实例重启,最终确认问题由于私网不稳定导致的。

 
         ORA-
         00600
         : 
         internal 
         error code, arguments: [kjctr_pbmsg:badbmsg2], [
         0x9FFFFFFFFC996B58
         ], [
         0x9FFFFFFFFC9976B8
         ], [], [], [], [], [], [], [], [], [] 
        
         LMS1 (ospid: 
         12379
         ): terminating the instance due to error 
         484

1. 具体分析如下，首先查看日志：
alert log

 
   
     
       
       
         Mon Aug 
         11 
         23
         :
         53
         :
         10 
         2014 
        
 
         Errors 
         in 
         file /oracle/app/oracle/diag/rdbms/cdrdb/orcl/
         trace
         /orcl_lms1_12379.trc (incident=
         1104178
         ): 
        
 
         ORA-
         00600
         : 
         internal 
         error code, arguments: [kjctr_pbmsg:badbmsg2], [
         0x9FFFFFFFFC996B58
         ], [
         0x9FFFFFFFFC9976B8
         ], [], [], [], [], [], [], [], [], []  
        
 
         Incident details 
         in
         : /oracle/app/oracle/diag/rdbms/cdrdb/orcl/incident/incdir_1104178/orcl_lms1_12379_i1104178.trc 
        
 
         Mon Aug 
         11 
         23
         :
         53
         :
         12 
         2014 
        
 
         Dumping diagnostic data 
         in 
         directory=[cdmp_20140811235312], requested by (instance=
         1
         , osid=
         12379 
         (LMS1)), summary=[incident=
         1104178
         ]. 
        
 
         Use ADRCI or Support Workbench to 
         package 
         the incident. 
        
 
         See Note 
         411.1 
         at My Oracle Support 
         for 
         error and packaging details. 
        
 
         Mon Aug 
         11 
         23
         :
         53
         :
         13 
         2014 
        
 
         Sweep [inc][
         1104178
         ]: completed 
        
 
         Sweep [inc2][
         1104178
         ]: completed 
        
 
         Errors 
         in 
         file /oracle/app/oracle/diag/rdbms/cdrdb/orcl/
         trace
         /orcl_lms1_12379.trc: 
        
 
         ORA-
         00600
         : 
         internal 
         error code, arguments: [kjctr_pbmsg:badbmsg2], [
         0x9FFFFFFFFC996B58
         ], [
         0x9FFFFFFFFC9976B8
         ], [], [], [], [], [], [], [], [], [] 
        
 
         LMS1 (ospid: 
         12379
         ): terminating the instance due to error 
         484 
        
 
         Mon Aug 
         11 
         23
         :
         53
         :
         22 
         2014 
        
 
         ORA-
         1092 
         : opitsk aborting process 
        
 
     

    
  

orcl_lms1_12379_i1104178.trc

 
         Oracle Database 11g Enterprise Edition Release 
         11.2
         .
         0.2
         .
         0 
         - 64bit Production 
        
         With the Partitioning, Real Application Clusters, OLAP, Data Mining
        
         and Real Application Testing options
        
         ORACLE_HOME = /oracle/app/oracle/product/
         11.2
         .
         0
         /dbhome_1 
        
         System name: HP-UX
        
         Node name: h7sd05da
        
         Release: B.
         11.31 
        
         Version: U
        
         Machine: ia64
        
         Instance name: orcl
        
         Redo thread mounted by 
         this 
         instance: 
         1 
        
         Oracle process number: 
         14 
        
         Unix process pid: 
         12379
         , image: oracleh7sd05da (LMS1) 
        
         Dump continued from file: /oracle/app/oracle/diag/rdbms/cdrdb/orcl/
         trace
         /orcl_lms1_12379.trc 
        
         ORA-
         00600
         : 
         internal 
         error code, arguments: [kjctr_pbmsg:badbmsg2], [
         0x9FFFFFFFFC996B58
         ], [
         0x9FFFFFFFFC9976B8
         ], [], [], [], [], [], [], [], [], [] 
        
         ========= Dump 
         for 
         incident 
         1104178 
         (ORA 
         600 
         [kjctr_pbmsg:badbmsg2]) ======== 
        
         *** 
         2014
         -
         08
         -
         11 
         23
         :
         53
         :
         10.339 
        
         dbkedDefDump(): Starting incident 
         default 
         dumps (flags=
         0x2
         , level=
         3
         , mask=
         0x0
         ) 
        
         ----- SQL Statement (None) -----
        
         Current SQL information unavailable - no cursor.
        
         ----- Call Stack Trace -----
        
         skdstdst <- ksedst <- dbkedDefDump <- ksedmp <- ksfdmp 
        
         <- $cold_dbgexPhaseII <- dbgexProcessError <- dbgeExecuteForError <- dbgePostErrorKGE <- 
         2352 
        
         <- dbkePostKGE_kgsf <- 
         128 
         <- kgeadse <- kgerinv_internal <- kgerinv 
        
         <- kgeasnmierr <- kjctr_pbmsg <- kjctr_rksxp <- kjctrcv <- kjcsrmg 
        
         <- kjmsm <- ksbrdp <- opirip <- opidrv <- sou2o 
        
         <- opimai_real <- ssthrdmain <- main <- main_opd_entry 
        
         --------------------- Binary Stack Dump ---------------------

2. 检查patch信息，当前版本是11.2.0.2.1

 
         $ opatch lsinventory 
        
         Installed Top-level Products (
         1
         ):  
        
         Oracle Database 11g 
         11.2
         .
         0.2
         .
         0 
        
         Patch 
         10248523 
         : applied on Fri Mar 
         25 
         09
         :
         33
         :
         02 
         GMT+
         08
         :
         00 
         2011

3. 根据这个错误搜索相关的文档和BUG，列出下面的相关bug和描述

Bug 18015296 : ORA-600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
The assert is trigerred because the batch message is invalid/corrupt. This looks like some form of underlying infrastructure/network issue, Please work with customer to have this checked and tested.
Bug 18771858 : LMS0 TERMINATING THE INSTANCE DUE TO ERROR 484 (ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
From the past bug 16240464 & bug 18015296 , both were closed by dev as not a product defect.
It was suggested that problem was outside Oracle stack at network level. So please check with CT on same lines to identify network problems (if any) with help from there OS/Net support. Refer Doc ID 563566.1 Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
Bug 16240464 : INSTANCE CRASH WITH ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
This looks like some form of underlying infrastructure/network issue, please work with customer to have this checked and tested.
Bug 17452853 : LNX64-12.1-EF,DB INST CRASH WITH LMS4 HIT ORA-600 [KJCTR_PBMSG:BADBMSG2] in 12.1.0.2
Bug 17049773 Diagnostic enhancement to give additional parameter in error ORA-600 [ kjctr_pbmsg:badbmsg2] in 12.1.0.1
Note: This fix will not address the root cause of the error but the additional information may help with diagnosis of the cause.
Bug 13917456 : LNX64-12.1-UD: ASM LMD HIT ORA-00600 KJCTR_PBMSG:BADBMSG2 IN NON-UPGRADED NODES in 12.1.0.0.2
It may occurred in upgrading stage from 11.2.0.3 to 12.1 . Not related with this SR.

4. 至此，我需要检查问题发生时的AWR，oswatcher和全部的LMS, LMD, LMON,LMHB and DIAG日志，看是否有跟多的信息记录。
同时也通过cluvfy和ORAchk来检查RAC的整体环境。

 
         --. AWR report 
         22
         :
         00
         ~
         23
         :
         00 
         on Aug 
         11 
         from both nodes. 
        
         --. Deploy the oswatcher, then collect the current OS information, when the database workload 
         is 
         high. 
        
         --. All the LMS, LMD, LMON,LMHB and DIAG from both nodes.
        
         --. CVU output:
        
         cluvfy stage -pre crsinst -n <node1,node2> -verbose  
        
         --. Please run oraCheck 
         as 
         root. 
        
         ORAchk - Health Checks 
         for 
         the Oracle Stack (Doc ID 
         1268927.2
         )

5. 在检查AWR的时候，发现有"gc blocks lost"，这个错误理论上，如果私网正常的话，是不会出现的，它的出现，基本就可以说明，私网是不稳定的

awrrpt_2_29557_29558.html

 
   
     
       
      
         Snap Id Snap Time Sessions Cursors/Session
        
 
         Begin Snap: 
         29557 
         11
         -Aug-
         14 
         22
         :
         00
         :
         45 
         563 
         1.3 
        
 
         End Snap: 
         29558 
         11
         -Aug-
         14 
         23
         :
         01
         :
         00 
         551 
         1.3 
        
 
         Elapsed: 
         60.24 
         (mins) 
        
 
         DB Time: 
         4
         ,
         835.90 
         (mins) 
        
 
         Top 
         5 
         Timed Foreground Events 
        

         Event Waits Time(s) Avg wait (ms) % DB time Wait Class
        
 
         db file sequential read 
         6
         ,
         269
         ,
         185 
         185
         ,
         621 
         30 
         63.97 
         User I/O 
        
 
         DB CPU 
         42
         ,
         433 
         14.62 
        
 
         gc current grant 
         2
         -way 
         3
         ,
         251
         ,
         636 
         25
         ,
         671 
         8 
         8.85 
         Cluster 
        
 
         db file scattered read 
         550
         ,
         524 
         9
         ,
         873 
         18 
         3.40 
         User I/O 
        
 
         gc cr multi block request 
         637
         ,
         442 
         6
         ,
         790 
         11 
         2.34 
         Cluster 
        

         Instance Activity Stats
        

         Statistic Total per Second per Trans
        
 
         gc blocks lost 
         269 
         0.07 
         0.01 
         <<<<<<<<<<<< 
        
 
     

    
  

awrrpt_1_29557_29558.html

 
   
     
       
      
         Snap Id Snap Time Sessions Cursors/Session
        
 
         Begin Snap: 
         29557 
         11
         -Aug-
         14 
         22
         :
         00
         :
         44 
         2470 
         1.0 
        
 
         End Snap: 
         29558 
         11
         -Aug-
         14 
         23
         :
         00
         :
         59 
         2500 
         1.0 
        
 
         Elapsed: 
         60.25 
         (mins) 
        
 
         DB Time: 
         4
         ,
         549.47 
         (mins) 
        
 
         Top 
         5 
         Timed Foreground Events 
        

         Event Waits Time(s) Avg wait (ms) % DB time Wait Class
        
 
         db file sequential read 
         8
         ,
         180
         ,
         795 
         154
         ,
         504 
         19 
         56.60 
         User I/O 
        
 
         DB CPU 
         44
         ,
         994 
         16.48 
        
 
         gc current grant 
         2
         -way 
         3
         ,
         699
         ,
         003 
         29
         ,
         357 
         8 
         10.75 
         Cluster 
        
 
         db file scattered read 
         677
         ,
         065 
         10
         ,
         190 
         15 
         3.73 
         User I/O 
        
 
         gc cr multi block request 
         718
         ,
         327 
         7
         ,
         856 
         11 
         2.88 
         Cluster 
        

         Statistic Total per Second per Trans
        
 
         gc blocks lost 
         410 
         0.11 
         0.01 
         <<<<<<<<<<<< 
        
 
     

    
  

6. 对于这个错误，更加证明私网的问题可能性，最终结论如下

The Bugs 16240464 and 18015296 are raised for the similar issue and both the bugs are closed as "Vendor OS Problem".
The bug confirmed that this issue is cause because of logical block corruption during network transfer over the interconnect or Infrastructure issue.

The ORA-00600 [kjctr_pbmsg:badbmsg2] error is purely a result of unstable network.
From the AWR reports it is confirmed that we were seeing block lost during the problematic time frame. This is one of the evidence that network is either saturated or causing packets to be corrupted.

By the way, Checked the AWR report. Found "gc blocks lost".
Please involve the OS team and Network team to identify the root cause of the issue. The below note will helpful for the network issue.
Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)

7. 这个问题的处理其实还缺少更有力的证据，就是oswatcher日志，如果有问题出现时的oswatcher日志，会让私网问题暴露的更清晰，毕竟整个问题分析过程中遇到的"gc blocks lost"和ORA-00600 [kjctr_pbmsg:badbmsg2]错误，都是oracle database角度报出的，并不能让OS的工程师信服，如果oswatcher日志记录当时的TCP和UDP丢包的话，会问题更清晰，责任更明确。

oswatcher的安装使用，请参考文档： OSWatcher (Doc ID 301137.1)

本文转自 hsbxxl 51CTO博客，原文链接：http://blog.51cto.com/hsbxxl/1561574，如需转载请自行联系原作者

ORA-00600 [kjctr_pbmsg:badbmsg2]

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

ORA-00600 [kjctr_pbmsg:badbmsg2]

热门文章

最新文章

相关电子书