ORA-30036故障处理思路

简介:

 故障现象为某省结算库应用方在跑存储过程发现报错

 

ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS1'

   

后进行了下面的一系列的排查分析:

查看undo表空间使用率为100%,查看alert日志中发现了大量的事物已经ORA-01555的报错。

Wed Jan 30 04:32:01 GMT+08:00 2013ORA-01555 caused by SQL statement below (SQL ID: 4ds6qq0mfac2t, Query Duration=5 sec, SCN: 0x0c00.d915c977):

Wed Jan 30 04:32:01 GMT+08:00 2013SELECT distinct FILE_NAME,to_char(file_name_time,:"SYS_B_00"),A.operation_type_grade FROM T_FILE_INFO_81 A,T_FILE_CLASS B ,T_FILE_STAT

E C                                        WHERE ((A.RAT_FIRST_STARTTIME >=:"SYS_B_01"                                     and A.RAT_FIRST_STARTTIME <=:"SYS_B_02" )    

                                   or (A.RAT_LAST_ENDTIME >=:"SYS_B_03"                                            and A.RAT_LAST_ENDTIME <=:"SYS_B_04" )               

                           or (A.RAT_FIRST_STARTTIME <=:"SYS_B_05"                                         and A.RAT_LAST_ENDTIME >= :"SYS_B_06" )      )               

                           and ((A.RAT_FIRST_CALLING_NBR >=:"SYS_B_07"                                     and A.RAT_FIRST_CALLING_NBR <=:"SYS_B_08")                   

                   or (A.RAT_LAST_CALLING_NBR >= :"SYS_B_09"                                       and A.RAT_LAST_CALLING_NBR <= :"SYS_B_10" )                          

           or (A.RAT_FIRST_CALLING_NBR <=:"SYS_B_11"                                       and A.RAT_LAST_CALLING_NBR >=:"SYS_B_12" ))                                  

   and A.FILE_NAME <>:"SYS_B_13"                                           and A.File_Class_Id = B.File_Class_Id                                           and B.operati

on_type_id=:"SYS_B_14"                                     and (A.FILE_NAME_TIME +  interval :"SYS_B_15" day ) > TO_DATE(:"SYS_B_16",:"SYS_B_17")                       

                   and A.city_id =:"SYS_B_18"                                      and a.state in (:"SYS

Wed Jan 30 08:49:12 GMT+08:00 2013insert into STL_GX.T_all_file_TOTAL

  select aa.province_id,

         dd.name,

         aa.BILL_DATE,

         aa.OPERATION_TYPE_GRADE,

         :"SYS_B_00",

         aa.cj_files,

         bb.pj_files,

         bb.org_counts,

         bb.rate_counts,

         bb.inp_counts,

         bb.inpc_counts,

         cc.jf_files,

         cc.jf_counts,

         :"SYS_B_01",

         :"SYS_B_02",

         aa.is_rate,

         aa.is_billtag,

         aa.is_insert,

         bb.ERR_COUNTS

    from (SELECT a.province_code as province_id,

                 substr(A.ORG_FILENAME, :"SYS_B_03", :"SYS_B_04") as BILL_DATE,

                 b.operation_type_grade,

                 COUNT(*) as cj_files,

                 b.is_rate as is_rate,

                 b.is_billtag as is_billtag,

                 b.is_insert as is_insert

            FROM STL_PARA.T_LOG_COLLECT_76@pub_PARA A,

                 stl_gx.tmp_cj_info b

           WHERE substr(A.ORG_FILENAME, :"SYS_B_05", :"SYS_B_06") = :"SYS_B_07"

             and b.province_id = :"S

Wed Jan 30 09:44:56 GMT+08:00 2013Thread 1 advanced to log sequence 149384 (LGWR switch)

  Current log# 6 seq# 149384 mem# 0: /dev/rjs_redolog06

Wed Jan 30 09:50:00 GMT+08:00 2013ORA-01555 caused by SQL statement below (SQL ID: 7t0bjnwxt9ufv, Query Duration=180 sec, SCN: 0x0c00.e294a117):

这个很明显,是因为undo中存在大量的insert操作,导致数据库undo没有commit,由于本库的实际环境,之前做过undo_retention的调整。下面看此设定值。

SQL> show parameter undo

 

NAME                                 TYPE                             VALUE

------------------------------------ -------------------------------- ------------------------------

undo_management                      string                           AUTO

undo_retention                       integer                          0

undo_tablespace                      string                           UNDOTBS1

 

undo_managementautoretention时间为0Oracle自动调整保留提交后undo信息的时间Oracle 10g之前,在自动Undo管理的模式下,我们都知道undo_retention参数的作用是用来控制当transactioncommit之后,undo信息的保留时间。这些undo信息可以用来构造consistent read以及用于一系列的闪回恢复,而且足够的undo信息还可以减少ORA-01555错误的发生,在Oracle 9R1中呢,这个value的默认值是900秒,Oracle 9R2以后这个value提高到了10800秒。即使我们设置了undo_retention这个参数,那么在默认情况下,这是一个noguarantee的限制。也就是说我将undo_retention=10800,那么原本以为在一个transaction commit之后,之前的undo还可以保存10800秒,才可以被别的transaction DML覆盖,孰不知当有其他的transaction DML处理过程中需要undo空间的时候,恰恰这个时候not enough space for undo,也就说我并没有允许undo tablespace自动扩展。由于我们的retentionnoguarantee的,所以transaction DML就会忽略这种retention的时间限制直接回绕覆盖我们的undo信息,这种结果下其实在很多情况下是不希望得到的。

Oracle 10g之后,oracle提出了一个特性就是undoguarantee,可以强制oracleguaranteeundo信息,也就说如果一个sessiontransaction DML需要undo空间的时候,即使undo的空间不足,这个session也不会强制覆盖由undo_retention所保护的undo信息,那么这个transaction DML会因为undo空间的不足会而report一个error并自动退出。

SQL> select tablespace_name,block_size,extent_management

  2  segment_space_management,contents,retention

  3  from dba_tablespaces;

 

TABLESPACE_NAME                BLOCK_SIZE SEGMENT_SP CONTENTS  RETENTION

------------------------------ ---------- ---------- --------- -----------

SYSTEM                               8192 LOCAL      PERMANENT NOT APPLY

UNDOTBS1                             8192 LOCAL      UNDO      NOGUARANTEE

SYSAUX                               8192 LOCAL      PERMANENT NOT APPLY

TEMP                                 8192 LOCAL      TEMPORARY NOT APPLY

USERS                                8192 LOCAL      PERMANENT NOT APPLY

COMMDATA                             8192 LOCAL      PERMANENT NOT APPLY

SETTLEINDEX                          8192 LOCAL      PERMANENT NOT APPLY

SETTLEDATA                           8192 LOCAL      PERMANENT NOT APPLY

STATDATA                             8192 LOCAL      PERMANENT NOT APPLY

STATINDEX                            8192 LOCAL      PERMANENT NOT APPLY

COMMINDEX                            8192 LOCAL      PERMANENT NOT APPLY

 

TABLESPACE_NAME                BLOCK_SIZE SEGMENT_SP CONTENTS  RETENTION

------------------------------ ---------- ---------- --------- -----------

RMAN_TBS                             8192 LOCAL      PERMANENT NOT APPLY

 

12 rows selected.

之后想的既然资源无法commit是否可以重启数据库达到资源释放,所以1520开始重启数据库,打算重新找一个数据文件,然后重新创建一个undo表空间,将undotbs1切换到undotbs2并把tbs1 offlinedrop后,在切回到tbs1上面进行资源释放。

 

create  undo tablespace  UNDOTBS2  datafile '/dev/untb03.dbf'  size 32700M

alter system set undo_tablespace=UNDOTBS2 scope=both;
将原来的UNDO表空间,置为脱机:
alter tablespace UNDOTBS1 offline;
删除原来的UNDO表空间:
drop tablespace UNDOTBS1 including contents AND DATAFILES CASCADE CONSTRAINTS ;

Wed Jan 30 15:33:39 GMT+08:00 2013ALTER DATABASE OPEN

重启发现,undo的利用率还是100%、也就是说undo_retention=0没有生效

Wed Jan 30 17:27:06 GMT+08:00 2013ALTER SYSTEM SET undo_retention=900 SCOPE=BOTH;

设定retention时间为15分钟,那么看看数据库中undo active使用率居然高达62GB(总共undo表空间为64GB

发现此刻数据库中存在两个死事物

SQL> select ADDR,KTUXEUSN,KTUXESLT,KTUXESQN,KTUXESIZ from x$ktuxe where KTUXECFL='DEAD';   

     ADDR               KTUXEUSN   KTUXESLT   KTUXESQN   KTUXESIZ

---------------- ---------- ---------- ---------- ----------

00000001108C63F0         75         27    2368514     795545

00000001108C5D10        644          7      57597          0

由于已经报Oracle ACS服务,oracle工程师到来后,原75-27死事物已经不存在(1835左右应用方停止了相关应用)

再次查看数据库中UNEXPIRED利用率63GBACTIVE利用率1GB

那么应该是死事物得到了释放,再次查看

SQL>  select ADDR,KTUXEUSN,KTUXESLT,KTUXESQN,KTUXESIZ from x$ktuxe where KTUXECFL='DEAD';

 

ADDR               KTUXEUSN   KTUXESLT   KTUXESQN   KTUXESIZ

---------------- ---------- ---------- ---------- ----------

0000000110845CF0        644          7      57597          0

那么绝对就是75-27得到了释放,查看现有的交易

SQL> alter session set nls_date_format='mm/dd/yy hh24:mi:ss';

 

Session altered.

SQL> select begin_time,end_time,UNXPSTEALCNT from v$undostat;

 

BEGIN_TIME        END_TIME          UNXPSTEALCNT

----------------- ----------------- ------------

01/30/13 20:53:28 01/31/13 11:09:42        71794

01/30/13 20:43:28 01/30/13 20:53:28       110035

01/30/13 20:33:28 01/30/13 20:43:28        15240

01/30/13 20:23:28 01/30/13 20:33:28        25489

01/30/13 20:13:28 01/30/13 20:23:28        11936

01/30/13 20:03:28 01/30/13 20:13:28         2950

01/30/13 19:53:28 01/30/13 20:03:28          707

01/30/13 19:43:28 01/30/13 19:53:28            0

01/30/13 19:33:28 01/30/13 19:43:28            0

01/30/13 19:23:28 01/30/13 19:33:28         1271

01/30/13 19:13:28 01/30/13 19:23:28        29187

 

BEGIN_TIME        END_TIME          UNXPSTEALCNT

----------------- ----------------- ------------

01/30/13 19:03:28 01/30/13 19:13:28        19976

01/30/13 18:53:28 01/30/13 19:03:28         1365

01/30/13 18:43:28 01/30/13 18:53:28         6235

01/30/13 18:33:28 01/30/13 18:43:28        24651

01/30/13 18:23:28 01/30/13 18:33:28        38220

01/30/13 18:13:28 01/30/13 18:23:28        49888

01/30/13 18:03:28 01/30/13 18:13:28        29815

01/30/13 17:53:28 01/30/13 18:03:28        43678

01/30/13 17:43:28 01/30/13 17:53:28       104834

01/30/13 17:33:28 01/30/13 17:43:28       101518

01/30/13 17:23:28 01/30/13 17:33:28        45838

 

BEGIN_TIME        END_TIME          UNXPSTEALCNT

----------------- ----------------- ------------

01/30/13 17:13:28 01/30/13 17:23:28        30964

01/30/13 17:03:28 01/30/13 17:13:28        43876

01/30/13 16:53:28 01/30/13 17:03:28        15455

01/30/13 16:43:28 01/30/13 16:53:28         7839

01/30/13 16:33:28 01/30/13 16:43:28        24606

01/30/13 16:23:28 01/30/13 16:33:28        40497

01/30/13 16:13:28 01/30/13 16:23:28        34759

01/30/13 16:03:28 01/30/13 16:13:28       118142

01/30/13 15:53:28 01/30/13 16:03:28       107958

01/30/13 15:43:28 01/30/13 15:53:28        20249

01/30/13 15:33:28 01/30/13 15:43:28            0

 

33 rows selected.

把疑问交给ORACLE工程师了,为什么undo_retention设定了900s未即时生效?

通过查看metalink得知一个bug问题psBug 5387030 - Automatic tuning of undo_retention causes unusual extra space allocation [ID 5387030.8]

Product (Component)

Oracle Server (Rdbms)

Range of versions believed to be affected

Versions >= 10.2.0.1 but BELOW 11.1

Versions confirmed as being affected

Description

When undo tablespace is using NON-AUTOEXTEND datafiles,

V$UNDOSTAT.TUNED_UNDORETENTION may be calculated too high preventing

undo block from being expired and reused. In extreme cases the undo

tablespace could be filled to capacity by these unexpired blocks.

 

An alert may be posted on DBA_ALERT_HISTORY that advises to increase

the space when it is not really necessary if this fix is applied.

If the user sets their own alert thresholds for undo tablespaces the

bug may prevent alerts from being produced.

 

Workaround

 alter system set "_smu_debug_mode" = 33554432;

 This causes the v$undostat.tuned_undoretention to be calculated as

  the maximum of:

    maxquerylen secs + 300

    undo_retention specified in init.ora

 

Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.

References

Bug:5387030 (This link will only work for PUBLISHED bugs)
Note:245840.1 Information on the sections in this article

我们的数据库版本为10.2.0.5.0满足其Versions >= 10.2.0.1 but BELOW 11.1的条件。对于固定UNDO 表空间,将会通过表空间的剩余空间来最大限度保留UNDO 信息。如果FIXED UNDO 表空间没有对保留时间作GUARANTEE alter tablespace xxx retention guarantee; ),则undo_retention 参数将不会起作用。(警告:如果设置UNDO 表空间为retention guarantee ,则未过期的数据不会被复写,如果表空间不够则会导致DML 操作失败或者transation 挂起)

   oracle 10g有自动Automatic Undo Retention Tuning 这个特性。设置的undo_retention 参数只是一个指导值Oracle 会自动调整Undo ( 会跨过undo_retention 设定的时间来保证不会出现Ora-1555 错误。通过查询V$UNDOSTAT (该视图记录天以内的UNDO 表空间使用情况,超过天可以查询DBA_HIST_UNDOSTAT 视图) 的tuned_undoretention (该字段在10G 版本才有,9I 是没有的)字段可以得到Oracle 根据事务量(如果是文件不可扩展,则会考虑剩余空间)采样后的自动计算出最佳的retenton 时间。这样对于一个事务量分布不均匀的 数据库 来说,就会引发潜在的问题-- 在批处理的时候可能Undo 会用光, 而且这个状态将一直持续, 不会释放。

如何取消 10g  auto UNDO Retention Tuning ,有如下三种方法:

 

PSAutomatic Tuning of Undo_retention Causes Space Problems [ID 420525.1]  Automatic Tuning of Undo_retention Causes Space Problems

1.) Set the autoextend and maxsize attribute of each datafile in the undo ts so it is autoextensible and its maxsize is equal to its current size so the undo tablespace now has the autoextend attribute but does not autoend:
SQL> alter database datafile '<datafile_flename>'
autoextend on maxsize <current_size>;

With this setting, v$undostat.tuned_undoretention is not calculated based on a percentage of the undo tablespace size, instead v$undostat.tuned_undoretention is set to the maximum of (maxquerylen secs + 300) undo_retention specified in init.ora file.

2.) Set the following hidden parameter in init.ora file:
_smu_debug_mode=33554432

or

SQL> Alter system set "_smu_debug_mode" = 33554432;

With this setting, v$undostat.tuned_undoretention is not calculated based on a percentage of the fixed size undo tablespace, instead v$undostat.tuned_undoretention is set to the maximum of (maxquerylen secs + 300) undo_retention specified in init.ora file.

3.) Set the following hidden parameter in init.ora:
_undo_autotune = false

or

SQL> Alter system set "_undo_autotune" = false; 
可动态调整无需重启数据库。

Autotune of undo retention is turned off.

Wed Jan 30 20:59:26 GMT+08:00 2013ALTER SYSTEM SET _undo_autotune=FALSE SCOPE=BOTH;

SQL> show parameter undo

 

NAME                TYPE              VALUE

------------------- ----------------- ----------

_undo_autotune      boolean           FALSE

undo_management     string            AUTO

undo_retention      integer           900

undo_tablespace     string            UNDOTBS1

在设定完_undo_autotune后,并且结果应用方跑存储过程,未发现ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS1'错误(此错误并没有在alert日志中体现。)

 


本文转自yangjunfeng 51CTO博客,原文链接:http://blog.51cto.com/yangjunfeng/1130277


相关文章
|
SQL 索引
RAC中一次混乱的性能诊断过程4
Segments by Logical Reads Total Logical Reads: 584,021,980 Captured Segments account for 99.
895 0
|
运维 Oracle 关系型数据库
【故障处理】ORA-30012的解决过程
【故障处理】ORA-30012的解决过程   1  BLOG文档结构图   2  前言部分 2.
994 0
|
Oracle 关系型数据库 数据库
|
Oracle 关系型数据库 数据库
配置dg broker的问题分析及修复
最近从同事那儿接手了一套新环境,备库因为服务器问题已经下架,重新配了一台服务器,所以需要搭一套备库,主库已经配置好了,而且同事已经把在主库把dg broker配好了。
1268 0