Oracle数据库出现坏块现象是指:在Oracle数据库的一个或多个数据块(一个数据块的容量在创建数据库时由db_block_size参数指定,缺省为8K)内出现内容混乱的现象。由于正常的数据块都有固定的合法内容格式,坏块的出现,导致数据库进程无法正常解析数据块的内容,进而使数据库进程报错乃至挂起,并级联导致整个数据库实例出现异常。
一、坏块分类
物理坏块:也可以称为介质坏块,指的是块格式本身是坏的,块内的数据没有任何意义。
逻辑坏块:指的是块内的数据在逻辑是存在问题。比如说索引块的索引值没有按从小到大排列。物理坏块一般是由于内存问题、OS问题、IO子系统问题和硬件引起,逻辑坏块一般是是由于Oracle Bug等原因引起。
另据metalink文档Identify the corruption extension using RMAN/DBV/ANALYZE etc [ID 836658.1],有如下分类:
Datafile Block Corruption - Physical/Logical
Table/Index Mismatch
Extents Inconsistencies
Data Dictionary Inconsistencies
可见除了物理和逻辑坏块,还有一些逻辑不一致的情况出现(TABLE / INDEX Mismatch),在本文档不做过多讨论。
二、产生原因
坏块产生的原因大致有以下几种:
2.1 硬件问题
Oracle进程在处理一个数据块时,首先将其读入物理内存空间,在处理完成后,再由特定进程将其写回磁盘;如果在这个过程中,出现内存故障,CPU计算失误,都会导致内存数据块的内容混乱,最后反映到写回磁盘的数据块内容有误。同样,如果存储子系统出现异常,数据块损坏也就随之出现了。
2.2 操作系统BUG
由于Oracle进程对数据块的读写,都是以操作系统内核调用(system call)的方式完成的,如果操作系统在内核调用存在问题,必然导致Oracle进程写入非法的内容。
2.3 操作系统的I/O错误或缓冲问题
2.4 内存或paging问题
Oracle软件BUG。
Oracle软件特定版本上,可能出现导致数据块的内容出现异常BUG。
2.5 非Oracle进程扰乱Oracle共享内存区域
如上文所述,在当数据块的内容被读入主机的物理内存时,如果其他非Oracle进程,对Oracle使用的共享内存区域形成了扰乱,最终导致写回磁盘的数据块内容混乱。
2.6 异常关机,掉电,终止服务
异常关机,掉电,终止服务使进程异常终止,而破坏数据块的完整性,导致坏块产生。(注:这也是为什么突然断电会导致数据库无法启动)
由上可见,坏块的形成原因复杂。当出现坏块时,为了找到确切的原因,需要大量的分析时间和排查操作,甚至需要多次重现才能找出根本原因。但当故障发生在生产系统上,我们为了减少停机时间,会尽快实施应急权变措施以保证系统的可用性,这样就破坏了故障现场,对根本原因的分析因而也更加困难了。
三,如何判断坏块
3.1 Alert日志后台日志
3.2 rman命令
RMAN> BACKUP CHECK LOGICAL VALIDATE DATAFILE 数据文件号;
SQL>select * from v$database_block_corruption where file#=4;
3.3 dbv工具
$dbv file=/u01/app/oracle/oradata/PROD/user01.dbf
3.4 确定坏块对象信息
SQL>Select tablespace_name,segment_type,owner,segment_name From dba_extents Where file_id=数据文件号 and 块号 between block_id and block_id+blocks-1;
3.5 ora错误
SQL> select count(*) from t1;
select count(*) from t1
*
第 1 行出现错误:
ORA-01578: ORACLE 数据块损坏 (文件号 7, 块号 131)
ORA-01110: 数据文件 7: 'E:\WH.DBF'
3.6 关于以上排查方法的区别和优缺点
1)RMAN的BACKUP CHECK LOGICAL VALIDATE DATAFILE
该命令可以设置并行度PARALLELISM,即使是对一个数据文件检查也可以多个通道并行,所以效率是比较高的;和dbv相比,该方法也可以通过v$session_longops更加方便的完成进度监控;且rman会把坏块信息保存到控制文件 (v$database_block_corruption, v$backup_corruption),通过sql可以很好的完成对应表空间、数据文件、对象的统计。
使用该方法的前提是需要数据库运行于归档模式,如果是非归档的话只能在mount下使用。
2) Dbv
Dbv执行块级的检测,所以这个工具无法检查如INDEX versus TABLE mismatches的情况;也无法做到自动并行;结果在os以平面文件的形式存储,不易统计。
3) ANALYZE TABLE .. VALIDATE STRUCTURE CASCADE
该命令是会在表上加tm锁,但是不影响dml。
关于该命令,SQL Reference解释如下:
Specify ONLINE to enable Oracle Database to run the validation while DML operations are ongoing within the object. The database reduces the amount of validation performed to allow for concurrency.
Note:
When you validate the structure of an object ONLINE, Oracle Database does not collect any statistics, as it does when you validate the structure of the object OFFLINE.
Specify OFFLINE, to maximize the amount of validation performed. This setting prevents INSERT, UPDATE, and DELETE statements from concurrently accessing the object during validation but allows queries. This is the default.
也就是该命令不会收集任何统计信息,不会引起sql执行计划的突变;性能方面,对于大并发的系统可能会有轻微影响。
由于dbv和rman都比较贴近底层,在mount时候就可以做,可见这两个工具都是比较贴近块级检测的,所以在做logical inconsistents(TABLE / INDEX Mismatch)方面的检测,还是需要ANALYZE TABLE .. VALIDATE STRUCTURE CASCADE这个方法。
在metalink上面有文档,可以利用该工具,在表空间级别完成检测,大大提高了检测效率,文档为SCRIPT: VALIDATE.SQL to ANALYZE .. VALIDATE STRUCTURE objects in a Tablespace [ID 100419.1]。
实际上dbv、rman、validate structure均可以检测出一定程度的逻辑讹误,但是最可靠的还是db_block_checksum=true情况下的validate structure [online]验证命令。从另一个角度来说,普通的dbv只能做单一的检测,而无法做到交叉地检验,从而了解表和索引上的不一致问题,但是validate structure online却可以做到。
所以在日常检测,考虑到效率和可靠性等,最好用rman+validate structure来完成。
四,如何修复坏块
4.1 无备份情况
4.1.1 EXP/IMP方式
4.1.2 通过内部事件跳过坏块
SQL>Alter system set events='10231 trace name context forever,level 10';
$exp userid=wh/wh file=t1.dump tables=t1
SQL>drop table t1;
$imp userid=wh/wh file=t1.dump tables=t1
4.1.3 DBMS_REPAIR包
a)创建表
SQL>BEGIN
DBMS_REPAIR.ADMIN_TABLES (
TABLE_NAME => 'REPAIR_TABLE01',
TABLE_TYPE => dbms_repair.repair_table,
ACTION => dbms_repair.create_action,
TABLESPACE => 'WH');
END;
/
b)收集坏块信息
DECLARE num_corrupt INT;
BEGIN
num_corrupt := 0;
DBMS_REPAIR.CHECK_OBJECT (
SCHEMA_NAME => 'WH',
OBJECT_NAME => 'T1',
REPAIR_TABLE_NAME => 'REPAIR_TABLE01',
corrupt_count => num_corrupt);
DBMS_OUTPUT.PUT_LINE('number corrupt: ' || TO_CHAR (num_corrupt));
END;
/
c)查看结果
SQL> select object_name, block_id, corrupt_type, marked_corrupt,corrupt_description,repair_description from repair_table01;
d)跳过坏块
BEGIN
DBMS_REPAIR.SKIP_CORRUPT_BLOCKS (
SCHEMA_NAME => 'WH',
OBJECT_NAME => 'T1',
OBJECT_TYPE => dbms_repair.table_object,
FLAGS => dbms_repair.SKIP_FLAG);
END;
/
注:这两种方法均是跳过坏块,而不去修复,数据有丢失。
4.1.4 rman制造坏块,bbed修复坏块
a)建立测试表
SQL> create tablespace test datafile '/home/oracle/oradata/orcl/test.dbf' size 100m;
Tablespace created.
SQL> create table test (a number,b varchar2(20)) tablespace test;
Table created.
SQL> Select rowid,dbms_rowid.rowid_relative_fno(rowid) rel_fno,dbms_rowid.rowid_block_number(rowid) block from test;
ROWID REL_FNO BLOCK
------------------ ---------- ----------
AAASwmAAGAAAACHAAA 6 135
AAASwmAAGAAAACHAAB 6 135
AAASwmAAGAAAACHAAC 6 135
AAASwmAAGAAAACHAAD 6 135
AAASwmAAGAAAACHAAE 6 135
AAASwmAAGAAAACHAAF 6 135
6 rows selected.
SQL> alter system checkpoint;
System altered.
b)DBV检测数据文件
[oracle@bjtest ~]$ dbv file=/home/oracle/oradata/orcl/test.dbf blocksize=8192
DBVERIFY: Release 11.2.0.3.0 - Production on Wed Jul 17 06:39:00 2013
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
DBVERIFY - Verification starting : FILE = /home/oracle/oradata/orcl/test.dbf
DBVERIFY - Verification complete
Total Pages Examined : 12800
Total Pages Processed (Data) : 5
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 130
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 12665
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 911665 (0.911665)
这里创建了test表,数据存储在file 6 block 135 中,现在该block一切正常,本试验就是要通过rman来使得该block corrupt,然后通过bbed来修复
c)bbed查看kcbh
BBED> set dba 6,135
BBED> print kcbh
struct kcbh, 20 bytes @0
ub1 type_kcbh @0 0x06
ub1 frmt_kcbh @1 0xa2
ub1 spare1_kcbh @2 0x00
ub1 spare2_kcbh @3 0x00
ub4 rdba_kcbh @4 0x01800087
ub4 bas_kcbh @8 0x000de931
ub2 wrp_kcbh @12 0x0000
ub1 seq_kcbh @14 0x01
ub1 flg_kcbh @15 0x06 (KCBHFDLC, KCBHFCKV)
ub2 chkval_kcbh @16 0x59c2
ub2 spare3_kcbh @18 0x0000
d)RMAN标记坏块
RMAN> BLOCKRECOVER DATAFILE 6 block 135 clear;
Starting recover at 17-JUL-13
using channel ORA_DISK_1
Finished recover at 17-JUL-13
e)dbv再次检查坏块
[oracle@bjtest ~]$ dbv file=/home/oracle/oradata/orcl/test.dbf blocksize=8192
DBVERIFY: Release 11.2.0.3.0 - Production on Wed Jul 17 06:48:44 2013
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
DBVERIFY - Verification starting : FILE = /home/oracle/oradata/orcl/test.dbf
Page 135 is marked corrupt
Corrupt block relative dba: 0x01800087 (file 6, block 135)
Bad check value found during dbv:
Data in bad block:
type: 6 format: 2 rdba: 0x01800087
last change scn: 0x0000.000de931 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xe9310601
check value in block header: 0x59c2
computed block checksum: 0x923b
DBVERIFY - Verification complete
Total Pages Examined : 12800
Total Pages Processed (Data) : 4
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 130
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 12665
Total Pages Marked Corrupt : 1
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 911663 (0.911663)
SQL> select * from test;
select * from test
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 6, block # 135)
ORA-01110: data file 6: '/home/oracle/oradata/orcl/test.dbf'
f)bbed验证坏块
BBED> set dba 6,135
BBED> print kcbh --查看块的结构体
struct kcbh, 20 bytes @0
ub1 type_kcbh @0 0x06 --Block type
ub1 frmt_kcbh @1 0xa2 --Block format 1=Oracle 7, 2=Oracle 8+
ub1 spare1_kcbh @2 0x00 --Not used
ub1 spare2_kcbh @3 0x00 --Not used
ub4 rdba_kcbh @4 0x01800087 --RDBA -Relative Data Block Address
ub4 bas_kcbh @8 0x000de931 --SCN Base
ub2 wrp_kcbh @12 0x0000 --SCN Wrap
ub1 seq_kcbh @14 0x01 --Sequence number, incremented for every change made to the block at the same SCN
ub1 flg_kcbh @15 0x06 (KCBHFDLC, KCBHFCKV)
ub2 chkval_kcbh @16 0x59c2 --这里没有变化 --[Optional block checksum (if DB_BLOCK_CHECKSUM=TRUE)]
ub2 spare3_kcbh @18 0x0000 --Not used
BBED> verify --验证坏块
DBVERIFY - Verification starting
FILE = /home/oracle/oradata/orcl/test.dbf
BLOCK = 135
Block 135 is corrupt
Corrupt block relative dba: 0x01800087 (file 0, block 135)
Bad check value found during verification
Data in bad block:
type: 6 format: 2 rdba: 0x01800087
last change scn: 0x0000.000de931 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xe9310601
check value in block header: 0x59c2
computed block checksum: 0x923b
DBVERIFY - Verification complete
Total Blocks Examined : 1
Total Blocks Processed (Data) : 0
Total Blocks Failing (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 1
Total Blocks Influx : 0
Message 531 not found; product=RDBMS; facility=BBED
这里通过dbv,bbed,select table都证明rman能够标记block为Corrupt.
这里需要分析:block已经被标记,那证明该块肯定有修改,也就是说chkval_kcbh一定要变化,但是这里没有变化,证明该处异常
g)bbed修复rman生成坏块
BBED> set dba 6,135
BBED> sum apply
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) y
Check value for File 6, Block 135:
current = 0xcbf9, required = 0xcbf9
BBED> verify
DBVERIFY - Verification starting
FILE = /home/oracle/oradata/orcl/test.dbf
BLOCK = 135
DBVERIFY - Verification complete
Total Blocks Examined : 1
Total Blocks Processed (Data) : 1
Total Blocks Failing (Data) : 0
Total Blocks Processed (Index): 0
Total Blocks Failing (Index): 0
Total Blocks Empty : 0
Total Blocks Marked Corrupt : 0
Total Blocks Influx : 0
Message 531 not found; product=RDBMS; facility=BBED
BBED测试已经恢复
h)DBV验证
[oracle@bjtest ~]$ dbv file=/home/oracle/oradata/orcl/test.dbf blocksize=8192
DBVERIFY: Release 11.2.0.3.0 - Production on Wed Jul 17 06:55:39 2013
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
DBVERIFY - Verification starting : FILE = /home/oracle/oradata/orcl/test.dbf
DBVERIFY - Verification complete
Total Pages Examined : 12800
Total Pages Processed (Data) : 5
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 130
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 12665
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 911665 (0.911665)
DBV测试已经恢复
i)RMAN 验证
RMAN> validate datafile 6 block 135;
Starting validate at 17-JUL-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=30 device type=DISK
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00006 name=/home/oracle/oradata/orcl/test.dbf
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
6 OK 0 0 1 911665
File Name: /home/oracle/oradata/orcl/test.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 1
Index 0 0
Other 0 0
Finished validate at 17-JUL-13
SQL> select count(*) from test;
COUNT(*)
----------
6
总结说明
1.我们可以通过rman的clear命令来标记坏块(BLOCKRECOVER DATAFILE file# BLOCK block1#, block2#, block3#… CLEAR ;)
2.我们可以通过bbed的sum apply命令来修复该类型坏块
4.1.5 利用构造ROWID实现无备份情况下绕过ORA-1578、ORA-8103、ORA-1410等逻辑/物理坏块问题(速度较慢)
对于无备份情况下的ORA-1578、ORA-8103、ORA-1410等逻辑/物理坏块问题,可以通过如下PL/SQL 构造ROWID的方式挽救绝大多数非坏块的数据,
一般分成2种情况:有索引可以用来获取ROWID; 或者没有索引、索引不可用,必须通过dbms_rowid.ROWID_CREATE来构造ROWID的方式。
a)创建示例数据
create table liuwen (t1 int,t2 date default sysdate) tablespace liuwen
partition by range(t1) (partition p1 values less than (10000),
partition p2 values less than (20000),
partition p3 values less than (30000),
partition p4 values less than (40000),
partition p5 values less than (50000),
partition p6 values less than (60000),
partition p7 values less than (70000),
partition p8 values less than (80000),
partition p9 values less than (90000),
partition p10 values less than (100000),
partition p11 values less than (110000),
partition p12 values less than (120000),
partition p13 values less than (130000),
partition p14 values less than (140000),
partition p15 values less than (150000),
partition p16 values less than (160000));
insert into liuwen(t1) select rownum from dual connect by level<1000;
SQL> select count(*) from liuwen;
COUNT(*)
----------
5999
exec dbms_stats.gather_schema_stats(‘SYS’);
alter system flush buffer_cache;
随机采样一些块来做 坏块 约涉及到5个块的数据
SQL> set linesize 200 pagesize 1400
SQL> select dbms_rowid.rowid_block_number(rowid) blkid,
2 dbms_rowid.rowid_relative_fno(rowid) rfile
3 from liuwen
4 where rownum <= 1100
5 group by dbms_rowid.rowid_block_number(rowid),
6 dbms_rowid.rowid_relative_fno(rowid)
7 order by 1;
BLKID RFILE
---------- ----------
165 7
168 7
171 7
b)制造坏块
RMAN> blockrecover datafile 7 block 165,168,171 clear;
Starting recover at 23-JUL-13
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=30 device type=DISK
Finished recover at 23-JUL-13
SQL> select count(*) from liuwen;
select count(*) from liuwen
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 7, block # 165)
ORA-01110: data file 7: '/home/oracle/oradata/orcl/liuwen.dbf'
通过blockrecover datafile block clear 构造了一系列坏块且没有备份 ,我们通过下列脚本挽回大部分可用数据.
c)创建恢复表 liuwen_tab_backup,错误表 bad_rows
liuwen_tab_backup用于存放导出的可用的数据,bad_rows用于存放不可用的记录的rowid。
SQL> create table liuwen_tab_backup
2 tablespace liuwen
3 as select * from liuwen where 1=0;
create table bad_rows (row_id rowid,oracle_error_code varchar2(50))
tablespace liuwen;
set serveroutput on;
set timing on;
d)执行程序块
该匿名块根据对象所在块循环,依次对块里面的所有记录进行遍历,如果该记录可用,那么导出到新表里面,如果不可用,那么记录不可用的记录的rowid。循环套循环,只能穷举,这也是此方法较慢的原因之一。另外,如果这个对象所在表空间不是标准的8k块大小,如果是16k或者32k,在理论上, Oracle用12bit来记录记录数的. 也就是最多4k条记录,这样的话,最内层的循环会更多速度更慢。
DECLARE
TYPE RowIDTab IS TABLE OF ROWID INDEX BY BINARY_INTEGER;
CURSOR Crowid_info IS
select Do.DATA_OBJECT_ID dataid,
DE.FILE_ID fid,
DE.BLOCK_ID blkid,
DE.BLOCKS blkcnt
from dba_objects DO, dba_extents DE
where DO.OBJECT_NAME = 'LIUWEN'
--and DE.PARTITION_NAME='&PARTITION_NAME'
--若指定分区则取消注释
and nvl(DO.SUBOBJECT_NAME,'-1') = nvl(DE.PARTITION_NAME,'-1')
and DO.OBJECT_NAME = DE.SEGMENT_NAME
and DO.owner = 'SYS'
order by 1, 2, 3 asc;
bad_rows number := 0;
errors varchar2(500);
error_code varchar2(500);
myrowid rowid;
BEGIN
execute immediate 'alter session set commit_write=''batch,nowait'' ';
for i in Crowid_info loop
--对象所对应所有的物理块位置
for j in 0 .. i.blkcnt - 1 loop
--逐个块循环
for z in 0 .. 2000 loop
--块里面一行一行看,rownum(一般一个8k的块,记录数小于736行,如果碰到大块,应该还加大此参数,循环套循环,只能穷举,这也是此方法较慢的原因之一)
begin
myrowid := dbms_rowid.ROWID_CREATE(1,
i.dataid,
i.fid,
i.blkid + j,
z);
insert into liuwen_tab_backup
select /*+ ROWID(A) */
*
from liuwen A
where rowid = myrowid;
--能读取的放到backup里面备出来
EXCEPTION
when OTHERS then
BEGIN
errors := SQLERRM;
error_code := SQLCODE;
if (error_code like '%1410%' or error_code like '%8103%' or error_code like '%1578%') then
bad_rows := bad_rows + 1;
insert into bad_rows values (myrowid, error_code);
--不能读取的,rowid也放到相应表里面备案
commit;
else
raise;
end if;
END;
commit;
end;
end loop;
end loop;
end loop;
dbms_output.put_line('Total Bad Rows: ' || bad_rows);
commit;
END;
4.2 有备份情况
4.2.1采用rman恢复
RMAN>blockrecover datafile 7 block 131 from backupset;
Oracle会修复坏块,数据无丢失。推荐方式。
4.2.2 11g新特性recover corruption list
11g新特性RMAN语法recover corruption list是为了简化数据坏块的修复,在11g中recover corruption块时不需要一一指定数据文件名字了,只要是在v$database_block_corruption视图中记录的坏块,只要使用了 corruption list语法,都会试图修复。
下面我们使用recover .. clear命令手动造成个别数据块坏块,之后使用 recover corruption list;
a)修复:
RMAN> BLOCKRECOVER DATAFILE 6 block 135 clear;
Starting recover at 17-JUL-13
using channel ORA_DISK_1
Finished recover at 17-JUL-13
RMAN> validate datafile 6 block 135;
Starting validate at 17-JUL-13
using channel ORA_DISK_1
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00006 name=/home/oracle/oradata/orcl/test.dbf
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
6 FAILED 0 0 1 0
File Name: /home/oracle/oradata/orcl/test.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 1 1
Index 0 0
Other 0 0
validate found one or more corrupt blocks
See trace file /home/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_12203.trc for details
Finished validate at 17-JUL-13
SQL> select * from v$database_block_corruption;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE
---------- ---------- ---------- ------------------ ------------------------
6 135 1 911665 FRACTURED
select count(*) from test
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 6, block # 135)
ORA-01110: data file 6: '/home/oracle/oradata/orcl/test.dbf'
b)恢复
RMAN> recover corruption list;
Starting recover at 17-JUL-13
using channel ORA_DISK_1
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00006
channel ORA_DISK_1: reading from backup piece /home/oracle/fast_recovery_area/ORCL/backupset/2013_07_17/o1_mf_nnndf_TAG20130717T062935_8yclbhvn_.bkp
channel ORA_DISK_1: piece handle=/home/oracle/fast_recovery_area/ORCL/backupset/2013_07_17/o1_mf_nnndf_TAG20130717T062935_8yclbhvn_.bkp tag=TAG20130717T062935
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 00:00:01
starting media recovery
media recovery complete, elapsed time: 00:00:04
Finished recover at 17-JUL-13
RMAN> validate datafile 6 block 135;
Starting validate at 17-JUL-13
using channel ORA_DISK_1
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
input datafile file number=00006 name=/home/oracle/oradata/orcl/test.dbf
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
6 OK 0 0 1 911665
File Name: /home/oracle/oradata/orcl/test.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 1
Index 0 0
Other 0 0
Finished validate at 17-JUL-13
SQL> select count(*) from test;
COUNT(*)
----------
6
五,如何预防
在有备份的情况下可以通过blockrecovery在线修复部分逻辑坏块,但是如果该逻辑讹误确实是由Oracle Bug引起的话,那么很有可能blockrecover也无能为力,需要具体问题具体分析之。坏块问题破坏性大,但并非不可预防。
5.1 底层
1)重要的文件系统、或者存储,raid也好,二次条带也好,不管是在哪一层,一定要冗余。
2)修改vg参数要注意,一定要先做好备份。
3)在为提高性能为操作系统打开异步I/O时,一定要把操作系统与异步I/O相关的补丁要打全。
4)尊重硬件生命周期,要及时的检查硬件的状态,及时更换驱动器部件。
5)关注由第三方软硬件引起的坏块相关问题,参考metalink文档,Known Corruption issues caused by 3rd party Software/Hardware Provider [ID 1323649.1]。
5.2 数据库设置
1)数据文件避免自动增长,使用裸设备避免破坏数据文件的文件头。
2)控制文件、日志文件一定要多工。
3)要开启归档,因为是rman validate check logical database;的前提条件,也是利用rman做recover的前提条件。
5.3 备份
1)制定一个良好的备份恢复策略,不能光放到本地文件系统,最好传到nas或者备份到带库,最好有表的EXP备份。
2)要做备份恢复测试,对备份文件的有效性进行检查,如:
restore validate database;
restore validate controlfile;
restore validate archivelog between lowseq and highseq;
5.4 数据加载
1)首先数据加载,尽量不使用nologging模式。
2)如果要使用nologging,在加载数据后,尽快全备数据库。
3)使用nologging模式创建的对象,创建后要马上改为logging,且在dba_tables和dba_indexes里面检查该对象所对应logging字段是否为yes
例如语句 alter index table1_PK rebuild online parallel nologging compute statistics,该语句在rebuild时候指定了nologging,所以其缺省存储属性是nologging的,这样dg打开后会遇到坏块问题,所以如果有dg,不建议使用nologging选项。
4)永久表和索引,不管是不是分区,都应该确认logging,分区表和分区索引新添加的分区默认的default attributes也应该是logging。
5.5 DG
1)物理dg要定期open read only打开,可以跑一些查询,或者启动到mount后backup validate check logical database。
2)dg主库,只有forcelogging不够,要经常检查dba_tables和dba_indexes里面logging字段是否全部为yes,另外建议开启supplemental log,否则万一用到日志挖掘恢复某些数据,防止sql_redo出现UNSUPPORTED而无法抓取到。
5.6 定期做相关检查
1)alert日志。
2)BACKUP CHECK LOGICAL VALIDATE DATAFILE,并检查v$database_block_corruption。
3)ANALYZE TABLE .. VALIDATE STRUCTURE CASCADE,检查内部逻辑不一致。
5.7 官方发布的预警等
在Metalink.oracle.com网站,Oracle定期发布基于特定软件版本的“已知问题(known issues)说明”。对于可能导致坏块的Oracle软件BUG,在Oracle公司内部,是作为高严重级别的问题进行处理,在“已知问题(known issues)说明”中,这些BUG以严重(Noticable)问题标出(标记为*或+),部分问题,Oracle还会发布警告(Alert)通告。在文档中,Oracle会提供相应的补丁或应对措施。
比如bug 5386204,该问题是在使用asm存储时,通过直接路径加载的话,可能会遇到坏块的bug,对应oracle软件版本为10204及10203,而这两个版本的软件使用时非常广泛的,该bug到10205才被完全修复。可见关注官方相关信息的重要性,可以提前预警、防患于未然。
所以,要密切关注官方的安全预警、bug修正,例如:http://www.oracle.com/technetwork/topics/security/alerts-086861.html是Critical Patch Updates, Security Alerts and Third Party Bulletin,另外还有Oracle Recommended Patches -- Oracle Database [ID 756671.1]。
附参考metalink文档:
Physical and Logical Block Corruptions. All you wanted to know about it. [ID 840978.1]
SCRIPT: VALIDATE.SQL to ANALYZE .. VALIDATE STRUCTURE objects in a Tablespace [ID 100419.1]
ORA-1499. Table/Index row count mismatch [ID 563070.1]
Known Corruption issues caused by 3rd party Software/Hardware Provider [ID 1323649.1]
Identify the corruption extension using RMAN/DBV/ANALYZE etc [ID 836658.1]
OERR: ORA-1578 "ORACLE data block corrupted (file # %s, block # %s)" Master Note [ID 1578.1]