redo日志文件是Oracle数据库中最重要的文件之下,在创建数据库时,通常要求创建多个在线日志组,每个日志组至少包括两个成员,也就是说,每个日志组至少有两个日志文件,这两个日志文件存储的内容是相同的,并且这两个日志文件最好位于不同的磁盘上。即使在使用磁盘阵列做存储时,一般也是每个日志组配置两个日志文件。这说明了redo日志文件十分重要,如果损坏可能会造成不可挽回的损失。这里用实验模拟一下数据库正在事务为提交时当前redo文件损坏会发生什么,先后模拟两种情况下的损坏,一种是用rm删除redo文件,一种是用dd命令清空redo文件,看看这两种情况下都会发生什么。
1 实验环境
本次实验用的是Oracle数据库最新的21c版本,详细的版本信息可以从视图v$version中查到。
SQL>select BANNER_FULL from v$version; BANNER_FULL ------------------------------------------------------ Oracle Database 21c Enterprise Edition Release 21.0.0.0.0- Production Version 21.3.0.0.0
看一下数据库中redo日志组的配置
SQL>selectGROUP# , bytes,status from v$log;GROUP# BYTES STATUS ---------- ---------- ----------------1209715200 CURRENT 2209715200 INACTIVE 3209715200 INACTIVE
数据库有三个日志组,日志组1是当前日志组。然后再看一下每个日志组的成员。
SQL> l 1*selectGROUP#,STATUS,MEMBER from v$logfile SQL> c /,status/1*selectGROUP#,MEMBER from v$logfile SQL>/GROUP# MEMBER ---------- ----------------------------------------------------------------3/opt/oracle/oradata/ORCLCDB/redo03.log2/opt/oracle/oradata/ORCLCDB/redo02.log1/opt/oracle/oradata/ORCLCDB/redo01.log
每个日志组只有一个日志文件,当前日志组1中的日志文件是/opt/oracle/oradata/ORCLCDB/redo01.log
SQL>alter session set container=ORCLPDB1; Session altered. SQL>selectGROUP#,MEMBER from v$logfile;GROUP# MEMBER ---------- ----------------------------------------------------------------3/opt/oracle/oradata/ORCLCDB/redo03.log2/opt/oracle/oradata/ORCLCDB/redo02.log1/opt/oracle/oradata/ORCLCDB/redo01.log
可插拔数据库pdb1的日志组和容器数据库cdb中是一样的,它们使用同一个日志组。
2 模拟误删当前日志文件组中所有日志文件场景
另开一个会话,在这个会话里运行事务,这个会话直接登录到可插拔数据库pdb1,使用用户test,不使用sys或system用户,以模拟真实的数据库场景。本文在以后就称这个会话为用户会话,前面的以sysdba身份登录的会话就称为管理会话。
[oracle@iZ2ze0t8khaprrpfvmevjiZ ~]$ sqlplus test/test123@iZ2ze0t8khaprrpfvmevjiZ/orclpdb1
登录会话之后,检查会话的自动提交是否关闭
SQL> show autoc; autocommit OFF
自动提交已关闭,如未关闭,用set autocommit off 命令关闭,看一下本次实验的数据库表
SQL>select*from test2; ID NAME SALARY ---------- -------------------- ----------1 zhangsan 4732 lisi 4733 wangwu 473SQL>select*from test; ID NAME SALARY ---------- -------------------- ----------1 zhangsan 6012 lisi 4743 wangwu 474
运行两条DML(数据操作语句)
SQL>update test2 set salary=500where id=1;1 row updated. SQL>deletefrom test where id=1;1 row deleted.
执行到这里,数据库系统里现在已经有了未提交事务,下面开始模仿误删当前redo文件,用rm命令删除redo文件。
切换到管理会话
SQL> host
使用host命令切换到操作系统shell下
[oracle@iZ2ze0t8khaprrpfvmevjiZ ~]$ ls-l /opt/oracle/oradata/ORCLCDB/redo01.log -rw-r-----1 oracle oinstall 209715712 Aug 1808:59 /opt/oracle/oradata/ORCLCDB/redo01.log [oracle@iZ2ze0t8khaprrpfvmevjiZ ~]$ rm /opt/oracle/oradata/ORCLCDB/redo01.log [oracle@iZ2ze0t8khaprrpfvmevjiZ ~]$ ls-l /opt/oracle/oradata/ORCLCDB/redo01.log ls: cannot access '/opt/oracle/oradata/ORCLCDB/redo01.log': No such file or directory [oracle@iZ2ze0t8khaprrpfvmevjiZ ~]$ exitexit
上面的命令先查看一下当前redo日志组的唯一成员/opt/oracle/oradata/ORCLCDB/redo01.log,用rm命令删除后,再次查看,确认删除后,退出shell进入sqlplus下。
SQL>selectGROUP# , bytes,status from v$log;GROUP# BYTES STATUS ---------- ---------- ----------------1209715200 CURRENT 2209715200 INACTIVE 3209715200 INACTIVE
当前的日志组id仍然是1,检查一下数据库告警日志,并无和本次删除相关的错误信息
[oracle@iZ2ze0t8khaprrpfvmevjiZ trace]$ pwd /opt/oracle/diag/rdbms/orclcdb/ORCLCDB/trace [oracle@iZ2ze0t8khaprrpfvmevjiZ trace]$ tail-10 alert_ORCLCDB.log ORCLPDB1(3):Clearing Resource Manager plan via parameter 2022-08-18T08:53:55.916914+08:00 ORCLPDB1(3):Setting Resource Manager plan SCHEDULER[0x52BD]:DEFAULT_MAINTENANCE_PLAN via scheduler window ORCLPDB1(3):Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter 2022-08-18T08:59:38.906285+08:00 TABLE SYS.WRP$_REPORTS: ADDED INTERVAL PARTITION SYS_P4541 (4613) VALUES LESS THAN (TO_DATE(' 2022-08-19 01:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) TABLE SYS.WRP$_REPORTS_DETAILS: ADDED INTERVAL PARTITION SYS_P4542 (4613) VALUES LESS THAN (TO_DATE(' 2022-08-19 01:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) TABLE SYS.WRP$_REPORTS_TIME_BANDS: ADDED INTERVAL PARTITION SYS_P4545 (4612) VALUES LESS THAN (TO_DATE(' 2022-08-18 01:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')) 2022-08-18T09:01:17.925156+08:00 ORCLPDB1(3):TABLE SYS.ACTIVITY_TABLE$: ADDED INTERVAL PARTITION SYS_P3413 (110) VALUES LESS THAN (10570)
到用户会话中进行操作
SQL> commit; Commit complete.
可以看到,事务成功提交。
回到管理会话
SQL>alter system switch logfile;SQL>selectGROUP# , bytes,status from v$log;GROUP# BYTES STATUS ---------- ---------- ----------------1209715200 ACTIVE 2209715200 CURRENT 3209715200 INACTIVE
切换一下日志文件,当前redo日志组id为2,日志组1 状态时活跃。
SQL>alter database clear logfile group1;alter database clear logfile group1* ERROR at line 1: ORA-01624: log 1 needed for crash recovery of instance ORCLCDB (thread 1) ORA-00312: online log 1 thread 1:'/opt/oracle/oradata/ORCLCDB/redo01.log'
清除并重建日志组1,命令操作失败,因为日志组1是崩溃恢复需要的,用clear unarchived logfile命令清除重建日志组
SQL>alter database clear unarchived logfile group1; Database altered.
命令运行成功了
SQL>!ls -l /opt/oracle/oradata/ORCLCDB/redo01.log-rw-r----- 1 oracle oinstall 209715712 Aug 18 09:16 /opt/oracle/oradata/ORCLCDB/redo01.log
检查一下日志文件,发现日志文件已经重建。
3 模拟日志文件本身故障
用dd命令覆盖清除在线日志文件模拟日志文件故障,切换到用户会话,重开一个事务
SQL>update test2 set salary=500where id=1;1 row updated. SQL>insertinto test values(1,'zhangsan',1500);1 row created.
切换到管理会话,查看一下当前日志组及其成员
SQL>selectGROUP# , bytes,status from v$log;GROUP# BYTES STATUS ---------- ---------- ----------------1209715200 UNUSED 2209715200 CURRENT 3209715200 UNUSED SQL>selectGROUP#,MEMBER from v$logfile;GROUP# MEMBER ---------- --------------------------------------------------------------------------------3/opt/oracle/oradata/ORCLCDB/redo03.log2/opt/oracle/oradata/ORCLCDB/redo02.log1/opt/oracle/oradata/ORCLCDB/redo01.log
当前的日志组id为2,其唯一的成员文件为/opt/oracle/oradata/ORCLCDB/redo02.log
切换到root用户下清空此文件
[root@ ~]# dd if=/dev/null of=/opt/oracle/oradata/ORCLCDB/redo02.log0+0 records in0+0 records out 0 bytes copied, 5.028e-05 s, 0.0 kB/s [root@ ~]# ls -l /opt/oracle/oradata/ORCLCDB/redo02.log-rw-r--r--1 root root 0 Aug 1809:40 /opt/oracle/oradata/ORCLCDB/redo02.log
文件已被清空,其字节数为0。
切换到用户会话,提交事务
SQL> commit; Commit complete.
事务仍能提交成功。再次回到管理会话,检查告警日志,没有错误信息
[oracle@iZ2ze0t8khaprrpfvmevjiZ trace]$ ls-l /opt/oracle/oradata/ORCLCDB/redo02.log -rw-r--r--1 root root 0 Aug 1809:40 /opt/oracle/oradata/ORCLCDB/redo02.log
当前redo文件的大小仍为0
[oracle@iZ2ze0t8khaprrpfvmevjiZ fd]$ ps-ef|grep lgwr|grep -vgrep oracle 906310 Aug17 ? 00:00:06 ora_lgwr_ORCLCDB [oracle@iZ2ze0t8khaprrpfvmevjiZ fd]$ cd /proc/9063/fd [oracle@iZ2ze0t8khaprrpfvmevjiZ fd]$ ls-l|grep redo lrwx------ 1 oracle oinstall 64 Aug 1809:51 258-> /opt/oracle/oradata/ORCLCDB/redo01.log lrwx------ 1 oracle oinstall 64 Aug 1809:51 259-> /opt/oracle/oradata/ORCLCDB/redo02.log (deleted) lrwx------ 1 oracle oinstall 64 Aug 1809:51 260-> /opt/oracle/oradata/ORCLCDB/redo03.log
查看redo日志写进程打开的文件,redo02.log为已删除。
SQL>selectGROUP# , bytes,status from v$log;GROUP# BYTES STATUS ---------- ---------- ----------------1209715200 UNUSED 2209715200 CURRENT 3209715200 UNUSED
当前的redo日志组仍然为2.切换一下日志组
SQL>alter system switch logfile;alter system switch logfile * ERROR at line 1: ORA-03113: end-of-file on communication channel Process ID:12168 Session ID:3 Serial number:11933
会话退出了,查看一下数据库的告警日志,发现数据库崩溃了
[oracle@iZ2ze0t8khaprrpfvmevjiZ trace]$ tail-30 alert_ORCLCDB.log ORA-27037: unable to obtain file status Linux-x86_64 Error: 2: No such file or directory Additional information: 72022-08-18T09:16:44.462008+08:00 Completed: alter database clear unarchived logfile group 12022-08-18T09:53:25.915074+08:00 Errors in file /opt/oracle/diag/rdbms/orclcdb/ORCLCDB/trace/ORCLCDB_lgwr_9063.trc: ORA-00316: log 2 of thread 1, type 0in header is not log file ORA-00312: online log 2 thread 1: '/opt/oracle/oradata/ORCLCDB/redo02.log'2022-08-18T09:53:25.915417+08:00 Errors in file /opt/oracle/diag/rdbms/orclcdb/ORCLCDB/trace/ORCLCDB_lgwr_9063.trc: ORA-00316: log 2 of thread 1, type 0in header is not log file ORA-00312: online log 2 thread 1: '/opt/oracle/oradata/ORCLCDB/redo02.log' Errors in file /opt/oracle/diag/rdbms/orclcdb/ORCLCDB/trace/ORCLCDB_lgwr_9063.trc (incident=16994) (PDBNAME=CDB$ROOT): ORA-316 [] [] [] [] [] [] [] [] [] [] [] [] Incident details in: /opt/oracle/diag/rdbms/orclcdb/ORCLCDB/incident/incdir_16994/ORCLCDB_lgwr_9063_i16994.trc 2022-08-18T09:53:26.136046+08:00 USER (ospid: 12168): terminating the instance due to ORA error 3162022-08-18T09:53:26.136802+08:00 Cause -'Instance is being terminated due to fatal process LGWR is terminating with error 316' Memory (Avail / Total) =26.98M / 1816.86M Swap (Avail / Total) =0.00M / 0.00M 2022-08-18T09:53:26.337603+08:00 System state dump requested by (instance=1, osid=12168), summary=[abnormal instance termination]. System State dumped to trace file /opt/oracle/diag/rdbms/orclcdb/ORCLCDB/trace/ORCLCDB_diag_9042.trc 2022-08-18T09:53:28.696702+08:00 Dumping diagnostic data indirectory=[cdmp_20220818095326], requested by (instance=1, osid=12168), summary=[abnormal ins tance termination]. 2022-08-18T09:53:32.210921+08:00 Instance terminated by USER, pid =12168
数据库实例因为ORA error 316而终结,这个错误就 是/opt/oracle/oradata/ORCLCDB/redo02.log被清空所致。
4 分析和结论
从这两个简单的模拟可以看出,如果是误删了当前日志文件,只要及时发现还可以恢复的,这时不必关闭数据库,直接切换日志,清除重建日志组即可。这和linux系统删除文件的原理有关,在使用rm命令删除文件时,操作系统并不对文件本身进行操作,只是把文件名从父目录条目中删除,数据库的lgwr进程仍然可以通过已打开的文件句柄对文件进行操作,所以可以成功的切换、清除、重建日志组,这时关闭了数据库恢复起来所需做的工作更多。
如果时redo文件本身故障或者被清空,lgwr进程不能访问日志文件,不能完成日志文件的切换,这时会报316错误,导致实例被终结。