背景说明:用户反馈数据库运行很慢,但是等查看的时候又恢复了正常,果断的查看了过去一段时间的AWR报告;
AWR报告信息如下:
从db time/Elapsed显示数据库的压力并不是很大。
每秒钟产生的redo log 6M,每小时21G,数据库的IO写压力很大。
top5等待事件:enq:CF-contention 该等待事件不是空闲等待事件;
二、Metalink对该等待事件的分析
这问题一直没有遇到过,只能求助于metalink,详细的说明如下:
1、出现问题的版本
ORACLE DATABASE - ENTERPRISE EDITION - VERSION 9.2.0.1 TO 11.2.0.3 [RELEASE 9.2 TO 11.2](当前数据库的版本为11.2.0.3)
2、症状
在awr等待报告中的top5等待事件或出现v$session_wait的等待事件;
3、原因
任何需要读取控制文件的动作期间都会产生CF队列,CF锁用于controlfile序列操作和共享部分controlfile读和写。通常CF锁是分配给一个非常简短的时间和时使用:
- 发生检查点
- 日志文件的切换
- 归档online redolog
- 运行崩溃后的恢复
- 热备的开始和结束
- DML通过nologging选项执行对象时
4、解决问题
找出当前持有CF锁的对象
select l.sid, p.program, p.pid, p.spid, s.username, s.terminal, s.module, s.action, s.event, s.wait_time, s.seconds_in_wait, s.statefrom v$lock l, v$session s, v$process pwhere l.sid = s.sidand s.paddr = p.addrand l.type='CF'and l.lmode >= 5;
查找等待CF锁的对象
select l.sid, p.program, p.pid, p.spid, s.username, s.terminal, s.module, s.action, s.event, s.wait_time, s.seconds_in_wait, s.statefrom v$lock l, v$session s, v$process pwhere l.sid = s.sidand s.paddr = p.addrand l.type='CF'and l.request >= 5
METALINK如下:
It is advisable to run the above queries a few times in a row...
1. If you see the holder is:
background process, typically LGWR, CKPT or ARCn the holder is holding the enqueue for a longer period of time
Check if the redologs are sized adequately. Typically you want to drive at a log switch every 30 minutes. Also verify checkpointing parameters such as fast_start_mttr_target
2. If you see the holder is:
a user session (so no background process) the holder is constantly changing the wait event of the holder is 'control file parallel write' Then it is most likely that the contention for the CF enqueue is caused by DML on a NOLOGGING object.
When performing DML operations using either NOLOGGING or UNRECOVERABLE option, then oracle records the unrecoverable SCN in the controlfiles. Typically you will see an increase in waits appearing for 'control file parallel write' as well however the session is not blocked for this wait event but rather the session performing the controlfile write will be holding the CF enqueue and the other sessions performing the unrecoverable (nologging) operation will be waiting to get a CF enqueue to update the controlfile with the unrecoverable SCN.
So if you have an object with the NOLOGGING option, it is normal to see CF enqueue contention...
The following operations can make use of no-logging mode:
direct load (SQL*Loader) direct-load INSERT CREATE TABLE ... AS SELECT CREATE INDEX ALTER TABLE ... MOVE PARTITION ALTER TABLE ... SPLIT PARTITION ALTER INDEX ... SPLIT PARTITION ALTER INDEX ... REBUILD ALTER INDEX ... REBUILD PARTITION INSERT, UPDATE, and DELETE on LOBs in NOCACHE NOLOGGING mode stored out of line
3. Check if the archive destination (log_archive_dest_n) are accessible, you may need to involve System/Storage admins.
If you are using NFS filesystem for the archive destinations then make sure there is no issue with nfs as this can lead to log switch hanging and that leads to CF enqueue as the lock holder will be either LGWR or ARCn processes |
理解如下:
- 当holder的对象是后台进程:LGWR、CKPT、ARCn
解决方法:redolog的大小和切换频率,建议每次日志切换的时间间隔着30分钟左右。
- 当holder的对象是用户session、并经常变化、等待事件"control file parallel write"
解决方法:该等待是正常的数据库等待;
- 其他:检查归档的路径,由于系统或存储的问题导致的该等待事件;
五、问题的总结
本案例的aw报告中显示数据库每小时产生的归档日志达22G,数据库的online redolog的大小为1G/个,计算下来每个小时需要进行20次的日志切换,平均3分钟执行次。与建议的30分钟一次相差很多。
经过与业务沟通发现当前数据库正在进行数据的抽取工作,导致该等待事件的发生。
最后的解决方法:建议在工作时间避免进行数据的抽取保证在工作期间系统能够正常运行;
可以适当增加online redolog的大小到5G,减低日志的切换频率;
DBA有时候就是有这个好处,当所有人都不知道问题的时候,问题的大小你都可以随便描述(前提是建立在事实的依据下),如果平时树立足够的威信的话,那么很容易让其他的人员配合你的工作,这个时候成就感是很强的。
附:日志信息和产生情况
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++本文作者:JOHN
ORACLE技术博客:ORACLE 猎人笔记 数据库技术群:367875324 (请备注ORACLE管理 )
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++