US contention: Lock held to perform. DDL on the undo segment
http://tech.sina.com.cn/s/2009-09-23/09561077577.shtml
看到一篇fuyuncat写的关于US-contention的troubleshooting 案例,以前从没有接触过类似的enqueue,仔细阅读几遍并记下笔记;然后又多搜了几篇类似的案例,以备后用。
该案例大致总结
"Active Sessions Waiting: Other" 统计了RAC中除了IO和Idle waits之外的所有wait事件
AWR中top 5占了两席
DFS lock handle—会话等待获取一个全局锁句柄,全局锁由DLM(分布式锁管理器)管理和分配的,这一等待事件说明全局锁句柄资源不够用,决定参数_lm_locks,9i后默认为12000;说明大量事务获取锁,但没有commit/rollback;
enq: US – contention 说明事务在队列中等待UNDO segment,通常由于UNDO空间不足导致
首先查找这两个等待事件的wait对象
Select * from Dba_objects in( select current_obj#,count(*) from dba_hist_active_sess_history where event=’DFS lock handle’ and snap_id between *** and ***)
发现两个等待事件的对象基本相同
Undo资源不足,可能是undo_retention时间过长且设置为guarantee,select retention,tablespace_name from dba_tablespaces where tablespace_name like ‘UNDO%’查看发现没有设置guarantee
接着查看哪些事务消耗了UNDO,但是只有一个transaction;
那就查看UNDO的实际使用情况 select * from dba_segments where segment_type like ‘%UNDO%’ 结果显示一个回滚段_SYSSMU69$占了将近20g,查看该回滚段的extents的状态
Select status, count(*) from dba_undo_extents where segment_name ='_SYSSMU69$' group by status; 全部为active,则说明有事务正在使用所有的扩展段,但又找不到这样的事务,原因是一个使用该回滚段的事务被异常终止了,先是通过kill session杀死,但是仍会回滚还未提交的事务,于是直接在OS删除
由于该回滚段状态仍未online,且所有extents都是active,所以无法drop或shrink,解决方案:
1、 重启实例,重置回滚段;
2、 新增一个undo表空间,使其他事务正常运行;杀掉由于等待而彻底hung的会话,恢复正常
总结:作者在AWR top 5 events里发现两个等待事件,然后判断US contention为元凶,先检查undo_retention/guarantee的参数设置,接下来查看UNDO表空间的使用状况,dba_segments里发现一个非常大的undo segment,进而查找dba_undo_extents以确认其extents的状态,最后联系开发人员找到原因;
另外一个案例 来源http://www.itpub.net/redirect.php?tid=1269096&goto=lastpost
RAC出现大量的row cache lock + us contention
前者是由于一个sequence设置为nocache,修改后变好;后一个猜测undo出问题,直接查active+unexpired的总和,接近undo表空间的大小,临时增大undo表空间并kill掉消耗量最大的impdp进程
第三个案例来自eagle fan http://www.dbafan.com/blog/?p=170
10203版本
AUM管理方式是系统不忙的时候offline一些undo segment,不够用时再online;而当系统特别繁忙时online或者resize或出现问题--10511事件解决;
但此刻并不繁忙
作者留意到v$undostat中的unxpblkreucnt: Number of unexpired undo blocks reused by transactions
该列值不为0,一般只有当undo不够存放undo_retention时间段内的数据时,才会发生unexpired undo extents stealing;
但是目前不是高峰期,作者留意到v$undostat中的tuned_undoretention字段,10.2之后,oracle默认采用自动调整undo retention,会根据undo大小以及系统繁忙程度字段调整undo_retention参数;
出问题前一天数据库重启过,因为起来很空闲,所以tuned_undoretention很大,undo被撑满,虽然该字段值一直在降,但还是没有赶上系统warm up的速度,导致数据库出现问题;
通过设置_undo_autotune为disable,不再自动更新。
总结:出现US contention,并且是在系统不繁忙的时候,作者留意到v$undostat中的unxpblkreucnt参数,由此推断出该enqueue是由于oracle 10.2以后的新特性引起的,通过设置隐含参数屏蔽此特性来最终解决
How to correct performance issues with enq: US - contention related to undo segments [ID 1332738.1]
Purpose
Assist in correcting performance issues related to "enq: US Contention" on undo segments.
You have many offline undo segments and the workload starts to online many undo segments over a short period of time. This can lead to high 'latch: row cache objects' contention may be seen on dc_rollback_segments together with high 'enq: US - contention' waits when using system managed undo with an auto tuned undo retention period.
Sessions attempting to online undo segments should show ktusmous_online_undoseg() in their call stack.
Another aspect of the problem can be due to long running queries which can raise tuned_undoretention to very high values and exhausts the undo tablespace resulting in ORA-1628.
A real world case:
A query is being executed and some rows are fetched from the cursor and then the user stops working on that query (e.g. does not press the "next" button on the application screen) and works on something else (e.g. in a different window). After some time the user continues working on the query ... auto-tune starts tracking the query from this point and the maxquerylen is quite large now, hence also the tuned_undoretention (that depends directly on the maxquerylen).
NOTE: The Seibel application can allow for this problem to happen.
June 24, 2011
A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.
The wait event "enq: US Contention" is associated with contention on the latch in the row cache (dc_rollback_seg). Enqueue US - Contention can become a bottle-neck for performance if workload dictates that a lot of offlined undo segments must be onlined over a short period of time. The latch on the row cache can be unable to keep up with the workload.
This can happen for a number of reasons and some scenarios are legitimate workload demands.
Solution:
Ensure that peaks in onlined undo segments do not happen (see workaround #2). That is not always feasible.
Workarounds:
1. Bounce the instance.
2. Setting _rollback_segment_count to a high number to keep undo segments online.
alter system set "_rollback_segment_count"=;
3. Set _undo_autotune to false
alter system set "_undo_autotune" = false;
NOTE: Simply using _smu_debug_mode=33554432 may not be enough to stop the problem, but valid fix for bug 5387030.
4. A fix to bug 7291739 is to set a new hidden parameter, _highthreshold_undoretention to set a high threshold for undo retention completely distinct from maxquerylen.
alter system set "_highthreshold_undoretention"=;
If problems persist, please file a Service Request with Oracle Support.
@ Diagnosis
@
@ Should the workarounds and/or configuration changes not help to alleviate the problems,
@ development would need the following diagnostics data:
@
@ a. Provide alert.log which shows the last instance startup parameters through the time of the
@ latest isssues.
@
@ b. AWR and/or ASH report of 30 or 60 minutes interval.
@
@ b. Following query output:
@
@ alter session set nls_date_format='mm/dd/yy hh24:mi:ss';
@ select begin_time, MAXQUERYID, MAXQUERYLEN from v$undostat;
@
@ c. While the error is ongoing:
@
@ On single instance:
@
@ sqlplus / as sysdba
@ oradebug setmypid
@ oradebug unlimit
@ oradebug hanganalyze 3
@ oradebug dump systemstate 266
@
@ wait for 5 seconds
@
@ oradebug dump systemstate 266
@
@ wait for 2 minutes
@
@ sqlplus / as sysdba
@ oradebug setmypid
@ oradebug unlimit
@ oradebug hanganalyze 3
@ oradebug dump systemstate 266
@
@ wait for 5 seconds
@
@ oradebug dump systemstate 266
@
@ On RAC get tracing on all nodes
@
@ sqlplus / as sysdba
@ oradebug setmypid
@ oradebug unlimit
@ oradebug -g all hanganalyze 3
@ oradebug -g all dump systemstate 266
@
@ wait for 5 seconds
@
@ oradebug -g all dump systemstate 266
@
@ wait for 2 minutes
@
@ sqlplus / as sysdba
@ oradebug setmypid
@ oradebug unlimit
@ oradebug -g all hanganalyze 3
@ oradebug -g all dump systemstate 266
@
@ wait for 5 seconds
@
@ oradebug -g all dump systemstate 266