测试环境的数据库负载情况
Load Profile
Per Second | Per Transaction | Per Exec | Per Call | |
---|---|---|---|---|
DB Time(s): | 98.2 | 2.2 | 0.73 | 0.67 |
DB CPU(s): | 6.4 | 0.1 | 0.05 | 0.04 |
Redo size: | 57,163,094.1 | 1,277,840.0 | ||
Logical reads: | 395,710.8 | 8,845.8 | ||
Block changes: | 279,863.1 | 6,256.1 | ||
Physical reads: | 1,931.7 | 43.2 | ||
Physical writes: | 8,592.7 | 192.1 | ||
User calls: | 145.8 | 3.3 | ||
Parses: | 67.4 | 1.5 | ||
Hard parses: | 0.8 | 0.0 | ||
W/A MB processed: | 0.1 | 0.0 | ||
Logons: | 0.8 | 0.0 | ||
Executes: | 133.9 | 3.0 | ||
Rollbacks: | 0.0 | 0.0 | ||
Transactions: | 44.7 |
生产环境:
Load Profile
Per Second | Per Transaction | Per Exec | Per Call | |
---|---|---|---|---|
DB Time(s): | 149.1 | 5.8 | 2.22 | 1.44 |
DB CPU(s): | 2.0 | 0.1 | 0.03 | 0.02 |
Redo size: | 22,033,012.0 | 863,923.6 | ||
Logical reads: | 143,043.3 | 5,608.8 | ||
Block changes: | 89,599.3 | 3,513.2 | ||
Physical reads: | 355.4 | 13.9 | ||
Physical writes: | 3,529.4 | 138.4 | ||
User calls: | 103.3 | 4.1 | ||
Parses: | 38.6 | 1.5 | ||
Hard parses: | 0.1 | 0.0 | ||
W/A MB processed: | 0.8 | 0.0 | ||
Logons: | 0.8 | 0.0 | ||
Executes: | 67.1 | 2.6 | ||
Rollbacks: | 0.0 | 0.0 | ||
Transactions: | 25.5 |
同样的数据加载线程(150个并行,采用sqlldr parllel=true),在测试环境中有57M每秒的速度,但是在生产环境中却少了一倍多。
来看看实例的工作情况:
测试环境:
Buffer Nowait %: | 99.76 | Redo NoWait %: | 99.98 |
Buffer Hit %: | 99.51 | In-memory Sort %: | 100.00 |
Library Hit %: | 98.31 | Soft Parse %: | 98.80 |
Execute to Parse %: | 49.66 | Latch Hit %: | 97.52 |
Parse CPU to Parse Elapsd %: | 70.81 | % Non-Parse CPU: | 99.92 |
生产环境:
Instance Efficiency Percentages (Target 100%)
Buffer Nowait %: | 99.60 | Redo NoWait %: | 99.64 |
Buffer Hit %: | 99.76 | In-memory Sort %: | 99.99 |
Library Hit %: | 99.76 | Soft Parse %: | 99.84 |
Execute to Parse %: | 42.46 | Latch Hit %: | 96.21 |
Parse CPU to Parse Elapsd %: | 14.27 | % Non-Parse CPU: | 99.92 |
通过上面的指标可以看到,cpu的解析占用的时间不多,
根据那天数据迁移时的监控,发现cpu的使用率怎么也没上去,对于个别比较大的分区表,耗费的缓存也大的惊人,180G的内存,除去sga,系统使用的,剩下的基本都被耗光了。只剩下400M左右,如果表小一些,空余内存马上会回复到100G。
undo的使用情况过高,可以从alert中一条信息加以验证。这条sql语句执行了近2个多小时,Undo都被耗光了。
Fri Jun 27 06:04:08 2014
ORA-01555 caused by SQL statement below (SQL ID: 7wx3cgjqsmnn4, Query Duration=5616 sec, SCN: 0x0a08.4205da20):
SELECT "A3"."SUBSCRIBER_NO","A3"."CUSTOMER_BAN",SUBSTR("A2"."TITLE"||' '||"A2"."FIRST_NAME",1,1500),.........
.......
"SYS_CREATION_DATE">=SYSDATE@!-5) ....
如果cpu的使用率不高,肯定是有什么等待事件。来看看等待事件。
测试环境:
Top 5 Timed Foreground Events
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
log buffer space | 3,074,452 | 177,382 | 58 | 50.41 | Configuration |
db file sequential read | 6,754,182 | 49,729 | 7 | 14.13 | User I/O |
log file sync | 206,372 | 26,221 | 127 | 7.45 | Commit |
DB CPU | 22,823 | 6.49 | |||
buffer busy waits | 2,041,520 | 21,531 | 11 | 6.12 | Concurrency |
生产环境:
Top 5 Timed Foreground Events
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
free buffer waits | 71,094 | 121,336 | 1707 | 22.82 | Configuration |
buffer busy waits | 1,349,891 | 83,772 | 62 | 15.76 | Concurrency |
log buffer space | 738,424 | 76,090 | 103 | 14.31 | Configuration |
log file switch (checkpoint incomplete) | 6,231 | 57,456 | 9221 | 10.81 | Configuration |
enq: TX - index contention | 109,479 | 57,413 | 524 | 10.80 | Concurrency |
可以看到等待事件也有很大的不同。
如果不太确定第一个等待事件“free buffer waits"主要代表什么,可以从addm里面得到更多的信息。
Finding 3: Free Buffer Waits
Impact is 44.79 active sessions, 30.05% of total activity.
----------------------------------------------------------
Database writers (DBWR) were unable to keep up with the demand for free
buffers.
Recommendation 1: Database Configuration
Estimated benefit is 44.79 active sessions, 30.05% of total activity.
---------------------------------------------------------------------
Action
Consider increasing the number of database writers (DBWR) by setting the
parameter "db_writer_processes". Also consider if asynchronous I/O is
appropriate for your architecture.
Rationale
The value of parameter "db_writer_processes" was "4" during the analysis
period.
Rationale
The value of parameter "disk_asynch_io" was "TRUE" during the analysis
period.
看到这,我就恍然大悟了,测试环境和生产环境有一个参数是不一样的,在生产环境中修改,启用异步IO需要重启实例,但是在生产数据迁移之前,时间已经很紧张了,所以计划把这个参数变更推迟到第二阶段,没想到有这么大的影响。
对应的参数有file_system_io需要设置为setall. disk_asynch_io=TRUE 第二个选项默认就是TRUE.
有了这个思路,再去看其他的影响因素,有些是关联的,有些影响比例很小。
所以总结这次数据迁移的问题,主要有两个主要原因导致性能下降,
一个是有外部的用户在访问数据库,他们在不间断的做一些查询,有个别查询还比较大,耗费了大量的undo资源。大概占40%以上的因素
一个原因是因为数据库的异步IO设置的问题。大概占50%以上的因素。