集中客户系统在2016年11月8日17点业务反馈系统链接不上,是不是有问题。当即查看系统
[oracle@SIDDB01 hwbackup3]$crs_stat -t -v
Name Type R/RA F/FT Target State Host
----------------------------------------------------------------------
ora....D1.inst application 1/5 0/0 ONLINE ONLINE siddb01
ora....D2.inst application 0/5 0/0 ONLINE ONLINE esbdb01
ora....ID2.srv application 0/0 0/0 ONLINE ONLINE esbdb01
ora....Bsrv.cs application 0/0 0/1 ONLINE ONLINE esbdb01
ora....ID1.srv application 0/0 0/0 ONLINE ONLINE siddb01
ora....Dsrv.cs application 0/0 0/1 ONLINE ONLINE siddb01
ora.ESBSID.db application 0/0 0/1 ONLINE ONLINE siddb01
ora....01.lsnr application 0/5 0/0 ONLINE ONLINE esbdb01
ora....b01.gsd application 2/5 0/0 ONLINE OFFLINE
ora....b01.ons application 0/3 0/0 ONLINE ONLINE esbdb01
ora....b01.vip application 0/0 0/0 ONLINE ONLINE esbdb01
ora....01.lsnr application 0/5 0/0 ONLINE ONLINE siddb01
ora....b01.gsd application 5/5 0/0 ONLINE OFFLINE
ora....b01.ons application 1/3 0/0 ONLINE ONLINE siddb01
ora....b01.vip application 0/0 0/0 ONLINE ONLINE siddb01
因为监控5分钟自动采集还没来得及嗅探,数据库自启动完成,所以运维人员都不知道数据库发生了什么事情。查看一下CRS日志情况,看看是否有切换动作;
2016-11-02 16:26:21.226: [ CRSAPP][2990760] CheckResource error for ora.siddb01.vip error code = -1
2016-11-08 17:08:15.846: [ CRSEVT][3027957] CAAMonitorHandler :: 0:Could not join /oracle/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2016-11-08 17:08:25.193: [ CRSEVT][3027957] CAAMonitorHandler :: 0:Action Script /oracle/crs/bin/racgwrap(check) timed out for ora.siddb01.vip! (timeout=60
)
2016-11-08 17:08:25.194: [ CRSAPP][3027957] CheckResource error for ora.siddb01.vip error code = -2
2016-11-08 17:09:53.467: [ CRSEVT][3027958] CAAMonitorHandler :: 0:Could not join /oracle/product/10.2.0/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2016-11-08 17:10:07.582: [ CRSAPP][3027958] CheckResource error for ora.siddb01.LISTENER_SIDDB01.lsnr error code = -1
2016-11-08 17:10:14.102: [ CRSEVT][3027962] CAAMonitorHandler :: 0:Could not join /oracle/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
2016-11-08 17:10:19.909: [ CRSAPP][3027962] CheckResource error for ora.ESBSID.db error code = -1
2016-11-08 17:10:39.084: [ CRSRES][3027965] In stateChanged, ora.ESBSID.ESBSID1.inst target is ONLINE
2016-11-08 17:10:41.191: [ CRSRES][3027965] ora.ESBSID.ESBSID1.inst on siddb01 went OFFLINE unexpectedly
2016-11-08 17:10:43.326: [ CRSRES][3027965] StopResource: setting CLI values
2016-11-08 17:10:43.723: [ CRSRES][3027965] Attempting to stop `ora.ESBSID.ESBSID1.inst` on member `siddb01`
2016-11-08 17:11:09.197: [ CRSRES][3027977] In stateChanged, ora.ESBSID.SIDsrv.ESBSID1.srv target is ONLINE
2016-11-08 17:11:09.198: [ CRSRES][3027977] ora.ESBSID.SIDsrv.ESBSID1.srv on siddb01 went OFFLINE unexpectedly
2016-11-08 17:11:09.198: [ CRSRES][3027977] StopResource: setting CLI values
2016-11-08 17:11:09.885: [ CRSRES][3027977] Attempting to stop `ora.ESBSID.SIDsrv.ESBSID1.srv` on member `siddb01`
2016-11-08 17:11:10.543: [ CRSRES][3027977] Stop of `ora.ESBSID.SIDsrv.ESBSID1.srv` on member `siddb01` succeeded.
2016-11-08 17:11:10.614: [ CRSRES][3027977] ora.ESBSID.SIDsrv.ESBSID1.srv RESTART_COUNT=0 RESTART_ATTEMPTS=0
2016-11-08 17:11:10.730: [ CRSRES][3027977] ora.ESBSID.SIDsrv.ESBSID1.srv failed on siddb01 relocating.
2016-11-08 17:11:13.721: [ CRSRES][3027977] Attempting to start `ora.ESBSID.SIDsrv.ESBSID1.srv` on member `esbdb01`
2016-11-08 17:11:26.264: [ CRSAPP][3027988] CheckResource error for ora.ESBSID.SIDsrv.cs error code = 1
2016-11-08 17:11:26.267: [ CRSRES][3027988] In stateChanged, ora.ESBSID.SIDsrv.cs target is ONLINE
2016-11-08 17:11:26.267: [ CRSRES][3027988] ora.ESBSID.SIDsrv.cs on siddb01 went OFFLINE unexpectedly
2016-11-08 17:11:26.268: [ CRSRES][3027988] StopResource: setting CLI values
2016-11-08 17:11:26.300: [ CRSRES][3027988] Attempting to stop `ora.ESBSID.SIDsrv.cs` on member `siddb01`
2016-11-08 17:11:26.564: [ CRSRES][3027988] Stop of `ora.ESBSID.SIDsrv.cs` on member `siddb01` succeeded.
2016-11-08 17:11:26.564: [ CRSRES][3027988] ora.ESBSID.SIDsrv.cs RESTART_COUNT=0 RESTART_ATTEMPTS=0
2016-11-08 17:11:27.639: [ CRSRES][3027988] ora.ESBSID.SIDsrv.cs exceeded it's failure threshold. Stopping it and its dependents!
2016-11-08 17:11:27.656: [ CRSRES][3027988] ora.ESBSID.SIDsrv.cs target set to OFFLINE before stop action
2016-11-08 17:11:27.656: [ CRSRES][3027988] StopResource: setting CLI values
2016-11-08 17:11:27.665: [ CRSRES][3027988] Target set to OFFLINE for `ora.ESBSID.SIDsrv.cs`
2016-11-08 17:11:28.419: [ CRSRES][3027965] Stop of `ora.ESBSID.ESBSID1.inst` on member `siddb01` succeeded.
2016-11-08 17:11:28.420: [ CRSRES][3027965] ora.ESBSID.ESBSID1.inst RESTART_COUNT=1 RESTART_ATTEMPTS=5
2016-11-08 17:11:28.420: [ CRSRES][3027965] ora.ESBSID.ESBSID1.inst Uptime exceeds uptime_threshold, resetting RC
2016-11-08 17:11:28.420: [ CRSRES][3027965] Restarting ora.ESBSID.ESBSID1.inst on siddb01
2016-11-08 17:11:28.426: [ CRSRES][3027965] startRunnable: setting CLI values
2016-11-08 17:11:28.426: [ CRSRES][3027965] Attempting to start `ora.ESBSID.ESBSID1.inst` on member `siddb01`
2016-11-08 17:11:34.766: [ CRSRES][3027977] Start of `ora.ESBSID.SIDsrv.ESBSID1.srv` on member `esbdb01` succeeded.
2016-11-08 17:11:46.009: [ CRSRES][3027999] startRunnable: setting CLI values
2016-11-08 17:12:22.967: [ CRSRES][3027965] Start of `ora.ESBSID.ESBSID1.inst` on member `siddb01` succeeded.
2016-11-08 17:12:22.968: [ CRSRES][3027965] Successfully restarted ora.ESBSID.ESBSID1.inst on siddb01, RESTART_COUNT=1
2016-11-08 17:12:22.997: [ CRSRES][3027965] ora.ESBSID.ESBSID1.inst Updated LAST_RESTART time in ocr
打算从会话连接数入手,找到连接数超高时间点,然后以此来查询当时数据库等待事件,然后找到对应的SQL进行分析;
INSTANCE_NUMBER TO_CHAR(TRUNC(SAMPLE_TIME,'MI'),'YYYY-MM-DDHH24:MI') COUNT(*)
--------------- --------------------------------------------------------------------------- ----------
1 2016-11-08 16:55 270
1 2016-11-08 16:56 951
1 2016-11-08 16:57 1917
1 2016-11-08 16:58 2000
1 2016-11-08 16:59 1152
1 2016-11-08 17:12 379
1 2016-11-08 17:13 560
1 2016-11-08 17:14 531
1 2016-11-08 17:15 458
1 2016-11-08 17:16 125
1 2016-11-08 17:17 106
查看一下当时时间点的等待事件
SQL> SELECT event_id, event, COUNT (*) cnt
2 FROM dba_hist_active_sess_history
3 WHERE snap_id between 64368 and 64369 AND wait_class_id = 1740759767
4 GROUP BY event_id, event ORDER BY 3 desc;
EVENT_ID EVENT CNT
---------- ---------------------------------------------------------------- ----------
38438084 direct path write temp 2867
3926164927 direct path read 2326
2652584166 db file sequential read 1710
861319509 direct path read temp 616
3056446529 read by other session 604
506183215 db file scattered read 572
834992820 db file parallel read 29
885859547 direct path write 18
SQL> SELECT sql_id, COUNT (*) cnt
2 FROM dba_hist_active_sess_history
3 WHERE snap_id between 64368 and 64369
4 AND event_id IN(38438084)GROUP BY sql_id
5 HAVING COUNT (*) > 1
6 ORDER BY 2 desc;
SQL_ID CNT
------------- ----------
a9k9pzsnh5s9n 575
gjv7gx73q07c6 258
7zbvg66aw143a 227
01spkazwnt63u 206
dt4xb2ausdy1z 187
768krsak8tg2d 185
4mzyj9vbyypc8 148
d1j0r7w8hwrp9 138
du52p4c2735y2 123
1wczkxv3znt0h 107
b8sj2zpc6n4kh 103
4q08ghwrssm09 99
ggurqxjuqpca0 84
67ram5177zdgf 75
SQL> SELECT sql_id, COUNT (*) cnt
2 FROM dba_hist_active_sess_history
3 WHERE snap_id between 64368 and 64369
4 AND event_id IN(3926164927)GROUP BY sql_id
5 HAVING COUNT (*) > 1
6 ORDER BY 2 desc;
SQL_ID CNT
------------- ----------
b8sj2zpc6n4kh 638
768krsak8tg2d 607
67ram5177zdgf 315
6varm0b41sfxd 164
4q08ghwrssm09 109
01spkazwnt63u 64
g9x112cs2dr4h 64
1,大量的磁盘排序操作,order by, group by, union, distinct,无法在PGA中完成排序,需要利用temp表空间进行排序,当从临时表空间中读取排序结果时,会产生direct path read.
2. 大量的Hash Join操作,利用temp表空间保存hash区。
3. SQL语句的并行处理
4. 大表的全表扫描
上面列举的SQL全都是通过SQL*PLUS连接进来执行的语句也都是distinct\parallel 20
系统内当时也有很多的direct path read、db file sequential read等待事件;
在排序操作(order by/group by/union/distinct/rollup/合并连接)时,由于PGA中的SORT_AREA_SIZE空间不足,造成需要使用临时表空间来保存中间结果,当从临时表空间读入排序结果时,产生direct path read等待事件。
使用HASH连接的SQL语句,将不适合位于内存中的散列分区刷新到临时表空间中。为了查明匹配SQL谓词的行,临时表空间中的散列分区被读回到内存中(目的是为了查明匹配SQL谓词的行),ORALCE会话在direct path read等待事件上等待。
使用并行扫描的SQL语句也会影响系统范围的direct path read等待事件。在并行执行过程中,direct path read等待事件与从属查询有关,而与父查询无关,运行父查询的会话基本上会在PX Deq:Execute Reply上等待,从属查询会产生direct path read等待事件我们来验证一下这些语句的执行计划;
那么有朋友想问了,ogg重启和他有什么关系?这是我们这个系统专门写的脚本,进行判断,为了就是怕将归档日志给删除,倒是ogg进程异常停止,没想到那段时间的归档日志已经被脚本转移到其他目录之下,这样ogg的抽取进程就报了如下错误;
2016-11-08 19:10:15 ERROR OGG-00446 Could not find archived log for sequence 144299 thread 1 under default destinations SQL <SELECT name FROM v$arch
ived_log WHERE sequence# = :ora_seq_no AND thread# = :ora_thread AND resetlogs_id = :ora_resetlog_id AND archived = 'YES' AND
deleted = 'NO' AND name not like '+%' AND standby_dest = 'NO' >, error retrieving redo file name for sequence 144299, archived = 1, use_
alternate = 0Not able to establish initial position for sequence 144299, rba 243031568.
OGG-00446也指明了Could not find archived log for sequence 144299 thread 1,我们将故障点以后的ogg进行移回归档目录操作。
因为脚本是用root跑的,权限也都有所改变。再将权限重新赋给oracle用户和它的属组
root@SIDDB01 # cd /oracle/oradata/archlog/sidarchlog/
root@SIDDB01 # ls -ltar
total 8177360
drwxr-xr-x 6 oracle dba 8192 Jan 6 2015 ..
-r--r----- 1 oracle dba 759247872 Jul 7 2015 2_152214_784815576.dbf
-r--r----- 1 oracle dba 1024 Jul 7 2015 2_152215_784815576.dbf
-r--r----- 1 oracle dba 1024 Jul 7 2015 2_152216_784815576.dbf
-r--r----- 1 oracle dba 1024 Jul 7 2015 2_152217_784815576.dbf
-r--r----- 1 oracle dba 311396864 Nov 8 17:20 1_144306_784815576.dbf
-r--r----- 1 oracle dba 716086272 Nov 8 17:34 1_144307_784815576.dbf
-r--r----- 1 oracle dba 187094016 Nov 8 17:49 1_144308_784815576.dbf
-r--r----- 1 oracle dba 160567296 Nov 8 18:01 1_144309_784815576.dbf
-r--r----- 1 oracle dba 283110912 Nov 8 18:21 1_144310_784815576.dbf
-r--r----- 1 oracle dba 316848640 Nov 8 18:47 1_144311_784815576.dbf
-r--r----- 1 oracle dba 194941952 Nov 8 19:03 1_144312_784815576.dbf
-r--r----- 1 root root 279188480 Nov 8 19:31 1_144299_784815576.dbf
-r--r----- 1 root root 242944512 Nov 8 19:31 1_144300_784815576.dbf
drwxr-xr-x 2 oracle dba 16384 Nov 8 19:39 .
-rw-r--r-- 1 oracle dba 391 Nov 8 19:40 ggserr.log
-r--r----- 1 root root 167494144 Nov 8 19:45 1_144301_784815576.dbf
-r--r----- 1 root root 199065088 Nov 8 19:46 1_144302_784815576.dbf
-r--r----- 1 root root 271761408 Nov 8 19:47 1_144303_784815576.dbf
-r--r----- 1 root root 96945664 Nov 8 19:47 1_144304_784815576.dbf
-r--r----- 1 root root 1024 Nov 8 19:47 1_144305_784815576.dbf
root@SIDDB01 # chown oracle:dba 1_144*
root@SIDDB01 # ls -ltar
total 8177360
drwxr-xr-x 6 oracle dba 8192 Jan 6 2015 ..
-r--r----- 1 oracle dba 759247872 Jul 7 2015 2_152214_784815576.dbf
-r--r----- 1 oracle dba 1024 Jul 7 2015 2_152215_784815576.dbf
-r--r----- 1 oracle dba 1024 Jul 7 2015 2_152216_784815576.dbf
-r--r----- 1 oracle dba 1024 Jul 7 2015 2_152217_784815576.dbf
-r--r----- 1 oracle dba 311396864 Nov 8 17:20 1_144306_784815576.dbf
-r--r----- 1 oracle dba 716086272 Nov 8 17:34 1_144307_784815576.dbf
-r--r----- 1 oracle dba 187094016 Nov 8 17:49 1_144308_784815576.dbf
-r--r----- 1 oracle dba 160567296 Nov 8 18:01 1_144309_784815576.dbf
-r--r----- 1 oracle dba 283110912 Nov 8 18:21 1_144310_784815576.dbf
-r--r----- 1 oracle dba 316848640 Nov 8 18:47 1_144311_784815576.dbf
-r--r----- 1 oracle dba 194941952 Nov 8 19:03 1_144312_784815576.dbf
-r--r----- 1 oracle dba 279188480 Nov 8 19:31 1_144299_784815576.dbf
-r--r----- 1 oracle dba 242944512 Nov 8 19:31 1_144300_784815576.dbf
drwxr-xr-x 2 oracle dba 16384 Nov 8 19:39 .
-rw-r--r-- 1 oracle dba 391 Nov 8 19:40 ggserr.log
-r--r----- 1 oracle dba 167494144 Nov 8 19:45 1_144301_784815576.dbf
-r--r----- 1 oracle dba 199065088 Nov 8 19:46 1_144302_784815576.dbf
-r--r----- 1 oracle dba 271761408 Nov 8 19:47 1_144303_784815576.dbf
-r--r----- 1 oracle dba 96945664 Nov 8 19:47 1_144304_784815576.dbf
-r--r----- 1 oracle dba 1024 Nov 8 19:47 1_144305_784815576.dbf
然后我们在重新启动extract进程
GGSCI (SIDDB01) 17> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
JAGENT STOPPED
EXTRACT RUNNING DP01 00:00:00 00:00:03
EXTRACT RUNNING DP02 01:10:28 00:00:05
EXTRACT RUNNING DP03 00:00:00 00:00:10
EXTRACT RUNNING DP04 01:11:41 00:00:06
EXTRACT RUNNING EXT1 00:00:01 00:00:08
EXTRACT RUNNING EXT2 01:10:37 00:00:07
EXTRACT RUNNING EXT3 00:00:01 00:00:09
EXTRACT RUNNING EXT4 01:11:31 00:00:02
进程启动正常,view report ext4也没有报错。剩下的交给时间慢慢抽取,感谢大家。
本文转自yangjunfeng 51CTO博客,原文链接:http://blog.51cto.com/yangjunfeng/1871100