今天收到报警邮件,提示在短时间内DB time有了很大的抖动。报警邮件如下:
ZABBIX-监控系统:
而紧接着CPU负载也开始急剧飙升,直接的反应就是机器反应开始非常慢。根据top得到的进程,可以看到cpu资源已经被耗光了。
这个从sar的结果也可以看得很明显。
11:30:02 AM CPU %user %nice %system %iowait %steal %idle
11:31:02 AM all 95.86 0.00 0.36 0.00 0.00 3.78
11:32:02 AM all 94.76 0.00 0.25 0.00 0.00 4.99
11:33:01 AM all 94.53 0.00 0.17 0.00 0.00 5.30
11:34:01 AM all 95.83 0.00 0.19 0.00 0.00 3.98
11:35:01 AM all 95.18 0.00 0.17 0.00 0.00 4.66
11:36:01 AM all 96.47 0.00 0.96 0.02 0.00 2.55
11:37:01 AM all 97.50 0.00 0.15 0.08 0.00 2.28
11:38:01 AM all 96.08 0.00 0.17 0.00 0.00 3.75
11:39:01 AM all 95.74 0.00 0.17 0.01 0.00 4.08
11:40:01 AM all 96.24 0.00 0.17 0.00 0.00 3.58
11:41:01 AM all 94.68 0.00 0.34 0.00 0.00 4.98
11:42:01 AM all 97.30 0.00 0.15 0.00 0.00 2.54
11:43:01 AM all 96.51 0.00 0.18 0.00 0.00 3.31
11:44:01 AM all 96.14 0.00 0.14 0.00 0.00 3.71
11:45:01 AM all 95.52 0.00 0.20 0.01 0.00 4.26
11:46:01 AM all 96.67 0.00 0.94 0.01 0.00 2.38
11:47:01 AM all 96.23 0.00 0.16 0.01 0.00 3.61
11:48:01 AM all 96.59 0.00 0.16 0.00 0.00 3.25
11:49:01 AM all 96.40 0.00 0.18 0.00 0.00 3.41
11:50:02 AM all 96.15 0.00 0.21 0.00 0.00 3.64
11:51:01 AM all 96.72 0.00 0.59 0.04 0.00 2.64
11:52:01 AM all 96.51 0.00 0.16 0.00 0.00 3.32
11:53:01 AM all 94.57 0.00 0.20 0.01 0.00 5.22
11:54:01 AM all 93.82 0.00 0.14 0.00 0.00 6.04
11:55:01 AM all 94.04 0.00 0.14 0.01 0.00 5.81
11:56:01 AM all 96.01 0.00 0.85 0.00 0.00 3.14
11:57:01 AM all 95.66 0.00 0.15 0.01 0.00 4.17
11:58:01 AM all 94.93 0.00 0.14 0.00 0.00 4.93
11:59:01 AM all 97.15 0.00 0.16 0.01 0.00 2.68
12:00:01 PM all 99.76 0.00 0.23 0.00 0.00 0.01
12:01:01 PM all 99.45 0.00 0.55 0.00 0.00 0.01
12:02:01 PM all 99.76 0.00 0.24 0.00 0.00 0.00
12:03:02 PM all 99.60 0.00 0.40 0.00 0.00 0.00
12:04:02 PM all 99.76 0.00 0.23 0.00 0.00 0.00
12:05:01 PM all 99.77 0.00 0.23 0.00 0.00 0.00
12:06:05 PM all 99.37 0.00 0.62 0.00 0.00 0.01
12:07:02 PM all 99.59 0.00 0.40 0.00 0.00 0.00
12:08:03 PM all 99.57 0.00 0.43 0.00 0.00 0.00
12:09:02 PM all 99.66 0.00 0.34 0.00 0.00 0.00
12:10:01 PM all 99.80 0.00 0.20 0.00 0.00 0.01
12:11:15 PM all 99.70 0.00 0.30 0.00 0.00 0.00
12:12:07 PM all 99.87 0.00 0.12 0.00 0.00 0.00
12:13:03 PM all 99.74 0.00 0.26 0.00 0.00 0.00
12:14:02 PM all 99.79 0.00 0.21 0.00 0.00 0.00
12:15:02 PM all 99.85 0.00 0.14 0.00 0.00 0.01
12:16:04 PM all 99.58 0.00 0.42 0.00 0.00 0.00
所以问题其实变得还是挺着急的。这种负载情况下,看似问题还在恶化,着实让人捏了一把汗。
top的情况如下:
top - 13:34:40 up 469 days, 20:09, 6 users, load average: 44.01, 42.83, 43.48
Tasks: 515 total, 48 running, 453 sleeping, 0 stopped, 14 zombie
Cpu(s): 99.9%us, 0.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32946232k total, 32753852k used, 192380k free, 209652k buffers
Swap: 16771776k total, 186968k used, 16584808k free, 28973828k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8517 oracle 18 0 10.1g 305m 300m R 64.7 0.9 263:48.43 oracletestdb (LOCAL=NO)
17796 oracle 19 0 10.1g 174m 169m R 56.7 0.5 14:54.71 oracletestdb (LOCAL=NO)
18473 oracle 17 0 10.1g 171m 166m R 39.8 0.5 14:31.95 oracletestdb (LOCAL=NO)
18153 oracle 16 0 10.1g 175m 171m R 38.1 0.5 13:55.82 oracletestdb (LOCAL=NO)
根据top的情况抓取了一个进程,看看这个进程到底在干嘛.比如进程号是16908.
可以看都原来是一个客户端触发的一个select语句导致。
sh showpid.sh 16908 ******************************************* Process has found, pid: 16908 , addr: 00000002DF6199B0 ####### Process Information from OS level as below ######## oracle 16908 1 17 12:00 ? 00:14:48 oracletestdb (LOCAL=NO) ############################################## SID SERIAL# USERNAME OSUSER MACHINE PROCESS TERMINAL TYPE LOGIN_TIME ---------- ---------- --------------- --------------- -------------------- --------------- --------------- ---------- ------------------- 771 41269 TESTAT_NEWBG mbilocalhost 1234 USER 2015-12-07 12:00:05 SQL_ID SQL_TEXT ------------------------------ ------------------------------------------------------------ 7jywxtcmcmcgv select t0.stat_time stat_time,t2.consume_5cnt today,t1.consu me_5cnt yestoday,t3.consume_5cnt lastweek from (select to _char(trunc(sysdate)-level*5/60/24,'hh24:mi') as stat_time f rom dual connect by level <=288 order by stat_time) t0 lef t join (select to_char(a.stat_time,'hh24:mi') stat_ 查看执行计划的情况如下,其实看计划都是走了索引扫描。sh showpid.sh 16908
*******************************************
Process has found, pid: 16908 , addr: 00000002DF6199B0
####### Process Information from OS level as below ########
oracle 16908 1 17 12:00 ? 00:14:48 oracletestdb (LOCAL=NO)
##############################################
SID SERIAL# USERNAME OSUSER MACHINE PROCESS TERMINAL TYPE LOGIN_TIME
---------- ---------- --------------- --------------- -------------------- --------------- --------------- ---------- -------------------
771 41269 TESTAT_NEWBG mbilocalhost 1234 USER 2015-12-07 12:00:05
SQL_ID SQL_TEXT
------------------------------ ------------------------------------------------------------
7jywxtcmcmcgv select t0.stat_time stat_time,t2.consume_5cnt today,t1.consu
me_5cnt yestoday,t3.consume_5cnt lastweek from (select to
_char(trunc(sysdate)-level*5/60/24,'hh24:mi') as stat_time f
rom dual connect by level <=288 order by stat_time) t0 lef
t join (select to_char(a.stat_time,'hh24:mi') stat_
查看执行计划的情况如下,其实看计划都是走了索引扫描。 sh showplan2.sh 7jywxtcmcmcgv
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 7jywxtcmcmcgv, child number 0
-------------------------------------
Plan hash value: 3396463774
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 171 (100)| |
| 1 | SORT ORDER BY | | 3 | 300 | 171 (2)| 00:00:03 |
| 2 | NESTED LOOPS OUTER | | 3 | 300 | 170 (1)| 00:00:03 |
| 3 | NESTED LOOPS OUTER | | 1 | 68 | 8 (0)| 00:00:01 |
| 4 | NESTED LOOPS OUTER | | 1 | 36 | 5 (0)| 00:00:01 |
| 5 | VIEW | | 1 | 4 | 2 (0)| 00:00:01 |
| 6 | CONNECT BY WITHOUT FILTERING| | | | | |
| 7 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 8 | TABLE ACCESS BY INDEX ROWID | DY_USER_ANALYSIS_MIN | 1 | 32 | 3 (0)| 00:00:01 |
|* 9 | INDEX RANGE SCAN | IND_USER_MIN_DATE_GAME | 2 | | 2 (0)| 00:00:01 |
|* 10 | TABLE ACCESS BY INDEX ROWID | DY_USER_ANALYSIS_MIN | 1 | 32 | 3 (0)| 00:00:01 |
|* 11 | INDEX RANGE SCAN | IND_USER_MIN_DATE_GAME | 2 | | 2 (0)| 00:00:01 |
|* 12 | TABLE ACCESS BY INDEX ROWID | DY_USER_ANALYSIS_MIN | 3 | 96 | 162 (1)| 00:00:02 |
|* 13 | INDEX RANGE SCAN | IND_USER_MIN_DATE_GAME | 481 | | 155 (1)| 00:00:02 |
------------------------------------------------------------------------------------------------------------
一边分析一遍生成了一个调优报告。看看dbms_sqltune怎么建议。
shgentunerpt.sh7jywxtcmcmcgvTASK68645GENERALINFORMATIONSECTION−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−TuningTaskName:TASK68645TuningTaskOwner:TESTDBAScope:COMPREHENSIVETimeLimit(seconds):120CompletionStatus:COMPLETEDStartedat:12/07/201513:28:55Completedat:12/07/201513:30:59NumberofIndexFindings:1NumberofErrors:1−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−SchemaName:TESTATNEWBGSQLID:7jywxtcmcmcgv完整的语句如下:SQLText:selectt0.stattimestattime,t2.consume5cnttoday,t1.consume5cntyestoday,t3.consume5cntlastweekfrom(selecttochar(trunc(sysdate)−level∗5/60/24,′hh24:mi′)asstattimefromdualconnectbylevel<=288orderbystattime)t0leftjoin(selecttochar(a.stattime,′hh24:mi′)stattime,nvl(a.consume5cnt,0)consume5cntfromDYUSERANALYSISMINawherea.gametype=:1anda.stattime>=todate(:2,′yyyy−mm−dd′)−(INTERVAL′1′DAY)anda.stattime<trunc(todate(:3,′yyyy−mm−dd′))anda.groupid=:4orderbya.stattime)t1ont0.stattime=t1.stattimeleftjoin(selecttochar(a.stattime,′hh24:mi′)stattime,nvl(a.consume5cnt,0)consume5cntfromDYUSERANALYSISMINawherea.gametype=:5anda.stattime>=todate(:6,′yyyy−mm−dd′)anda.stattime<trunc(todate(:7,′yyyy−mm−dd′))+(INTERVAL′1′DAY)anda.groupid=:8orderbya.stattime)t2ont0.stattime=t2.stattimeleftjoin(selecttochar(a.stattime,′hh24:mi′)stattime,nvl(a.consume5cnt,0)consume5cntfromDYUSERANALYSISMINawherea.gametype=:9anda.stattime>=todate(:10,′yyyy−mm−dd′)−(INTERVAL′7′DAY)anda.stattime<trunc(todate(:11,′yyyy−mm−dd′))−(INTERVAL′6′DAY)anda.groupid=:12orderbya.stattime)t3ont3.stattime=t0.stattimeorderbyt0.stattime给出的建议是创建一个索引。−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−FINDINGSSECTION(1finding)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−1−IndexFinding(seeexplainplanssectionbelow)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Theexecutionplanofthisstatementcanbeimprovedbycreatingoneormoreindices.Recommendation(estimatedbenefit:92.38−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−ConsiderrunningtheAccessAdvisortoimprovethephysicalschemadesignorcreatingtherecommendedindex.createindexTESTATNEWBG.IDX10C250001onTESTATNEWBG.DYUSERANALYSISMIN("GROUPID","GAMETYPE","STATTIME");然后在执行计划的对比中,可以看到新的执行计划选择度要好很多。−UsingNewIndices−−−−−−−−−−−−−−−−−−−−Planhashvalue:2029808490−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−|Id|Operation|Name|Rows|Bytes|Cost(−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−|0|SELECTSTATEMENT||1|100|13(8)|00:00:01||1|SORTORDERBY||1|100|13(8)|00:00:01||2|NESTEDLOOPSOUTER||1|100|12(0)|00:00:01||3|NESTEDLOOPSOUTER||1|68|8(0)|00:00:01||4|NESTEDLOOPSOUTER||1|36|5(0)|00:00:01||5|VIEW||1|4|2(0)|00:00:01||∗6|CONNECTBYWITHOUTFILTERING|||||||7|FASTDUAL||1||2(0)|00:00:01||8|TABLEACCESSBYINDEXROWID|DYUSERANALYSISMIN|1|32|3(0)|00:00:01||∗9|INDEXRANGESCAN|IDX10C250001|1||2(0)|00:00:01||10|TABLEACCESSBYINDEXROWID|DYUSERANALYSISMIN|1|32|3(0)|00:00:01||∗11|INDEXRANGESCAN|IDX10C250001|1||2(0)|00:00:01||12|TABLEACCESSBYINDEXROWID|DYUSERANALYSISMIN|1|32|4(0)|00:00:01||∗13|INDEXRANGESCAN|IDX10C250001|1||3(0)|00:00:01|−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−查看表的属性情况如下,可以看到当前存在两个索引。 sh showtab.sh TESTAT_NEWBG DY_USER_ANALYSIS_MIN|less
*******************************************
OWNER TABLE_NAME
------------------------------ ------------------------------
TESTAT_NEWBG DY_USER_ANALYSIS_MIN
*******************************************
********** TABLE GENERAL INFO *****************
TABLE_NAME PAR TABLESPACE STATUS INI_TRANS NUM_ROWS BLOCKS EMPTY_BLOCKS LOG MON ROW_MOVE LAST_ANALYZED
------------------------------ --- ---------- ------ ---------- ---------- ---------- ------------ --- --- -------- -------------------
DY_USER_ANALYSIS_MIN NO TE_DATA VALID 1 372113 3268 0 YES YES DISABLED 2015-12-05 06:00:19
********** TABLE STORAGE INFO *****************
INITEXT NXTEXT MINEXT MAXEXT FREELISTS AVG_SPACE CHAIN_CNT AVG_ROW_LEN CACHE T DEPENDEN COMPRES
--------- --------- ---------- ----------- ---------- ---------- ---------- ----------- ---------- - -------- ----------
65536 1048576 1 2147483645 0 0 61 N N DISABLED DISABLED
********** TABLE columns INFO *****************
COLUMN_ID COLUMN_NAME DATA_TYPE DATA_LENGTH NULLABLE DATA_DEFAULT
---------- ------------------------------ --------------- ----------- ---------- --------------------
1 ID NUMBER(11,0) 22 N
2 STAT_TIME DATE 7 Y
3 GAME_TYPE VARCHAR2(50) 50 Y
4 ZONE_ID NUMBER 22 Y 0
5 GROUP_ID NUMBER 22 Y 0
6 ONLINE_5CNT NUMBER 22 Y 0
7 ONLINE_20CNT NUMBER 22 Y 0
....
19 rows selected.
********** INDEX DETAILS INFO *****************
INDEX_NAME TABLESPACE INDEX_TYPE UNIQUENES PAR COLUMN_LIST TABLE_TYPE STATUS NUM_ROWS LAST_ANALYZED G
------------------------------ ---------- ---------- --------- --- ------------------------------ ---------- ------ ---------- ------------------- -
IND_USER_MIN_DATE_GAME USERS NORMAL NONUNIQUE NO STAT_TIME,GAME_TYPE TABLE VALID 371010 2015-12-05 06:00:21 N
SYS_C00106568 TL_DATA NORMAL UNIQUE NO ID TABLE VALID 371010 2015-12-05 06:00:20 Y
集群因子的情况如下:
TABLE_NAME INDEX_NAME CLUSTERING_FACTOR BLOCKS NUM_ROWS
------------------------------ ------------------------------ ----------------- ---------- ----------
DY_USER_ANALYSIS_MIN IND_USER_MIN_DATE_GAME 4700 3268 372113
DY_USER_ANALYSIS_MIN SYS_C00106568 4575 3268 372113
对于创建索引的如下建议。
create index TESTAT_NEWBG.IDX_DY_USER_ANA_IDTYPT on TESTAT_NEWBG.DY_USER_ANALYSIS_MIN("GROUP_ID","GAME_TYPE","STAT_TIME");
可以通过如下的方式来做个简单的分析。这个索引列的顺序也是蛮讲究,通过直方图的信息可以看到三个相关列的数据分布情况。
select column_name,num_distinct,high_value,low_value ,avg_col_len,histogram from dba_tab_col_statistics where table_name='DY_USER_ANALYSIS_MIN'
COLUMN_NAME NUM_DISTINCT HIGH_VALUE LOW_VALUE AVG_COL_LEN HISTOGRAM
--------------------------- --------------- --------------------------- -----------
ID 503808 C33E403E C30C5164 5 NONE
STAT_TIME 2992 78730C07170101 78730B1B0C3301 8 NONE
GAME_TYPE 1 746C6262 746C6262 5 FREQUENCY
ZONE_ID 1 3E6466 3E6466 4 FREQUENCY
GROUP_ID 136 C25C44 3E6466 4 FREQUENCY
为什么选用了GROUP_ID作为首选列呢,其中一个原因就是范围查询和等值查询,在这个例子中范围查询就是stat_time相关的查询,等值查询就是group_id相关的。这种情况下是优先选择等值查询的。而game_type的数据分布很单一,所以这个列也不能作为首选列。
当然这些信息也是在做了简单的评估之后发现可行,所以马上部署了。可以看到部署之后负载立马降了下来。
可见这个变更也确实起到了立竿见影的效果。但是问题还不止于此,为什么这个语句造成了严重的性能问题,经过后续和开发同事的讨论,他们说这个新需求已经上线两周了。目前存在大量的left join的原因就是需要查询这周,上周,上上周。。某一天的数据情况,按照这种思路,其实我还是建议他们去掉这种冗余的left join,直接使用stat_time in (xxxx,xxxx,xxxx,xxx)其实这种更简明直接。当然问题解决了还不是结束,还需要后续跟进,保证这种问题从根本上杜绝。
ZABBIX-监控系统:
------------------------------------
报警内容: DB time is too high
------------------------------------
报警级别: PROBLEM
------------------------------------
监控项目: DBtime:543 %
------------------------------------
报警时间:2015.12.07-10:48:37
可以看到在蓝色的框中是问题发生时间段的DB time情况,其实后面直接飙到了5000%这个效果是很恐怖的。

而紧接着CPU负载也开始急剧飙升,直接的反应就是机器反应开始非常慢。根据top得到的进程,可以看到cpu资源已经被耗光了。
这个从sar的结果也可以看得很明显。
11:30:02 AM CPU %user %nice %system %iowait %steal %idle
11:31:02 AM all 95.86 0.00 0.36 0.00 0.00 3.78
11:32:02 AM all 94.76 0.00 0.25 0.00 0.00 4.99
11:33:01 AM all 94.53 0.00 0.17 0.00 0.00 5.30
11:34:01 AM all 95.83 0.00 0.19 0.00 0.00 3.98
11:35:01 AM all 95.18 0.00 0.17 0.00 0.00 4.66
11:36:01 AM all 96.47 0.00 0.96 0.02 0.00 2.55
11:37:01 AM all 97.50 0.00 0.15 0.08 0.00 2.28
11:38:01 AM all 96.08 0.00 0.17 0.00 0.00 3.75
11:39:01 AM all 95.74 0.00 0.17 0.01 0.00 4.08
11:40:01 AM all 96.24 0.00 0.17 0.00 0.00 3.58
11:41:01 AM all 94.68 0.00 0.34 0.00 0.00 4.98
11:42:01 AM all 97.30 0.00 0.15 0.00 0.00 2.54
11:43:01 AM all 96.51 0.00 0.18 0.00 0.00 3.31
11:44:01 AM all 96.14 0.00 0.14 0.00 0.00 3.71
11:45:01 AM all 95.52 0.00 0.20 0.01 0.00 4.26
11:46:01 AM all 96.67 0.00 0.94 0.01 0.00 2.38
11:47:01 AM all 96.23 0.00 0.16 0.01 0.00 3.61
11:48:01 AM all 96.59 0.00 0.16 0.00 0.00 3.25
11:49:01 AM all 96.40 0.00 0.18 0.00 0.00 3.41
11:50:02 AM all 96.15 0.00 0.21 0.00 0.00 3.64
11:51:01 AM all 96.72 0.00 0.59 0.04 0.00 2.64
11:52:01 AM all 96.51 0.00 0.16 0.00 0.00 3.32
11:53:01 AM all 94.57 0.00 0.20 0.01 0.00 5.22
11:54:01 AM all 93.82 0.00 0.14 0.00 0.00 6.04
11:55:01 AM all 94.04 0.00 0.14 0.01 0.00 5.81
11:56:01 AM all 96.01 0.00 0.85 0.00 0.00 3.14
11:57:01 AM all 95.66 0.00 0.15 0.01 0.00 4.17
11:58:01 AM all 94.93 0.00 0.14 0.00 0.00 4.93
11:59:01 AM all 97.15 0.00 0.16 0.01 0.00 2.68
12:00:01 PM all 99.76 0.00 0.23 0.00 0.00 0.01
12:01:01 PM all 99.45 0.00 0.55 0.00 0.00 0.01
12:02:01 PM all 99.76 0.00 0.24 0.00 0.00 0.00
12:03:02 PM all 99.60 0.00 0.40 0.00 0.00 0.00
12:04:02 PM all 99.76 0.00 0.23 0.00 0.00 0.00
12:05:01 PM all 99.77 0.00 0.23 0.00 0.00 0.00
12:06:05 PM all 99.37 0.00 0.62 0.00 0.00 0.01
12:07:02 PM all 99.59 0.00 0.40 0.00 0.00 0.00
12:08:03 PM all 99.57 0.00 0.43 0.00 0.00 0.00
12:09:02 PM all 99.66 0.00 0.34 0.00 0.00 0.00
12:10:01 PM all 99.80 0.00 0.20 0.00 0.00 0.01
12:11:15 PM all 99.70 0.00 0.30 0.00 0.00 0.00
12:12:07 PM all 99.87 0.00 0.12 0.00 0.00 0.00
12:13:03 PM all 99.74 0.00 0.26 0.00 0.00 0.00
12:14:02 PM all 99.79 0.00 0.21 0.00 0.00 0.00
12:15:02 PM all 99.85 0.00 0.14 0.00 0.00 0.01
12:16:04 PM all 99.58 0.00 0.42 0.00 0.00 0.00
所以问题其实变得还是挺着急的。这种负载情况下,看似问题还在恶化,着实让人捏了一把汗。
top的情况如下:
top - 13:34:40 up 469 days, 20:09, 6 users, load average: 44.01, 42.83, 43.48
Tasks: 515 total, 48 running, 453 sleeping, 0 stopped, 14 zombie
Cpu(s): 99.9%us, 0.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32946232k total, 32753852k used, 192380k free, 209652k buffers
Swap: 16771776k total, 186968k used, 16584808k free, 28973828k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8517 oracle 18 0 10.1g 305m 300m R 64.7 0.9 263:48.43 oracletestdb (LOCAL=NO)
17796 oracle 19 0 10.1g 174m 169m R 56.7 0.5 14:54.71 oracletestdb (LOCAL=NO)
18473 oracle 17 0 10.1g 171m 166m R 39.8 0.5 14:31.95 oracletestdb (LOCAL=NO)
18153 oracle 16 0 10.1g 175m 171m R 38.1 0.5 13:55.82 oracletestdb (LOCAL=NO)
根据top的情况抓取了一个进程,看看这个进程到底在干嘛.比如进程号是16908.
可以看都原来是一个客户端触发的一个select语句导致。
sh showpid.sh 16908 ******************************************* Process has found, pid: 16908 , addr: 00000002DF6199B0 ####### Process Information from OS level as below ######## oracle 16908 1 17 12:00 ? 00:14:48 oracletestdb (LOCAL=NO) ############################################## SID SERIAL# USERNAME OSUSER MACHINE PROCESS TERMINAL TYPE LOGIN_TIME ---------- ---------- --------------- --------------- -------------------- --------------- --------------- ---------- ------------------- 771 41269 TESTAT_NEWBG mbilocalhost 1234 USER 2015-12-07 12:00:05 SQL_ID SQL_TEXT ------------------------------ ------------------------------------------------------------ 7jywxtcmcmcgv select t0.stat_time stat_time,t2.consume_5cnt today,t1.consu me_5cnt yestoday,t3.consume_5cnt lastweek from (select to _char(trunc(sysdate)-level*5/60/24,'hh24:mi') as stat_time f rom dual connect by level <=288 order by stat_time) t0 lef t join (select to_char(a.stat_time,'hh24:mi') stat_ 查看执行计划的情况如下,其实看计划都是走了索引扫描。sh showpid.sh 16908
*******************************************
Process has found, pid: 16908 , addr: 00000002DF6199B0
####### Process Information from OS level as below ########
oracle 16908 1 17 12:00 ? 00:14:48 oracletestdb (LOCAL=NO)
##############################################
SID SERIAL# USERNAME OSUSER MACHINE PROCESS TERMINAL TYPE LOGIN_TIME
---------- ---------- --------------- --------------- -------------------- --------------- --------------- ---------- -------------------
771 41269 TESTAT_NEWBG mbilocalhost 1234 USER 2015-12-07 12:00:05
SQL_ID SQL_TEXT
------------------------------ ------------------------------------------------------------
7jywxtcmcmcgv select t0.stat_time stat_time,t2.consume_5cnt today,t1.consu
me_5cnt yestoday,t3.consume_5cnt lastweek from (select to
_char(trunc(sysdate)-level*5/60/24,'hh24:mi') as stat_time f
rom dual connect by level <=288 order by stat_time) t0 lef
t join (select to_char(a.stat_time,'hh24:mi') stat_
查看执行计划的情况如下,其实看计划都是走了索引扫描。 sh showplan2.sh 7jywxtcmcmcgv
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 7jywxtcmcmcgv, child number 0
-------------------------------------
Plan hash value: 3396463774
------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 171 (100)| |
| 1 | SORT ORDER BY | | 3 | 300 | 171 (2)| 00:00:03 |
| 2 | NESTED LOOPS OUTER | | 3 | 300 | 170 (1)| 00:00:03 |
| 3 | NESTED LOOPS OUTER | | 1 | 68 | 8 (0)| 00:00:01 |
| 4 | NESTED LOOPS OUTER | | 1 | 36 | 5 (0)| 00:00:01 |
| 5 | VIEW | | 1 | 4 | 2 (0)| 00:00:01 |
| 6 | CONNECT BY WITHOUT FILTERING| | | | | |
| 7 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 8 | TABLE ACCESS BY INDEX ROWID | DY_USER_ANALYSIS_MIN | 1 | 32 | 3 (0)| 00:00:01 |
|* 9 | INDEX RANGE SCAN | IND_USER_MIN_DATE_GAME | 2 | | 2 (0)| 00:00:01 |
|* 10 | TABLE ACCESS BY INDEX ROWID | DY_USER_ANALYSIS_MIN | 1 | 32 | 3 (0)| 00:00:01 |
|* 11 | INDEX RANGE SCAN | IND_USER_MIN_DATE_GAME | 2 | | 2 (0)| 00:00:01 |
|* 12 | TABLE ACCESS BY INDEX ROWID | DY_USER_ANALYSIS_MIN | 3 | 96 | 162 (1)| 00:00:02 |
|* 13 | INDEX RANGE SCAN | IND_USER_MIN_DATE_GAME | 481 | | 155 (1)| 00:00:02 |
------------------------------------------------------------------------------------------------------------
一边分析一遍生成了一个调优报告。看看dbms_sqltune怎么建议。
shgentunerpt.sh7jywxtcmcmcgvTASK68645GENERALINFORMATIONSECTION−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−TuningTaskName:TASK68645TuningTaskOwner:TESTDBAScope:COMPREHENSIVETimeLimit(seconds):120CompletionStatus:COMPLETEDStartedat:12/07/201513:28:55Completedat:12/07/201513:30:59NumberofIndexFindings:1NumberofErrors:1−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−SchemaName:TESTATNEWBGSQLID:7jywxtcmcmcgv完整的语句如下:SQLText:selectt0.stattimestattime,t2.consume5cnttoday,t1.consume5cntyestoday,t3.consume5cntlastweekfrom(selecttochar(trunc(sysdate)−level∗5/60/24,′hh24:mi′)asstattimefromdualconnectbylevel<=288orderbystattime)t0leftjoin(selecttochar(a.stattime,′hh24:mi′)stattime,nvl(a.consume5cnt,0)consume5cntfromDYUSERANALYSISMINawherea.gametype=:1anda.stattime>=todate(:2,′yyyy−mm−dd′)−(INTERVAL′1′DAY)anda.stattime<trunc(todate(:3,′yyyy−mm−dd′))anda.groupid=:4orderbya.stattime)t1ont0.stattime=t1.stattimeleftjoin(selecttochar(a.stattime,′hh24:mi′)stattime,nvl(a.consume5cnt,0)consume5cntfromDYUSERANALYSISMINawherea.gametype=:5anda.stattime>=todate(:6,′yyyy−mm−dd′)anda.stattime<trunc(todate(:7,′yyyy−mm−dd′))+(INTERVAL′1′DAY)anda.groupid=:8orderbya.stattime)t2ont0.stattime=t2.stattimeleftjoin(selecttochar(a.stattime,′hh24:mi′)stattime,nvl(a.consume5cnt,0)consume5cntfromDYUSERANALYSISMINawherea.gametype=:9anda.stattime>=todate(:10,′yyyy−mm−dd′)−(INTERVAL′7′DAY)anda.stattime<trunc(todate(:11,′yyyy−mm−dd′))−(INTERVAL′6′DAY)anda.groupid=:12orderbya.stattime)t3ont3.stattime=t0.stattimeorderbyt0.stattime给出的建议是创建一个索引。−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−FINDINGSSECTION(1finding)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−1−IndexFinding(seeexplainplanssectionbelow)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Theexecutionplanofthisstatementcanbeimprovedbycreatingoneormoreindices.Recommendation(estimatedbenefit:92.38−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−ConsiderrunningtheAccessAdvisortoimprovethephysicalschemadesignorcreatingtherecommendedindex.createindexTESTATNEWBG.IDX10C250001onTESTATNEWBG.DYUSERANALYSISMIN("GROUPID","GAMETYPE","STATTIME");然后在执行计划的对比中,可以看到新的执行计划选择度要好很多。−UsingNewIndices−−−−−−−−−−−−−−−−−−−−Planhashvalue:2029808490−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−|Id|Operation|Name|Rows|Bytes|Cost(−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−|0|SELECTSTATEMENT||1|100|13(8)|00:00:01||1|SORTORDERBY||1|100|13(8)|00:00:01||2|NESTEDLOOPSOUTER||1|100|12(0)|00:00:01||3|NESTEDLOOPSOUTER||1|68|8(0)|00:00:01||4|NESTEDLOOPSOUTER||1|36|5(0)|00:00:01||5|VIEW||1|4|2(0)|00:00:01||∗6|CONNECTBYWITHOUTFILTERING|||||||7|FASTDUAL||1||2(0)|00:00:01||8|TABLEACCESSBYINDEXROWID|DYUSERANALYSISMIN|1|32|3(0)|00:00:01||∗9|INDEXRANGESCAN|IDX10C250001|1||2(0)|00:00:01||10|TABLEACCESSBYINDEXROWID|DYUSERANALYSISMIN|1|32|3(0)|00:00:01||∗11|INDEXRANGESCAN|IDX10C250001|1||2(0)|00:00:01||12|TABLEACCESSBYINDEXROWID|DYUSERANALYSISMIN|1|32|4(0)|00:00:01||∗13|INDEXRANGESCAN|IDX10C250001|1||3(0)|00:00:01|−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−查看表的属性情况如下,可以看到当前存在两个索引。 sh showtab.sh TESTAT_NEWBG DY_USER_ANALYSIS_MIN|less
*******************************************
OWNER TABLE_NAME
------------------------------ ------------------------------
TESTAT_NEWBG DY_USER_ANALYSIS_MIN
*******************************************
********** TABLE GENERAL INFO *****************
TABLE_NAME PAR TABLESPACE STATUS INI_TRANS NUM_ROWS BLOCKS EMPTY_BLOCKS LOG MON ROW_MOVE LAST_ANALYZED
------------------------------ --- ---------- ------ ---------- ---------- ---------- ------------ --- --- -------- -------------------
DY_USER_ANALYSIS_MIN NO TE_DATA VALID 1 372113 3268 0 YES YES DISABLED 2015-12-05 06:00:19
********** TABLE STORAGE INFO *****************
INITEXT NXTEXT MINEXT MAXEXT FREELISTS AVG_SPACE CHAIN_CNT AVG_ROW_LEN CACHE T DEPENDEN COMPRES
--------- --------- ---------- ----------- ---------- ---------- ---------- ----------- ---------- - -------- ----------
65536 1048576 1 2147483645 0 0 61 N N DISABLED DISABLED
********** TABLE columns INFO *****************
COLUMN_ID COLUMN_NAME DATA_TYPE DATA_LENGTH NULLABLE DATA_DEFAULT
---------- ------------------------------ --------------- ----------- ---------- --------------------
1 ID NUMBER(11,0) 22 N
2 STAT_TIME DATE 7 Y
3 GAME_TYPE VARCHAR2(50) 50 Y
4 ZONE_ID NUMBER 22 Y 0
5 GROUP_ID NUMBER 22 Y 0
6 ONLINE_5CNT NUMBER 22 Y 0
7 ONLINE_20CNT NUMBER 22 Y 0
....
19 rows selected.
********** INDEX DETAILS INFO *****************
INDEX_NAME TABLESPACE INDEX_TYPE UNIQUENES PAR COLUMN_LIST TABLE_TYPE STATUS NUM_ROWS LAST_ANALYZED G
------------------------------ ---------- ---------- --------- --- ------------------------------ ---------- ------ ---------- ------------------- -
IND_USER_MIN_DATE_GAME USERS NORMAL NONUNIQUE NO STAT_TIME,GAME_TYPE TABLE VALID 371010 2015-12-05 06:00:21 N
SYS_C00106568 TL_DATA NORMAL UNIQUE NO ID TABLE VALID 371010 2015-12-05 06:00:20 Y
集群因子的情况如下:
TABLE_NAME INDEX_NAME CLUSTERING_FACTOR BLOCKS NUM_ROWS
------------------------------ ------------------------------ ----------------- ---------- ----------
DY_USER_ANALYSIS_MIN IND_USER_MIN_DATE_GAME 4700 3268 372113
DY_USER_ANALYSIS_MIN SYS_C00106568 4575 3268 372113
对于创建索引的如下建议。
create index TESTAT_NEWBG.IDX_DY_USER_ANA_IDTYPT on TESTAT_NEWBG.DY_USER_ANALYSIS_MIN("GROUP_ID","GAME_TYPE","STAT_TIME");
可以通过如下的方式来做个简单的分析。这个索引列的顺序也是蛮讲究,通过直方图的信息可以看到三个相关列的数据分布情况。
select column_name,num_distinct,high_value,low_value ,avg_col_len,histogram from dba_tab_col_statistics where table_name='DY_USER_ANALYSIS_MIN'
COLUMN_NAME NUM_DISTINCT HIGH_VALUE LOW_VALUE AVG_COL_LEN HISTOGRAM
--------------------------- --------------- --------------------------- -----------
ID 503808 C33E403E C30C5164 5 NONE
STAT_TIME 2992 78730C07170101 78730B1B0C3301 8 NONE
GAME_TYPE 1 746C6262 746C6262 5 FREQUENCY
ZONE_ID 1 3E6466 3E6466 4 FREQUENCY
GROUP_ID 136 C25C44 3E6466 4 FREQUENCY
为什么选用了GROUP_ID作为首选列呢,其中一个原因就是范围查询和等值查询,在这个例子中范围查询就是stat_time相关的查询,等值查询就是group_id相关的。这种情况下是优先选择等值查询的。而game_type的数据分布很单一,所以这个列也不能作为首选列。
当然这些信息也是在做了简单的评估之后发现可行,所以马上部署了。可以看到部署之后负载立马降了下来。

可见这个变更也确实起到了立竿见影的效果。但是问题还不止于此,为什么这个语句造成了严重的性能问题,经过后续和开发同事的讨论,他们说这个新需求已经上线两周了。目前存在大量的left join的原因就是需要查询这周,上周,上上周。。某一天的数据情况,按照这种思路,其实我还是建议他们去掉这种冗余的left join,直接使用stat_time in (xxxx,xxxx,xxxx,xxx)其实这种更简明直接。当然问题解决了还不是结束,还需要后续跟进,保证这种问题从根本上杜绝。