近日遇到Exadata 的磁盘故障,在更新完physical disk后,其中一个griddisk没有自动添加的到ASM实例中,在问题解决后,整理出整个问题分析的思路。
1.在磁盘失败的情况下,如果有LED灯亮,如果怀疑是硬件问题需要首先收集硬件日志
首先需要使用sundiag.sh脚本收集cell硬件的信息,检查是否有硬件损坏,并定位具体设备,更换故障设备。
(注:sundiag.sh是oracle提供的硬件检查脚本,下面是在compute node上执行,然后到每一个compute node和cell的/tmp下找到生成的日志)
- #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "/opt/oracle.SupportTools/sundiag.sh"
- For example
- dmorlcel09: Success in AdpEventLog
- dmorlcel09:
- dmorlcel09: Exit Code: 0x00
- ……
- dmorlcel09: sundiag_2012_07_10_06_08/dmorlcel09_megacli64-status_2012_07_10_06_08.out
- dmorlcel09: ==============================================================================
- dmorlcel09: Done the report files are in bzip2 compressed /tmp/sundiag_2012_07_10_06_08.tar.bz2
- dmorlcel09:
- $ sqlplus / as sysasm
- (1)Check sum of griddisk
- column "Diskgroup" format A30
- column "Imbalance" format 99.9 Heading "Percent|Imbalance"
- column "Variance" format 99.9 Heading "Percent|Disk Size|Variance"
- column "MinFree" format 99.9 Heading "Minimum|Percent|Free"
- column "DiskCnt" format 9999 Heading "Disk|Count"
- column "Type" format A10 Heading "Diskgroup|Redundancy"
- SELECT g.name "Diskgroup",
- 100*(max((d.total_mb-d.free_mb)/d.total_mb)-min((d.total_mb-d.free_mb)/d.total_mb))/max((d.total_mb-d.free_mb)/d.total_mb) "Imbalance",
- 100*(max(d.total_mb)-min(d.total_mb))/max(d.total_mb) "Variance",
- 100*(min(d.free_mb/d.total_mb)) "MinFree",
- count(*) "DiskCnt",
- g.type "Type"
- FROM v$asm_disk d, v$asm_diskgroup g
- WHERE d.group_number = g.group_number and
- d.group_number <> 0 and
- d.state = 'NORMAL' and
- d.mount_status = 'CACHED'
- GROUP BY g.name, g.type;
- ---
- Diskgroup Percent Imbalance Percent Disk Size Variance Minimum Percent Free Disk Count Redundancy
- ---------------- --------- --------- ------- ----- ----------
- DBFS_DG 84.6 .0 99.8 220 NORMAL
- DG_DAT1 7.1 .0 8.9 264 NORMAL
- DG_DAT2 .1 .0 65.1 263 NORMAL <<<<<<<<This lack one disk then DG_DAT1
- (2) We can find one griddisk isn’t mounted.
- set line 300
- column "PATH" format A100
- SQL> select name, path, header_status from v$asm_disk order by path;
- DG_DAT1_CD_11_DM02CEL04 o/192.168.10.34/DG_DAT1_CD_11_dm02cel04 MEMBER
- DG_DAT1_CD_00_DM02CEL04 o/192.168.10.34/DG_DAT1_CD_00_dm02cel04 MEMBER
- o/192.168.10.34/DG_DAT2_CD_00_dm02cel04 FORMER <<<<<<<<<<<<<<<<<< This status is unnormal
- DG_DAT2_CD_01_DM02CEL04 o/192.168.10.34/DG_DAT2_CD_01_dm02cel04 MEMBER
- ------------------------------------
- Comments
- FORMER status indicate Disk was once part of a disk group but has been dropped cleanly from the group. It may be added to a new disk group with the ALTER DISKGROUP statement.
- (3)Determine DG_DAT2_CD_00_dm02cel04 was missed
- select NAME,HEADER_STATUS,MOUNT_STATUS,STATE,GROUP_NUMBER from V$ASM_DISK where NAME like '%CD_03_DM01CEL03'; <<<<<<This need modify as cell name
- NAME HEADER_STATUS MOUNT_STATUS STATE GROUP_NUMBER
- ------------------------------------------------------------ ------------------------ -------------- ---------------- ------------
- DG_DAT1_CD_00_DM02CEL04 MEMBER CACHED NORMAL 2
- <<<<<<DG_DAT2_CD_00_dm02cel04 is missed
3.解决方法及步骤
Exadata硬盘错误更换后的ASM磁盘组重新添加操作顺利完成。
在exadata上面通过grid用户把RECO_DM01_CD_03_DM01CEL03重新加入到ASM中。
整个操作过程如下:
(1)、操作之前查询的磁盘组相关信息:
- SQL> select name,header_status,mount_status,state,group_number from v$asm_disk where name like '%CD_03_DM01CEL03';
- NAME HEADER_STA MOUNT_S STATE GROUP_NUMBER
- ------------------------------ ---------- ------- -------- ------------
- DBFS_DG_CD_03_DM01CEL03 MEMBER CACHED NORMAL 2
- DATA_DM01_CD_03_DM01CEL03 MEMBER CACHED NORMAL 1
(2)、把硬盘重新添加回磁盘组:
- SQL> alter diskgroup RECO_DM01 add disk 'o/192.168.252.5/RECO_DM01_CD_03_dm01cel03' rebalance power 10;
- Diskgroup altered.
(3)、添加之后,查看磁盘组信息:
- SQL> select name,header_status,mount_status,state,group_number from v$asm_disk where name like '%CD_03_DM01CEL03';
- NAME HEADER_STATUS MOUNT_STATUS STATE GROUP_NUMBER
- ------------------------------------------------------------ ------------------------ -------------- ---------------- ------------
- DBFS_DG_CD_03_DM01CEL03 MEMBER CACHED NORMAL 2
- RECO_DM01_CD_03_DM01CEL03 MEMBER CACHED NORMAL 3
- DATA_DM01_CD_03_DM01CEL03 MEMBER CACHED NORMAL 1
4.故障原因分析,分析Alert+ASM.log
- Name
- --------
- === ODM Data Collection ===
- NOTE: cache opening disk 192 of grp 3: DG_DAT2_CD_00_DM02CEL04 path:o/192.168.10.34/DG_DAT2_CD_00_dm02cel04
- NOTE: Attempting voting file refresh on diskgroup DG_DAT2
- GMON querying group 3 at 19 for pid 19, osid 12228
- SUCCESS: refreshed membership for 3/0xa4c726c6 (DG_DAT2)
- Tue Jun 05 22:33:40 2012
- NOTE: Attempting voting file refresh on diskgroup DG_DAT2
- Tue Jun 05 22:33:43 2012
- SUCCESS: /* Exadata Auto Mgmt: ADD ASM Disk in given FAILGROUP */
- alter diskgroup DG_DAT2 add
- failgroup DM02CEL04
- disk 'o/192.168.10.34/DG_DAT2_CD_00_dm02cel04'
- name DG_DAT2_CD_00_DM02CEL04
- rebalance nowait
- NOTE: starting rebalance of group 3/0xa4c726c6 (DG_DAT2) at power 4
- Starting background process ARB0
- Tue Jun 05 22:33:46 2012
- ARB0 started with pid=39, OS id=5039
- NOTE: assigning ARB0 to group 3/0xa4c726c6 (DG_DAT2) with 4 parallel I/Os <<<<<<<<<<<<<<<我以为是在两个griddisk同时reblance的时候产生I/O压力导致的故障的griddisk添加失败
- NOTE: F1X0 copy 2 relocating from 23:2 to 249:2 for diskgroup 3 (DG_DAT2)
- NOTE: F1X0 copy 3 relocating from 249:2 to 255:9441 for diskgroup 3 (DG_DAT2)
- ......
- Tue Jun 05 22:36:04 2012 <<<<<<<当时认为这个分析就是对的,但是后来发现其实root cause是在硬盘最开始报错时就决定的
- NOTE: stopping process ARB0 <<<<<<<<<<<<<<<<<<<<<<<
- NOTE: rebalance interrupted for group 3/0xa4c726c6 (DG_DAT2) <<<<<<<<<<<<<<<<<<<<<<< rebalance interrupted
- NOTE: membership refresh pending for group 3/0xa4c726c6 (DG_DAT2)
- Tue Jun 05 22:36:11 2012
- GMON querying group 3 at 22 for pid 19, osid 12228
- SUCCESS: refreshed membership for 3/0xa4c726c6 (DG_DAT2)
- Tue Jun 05 22:36:17 2012
- NOTE: Attempting voting file refresh on diskgroup DG_DAT2
- Tue Jun 05 23:19:15 2012
- NOTE: cache closing disk 192 of grp 3: DG_DAT2_CD_00_DM02CEL04
- Tue Jun 05 23:19:15 2012
- NOTE: membership refresh pending for group 3/0xa4c726c6 (DG_DAT2)
- GMON querying group 3 at 23 for pid 19, osid 12228
- GMON querying group 3 at 24 for pid 19, osid 12228
- NOTE: Disk in mode 0x8 marked for de-assignment
- SUCCESS: refreshed membership for 3/0xa4c726c6 (DG_DAT2)
5.Root Cause的最终分析
这个是由于一个physical disk最终划分到两个griddisk,当磁盘某一个扇区损坏,导致其中的一个griddisk直接被drop掉,另一个griddisk受到影响,但是并不会被drop,而是报警,在更坏新physical disk后,未损坏部分的griddisk直接被自动添加,而另一个必须手动添加。
针对这个问题深入分析,磁盘删除后会不会被ASM自动添加,取决于磁盘删除的方式
- alter diskgroup DG_DAT2 drop disk DG_DAT2_CD_00_DM02CEL04 <<<<<<<<<<<<<<<<<<<<<<没有force参数的,系统不会再次尝试自动添加磁盘
- alter diskgroup DG_DAT2 drop force disk DG_DAT2_CD_00_DM02CEL04 <<<<<<<<<<<<<<<<<<<<<<有force参数,系统认为不是正常删除的磁盘,所以会尝试自动添加新磁盘
下面是继续分析,问题发生时,具体是什么问题导致的磁盘drop,才能分析出,后续为什么不能自动添加到ASM磁盘组中
=== 收集alert_+ASM.log信息,问题发生时 。这部分显示的是能正常被自动添加的grid disk===
- Tue Jun 05 16:34:56 2012
- XDWK started with pid=30, OS id=13410
- WARNING: Exadata Auto Management: OS PID: 13410 Operation ID: 3131: ONLINE disk DG_DAT1_CD_00_DM02CEL04 in diskgroup DG_DAT1 Failed
- SQL :
- Cause :
- Action : Check alert log to see why this operation failed.
- Also check process trace file for matching Operation ID.
- ...................................
- Tue Jun 05 22:25:59 2012
- WARNING: Exadata Auto Management: OS PID: 20903 Operation ID: 3246: ONLINE disk DG_DAT1_CD_00_DM02CEL04 in diskgroup DG_DAT1 Failed
- SQL :
- Cause :
- Action : Check alert log to see why this operation failed.
- Also check process trace file for matching Operation ID.
=== 收集alert_+ASM.log信息,问题发生时 。这部分显示的是不能正常被自动添加的grid disk===
- Tue Jun 05 13:35:58 2012
- XDWK started with pid=30, OS id=26485
- SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */ <<<<<<<<<<<<<<<<<<<<<<<<<Exadata Auto Mgmt: Proactive DROP ASM Disk
- alter diskgroup DG_DAT2 drop
- disk DG_DAT2_CD_00_DM02CEL04
- NOTE: GroupBlock outside rolling migration privileged region
- NOTE: requesting all-instance membership refresh for group=3
- Tue Jun 05 13:36:00 2012
- GMON updating for reconfiguration, group 3 at 10 for pid 30, osid 26485
- NOTE: group 3 PST updated.
- Tue Jun 05 13:36:00 2012
- NOTE: membership refresh pending for group 3/0xa4c87c09 (DG_DAT2)
- GMON querying group 3 at 11 for pid 19, osid 15396
- SUCCESS: refreshed membership for 3/0xa4c87c09 (DG_DAT2)
- SUCCESS: /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
- alter diskgroup DG_DAT2 drop
- disk DG_DAT2_CD_00_DM02CEL04
- NOTE: Attempting voting file refresh on diskgroup DG_DAT2
- NOTE: starting rebalance of group 3/0xa4c87c09 (DG_DAT2) at power 4
- Starting background process ARB0
- Tue Jun 05 13:36:05 2012
- ARB0 started with pid=38, OS id=26796
- NOTE: assigning ARB0 to group 3/0xa4c87c09 (DG_DAT2) with 4 parallel I/Os
- NOTE: membership refresh pending for group 2/0xa4c87c08 (DG_DAT1)
- GMON querying group 2 at 12 for pid 19, osid 15396
- SUCCESS: refreshed membership for 2/0xa4c87c08 (DG_DAT1)
- Tue Jun 05 13:36:11 2012
- NOTE: Attempting voting file refresh on diskgroup DG_DAT1
- Tue Jun 05 13:49:26 2012
- Starting background process XDWK
- Tue Jun 05 13:49:26 2012
- XDWK started with pid=30, OS id=25037
- Tue Jun 05 14:04:28 2012
- Starting background process XDWK
- Tue Jun 05 14:04:29 2012
- XDWK started with pid=39, OS id=26978
- Tue Jun 05 14:19:31 2012
- Starting background process XDWK
- Tue Jun 05 14:19:31 2012
- XDWK started with pid=30, OS id=28260
- Tue Jun 05 14:34:34 2012
- Starting background process XDWK
- Tue Jun 05 14:34:34 2012
- XDWK started with pid=39, OS id=32093
- Tue Jun 05 14:43:59 2012
- NOTE: GroupBlock outside rolling migration privileged region
- NOTE: requesting all-instance membership refresh for group=3
- Tue Jun 05 14:44:22 2012
- GMON updating for reconfiguration, group 3 at 13 for pid 30, osid 21558
- Tue Jun 05 14:44:23 2012
- NOTE: group 3 PST updated.
- Tue Jun 05 14:44:34 2012
- SUCCESS: grp 3 disk DG_DAT2_CD_00_DM02CEL04 emptied
- NOTE: erasing header on grp 3 disk DG_DAT2_CD_00_DM02CEL04
- NOTE: process _x000_+asm1 (21558) initiating offline of disk 192.3915944441 (DG_DAT2_CD_00_DM02CEL04) with mask 0x7e in group 3
- NOTE: initiating PST update: grp = 3, dsk = 192/0xe96891f9, mask = 0x6a, op = clear
- Tue Jun 05 14:44:34 2012
6.总结
开始自己考虑到一个celldisk被分为两个griddisk,在reblance的时候一定是导致了I/O争用的问题导致了其中一个加载失败,但是后来重新分析日志,发现问题不在这里。
从源头开始查找,查找该griddisk第一报错的时候是什么原因,最后发现这行日志,这里就引出另一个问题,ASM实例磁盘组删除的问题,所以在分析问题没有足够说服力的时候,尝试从头来,换个思路,兴许,问题就解决了。
- alter diskgroup DG_DAT2 drop disk DG_DAT2_CD_00_DM02CEL04