VMware故障案例分享-ESXi6.0磁盘拥堵-阿里云开发者社区

开发者社区> 开发与运维> 正文

VMware故障案例分享-ESXi6.0磁盘拥堵

简介: 更换了一个新SSD后,发生了拥堵的现象.将其卸载,拥堵就消失了.下面是拥堵故障的一个分析过程.

故障现象:

一台esxi6.0于7月5号3am左右出现 一块 ssd congestion;这块ssd卡是刚更换上去的;将这块 ssd卡unmount之后,ssd congestion 消失了。

分析过程:

产品版本信息。

Huawei Technologies Co., Ltd. RH2288H V2-24S | BIOS: RMIBV503 | Date (ISO-8601): 2015-03-09
VMware ESXi 6.0.0 build-6921384
ESXi 6.0 Patch 6 ESXi600-201711001 11/9/2017 6921384 N/A

检查vSAN使用的hba卡的驱动信息,发现连接两个SSD所使用的hba3和hba4并不在vSAN的兼容列表里面。
Support Bundle: .(ESXi 6.0 U3) Virtual SAN Enabled: Yes
HBA: vmhba4

Huawei <class> Mass storage controller(19e5:0007 19e5:0007) Status:  Not Listed on HCL
hio 2.1.0.23 Status:  Not checked

HBA: vmhba2

LSI Logic / Symbios Logic LSI2308_2(1000:0087 1000:0087) Status:  Found Match on HCL
mpt2sas 19.00.00.00.1vmw Status:  Driver/Version As per HCL
Recommended Drivers for version ESXi 6.0 U3:
    Driver: mpt2sas Ver:19.00.00.00.1vmw (Match Confidence: 100) Firmware: 19.00.00.00-IT
VCG link: http://vcg-stg-vip-1.vmware.com/comp_guide2/detail.php?deviceCategory=vsanio&productid=39286

HBA: vmhba3

Huawei <class> Mass storage controller(19e5:0007 19e5:0007) Status:  Not Listed on HCL
hio 2.1.0.23 Status:  Not checked

vmhba2 mpt2sas 19.00.00.00.1vmw 1000 0087 1000 0087 LSI Logic / Symbios Logic LSI2308_2
vmhba3 hio 2.1.0.23 19e5 0007 19e5 0007
vmhba4 hio 2.1.0.23 19e5 0007 19e5 0007

Disk Group: 5275c7e2-f296-a38e-9b0d-15fe4aea962c
Device Type In CMMDS Vendor Model Revision Offline? Size Transport HBA
t10.hioa___00030PXS10D6000058 SSD false Huawei ES3000 2.0 false 1121.81GB parallel vmhba4
naa.5000cca0720a8210 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c18343d7 MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
naa.5000cca0720a5a60 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a15c8 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c15a6cfb MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
naa.5000cca07209c0cc MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c1a59b2f MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2

Disk Group: 52d8a147-5bf1-2fa3-f755-ffc14a44ab8f
Device Type In CMMDS Vendor Model Revision Offline? Size Transport HBA
naa.5000cca0720a63d4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a1c3c MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca07209d7a8 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
t10.hiob___00030PXT10F3000198 SSD true Huawei ES3000 2.0 false 747.88GB parallel vmhba3
naa.5000cca0720a5cb4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca07209ab74 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a5a4c MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a7bf4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2

检查vsan磁盘信息可以看到这个磁盘组的InCMMDS都是false,应该是没有被mount到vSAN中。

2020-07-02T09:04:01Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca0720a5fac -s t10.hioa___00030PXS10D6000058
2020-07-02T09:05:56Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca0720a5fac -s t10.hioa___00030PXS10D6000058
2020-07-02T09:07:25Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000c500c15a6cfb -s t10.hioa___00030PXS10D6000058
2020-07-02T09:07:43Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca07209c0cc -s t10.hioa___00030PXS10D6000058
2020-07-02T09:07:55Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca0720a5a60 -s t10.hioa___00030PXS10D6000058
2020-07-02T09:08:05Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000c500c1a59b2f -s t10.hioa___00030PXS10D6000058
2020-07-02T09:08:13Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca0720a15c8 -s t10.hioa___00030PXS10D6000058
2020-07-02T09:08:25Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000c500c18343d7 -s t10.hioa___00030PXS10D6000058
2020-07-05T00:40:49Z shell[37182]: [root]: esxcli vsan storage diskgroup unmount -d t10.hioa___00030PXS10D6000058
2020-07-05T00:41:08Z shell[37182]: [root]: esxcli vsan storage diskgroup unmount -s t10.hioa___00030PXS10D6000058

Disk Group: 5275c7e2-f296-a38e-9b0d-15fe4aea962c

Device Type In CMMDS Vendor Model Revision Offline? Size Transport HBA
t10.hioa___00030PXS10D6000058 SSD false Huawei ES3000 2.0 false 1121.81GB parallel vmhba4
naa.5000cca0720a8210 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c18343d7 MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
naa.5000cca0720a5a60 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a15c8 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c15a6cfb MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
naa.5000cca07209c0cc MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c1a59b2f MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
Disk Group: 52d8a147-5bf1-2fa3-f755-ffc14a44ab8f
Device Type In CMMDS Vendor Model Revision Offline? Size Transport HBA
naa.5000cca0720a63d4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a1c3c MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca07209d7a8 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
t10.hiob___00030PXT10F3000198 SSD true Huawei ES3000 2.0 false 747.88GB parallel vmhba3
naa.5000cca0720a5cb4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca07209ab74 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a5a4c MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a7bf4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2

检查vobd和vmkernel日志,可以看到下面时间点开始该磁盘组突然开始报拥堵,并且拥堵前驱动和磁盘都没有报错。
vobd.log
2020-07-04T01:56:52.204Z: [VsanCorrelator] 60119767788us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceeded. Cong
estion Threshold: 200 Current Congestion: 202.
2020-07-04T01:56:52.204Z: [VsanCorrelator] 60120438700us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceed

  1. Congestion Threshold: 200 Current Congestion: 202.
    2020-07-04T01:57:52.204Z: [VsanCorrelator] 60179767871us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal. Conges

tion Threshold: 200 Current Congestion: 148.
2020-07-04T01:57:52.204Z: [VsanCorrelator] 60180439130us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal
. Congestion Threshold: 200 Current Congestion: 148.
2020-07-04T01:58:52.224Z: [VsanCorrelator] 60239787321us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceeded. Cong
estion Threshold: 200 Current Congestion: 202.
2020-07-04T01:58:52.224Z: [VsanCorrelator] 60240459260us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceed

  1. Congestion Threshold: 200 Current Congestion: 202.
    2020-07-04T01:59:52.226Z: [VsanCorrelator] 60299788319us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal. Conges

tion Threshold: 200 Current Congestion: 148.
2020-07-04T01:59:52.226Z: [VsanCorrelator] 60300460977us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal
. Congestion Threshold: 200 Current Congestion: 148.
2020-07-04T02:00:52.233Z: [VsanCorrelator] 60359795073us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceeded. Cong
estion Threshold: 200 Current Congestion: 202.
2020-07-04T02:00:52.233Z: [VsanCorrelator] 60360468427us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceed

  1. Congestion Threshold: 200 Current Congestion: 202.
    2020-07-04T02:01:52.235Z: [VsanCorrelator] 60419795887us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal. Conges

tion Threshold: 200 Current Congestion: 148.

结论:

虽然报拥堵的时候总是报磁盘组SSD拥堵,但是造成拥堵的原因可以是IO链上HBA卡、SSD、MD的性能问题导致的,
拥堵的原因调查非常复杂,由于日志中并没有驱动、ssd,md相关报错信息,无法仅仅通过日志信息来判断拥堵具体原因的。
对于这台主机,可以尝试更换兼容的HBA卡的驱动固件,再加入mount磁盘进行观察.如果还有问题,可以考虑升级ESXi到6.5及以上版本,并同步更新兼容的驱动固件.

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

分享:
开发与运维
使用钉钉扫一扫加入圈子
+ 订阅

集结各类场景实战经验,助你开发运维畅行无忧

其他文章