故障案例-ESXI6.5主机无法发生重启,并有发生网卡无故UP DOWN的事件

本文涉及的产品
日志服务 SLS,月写入数据量 50GB 1个月
简介: VSAN环境下的一台ESXI6.5主机无法发生重启,并发生网卡无故UP DOWN的事件.以下是故障分析过程和解决方法

分析过程:
下面是主机日志包中产品版本信息。

Huawei RH2288H V3 | BIOS: 3.87 | Date (ISO-8601): 2018-02-02
VMware ESXi 6.5.0 build-10175896
ESXi 6.5 EP 09 ESXi650-201810001 10/02/2018 10175896 N/A

下面是主机最后一次完成重启的时间。
vmksummary.log:2020-07-17T02:53:25Z bootstop: Host has booted

检查vmkenrel日志和hostd以及syslog日志,可以确认日志都是输出到了UTC时间2020-07-17T02:15这个时间点左右,
突然就发生了重启。
vmkernel.log
2020-07-17T02:00:02.141Z cpu23:66326)ScsiDeviceIO: 2954: Cmd(0x4395c1409c00) 0x1a, CmdSN 0x1bd573 from world 0 to dev "naa.680007385b68e5dd2308219504cf9e7a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2020-07-17T02:04:40.826Z cpu0:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
2020-07-17T02:08:02.130Z cpu4:67185)ScsiDeviceIO: 2954: Cmd(0x43997430e380) 0x1a, CmdSN 0x1bd915 from world 0 to dev "naa.680007385b68e5dd2308219504cf9e7a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2020-07-17T02:09:40.819Z cpu1:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
2020-07-17T02:14:40.872Z cpu0:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
VMB: 112: mbMagic: 2badb002, mbInfo 0x1016d8
VMB: 56: flags a6d
VMB: 59: cmdline: /jumpstrt.gz vmbTrustedBoot=false tboot=0x101b000 installerDiskDumpSlotSize=2560 no-auto-partition bootUUID=75c473852a000da92117490a882df41e
VMB: 64: 139 boot modules @ 0x100e08
VMB: 71: mmap_addr 0x101750 (476b)
VMB: 77: VBE Mode: 0x118

hostd.log
2020-07-17T02:14:42.554Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:42.554Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.config, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:42.657Z verbose hostd[34140B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.runtime, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:59.046Z verbose hostd[34140B70] [Originator@6876 sub=PropertyProvider opID=d3c3bc66 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.queryStats-111362357. Applied change to temp map.
2020-07-17T02:14:59.047Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider opID=d3c3bc67 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.summarizeStats-111362358. Applied change to temp map.
2020-07-17T02:15:00.067Z verbose hostd[33880B70] [Originator@6876 sub=PropertyProvider opID=4a8bb9ae-7e-bc68 user=vpxuser:PerfCapService] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.summarizeStats-111362359. Applied change to temp map.
2020-07-17T02:52:59Z mark: storage-path-claim-completed
2020-07-17T02:53:17.543Z Section for VMware ESX, pid=71807, version=6.5.0, build=10175896, option=Release
2020-07-17T02:53:17.543Z verbose hostd[B648B80] [Originator@6876 sub=Default] Dumping early logs:

syslog.log
2020-07-17T02:14:01Z root: There are 1 /usr/lib/vmware/vsan/bin/vsanObserver.sh running ...
2020-07-17T02:14:01Z root: Calc for ramdisk mounted on /, freeMB:28
2020-07-17T02:14:02Z root: vsantraces is on device 6316156880550863952
2020-07-17T02:14:02Z root: Found file system entry for vsantraces: /vmfs/volumes/5beaf5d5-66cb81db-62d1-340a98824832 esx5f-local-storage-1 5beaf5d5-66cb81db-62d1-340a98824832 true VMFS-5 590826438656 568564121600
2020-07-17T02:14:02Z root: CalcFreeSpace sizeKB: 53248, freeMB: 542225
2020-07-17T02:36:32Z watchdog-vobd: [66052] Begin '/usr/lib/vmware/vob/bin/vobd', min-uptime = 60, max-quick-failures = 5, max-total-failures = 1000000, bg_pid_file = '', reboot-flag = '0'
2020-07-17T02:36:32Z watchdog-vobd: Executing '/usr/lib/vmware/vob/bin/vobd'
2020-07-17T02:36:32Z jumpstart[66037]: Launching Executor
2020-07-17T02:36:32Z jumpstart[66037]: Setting up Executor - Reset Requested
2020-07-17T02:36:32Z jumpstart[66037]: BmcInfoImpl: Retrieve Version information failed
2020-07-17T02:36:32Z jumpstart[66037]: ignoring plugin 'vsan-upgrade' because version '2.0.0' has already been run.
2020-07-17T02:36:32Z jumpstart[66037]: executing start plugin: check-required-memory
2020-07-17T02:36:32Z jumpstart[66037]: executing start plugin: restore-configuration
2020-07-17T02:36:32Z jumpstart[66083]: restoring configuration
2020-07-17T02:36:32Z jumpstart[66083]: extracting from file /local.tgz

从ipmi日志可以看到主机发起重启的时间是UTC 2020-07-17T02:17:11这个时间点。
commands/localcli_hardware-ipmi-sel-list--p--i--n-all.txt
Record:585:
Record Id: 585
When: 2020-07-17T02:17:11
Event Type: 111 (Unknown)
SEL Type: 2 (System Event)
Message: Assert + System Boot Initiated System Restart
Sensor Number: 87
Raw:
Formatted-Raw: 49 02 02 27 0a 11 5f 20 00 04 1d 57 6f c7 06 0d

因此判断主机是突然重启的。

而网卡down是在重启过程中初始化网卡时候发生的,这个不是异常。
vobd.log
2020-07-17T02:36:51.207Z: [netCorrelator] 49762304us: [vob.net.vmnic.linkstate.down] vmnic vmnic15 linkstate down
2020-07-17T02:36:51.321Z: [netCorrelator] 49875555us: [vob.net.vmnic.linkstate.down] vmnic vmnic11 linkstate down
2020-07-17T02:36:52.002Z: [netCorrelator] 50556753us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic11 linkstate is down
2020-07-17T02:36:52.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:52.002Z: [netCorrelator] 50556895us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic15 linkstate is down
2020-07-17T02:36:52.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:52.679Z: [netCorrelator] 51233568us: [vob.net.vmnic.linkstate.down] vmnic vmnic9 linkstate down
2020-07-17T02:36:52.728Z: [netCorrelator] 51283400us: [vob.net.vmnic.linkstate.down] vmnic vmnic13 linkstate down
2020-07-17T02:36:53.002Z: [netCorrelator] 51556645us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic13 linkstate is down
2020-07-17T02:36:53.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:53.002Z: [netCorrelator] 51556749us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic9 linkstate is down
2020-07-17T02:36:53.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:39:52.001Z: Failed to send event (esx.problem.net.vmnic.linkstate.down); 2 failures so far.

解决方法:

由于主机日志内并没有会导致主机重启的信息,

下一步联系主机硬件厂商,作进一步排查。

相关实践学习
日志服务之数据清洗与入湖
本教程介绍如何使用日志服务接入NGINX模拟数据,通过数据加工对数据进行清洗并归档至OSS中进行存储。
目录
相关文章
|
2月前
|
Shell Linux 开发工具
Vmware 虚拟机挂起恢复后发现无法 Ping 通,无法连接到主机
在Linux主机上,以`root`用户停止NetworkManager服务并重启网络: ```shell systemctl stop NetworkManager systemctl restart network ``` 或修改网卡配置文件`ifcfg-ens33`,添加`NM_CONTROLLED="no"`,然后重启`network`服务: ```shell vim /etc/sysconfig/network-scripts/ifcfg-ens33 systemctl restart network ```
|
7月前
|
Shell
使用shell脚本编程查看局域网内哪些主机开启或宕机
使用shell脚本编程查看局域网内哪些主机开启或宕机
38 0
实现远程批量管理主机的关机和重启 - WGCLOUD
WGCLOUD有个功能模块叫做下发指令,可以将关机或重启的指令下发给多个主机执行,使用起来非常简单
实现远程批量管理主机的关机和重启 - WGCLOUD
|
网络协议 Linux
A机器与B机器网络connect成功后,断电时的网络状态?
A机器与B机器网络connect成功后,断电时的网络状态?
116 0
|
网络协议 虚拟化 存储