故障案例-ESXI6.5主机无法发生重启,并有发生网卡无故UP DOWN的事件

本文涉及的产品
日志服务 SLS,月写入数据量 50GB 1个月
简介: VSAN环境下的一台ESXI6.5主机无法发生重启,并发生网卡无故UP DOWN的事件.以下是故障分析过程和解决方法

分析过程:
下面是主机日志包中产品版本信息。

Huawei RH2288H V3 | BIOS: 3.87 | Date (ISO-8601): 2018-02-02
VMware ESXi 6.5.0 build-10175896
ESXi 6.5 EP 09 ESXi650-201810001 10/02/2018 10175896 N/A

下面是主机最后一次完成重启的时间。
vmksummary.log:2020-07-17T02:53:25Z bootstop: Host has booted

检查vmkenrel日志和hostd以及syslog日志,可以确认日志都是输出到了UTC时间2020-07-17T02:15这个时间点左右,
突然就发生了重启。
vmkernel.log
2020-07-17T02:00:02.141Z cpu23:66326)ScsiDeviceIO: 2954: Cmd(0x4395c1409c00) 0x1a, CmdSN 0x1bd573 from world 0 to dev "naa.680007385b68e5dd2308219504cf9e7a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2020-07-17T02:04:40.826Z cpu0:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
2020-07-17T02:08:02.130Z cpu4:67185)ScsiDeviceIO: 2954: Cmd(0x43997430e380) 0x1a, CmdSN 0x1bd915 from world 0 to dev "naa.680007385b68e5dd2308219504cf9e7a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2020-07-17T02:09:40.819Z cpu1:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
2020-07-17T02:14:40.872Z cpu0:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
VMB: 112: mbMagic: 2badb002, mbInfo 0x1016d8
VMB: 56: flags a6d
VMB: 59: cmdline: /jumpstrt.gz vmbTrustedBoot=false tboot=0x101b000 installerDiskDumpSlotSize=2560 no-auto-partition bootUUID=75c473852a000da92117490a882df41e
VMB: 64: 139 boot modules @ 0x100e08
VMB: 71: mmap_addr 0x101750 (476b)
VMB: 77: VBE Mode: 0x118

hostd.log
2020-07-17T02:14:42.554Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:42.554Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.config, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:42.657Z verbose hostd[34140B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.runtime, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:59.046Z verbose hostd[34140B70] [Originator@6876 sub=PropertyProvider opID=d3c3bc66 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.queryStats-111362357. Applied change to temp map.
2020-07-17T02:14:59.047Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider opID=d3c3bc67 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.summarizeStats-111362358. Applied change to temp map.
2020-07-17T02:15:00.067Z verbose hostd[33880B70] [Originator@6876 sub=PropertyProvider opID=4a8bb9ae-7e-bc68 user=vpxuser:PerfCapService] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.summarizeStats-111362359. Applied change to temp map.
2020-07-17T02:52:59Z mark: storage-path-claim-completed
2020-07-17T02:53:17.543Z Section for VMware ESX, pid=71807, version=6.5.0, build=10175896, option=Release
2020-07-17T02:53:17.543Z verbose hostd[B648B80] [Originator@6876 sub=Default] Dumping early logs:

syslog.log
2020-07-17T02:14:01Z root: There are 1 /usr/lib/vmware/vsan/bin/vsanObserver.sh running ...
2020-07-17T02:14:01Z root: Calc for ramdisk mounted on /, freeMB:28
2020-07-17T02:14:02Z root: vsantraces is on device 6316156880550863952
2020-07-17T02:14:02Z root: Found file system entry for vsantraces: /vmfs/volumes/5beaf5d5-66cb81db-62d1-340a98824832 esx5f-local-storage-1 5beaf5d5-66cb81db-62d1-340a98824832 true VMFS-5 590826438656 568564121600
2020-07-17T02:14:02Z root: CalcFreeSpace sizeKB: 53248, freeMB: 542225
2020-07-17T02:36:32Z watchdog-vobd: [66052] Begin '/usr/lib/vmware/vob/bin/vobd', min-uptime = 60, max-quick-failures = 5, max-total-failures = 1000000, bg_pid_file = '', reboot-flag = '0'
2020-07-17T02:36:32Z watchdog-vobd: Executing '/usr/lib/vmware/vob/bin/vobd'
2020-07-17T02:36:32Z jumpstart[66037]: Launching Executor
2020-07-17T02:36:32Z jumpstart[66037]: Setting up Executor - Reset Requested
2020-07-17T02:36:32Z jumpstart[66037]: BmcInfoImpl: Retrieve Version information failed
2020-07-17T02:36:32Z jumpstart[66037]: ignoring plugin 'vsan-upgrade' because version '2.0.0' has already been run.
2020-07-17T02:36:32Z jumpstart[66037]: executing start plugin: check-required-memory
2020-07-17T02:36:32Z jumpstart[66037]: executing start plugin: restore-configuration
2020-07-17T02:36:32Z jumpstart[66083]: restoring configuration
2020-07-17T02:36:32Z jumpstart[66083]: extracting from file /local.tgz

从ipmi日志可以看到主机发起重启的时间是UTC 2020-07-17T02:17:11这个时间点。
commands/localcli_hardware-ipmi-sel-list--p--i--n-all.txt
Record:585:
Record Id: 585
When: 2020-07-17T02:17:11
Event Type: 111 (Unknown)
SEL Type: 2 (System Event)
Message: Assert + System Boot Initiated System Restart
Sensor Number: 87
Raw:
Formatted-Raw: 49 02 02 27 0a 11 5f 20 00 04 1d 57 6f c7 06 0d

因此判断主机是突然重启的。

而网卡down是在重启过程中初始化网卡时候发生的,这个不是异常。
vobd.log
2020-07-17T02:36:51.207Z: [netCorrelator] 49762304us: [vob.net.vmnic.linkstate.down] vmnic vmnic15 linkstate down
2020-07-17T02:36:51.321Z: [netCorrelator] 49875555us: [vob.net.vmnic.linkstate.down] vmnic vmnic11 linkstate down
2020-07-17T02:36:52.002Z: [netCorrelator] 50556753us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic11 linkstate is down
2020-07-17T02:36:52.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:52.002Z: [netCorrelator] 50556895us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic15 linkstate is down
2020-07-17T02:36:52.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:52.679Z: [netCorrelator] 51233568us: [vob.net.vmnic.linkstate.down] vmnic vmnic9 linkstate down
2020-07-17T02:36:52.728Z: [netCorrelator] 51283400us: [vob.net.vmnic.linkstate.down] vmnic vmnic13 linkstate down
2020-07-17T02:36:53.002Z: [netCorrelator] 51556645us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic13 linkstate is down
2020-07-17T02:36:53.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:53.002Z: [netCorrelator] 51556749us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic9 linkstate is down
2020-07-17T02:36:53.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:39:52.001Z: Failed to send event (esx.problem.net.vmnic.linkstate.down); 2 failures so far.

解决方法:

由于主机日志内并没有会导致主机重启的信息,

下一步联系主机硬件厂商,作进一步排查。

相关实践学习
日志服务之使用Nginx模式采集日志
本文介绍如何通过日志服务控制台创建Nginx模式的Logtail配置快速采集Nginx日志并进行多维度分析。
目录
相关文章
|
10月前
|
Shell Linux 开发工具
Vmware 虚拟机挂起恢复后发现无法 Ping 通,无法连接到主机
在Linux主机上,以`root`用户停止NetworkManager服务并重启网络: ```shell systemctl stop NetworkManager systemctl restart network ``` 或修改网卡配置文件`ifcfg-ens33`,添加`NM_CONTROLLED="no"`,然后重启`network`服务: ```shell vim /etc/sysconfig/network-scripts/ifcfg-ens33 systemctl restart network ```
265 6
|
10月前
|
Ubuntu 网络协议 Windows
ubuntu 16.04无法连接网络;双系统无法上网;连接已断开,你现在处于断开状态
ubuntu 16.04无法连接网络;双系统无法上网;连接已断开,你现在处于断开状态
150 0
|
Shell
使用shell脚本编程查看局域网内哪些主机开启或宕机
使用shell脚本编程查看局域网内哪些主机开启或宕机
87 0
|
网络协议 Linux
A机器与B机器网络connect成功后,断电时的网络状态?
A机器与B机器网络connect成功后,断电时的网络状态?
149 0
|
虚拟化
VMware故障案例分享-ESXi 6.7异常重启
一台VSAN环境下的ESXi 6.7异常重启分析
4307 0
|
运维 Oracle 网络协议
双网卡双监听故障处理
双网卡监听配置问题分析
|
虚拟化 Windows
记一次被动的网卡升级:VMWare导致的无线网卡不能启用
最近在下潜心研究持续集成的环境搭建、折腾VMWare虚拟机比较频繁。上周的某一天笔记本的无线网卡突然罢工了;重装驱动也完全没有作用。 网上的攻略都是重装网卡驱动,对于问题的定位和我遇到的根本对不上;当然不能期望过高,大部分时候重装驱动就能解决。
2765 0