故障案例-ESXI6.5主机无法发生重启,并有发生网卡无故UP DOWN的事件

简介: VSAN环境下的一台ESXI6.5主机无法发生重启,并发生网卡无故UP DOWN的事件.以下是故障分析过程和解决方法

分析过程:
下面是主机日志包中产品版本信息。

Huawei RH2288H V3 | BIOS: 3.87 | Date (ISO-8601): 2018-02-02
VMware ESXi 6.5.0 build-10175896
ESXi 6.5 EP 09 ESXi650-201810001 10/02/2018 10175896 N/A

下面是主机最后一次完成重启的时间。
vmksummary.log:2020-07-17T02:53:25Z bootstop: Host has booted

检查vmkenrel日志和hostd以及syslog日志,可以确认日志都是输出到了UTC时间2020-07-17T02:15这个时间点左右,
突然就发生了重启。
vmkernel.log
2020-07-17T02:00:02.141Z cpu23:66326)ScsiDeviceIO: 2954: Cmd(0x4395c1409c00) 0x1a, CmdSN 0x1bd573 from world 0 to dev "naa.680007385b68e5dd2308219504cf9e7a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2020-07-17T02:04:40.826Z cpu0:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
2020-07-17T02:08:02.130Z cpu4:67185)ScsiDeviceIO: 2954: Cmd(0x43997430e380) 0x1a, CmdSN 0x1bd915 from world 0 to dev "naa.680007385b68e5dd2308219504cf9e7a" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2020-07-17T02:09:40.819Z cpu1:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
2020-07-17T02:14:40.872Z cpu0:67600)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk naa.5b44326c4162f002:2
VMB: 112: mbMagic: 2badb002, mbInfo 0x1016d8
VMB: 56: flags a6d
VMB: 59: cmdline: /jumpstrt.gz vmbTrustedBoot=false tboot=0x101b000 installerDiskDumpSlotSize=2560 no-auto-partition bootUUID=75c473852a000da92117490a882df41e
VMB: 64: 139 boot modules @ 0x100e08
VMB: 71: mmap_addr 0x101750 (476b)
VMB: 77: VBE Mode: 0x118

hostd.log
2020-07-17T02:14:42.554Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: config, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:42.554Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.config, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:42.657Z verbose hostd[34140B70] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.runtime, ha-root-pool. Sent notification immediately.
2020-07-17T02:14:59.046Z verbose hostd[34140B70] [Originator@6876 sub=PropertyProvider opID=d3c3bc66 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.queryStats-111362357. Applied change to temp map.
2020-07-17T02:14:59.047Z verbose hostd[33B81B70] [Originator@6876 sub=PropertyProvider opID=d3c3bc67 user=vpxuser] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.summarizeStats-111362358. Applied change to temp map.
2020-07-17T02:15:00.067Z verbose hostd[33880B70] [Originator@6876 sub=PropertyProvider opID=4a8bb9ae-7e-bc68 user=vpxuser:PerfCapService] RecordOp ASSIGN: info, haTask--vim.PerformanceManager.summarizeStats-111362359. Applied change to temp map.
2020-07-17T02:52:59Z mark: storage-path-claim-completed
2020-07-17T02:53:17.543Z Section for VMware ESX, pid=71807, version=6.5.0, build=10175896, option=Release
2020-07-17T02:53:17.543Z verbose hostd[B648B80] [Originator@6876 sub=Default] Dumping early logs:

syslog.log
2020-07-17T02:14:01Z root: There are 1 /usr/lib/vmware/vsan/bin/vsanObserver.sh running ...
2020-07-17T02:14:01Z root: Calc for ramdisk mounted on /, freeMB:28
2020-07-17T02:14:02Z root: vsantraces is on device 6316156880550863952
2020-07-17T02:14:02Z root: Found file system entry for vsantraces: /vmfs/volumes/5beaf5d5-66cb81db-62d1-340a98824832 esx5f-local-storage-1 5beaf5d5-66cb81db-62d1-340a98824832 true VMFS-5 590826438656 568564121600
2020-07-17T02:14:02Z root: CalcFreeSpace sizeKB: 53248, freeMB: 542225
2020-07-17T02:36:32Z watchdog-vobd: [66052] Begin '/usr/lib/vmware/vob/bin/vobd', min-uptime = 60, max-quick-failures = 5, max-total-failures = 1000000, bg_pid_file = '', reboot-flag = '0'
2020-07-17T02:36:32Z watchdog-vobd: Executing '/usr/lib/vmware/vob/bin/vobd'
2020-07-17T02:36:32Z jumpstart[66037]: Launching Executor
2020-07-17T02:36:32Z jumpstart[66037]: Setting up Executor - Reset Requested
2020-07-17T02:36:32Z jumpstart[66037]: BmcInfoImpl: Retrieve Version information failed
2020-07-17T02:36:32Z jumpstart[66037]: ignoring plugin 'vsan-upgrade' because version '2.0.0' has already been run.
2020-07-17T02:36:32Z jumpstart[66037]: executing start plugin: check-required-memory
2020-07-17T02:36:32Z jumpstart[66037]: executing start plugin: restore-configuration
2020-07-17T02:36:32Z jumpstart[66083]: restoring configuration
2020-07-17T02:36:32Z jumpstart[66083]: extracting from file /local.tgz

从ipmi日志可以看到主机发起重启的时间是UTC 2020-07-17T02:17:11这个时间点。
commands/localcli_hardware-ipmi-sel-list--p--i--n-all.txt
Record:585:
Record Id: 585
When: 2020-07-17T02:17:11
Event Type: 111 (Unknown)
SEL Type: 2 (System Event)
Message: Assert + System Boot Initiated System Restart
Sensor Number: 87
Raw:
Formatted-Raw: 49 02 02 27 0a 11 5f 20 00 04 1d 57 6f c7 06 0d

因此判断主机是突然重启的。

而网卡down是在重启过程中初始化网卡时候发生的,这个不是异常。
vobd.log
2020-07-17T02:36:51.207Z: [netCorrelator] 49762304us: [vob.net.vmnic.linkstate.down] vmnic vmnic15 linkstate down
2020-07-17T02:36:51.321Z: [netCorrelator] 49875555us: [vob.net.vmnic.linkstate.down] vmnic vmnic11 linkstate down
2020-07-17T02:36:52.002Z: [netCorrelator] 50556753us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic11 linkstate is down
2020-07-17T02:36:52.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:52.002Z: [netCorrelator] 50556895us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic15 linkstate is down
2020-07-17T02:36:52.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:52.679Z: [netCorrelator] 51233568us: [vob.net.vmnic.linkstate.down] vmnic vmnic9 linkstate down
2020-07-17T02:36:52.728Z: [netCorrelator] 51283400us: [vob.net.vmnic.linkstate.down] vmnic vmnic13 linkstate down
2020-07-17T02:36:53.002Z: [netCorrelator] 51556645us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic13 linkstate is down
2020-07-17T02:36:53.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:36:53.002Z: [netCorrelator] 51556749us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic9 linkstate is down
2020-07-17T02:36:53.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-07-17T02:39:52.001Z: Failed to send event (esx.problem.net.vmnic.linkstate.down); 2 failures so far.

解决方法:

由于主机日志内并没有会导致主机重启的信息,

下一步联系主机硬件厂商,作进一步排查。

相关实践学习
通过日志服务实现云资源OSS的安全审计
本实验介绍如何通过日志服务实现云资源OSS的安全审计。
目录
相关文章
|
缓存 Linux 开发工具
CentOS 7- 配置阿里镜像源
阿里镜像官方地址http://mirrors.aliyun.com/ 1、点击官方提供的相应系统的帮助 :2、查看不同版本的系统操作: 下载源1、安装wget yum install -y wget2、下载CentOS 7的repo文件wget -O /etc/yum.
266135 0
|
3月前
|
存储 缓存 调度
vLLM 吞吐量优化实战:10个KV-Cache调优方法让tokens/sec翻倍
十个经过实战检验的 vLLM KV-cache 优化方法 —— 量化、分块预填充、前缀重用、滑动窗口、ROPE 缩放、后端选择等等 —— 提升 tokens/sec。
1425 10
|
10月前
|
移动开发 安全 虚拟化
VMware ESXi 6.7 U3v (ESXi670-202503001.zip) 下载
VMware ESXi 6.7 U3v 是一款专为服务器虚拟化设计的裸机 Hypervisor,提供高效、可靠的硬件资源管理。它支持更高的硬件利用率、增强的安全性和简化的 IT 管理,帮助企业降低运营成本并提升性能。2025年3月4日发布的更新版本进一步提升了稳定性和兼容性。下载地址:<https://sysin.cn/blog/vmware-esxi-6/>,更多定制镜像和相关产品请访问 [sysin.org](https://sysin.org)。
3225 41
VMware ESXi 6.7 U3v (ESXi670-202503001.zip) 下载
|
11月前
|
Linux 数据安全/隐私保护 开发者
Red Hat下载ISO镜像的方法
简介:本文介绍如何注册或登录Red Hat账号,选择并下载Red Hat Enterprise Linux (RHEL)的免费版本,以及激活订阅的步骤。首先访问Red Hat官网注册或登录账号,然后在开发者页面选择免费下载RHEL,最后通过终端命令激活订阅,确保状态已订阅。订阅为期一年,到期后需重新注册以继续享受支持和权益。
4508 17
Red Hat下载ISO镜像的方法
|
11月前
|
机器学习/深度学习 监控 Linux
ollama+openwebui本地部署deepseek 7b
Ollama是一个开源平台,用于本地部署和管理大型语言模型(LLMs),简化了模型的训练、部署与监控过程,并支持多种机器学习框架。用户可以通过简单的命令行操作完成模型的安装与运行,如下载指定模型并启动交互式会话。对于环境配置,Ollama提供了灵活的环境变量设置,以适应不同的服务器需求。结合Open WebUI,一个自托管且功能丰富的Web界面,用户可以更便捷地管理和使用这些大模型,即使在完全离线的环境中也能顺利操作。此外,通过配置特定环境变量,解决了国内访问限制的问题,例如使用镜像站来替代无法直接访问的服务。
|
缓存 Linux
CentOS-6的iso下载地址镜像yum源
通过上述步骤,您可以成功下载CentOS 6的ISO镜像文件,并配置适用于CentOS 6的YUM源。尽管CentOS 6已经停止更新,但使用这些镜像和YUM源配置,可以继续在需要的环境中使用和维护CentOS 6系统。
6184 20
|
Ubuntu 安全 Linux
基于Ubuntu24.04原内核6.8.0升级到6.9.0
通过上述步骤,我们在Ubuntu 24.04系统上成功地将内核从6.8.0升级到了6.9.0。升级内核有助于提高系统的安全性和性能,但也可能带来兼容性问题。因此,在升级前备份重要数据,并确保新内核支持所需的所有硬件和软件。通过合理的验证和测试,可以确保系统在新内核上稳定运行。
1598 6
|
应用服务中间件 Shell 网络安全
nginx安装提示 libssl.so.3: cannot open shared object file: No
【8月更文挑战第1天】### 原因 未将安装的ssl中的`libssl.so.3`链接到`/usr/lib`导致缺失。 ### 解决方案 1. 检查openssl是否已安装,若为低版本则需重装。 ```sh whereis openssl
6632 6
|
Ubuntu 安全 网络协议