All 16 hosts in cluster are up and running since long time without any issue - uptime 300+ days. On all hosts we cannot get access to VM console. Opening VM console from viClient we get error "Unable to connect to the MKS: A general system error occured: Internal error".
We cannot vmotion VMs to another esxi hosts in cluster.
We login into esxi hosts and noticed that root Ramdisk is full:
# vdf -h | tail -6
The uptime of esxi hosts was impressive:
# uptime
When we tried to get information about Virtual Machines using vim-cmd command we got error:
# vim-cmd vmsvc/getallvms
We tried to figure out what consumed space on root in Ramdisk, we run command:
# find / -size +10k -exec du -h {} \; | egrep -v volumes | egrep -v disks | less
I spotted a lot of EMCProvider logs in /opt/emc/cim/log
# ls -l | head -5
And bingo! these logs eat the space:
# du -h /opt/emc/cim/log/
It seems that EMCProvider logs haven't rotated and fulfilled root in Ramdisk. I couldn't find any parameter in conf file to setup rotation of EMCProvider logs - it is more feature than bug ;)
We deleted logs older than 200 days (eventually we deleted all EMCProvider logs older than 1 day) on esxi hosts in cluster:
# cd /opt/emc/cim/log/
# find . -name '*.log' -mtime +200 -exec rm -f {} \;
We got some free space on root and were able to got access to some VM console, but some VMs started to show another error 'Unable to connect to the MKS: Failed to connect to server fqdn.com:902':
We identified that VMs located on 3 esxi hosts encounter the error above.
We noticed that on affected esxi hosts nothing is listen on port 902 even when we already had enough free space on root ramdisk:
# esxcli network ip connection | grep :902
VMs which no longer encountered issue with VM console access were located on esxi hosts where 'busybox' listened on port 902:
We decided to put affected esxi hosts into MM (Maintenance Mode) and reboot. After esxi host reboot 'busybox' started to listen on port 902 and VM console issue gone.
The main take-away is that full root ramdisk condition is abnormal - we have to remember that in *nix world everything is a file it could explain why some hosts cannot create TCP socket for 902 port when root was full even after we got some free space on root ramdisk.
Here all steps in one printscreen:
The End.