Fixing Insufficient resources to satisfy configured failover level for HA

简介:


This post comes from a few days of poring over manuals as well as some technical support.  This is a good one.  The error came from trying to power on a VM in our VMware cluster and we would get these errors:

“Insufficient resources to satisfy configured failover level for HA”

VMWARE: insufficient resources to satisfy configured failover level for HA

And this alert on our cluster

“Insufficient resources to satisfy HA failover level on cluster vmCluster in vmTST”

Our way of thinking was we had to power one off to power another one on.

But that didn’t work.

Here we is the actual solution. (p.s. Great VMware HA education for me on this one!)

PROBLEM SOURCE: VMware HA is turned on and you are violating constraints


VMware HA is turned on, and you have it configured so that there is a certain amount of resource reserve for failover.  By turning on this VM, you are going to dip into that resource reserve and so VMware is telling you “Nope, not turning it on….”

There is a quick fix to get the VM turned on (one good way, one bad way), and then there are two long term fixes for you to consider.  In my case, the first one was faster, while the second one was better for my environment.

My VMware environment

Datacenter: vmTST
Cluster: vmCluster
OS: ESXi 4.1.0
Five (5) servers in a cluster.

VMware environment

My VMware Cluster Errors

As mentioned above:

“Insufficient resources to satisfy configured failover level for HA”

and

“Insufficient resources to satisfy HA failover level on cluster vmCluster in vmTST”

TWO WAYS TO DO QUICK FIX


  1. Turning off HA (popular, and I would say WRONG)

  2. Disable Admission Control (much better!!)

#1: Turning off HA (though I recommend against)

This is the solution I saw on some forums (including vmware forum).  After looking at it more, I recommend against it and I’ll explain why, but here it is:

VSphere Client: Browse Inventory -> Hosts and Clusters

Browse vmware inventory hosts and clusters

Edit VMware cluster settings

Right Click on Cluster name -> Edit Settings

VMware cluster: edit settings

Turning off HA

While this works, if you do this, whenever you turn it back on, it has to do a recalculation for the HA failover.  Bad, especially for testing or doing temporary power ons.

WRONG WAY: do not turn off VMware HA

#2: Disable “Admission Control” (better IMO)

Better to disable “Admission Control” so VMS will power on despite violating availability constraints.  This way your HA is still on. In the long run, though, it is better to fix your issue.

Same window, but next bullet item on the left:

VMware: better to disable admission control

LONG TERM FIX: TWO WAYS

There are two things I ended up having to look at.  One was pretty good long term fix and that I had found suggested on forums including VMware forums.

The second is the actual fix to my problem, the best one in the long term


FIX #1: Change from “Host Failures Cluster Tolerates” to “Percentage of cluster resources reserved as failover spare capacity”

In other words, instead of telling VMware you want to have enough resource reserve so that you can lose one host, you are telling VMware you want to have a certain percentage of resources unused for failover.

We had it configured to lose one host.  So by switching to a percentage it was a quick and easy fix for my environment.

VMware HA: Host failures cluster tolerates (?)

VMware HA: Host Failures Cluster Tolerates

So if we look at the “VMware HA” window, you’ll see that my “Host failures cluster tolerates” was set to 1.  Now with 5 servers you would think that means “20%” but that’s not so. Because what if one of your VMs (or more) for whatever reason took up 75% of your resources, then by worst case calculation you could only have one VM on your five node cluster.

A worst case calculation of your largest VM will determine what’s called a “slot” size. The VMware HA will then calculate how many total “slots” can be used which determines how many total VMs you can have powered on.

When this option is chosen, from what I’ve read on VMware forums, the calculations are VERY conservative.

Find Your Slot Size: VMware Cluster Summary -> Advanced Runtime Info

VMware cluster summary: Advanced Runtime Info

VMware Advanced Runtime Info: Slot sizes

VMware HA Advanced Runtime Info

So you can see above, worst case scenario, one slot size is 2507Mhz, 4256 MB.  With that in mind, there are 55 slots available on my five node cluster.  There are a total of 156 VMs out of 55.

This means I would have to power off 102 VMs to get to 54 powered on VMs leaving one slot open to power the new one on… (YIKES!0

Changing To Percentage: First Check Resource Usage

Out of curiosity, I checked the actual resource usage in my cluster

VMware cluster: Hosts

If you tally up all the green bars in CPU, I could fit all the CPU usage of every VM on one host.

If you tally up all the green bars in Memory, I could fit all the memory usage in about three hosts.

So why can’t I power on a VM?  Because the calculation is *THAT CONSERVATIVE* for the “Host failure cluster tolerates” option

VMware HA: Switch to percentage

VMware HA: Percentage of cluster resources reserved as failover spare capacity

Now, the first time I did this, I chose “20%” which prorated to one server out of the five being free.

And I was able to power on a VM

On a whim, I kept upping the percentage and I got as high as 75% before I decided to stop, thinking I was doing something wrong.

Part of it was that the VM I was powering on was very very small in resource usage (and later I found out also it had 0 reserve configured with it) which is probably why it powered on even at 75% failover spare capacity.

Anyhow, so in a pinch, this is one way to configure some amount of reserve AND be able to power on your VMs, at least if your resource usage somewhat mirrors mine (see previous picture)

FIX #2: Best Long Term Fix: Determine WHY the cluster resource reserve is so high and see if it is actually needed, or if it is just poorly configured

In the end this was the actual fix for us, because it delved into the actual source of the problem.  Which was to find out:

WHY the heck was our VM slot size so BIG?

Because obviously all five hosts combined were using VERY LITTLE CPU and RAM.  Less than 20% on CPU (it could fit all on one server), and less than 50% on RAM (it could fit on two to three servers).

It turns out:  The slot size is not based on usage, it is based on a VM resource reservation.

So here is how to check your resource reservation for your Vms.

VMware Cluster: Resource Allocation for CPU and Memory

CPU

VMware cluster: resource allocation cpu

(The dashed lines are my VM names which I blanked out)

Click on the “CPU” button and look for the “Reservation” column and sort by largest to smallest.

Memory

VMware cluster: resource allocation memory

(The dashed lines are my VM names which I blanked out)

Click on the “Memory” button and look for the “Reservation” column and sort by largest to smallest.

As you can see, there are many VMs with resource reservation.  This means as soon as the VM is powered on, it will reserve this much resource REGARDLESS IF IT IS NEEDED OR NOT!

But as you can see by actual usage, we are not even near to capacity, there is no real reason for us to reserve that much.

One of the culprits: it turns out many of our templates we use to clone/deploy VMs had resources reservation already set, so each time we made a new VM it had a resource reservation.

VMware Cluster: Virtual Machines Actual Usage

Go to the tab “Virtual Machines” now and you can see actual usage.  There is a column “HOST CPU – Mhz” and “Guest Mem – %”.  These show actual usage by the VM.

VMware cluster: Virtual Machines List

I sorted alphabetically here and referenced the previous two pictures (VMs with the highest reservations) and then checked this list to see actual usage. Sure enough, many of our VMs were not using that much resource (as you can tell from earlier graphs)

Next step: contact VM owners to see if the VM was in typical usage.  If so, get permission to turn the resource reserve down or even off.

VMware: Right Click -> Edit Settings

To configure resource reserve, right click on the VM and Edit Settings

VMware cluster: Right Click and Edit Settings

VMware: CPU reservation and Memory reservation

Here I turned the CPU Resource reservation and memory reservation low or to zero

VMware Edit CPU Resource Reservation Settings

VMware edit settings memory resource reservation

REMEMBER TO CONSULT YOUR USER FIRST TO SEE IF VM IS IN TYPICAL USE

VMware HA: Advanced Runtime Info Results

Now go back to your Advanced Runtime Info Results… (you might have to turn the VMware HA to “Host failover cluster tolerates” if you had changed it to the percentage as an intermediate fix)

VMware HA: Advanced Runtime Info

When all was said and done, I went from 55 slots to 550 slots.

And from being in the “red” of 101 VMs I’d need to power off to power one on to being in the “green” of 394 VM slots available.

CPU slot size went down a factor of 10
Memory slot size went down a factor of 20

NICE!!!

Hope this has been helpful!

本文转自学海无涯博客51CTO博客,原文链接http://blog.51cto.com/549687/1953598如需转载请自行联系原作者


520feng2007

相关文章
|
2月前
|
传感器 数据采集 人工智能
MyEMS 能源管理系统后台配置 - 传感器管理操作指南​
MyEMS开源能源管理系统支持建筑、工厂、商场等多种场景,具备能源数据采集、分析及报表生成功能,并提供光伏、储能、充电桩等扩展模块。系统支持传感器管理操作,包括添加、编辑、删除、导入导出及数据点绑定,助力企业精细化能源管理。
36 3
|
Android开发
基于Amlogic 安卓9.0, 驱动简说(三):使用misc框架,让驱动更简单
如何使用Amlogic T972安卓9.0系统上的misc框架来简化驱动程序开发,通过misc框架自动分配设备号并创建设备文件,从而减少代码量并避免设备号冲突。
182 0
基于Amlogic 安卓9.0, 驱动简说(三):使用misc框架,让驱动更简单
|
消息中间件 网络协议 物联网
如何入门做物联网系统压测?
【4月更文挑战第18天】物联网系统在架构、网络模式、通信协议等方面与传统的互联网系统有所区别。因此,传统的性能测试方法不能直接套用到物联网系统中。
672 13
如何入门做物联网系统压测?
|
Java Android开发
IDEA设置项目编码格式【修改为GBK 或 UTF-8】
这篇文章介绍了在IntelliJ IDEA中如何设置项目编码格式,包括将项目编码修改为GBK或UTF-8的详细步骤和图解。
18984 12
IDEA设置项目编码格式【修改为GBK 或 UTF-8】
|
缓存 Java 关系型数据库
【超全详解】Maven工程配置与常见问题解决指南
检查Maven配置包括验证路径、设置pom.xml与Project Structure的Java版本。基本操作有`clean-compile`、`install`和`package`,其中`install`会将jar包放入本地仓库。获取他人工程后需修改配置、清除缓存、更新依赖等。配置文件应从Maven Repository找寻,选择稳定高版本。创建Maven工程可选archetype如`quickstart`或直接创建Java工程。基本目录结构遵循分层设计原则,常见问题包括假性导包、端口占用、时区问题等,对应解决方案包括删除本地仓库文件、调整系统设置或重新加载项目。
2277 6
【超全详解】Maven工程配置与常见问题解决指南
|
数据采集 存储 JavaScript
(2024)豆瓣电影TOP250爬虫详细讲解和代码
这是一个关于如何用Python爬取2024年豆瓣电影Top250的详细教程。教程涵盖了生成分页URL列表和解析页面以获取电影信息的函数。`getAllPageUrl()` 生成前10页的链接,而`getMoiveListByUrl()` 使用PyQuery解析HTML,提取电影标题、封面、评价数和评分。代码示例展示了测试这些函数的方法,输出包括电影详情的字典列表。
903 3
|
人工智能 数据安全/隐私保护
如何实现AI检测与反检测原理
AI检测器用于识别AI生成的文本,如ChatGPT,通过困惑度和爆发性指标评估文本。低困惑度和低爆发性可能指示AI创作。OpenAI正研发AI文本水印系统,但尚处早期阶段。现有检测器对长文本较准确,但非100%可靠,最高准确率约84%。工具如AIUNDETECT和AI Humanizer提供AI检测解决方案,适用于学生、研究人员和内容创作者。
|
JavaScript Java 测试技术
基于springboot+vue.js+uniapp小程序的工作量统计系统附带文章源码部署视频讲解等
基于springboot+vue.js+uniapp小程序的工作量统计系统附带文章源码部署视频讲解等
173 0
|
JavaScript NoSQL Redis
深入浅出:使用 Docker 容器化部署 Node.js 应用
在当今快速发展的软件开发领域,Docker 作为一种开源的容器化技术,已经成为了提高应用部署效率、实现环境一致性和便于维护的关键工具。本文将通过一个简单的 Node.js 应用示例,引导读者从零开始学习如何使用 Docker 容器化技术来部署应用。我们不仅会介绍 Docker 的基本概念和操作,还会探讨如何构建高效的 Docker 镜像,并通过 Docker Compose 管理多容器应用。此外,文章还将涉及到一些最佳实践,帮助读者更好地理解和应用 Docker 在日常开发和部署中的强大功能。
1670 0