8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
1/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
2/140
VMware vSphere 4.1
HA and DRSTechnical Deepdive
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
3/140
VMware vSphere 4.1, HA and DRS Technical Deepdive
Copyright © 2010 by Duncan Epping and Frank Denneman.
All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or
transmitted by any means, electronic, mechanical, or otherwise, without written permission from
the publisher. No patent liability is assumed with respect to the use of the information contained
herein. Although every precaution has been taken in the preparation of this book, the publisher and
authors assume no responsibility for errors or omissions. Neither is any liability assumed for
damages resulting from the use of the information contained herein.
International Standard Book Number (ISBN:)
9781456301446
All terms mentioned in this book that are known to be trademarks or service marks have been
appropriately capitalized.
Use of a term in this book should not be regarded as affecting the validity of any trademark or
service mark.
Version: 1.1
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
4/140
About the Authors
Duncan Epping is a Principal Architect working for VMware as part of the Technical Marketing
department. Duncan primarily focuses on vStorage initiatives and ESXi. He is specialized in
vSphere, vStorage, VMware HA and Architecture. Duncan is a VMware Certified Professional and
among the first VMware Certified Design Experts (VCDX 007). Duncan is the owner of Yellow-
Bricks.com, one of the leading VMware/virtualization blogs worldwide (Recently been voted
number 1 virtualization blog for the 4th consecutive time on vsphere-land.com.) and lead-author of
the "vSphere Quick Start Guide" and "Foundation for Cloud Computing with VMware vSphere 4"
which has recently been published by Usenix/Sage. (#21 in the Short Topics Series). He can be
followed on twitter at http://twitter.com/DuncanYB.
Frank Denneman is a Consulting Architect working for VMware as part of the Professional
Services Organization. Frank works primarily with large Enterprise customers and Service
Providers. He is focused on designing large vSphere Infrastructures and specializes in Resource
Management, DRS in general and storage. Frank is a VMware Certified Professional and among the
first VMware Certified Design Experts (VCDX 029). Frank is the owner of FrankDenneman.nl which
has recently been voted number 6 worldwide on vsphere-land.com. He can be followed on twitter
at http://twitter.com/FrankDenneman.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
5/140
Table of Contents
About the Authors
Acknowledgements
Foreword
Introduction to VMware High Availability
How Does High Availability Work?
Pre-requisites
Firewall Requirements
Configuring VMware High Availability
Components of High Availability
VPXA
VMAP Plug-In
AAM
Nodes
Promoting Nodes
Failover Coordinator
Preferred Primary
High Availability Constructs
Isolation Response
Split-Brain
Isolation Detection
Selecting an Additional Isolation Address
Failure Detection Time
Adding Resiliency to HA (Network Redundancy)
Single Service Console with vmnics in Active/Standby Configuration
Secondary Management Network
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
6/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
7/140
Operation and Tasks of DRS
Load Balance Calculation
Events and Statistics
Migration and Info Requests
vCenter and Cluster sizing
DRS Cluster Settings
Automation Level
Initial Placement
Impact of Automation Levels on Procedures
Resource Management
Two-Layer Scheduler Architecture
Resource Entitlement
Resource Entitlement Calculation
Calculating DRS Recommendations
When is DRS Invoked?
Defragmenting cluster during Host failover
Recommendation Calculation
Constraints Correction
Imbalance Calculation
Impact of Migration Threshold on Selection Procedure
Selection of Virtual Machine Candidate
Cost-Benefit and Risk Analysis Criteria
The Biggest Bang for the Buck
Calculating the Migration Recommendation Priority Level
Influence DRS Recommendations
Migration Threshold Levels
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
8/140
Rules
VM-VM Affinity Rules
VM-Host Affinity Rules
Impact of Rules on Organization
Virtual Machine Automation Level
Impact of VM Automation Level on DRS Load Balancing Calculation
Resource Pools and Controls
Root Resource Pool
Resource Pools
Resource pools and simultaneous vMotions
Under Committed versus Over Committed
Resource Allocation Settings
Shares
Reservation
VM Level Scheduling: CPU vs Memory
Impact of Reservations on VMware HA Slot Sizes.
Behavior of Resource Pool Level Memory Reservations
Setting a VM Level Reservation inside a Resource Pool
VMkernel CPU reservation for vMotion
Reservations Are Not Limits.
Memory Overhead Reservation
Expandable Reservation
Limits
CPU Resource Scheduling
Memory Scheduler
Distributed Power Management
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
9/140
Enable DPM
Templates
DPM Threshold and the Recommendation Rankings
Evaluating Resource Utilization
Virtual Machine Demand and ESX Host Capacity Calculation
Evaluating Power-On and Power-Off Recommendations
Resource LowScore and HighScore
Host Power-On Recommendations
Host Power-Off Recommendations
DPM Power-Off Cost/Benefit Analysis
Integration with DRS and High Availability
Distributed Resource Scheduler
High Availability
DPM awareness of High Availability Primary Nodes
DPM Standby Mode
DPM WOL Magic Packet
Baseboard Management Controller
Protocol Selection Order
DPM and Host Failure Worst Case Scenario
DRS, DPM and VMware Fault Tolerance
DPM Scheduled Tasks
Summarizing
Appendix A – Basic Design Principles
VMware High Availability
VMware Distributed Resource Scheduler
Appendix B – HA Advanced Settings
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
10/140
Acknowledgements
The authors of this book work for VMware. The opinions expressed here are the authors’ personal
opinions. Content published was not read or approved in advance by VMware and does not
necessarily reflect the views and opinions of VMware. This is the authors’ book, not a VMware book.
First of all we would like to thank our VMware management team (Steve Beck, Director; Rob
Jenkins, Director) for supporting us on this and other projects.
A special thanks goes out to our Technical Reviewers: fellow VCDX Panel Member Craig Risinger
(VMware PSO), Marc Sevigny (VMware HA Engineering), Anne Holler (VMware DRS Engineering)
and Bouke Groenescheij (Jume.nl) for their very valuable feedback and for keeping us honest.
A very special thanks to our families and friends for supporting this project. Without your support
we could have not have done this.
We would like to dedicate this book to the VMware Community. We highly appreciate all the effort
everyone is putting in to take VMware, Virtualization and Cloud to the next level. This is our gift to
you.
Duncan Epping and Frank Denneman
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
11/140
Foreword
Since its inception, server virtualization has forever changed how we build and manage the
traditional x86 datacenter. In its early days of providing an enterprise-ready hypervisor, VMware
focused their initial virtualization efforts to meet the need for server consolidation. Increased
optimization of low-utilized systems and lowering datacenter costs of cooling, electricity, and floor
space requirements was a surefire recipe for VMware’s early success. Shortly after introducing
virtualization solutions, customers started to see the significant advantages introduced by the
increased portability and recoverability that were all of a sudden available.
It’s this increased portability and recoverability that significantly drove VMware’s adoption during
its highest growth period. Recovery capabilities and options that were once reserved for the most
critical of workloads within the world’s largest organizations became broadly available to the
masses. Replication, High-Availability, and Fault Tolerance were once synonymous with “Expensive
Enterprise Solutions, but are now available to even the smallest of companies. Data protectionenhancements, when combined with the intelligence of intelligent resource management, placed
VMware squarely at the top market leadership board. VMware’s virtualization platform can
provide near instant recovery time with increasingly more recent recovery points in a properly
designed environment.
Now, if you’ve read this far, you likely understand the significant benefits that virtualization can
provide, and are probably well on your way to building out your virtual infrastructure and strategy.
The capabilities provided by VMware are not ultimately what dictates the success and failure of a
virtualization project, especially as increasingly more critical applications are introduced and
require greater availability and recoverability service levels. It takes a well-designed virtual
infrastructure and a full understanding of how the business requirements of the organization align
to the capabilities of the platform.
This book is going to arm you with the information necessary to understand the in-depth details of
what VMware can provide you when it comes to improving the availability of your systems. This
will help you better prepare for, and align to, the requirements of your business as well as set the
proper expectations with the key stakeholders within the IT organization. Duncan and Frank have
used their extensive field experience into this book to enable you to drive broader virtualization
adoption across more complex and critical applications. This book will enable you to make the
most educated decisions as you attempt to achieve the next level of maturity within your virtual
environment.
Scott Herold
Lead Architect, Virtualization Business, Quest Software
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
12/140
Part 1
VMware High Availability
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
13/140
Chapter 1
Introduction to VMware High Availability
VMware High Availability (HA) provides a simple and cost effective clustering solution to increase
uptime for virtual machines. HA uses a heartbeat mechanism to detect a host or virtual machine
failure. In the event of a host failure, affected virtual machines are automatically restarted on other
production hosts within the cluster with spare capacity. In the case of a failure caused by the Guest
OS, HA restarts the failed virtual machine on the same host. This feature is called VM Monitoring,
but sometimes also referred to as VM HA.
Figure 1: High Availability in action
Unlike many other clustering solutions HA is literally configured and enabled with 4 clicks.
However HA is not, and let’s repeat it, is not a 1:1 replacement for solutions like MicrosoftClustering Services. (MSCS). MSCS and for instance Linux Clustering are stateful clustering solutions
where the state of the service or application is preserved when one of the nodes fails. The service is
transitioned to one of the other nodes and it should resume with limited downtime or loss of data.
With HA the virtual machine is literally restarted and this incurs downtime. HA is a form of stateless
clustering.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
14/140
One might ask why would you want to use HA when a virtual machine is restarted and service is
temporarily lost. The answer is simple; not all virtual machines (or services) need 99.999% uptime.
For many services the type of availability HA provides is more than sufficient. Stateful clustering
does not guarantee 100% uptime, can be complex and need special skills and training. One example
is managing patches and updates/upgrades in a MSCS environment; this could even cause more
downtime if not operated correctly. Just like MSCS a service or application is restarted during afailover, the same happens with HA and the effected virtual machines.
Besides that, HA reduces complexity, costs (associated with downtime and MSCS), resource
overhead and unplanned downtime for minimal additional costs. It is important to note that HA,
contrary to MSCS, does not require any changes to the guest as HA is provided on the hypervisor
level. Also, VM Monitoring does not require any additional software or OS modifications except for
VMware Tools, which should be installed anyway.
We can’t think of a single reason not to use it.
How Does High Availability Work?
Before we deep dive into the main constructs of HA and describe all the choices one has when
configuring HA we will first briefly touch on the requirements. Now, the question of course is how
does HA work? As just briefly touched in the introductions, HA triggers a response based on the loss
of heartbeats. However you might be more interested in knowing which components VMware uses
and what is required in order for HA to function correctly. Maybe if this is the first time you are
exposed to HA you also want to know how to configure it.
Pre-requisites
For those who want to configure HA, the following items are the pre-requisites in order for HA to
function correctly:
• Minimum of two VMware ESX or ESXi hosts
• Minimum of 2300MB memory to install the HA Agent
• VMware vCenter Server
• Redundant Service Console or Management Network (not a requirement, but highly
recommended)
•
Shared Storage for VMs – NFS, SAN, iSCSI
• Pingable gateway or other reliable address for testing isolation
We recommend against using a mixed cluster. With that we mean a single cluster containing bothESX and ESXi hosts. Differences in build numbers has led to serious issues in the past when using
VMware FT. (KB article: 1013637)
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
15/140
Firewall RequirementsThe following list contains the ports that are used by HA for communication. If your environment
contains firewalls ensure these ports are opened for HA to function correctly.
High Availability port settings:
• 8042 – UDP - Used for host-to-hosts "backbone" (message bus) communication.
• 8042 – TCP - Used by AAM agents to communicate with a remote backbone.
• 8043 – TCP - Used to locate a backbone at bootstrap time.
• 8044 – UDP - Used by HA to send heartbeats.
• 2050 – 2250 - Used by AAM agent process to communicate with the backbone.
Configuring VMware High Availability
As described earlier, HA can be configured with the default settings within 4 clicks. The following
steps however will show you how to create a cluster and how to enable HA including VMMonitoring. Each of the settings and the mechanisms associated with these will be described more
in-depth in the following chapters.
1. Select the Hosts & Clusters view.
2. Right-click the Datacenter in the Inventory tree and click New Cluster.
3. Give the new cluster an appropriate name. We recommend at a minimum including the
location of the cluster and a sequence number ie. ams-hadrs-001.
4. In the Cluster Features section of the page, select Turn On VMware HA and click Next.
5. Ensure Host Monitoring Status and Admission Control is enabled and click Next
6. Leave Cluster Default Settings for what it is and click Next
7. Enable VM Monitoring Status by selecting “VM Monitoring Only” and click Next
8. Leave VMware EVC set to the default and click Next
9. Leave the Swapfile Policy set to default and click Next
10. Click Finish to complete the creation of the cluster
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
16/140
When the HA cluster has been created ESX hosts can be added to the cluster simply by dragging
them into the cluster. When an ESX host is added to the cluster the HA agent will be loaded.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
17/140
Chapter 2
Components of High Availability
Now that we know what the pre-requisites are and how to configure HA the next steps will be
describing which components form HA. This is still a “high level” overview however. There is more
under the cover that we will explain in following chapters. The following diagram depicts a two
host cluster and shows the key HA components.
Figure 3: Components of High Availability
As you can clearly see there are three major components that form the foundation for HA:
•
VPXA
• VMAP
• AAM
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
18/140
VPXAThe first and probably the most important is VPXA. This is not an HA agent, but it is the vCenter
agent and it allows your vCenter Server to interact with your ESX host. It is also takes care of
stopping and starting virtual machines if and when needed.
HA is loosely coupled with vCenter Server. Although HA is configured by vCenter Server, it does notneed vCenter to manage an HA failover. It is comforting to know that in case of a host failure
containing the virtualized vCenter server, HA takes care of the failure and restarts the vCenter
server on another host, including all other configured virtual machines from that failed host.
When a virtual vCenter is used we do however recommend setting the correct restart priorities
within HA to avoid any dependency problems.
It’s highly recommended to register ESX hosts with their FQDN in vCenter. VMware vCenter
supplies the name resolution information that HA needs to function. HA stores this locally in a file
called “FT_HOSTS”. In other words, from an HA perspective there is no need to create local host files
and it is our recommendation to avoid using local host files. They are too static and will maketroubleshooting more difficult.
To stress my point even more as of vSphere 4.0 Update 1 host files (i.e. /etc/hosts) are corrected
automatically by HA. In other words if you have made a typo or for example forgot to add the short
name HA will correct the host file to make sure nothing interferes with HA.
Basic design principle:
Avoid using static host files as it leads to inconsistency, which makes troubleshooting
difficult.
VMAP Plug-In
Next on the list is VMAP. Where vpxa is the process for vCenter to communicate with the host
VMAP is the translator for the HA agent (AAM) and vpxa. When vpxa wants to communicate with
the AAM agent VMAP will translate this into understandable instructions for the AAM agent. A good
example of what VMAP would translate is the state of a virtual machine: is it powered on or
powered off? Pre-vSphere 4.0 VMAP was a separate process instead of a plugin linked into vpxa.VMAP is loaded into vpxa at runtime when a host is added to an HA cluster.
The vpxa communicates with VMAP and VMAP communicates with AAM. When AAM has received it
and flushed the info it well tell VMAP and VMAP on its turn will acknowledge to vpxa that info has
been processed. The VMAP plug-in acts as a proxy for communication to AAM.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
19/140
One thing you are probably wondering is why do we need VMAP in the first place? Wouldn’t this be
something vpxa or AAM should be able to do? The answer is yes, either vpxa or AAM should be able
to carry this functionality. However, when HA was first introduced it was architecturally more
prudent to create a separate process for dealing with this which has now been turned into a plugin.
AAMThat brings us to our next and final component, the AAM agent. The AAM agent is the core of HA
and actually stands for “Automated Availability Manager”. As stated above, AAM was originally
developed by Legato. It is responsible for many tasks such as communicating host resource
information, virtual machine states and HA properties to other hosts in the cluster. AAM stores all
this info in a database and ensures consistency by replicating this database amongst all primary
nodes. (Primary nodes are discussed in more detail in chapter 4.) It is often mentioned that HA uses
an In-Memory database only, this is not the case! The data is stored in a database on local storage or
in FLASH memory on diskless ESXi hosts.
One of the other tasks AAM is responsible for is the mechanism with which HA detects
isolations/failures: heartbeats.
All this makes the AAM agent one of the most important processes on an ESX host, when HA is
enabled of course, but we are assuming for now it is. The engineers recognized the importance and
added an extra level of resiliency to HA. The agent is multi-process and each process acts as a
watchdog for the other. If one of the processes dies the watchdog functionality will pick up on this
and restart the process to ensure HA functionality remains without anyone ever noticing it failed. It
is also resilient to network interruptions and component failures. Inter-host communication
automatically uses another communication path (if the host is configured with redundantmanagement networks) in the case of a network failure. The underlying message framework
exactly-once guarantees message delivery.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
20/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
21/140
An HA cluster consists of hosts, or nodes as HA calls them. There are two types of nodes. A node is
either a primary or a secondary node. This concept was introduced to enable scaling up to 32 hosts
in a cluster and each type of node has a different role. Primary nodes hold cluster settings and all
“node states”. The data a primary node holds is stored in a persistent database and synchronized
between primaries as depicted in the diagram above.
An example of node state data would be host resource usage. In case vCenter is not available the
primary nodes will always have a very recent calculation of the resource utilization and can take
this into account when a failover needs to occur. Secondary nodes send their state info to primary
nodes. This will be sent when changes occur, generally within seconds after a change. As of vSphere
4.1 by default every host will send an update of its status every 10 seconds. Pre-vSphere 4.1 this
used to be every second.
This interval can be controlled by an advanced setting called das.sensorPollingFreq. As stated
before the default value of this advanced setting is 10. Although a smaller value will lead to a more
update view of the status of the cluster overall it will also increase the amount of traffic between
nodes. It is not recommended to decrease this value as it might lead to decreased scalability due tothe overhead of these status updates. The maximum value of the advanced setting is 30.
As discussed earlier, HA uses a heartbeat mechanism to detect possible outages or network
isolation. The heartbeat mechanism is used to detect a failed or isolated node. However, a node will
recognize it is isolated by the fact that it isn’t receiving heartbeats from any of the other nodes.
Nodes send a heartbeat to each other. Primary nodes send heartbeats to all primary nodes and all
secondary nodes. Secondary nodes send their heartbeats to all primary nodes, but not to
secondaries. Nodes send out these heartbeats every second by default. However, this is a
configurable value through the use of the following cluster advanced setting:
das.failuredetectioninterval . We do however not recommend changing this interval as it wascarefully selected by VMware.
The first 5 hosts that join the HA cluster are automatically selected as primary nodes. All other
nodes are automatically selected as secondary nodes. When you do a reconfigure for HA, the
primary nodes and secondary nodes are selected again; this is virtually random.
Except for the first host that is added to the cluster; any host that joins the cluster must
communicate with an existing primary node to complete its configuration. At least one primary host
must be available for HA to operate correctly. If all primary hosts are unavailable, you will not be
able to add or remove a host from your cluster.
The vCenter client normally does not show which host is a primary node and which is a secondary
node. As of vCenter 4.1 a new feature has been added which is called “Operational Status” and can
be found on the HA section of the Cluster’s summary tab. It will give details around errors and will
show the primary and secondary nodes. There is one gotcha however; it will only show which
nodes are primary and secondary in case of an error.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
22/140
Figure 5: Cluster operational status
This however can also be revealed from the Service Console or via PowerCLI. The following are two
examples of how to list the primary nodes via the Service Console (ESX 4.0):
Figure 6: List node command
Another method of showing the primary nodes is:
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
23/140
Figure 7: List nodes command
With PowerCLI the primary nodes can be listed with the following lines of code:
Power-CLI code:Get-Cluster | Get-HAPrimaryVMHost
Now that you have seen that it is possible that you can list all nodes with the CLI you probably
wonder what else is possible… Let’s start with a warning - this is not supported! Currently the
supported limit of primaries is 5. This is a soft limit however. It is possible to manually add a 6th
primary but this is not supported nor encouraged.
Having more than 5 primaries in a cluster will significantly increase network and CPU overhead.
There should be no reason to increase the number of primaries beyond 5. For the purpose of
education we will demonstrate how to promote a secondary node to primary and vice versa.
To promote a node:
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
24/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
25/140
Promoting NodesA common misunderstanding about HA with regards to primary and secondary nodes is the re-
election process. When does a re-election, or promotion, occur?
It is a common misconception that a promotion of a secondary occurs when a primary node fails.
This is not the case. Let’s stress that, this is not the case! The promotion of a secondary node toprimary only occurs in one of the following scenarios:
•
When a primary node is placed in “Maintenance Mode”
• When a primary node is disconnected from the cluster
• When a primary node is removed from the cluster
• When the user clicks “reconfigure for HA” on any ESX host
This is particularly important for the operational aspect of a virtualized environment. When a host
fails it is important to ensure its role is migrated to any of the other hosts in case it was an HA
primary node. To simplify it; when a host fails we recommend placing it in maintenance mode, to
disconnect it or to remove it from the cluster to avoid any risks!
If all primary hosts fail simultaneously no HA initiated restart of the virtual machines can take
place. HA needs at least one primary node to restart virtual machines. This is why you can configure
HA to tolerate only up to 4 host failures when you have selected the “host failures” Admission
Control Policy (Remember 5 primaries…). The amount of primaries is definitely something to take
into account when designing for uptime.
Failover Coordinator
As explained in the previous section, you will need at least one primary to restart virtual machines.
The reason for this is that one of the primary nodes will hold the “failover coordinator” role. This
role will be randomly assigned to a primary node; this role is also sometimes referred to as “active
primary”. We will use “failover coordinator” for now.
The failover coordinator coordinates the restart of virtual machines on the remaining primary and
secondary hosts. The coordinator takes restart priorities in account when coordinating the restarts.
Pre-vSphere 4.1 when multiple hosts would fail at the same time it would handle the restarts
serially. In other words, restart the virtual machines of the first failed host (taking restart priorities
in account) and then restart the virtual machines of the host that failed as second (again taking
restart priorities in account). As of vSphere 4.1 this mechanism has been severely improved. In the
case of multiple near-simultaneous host failures, all the host failures that occur within 15 secondswill have all their VMs aggregated and prioritized before the power-on operations occur.
If the failover coordinator fails, one of the other primaries will take over. This node is again
randomly selected from the pool of available primary nodes. As any other process within the HA
stack, the failover coordinator process is carefully watched by the watchdog functionality of HA.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
26/140
Pre-vSphere 4.1 the failover coordinator would decide where a virtual machine would be restarted.
Basically it would check which host had the highest percentage of unreserved and available
memory and CPU and select it to restart that particular virtual machine. For the next virtual
machine the same exercise would be done by HA, select the host with the highest percentage of
unreserved memory and CPU and restart the virtual machine.
HA does not coordinate with DRS when making the decision on where to place virtual machines. HA
would rely on DRS. As soon as the virtual machines were restarted, DRS would kick in and
redistribute the load if and when needed.
As of vSphere 4.1 virtual machines will be evenly distributed across hosts to lighten the load on the
hostd service and to get quicker power-on results. HA then relies on DRS to redistribute the load
later if required. This improvement results in faster restarts of the virtual machines and less stress
on the ESX hosts. DRS also re-parents the virtual machine when it is booted up as virtual machines
are failed over into the root resource pool by default. This re-parent process however did already
exist pre-vSphere 4.1.
The failover coordinator can restart up to 32 VMs concurrently per host. The number of concurrent
failovers can be controlled by an advanced setting called das.perHostConcurrentFailoversLimit . As
stated the default value is 32. Setting a larger value will allow more VMs to be restarted
concurrently and might reduce the overall VM recovery time, but the average latency to recover
individual VMs might increase.
In blade environments it is particularly important to factor the primary nodes and failover
coordinator concept into your design. When designing a multi chassis environment the impact of a
single chassis failure needs to be taken into account. When all primary nodes reside in a single
chassis and the chassis fails, no virtual machines will be restarted as the failover coordinator is the
only one who initiates the restart of your virtual machines. When it is unavailable, no restart willtake place.
It is a best practice to have the primaries distributed amongst the chassis in case an entire chassis
fails or a rack loses power, there is still a running primary to coordinate the failover. This can even
be extended in very large environments by having no more than 2 hosts of a cluster in a chassis.
The following diagram depicts the scenario where four 8 hosts clusters are spread across four
chassis.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
27/140
Figure 10: Logical cluster layout on blade environment
Basic design principle:In blade environments, divide hosts over all blade chassis and never exceed
four hosts per chassis to avoid having all primary nodes in a single chassis.
Preferred Primary
With vSphere 4.1 a new advanced setting has been introduced. This setting is not even
experimental, it is currently considered unsupported. We don't recommend anyone using it in a
production environment, if you do want to play around with it use your test environment.
This new advanced setting is called das.preferredPrimaries. With this setting multiple hosts of a
cluster can be manually designated as a preferred node during the primary node election process.
The list of nodes can either be comma or space separated and both hostnames and IP addresses are
allowed. Below you can find an example of what this would typically look like. The “=” sign has been
used as a divider between the setting and the value.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
28/140
das.preferredPrimaries = hostname1,hostname2,hostname3
or
das.preferredPrimaries = 192.168.1.1 192.168.1.2 192.168.1.3
As shown there is no need to specify 5 hosts; you can specify any number of hosts. If you specify 5
hosts, or less, and all 5 hosts are available they will become the primary nodes in your cluster. If you
specify more than 5 hosts, the first 5 hosts of your list will become primary.
Again, please be warned that this is considered unsupported at times of writing and please verify in
the VMware Availability Guide or online in the knowledge base (kb.vmware.com) what the status is
of the support on this feature before even thinking about implementing it.
A work around found by some pre-vSphere 4.1 was using the “promote/demote” option of HA’s CLI
as described earlier in this chapter. Although this solution could fairly simply be scripted it is
unsupported and as opposed to “das.preferredPrimaries” a rather static solution.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
29/140
Chapter 4
High Availability Constructs
When configuring HA two major decisions will need to be made.
• Isolation Response
•
Admission Control
Both are important to how HA behaves. Both will also have an impact on availability. It is really
important to understand these concepts. Both concepts have specific caveats. Without a good
understanding of these it is very easy to increase downtime instead of decreasing downtime.
Isolation Response
One of the first decisions that will need to be made when HA is configured is the “isolation
response”. The isolation response refers to the action that HA takes for its VMs when the host has
lost its connection with the network. This does not necessarily means that the whole network is
down; it could just be this hosts network ports or just the ports that are used by HA for the
heartbeat. Even if your virtual machine has a network connection and only your “heartbeat
network” is isolated the isolation response is triggered.
Today there are three isolation responses, “Power off”, “Leave powered on” and “Shut down”. This
answers the question what a host should do when it has detected it is isolated from the network. In
any of the three chosen options, the remaining non isolated, hosts will always try to restart the
virtual machines no matter which of the following three options is chosen as the isolationresponse:
• Power off – When network isolation occurs all virtual machines are powered off. It is a hard
stop, or to put it bluntly, the power cable of the VMs will be pulled out!
•
Shut down – When network isolation occurs all virtual machines running on the host will be
shut down using VMware Tools. If this is not successful within 5 minutes, a “power off” will
be executed. This time out value can be adjusted by setting the advanced option
das.isolationShutdownTimeout. If VMware Tools is not installed, a “power off” will be
initiated immediately.
• Leave powered on – When network isolation occurs on the host, the state of the virtual
machines remains unchanged.
This setting can be changed on the cluster settings under virtual machine options.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
30/140
Figure 11: Cluster default setting
The default setting for the isolation response has changed multiple times over the last couple of
years. Up to ESX 3.5 U2 / vCenter 2.5 U2 the default isolation response when creating a new cluster
was “Power off”. This changed to “Leave powered on” as of ESX 3.5 U3 / vCenter 2.5 U3. However
with vSphere 4.0 this has changed again. The default setting for newly created clusters, at the time
of writing, is “Shut down” which might not be the desired response. When installing a new
environment; you might want to change the default setting based on your customer’s requirements
or constraints.
The question remains, which setting should you use? The obvious answer applies here; it depends.We prefer “Shut down” because we do not want to use a degraded host to run our virtual machines
on and it will shut down your virtual machines in clean manner. Many people however prefer to use
“Leave powered on” because it eliminates the chances of having a false positive and the associated
down time with a false positive. A false positive in this case is an isolated heartbeat network but a
non-isolated virtual machine network and a non-isolated iSCSI / NFS network.
That leaves the question how the other HA nodes know if the host is isolated or failed.
HA actually does not know the difference. The other HA nodes will try to restart the affected virtual
machines in either case. When the host is unavailable, a restart attempt will take place no matter
which isolation response has been selected. If a host is merely isolated, the non-isolated hosts willnot be able to restart the affected virtual machines. The reason for this is the fact that the host that
is running the virtual machine has a lock on the VMDK and swap files. None of the hosts will be able
to boot a virtual machine when the files are locked. For those who don’t know, ESX locks files to
prevent the possibility of multiple ESX hosts starting the same virtual machine. However, when a
host fails, this lock expires and a restart can occur.
To reiterate, the remaining nodes will always try to restart the “failed” virtual machines. The
possible lock on the VMDK files belonging to these virtual machines, in the case of an isolation
event, prevents them from being started. This assumes that the isolated host can still reach the files,
which might not be true if the files are accessed through the network on iSCSI, NFS, or FCoE based
storage. HA however will repeatedly try starting the “failed” virtual machines when a restart isunsuccessful.
The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option
“das.maxvmrestartcount ”. The default value is 5. Pre-vCenter 2.5 U4 HA would keep retrying
forever which could lead to serious problems as described in KB article 1009625 where multiple
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
31/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
32/140
Split-Brain
When creating your design, make sure you understand the isolation response setting. For instancewhen using an iSCSI array or NFS based storage choosing “Leave powered on” as your default
isolation response might lead to a split-brain situation.
A split-brain situation can occur when the VMDK file lock times out. This could happen when the
iSCSI, FCoE or NFS network is also unavailable. In this case the virtual machine is being restarted on
a different host while it is not being powered off on the original host because the selected isolation
response is “Leave powered on”. Which could potentially leave vCenter in an inconsistent state as
two VMs with a similar UUID would be reported as running on both hosts. This would cause a
“ping-pong” effect where the VM would appear to live on ESX host 1 at one moment and on ESX
host 2 soon after.
VMware’s engineers have recognized this as a potential risk and developed a solution for this
unwanted situation. (This not well documented, but briefly explained by one of the engineers on the
VMTN Community forums. http://communities.vmware.com/message/1488426#1488426.)
In short; as of version 4.0 Update 2 ESX detects that the lock on the VMDK has been lost and issues a
question if the virtual machine should be powered off and auto answers the question with yes.
However, you will only see this question if you directly connect to the ESX host. HA will generate an
event for this auto-answer though, which is viewable within vCenter. Below you can find a
screenshot of this question.
Figure 13: Virtual machine message
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
33/140
As stated above, as of ESX 4 update 2 the question will be auto-answered and the virtual machine
will be powered off to recover from the split brain scenario.
The question still remains: with iSCSI or NFS, should you power off virtual machines or leave them
powered on?
As described above in earlier versions, "Leave powered on" could lead to a split-brain scenario. You
would end up seeing multiple virtual machines ping-ponging between hosts as vCenter would not
know where it resided as it was active in memory on two hosts. As of ESX 4.0 Update 2, this is
however not the case anymore and it should be safe to use “Leave powered on”.
We recommend avoiding the chances of a split-brain scenario. Configure a secondary ServiceConsole on the same vSwitch and network as the iSCSI or NFS VMkernel portgroup and pre-vSphere
4.0 Update 2 to select either “Power off” or “Shut down” as the isolation response. By doing this you
will be able to detect if there’s an outage on the storage network. We will discuss the options you
have for Service Console / Management Network redundancy more extensively later on in this.
Basic design principle: For network-based storage (iSCSI, NFS, FCoE) it isrecommended (pre-vSphere 4.0 Update 2) to set the isolation response to "Shut Down" or
“Power off”. It is also recommended to have a secondary Service Console (ESX) or
Management Network (ESXi) running on the same vSwitch as the storage network to detecta storage outage and avoid false positives for isolation detection.
Isolation Detection
We have explained what the options are to respond to an isolation event. However we have not
extensively discussed how isolation is detected. This is one of the key mechanisms of HA. Isolate
detection is a mechanism that takes place on the host that is isolated. The remaining, non-isolated,
hosts don’t know if that host has failed completely or if it is isolated from the network, they only
know it is unavailable.
The mechanism is fairly straightforward though and works as earlier explained with heartbeats.
When a node receives no heartbeats from any of the other nodes for 13 seconds (default setting)
HA will ping the “isolation address”. Remember primary nodes send heartbeats to primaries and
secondaries, secondary nodes send heartbeats only to primaries.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
34/140
The isolation address is the gateway specified for the Service Console network (or management
network on ESXi), but there is a possibility to specify one or multiple additional isolation addresses
with an advanced setting. This advanced setting is called “das.isolationaddress” and could be used
to reduce the chances of having a false positive. We recommend to set at least one additional
isolation address.
Figure 14: das.isolationaddress
When isolation has been confirmed, meaning no heartbeats have been received and HA was unable
to ping any of the isolation addresses, HA will execute the isolation response. This could be any ofthe above-described options, power down, shut down or leave powered on.
If only one heartbeat is received or just a single isolation address can be pinged the isolation
response will not be triggered, which is exactly what you want.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
35/140
Selecting an Additional Isolation Address
A question asked by many people is which address should be specified for this additional isolation
verification. We generally recommend an isolation address closest to the hosts to avoid too many
network hops. In many cases the most logical choice is the physical switch to which the host is
directly connected, another usual suspect would be a router or any other reliable and pingable
device. However, when you are using network based shared storage like NFS and for instance iSCSI
a good choice would be the IP-address of the device, this way you would also verify if the storage is
still reachable or not.
Failure Detection Time
Failure Detection Time seems to be a concept that is often misunderstood but is critical when
designing a virtual infrastructure. Failure Detection Time is basically the time it takes before the
“isolation response” is triggered. There are two primary concepts when we are talking about failure
detection time:
• The time it will take the host to detect it is isolated
• The time it will take the non-isolated hosts to mark the unavailable host as isolated and
initiate the failover
The following diagram depicts the timeline for both concepts:
Figure 15: High Availability failure detection time
The default value for failure detection is 15 seconds. (das.failuredetectiontime) In other words the
failed or isolated host will be declared failed by the other hosts in the HA cluster on the fifteenth
second and a restart will be initiated by the failover coordinator after one of the primaries has
verified that the failed or isolated host is unavailable by pinging the host on its management
network.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
36/140
It should be noted that in the case of a dual management network setup both addresses will be
pinged and a 1 second will need to be added to the timeline. Meaning that the failover coordinator
will initiate the restart on the 17th second.
Let’s stress that again, a restart will be initiated after one of the primary nodes has tried to ping all
of the management network addresses of the failed host.
Let’s assume the isolation response is “Power off ”. The isolation response “Power off” will be
triggered by the isolated host 1 second before the das.failuredetectiontime elapses. In other words a
“Power off” will be initiated on the fourteenth second. A restart will be initiated on the sixteenth
second by the failover coordinator if the host has a single management network.
Does this mean that you can end up with your virtual machines being down and HA not restarting
them?
Yes, when the heartbeat returns between the 14th and 16th second the “Power off” might have
already been initiated. The restart however will not be initiated because the received heartbeat
indicates that the host is not isolated anymore.
How can you avoid this?
Selecting “Leave VM powered on” as an isolation response is one option. Increasing the
das.failuredetectiontime will also decrease the chances of running into issues like these, and with
ESX 3.5 it was a standard best practice to increase the failure detection time to 30 seconds.
At the time of writing (vSphere) this is not a best practice anymore as with any value the “2-second”
gap exists and the likelihood of running into this issue is small. We recommend keeping
das.failuredetectiontime as low as possible to decrease associated down time.
Basic design principle: Keep das.failuredetectiontime low for fast responses tofailures. If an isolation validation address has been added, “das.isolationaddress”, add 5000
to the default “das.failuredetectiontime” (15000).
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
37/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
38/140
Recommended:
2 physical switches
The vSwitch should be configured as follows:
•
vSwitch0: 2 Physical NICs (vmnic0 and vmnic2)• 2 Portgroups (Service Console and VMkernel)
• Service Console active on vmnic0 and standby on vmnic2
•
VMkernel active on vmnic2 and standby on vmnic0
• Failback set to No
Each portgroup has a VLAN ID assigned and runs dedicated on its own physical NIC; only in the
case of a failure it is switched over to the standby NIC. We highly recommend setting failback to
“No” to avoid chances of a false positive which can occur when a physical switch routes no traffic
during boot but the ports are reported as “up”. (NIC Teaming Tab)
Pros: Only 2 NICs in total are needed for the Service Console and VMkernel, especially useful in
Blade environments. This setup is also less complex.
Cons: Just a single active path for heartbeats.
The following diagram depicts the active/standby scenario:
Figure 16: Active-standby Service Console network layout
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
39/140
Secondary Management Network
Requirements:
• 3 physical NICs
• VLAN trunking
Recommended:
• 2 physical switches
• The vSwitch should be configured as follows:
• vSwitch0 – 3 Physical NICs (vmnic0 & vmnic2)
• 3 Portgroup (Service Console, secondary Service Console and VMkernel)
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
40/140
The primary Service Console runs on vSwitch0 and active on vmnic0, with a VLAN assigned on
either the physical switch port or the portgroup and is connected to the first physical switch. (We
recommend using a VLAN trunk for all network connections for consistency and flexibility.)
The secondary Service Console will be active on vmnic2 and connected to the second physical
switch.
The VMkernel is active on vmnic1 and standby on vmnic2.
Pros - Decreased chances of false alarms due to Spanning Tree “problems” as the setup contains
two Service Consoles that are both connected to only 1 physical switch. Subsequently both Service
Consoles will be used for the heartbeat mechanism that will increase resiliency.
Cons - Need to set advanced settings. It is mandatory to set an additional isolation address
(das.isolationaddress2) in order for the secondary Service Console to verify network isolation via a
different route.
The following diagram depicts the secondary Service Console scenario:
Figure 17: Secondary management network
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
41/140
The question remains; which would we recommend? Both scenarios are fully supported and
provide a highly redundant environment either way. Redundancy for the Service Console or
Management Network is important for HA to function correctly and avoid false alarms about the
host being isolated from the network. We however recommend the first scenario. Redundant NICs
for your Service Console adds a sufficient level of resilience without leading to an overly complex
environment.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
42/140
Chapter 6
Admission Control
Admission Control is often misunderstood and disabled because of this. However Admission
Control is a must when availability needs to be guaranteed and isn’t that the reason for enabling HA
in the first place?
What is HA Admission Control about? Why does HA contain Admission Control?
The “ Availability Guide” a.k.a HA bible states the following:
“vCenter Server uses Admission
Control to ensure that sufficient
resources are available in
a cluster to provide failover
protection and to ensure that
virtual machine resource
reservations are respected.”
Admission Control guarantees capacity is available for an HA initiated failover by reserving
resources within a cluster. It calculates the capacity required for a failover based on availableresources. In other words if a host is placed into maintenance mode, or disconnected, it is taken out
of the equation. Available resources also mean that the virtualization overhead has already been
subtracted from the total. To give an example; Service Console Memory and VMkernel memory is
subtracted from the total amount of memory that results in the available memory for the virtual
machines.
There is one gotcha with Admission Control that we want to bring to your attention before drilling
into the different policies.
When Admission Control is set to strict, VMware Distributed Power Management in no way will
violate availability constraints. This means that it will always ensure multiple hosts are up andrunning. (For more info on how DPM calculates read Chapter 18)
When Admission Control was disabled and DPM was enabled in a pre-vSphere 4.1 environment you
could have ended up with all but one ESX host placed in sleep mode, which could lead to potential
issues when that particular host failed or resources were scarce as there would be no host available
to power-on your virtual machines. (KB: http://kb.vmware.com/kb/1007006)
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
43/140
With vSphere 4.1 however; if there are not enough resources to power on all hosts, DPM will be
asked to take hosts out of standby mode to make more resources available and the virtual machines
can then get powered on by HA when those hosts are back online.
Admission Control Policy
The Admission Control Policy dictates the mechanism that HA uses to guarantee enough resources
are available for an HA initiated failover. This section gives a general overview of the available
Admission Control Policies. The impact of each policy is described in the following section including
our recommendation.
HA has three mechanisms to guarantee enough capacity is available to respect virtual machine
resource reservations.
Figure 18: Admission control policy
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
44/140
Below we have listed all three options currently available as the Admission Control Policy. Each
option has a different mechanism to ensure resources are available for a failover and each option
has its caveats.
Admission Control Mechanisms
Each Admission Control Policy has its own Admission Control mechanism. Understanding this
Admission Control mechanism is important to understand the impact of decisions for your cluster
design. For instance setting a reservation on a specific virtual machine can have an impact on the
achieved consolidation ratio. This section will take you on a journey through the trenches of
Admission Control mechanisms.
Host Failures Cluster Tolerates
The Admission Control Policy that has been around the longest is the “Host Failures Cluster
Tolerates” policy. It is also historically the least understood Admission Control Policy due to its
complex admission control mechanism.
The so-called “slots” mechanism is used when selecting “host failures cluster tolerates” as the
Admission Control Policy. The mechanism of this concept has changed several times in the past and
it is one of the most restrictive policies.
Slots dictate how many virtual machines can be powered on before vCenter starts yelling “Out Of
Resources”! Normally a slot represents one virtual machine. Admission Control does not limit HA in
restarting virtual machines, it ensures enough resources are available to power on all virtual
machines in the cluster by preventing “over-commitment”. For those wondering why HA initiated
failovers are not prone to the Admission Control Policy think back for a second. Admission Control
is done by vCenter. HA initiated restarts are executed directly on the ESX host without the use of
vCenter. So even if resource would be low and vCenter would complain it couldn’t stop the restart.
If a failure has occurred and the host has been removed from the cluster, HA will recalculate all the
values and start with an “N+x” cluster again from scratch. This could result in an over-committed
cluster as you can imagine.
“A slot is defined as a logical representation of the memory and CPU resources that satisfy the
requirements for any powered-on virtual machine in the cluster…”
In other words a slot is the worst case CPU and memory reservation scenario in a cluster. This
directly leads to the first “gotcha”:
HA uses the highest CPU reservation of any given virtual machine and the highest memory
reservation of any given VM in the cluster. If no reservations of higher than 256 MHz are set HA will
use a default of 256 MHz for CPU. If no memory reservation is set HA will use a default of
0MB+memory overhead for memory. (See the VMware vSphere Resource Management Guide for
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
45/140
more details on memory overhead per virtual machine configuration) The following example will
clarify what “worst-case” actually means.
Example - If virtual machine “VM1” has 2GHz of CPU reserved and 1024MB of memory reserved
and virtual machine “VM2” has 1GHz of CPU reserved and 2048MB of memory reserved the slot
size for memory will be 2048MB (+memory overhead) and the slot size for CPU will be 2GHz. It is acombination of the highest reservation of both virtual machines. Reservations defined at the
Resource Pool level however, will not affect HA slot size calculations.
Basic design principle:Be really careful with reservations, if there’s no need to have them on a per
virtual machine basis; don’t configure them, especially when using Host Failures Cluster Tolerates.
If reservations are needed, resort to resource pool based reservations.
Now that we know the worst case scenario is always taken into account when it comes to slot size
calculations we will describe what dictates the amount of available slots per cluster.
We will need to know what the slot size for memory and CPU is first. Then we will divide the total
available CPU resources of a host by the CPU slot size and the total available Memory Resources of a
host by the memory slot size. This leaves us with a slot size for both memory and CPU. The most
restrictive number, again worst-case scenario is the number of slots for this host. If you have 25
CPU slots but only 5 memory slots, the amount of available slots for this host will be 5 as HA always
will always take the worst case scenario into account to “guarantee” all virtual machines can be
powered on in case of a failure or isolation.
The question we receive a lot is how do I know what my slot size is? The details around slot sizes
can be monitored on the HA section of the Cluster’s summary tab by clicking the “Advanced
Runtime Info” line.
Figure 19: High Availability cluster summary tab
This will show the following screen that specifies the slot size and more useful details around the
amount of slots available.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
46/140
Figure 20: High Availability advanced runtime info
As you can see using reservations on a per-VM basis can lead to very conservative consolidation
ratios. However, with vSphere this is something that is configurable. If you have just one virtual
machine with a really high reservation you can set the following advanced settings to lower the slot
size used for these calculations: “das.slotCpuInMHz” or “das.slotMemInMB”.
To avoid not being able to power on the virtual machine with high reservations the virtual machine
will take up multiple slots. When you are low on resources this could mean that you are not able to
power-on this high reservation virtual machine as resources may be fragmented throughout the
cluster instead of available on a single host. As of vSphere 4.1 HA will notify DRS that a power-on
attempt was unsuccessful and a request will be made to defragment the resources to accommodate
the remaining virtual machines that need to be powered on.
The following diagram depicts a scenario where a virtual machine spans multiple slots:
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
47/140
Figure 21: Virtual machine spanning multiple HA slot
Notice that because the memory slot size has been manually set to 1024MB one of the virtual
machines (grouped with dotted lines) spans multiple slots due to a 4GB memory reservation. As
you might have noticed none of the hosts has 4 slots left. Although in total there are enough slots
available; they are fragmented and HA will not be able to power-on this particular virtual machine
directly but will request DRS to defragment the resources to accommodate for this virtual machines
resource requirements.
Admission control does not take fragmentation of slots into account when slot sizes are manually
defined with advanced settings. It will take the number of slots this virtual machine will consume
into account by subtracting them from the total number of available slots, but it will not verify the
amount of available slots per host to ensure failover. As stated earlier though HA will request DRS,
as of vSphere 4.1, to defragment the resources. However, this is no guarantee for a successful
power-on attempt or slot availability.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
48/140
Basic design principle:Avoid using advanced settings to decrease the slot size as it could lead to
more down time and adds an extra layer of complexity. If there is a large discrepancy in size and
reservations are set it might help to put similar sized virtual machines into their own cluster.
Unbalanced Configurations and Impact on Slot
Calculation
It is an industry best practice to create clusters with similar hardware configurations. However
many companies start out with a small VMware cluster when virtualization is introduced and plan
on expanding when trust within the organization has been built.
When the time has come to expand, chances are fairly large the same hardware configuration is no
longer available. The question is will you add the newly bought hosts to the same cluster or create a
new cluster?
From a DRS perspective, large clusters are preferred as it increases the load balancing options.
However there is a caveat for DRS as well, which is described in the DRS section of this book. For
HA there is a big caveat and when you think about it and understand the internal workings of HA
you probably already know what is coming up.
Let’s first define the term “unbalanced cluster”.
An unbalanced cluster would for instance be a cluster with 6 hosts of which one contains more
memory than the other hosts in the cluster.
Let’s try to clarify that with an example.
Example:
What would happen to the total number of slots in a cluster of the following specifications?
•
Six host cluster
• Five hosts have 16GB of available memory
•
One host has 32GB of available memory
The sixth host is a brand new host that has just been bought and as prices of memory dropped
immensely the decision was made to buy 32GB instead of 16GB.
The cluster contains a virtual machine that has 1 vCPU and 4GB of memory. A 1024MB memory
reservation has been defined on this virtual machine. As explained earlier a reservation will dictate
the slot size, which in this case leads to a memory slot size of 1024MB+memory overhead. For the
sake of simplicity we will however calculate with 1024MB.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
49/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
50/140
As Admission Control is enabled a worst-case scenario is taken into account . When a single host
failure has been specified, this means that the host with the largest number of slots will be taken
out of the equation. In other words for our cluster this would result in:
esx01 + esx02 + esx03 + esx4 + esx5 = 80 slots available
Although you have doubled the amount of memory in one of your hosts you are still stuck with only
80 slots in total. As clearly demonstrated there is absolutely no point in buying additional memory
for a single host when your cluster is designed with Admission Control enabled and a number of
host failures as the Admission Control Policy has been selected.
In our example the memory slot size happened to be the most restrictive, the same principle is
applied when CPU slot size is most restrictive.
Basic design principle:When using Admission Control, balance your clusters and be conservative
with reservations as it leads to decreased consolidation ratios.
Now what would happen in the scenario above when the number of allowed host failures is to 2?
In this case ESX06 is taken out of the equation and one of any of the remaining hosts in the cluster is
also taken out. It would result in 64 slots. This makes sense doesn’t it?
Can you avoid large HA slot sizes due to reservations without resorting to advanced settings? That’s
the question we get almost daily. The answer used to be NO if per virtual machine reservations
were required. HA uses reservations to calculate the slot size and there’s no way to tell HA to ignore
them without using advanced settings pre-vSphere. With vSphere, the new Percentage method is
an alternative.
Percentage of Cluster Resources Reserved
With vSphere VMware introduced the ability to specify a percentage next to a number of host
failures and a designated failover host. The percentage avoids the slot size issue, as it does not use
slots for Admission Control. So what does it use?
When you specify a percentage; that percentage of the total amount of available resources will stay
reserved for HA purposes. First of all HA will add up all available resources to see how much it has
available (virtualization overhead will be subtracted) in total. Then HA will calculate how much
resources are currently reserved by adding up all reservations for both memory and CPU forpowered on virtual machines.
For those virtual machines that do not have a reservation larger than 256 MHz a default of 256 MHz
will be used for CPU and a default of 0MB+memory overhead will be used for Memory. (Amount of
overhead per configuration type can be found in the “Understanding Memory Overhead” section of
the Resource Management guide.)
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
51/140
In other words:
((Total amount of available resources – total reserved virtual machine resources)/total amount of
available resources)
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
52/140
If you have an unbalanced cluster (hosts with different sizes of CPU or memory resources) your
percentage should be equal or preferably larger than the percentage of resources provided by the
largest host. This way you ensure that all virtual machines residing on this host can be restarted in
case of a host failure.
As earlier explained this Admission Control Policy does not use slots, as such resources might befragmented throughout the cluster. Although as of vSphere 4.1 DRS is notified to rebalance the
cluster, if needed, to accommodate for these virtual machines resource requirements a guarantee
cannot be given. We recommend ensuring you have at least one host with enough available capacity
to boot the largest virtual machine (reservation CPU/MEM). Also make sure you select the highest
restart priority for this virtual machine (of course depending on the SLA) to ensure it will be able to
boot.
The following diagram will make it more obvious. You have 5 hosts, each with roughly 80%
memory usage, and you have configured HA to reserve 20% of resources. A host fails and all virtual
machines will need to failover. One of those virtual machines has a 4GB memory reservation, as you
can imagine, the first power-on attempt for this particular virtual machine will fail due to the factthat none of the hosts has enough memory available to guarantee it.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
53/140
Figure 25: Available resources
Basic design principle:Although vSphere 4.1 will utilize DRS to try to accommodate for the resource
requirements of this virtual machine a guarantee cannot be given. Do the math; verify that any
single host has enough resources to power-on your largest virtual machine. Also take restart
priority into account for this/these virtual machine(s).
Failover Host
The third option one could choose is a designated Failover host. This is commonly referred to as a
hot standby. There is actually not much to tell around this mechanism, as it is “what you see is what
you get”. When you designate a host as a failover host it will not participate in DRS. You will not be
able to power on virtual machines on this host! It is almost like it is in maintenance mode and it will
only be used in case a failover needs to occur.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
54/140
Chapter 7
Impact of Admission Control Policy
As with any decision when architecting your environment there is an impact. This especially goes
for the Admission Control Policy. The first decision that will need to be made is if Admission Control
is enabled or not. We recommend enabling Admission Control but carefully select the policy and
ensure it fits your or your customer’s needs.
Basic design principle:
Admission Control guarantees enough capacity is available for virtual machine failover. As
such we recommend enabling it.
We have explained all the mechanisms that are being used by each of the policies in Chapter 6. Asthis is one of the most crucial decisions that need to be made we have summarized all the pros and
cons for each of the three policies below.
Host Failures Cluster Tolerates
This option is historically speaking the most used for Admission Control. Most environments are
designed with an N+1 redundancy and N+2 is also not uncommon. This Admission Control Policy
uses “slots” to ensure enough capacity is reserved for failover, which is a fairly complex
mechanism. Slots are based on VM-level Reservations.
Pros:
•
Fully automated (When a host is added to a cluster, HA re-calculates how many slots are
available.)
• Ensures failover by calculating slot sizes.
Cons:
Can be very conservative and inflexible when reservations are used as the largest reservation
dictates slot sizes.
•
Unbalanced clusters lead to wastage of resources.• Complexity for administrator from calculation perspective.
• Percentage as Cluster Resources Reserved
Percentage based Admission Control is the latest addition to the HA Admission Control Policy. The
percentage based Admission Control is based on per VM reservation calculations instead of slots.
Pros:
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
55/140
•
Accurate as it considers actual reservation per virtual machine.
•
Cluster dynamically adjusts when resources are added.
Cons:
Manual calculations needed when adding additional hosts in a cluster and number of host failuresneed to remain unchanged.
Unbalanced clusters can be a problem when chosen percentage is too low and resources are
fragmented, which means failover of a virtual machine can’t be guaranteed as the reservation of this
virtual machine might not be available as resources on a single host.
Specify a Failover Host
With the Specify a Failover Host Admission Control Policy, when a host fails, HA will attempt to
restart all virtual machines on the designated failover host. The designated failover host is
essentially a “hot standby”. In other words DRS will not migrate VMs to this host when resourcesare scarce or the cluster is imbalanced.
Pros:
• What you see is what you get.
• No fragmented resources.
Cons:
•
What you see is what you get.
•
Maximum of one failover host. (N+2 redundancy is impossible.)
• Dedicated failover host not utilized during normal operations.
Recommendations
We have been asked many times for our recommendation on Admission Control and it is difficult to
answer as each policy has its pros and cons. However, we generally recommend a Percentage based
Admission Control Policy. It is the most flexible policy as it uses the actual reservation per virtual
machine instead of taking a “worse case” scenario approach like the number of host failures does.
However, the number of host failures policy guarantees the failover level under all circumstances.
Percentage based is less restrictive, but offers lower guarantees that in all scenarios, HA will be ableto restart all virtual machines. With the added level of integration between HA and DRS we believe
a Percentage based Admission Control Policy will fit most environments.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
56/140
Basic design principle: Do the math, and take customer requirements into account.We recommend using a “Percentage” based Admission Control Policy, as it is the most
flexible policy.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
57/140
Chapter 8
VM Monitoring
VM monitoring or VM level HA is an often overlooked but really powerful feature of HA. The reason
for this is most likely that it is disabled by default and relatively new compared to HA. We have
tried to gather all the info we could around VM Monitoring but it is a pretty straightforward
product that actually does what you expect it would do.
With vSphere 4.1 VMware also introduced VM and Application Monitoring. Application Monitoring
is a brand new feature that Application Developers can leverage to increase resiliency as shown in
the screenshot below.
Figure 26: VM and Application Monitoring
As of writing there was little information around Application Monitoring besides the fact that the
Guest SDK is be used by application developers or partners like for instance Symantec to develop
solutions against the SDK. In the case of Symantec a simplified version of Veritas Cluster Server
(VCS) is used to enable application availability monitoring including of course responding to issues.
Note that it is not a multi-node clustering solution like VCS itself but a single node solution.
Symantec ApplicationHA as it is called is triggered to get the application up and running again by
restarting it. Symantec's ApplicationHA is aware of dependencies and knows in which order
services should be started or stopped. If however for whatever reason this fails for an "X" amount
(configurable option within ApplicationHA) of times HA will be asked to take action. This action will
be a restart of the virtual machine.
Although Application Monitoring is relatively new and there are only a few partners currently
exploring the capabilities it does add a whole new level of resiliency in our opinion. We have tested
ApplicationHA by Symantec and personally feel it is the missing link. It enables you as System
Admin to integrate your virtualization layer with your application layer. It ensures you as a SystemAdmin that services, which are protected, are restarted in the correct order and it avoids the
common pitfalls associated with restarts and maintenance.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
58/140
Why Do You Need VM/Application Monitoring?
VM and Application Monitoring acts on a different level as HA. VM/App Monitoring responds to a
single virtual machine or application failure as opposed to HA which responds to a host failure. An
example of a single virtual machine failure would for instance be the infamous “blue screen of
death”.
How Does VM/App Monitoring Work?
VM Monitoring restarts individual virtual machines when needed. VM/App monitoring uses a
similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats, are not
received for a specific amount of time, the virtual machine will be rebooted. The heartbeats are
communicated directly to VPXA by VMware Tools; these heartbeats are not sent over a network.
Figure 27: VM monitoring sensitivity
When enabling VM/App Monitoring, the level of sensitivity can be configured. The default settingshould fit most situations. Low sensitivity basically means that the amount of allowed “missed”
heartbeats is higher and as such the chances of running into a false positive are lower. However if a
failure occurs and the sensitivity level is set to low the experienced downtime will be higher. When
quick action is required in case of a possible failure “high sensitivity” can be selected, and as
expected this is the opposite of “low sensitivity”.
Table 1: VM monitoring sensitivity
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
59/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
60/140
Screenshots
The cool thing about VM Monitoring is the fact that it takes screenshots of the VM console. They are
taken right before a virtual machine is reset by VM Monitoring. This has been added as of vCenter
4.0. It is a very useful feature when a virtual machine “freezes” every once in a while with no
apparent reason. This screenshot can be used to debug the virtual machine operating system, if and
when needed, and is stored in the virtual machine’s working directory.
Basic design principle: VM Monitoring can substantially increase availability. Itis part of the HA stack and we heavily recommend using it!
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
61/140
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
62/140
Flattened Shares
Pre-vSphere 4.1 an issue could arise when shares had been set custom on a virtual machine. When
HA fails over a virtual machine it will power-on the virtual machine in the Root Resource Pool.
However, the virtual machine’s shares were scaled for its appropriate place in the resource pool
hierarchy, not for the Root Resource Pool. This could cause the virtual machine to receive either too
many or too few resources relative to its entitlement.
A scenario where and when this can occur would be the following:
VM1 has a 1000 shares and Resource Pool A has 2000 shares. However Resource Pool A has 2 VMs
and both will have 50% of those “20003 shares. The following diagram depicts this scenario:
Figure 28: Flatten shares starting point
When the host would fail both VM2 and VM3 will end up on the same level as VM1. However as a
custom shares value of 10.000 was specified on both VM2 and VM3 they will completely blow away
VM1 in times of contention. This is depicted in the following diagram:
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
63/140
Figure 29: Flatten shares host failure
This situation would persist until the next invocation of DRS would re-parent the virtual machine to
its original Resource Pool. To address this issue as of vSphere 4.1 DRS will flatten the virtual
machine’s shares and limits before fail-over. This flattening process ensures that the virtual
machine will get the resources it would have received if it had failed over to the correct Resource
Pool. This scenario is depicted in the following diagram. Note that both VM2 and VM3 are placed
under the Root Resource Pool with a shares value of 1000.
Figure 30: Flatten shares after host failure before DRS invocation
Of course when DRS is invoked both VM2 and VM3 will be re-parented under Resource Pool A and
will receive the amount of shares they had originally assigned again.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
64/140
Chapter 10
Summarizing
The integration of HA with DRS has been vastly improved and so has HA in general. We hope
everyone sees the benefits of these improvements and of HA and VM and Application Monitoring in
general. We have tried to simplify some of the concepts to make it easier to understand, still we
acknowledge that some concepts are difficult to grasp. We hope though that after reading this
section of the book everyone is confident enough to make the changes to HA needed to increase
resiliency and essentially uptime of your environment because that is what it is all about.
If there are any questions please do not hesitate to reach out to either of the authors.
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
65/140
Part 2
VMware Distributed Resource Scheduler
8/9/2019 Vmware Vsphere 4.1 Ha and Drs Deep Technical Drive
66/140
Chapter 11
What is VMware DRS?
VMware Distributed Resource Scheduler (DRS) is an infrastructure service run by VMware vCenter
Server (vCenter). DRS aggregates ESX host resources into clusters and automatically distributes
these resources to the virtual machines.
DRS monitors resource usage and continuously optimizes the virtual machine resource distribution
across ESX hosts.
DRS computes the resource entitlement for each virtual machine based on static resource allocation
settings and dynamic settings such as active usage and level of contention.
DRS attempts to satisfy the virtual machine resource entitlement with the resources available in the
cluster by leveraging vMotion. vMotion is used to either migrate the virtual machines to alternative
ESX hosts with more available resources or migrating virtual machines away to free up resources.
Because DRS is an automated solution and easy to configure, we recommend enabling DRS to
achieve higher consolidation ratios at low costs.
A DRS-enabled cluster is often referred to as a DRS cluster. In vSphere 4.1, a DRS cluster can
manage up to 32 hosts and 3000 VMs.
Cluster Level Resource Management
Clusters group the resources of the various ESX hosts together and treat them as a pool of
resources, DRS presents the aggregated resources as one big host to the virtual machines. Pooling
resources allows DRS to create resource pools spanning across all hosts in the cluster and apply
cluster level resource allocation policies. Probably unnecessary to point out, but a virtual machine
cannot span hosts even when resources are pooled by using DRS. In addition to resource pools and
resource allocation policies, DRS offers the following resource management capabilities.
Initial placement – When a virtual machine is powered on in the cluster, DRS places thevirtual machine on an appropriate host or generates a recommendation depending on the
automation level.
Top Related