CPU and Memory Considerations for Virtual Machines on VMware ESXi

CPU and Memory considerations for virtual machines on VMware ESXi

This is one of the greatest mysteries still prevalent even amongst the top echelons of virtualization consultants, bloggers and the like. The source of all our troubles in understanding exactly how an ESXi host will handle a virtual machine under different circumstances are two very basic things.

1. VMware has never documented, at least in the public domain, a lot of technical details on how their algorithms work. Sometimes not even the thresholds for particular operations.

2. ESXi uses a vast array of complicated algorithms. And at any point of time, a multitude of these algorithms may be in action.

The first thing one has to understand is that CPU and memory management are not two different tasks but rather the two sides of same coin. However, let’s talk about them as separate entities at first before funneling down to the co-design aspects of CPU and memory management in the ESXi hypervisor.

Note: All of the material here are either my own opinion or directly borrowed from sites like frankdenneman.nl, yellow-bricks.com and others.

CPU Management

A typical fallacy of virtualization is rooted in how IT departments are structured. Anyone who has spent time in operations knows that typically the specification for a virtual machine is generated by an application management team. In almost all cases, such teams have no idea of how virtualization works. So their specs are ditto copies of what they had in a physical server for the same or maybe an earlier version of the application. IT administrators also typically do not look into the application’s details and discuss with the application team about the VM specs. This results in numerous 4- and 8 vCPU servers in most environments. Sometimes CPU allocated to a machine is increased in response to performance issues as well. These behemoths chug along for years on the end, hogging precious resources

unnecessarily, and waiting for some smart guy to figure out that things are very wrong.

The first question is: “How many CPUs should you have in your virtual machines?”

The answer is really simple and constant: “1.”

Most users are somewhat taken aback at this. But one needs to realize that they are not dealing with physical hardware anymore. Rather they are literally slicing up a physical hardware and dividing it amongst many operating systems. The ‘virtual machine shell’ is really for the operating system and not for the user. And the slicing happens both in space and time. Think of this as statistical multiplexing. So if the number of CPUs is increased in a VM then the total number of CPUs in the system increases which in turn increases contention and latency. And the latency is not only limited to scheduling latency. Two vCPUs belonging to the same VM are like two close brothers. They like to run together and won’t leave each other behind. So if one vCPU falls behind because there is no work for him then ESXi forces the other vCPU of the same VM to stop long enough for the slow brother to catch up. This happens even if there is a CPU intensive load running on this VM.

But then one may wonder, “If there is a CPU intensive load then it should be using both CPUs to run faster and there is then no question of one vCPU falling behind.”

This is not entirely true. The application may be a single threaded application – sequential in other words. Thus it will not use more than one CPU and is absolutely not bothered about how many CPUs are present in the system. Though most of the applications today are multi-threaded, the administrators need to know how many CPUs can be effectively used. Increasing CPU count to deal with performance issues only complicates the matter further.

Similarly, when sizing the VM a lot of administrators are guilty of planning for maximum load rather than average load. This means that most resources are sitting idle for most of the time. Now in case of physical to virtual (P2V) conversions one can use Capacity Planner from VMware to calculate average load. Otherwise though it becomes a little more difficult and one has to rely on previous experience. Generally, a scale out policy is better than a scale up policy when it comes to designing a virtual datacenter. However, this requires the administrator to have a detailed understanding of the workload.

CPU Resource Management

VMware provides several configurable settings to allow the administrator to direct the relative importance of VMs. The most important amongst these are ‘Reservation’ and ‘Share’. Bursts of CPU activities can be handled by adjusting the ‘Limit’ setting.

Reservation tells the hypervisor how much CPU should be permanently set aside for the machine whereas shares tell it the relative importance of a machine. The reservation setting is typically set only if required. Shares are a much better way of dividing the resources amongst machines.

In the context of CPU management, reservation is not so damaging. But even then, under certain conditions the VM for which CPU reservations are set may get choked if it requires more CPU than specified in its reservation. This is because the hypervisor uses a metric called ‘MHzPerShare’ to determine the fairness in CPU allocation to various VMs. Reservations set on a VM with a CPU intensive application drives up this metric. The hypervisor will then offer other VMs the ability to catch up before CPU cycles are offered to this VM again. To gain further insight into this, please look here.

It is generally not advised to set CPU limits either especially in an under-loaded system. To understand the impact of CPU limits please read this KB.

Memory Management

As far as memory is concerned it is more expensive than CPU in terms of virtualization. It is a general experience that memory resources in a physical system is utilized much faster than CPU and/or storage. The virtualization density of any server is effectively limited by the amount of memory available on the system. And though VMware is a market leader in this space by miles, it is still a major area of concern.

VMware uses the following algorithms to handle memory:

1. Transparent Page Sharing (TPS) – This is used to deduplicate pages in the physical memory space. This feature cannot be disabled.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033115

http://frankdenneman.nl/2010/06/08/reservations-and-cpu-scheduling/

2. Memory Compression – This is used to compress pages in the physical memory. This can also be disabled at the system level or at the individual VM level.

3. Ballooning – This has been for a long time and still is VMware’s crown jewel in memory management. It comes along with the VMware Tools and is installed in the guest operating system as a driver. Whenever there is a request for memory that cannot be satisfied by the current amount of free physical memory pages, the balloon driver (memctl) starts inflating or in other words starts requesting memory from the guest operating system. Once it has received this memory, the hypervisor frees up the physical memory at the backend and gives it to the VM requesting for memory. However, please bear in mind that due to TPS the amount of memory given up may not be equal to the amount of physical pages freed up. Ballooning can be disabled but it is not advised. Ballooning is used to pass the resource contention in the hypervisor to the guest operating systems so that currently active pages are not freed up.

4. Swap – This comes as the last resort to handle memory contention. The hypervisor starts swapping out a VM’s memory to the swap file created. However, this usually results in terrible performance of the VM. The hypervisor has no idea of the pages currently active for any VM running on it. The swap out can thus result in active pages being swapped out. As a thumb rule, administrators should look at re-distributing VMs at the first sign of hypervisor level swapping.

TPS and Memory Compression allow overcommitment of memory while ballooning and swapping are reclamation techniques. Usually when less than 4% memory is free ESXi will start reclamation through ballooning and swapping when only 2% is free.

Memory Resource Management

Similar to CPU settings, we again have ‘Reservation’, ‘Share’ and ‘Limit’. In the context of memory management setting reservation can be disastrous.

Unlike reserved CPU, ESXi does not redistribute reserved physical memory pages once the VM has touched them. This is especially bad in case of Windows VMs as Windows zeroes out its entire memory at boot. Reservation also has a negative impact on High Availability (HA) clusters setup in the ‘Host’ mode. HA in this case uses the maximum

memory and CPU reservations to calculate slot size for admission control purposes.

However, reservation does have one positive effect. When a virtual machine is setup it creates a swap file in the storage equal in size to the memory. If memory is available in a system where you have run out of storage, memory reservations can be set to free up storage to increase virtualization density. Typically, it is not a good idea for physical hosts with a variety of applications.

Again it is the Share setting that comes to the rescue. Similar to CPU, share settings decide the proportional importance of a VM when the hypervisor is assigning memory shares. Please read this article for a better understanding.

NUMA

This is the technology that has made life harder for virtualization administrators. AMD has been using the NUMA architecture with their Opteron processors for years and for a few years now even Intel has rolled over to NUMA from their earlier shared Front Side/Back Side Bus (FSB/BSB) architecture. This was also called North Bridge – South Bridge.

So basically in NUMA each CPU has its own bunch of associated memory. This is also known as local memory. In order to access memory associated with other processors – remote memory – the package has to initiate communication with the related CPU over system board bus. Thus remote access is much slower than local access.

NUMA and VMware

VMware ESXi is a NUMA aware hypervisor. This means that on NUMA architecture it has special optimization algorithms that kick in. It is the goal of ESXi to keep at least 80% of a VM’s working set localized i.e. in the local memory. This is achieved by assigning a ‘soft’ package affinity to that VM.

If locally available memory falls below 80% then it is considered as ‘poor localisation’. If this value falls too far then ESXi has algorithms to determine whether it makes more sense to move the CPU affinity of the VM to other packages.

http://frankdenneman.nl/2009/12/08/impact-of-memory-reservation/

ESXi will always prefer to place all vCPUs belonging to the same VM on the same NUMA node. This is why it makes more sense to have a physical server with more cores than with more packages. This not only improves physical RAM locality but also vastly improves cache locality at least till L2 cache. Also, with servers with less packages but more cores have a lesser number of NUMA nodes which decreases the requirement for wide VMs.

As a side note, there is no difference between multiple sockets and multiple cores at the vCPU level apart from the fact that Hot-Add works only for sockets.

Wide VMs

A wide VM is a VM that has so many vCPU cores that it cannot fit into a single NUMA node. Please note that ESXi ignores hyperthreading while doing this calculation. Once this is determined and the vCPUs scheduled the NUMA optimizations kick in and memory is localized as far as possible.

However, when wide VMs are deployed the memory is not absolutely localized but interleaved. This means that the total memory will be distributed evenly by ESXi amongst the various NUMA nodes used by the VM. This implies that a vCPU scheduled on node 1 maybe accessing memory from node 2. Thus wide VMs should be used only when absolutely required.

TPS with NUMA

TPS, as mentioned earlier, deduplicates memory pages. However, deduplication of memory pages across NUMA nodes can induce a performance hit. This behavior is thus disabled by default. TPS works only within each NUMA node.

This feature can be disabled and performance hit due to system wide TPS can still provide enough free memory to justify the move. However, most environments do not need this.

Similarly, other memory techniques are also localized to the NUMA node.

Please read this article for greater insight.

http://frankdenneman.nl/2010/09/13/esx-4-1-numa-scheduling/

CPU and Memory Considerations for Virtual Machines on VMware ESXi

Documents

Transcript of CPU and Memory Considerations for Virtual Machines on VMware ESXi