SER1815BU DRS Advancements: What's New and … Bryant, VMware, Inc - @kix1979 Maarten Wiggers,...

56
Thomas Bryant, VMware, Inc - @kix1979 Maarten Wiggers, VMware, Inc SER1815BU #VMworld #SER1815BU DRS Advancements: What's New and What Is Being Cooked Up in Resource Management Land VMworld 2017 Content: Not for publication or distribution

Transcript of SER1815BU DRS Advancements: What's New and … Bryant, VMware, Inc - @kix1979 Maarten Wiggers,...

Thomas Bryant, VMware, Inc - @kix1979Maarten Wiggers, VMware, Inc

SER1815BU

#VMworld #SER1815BU

DRS Advancements: What's New and What Is Being Cooked Up in Resource Management Land

VMworld 2017 Content: Not fo

r publication or distri

bution

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

2

VMworld 2017 Content: Not fo

r publication or distri

bution

CONFIDENTIAL 3

Proven Best Practices

How DRS works

New 6.5 Features

Industry Trends

Agenda

VMware Labs

What is DRS?

Q & A

VMworld 2017 Content: Not fo

r publication or distri

bution

What is DRS?

VMworld 2017 Content: Not fo

r publication or distri

bution

Initial placement & ongoing balancing Minimize risk of contention and satisfy business policy

5CONFIDENTIAL

Maintenance modeNo downtime for infrastructure updates

Power Management & Consolidation

Efficient use of infrastructure

VMworld 2017 Content: Not fo

r publication or distri

bution

What is in the DRS family

• Distributed Resource Scheduler (DRS)

– Resource Pools

– Leverages shares

– NIOC

• Storage DRS (sDRS)

– SIOC

• Distributed Power Management (DPM)

6

VMworld 2017 Content: Not fo

r publication or distri

bution

What’s new in vSphere 6.5

• Proactive HA

• Predictive Workload Balancing (pDRS)

• Policy-based SIOC configuration

• Network Aware DRS

7

VMworld 2017 Content: Not fo

r publication or distri

bution

DRS by the numbers

VMworld 2017 Content: Not fo

r publication or distri

bution

81%Fully Automated

15%Partially Automated

4%Manual

VMworld 2017 Content: Not fo

r publication or distri

bution

100%Affinity/

Anti-affinityrules

89%48%

MaintenanceMode

ResourcePool

VMworld 2017 Content: Not fo

r publication or distri

bution

How DRS works

VMworld 2017 Content: Not fo

r publication or distri

bution

Distributed Resource Scheduler (DRS)

• Performance

– DRS keeps VM’s happy

– Resource Pools

• Operational

– DRS affinity rules: Control the placement of VMs on hosts within a cluster.

– Maintenance Mode

– Works in conjunction with

• HA

• Proactive HA

• Fault Tolerance

• vSphere Upgrade Manager (VUM)

• Auto-Deploy & others

VMworld 2017 Content: Not fo

r publication or distri

bution

Metrics used for Initial Placement and Load balancing

• Innumerable host-level and VM-level stats and metrics are considered during IP and LB

• Few important VM metrics

– CPU active, run and peak

– Memory overhead, growth-rate

– Active, Consumed and Idle memory

– Network saturation

VMworld 2017 Content: Not fo

r publication or distri

bution

Constraints are essential

• HA admission control policies (slot-based, reserved % for failover etc..)

• Affinity and anti-affinity rules

• # of concurrent vMotions

• Datastore connectivity

• vCPU to pCPU ratio

• Reservation, limit and share settings

• Special VMs (eg: SMP-FT, Latency sensitive VM, etc.)

• Placement on hosts that have all required physical devices

CONFIDENTIAL 14

VMworld 2017 Content: Not fo

r publication or distri

bution

De-Mystifying Resource Pool

• Resource Pool:

– Powerful abstraction to segregate resources in a cluster

– Set business requirements based on workload importance and characteristics

– Provides isolation between resource pools

– It is the fundamental building block for vCAN partners (Cloud Service Providers)

• Resource controls:

1. Reservation (MHz or MB)

• Minimum MHz or MB guaranteed

• By default, R = 0 <means, no dedicated resource>

2. Limit (MHz or MB)

• Maximum MHz or MB allowed

• By default, L = 0 <means unlimited>

3. Shares (No unit)

• Relative priority between siblings

• How to proportionally divvy resources when there is contention

CONFIDENTIAL 15

VMworld 2017 Content: Not fo

r publication or distri

bution

Resource Pool Example

Root RP

RP2(Analytics)

RP1(Production)

VM-P1 VM-P10 VM-A1 VM-A20. . . . . .

Total Cluster

Capacity = 100 GHz

R=80, S=400 R=0, S=100

Total Shares = 400+100 = 500

Contention for = 100 – 80 = 20GHz

RP1 quota = 400 x 20 = 16GHz

500

RP2 quota = 100 x 20 = 4GHz

500

VMworld 2017 Content: Not fo

r publication or distri

bution

Cost Benefit and minGoodness

• Cost-Benefit Analysis:

– Cost of VM migration is evaluated against the potential benefits wrt VM demands and host load

– Cost considerations:

• Per vMotion a Reservation of 30% of a CPU core for 1GbE and 100% of a CPU core per 10GbE

• Memory consumption of “Shadow VM” at the destination host

• Negative performance impact to VMs at the destination host

• Potential memory reclamation implication at the destination host

– Benefit considerations:

• Positive performance benefits to VMs at the source host

• Positive performance gains for the migrated VM at the destination host

• VMs on source host and moved VM have more headroom for utilization spike

• minGoodness:

– vMotions need to improve cluster balance beyond this threshold (configured through DRS migration threshold)

CONFIDENTIAL 17

VMworld 2017 Content: Not fo

r publication or distri

bution

Cost Benefit and minGoodness

• VM happiness ☺ is the most important metric!!

– If VM’s demand and entitlement for resources are always met, then VM is “happy”!

– During Initial placement, DRS ensures minimum performance impact on already running VMs

– During Load balance, DRS ensures VMs are happy with a minimum number of vMotions

CONFIDENTIAL 18

VMworld 2017 Content: Not fo

r publication or distri

bution

Memory Metrics in ESXi

• Consumed: All touched memory pages minus page sharing

• Active: Estimated based on recently-touched memory pages

Configured VM Size

Consumed Memory

IdleMemory

ActiveMemory

SharedMemory

VMworld 2017 Content: Not fo

r publication or distri

bution

Memory Metrics and DRS

25%

Configured VM Size

Consumed Memory

IdleMemory

ActiveMemory

What DRS uses by defaultto balance memory

What is displayed in the cluster summary screen

Sum of Consumed

memory of all VMs on host

VMworld 2017 Content: Not fo

r publication or distri

bution

• Manual – vCenter Server will suggest migration recommendations for VMs

• Partially Automated – Automatic placement, migration recommendations

• Full Automated (recommended) – Automatic Placement and migration recommendations

DRS Settings – Automation Level

21

VMworld 2017 Content: Not fo

r publication or distri

bution

Migration Threshold

22

Priority 2

Priority 3

Priority 4

Priority 5

• Priority 1 – Only mandatory moves (maintenance mode or affinity/anti-affinity rules)

• Priority 2 – Very conservative. Only recommends moves where a severe imbalance is detected.

• Priority 3 – Conservative yet balanced approach. (Default)

• Priority 4 – Semi-aggressive. (Recommended if balanced clusters is desired)

• Priority 5 – Very aggressive. Will balance even if very little performance benefit results.

Hosts in DRS Cluster

VMworld 2017 Content: Not fo

r publication or distri

bution

How SDRS works

VMworld 2017 Content: Not fo

r publication or distri

bution

SIOC - IO control w/single datastore

Storage IO Control Capabilities

Control: IO Reservations

Storage IO Control

ESX IO Scheduler

Control: IO Reservations

VMworld 2017 Content: Not fo

r publication or distri

bution

Storage IO Control

▪ Control Congestion in shared datastore

▪ Detect Congestion

– SIOC monitors average IO latency for a datastore

– Latency above a threshold indicates congestion

▪ SIOC throttles IOs once congestion is detected

– Control IOs issued per host

– Based on VMs shares, reservations, and limits on each host

– Configurable via Storage Policies (SPBM)

– Throttling adjusted dynamically based on workload

• Idleness

• Bursty behavior

VMworld 2017 Content: Not fo

r publication or distri

bution

SDRS – IO control w/multiple datastores

Storage DRS

Storage IO Control

VMworld 2017 Content: Not fo

r publication or distri

bution

▪ Ease of Storage Management

▪ Initial Placement

▪ Out of Space Avoidance

▪ IO Load Balancing

▪ Virtual Disk Affinity (Anti-Affinity)

▪ Datastore Maintenance Mode

▪ Add Datastore

Storage DRS

Datastore

Cluster

Storage vMotion

•••

VMworld 2017 Content: Not fo

r publication or distri

bution

Key Takeaways

• Initial placement and Load balancing is greatly influenced by:

– Real time stats from ESX host and VMs (ex: CPU Demand, Memory Active, Memory Consumed etc…)

– Constraints (ex: HA policies, affinity rules, etc..)

– Cost Benefit Analysis

• VM Happiness ☺ is the #1 influencer for both initial placement and load balance decisions

• A small imbalance in the DRS/SDRS cluster should not be a concern.

• SDRS/SIOC helps to solve IO contention

• Start at default Priority 3. Adjust up if you require a more aggressive balance profile

– This can cause additional vMotions

28

VMworld 2017 Content: Not fo

r publication or distri

bution

Proven Best Practices

VMworld 2017 Content: Not fo

r publication or distri

bution

Best Practices - Tip #1 – Use “Latency Sensitivity” flag

• For latency sensitive VMs set “latency sensitivity” flag

• ESX CPU scheduler gives prioritized scheduling for this VM

• DRS ensures this VM is *not* disturbed during periodic load balancing

CONFIDENTIAL 30

VMworld 2017 Content: Not fo

r publication or distri

bution

Best Practices – Tip #2 – CPU Ready time?

• Check BIOS power management is set to “OS control” mode

• Ensure the ESX power management “Active Policy” is set to “Performance”

CONFIDENTIAL 31

VMworld 2017 Content: Not fo

r publication or distri

bution

Best Practices – Tip #3 – Full Storage Connectivity

• All the hosts have access to all the data stores

• Results in an efficient initial placement, load balancing and workload consolidation

• VM availability is improved significantly

CONFIDENTIAL 32

VMworld 2017 Content: Not fo

r publication or distri

bution

New 6.5 Features

VMworld 2017 Content: Not fo

r publication or distri

bution

Key Themes for 6.5 enhancements

• Enable higher churn environments like containers & devOps

– Improved algorithm

– Scalability enhancements

• Business critical

– pDRS

– Proactive HA

– Network Aware DRS

– Advanced Options UI enhancements

34

VMworld 2017 Content: Not fo

r publication or distri

bution

DRS Algo Enhancements

• Improved initial placement algorithm

– Even VM distribution

– Saves on vMotions on subsequent load balancing!

• More aggressive

– Detects and corrects outlier situations

– Recommends/balances until no two hosts differ by a defined value

• maximum and minimum host entitlement

• And more!

CONFIDENTIAL 35

VMworld 2017 Content: Not fo

r publication or distri

bution

Resource Utilization Optimization (of vCenter)

• Throughput > 2.5x increase

• 70% resource reduction at scale

• VM Power-on Latency > 3x improvement

• DRS Cluster Compatibility check

• > 21x Improvement

• Less than 2% CPU utilization

• > 850 MB Reduction

CONFIDENTIAL 36

http://www.vmware.com/techpapers/2017/drs-cluster-mgmt-perf.html

VMworld 2017 Content: Not fo

r publication or distri

bution

Predictive DRS

• Tight integration with vRealize Operations Manager (vROPs)

• Resource utilization trends are observed

• Predicted demand of workloads is incorporated in ‘initial placement’ and ‘load balancing’

• Current VM demands are honored before future demands are satisfied

CONFIDENTIAL 37

vSphere DRS

• Ingests forecasted metrics

• Balances cluster based on forecasted utilization

vRealize Operations

• Computes and forecasts utilization based on metric history.

• CPU

• Memory

• Dynamic Thresholds created and data passed to DRS

VMworld 2017 Content: Not fo

r publication or distri

bution

Predictive DRS

• Some workloads have predictable resource utilization trends

• Having a high level of confidence allows DRS to pro-actively prepare for increased demand before demand occurs

• Potentially faster balancing and better performance from VMs

CONFIDENTIAL 38

Predicted spike: prepare

Proactive remediation complete

Observed

Predicted

Observed spike: react!

Remediation complete

time

resource

demand

VMworld 2017 Content: Not fo

r publication or distri

bution

Proactive High Availability

• Proactive evacuation of VMs from degraded hosts based on hardware health metrics

• Increase the availability of VMs even more than current technology provides

• Tight integration, qualification and certification with hardware vendors

CONFIDENTIAL 39

VMworld 2017 Content: Not fo

r publication or distri

bution

What would this look like?

40

vSphere

DRS

1. Servers running in Datacenter

2. Hardware is monitored via OEM software

3. Health alerts/updates pushed to vCenter

4. DRS and health state are invoked. Workloads are moved according toseverity

VMworld 2017 Content: Not fo

r publication or distri

bution

Customized Proactive HA automation settings

CONFIDENTIAL 41

VMworld 2017 Content: Not fo

r publication or distri

bution

Degradation events generated in vCenter

CONFIDENTIAL 42

Provider Health Host Failure Condition

Remediation

VMworld 2017 Content: Not fo

r publication or distri

bution

Network-Aware DRS

CONFIDENTIAL 43

• Network utilization has not been a first-class citizen with CPU and Memory

• Network-Aware DRS is based on host pNIC saturation

• Advanced option for Network Utilization %

– ‘NetworkAwareDrsSaturationThresholdPercent’

– Default is 80%

UTILIZATION

VMworld 2017 Content: Not fo

r publication or distri

bution

Advanced Options in the UI

• Do not need to know the property name

• Easier to consume

• Commonly used options

CONFIDENTIAL 44

VMworld 2017 Content: Not fo

r publication or distri

bution

Advanced Options in the UI

• Even distribution of virtual machines

– ‘TryBalanceVmsPerHost’

– Best effort attempt for purposes of availability

– Each host given a maxVMs limit (avg VMs per host)

• Only applied to the Load Balancing Algorithm (Initial Placement can violate this)

• Will try to balance VMs (count) but if there is an imbalance of resources, DRS will violate the VM balance

– Attempts to move small VMs to correct the maxVMs limit violations

– May introduce more vMotions

CONFIDENTIAL 45

VMworld 2017 Content: Not fo

r publication or distri

bution

Advanced Options in the UI

CPU Over-commitment

• Used heavily by VDI

• Applies for certain application requirements (exchange and others may require specific ratio)

• MaxVcpusPerCore – Set max CPU Overcomittment per host for cluster

• MaxVcpusPerClusterPct – Set max CPU Overcommit for the cluster

CONFIDENTIAL 46

VMworld 2017 Content: Not fo

r publication or distri

bution

Advanced Options in the UI

Consumed Memory vs Active Memory

• ‘PercentIdleMBInMemDemand’

• Allow DRS to balance on Consumed Memory

• Specifically for environments are under-committed in memory

CONFIDENTIAL 47

VMworld 2017 Content: Not fo

r publication or distri

bution

DRS Labs

VMworld 2017 Content: Not fo

r publication or distri

bution

DRS Lens

CONFIDENTIAL 49

VMworld 2017 Content: Not fo

r publication or distri

bution

DRS Lens

50

VMworld 2017 Content: Not fo

r publication or distri

bution

DRS Dump Insight

51

VMworld 2017 Content: Not fo

r publication or distri

bution

Industry Trends

VMworld 2017 Content: Not fo

r publication or distri

bution

What’s Next –Industry trends we are considering

• Application Requirements are changing

– Customers are moving beyond Traditional apps -> Containers/devOps/Business Critical apps

• Application Requirements span all aspects of infrastructure

– More integrated management (eg. HCI)

• Touches Compute, Storage & Network

• IT is increasingly more important in the business

– Increasing visibility for compliance, auditing, & legal reasons

CONFIDENTIAL 53

VMworld 2017 Content: Not fo

r publication or distri

bution

Q&A

VMworld 2017 Content: Not fo

r publication or distri

bution

VMworld 2017 Content: Not fo

r publication or distri

bution

VMworld 2017 Content: Not fo

r publication or distri

bution