BCO2982-Stretched Clusters and VMware vCenter Site Recovery Manager How and When to Choose One, the...

Stretched Clusters and VMware vCenter Site Recovery Manager: How and When to Choose One, the Other, or Both

Chad Sakac, EMC Corporation

Vaughn Stewart, NetApp

INF-BCO2982

##vmworldinf

&

Where were we last year? • Covered at VMworld 2011

– BCO2863: Using Distance to Your Advantage (NetApp) – BCO2479: Understanding vSphere Stretched Clusters (EMC)

• Stretched clusters exists since VI3 with NetApp

MetroCluster and accelerated with EMC VPLEX entrance into market – vSphere 5 introduced vMSC certification – initially with EMC

VPLEX, accelerating with new entrants

• Customers are actually seeking availability

– Blending Backup, Disaster Recovery & Disaster Avoidance

&

The State of the Union • Adoption continues to accelerate! • vSphere Metro Stretched Cluster HCL is expanding • Hardening of VM HA for stretched clusters

– terminateVMonPDLByDefault in vSphere 5.0 u1 and vSphere 5.1 – Timeout of IO on APD

• Stretched Clusters + SRM = AND, not an OR • Expanding the use cases

– longer and longer distances – Reducing hardware dependencies

&

Customers Want Geo-Spanned Availability

Stretch Clusters ACROSS DATA CENTERS

SYNCHRONOUS DISTANCES

Future… ACROSS DATA CENTERS ASYNCH

DISTANCES

Disaster Recovery OPERATIONAL AND 3RD SITE

RECOVERY

&

“Disaster Recovery” “Disaster Avoidance”

“High Availability”

…Words Matter

&

“Disaster” Avoidance – Host Level

“Hey… That host WILL need to go down for maintenance. Let’s vMotion to avoid

a disaster and outage.”

&




X

&




X This is vMotion.

Most important characteristics:

• By definition, avoidance, not

recovery. • “non-disruptive” is massively

different than “almost non-disruptive”

&

“Disaster” Recovery – Host Level

&


Hey… That host WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the

affected VMs on another host.

X

&


Hey… That host WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the


X This is VM HA.


• By definition recovery

(restart), not avoidance • Simplicity, automation,

sequencing

&

Disaster Avoidance – Site Level

Hey… That site WILL need to go down for maintenance. Let’s vMotion to avoid

a disaster and outage.

&

Disaster Avoidance – Site Level

Hey… That site WILL need to go down for maintenance. Let’s vMotion to avoid

a disaster and outage.

This is inter-site vMotion.


• By definition, avoidance, not

recovery. • “non-disruptive” is massively

different than “almost non-disruptive” X

&

Disaster Recovery – Site Level

Hey… That site WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the


X

&

Disaster Recovery – Site Level

Hey… That site WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the


X This is Disaster

Recovery. Most important characteristics:

• By definition recovery

(restart), not avoidance • Simplicity, testing, split brain

behavior, automation, sequencing, IP address changes

&

VMware High Availability vSphere HA Cluster

Stretched across campus or metro area

VMware High Availability – Extended between distributed parts of the same

virtual datacenter

&


APP OS

APP OS

APP OS

APP OS

APP OS

APP OS


virtual datacenter

&



virtual datacenter – Automatic rapid recovery from host failures

&



virtual datacenter – Automatic rapid recovery from host failures

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

&


APP OS

APP OS

APP OS

APP OS

APP OS

APP OS


virtual datacenter – Automatic rapid recovery from host failures – No complex clustering software in the VM

&

VMware Fault Tolerance vSphere HA Cluster

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

• VMware Fault Tolerance – Easily enabled/disabled per virtual machine

&


APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

FT Protected VM

• VMware Fault Tolerance – Easily enabled/disabled per virtual machine

APP OS

2

APP OS 1

&


APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

FT Protected VM

• VMware Fault Tolerance – Easily enabled/disabled per virtual machine – Eliminate VM downtime due to hardware failures

APP OS

2

APP OS 1

&

VMware Fault Tolerance

• VMware Fault Tolerance – Easily enabled/disabled per virtual machine – Eliminate VM downtime due to hardware failures

vSphere HA Cluster

FT Protected VM APP OS

2

APP OS

APP OS

APP OS

APP OS

APP OS

&

VMware Fault Tolerance

• VMware Fault Tolerance – Easily enabled/disabled per virtual machine – Eliminate VM downtime due to hardware failures – Protect homegrown applications without

a clustering solution

vSphere HA Cluster

FT Protected VM APP OS

2

APP OS

APP OS

APP OS

APP OS

APP OS

Note – not part of the vMSC, ergo not VMware supported, MAY be vendor supported.

&

VMware VM Host Affinity

• VMware Host Affinity – Provides a “site affinity” capability – Keeps workloads local to storage until failure – Keeps primary and secondary FT VMs in

appropriate sites

vSphere HA Cluster

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

2

APP OS

APP OS

APP OS

APP OS

APP OS

Site 1 Affinity Group Site 2 Affinity Group

Note – considerations later.

&

Type 1: “Stretched Single vSphere Cluster”

&

Stretching VMware vSphere Clusters vSphere HA Cluster

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

&

Stretched Storage (eg EMC VPLEX, NetApp Metrocluster)


APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

&



APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

Array based synchronous

replication

&

Planned Datacenter Migration vSphere HA Cluster


APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

vMotion

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

Standard vMotion of Virtual Machines

Moving all operations between locations

&

Planned Datacenter Migration vSphere HA Cluster


APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

APP OS

&

One little note re: “Intra-Cluster” vMotion • Intra-cluster vMotions can be highly parallelized

– and more and more with each passing vSphere release – With vSphere 4.1 and vSphere 5.x it’s up to 4 per

host/128 per datastore if using 1GbE – 8 per host/128 per datastore if using 10GbE – …and that’s before you tweak settings for more, and

shoot yourself in the foot :-) • Need to meet the vMotion network requirements

– 622Mbps or more, 5ms RTT (upped to 10ms RTT if using Metro vMotion - vSphere 5 Enterprise Plus)

– Layer 2 equivalence for vmkernel (support requirement)

– Layer 2 equivalence for VM network traffic (required)

&

Type 2: “Two Clusters, Stretched Storage, inter-cluster vMotion”

We don’t see this much, so will skip it.

&

Type 3: “Classic Site Recovery Manager”

&

vSphere Cluster A vSphere Cluster B Distance

Datastore A

vCenter Prot.

vCenter Recov.

Read-only (gets promoted or

snapshoted to become

writeable) replica of

Datastore A

Array-based (sync, async or continuous) replication or vSphere

Replication v1.0 (async)

Type 3: “Classic Site Recovery Manager”

&

Type 4: “Stretched Cluster + Site Recovery Manager”

&

Can You have Stretched Clusters + SRM? YES!

• Deduped replication • Native WAN

compression and/or replication compression

• VMware vSphere 5 integration

• Robust DR testing and sequencing

• Automated Failback

Array Replica (EMC Recoverpoint, NetApp Snapmirror)

Stretched vSphere Cluster

VMware vSphere 5 Site Recovery Manager

&

Summary of “Taxonomy Matters” • Disaster Avoidance != High Availability != Disaster Recovery

– Same logic applies at a server level applies at the site level – Same value (non-disruptive for avoidance, automation/simplicity for

recovery) that applies at a server level, applies at the site level – Don’t underestimate the importance of DR testing

• Stretched clusters have complex considerations • vMotion = single vCenter domain vs. SRM = two or more vCenter domains

• Straight-forward SRM for most (~1 of every 5) • Stretched Clusters and SRM no longer mutually exclusive

&

Thinking of Stretching?

vSphere Stretched Clusters Considerations

&

Stretched Cluster Design Considerations

• Understand the difference compared to DR – HA does not follow a robust, scriptable recovery plan workflow – HA is not site aware for applications, where are all the moving parts of my

app? Same site or dispersed? How will I know what needs to be recovered? – DR usually involves a regular, structured “DR test”.

• Single stretch site = single vCenter – During disaster, what about vCenter setting consistency across sites? (DRS

Affinity, cluster settings, network)

• Will network support? Layer2 stretch? IP mobility? • Cluster split brain = how to handle?

Not necessarily cheaper solution vs. SRM licensing, read between the lines (hidden storage, networking and WAN costs)

&

vSphere 5.x - HA

• Complete re-write of vSphere HA • Elimination of Primary/Secondary

concept • Foundation for increased scale and

functionality – Eliminates common issues (DNS resolution)

• Multiple Communication Paths – Can leverage storage as well as the mgmt network

for communications – Enhances the ability to detect certain types of

failures and provides redundancy

• IPv6 Support • Enhanced User Interface • Enhanced Deployment

ESX 01 ESX 03

ESX 04 ESX 02

&

vSphere 5.x HA – Heartbeat Datastores

• Monitor availability of Slave hosts and VMs running on them

• Determine host network isolated VS network partitioned

• Coordinate with other Masters – VM can only be owned by one master

• By default, vCenter will automatically pick 2 datastores

• Very useful for hardening stretched storage models

ESX 01 ESX 03

ESX 04 ESX 02

&

Something to understand re: yanking & “suspending” storage re VM HA

• What happens when you “yank” storage? – VMs who’s storage “disappears” or goes “read-only” behavior is

more complex than people think at first. – Responding to a ping doesn’t mean a system is available (if it

doesn’t respond to any services, for example)

• In vSphere 5.0 or earlier: – Yanked: http://www.youtube.com/watch?v=6Op0i0cekLg

– Suspended: http://www.youtube.com/watch?v=WJQfy7-udOY

• What’s new? vSphere 5.0 u1 or 5.1: • terminateVMonPDLByDefault • In vSphere 5.1 - Timeout of IO on APD

http://www.youtube.com/watch?v=6Op0i0cekLg

http://www.youtube.com/watch?v=WJQfy7-udOY

&

Stretched Storage Configuration • Literally just stretching the SAN fabric (or NFS exports

over LAN) between locations, with a failover on failure • Requires synchronous replication • Limited in distance to ~100km in most cases • Typically read/write in one location, read-only in

second location • Implementations with only a single storage controller at

each location create other considerations.

&

Stretched Storage Configuration

X Read/Write Read-Only

Stretched Storage Fabric(s)

X

&

Distributed Virtual Storage Configuration • Leverages storage technologies to distribute storage

across multiple sites • Requires some sort of synchronous replication • Limited in distance to ~100km in most cases • Read/write storage in both locations, employs data

locality and caching algorithms • Typically uses multiple controllers in a scale-out fashion • Must address “split brain” scenarios

&

Distributed Virtual Storage Configuration

X X Read/Write Read/Write

&

Stretch Cluster

Stretched Storage

Virtual Center VMs VMs

Array at Site-A

Array at Site-B

Witness at 3rd Site

FC or IP

Underlying Storage

IP

Logical Paths to the other site Logical Paths to the same site Physical Connections

Understanding – “Uniform Access”

• Pros: –One more failure mode that doesn’t trigger VM HA

• Cons: –Operational complexity, including multipathing –If non-locally cached, latency

&

Understanding – “Non Uniform Access”

• Pros: –Simple

• Cons: – cluster failure = VM HA event.

Stretch Cluster

Stretched Storage

Virtual Center VMs VMs

Array at Site-A

Array at Site-B

Witness at 3rd Site

FC or IP

Underlying Storage

IP

&

Understanding… Network Options

• Stretched VLAN approaches (VPLS, Ethernet Fabrics, etc)

• Cisco OTV • VXLAN (haven’t seen this widely used for this

use case yet)

&

Stretched Cluster Considerations #1 Consideration: Prior to and including vSphere 4.1, you can’t control HA/DRS behavior for “sidedness” • With stretched Storage Network configurations:

– Additional latency introduced when VM storage resides in other location

– Storage vMotion required to remove this latency

• With distributed virtual storage configurations: – Need to keep cluster behaviors in mind – Data is access locally due to data locality algorithms

&

Stretched Cluster Considerations #2 Consideration: With vSphere 5, you can use DRS host affinity rules to DRS behavior

– NOTE: Doesn’t address HA primary/secondary node selection

• With stretched Storage Network configurations: – Caution when using single-controller implementations – Storage latency still present in the event of a controller

failure • With distributed virtual storage configurations:

– Plan for cluster failure/cluster partition behaviors • Understand/embrace “VMware supported” vs.

“Vendor Supported” – This is what vMSC is really all about….

&

Stretched Cluster Considerations #3

Consideration: There is no supported way to control VMware HA primary /secondary node selection with vSphere 4.x • With vSphere 4.x

– Limits cluster size to 8 hosts (4 in each site) – No supported mechanism for controlling/specifying primary/secondary node

selection – Methods for increasing the number of primary nodes also not supported by

Vmware

• With vSphere 5.x – Better VM HA implementation (heartbeat datastores hel – Still no supported mechanism for controlling/specifying primary/secondary

node selection – Host affinity groups + DRS may sort it out – but may not.

&

Stretched Cluster Considerations #4

Consideration: Stretched Clusters require Layer 2 “equivalence” at the network layer • Complicates the network infrastructure • The “re-IP” approach with SRM is relatively simple • Requires use of technologies like VXLAN, OTV, VPLS • Main question: “do you have the equipment and the

networking expertise”?

&

Stretched Cluster Considerations #5 Consideration: The network lacks site awareness, so stretched clusters introduce new networking challenges. • The movement of VMs from one site to another doesn’t

update the network • VM movement can cause “horseshoe/trombone routing”

(LISP and other approaches can help) • You’ll need to use multiple isolation addresses in your

VMware HA configuration

&

Nope. Not Sci-Fi. 500+ EMC examples

70% MORE

UTILIZATION

MIGRATED 250 LIVE SYSTEMS

MULTI VENDOR MIGRATIONS

15% MORE

EFFICIENCY

ALWAYS ON AVAILABILITY

ONLINE MIGRATIONS

IMPLEMENTED PRIVATE CLOUD

83% LESS

MANAGEMENT

ACTIVE/ACTIVE DATA CENTERS

http://www.kattenlaw.com/

&

6,000+ NetApp examples…

• In Germany alone!

• 11,000+ global installations

&

Summary of what’s new…. • NOW – Expanding vMSC (includes NetApp, IBM, HP) • NOW – Site Recovery Manager 5.1 • NOW – vSphere 5 VM HA rewrite & heartbeat

datastores, help on partition scenarios • NOW – vSphere 5 Metro vMotion • NOW – vSphere 5.0 update 1 and 5.1 PDL changes • NOW – PDL response in VPLEX & MetroCluster, VAAI

support, Cluster Interconnect, Witness

&

For More Information… • EMC VPLEX vMSC

– VMware: Using VPLEX Metro with VMware HA

• http://kb.vmware.com/kb/1026692 • http://kb.vmware.com/kb/1021215

– VMware: Implementing Uniform and Non-Uniform VPLEX Metro configs

• http://kb.vmware.com/kb/2007545 – EMC: VPLEX Metro HA techbook : h7113 – EMC: VPLEX Metro with VMware HA: h8218

• NetApp MetroCluster vMSC – VMware: vSphere Metro Storage Cluster

Case Study – NetApp: TR3548: Best Practices for

MetroCluster Design and Implementation

http://kb.vmware.com/kb/1026692



http://powerlink.emc.com/km/live1/en_US/Offering_Technical/Technical_Documentation/h7113-vplex-architecture-deployment-techbook.pdf

http://powerlink.emc.com/km/live1/en_US/Offering_Technical/White_Paper/h8218-vplex-metro-vmware-ha-wp.pdf

http://www.vmware.com/files/pdf/techpaper/vSPHR-CS-MTRO-STOR-CLSTR-USLET-102-HI-RES.pdf

http://www.vmware.com/files/pdf/techpaper/vSPHR-CS-MTRO-STOR-CLSTR-USLET-102-HI-RES.pdf

&

So… What’s Next?

&

VM Component Protection • Detect and recover from catastrophic infrastructure

failures affecting a VM – Loss of storage path – Loss of Network link connectivity

• VMware HA restarts VM on available healthy host

VMware ESX VMware ESX

&

Automated Stretched Cluster Config • Leverage the work in VASA and VM Granular Storage • Automated site protection for all VM’s • Benefits of single cluster model • Automated setup of HA and DRS affinity rules

Site A Site B

Distributed Storage Volumes

Layer 2 Network

HA/DRS Cluster

&

Stretched Cluster +

vCOPS

&

More to come…

1. VM Granular Operations 2. “vRecoverpoint”

3. Multi-Site

RecoverPoint RAPIDpath

Network Transformation

Future

1. VM Granular Operations = Async 2. “vVPLEX”

VPLEX

Future Future

&

Q & A – Some Questions from us to you. • “Stretched clustering sounds like awesomesauce, why not?” • “Our storage vendor/team tells us their disaster avoidance solution

will do everything we want, HA, DA, DR, we are not experts here, should we be wary?”

• “Our corporate SLA’s for recovery are simple BUT we have LOTS of expertise and think we can handle the bleeding edge stuff should we just go for it???”

• “My datacenter server rooms are 50 ft apart but i definitely want a DR solution what's wrong with that idea?”

• Is “cold migration” over distance good enough for you, or is it live or nothing?

&

THANK YOU

FILL OUT A SURVEY

EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A

$25 VMWARE COMPANY STORE GIFT CERTIFICATE

Stretched Clusters and VMware vCenter Site Recovery Manager: How and When to Choose One, the Other, or Both

Chad Sakac, EMC Corporation

Vaughn Stewart, NetApp

INF-BCO2982

##vmworldinf

BCO2982-Stretched Clusters and VMware vCenter Site Recovery Manager How and When to Choose One, the...

Documents

Transcript of BCO2982-Stretched Clusters and VMware vCenter Site Recovery Manager How and When to Choose One, the...