BCO2982-Stretched Clusters and VMware vCenter Site Recovery Manager How and When to Choose One, the...
-
Upload
kinankazuki104 -
Category
Documents
-
view
66 -
download
4
description
Transcript of BCO2982-Stretched Clusters and VMware vCenter Site Recovery Manager How and When to Choose One, the...
Stretched Clusters and VMware vCenter Site Recovery Manager: How and When to Choose One, the Other, or Both
Chad Sakac, EMC Corporation
Vaughn Stewart, NetApp
INF-BCO2982
##vmworldinf
&
Where were we last year? • Covered at VMworld 2011
– BCO2863: Using Distance to Your Advantage (NetApp) – BCO2479: Understanding vSphere Stretched Clusters (EMC)
• Stretched clusters exists since VI3 with NetApp
MetroCluster and accelerated with EMC VPLEX entrance into market – vSphere 5 introduced vMSC certification – initially with EMC
VPLEX, accelerating with new entrants
• Customers are actually seeking availability
– Blending Backup, Disaster Recovery & Disaster Avoidance
&
The State of the Union • Adoption continues to accelerate! • vSphere Metro Stretched Cluster HCL is expanding • Hardening of VM HA for stretched clusters
– terminateVMonPDLByDefault in vSphere 5.0 u1 and vSphere 5.1 – Timeout of IO on APD
• Stretched Clusters + SRM = AND, not an OR • Expanding the use cases
– longer and longer distances – Reducing hardware dependencies
&
Customers Want Geo-Spanned Availability
Stretch Clusters ACROSS DATA CENTERS
SYNCHRONOUS DISTANCES
Future… ACROSS DATA CENTERS ASYNCH
DISTANCES
Disaster Recovery OPERATIONAL AND 3RD SITE
RECOVERY
&
“Disaster Recovery” “Disaster Avoidance”
“High Availability”
…Words Matter
&
“Disaster” Avoidance – Host Level
“Hey… That host WILL need to go down for maintenance. Let’s vMotion to avoid
a disaster and outage.”
&
“Disaster” Avoidance – Host Level
“Hey… That host WILL need to go down for maintenance. Let’s vMotion to avoid
a disaster and outage.”
X
&
“Disaster” Avoidance – Host Level
“Hey… That host WILL need to go down for maintenance. Let’s vMotion to avoid
a disaster and outage.”
X This is vMotion.
Most important characteristics:
• By definition, avoidance, not
recovery. • “non-disruptive” is massively
different than “almost non-disruptive”
&
“Disaster” Recovery – Host Level
&
“Disaster” Recovery – Host Level
Hey… That host WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the
affected VMs on another host.
X
&
“Disaster” Recovery – Host Level
Hey… That host WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the
affected VMs on another host.
X This is VM HA.
Most important characteristics:
• By definition recovery
(restart), not avoidance • Simplicity, automation,
sequencing
&
Disaster Avoidance – Site Level
Hey… That site WILL need to go down for maintenance. Let’s vMotion to avoid
a disaster and outage.
&
Disaster Avoidance – Site Level
Hey… That site WILL need to go down for maintenance. Let’s vMotion to avoid
a disaster and outage.
&
Disaster Avoidance – Site Level
Hey… That site WILL need to go down for maintenance. Let’s vMotion to avoid
a disaster and outage.
This is inter-site vMotion.
Most important characteristics:
• By definition, avoidance, not
recovery. • “non-disruptive” is massively
different than “almost non-disruptive” X
&
Disaster Recovery – Site Level
Hey… That site WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the
affected VMs on another host.
X
&
Disaster Recovery – Site Level
Hey… That site WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the
affected VMs on another host.
X
&
Disaster Recovery – Site Level
Hey… That site WENT down due to unplanned failure causing a unplanned outage due to that disaster. Let’s automate the RESTART of the
affected VMs on another host.
X This is Disaster
Recovery. Most important characteristics:
• By definition recovery
(restart), not avoidance • Simplicity, testing, split brain
behavior, automation, sequencing, IP address changes
&
VMware High Availability vSphere HA Cluster
Stretched across campus or metro area
VMware High Availability – Extended between distributed parts of the same
virtual datacenter
&
VMware High Availability vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
VMware High Availability – Extended between distributed parts of the same
virtual datacenter
&
VMware High Availability vSphere HA Cluster
VMware High Availability – Extended between distributed parts of the same
virtual datacenter – Automatic rapid recovery from host failures
&
VMware High Availability vSphere HA Cluster
VMware High Availability – Extended between distributed parts of the same
virtual datacenter – Automatic rapid recovery from host failures
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
&
VMware High Availability vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
VMware High Availability – Extended between distributed parts of the same
virtual datacenter – Automatic rapid recovery from host failures – No complex clustering software in the VM
&
VMware Fault Tolerance vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
• VMware Fault Tolerance – Easily enabled/disabled per virtual machine
&
VMware Fault Tolerance vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
FT Protected VM
• VMware Fault Tolerance – Easily enabled/disabled per virtual machine
APP OS
2
APP OS 1
&
VMware Fault Tolerance vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
FT Protected VM
• VMware Fault Tolerance – Easily enabled/disabled per virtual machine – Eliminate VM downtime due to hardware failures
APP OS
2
APP OS 1
&
VMware Fault Tolerance
• VMware Fault Tolerance – Easily enabled/disabled per virtual machine – Eliminate VM downtime due to hardware failures
vSphere HA Cluster
FT Protected VM APP OS
2
APP OS
APP OS
APP OS
APP OS
APP OS
&
VMware Fault Tolerance
• VMware Fault Tolerance – Easily enabled/disabled per virtual machine – Eliminate VM downtime due to hardware failures – Protect homegrown applications without
a clustering solution
vSphere HA Cluster
FT Protected VM APP OS
2
APP OS
APP OS
APP OS
APP OS
APP OS
Note – not part of the vMSC, ergo not VMware supported, MAY be vendor supported.
&
VMware VM Host Affinity
• VMware Host Affinity – Provides a “site affinity” capability – Keeps workloads local to storage until failure – Keeps primary and secondary FT VMs in
appropriate sites
vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
2
APP OS
APP OS
APP OS
APP OS
APP OS
Site 1 Affinity Group Site 2 Affinity Group
Note – considerations later.
&
Type 1: “Stretched Single vSphere Cluster”
&
Stretching VMware vSphere Clusters vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
&
Stretching VMware vSphere Clusters vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
&
Stretched Storage (eg EMC VPLEX, NetApp Metrocluster)
Stretching VMware vSphere Clusters vSphere HA Cluster
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
&
Stretching VMware vSphere Clusters vSphere HA Cluster
Stretched Storage (eg EMC VPLEX, NetApp Metrocluster)
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
Array based synchronous
replication
&
Planned Datacenter Migration vSphere HA Cluster
Stretched Storage (eg EMC VPLEX, NetApp Metrocluster)
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
vMotion
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
Standard vMotion of Virtual Machines
Moving all operations between locations
&
Planned Datacenter Migration vSphere HA Cluster
Stretched Storage (eg EMC VPLEX, NetApp Metrocluster)
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
APP OS
&
One little note re: “Intra-Cluster” vMotion • Intra-cluster vMotions can be highly parallelized
– and more and more with each passing vSphere release – With vSphere 4.1 and vSphere 5.x it’s up to 4 per
host/128 per datastore if using 1GbE – 8 per host/128 per datastore if using 10GbE – …and that’s before you tweak settings for more, and
shoot yourself in the foot :-) • Need to meet the vMotion network requirements
– 622Mbps or more, 5ms RTT (upped to 10ms RTT if using Metro vMotion - vSphere 5 Enterprise Plus)
– Layer 2 equivalence for vmkernel (support requirement)
– Layer 2 equivalence for VM network traffic (required)
&
Type 2: “Two Clusters, Stretched Storage, inter-cluster vMotion”
We don’t see this much, so will skip it.
&
Type 3: “Classic Site Recovery Manager”
&
vSphere Cluster A vSphere Cluster B Distance
Datastore A
vCenter Prot.
vCenter Recov.
Read-only (gets promoted or
snapshoted to become
writeable) replica of
Datastore A
Array-based (sync, async or continuous) replication or vSphere
Replication v1.0 (async)
Type 3: “Classic Site Recovery Manager”
&
Type 4: “Stretched Cluster + Site Recovery Manager”
&
Can You have Stretched Clusters + SRM? YES!
• Deduped replication • Native WAN
compression and/or replication compression
• VMware vSphere 5 integration
• Robust DR testing and sequencing
• Automated Failback
Array Replica (EMC Recoverpoint, NetApp Snapmirror)
Stretched vSphere Cluster
VMware vSphere 5 Site Recovery Manager
&
Summary of “Taxonomy Matters” • Disaster Avoidance != High Availability != Disaster Recovery
– Same logic applies at a server level applies at the site level – Same value (non-disruptive for avoidance, automation/simplicity for
recovery) that applies at a server level, applies at the site level – Don’t underestimate the importance of DR testing
• Stretched clusters have complex considerations • vMotion = single vCenter domain vs. SRM = two or more vCenter domains
• Straight-forward SRM for most (~1 of every 5) • Stretched Clusters and SRM no longer mutually exclusive
&
Thinking of Stretching?
vSphere Stretched Clusters Considerations
&
Stretched Cluster Design Considerations
• Understand the difference compared to DR – HA does not follow a robust, scriptable recovery plan workflow – HA is not site aware for applications, where are all the moving parts of my
app? Same site or dispersed? How will I know what needs to be recovered? – DR usually involves a regular, structured “DR test”.
• Single stretch site = single vCenter – During disaster, what about vCenter setting consistency across sites? (DRS
Affinity, cluster settings, network)
• Will network support? Layer2 stretch? IP mobility? • Cluster split brain = how to handle?
Not necessarily cheaper solution vs. SRM licensing, read between the lines (hidden storage, networking and WAN costs)
&
vSphere 5.x - HA
• Complete re-write of vSphere HA • Elimination of Primary/Secondary
concept • Foundation for increased scale and
functionality – Eliminates common issues (DNS resolution)
• Multiple Communication Paths – Can leverage storage as well as the mgmt network
for communications – Enhances the ability to detect certain types of
failures and provides redundancy
• IPv6 Support • Enhanced User Interface • Enhanced Deployment
ESX 01 ESX 03
ESX 04 ESX 02
&
vSphere 5.x HA – Heartbeat Datastores
• Monitor availability of Slave hosts and VMs running on them
• Determine host network isolated VS network partitioned
• Coordinate with other Masters – VM can only be owned by one master
• By default, vCenter will automatically pick 2 datastores
• Very useful for hardening stretched storage models
ESX 01 ESX 03
ESX 04 ESX 02
&
Something to understand re: yanking & “suspending” storage re VM HA
• What happens when you “yank” storage? – VMs who’s storage “disappears” or goes “read-only” behavior is
more complex than people think at first. – Responding to a ping doesn’t mean a system is available (if it
doesn’t respond to any services, for example)
• In vSphere 5.0 or earlier: – Yanked: http://www.youtube.com/watch?v=6Op0i0cekLg
– Suspended: http://www.youtube.com/watch?v=WJQfy7-udOY
• What’s new? vSphere 5.0 u1 or 5.1: • terminateVMonPDLByDefault • In vSphere 5.1 - Timeout of IO on APD
&
Stretched Storage Configuration • Literally just stretching the SAN fabric (or NFS exports
over LAN) between locations, with a failover on failure • Requires synchronous replication • Limited in distance to ~100km in most cases • Typically read/write in one location, read-only in
second location • Implementations with only a single storage controller at
each location create other considerations.
&
Stretched Storage Configuration
X Read/Write Read-Only
Stretched Storage Fabric(s)
X
&
Distributed Virtual Storage Configuration • Leverages storage technologies to distribute storage
across multiple sites • Requires some sort of synchronous replication • Limited in distance to ~100km in most cases • Read/write storage in both locations, employs data
locality and caching algorithms • Typically uses multiple controllers in a scale-out fashion • Must address “split brain” scenarios
&
Distributed Virtual Storage Configuration
X X Read/Write Read/Write
&
Stretch Cluster
Stretched Storage
Virtual Center VMs VMs
Array at Site-A
Array at Site-B
Witness at 3rd Site
FC or IP
Underlying Storage
IP
Logical Paths to the other site Logical Paths to the same site Physical Connections
Understanding – “Uniform Access”
• Pros: –One more failure mode that doesn’t trigger VM HA
• Cons: –Operational complexity, including multipathing –If non-locally cached, latency
&
Understanding – “Non Uniform Access”
• Pros: –Simple
• Cons: – cluster failure = VM HA event.
Stretch Cluster
Stretched Storage
Virtual Center VMs VMs
Array at Site-A
Array at Site-B
Witness at 3rd Site
FC or IP
Underlying Storage
IP
&
Understanding… Network Options
• Stretched VLAN approaches (VPLS, Ethernet Fabrics, etc)
• Cisco OTV • VXLAN (haven’t seen this widely used for this
use case yet)
&
Stretched Cluster Considerations #1 Consideration: Prior to and including vSphere 4.1, you can’t control HA/DRS behavior for “sidedness” • With stretched Storage Network configurations:
– Additional latency introduced when VM storage resides in other location
– Storage vMotion required to remove this latency
• With distributed virtual storage configurations: – Need to keep cluster behaviors in mind – Data is access locally due to data locality algorithms
&
Stretched Cluster Considerations #2 Consideration: With vSphere 5, you can use DRS host affinity rules to DRS behavior
– NOTE: Doesn’t address HA primary/secondary node selection
• With stretched Storage Network configurations: – Caution when using single-controller implementations – Storage latency still present in the event of a controller
failure • With distributed virtual storage configurations:
– Plan for cluster failure/cluster partition behaviors • Understand/embrace “VMware supported” vs.
“Vendor Supported” – This is what vMSC is really all about….
&
Stretched Cluster Considerations #3
Consideration: There is no supported way to control VMware HA primary /secondary node selection with vSphere 4.x • With vSphere 4.x
– Limits cluster size to 8 hosts (4 in each site) – No supported mechanism for controlling/specifying primary/secondary node
selection – Methods for increasing the number of primary nodes also not supported by
Vmware
• With vSphere 5.x – Better VM HA implementation (heartbeat datastores hel – Still no supported mechanism for controlling/specifying primary/secondary
node selection – Host affinity groups + DRS may sort it out – but may not.
&
Stretched Cluster Considerations #4
Consideration: Stretched Clusters require Layer 2 “equivalence” at the network layer • Complicates the network infrastructure • The “re-IP” approach with SRM is relatively simple • Requires use of technologies like VXLAN, OTV, VPLS • Main question: “do you have the equipment and the
networking expertise”?
&
Stretched Cluster Considerations #5 Consideration: The network lacks site awareness, so stretched clusters introduce new networking challenges. • The movement of VMs from one site to another doesn’t
update the network • VM movement can cause “horseshoe/trombone routing”
(LISP and other approaches can help) • You’ll need to use multiple isolation addresses in your
VMware HA configuration
&
&
Nope. Not Sci-Fi. 500+ EMC examples
70% MORE
UTILIZATION
MIGRATED 250 LIVE SYSTEMS
MULTI VENDOR MIGRATIONS
15% MORE
EFFICIENCY
ALWAYS ON AVAILABILITY
ONLINE MIGRATIONS
IMPLEMENTED PRIVATE CLOUD
83% LESS
MANAGEMENT
ACTIVE/ACTIVE DATA CENTERS
&
6,000+ NetApp examples…
• In Germany alone!
• 11,000+ global installations
&
Summary of what’s new…. • NOW – Expanding vMSC (includes NetApp, IBM, HP) • NOW – Site Recovery Manager 5.1 • NOW – vSphere 5 VM HA rewrite & heartbeat
datastores, help on partition scenarios • NOW – vSphere 5 Metro vMotion • NOW – vSphere 5.0 update 1 and 5.1 PDL changes • NOW – PDL response in VPLEX & MetroCluster, VAAI
support, Cluster Interconnect, Witness
&
For More Information… • EMC VPLEX vMSC
– VMware: Using VPLEX Metro with VMware HA
• http://kb.vmware.com/kb/1026692 • http://kb.vmware.com/kb/1021215
– VMware: Implementing Uniform and Non-Uniform VPLEX Metro configs
• http://kb.vmware.com/kb/2007545 – EMC: VPLEX Metro HA techbook : h7113 – EMC: VPLEX Metro with VMware HA: h8218
• NetApp MetroCluster vMSC – VMware: vSphere Metro Storage Cluster
Case Study – NetApp: TR3548: Best Practices for
MetroCluster Design and Implementation
&
So… What’s Next?
&
VM Component Protection • Detect and recover from catastrophic infrastructure
failures affecting a VM – Loss of storage path – Loss of Network link connectivity
• VMware HA restarts VM on available healthy host
VMware ESX VMware ESX
&
Automated Stretched Cluster Config • Leverage the work in VASA and VM Granular Storage • Automated site protection for all VM’s • Benefits of single cluster model • Automated setup of HA and DRS affinity rules
Site A Site B
Distributed Storage Volumes
Layer 2 Network
HA/DRS Cluster
&
Stretched Cluster +
vCOPS
&
More to come…
1. VM Granular Operations 2. “vRecoverpoint”
3. Multi-Site
RecoverPoint RAPIDpath
Network Transformation
Future
1. VM Granular Operations = Async 2. “vVPLEX”
VPLEX
Future Future
&
Q & A – Some Questions from us to you. • “Stretched clustering sounds like awesomesauce, why not?” • “Our storage vendor/team tells us their disaster avoidance solution
will do everything we want, HA, DA, DR, we are not experts here, should we be wary?”
• “Our corporate SLA’s for recovery are simple BUT we have LOTS of expertise and think we can handle the bleeding edge stuff should we just go for it???”
• “My datacenter server rooms are 50 ft apart but i definitely want a DR solution what's wrong with that idea?”
• Is “cold migration” over distance good enough for you, or is it live or nothing?
&
THANK YOU
FILL OUT A SURVEY
EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A
$25 VMWARE COMPANY STORE GIFT CERTIFICATE
Stretched Clusters and VMware vCenter Site Recovery Manager: How and When to Choose One, the Other, or Both
Chad Sakac, EMC Corporation
Vaughn Stewart, NetApp
INF-BCO2982
##vmworldinf