Disaster Recovery Technologies
Seann Herdejurgen
Technical Product Manager
Your Business is Expected to Be Running 24x7, but Outages Happen
DRJ - Disaster Recovery Technologies 2
3DRJ - Disaster Recovery Technologies
1%
42%
42%
44%
45%
46%
46%
47%
48%
63%
63%
63%
64%
69%
70%
72%
0% 20% 40% 60% 80% 100%
Other
Volcano
Terrorism
Earthquake
Flood
Data leakage or loss
Configuration change management issues
Power outage / failure / issues
How many of each of the following has caused your organization to
experience downtime in the past five years?
(Mark all that apply)
As outage duration increases, actual & indirect costs accelerate.
Direct Cost of DowntimeDirect Cost of Downtime
DRJ - Disaster Recovery Technologies 4
Downtime Retailer Financial
1 hour $60K $16M
6 hours $360K $96M
1 day $1.44M $384M
3 days $4.32M $1.15B
Indirect Costs
Stock Price
Reputation
Market Share
Brand Equity
CustomerSatisfaction
DirectCosts
Lost Revenue
Lost Productivity
Costs: Direct and IndirectCosts: Direct and Indirect
DRJ - Disaster Recovery Technologies 5
Availability Downtime
99% 3.65 days
99.9% 8.76 hours
99.99% 52 minutes
99.999% 5 minutes
99.9999% 31 seconds
Availability
DRJ - Disaster Recovery Technologies 6
Availability is the percentage of total time that a Network, System or Service is available for use.
Logical
Hardware
Site-wide
Threats to Availability
Software defectVirusData corruptionHacker
Accidental deleteDropped tableMemory leak
CPUDiskMemoryNIC
SwitchHBASANUPS
Power outageNetwork outageFloodFire
TornadoHurricaneEarthquake
DRJ - Disaster Recovery Technologies 7
Definitions
• Application – one or more infrastructure components needed to perform a particular function. An application can be as simple as a database instance, mail server, or DNS server, or it can be as complicated as amazon.com.
• Clustering - automating processes to maximize availability of an application in the event of a component failure
• Replication – copying data from a master copy to a replica copy. Replication is used to recover from a failure to access the master copy.
• Write order fidelity – data is replicated in the order it was written at the primary site. This provides a crash consistent data copy at the target site.
DRJ - Disaster Recovery Technologies 8
Why do you need clustering?
Automate with Confidence
9DRJ - Disaster Recovery Technologies
What makes up an application?
DRJ - Disaster Recovery Technologies 10
Application Code
Data
Network
How do you implement DR?
DRJ - Disaster Recovery Technologies 11
Static content
No need to replicate
Dynamic content
Must replicate
Generally static content
No need to replicate
Clustering Concepts
DRJ - Disaster Recovery Technologies 12
APP 1APP 1APP 3APP 3 APP 2APP 2SAPSAPAPP 4APP 4 APP 4APP 4 APP 1APP 1APP 2APP 2APP 3APP 3
Replication
HA
Local Clustering
DR
Remote Clustering
Clustering Concepts
DRJ - Disaster Recovery Technologies
Shared Storage Replicated Data Cluster
13
host
based
array
based
BusinessDecision
ManualFailover
if RPO>0
AutomaticFailover
if RPO=0
Clustering Concepts
DRJ - Disaster Recovery Technologies 14
Local Remote
Shared Storage
Replicated Data Cluster
HA Cluster
HA Cluster
Campus ClusterCampus Cluster
Shared NothingShared Nothing
Global ClusterGlobal Cluster
Campus Cluster
DRJ - Disaster Recovery Technologies 15
Active Site BActive Site A
Mirrored
Shared Storage
< 65 km
Clustering Concepts
DRJ - Disaster Recovery Technologies 16
Host 1
VMware ESX
VMware HA
Host 2
VM
OS
SQL
ApplicationHA
Application Clustering System Clustering
SQLSQLSQLSQL
Cluster Building Blocks
Clustering uses a combination of redundant hardware, communication links and software configuration to achieve high availability.
DRJ - Disaster Recovery Technologies 17
Servers Storage Networking
Similar sized servers are
recommended
N+1 redundancy
Redundant storage
Redundant HBAs &
SAN switches*
Multi-Pathing
Redundant network
interfaces for heartbeat
links
Redundant network links
for TCP/IP
The transformation of High Availability
Infrastructure Availability
Application Availability
High Availability
DRJ - Disaster Recovery Technologies 18
What are the benefits of High Availability?
• Minimize downtime
• Automate DR
• Load balancing
• Increase manageability / serviceability
• Enforce dependencies between application components
• Reduce number of personnel needed during incidents
19DRJ - Disaster Recovery Technologies
Typical clustering projects for customers
• HA from scratch
• Hardware refresh standardization
• New deployments
• Mergers & Acquisitions
• Cluster standardization
• Regulatory requirements necessitate need for HA / DR
• Reduce downtime costs
DRJ - Disaster Recovery Technologies 20
Disaster Recovery
Recover mission-critical technology and applications at an alternate site.
Business Continuity Planning
Developing contingency plans for external events that interrupt business operations.
Business Impact
Analysis
Analyzing and assigning a level of importance to business functions.
Work Area Recovery
Recover the business process at an alternate site.
Business Continuity
DRJ - Disaster Recovery Technologies 21
Recovery Point Recovery Time
Business Continuity Concepts
DRJ - Disaster Recovery Technologies
SecsMinsHrsDaysWks Secs Mins Hrs Days Wks
• Recovery Point Objective (RPO)
– The point at which data can successfully be restored
• Amount of data loss acceptable
• Recovery Time Objective (RTO)
– The time it takes to restore data and applications
• Amount of time it takes to come back online
22
Architecting Disaster Recovery
DRJ - Disaster Recovery Technologies 23
As your business requirements for RPO / RTO decrease from days to minutes to zero, the technology required to support your DR solution change.
RPO – protection methods RTO – recovery methods
Days Vault backup tapes Restore vaulted backup tapes
Minutes Asynchronous replication HA/DR failover
Zero Maximum WAN bandwidth and
synchronous replication / mirroring
Campus cluster (<80 km)
Active / active application support
Failover Times
DRJ - Disaster Recovery Technologies 24
RTO Minutes
Cost $
Complexity Low
HA
DB
HA
DB
HA
DB
HA
DB
Cluster File System
DB DB DB DBClustered Database
RTO Sub-minute
Cost $$
Complexity Medium
RTO Seconds
Cost $$$$
Complexity High
Single Instance
Fast FailoverSingle Instance Failover Clustered Databases
Challenges
Troubleshoot incident
Declare disaster
Failure occurs
Contact on-call personnel, subject matter experts & business leaders
Offline application components orderly
(production)
How Long Does it Take You to Recover from a Failure?
DRJ - Disaster Recovery Technologies 25
Manual Recovery
Wait for data to replicate
Online application components orderly
(contingency)
Validate applications are running correctly
Resume normal operations
00:00:00 04:00:00
Personnel available?
Operator error?
Missing patch?
Wrong configuration?
Coordination between
IT teams?Incident recognition?
Problem diagnosis?
TIER 1 TIER 3 TIER 4TIER 2
RTO
RPO
Apps
< 1 hour
Near Zero
Oracle DB2SAP
Web ApplicationsOther Applications
DRJ - Disaster Recovery Technologies 26
Business Impact Analysis
< 12 hours
Today
Intra-web
VMware
Less criticalapplications
When Convenient
Days
Other less critical applications
Other Databases
Applications
Hours
< 6 hours
Constraints
Data growthDR testing
Cost of recovery Manual process errors# of sites
VirtualizationDistance between sites
Data Protection Evolution
DRJ - Disaster Recovery Technologies 27
Technology Frequency
Backup to tape Daily
Point-in-time snapshot Several times a day
Periodic replication Every 30 minutes
Sync/Async replication Continuous
Continuous Data Protection
(CDP)
Continuous backups
Synchronous Replication
DRJ - Disaster Recovery Technologies 28
1
4
2
3
Time
WritesMB/s
Required bandwidth
Typical workload
Max
Asynchronous Replication
DRJ - Disaster Recovery Technologies 29
1
2
3
4
Average
Time
WritesMB/s
Required bandwidth
Typicalworkload
Mirroring vs. Replication
DRJ - Disaster Recovery Technologies 30
Logical Volume Mirror
Synchronous
Asynchronous
1m 1km 100km 1,000km 10,000km
Asynchronous Replication Considerations
• What do you do when your storage replicator log fills up?
– suspend replication to preserve a crash consistent copy of data
– delay I/O until replication catches up (performance hit)
– add more SRL / journal storage
– increase WAN bandwidth
• If you don’t have enough network bandwidth to support your average write rate, your replication solution will fail
DRJ - Disaster Recovery Technologies 31
Based on
actual events
Replication Network Performance Tuning
• Configure bandwidth limits
• Enable jumbo frames
• Max out TCP window size
• Firewalls - Increase network buffers
• Enable compression
• Use fewer network devices
• Use multiple FCIP tunnels
• Update NIC firmware
DRJ - Disaster Recovery Technologies 32
Replication modes to meet your SLA
DRJ - Disaster Recovery Technologies 33
• Maximum Protection: Zero data loss
• Ideal for small distances (< 100km)Synchronous
• Maximum Performance: Limited data loss
• Ideal for any distance between sitesAsynchronous
• Maximum Protection + Maximum Performance
• Zero data loss over any distanceBunker
any distance
< 1
00
km
Bunker Replication
Bunker replication allows you to replicate data synchronously to your bunker site and asynchronously to your contingency site. This supports a zero RPO for data during a primary site failure.
DRJ - Disaster Recovery Technologies 34
Primary
Site
Primary
Site
Syn
chro
no
us
Re
pli
cati
on
Syn
chro
no
us
Re
pli
cati
on
Bunker
Site
Bunker
Site
Asynchronous
Replication
Asynchronous
ReplicationContingency
Site
Contingency
Site
Parallel replication to multiple DR sites
DRJ - Disaster Recovery Technologies 35
VVR
Production
Site
Disaster
Recovery
Site 1
Disaster
Recovery
Site 2
• Each secondary site can
be at different RPO
• Ideal for new site bring-
up and old site retire at
no DR loss
Disaster Recovery
DRJ - Disaster Recovery Technologies
• Periodic replication
• Save bandwidth – replicate subset of files
• Suitable for triggered replication
Primary Site
App
Periodic Replication
Disaster Recovery Site
App
Any Distance
36
Content Distribution
DRJ - Disaster Recovery Technologies
Central Office
App
Branch Office 1
Branch Office 2
• Distribute files between sites
• One source to many targets • Share selected files or directories
• Single direction of data flow
Content Refresh
Content Refresh
Periodic Updates
37
On-Demand Replication
DRJ - Disaster Recovery Technologies
• Replicate On-Demand
• Content distribution on a non-periodic interval
• Refresh test/pre-prod env
Production
App
Replicate On-Demand
Pre-Prod
App
38
Primary DatacenterPrimary Datacenter Secondary DatacenterSecondary Datacenter
Off Host Processing
WAN
DRJ - Disaster Recovery Technologies
Primary File System
Application writes
Read-Only Target File System
• Replication has exclusive R/W access to target file systems
• Prevents accidental writes
39
DRJ - Disaster Recovery Technologies 40
IP Network
Global Clustering
High Availability and Disaster Recovery Architectures
DRJ - Disaster Recovery Technologies 41
SAPSAPAPP 1APP 1APP 3APP 3 APP 2APP 2SAPSAPSAPSAPAPP 4APP 4 APP 4APP 4 APP 1APP 1APP 2APP 2APP 3APP 3
Asynchronous
ReplicationSync Replication
or Mirroring
Metropolitan HA(Campus Cluster)
Wide-Area DR(Global Cluster)
Local HA
Bunker Site
Synchronous
Replication
Asynchronous
Replication
Primary Site Secondary Site
Replication Options
42
Array Based Replication
DRJ - Disaster Recovery Technologies
APP
Appliance Based Replication
Application Based ReplicationAPP
Host Based Replication
Array Based Replication
DRJ - Disaster Recovery Technologies 43
Vendor Synchronous Asynchronous
EMC SRDF/S SRDF/A
Hitachi TrueCopy HUR
IBM MetroMirror Global Mirror
NetApp MetroCluster SnapMirror
Primary Site
Volume
Snapshot
Replication
DR Site
Test Application
Resume Operations
Mount Snapshot
Initiate Fire Drill
Simulate DR Failover
DRJ - Disaster Recovery Technologies 44
Comprehensive Disaster Recovery simulation
No impact on application at production site
Logs for DR readiness audit
The BenefitsFire Drill
Snapshot
Automating Business Applications
DRJ - Disaster Recovery Technologies 45
Web
App
Billing
DB
Start Web TierStart Web Tier
Start App TierStart App Tier
ONON
Start DB TierStart DB Tier
Started
Started
Application Start
Disaster RecoveryHigh Availability
Application Stop
SecurityStatus Summary
Types of Disaster Recovery Tests
46
Walkthrough
Tabletop Exercise
Simulation
Full Test
Key stakeholders meet to review the layout and contents of a plan
Key stakeholders rehearse a specific threat scenario
IT team invokes the plan in a controlled situation without impacting business operations
IT team perform an actual failover of IT systems and end-user processing to the DR site
DRJ - Disaster Recovery Technologies
Disaster Recovery Testing
• Reality – Plan documents and procedures are rarely referenced during an actual disaster
• Testing helps train team members to operate effectively despite the “heat of battle” that often obscures centralized command, control and communications capabilities of emergency decision-makers
47DRJ - Disaster Recovery Technologies
Top DR Testing Rules
• Test regularly, more is better
• Test using different personnel
• Test after significant changes in business or infrastructure
• Test ALL infrastructure / application components
• Re-test when test fails to meet objectives
48DRJ - Disaster Recovery Technologies
DRJ - Disaster Recovery Technologies 49
Advanced DR Configurations
3DC and 4DC
Hardware Replication
• SAN vendors support various replication configurations which can be cascaded to configure complex solutions to meet a customer’s needs
– Bunker replication
– 3DC – Three data center replication
– 4DC – Four data center replication
DRJ - Disaster Recovery Technologies 50
3DC Replication
• 3DC solutions are any combination of replication that involve three data centers
DRJ - Disaster Recovery Technologies 51
bunkerbunker
DRDRprimaryprimary
primaryprimary
DR2DR2DR1DR1
4DC Replication
• 4DC solutions are any combination of replication that involve four data centers
DRJ - Disaster Recovery Technologies 52
bunkerbunker
secondarysecondary
tertiarytertiary
primaryprimary
bunker 1bunker 1
secondarysecondary
bunker 2bunker 2
primaryprimary
Bunker Site vs. Asynchronous Replication
• Bunker sites can be costly in terms of capital and operational expenses
• Businesses can instead purchase additional bandwidth to reduce their RPO time to near zero
• Near zero RPO may be good enough to avoid the cost and complexity of running a bunker site
DRJ - Disaster Recovery Technologies 53
Thank You!
Tak!
Dank u!
شكرالك
Kiitos!Kiitos!Kiitos!Kiitos! 謝謝!
Merci! Obrigado!
Děkuji vám!
Danke!
谢谢谢谢谢谢谢谢!
Falemnderit!Falemnderit!Falemnderit!Falemnderit!
תָדה
σας ευχαριστώ!
ध�यवाद!
Köszönöm!
Grazie!
ありがとうございました
감사합니다
Takk!
Спасибо
Gracias!
ขอบคุณ
Tack!
54DRJ - Disaster Recovery Technologies
Top Related