MTBF / MTTR - Energized Work TekTalk, Mar 2012
-
Upload
energized-work -
Category
Technology
-
view
2.541 -
download
18
description
Transcript of MTBF / MTTR - Energized Work TekTalk, Mar 2012
Presented by Michael Richardson, Energized Work 21 March 2012
MTBF / MTTR Availability or recoverability?
25 MACKLIN STREET LONDON WC2B 5NN +44 (0)20 7691 8933
ENERGIZED WORK
WWW.ENERGIZEDWORK.COM
Michael Richardson Twitter: @mr_spb
Email: [email protected] #ewtektalk
2 © 2012 Energized Work - www.energizedwork.com
So what is high availability?
3 © 2012 Energized Work - www.energizedwork.com
• Five nines? • No single point of failures? • Multiple data centres? • Fault tolerance? • Load balancing? • Uptime?
Nines of availability
4 © 2012 Energized Work - www.energizedwork.com
9
99
9 9
9 9 9
Nines of availability
5 © 2012 Energized Work - www.energizedwork.com
Availability Downtime per Year One nine (90%) 36.5 days Two nines (99%) 3.65 days Three nines (99.9%) 8.76 hours Four nines (99.99%) 52.56 minutes Five nines (99.999%) 5.26 minutes
Problem with the nines
6 © 2012 Energized Work - www.energizedwork.com
• What do they mean? • Guaranteed or just an SLA? • Multiplicity (99.9% * 99.9% * 99.9% = 99.7%)
SLA availability numbers just aim to provide a level of confidence in a website’s service
7 © 2012 Energized Work - www.energizedwork.com
No single point of failure (SPOF)
8 © 2012 Energized Work - www.energizedwork.com
Two of everything?
9 © 2012 Energized Work - www.energizedwork.com
Start with this
10 © 2012 Energized Work - www.energizedwork.com
Index.html
Users
End with this
11 © 2012 Energized Work - www.energizedwork.com
Switch 1 Switch 2
Firewall 1 Firewall 2
Users
WEB1 WEB2 APP1 APP2 DB1 DB2
Problems with eliminating SPOF
12 © 2012 Energized Work - www.energizedwork.com
• It’s expensive • Where do you draw the line? • Are failures independent? • Can you guarantee no SPOF? • Increased complexity
Problem: Data centres fail
13 © 2012 Energized Work - www.energizedwork.com
Solution: Get a second data centre
14 © 2012 Energized Work - www.energizedwork.com
Hot – Hot multisite
15 © 2012 Energized Work - www.energizedwork.com
• Full range of services available in multiple locations • Easy to automate failover of sites • Data consistency is hard • Capacity planning concerns
+
Hot – Warm multisite
16 © 2012 Energized Work - www.energizedwork.com
• Simpler than hot – hot • Read / Write ratio dependent • Synchronously or asynchronously replicate data?
+
Hot – Cold multisite
17 © 2012 Energized Work - www.energizedwork.com
• Easy to setup • Will it work? • Can it be trusted? • Cold site rapidly becomes stale • Is it actually valuable?
+
DR multisite
18 © 2012 Energized Work - www.energizedwork.com
• Fingers crossed you never need it • How can / should you test it? • Cloud?
+
Problems with multiple sites
19 © 2012 Energized Work - www.energizedwork.com
• It’s expensive • Managing more systems • Managing data consistency • Managing capacity • Is it still fail proof? • Unless you test it, it’s just a plan
We now have a complex system
20 © 2012 Energized Work - www.energizedwork.com
Complex systems
21 © 2012 Energized Work - www.energizedwork.com
• More redundancy and automation leads to more complexity • More complexity often adds more points of failure
How complex systems fail
22 © 2012 Energized Work - www.energizedwork.com
• Catastrophe is always just around the corner • Human operators have dual roles • Change introduces new forms of failure
- Dr. Richard Cook
Failure and recovery
23 © 2012 Energized Work - www.energizedwork.com
Questions for the business
24 © 2012 Energized Work - www.energizedwork.com
• What is the cost of downtime? • What are the Recovery Time Objectives (RTO) • What are the Recovery Point Objectives (RPO)?
Aggressive RTO and RPO are expensive and have a performance impact
25 © 2012 Energized Work - www.energizedwork.com
RTO / RPO example
26 © 2012 Energized Work - www.energizedwork.com
Problem: • Simple DB • Business can tolerate up to 15 minutes downtime • 10-minute window of data loss
RTO / RPO example
27 © 2012 Energized Work - www.energizedwork.com
Possible solution: • Continuously replicate data to second host • Continue with nightly backups and also copy DB transaction logs
from the primary host to another system
So what is more important – increasing availability or reducing recovery time?
28 © 2012 Energized Work - www.energizedwork.com
MTBF or MTTR?
29 © 2012 Energized Work - www.energizedwork.com
What about MTTD?
The answer is: It depends
30 © 2012 Energized Work - www.energizedwork.com
Failure is inevitable
31 © 2012 Energized Work - www.energizedwork.com
Ask anyone
32 © 2012 Energized Work - www.energizedwork.com
License This presentation is provided under the Creative Commons Attribution Share Alike 3.0 Unported License.
You are free: To share – to copy, distribute and transmit the work To remix – to adapt the work Under the following conditions: Attribution – You must attribute the work in the manner specified by Energized Work (but not in any way that suggests that Energized Work endorse you or your use of the work). Share Alike – If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
33 © 2012 Energized Work - www.energizedwork.com
25 MACKLIN STREET LONDON WC2B 5NN +44 (0)20 7691 8933
ENERGIZED WORK
WWW.ENERGIZEDWORK.COM