Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a...
Transcript of Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a...
Copyright © 2004-2005 Konstantin Beznosov
T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A
Availability
EECE 412
2
Where We Are
ProtectionAuthorization Accountability Availability
Acc
ess
Con
trol
Dat
a Pr
otec
tion
Audit
Non-Repudiation
Serv
ice
Con
tinui
ty
Dis
aste
r R
ecov
ery
Assurance
Req
uire
men
ts A
ssur
ance
Dev
elop
men
t A
ssur
ance
Ope
ratio
nal A
ssur
ance
Des
ign
Ass
uran
ce
AuthenticationCryptography
3
What do you already know?
How are error, fault, and failure different? What’s the difference between fail-stop
and Byzantine failures? How many nodes do you need to have
3-fault tolerance for Byzantine failures? What measures to deal with failures do you
know? What are the ways of achieving service
continuity in the presence of attacks?
4
Outline
Availability in the presence of failures
• FT terminology
• k fault tolerance
• two army problem
• Byzantine Generals problem
• Services continuity and disaster recovery
Availability in the presence of attacks
• Failures vs. attacks
• Random vs. scale-free networks
• Internet tolerance to attacks and failures
• Services continuity and disaster recovery
Copyright © 2004-2005 Konstantin Beznosov
T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A
Availability in the Presence ofFailures
Failures, Errors, and Faults
A system is said to fail when it cannotmeet its promisesError may lead to a faultFault -- a cause of an error
errors failuresfaults
Fault Types
Transient: occur once and then disappear
Intermittent: occurs, then vanishes, thenreappears
Permanent: continues to exist
Availability and Reliability
Availability: Probability that a system operatescorrectly at any given moment and is available toperform its functions
Reliability: time period during which a systemcontinues to be available to perform its functions
Problem: calculate system availability andreliability if it’s unavailable for 1 second everyhour.
Fault Tolerance
A fault tolerant system can provide itsservices even in the presence of faults
Classification of Failure Modes
Type of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure Receive omission Send omission
A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages
Timing failure A server's response lies outside the specified time interval
Response failure Value failure State transition failure
The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control
Arbitrary (a.k.a. Byzantine) failure A server may produce arbitrary responses at arbitrary times
Achieving k fault tolerance
A system is k fault tolerant if it can survivefaults in k components silent failure vs. Byzantine failure k+1 2k+1
Agreement among honest playerswith unreliable communications:
Two-army Problem
Even with nonfaulty processes, agreementeven between two processes is not possiblein the face of unreliable communications
Agreement among dishonest playerswith perfect communications:Byzantine Generals Problem
Results:1. In a system with m faulty processes,agreement can be achieved only if 2m+1correctly functioning processes are present(total 3m+1). (Lamport et al., 1982)
2. If messages cannot be guaranteed to bedelivered within a known, finite time, noagreement is possible even with one faultyprocess. (Fischer et al., 1985)
14
Ways to Deal with Failures
Service continuity• Masking failures via
• Redundancy of– information– time– physical
Disaster recovery• Backward recovery
• check pointing
• Forward recovery• bringing system into a correct new state
• Don’t underestimate backups!
Copyright © 2004-2005 Konstantin Beznosov
T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A
Availability in the Presence ofAttacks
16
Failures vs. Attacks
Failure• Random (unintentional) unavailability of
participants and/or infrastructure elements
Attack• Systematic (intentional) unavailability of
participants and/or infrastructure elements
17
Random vs. Scale–free Networks
18
19
Internet Tolerance toAttacks and Failures
Source: R. Albert, H. Jeong, and A.-L. Barabasi, "Error and attack tolerance of complex networks,"Nature, vol. 406, no. 6794, 2000, pp. 378-82.
fraction of nodes destroyed
netw
ork
diam
eter
Scale-free networks are failure-tolerant Random networks are attack-tolerant
20
Ways to Deal with Attacks
Service continuity• Same as for FT, plus• Heterogeneity
• Diversification– Avoid monocultures
• Randomization– Avoid “hubs”
Disaster recovery• Same as for FT
21
Summary
Availability in the presence of failures
• FT terminology
• k fault tolerance
• two army problem
• Byzantine Generals problem
• Services continuity and disaster recovery
Availability in the presence of attacks
• Failures vs. attacks
• Random vs. scale-free networks
• Internet tolerance to attacks and failures
• Services continuity and disaster recovery
22
What did you learn?
How are error, fault, and failure different? What’s the difference between fail-stop
and Byzantine failures? How many nodes do you need to have
3-fault tolerance for Byzantine failures? What measures to deal with failures do you
know? What are the ways of achieving service
continuity in the presence of attacks?