Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a...

22
Copyright © 2004-2005 Konstantin Beznosov T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A Availability EECE 412

Transcript of Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a...

Page 1: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Copyright © 2004-2005 Konstantin Beznosov

T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A

Availability

EECE 412

Page 2: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

2

Where We Are

ProtectionAuthorization Accountability Availability

Acc

ess

Con

trol

Dat

a Pr

otec

tion

Audit

Non-Repudiation

Serv

ice

Con

tinui

ty

Dis

aste

r R

ecov

ery

Assurance

Req

uire

men

ts A

ssur

ance

Dev

elop

men

t A

ssur

ance

Ope

ratio

nal A

ssur

ance

Des

ign

Ass

uran

ce

AuthenticationCryptography

Page 3: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

3

What do you already know?

How are error, fault, and failure different? What’s the difference between fail-stop

and Byzantine failures? How many nodes do you need to have

3-fault tolerance for Byzantine failures? What measures to deal with failures do you

know? What are the ways of achieving service

continuity in the presence of attacks?

Page 4: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

4

Outline

Availability in the presence of failures

• FT terminology

• k fault tolerance

• two army problem

• Byzantine Generals problem

• Services continuity and disaster recovery

Availability in the presence of attacks

• Failures vs. attacks

• Random vs. scale-free networks

• Internet tolerance to attacks and failures

• Services continuity and disaster recovery

Page 5: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Copyright © 2004-2005 Konstantin Beznosov

T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A

Availability in the Presence ofFailures

Page 6: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Failures, Errors, and Faults

A system is said to fail when it cannotmeet its promisesError may lead to a faultFault -- a cause of an error

errors failuresfaults

Page 7: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Fault Types

Transient: occur once and then disappear

Intermittent: occurs, then vanishes, thenreappears

Permanent: continues to exist

Page 8: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Availability and Reliability

Availability: Probability that a system operatescorrectly at any given moment and is available toperform its functions

Reliability: time period during which a systemcontinues to be available to perform its functions

Problem: calculate system availability andreliability if it’s unavailable for 1 second everyhour.

Page 9: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Fault Tolerance

A fault tolerant system can provide itsservices even in the presence of faults

Page 10: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Classification of Failure Modes

Type of failure Description

Crash failure A server halts, but is working correctly until it halts

Omission failure Receive omission Send omission

A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure Value failure State transition failure

The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary (a.k.a. Byzantine) failure A server may produce arbitrary responses at arbitrary times

Page 11: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Achieving k fault tolerance

A system is k fault tolerant if it can survivefaults in k components silent failure vs. Byzantine failure k+1 2k+1

Page 12: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Agreement among honest playerswith unreliable communications:

Two-army Problem

Even with nonfaulty processes, agreementeven between two processes is not possiblein the face of unreliable communications

Page 13: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Agreement among dishonest playerswith perfect communications:Byzantine Generals Problem

Results:1. In a system with m faulty processes,agreement can be achieved only if 2m+1correctly functioning processes are present(total 3m+1). (Lamport et al., 1982)

2. If messages cannot be guaranteed to bedelivered within a known, finite time, noagreement is possible even with one faultyprocess. (Fischer et al., 1985)

Page 14: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

14

Ways to Deal with Failures

Service continuity• Masking failures via

• Redundancy of– information– time– physical

Disaster recovery• Backward recovery

• check pointing

• Forward recovery• bringing system into a correct new state

• Don’t underestimate backups!

Page 15: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

Copyright © 2004-2005 Konstantin Beznosov

T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A

Availability in the Presence ofAttacks

Page 16: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

16

Failures vs. Attacks

Failure• Random (unintentional) unavailability of

participants and/or infrastructure elements

Attack• Systematic (intentional) unavailability of

participants and/or infrastructure elements

Page 17: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

17

Random vs. Scale–free Networks

Page 18: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

18

Page 19: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

19

Internet Tolerance toAttacks and Failures

Source: R. Albert, H. Jeong, and A.-L. Barabasi, "Error and attack tolerance of complex networks,"Nature, vol. 406, no. 6794, 2000, pp. 378-82.

fraction of nodes destroyed

netw

ork

diam

eter

Scale-free networks are failure-tolerant Random networks are attack-tolerant

Page 20: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

20

Ways to Deal with Attacks

Service continuity• Same as for FT, plus• Heterogeneity

• Diversification– Avoid monocultures

• Randomization– Avoid “hubs”

Disaster recovery• Same as for FT

Page 21: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

21

Summary

Availability in the presence of failures

• FT terminology

• k fault tolerance

• two army problem

• Byzantine Generals problem

• Services continuity and disaster recovery

Availability in the presence of attacks

• Failures vs. attacks

• Random vs. scale-free networks

• Internet tolerance to attacks and failures

• Services continuity and disaster recovery

Page 22: Availability - courses.ece.ubc.ca · Availability and Reliability Availability: Probability that a system operates correctly at any given moment and is available to perform its functions

22

What did you learn?

How are error, fault, and failure different? What’s the difference between fail-stop

and Byzantine failures? How many nodes do you need to have

3-fault tolerance for Byzantine failures? What measures to deal with failures do you

know? What are the ways of achieving service

continuity in the presence of attacks?