Designing For Failure Stanford University CS 444A, Autumn 99 Software Development for Critical...

Designing For FailureDesigning For Failure

Stanford University CS 444A, Autumn 99Stanford University CS 444A, Autumn 99Software Development for Critical ApplicationsSoftware Development for Critical Applications

Armando Fox & David DillArmando Fox & David Dill{fox,dill}@cs.stanford.edu{fox,dill}@cs.stanford.edu

OutlineOutline User expectations and failure semanticsUser expectations and failure semantics

Orthogonal mechanisms (again)Orthogonal mechanisms (again)

Case studiesCase studies TACCTACC

Microsoft Tiger video serverMicrosoft Tiger video server

Cirrus banking networkCirrus banking network

TWA flight reservations systemTWA flight reservations system

Mars PathfinderMars Pathfinder

LessonsLessons

Designing For Failure: PhilosophyDesigning For Failure: Philosophy Start with some “givens”Start with some “givens”

Hardware Hardware doesdoes fail fail

Latent bugs Latent bugs dodo occur occur

Nondeterministic/hard-to-reproduce bugs Nondeterministic/hard-to-reproduce bugs will will happenhappen

Requirements:Requirements: Maintain availability (possibly with degraded performance) Maintain availability (possibly with degraded performance)

when these things happenwhen these things happen

Isolate faults so they don’t bring whole system downIsolate faults so they don’t bring whole system down

Question:Question: What is “availability” from end user’s point of view?What is “availability” from end user’s point of view?

Specifically…what constitutes correct/acceptable behavior?Specifically…what constitutes correct/acceptable behavior?

User Expectations & Failure SemanticsUser Expectations & Failure Semantics Determine expected/common failure modesDetermine expected/common failure modes

Find “cheap” mechanisms to address themFind “cheap” mechanisms to address them Analysis: Invariants are your friendsAnalysis: Invariants are your friends

Performance: keep the common case fastPerformance: keep the common case fast

Robustness: minimize dependencies on other components Robustness: minimize dependencies on other components to avoid “domino effect”to avoid “domino effect”

Determine effects on performance & semanticsDetermine effects on performance & semantics

Unexpected/unsupported failure modes?Unexpected/unsupported failure modes?

Example Cheap Mechanism: BASEExample Cheap Mechanism: BASE Best-effort Availability (like the Internet!)Best-effort Availability (like the Internet!)

Soft State (if you can tolerate Soft State (if you can tolerate temporarytemporary state loss) state loss)

Eventual Consistency (often fine in practice)Eventual Consistency (often fine in practice)

When is BASE compatible with When is BASE compatible with user-perceived user-perceived application semantics?application semantics? When is When is availabilityavailability more important? (Your client-side more important? (Your client-side

Netscape cache? Source code library?)Netscape cache? Source code library?)

When is When is consistencyconsistency more important? (Your bank account? more important? (Your bank account? User profiles?)User profiles?)

Example: “Reload semantics”Example: “Reload semantics”

Example: NFS (using BASE badly)Example: NFS (using BASE badly) Original philosophy: “Stateless is good”Original philosophy: “Stateless is good”

Implementation revealed: performance was badImplementation revealed: performance was bad

Caching retrofitted as a hackCaching retrofitted as a hack Attribute cache for Attribute cache for stat()stat()

Stateless Stateless soft state! soft state!

Network file locking added laterNetwork file locking added later Inconsistent lock state if server or client crashesInconsistent lock state if server or client crashes

User’s view of data semantics wasn’t considered from User’s view of data semantics wasn’t considered from the get-go. Result: brain-dead semantics.the get-go. Result: brain-dead semantics.

Example: Multicast/SRMExample: Multicast/SRM Expected failure: periodic router death/partitionExpected failure: periodic router death/partition

Cheap solution:Cheap solution: Each router stores only local mcast routing infoEach router stores only local mcast routing info

Soft state with periodic refreshesSoft state with periodic refreshes

Absence of “reachability beacons” used to infer failure of Absence of “reachability beacons” used to infer failure of upstream nodes; can look for anotherupstream nodes; can look for another

Receiving duplicate refreshes is idempotentReceiving duplicate refreshes is idempotent

Unsupported failure mode: frequent or prolonged Unsupported failure mode: frequent or prolonged router failurerouter failure

Example: TACC PlatformExample: TACC Platform

Expected failure mode: Load balancer death (or partition)Expected failure mode: Load balancer death (or partition)

Cheap solution: LB state is all softCheap solution: LB state is all soft LB periodically beacons its existence & locationLB periodically beacons its existence & location

Workers connect to it and send periodic load reportsWorkers connect to it and send periodic load reports

If LB crashes, state is restored within a few secondsIf LB crashes, state is restored within a few seconds

Effect on data semanticsEffect on data semantics Cached, stale state in FE’s allows temporary operation during LB Cached, stale state in FE’s allows temporary operation during LB

restartrestart

Empirically and analytically, this is not so badEmpirically and analytically, this is not so bad

Example: TACC WorkersExample: TACC Workers Expected failure: Worker death (or partition)Expected failure: Worker death (or partition)

Cheap solution:Cheap solution: Invariant: workers are Invariant: workers are restartablerestartable

Use retry count to handle pathological inputsUse retry count to handle pathological inputs

Effect on system behavior/semanticsEffect on system behavior/semantics Temporary extra latency for retryTemporary extra latency for retry

Unexpected failure mode: livelock!Unexpected failure mode: livelock! Timers retrofitted; what should be the behavior on timeout? Timers retrofitted; what should be the behavior on timeout?

(Kill request, or kill worker?)(Kill request, or kill worker?)

User-perceived effects?User-perceived effects?

Example: MS Tiger Video FileserverExample: MS Tiger Video Fileserver

ATM-connected disks “walk down” static scheduleATM-connected disks “walk down” static schedule

““Coherent hallucination”: global schedule, but each disk only Coherent hallucination”: global schedule, but each disk only knows its local pieceknows its local piece

Pieces are passed around ring, bucket-brigade stylePieces are passed around ring, bucket-brigade style

0 1 2 3 4

ATM interconnect WANWAN

ScheduleScheduleTimeTime

Example: Tiger, cont’d.Example: Tiger, cont’d.

Once failure is detected, bypass failed cubOnce failure is detected, bypass failed cub Why not just detect first, then bypass schedule info to Why not just detect first, then bypass schedule info to

successor?successor?

If cub fails, its successor is If cub fails, its successor is already preparedalready prepared to take to take over that schedule slotover that schedule slot

0 1 2 3 4


Example: Tiger, Cont’d.Example: Tiger, Cont’d.

Failed cub permanently removed from ringFailed cub permanently removed from ring

0 1 2 3 4


Example: Tiger, cont’d.Example: Tiger, cont’d. Expected failure mode: death of a cub (node)Expected failure mode: death of a cub (node)

Cheap solution:Cheap solution: All files are mirrored using decluster factorAll files are mirrored using decluster factor

Cub passes chunk of the schedule to both its successor and Cub passes chunk of the schedule to both its successor and second-successorsecond-successor

Effect on performance/semanticsEffect on performance/semantics No user-visible latency to recover from failure!No user-visible latency to recover from failure!

What’s the cost of this?What’s the cost of this?

What would be an analogous TACC mechanism?What would be an analogous TACC mechanism?

Example: Tiger, cont’d.Example: Tiger, cont’d. Unsupported failure modesUnsupported failure modes

Interconnect failureInterconnect failure

Failure of both primary and secondary copies of a blockFailure of both primary and secondary copies of a block

User-visible effect: frame loss or complete hosageUser-visible effect: frame loss or complete hosage

Notable differences from TACCNotable differences from TACC Schedule is global but not centralized (Schedule is global but not centralized (coherent coherent

hallucinationhallucination))

State is not soft but not really “durable” either State is not soft but not really “durable” either (probabilistically hard? Asymptotically hard?)(probabilistically hard? Asymptotically hard?)

Example: CIRRUS Banking NetworkExample: CIRRUS Banking Network CIRRUS network is a “smart switch” that runs on CIRRUS network is a “smart switch” that runs on

Tandem NonStop nodes (deployed early 80’s)Tandem NonStop nodes (deployed early 80’s)

2-phase commit for cash withdrawal: 2-phase commit for cash withdrawal:

1. Withdrawal request from ATM1. Withdrawal request from ATM

2. “xact in progress” logged at Bank2. “xact in progress” logged at Bank

3. [Commit point] cash dispensed, confirmation sent to Bank3. [Commit point] cash dispensed, confirmation sent to Bank

4. Ack from bank4. Ack from bank

Non-obvious failure mode #1 after commit point: Non-obvious failure mode #1 after commit point: CIRRUS switch resends #3 until reply received CIRRUS switch resends #3 until reply received

If reply indicates error, manual cleanup needed If reply indicates error, manual cleanup needed

ATM Example, continuedATM Example, continued Non-obvious failure mode #2: Cash is dispensed, but Non-obvious failure mode #2: Cash is dispensed, but

CIRRUS switch never sees message #3 from ATM CIRRUS switch never sees message #3 from ATM Notify bank: cash Notify bank: cash was not was not dispenseddispensed

““Reboot” ATM-to-CIRRUS netReboot” ATM-to-CIRRUS net

Query net log to see if net thinks #3 was sent; if so, and #3 Query net log to see if net thinks #3 was sent; if so, and #3 arrived before end of business day, create Adjustment arrived before end of business day, create Adjustment Record at both sides Record at both sides

Otherwise, resolve out-of-band using physical evidence of Otherwise, resolve out-of-band using physical evidence of xact (tape logs, video cam, etc) xact (tape logs, video cam, etc)

Role of end-user semantics for picking design point Role of end-user semantics for picking design point for auto vs. manual recovery (end of business day)for auto vs. manual recovery (end of business day)

ATM and TWA Reservations ExampleATM and TWA Reservations Example Non-obvious failure mode #3: malicious ATM fakes Non-obvious failure mode #3: malicious ATM fakes

message #3. message #3. Not caught or handled! "In practice, this won't go unnoticed Not caught or handled! "In practice, this won't go unnoticed

in the banking industry and/or by our customers.”in the banking industry and/or by our customers.”

TWA Reservations System (~1985)TWA Reservations System (~1985) ““Secondary” DB of reservations sold; points into “primary” Secondary” DB of reservations sold; points into “primary”

DB of complete seating inventoryDB of complete seating inventory

Dangling pointers and double-bookings resolved Dangling pointers and double-bookings resolved offlineoffline

Explicitly trades availability for throughput: “...a 90% Explicitly trades availability for throughput: “...a 90% utilization rate would make it impossible for us to log all our utilization rate would make it impossible for us to log all our transactions” transactions”

““What Really Happened On Mars”What Really Happened On Mars” Sources: various posts to Risks Digest, including one Sources: various posts to Risks Digest, including one

from CTO of WindRiver Systems, which makes the from CTO of WindRiver Systems, which makes the VxWorks operating system that controls the VxWorks operating system that controls the Pathfinder.Pathfinder.

Background conceptsBackground concepts Threads and mutual exclusionThreads and mutual exclusion

Mutexes and priority inheritanceMutexes and priority inheritance

Priority inversionPriority inversion

How the failure was diagnosed and fixedHow the failure was diagnosed and fixed

LessonsLessons

Background: threads and mutexesBackground: threads and mutexes Threads:Threads: independent flows of control, typically within a independent flows of control, typically within a

single program, that usually share some data or single program, that usually share some data or resourcesresources Can be used for performance, program structure, or bothCan be used for performance, program structure, or both

Threads can have different Threads can have different prioritiespriorities for sharing resources for sharing resources (CPU, memory, etc)(CPU, memory, etc)

Thread schedulerThread scheduler (dozens of algorithms) enforces the priorities (dozens of algorithms) enforces the priorities

Mutex:Mutex: a lock (data structure) that enforces one-at-a- a lock (data structure) that enforces one-at-a-time access to a shared resourcetime access to a shared resource In this case, a systemwide data busIn this case, a systemwide data bus

Usage: to exclusively access a shared resource, you acquire Usage: to exclusively access a shared resource, you acquire the mutex, do your thing, then release the mutexthe mutex, do your thing, then release the mutex

Background: priority inversionBackground: priority inversion Priority inheritance:Priority inheritance: the mutex “inherits” the priority of the mutex “inherits” the priority of

whichever thread currently holds itwhichever thread currently holds it So, if a high-priority thread has it (or is waiting for it), mutex So, if a high-priority thread has it (or is waiting for it), mutex

operations are given high priorityoperations are given high priority

Priority inversionPriority inversion can happen when this isn’t done: can happen when this isn’t done: Low-priority, infrequent thread A grabs mutex MLow-priority, infrequent thread A grabs mutex M

High-priority thread B needs to access data protected by MHigh-priority thread B needs to access data protected by M

Result: even though B has higher priority than A, it has to Result: even though B has higher priority than A, it has to wait a long time since A holds Mwait a long time since A holds M

This example is obvious; in practice, priority inheritance is This example is obvious; in practice, priority inheritance is usually more subtleusually more subtle

What Really HappenedWhat Really Happened Dramatis personaeDramatis personae

Low-priority thread A: infrequent, short-running meteorological data Low-priority thread A: infrequent, short-running meteorological data collection, using bus mutexcollection, using bus mutex

High-priority thread B: bus manager, using bus mutexHigh-priority thread B: bus manager, using bus mutex

Medium-priority thread C: long-running communications task (that Medium-priority thread C: long-running communications task (that doesn’t doesn’t need the mutex)need the mutex)

Priority inversion scenarioPriority inversion scenario A is scheduled, and grabs bus mutexA is scheduled, and grabs bus mutex

B is scheduled, and blocks waiting for A to release mutexB is scheduled, and blocks waiting for A to release mutex

C is scheduled while B is waiting for mutexC is scheduled while B is waiting for mutex

C has higher priority than A, so it prevents A from running (and C has higher priority than A, so it prevents A from running (and therefore B as well)therefore B as well)

Watchdog timer notices B hasn’t run, concludes something is wrong, Watchdog timer notices B hasn’t run, concludes something is wrong, rebootsreboots

Lessons They LearnedLessons They Learned Extensive logging facilities (trace of system events) Extensive logging facilities (trace of system events)

allowed diagnosisallowed diagnosis

Debugging facility (ability to execute C commands Debugging facility (ability to execute C commands that tweak the running system…kind of like that tweak the running system…kind of like gdbgdb) ) allowed problem to be fixedallowed problem to be fixed

Debugging facility was intended to be Debugging facility was intended to be turned offturned off prior prior to production useto production use Similar facility in certain commercial CPU designs!Similar facility in certain commercial CPU designs!

Lessons We Should LearnLessons We Should Learn Complexity has pitfallsComplexity has pitfalls

Concurrency, race conditions, and mutual exclusion are Concurrency, race conditions, and mutual exclusion are hardhard to debugto debug

Ousterhout. Ousterhout. Why Threads Are a Bad Idea (For Most Why Threads Are a Bad Idea (For Most Purposes); Purposes); Savage et al., Savage et al., EraserEraser

C.A.R. Hoare: “The unavoidable price of reliability is C.A.R. Hoare: “The unavoidable price of reliability is simplicity.”simplicity.”

It’s more important to get it extensible than to get it It’s more important to get it extensible than to get it rightright Even in the hardware world!Even in the hardware world!

Knowledge of end-to-end Knowledge of end-to-end semanticssemantics can be critical can be critical Whither “transparent” OS-level mechanisms?Whither “transparent” OS-level mechanisms?

Designing For Failure Stanford University CS 444A, Autumn 99 Software Development for Critical...

Documents

Transcript of Designing For Failure Stanford University CS 444A, Autumn 99 Software Development for Critical...