Designing For Failure Stanford University CS 444A, Autumn 99 Software Development for Critical...
-
Upload
rosamund-mcbride -
Category
Documents
-
view
215 -
download
0
Transcript of Designing For Failure Stanford University CS 444A, Autumn 99 Software Development for Critical...
Designing For FailureDesigning For Failure
Stanford University CS 444A, Autumn 99Stanford University CS 444A, Autumn 99Software Development for Critical ApplicationsSoftware Development for Critical Applications
Armando Fox & David DillArmando Fox & David Dill{fox,dill}@cs.stanford.edu{fox,dill}@cs.stanford.edu
OutlineOutline User expectations and failure semanticsUser expectations and failure semantics
Orthogonal mechanisms (again)Orthogonal mechanisms (again)
Case studiesCase studies TACCTACC
Microsoft Tiger video serverMicrosoft Tiger video server
Cirrus banking networkCirrus banking network
TWA flight reservations systemTWA flight reservations system
Mars PathfinderMars Pathfinder
LessonsLessons
Designing For Failure: PhilosophyDesigning For Failure: Philosophy Start with some “givens”Start with some “givens”
Hardware Hardware doesdoes fail fail
Latent bugs Latent bugs dodo occur occur
Nondeterministic/hard-to-reproduce bugs Nondeterministic/hard-to-reproduce bugs will will happenhappen
Requirements:Requirements: Maintain availability (possibly with degraded performance) Maintain availability (possibly with degraded performance)
when these things happenwhen these things happen
Isolate faults so they don’t bring whole system downIsolate faults so they don’t bring whole system down
Question:Question: What is “availability” from end user’s point of view?What is “availability” from end user’s point of view?
Specifically…what constitutes correct/acceptable behavior?Specifically…what constitutes correct/acceptable behavior?
User Expectations & Failure SemanticsUser Expectations & Failure Semantics Determine expected/common failure modesDetermine expected/common failure modes
Find “cheap” mechanisms to address themFind “cheap” mechanisms to address them Analysis: Invariants are your friendsAnalysis: Invariants are your friends
Performance: keep the common case fastPerformance: keep the common case fast
Robustness: minimize dependencies on other components Robustness: minimize dependencies on other components to avoid “domino effect”to avoid “domino effect”
Determine effects on performance & semanticsDetermine effects on performance & semantics
Unexpected/unsupported failure modes?Unexpected/unsupported failure modes?
Example Cheap Mechanism: BASEExample Cheap Mechanism: BASE Best-effort Availability (like the Internet!)Best-effort Availability (like the Internet!)
Soft State (if you can tolerate Soft State (if you can tolerate temporarytemporary state loss) state loss)
Eventual Consistency (often fine in practice)Eventual Consistency (often fine in practice)
When is BASE compatible with When is BASE compatible with user-perceived user-perceived application semantics?application semantics? When is When is availabilityavailability more important? (Your client-side more important? (Your client-side
Netscape cache? Source code library?)Netscape cache? Source code library?)
When is When is consistencyconsistency more important? (Your bank account? more important? (Your bank account? User profiles?)User profiles?)
Example: “Reload semantics”Example: “Reload semantics”
Example: NFS (using BASE badly)Example: NFS (using BASE badly) Original philosophy: “Stateless is good”Original philosophy: “Stateless is good”
Implementation revealed: performance was badImplementation revealed: performance was bad
Caching retrofitted as a hackCaching retrofitted as a hack Attribute cache for Attribute cache for stat()stat()
Stateless Stateless soft state! soft state!
Network file locking added laterNetwork file locking added later Inconsistent lock state if server or client crashesInconsistent lock state if server or client crashes
User’s view of data semantics wasn’t considered from User’s view of data semantics wasn’t considered from the get-go. Result: brain-dead semantics.the get-go. Result: brain-dead semantics.
Example: Multicast/SRMExample: Multicast/SRM Expected failure: periodic router death/partitionExpected failure: periodic router death/partition
Cheap solution:Cheap solution: Each router stores only local mcast routing infoEach router stores only local mcast routing info
Soft state with periodic refreshesSoft state with periodic refreshes
Absence of “reachability beacons” used to infer failure of Absence of “reachability beacons” used to infer failure of upstream nodes; can look for anotherupstream nodes; can look for another
Receiving duplicate refreshes is idempotentReceiving duplicate refreshes is idempotent
Unsupported failure mode: frequent or prolonged Unsupported failure mode: frequent or prolonged router failurerouter failure
Example: TACC PlatformExample: TACC Platform
Expected failure mode: Load balancer death (or partition)Expected failure mode: Load balancer death (or partition)
Cheap solution: LB state is all softCheap solution: LB state is all soft LB periodically beacons its existence & locationLB periodically beacons its existence & location
Workers connect to it and send periodic load reportsWorkers connect to it and send periodic load reports
If LB crashes, state is restored within a few secondsIf LB crashes, state is restored within a few seconds
Effect on data semanticsEffect on data semantics Cached, stale state in FE’s allows temporary operation during LB Cached, stale state in FE’s allows temporary operation during LB
restartrestart
Empirically and analytically, this is not so badEmpirically and analytically, this is not so bad
Example: TACC WorkersExample: TACC Workers Expected failure: Worker death (or partition)Expected failure: Worker death (or partition)
Cheap solution:Cheap solution: Invariant: workers are Invariant: workers are restartablerestartable
Use retry count to handle pathological inputsUse retry count to handle pathological inputs
Effect on system behavior/semanticsEffect on system behavior/semantics Temporary extra latency for retryTemporary extra latency for retry
Unexpected failure mode: livelock!Unexpected failure mode: livelock! Timers retrofitted; what should be the behavior on timeout? Timers retrofitted; what should be the behavior on timeout?
(Kill request, or kill worker?)(Kill request, or kill worker?)
User-perceived effects?User-perceived effects?
Example: MS Tiger Video FileserverExample: MS Tiger Video Fileserver
ATM-connected disks “walk down” static scheduleATM-connected disks “walk down” static schedule
““Coherent hallucination”: global schedule, but each disk only Coherent hallucination”: global schedule, but each disk only knows its local pieceknows its local piece
Pieces are passed around ring, bucket-brigade stylePieces are passed around ring, bucket-brigade style
0 1 2 3 4
ATM interconnect WANWAN
ScheduleScheduleTimeTime
Example: Tiger, cont’d.Example: Tiger, cont’d.
Once failure is detected, bypass failed cubOnce failure is detected, bypass failed cub Why not just detect first, then bypass schedule info to Why not just detect first, then bypass schedule info to
successor?successor?
If cub fails, its successor is If cub fails, its successor is already preparedalready prepared to take to take over that schedule slotover that schedule slot
0 1 2 3 4
ATM interconnect WANWAN
Example: Tiger, Cont’d.Example: Tiger, Cont’d.
Failed cub permanently removed from ringFailed cub permanently removed from ring
0 1 2 3 4
ATM interconnect WANWAN
Example: Tiger, cont’d.Example: Tiger, cont’d. Expected failure mode: death of a cub (node)Expected failure mode: death of a cub (node)
Cheap solution:Cheap solution: All files are mirrored using decluster factorAll files are mirrored using decluster factor
Cub passes chunk of the schedule to both its successor and Cub passes chunk of the schedule to both its successor and second-successorsecond-successor
Effect on performance/semanticsEffect on performance/semantics No user-visible latency to recover from failure!No user-visible latency to recover from failure!
What’s the cost of this?What’s the cost of this?
What would be an analogous TACC mechanism?What would be an analogous TACC mechanism?
Example: Tiger, cont’d.Example: Tiger, cont’d. Unsupported failure modesUnsupported failure modes
Interconnect failureInterconnect failure
Failure of both primary and secondary copies of a blockFailure of both primary and secondary copies of a block
User-visible effect: frame loss or complete hosageUser-visible effect: frame loss or complete hosage
Notable differences from TACCNotable differences from TACC Schedule is global but not centralized (Schedule is global but not centralized (coherent coherent
hallucinationhallucination))
State is not soft but not really “durable” either State is not soft but not really “durable” either (probabilistically hard? Asymptotically hard?)(probabilistically hard? Asymptotically hard?)
Example: CIRRUS Banking NetworkExample: CIRRUS Banking Network CIRRUS network is a “smart switch” that runs on CIRRUS network is a “smart switch” that runs on
Tandem NonStop nodes (deployed early 80’s)Tandem NonStop nodes (deployed early 80’s)
2-phase commit for cash withdrawal: 2-phase commit for cash withdrawal:
1. Withdrawal request from ATM1. Withdrawal request from ATM
2. “xact in progress” logged at Bank2. “xact in progress” logged at Bank
3. [Commit point] cash dispensed, confirmation sent to Bank3. [Commit point] cash dispensed, confirmation sent to Bank
4. Ack from bank4. Ack from bank
Non-obvious failure mode #1 after commit point: Non-obvious failure mode #1 after commit point: CIRRUS switch resends #3 until reply received CIRRUS switch resends #3 until reply received
If reply indicates error, manual cleanup needed If reply indicates error, manual cleanup needed
ATM Example, continuedATM Example, continued Non-obvious failure mode #2: Cash is dispensed, but Non-obvious failure mode #2: Cash is dispensed, but
CIRRUS switch never sees message #3 from ATM CIRRUS switch never sees message #3 from ATM Notify bank: cash Notify bank: cash was not was not dispenseddispensed
““Reboot” ATM-to-CIRRUS netReboot” ATM-to-CIRRUS net
Query net log to see if net thinks #3 was sent; if so, and #3 Query net log to see if net thinks #3 was sent; if so, and #3 arrived before end of business day, create Adjustment arrived before end of business day, create Adjustment Record at both sides Record at both sides
Otherwise, resolve out-of-band using physical evidence of Otherwise, resolve out-of-band using physical evidence of xact (tape logs, video cam, etc) xact (tape logs, video cam, etc)
Role of end-user semantics for picking design point Role of end-user semantics for picking design point for auto vs. manual recovery (end of business day)for auto vs. manual recovery (end of business day)
ATM and TWA Reservations ExampleATM and TWA Reservations Example Non-obvious failure mode #3: malicious ATM fakes Non-obvious failure mode #3: malicious ATM fakes
message #3. message #3. Not caught or handled! "In practice, this won't go unnoticed Not caught or handled! "In practice, this won't go unnoticed
in the banking industry and/or by our customers.”in the banking industry and/or by our customers.”
TWA Reservations System (~1985)TWA Reservations System (~1985) ““Secondary” DB of reservations sold; points into “primary” Secondary” DB of reservations sold; points into “primary”
DB of complete seating inventoryDB of complete seating inventory
Dangling pointers and double-bookings resolved Dangling pointers and double-bookings resolved offlineoffline
Explicitly trades availability for throughput: “...a 90% Explicitly trades availability for throughput: “...a 90% utilization rate would make it impossible for us to log all our utilization rate would make it impossible for us to log all our transactions” transactions”
““What Really Happened On Mars”What Really Happened On Mars” Sources: various posts to Risks Digest, including one Sources: various posts to Risks Digest, including one
from CTO of WindRiver Systems, which makes the from CTO of WindRiver Systems, which makes the VxWorks operating system that controls the VxWorks operating system that controls the Pathfinder.Pathfinder.
Background conceptsBackground concepts Threads and mutual exclusionThreads and mutual exclusion
Mutexes and priority inheritanceMutexes and priority inheritance
Priority inversionPriority inversion
How the failure was diagnosed and fixedHow the failure was diagnosed and fixed
LessonsLessons
Background: threads and mutexesBackground: threads and mutexes Threads:Threads: independent flows of control, typically within a independent flows of control, typically within a
single program, that usually share some data or single program, that usually share some data or resourcesresources Can be used for performance, program structure, or bothCan be used for performance, program structure, or both
Threads can have different Threads can have different prioritiespriorities for sharing resources for sharing resources (CPU, memory, etc)(CPU, memory, etc)
Thread schedulerThread scheduler (dozens of algorithms) enforces the priorities (dozens of algorithms) enforces the priorities
Mutex:Mutex: a lock (data structure) that enforces one-at-a- a lock (data structure) that enforces one-at-a-time access to a shared resourcetime access to a shared resource In this case, a systemwide data busIn this case, a systemwide data bus
Usage: to exclusively access a shared resource, you acquire Usage: to exclusively access a shared resource, you acquire the mutex, do your thing, then release the mutexthe mutex, do your thing, then release the mutex
Background: priority inversionBackground: priority inversion Priority inheritance:Priority inheritance: the mutex “inherits” the priority of the mutex “inherits” the priority of
whichever thread currently holds itwhichever thread currently holds it So, if a high-priority thread has it (or is waiting for it), mutex So, if a high-priority thread has it (or is waiting for it), mutex
operations are given high priorityoperations are given high priority
Priority inversionPriority inversion can happen when this isn’t done: can happen when this isn’t done: Low-priority, infrequent thread A grabs mutex MLow-priority, infrequent thread A grabs mutex M
High-priority thread B needs to access data protected by MHigh-priority thread B needs to access data protected by M
Result: even though B has higher priority than A, it has to Result: even though B has higher priority than A, it has to wait a long time since A holds Mwait a long time since A holds M
This example is obvious; in practice, priority inheritance is This example is obvious; in practice, priority inheritance is usually more subtleusually more subtle
What Really HappenedWhat Really Happened Dramatis personaeDramatis personae
Low-priority thread A: infrequent, short-running meteorological data Low-priority thread A: infrequent, short-running meteorological data collection, using bus mutexcollection, using bus mutex
High-priority thread B: bus manager, using bus mutexHigh-priority thread B: bus manager, using bus mutex
Medium-priority thread C: long-running communications task (that Medium-priority thread C: long-running communications task (that doesn’t doesn’t need the mutex)need the mutex)
Priority inversion scenarioPriority inversion scenario A is scheduled, and grabs bus mutexA is scheduled, and grabs bus mutex
B is scheduled, and blocks waiting for A to release mutexB is scheduled, and blocks waiting for A to release mutex
C is scheduled while B is waiting for mutexC is scheduled while B is waiting for mutex
C has higher priority than A, so it prevents A from running (and C has higher priority than A, so it prevents A from running (and therefore B as well)therefore B as well)
Watchdog timer notices B hasn’t run, concludes something is wrong, Watchdog timer notices B hasn’t run, concludes something is wrong, rebootsreboots
Lessons They LearnedLessons They Learned Extensive logging facilities (trace of system events) Extensive logging facilities (trace of system events)
allowed diagnosisallowed diagnosis
Debugging facility (ability to execute C commands Debugging facility (ability to execute C commands that tweak the running system…kind of like that tweak the running system…kind of like gdbgdb) ) allowed problem to be fixedallowed problem to be fixed
Debugging facility was intended to be Debugging facility was intended to be turned offturned off prior prior to production useto production use Similar facility in certain commercial CPU designs!Similar facility in certain commercial CPU designs!
Lessons We Should LearnLessons We Should Learn Complexity has pitfallsComplexity has pitfalls
Concurrency, race conditions, and mutual exclusion are Concurrency, race conditions, and mutual exclusion are hardhard to debugto debug
Ousterhout. Ousterhout. Why Threads Are a Bad Idea (For Most Why Threads Are a Bad Idea (For Most Purposes); Purposes); Savage et al., Savage et al., EraserEraser
C.A.R. Hoare: “The unavoidable price of reliability is C.A.R. Hoare: “The unavoidable price of reliability is simplicity.”simplicity.”
It’s more important to get it extensible than to get it It’s more important to get it extensible than to get it rightright Even in the hardware world!Even in the hardware world!
Knowledge of end-to-end Knowledge of end-to-end semanticssemantics can be critical can be critical Whither “transparent” OS-level mechanisms?Whither “transparent” OS-level mechanisms?