Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures
description
Transcript of Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures
![Page 1: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/1.jpg)
11
Rx: Treating Bugs as Allergies Rx: Treating Bugs as Allergies – A Safe Method to Survive – A Safe Method to Survive
Software FailuresSoftware FailuresFeng QinFeng Qin
Joseph TucekJoseph TucekJagadeesan SundaresanJagadeesan Sundaresan
Yuanyuan ZhouYuanyuan Zhou
Presentation by Mark LawsonPresentation by Mark Lawson
![Page 2: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/2.jpg)
22
MotivationMotivation
Applications require high availabilityApplications require high availability Server application downtime leads to Server application downtime leads to
lost productivity and lost businesslost productivity and lost business Average cost of an hour of downtime Average cost of an hour of downtime
can exceed six million dollarscan exceed six million dollars Almost every organization in today’s Almost every organization in today’s
e-commerce world is dependent on e-commerce world is dependent on their systems being highly availabletheir systems being highly available
![Page 3: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/3.jpg)
33
MotivationMotivation
Software defects make up 40% of all Software defects make up 40% of all system failuressystem failures
Programmers are aware of this and Programmers are aware of this and rigorously test applications before releaserigorously test applications before release Doesn’t always help, bugs are tricky bastardsDoesn’t always help, bugs are tricky bastards
““to achieve higher system availability, to achieve higher system availability, mechanisms must be devised to allow mechanisms must be devised to allow systems to survive the effects of systems to survive the effects of uneliminated software bugs to the largest uneliminated software bugs to the largest extent possible”extent possible”
![Page 4: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/4.jpg)
44
Rebooting TechniquesRebooting Techniques
Idea: Restart program or parts of Idea: Restart program or parts of program (microreboot) after it crashesprogram (microreboot) after it crashes
Problems: Problems: Designed for hardware failures, not softwareDesigned for hardware failures, not software Deterministic software failures cannot be Deterministic software failures cannot be
dealt with as they will occur every timedealt with as they will occur every time Restarting takes timeRestarting takes time
![Page 5: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/5.jpg)
55
General checkpointing and General checkpointing and recoveryrecovery
Idea: Checkpoint -> Rollback upon Idea: Checkpoint -> Rollback upon failure -> Re-executefailure -> Re-execute
Problems:Problems: Similar problems to restarting Similar problems to restarting
techniques, such as inability to handle techniques, such as inability to handle deterministic bugsdeterministic bugs
![Page 6: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/6.jpg)
66
Application specific recovery Application specific recovery mechanismsmechanisms
Idea: Multi-process model, each Idea: Multi-process model, each client connection is new process, kill client connection is new process, kill process if it failsprocess if it fails
Problems:Problems: Still has issues with dealing with Still has issues with dealing with
deterministic errorsdeterministic errors If shared data is the problem, killing and If shared data is the problem, killing and
restarting processes will not restore it to restarting processes will not restore it to consistent stateconsistent state
![Page 7: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/7.jpg)
77
Other methodsOther methods
Failure-oblivious computingFailure-oblivious computing Idea: Provide artificial values for out-of-bound Idea: Provide artificial values for out-of-bound
readsreads Reactive immune systemReactive immune system
Idea: Creates emulators to run “faulty” regions Idea: Creates emulators to run “faulty” regions of a programof a program
Problems:Problems: Considered by authors as “unsafe” because they Considered by authors as “unsafe” because they
mask behaviors and speculate as to what the mask behaviors and speculate as to what the program wants to achieveprogram wants to achieve
Immune system has large overheadsImmune system has large overheads
![Page 8: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/8.jpg)
88
Rx real-world metaphorRx real-world metaphor
Idea: Treat software bugs as real-world Idea: Treat software bugs as real-world allergiesallergies
In real life allergens can be dealt with by In real life allergens can be dealt with by changing living environmentchanging living environment Removing cat hair from area allows me to breathe Removing cat hair from area allows me to breathe
betterbetter Successfully removing allergen from Successfully removing allergen from
environment allows one to determine cause of environment allows one to determine cause of allergyallergy No cat hair = no sneezing No cat hair = no sneezing allergic to cats allergic to cats
![Page 9: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/9.jpg)
99
Rx metaphor implementedRx metaphor implemented
Bugs resemble allergiesBugs resemble allergies Bugs can be dealt with by changing Bugs can be dealt with by changing
execution environmentexecution environment When a bug is detected, rollback to When a bug is detected, rollback to
checkpoint and alter execution checkpoint and alter execution environment to deal with detected issuesenvironment to deal with detected issues
Least-intrusive changes can be tried first Least-intrusive changes can be tried first and more drastic changes can be and more drastic changes can be implemented until a good execution implemented until a good execution environment is foundenvironment is found
![Page 10: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/10.jpg)
1010
The Main IdeaThe Main Idea
![Page 11: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/11.jpg)
1111
Rx ArchitectureRx Architecture
![Page 12: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/12.jpg)
1212
SensorsSensors
Dynamically monitor applications Dynamically monitor applications execution to determine software failuresexecution to determine software failures
Sends information to control unitSends information to control unit Two types of sensorsTwo types of sensors
Sensor to monitor software errors (assertion Sensor to monitor software errors (assertion failures, access violations)failures, access violations)
Sensor to monitor software bugs (buffer Sensor to monitor software bugs (buffer overflows, access to freed memory)overflows, access to freed memory)
![Page 13: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/13.jpg)
1313
Checkpoint and RollbackCheckpoint and Rollback
CR component takes a snapshot of CR component takes a snapshot of application and stores it in main memoryapplication and stores it in main memory
Stores memory and file statesStores memory and file states During rollback all of these states can be During rollback all of these states can be
re-implemented and the program can be re-implemented and the program can be continued from this previous checkpointcontinued from this previous checkpoint
Multiple checkpoints can be stored in case Multiple checkpoints can be stored in case Rx needs to rollback to an earlier Rx needs to rollback to an earlier checkpointcheckpoint Keeps enough to be “2-competitive”Keeps enough to be “2-competitive”
![Page 14: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/14.jpg)
1414
Execution Environment Execution Environment ChangesChanges
Memory management basedMemory management based Addresses bugs that are memory based such as buffer Addresses bugs that are memory based such as buffer
overflows, dangling pointers etc.overflows, dangling pointers etc. Ex: Padding to prevent buffer overflows, zero-filling new Ex: Padding to prevent buffer overflows, zero-filling new
buffersbuffers Timing basedTiming based
Addresses bugs that are related to asynchronous events like Addresses bugs that are related to asynchronous events like data racesdata races
Ex: Increasing length of scheduling time slot can avoid context Ex: Increasing length of scheduling time slot can avoid context switches in buggy critical sectionsswitches in buggy critical sections
User request basedUser request based Deals with the fact that it is impossible to test every possible Deals with the fact that it is impossible to test every possible
user requestuser request Ex: Dropping user requests during re-execution to deal with Ex: Dropping user requests during re-execution to deal with
unexpected requests (LAST RESORT!)unexpected requests (LAST RESORT!)
![Page 15: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/15.jpg)
1515
Environment WrappersEnvironment Wrappers
Perform environmental changes for application during re-Perform environmental changes for application during re-executionexecution
Memory wrapperMemory wrapper Intercepts memory-related library calls, adjusts according to Intercepts memory-related library calls, adjusts according to
what control unit specifieswhat control unit specifies Message wrapperMessage wrapper
Changes message delivery environmentChanges message delivery environment Process schedulingProcess scheduling
Changes processes priority to deal with scheduling issuesChanges processes priority to deal with scheduling issues Signal deliverySignal delivery
Keeps track of signals in order to control when they are sentKeeps track of signals in order to control when they are sent Dropping user requestsDropping user requests
Drops requests that may be causing errorsDrops requests that may be causing errors
![Page 16: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/16.jpg)
1616
ProxyProxy
Handles re-execution of requests, making Handles re-execution of requests, making crashes oblivious to clientscrashes oblivious to clients
In normal mode the proxy simply relays In normal mode the proxy simply relays messages between client and server, keeping messages between client and server, keeping track of themtrack of them
In recovery mode handles three tasks:In recovery mode handles three tasks: Replays requests from client since last checkpointReplays requests from client since last checkpoint Implements message-related environmental Implements message-related environmental
changeschanges Buffers client requests until server has come back Buffers client requests until server has come back
from software failurefrom software failure
![Page 17: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/17.jpg)
1717
Control UnitControl Unit
Controls the whole Rx systemControls the whole Rx system Perform three functions:Perform three functions:
Directs CR to rollback at software failuresDirects CR to rollback at software failures Diagnoses failures based on “symptoms” and Diagnoses failures based on “symptoms” and
previous knowledge of failuresprevious knowledge of failures Provides information on failures for programmersProvides information on failures for programmers
The control unit stores information on The control unit stores information on failures and what recoveries worked for failures and what recoveries worked for future referencefuture reference
![Page 18: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/18.jpg)
1818
Design and Implementation Design and Implementation IssuesIssues
Inter-server communicationInter-server communication Server communication is key so that Server communication is key so that
multiple servers can be rolled back to multiple servers can be rolled back to achieve system stabilityachieve system stability
Multi-threaded process checkpointingMulti-threaded process checkpointing Force all threads to be at user level to Force all threads to be at user level to
ensure accurate checkpointing due to ensure accurate checkpointing due to threads running simultaneouslythreads running simultaneously
![Page 19: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/19.jpg)
1919
EvaluationEvaluation
Tested on 4 server applications Tested on 4 server applications (Apache httpd, MySQL, Squid, CVS)(Apache httpd, MySQL, Squid, CVS)
![Page 20: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/20.jpg)
2020
Overall ResultsOverall Results
![Page 21: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/21.jpg)
2121
Throughput and Avg Response Throughput and Avg Response TimeTime
![Page 22: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/22.jpg)
2222
Recovery TimeRecovery TimeRecovery time for first and subsequent bug occurences
0
50
100
150
200
250
300
350
400
Squid CVS MySQL Apache Squid-ui Squid-dp
Avg
Rec
ove
ry T
ime
(ms)
First
Subsequent
![Page 23: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/23.jpg)
2323
Rx AdvantagesRx Advantages
ComprehensiveComprehensive Can survive many common software defectsCan survive many common software defects
SafeSafe Does not change program, only environment it runs inDoes not change program, only environment it runs in
NoninvasiveNoninvasive Few to no modifications required in software (no mods in Few to no modifications required in software (no mods in
any of the tested systems)any of the tested systems) EfficientEfficient
No rebooting (mostly) with little overheadNo rebooting (mostly) with little overhead Learns from previous solutionsLearns from previous solutions
InformativeInformative Bugs are shown and details are given on the nature of Bugs are shown and details are given on the nature of
the bugthe bug
![Page 24: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/24.jpg)
2424
IssuesIssues
Unavoidable Bug/FailuresUnavoidable Bug/Failures Accumulative memory leaks cannot be Accumulative memory leaks cannot be
detected by Rxdetected by Rx Only solution is program restartOnly solution is program restart
Worst case scenario 2x time for Worst case scenario 2x time for normal restartnormal restart Did not happen in any of the testsDid not happen in any of the tests
![Page 25: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/25.jpg)
2525
Questions/Complaints?Questions/Complaints?
![Page 26: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/26.jpg)
2626
What do they mean with What do they mean with “execution environment”?“execution environment”?
““almost everything that is external to almost everything that is external to the target application but can affect the the target application but can affect the execution of the target application”execution of the target application”
3 levels:3 levels: Lowest: Hardware (processor, devices)Lowest: Hardware (processor, devices) Middle: OS kernel (scheduling, virtual Middle: OS kernel (scheduling, virtual
memory management, device drivers)memory management, device drivers) Highest: libraries (standard, third-party)Highest: libraries (standard, third-party)
![Page 27: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/27.jpg)
2727
Throughput and Avg Response Throughput and Avg Response TimeTime
![Page 28: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/28.jpg)
2828
Avg Space Overhead per Avg Space Overhead per CheckpointCheckpoint
![Page 29: Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures](https://reader035.fdocuments.in/reader035/viewer/2022081514/56815a81550346895dc7eac5/html5/thumbnails/29.jpg)
2929
Different bug arrival ratesDifferent bug arrival rates