dBug: Systematic Evaluation of Distributed Systems
Transcript of dBug: Systematic Evaluation of Distributed Systems
![Page 1: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/1.jpg)
dBug: Systematic Evaluation of Distributed Systems
Jiří Šimša Randy Bryant, Garth Gibson
PARALLEL DATA LABORATORY
Carnegie Mellon University
![Page 2: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/2.jpg)
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 2
Concurrency Bugs Everywhere
![Page 3: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/3.jpg)
Why Do Concurrency Bugs Exist?
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 3
Server 1 Server i Server j Server n... ... ...
Client
2
31
4
![Page 4: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/4.jpg)
Why Do Concurrency Bugs Exist?
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 4
Server 1 Server i Server j Server n... ... ...
Client Client
1 1
2
2
![Page 5: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/5.jpg)
Motivating Example Lessons • Locking across RPC = bad idea
• Explosion of possible scenarios
• Corner case errors easy to miss
• Testing concurrent systems is hard: • Control / Enumerate possible scenarios
• Tackle state space explosion
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 5
![Page 6: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/6.jpg)
Need For Better Testing Methods • Hardware performance
• Software complexity
• Formal specifications impractical
• New systems rarely written from scratch
• Common testing mechanism: stress testing
• Imprecise, falling behind
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 6
![Page 7: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/7.jpg)
Outline • Motivation
• dBug Design
• dBug Prototype
• Prototype Case Studies
• Ongoing & Future Work
• Conclusion
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 7
![Page 8: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/8.jpg)
dBug Design • Goal: Enable systematic enumeration of (all)
possible execution scenarios of a test
• Repeated execution of the same test is guaranteed to explore different scenarios
• Light-weight model checking • Fixed initial state • User provided test as a specification • State space of the actual implementation explored
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 8
![Page 9: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/9.jpg)
Motivating Example dBug-ed
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 9
Server 1 Server i Server j Server n... ... ...
Client ClientArbiter1 23
4 5 6
![Page 10: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/10.jpg)
dBug Design Decisions • What events to control on and how? • When to signal a request? • How to (re)store a state of the system? • How to explore the state space?
• Parallel exploration • Exploration heuristics • State space reduction
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 10
![Page 11: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/11.jpg)
Outline • Motivation
• dBug Design
• dBug Prototype
• Prototype Case Studies
• Ongoing & Future Work
• Conclusion
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 11
![Page 12: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/12.jpg)
Event Control Mechanism
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 12
Application
OS + Libraries
Application
OS + Libraries
dBug interposition
![Page 13: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/13.jpg)
Compile-time Interposition Source code annotation of:
• Creation of threads (processes) • Destruction of threads (processes) • Coordination primitives:
– Thread synchronization – Remote procedure calls – “Your coordination primitive here”
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 13
![Page 14: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/14.jpg)
Client-Server Architecture
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 14
dBug
Original Distributed System
dBug arbiter
Thread 1
dBug client
Thread n
dBug client
. . .
dBug server
![Page 15: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/15.jpg)
When to Signal a Request? • Blind Mode:
• Uses a timeout • Pros: Easy to implement • Cons: Overhead, Imprecise
• Informed Mode • Uses application idle/progress hints • Pros: Fast, Accurate • Cons: Expert knowledge, Annotation
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 15
![Page 16: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/16.jpg)
State Space Exploration
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 16
dBug
Original Distributed System
dBug arbiter
Thread 1
dBug client
Thread n
dBug client
. . .
dBug server
dBug
Original Distributed System
dBug arbiter
Thread 1
dBug client
Thread n
dBug client
. . .
dBug server
dBug
Original Distributed System
dBug arbiter
Thread 1
dBug client
Thread n
dBug client
. . .
dBug server
dBug
Original Distributed System
dBug arbiter
Thread 1
dBug client
Thread n
dBug client
. . .
dBug server
![Page 17: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/17.jpg)
Outline • Motivation
• dBug Design
• dBug Prototype
• Prototype Case Studies
• Ongoing & Future Work
• Conclusion
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 17
![Page 18: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/18.jpg)
Fast Array of Wimpy Nodes • Energy-efficient architecture • FAWN-KV = distributed key-value storage • put()/get() interface, strong consistency • get() returns value of the last acked put()
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 18
KV Ring Front-‐end
Back-‐end
Back-‐end
Switch
. . .
![Page 19: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/19.jpg)
Case Study 1: Multi-threading
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 19
Log-structured writes
Need for clean-up
Rewrite Operation • sequential scan • atomic swap
Obsolete data Up-to-date data
![Page 20: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/20.jpg)
Integrating FAWN-KV and dBug • Creation and destruction of threads
• 20 lines of annotations
• Acquiring and releasing locks • Compile-time interposition on pthread interface
• Test case: put(key,value1); if (fork() == 0) { rewrite(); } else { put(key,value2); get(key); }
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 20
![Page 21: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/21.jpg)
Case Study Results • Evaluated with the blind mode for ~24 hours • Over 7000 possible scenarios • Test always executed correctly
• Introduced and detected a data race bug • The bug showed up in ~700 scenarios
• Two person weeks of work
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 21
![Page 22: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/22.jpg)
Case Study 2: Including RPCs
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 22
H
G
F
D
C
B
A
![Page 23: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/23.jpg)
Integrating FAWN-KV with dBug • Creation and destruction of agents
• 20 lines of annotations
• Issuing remote procedure calls • Modified Apache Thrift library (2 lines)
• Test case: put(key,value1); If (fork() == 0) { join(); } else { if (fork() == 0) put(key,value2); else get(key); }
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 23
![Page 24: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/24.jpg)
Case Study Results • Evaluated with blind mode for 45 minutes • Total of 173 possible scenarios • Found a bug
• The bug showed up in only 3 scenarios • get(key) returns “not found”
• Two person weeks of work
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 24
![Page 25: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/25.jpg)
Outline • Motivation
• dBug Design
• dBug Prototype
• Prototype Case Studies
• Ongoing & Future Work
• Conclusion
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 25
![Page 26: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/26.jpg)
dBug Evolution
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 26
![Page 27: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/27.jpg)
dBug 2nd Generation • Open source Autotools project • dBug interposition as a shared library • Precise and automatic detection of when to
signal a request
• Educational use of dBug: • In use to evaluate student solutions for 15-213 • Found bugs in the TA implementation • Available to students to test their solutions
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 27
![Page 28: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/28.jpg)
Future Work
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 28
Interface support State Space Case Studies Event Injection
Parallel ExplorationPOSIX threads
Local I/O
Network I/O
UNIX processes
POSIX threads
Local I/O
Network I/O
UNIX processes
Fault Injection
Time Distortion
15-213
RAIDTool
FAWN-KV
PVFS
Manual Ad hoc
FAWN-KV
PVFS
PRESENT
PAST
FUTURE
![Page 29: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/29.jpg)
Outline • Motivation
• dBug Design
• dBug Prototype
• Prototype Case Studies
• Ongoing & Future Work
• Conclusion
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 29
![Page 30: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/30.jpg)
Related Work • Verisoft [Godefroid98]
• manual, exhaustive, multi-threaded, C and C++ sources
• MaceMC [Killian07] • automated, selective, distributed, Mace sources
• CHESS [Musuvathi08] • automated, selective, multi-threaded, Windows binaries
• MoDist [Yang09] • automated, selective, distributed, Windows binaries
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 30
![Page 31: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/31.jpg)
Conclusion • Systematic and automatic evaluation of
distributed system test cases
• Open source implementation of dBug • Experiments with:
• Parallel Virtual File System (C) • FAWN-based key value storage (C++) • CMU student class projects (C and C++) • RAIDTool (Java)
• Finding real bugs
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 31
![Page 32: dBug: Systematic Evaluation of Distributed Systems](https://reader031.fdocuments.in/reader031/viewer/2022012921/61c8e59687139006ca705f88/html5/thumbnails/32.jpg)
References • [Godefroid98] P. Godefroid, VeriSoft: A Tool for the Automatic
Analysis of Concurrent Reactive Software, CAV 1997. • [Killian07] C. Killian, J. W. Anderson, R. Jhala, and A. Vahdat:
Life, Death, and the Critical Transition: Detecting Liveness Bugs in Systems Code, NSDI 2007.
• [Musuvathi08] M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs, OSDI 2008.
• [Yang09] J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, L. Zhou: MODIST: Transparent Model Checking of Unmodified Distributed Systems, NSDI 2009.
• [Simsa10] J. Simsa, G. Gibson, R. Bryant: dBug: Systematic Evaluation of Distributed Systems, SSV 2010.
Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 32