Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe §...
-
Upload
maria-martinez -
Category
Documents
-
view
218 -
download
5
Transcript of Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe §...
Toward Interactive Debugging for ISP Networks
Chia-Chi Lin†, Matthew Caesar†,Jacobus Van der Merwe§
†University of Illinois at Urbana-Champaign§AT&T Labs – Research
2
Debugging in ISP Networks• Internet: most complex distributed system ever created
– Leads to complex failure modes– Bugs, vulnerabilities, compromise, misconfigurations
• Major challenges in debugging in ISP Networks– Lack of visibility– High rates of change of protocols– Complex interdependencies
• These could cause devastating effects– Long-term outages, slow repair– February 2009 BGP outage
3
Interactive Debugging is Necessary
• Problems exist with fully automated techniques– Focus on detection rather than diagnosis– Modeling could be inexact– Logical and semantic errors seems to require
human knowledge to solve• Our position:
– Humans must be “in-the-loop”– Tools are required to facilitate the process
4
A Scenario
ISP
Customer
Pause when the outage
occurs
Cloned Network
5
Our Vision• Isolation of the operational network
– Prevent diagnostic procedure from interfering with live network operation
– Solution: virtualization technologies• Reproducibility of network execution
– Enable operator to replay execution, narrow in on rare events– Solution: instill a pseudorandom ordering over events, messages
• Interactive stepping through execution– Operator can slowly step through operation, trace messages– Solution: protocols providing tight control over distributed execution
6
The ArchitectureVirtual Service Platforms
Virtual Service Coordinator
Physical Network Node
DebuggingCoordinator
Virtual Service Nodes
User (human troubleshooter)
Physical Network Infrastructure
Application 1: e.g. BGPApplication
2: e.g. OSPF
7
Key Challenge: Reproducibility• Reproducibility simplifies interactive debugging
– Can run multiple times, varying inputs to narrow down cause– When rare bug occurs, don’t need to wait for it to reoccur
• One option: generate comprehensive logs of all events– e.g., log all packet sends/receives, all data– Problem: not scalable to large networked software
• Our approach: eliminate randomness in execution– Starting with the same initial state will produce same execution– Make execution “pseudorandom” to explore different execution paths– Key challenge: how to eliminate randomness in large-scale software
execution?
8
An Algorithm for Distributed, Reproducible Execution
• Approach:– Encapsulate software in virtual environment– Intercept software’s inputs/outputs, instill an ordering over them– Make sure that ordering is the same, every time software is run
• How this is done:– Network is run in lockstep fashion– On every cycle: messages from neighbors are buffered– Before deliver to application, pseudorandom ordering is instilled by
consistent hash of packet’s contents– Human sends “step” commands to move to next lockstep cycle
9
Improving Performance for the Production Network
• Problem: running application in lockstep fashion slows operation– Might be okay for some protocols (e.g., BGP)– Probably not okay for others (e.g., OSPF)
• Solution: “optimistic” execution of events– Choose pseudorandom ordering in advance that is likely to
happen anyway– Don’t buffer packets, deliver them immediately– If we guess wrong, roll back application to earlier state
10
Example: Running the Lockstep Algorithm in a Cloned Network
App
App
App
App
TransmissionPhase
ProcessingPhase
I finished transmitting.I am ready to process.
K
L
S
A
AK
L
S
S LK
A
I finished processing.I am ready to transmit.
App
App
App
App
App
Sending Buffer
Receiving Buffer
1. S2. L3. K4. ……
11
Example: Live Algorithm in Production Network
10
1413
13
16
8
11
6
107
39
14
Seattle
Los Angeles
Salt Lake City
Kansas City
Houston
Atlanta
New York
Washington
Chicago
The live algorithm does two things:• Determine the ordering of events• Roll back events violating the ordering
Packets from Seattle should come before
those from Los Angeles
1. Seattle2. Los Angeles3. Kansas City4. Chicago5. ……
S
K
C
L
S K CL K C
K C
Pseudorandom ordering is violated!
12
Connecting the Two Algorithms
• We can run the production network using the live algorithm– Achieves a fixed ordering over messages– But how to actually debug it?
• Solution: replay using the lockstep algorithm– First let the production network run, checkpoint starting
state– To debug, start lockstep algorithm with same staring state– Lockstep algorithm will traverse the same execution
• Can replay multiple times, narrow in on problem, experiment by changing inputs, etc.
13
Simulation Settings
• Protocol evaluated: OSPF• Topologies used: BRITE, Internet2 backbone• Link delay model: 1 ms + (0, 0.5] exponentially
distributed random delay• Events simulated: Abilene IS-IS traces over the
month of January 2009 (giving 209 events)• Measure performance overheads of our
approach
14
Results – Overhead in Production Networks
• Live algorithm suffers from rollbacks, incurring 4x inflation in traffic overhead
• Using delay-estimation optimization reduces overhead to 0.02x traffic inflation
15
Results – Response Time in Cloned Networks
• Low response time is beneficial to interactive debugging
• Response time is low for variety of network sizes
16
Conclusion
• Humans are required to be “in-the-loop” to diagnose problems
• Our architecture is a first step towards interactive debugging– Builds on known techniques, e.g., virtualization
technologies and distributed semaphores– Develop techniques to reproduce distributed executions
• Simulations on real-world events show the scheme accompanied with low overheads
17
18
The State of the Art: Automated Techniques
• Logging observations– X-Trace, Friday, etc.
• Model checking– rcc, OD flow, etc.
• Debugging standalone programs– Coverity, AVIO, etc.
19
Optimized Ordering in the Production Network
• Goal: avoid rollbacks by selecting ordering likely to happen anyway– Events separated by long period will fall into different groups which
means ordering is easy– Problem: some failure events are correlated
• E.g., multiple overlay links sharing same physical link
– How to order events in same group?• Solution: if we know link delays, we can reliably estimate
expected arrival of events– In practice we don’t know exact link delays– But we can estimate them– Can improve estimation by giving protocol messages high priority
20
Results – Storage in Production Network
• State required for rolling back packets is small and increases slowly with network size