Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe §...

Toward Interactive Debugging for ISP Networks

Chia-Chi Lin†, Matthew Caesar†,Jacobus Van der Merwe§

†University of Illinois at Urbana-Champaign§AT&T Labs – Research

2

Debugging in ISP Networks• Internet: most complex distributed system ever created

– Leads to complex failure modes– Bugs, vulnerabilities, compromise, misconfigurations

• Major challenges in debugging in ISP Networks– Lack of visibility– High rates of change of protocols– Complex interdependencies

• These could cause devastating effects– Long-term outages, slow repair– February 2009 BGP outage

3

Interactive Debugging is Necessary

• Problems exist with fully automated techniques– Focus on detection rather than diagnosis– Modeling could be inexact– Logical and semantic errors seems to require

human knowledge to solve• Our position:

– Humans must be “in-the-loop”– Tools are required to facilitate the process

4

A Scenario

ISP

Customer

Pause when the outage

occurs

Cloned Network

5

Our Vision• Isolation of the operational network

– Prevent diagnostic procedure from interfering with live network operation

– Solution: virtualization technologies• Reproducibility of network execution

– Enable operator to replay execution, narrow in on rare events– Solution: instill a pseudorandom ordering over events, messages

• Interactive stepping through execution– Operator can slowly step through operation, trace messages– Solution: protocols providing tight control over distributed execution

6

The ArchitectureVirtual Service Platforms

Virtual Service Coordinator

Physical Network Node

DebuggingCoordinator

Virtual Service Nodes

User (human troubleshooter)

Physical Network Infrastructure

Application 1: e.g. BGPApplication

2: e.g. OSPF

7

Key Challenge: Reproducibility• Reproducibility simplifies interactive debugging

– Can run multiple times, varying inputs to narrow down cause– When rare bug occurs, don’t need to wait for it to reoccur

• One option: generate comprehensive logs of all events– e.g., log all packet sends/receives, all data– Problem: not scalable to large networked software

• Our approach: eliminate randomness in execution– Starting with the same initial state will produce same execution– Make execution “pseudorandom” to explore different execution paths– Key challenge: how to eliminate randomness in large-scale software

execution?

8

An Algorithm for Distributed, Reproducible Execution

• Approach:– Encapsulate software in virtual environment– Intercept software’s inputs/outputs, instill an ordering over them– Make sure that ordering is the same, every time software is run

• How this is done:– Network is run in lockstep fashion– On every cycle: messages from neighbors are buffered– Before deliver to application, pseudorandom ordering is instilled by

consistent hash of packet’s contents– Human sends “step” commands to move to next lockstep cycle

9

Improving Performance for the Production Network

• Problem: running application in lockstep fashion slows operation– Might be okay for some protocols (e.g., BGP)– Probably not okay for others (e.g., OSPF)

• Solution: “optimistic” execution of events– Choose pseudorandom ordering in advance that is likely to

happen anyway– Don’t buffer packets, deliver them immediately– If we guess wrong, roll back application to earlier state

10

Example: Running the Lockstep Algorithm in a Cloned Network

App

App

App

App

TransmissionPhase

ProcessingPhase

I finished transmitting.I am ready to process.

K

L

S

A

AK

L

S

S LK

A

I finished processing.I am ready to transmit.

App

App

App

App

App

Sending Buffer

Receiving Buffer

1. S2. L3. K4. ……

11

Example: Live Algorithm in Production Network

10

1413

13

16

8

11

6

107

39

14

Seattle

Los Angeles

Salt Lake City

Kansas City

Houston

Atlanta

New York

Washington

Chicago

The live algorithm does two things:• Determine the ordering of events• Roll back events violating the ordering

Packets from Seattle should come before

those from Los Angeles

1. Seattle2. Los Angeles3. Kansas City4. Chicago5. ……

S

K

C

L

S K CL K C

K C

Pseudorandom ordering is violated!

12

Connecting the Two Algorithms

• We can run the production network using the live algorithm– Achieves a fixed ordering over messages– But how to actually debug it?

• Solution: replay using the lockstep algorithm– First let the production network run, checkpoint starting

state– To debug, start lockstep algorithm with same staring state– Lockstep algorithm will traverse the same execution

• Can replay multiple times, narrow in on problem, experiment by changing inputs, etc.

13

Simulation Settings

• Protocol evaluated: OSPF• Topologies used: BRITE, Internet2 backbone• Link delay model: 1 ms + (0, 0.5] exponentially

distributed random delay• Events simulated: Abilene IS-IS traces over the

month of January 2009 (giving 209 events)• Measure performance overheads of our

approach

14

Results – Overhead in Production Networks

• Live algorithm suffers from rollbacks, incurring 4x inflation in traffic overhead

• Using delay-estimation optimization reduces overhead to 0.02x traffic inflation

15

Results – Response Time in Cloned Networks

• Low response time is beneficial to interactive debugging

• Response time is low for variety of network sizes

16

Conclusion

• Humans are required to be “in-the-loop” to diagnose problems

• Our architecture is a first step towards interactive debugging– Builds on known techniques, e.g., virtualization

technologies and distributed semaphores– Develop techniques to reproduce distributed executions

• Simulations on real-world events show the scheme accompanied with low overheads

18

The State of the Art: Automated Techniques

• Logging observations– X-Trace, Friday, etc.

• Model checking– rcc, OD flow, etc.

• Debugging standalone programs– Coverity, AVIO, etc.

19

Optimized Ordering in the Production Network

• Goal: avoid rollbacks by selecting ordering likely to happen anyway– Events separated by long period will fall into different groups which

means ordering is easy– Problem: some failure events are correlated

• E.g., multiple overlay links sharing same physical link

– How to order events in same group?• Solution: if we know link delays, we can reliably estimate

expected arrival of events– In practice we don’t know exact link delays– But we can estimate them– Can improve estimation by giving protocol messages high priority

20

Results – Storage in Production Network

• State required for rolling back packets is small and increases slowly with network size

Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe §...

Documents

Transcript of Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe §...