Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris...

Network Tomography for Fault Diagnosis

Renata TeixeiraLIP6 Computer Laboratory

CNRS and UPMC Paris Universitas

2

The Internet is great, but problems happen

LIP6network

Net1Net2

Net3

How to automatically detect and identify problems?

Is my connection ok?

Is it google?

Is the problem in one of the networks in path?

3

Current alarms are not enough

Network equipments already have many alarms

– SNMP traps

– Anomaly detection systems

But, alarms may not reflect user’s experience

– Hard to map users’ complaints to alarms

– The user’s problem may not appear as an alarm

Network admins often resort to active measurements

– Active monitoring servers inside their network

– Subscribe to third-party monitoring services• Eg. Keynote or RIPE TTM

4

End-hosts can collaborate to troubleshoot problems

LIP6network

Net1Net2

Net3

Detection: continuous path monitoring

Identification: tomography

5

End-host troubleshooting in two different contexts

Network admins deploy monitoring services– Verify the performance of their networks

– Assist in troubleshooting

End-users can collaborate – Identify and bypass problems

– Rank providers

6

Detection techniques

For network admins– Deploy dedicated

monitors

– Need to inject probes to measure paths

For end-users– Monitoring at end-

users’ machine

– Tapping users’ traffic is promising

Challengecannot continuously overload the network or

end-user’s machine to detect faults

Minimizing probing cost for detecting interface failures:

Algorithms and scalability analysis

withHung X. Nguyen (Univ. of Adelaide)

Patrick Thiran (EPFL) Christophe Diot (Thomson)

8

Active monitoring system to detect faults

M1

M2

T3

T1 T2

A C

BD

target hosts

monitors Goal detect failures of any of the

interfaces in the subscriber’s networkwith minimum probing overhead

target network

9

Simple solution: Coverage problem

M1

M2

T3

T1 T2

A C

BD

Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network

Coverage problem is NP-hard– Solution: greedy set-cover heuristic

10

Coverage solution doesn’t detect all types of failures

Detects fail-stop failures– Failures that affect all packets that traverse

the faulty interface• Eg., interface or router crashes, fiber cuts, bugs

But not path-specific failures– Failures that affect only a subset of paths that

cross the faulty interface• Eg., router misconfigurations

11

New formulation of failure detection problem

Select the frequency to probe each path– Lower frequency per-path probing can achieve a

high frequency probing of each interface

M1

M2

T3

T1 T2

A C

BD

1 every 9 mins

1 every 3 mins

12

Properties of solution Failure detection problem is no longer NP-hard

– Can find optimal solution using linear programming

– Parameters: Duration of path-specific and fail-stop failures

Needs synchronization among monitors

– Monitors need collaborate to probe an interface

– Alternative probabilistic solution avoids synchronization overhead

Probing cost scales almost linearly with the size of the target network

– In random power-law graphs like inferred internet graphs

13

Evaluation

Paths obtained using traceroutes

– From 750 PlanetLab nodes to 3,000 DNS servers

– From 12 RON nodes to 60,000 targets

Target networks are probed ASes

– Map IPs to ASes using Mao et al.’s technique

– 1,366 ASes in PlanetLab

– 6,517 ASes in RON

Compute probing costs varying parameters

– Set of paths, failure durations, target network

14

Probing costs varying size of subscriber network in PlanetLab

DurationPath-specific = 1000

secFail-stop = 1 sec

15

Summary

Practical formulation of failure detection problem

– Incorporates both fail-stop and path-specific failures

Solution minimizes probing cost

– Using linear programming

Inferred internet graphs are among the most expensive to probe

– Probing scales almost linearly with network size

Next step

– Deploy a system based on these probing techniques

ConnectionWatch: Passive monitoring of round-trip times at

end-hosts

with

Diana Zeaiter Joumblatt (LIP6)

Nina Taft (Intel)

17

Goal

Automatic detection of performance degradations

– Only care about problems that impact applications

– Focus on detecting “large” round-trip times (RTT)

– Detection should be fast and lightweight

18

ConnectionWatch

SnifferExtract flow ID

RTT estimatio

n

High RTT detector

TCP packets

Upload to central server

PacketTrace

Ping Daemo

n

Flow statistics Alarms

19

Insights from preliminary experiments

Datasets from five students during three days– 44,715 TCP connections over 3,584 paths to 2,242 IPs

Some observations– More complete measurements than ping

• 16.5% of 1,072 addresses don’t reply to pings

– Transfer of traces to server is main bottleneck

Hurdles– Portability of system to other OSes

– Privacy concerns with capturing user’s traffic

– Incentives for large-scale deployment

20

Which RTT variations correspond to performance

degradations? Our datasets are still too small to answer

– Performance degradations are rare events

Simple technique based on outlier threshold

– What is a good threshold?

– Should it the threshold be for all users, per user, per path, per app?

Do outliers correspond to real performance degradations?

– ConnectionWatch should get user’s feedback• “I’m annoyed button”

Practical issues with using network tomography for fault

diagnosiswith

Italo Scota Cunha (LIP6, Thomson)Amogh Dhamdhere, Yiyi Huang, Nick Feamster,

Constantine Dovrolis (Georgia Tech)Christophe Diot (Thomson)

22

The binary tomography solution by Duffield

Given– Complete network topology– End-to-end reachability measurements

Find the smallest set of links that explain observations– Assumes single-source tree, access to targets

m t1 t2

23

Extending binary tomography

Multi-network setting: topology not known– Periodic traceroutes determine topologies

Extension to multiple-sources, multiple-targets

– Minimum hitting set problem (NP-hard)

Tomo: Iterative poly-time greedy heuristic– Intuition: Iteratively choose link that explains

the max number of failures

24

Some problems

Dynamics– Loss can be transient, topology can change

Ambiguity– Losses are one-way but don’t always have

access to both ends of the path

Lack of synchronization– Different monitors see different conditions

25

Approach

Transient packet loss– Triggered confirmation of failed paths

Dynamic routing– Periodic snapshots of the network topology

One-way losses– Algorithm based on IP spoofing

Lack of synchronization– Correlation of probes from different monitors

26

Failure confirmation

time

loss burst

packets on a path

Upon detection of a failure, trigger extra probes Number of probes

– Confirm failures with a target false positive rate– Assume independence and a given a loss rate

Time between probes – Reduce chance that probes fall on the same loss burst– Assume link losses follow a Gilbert process

false positive

27

Disambiguating one-way losses: Spoofing

Monitor sends request to spoofer to send probe

Probe has IP address of the monitor

If reply reaches the monitor, reverse path is working

M

Spoofer: Send spoofed packet with source address of M

T

28

Evaluation

Evaluation is challenging

– Need ground truth and realistic environment

Controlled experiments on the VINI testbed

– Allow us to inject failures

– Problem: hard to argue about false positive

Experiments on Emulab

– More control: dedicated nodes and links

– Emulate the Abilene network

– Selected LA and NY as monitors

29

Failure confirmation reduces false positives

Emulab experiment setup– 10% loss rates in each direction– No persistent failures

Both schemes use three probes to confirm a failure

Confirmation interval

Burst factor90% 96%

Back-to-back

15% 25%

0.2 secs 0.8% 0.8%low false positives, because an interval of 0.2 secs guarantees a small probability of probes being correlated

30

Correlation is important to get a consistent view

Emulab and VINI experiments with short failures – More false positives

– Lower detection rate

In real deployments, can we get a consistent view?– More noise because of losses and routing dynamics

– Monitors are less synchronized

– Monitors may not be able to reach the coordinator

Next steps– Online correlation

– Minimize communication with coordinator

31

Summary

Continuous monitoring for detection

– At management hosts: active measurements• Reduce probing overhead, still detect failures

– At end-users: passive measurements • Lightweight detection of problems that affect apps

Network tomography for identification

– Many challenges to get consistent inputs for tomography• Network dynamics and transient losses

• Ambiguity of forward and reverse failures

• Monitors may observe different conditions

Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris...

Documents

Transcript of Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris...