Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris...

Network Tomography for Fault Diagnosis

Renata TeixeiraLIP6 Computer Laboratory

CNRS and UPMC Paris Universitas

The Internet is great, but problems happen

LIP6network

Net1Net2

How to automatically detect and identify problems?

Is my connection ok?

Is it google?

Is the problem in one of the networks in path?

Current alarms are not enough

Network equipments already have many alarms

– SNMP traps

– Anomaly detection systems

But, alarms may not reflect user’s experience

– Hard to map users’ complaints to alarms

– The user’s problem may not appear as an alarm

Network admins often resort to active measurements

– Active monitoring servers inside their network

– Subscribe to third-party monitoring services• Eg. Keynote or RIPE TTM

End-hosts can collaborate to troubleshoot problems

LIP6network

Net1Net2

Detection: continuous path monitoring

Identification: tomography

End-host troubleshooting in two different contexts

Network admins deploy monitoring services– Verify the performance of their networks

– Assist in troubleshooting

End-users can collaborate – Identify and bypass problems

– Rank providers

Detection techniques

For network admins– Deploy dedicated

monitors

– Need to inject probes to measure paths

For end-users– Monitoring at end-

users’ machine

– Tapping users’ traffic is promising

Challengecannot continuously overload the network or

end-user’s machine to detect faults

Minimizing probing cost for detecting interface failures:

Algorithms and scalability analysis

withHung X. Nguyen (Univ. of Adelaide)

Patrick Thiran (EPFL) Christophe Diot (Thomson)

Active monitoring system to detect faults

target hosts

monitors Goal detect failures of any of the

interfaces in the subscriber’s networkwith minimum probing overhead

target network

Simple solution: Coverage problem

Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network

Coverage problem is NP-hard– Solution: greedy set-cover heuristic

Coverage solution doesn’t detect all types of failures

Detects fail-stop failures– Failures that affect all packets that traverse

the faulty interface• Eg., interface or router crashes, fiber cuts, bugs

But not path-specific failures– Failures that affect only a subset of paths that

cross the faulty interface• Eg., router misconfigurations

New formulation of failure detection problem

Select the frequency to probe each path– Lower frequency per-path probing can achieve a

high frequency probing of each interface

1 every 9 mins

1 every 3 mins

Properties of solution Failure detection problem is no longer NP-hard

– Can find optimal solution using linear programming

– Parameters: Duration of path-specific and fail-stop failures

Needs synchronization among monitors

– Monitors need collaborate to probe an interface

– Alternative probabilistic solution avoids synchronization overhead

Probing cost scales almost linearly with the size of the target network

– In random power-law graphs like inferred internet graphs

Evaluation

Paths obtained using traceroutes

– From 750 PlanetLab nodes to 3,000 DNS servers

– From 12 RON nodes to 60,000 targets

Target networks are probed ASes

– Map IPs to ASes using Mao et al.’s technique

– 1,366 ASes in PlanetLab

– 6,517 ASes in RON

Compute probing costs varying parameters

– Set of paths, failure durations, target network

Probing costs varying size of subscriber network in PlanetLab

DurationPath-specific = 1000

secFail-stop = 1 sec

Summary

Practical formulation of failure detection problem

– Incorporates both fail-stop and path-specific failures

Solution minimizes probing cost

– Using linear programming

Inferred internet graphs are among the most expensive to probe

– Probing scales almost linearly with network size

Next step

– Deploy a system based on these probing techniques

ConnectionWatch: Passive monitoring of round-trip times at

end-hosts

Diana Zeaiter Joumblatt (LIP6)

Nina Taft (Intel)

Automatic detection of performance degradations

– Only care about problems that impact applications

– Focus on detecting “large” round-trip times (RTT)

– Detection should be fast and lightweight

ConnectionWatch

SnifferExtract flow ID

RTT estimatio

High RTT detector

TCP packets

Upload to central server

PacketTrace

Ping Daemo

Flow statistics Alarms

Insights from preliminary experiments

Datasets from five students during three days– 44,715 TCP connections over 3,584 paths to 2,242 IPs

Some observations– More complete measurements than ping

• 16.5% of 1,072 addresses don’t reply to pings

– Transfer of traces to server is main bottleneck

Hurdles– Portability of system to other OSes

– Privacy concerns with capturing user’s traffic

– Incentives for large-scale deployment

Which RTT variations correspond to performance

degradations? Our datasets are still too small to answer

– Performance degradations are rare events

Simple technique based on outlier threshold

– What is a good threshold?

– Should it the threshold be for all users, per user, per path, per app?

Do outliers correspond to real performance degradations?

– ConnectionWatch should get user’s feedback• “I’m annoyed button”

Practical issues with using network tomography for fault

diagnosiswith

Italo Scota Cunha (LIP6, Thomson)Amogh Dhamdhere, Yiyi Huang, Nick Feamster,

Constantine Dovrolis (Georgia Tech)Christophe Diot (Thomson)

The binary tomography solution by Duffield

Given– Complete network topology– End-to-end reachability measurements

Find the smallest set of links that explain observations– Assumes single-source tree, access to targets

m t1 t2

Extending binary tomography

Multi-network setting: topology not known– Periodic traceroutes determine topologies

Extension to multiple-sources, multiple-targets

– Minimum hitting set problem (NP-hard)

Tomo: Iterative poly-time greedy heuristic– Intuition: Iteratively choose link that explains

the max number of failures

Some problems

Dynamics– Loss can be transient, topology can change

Ambiguity– Losses are one-way but don’t always have

access to both ends of the path

Lack of synchronization– Different monitors see different conditions

Approach

Transient packet loss– Triggered confirmation of failed paths

Dynamic routing– Periodic snapshots of the network topology

One-way losses– Algorithm based on IP spoofing

Lack of synchronization– Correlation of probes from different monitors

Failure confirmation

loss burst

packets on a path

Upon detection of a failure, trigger extra probes Number of probes

– Confirm failures with a target false positive rate– Assume independence and a given a loss rate

Time between probes – Reduce chance that probes fall on the same loss burst– Assume link losses follow a Gilbert process

false positive

Disambiguating one-way losses: Spoofing

Monitor sends request to spoofer to send probe

Probe has IP address of the monitor

If reply reaches the monitor, reverse path is working

Spoofer: Send spoofed packet with source address of M

Evaluation

Evaluation is challenging

– Need ground truth and realistic environment

Controlled experiments on the VINI testbed

– Allow us to inject failures

– Problem: hard to argue about false positive

Experiments on Emulab

– More control: dedicated nodes and links

– Emulate the Abilene network

– Selected LA and NY as monitors

Failure confirmation reduces false positives

Emulab experiment setup– 10% loss rates in each direction– No persistent failures

Both schemes use three probes to confirm a failure

Confirmation interval

Burst factor90% 96%

Back-to-back

15% 25%

0.2 secs 0.8% 0.8%low false positives, because an interval of 0.2 secs guarantees a small probability of probes being correlated

Correlation is important to get a consistent view

Emulab and VINI experiments with short failures – More false positives

– Lower detection rate

In real deployments, can we get a consistent view?– More noise because of losses and routing dynamics

– Monitors are less synchronized

– Monitors may not be able to reach the coordinator

Next steps– Online correlation

– Minimize communication with coordinator

Summary

Continuous monitoring for detection

– At management hosts: active measurements• Reduce probing overhead, still detect failures

– At end-users: passive measurements • Lightweight detection of problems that affect apps

Network tomography for identification

– Many challenges to get consistent inputs for tomography• Network dynamics and transient losses

• Ambiguity of forward and reverse failures

• Monitors may observe different conditions

Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris...

Documents

Transcript of Network Tomography for Fault Diagnosis Renata Teixeira LIP6 Computer Laboratory CNRS and UPMC Paris...

UPMC Stationery Site USer GUide - portal.relizon.comportal.relizon.com/pfimages/UPMC/UserGuide.pdf · UPMC UPMC Stationery Site USer GUide 1. Step-by-Step Instructions 2. Template

Challenges in Making Tomography Practical Yiyi Huang, Georgia Tech Nick Feamster, Georgia Tech Renata Teixeira, LIP6 Christophe Diot, Thomson.

À la recherche des réseaux perdus. 9 décembre 2007. Le tutoriel sera donné par Jon Crowcroft, Jon Crowcroft, University of Cambridge Currently CNRS/LIP6/UPMC.

1109 seminaire LIP6

PROLOG - LIP6

Zeinab Movahedi Phare Team Laboratoire d’Informatique de Paris6 (LIP6) Zeinab.movahedi@lip6.fr.

Combining Filtering and Statistical Methods for Anomaly ... · Combining Filtering and Statistical Methods for Anomaly Detection Augustin Soule LIP6-UPMC Kav´e Salamatian LIP6-UPMC

Internet measurements: Fault detection, identification ... · Internet measurements: Fault detection, identification, and topology discovery Renata Teixeira Laboratoire LIP6 CNRS

HybridApproach: aToolforMultivariate Cryptography · HybridApproach: aToolforMultivariate Cryptography LukBettale1 Jean-CharlesFaugère LudovicPerret LIP6-SALSA UPMC,CNRS,INRIAParis-Rocquencourt

LIP6 (equipe SMA) - S-CLAIM

UE SYSTEMC – Cours 4 Prototypage virtuel avec SocLib Francois.pecheux@lip6.fr Julien.denoulet@lip6.fr.

sumerintership-lip6-slides (1)

Medicine is an Art - Westinghouse Nuclear · 2018. 11. 30. · UPMC New $2 Billion Complex UPMC Rooney Sports Performance Complex UPMC Rooney Sports Performance Complex UPMC Freddie

IN THE UNITED STATES DISTRICT COURT UPMC Pinnacle; UPMC ... · UPMC Pinnacle Memorial is a direct subsidiary of UPMC Pinnacle, an indirect subsidiary of UPMC, and a nonprofit corporation

Cross layer design for Wireless networks Kavé Salamatian LIP6-UPMC.

UPMC classification

UPMC Shadyside

Outline - UPMC

Teixeira 9788523209209

The expanding search ratio of a graph Spyros Angelopoulos* Christoph Dürr* Thomas Lidbetter** *Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6, Paris,