Post on 02-Jan-2016
Network Tomography for Fault Diagnosis
Renata TeixeiraLIP6 Computer Laboratory
CNRS and UPMC Paris Universitas
2
The Internet is great, but problems happen
LIP6network
Net1Net2
Net3
How to automatically detect and identify problems?
Is my connection ok?
Is it google?
Is the problem in one of the networks in path?
3
Current alarms are not enough
Network equipments already have many alarms
– SNMP traps
– Anomaly detection systems
But, alarms may not reflect user’s experience
– Hard to map users’ complaints to alarms
– The user’s problem may not appear as an alarm
Network admins often resort to active measurements
– Active monitoring servers inside their network
– Subscribe to third-party monitoring services• Eg. Keynote or RIPE TTM
4
End-hosts can collaborate to troubleshoot problems
LIP6network
Net1Net2
Net3
Detection: continuous path monitoring
Identification: tomography
5
End-host troubleshooting in two different contexts
Network admins deploy monitoring services– Verify the performance of their networks
– Assist in troubleshooting
End-users can collaborate – Identify and bypass problems
– Rank providers
6
Detection techniques
For network admins– Deploy dedicated
monitors
– Need to inject probes to measure paths
For end-users– Monitoring at end-
users’ machine
– Tapping users’ traffic is promising
Challengecannot continuously overload the network or
end-user’s machine to detect faults
Minimizing probing cost for detecting interface failures:
Algorithms and scalability analysis
withHung X. Nguyen (Univ. of Adelaide)
Patrick Thiran (EPFL) Christophe Diot (Thomson)
8
Active monitoring system to detect faults
M1
M2
T3
T1 T2
A C
BD
target hosts
monitors Goal detect failures of any of the
interfaces in the subscriber’s networkwith minimum probing overhead
target network
9
Simple solution: Coverage problem
M1
M2
T3
T1 T2
A C
BD
Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network
Coverage problem is NP-hard– Solution: greedy set-cover heuristic
10
Coverage solution doesn’t detect all types of failures
Detects fail-stop failures– Failures that affect all packets that traverse
the faulty interface• Eg., interface or router crashes, fiber cuts, bugs
But not path-specific failures– Failures that affect only a subset of paths that
cross the faulty interface• Eg., router misconfigurations
11
New formulation of failure detection problem
Select the frequency to probe each path– Lower frequency per-path probing can achieve a
high frequency probing of each interface
M1
M2
T3
T1 T2
A C
BD
1 every 9 mins
1 every 3 mins
12
Properties of solution Failure detection problem is no longer NP-hard
– Can find optimal solution using linear programming
– Parameters: Duration of path-specific and fail-stop failures
Needs synchronization among monitors
– Monitors need collaborate to probe an interface
– Alternative probabilistic solution avoids synchronization overhead
Probing cost scales almost linearly with the size of the target network
– In random power-law graphs like inferred internet graphs
13
Evaluation
Paths obtained using traceroutes
– From 750 PlanetLab nodes to 3,000 DNS servers
– From 12 RON nodes to 60,000 targets
Target networks are probed ASes
– Map IPs to ASes using Mao et al.’s technique
– 1,366 ASes in PlanetLab
– 6,517 ASes in RON
Compute probing costs varying parameters
– Set of paths, failure durations, target network
14
Probing costs varying size of subscriber network in PlanetLab
DurationPath-specific = 1000
secFail-stop = 1 sec
15
Summary
Practical formulation of failure detection problem
– Incorporates both fail-stop and path-specific failures
Solution minimizes probing cost
– Using linear programming
Inferred internet graphs are among the most expensive to probe
– Probing scales almost linearly with network size
Next step
– Deploy a system based on these probing techniques
ConnectionWatch: Passive monitoring of round-trip times at
end-hosts
with
Diana Zeaiter Joumblatt (LIP6)
Nina Taft (Intel)
17
Goal
Automatic detection of performance degradations
– Only care about problems that impact applications
– Focus on detecting “large” round-trip times (RTT)
– Detection should be fast and lightweight
18
ConnectionWatch
SnifferExtract flow ID
RTT estimatio
n
High RTT detector
TCP packets
Upload to central server
PacketTrace
Ping Daemo
n
Flow statistics Alarms
19
Insights from preliminary experiments
Datasets from five students during three days– 44,715 TCP connections over 3,584 paths to 2,242 IPs
Some observations– More complete measurements than ping
• 16.5% of 1,072 addresses don’t reply to pings
– Transfer of traces to server is main bottleneck
Hurdles– Portability of system to other OSes
– Privacy concerns with capturing user’s traffic
– Incentives for large-scale deployment
20
Which RTT variations correspond to performance
degradations? Our datasets are still too small to answer
– Performance degradations are rare events
Simple technique based on outlier threshold
– What is a good threshold?
– Should it the threshold be for all users, per user, per path, per app?
Do outliers correspond to real performance degradations?
– ConnectionWatch should get user’s feedback• “I’m annoyed button”
Practical issues with using network tomography for fault
diagnosiswith
Italo Scota Cunha (LIP6, Thomson)Amogh Dhamdhere, Yiyi Huang, Nick Feamster,
Constantine Dovrolis (Georgia Tech)Christophe Diot (Thomson)
22
The binary tomography solution by Duffield
Given– Complete network topology– End-to-end reachability measurements
Find the smallest set of links that explain observations– Assumes single-source tree, access to targets
m t1 t2
23
Extending binary tomography
Multi-network setting: topology not known– Periodic traceroutes determine topologies
Extension to multiple-sources, multiple-targets
– Minimum hitting set problem (NP-hard)
Tomo: Iterative poly-time greedy heuristic– Intuition: Iteratively choose link that explains
the max number of failures
24
Some problems
Dynamics– Loss can be transient, topology can change
Ambiguity– Losses are one-way but don’t always have
access to both ends of the path
Lack of synchronization– Different monitors see different conditions
25
Approach
Transient packet loss– Triggered confirmation of failed paths
Dynamic routing– Periodic snapshots of the network topology
One-way losses– Algorithm based on IP spoofing
Lack of synchronization– Correlation of probes from different monitors
26
Failure confirmation
time
loss burst
packets on a path
Upon detection of a failure, trigger extra probes Number of probes
– Confirm failures with a target false positive rate– Assume independence and a given a loss rate
Time between probes – Reduce chance that probes fall on the same loss burst– Assume link losses follow a Gilbert process
false positive
27
Disambiguating one-way losses: Spoofing
Monitor sends request to spoofer to send probe
Probe has IP address of the monitor
If reply reaches the monitor, reverse path is working
M
Spoofer: Send spoofed packet with source address of M
T
28
Evaluation
Evaluation is challenging
– Need ground truth and realistic environment
Controlled experiments on the VINI testbed
– Allow us to inject failures
– Problem: hard to argue about false positive
Experiments on Emulab
– More control: dedicated nodes and links
– Emulate the Abilene network
– Selected LA and NY as monitors
29
Failure confirmation reduces false positives
Emulab experiment setup– 10% loss rates in each direction– No persistent failures
Both schemes use three probes to confirm a failure
Confirmation interval
Burst factor90% 96%
Back-to-back
15% 25%
0.2 secs 0.8% 0.8%low false positives, because an interval of 0.2 secs guarantees a small probability of probes being correlated
30
Correlation is important to get a consistent view
Emulab and VINI experiments with short failures – More false positives
– Lower detection rate
In real deployments, can we get a consistent view?– More noise because of losses and routing dynamics
– Monitors are less synchronized
– Monitors may not be able to reach the coordinator
Next steps– Online correlation
– Minimize communication with coordinator
31
Summary
Continuous monitoring for detection
– At management hosts: active measurements• Reduce probing overhead, still detect failures
– At end-users: passive measurements • Lightweight detection of problems that affect apps
Network tomography for identification
– Many challenges to get consistent inputs for tomography• Network dynamics and transient losses
• Ambiguity of forward and reverse failures
• Monitors may observe different conditions