Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris...

59
Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas

Transcript of Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris...

Page 1: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Making Network Tomography Practical

Renata TeixeiraLaboratoire LIP6

CNRS and UPMC Paris Universitas

Page 2: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Internet monitoring is essential

For network operators– Monitor service-level agreements

– Troubleshoot failures

– Diagnose anomalous behavior

For users or content/application providers– Verify network performance

2

Page 3: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Challenge 1: Nobody controls end-to-end path

Network operators only have data of one AS End-hosts can only monitor end-to-end paths

3

AS1

AS2AS3

AS4

Page 4: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Challenge 2:Available data not direct

Network operators

Is my network performance good?

– Only have per-link counts or active probes

Is there a problem? Where?

– There may be no alarm

Users, applications

Is my provider’s performance good?

– Only have end-to-end delay and loss

4

Page 5: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Network tomography to rescue

Inference of unknown network properties from measurable ones

Sophisticated inference algorithms – Given a model and available measurements

– Apply statistical inference to estimate properties• Maximum likelihood estimator, Bayesian inference

Unfortunately, limited practical deployment– Measuring the required inputs is difficult

5

Page 6: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

This tutorial

Monitoring techniques to make network tomography practical

6

Page 7: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Outline

Examples of network tomography problems

Case study: fault diagnosis– Fault detection: continuous path monitoring

– Fault identification: binary tomography• Correlated path reachability

• Topology measurements

Open issues

7

Page 8: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

8

Network tomography problems Estimation of a network’s traffic matrix

– Given total traffic in network links– What is the traffic between a network’s entry and

exit points? Inference of link performance

– Given end-to-end probes– What is the loss rate or delay of a link?

Inference of network topology– Given end-to-end loss measurements– What is the logical network topology?

Page 9: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Inference of link performance

What are the properties of network links?– Loss rate

– Delay

– Bandwidth

– Connectivity

Given end-to-end measurements– No access to routers

9

D F

E

A C

B

AS 2

AS 1

Page 10: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Multicast-based Inference of Network-internal Characteristics

Measurements– Multicast probes

– Traces collected at receivers

Inference– Exploit correlation in traces to

estimate link properties

Introduced by MINC project

10

probesender

probecollectors

Page 11: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Inferring link loss rates

Assumptions– Known, logical-tree topology

– Losses are independent

– Multicast probes

Methodology– Maximum likelihood

estimates for αk

11

1 10 11 1

α1

α2 α3

α1^ α2^ α3^

m

t1 t2

successprobabilities

estimatedsuccess

probabilities

Page 12: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Binary tomography

Labels links as good or bad– Loss rate estimation requires

tight correlation

– Instead, separate good/bad performance

– If link is bad, all paths that cross the link are bad

12

1 10 10 1

α1

α2 α3

m

t1 t2

goodbad

Page 13: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Single-source tree

“Smallest Consistent Failure Set” algorithm

– Assumes a single-source tree and known topology

– Find the smallest set of links that explains bad paths• Given bad links are uncommon

• Bad link is the root of maximal bad subtree

13

m

t1 t2

bad

1 10 10 1

goodbad

Page 14: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Binary tomography with multiple sources and targets

Problem becomes NP-hard– Minimum hitting set problem

• Hitting set of a link = paths that traverse the link

Iterative greedy heuristic– Given the set of links in bad paths

– Iteratively choose link that explains the max number of bad paths

Promising for fault identification

14

m2

t1 t2

m1

Page 15: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Practical issues

Topology is often unknown – Need to measure accurate topology

Limited deployment of multicast– Need to extract correlation from unicast probes– Even using probes from different monitors

Control of targets is not always practical– Need one-way performance from round-trip probes Links can fail for some paths, but not all– Need to extend tomography algorithms

15

Page 16: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Outline

Examples of network tomography problems

Case study: fault diagnosis– Fault detection: continuous path monitoring

– Fault identification: binary tomography• Correlated path reachability

• Topology measurements

Open issues

16

Page 17: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

17

Steps of fault diagnosis

AS1

AS2AS3

AS4

Detection: continuous path monitoring

Identification: binary tomography

Page 18: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

FAULT DETECTION

18

Page 19: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Detection techniques

Active probing: ping– Send probe and collect response– No control of targets

Passive analysis of user’s traffic– tcpdump: tap all incoming and outgoing packets– Monitoring of TCP connections

19

Page 20: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Detection with ping

If receives reply– Then, path is good

If no reply before timeout– Then, path is bad

20

m

tprobeICMP

echo request

replyICMP

echo reply

Page 21: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Persistent failure or measurement noise?

Many reasons to lose probe or reply– Timeout may be too short

– Rate limiting at routers

– Some end-hosts don’t respond to ICMP request

– Transient congestion

– Routing change

Need to confirm that failure is persistent– Otherwise, may trigger false alarms

21

Page 22: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Upon detection of a failure, trigger extra probes Goal: minimize detection errors

– Sending more probes – Waiting longer between probes

Tradeoff: detection error and detection time

22

Failure confirmation

time

loss burstpackets on

a path

Detection error

Page 23: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Passive detection tcpdump captures all packets Track status of each TCP connection

– RTTs, timeouts, retransmissions Multiple timeouts indicate path is bad

23

– If current seq. number > last seq. number seen• Path is good

– If current seq. number = last seq. number seen• Timeout has occurred • After four timeouts, declare path as bad

Page 24: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Passive vs. active detectionPassive

+ No need to inject traffic+ Detects all failures that

affect user’s traffic+ Responses from targets

that don’t respond to ping

Active

+ No need to tap user’s traffic + Detects failures in any desired path

24

‒ Not always possible to tap user’s traffic

‒ Only detects failures in paths with traffic

‒ Probing overhead– Cover a large number of paths– Detect failures fast

Page 25: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

25

Active monitoring: reducing probing overhead

M1

M2

T3

T1 T2

A C

BD

target hosts

monitors Goal detect failures of any of the

interfaces in the target networkwith minimum probing overhead

target network

Page 26: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

26

Simple solution: Coverage problem

M1

M2

T3

T1 T2

A C

BD

Instead of probing all paths, select the minimum set of paths that covers all interfaces in the subscriber’s network

Coverage problem is NP-hard

– Solution: greedy set-cover heuristic

Page 27: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

27

Coverage solution doesn’t detect all types of failures

Detects fail-stop failures– Failures that affect all packets that traverse

the faulty interface• Eg., interface or router crashes, fiber cuts, bugs

But not path-specific failures– Failures that affect only a subset of paths

that cross the faulty interface• Eg., router misconfigurations

Page 28: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

28

New formulation of failure detection problem

Select the frequency to probe each path– Lower frequency per-path probing can achieve a

high frequency probing of each interface

M1

M2

T3

T1 T2

A C

BD

1 every 9 mins

1 every 3 mins

Page 29: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Is failure in forward or reverse path?

Paths can be asymmetric– Load balancing

– Hot-potato routing

29

m

tprobe

reply

Page 30: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Disambiguating one-way losses: Spoofing

Monitor requests to spoofer to send probe

Probe has IP address of the monitor

If reply reaches the monitor, reverse path is good

30

m

t

Spoofer: Send spoofed packet with source address of m

Spoofer

Page 31: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Summary: Fault detection

Techniques to measure path reachability– Active probing: ping + failure confirmation– Passive analysis of TCP connections

Reducing overhead of active monitoring– Select the set of paths to probe– Trade-off: set of paths and probing frequency

No control of targets– Only have round-trip measurements– Spoofing differentiates forward/reverse failures

31

Page 32: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

FAULT IDENTIFICATION: CORRELATED PATH REACHABILITY

32

Page 33: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Uncorrelated measurements lead to errors

Lack of synchronization leads to inconsistencies

– Probes cross links at different times

– Path may change between probes

33

m

t1 t2

mistakenly inferred failure

Page 34: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

34

Sources of inconsistencies

In measurements from a single monitor– Probing all targets can take time

In measurements from multiple monitors– Hard to synchronize monitors for all probes to reach

a link at the same time– Impossible to generalize to all links

Page 35: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Inconsistent measurements with multiple monitors

35

m1

t1

tN

mK

mK,t1

mK, tN

…m1,t1

m1, tN

path reachability

good

good

good

bad…

inconsistent measurements

Page 36: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Solution: Reprobe paths after failure

36

Consistency has a cost– Delays fault identification

– Cannot identify short failures

m1

t1

tN

mK

mK,t1

mK, tN

m1,t1

m1, tN

path reachability

good

bad

good

bad

Page 37: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Summary: Correlated measurements

Correlation is essential to tomography– Lack of correlation leads to false alarms

Correlation is hard with unicast probes– Probing multiple targets takes time

– Multiple monitors cannot probe a link simultaneously

Solution: probe paths again after fault detection– Trade-off: consistency vs. detection speed

37

Page 38: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

FAULT IDENTIFICATION: ACCURATE TOPOLOGY

38

Page 39: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Measuring router topology

With access to routers (or “from inside”) – Topology of one network

– Routing monitors (OSPF or IS-IS)

No access to routers (or “from outside”)– Multi-AS topology or from end-hosts

– Monitors issue active probes: traceroute

39

Page 40: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

40

Topology from inside

Routing protocols flood state of each link– Periodically refresh link state

– Report any changes: link down, up, cost change

Monitor listens to link-state messages– Acts as a regular router

• AT&T’s OSPFmon or Sprint’s PyRT for IS-IS

Combining link states gives the topology– Easy to maintain, messages report any changes

Page 41: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Inferring a path from outside: traceroute

41

A B

TTL = 1

A.1 A.2 B.2B.1

TTL = 2

TTL exceeded from A.1

TTL exceeded from B.1

Actual path

Inferred path

A.1 B.1

m t

m t

Page 42: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

A traceroute path can be incomplete

Load balancing is widely used– Traceroute only probes one path

Sometimes taceroute has no answer (stars)– ICMP rate limiting

– Anonymous routers

Tunnelling (e.g., MPLS) may hide routers– Routers inside the tunnel may not decrement TTL

42

Page 43: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

43

Traceroute under load balancing

L

B

A C

D

L

A

D

C

TTL = 2

TTL = 3

B

E

E

Missing nodes and links

False link

Actual path

Inferred path

m

m t

t

Page 44: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

44

Errors happen even under per-flow load balancing

L

B

A C

D

TTL = 2Port 2

TTL = 3Port 3

E

Traceroute uses the destination port as identifier Per-flow load balancers use the destination port as part of the flow identifier

Flow 1

m t

Page 45: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

45

Paris traceroute Solves the problem with per-flow load balancing

– Probes to a destination belong to same flow

Changes the location of the probe identifier– Use the UDP checksum

L

B

A C

D

TTL = 2Port 1

TTL = 3Port 1

EChecksum 3Checksum 2m t

Page 46: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

42 1

1

Topology from traceroutes

Inferred nodes = interfaces, not routers

Coverage depends on monitors and targets – Misses links and routers– Some links and routers appear multiple times

46

1 A

D

3B 2

3

2

3 1m1

t1

m2

t2

C

Actual topology

A.1m1t1

m2t2

Inferred topology

C.1D.1

C.2

B.3

2

Page 47: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Alias resolution: Map interfaces to routers

Direct probing– Probe an interface, may receive

response from another

– Responses from the same router will have close IP identifiers and same TTL

Record-route IP option– Records up to nine IP

addresses of routers in the path

47

A.1m1t1

m2t2

Inferred topology

C.1D.1

C.2

B.3

same router

Page 48: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Large-scale topology measurements

Probing a large topology takes time – E.g., probing 1200 targets from PlanetLab nodes

takes 5 minutes on average (using 30 threads)– Probing more targets covers more links– But, getting a topology snapshot takes longer

Snapshot may be inaccurate– Paths may change during snapshot

Hard to get up-to-date topology– To know that a path changed, need to re-probe

48

Page 49: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Faster topology snapshots

Probing redundancy– Intra-monitor

– Inter-monitor

Doubletree– Combines backward and

forward probing to eliminate redundancy

49

A

D

B

m1

t1

m2

t2

C

Page 50: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Summary of techniques to measure topology

Routing messages– Complete and accurate– But, need access to routers

Combining traceroutes– Anyone can use it, no privileged access to routers– But, false or missing links and nodes

Topologies for tomography: some uncertainties– Multiple topologies close to the time of an event– Multiple paths between a monitor and a target

50

Page 51: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Outline

Examples of network tomography problems

Case study: fault diagnosis– Fault detection: continuous path monitoring

– Fault identification: binary tomography• Correlated path reachability

• Topology measurements

Open issues

51

Page 52: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Open issues

Fault detection– How to detect faults or performance degradations that impact

end-users?

– What is the overhead and speed of large-scale deployments?

– Will spoofing work in a large-scale deployments?

Fault identification– How to keep the topology up-to-date for fast identification?

– Do we need new tomography techniques to cope with partial failures?

– Could inference be easier with cooperation from routers?

52

Page 53: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

REFERENCES

53

Page 54: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Network tomography theory

Survey on network tomography– R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network

Tomography: Recent Developments”, Statistical Science, Vol. 19, No. 3 (2004), 499-517.

Traffic matrix estimation– Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic

Intensities from Link Data”, Journal of the American Statistical Association, Vol. 91, 1996.

Inference of link performance/connectivity– MINC project: http://gaia.cs.umass.edu/minc/

– A. Adams et al., “The Use of End-to-end Multicast Measurements for Characterizing Internal Network Behavior”, IEEE Communications Magazine, May 2000.

54

Page 55: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Binary tomography

Single-source tree algorithm– N. Duffield, “Network Tomography of Binary Network

Performance Characteristics”, IEEE Transactions on Information Theory, 2006.

Applying tomography in one network– R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren,

“Detection and Localization of Network Blackholes”, IEEE INFOCOM, 2007.

Applying tomography in multiple network topology– A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot,

“NetDiagnoser:Troubleshooting network unreachabilities using end-to-end probes and routing data”, CoNEXT, 2007.

55

Page 56: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Topology from inside

IS-IS monitoring– R. Mortier, “Python Routeing Toolkit (`PyRT')”,

https://research.sprintlabs.com/pyrt/

OSPF monitoring– A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture,

Design and Deployment Experience”, NSDI 2004

Commercial products– Packet Design: http://www.packetdesign.com/

56

Page 57: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Topology with traceroute Tracing accurate paths under load-balancing

– B. Augustin et al., “Avoiding traceroute anomalies with Paris traceroute”, IMC, 2006.

Reducing overhead to trace topology of a network and alias resolution with direct probing

– N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel”, SIGCOMM 2002.

Use of record route to obtain more accurate topologies– R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet

Cartographer”, SIGCOMM, 2008.

Reducing overhead to trace a multi-network topology– B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient

Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.

57

Page 58: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Reducing overhead of active fault detection

Selection of paths to probe – H. Nguyen and P. Thiran, “Active measurement for multiple link

failures diagnosis in IP networks”, PAM, 2004.

– Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link delays and faults in IP networks”, INFOCOM, 2003.

Selection of the frequency to probe paths– H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing

Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis", INFOCOM, 2009.

58

Page 59: Making Network Tomography Practical Renata Teixeira Laboratoire LIP6 CNRS and UPMC Paris Universitas.

Internet-wide fault detection systems

Detection with BGP monitoring plus continuous pings, spoofing to disambiguate one-way failures, traceroute to locate faults

– E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D. Wetherall, T. Anderson, “Studying Black Holes in the Internet with Hubble”, NSDI, 2008.

Detection with passive monitoring of traffic of peer-to-peer systems or content distribution networks, traceroutes to locate faults

– M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services”, OSDI, 2004.

59