PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

35
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University

description

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services. Ming Zhang , Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University. Motivation. Routing anomalies are common on Internet Maintenance Power outage Fiber cut Misconfiguration … - PowerPoint PPT Presentation

Transcript of PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

Page 1: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

PlanetSeer: Internet Path Failure Monitoring and Characterization

in Wide-Area Services

Ming Zhang, Chi Zhang

Vivek Pai, Larry Peterson, Randy Wang

Princeton University

Page 2: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

2

Motivation

• Routing anomalies are common on Internet Maintenance Power outage Fiber cut Misconfiguration …

• Anomalies can affect end-to-end performance Packet losses Packet delays Disconnectivities

Page 3: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

3

Background

• Anomaly detection and diagnosis are nontrivial Asymmetric paths Failure information propagation Highly varied durations Limited coverage

Page 4: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

4

Contributions

• New techniques for Anomaly detection Anomaly isolation Anomaly classification

• Large-scale study of anomalies Broad coverage High detection rate, low overhead Characterization of anomalies End-to-end effects Benefits to host service

Page 5: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

5

Outline

• State of the Art• PlanetSeer Components

MonD – passive monitoring ProbeD – active probing

• Anomaly Analysis Loop-based anomaly Non-loop anomaly

• Bypassing Anomalies• Summary

Page 6: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

6

State of the Art

• Routing messages BGP: AS-level diagnosis IS-IS, OSPF: Within single ISP

• Router/link traffic statistics SNMP, NetFlow: proprietary

• End-to-end measurement Ping, traceroute

Page 7: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

7

End-to-End Probing

• All-pairs probes among n nodes O(n^2) measurement cost Not scalable as n grows

Page 8: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

8

Key Observation

• Combine passive monitoring with active probing• Peer-to-Peer (P2P), Content Distribution Network

(CDN) Large client population Geographically distributed nodes Large traffic volume Highly diverse paths

• The traffic generated by the services reveals information about the network.

Page 9: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

9

Our Approach

• Host service CDN

• Components Passive monitoring

Active probing

• Advantages Low overhead

Wide coverage

Client

A

C

B

R1

R2

Page 10: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

10

MonD: Anomaly Detection

• Anomaly indicators Time-to-live (TTL) change

• Routing change n consecutive timeouts (n = 4 in current system)

• Idling period of 3 to 16 seconds

• most congestion periods < 220ms

Page 11: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

11

ProbeD Operation

• Baseline probes When a new IP appears From local node

• Forward probes When a possible anomaly detected From multiple nodes (including local node)

• Reprobes At 0.5, 1.5, 3.5 and 7.5 hours later From local node

Page 12: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

12

ProbeD Groups

• 353 nodes, 145 sites, 30 groups According to geographic location One traceroute per group

0123456789

1011

US (edu) US (non-edu)

Canada Europe Asia &MidE

Other

Num

ber

of G

roup

s

Page 13: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

13

Estimating Scope

• Which routers might be affected? Routers which possibly change their next hops Traceroutes from multiple locations can narrow the

scope

ra rb rcrd

Client

Local ProbeD

RemoteProbeD

Page 14: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

14

Path Diversity

• Monitoring Period: 02/2004 – 05/2004• Unique IPs: 887,521 • Traversed ASes: 10,090

0%

20%

40%

60%

80%

100%

Tier 1 Tier 2 Tier 3 Tier 4 Tier 5

Tie

r C

ov

erag

e

22 ASes

215 ASes

1392 ASes

1420 ASes

13872 ASes

Core

Edge

Page 15: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

15

Confirming Anomalies

• Reported anomalies 2,259,588

• Conditions Loops Route change Partial unreachability ICMP unreachable

• Very conservative confirmation

Undecided 22%

Non-anomaly 66%

Anomaly 12%

Page 16: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

16

Confirmed Anomaly Breakdown

• Confirmed anomalies 271,898 2 per minute 100x more

• Temp anomalies Inconsistent probes

Temp loop 1%

Path Change 44%

Fwd Outage 9%

Other Outage 23%

Persist Loop 7%

Temp Anomalies 16%

Page 17: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

17

Scope of Loops

• How many routers or ASes are involved? Temp loops involve more routers than persistent loops 97% persistent loops and 51% temp loops contain 2

hops

0%10%20%30%40%50%60%70%80%90%

100%

2 3 4 5 6+

PersistentTemp

1% persist loops cross ASes

15% temp loops cross ASes

Page 18: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

18

Distribution of Loops

• Many persistent loops in tier-3, few in tier-1• Worst 10% of tier-1 ASes – implications for

largest ISPs 20% traffic 35% persistent loops

0%5%

10%15%20%25%30%35%40%45%50%

Tier 1 Tier 2 Tier 3 Tier 4 Tier 5

PersistentTempTraffic

Page 19: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

19

Duration of Persistent Loops

• How long do persistent loops last? Either resolve quickly or last for an extended period

0%

10%

20%

30%

40%

50%

60%

<0.5 hrs <1.5 hrs <3.5 hrs <7.5 hrs >= 7.5 hrs

Page 20: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

20

Scope of Forward Anomalies

• How many routers or ASes are affected? 60% outages within 1 hops 75% outages and 68% changes within 4 hops

00.10.20.30.40.50.60.70.80.9

1

0 2 4 6 8 10 12 14hops

frac

tion

change

outage78% outages within 2 ASes

57% changes within 2 ASes

Page 21: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

21

Location of Forward Anomalies

• How close are the anomalies to the edges of the network? 44% outages at the last hop 72% outages and 40% changes within 4 hops

00.10.20.30.40.50.60.70.80.9

1

0 1 2 3 4 5 6 7 8 9 10hops

frac

tion change

outage

Page 22: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

22

Distribution of Forward Anomalies

• Which ASes are affected? Tier-1 ASes most stable Tier-3 ASes most likely to be affected

0%5%

10%15%20%25%30%35%40%45%50%

Tier 1 Tier 2 Tier 3 Tier 4 Tier 5

Change

Outage

Traffic

Page 23: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

23

Overlay Routing

• Use alternate path when default path fails

source destination

intermediate

Page 24: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

24

00.10.20.30.40.50.60.70.80.9

1

0.1 1 10 100bypass ratio

frac

tion

Bypassing Anomalies

• How useful is overlay routing for bypassing failures? Effective in 43% of 62,815 failures, lower than

previous studies 32% bypass paths inflate RTTs by more than a factor of

two

Page 25: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

25

Summary

• Confirm 272,000 anomalies in 3 months• Persistent and temporary loops

Persistent loops narrower scope, either resolve quickly or last for a long time

• Path outages and changes Outages closer to edge, narrower scope

• Anomaly distribution Skewed. Tier-1 most stable. Tier-3 most problematic.

• Overlay routing Bypasses 43% failures, latency inflation

Page 26: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

26

More Information

• In the paper More details about anomaly characteristics End-to-end impacts Classification methodology Optimizations to reduce overheads & improve

confirmation rate

[email protected]• http://www.cs.princeton.edu/nsg/infoplane

Page 27: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

27

Classifying Anomalies

• Temporary vs. persistent loops Whether exit loops at maximum hop

• Path changes vs. outages Changes: follow different paths to clients Outages: stop at intermediate hops

ProbeD

Client

Page 28: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

28

Non-anomalies

• Non-anomalies Ultrashort anomalies Path-based TTL Aggressive timeout

Page 29: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

29

Identifying Forward Outages

• Forward outages Route change ICMP dest unreachable Forward timeout Fwd

timeout35% Route

Change53%

ICMPUnreach

12%

Page 30: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

30

Loop Effect on RTT

• How do loops affect RTTs? Loops can incur high latency inflation

00.10.20.30.40.50.60.70.80.9

1

0 1 2 3 4RTT (seconds)

frac

tion

Persist loop

Persist loopnormal

Temp loop

Temp loopnormal

Page 31: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

31

Loop Effect on Loss Rate

• How do loops affect loss rates? 65% temporary and 55% persistent loops preceded by

loss rates exceeding 30%

00.10.20.30.40.50.60.70.80.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1loss rate %

frac

tion Persistent

Temp

Page 32: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

32

Forward Anomaly Effect on RTT

• How do forward anomalies affect RTTs? Outages and changes can incur latency inflation Outages have more negative effect on RTTs

00.10.20.30.40.50.60.70.80.9

1

0 1 2 3 4RTT (seconds)

frac

tion

change

changenormal

outage

outagenormal

Page 33: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

33

Forward Anomaly Effect on Loss Rate

• How do forward anomalies affect loss rates? 45% outages and 40% changes preceded by loss rates

exceeding 30%

00.10.20.30.40.50.60.70.80.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1loss rate %

frac

tion change

outage

Page 34: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

34

Reducing Measurement Overhead

• Can we reduce the number of probes? 15 probes can achieve the same accuracy in 80% cases Flow-based TTL

00.10.20.30.40.50.60.70.80.9

1

0 5 10 15 20 25 30

Number of Probes

frac

tion

Page 35: PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

35

Traffic Breakdown By Tiers

Tier 324%

Tier 223%

Tier 120%

Tier 526%

Tier 47%