Internet Routing (COS 598A) Today: Root-Cause Analysis Jennifer Rexford jrex/teaching/spring2005...

40
Internet Routing (COS Internet Routing (COS 598A) 598A) Today: Root-Cause Analysis Today: Root-Cause Analysis Jennifer Rexford Jennifer Rexford http://www.cs.princeton.edu/~jrex/ http://www.cs.princeton.edu/~jrex/ teaching/spring2005 teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm Tuesdays/Thursdays 11:00am-12:20pm
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Internet Routing (COS 598A) Today: Root-Cause Analysis Jennifer Rexford jrex/teaching/spring2005...

Internet Routing (COS Internet Routing (COS 598A)598A)

Today: Root-Cause AnalysisToday: Root-Cause Analysis

Jennifer RexfordJennifer Rexford

http://www.cs.princeton.edu/~jrex/teaching/http://www.cs.princeton.edu/~jrex/teaching/spring2005spring2005

Tuesdays/Thursdays 11:00am-12:20pmTuesdays/Thursdays 11:00am-12:20pm

Outline

• Network troubleshooting– Motivation for network troubleshooting– Investigating from the edge vs. inside

• Active probing– Traceroute– Mapping IP addresses to AS numbers

• Passive monitoring– Analyzing BGP update streams– Identifying location and cause of routing

change– Limitations of the approach

Network Troubleshooting

www.cnn.com

“Why can’t I reach www.cnn.com?”

“Why is the performance bad?”

Internet

Reachability Problems: What Could be Wrong?

• End-host problem– Web server down– DNS server down, or misconfigured

• Forwarding-path problem– Packet filter or firewall restricting access– Mismatch in Maximum Transmission Unit

(MTU)

• Routing problem– User or server disconnected from Internet– Blackhole dropping all packets– Persistent loop

Performance Problem: What Could be Wrong?

• End-host problems– Overloaded Web server– Overloaded DNS server– Overloaded user machine

• Forwarding-path problem– High round-trip time– Link congestion

• Routing problem– Long-term routing instability– Transient disruption during convergence

Motivation for Troubleshooting

• Improving performance– Detect, diagnose, and fix the problem– Pick a path through another provider– Pick a different path in any overlay network

• Establishing accountability– Enforce Service Level Agreements– Rate service providers

• Characterizing the Internet– Understand causes of performance

problems– Understand challenges of troubleshooting

Troubleshooting Outside vs. Inside

• Outside: from network edge– Who: users and researchers, and operators

troubleshooting problems outside their network

– Data: ping/traceroute, public feeds of BGP updates, and public measurement platforms

– Challenges: inference from very limited data

• Inside: from inside the network– Who: operators running a network– Data: SNMP, fault data, traffic measurement,

route monitors, and router configuration files– Challenges: collecting and joining the data

Today

Active Probing

Pros and Cons of Active Probing

• Advantages– Can run from any end system– Measure the actual forwarding path

• See black-holes, loops, and delays directly

• Disadvantages– Effects of routing changes, not the cause– Current path, not the path used in the past

• Requires frequent probes to observe the changes

– Shows only properties of round-trip path• Hard to tell if problem is on forward vs. reverse

Traceroute: Measuring the Forwarding Path

• Time-To-Live field in IP packet header– Source sends a packet with a TTL of n– Each router along the path decrements the

TTL– “TTL exceeded” sent when TTL reaches 0

• Traceroute tool exploits this TTL behavior

source destination

TTL=1

Time exceeded

TTL=2

Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message

Example Traceroute Output (Berkeley to CNN)

1 169.229.62.1

2 169.229.59.225

3 128.32.255.169

4 128.32.0.249

5 128.32.0.66

6 209.247.159.109

7 *

8 64.159.1.46

9 209.247.9.170

10 66.185.138.33

11 *

12 66.185.136.17

13 64.236.16.52

Hop number, IP address, DNS nameinr-daedalus-0.CS.Berkeley.EDU

soda-cr-1-1-soda-br-6-2

vlan242.inr-202-doecev.Berkeley.EDU

gigE6-0-0.inr-666-doecev.Berkeley.EDU

qsv-juniper--ucb-gw.calren2.net

POS1-0.hsipaccess1.SanJose1.Level3.net

?

?

pos8-0.hsa2.Atlanta2.Level3.net

pop2-atm-P0-2.atdn.net

?

pop1-atl-P4-0.atdn.net

www4.cnn.com

No responsefrom router

No name resolution

Example Troubleshooting Results

• No packets go beyond your gateway– Gateway’s connection to Internet is dead

• Traceroute stops at intermediate point– Perhaps a blackhole

• Traceroute path has a loop– Transient or persistent forwarding loop

• Traceroute shows a very long path– Routing anomaly, route hijacking, etc.

• Traceroute shows very long delays– Delay or congestion on forward or reverse

path

Problems with Traceroute

• Missing responses– Routers might not send “Time-Exceeded”– Firewalls may drop the probe packets– “Time-Exceeded” reply may be dropped

• Misleading responses– Probes taken while the path is changing– Name not in DNS, or DNS entry misconfigured

• Mapping IP addresses– Mapping interfaces to a common router– Mapping interface/router to Autonomous

System

Map Traceroute Hops to ASes

1 169.229.62.1

2 169.229.59.225

3 128.32.255.169

4 128.32.0.249

5 128.32.0.66

6 209.247.159.109

7 *

8 64.159.1.46

9 209.247.9.170

10 66.185.138.33

11 *

12 66.185.136.17

13 64.236.16.52

Traceroute output: (hop number, IP)AS25

AS25

AS25

AS25

AS11423

AS3356

AS3356

AS3356

AS3356

AS1668

AS1668

AS1668

AS5662

Berkeley

CNN

Calren

Level3

AOL

Need accurate IP-to-AS mappings(for network equipment).

Candidate Ways to Get IP-to-AS Mapping

• Routing address registry– Voluntary public registry such as whois.radb.net– Used by prtraceroute and “NANOG traceroute”– Incomplete and quite out-of-date

• Mergers, acquisitions, delegation to customers

• Origin AS in BGP paths– Public BGP routing tables such as RouteViews– Used to translate traceroute data to an AS graph– Incomplete and inaccurate… but usually right

• Multiple Origin ASes, no mapping, wrong mapping

Example: BGP Table (“show ip bgp” at RouteViews)

Network Next Hop Metric LocPrf Weight Path* 3.0.0.0/8 205.215.45.50 0 4006 701 80 i* 167.142.3.6 0 5056 701 80 i* 157.22.9.7 0 715 1 701 80 i* 195.219.96.239 0 8297 6453 701 80 i* 195.211.29.254 0 5409 6667 6427 3356 701 80 i*> 12.127.0.249 0 7018 701 80 i* 213.200.87.254 929 0 3257 701 80 i* 9.184.112.0/20 205.215.45.50 0 4006 6461 3786 i* 195.66.225.254 0 5459 6461 3786 i*> 203.62.248.4 0 1221 3786 i* 167.142.3.6 0 5056 6461 6461 3786 i* 195.219.96.239 0 8297 6461 3786 i* 195.211.29.254 0 5409 6461 3786 i

AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&TAS 3786 is DACOM (Korea), AS 1221 is Telstra

Why Would IP-to-AS Mapping Be Wrong?

• IP addresses of equipment– Interfaces on the routers, not end hosts– Identifies equipment in routing protocols– Doesn’t need to be globally visible consistent

• Three reasons the mappings may be “wrong”– Addresses of Internet Exchange Points– Sibling ASes that share address space– ASes that don’t announce their addresses

• Look at traceroute path vs. BGP AS path– Traceroute path after IP-to-AS mapping– BGP AS path taken from the BGP table

Extra AS due to Internet eXchange Points

• IXP: shared place where providers meet– E.g., Mae-East, Mae-West, PAIX– Large number of fan-in and fan-out ASes

A

B

C

D

E

F

G

Traceroute AS path BGP AS path

Ignore extra traceroute AS hop with high fan-in and fan-out

B

C

F

G

A E

Extra AS due to Sibling ASes

• Sibling: organizations with multiple ASes:– E.g., Sprint AS 1239 and AS 1791– AS numbers equipment with addresses of

another

Traceroute AS path BGP AS path

A

B

C

D

E

F

G

H

A

B

C

D

E

F

G

Merge sibling ASes “belong together” as if they were one AS.

Unannounced Infrastructure Addresses

A B

C

A C

A C A C

B A C B C

C does not announce part ofits address space in BGP

(e.g., 12.1.2.0/24)

12.0.0.0/8

Fix the IP-to-AS map to associate 12.1.2.0/24 with C

Refining Initial IP-to-AS Mapping

• Start with initial IP-to-AS mapping– Mapping from BGP tables is usually correct– Good starting point for computing the mapping

• Collect many BGP and traceroute paths– Signaling and forwarding AS path usually

match– Good way to identify mistakes in IP-to-AS map

• Successively refine the IP-to-AS mapping– Find add/change/delete that makes big

difference– Base these “edits” on operational realitieshttp://www.cs.princeton.edu/~jrex/papers/sigcomm03.pdf

http://www.cs.princeton.edu/~jrex/papers/infocom04.pdf

Research Areas

• Better version of traceroute– Router support for active measurement– IPPM (IP Performance Measurement)– http://www1.ietf.org/mail-archive/web/imrg/current/

msg00154.html

• Peer-to-peer troubleshooting

www.cnn.com

“No”

“Yes”

Passive Monitoring

Limitations of Active Measurements

• Active measurements: traceroute-like tools– Can’t probe in the past– Shows the effect, not the cause

User(s)

Web Server

(d)AS 1

AS 2

AS 3

AS 4

Appealing to Peek Inside

• Passive measurements: public BGP data

Data Correlation

BGP update feeds

root causeData Collection

(RouteViews, RIPE)

Inspect BGP Routing Changes

• Changes in paths to reach destination d– AS 1: “1 3 4” “1 2 4”– AS 2: “2 4” (no change)– AS 3: “3 4” “3 1 2 4”– AS 4: “4” (no change)

User(s)

Web Server

(d)AS 1

AS 2

AS 3

AS 4

Idea #1: ASes in Paths Undergoing Change

• Key assumption– “The AS responsible for the change appears

in the old and/or the new AS path to the destination.”

• If an AS has a routing change– All ASes in old and new paths may be

responsible– Call these ASes the “suspect set”

• Combining across vantage points– Consider all ASes that had a routing change– Perform the intersection across the suspect

sets

Idea #2: Excluding ASes in Non-Changing Paths

• Key assumption– “If an AS has no routing change, the ASes in the

path are not responsible and can be excluded.”

• Example– AS 1: “1 2 4” “1 2 3 4”: suspects {1, 2, 3, 4}– AS 2: “2 4” “2 3 4”: suspects {2, 3, 4}– AS 3: “3 4” (no change): non-suspects {3, 4}

AS 1 AS 2

AS 3

AS 4

Idea #3: Blaming the ASes in the Better Path

• Key assumption– “The better path is the one that contains the

AS responsible for the change.”

• Example– “1 2 4” “1 2 3 4”: better path to worse

path, with ASes {1,2,4} as the suspects (not AS 3)

• Heuristics for identifying the “better” path– E.g., the shorter AS path

AS 1 AS 2

AS 3

AS 4

Idea #4: Combining Across Destinations

• Key assumption– “All destinations experiencing routing

changes in a short period of time have a common cause.”

• Exploiting the observation– Form suspect sets for each destination– Perform intersections of the sets across the

destinations

Difficulties With Root-Cause Analysis

• Misleading BGP routing changes– Responsible AS not on old or new path– Looking across destinations doesn’t resolve

• Missing routing changes– Some routers in an AS don’t have a change– Some subnets are not visible in BGP– Some internal changes are not visible in

BGP

Misleading BGP Changes

BGP datacollection

Myth:The AS responsible for the change appears in the old or the new AS path.

1

4

5

6

2 3

7

8

9

10

11

old: 1,2,8,9,10new: 1,4,5,6,7,10

Misleading BGP Changes

Myth:Looking at routing changes across prefixes resolves causes

A B

CBGP datacollection

10

7

AS 1 AS 2

AS 3

d1

d2

d3

12

Changes for d2, but not for d1 and d3

Missing Routing Changes

Myth: The BGP updates from a single router accurately represent the AS

C

A B

DBGP datacollection

dst

6

12 10

7

AS 1 AS 2

No change

Missing Routing Changes

ABGP datacollection

Myth:BGP data from a router accurately represents changes on that router.

12.1.1.0/24

12.1.0.0/16

Missing Routing Changes

C

A B

DBGP datacollection

dst

6

12 10

5 7

AS 1 AS 2

Myth:Routing changes visible in eBGP have greater impact end-to-end impact than changes with local scope.

Hybrid of Active and Passive Monitoring

i

jOmni 1

Omni 3

Omni 2

Omni 4

User(s)

Web Server

(d)

(i,s,d,t)

(j,s,d,t’)failure link (3,4)

failure link (3,4)

AS 1

AS 2

AS 3

AS 4

Research Questions

• Understanding if root-cause analysis can work– How many vantage points are needed?– Do the assumptions usually hold?– Can algorithms tolerate occasional violations?– Can some additional information help?

• Distributed algorithms for root-cause analysis– Can ASes cooperate in distributed fashion?– How to prevent or detect ASes that cheat?– Do all ASes have to participate?– Other hybrids of active and passive monitoring?

Conclusions

• Troubleshooting is important– Detect, diagnose, and fix problems– Accountability and service-level agreements

• Troubleshooting is hard– Active measurement (e.g., traceroute) not

enough– Root-cause analysis techniques are not enough

• New innovation necessary– Hybrid active/passive approaches– Router support for active measurement– Routing protocol extensions for troubleshooting

For Next Time: From Inside an AS

• Two papers– “OSPF monitoring: Architecture, design, and

deployment experience”– “Finding a needle in a haystack: Pinpointing

significant BGP routing changes in an IP network”

• Optional reading– Materials from Packet Design and Ipsum

Networks• Review only of first paper

– Summary– Why accept– Why reject– Future work