PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet...

16
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang, Vivek Pai, Larry Peterson, and Randy Wang Department of Computer Science Princeton University Abstract Detecting network path anomalies generally requires ex- amining large volumes of traffic data to find misbehav- ior. We observe that wide-area services, such as peer- to-peer systems and content distribution networks, ex- hibit large traffic volumes, spread over large numbers of geographically-dispersed endpoints. This makes them ideal candidates for observing wide-area network behav- ior. Specifically, we can combine passive monitoring of wide-area traffic to detect anomalous network behavior, with active probes from multiple nodes to quantify and characterize the scope of these anomalies. This approach provides several advantages over other techniques: (1) we obtain more complete and finer- grained views of failures since the wide-area nodes al- ready provide geographically diverse vantage points; (2) we incur limited additional measurement cost since most active probing is initiated when passive monitoring de- tects oddities; and (3) we detect failures at a much higher rate than other researchers have reported since the ser- vices provide large volumes of traffic to sample. This pa- per shows how to exploit this combination of wide-area traffic, passive monitoring, and active probing, to both understand path anomalies and to provide optimization opportunities for the host service. 1 Introduction As the Internet grows and routing complexity increases, network-level instabilities are becoming more common. Among the problems causing end-to-end path failures are router misconfigurations [16], maintenance, power outages, and fiber cuts [15]. Inter-domain routers may take tens of minutes to reach a consistent view of the net- work topology after network failures, during which time end-to-end paths may experience outages, packet losses and delays [14]. These routing problems can affect per- formance and availability [1, 14], especially if they occur on commonly-used network paths. However, even deter- mining the existence of such problems is nontrivial, since no central authority monitors all Internet paths. Previously, researchers have used routing messages, such as BGP [16], OSPF [15] and IS-IS [12] update mes- sages to identify inter-domain and intra-domain routing failures. This approach usually requires collecting rout- ing updates from multiple vantage points, which may not be easily accessible for normal end users. Other re- searchers have relied on some form of distributed active probing, such as pings and traceroutes [1, 7, 19], to de- tect path anomalies from end hosts. These approaches monitor the paths between pairs of hosts by having them repeatedly probe each other. Because this approach requires cooperation from both source and destination hosts, these techniques measure only paths among a lim- ited set of participating nodes. We observe that there exist several wide-area services employing multiple geographically-distributed nodes to serve a large and dispersed client population. Examples of such services include Content Distribution Networks (CDNs), where the clients are distinct from the nodes providing the service, and Peer-to-Peer (P2P) systems, where the clients also participate in providing the service. In these kinds of systems, the large number of clients use a variety of network paths to communicate with the ser- vice, and are therefore likely to see any path instabilities that occur between them and the service nodes. This scenario of geographically-distributed clients ac- cessing a wide-area service can itself be used as a mon- itoring infrastructure, since the natural traffic generated by the service can reveal information about the network paths being used. By observing this traffic, we can pas- sively detect odd behavior and then actively probe it to understand it in more detail. This approach produces less overhead than a purely active-probing based approach. This monitoring can also provide direct benefit to the wide-area service hosting the measurement infrastruc- ture. By characterizing failures, the wide-area service can mitigate their impact. For example, if the outbound path between a service node and a client suddenly fails, OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association 167

Transcript of PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet...

Page 1: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

PlanetSeer: Internet Path Failure Monitoringand Characterization in Wide-Area Services

Ming Zhang, Chi Zhang, Vivek Pai, Larry Peterson, and Randy WangDepartment of Computer Science

Princeton University

AbstractDetecting network path anomalies generally requires ex-amining large volumes of traffic data to find misbehav-ior. We observe that wide-area services, such as peer-to-peer systems and content distribution networks, ex-hibit large traffic volumes, spread over large numbersof geographically-dispersed endpoints. This makes themideal candidates for observing wide-area network behav-ior. Specifically, we can combine passive monitoring ofwide-area traffic to detect anomalous network behavior,with active probes from multiple nodes to quantify andcharacterize the scope of these anomalies.

This approach provides several advantages over othertechniques: (1) we obtain more complete and finer-grained views of failures since the wide-area nodes al-ready provide geographically diverse vantage points; (2)we incur limited additional measurement cost since mostactive probing is initiated when passive monitoring de-tects oddities; and (3) we detect failures at a much higherrate than other researchers have reported since the ser-vices provide large volumes of traffic to sample. This pa-per shows how to exploit this combination of wide-areatraffic, passive monitoring, and active probing, to bothunderstand path anomalies and to provide optimizationopportunities for the host service.

1 IntroductionAs the Internet grows and routing complexity increases,network-level instabilities are becoming more common.Among the problems causing end-to-end path failuresare router misconfigurations [16], maintenance, poweroutages, and fiber cuts [15]. Inter-domain routers maytake tens of minutes to reach a consistent view of the net-work topology after network failures, during which timeend-to-end paths may experience outages, packet lossesand delays [14]. These routing problems can affect per-formance and availability [1, 14], especially if they occuron commonly-used network paths. However, even deter-mining the existence of such problems is nontrivial, since

no central authority monitors all Internet paths.Previously, researchers have used routing messages,

such as BGP [16], OSPF [15] and IS-IS [12] update mes-sages to identify inter-domain and intra-domain routingfailures. This approach usually requires collecting rout-ing updates from multiple vantage points, which maynot be easily accessible for normal end users. Other re-searchers have relied on some form of distributed activeprobing, such as pings and traceroutes [1, 7, 19], to de-tect path anomalies from end hosts. These approachesmonitor the paths between pairs of hosts by having themrepeatedly probe each other. Because this approachrequires cooperation from both source and destinationhosts, these techniques measure only paths among a lim-ited set of participating nodes.

We observe that there exist several wide-area servicesemploying multiple geographically-distributed nodes toserve a large and dispersed client population. Examplesof such services include Content Distribution Networks(CDNs), where the clients are distinct from the nodesproviding the service, and Peer-to-Peer (P2P) systems,where the clients also participate in providing the service.In these kinds of systems, the large number of clients usea variety of network paths to communicate with the ser-vice, and are therefore likely to see any path instabilitiesthat occur between them and the service nodes.

This scenario of geographically-distributed clients ac-cessing a wide-area service can itself be used as a mon-itoring infrastructure, since the natural traffic generatedby the service can reveal information about the networkpaths being used. By observing this traffic, we can pas-sively detect odd behavior and then actively probe it tounderstand it in more detail. This approach produces lessoverhead than a purely active-probing based approach.

This monitoring can also provide direct benefit to thewide-area service hosting the measurement infrastruc-ture. By characterizing failures, the wide-area servicecan mitigate their impact. For example, if the outboundpath between a service node and a client suddenly fails,

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 167

Page 2: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

it may be possible to mask the failure by sending out-bound traffic indirectly through an unaffected servicenode, using techniques such as overlay routing [1]. Moreflexible services may adapt their routing decisions, andhave clients use service nodes that avoid the failure en-tirely. Finally, a history of failure may motivate place-ment decisions—a service may opt to place a servicenode within an ISP if intra-ISP paths are more reliablethan paths between it and other ISPs.

This paper describes a monitoring system, PlanetSeer,that has been running on PlanetLab since February 2004.It passively monitors traffic between PlanetLab and thou-sands of clients to detect anomalous behavior, and thencoordinates active probes from many PlanetLab sites toconfirm the anomaly, characterize it, and determine itsscope. We are able to confirm roughly 90,000 anomaliesper month using this approach, which exceeds the rateof previous active-probing measurements by more thantwo orders of magnitude [7]. Furthermore, since we canmonitor traffic initiated by clients outside PlanetLab, weare also able to detect anomalies beyond those seen by apurely active-probing approach.

In describing PlanetSeer, this paper makes three con-tributions. First, it describes the design of the passive andactive monitoring techniques we employ, and presentsthe algorithms we use to analyze the failure informa-tion we collect. Second, it reports the results of runningPlanetSeer over a three month period of time, includinga characterization of the failures we see. Third, it dis-cusses opportunities to exploit PlanetSeer diagnostics toimprove the level of service received by end users.

2 BackgroundAlthough the Internet is designed to be self-healing, var-ious problems can arise in the protocols, implementa-tions [15], and configurations [16] to cause network in-stability. Even in the absence of such problems, rout-ing updates can take time to propagate, so failures maybe visible for minutes rather than seconds. Even thoughtools like ping and traceroute exist for diagnosing net-work problems, pinpointing failures and determiningtheir origins is nontrivial for several reasons:

Network paths are often asymmetric. Paxson ob-served that 49% of node pairs have different forward andreverse paths which visit at least one different city [19].Since traceroute only maps the forward path, it is hard toinfer whether the forward or reverse path is at fault with-out cooperation from the destination. PlanetSeer com-bines passive monitoring results, path history informa-tion, and multi-point probing to isolate forward failures.

Failure origin may differ from failure appearance.Routing protocols, such as BGP and OSPF, may propa-

gate failure information to divert traffic away from failedlinks. When a traceroute stops at a hop, it is often thecase that the router has received a routing update to with-draw that path, leaving no route to the destination.

Failure durations are highly varied. Some failures,like routing loops, can last for days. Others may persistfor less than a minute. This high variance makes it hardto diagnose failures and react in time.

Failure isolation requires broad coverage. Histori-cally, few sites had enough network coverage to initiateenough traceroutes to identify all affected paths. The ad-vent of public traceroute servers [24] provides some re-sources to help manually diagnose problems, and toolssuch as ScriptRoute [22] can help automate the process.

The advent of wide-coverage networking testbeds likePlanetLab has made it possible to deploy wide-area ser-vices on a platform with enough network coverage toalso perform the probing on the same system. This ap-proach addresses the problems listed above: (1) whenclients initiate connections to the wide-area service, weobtain a forward path that we can monitor for anoma-lies; (2) PlanetLab nodes span a large number of diverseautonomous systems (ASes), providing reasonable net-work coverage to initiate probing; and (3) active probingcan be launched as soon as problems are visible in thepassively-monitored traffic, making it possible to catcheven short-term anomalies that last only a few minutes.

While our focus is techniques for efficiently identify-ing and characterizing network anomalies, we must givesome attention to the possibility of our host platform af-fecting our results. In particular, it has been recentlyobserved that intra-PlanetLab paths may not be repre-sentative of the Internet [2], since these nodes are of-ten hosted on research-oriented networks. Fortunately,by monitoring services with large client populations, weconveniently bypass this issue since most of the paths be-ing monitored terminate outside of PlanetLab. By usinggeographically-dispersed clients connecting to a largenumber of PlanetLab nodes, we observe more than justintra-PlanetLab connections.

3 PlanetSeer OperationThis section describes our environment and our ap-proach, including how we monitor traffic, identify po-tential path anomalies, and actively probe them.

3.1 ComponentsWe currently use the CoDeeN Content Distribution Net-work [26] as our host wide-area service, since it attractsa reasonably large number of clients (7K-12K daily) andgenerates a reasonable traffic volume (100-200 GB/day,5-7 million reqs/day). CoDeeN currently runs on 120

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association168

Page 3: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 169

Page 4: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

Our current threshold is four consecutive timeouts,which corresponds to 3.2–16 seconds. Since most con-gestion periods on today’s Internet are short-lived (95%are less than 220ms [27]), these consecutive timeouts arelikely due to path anomalies. We can further subdividethis case based on whether MonD is on the sender or re-ceiver side of the flow. If MonD is on the receiver side,then no ACKs are reaching the sender, and we can in-fer the problem is on the path from the CoDeeN node tothe client/server, which we call forward path. If MonDis on the sender side, then we cannot determine whetheroutbound packets are being lost or whether ACKs are be-ing lost on the way to MonD.

3.4 MonD Flow/Path StatisticsWe now describe how MonD infers path anomalies aftergrouping packets into flows. We examine how to mea-sure the per-flow RTTs, timeouts and retransmissions.

To detect timeouts when MonD is on the sender side,we maintain two variables, SendSeqNo and SendRtx-Count for each flow. SendSeqNo is the sequence num-ber (seqno) of the most recently sent new packet, whileSendRtxCount is a count of how many times the packethas been retransmitted. If we use CurrSeqNo to representthe seqno of the packet currently being sent, we see threecases: If CurrSeqNo > SendSeqNo, the flow is mak-ing progress, so we clear SendRtxCount and set Send-SeqNo to CurrSeqNo. If CurrSeqNo < SendSeqNo, thepacket is a fast retransmit, and we again set SendSeqNo toCurrSeqNo. If CurrSeqNo = SendSeqNo, we conclude atimeout has occurred and we increment SendRtxCount.If SendRtxCount exceeds our threshold, MonD notifiesProbeD that a possible path anomaly has occurred.

A similar mechanism is used when MonD observes thereceiver side of a TCP connection. It keeps track of thelargest seqno received per flow, and if the current packethas the same seqno, a counter is incremented. Doing thisdetermines how many times a packet has been retrans-mitted due to consecutive timeouts at the sender. Whenthis counter hits our threshold, MonD notifies ProbeDthat this sender is not seeing our ACKs. Since we areseeing the sender’s packets, we know that this directionis working correctly. Note that this method assumes thatduplicate packets are mostly due to retransmissions at thesender. This assumption is safe because previous workhas shown that packets are rarely duplicated by the net-work [20].

Detecting TTL change is trivial: MonD records theTTL for the first packet received along each path. Foreach packet received from any flow on the same path, wecompare its TTL to our recorded value. If MonD detectsany change, it notifies ProbeD that a possible anomalyhas occurred. Note that this case can aggregate informa-tion from all flows along the path.

3.5 ProbeD OperationProbeD is responsible for the active probing portion ofPlanetSeer, and generally operates after being notified ofpossible anomalies by MonD. For the purpose of the fol-lowing discussion, when an anomaly occurs, we call theCoDeeN node where the anomaly is detected the localnode, and the corresponding remote client/server the des-tination. The active probing is performed using tracer-oute, a standard network diagnostic tool. ProbeD sup-ports three probing operations:

Baseline Probes: When a new IP address is added toMonD’s path table, the ProbeD on the local nodeperforms a “baseline probe” to that destination. Itis expected that the results of this traceroute re-flect the default network path used to communi-cate with that destination under normal conditions.For actively-used communication paths, a baselineprobe is launched once every 30 minutes. WhenPlanetSeer is run on CoDeeN nodes, these baselineprobes are generated whenever a new client con-nects to a node, or when a node has to contact anew origin server.

Forward Probes: When a possible anomaly is detectedby the local MonD and reported to ProbeD, it in-vokes multiple traceroutes from a set of geographi-cally distributed nodes (including itself) to the des-tination; we call the traceroute from the local nodethe local traceroute or local path. This process isperformed twice, generally within one minute, inorder to identify what we term ultrashort anomalies.On particularly problematic paths, MonD might re-port possible anomalies very frequently, especiallyif the path is very unstable. To avoid generatingtoo much probing traffic, ProbeD rate-limits the for-ward probes so that it does not probe the same des-tination within 10 minutes.

Reprobes: If the forward probes confirm an anomalyalong a path to a destination, the local ProbeD thatinitiated the forward probes periodically reprobesthat path to determine the duration and effects of theanomaly. We currently reprobe four times, at 0.5,1.5, 3.5, and 7.5 hours after the anomaly detectiontime. These reprobes can compare their tracerouteresults with the original baseline probe as well asthe forward probes.

3.6 ProbeD MechanicsWhen ProbeD performs the forward probes, it launchesthem from geographically distributed nodes on Planet-Lab. Compared with only doing traceroute from the lo-cal node, using multiple vantage points gives us a more

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association170

Page 5: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

Category Grps Sites DescriptionsUS (edu) 11 70 US Universities

US (non-edu) 5 13 Intel, HP, NEC, etc.Canada 2 11 Eastern & Western CanadaEurope 7 31 UK, France, Germany, etc.

Asia & MidE 4 14 China, Korea, Israel, etc.Others 1 6 Australia, Brazil, etc.Total 30 145

Table 1: Groups of the probing sites

complete view of the anomaly, such as its location, pat-tern, and scope. Our ProbeDs are running on 353 nodesacross 145 sites on PlanetLab, more than the numberof nodes running CoDeeN. They are distributed acrossNorth/South America, Europe, Asia and elsewhere.

Since the set of ProbeDs must communicate with eachother, they keep some membership information to trackliveness. We note that an unavailable ProbeD only re-sults in a degradation of information quality, rather thancomplete inoperability, so we do not need to aggressivelytrack ProbeD health. Each ProbeD queries the otherswhen it first starts. Thereafter, dead ProbeDs are checkedevery 8 hours to reduce unneeded communication. In thecourse of operation, any ProbeD that is unresponsive toa query is considered dead.

We divide the ProbeD nodes into 30 groups based ongeographic diversity, as shown in Table 1, mainly to re-duce the number of probes launched per anomaly. Prob-ing every anomaly from every ProbeD location wouldyield too much traffic (especially to sites with conser-vative intrusion detection systems), and the extra traf-fic may not yield much insight if many nodes share thesame paths to the anomaly. We also divide North Amer-ica into educational and non-educational groups, becausethe educational (.edu) sites are mostly connected to Inter-net2, while non-educational sites are mostly connectedby commercial ISPs.

When a ProbeD receives a request from its localMonD, it forwards it to a ProbeD in each of the othergroups. The ProbeDs perform the forward probes, andsend the results back to the requester. All results are col-lected by the originator, and logged with other details,such as remote IP address and the current time.

3.7 Path DiversityWe have been running PlanetSeer since February2004. In three months, we have seen 887,521 uniqueclients IPs, coming from 9232 Autonomous Systems(ASes) (according to previous IP-to-AS mappings tech-niques [17]). Our probes have traversed 10090 ASes,well over half of all ASes on the Internet. We use ahierarchical AS classification scheme that has five tiers,based on AS relationships and connectivity [23]. Thehighest layer (tier 1) represents the dense core of the In-

Tier # Covered Tier Size CoverageTier 1 22 22 100%Tier 2 207 215 96%Tier 3 1119 1392 80%Tier 4 1209 1420 85%Tier 5 5906 13872 43%Unmapped 1627

Table 2: Path diversity

ternet, and consists of 22 ASes of the largest ISPs. Tier 2typically includes ASes of other large national ISPs. Tier3 includes ASes of regional ISPs. Finally, tiers 4 and 5include ASes of small regional ISPs and customer ASesrespectively. As shown in Table 2, we have very goodcoverage of the top 4 AS tiers, with complete coverageof tier 1 and nearly-complete coverage of tier 2.

4 Confirming AnomaliesHaving collected the passive data from MonD and thetraceroutes from ProbeD, the next step is processing theprobes to confirm the existence of the anomaly. This sec-tion describes how we use this data to classify anomaliesand quantify their scope. It also reports how differenttypes of anomalies influence end-to-end performance.

4.1 Massaging Traceroute DataSome of the data we receive from the traceroutes is in-complete or unusable, but we can often perform somesimple processing on it to salvage it. The unusablehops in a traceroute are those that do not identify therouters or that identify the routers by special-use IP ad-dressess [11]. The missing data is generally the absenceof a hop, and can be interpolated from other traceroutes.

Identifying and pruning unusable hops in traceroute issimple: the unusable hops are identified by asterisks inplace of names, and other than showing the existence ofthese hops, these data items provide no useful informa-tion. We simply remove them from the traceroute butkeep the relative hop count difference between the exist-ing hops.

Missing hops in a traceroute are slightly harder todetect, but we can use overlapping portions of multi-ple traceroutes to infer where they occur. We use asimple heuristic to identify the missing hops: we com-pare traceroutes that share the same destination, and ifthe hops leading to the destination differ by an inter-mediate hop that is present in one and missing in theother, we replace the missing hop with the address inthe other trace. Put more formally, given two tracer-outes from two sources to the same destination, sup-pose there are two subsequences of these two tracer-outes, X(X1, X2, ...,Xm) and Y (Y1, Y2, ..., Yn) (m > 2and n > 2) such that X1 = Y1 and Xm = Yn. We

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 171

Page 6: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

TTL Change Timeout Fwd Timeout994485 (44%) 754434 (33%) 510669 (23%)

Table 3: Breakdown of anomalies reported by MonD

define hop(Xi) to be the hop count of Xi. Note thatthe number of hops between X1 and Xm, hop(Xm) −hop(X1), can be greater than m − 1, because we haveremoved “*” and special-use IPs from the traceroutes. Ifhop(Xm) − hop(X1) = hop(Yn) − hop(Y1), it is verylikely that all the hops in X and Y are the same sincethey merge at X1(Y1) and follow the same number ofhops to Xm(Yn). If there exists Xi such that the hop cor-responds to hop(Y1)+hop(Xi)−hop(X1) in Y does notexist because it has been removed as “*”, we consider Xi

a missing hop in Y and add this hop into Y .Our approach to inserting missing hops helps us fil-

ter out the “noise” in traceroutes so that it does not con-fuse our anomaly confirmation using route change as de-scribed in Section 4.2. However, it may mask certain hopdifferences. For example, we sometimes see two paths X

and Y merge at X1(Y1), and diverge at some later hopsbefore merging at Xm(Yn) again. This usually occurs be-tween tightly-coupled routers for load balancing reasons,where a router selects the next hop from several paral-lel links based on the packet IP addresses and/or trafficload [19]. In this case, inserting missing hops may elim-inate the different hops between the two traceroutes.

For our purposes, we do not treat such “fluttering” [19]as anomalies because it occurs as a normal practice. Wedetect fluttering by looking for hops Xi and Yj such thathop(Yj)−hop(Y1) = hop(Xi)−hop(X1) but Xi 6= Yj ,and we merge them into the same hop in all the tracer-outes. Note that this could also possibly eliminate thehop difference caused by path change and lead to under-estimating the number of anomalies.

4.2 Final ConfirmationAfter we have processed the traceroutes, we are ready todecide whether an event reported by MonD is actually ananomaly. We consider an anomaly “confirmed” if any ofthe following conditions is met:

Loops: There is a loop in the local traceroute fromthe local node to the destination, which means theanomaly is triggered by routing loops.

Route change: The local traceroute disagrees with thebaseline traceroute. Note that the baseline tracer-oute is no more than 30 minutes old. Given that91% of the Internet paths remains stable for morethan several hours [19], the anomaly is most likelycaused by path change or path outage.

Non-Anomaly Anomaly Undecided1484518 (66%) 271898 (12%) 503172 (22%)

Table 4: Breakdown of reported anomalies using the fourconfirmation conditions

Partial unreachability: The local traceroute stops be-fore reaching the destination, but there exist tracer-outes from other nodes that reach the destination.This could be caused by path outages.

Forwarding failures: The local traceroute returns anICMP destination unreachable message, with codeof net unreachable, host unreachable, net unknown,or host unknown. This usually indicates that arouter does not know how to reach the destinationbecause of routing failures [13].

Our confirmation process is very conservative—it ispossible that some of the reported anomalies are real, butdo not meet any of the above conditions. However, ourgoal is to obtain enough samples of anomalies for ouranalysis and we do not want our results to be tainted byfalse positives. Hence, we choose stricter conditions forconfirming anomalies. Similarly, we confirm a reportedanomaly as non-anomaly if the local traceroute does notcontain any loop, agrees with the baseline traceroute, andreaches the destination. For a confirmed non-anomaly,we do not perform traceroutes at remote ProbeDs, in or-der to reduce measurement traffic.

In three months, we have seen a total of 2,259,588possible anomalies reported, of which we confirmed271,898. Table 3 shows the number of reported anoma-lies of each type. As we can see, TTL change is the mostcommon type of reported anomaly, accounting for 44%of the reported anomalies. For the remaining anomaliestriggered by timeouts, passive monitoring suggests that23% are most likely caused by forward path problems.

Table 4 shows the breakdown of anomalies using the4 confirmation conditions. The non-anomalies accountfor 2/3 of the reported anomalies. Among the possiblereasons for the occurrence of non-anomalies are: ultra-short anomalies, path-based TTL, and aggressive con-secutive timeout levels. Some anomalies, which weterm ultrashort, are so short that our system is unableto respond in time. Since they often are in the processof being resolved when the forward probes are takingplace, the traceroute results may be inconsistent. Manyfalse alarms from path-based TTL changes are due toNAT boxes. When clients with different initial TTL val-ues share a NAT box, their interleaved packets appearto show TTL change. Using flow-based TTL changewould reduce these false alarms, but may miss real TTLchanges that occur between flows since any path history

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association172

Page 7: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

Temporary PersistentTotal 3565 (17%) 18000 (83%)≤ 30min N/A 54%≤ 1.5 hrs N/A 11%≤ 3.5 hrs N/A 6%≤ 7.5 hrs N/A 6%> 7.5 hrs N/A 23%Single Loop 3008 (84%) 17007(94%)Multiple Loops 557 (16%) 993 (6%)1 AS 3021, (85%) 17895, (99%)2 ASes 416, (12%) 101, (1%)3 ASes 106, (3%) 4, (0%)≥4 ASes 22, (0%) 0, (0%)Tier-1 AS 510 (15%) 244 (2%)Tier-2 AS 859 (25%) 789 (6%)Tier-3 AS 1378(40%) 6263 (46%)Tier-4 AS 197 (5%) 3899 (29%)Tier-5 AS 538 (15%) 2401 (17%)Total 3482 13596

Table 5: Summarized breakdown of 21565 loop anomalies.Some counts less than 100% because some ASes are not in theAS hierarchy mapping.

would be lost. Finally, our consecutive timeout condi-tions may be aggressive for hosts with short timeout pe-riods. Excluding the non-anomalies, we confirm 271898(35%) of the remaining anomalies and probe them frommultiple vantage points. We use these anomalies for ouranalysis in the remainder of this paper.

5 Loop-Based AnomaliesThis section focuses on analyzing routing loops, whichcan occur due to inconsistencies in routing state, miscon-figuration, and other causes. We are interested in theirfrequency, duration, size, location, and effect on end-to-end performance.

We detect routing loops by observing the same se-quence of routers appearing several times in a traceroute.Since loops in traceroute may reflect upstream routingchanges rather than true routing loops [19], we choose totake a conservative approach and require that the samesequence appear in a traceroute at least 3 times beforewe confirm it as a routing loop.

Using this metric, we identify a total number of 21565routing loops in our data. If we relax the loop condi-tion to allow loops that have the same sequence onlytwice, this count increases to 119,936, almost six timesas many. Using this approach, our overall confirmedanomaly count would increase 36% to over 370K.

Loops are separated into persistent and temporaryloops [19] based on whether the traceroute was ulti-mately able to exit the loop. If the traceroute stays within

the loop until the maximum number of hops (32 in ourcase), we classify it as persistent, while if the loop is re-solved before the traceroute reaches the maximum hop, itis temporary. Temporary loops can occur due to the timenecessary to propagate updated routing information tothe different parts of the network, while persistent loopscan be caused by various reasons, including router mis-configuration [16]. Persistent loops tend to last longerand may require human intervention to be resolved, sothey are considered more harmful. About 83% of the ob-served loops are persistent, as shown in Table 5. Sincetemporary loops only exist for a short period, it may beharder to catch them.

We use the reprobes to determine duration of the per-sistent loops. The reprobes for some persistent loops aremissing, often because the local PlanetLab node failed orrebooted before all reprobes completed. For those loops,we do not know when the loops were resolved. We onlyinclude the loops having all 4 reprobes in our analysis.Therefore, we show the percentage of loops in each dura-tion instead of the exact numbers in Table 5. We can seemany persistent loops are either resolved quickly (54%terminate within half an hour) or last for a long time(23% stay for more than 7.5 hours).

Previous research has noted that routing loops arelikely to be correlated [19], because nearby routers usu-ally share routing information very quickly. If somerouters have inconsistent information, such informationis likely to be propagated to other nearby routers andcause those routers to form loops. We observe a similarphenomenon, which is quantified in Table 5. We countthe number of distinct loops in traceroutes from otherProbeDs during the same period. We find that 16% ofthe temporary loops are accompanied by at least one dis-joint loop while only 6% of the persistent loops see them.We suspect the reason is temporary loops are more likelyto stem from inconsistent routing state while persistentloops are more likely to be caused by other factors whichmay not be related to routing inconsistency.

Finally, using the persistent loop data, we can also getsome insight into the relative distribution of AS qual-ity by measuring how evenly-distributed the loops are.Since these loops are largely single-AS, they are veryunlikely to arise from external factors, and may providesome insight into the monitoring/management quality ofthe AS operators. We use a metric, which we call skew,to provide some insight into the distribution. We cal-culate skew as the percentage of per-tier loops seen bythe “worst” 10% of the ASes in that tier. A skew valueof 10% indicates all ASes in the tier are likely to be uni-form in the quality, while larger numbers indicate a widerdisparity between the best ASes and the worst.

In tier 1, the top 2 ASes (10% of 22) account for 35%of the loops, while in tier 2, the top 21 ASes (10% of 215)

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 173

Page 8: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

2 3 4 5 6+

Persistent/All 97% 2% 1% 0% 0%Persistent/Core 94% 4% 1% 1% 0%Persistent/Edge 97% 2% 1% 0% 0%Temporary/All 51% 29% 11% 7% 2%Temporary/Core 45% 31% 13% 8% 3%Temporary/Edge 53% 27% 12% 6% 2%

Table 6: Number of hops in loops, as % of loops

account for 97% of the loops. This skew may translateinto different reliabilities for the customers they serve.The disparity in traffic must also be considered whenjudging how important these skew numbers are. Withrespect to the traffic that we observe, we find that theseASes account for 20% of tier 1 traffic and 63% of tier 2traffic. The disparity between the loop rates and the traf-fic for these ASes would indicate that these ASes appearto be much more problematic than others in their tier.

5.1 ScopeBesides their frequency, one of the other factors that de-termines the effect of routing loops is their scope, includ-ing how many routers/ASes are involved in the loop, andwhere they are located. We use the term loop length torefer to the number of routers involved, and we show abreakdown of this metric in Table 6. The most noticeablefeature is that temporary loops have much longer lengthsthan persistent loops. 97% of the persistent loops consistof only 2 routers, while the ratio is only 50% for tem-porary loops. Intuitively, the more routers are involvedin a loop, the less stable it is. Therefore, most persis-tent loops exist between only two routers, while tempo-rary loops span additional routers as the inconsistent statepropagates.

We next examine the number of ASes that are involvedin the loops. The breakdown is shown in Table 5, whichshows that persistent loops overwhelmingly occur withina single AS, while 15% of temporary loops span multi-ple ASes. Ideally, BGP prevents any inter-AS loops byprohibiting a router from accepting an AS path with itsown AS number in that path. However, BGP allows tran-sient inconsistency, which can arise during route with-drawals and announcements [10], and this is why we seemore temporary loops spanning multiple ASes. In con-trast, persistent loops can occur due to static routing [19]or router misconfigurations [16]. Given how few persis-tent loops span multiple ASes, it appears that BGP’s loopsuppression mechanism is effective.

To understand where loops arise, we classify them ac-cording to the hierarchy of ASes involved. In theory, wecould calculate their depth [7], which would tell us theminimum number of hops from the routers to the net-work edge. However, since we cannot launch probes

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

frac

tion

loss rate

persistenttemporary

Figure 1: CDF of loss rates proceding the loop anomalies

from the clients, this depth metric would be misleading.If the loop occurs on an AS that does not lie near any ofour ProbeD locations, our probes travel through the net-work core to reach it, and we would believe it to be veryfar from the edge. If we could launch probes from theclients, network depth would be meaningful.

We map loops into tiers by using the tier(s) of theAS(es) involved. A loop can be mapped to multiple tiersif it involves multiple ASes. Table 5 shows the numberof loops occurring in each tier. Tier-1 and tier-2 ASeshave very few persistent loops, possibly because they arebetter provisioned than smaller ASes. A large portion ofloops (40% for temporary and 46% for persistent) occurin tier-3 (outer core) ASes, which suggests that the pathsin those large regional ASes are less stable. In Table 6,we compare the loops in the core network (tiers 1, 2, 3)or the edge network (tiers 4, 5). As the table shows, bothtemporary and persistent loops are likely to involve morehops if occurring in the core network.

5.2 End-to-End EffectsLoops can degrade end-to-end performance in two ways:by overloading routers due to processing the same packetmultiple times [10] (for temporary loops), or by leadingto loss of connectivity between pairs of hosts (for perma-nent loops). Since MonD monitors all flows between thenode and remote hosts, we can use the network statisticsit keeps to understand end-to-end effects.

When MonD suspects an anomaly, it logs the retrans-mission rate and RTT leading up to that point. Retrans-mission rates are calculated for the last 5 minutes. RTTsare calculated using an Exponentially Weighted MovingAverage (EWMA) with the most recent value given aweight of 1/8, similar to that used in TCP. Figure 1 showsthe CDF of the retransmission rate, and we see that 65%of the temporary loops and 55% of the persistent loopsare preceded by loss rates exceeding 30%. Since the typ-ical Internet loss rate is less than 5% [20], this higher lossrate will significantly reduce end-user TCP performanceprior to the start of the anomaly.

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association174

Page 9: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.5 1 1.5 2 2.5 3 3.5 4

frac

tion

RTT (seconds)

persistentpersistent normal

temporarytemporary normal

Figure 2: CDF of RTTs proceding the loop anomalies vs. un-der normal conditions

In addition to the high loss rates, loop anomalies arealso preceded by high latency, as shown in Figure 2.High latency can be caused by queuing due to conges-tion or packets repeatedly traversing the same sequenceof routers. We compare the RTT right before loops oc-cur with the RTT measured in the baseline traceroute onthe same path. It is evident that loops can significantlydegrade RTTs.

6 Building a Reference PathWhile loop-based anomalies are relatively simple toidentify, they represent a relatively small fraction of thetotal confirmed anomalies. Classifying the other anoma-lies, however, requires more effort. This section dis-cusses the steps taken to classify the non-loop anoma-lies, and the complications involved. The main problemis how to determine that the anomaly occurs on the for-ward path from the local node to the destination. Addi-tionally, we want to characterize the anomaly’s features,such as its pattern, location and affected routers.

To deal with these two problems, we need a refer-ence path from the local node to the destination, whichrepresents the path before the anomaly occurs. Thenwe can compare it against the local traceroute duringthe anomaly. This comparison serves three purposes:First, it can help us distinguish between forward-pathand reverse-path anomalies. If the local path during theanomaly is different from the reference path, the routechange usually indicates that the anomaly is on the for-ward path. Second, it can be used to quantify the scopeof the anomaly. By examining which routers overlapbetween the local path and the reference path, we canestimate which routers are potentially affected by theanomaly. It can also be used to determine the locationof the anomaly, in terms of the number of the router hopsbetween the anomaly and the end hosts. Third, it is usedto classify the anomalies. Based on whether the localtraceroute reaches the last hop of the reference path, wecan classify it as either path change or path outage.

Ideally, we could simply use the baseline traceroute asthe reference path, if it successfully reaches the destina-tion. If the local traceroute during an anomaly stops atsome intermediate hop, we know it is an outage. If thelocal traceroute is different, but reaches the destination,we know it is a routing change. However, the baselinetraceroute may not reach the destination for a variety rea-sons. Some of these include:

• The destination is behind a firewall which filterstraceroutes. In this case, we still want to use it asthe reference path, which can be compared with lo-cal traceroute to analyze anomalies.

• Some intermediate routers filter traceroutes. In thiscase, we do not have enough information about thehops after the last known hop on the forward path.When an outage occurs, we cannot quantify whereit occurs since the anomaly may occur after the lastknown hop.

• The baseline traceroute is also affected by theanomaly and fails to reach the destination. In thiscase, we cannot use it as a reference because it usu-ally does not provide more useful information thanthe local traceroute.

The rest of this section focuses on deciding whether thebaseline traceroute can be used as the reference pathwhen it does not reach the destination. If a baselinepath S stops at hop Sx, we try to guess if Sx is a firewallusing some heuristics. Sx must meet the following fourrequirements before we consider it a firewall:

1. From MonD’s passive data, we know the client isable to send and receive TCP packets with the lo-cal node. Therefore, the path is working when thebaseline traceroute is being calculated.

2. S does not return an ICMP destination unreachablemessage, which usually indicates that the tracerouteencounters routing problems at Sx [13].

3. Sx and the destination are in the same AS. We as-sume that a firewall and its protected clients shouldbelong to the same organization.

4. Sx is within n hops (close) to the destination.

The first three requirements are easy to verify, so wefocus on the last requirement. Let RevHop(h) bethe number of hops from hop h to the local nodeon the reverse path. We first want to check if 0 <

RevHop(dst) − RevHop(Sx) ≤ n. From MonD,we know RevTTL(dst), the TTL of a packet whenit arrives at the local node from the destination. If

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 175

Page 10: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association176

Page 11: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 2 4 6 8 10 12 14

frac

tion

number of hops

path changepath change in core

path change at edgepath outage

path outage in corepath outage at edge

Figure 4: Scope of path changes and forward outages in numberof hops

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 2 4 6 8 10

frac

tion

number of hops

path changepath outage

Figure 5: Distance of path changes and forward outages to theend hosts in number of hops

diverges from R at ri and merges into R at rl. All thehops before ri or after rl (including rl) follow the samenext hop towards the destination. So the hops which arepossibly influenced by the path change and have differentnext hops are ri, rj and rk.

In some cases, we may be able to use remote tracer-outes to narrow the scope even further. For example, inFigure 3, if I , a traceroute from another ProbeD mergesinto R at rk, a hop that is before rl, we can eliminaterk from the scope of the path change anomaly, since weknow rk has the same next hop value as it did in the ref-erence path. We call I the intercept path. This methodmay still overestimate the scope of path change: it is pos-sible, for example, that rj ’s next hop value is unaffected,but we cannot know this unless some traceroute mergesinto R at rj .

Performing traceroute from multiple geographicallydiverse vantage points increases our chances of findingan intercept path. Our ability of narrowing the scope isaffected by the location of the anomaly. If it is closerto the destination, we have a better chance of obtain-ing intercept paths by launching many forward tracer-outes, and thus successfully reducing the scope of theanomaly. In contrast, if the anomaly is nearer our source,the chance of another traceroute merging into the refer-ence path early is low.

Figure 4 shows the CDF of path change scope, mea-sured in hop count. We can confine the scope of 68%of the path changes to within 4 hops. We do not countfluttering as path changes, since these would appear as alarge number of path change anomalies, each with a verysmall scope. We also examine how many ASes the pathchange scope spans, shown in Table 8. Again, we canconfine the scope of 57% of the path changes to withintwo ASes and 82% of them to within three ASes.

To gain some insight into the location of the anoma-lies, we also study whether the path change occurs nearend hosts or in the middle of the path [7]. We measurethe distance of a path change to the end host by averag-

Change Fwd OutageTotal 120283 12740No Ref Path N/A 111811 AS 24418 (20%) 6534 (51%)2 ASes 43909 (37%) 3413 (27%)3 ASes 29426 (25%) 1321 (10%)4 ASes 12603 (10%) 856 (7%)5 ASes 6322 (5%) 411 (3%)6 ASes 3605 (3%) 205 (2%)Guessed Last Hop N/A 1055Scope Changed 4292 (4%) 1225 (10%)Tier-1 AS 12374 (6%) 2746 (15%)Tier-2 AS 43104 (23%) 3255 (18%)Tier-3 AS 88959 (47%) 4638 (26%)Tier-4 AS 8015 (4%) 3501 (19%)Tier-5 AS 38313 (20%) 3838 (21%)Total 190765 17978

Table 8: Summary of path change and forward outage.Some counts exceed 100% due to multiple classification.

ing the distances of all the routers within the path changescope. The distance of a router is defined as the mini-mum number of hops to either the source or the destina-tion. Figure 5 plots the CDF of path change distances.As we can see, 60% of the path changes occur within 5hops to the end hosts.

We can also use AS tiers to characterize network lo-cations, so we map the routers within anomaly scopesto ASes and AS tiers. The breakdown of possibly af-fected routers by their AS tiers is shown in Table 8. Therouters in Tier-3 are most likely to be affected by pathchanges, since they account for nearly half of the total.By comparison, the routers in tier-1 ASes are rather sta-ble, though presumably they are traversed repeatedly bymany paths.

In Figure 4, we see that path changes in the core net-

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 177

Page 12: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

R

r r ri kr j l

L

I

dst

Figure 6: Confining the scope of forward outage

work have narrower scope than those in the edge. This isprobably because the paths in the core network are likelyto be traversed by traceroutes from many vantage pointsto reach the destination. In contrast, if a route changeoccurs near a local node, we have less chance of findingan intercept path that happens to merge into the referencepath early. As a result, the anomaly scope in these casesis more loosely confined.

Since path change is a dynamic process and anomalyscopes may evolve over time, a measured anomaly scopeshould be viewed as a snapshot of which routers are af-fected when the traceroutes reach them. In Table 8, weshow how many path changes have changed scope be-tween the first and second sets of forward probes. Wefind that only 4% of them have changed during that pe-riod (mostly within one minute). In addition, 66% of thescope changes are due to local path changes instead ofintercept path changes.

We now examine the effect of path changes on end-to-end performance. The effect of path changes on RTTs isrelatively mild, as can be seen in Figures 8. The RTTsmeasured during path changes are only slightly worsethan the RTTs measured in baseline traceroutes. But theloss rates during path changes can be very high. Nearly45% of the path changes cause loss rates higher than30%, which can significantly degrade TCP throughput.

7.2 Path OutageWe now focus on path outages and describe how to dis-tinguish between forward and reverse path outages. InFigure 6, the local path L stops at ri, which is an in-termediate hop on the reference path R. At first glance,one might conclude that a failure has occurred after ri onthe forward path, which prevents the packets from goingthrough; but other possibilities also exist—because In-ternet paths could be asymmetric, a failure on the reversepath may produce the same results. For example, if ashared link on the reverse paths from all the hops beyondri to the source has failed, none of the ICMP packetsfrom those hops can return. Consequently, we will notsee the hops after ri.

If we have control of the destination, we can simplydistinguish between forward and reverse path outages us-ing ping [7]. However, since our clients are outside ofPlanetLab and not under our control, we cannot performpings in both directions, and must use other informationto disambiguate forward path outages from reverse path

Route change Unreachable Fwd Timeout12822 (54%) 2751 (11%) 8348 (35%)

Table 9: Breakdown of reasons for inferring forward outage

failures. Specifically, we can infer that the outage is onthe forward path using the following rules:

• There is a route change on the forward path in addi-tion to the outage.

• The local traceroute returns an ICMP destinationunreachable message.

• The anomaly is reported as timeouts on forwardpath. As described in Section 3.4, MonD will re-port this type of anomaly when it infers ACK losseson the forward path from the local node to the client.

Table 9 shows the number of forward path outages in-ferred from each rule. As we can see, all three rules areuseful in identifying forward outages. More than halfof the outages are accompanied by route changes, as thefailure information is propagated and some routers tryto bypass the failure using other routes. Forward time-outs help us infer one third of the forward outages. Thisdemonstrates the benefit of combining passive monitor-ing with active probing, since we would not have beenable to disambiguate them otherwise.

7.2.1 ScopeTo characterize the scope of path outages, we use a tech-nique similar to the one we used to characterize the scopeof path change. We define a path outage scope as thenumber of hops in a path that cannot forward packets totheir next hop towards the destination. In Figure 6, R isthe reference path and L is the local path. L stops at ri,which is an intermediate hop of R. Hence, all the hopsafter ri (including ri) are possibly influenced by the out-age and may not be able to forward packets to the nexthops towards the destination. However, when we canfind another intercept path, we can narrow the scope. Forexample, if I merges into R at rk and reaches the desti-nation, then only ri and rj can possibly be influenced bythe outage. Again, this method might overestimate thescope of a path outage, for the same reasons describedearlier on estimating a path change scope.

Note that unlike in previous work [6, 7], we use aset of routers to quantify the effect of path outages in-stead of just using the last hop where the traceroute stops.Since outage information is usually propagated to manyrouters, using only one router does not give us a sense ofhow many routers may have been affected by the outage.

In some cases, we may not have a complete baselinepath which reaches the destination or the penultimaterouter. In these cases, we can not estimate the scope

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association178

Page 13: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

frac

tion

loss rate

path changepath outage

Figure 7: CDF of loss rates proceding path changes and for-ward outages

of the forward outage because we do not know the ex-act number of hops between the last hop of the baselinepath and the destination. We only know that the anomalyoccurs somewhere on the forward path. Among all theoutages, about 47% have no complete reference path. Inthe following, we use only those with complete referencepaths in the scope analysis.

In Figure 4, we plot the CDF of the number of hops inthe forward outage scope. Compared with path change,we can confine the outage scope more tightly. Nearly60% of the outages can be confined to within 1 hop and75% of them can be confined to 4 hops.

We suspect that such tight confinement is due to lasthop failures. In Figure 5, we plot the distances of forwardoutages to the end hosts. The distance of an outage isdefined as the average distance of the routers within theoutage scope to the end hosts, similar to the definitionused for path change in Section 7.1. As we can see, 44%of the outages do occur at the last hop, allowing us toconfine their scopes to 1 hop. This observation explainswhy the outages in the edge network are confined moretightly than those in core networks, as shown in Figure 4.

Excluding last hop failures, we can only confine 14%of the outages to one hop, a result that is slightly betterthan that for path changes. In general, the scopes of pathoutages tend to be smaller than those of path changes.Compared with path changes in Figure 5, path outagestend to occur much closer to the end hosts. More than70% of the outages occur within 4 hops to the end hosts.

Table 8 gives the number of ASes that the outagesspan. Compared with path changes, we can confine amuch higher percentage of outages (78%) within twoASes. If we examine the AS tiers where the affectedrouters are located, outages are spread out more evenlyacross tiers than path changes are. Paths in tier-1 ASesare the most stable and those in tier-3 ASes are most un-stable. If we look at both Table 5 and Table 8, we notethat paths in tier-3 ASes are most likely to be affected byall types of anomalies. They account for 40% of tempo-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.5 1 1.5 2 2.5 3 3.5 4

frac

tion

RTT (seconds)

path changepath change normal

path outagepath outage normal

Figure 8: CDF of RTTs proceding path changes and forwardoutages vs. under normal conditions

rary loops, 46% of persistent loops, 47% of path changesand 26% of forward outages. In contrast, paths in tier-1ASes are most stable.

Finally, in Table 8, we find that the scopes of 10%of the forward outages might have changed between thefirst and second set of forward probes, mostly due to lo-cal path changes. Another 8% of the forward outageshave reference paths that do not terminate at the destina-tions. These last hops are considered firewalls based onthe heuristic described in Section 6.

7.2.2 End-to-End EffectWe also study how path outages influence end-to-endperformance. Not surprisingly, forward outages can bepreceded by very high loss rates, which are slightlyworse than those generated by path changes. The com-parisons are shown in Figure 7. Similarly, outages tendto be preceded by much worse RTTs than path changes,as shown in Figure 8: 23% of the outages experienceRTTs that are over one second, while only 7% do whenthere is no outage. The RTT variances can also be veryhigh: 17% of them exceed 0.5 seconds.

8 Discussion

8.1 Bypassing AnomaliesIn addition to characterizing anomalies, one of the goalsof PlanetSeer is to provide benefits to the hosts runningit. One possible approach is using the wide-area servicenodes as an overlay, to bypass path failures. Existing sys-tems, such as RON [1], bypass path failures by indirectlyrouting through intermediate nodes before reaching thedestinations. Their published results show that around50% of the failures on a 31-node testbed can be by-passed [7]. PlanetSeer differs in size and scope, since weare interested in serving thousands of clients that are notparticipants in the overlay, and we have a much higherAS-level coverage.

Determining how many failures can be bypassed inour model is more complicated, since we have no con-

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 179

Page 14: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0.1 1 10 100

frac

tion

bypass ratio

Figure 9: CDF of latency ratio of overlay paths to direct paths

trol over the clients. Clients that are behind firewalls andfilter pings and traceroutes may be reachable from otheroverlay nodes, but we may not be able to confirm thisscenario. As a result, we focus only on those destina-tions that are reachable in the baseline probes, since wecan confirm their reachability during normal operation.

For this group of clients, we have a total of 62815reachability failures, due to anomalies like path outagesor loops. Of these failures, we find that some nodes inPlanetSeer are able to reach the destinations in 27263cases, indicating that one-hop indirection is effective infinding a bypass path for 43% of the failures.

In addition to improving the reachability of clients us-ing overlay paths, the other issue is their relative perfor-mance during failures. We calculate a bypass ratio asthe ratio between the minimum RTT of any of the by-pass paths and the RTT of the baseline path. These re-sults are shown in Figure 9, and we see that the resultsare moderately promising – 68% of the bypass paths suf-fer less than a factor of two in increased latency. In fact,23% of the new paths actually see a latency improvement,suggesting that the overlay could be used for improvingroute performance in addition to failure resiliency. How-ever, some paths see much worse latency degradation,with the worst 5% seeing more than a factor of 18 worselatency. While these paths may bypass the anomaly, theperformance degradation will be very noticeable, per-haps to the point of unusability.

8.2 Reducing Measurement OverheadWhile PlanetSeer’s combination of passive monitoringand distributed active probing is very effective at find-ing anomalies, particularly the short-lived ones, the prob-ing traffic can be aggressive, and can come as a surpriseto low-traffic sites that suddenly see a burst of tracer-outes coming from around the world. Therefore, we areinterested in reducing the measurement overhead whilenot losing the accuracy and flexibility of our approach.For example, we can use a single traceroute to confirmloops, and then decide if we want distributed traceroutes

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 5 10 15 20 25 30 35

frac

tion

number of probes

Figure 10: CDF of number of path examined before findingthe intercept path

to test for the presence of correlated loops. Similarly,for path changes and outages, we can reduce the num-ber of distributed traceroutes if we are willing to toleratesome inaccuracy in characterizing their scope. In Fig-ure 10, we plot the CDF of the number of the probes fromother vantages points we have to examine before we findthe intercept traceroutes that can successfully narrow thescopes of the anomalies. Using only 15 vantage points,we achieve the same results as when using all 30 vantagepoints in 80% of the cases. We are interested in studyingthis issue further, so that we can determine which van-tage points we need to achieve good results.

9 Related WorkThere is extensive existing work on studying Internetpath failure. Labovitz and Ahuja [15] studied inter-domain routing failures using BGP data collected fromseveral ISPs and 5 network exchange points. They ana-lyzed the temporal properties of failures, such as meantime to fail, mean time to repair and failure duration.They found that 40% of the failures are repaired in 10minutes and 60% of them are resolved in 30 minutes.They also studied the intra-domain routing failures of amedium-sized regional ISP by examining the data fromthe trouble ticket tracking system managed by the Net-work Operation Center (NOC) of the ISP, together withthe OSPF routing messages. Based on this data, theycharacterize the origins of failures into hardware, soft-ware and operational problems.

Iannaccone et al. investigated the failures in Sprint’sIP backbones using the IS-IS routing updates collectedfrom three vantage points [12]. They examined the fre-quency and duration of failures inferred from routing up-dates and concluded that most failures are short-lived(within 10 minutes). They also studied the interval be-tween failures. Again, their focus is on the temporalproperties of failure.

Mahajan, Wetherall and Anderson [16] studied BGPmisconfigurations using BGP updates from Route-

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association180

Page 15: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

Views [18], which has 23 vantage points across differentISPs. They found BGP misconfigurations were relativelycommon and classified them into origin and export mis-configurations. Note that BGP misconfigurations may ormay not be visible to end users. They observed that only1 in 25 misconfigurations affect end-to-end connectivity.

More recently, Feldmann et al. have presented amethodology to locate the origin of routing instabili-ties along three dimensions in BGP: time, prefix, andview [9]. Their basic assumption is that an AS pathchange is caused by some instability either on the pre-vious best path or the new best path. Caesar et al. [4]and Chang et al. [5] also propose similar approaches toanalyze routing changes, although their algorithms aredifferent in details.

All of the above failure studies are based on eitherinter-domain (BGP) or intra-domain (IS-IS, OSPF) rout-ing messages, from which failures are inferred. Someof them require access to the ISP’s internal data, suchas trouble tickets. Our study complements this work bystudying failures from an end-to-end perspective, andquantifies how failures affect end-to-end performance,such as loss rate and RTT.

There also has been much work on studying Internetfailures through end-to-end measurement, and these havegreatly influenced our approach. Paxson [19] studied theend-to-end routing behavior by running repeated tracer-outes between 37 Internet hosts. His study shows that49% of the Internet paths are asymmetric and visit atleast one different city. 91% of the paths persist morethan several hours. He used traceroutes to identify var-ious routing pathologies, such as loops, fluttering, pathchanges and outages. However these traceroutes do notdistinguish between forward and reverse failures.

Chandra et al. [6] studied the effect of network failureson end-to-end services using the traceroute datasets [19,21]. They also used the HTTP traces collected from 11proxies. They model the failures using their location andduration and evaluate different techniques for maskingfailures. However, the HTTP and traceroute datasets areindependent. In comparison, we combine the passivemonitoring data and active probing data, which allowsus to detect failures in realtime and correlate the end-to-end effect with different types of failures. They also clas-sify the failures into near-source, near-destination and in-middle by matching /24s IP prefixes with end host IPs.In contrast, we study the location of failures using bothIP-to-AS mapping [17] and 5-tier AS hierarchies [23].This allows us to quantify the failure locations more ac-curately and at a finer granularity.

Feamster et al. measured the Internet failures among31 hosts using periodic pings combined with tracer-outes [7]. They ping the path between a pair of nodesevery 30 seconds, with consecutive ping losses trigger-

ing traceroutes. They consider the location of a failure tobe the last reachable hop in traceroute and used the num-ber of hops to closest end host to quantify the depth of thefailure. They characterize failures as inter-AS and intra-AS and use one-way ping to distinguish between forwardand reverse failures. They also examine the correlationbetween path failures with BGP updates.

Our work is partly motivated by these approaches,but we cannot use their methodology directly becauseof environmental differences. With the large number ofclients that connect to our proxies, we can not afford toping each of them frequently. Failure detection and con-firmation are more challenging in our case, since manyclients may not respond to pings (behind firewalls) oreven are offline (such as dialup users). We infer anoma-lies by monitoring the status of active flows, which al-lows us to study anomalies on a much more diverse setof paths. We also combine the baseline traceroutes withpassive monitoring to distinguish between forward andreverse failures and classify forward anomalies into sev-eral categories. Since where the failure appears maybe different from where the failure occurs, we quantifythe scope of failures by correlating the traceroutes frommultiple vantage points, instead of using one hop (lastreachable hop) as the failure location. Finally, we studyhow different types of anomalies affect end-to-end per-formance.

10 ConclusionThis paper introduces what we believe to be an impor-tant new type of diagnostic tool, one that passively mon-itors network communication watching for anomalies,and then engages widely-distributed probing machinerywhen suspicious events occur. Although much work canstill be done to improve the tool—e.g., reducing the ac-tive probes required, possibly by integrating static topol-ogy information and BGP updates—the observations wehave been able to make in a short time are dramatic.

• Passive monitoring allows us to detect more anoma-lies in less time: we have confirmed nearly 272,000anomalies in three months. This is roughly 3,000 aday, and is 10 to 100 times more than reported pre-viously. We also see a qualitative change, such aslarge numbers of ultrashort and temporary anoma-lies that last less than one minute.

• Due to our wide coverage, we see new failure dis-tribution and location properties. Failures are heav-ily skewed, rather than pervasively distributed: Tier3 seems to be the most problematic, accounting foralmost half of the loops, path changes, and path out-ages that we see. Tier 1 ASes are generally the moststable.

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 181

Page 16: PlanetSeer: Internet Path Failure Monitoring and Characterization …€¦ · Although the Internet is designed to be self-healing, var- ious problems can arise in the protocols,

• We provide some new measurements about routingloop behavior. Temporary loops have much longerlengths than persistent loops. 97% of the persis-tent loops consist of only 2 routers, but only 50%of temporary loops do. Many temporary loops span4 routers. This makes sense since the more routersare involved in a loop, the less stable it is. Persis-tent loops are either resolved in a relative short time(54% last less than 30 minute) or continue for anextended period of time (23% last more than 7.5hours). Our results confirm Paxson’s findings thatrouting loops are correlated.

• Path changes exhibit different characteristics thanoutages. Outages appear closer to the edge of thenetwork: 63% of outages occur within 3 hops toend hosts while the figure is 20% for path changes.Path changes tend to have wider impact: 57% ofpath changes can be confined to two ASes and 50%of them can be confined to within three hops, whilethe respective figures are 78% and 70% for outages.Path changes have a much milder effect on RTTsthan outages while they both can incur high lossrates.

• Our measurements suggest less opportunity forindirection-based resiliency than previous studies:alternative routes are available only 43% of thetime, and a significant fraction of them suffer fromhigh latency inflation. These results stem from mostoutages occurring nearer the edge of the networkthan the core; redundancy is less available, and lesspractical when it is available.

We have shown that PlanetSeer provides an effectivemeans to detect large numbers of anomalies with broadcoverage, especially in the case of wide-area services thatcannot rely on cooperation from one endpoint. In addi-tion to the detection rate, the short delay between emer-gence and detection allows us to capture anomaly behav-ior more effectively, and our distributed framework pro-vides improved characterization.

AcknowledgmentsThis research was supported in part by NSF grant CNS-0335214. We would like to thank KyoungSoo Park forhis effort in keeping CoDeeN operational during thiswork. We thank our shepherd, Miguel Castro, for hisguidance and helpful feedback, and we thank our anony-mous reviewers for their valuable comments on improv-ing this paper.

References[1] D. Andersen, H. Balakrishnan, F. Kaashoek, and R. Morris. Re-

silient overlay networks. ACM SOSP, Oct. 2001.

[2] S. Banerjee, T. G. Griffin, and M. Pias. The interdomain connec-tivity of PlanetLab nodes. In Passive and Active MeasurementWorkshop, April 2004.

[3] A. Bavier, M. Bowman, B. Chun, D. Culler, S. Karlin, S. Muir,L. Peterson, T. Roscoe, T. Spalink, and M. Mawrzoniak. Op-erating system support for planetary-scale network services. InUSENIX/ACM NSDI, Mar. 2004.

[4] M. Caesar, L. Subramanian, and R. H. Katz. Route cause analysisof Internet routing dynamics. In Tech Report UCB/CSD-04-1302,2003.

[5] D.-F. Chang, R. Govindan, and J. Heidemann. The temporal andtoplogical characteristics of BGP path changes. In IEEE ICNP,Nov. 2003.

[6] M. Dahlin, B. Chandra, L. Gao, and A. Nayate. End-to-end WANservice availability. ACM/IEEE Trans. Netw., Apr 2003.

[7] N. Feamster, D. G. Andersen, H. Balakrishnan, and M. F.Kaashoek. Measuring the effects of Internet path faults on re-active routing. In ACM SIGMETRICS, Jun 2003.

[8] A. Feldmann, A. Greenberg, C. Lund, N. Reingold, J. Rexford,and F. True. Deriving traffic demands for operational IP net-works: methodology and experience. ACM SIGCOMM, Aug.2000.

[9] A. Feldmann, O. Maennel, Z. Mao, A. Berger, and B. Maggs.Locating Internet routing instabilities. In ACM SIGCOMM, Aug2004.

[10] U. Hengartner, S. Moon, R. Mortier, and C. Diot. Detection andanalysis of routing loops in packet traces. In ACM IMW, 2002.

[11] IANA. Special-use IPv4 addresses. RFC 3330.

[12] G. Iannaccone, C.-N. Chuah, R. Mortier, S. Bhattacharyya, andC. Diot. Analysis of link failures in an IP backbone. In ACMIMW, Nov 2002.

[13] J. Postel. Internet control message protocol. RFC 792.

[14] C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian. Delayed Internetrouting convergence. In ACM SIGCOMM, Sep 2000.

[15] C. Labovitz, A. Ahuja, and F. Jahanian. Experimental study ofInternet stability and wide-area backbone failures. Technical Re-port CSE-TR-382-98, University of Michigan, 1998.

[16] R. Mahajan, D. Wetherall, and T. Anderson. Understanding BGPmisconfiguration. In ACM SIGCOMM, 2002.

[17] Z. Mao, J. Rexford, J. Wang, and R. H. Katz. Towards an accurateAS-level traceroute tool. In ACM SIGCOMM, 2003.

[18] U. of Oregon RouteViews Project. http://www.routeviews.org.

[19] V. Paxson. End-to-end routing behavior in the Internet. In ACMSIGCOMM, Aug 1996.

[20] V. Paxson. End-to-end Internet packet dynamics. IEEE/ACMTrans. Netw., 7(3), 1999.

[21] S. Savage, A. Collins, and E. Hoffman. The end-to-end effects ofInternet path selection. ACM SIGCOMM, Aug. 1999.

[22] N. Spring, D. Wetherall, and T. Anderson. Scriptroute: A facilityfor distributed Internet measurement. USITS, March 2003.

[23] L. Subrmanian, S. Agarwal, J. Rexford, and R. H. Katz. Char-acterizing the Internet hierarchy from multiple vantage points.IEEE INFOCOM, June 2002.

[24] Traceroute.Org. http://www.traceroute.org.

[25] V. Paxson and M. Allman. Computing TCP’s retransmissiontimer. RFC 2988.

[26] L. Wang, V. Pai, and L. Peterson. The effectiveness of requestredirection on CDN robustness. In OSDI, Dec. 2002.

[27] Y. Zhang, N. Duffield, V. Paxson, and S. Shenkar. On the con-stancy of Internet path properties. ACM IMW, Nov. 2001.

OSDI ’04: 6th Symposium on Operating Systems Design and Implementation USENIX Association182