Understanding Network Failures in Data Centers...

12
Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications Phillipa Gill University of Toronto [email protected] Navendu Jain Microsoft Research [email protected] Nachiappan Nagappan Microsoft Research [email protected] ABSTRACT We present the first large-scale analysis of failures in a data cen- ter network. Through our analysis, we seek to answer several fun- damental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how ef- fective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults, (4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure. Categories and Subject Descriptors: C.2.3 [Computer-Comm- unication Network]: Network Operations General Terms: Network Management, Performance, Reliability Keywords: Data Centers, Network Reliability 1. INTRODUCTION Demand for dynamic scaling and benefits from economies of scale are driving the creation of mega data centers to host a broad range of services such as Web search, e-commerce, storage backup, video streaming, high-performance computing, and data analytics. To host these applications, data center networks need to be scalable, efficient, fault tolerant, and easy-to-manage. Recognizing this need, the research community has proposed several architectures to im- prove scalability and performance of data center networks [2, 3, 12– 14, 17, 21]. However, the issue of reliability has remained unad- dressed, mainly due to a dearth of available empirical data on fail- ures in these networks. In this paper, we study data center network reliability by ana- lyzing network error logs collected for over a year from thousands of network devices across tens of geographically distributed data centers. Our goals for this analysis are two-fold. First, we seek to characterize network failure patterns in data centers and under- stand overall reliability of the network. Second, we want to leverage Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGCOMM’11, August 15-19, 2011, Toronto, Ontario, Canada. Copyright 2011 ACM 978-1-4503-0797-0/11/08 ...$10.00. lessons learned from this study to guide the design of future data center networks. Motivated by issues encountered by network operators, we study network reliability along three dimensions: Characterizing the most failure prone network elements. To achieve high availability amidst multiple failure sources such as hardware, software, and human errors, operators need to focus on fixing the most unreliable devices and links in the network. To this end, we characterize failures to identify network elements with high impact on network reliability e.g., those that fail with high frequency or that incur high downtime. Estimating the impact of failures. Given limited resources at hand, operators need to prioritize severe incidents for troubleshoot- ing based on their impact to end-users and applications. In gen- eral, however, it is difficult to accurately quantify a failure’s im- pact from error logs, and annotations provided by operators in trouble tickets tend to be ambiguous. Thus, as a first step, we estimate failure impact by correlating event logs with recent net- work traffic observed on links involved in the event. Note that logged events do not necessarily result in a service outage be- cause of failure-mitigation techniques such as network redun- dancy [1] and replication of compute and data [11, 27], typically deployed in data centers. Analyzing the effectiveness of network redundancy. Ideally, operators want to mask all failures before applications experi- ence any disruption. Current data center networks typically pro- vide 1:1 redundancy to allow traffic to flow along an alternate route when a device or link becomes unavailable [1]. However, this redundancy comes at a high cost—both monetary expenses and management overheads—to maintain a large number of net- work devices and links in the multi-rooted tree topology. To ana- lyze its effectiveness, we compare traffic on a per-link basis dur- ing failure events to traffic across all links in the network redun- dancy group where the failure occurred. For our study, we leverage multiple monitoring tools put in place by our network operators. We utilize data sources that pro- vide both a static view (e.g., router configuration files, device pro- curement data) and a dynamic view (e.g., SNMP polling, syslog, trouble tickets) of the network. Analyzing these data sources, how- ever, poses several challenges. First, since these logs track low level network events, they do not necessarily imply application perfor- mance impact or service outage. Second, we need to separate fail- ures that potentially impact network connectivity from high volume and often noisy network logs e.g., warnings and error messages even when the device is functional. Finally, analyzing the effec- tiveness of network redundancy requires correlating multiple data 350

Transcript of Understanding Network Failures in Data Centers...

Page 1: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

Understanding Network Failures in Data Centers:Measurement, Analysis, and ImplicationsPhillipa Gill

University of [email protected]

Navendu JainMicrosoft Research

[email protected]

Nachiappan NagappanMicrosoft Research

[email protected]

ABSTRACTWe present the first large-scale analysis of failures in a data cen-ter network. Through our analysis, we seek to answer several fun-damental questions: which devices/links are most unreliable, whatcauses failures, how do failures impact network traffic and how ef-fective is network redundancy? We answer these questions usingmultiple data sources commonly collected by network operators.The key findings of our study are that (1) data center networksshow high reliability, (2) commodity switches such as ToRs andAggS are highly reliable, (3) load balancers dominate in terms offailure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets suchas keep alive messages and ACKs, and (5) network redundancy isonly 40% effective in reducing the median impact of failure.

Categories and Subject Descriptors: C.2.3 [Computer-Comm-unication Network]: Network Operations

General Terms: Network Management, Performance, Reliability

Keywords: Data Centers, Network Reliability

1. INTRODUCTIONDemand for dynamic scaling and benefits from economies of

scale are driving the creation of mega data centers to host a broadrange of services such as Web search, e-commerce, storage backup,video streaming, high-performance computing, and data analytics.To host these applications, data center networks need to be scalable,efficient, fault tolerant, and easy-to-manage. Recognizing this need,the research community has proposed several architectures to im-prove scalability and performance of data center networks [2, 3, 12–14, 17, 21]. However, the issue of reliability has remained unad-dressed, mainly due to a dearth of available empirical data on fail-ures in these networks.

In this paper, we study data center network reliability by ana-lyzing network error logs collected for over a year from thousandsof network devices across tens of geographically distributed datacenters. Our goals for this analysis are two-fold. First, we seekto characterize network failure patterns in data centers and under-stand overall reliability of the network. Second, we want to leverage

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGCOMM’11, August 15-19, 2011, Toronto, Ontario, Canada.Copyright 2011 ACM 978-1-4503-0797-0/11/08 ...$10.00.

lessons learned from this study to guide the design of future datacenter networks.

Motivated by issues encountered by network operators, westudy network reliability along three dimensions:

• Characterizing the most failure prone network elements. Toachieve high availability amidst multiple failure sources such ashardware, software, and human errors, operators need to focuson fixing the most unreliable devices and links in the network.To this end, we characterize failures to identify network elementswith high impact on network reliability e.g., those that fail withhigh frequency or that incur high downtime.

• Estimating the impact of failures. Given limited resources athand, operators need to prioritize severe incidents for troubleshoot-ing based on their impact to end-users and applications. In gen-eral, however, it is difficult to accurately quantify a failure’s im-pact from error logs, and annotations provided by operators introuble tickets tend to be ambiguous. Thus, as a first step, weestimate failure impact by correlating event logs with recent net-work traffic observed on links involved in the event. Note thatlogged events do not necessarily result in a service outage be-cause of failure-mitigation techniques such as network redun-dancy [1] and replication of compute and data [11, 27], typicallydeployed in data centers.

• Analyzing the effectiveness of network redundancy. Ideally,operators want to mask all failures before applications experi-ence any disruption. Current data center networks typically pro-vide 1:1 redundancy to allow traffic to flow along an alternateroute when a device or link becomes unavailable [1]. However,this redundancy comes at a high cost—both monetary expensesand management overheads—to maintain a large number of net-work devices and links in the multi-rooted tree topology. To ana-lyze its effectiveness, we compare traffic on a per-link basis dur-ing failure events to traffic across all links in the network redun-dancy group where the failure occurred.

For our study, we leverage multiple monitoring tools put inplace by our network operators. We utilize data sources that pro-vide both a static view (e.g., router configuration files, device pro-curement data) and a dynamic view (e.g., SNMP polling, syslog,trouble tickets) of the network. Analyzing these data sources, how-ever, poses several challenges. First, since these logs track low levelnetwork events, they do not necessarily imply application perfor-mance impact or service outage. Second, we need to separate fail-ures that potentially impact network connectivity from high volumeand often noisy network logs e.g., warnings and error messageseven when the device is functional. Finally, analyzing the effec-tiveness of network redundancy requires correlating multiple data

350

Page 2: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

sources across redundant devices and links. Through our analysis,we aim to address these challenges to characterize network fail-ures, estimate the failure impact, and analyze the effectiveness ofnetwork redundancy in data centers.

1.1 Key observationsWe make several key observations from our study:

• Data center networks are reliable. We find that overall the datacenter network exhibits high reliability with more than four 9’sof availability for about 80% of the links and for about 60% ofthe devices in the network (Section 4.5.3).

• Low-cost, commodity switches are highly reliable. We findthat Top of Rack switches (ToRs) and aggregation switches ex-hibit the highest reliability in the network with failure rates ofabout 5% and 10%, respectively. This observation supports net-work design proposals that aim to build data center networksusing low cost, commodity switches [3, 12, 21] (Section 4.3).

• Load balancers experience a high number of software faults.We observe 1 in 5 load balancers exhibit a failure (Section 4.3)and that they experience many transient software faults (Sec-tion 4.7).

• Failures potentially cause loss of a large number of smallpackets. By correlating network traffic with link failure events,we estimate the amount of packets and data lost during failures.We find that most failures lose a large number of packets rela-tive to the number of lost bytes (Section 5), likely due to loss ofprotocol-specific keep alive messages or ACKs.

• Network redundancy helps, but it is not entirely effective.Ideally, network redundancy should completely mask all fail-ures from applications. However, we observe that network re-dundancy is only able to reduce the median impact of failures (interms of lost bytes or packets) by up to 40% (Section 5.1).

Limitations. As with any large-scale empirical study, our resultsare subject to several limitations. First, the best-effort nature of fail-ure reporting may lead to missed events or multiply-logged events.While we perform data cleaning (Section 3) to filter the noise, someevents may still be lost due to software faults (e.g., firmware errors)or disconnections (e.g., under correlated failures). Second, humanbias may arise in failure annotations (e.g., root cause). This concernis alleviated to an extent by verification with operators, and scaleand diversity of our network logs. Third, network errors do not al-ways impact network traffic or service availability, due to severalfactors such as in-built redundancy at network, data, and applica-tion layers. Thus, our failure rates should not be interpreted as im-pacting applications. Overall, we hope that this study contributes toa deeper understanding of network reliability in data centers.

Paper organization. The rest of this paper is organized as follows.Section 2 presents our network architecture and workload charac-teristics. Data sources and methodology are described in Section 3.We characterize failures over a year within our data centers in Sec-tion 4. We estimate the impact of failures on applications and theeffectiveness of network redundancy in masking them in Section 5.Finally we discuss implications of our study for future data centernetworks in Section 6. We present related work in Section 7 andconclude in Section 8.

Internet

LBLB

LB LB

Layer 2

Layer 3Data center

Internet

ToR ToR ToRToR

AggSAggS

AccRAccR

CoreCore

} primary andback up

Figure 1: A conventional data center network architectureadapted from figure by Cisco [12]. The device naming conven-tion is summarized in Table 1.

Table 1: Summary of device abbreviationsType Devices Description

AggS AggS-1, AggS-2 Aggregation switchesLB LB-1, LB-2, LB-3 Load balancersToR ToR-1, ToR-2, ToR-3 Top of Rack switchesAccR - Access routersCore - Core routers

2. BACKGROUNDOur study focuses on characterizing failure events within our

organization’s set of data centers. We next give an overview of datacenter networks and workload characteristics.

2.1 Data center network architectureFigure 1 illustrates an example of a partial data center net-

work architecture [1]. In the network, rack-mounted servers areconnected (or dual-homed) to a Top of Rack (ToR) switch usu-ally via a 1 Gbps link. The ToR is in turn connected to a primaryand back up aggregation switch (AggS) for redundancy. Each re-dundant pair of AggS aggregates traffic from tens of ToRs whichis then forwarded to the access routers (AccR). The access routersaggregate traffic from up to several thousand servers and route it tocore routers that connect to the rest of the data center network andInternet.

All links in our data centers use Ethernet as the link layerprotocol and physical connections are a mix of copper and fibercables. The servers are partitioned into virtual LANs (VLANs) tolimit overheads (e.g., ARP broadcasts, packet flooding) and to iso-late different applications hosted in the network. At each layer ofthe data center network topology, with the exception of a subset ofToRs, 1:1 redundancy is built into the network topology to miti-gate failures. As part of our study, we evaluate the effectiveness ofredundancy in masking failures when one (or more) componentsfail, and analyze how the tree topology affects failure characteris-tics e.g., correlated failures.

In addition to routers and switches, our network contains manymiddle boxes such as load balancers and firewalls. Redundant pairsof load balancers (LBs) connect to each aggregation switch and

351

Page 3: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Daily 95th percentile utilization

P[X

<x]

TRUNKLBMGMT

COREISCIX

Overall

Figure 2: The daily 95th percentile utilization as computed us-ing five-minute traffic averages (in bytes).

Table 2: Summary of link typesType Description

TRUNK connect ToRs to AggS and AggS to AccRLB connect load balancers to AggSMGMT management interfacesCORE connect routers (AccR, Core) in the network coreISC connect primary and back up switches/routersIX connect data centers (wide area network)

perform mapping between static IP addresses (exposed to clientsthrough DNS) and dynamic IP addresses of the servers that processuser requests. Some applications require programming the load bal-ancers and upgrading their software and configuration to supportdifferent functionalities.

Network composition. The device-level breakdown of our net-work is as follows. ToRs are the most prevalent device type in ournetwork comprising approximately three quarters of devices. LBsare the next most prevalent at one in ten devices. The remaining15% of devices are AggS, Core and AccR. We observe the effectsof prevalent ToRs in Section 4.4, where despite being highly re-liable, ToRs account for a large amount of downtime. LBs on theother hand account for few devices but are extremely failure prone,making them a leading contributor of failures (Section 4.4).

2.2 Data center workload characteristicsOur network is used in the delivery of many online applica-

tions. As a result, it is subject to many well known properties ofdata center traffic; in particular the prevalence of a large volumeof short-lived latency-sensitive “mice” flows and a few long-livedthroughput-sensitive “elephant” flows that make up the majority ofbytes transferred in the network. These properties have also beenobserved by others [4, 5, 12].

Network utilization. Figure 2 shows the daily 95th percentile uti-lization as computed using five-minute traffic averages (in bytes).We divide links into six categories based on their role in the net-work (summarized in Table 2). TRUNK and LB links which re-side lower in the network topology are least utilized with 90% ofTRUNK links observing less than 24% utilization. Links higher inthe topology such as CORE links observe higher utilization with90% of CORE links observing less than 51% utilization. Finally,links that connect data centers (IX) are the most utilized with 35%observing utilization of more than 45%. Similar to prior studiesof data center network traffic [5], we observe higher utilization atupper layers of the topology as a result of aggregation and high

bandwidth oversubscription [12]. Note that since the traffic mea-surement is at the granularity of five minute averages, it is likely tosmooth the effect of short-lived traffic spikes on link utilization.

3. METHODOLOGY AND DATA SETSThe network operators collect data from multiple sources to

track the performance and health of the network. We leverage theseexisting data sets in our analysis of network failures. In this section,we first describe the data sets and then the steps we took to extractfailures of network elements.

3.1 Existing data setsThe data sets used in our analysis are a subset of what is col-

lected by the network operators. We describe these data sets in turn:

• Network event logs (SNMP/syslog). We consider logs derivedfrom syslog, SNMP traps and polling, collected by our networkoperators. The operators filter the logs to reduce the number oftransient events and produce a smaller set of actionable events.One of the filtering rules excludes link failures reported by serversconnected to ToRs as these links are extremely prone to spuriousport flapping (e.g., more than 100,000 events per hour across thenetwork). Of the filtered events, 90% are assigned to NOC ticketsthat must be investigated for troubleshooting. These event logscontain information about what type of network element expe-rienced the event, what type of event it was, a small amount ofdescriptive text (machine generated) and an ID number for anyNOC tickets relating to the event. For this study we analyzed ayear’s worth of events from October 2009 to September 2010.

• NOC Tickets. To track the resolution of issues, the operatorsemploy a ticketing system. Tickets contain information aboutwhen and how events were discovered as well as when they wereresolved. Additional descriptive tags are applied to tickets de-scribing the cause of the problem, if any specific device was atfault, as well as a “diary” logging steps taken by the operators asthey worked to resolve the issue.

• Network traffic data. Data transferred on network interfaces islogged using SNMP polling. This data includes five minute aver-ages of bytes and packets into and out of each network interface.

• Network topology data. Given the sensitive nature of networktopology and device procurement data, we used a static snapshotof our network encompassing thousands of devices and tens ofthousands of interfaces spread across tens of data centers.

3.2 Defining and identifying failuresWhen studying failures, it is important to understand what

types of logged events constitute a “failure”. Previous studies havelooked at failures as defined by pre-existing measurement frame-works such as syslog messages [26], OSPF [25, 28] or IS-IS listen-ers [19]. These approaches benefit from a consistent definition offailure, but tend to be ambiguous when trying to determine whethera failure had impact or not. Syslog messages in particular can bespurious with network devices sending multiple notifications eventhough a link is operational. For multiple devices, we observed thistype of behavior after the device was initially deployed and therouter software went into an erroneous state. For some devices, thiseffect was severe, with one device sending 250 syslog “link down”events per hour for 2.5 months (with no impact on applications)before it was noticed and mitigated.

We mine network event logs collected over a year to extractevents relating to device and link failures. Initially, we extract all

352

Page 4: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

logged “down” events for network devices and links. This leads usto define two types of failures:

Link failures: A link failure occurs when the connection betweentwo devices (on specific interfaces) is down. These events are de-tected by SNMP monitoring on interface state of devices.

Device failures: A device failure occurs when the device is notfunctioning for routing/forwarding traffic. These events can be causedby a variety of factors such as a device being powered down formaintenance or crashing due to hardware errors.

We refer to each logged event as a “failure” to understand theoccurrence of low level failure events in our network. As a result,we may observe multiple component notifications related to a sin-gle high level failure or a correlated event e.g., a AggS failure re-sulting in down events for its incident ToR links. We also correlatefailure events with network traffic logs to filter failures with im-pact that potentially result in loss of traffic (Section 3.4); we leaveanalyzing application performance and availability under networkfailures, to future work.

3.3 Cleaning the dataWe observed two key inconsistencies in the network event

logs stemming from redundant monitoring servers being deployed.First, a single element (link or device) may experience multiple“down” events simultaneously. Second, an element may experienceanother down event before the previous down event has been re-solved. We perform two passes of cleaning over the data to resolvethese inconsistencies. First, multiple down events on the same ele-ment that start at the same time are grouped together. If they do notend at the same time, the earlier of their end times is taken. In thecase of down events that occur for an element that is already down,we group these events together, taking the earliest down time of theevents in the group. For failures that are grouped in this way wetake the earliest end time for the failure. We take the earliest failureend times because of occasional instances where events were notmarked as resolved until long after their apparent resolution.

3.4 Identifying failures with impactAs previously stated, one of our goals is to identify failures

that potentially impact end-users and applications. Since we didnot have access to application monitoring logs, we cannot preciselyquantify application impact such as throughput loss or increased re-sponse times. Therefore, we instead estimate the impact of failureson network traffic.

To estimate traffic impact, we correlate each link failure withtraffic observed on the link in the recent past before the time offailure. We leverage five minute traffic averages for each link thatfailed and compare the median traffic on the link in the time win-dow preceding the failure event and the median traffic during thefailure event. We say a failure has impacted network traffic if themedian traffic during the failure is less than the traffic before thefailure. Since many of the failures we observe have short durations(less than ten minutes) and our polling interval is five minutes, wedo not require that traffic on the link go down to zero during thefailure. We analyze the failure impact in detail in Section 5.

Table 3 summarizes the impact of link failures we observe.We separate links that were transferring no data before the failureinto two categories, “inactive” (no data before or during failure) and“provisioning” (no data before, some data transferred during fail-ure). (Note that these categories are inferred based only on trafficobservations.) The majority of failures we observe are on links thatare inactive (e.g., a new device being deployed), followed by linkfailures with impact. We also observe a significant fraction of link

Table 3: Summary of logged link eventsCategory Percent Events

All 100.0 46,676Inactive 41.2 19,253Provisioning 1.0 477No impact 17.9 8,339Impact 28.6 13,330No traffic data 11.3 5,277

failure notifications where no impact was observed (e.g., devicesexperiencing software errors at the end of the deployment process).

For link failures, verifying that the failure caused impact tonetwork traffic enables us to eliminate many spurious notificationsfrom our analysis and focus on events that had a measurable impacton network traffic. However, since we do not have application levelmonitoring, we are unable to determine if these events impactedapplications or if there were faults that impacted applications thatwe did not observe.

For device failures, we perform additional steps to filter spuri-ous failure messages (e.g., down messages caused by software bugswhen the device is in fact up). If a device is down, neighboring de-vices connected to it will observe failures on inter-connecting links.For each device down notification, we verify that at least one linkfailure with impact has been noted for links incident on the devicewithin a time window of five minutes. This simple sanity check sig-nificantly reduces the number of device failures we observe. Notethat if the neighbors of a device fail simultaneously e.g., due to acorrelated failure, we may not observe a link-down message for thatdevice.

For the remainder of our analysis, unless stated otherwise, weconsider only failure events that impacted network traffic.

4. FAILURE ANALYSIS

4.1 Failure event panoramaFigure 3 illustrates how failures are distributed across our mea-

surement period and across data centers in our network. It showsplots for links that experience at least one failure, both for all fail-ures and those with potential impact; the y-axis is sorted by datacenter and the x-axis is binned by day. Each point indicates that thelink (y) experienced at least one failure on a given day (x).

All failures vs. failures with impact. We first compare the view ofall failures (Figure 3 (a)) to failures having impact (Figure 3 (b)).Links that experience failures impacting network traffic are onlyabout one third of the population of links that experience failures.We do not observe significant widespread failures in either plot,with failures tending to cluster within data centers, or even on in-terfaces of a single device.

Widespread failures: Vertical bands indicate failures that werespatially widespread. Upon further investigation, we find that thesetend to be related to software upgrades. For example, the verticalband highlighted in Figure 3 (b) was due to an upgrade of load bal-ancer software that spanned multiple data centers. In the case ofplanned upgrades, the network operators are able to take precau-tions so that the disruptions do not impact applications.

Long-lived failures: Horizontal bands indicate link failures on acommon link or device over time. These tend to be caused by prob-lems such as firmware bugs or device unreliability (wider bandsindicate multiple interfaces failed on a single device). We observehorizontal bands with regular spacing between link failure events.In one case, these events occurred weekly and were investigated

353

Page 5: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

0

2000

4000

6000

8000

10000

12000

Oct-09 Dec-09 Feb-10 Apr-10 Jul-10 Sep-10

Link

s sor

ted

by d

ata

cent

er

Time (binned by day)

(a) All failures

0

2000

4000

6000

8000

10000

12000

Oct-09 Dec-09 Feb-10 Apr-10 Jul-10 Sep-10

Link

s sor

ted

by d

ata

cent

er

Time (binned by day)

(b) Failures with impact

Figure 3: Overview of all link failures (a) and link failures with impact on network traffic (b) on links with at least one failure.

0.039 0.045 0.030 0.027 0.020

0.219

0.005

0.111

0.214

0

0.05

0.1

0.15

0.2

0.25

ToR-1 ToR-2 ToR-3 AggS-2 ToR-5 LB-2 ToR-4 AggS-1 LB-1

Prob

abili

ty o

f fai

lure

Device type Figure 4: Probability of device failure in one year for devicetypes with population size of at least 300.

0.054 0.026

0.095

0.176

0.095

0.028

0

0.05

0.1

0.15

0.2

0.25

TRUNK MGMT CORE LB ISC IX

Prob

abili

ty o

f fai

lure

Link type

Figure 5: Probability of a failure impacting network traffic inone year for interface types with population size of at least 500.

in independent NOC tickets. As a result of the time lag, the op-erators did not correlate these events and dismissed each notifica-tion as spurious since they occurred in isolation and did not impactperformance. This underscores the importance of network healthmonitoring tools that track failures over time and alert operators tospatio-temporal patterns which may not be easily recognized usinglocal views alone.

Table 4: Failures per time unitFailures per day: Mean Median 95% COV

Devices 5.2 3.0 14.7 1.3Links 40.8 18.5 136.0 1.9

4.2 Daily volume of failuresWe now consider the daily frequency of failures of devices

and links. Table 4 summarizes the occurrences of link and devicefailures per day during our measurement period. Links experienceabout an order of magnitude more failures than devices. On a dailybasis, device and link failures occur with high variability, havingCOV of 1.3 and 1.9, respectively. (COV > 1 is considered highvariability.)

Link failures are variable and bursty. Link failures exhibit highvariability in their rate of occurrence. We observed bursts of link

failures caused by protocol issues (e.g., UDLD [9]) and device is-sues (e.g., power cycling load balancers).

Device failures are usually caused by maintenance. While de-vice failures are less frequent than link failures, they also occur inbursts at the daily level. We discovered that periods with high fre-quency of device failures are caused by large scale maintenance(e.g., on all ToRs connected to a common AggS).

4.3 Probability of failureWe next consider the probability of failure for network ele-

ments. This value is computed by dividing the number of devicesof a given type that observe failures by the total device populationof the given type. This gives the probability of failure in our oneyear measurement period. We observe (Figure 4) that in terms ofoverall reliability, ToRs have the lowest failure rates whereas LBshave the highest failure rate. (Tables 1 and 2 summarize the abbre-viated link and device names.)

Load balancers have the highest failure probability. Figure 4shows the failure probability for device types with population sizeof at least 300. In terms of overall failure probability, load balancers(LB-1, LB-2) are the least reliable with a 1 in 5 chance of expe-riencing failure. Since our definition of failure can include inci-dents where devices are power cycled during planned maintenance,we emphasize here that not all of these failures are unexpected.Our analysis of load balancer logs revealed several causes of these

354

Page 6: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

38%

28%

15% 9%

4% 4% 2%

18%

66%

5% 8%

0.4% 0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

LB-1 LB-2 ToR-1 LB-3 ToR-2 AggS-1

Perc

enta

ge

Device type

failuresdowntime

Figure 6: Percent of failures and downtime per device type.

58%

26%

5% 5% 5% 1%

70%

9% 2%

6% 12%

1% 0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TRUNK LB ISC MGMT CORE IX

Perc

enta

ge

Link type

failuresdowntime

Figure 7: Percent of failures and downtime per link type.

Table 5: Summary of failures per device (for devices that expe-rience at least one failure).

Device type Mean Median 99% COV

LB-1 11.4 1.0 426.0 5.1LB-2 4.0 1.0 189.0 5.1ToR-1 1.2 1.0 4.0 0.7LB-3 3.0 1.0 15.0 1.1ToR-2 1.1 1.0 5.0 0.5AggS-1 1.7 1.0 23.0 1.7

Overall 2.5 1.0 11.0 6.8

transient problems such as software bugs, configuration errors, andhardware faults related to ASIC and memory.

ToRs have low failure rates. ToRs have among the lowest fail-ure rate across all devices. This observation suggests that low-cost,commodity switches are not necessarily less reliable than their ex-pensive, higher capacity counterparts and bodes well for data cen-ter networking proposals that focus on using commodity switchesto build flat data center networks [3, 12, 21].

We next turn our attention to the probability of link failures atdifferent layers in our network topology.

Load balancer links have the highest rate of logged failures.Figure 5 shows the failure probability for interface types with apopulation size of at least 500. Similar to our observation with de-vices, links forwarding load balancer traffic are most likely to ex-perience failures (e.g., as a result of failures on LB devices).

Links higher in the network topology (CORE) and links con-necting primary and back up of the same device (ISC) are the sec-ond most likely to fail, each with an almost 1 in 10 chance of fail-ure. However, these events are more likely to be masked by networkredundancy (Section 5.2). In contrast, links lower in the topology(TRUNK) only have about a 5% failure rate.

Management and inter-data center links have lowest failurerate. Links connecting data centers (IX) and for managing deviceshave high reliability with fewer than 3% of each of these link typesfailing. This observation is important because these links are themost utilized and least utilized, respectively (cf. Figure 2). Linksconnecting data centers are critical to our network and hence backup links are maintained to ensure that failure of a subset of linksdoes not impact the end-to-end performance.

4.4 Aggregate impact of failuresIn the previous section, we considered the reliability of indi-

vidual links and devices. We next turn our attention to the aggregateimpact of each population in terms of total number of failure events

and total downtime. Figure 6 presents the percentage of failures anddowntime for the different device types.

Load balancers have the most failures but ToRs have the mostdowntime. LBs have the highest number of failures of any devicetype. Of our top six devices in terms of failures, half are load bal-ancers. However, LBs do not experience the most downtime whichis dominated instead by ToRs. This is counterintuitive since, as wehave seen, ToRs have very low failure probabilities. There are threefactors at play here: (1) LBs are subject to more frequent softwarefaults and upgrades (Section 4.7) (2) ToRs are the most prevalentdevice type in the network (Section 2.1), increasing their aggregateeffect on failure events and downtime (3) ToRs are not a high pri-ority component for repair because of in-built failover techniques,such as replicating data and compute across multiple racks, that aimto maintain high service availability despite failures.

We next analyze the aggregate number of failures and down-time for network links. Figure 7 shows the normalized number offailures and downtime for the six most failure prone link types.

Load balancer links experience many failure events but rela-tively small downtime. Load balancer links experience the secondhighest number of failures, followed by ISC, MGMT and CORElinks which all experience approximately 5% of failures. Note thatdespite LB links being second most frequent in terms of numberof failures, they exhibit less downtime than CORE links (which, incontrast, experience about 5X fewer failures). This result suggeststhat failures for LBs are short-lived and intermittent caused by tran-sient software bugs, rather than more severe hardware issues. Weinvestigate these issues in detail in Section 4.7.

We observe that the total number of failures and downtimeare dominated by LBs and ToRs, respectively. We next considerhow many failures each element experiences. Table 5 shows themean, median, 99th percentile and COV for the number of failuresobserved per device over a year (for devices that experience at leastone failure).

Load balancer failures dominated by few failure prone devices.We observe that individual LBs experience a highly variable num-ber of failures with a few outlier LB devices experiencing morethan 400 failures. ToRs, on the other hand, experience little vari-ability in terms of the number of failures with most ToRs experi-encing between 1 and 4 failures. We make similar observations forlinks, where LB links experience very high variability relative toothers (omitted due to limited space).

4.5 Properties of failuresWe next consider the properties of failures for network ele-

ment types that experienced the highest number of events.

355

Page 7: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

1e+02 1e+03 1e+04 1e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Time to repair (s)

P[X

<x]

LB−1LB−2ToR−1LB−3ToR−2AggS−1Overall

5 min 1 hour1 day 1 week

LB−1LB−2ToR−1LB−3ToR−2AggS−1Overall

(a) Time to repair for devices

1e+00 1e+02 1e+04 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Time between failures (s)

P[X

<x]

LB−1LB−2ToR−1LB−3ToR−2AggS−1Overall

5 min

1 hour 1 day 1 week

(b) Time between failures for devices

1e+02 1e+03 1e+04 1e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Annual downtime (s)

P[X

<x]

LB−1LB−2ToR−1LB−3ToR−2AggS−1Overall

two 9’s

three 9’s

four 9’s

five 9’s

LB−1LB−2ToR−1LB−3ToR−2AggS−1Overall

(c)Annualized downtime for devices

Figure 8: Properties of device failures.

1e+02 1e+03 1e+04 1e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Time to repair (s)

P[X

<x]

TRUNKLBMGMTCOREISCIXOverall

5 min 1 hour 1 day 1 week

TRUNKLBMGMTCOREISCIXOverall

(a) Time to repair for links

1e+00 1e+02 1e+04 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Time between failures (s)

P[X

<x]

TRUNKLBMGMTCOREISCIXOverall

5 min1 hour 1 day 1 week

TRUNKLBMGMTCOREISCIXOverall

(b) Time between failures for links

1e+02 1e+03 1e+04 1e+05 1e+06

0.0

0.2

0.4

0.6

0.8

1.0

Annual downtime (s)

P[X

<x]

TRUNKLBMGMTCOREISCIXOverall

two 9’sthree 9’s

four 9’s

five 9’s

TRUNKLBMGMTCOREISCIXOverall

(c) Annualized downtime for links

Figure 9: Properties of link failures that impacted network traffic.

4.5.1 Time to repairThis section considers the time to repair (or duration) for fail-

ures, computed as the time between a down notification for a net-work element and when it is reported as being back online. It isnot always the case that an operator had to intervene to resolve thefailure. In particular, for short duration failures, it is likely that thefault was resolved automatically (e.g., root guard in the spanningtree protocol can temporarily disable a port [10]). In the case oflink failures, our SNMP polling interval of four minutes results in agrouping of durations around four minutes (Figure 9 (a)) indicatingthat many link failures are resolved automatically without operatorintervention. Finally, for long-lived failures, the failure durationsmay be skewed by when the NOC tickets were closed by networkoperators. For example, some incident tickets may not be termedas ’resolved’ even if normal operation has been restored, until ahardware replacement arrives in stock.

Load balancers experience short-lived failures. We first lookat the duration of device failures. Figure 8 (a) shows the CDF oftime to repair for device types with the most failures. We observethat LB-1 and LB-3 load balancers experience the shortest failureswith median time to repair of 3.7 and 4.5 minutes, respectively,indicating that most of their faults are short-lived.

ToRs experience correlated failures. When considering time torepair for devices, we observe a correlated failure pattern for ToRs.Specifically, these devices tend to have several discrete “steps” inthe CDF of their failure durations. These steps correspond to spikesin specific duration values. On analyzing the failure logs, we findthat these spikes are due to groups of ToRs that connect to the same(or pair of) AggS going down at the same time (e.g., due to main-tenance or AggS failure).

Inter-data center links take the longest to repair. Figure 9 (a)shows the distribution of time to repair for different link types. Themajority of link failures are resolved within five minutes, with theexception of links between data centers which take longer to re-pair. This is because links between data centers require coordina-tion between technicians in multiple locations to identify and re-solve faults as well as additional time to repair cables that may bein remote locations.

4.5.2 Time between failuresWe next consider the time between failure events. Since time

between failure requires a network element to have observed morethan a single failure event, this metric is most relevant to elementsthat are failure prone. Specifically, note that more than half of allelements have only a single failure (cf. Table 5), so the devices andlinks we consider here are in the minority.

Load balancer failures are bursty. Figure 8 (b) shows the distri-bution of time between failures for devices. LBs tend to have theshortest time between failures, with a median of 8.6 minutes and16.4 minutes for LB-1 and LB-2, respectively. Recall that failureevents for these two LBs are dominated by a small number of de-vices that experience numerous failures (cf. Table 5). This smallnumber of failure prone devices has a high impact on time betweenfailure, especially since more than half of the LB-1 and LB-2 de-vices experience only a single failure.

In contrast to LB-1 and LB-2, devices like ToR-1 and AggS-1have median time between failure of multiple hours and LB-3 hasmedian time between failure of more than a day. We note that theLB-3 device is a newer version of the LB-1 and LB-2 devices andit exhibits higher reliability in terms of time between failures.

356

Page 8: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

Link flapping is absent from the actionable network logs. Fig-ure 9 (b) presents the distribution of time between failures for thedifferent link types. On an average, link failures tend to be sepa-rated by a period of about one week. Recall that our methodologyleverages actionable information, as determined by network oper-ators. This significantly reduces our observations of spurious linkdown events and observations of link flapping that do not impactnetwork connectivity.

MGMT, CORE and ISC links are the most reliable in termsof time between failures, with most link failures on CORE and ISClinks occurring more than an hour apart. Links between data cen-ters experience the shortest time between failures. However, notethat links connecting data centers have a very low failure probabil-ity. Therefore, while most links do not fail, the few that do tend tofail within a short time period of prior failures. In reality, multipleinter-data center link failures in close succession are more likely tobe investigated as part of the same troubleshooting window by thenetwork operators.

4.5.3 Reliability of network elementsWe conclude our analysis of failure properties by quantifying

the aggregate downtime of network elements. We define annualizeddowntime as the sum of the duration of all failures observed by anetwork element over a year. For link failures, we consider failuresthat impacted network traffic, but highlight that a subset of thesefailures are due to planned maintenance. Additionally, redundancyin terms of network, application, and data in our system implies thatthis downtime cannot be interpreted as a measure of application-level availability. Figure 8 (c) summarizes the annual downtime fordevices that experienced failures during our study.

Data center networks experience high availability. With the ex-ception of ToR-1 devices, all devices have a median annual down-time of less than 30 minutes. Despite experiencing the highest num-ber of failures, LB-1 devices have the lowest annual downtime. Thisis due to many of their failures being short-lived. Overall, devicesexperience higher than four 9’s of reliability with the exceptionof ToRs, where long lived correlated failures cause ToRs to havehigher downtime; recall, however, that only 3.9% of ToR-1s expe-rience any failures (cf. Figure 4).

Annual downtime for the different link types are shown in Fig-ure 9 (c). The median yearly downtime for all link types, with theexception of links connecting data centers is less than 10 minutes.This duration is smaller than the annual downtime of 24-72 minutesreported by Turner et al. when considering an academic WAN [26].Links between data centers are the exception because, as observedpreviously, failures on links connecting data centers take longer toresolve than failures for other link types. Overall, links have highavailability with the majority of links (except those connecting datacenters) having higher than four 9’s of reliability.

4.6 Grouping link failuresWe now consider correlations between link failures. We also

analyzed correlated failures for devices, but except for a few in-stances of ToRs failing together, grouped device failures are ex-tremely rare (not shown).

To group correlated failures, we need to define what it meansfor failures to be correlated. First, we require that link failures occurin the same data center to be considered related (since it can be thecase that links in multiple data centers fail close together in time butare in fact unrelated). Second, we require failures to occur within apredefined time threshold of each other to be considered correlated.When combining failures into groups, it is important to pick anappropriate threshold for grouping failures. If the threshold is too

1 2 5 10 20 50 100 200

0.0

0.2

0.4

0.6

0.8

1.0

Size of link failure group (links)

P[X

<x]

Figure 10: Number of links involved in link failure groups.

Table 6: Examples of problem typesProblem Type Example causes or explanations

Change Device deployment, UPS maintenanceIncident OS reboot (watchdog timer expired)Network Connection OSPF convergence, UDLD errors,

Cabling, Carrier signaling/timing issuesSoftware IOS hotfixes, BIOS upgradeHardware Power supply/fan, Replacement of

line card/chassis/optical adapterConfiguration VPN tunneling, Primary-backup failover,

IP/MPLS routing

small, correlated failures may be split into many smaller events. Ifthe threshold is too large, many unrelated failures will be combinedinto one larger group.

We considered the number of failures for different thresholdvalues. Beyond grouping simultaneous events, which reduces thenumber of link failures by a factor of two, we did not see significantchanges by increasing the threshold.

Link failures tend to be isolated. The size of failure groups pro-duced by our grouping method is shown in Figure 10. We see thatjust over half of failure events are isolated with 41% of groups con-taining more than one failure. Large groups of correlated link fail-ures are rare with only 10% of failure groups containing more thanfour failures. We observed two failure groups with the maximumfailure group size of 180 links. These were caused by scheduledmaintenance to multiple aggregation switches connected to a largenumber of ToRs.

4.7 Root causes of failuresFinally, we analyze the types of problems associated with de-

vice and link failures. We initially tried to determine the root causeof failure events by mining diaries associated with NOC tickets.However, the diaries often considered multiple potential causes forfailure before arriving at the final root cause, which made min-ing the text impractical. Because of this complication, we choseto leverage the “problem type” field of the NOC tickets which al-lows operators to place tickets into categories based on the cause ofthe problem. Table 6 gives examples of the types of problems thatare put into each of the categories.

Hardware problems take longer to mitigate. Figure 11 consid-ers the top problem types in terms of number of failures and totaldowntime for devices. Software and hardware faults dominate in

357

Page 9: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

3%

15%

5%

31% 33%

5% 4% 5% 6%

13%

72%

1% 0%

10%

20%

30%

40%

50%

60%

70%

80%

Change Incident Net.Conn.

SW HW Config

Perc

enta

ge

Problem type

failuresdowntime

Figure 11: Device problem types.

4% 8%

26%

14%

20%

11% 15%

3%

33%

7%

38%

3%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Change Incident Net.Conn.

SW HW Config

Perc

enta

ge

Problem type

failuresdowntime

Figure 12: Link problem types.

1e−03 1e−01 1e+01 1e+03 1e+05

0.0

0.2

0.4

0.6

0.8

1.0

Packets lost (millions)

P[X

<x]

Figure 13: Estimated packet loss during failure events.

1e−03 1e−01 1e+01 1e+03 1e+05

0.0

0.2

0.4

0.6

0.8

1.0

Traffic loss (GB)

P[X

<x]

Figure 14: Estimated traffic loss during failure events.

terms of number of failures for devices. However, when consider-ing downtime, the balance shifts and hardware problems have themost downtime. This shift between the number of failures and thetotal downtime may be attributed to software errors being allevi-ated by tasks that take less time to complete, such as power cy-cling, patching or upgrading software. In contrast, hardware errorsmay require a device to be replaced resulting in longer repair times.

Load balancers affected by software problems. We examinedwhat types of errors dominated for the most failure prone devicetypes (not shown). The LB-1 load balancer, which tends to haveshort, frequent failures and accounts for most failures (but relativelylow downtime), mainly experiences software problems. Hardwareproblems dominate for the remaining device types. We observe thatLB-3, despite also being a load balancer, sees much fewer softwareissues than LB-1 and LB-2 devices, suggesting higher stability inthe newer model of the device.

Link failures are dominated by connection and hardware prob-lems. Figure 12 shows the total number of failures and total down-time attributed to different causes for link failures. In contrast todevice failures, link failures are dominated by network connectionerrors, followed by hardware and software issues. In terms of down-time, software errors incur much less downtime per failure thanhardware and network connection problems. This suggests soft-ware problems lead to sporadic short-lived failures (e.g., a softwarebug causing a spurious link down notification) as opposed to severenetwork connectivity and hardware related problems.

5. ESTIMATING FAILURE IMPACTIn this section, we estimate the impact of link failures. In the

absence of application performance data, we aim to quantify theimpact of failures in terms of lost network traffic. In particular, weestimate the amount of traffic that would have been routed on afailed link had it been available for the duration of the failure.

In general, it is difficult to precisely quantify how much datawas actually lost during a failure because of two complications.First, flows may successfully be re-routed to use alternate routes af-ter a link failure and protocols (e.g., TCP) have in-built retransmis-sion mechanisms. Second, for long-lived failures, traffic variations(e.g., traffic bursts, diurnal workloads) mean that the link may nothave carried the same amount of data even if it was active. There-fore, we propose a simple metric to approximate the magnitude oftraffic lost due to failures, based on the available data sources.

To estimate the impact of link failures on network traffic (bothin terms of bytes and packets), we first compute the median numberof packets (or bytes) on the link in the hours preceding the failureevent, medb, and the median packets (or bytes) during the failuremedd. We then compute the amount of data (in terms of packets orbytes) that was potentially lost during the failure event as:

loss = (medb −medd)× duration

where duration denotes how long the failure lasted. We use me-dian traffic instead of average to avoid outlier effects.

358

Page 10: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

As described in Section 2, the network traffic in a typical datacenter may be classified into short-lived, latency-sensitive “mice”flows and long-lived, throughput-sensitive “elephant” flows. Packetloss is much more likely to adversely affect “mice” flows wherethe loss of an ACK may cause TCP to perform a timed out retrans-mission. In contrast, loss in traffic throughput is more critical for“elephant” flows.

Link failures incur loss of many packets, but relatively few bytes.For link failures, few bytes are estimated to be lost relative to thenumber of packets. We observe that the estimated median numberof packets lost during failures is 59K (Figure 13) but the estimatedmedian number of bytes lost is only 25MB (Figure 14). Thus, theaverage size of lost packets is 423 bytes. Prior measurement studyon data center network traffic observed that packet sizes tend tobe bimodal with modes around 200B and 1,400B [5]. This sug-gests that packets lost during failures are mostly part of the lowermode, consisting of keep alive packets used by applications (e.g.,MYSQL, HTTP) or ACKs [5].

5.1 Is redundancy effective in reducing impact?In a well-designed network, we expect most failures to be

masked by redundant groups of devices and links. We evaluatethis expectation by considering median traffic during a link fail-ure (in packets or bytes) normalized by median traffic before thefailure: medd/medb; for brevity, we refer to this quantity as “nor-malized traffic”. The effectiveness of redundancy is estimated bycomputing this ratio on a per-link basis, as well as across all linksin the redundancy group where the failure occurred. An exampleof a redundancy group is shown in Figure 15. If a failure has beenmasked completely, this ratio will be close to one across a redun-dancy group i.e., traffic during failure was equal to traffic beforethe failure.

Network redundancy helps, but it is not entirely effective. Fig-ure 16 shows the distribution of normalized byte volumes for indi-vidual links and redundancy groups. Redundancy groups are effec-tive at moving the ratio of traffic carried during failures closer toone with 25% of events experiencing no impact on network trafficat the redundancy group level. Also, the median traffic carried atthe redundancy group level is 93% as compared with 65% per link.This is an improvement of 43% in median traffic as a result of net-work redundancy. We make a similar observation when consideringpacket volumes (not shown) .

There are several reasons why redundancy may not be 100%effective in eliminating the impact of failures on network traffic.First, bugs in fail-over mechanisms can arise if there is uncertaintyas to which link or component is the back up (e.g., traffic may beregularly traversing the back up link [7]). Second, if the redundantcomponents are not configured correctly, they will not be able to re-route traffic away from the failed component. For example, we ob-served the same configuration error made on both the primary andback up of a network connection because of a typo in the configura-tion script. Further, protocol issues such as TCP backoff, timeouts,and spanning tree reconfigurations may result in loss of traffic.

5.2 Redundancy at different layers of the net-work topology

This section analyzes the effectiveness of network redundancyacross different layers in the network topology. We logically di-vide links based on their location in the topology. Location is de-termined based on the types of devices connected by the link (e.g.,a CoreCore link connects two core routers). Figure 17 plots quar-tiles of normalized traffic (in bytes) for links at different layers ofthe network topology.

������

������

�����

�����

Figure 15: An example redundancy group between a primary(P) and backup (B) aggregation switch (AggS) and accessrouter (AccR).

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

Traffic during/traffic before

P[X

<x]

per linkper redundancy group

Figure 16: Normalized traffic (bytes) during failure events perlink as well as within redundancy groups.

Links highest in the topology benefit most from redundancy.A reliable network core is critical to traffic flow in data centers.We observe that redundancy is effective at ensuring that failuresbetween core devices have a minimal impact. In the core of the net-work, the median traffic carried during failure drops to 27% per linkbut remains at 100% when considered across a redundancy group.Links between aggregation switches and access routers (AggAccR)experience the next highest benefit from redundancy where the me-dian traffic carried per link during failure drops to 42% per link butremains at 86% across redundancy groups.

Links from ToRs to aggregation switches benefit the least fromredundancy, but have low failure impact. Links near the edge ofthe data center topology benefit the least from redundancy, wherethe median traffic carried during failure increases from 68% onlinks to 94% within redundancy groups for links connecting ToRsto AggS. However, we observe that on a per link basis, these linksdo not experience significant impact from failures so there is lessroom for redundancy to benefit them.

6. DISCUSSIONIn this section, we discuss implications of our study for the de-

sign of data center networks and future directions on characterizingdata center reliability.

Low-end switches exhibit high reliability. Low-cost, commod-ity switches in our data centers experience the lowest failure ratewith a failure probability of less than 5% annually for all types ofToR switches and AggS-2. However, due to their much larger pop-ulation, the ToRs still rank third in terms of number of failures and

359

Page 11: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

All ToRAgg AggAccR AggLB AccRCore CoreCore

Topology level

Med

ian

norm

aliz

ed #

byte

s

0.0

0.2

0.4

0.6

0.8

1.0

− −

−−

− −

− − −

− − −− −− − −

− − − − − −per linkper redundancy group

Figure 17: Normalized bytes (quartiles) during failure eventsper link and across redundancy group compared across differ-ent layers in the data center topology.

dominate in terms of total downtime. Since ToR failures are consid-ered the norm rather than the exception (and are typically maskedby redundancy in application, data, and network layers), ToRs havea low priority for repair relative to other outage types. This sug-gests that proposals to leverage commodity switches to build flatdata center networks [3, 12, 21] will be able to provide good reli-ability. However, as populations of these devices rise, the absolutenumber of failures observed will inevitably increase.

Improve reliability of middleboxes. Our analysis of network fail-ure events highlights the role that middle boxes such as load bal-ancers play in the overall reliability of the network. While therehave been many studies on improving performance and scalabilityof large-scale networks [2, 3, 12–14, 21], only few studies focus onmanagement of middle boxes in data center networks [15]. Mid-dle boxes such as load balancers are a critical part of data centernetworks that need to be taken into account when developing newrouting frameworks. Further, the development of better manage-ment and debugging tools would help alleviate software and con-figuration faults frequently experienced by load balancers. Finally,software load balancers running on commodity servers can be ex-plored to provide cost-effective, reliable alternatives to expensiveand proprietary hardware solutions.

Improve the effectiveness of network redundancy. We observethat network redundancies in our system are 40% effective at mask-ing the impact of network failures. One cause of this is due to con-figuration issues that lead to redundancy being ineffective at mask-ing failure. For instance, we observed an instance where the sametypo was made when configuring interfaces on both the primary andback up of a load balancer connection to an aggregation switch.As a result, the back up link was subject to the same flaw as theprimary. This type of error occurs when operators configure largenumbers of devices, and highlights the importance of automatedconfiguration and validation tools (e.g., [8]).

Separate control plane from data plane. Our analysis of NOCtickets reveals that in several cases, the loss of keep alive mes-sages resulted in disconnection of portchannels, which are virtuallinks that bundle multiple physical interfaces to increase aggre-gate link speed. For some of these cases, we manually correlatedloss of control packets with application-level logs that showed sig-nificant traffic bursts in the hosted application on the egress path.This interference between application and control traffic is undesir-able. Software Defined Networking (SDN) proposals such as Open-Flow [20] present a solution to this problem by maintaining state in

a logically centralized controller, thus eliminating keep alive mes-sages in the data plane. In the context of proposals that leveragelocation independent addressing (e.g., [12, 21]), this separation be-tween control plane (e.g., ARP and DHCP requests, directory ser-vice lookups [12]) and data plane becomes even more crucial toavoid impact to hosted applications.

7. RELATED WORKPrevious studies of network failures have considered application-

level [16, 22] or network connectivity [18, 19, 25, 26, 28] failures.There also have been several studies on understanding hardwarereliability in the context of cloud computing [11, 23, 24, 27].

Application failures. Padmanabhan et al. consider failures fromthe perspective of Web clients [22]. They observe that the majorityof failures occur during the TCP handshake as a result of end-to-end connectivity issues. They also find that Web access failures aredominated by server-side issues. These findings highlight the im-portance of studying failures in data centers hosting Web services.

Netmedic aims to diagnose application failures in enterprisenetworks [16]. By taking into account state of components that failtogether (as opposed to grouping all components that fail together),it is able to limit the number of incorrect correlations between fail-ures and components.

Network failures. There have been many studies of network fail-ures in wide area and enterprise networks [18, 19, 25, 26, 28] butnone consider network element failures in large-scale data centers.

Shaikh et al study properties of OSPF Link State Advertise-ment (LSA) traffic in a data center connected to a corporate networkvia leased lines [25]. Watson et al also study stability of OSPF byanalyzing LSA messages in a regional ISP network [28]. Both stud-ies observe significant instability and flapping as a result of externalrouting protocols (e.g., BGP). Unlike these studies, we do not ob-serve link flapping owing to our data sources being geared towardsactionable events.

Markopolou et al. use IS-IS listeners to characterize failures inan ISP backbone [19]. The authors classify failures as either routerrelated or optical related by correlating time and impacted networkcomponents. They find that 70% of their failures involve only asingle link. We similarly observe that the majority of failures in ourdata centers are isolated.

More recently, Turner et al. consider failures in an academicWAN using syslog messages generated by IS-IS [26]. Unlike pre-vious studies [19, 25, 28], the authors leverage existing syslog, e-mail notifications, and router configuration data to study networkfailures. Consistent with prior studies that focus on OSPF [25, 28],the authors observe link flapping. They also observe longer time torepair on wide area links, similar to our observations for wide arealinks connecting data centers.

Failures in cloud computing. The interest in cloud computinghas increased focus on understanding component failures, as evena small failure rate can manifest itself in a high number of fail-ures in large-scale systems. Previous work has looked at failuresof DRAM [24], storage [11, 23] and server nodes [27], but therehas not been a large-scale study on network component failures indata centers. Ford et al. consider the availability of distributed stor-age and observe that the majority of failures involving more thanten storage nodes are localized within a single rack [11]. We alsoobserve spatial correlations but they occur higher in the networktopology, where we see multiple ToRs associated with the sameaggregation switch having correlated failures.

Complementary to our work, Benson et al. mine threads fromcustomer service forums of an IaaS cloud provider [6]. They report

360

Page 12: Understanding Network Failures in Data Centers ...conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdfUnderstanding Network Failures in Data Centers: Measurement, Analysis,

on problems users face when using IaaS and observe that problemsrelated to managing virtual resources and debugging performanceof computing instances that require involvement of cloud adminis-trators, increase over time.

8. CONCLUSIONS AND FUTURE WORKIn this paper, we have presented the first large-scale analysis

of network failure events in data centers. We focused our analysison characterizing failures of network links and devices, estimatingtheir failure impact, and analyzing the effectiveness of network re-dundancy in masking failures. To undertake this analysis, we devel-oped a methodology that correlates network traffic logs with logs ofactionable events, to filter a large volume of non-impacting failuresdue to spurious notifications and errors in logging software.

Our study is part of a larger project, NetWiser, on understand-ing reliability in data centers to aid research efforts on improv-ing network availability and designing new network architectures.Based on our study, we find that commodity switches exhibit highreliability which supports current proposals to design flat networksusing commodity components [3, 12, 17, 21]. We also highlight theimportance of studies to better manage middle boxes such as loadbalancers, as they exhibit high failure rates. Finally, more investi-gation is needed to analyze and improve the effectiveness of redun-dancy at both network and application layers.

Future work. In this study, we consider occurrence of interfacelevel failures. This is only one aspect of reliability in data centernetworks. An important direction for future work is correlating logsfrom application-level monitors with the logs collected by networkoperators to determine what fraction of observed errors do not im-pact applications (false positives) and what fraction of applicationerrors are not observed (e.g., because of a server or storage failurethat we cannot observe). This would enable us to understand whatfraction of application failures can be attributed to network fail-ures. Another extension to our study would be to understand whatthese low level failures mean in terms of convergence for networkprotocols such as OSPF, and to analyze the impact on end-to-endnetwork connectivity by incorporating logging data from externalsources e.g., BGP neighbors.

AcknowledgementsWe thank our shepherd Arvind Krishnamurthy and the anonymousreviewers for their feedback. We are grateful to David St. Pierre forhelping us understand the network logging systems and data sets.

9. REFERENCES[1] Cisco: Data center: Load balancing data center services, 2004.

www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/net_implementation_white_paper0900aecd8053495a.html.

[2] H. Abu-Libdeh, P. Costa, A. I. T. Rowstron, G. O’Shea, andA. Donnelly. Symbiotic routing in future data centers. In SIGCOMM,2010.

[3] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commoditydata center network architecture. In SIGCOMM, 2008.

[4] M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel,B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP(DCTCP). In SIGCOMM, 2010.

[5] T. Benson, A. Akella, and D. Maltz. Network traffic characteristics ofdata centers in the wild. In IMC, 2010.

[6] T. Benson, S. Sahu, A. Akella, and A. Shaikh. A first look atproblems in the cloud. In HotCloud, 2010.

[7] J. Brodkin. Amazon EC2 outage calls ’availability zones’ intoquestion, 2011. http://www.networkworld.com/news/2011/042111-amazon-ec2-zones.html.

[8] X. Chen, Y. Mao, Z. M. Mao, and K. van de Merwe. Declarativeconfiguration management for complex and dynamic networks. InCoNEXT, 2010.

[9] Cisco. UniDirectional Link Detection (UDLD).http://www.cisco.com/en/US/tech/tk866/tsd_technology_support_sub-protocol_home.html.

[10] Cisco. Spanning tree protocol root guard enhancement, 2011.http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800ae96b.shtml.

[11] D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong,L. Barroso, C. Grimes, and S. Quinlan. Availability in globallydistributed storage systems. In OSDI, 2010.

[12] A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri,D. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible datacenter network. In SIGCOMM, 2009.

[13] C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. DCell: Ascalable and fault-tolerant network structure for data centers. InSIGCOMM, 2008.

[14] C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. BCube: A highperformance, server-centric network architecture for modular datacenters. In SIGCOMM, 2009.

[15] D. Joseph, A. Tavakoli, and I. Stoica. A policy-aware switching layerfor data centers. In SIGCOMM, 2008.

[16] S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, andP. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM,2010.

[17] C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: ascalable ethernet architecture for large enterprises. In SIGCOMM,2008.

[18] C. Labovitz and A. Ahuja. Experimental study of internet stabilityand wide-area backbone failures. In The Twenty-Ninth AnnualInternational Symposium on Fault-Tolerant Computing, 1999.

[19] A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah,Y. Ganjali, and C. Diot. Characterization of failures in an operationalIP backbone network. IEEE/ACM Transactions on Networking, 2008.

[20] N. Mckeown, T. Anderson, H. Balakrishnan, G. Parulkar,L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow:enabling innovation in campus networks. In SIGCOMM CCR, 2008.

[21] R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri,S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: Ascalable fault-tolerant layer 2 data center network fabric. InSIGCOMM, 2009.

[22] V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. Astudy of end-to-end web access failures. In CoNEXT, 2006.

[23] B. Schroeder and G. Gibson. Disk failures in the real world: Whatdoes an MTTF of 1,000,000 hours mean too you? In FAST, 2007.

[24] B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in thewild: A large-scale field study. In SIGMETRICS, 2009.

[25] A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. Acase study of OSPF behavior in a large enterprise network. In ACMIMW, 2002.

[26] D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. Californiafault lines: Understanding the causes and impact of network failures.In SIGCOMM, 2010.

[27] K. V. Vishwanath and N. Nagappan. Characterizing cloud computinghardware reliability. In Symposium on Cloud Computing (SOCC),2010.

[28] D. Watson, F. Jahanian, and C. Labovitz. Experiences withmonitoring OSPF on a regional service provider network. In ICDCS,2003.

361