Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies...

48
Passive realtime datacenter fault detection and localization Arjun Roy , James Hongyi Zeng*, Jasmeet Bagga*, and Alex C. Snoeren University of California, San Diego Facebook* 1

Transcript of Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies...

Page 1: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Passiverealtime datacenterfaultdetectionandlocalization

ArjunRoy,JamesHongyi Zeng*,JasmeetBagga*,andAlexC.SnoerenUniversityofCalifornia,SanDiegoFacebook*

1

Page 2: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

“Itwouldbeniceifwecouldfigureoutwhichlinkwascausingtheseretransmits.”

- Ranjeeth Dasineni,Facebook(paraphrased)

2

Page 3: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Contemporarydatacenternetwork

However:faultsmaybepartial/intermittent.3

Page 4: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Partialfaults:Afewexamples

• Netpilot (Sigcomm 2011):Framecheckerror,unequalECMPhashing,etc.Wu,Xin,etal."Netpilot:automatingdatacenternetworkfailuremitigation." ACMSIGCOMMComputerCommunicationReview 42.4(2012):419-430.

• Everflow (Sigcomm 2015):TCAMbiterrors,silentpacketdrops.Zhu,Yibo,etal."Packet-LevelTelemetryinLargeDatacenterNetworks.”SIGCOMM,2015.

• Pingmesh (Sigcomm 2015):“fiberFCS…errors,switchingASICdefects,switchfabricflaw,switchsoftwarebug,NICconfigurationissue,networkcongestions,etc.Wehaveseenallthesetypesofissuesinourproductionnetworks.”

Guo,Chuanxiong,etal."Pingmesh:ALarge-ScaleSystemforDataCenterNetworkLatencyMeasurementandAnalysis.” SIGCOMM,2015.

4

Page 5: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Vastbodyofpriorwork(justasmallsample…)• Applicationinstrumentation:variousproductionsystems

• Activeprobing:Pingmesh (SIGCOMM’15),NetNorad (Facebook),ATPG(CoNEXT ‘12),Everflow (SIGCOMM‘15)

• Machinelearning:NetPoirot (SIGCOMM’16)

• Graphalgorithms:Gestalt(Usenix ATC‘14),SCORE(NSDI‘05)

• Pathtracing: Everflow (SIGCOMM‘15),NetNorad (Facebook),NetSight (NSDI‘14),TinyPacketPrograms(SIGCOMM‘14)

• Networkinstrumentation:FlowRadar (NSDI’16),Planck(SIGCOMM‘14),NetPilot (SIGCOMM‘11)

5

Page 6: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Weexploit:highlyregularloadbalancedtraffic

Sourceracktrafficmagnitude

Destinationracktrafficmagnitude

6

ArjunRoy,Hongyi Zeng,JasmeetBagga,GeorgePorter,andAlexC.Snoeren.InsidetheSocialNetwork's(Datacenter)Network. ACMSIGCOMM'15,London,England.

Page 7: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Loadbalancedtrafficsimplifiesfaulthandling

• Evenlyloadedpathsmeansperpathperformanceissimilarifnoerrors.• Networkfaultsleadtooutlierpaths.• Ifflownetworkpathknown,cancorrelateflowperformancewithpath.

• Approachallowsustofindandlocalizefaults:• Inanapplicationagnosticmanner• Incurringnoadditionalprobingoverhead• Morerapidlythanpriorpublishedworks

7

Page 8: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Facebookdatacentertopology

8

AlexeyAndreyev.Introducingdatacenterfabric,thenext-generationFacebookdatacenternetwork.https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/

Page 9: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg9

Page 10: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg10

Page 11: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg11

Page 12: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg12

Page 13: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg13

Page 14: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

ToR ToRCoreCore

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Sourcehost

DestinationhostAgg

Agg

Agg

Agg Agg

Agg

Agg

Agg14

Page 15: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Agg

15

Page 16: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Agg

16

Page 17: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

FindingpathinformationatFacebook

Core

Core

Core

Agg

Agg

AggToR ToR

Agg

Agg

Core

Core

Sourcehost

Destinationhost

Agg

Agg

Solution:aggregationswitchmarkspacketsbasedoncoredownlinktraversed.

Agg

17

Page 18: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Howdoweusepathinformation?

• Inprinciple:cancompareflowperformancebypath.1. Combinatorialdisaster:O(10,000)pathsfromsinglehosttoremoteracks.2. Nolocalization:doesn’ttelluswhichlink/switchisatfault.

• But:forthistrafficpattern,ECMProutinggivesusevenbytes/link.

• Solution:Justcomparelinks!

Create“EquivalenceSets”:setsoflinkshandlingsimilarload

andexhibitingsimilarperformance,intheabsenceoffaults

18

Equivalencesets:1. Reducesnumberofcomparisonsneeded.

2. Pinpointsfaulttospecificlocation.

Page 19: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

EquivalencesetsinFacebooktopologyCoreCoreCore

CoreCoreCore

CoreCoreCore

Sourcehost

Agg

Agg

Agg

ToRCoreCoreCoreAgg

Equivalenceset:4uplinksfromeachToR

topodAgg layer

…eachhasclosetoidenticalperformancedistribution

inabsenceoferrors19

Page 20: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

CoreCoreCore

CoreCoreCore

CoreCoreCore

Sourcehost

Agg

Agg

Agg

ToRCoreCoreCoreAgg

…eachhasclosetoidenticalperformancedistribution

inabsenceoferrors

Equivalenceset:NuplinksfrompodAgg layertocorelayer

EquivalencesetsinFacebooktopology

20

Page 21: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Outlieranalysiswithapplicationagnosticmetrics

Hostsalreadytrackmetricsforcongestioncontrolorperformancemonitoring:

TCPCongestionwindow:Affectedbypacketloss.TCPRetransmits:Affectedbypacketloss.SmoothedRoundtriptime:Affectedbylatencyspikes.Systemcalllatency: Affectedbypacketloss.

Caveat:Canbedifficulttodetermineifanaffectisduetoafaultylink,overloadedhosts,applicationvariance,etc.

Withequivalencesetbasedgrouping,wecancomparedistributionsbylink.

Onlylinkfaultscausevariancebetweenlinks.

21

Page 22: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

DemonstratingequivalencesetsfromAgg toToR

(1)ToR markspacketDSCP

perinboundlink

(2)HostaggregatesTCPmetricsbylink(3b)Host drops0.5%ofpacketstraversinglink

(3a)Wesimulateerroronthislink:

22

Host ToRAgg 2

Agg 3

Agg 4

Agg 1

Page 23: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

TCPCongestionwindowinAgg toToR equivalenceset

Cacheserver 23

Page 24: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Congestionwindowsignalisapplicationagnostic

Cacheserver Webserver24

Page 25: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Weuse:TCPretransmitsinourwork

Cacheserver Webserver25

Page 26: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Detectingfaultsinproduction

• Monitoredtrafficthroughpodaggregationswitch.1. Nofaultsinjected.2. CollectedTCPmetricdataon30webserverhosts.3. Equivalenceset:fourlinecards connectingtocorelayer

(eachlinecard hasequalshareofuplinks).

• OnJanuary25th,asinglelinecard hadasoftwarefault.1. Linecard controllersoftwarehung.2. BGProutestimedout,productiontrafficthroughlinecard routedaway.3. Afewminuteslater,NetNORAD flaggedunresponsivelinecard.

26

Page 27: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Faultvisibletoourapproachin30seconds

27

Page 28: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Classifyingfaultylinks

• “Doesthislinkhavemoreretransmitsperflowthantheotherlinks?”

• “Dotwodistributionshavethesamemean,orisonegreater?”

28

Classifier:compareeachlinktootherlinkswithonesampleStudent’sT-Test.

Page 29: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

OnlinefaultmonitoringwithT-Testalone

• Inprinciple:cansetupasystemthatusesendhostT-Testresulttotelluswhichnetworklinksarefaulty.

• However:byitselfthisissusceptibletoFalsePositives.

• Can’taffordfalsepositivesinnetworkwithO(10,000)links!

29

Page 30: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Accountingforfalsepositives

• However,twocharacteristicsaidus:1. Per-hostfalsepositivesevenlydistributedperlinkovertime.2. Datacenterhasaplethoraofhostsforwhichthisistrue.

• Thus,we’renottryingtoseeif agivenlinkismarkedfaultybyhosts.

• Instead,weonceagainperformoutlieranalysis.1. “Areallthelinksbeingmarkedfaultybyhostsatsimilarrates?”2. “Arehostsflaggingaparticularsubsetoflinksasfaultyathigherrates?”

30

Chi-squaredtest:determinesifanylinksareoutliers.

P-Value≈ 1:“Yes,allthelinksbeingmarkedfaultybyhostsatsimilarrates.”

P-Value≈ 0: “No,asubsethasacomparativelyhighpercentageofhostsclaimingfault.”

Page 31: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Evaluationinthedatacenter

• Smalldetectionsurface;didnotdetectany‘organic’partialfaults.

• Approach:inject‘simulated’faultstoevaluateapproach.

• Inducedavarietyoffaultscenariostochallengeoursystem.

31

Page 32: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Evaluationinthedatacenter:faultscenarios

• Minisculefaults:faultscanhaveverylowdroprates.

• Concurrentfaults:multiplefaultscanoccursimultaneously.

• Maskedfaults:largerfaultcanmaskeffectofminisculefault.

• Correlatedfaults:hardwarefaultcanimpactmultiplenearbylinks,confoundingoutlieranalysis.

32

Page 33: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Evaluationinthedatacenter:faultscenarios

• Minisculefaults:faultscanhavevery lowdroprates.

• Concurrentfaults:multiplefaultscanoccursimultaneously.

• Maskedfaults:largerfaultcanmaskeffectofminisculefault.

• Correlatedfaults:hardwarefaultcanimpactmultiplenearbylinks,confoundingoutlieranalysis.

33

Page 34: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

34

Page 35: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

35

Page 36: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

Equivalenceset:NuplinksfrompodAgg layertocorelayer

36

Page 37: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

CoreCoreCore

CoreCoreCore

CoreCoreCore

HostHostHost

HostHostHost

HostHostHost

Agg

Agg

Agg

ToR

ToR

ToR

Findingminisculefaults:experimentsetup

Core1

Core2

CoreN

Agg

…Core3

Partialfaultinducedonsingle

CoretoAggdownlink.

37

Page 38: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Faultdetectionratevsdroprate

38

Page 39: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Minisculefaults:choosingbetweendetectionspeedandsensitivity

39

Page 40: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Minisculefaults:choosingbetweendetectionspeedandsensitivity

40

Page 41: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Minisculefaults:choosingbetweendetectionspeedandsensitivity

41

Page 42: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Minisculefaults:choosingbetweendetectionspeedandsensitivity

42

Page 43: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Minisculefaults:choosingbetweendetectionspeedandsensitivity

43

Page 44: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

Minisculefaults:choosingbetweendetectionspeedandsensitivity

44

Page 45: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

“Itwouldbeniceifwecouldfigureoutwhichlinkwascausingtheseretransmits.”

Ranjeeth Dasineni,Facebook(paraphrased)

45

Page 46: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

46

Page 47: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

InterpretingtheT-Test

1. T-Statistic:“Doesthislinkhavemoreorlessretransmitsthanaverage?”

• Positive T-statisticmeanslargerthanaverage.• Negative T-statisticmeanssmallerthanaverage.

2. P-Value:“Isthedifferenceinmeanbigenoughtoconcernus?”

• Closeto0meansthislinkcouldbeanoutlier.• Closeto1meanswearenotconcerned.

47

Page 48: Passive realtimedatacenter fault detection and localization · Load balanced traffic simplifies fault handling •Evenly loaded paths means per path performance is similar if no errors.

InterpretingtheT-Test

P-value0,t-stat>0

P-value1,t-stat≈0

48