Democratically Finding The Cause of Packet Drops
Transcript of Democratically Finding The Cause of Packet Drops
![Page 1: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/1.jpg)
Democratically Finding The Cause of Packet Drops
Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang (Harry) Liu, Jitu Padhye,
Geoff Outhred, Boon Thau Loo
1
![Page 2: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/2.jpg)
2
The ultimate goal of network diagnosis:Find the cause of every packet drop
Sherlock- SigComm 2007
Netclinic- VAST 2010
Netprofiler- P2Psys 2005
Marple- SigComm 2017
In this talk I will show how to:Find the cause of every TCP packet drop*
*As long as it is not caused by noise
![Page 3: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/3.jpg)
Not all faults are the same
33
Associate failed links with
problems they cause
High drop rate
lower drop rate
My connections to service X are
failing
![Page 4: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/4.jpg)
Mapping complaints to faulty links
4
But operators don’t always know where the failures are either
![Page 5: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/5.jpg)
Clouds operate at massive scales
5
Each Data center has millions of devices
![Page 6: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/6.jpg)
Low congestion drop rates add noise
6* Z., Danyang, et al. "Understanding and mitigating packet corruption in data center networks."
One-off, transient, drops do occur on many links and add noise to diagnosis*
One-off packet drop
Fault: Systemic causes of packet drops whether transient or not
![Page 7: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/7.jpg)
7
Noise: One-off packet drop due to buffer overflows
Fault: Systemic causes of packet drops whether transient or not
![Page 8: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/8.jpg)
Talk outline• Solution requirements• A strawman solution and why its impractical• The 007 solution– Design– How it finds the cause of every TCP flow’s drops– Theoretical guarantees
• Evaluation
8
![Page 9: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/9.jpg)
Solution Requirements• Detect short-lived failures• Detect concurrent failures• Robust to noise
9
![Page 10: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/10.jpg)
Want to avoid infrastructure changes
• Costly to implement and maintain• Sometimes not even an option– Example: changes to flow destinations (not in the DC)
10
![Page 11: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/11.jpg)
A “strawman” solution• Suppose – we knew the path of all flows – we knew of every packet drop
• Tomography can find where failures are
If we assume there are enough flows
11
![Page 12: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/12.jpg)
Example of doing tomography
12
4
1 2 3
Only solvable if we have N independent equationsN = number of links in the network
![Page 13: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/13.jpg)
Tomography is not always practical
Theoretical challengesEngineering challenges
13
Set of equations doesn't fully specify a solution– Number of active flows may not be sufficient– Becomes NP hard
Many approximate solutions– MAX_COVERAGE (PathDump-OSDI 2016)– They are sensitive to noise
![Page 14: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/14.jpg)
14
Assume small number of failed links
AND
Fate Sharing across flows
![Page 15: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/15.jpg)
Tomography is not always practical
• Finding path of all flows is hard• Pre-compute paths
– ECMP changes with every reboot/link failure– Hard to keep track of these changes
• Traceroute (TCP)– ICMP messages use up switch CPU– NATs and Software load balancers
• Infrastructure changes– Labeling packets, adding metadata– Costly
15
Engineering challenges
![Page 16: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/16.jpg)
We show in this work• Simple traceroute-based solution–Minimal overhead on switches– Tractable (not NP hard)– Resilient to noise– No infrastructure changes (host based app)
We prove its accurate
16
![Page 17: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/17.jpg)
We can fix problems with traceroute
• Overhead on switch CPU– Only find paths of flows with packet drops– Limit number of traceroutes from each host– Explicit rules on the switch to limit responses
• NATs and Software load balancer– See paper for details
17
![Page 18: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/18.jpg)
How the system works
18
4
1 2 3
Monitoring agent:Deployed on all hosts
Notified of each TCP retransmission (ETW) Path discovery agent finds the path of the failed flows
Flows vote on the status of links
Votes: if you don’t know who to blame just blame everyone!
![Page 19: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/19.jpg)
1
How the system works
19
4
1 2 3
2Democracy works!
![Page 20: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/20.jpg)
Can diagnose TCP flows• Using votes to compare drop rates– For each flow we know the links involved– Link with most votes most likely cause of drops
20
Assume small number of failed links andfate sharing across flows
![Page 21: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/21.jpg)
Attractive features of 007• Resilient to noise• Intuitive and easy to implement• Requires no changes to the network
21
![Page 22: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/22.jpg)
We give theoretical guarantees
• We ensure minimal impact on switch CPU– Theorem bounding number of traceroutes
• We prove the voting scheme is 100% accurate when the noise is bounded– Depends on the network topology and failure
drop rate
22
![Page 23: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/23.jpg)
Questions to answer in evaluation
• Does 007 work in practice?– Capture the right path for each flow?– Find the cause of drops for each flow correctly?
• Are votes a good indicator of packet drop rate?• What level of noise can 007 tolerate?• What level of traffic skew can 007 tolerate?
23
![Page 24: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/24.jpg)
5 hour experiment• Comparison to EverFlow (ground truth)– Do Traceroutes go over the right path?– Does 007 find the cause of packet drops?
Two month deployment• Types of problems found in production:– Software bugs– FCS errors– Route flaps– Switch reconfigurations
YES
Does 007 work in practice
24
YES
![Page 25: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/25.jpg)
25
Are votes correlated with drops?
0 10 20 30 40 50 60 70 80 90 100
Accuracy
Drop rate 1% Drop rate 0.1% Drop rate 0.05%
![Page 26: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/26.jpg)
Are votes correlated with drops?
• Test cluster (we know ground truth)
26
False positive
![Page 27: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/27.jpg)
Comparison to MAX_COVERAGE
• MAX_COVERAGE (PathDump- OSDI 2016)– Approximate solution to a binary optimization– See 007 extended version for proof– Highly sensitive to noise
• Integer optimization– Improvement on the binary optimization approach– Reduces sensitivity to noise
27
![Page 28: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/28.jpg)
Binary optimization underperforms• Clos topology• 2 pods• 4000 links
• Drop rates between 0.01%-1% uniform at random• Noise uniformly at random between 0-0.0001%
28
75.3
%
![Page 29: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/29.jpg)
Is 007 robust to noise?
29
![Page 30: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/30.jpg)
Skewed traffic causes problems
30
4
1 2 30
We don’t care about this particular case, because… The failure isn’t impacting any traffic
But what if it had?
![Page 31: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/31.jpg)
Is 007 impacted by traffic skew?
• More simulation results in the paper
31
![Page 32: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/32.jpg)
Conclusion• 007: simple voting scheme• Finds cause of problems for each flow• Allows operators to prioritize fixes• Analytically proven to be accurate• Contained at the end host as an application– No changes to the network or destinations
32
![Page 33: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/33.jpg)
Thank You
• Adi Aditya• Alec Wolman• Andreas Haeberlen• Ang Chen• Deepal Dhariwal• Ishai Menache• Jiaxin Cao• Monia Ghobadi
• Mina Tahmasbi• Omid AlipourFard• Stefan Saroiu• Trevor Adams
33
![Page 34: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/34.jpg)
An example closer to home
34
![Page 35: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/35.jpg)
Guaranteed Accurate
• Theorem:For Vigil will rank with probabilitythe bad links that drop packets with probability higher than all good links that drop packets with probability if
where is the total number of connections between hosts, and are lower and upper bounds, respectively, on the number of packets per connection.
35
![Page 36: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/36.jpg)
Minimal impact on switch CPU
• Theorem:The rate of ICMP packets generated by any switch due to a traceroute is below if the rate at which hosts trigger traceroutes is upper bounded as
Where are the number of ToR, T1 , and T2 switches respectively and is the number of hosts under each ToR.
n0, n1, n2
36
![Page 37: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/37.jpg)
Failures are complicated
37
![Page 38: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/38.jpg)
We can now prioritize fixes• We can answer questions like:– Why are connections to storage failing?– What is causing problems for SQL connections?– Why do I have bad throughput to a.b.c.d?
38
![Page 39: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/39.jpg)
An example closer to home
39
![Page 40: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/40.jpg)
More than finding a few failed links
40
![Page 41: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/41.jpg)
Past solutions don’t help
• Don’t allow for always on monitoring– Pingmesh [SIGCOMM-15]– EverFlow [SIGCOMM-15]– TRAT [SIGCOMM-02]– Other Tomography work
• Require changes to network/remote hosts–Marple [SIGCOMM-17]– PathDump [OSDI-16]– Link-based anomaly detection [NSDI-17]
41
![Page 42: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/42.jpg)
Finding paths is also hard• Infrastructure changes are costly– DSCP bit reserved for other tasks– Cannot deploy any changes on the destination end-point
• Reverse engineering ECMP also difficult– Can get the ECMP functions from vendors– Seed changes with every reboot/link failure– Hard to keep track of these changes
• Only option left: Traceroute– ICMP messages use up switch CPU– We cannot find the path of all flows• Problem is not always fully specified• Approximate solutions are NP hard• And the approach is sensitive to noise 42
![Page 43: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/43.jpg)
Our Solution
007 Monitors TCP connections at the host through ETW
It detects retransmissions as soon as they happen
43
![Page 44: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/44.jpg)
Mapping DIPs to VIPs
• Connections are to Virtual IPs– SYN packets go to a Software Load Balancer (SLB)– The host gets configured with a physical IP– All other packets in the connections use the physical IP
• Traceroute packets must use the physical IP
44
![Page 45: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/45.jpg)
An evaluation with skewed traffic
• Traffic concentrated in one part of network• Extreme example: most flows go to one ToR– Small fraction of traffic goes over failed links– Votes can become skewed– We call this a hot ToR scenario
45
![Page 46: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/46.jpg)
Our Solution
007 Monitors TCP connections at the host through ETW
It detects retransmissions as soon as they happen
46
![Page 47: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/47.jpg)
Observation
Data gathered using the monitoring agent of NetPoirotUses ETW to get notifications of TCP retransmissions
47
![Page 48: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/48.jpg)
If path of all flows was known
• Given TCP statistics for existing flows–We know the paths that have problems–Without having to send any probe traffic–Without having to rely on packet captures
• We can also find the failed links
48
![Page 49: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/49.jpg)
We can now prioritize fixes• We can answer questions like:– Why are connections to storage failing?– What is causing problems for SQL connections?– Why do I have bad throughput to a.b.c.d?
• Just one catch:– Needs to know retransmissions– Ok for infrastructure traffic (e.g. storage)– See paper on how to extend to VM traffic
49
![Page 50: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/50.jpg)
SLB
Get the DIP to VIP mapping from SLBSend traceroute like packetsEach connection votes on the status of linksgood links get a vote of 0
50
![Page 51: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/51.jpg)
Where in the network?
51
![Page 52: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/52.jpg)
Holding the network accountable
• Given impacted application find links responsible – Allows us to prioritize fixes
• Given a failed device quantify its impact– Estimate cost of failures in customer impact
52
![Page 53: Democratically Finding The Cause of Packet Drops](https://reader030.fdocuments.in/reader030/viewer/2022013012/61cf69a7fabfbd65752329ad/html5/thumbnails/53.jpg)
Failures are hard to diagnose
High CPU loadHigh I/O load
RebootsSoftware bugs
BGP link flapsFCS errors
misconfigurationsSwitch Reboots
CongestionHardware bug
+Millions of devices
Bad designSoftware bugs
High CPU usageHigh memory usage
53