Learning at Scale is Hard! - USENIX
Transcript of Learning at Scale is Hard! - USENIX
![Page 1: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/1.jpg)
Learning at Scale is Hard!
Outage Pattern Analysis and Dirty Data
Tanner LundMicrosoft Azure SRE
@101010Lund
![Page 2: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/2.jpg)
@101010Lund Photo:RachelChapman(CC)
![Page 3: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/3.jpg)
@101010Lund
![Page 4: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/4.jpg)
@101010Lund
![Page 5: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/5.jpg)
@101010Lund
![Page 6: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/6.jpg)
@101010Lund
![Page 7: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/7.jpg)
@101010Lund
![Page 8: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/8.jpg)
@101010Lund Photo:MoRiza(CC)
![Page 9: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/9.jpg)
@101010Lund Photo:RachelChapman(CC)
![Page 10: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/10.jpg)
Learning (From Failure) At Scale
@101010Lund
![Page 11: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/11.jpg)
Trends: Identified
@101010Lund
![Page 12: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/12.jpg)
Antipatterns: Quashed
@101010Lund
![Page 13: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/13.jpg)
Reliability Work:Actually Gets Done
Appropriately Prioritized
@101010Lund
![Page 14: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/14.jpg)
@101010Lund
![Page 15: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/15.jpg)
Data Scientists:
@101010Lund
![Page 16: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/16.jpg)
Problem Management
@101010Lund
![Page 17: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/17.jpg)
Problem: “The cause of one or more incidents” – Information Technology
Infrastructure Library (ITIL)
@101010Lund
![Page 18: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/18.jpg)
@101010Lund Photo:RachelChapman(CC)
![Page 19: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/19.jpg)
Sharing is caring!
@101010Lund
![Page 20: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/20.jpg)
Gathering data
@101010Lund
![Page 21: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/21.jpg)
Selecting models
@101010Lund
![Page 22: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/22.jpg)
Training said models
@101010Lund
![Page 23: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/23.jpg)
Evaluating models
@101010Lund
![Page 24: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/24.jpg)
You know what was harder?
@101010Lund
![Page 25: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/25.jpg)
Knowing what we’re actually looking for.
@101010Lund
![Page 26: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/26.jpg)
IDK, something amazing!
¯\(°_o)/¯
@101010Lund
![Page 27: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/27.jpg)
Fundamental Issue: ROOT CAUSES
@101010Lund
![Page 28: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/28.jpg)
@101010Lund
![Page 29: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/29.jpg)
Complex Systems fail in complex ways
@101010Lund
![Page 30: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/30.jpg)
“Each of these small failures is necessary to cause catastrophe
but only a combination is sufficient to permit failure”
-Richard I. Cook, “How Complex Systems Fail”
@101010Lund
![Page 31: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/31.jpg)
Let’s take a step back
@101010Lund
![Page 32: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/32.jpg)
Why do we do RCAs?
@101010Lund
![Page 33: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/33.jpg)
To stop bad stuff from happening (again)
@101010Lund
![Page 34: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/34.jpg)
Hunting for Causes Problems Contributing Factors
@101010Lund
![Page 35: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/35.jpg)
Outage (for our purposes):
Service or platform level issue that impacts customer experience
@101010Lund
![Page 36: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/36.jpg)
Postmortem Text Analysis
@101010Lund
![Page 37: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/37.jpg)
BeautifulSoupNLTK
GensimpyLDAvis
@101010Lund
![Page 38: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/38.jpg)
@101010Lund
![Page 39: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/39.jpg)
Not actionable.
@101010Lund
![Page 40: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/40.jpg)
@101010Lund
![Page 41: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/41.jpg)
Big Deal™
@101010Lund
![Page 42: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/42.jpg)
Metrics!
@101010Lund
![Page 43: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/43.jpg)
@101010Lund Photo:JudyWitts (cc)
![Page 44: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/44.jpg)
Pain Value
@101010Lund
![Page 45: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/45.jpg)
Pain Value=(No.ofoutages)*(duration)*(severity)*
(weightingfactor)
@101010Lund
![Page 46: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/46.jpg)
Customers ImpactedRegions
Hardware SKUsDistance Below SLO
Number of breached SLOs
@101010Lund
![Page 47: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/47.jpg)
Data Scientists:
@101010Lund
![Page 48: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/48.jpg)
Pain Value=(No.ofoutages)*(duration)*(severity)*
(weightingfactor)
@101010Lund
![Page 49: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/49.jpg)
Human interpretation still necessary
@101010LundPhoto:WikimediaCommons
![Page 50: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/50.jpg)
@101010Lund
![Page 51: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/51.jpg)
Missing/InsufficientData
@101010Lund
![Page 52: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/52.jpg)
Incomplete Data
@101010Lund
![Page 53: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/53.jpg)
InaccurateData
ItWasDefinitelyNetwork’sFault
OurCertsExpired
@101010Lund
![Page 54: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/54.jpg)
Irrelevant Data
@101010Lund
![Page 55: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/55.jpg)
Ambiguity
Node – CPUNode – Instance of ProgramNode – Physical Hardware BoxNode – Point on Graph such that G = (V,E)Node – Any device connected to the networkNode – Communication endpointNode – Client, Server, or PeerNode – Bitcoin minerNode – Data TypeNode – Node.js
@101010Lund
![Page 56: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/56.jpg)
Confounding Factors
(like config drift)
@101010Lund
![Page 57: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/57.jpg)
@101010Lund
![Page 58: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/58.jpg)
Dirty data will lie to you.
@101010Lund
![Page 59: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/59.jpg)
What was the (preliminary) result?
@101010Lund
![Page 60: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/60.jpg)
1. Surfaced surprise issues
@101010Lund
![Page 61: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/61.jpg)
2. Debunked production myths
@101010Lund
![Page 62: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/62.jpg)
3. Stronger arguments for prioritization of reliability
work
@101010Lund
![Page 63: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/63.jpg)
What did we learn?
@101010Lund
![Page 64: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/64.jpg)
1. Define your hypotheses
@101010Lund
![Page 65: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/65.jpg)
2. Clean your data
@101010Lund
![Page 66: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/66.jpg)
3. Work your way up the DIKW pyramid
@101010Lund
![Page 67: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/67.jpg)
What else can we do?
@101010Lund
![Page 68: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/68.jpg)
Cross-Correlate Data Sets
@101010Lund
![Page 69: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/69.jpg)
@101010Lund
![Page 70: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/70.jpg)
Study your minor failures
@101010Lund
![Page 71: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/71.jpg)
Intelligently Calculate Risk
@101010Lund
![Page 72: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/72.jpg)
Continue to improve the RCA Process
@101010Lund
![Page 73: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/73.jpg)
@101010Lund Photo:RachelChapman(CC)
![Page 74: Learning at Scale is Hard! - USENIX](https://reader034.fdocuments.in/reader034/viewer/2022042804/62685202c9a7201c35339437/html5/thumbnails/74.jpg)
@101010Lund