Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site...
Transcript of Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site...
![Page 1: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/1.jpg)
![Page 2: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/2.jpg)
Introduction to SRE at GoogleChristof Leng, [email protected] 2018
![Page 3: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/3.jpg)
Speaker Introduction
● Christof Leng● Site Reliability Manager at Google Munich● Developer Infrastructure SRE
○ Responsible for Google's developer and CI/CD tools
● Researcher, politician, DJ
![Page 4: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/4.jpg)
Why Reliability?
● It's the number one feature
![Page 5: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/5.jpg)
Do you prefer Gmail 2010?
![Page 6: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/6.jpg)
Or Gmail 500?
![Page 7: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/7.jpg)
Reliability is easy to take for granted
● It’s the absence of errors● Obviously unstable == too late● You need to work at reliability all the time
○ Not just when everything’s on fire
![Page 8: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/8.jpg)
● The SRE Organization is separate from feature development● SRE teams are organized around a single service or a collection of related
services or technologies
SRE Organizational Structure
![Page 9: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/9.jpg)
Dev and Ops
● Don't Dev and Ops always fight?○ Dev wants to...
■ ...roll out features fast■ ...and see them widely adopted
○ Ops wants...■ ...stability so they don't get paged
![Page 10: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/10.jpg)
And just to make it harder...
● Information asymmetry is extreme● Ops doesn’t really know the code base● The team which knows the least about the code...
○ ...has the strongest incentive to object to it launching
![Page 11: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/11.jpg)
Is conflict inevitable?
No :-)
● SRE doesn’t attempt to assess launch risk, ● or set release policy,● or avoid all outages
![Page 12: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/12.jpg)
Then what?
● Error budgets!● But you first need an SLO!
![Page 13: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/13.jpg)
● Service Level Indicator (SLI): a quantitative measure of an attribute of the service. It's a metric that users care about, such as:○ availability○ latency○ freshness○ durability
● Service Level Objective (SLO): SLI @ specific target (99.9% availability = �)● Service Level Agreement (SLA): SLO + consequences (99% availability = ☹)
Service Level .*
![Page 14: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/14.jpg)
100% SLO
![Page 15: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/15.jpg)
<100% SLO
● Google doesn't run at 100% SLO● Impossible to achieve● Very expensive
https://pixabay.com/en/laptop-black-blue-screen-monitor-33521/https://pixabay.com/en/computer-desktop-workstation-office-158675/
![Page 16: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/16.jpg)
Error Budget
● 1 - SLO● Example
○ SLO: 99.9%○ Error budget: 100% - 99.9% = 0.1%○ Can spend this○ For a 1 billion query/month service
■ 1 million "errors" to spend
![Page 17: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/17.jpg)
What do you spend your budget on?
● Change is #1 cause of outage● Launches are big sources of change● Solution: Spend error budget on launches!
○ … or spend it on service instability :(
![Page 18: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/18.jpg)
The rule
● Error budget > 0, launch away○ Clearly DEV team is doing a good job
● Error budget < 0, launch freeze○ Until you earn back enough error budget
![Page 19: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/19.jpg)
Two nice features of Error Budgets
1. Removes major source SRE-DEV conflicta. It’s a math problem, not an opinion or power conflict
2. DEV teams self-police because they are not monolithic
![Page 20: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/20.jpg)
Staffing, Work, Ops Overload
● At the core, you can throw people at a badly-functioning system and keep it alive via manual labor
● That job isn't fun○ Google doesn't ask SREs to do it
![Page 21: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/21.jpg)
But it’s soooo tempting?
● What I see is all there is● Can’t see operations work = doesn’t exist● It’s another incentives problem
![Page 22: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/22.jpg)
Fix 1: Common Staffing Pool
● One more SRE = one less developer● The more operations work...
○ ...the fewer features
● Self-regulating systems win!
![Page 23: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/23.jpg)
Fix 2: SRE hires only coders
● They speak the same language as DEV● They know what a computer can do● They get bored easily
![Page 24: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/24.jpg)
Fix 3: 50% cap on Ops work
● If you succeed, traffic increases● Toil scales with traffic● Write software to reduce toil● Leave enough time for serious coding
○ ...or drown,○ ...or fail
![Page 25: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/25.jpg)
● “What I see is all there is”● Dev team sees the product in action● Not all teams do this though
Fix 4: Keep DEV in the rotation
![Page 26: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/26.jpg)
Fix 5: Speaking of Dev and Ops work...
● Excess operations load gets assigned to the dev team○ tickets, oncall, etc.
● Another self-regulating system :)
![Page 27: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/27.jpg)
Fix 6: SRE Portability
● No requirement to stick with any project○ No requirement to stick with SRE
● Build it and they will come○ Bust it, and they will leave
● The threat is rarely executed, but it is powerful
![Page 28: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/28.jpg)
1. Single staffing pool2. Hire coders3. Ops work < 50%4. Dev involved in operations5. Excessive toil → Dev6. Mobility
Limiting operational work
![Page 29: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/29.jpg)
Death, taxes, and outages...
● SLO < 100% means that there will be outages○ This is OK. Not fun, but OK
● Two goals for each outage:○ Minimize impact○ Prevent recurrence
![Page 30: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/30.jpg)
Minimize Damage
● Make the outage as short as possible● No NOC● Good diagnostic information
![Page 31: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/31.jpg)
A word on practice...
Operational readiness drills aren’t cool.
You know what’s cool?
Wheel of Misfortune!
One of our most popular SRE events.
![Page 32: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/32.jpg)
● Step 1: Handle the event● Step 2: Write the post-mortem● Step 3: Reset
Prevent recurrence
![Page 33: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/33.jpg)
Post-mortem philosophy
● Post-mortems are blameless● Assume people are intelligent, well-intentioned● Focus on process and technology
● Create a timeline● Get all the facts● Create bugs for all follow-up work
![Page 34: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/34.jpg)
Google's SRE Website
● https://www.google.com/sre● More resources● Articles● Videos
![Page 35: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/35.jpg)
O'Reilly Book
● Site Reliability Engineering● How Google Runs Production Systems● landing.google.com/sre/book.html
![Page 36: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/36.jpg)
● Reliability is the most important feature● SRE = a dedicated team focused on reliability
○ Software engineering, consulting, on-call
● SLO is the target. Error budget is there to be spent○ Divert SWE resources to reliability when you run out of error budget
● Limiting operational work● Incident response and postmortems
Questions on any of these?
![Page 37: Introduction to SRE at Google - GOTO Conference · Speaker Introduction Christof Leng Site Reliability Manager at Google Munich Developer Infrastructure SRE Responsible for Google's](https://reader035.fdocuments.in/reader035/viewer/2022080721/5f7b37375684a041a71831b1/html5/thumbnails/37.jpg)