Site Reliability Engineering...
Transcript of Site Reliability Engineering...
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Site Reliability Engineering (SRE)What is it? Can my organization benefit from it?
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Brent ChapmanPrincipal
Great Circle Associates
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Who am I?
Majordomo
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
ServiceSite
Reliability
Engineering
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
[SRE is] what happens
when you ask a software
engineer to design an
operations function.
— Ben Treynor
VP/Engineering, 24/7 Operations
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• SREs develop engineering solutions to operations matters
• Programmers, not operators
• Specialists, just like DB, QA, etc.
SRE is Engineering
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Specialties typically include:• Performance
• Availability
• Latency
• Efficiency
• Monitoring
• Change Management
• Emergency Response
• Capacity Planning
SRE is Engineering
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
SRE teams are meant to do engineering work to improve the reliability and operability of systems and reduce toil (freeing up more time for engineering work). SRE teams, by definition, must be learning teams who can spot and fix problems as part of their day-to-day work.
— Damon EdwardsCo-Founder
Rundeck
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Created at Google in ~2003• Benjamin Treynor Sloss (now “VP, 24/7”)
• Took over as manager of Production Team (then 7 software engineers)
• “… what happens when you ask a software engineer to design an operations function”
• This is why SRE has a very “software engineering” feel to it
• Spread through industry as SREs left Google, and orgs adopted/adapted SRE principles for their own situations
• Developed independently from DevOps
History of SRE
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Evolved in parallel, simultaneously, largely independently of each other
• SRE is not “next generation DevOps”
• SRE not created to be the “future of DevOps”
• SRE is an engineering discipline; DevOps is a cultural movement
• SRE tends to be more prescriptive than DevOps
• Many common threads: automation, monitoring, CI/CD, rapid small releases, etc.
SRE and DevOps
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
class SRE implements interface DevOps
– Liz Fong-JonesSRE
Google Cloud
SRE and DevOps
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
DevOps Pillar SRE Implementation
Reduce
organization silos
Share ownership with developers by using the same
tools and techniques across the stack
Accept failure as
normal
Have a formula for balancing accidents and failures
against new releases (error budget)
Implement gradual
change
Encourage moving quickly by reducing costs of
failure
Leverage tooling &
automation
Encourages “automating this year’s job away” and
minimizing manual systems work to focus on efforts
that bring long-term value to the system
Measure
everything
Believes that operations is a software problem, and
defines prescriptive ways for measuring availability,
uptime, outages, toil, etc.
SRE implements DevOps
– Seth Vargo and Liz Fong-Jones
“SRE vs. DevOps: competing standards or close friends?”
Google Cloud Platform Blog, 08 May 2018
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Operations is a Software Problem
• Manage by Service Level Objectives
• Work to Minimize Toil
• Automate This Year’s Job Away
• Move Fast by Reducing the Cost of Failure
• Share Ownership with Developers
• Use the Same Tooling, Regardless of Role
– The Site Reliability Workbook, Chapter 1
Fundamental SRE Principles
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Toil
• Automation
• SLI/SLO/SLA
• Error Budget
• Sharing On-Call with Devs
• On-Call Limits
• Alert Quality
• Common Platforms
• Self-Service
• Service Reviews
• Incident Management
• Blameless Postmortems
SRE Concepts & Practices
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• One hour is not long enough to explore all of these principles, concepts, and practices
• So, let’s look at a few that are “most fundamental” and “most distinctive”:
• Minimizing Toil
• Automation
• Monitoring
• SLI/SLO/SLA
• Error Budgets
A Deeper Dive
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
– Vivek RauSRE Manager
Minimizing Toil
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Toil vs. Engineering Work
Toil
Lacks enduring value
Rote, repetitive
Tactical
Increases with scale
Can be automated
Engineering Work
Builds enduring value
Creative, iterative
Strategic
Enables scaling
Requires human
creativity
— Damon Edwards
Co-Founder, Rundeck
Seeking SRE, Chapter 10
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Individual
Discontent, lack of feeling of accomplishment
Burnout
More errors, leading to time-consuming rework
No time to learn new skills
Career stagnation (lack of opportunity to deliver value-adding projects)
Organization
Constant shortage of team capacity
Excessive operational support costs
Inability to make progress on strategic initiatives (“everyone is busy, but nothing is getting done”)
Inability to retain top talent (or acquire, after word gets around)
Costs of Toil
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Requires engineering time!• Build automation to eliminate manual work
• Redesign to alleviate need for manual work
Reducing Toil
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Fast
• Consistent
• Repeatable
• Extendible
• Composable
• Malleable
• Reviewable
• Testable
• Can set smart defaults for security, performance, monitoring, etc.
• Enables self-service
• Enables self-repair
Benefits of Automation
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Golden signals• Latency, traffic, errors, saturation
• Awareness of long-tail effects• Averages vs. percentiles
• Dashboards vs. alerts
• Symptoms vs. causes
• Black-box vs. white-box monitoring
• Monitoring vs. observability
• Alert quality
SRE Monitoring Practices
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Alerts should be• Urgent – something that can’t wait
• Actionable – something that can be dealt with
• Interesting – something that requires human judgement
• Unique – something that only you can deal with
All alerts that fire should be reviewed ~weekly for quality, and ruthlessly tuned or eliminated (perhaps through automation)
Alert Quality
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• SLI: Service Level Indicator• Metric that you monitor
• Ideally something user-meaningful, i.e., “time to load home page”
• SLO: Service Level Objective• Target value for an SLI
• If you exceed that value, indicates a problem
• SLA: Service Level Agreement• An SLO with teeth
• If you breach it, it costs you money (penalties, refunds, etc.)
• Usually less strict than your SLO, obviously
SLI/SLO/SLA, oh my…
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Agreement between Dev and SRE on service availability target (SLO)
• 1 minus target 9’s of availability
• Ex: 99.9% uptime means error budget is 0.1%
• Error budget can be “spent” on either features or reliability
• If service availability meets error budget, keep pushing features
• If service availability exceeds error budget, slow/pause feature work and focus on reliability until service is back in spec
Error Budgets
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Solo – SRE as individual skillset
• Consulting – consultants to Dev or DevOps
• Embedded – individual SREs join product-focused teams (“Netflix model”)
• Dedicated – full SRE teams, in own org structure (“Google model”)
SRE engagement models
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• SRE is a specialization/role
• SREs join cross-functional product teams, just like QA engineers, DBAs, PMs, etc.
• Development and ongoing ops happen from within these teams
• These teams own a service through entire lifecycle, from concept to decommissioning
• No “ops” org to “throw it over the wall to”
Netflix Model (embedded)
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• SRE is a distinct org within company• Individual SREs are on teams within this org
• Dev teams have limited SRE skills• Initially own full lifecycle of service
• At certain level of scale, performance, and stability, begin partnering with SRE team
• SRE team manages availability, scalability, etc.
• Dev team continues to do feature development
• Dev team stays somewhat involved in operation & on-call (85% SRE, 15% Dev)
• Error budget used to balance stability vs. feature development
Google Model (vertical)
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• Production Engineering (Facebook)
• Chaos Engineering (Netflix)
• Resiliency Engineering
Related Disciplines
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• O’Reilly books• Site Reliability Engineering: How Google Runs
Production Systems (April 2016)• “The SRE Book”
• How Google implemented SRE; the “what” of SRE
• The Site Reliability Workbook: Practical Ways to Implement SRE (July 2018)
• “The SRE Workbook”
• Companion to The SRE Book; explains more “how” and “why”, and more from others than Google
• Seeking SRE: Conversations About Running Production Systems at Scale (Sep 2018)
• More expansive view, from still more orgs
• How to apply SRE principles in more situations
Exploring SRE
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
• USENIX SREcon• 3 per year: Americas, EMEA, Asia
• Videos of past talks available online, for free
• Usually sells out very early (often before early-bird ends), so register early!
• Highly recommend Ben Treynor Sloss’ “Keys to SRE” keynote from first SREcon
• http://j.mp/2N3ABSz
• Also Damon Edwards’ “Clearing the Way for SRE in the Enterprise” from SREcon18Europe
• http://j.mp/2OUYNrZ
Exploring SRE
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Please contact me!Brent Chapman
@brent_chapman
Available for consulting, training, conferences, public speaking, and in-house presentations
These slides are at http://j.mp/grtcrcl-181018
Questions? More info…