Site Reliability Engineering...

30
@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM Site Reliability Engineering (SRE) What is it? Can my organization benefit from it? @BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM Brent Chapman Principal Great Circle Associates

Transcript of Site Reliability Engineering...

Page 1: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Site Reliability Engineering (SRE)What is it? Can my organization benefit from it?

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Brent ChapmanPrincipal

Great Circle Associates

Page 2: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Who am I?

Majordomo

Page 3: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

ServiceSite

Reliability

Engineering

Page 4: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

[SRE is] what happens

when you ask a software

engineer to design an

operations function.

— Ben Treynor

VP/Engineering, 24/7 Operations

Google

Page 5: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• SREs develop engineering solutions to operations matters

• Programmers, not operators

• Specialists, just like DB, QA, etc.

SRE is Engineering

Page 6: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Specialties typically include:• Performance

• Availability

• Latency

• Efficiency

• Monitoring

• Change Management

• Emergency Response

• Capacity Planning

SRE is Engineering

Page 7: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

SRE teams are meant to do engineering work to improve the reliability and operability of systems and reduce toil (freeing up more time for engineering work). SRE teams, by definition, must be learning teams who can spot and fix problems as part of their day-to-day work.

— Damon EdwardsCo-Founder

Rundeck

Page 8: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Created at Google in ~2003• Benjamin Treynor Sloss (now “VP, 24/7”)

• Took over as manager of Production Team (then 7 software engineers)

• “… what happens when you ask a software engineer to design an operations function”

• This is why SRE has a very “software engineering” feel to it

• Spread through industry as SREs left Google, and orgs adopted/adapted SRE principles for their own situations

• Developed independently from DevOps

History of SRE

Page 9: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Evolved in parallel, simultaneously, largely independently of each other

• SRE is not “next generation DevOps”

• SRE not created to be the “future of DevOps”

• SRE is an engineering discipline; DevOps is a cultural movement

• SRE tends to be more prescriptive than DevOps

• Many common threads: automation, monitoring, CI/CD, rapid small releases, etc.

SRE and DevOps

Page 10: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

class SRE implements interface DevOps

– Liz Fong-JonesSRE

Google Cloud

SRE and DevOps

Page 11: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

DevOps Pillar SRE Implementation

Reduce

organization silos

Share ownership with developers by using the same

tools and techniques across the stack

Accept failure as

normal

Have a formula for balancing accidents and failures

against new releases (error budget)

Implement gradual

change

Encourage moving quickly by reducing costs of

failure

Leverage tooling &

automation

Encourages “automating this year’s job away” and

minimizing manual systems work to focus on efforts

that bring long-term value to the system

Measure

everything

Believes that operations is a software problem, and

defines prescriptive ways for measuring availability,

uptime, outages, toil, etc.

SRE implements DevOps

– Seth Vargo and Liz Fong-Jones

“SRE vs. DevOps: competing standards or close friends?”

Google Cloud Platform Blog, 08 May 2018

Page 12: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Operations is a Software Problem

• Manage by Service Level Objectives

• Work to Minimize Toil

• Automate This Year’s Job Away

• Move Fast by Reducing the Cost of Failure

• Share Ownership with Developers

• Use the Same Tooling, Regardless of Role

– The Site Reliability Workbook, Chapter 1

Fundamental SRE Principles

Page 13: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Toil

• Automation

• SLI/SLO/SLA

• Error Budget

• Sharing On-Call with Devs

• On-Call Limits

• Alert Quality

• Common Platforms

• Self-Service

• Service Reviews

• Incident Management

• Blameless Postmortems

SRE Concepts & Practices

Page 14: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• One hour is not long enough to explore all of these principles, concepts, and practices

• So, let’s look at a few that are “most fundamental” and “most distinctive”:

• Minimizing Toil

• Automation

• Monitoring

• SLI/SLO/SLA

• Error Budgets

A Deeper Dive

Page 15: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

– Vivek RauSRE Manager

Google

Minimizing Toil

Page 16: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Toil vs. Engineering Work

Toil

Lacks enduring value

Rote, repetitive

Tactical

Increases with scale

Can be automated

Engineering Work

Builds enduring value

Creative, iterative

Strategic

Enables scaling

Requires human

creativity

— Damon Edwards

Co-Founder, Rundeck

Seeking SRE, Chapter 10

Page 17: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Individual

Discontent, lack of feeling of accomplishment

Burnout

More errors, leading to time-consuming rework

No time to learn new skills

Career stagnation (lack of opportunity to deliver value-adding projects)

Organization

Constant shortage of team capacity

Excessive operational support costs

Inability to make progress on strategic initiatives (“everyone is busy, but nothing is getting done”)

Inability to retain top talent (or acquire, after word gets around)

Costs of Toil

Page 18: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Requires engineering time!• Build automation to eliminate manual work

• Redesign to alleviate need for manual work

Reducing Toil

Page 19: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Fast

• Consistent

• Repeatable

• Extendible

• Composable

• Malleable

• Reviewable

• Testable

• Can set smart defaults for security, performance, monitoring, etc.

• Enables self-service

• Enables self-repair

Benefits of Automation

Page 20: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Golden signals• Latency, traffic, errors, saturation

• Awareness of long-tail effects• Averages vs. percentiles

• Dashboards vs. alerts

• Symptoms vs. causes

• Black-box vs. white-box monitoring

• Monitoring vs. observability

• Alert quality

SRE Monitoring Practices

Page 21: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Alerts should be• Urgent – something that can’t wait

• Actionable – something that can be dealt with

• Interesting – something that requires human judgement

• Unique – something that only you can deal with

All alerts that fire should be reviewed ~weekly for quality, and ruthlessly tuned or eliminated (perhaps through automation)

Alert Quality

Page 22: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• SLI: Service Level Indicator• Metric that you monitor

• Ideally something user-meaningful, i.e., “time to load home page”

• SLO: Service Level Objective• Target value for an SLI

• If you exceed that value, indicates a problem

• SLA: Service Level Agreement• An SLO with teeth

• If you breach it, it costs you money (penalties, refunds, etc.)

• Usually less strict than your SLO, obviously

SLI/SLO/SLA, oh my…

Page 23: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Agreement between Dev and SRE on service availability target (SLO)

• 1 minus target 9’s of availability

• Ex: 99.9% uptime means error budget is 0.1%

• Error budget can be “spent” on either features or reliability

• If service availability meets error budget, keep pushing features

• If service availability exceeds error budget, slow/pause feature work and focus on reliability until service is back in spec

Error Budgets

Page 24: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Solo – SRE as individual skillset

• Consulting – consultants to Dev or DevOps

• Embedded – individual SREs join product-focused teams (“Netflix model”)

• Dedicated – full SRE teams, in own org structure (“Google model”)

SRE engagement models

Page 25: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• SRE is a specialization/role

• SREs join cross-functional product teams, just like QA engineers, DBAs, PMs, etc.

• Development and ongoing ops happen from within these teams

• These teams own a service through entire lifecycle, from concept to decommissioning

• No “ops” org to “throw it over the wall to”

Netflix Model (embedded)

Page 26: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• SRE is a distinct org within company• Individual SREs are on teams within this org

• Dev teams have limited SRE skills• Initially own full lifecycle of service

• At certain level of scale, performance, and stability, begin partnering with SRE team

• SRE team manages availability, scalability, etc.

• Dev team continues to do feature development

• Dev team stays somewhat involved in operation & on-call (85% SRE, 15% Dev)

• Error budget used to balance stability vs. feature development

Google Model (vertical)

Page 27: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• Production Engineering (Facebook)

• Chaos Engineering (Netflix)

• Resiliency Engineering

Related Disciplines

Page 28: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• O’Reilly books• Site Reliability Engineering: How Google Runs

Production Systems (April 2016)• “The SRE Book”

• How Google implemented SRE; the “what” of SRE

• The Site Reliability Workbook: Practical Ways to Implement SRE (July 2018)

• “The SRE Workbook”

• Companion to The SRE Book; explains more “how” and “why”, and more from others than Google

• Seeking SRE: Conversations About Running Production Systems at Scale (Sep 2018)

• More expansive view, from still more orgs

• How to apply SRE principles in more situations

Exploring SRE

Page 29: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

• USENIX SREcon• 3 per year: Americas, EMEA, Asia

• Videos of past talks available online, for free

• Usually sells out very early (often before early-bird ends), so register early!

• Highly recommend Ben Treynor Sloss’ “Keys to SRE” keynote from first SREcon

• http://j.mp/2N3ABSz

• Also Damon Edwards’ “Clearing the Way for SRE in the Enterprise” from SREcon18Europe

• http://j.mp/2OUYNrZ

Exploring SRE

Page 30: Site Reliability Engineering (SRE)files.informatandm.com/uploads/2018/10/Site_Reliability...•Individual SREs are on teams within this org •Dev teams have limited SRE skills •Initially

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Please contact me!Brent Chapman

[email protected]

@brent_chapman

Available for consulting, training, conferences, public speaking, and in-house presentations

These slides are at http://j.mp/grtcrcl-181018

Questions? More info…