Site Reliability Engineering...

@BRENT_CHAPMAN | #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Site Reliability Engineering (SRE)What is it? Can my organization benefit from it?


Brent ChapmanPrincipal

Great Circle Associates


Who am I?

Majordomo


ServiceSite

Reliability

Engineering


[SRE is] what happens

when you ask a software

engineer to design an

operations function.

— Ben Treynor

VP/Engineering, 24/7 Operations

Google


• SREs develop engineering solutions to operations matters

• Programmers, not operators

• Specialists, just like DB, QA, etc.

SRE is Engineering


Specialties typically include:• Performance

• Availability

• Latency

• Efficiency

• Monitoring

• Change Management

• Emergency Response

• Capacity Planning

SRE is Engineering


SRE teams are meant to do engineering work to improve the reliability and operability of systems and reduce toil (freeing up more time for engineering work). SRE teams, by definition, must be learning teams who can spot and fix problems as part of their day-to-day work.

— Damon EdwardsCo-Founder

Rundeck


• Created at Google in ~2003• Benjamin Treynor Sloss (now “VP, 24/7”)

• Took over as manager of Production Team (then 7 software engineers)

• “… what happens when you ask a software engineer to design an operations function”

• This is why SRE has a very “software engineering” feel to it

• Spread through industry as SREs left Google, and orgs adopted/adapted SRE principles for their own situations

• Developed independently from DevOps

History of SRE


• Evolved in parallel, simultaneously, largely independently of each other

• SRE is not “next generation DevOps”

• SRE not created to be the “future of DevOps”

• SRE is an engineering discipline; DevOps is a cultural movement

• SRE tends to be more prescriptive than DevOps

• Many common threads: automation, monitoring, CI/CD, rapid small releases, etc.

SRE and DevOps


class SRE implements interface DevOps

– Liz Fong-JonesSRE

Google Cloud

SRE and DevOps


DevOps Pillar SRE Implementation

Reduce

organization silos

Share ownership with developers by using the same

tools and techniques across the stack

Accept failure as

normal

Have a formula for balancing accidents and failures

against new releases (error budget)

Implement gradual

change

Encourage moving quickly by reducing costs of

failure

Leverage tooling &

automation

Encourages “automating this year’s job away” and

minimizing manual systems work to focus on efforts

that bring long-term value to the system

Measure

everything

Believes that operations is a software problem, and

defines prescriptive ways for measuring availability,

uptime, outages, toil, etc.

SRE implements DevOps

– Seth Vargo and Liz Fong-Jones

“SRE vs. DevOps: competing standards or close friends?”

Google Cloud Platform Blog, 08 May 2018


• Operations is a Software Problem

• Manage by Service Level Objectives

• Work to Minimize Toil

• Automate This Year’s Job Away

• Move Fast by Reducing the Cost of Failure

• Share Ownership with Developers

• Use the Same Tooling, Regardless of Role

– The Site Reliability Workbook, Chapter 1

Fundamental SRE Principles


• Toil

• Automation

• SLI/SLO/SLA

• Error Budget

• Sharing On-Call with Devs

• On-Call Limits

• Alert Quality

• Common Platforms

• Self-Service

• Service Reviews

• Incident Management

• Blameless Postmortems

SRE Concepts & Practices


• One hour is not long enough to explore all of these principles, concepts, and practices

• So, let’s look at a few that are “most fundamental” and “most distinctive”:

• Minimizing Toil

• Automation

• Monitoring

• SLI/SLO/SLA

• Error Budgets

A Deeper Dive


Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

– Vivek RauSRE Manager

Google

Minimizing Toil


Toil vs. Engineering Work

Toil

Lacks enduring value

Rote, repetitive

Tactical

Increases with scale

Can be automated

Engineering Work

Builds enduring value

Creative, iterative

Strategic

Enables scaling

Requires human

creativity

— Damon Edwards

Co-Founder, Rundeck

Seeking SRE, Chapter 10


Individual

Discontent, lack of feeling of accomplishment

Burnout

More errors, leading to time-consuming rework

No time to learn new skills

Career stagnation (lack of opportunity to deliver value-adding projects)

Organization

Constant shortage of team capacity

Excessive operational support costs

Inability to make progress on strategic initiatives (“everyone is busy, but nothing is getting done”)

Inability to retain top talent (or acquire, after word gets around)

Costs of Toil


Requires engineering time!• Build automation to eliminate manual work

• Redesign to alleviate need for manual work

Reducing Toil


• Fast

• Consistent

• Repeatable

• Extendible

• Composable

• Malleable

• Reviewable

• Testable

• Can set smart defaults for security, performance, monitoring, etc.

• Enables self-service

• Enables self-repair

Benefits of Automation


• Golden signals• Latency, traffic, errors, saturation

• Awareness of long-tail effects• Averages vs. percentiles

• Dashboards vs. alerts

• Symptoms vs. causes

• Black-box vs. white-box monitoring

• Monitoring vs. observability

• Alert quality

SRE Monitoring Practices


Alerts should be• Urgent – something that can’t wait

• Actionable – something that can be dealt with

• Interesting – something that requires human judgement

• Unique – something that only you can deal with

All alerts that fire should be reviewed ~weekly for quality, and ruthlessly tuned or eliminated (perhaps through automation)

Alert Quality


• SLI: Service Level Indicator• Metric that you monitor

• Ideally something user-meaningful, i.e., “time to load home page”

• SLO: Service Level Objective• Target value for an SLI

• If you exceed that value, indicates a problem

• SLA: Service Level Agreement• An SLO with teeth

• If you breach it, it costs you money (penalties, refunds, etc.)

• Usually less strict than your SLO, obviously

SLI/SLO/SLA, oh my…


• Agreement between Dev and SRE on service availability target (SLO)

• 1 minus target 9’s of availability

• Ex: 99.9% uptime means error budget is 0.1%

• Error budget can be “spent” on either features or reliability

• If service availability meets error budget, keep pushing features

• If service availability exceeds error budget, slow/pause feature work and focus on reliability until service is back in spec

Error Budgets


• Solo – SRE as individual skillset

• Consulting – consultants to Dev or DevOps

• Embedded – individual SREs join product-focused teams (“Netflix model”)

• Dedicated – full SRE teams, in own org structure (“Google model”)

SRE engagement models


• SRE is a specialization/role

• SREs join cross-functional product teams, just like QA engineers, DBAs, PMs, etc.

• Development and ongoing ops happen from within these teams

• These teams own a service through entire lifecycle, from concept to decommissioning

• No “ops” org to “throw it over the wall to”

Netflix Model (embedded)


• SRE is a distinct org within company• Individual SREs are on teams within this org

• Dev teams have limited SRE skills• Initially own full lifecycle of service

• At certain level of scale, performance, and stability, begin partnering with SRE team

• SRE team manages availability, scalability, etc.

• Dev team continues to do feature development

• Dev team stays somewhat involved in operation & on-call (85% SRE, 15% Dev)

• Error budget used to balance stability vs. feature development

Google Model (vertical)


• Production Engineering (Facebook)

• Chaos Engineering (Netflix)

• Resiliency Engineering

Related Disciplines


• O’Reilly books• Site Reliability Engineering: How Google Runs

Production Systems (April 2016)• “The SRE Book”

• How Google implemented SRE; the “what” of SRE

• The Site Reliability Workbook: Practical Ways to Implement SRE (July 2018)

• “The SRE Workbook”

• Companion to The SRE Book; explains more “how” and “why”, and more from others than Google

• Seeking SRE: Conversations About Running Production Systems at Scale (Sep 2018)

• More expansive view, from still more orgs

• How to apply SRE principles in more situations

Exploring SRE


• USENIX SREcon• 3 per year: Americas, EMEA, Asia

• Videos of past talks available online, for free

• Usually sells out very early (often before early-bird ends), so register early!

• Highly recommend Ben Treynor Sloss’ “Keys to SRE” keynote from first SREcon

• http://j.mp/2N3ABSz

• Also Damon Edwards’ “Clearing the Way for SRE in the Enterprise” from SREcon18Europe

• http://j.mp/2OUYNrZ

Exploring SRE


Please contact me!Brent Chapman

[email protected]

@brent_chapman

Available for consulting, training, conferences, public speaking, and in-house presentations

These slides are at http://j.mp/grtcrcl-181018

Questions? More info…

Site Reliability Engineering...

Documents

Transcript of Site Reliability Engineering...