slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext...

68
SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Transcript of slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext...

Page 1: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SLO Review

Takeshi Kondo / @chaspy 2020/01/25

SRE NEXT 2020 #srenext #srenextC

Page 2: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Service Level Objectives

Page 3: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Questions

• ✋Do you know the meaning of SLO?• ✋Do you define SLO for your service?• ✋Do you have an Error Budget Policy for your service?

Page 4: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Target

• People who want to know SLI/SLO• People who want to know how to use SLI/SLO• People who want to keep the reliability and agility of product

development

Page 5: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Site Reliability Engineering: Measuring and Managing Reliability 🎉

https://www.coursera.org/learn/site-reliability-engineering-slos

Page 6: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

tl;dr

• It is worth defining and reviewing SLI / SLO• But the SLI / SLO is not perfect from the beginning• Reduce cognitive load and introduce gradually to team

Page 7: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Agenda

• Learn SLO• What / Why / Where

• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve

Page 8: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Agenda

• Learn SLO• What / Why / Where

• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve

Page 9: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

What

• SLI / Service Level Indicators• A quantifiable measure of service reliability• i.e. http success rate, response time

• SLO / Service Level Objectives• Set a reliability target for an SLI• 99%, 99.9%, 99.99%…

• Error Budget• An SLO implies an acceptable level of unreliability• This is a budget that can be allocated

The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0

Page 10: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SLI should be related to user happiness

😄

😥

SLI(%)Good Event

——————————- Valid Event

Page 11: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SLI should be related to user happiness

😄

😥

SLI(%)http 2xx status count

———————————————————————————-——- http 2xx status count + 5xx status count

Page 12: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SLO is a reliability target for an SLI

😄

😥

SLI(%)

SLO: 99.9%

http 2xx status count ———————————————————————————-——- http 2xx status count + 5xx status count

Page 13: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SLO is a reliability target for an SLI

😄

😥

SLI(%)

SLO: 99.9%

Present: 99.95%

10000 (2xx count) ———————————————————————————-——-

10000 (2xx count) + 5 (5xx count)

Page 14: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

We can accept Errors as Error Budget

😄

😥

SLI(%)

SLO: 99.9%

Present: 99.95%

10000 (2xx count) ———————————————————————————-——-

10000 (2xx count) + 5 (5xx count)

Error Budget We can accept more 5

count of 5xx error 😌

Page 15: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

We can accept Errors as Error Budget

😄

😥

SLI(%)

SLO: 99.9%

Present: 99.95%

10000 (2xx count) ———————————————————————————-——-

10000 (2xx count) + 5 (5xx count)

Error Budget We can accept more 5

count of 5xx error 😌

Event based SLO

Page 16: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

We can accept Errors as Error Budget

😄

😥

SLI(%)

SLO: 99.9%

Present: 99.95%

95 percentile Response time < 100msec In last 1 minutes

———————————————————————————-——- All time window

Page 17: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

We can accept Errors as Error Budget

😄

😥

SLI(%)

SLO: 99.9%

Present: 99.95%

95 percentile Response time < 100msec In last 1 minutes

———————————————————————————-——- All time window

7 days

Error Budget is only 10 minutes in 7 days 😅

Page 18: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

We can accept Errors as Error Budget

😄

😥

SLI(%)

SLO: 99.9%

Present: 99.95%

95 percentile Response time < 100msec In last 1 minutes

———————————————————————————-——- All time window

7 days

Error Budget is only 10 minutes in 7 days 😅

Monitor based SLO

Page 19: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Why

• Fact-based decision making• Team can develop with a balance between reliability and agility• Especially important in the microserrvices architecture

Page 20: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Team can develop with a balance between reliability and agility

🤔

Reliability Agility

Ops 🙂Keep the reliability

Dev 😎Let’s release new feature!

SLO

Page 21: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Especially important in the microserrvices architecture

ServiceA

ServiceB

ServiceC

Success Rate 99.9%

Success Rate 99%

Success Rate 99% 😥

Reliability depends on other services

Page 22: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Where

Synthetics Client

Frontend

CDN LoadBalancer Application DataStore

Many options, Trade-off

Page 23: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Where

Synthetics Client

Frontend

CDN LoadBalancer Application DataStore

Many options, Trade-off

Some requests might not reach to the apps

Need more engineering effort to generate E2E tests

Page 24: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

In Quipper

Synthetics Client

Frontend

CDN LoadBalancer Application DataStore

Send everything to Datadog

Page 25: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Agenda

• Learn SLO• What / Why / Where

• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve

Page 26: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Self-Contained

“Encourage development teams to be self-contained so that each team can make products more comprehensively, proactively, and efficiently.”

Page 27: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SRE Mission for 2020 / Self-Contained

• Product Team can develop by themselves• No ask SREs

• We SRE provides the process• Design Doc• Production Readiness Check• Delegate Infrastructure Management(Terraform)• SLI/SLO

Page 28: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Timeline

2019 2020

Migrated to Kubernetes

Define the Ownership

Production Readiness Checklist

SLO review by myself

Set Error Budget Policy

Jun.Mar. Mar.Sep.

SRE NEXT

SLO review with Devs

Page 29: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Timeline

2019 2020

Migrated to Kubernetes

Define the Ownership

Production Readiness Checklist

SLO review by myself

SLO review with Devs

Jun.Mar. Mar.Sep.

SRE NEXT

Set Error Budget Policy

Page 30: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Timeline

2019 2020

Migrated to Kubernetes

Define the Ownership

Production Readiness Checklist

SLO review by myself

SLO review with Devs

Jun.Mar. Mar.Sep.

SRE NEXT

Why do we need such steps?

Set Error Budget Policy

Page 31: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Why do we need such steps?

• SLIs/SLOs we defined are appropriate?• If not, Error Budget Policy won’t work well

• Can the product team start the process itself?• If not, need some scaffold, preparation, training

Page 32: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Case Study in Quipper

• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy

Page 33: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Case Study in Quipper

• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy

Page 34: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Know your systems and organizations

• 2 Product• 4 Branches 🇯🇵🇮🇩🇵🇭🇲🇽• 97 Kubernetes Deployment• 84 Developers (Includes 6 SREs)• 48 subdomains

Where is the Ownership?

Page 35: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Define the Owner

Page 36: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Define the Owner

Services / Teams

Japan 7 Global 8 Philippines 3 indonesia 4 Shared 1

Page 37: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Define Service Owner In Design Doc for new service

Page 38: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Case Study in Quipper

• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy

Page 39: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SLO review by myself

• Establish SLO Review process• How to set SLO?• How to monitor SLO?• What is an action when SLO violation?• How to investigate?

• Improve SLI / SLO accuracy• How to think to revise?

Page 40: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

How to set and monitor SLO?

Page 41: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

How to set and monitor SLO?

• Unfortunately, there is no Alert or recording system 😅• Use Slack reminder and record on Github Issue

Page 42: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

How to set and monitor SLO?

Page 43: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Availability Table

https://landing.google.com/sre/sre-book/chapters/availability-table/

Too many errors 🤔

Target too high 🤔

Start with this!

Page 44: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Realized that “SLO Review” is good habit

• Good habit?• Like Pair-Programming or Unit Test

• Why?• Motivate to get metrics• No burnout, feel relief• Aware of the factors that hinder reliability

• Platform Outage• Push notification• Resource Capacity• Rolling Update

Page 45: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Case Study in Quipper

• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy

Page 46: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Many Problems…

• Noisy metrics by dos detector• Developing SLIs• Send http path tag for shared service• No available metrics for microservices SLIs

Page 47: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Dos Detector: Rate limiting by Reverse Proxy

Page 48: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Dos Detector: Rate limiting by Reverse Proxy

If a large number of requests are made from the same client

in a short time, returns 503

Page 49: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SLI should be related to user happiness

😄

😥

SLI(%)http 2xx status count

———————————————————————————-——- http 2xx status count + 5xx status count

Page 50: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Noisy metrics by dos detector

Page 51: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Send http path tag for shared service

Coaching Team uses example.quipper.com/coaching

School Team uses example.quipper.com/school

Page 52: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Send http path tag for shared service

Page 53: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Send http path tag for shared service

Page 54: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

No available metrics for microservices SLIs

Page 55: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

No available metrics for microservices SLIs

ServiceA

ServiceB

ServiceC

GET http://serviceb

GET http://servicec

Page 56: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

No available metrics for microservices SLIs

ServiceA

ServiceB

ServiceC

GET http://serviceb

GET http://servicec

Side-car container

Page 57: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Case Study in Quipper

• Define the Ownership• SLO review by myself• SLO review with Devs• Set Error Budget Policy• To be continued…

Page 58: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Agenda

• Learn SLO• What / Why / Where

• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve

Page 59: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Provide Standardized / Recommended SLIs

• Ideally, better to set SLIs by Product Team but…• Start with default first

Page 60: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

SLI menu

• Availability• http success rate

• Latency• upstream response time < x msec

Page 61: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Make the configuration as code

Page 62: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Make the configuration as code

Developer can easily change by pull request

Page 63: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Have a steep learning curve

Page 64: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Good Documentation

Page 65: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Work together 🤝

Page 66: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Agenda

• Learn SLO• What / Why / Where

• Case Study in Quipper• Takeaways• Provide Recommended SLIs• Make the configuration as code• Have a steep learning curve

Page 67: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Summery

• It is worth defining and reviewing SLI / SLO• But the SLI / SLO is not perfect from the beginning• Reduce cognitive load and introduce gradually to team

Page 68: slo review - Speaker Deck...SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Thank You!

chaspy

chaspy_

Site Reliability Engineerat Quipper

Takeshi Kondo

SRE Lounge Terraform-jp