Sandstorm or Significant? The evolving role of situational context in incident management.

44
The evolving role of context in Incident Management

Transcript of Sandstorm or Significant? The evolving role of situational context in incident management.

Page 1: Sandstorm or Significant? The evolving role of situational context in incident management.

The evolving role of context in Incident Management

Page 2: Sandstorm or Significant? The evolving role of situational context in incident management.

Matthew BoeckmanDeveloper Advocate

Victorops.com/blog

@matthewboeckmanBackground

● 18 years on-call Ops● 15 years w/software

teams● Startup junkie● DevOps enthusiast

Page 3: Sandstorm or Significant? The evolving role of situational context in incident management.

3

What is VictorOps?

VictorOps ingests all of your alerts from your current monitoring tools and becomes the logical layer between your alerts and the people who receives them.

Page 4: Sandstorm or Significant? The evolving role of situational context in incident management.

victorops.com/IMA

Page 5: Sandstorm or Significant? The evolving role of situational context in incident management.

5

5 Phases of Incident Management

Detection

monitoring, metrics, thresholds

Response

alerting,on-call,escalation

Remediation

fixes,tickets,deployments

Analysis

postmortem,how or why,understand

Readiness

improvement,game days,learning

Page 6: Sandstorm or Significant? The evolving role of situational context in incident management.

6

Standard Incident Workflow

Detection Response Remediation

AnalysisReadiness

Page 7: Sandstorm or Significant? The evolving role of situational context in incident management.

7

Incident Management Assessment Matrix

Detection Response Remediation Analysis Preparedness

Novice

Beginner

Competent

Proficient

Expert

Page 8: Sandstorm or Significant? The evolving role of situational context in incident management.

8

Incident Management Maturity Matrix

Detection Response Remediation Analysis Preparedness

Novice

Beginner xCompetent x xProficient x x

Expert

Page 9: Sandstorm or Significant? The evolving role of situational context in incident management.

9

Self Assessment

Poll: How would you rate your overall team maturity?

A. NoviceB. BeginnerC. CompetentD. ProficientE. Expert

Page 10: Sandstorm or Significant? The evolving role of situational context in incident management.

10

The Focus Question

How can we help teams

mature their incident management practice

(Stated plainly: Make On-Call suck less)

Page 11: Sandstorm or Significant? The evolving role of situational context in incident management.

11

Situational Context

Page 12: Sandstorm or Significant? The evolving role of situational context in incident management.

12

Incident Management Key Metrics

● MTTR Mean time to Repair(MTTR)● Availability (SLA)● Ticket Volumes● Escalations● Customer Satisfaction

Page 13: Sandstorm or Significant? The evolving role of situational context in incident management.

13

Incident Management Key Metrics

Page 14: Sandstorm or Significant? The evolving role of situational context in incident management.

14

Time Spent Managing Incidents - Low Maturity

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

Page 15: Sandstorm or Significant? The evolving role of situational context in incident management.

15

Time Spent Managing Incidents - Medium Maturity

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

Page 16: Sandstorm or Significant? The evolving role of situational context in incident management.

16

Time Spent Managing Incidents - High Maturity

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

Page 17: Sandstorm or Significant? The evolving role of situational context in incident management.

17

A New Core Metric

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

Time to Learn(TTL)

Identify trendsCapacity planImprove infrastructure

GamedaysCross trainUpdate runbooks

Page 18: Sandstorm or Significant? The evolving role of situational context in incident management.

18

Beep Beep Beep

Page 19: Sandstorm or Significant? The evolving role of situational context in incident management.

19

Standard Incident Workflow

Page 20: Sandstorm or Significant? The evolving role of situational context in incident management.

20

Standard Diagnostic Procedure

1. Fire up the VPN

2. Navigate dashboards, find relevant section

3. Review ticket or incident history for host

4. Review Runbooks for associated host

Page 21: Sandstorm or Significant? The evolving role of situational context in incident management.

21

Common Bottlenecks to Establishing Context

● Multiple sources of record● Duplicate Runbooks or documentation● Metric overload

● New responders unfamiliar with systems

Page 22: Sandstorm or Significant? The evolving role of situational context in incident management.

22

Where Does it Hurt?

Poll: Which is the most painful problem you experience in establishing context

A. Multiple sources of recordB. Duplicate documentationC. Metric overloadD. Everything is equally on fireE. Everything is fantastic

Page 23: Sandstorm or Significant? The evolving role of situational context in incident management.

23

Beep Beep Beep

Page 24: Sandstorm or Significant? The evolving role of situational context in incident management.

24

A Tale of Two Graphs

Massive spike above expected norm

Response: Fire up the laptop and put a pot of coffee on

Page 25: Sandstorm or Significant? The evolving role of situational context in incident management.

25

A Tale of Two Graphs

Small spike for a consistently loaded box.

Response: ACK alert, go back to sleep

Page 26: Sandstorm or Significant? The evolving role of situational context in incident management.

26

This Time, with Context!

Page 27: Sandstorm or Significant? The evolving role of situational context in incident management.

27

Enhanced Contextual Workflow

Page 28: Sandstorm or Significant? The evolving role of situational context in incident management.

28

Alert Enhancements

Poll: My team is doing some enhancement of alerts today.

A. TrueB. False

Page 29: Sandstorm or Significant? The evolving role of situational context in incident management.

Many incidents can be tracked to deploys

Developer Velocity = Constant Change

Silos impair communication

29

CI/CD Exacerbates the Contextual Challenge

Page 30: Sandstorm or Significant? The evolving role of situational context in incident management.

30

A Tale of Two Incidents

Page 31: Sandstorm or Significant? The evolving role of situational context in incident management.

31

A Tale of Two Incidents

Page 32: Sandstorm or Significant? The evolving role of situational context in incident management.

32

Introducing: The Scientific Method

Make Observations (the measurement)

Ask a question (why would a webserver quit working?)

Form a hypothesis (because we just deployed?)

Page 33: Sandstorm or Significant? The evolving role of situational context in incident management.

33

The Sandstorm

Page 34: Sandstorm or Significant? The evolving role of situational context in incident management.

34

No. Do not.

Page 35: Sandstorm or Significant? The evolving role of situational context in incident management.

35

Measure Everything: the Anti-pattern

Measurements cost time and money

Busy dashboards lead to sub-concious filtering

Measurements create a natural impulse to alert

Page 36: Sandstorm or Significant? The evolving role of situational context in incident management.

36

Enhance

Page 37: Sandstorm or Significant? The evolving role of situational context in incident management.

37

Stop

Page 38: Sandstorm or Significant? The evolving role of situational context in incident management.

38

An Embarrassment of Dashboards

Page 39: Sandstorm or Significant? The evolving role of situational context in incident management.

39

Rule of Thumb

Measure much

Alert on some

Contextualize all

Page 40: Sandstorm or Significant? The evolving role of situational context in incident management.

40

Iteration is Key

Dialing in context takes time

Conduct blameless postmortems

Experiment with more and less context

Be objective in your assessment of what works

Page 41: Sandstorm or Significant? The evolving role of situational context in incident management.

41

Leverage Situational Context

Providing incident responders with context

can meaningfully impact MTTR

paying dividends in time

to move your practice forward

Page 42: Sandstorm or Significant? The evolving role of situational context in incident management.

42

The Beginning

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

Page 43: Sandstorm or Significant? The evolving role of situational context in incident management.

43

The Goal

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

Time to Learn(TTL)

Identify trendsCapacity planImprove infrastructure

GamedaysCross trainUpdate runbooks

Page 44: Sandstorm or Significant? The evolving role of situational context in incident management.

Take the IMA!http://victorops.com/ima

Questions?

44

Thank you!

Matthew Boeckman@matthewboeckman

Slides on devops.com & slideshare.com