Ignite (10m) how to not burn out your monitoring team

56
Relaxing picture of Yoga

Transcript of Ignite (10m) how to not burn out your monitoring team

Relaxing picture of Yoga

PagerDuty Alert

hunt through logs for 2 hours

How to not burn out your production team

Gil Zellner (CloudifyDev at Gigaspaces)

Twitter: @Heathenaspargus

Who am I?Now:

Past:

@Heathenaspargus

@Heathenaspargus

@Heathenaspargus

cost of hiring new employee is 1.5-3x their monthly salary

@Heathenaspargus

@Heathenaspargus

@Heathenaspargus

Next day

@Heathenaspargus

@Heathenaspargus

@Heathenaspargus

frustration - I am unable to complete my task

@Heathenaspargus

Time spent inefficiently

@Heathenaspargus

Repetitive tasks

@Heathenaspargus

Working Alone

@Heathenaspargus

Yak Shaving

@Heathenaspargus

https://www.ergoflex.co.uk/blog/category/sleep-research/sleeponomics-could-sleep-deprivation-be-the-real-reason-politicians-make-bad-decisions

@Heathenaspargus

Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging

- light automation

- Design for better operability

- long term

@Heathenaspargus

Mandatory Half day-off after night production issue

@Heathenaspargus

Allocate weekly time to resolve or automate issues that kept us up at night

@Heathenaspargus

Wider rotation (more people do on-call)

@Heathenaspargus

Knowledge Matrix

Deploy System Mobile Link Backend

Gil V V

Karen V V

Ari V V

@Heathenaspargus

@Heathenaspargus

Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging

- light automation

- Design for better operability

- long term

@Heathenaspargus

@Heathenaspargus

solution: alert only things that meet the following criteria:

1) Alert on symptoms, not suspected "causes"

2) Actionable

3) Business breaking

@Heathenaspargus

Alerte générale!

@Heathenaspargus

Solution: direct alerts to relevant parties

@Heathenaspargus

@Heathenaspargus

@Heathenaspargus

@Heathenaspargus

@Heathenaspargus

What are your KPIs ?

@Heathenaspargus

@Heathenaspargus

Netflix stream starts per second

@Heathenaspargus

Picking how to measure things

Diagnosis

@Heathenaspargus

Make heal script

@Heathenaspargus

@Heathenaspargus

@Heathenaspargus

Facebook Auto Remediation

https://www.facebook.com/notes/facebook-engineering/making-facebook-self-healing/10150275248698920

@Heathenaspargus

Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging

- light automation

- Design for better operability

- long term

@Heathenaspargus

@Heathenaspargus

Bad artists copy, great artists steal

email:[email protected]

Twitter: @Heathenaspargus