Ignite (10m) how to not burn out your monitoring team

Relaxing picture of Yoga

Transcript of Ignite (10m) how to not burn out your monitoring team

Relaxing picture of Yoga

PagerDuty Alert

hunt through logs for 2 hours

How to not burn out your production team

Gil Zellner (CloudifyDev at Gigaspaces)

Twitter: @Heathenaspargus

Who am I?Now:





cost of hiring new employee is 1.5-3x their monthly salary




Next day




frustration - I am unable to complete my task


Time spent inefficiently


Repetitive tasks


Working Alone


Yak Shaving




Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging

- light automation

- Design for better operability

- long term


Mandatory Half day-off after night production issue


Allocate weekly time to resolve or automate issues that kept us up at night


Wider rotation (more people do on-call)


Knowledge Matrix

Deploy System Mobile Link Backend

Gil V V

Karen V V

Ari V V



Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging

- light automation

- Design for better operability

- long term



solution: alert only things that meet the following criteria:

1) Alert on symptoms, not suspected "causes"

2) Actionable

3) Business breaking


Alerte générale!


Solution: direct alerts to relevant parties






What are your KPIs ?



Netflix stream starts per second


Picking how to measure things



Make heal script




Facebook Auto Remediation



Easy (days) Intermediate (months)

Hard (years)

- no changes to infrastructure

- just policy

- Small changes to apps

- logging

- light automation

- Design for better operability

- long term



Bad artists copy, great artists steal

email:[email protected]

Twitter: @Heathenaspargus