Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

46
Who Watches the Watchmen? DevOps Days Chicago 2014 Arup Chakrabarti @arupchak or [email protected]

Transcript of Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

Page 1: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

Who Watches the Watchmen? DevOps Days Chicago 2014

Arup Chakrabarti @arupchak or [email protected]

Page 2: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

What is PagerDuty?

10/23/14

Ops Guys know all too well...

•  Alert and Incident Tracking • On-Call Management •  Integrates with monitoring tools •  Alert the right person, every time

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 3: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

What is PagerDuty?

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 4: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Why do we care about monitoring?

10/23/14

Oct 2014 US East Outage – Outgoing Traffic

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 5: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Today’s talk is about:

10/23/14

•  What is PagerDuty? •  Philosophies •  Tools •  Security •  Distributed Systems •  Dependency •  How we cheat by using Chef •  Validation •  Q and A

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 6: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Quick Disclaimer

10/23/14

I did not come up with everything

Who Watches the Watchmen? DevOps Days Chicago 2014

•  I work with smart people •  Slides will be posted

Page 7: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Philosophies

10/23/14

Thou Shall:

•  Use the right tool •  Avoid single host

monitoring

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 8: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Philosophies

10/23/14

Thou Shall:

•  Alert on what customers care about

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 9: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Philosophies

10/23/14

Thou Shall:

•  Alert on expected values •  High and Low

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 10: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Philosophies

10/23/14

Thou Shall:

•  Make it Self-Service

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 11: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Philosophies

10/23/14

Thou Shall:

•  Validate Alerts Work

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 12: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Monitoring Tools New Relic

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 13: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

New Relic Oooooo Graphs

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 14: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

New Relic Stacked lines!

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 15: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

New Relic Lines AND Bars!

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 16: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

New Relic Reports!

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 17: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

New Relic Moar Reports!

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 18: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

New Relic / APMs

10/23/14

Great for small env’s

•  Pros •  Great for new stacks •  Helpful for tracing transactions •  Gives a lot of data

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 19: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Problem with APMs

10/23/14

Not a silver bullet

•  Cons •  They can be overly prescriptive •  They can be hard to tune/

customize •  Gives a lot of data

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 20: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

StatsD / DataDog

10/23/14

All hail self service metrics

•  StatsD is the client •  DataDog is the backend •  Super easy to use •  statsd.gauge(metric_name, val) •  statsd.counter(metric_name) •  statsd.histogram(metric_name,val)

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 21: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

StatsD / DataDog Custom Alerts

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 22: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

StatsD / DataDog

10/23/14

Custom Notifications

•  PagerDuty Integration •  Email •  HipChat

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 23: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

StatsD / DataDog

10/23/14

Customize all the things

•  Pros •  Very customizable •  Can change as you grow •  Self Service

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 24: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

StatsD / DataDog

10/23/14

Needs some hand holding

•  Cons •  Need to have Configuration

Management •  Hard to ramp teams up

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 25: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

SumoLogic

10/23/14

Logging as Monitoring

•  Ship Critical App Logs •  Engineers setup alerts on patterns •  “Too many 500’s in the last 10m”

•  Somewhat self-service •  Initial setup is in Chef

•  Hard to use for realtime debugging

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 26: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

PagerDuty at PagerDuty?

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 27: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

PagerDuty at PagerDuty?

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

YES

Page 28: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Simple External Monitoring

10/23/14

Dumb health checks

•  Wormly and Monitis •  Simple tools •  Backup alerting •  Very naive in the health checks •  Had to build out smarter health

check page

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 29: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Simple External Monitoring

10/23/14

Dumb health checks made smarter

•  Health Check Page •  Lightly touches internal services •  Gives back an expected value for

each service •  Alert on non-expected value

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 30: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Security Monitoring

10/23/14

Why do security monitoring?

•  Audits are tedious •  Continuous Audits

•  Earlier alerting •  Easier fixing

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 31: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Security Monitoring

10/23/14

Audits are tedious

•  IDS via OSSEC •  Monitor Logs / Checksum Dir’s

•  Port scanning •  nmap

•  Scrape IPSec data

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 32: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Distributed Systems

10/23/14

The single host does not matter anymore

•  Alert on cluster level metrics •  Overall number of 500’s •  % of nodes down •  Overall latency

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 33: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Avoid Single Host Alerts

10/23/14

Crons should not be used for creating alerts

Who Watches the Watchmen? DevOps Days Chicago 2014

PAGERDUTY

US West 1

US West 2

Linode Monitoring System

Page 34: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Same model for service alerting

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

PAGERDUTY

Service A

Service B

Service C Monitoring System

Page 35: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Dependency Monitoring

10/23/14

Stuff that you do not control

•  Dependencies Everywhere •  Operations •  DNS •  Monitoring Tools •  Logging

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 36: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Dependency Monitoring

10/23/14

Stuff that you do not control

•  How to monitor? •  Operations •  DNS -> Create/Delete records •  Monitoring Tools -> Basic ping •  Logging -> Validate that logs are

being pushed •  Status Pages

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 37: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Dependency Monitoring

10/23/14

What keeps us up at night

•  Dependencies Everywhere •  Software •  Email •  SMS •  Phone •  Push Notifications

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 38: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Quick Story

10/23/14

When SMS providers screw us over

•  Primary SMS provider was “Up” •  Customer was not getting their SMS •  Found out in the worst way possible •  Customer called us

•  Provider was working but T-Mobile prepaid was not passing our short code through

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 39: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

End to End testing

10/23/14

aka how to abuse unlimited messaging plans

•  Every minute we send a SMS alert •  Every SMS provider we use •  Main Carriers •  Verizon •  AT&T •  T-Mobile •  Sprint

•  Measure Response times

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 40: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

“Device Lab” It looks more official now

10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 41: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Some stats (Averages)

10/23/14

Sorry, cannot tell you which carrier is which

•  Carrier A •  15 Seconds

•  Carrier B •  5 Seconds

•  Carrier C •  15 Seconds

•  Carrier D •  50 Seconds

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 42: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

Very Spikey 10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014

Page 43: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

How to cheat with Chef

10/23/14

Automate all the things

•  Install all the agents •  New Relic •  DataDog – Easy alerts as well •  SumoLogic •  OSSEC

•  Backup alerts are not automated •  Cluster alert setup is not automated

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 44: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

How to Validate

10/23/14

Failure Friday

•  We attack our own services •  Process failure •  Datacenter failure •  Network failure

•  https://blog.pagerduty.com/failure-friday-at-pagerduty

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 45: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

@arupchak

What we have learned

10/23/14

Failure Friday

•  Process monitoring is co-mingled with the process running

•  Only localhost checks on service •  Alerts require outbound network conn

Who Watches the Watchmen? DevOps Days Chicago 2014

Page 46: Who Watches the Watchmen - Arup Chakrabarti, PagerDuty - DevOpsDays Tel Aviv 2015

Thank you. We are Hiring! pagerduty.com/jobs

@arupchak or [email protected]