Who Watches the Watchmen? DevOps Days Chicago 2014
Arup Chakrabarti @arupchak or [email protected]
@arupchak
What is PagerDuty?
10/23/14
Ops Guys know all too well...
• Alert and Incident Tracking • On-Call Management • Integrates with monitoring tools • Alert the right person, every time
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
What is PagerDuty?
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Why do we care about monitoring?
10/23/14
Oct 2014 US East Outage – Outgoing Traffic
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Today’s talk is about:
10/23/14
• What is PagerDuty? • Philosophies • Tools • Security • Distributed Systems • Dependency • How we cheat by using Chef • Validation • Q and A
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Quick Disclaimer
10/23/14
I did not come up with everything
Who Watches the Watchmen? DevOps Days Chicago 2014
• I work with smart people • Slides will be posted
@arupchak
Philosophies
10/23/14
Thou Shall:
• Use the right tool • Avoid single host
monitoring
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Philosophies
10/23/14
Thou Shall:
• Alert on what customers care about
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Philosophies
10/23/14
Thou Shall:
• Alert on expected values • High and Low
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Philosophies
10/23/14
Thou Shall:
• Make it Self-Service
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Philosophies
10/23/14
Thou Shall:
• Validate Alerts Work
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Monitoring Tools New Relic
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
New Relic Oooooo Graphs
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
New Relic Stacked lines!
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
New Relic Lines AND Bars!
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
New Relic Reports!
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
New Relic Moar Reports!
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
New Relic / APMs
10/23/14
Great for small env’s
• Pros • Great for new stacks • Helpful for tracing transactions • Gives a lot of data
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Problem with APMs
10/23/14
Not a silver bullet
• Cons • They can be overly prescriptive • They can be hard to tune/
customize • Gives a lot of data
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
StatsD / DataDog
10/23/14
All hail self service metrics
• StatsD is the client • DataDog is the backend • Super easy to use • statsd.gauge(metric_name, val) • statsd.counter(metric_name) • statsd.histogram(metric_name,val)
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
StatsD / DataDog Custom Alerts
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
StatsD / DataDog
10/23/14
Custom Notifications
• PagerDuty Integration • Email • HipChat
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
StatsD / DataDog
10/23/14
Customize all the things
• Pros • Very customizable • Can change as you grow • Self Service
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
StatsD / DataDog
10/23/14
Needs some hand holding
• Cons • Need to have Configuration
Management • Hard to ramp teams up
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
SumoLogic
10/23/14
Logging as Monitoring
• Ship Critical App Logs • Engineers setup alerts on patterns • “Too many 500’s in the last 10m”
• Somewhat self-service • Initial setup is in Chef
• Hard to use for realtime debugging
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
PagerDuty at PagerDuty?
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
PagerDuty at PagerDuty?
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
YES
@arupchak
Simple External Monitoring
10/23/14
Dumb health checks
• Wormly and Monitis • Simple tools • Backup alerting • Very naive in the health checks • Had to build out smarter health
check page
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Simple External Monitoring
10/23/14
Dumb health checks made smarter
• Health Check Page • Lightly touches internal services • Gives back an expected value for
each service • Alert on non-expected value
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Security Monitoring
10/23/14
Why do security monitoring?
• Audits are tedious • Continuous Audits
• Earlier alerting • Easier fixing
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Security Monitoring
10/23/14
Audits are tedious
• IDS via OSSEC • Monitor Logs / Checksum Dir’s
• Port scanning • nmap
• Scrape IPSec data
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Distributed Systems
10/23/14
The single host does not matter anymore
• Alert on cluster level metrics • Overall number of 500’s • % of nodes down • Overall latency
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Avoid Single Host Alerts
10/23/14
Crons should not be used for creating alerts
Who Watches the Watchmen? DevOps Days Chicago 2014
PAGERDUTY
US West 1
US West 2
Linode Monitoring System
@arupchak
Same model for service alerting
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
PAGERDUTY
Service A
Service B
Service C Monitoring System
@arupchak
Dependency Monitoring
10/23/14
Stuff that you do not control
• Dependencies Everywhere • Operations • DNS • Monitoring Tools • Logging
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Dependency Monitoring
10/23/14
Stuff that you do not control
• How to monitor? • Operations • DNS -> Create/Delete records • Monitoring Tools -> Basic ping • Logging -> Validate that logs are
being pushed • Status Pages
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Dependency Monitoring
10/23/14
What keeps us up at night
• Dependencies Everywhere • Software • Email • SMS • Phone • Push Notifications
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Quick Story
10/23/14
When SMS providers screw us over
• Primary SMS provider was “Up” • Customer was not getting their SMS • Found out in the worst way possible • Customer called us
• Provider was working but T-Mobile prepaid was not passing our short code through
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
End to End testing
10/23/14
aka how to abuse unlimited messaging plans
• Every minute we send a SMS alert • Every SMS provider we use • Main Carriers • Verizon • AT&T • T-Mobile • Sprint
• Measure Response times
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
“Device Lab” It looks more official now
10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Some stats (Averages)
10/23/14
Sorry, cannot tell you which carrier is which
• Carrier A • 15 Seconds
• Carrier B • 5 Seconds
• Carrier C • 15 Seconds
• Carrier D • 50 Seconds
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
Very Spikey 10/23/14 Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
How to cheat with Chef
10/23/14
Automate all the things
• Install all the agents • New Relic • DataDog – Easy alerts as well • SumoLogic • OSSEC
• Backup alerts are not automated • Cluster alert setup is not automated
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
How to Validate
10/23/14
Failure Friday
• We attack our own services • Process failure • Datacenter failure • Network failure
• https://blog.pagerduty.com/failure-friday-at-pagerduty
Who Watches the Watchmen? DevOps Days Chicago 2014
@arupchak
What we have learned
10/23/14
Failure Friday
• Process monitoring is co-mingled with the process running
• Only localhost checks on service • Alerts require outbound network conn
Who Watches the Watchmen? DevOps Days Chicago 2014
Thank you. We are Hiring! pagerduty.com/jobs
@arupchak or [email protected]