Post on 02-Jul-2015
description
Just Enough WebOpsfor Developers
Alexis Lê-Quôc @alqhttp://www.datadoghq.com
@alq
@alq
Co-founder DATADOG
Datadog is Monitoring that does not suck... as a Service
Datadog is Monitoring that does not suck... as a Service
“Metrics made social”
People-friendly Monitoring
Developer-friendly Monitoring
Dev Ops930,000 350,000
2010 US figures from BLS
The New Development
Equation
Code + + AWS =
The New Development Equation
Code + + AWS =3 months
The New Development Equation
Code + + AWS =3 months 5 minutes
The New Development Equation
Web Operations?
Code + + AWS =3 months 5 minutes
The New Development Equation
Code + + AWS =3 months 5 minutes
Web Operations?
The New Development Equation
Cargo cult Operations
Common vocabularybetween Dev & WebOps?
Users
SysAdmin
“Come and get it”
“We want root!”
Dev
WebOps
WebOps
and this is what I do
But first an important digression
Product Service
Service = Code + Infrastructure
Service = Product + Access
Provide access
Provide access
Provide access
reliable, fast, cheap
Provide access
reliable, fast, cheap
Provide access
reliable, fast, cheap
24x7without going crazy
24x7 && !crazy
DevelopmentModels
Delivery historicallynot the focus
Agile Cycle Delivery
Agile Cycle Delivery
Agile Cycle DeliveryWebOps Cycle
WebOps
and this is what I do
Dev Release Measure & Log
Monitor
AlertInvestigate
Change
Fix || Escalate
WebOps Cycle
(Release)
Dev Release
Monitor
AlertInvestigate
Change
Fix || Escalate
Measure & Log
Measure
PurposeCollect quantitative metrics
ProcessInstrument serversInstrument codeInstrument SaaS depsAutomate collection
RisksImprecise metric definitionManual collection“What does it mean?”
ToolsSystem (ganglia, collectd, munin, nagios, etc.)Code (metrics, statsd)SaaS (Datadog et al.)
Dev Release
Monitor
AlertInvestigate
Change
Fix || Escalate
Measure & Log
Log
PurposeCollect meaningful, timestamped events
ProcessAll the timeIn one placeAccess for everyoneDiscipline
RisksTiB of garbageNon-uniform timestampsNon-uniform formats
Toolslog4j et al.syslog et al.logstash, splunk+ Logging-as-a-Service
Dev Release Measure & Log
AlertInvestigate
Change
Fix || Escalate
Monitor
Monitor
PurposeWatch actionable events & metrics
ProcessHealth of the app?Which metrics for health?Compute metricsMetric domainAccess for everyonePretty graphs
RisksNon-actionable metrics
Toolsgraphite, cubism et al.+ services
Dev Release Measure & Log
Monitor
Investigate
Change
Fix || Escalate
Alert
Alert
PurposeBring human in the loopwhen automated fix does not work
ProcessAlert on vital monitorsAdd new alerts with new monitorsCompute metrics from alertsRuthlessly edit
RisksToo many alertsBecome desensitizedIgnore alertsApp crashes for realPendulum swings back
Toolsnagios+ services
Dev Release Measure & Log
Monitor
AlertInvestigate
Change
Fix || Escalate
Fix || Escalate
PurposeFix issue or find someone who can
Process(fix) capture actions as soon as possible (while or shortly after)(fix) runbooks(fix) automate fixes(escalation) on-call rotation(escalation) agree on rules
RisksBurn out
ToolsPagerDutyBug tracker
Dev Release Measure & Log
Monitor
Alert
Change
Fix || Escalate
Investigate
Investigate
PurposeCollect evidenceReconstruct what happened
ProcessStart where/when problem 1st detectedWork your way from thereCapture relevant graphs/logs
RisksMissing the starting pointLagging events/metricsLow-level events/metricsBlame game
ToolsPost-mortems
Dev Release Measure & Log
Monitor
AlertInvestigate
Fix || Escalate
Change
Change
PurposeFewer alertsBetter service
ProcessChange infrastructure, codeInfrastructure == codeAdd/Edit monitors & alerts
Risksad-hoc changes
Tools...
WebOps
and this is what I do
Dev Release Measure & Log
Monitor
AlertInvestigate
Change
Fix || Escalate
Questions?Comments?
@alq