Just enough web ops for web developers

Post on 02-Jul-2015

51 views 4 download

description

Datadog is monitoring that does not suck. It's metrics friendly, people friendly and developer friendly monitoring. Learn more at https://www.datadoghq.com/

Transcript of Just enough web ops for web developers

Just Enough WebOpsfor Developers

Alexis Lê-Quôc @alqhttp://www.datadoghq.com

@alq

@alq

Co-founder DATADOG

Datadog is Monitoring that does not suck... as a Service

Datadog is Monitoring that does not suck... as a Service

“Metrics made social”

People-friendly Monitoring

Developer-friendly Monitoring

Dev Ops930,000 350,000

2010 US figures from BLS

The New Development

Equation

Code + + AWS =

The New Development Equation

Code + + AWS =3 months

The New Development Equation

Code + + AWS =3 months 5 minutes

The New Development Equation

Web Operations?

Code + + AWS =3 months 5 minutes

The New Development Equation

Code + + AWS =3 months 5 minutes

Web Operations?

The New Development Equation

Cargo cult Operations

Common vocabularybetween Dev & WebOps?

Users

SysAdmin

“Come and get it”

“We want root!”

Dev

WebOps

WebOps

and this is what I do

But first an important digression

Product Service

Service = Code + Infrastructure

Service = Product + Access

Provide access

Provide access

Provide access

reliable, fast, cheap

Provide access

reliable, fast, cheap

Provide access

reliable, fast, cheap

24x7without going crazy

24x7 && !crazy

DevelopmentModels

Delivery historicallynot the focus

Agile Cycle Delivery

Agile Cycle Delivery

Agile Cycle DeliveryWebOps Cycle

WebOps

and this is what I do

Dev Release Measure & Log

Monitor

AlertInvestigate

Change

Fix || Escalate

WebOps Cycle

(Release)

Dev Release

Monitor

AlertInvestigate

Change

Fix || Escalate

Measure & Log

Measure

PurposeCollect quantitative metrics

ProcessInstrument serversInstrument codeInstrument SaaS depsAutomate collection

RisksImprecise metric definitionManual collection“What does it mean?”

ToolsSystem (ganglia, collectd, munin, nagios, etc.)Code (metrics, statsd)SaaS (Datadog et al.)

Dev Release

Monitor

AlertInvestigate

Change

Fix || Escalate

Measure & Log

Log

PurposeCollect meaningful, timestamped events

ProcessAll the timeIn one placeAccess for everyoneDiscipline

RisksTiB of garbageNon-uniform timestampsNon-uniform formats

Toolslog4j et al.syslog et al.logstash, splunk+ Logging-as-a-Service

Dev Release Measure & Log

AlertInvestigate

Change

Fix || Escalate

Monitor

Monitor

PurposeWatch actionable events & metrics

ProcessHealth of the app?Which metrics for health?Compute metricsMetric domainAccess for everyonePretty graphs

RisksNon-actionable metrics

Toolsgraphite, cubism et al.+ services

Dev Release Measure & Log

Monitor

Investigate

Change

Fix || Escalate

Alert

Alert

PurposeBring human in the loopwhen automated fix does not work

ProcessAlert on vital monitorsAdd new alerts with new monitorsCompute metrics from alertsRuthlessly edit

RisksToo many alertsBecome desensitizedIgnore alertsApp crashes for realPendulum swings back

Toolsnagios+ services

Dev Release Measure & Log

Monitor

AlertInvestigate

Change

Fix || Escalate

Fix || Escalate

PurposeFix issue or find someone who can

Process(fix) capture actions as soon as possible (while or shortly after)(fix) runbooks(fix) automate fixes(escalation) on-call rotation(escalation) agree on rules

RisksBurn out

ToolsPagerDutyBug tracker

Dev Release Measure & Log

Monitor

Alert

Change

Fix || Escalate

Investigate

Investigate

PurposeCollect evidenceReconstruct what happened

ProcessStart where/when problem 1st detectedWork your way from thereCapture relevant graphs/logs

RisksMissing the starting pointLagging events/metricsLow-level events/metricsBlame game

ToolsPost-mortems

Dev Release Measure & Log

Monitor

AlertInvestigate

Fix || Escalate

Change

Change

PurposeFewer alertsBetter service

ProcessChange infrastructure, codeInfrastructure == codeAdd/Edit monitors & alerts

Risksad-hoc changes

Tools...

WebOps

and this is what I do

Dev Release Measure & Log

Monitor

AlertInvestigate

Change

Fix || Escalate

Questions?Comments?

@alq