Metrics-driven Engineering at Etsy
MIKE [email protected] @mikebrittain
Logs, Graphs, Trends,and Correlations
Making Decisions
How many visitors are using this thing?
Can we deploy that to 100% of our visitors?
Did we make it faster?
Did I just break something?
Q. Who makes the graphs?A. Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears
the pagers, blah, blah, blah...
(but...) Engineers build the application.
Dev + Ops
Access
Yes No
“Engineers are too busy meeting our product
deadlines.”
Here’s the big secret...
Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)
Logging
Logger::log_error("User login failed. Reason: $msg for $username", “login”);
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
Logster
Forked from ganglia-logtailer...
- Daemon mode (only cron mode)+ Support for Graphite+ Simplified parsing scripts
web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling
Fatals Errors Warnings
StatsD
StatsD::increment("logins.success");
StatsD::timing("gearman.time", $msec);
StatsD::timing("gearman.time", $msec);
90th pct
average
lower
Ad hocname value timestamp\n
echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003
Trends + Eventstarget=drawAsInfinite(events.deploy.site)
What Happened?
16,000 metrics in Graphite(plus 32,000 metrics in Ganglia)
Dashboards
DashboardsMix & Match
<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>
Hard
$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');$g->showDeploys(true);echo $g->getDashboardHTML(280, 220);
Easy
20 dashboards by25 engineers
Application health correlated with events
High-level visibility
Low MTTD
Validation
Confidence
codeascraft.etsy.comgithub.com/etsy/statsdgithub.com/etsy/logster
bitbucket.org/maplebed/ganglia-logtailer
Q&A
Does this sound like fun? Get in touch with us.
[email protected] [email protected]@etsy.com [email protected]
Top Related