A Whirlwind Tour of Etsy's Monitoring Stack
-
Upload
daniel-schauenberg -
Category
Technology
-
view
2.774 -
download
1
description
Transcript of A Whirlwind Tour of Etsy's Monitoring Stack
A Whirlwind Tour of Etsy's Monitoring Stack
Daniel Schauenberg
@mrtazz
@mrtazz
@mrtazz
@mrtazzItem by TheBackPackShoppe
How comfortable are you deploying
a change right now?
“If this is your first day at Etsy, you deploy the site”
@mrtazz
Ganglia• System level metrics
• Instance per DC/environment
• > 220k RRD files
• Fully configured through Chef role attributes
@mrtazz
Rainbow Graphs!
@mrtazz
StatsD• Single instance on one server
• Traffic mostly from 70 Web & 24 API servers
• Node.js
• Heavy Sampling
• Graphite as backend
@mrtazz
@mrtazz
Graphite• Application level metrics
• 96G RAM, 20 Cores, 7.3T SSD RAID 10
• 525k metrics/minute
• Mirrored Master/Master Setup
• Functionally sharded relays
@mrtazz
CNAME
relays
relays
caches
caches
statsdtimers statsdcounts
statsd chef
logster fqld
search generic
@mrtazz
@mrtazz
@mrtazz
Syslog-Ng• Web, Search, Gearman, Photos, Nagios,
Network, VPN
• 1.2GB written/minute
• Chef role attribute based config
• Rule ordering!
@mrtazz
github.com/etsy/logster
• Extract metrics from log files
• Written in Python
• Runs every minute via cron
@mrtazz
Splunk
• Indexes all of our log files
• Easy search for patterns
• Saved searches for interesting ones
• Basically using it as a glorified grep
@mrtazz
Logstash• Experiment status
• Makes it easier integrate different sources
• Easy to set up in dev environment
• Trying to figure out where/how it fits into our infrastructure
@mrtazz
Eventinator• Tracks all events in our infrastructure
• Chef runs and changes
• DNS changes
• Network
• Deploys
• Server provisioning and decommissioning
• ~ 12 million events in the last 2 years
@mrtazz
@mrtazz
Chef
• rules everything around me
• Same cookbooks on prod and dev
• every node runs Chef every 10 minutes
• ton of knife plugins and handlers
@mrtazz
@mrtazz
> 120 recipes
@mrtazz
@mrtazz
Nagios
@mrtazz
Nagios• 2 instances in each DC/environment
• Fully Chef generated configuration
• Service checks and contacts in git
• Notifications via email->SMS gateway
• ~75% ops on-call
@mrtazz
@mrtazz
@mrtazz
@mrtazz
Nagios Herald• Add context to nagios alerts
• What are the first 5 things you do when you get paged?
• You already have the phone in your hand
• nagios notification handler
@mrtazz
@mrtazz
The Toys are real
@mrtazz
There’s another side of heaven
@mrtazz
Ops Weekly
@mrtazz
Ops Weekly
@mrtazz
Summary• Set of trusted tools
• Enhance where they come short
• Try out new things
• Write tools where applicable
• Continuous monitoring and adaptation
@mrtazz
codeascraft.com etsy.com/codeascraft/talks
etsy.github.com etsy.com/careers
@mrtazz
Questions?
A Whirlwind Tour of Etsy's Monitoring Stack
Daniel Schauenberg
@mrtazz