A Whirlwind Tour of Etsy's Monitoring Stack

44
A Whirlwind Tour of Etsy's Monitoring Stack Daniel Schauenberg [email protected] @mrtazz

description

It's no secret that at Etsy we are big fans of small, incremental and frequent changes and tight feedback loops. This is how we make it possible to deploy changes to our main codebase more than 50 times a day and also safely apply changes to our infrastructure in a continuous fashion. It enables us to rapidly fix bugs and roll out features in our application stack and infrastructure. This however would not be possible without a tight feedback loop and a myriad of monitoring tools that keep us informed about changes and possible problems in every nook and cranny of the Etsy stack, no matter if it's a network change event, systems or application level performance or how bad the last week of on-call rotation was.

Transcript of A Whirlwind Tour of Etsy's Monitoring Stack

Page 1: A Whirlwind Tour of Etsy's Monitoring Stack

A Whirlwind Tour of Etsy's Monitoring Stack

Daniel Schauenberg

[email protected]

@mrtazz

Page 2: A Whirlwind Tour of Etsy's Monitoring Stack
Page 3: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 4: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 5: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazzItem by TheBackPackShoppe

Page 6: A Whirlwind Tour of Etsy's Monitoring Stack

How comfortable are you deploying

a change right now?

Page 7: A Whirlwind Tour of Etsy's Monitoring Stack

“If this is your first day at Etsy, you deploy the site”

Page 8: A Whirlwind Tour of Etsy's Monitoring Stack
Page 9: A Whirlwind Tour of Etsy's Monitoring Stack
Page 10: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Ganglia• System level metrics

• Instance per DC/environment

• > 220k RRD files

• Fully configured through Chef role attributes

Page 11: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Rainbow Graphs!

Page 12: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

StatsD• Single instance on one server

• Traffic mostly from 70 Web & 24 API servers

• Node.js

• Heavy Sampling

• Graphite as backend

Page 13: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 14: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Graphite• Application level metrics

• 96G RAM, 20 Cores, 7.3T SSD RAID 10

• 525k metrics/minute

• Mirrored Master/Master Setup

• Functionally sharded relays

Page 15: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

CNAME

relays

relays

caches

caches

statsdtimers statsdcounts

statsd chef

logster fqld

search generic

Page 16: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 17: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 18: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Syslog-Ng• Web, Search, Gearman, Photos, Nagios,

Network, VPN

• 1.2GB written/minute

• Chef role attribute based config

• Rule ordering!

Page 19: A Whirlwind Tour of Etsy's Monitoring Stack
Page 20: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

github.com/etsy/logster

• Extract metrics from log files

• Written in Python

• Runs every minute via cron

Page 21: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Splunk

• Indexes all of our log files

• Easy search for patterns

• Saved searches for interesting ones

• Basically using it as a glorified grep

Page 22: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Logstash• Experiment status

• Makes it easier integrate different sources

• Easy to set up in dev environment

• Trying to figure out where/how it fits into our infrastructure

Page 23: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Eventinator• Tracks all events in our infrastructure

• Chef runs and changes

• DNS changes

• Network

• Deploys

• Server provisioning and decommissioning

• ~ 12 million events in the last 2 years

Page 24: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 25: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Chef

• rules everything around me

• Same cookbooks on prod and dev

• every node runs Chef every 10 minutes

• ton of knife plugins and handlers

Page 26: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 27: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

> 120 recipes

Page 28: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 29: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Nagios

Page 30: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Nagios• 2 instances in each DC/environment

• Fully Chef generated configuration

• Service checks and contacts in git

• Notifications via email->SMS gateway

• ~75% ops on-call

Page 31: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

github.com/lozzd/nagdash

Page 32: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 33: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 34: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 35: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Nagios Herald• Add context to nagios alerts

• What are the first 5 things you do when you get paged?

• You already have the phone in your hand

• nagios notification handler

Page 36: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Page 37: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

The Toys are real

Page 38: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

There’s another side of heaven

Page 39: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Ops Weekly

Page 40: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Ops Weekly

Page 41: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Summary• Set of trusted tools

• Enhance where they come short

• Try out new things

• Write tools where applicable

• Continuous monitoring and adaptation

Page 42: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

codeascraft.com etsy.com/codeascraft/talks

etsy.github.com etsy.com/careers

Page 43: A Whirlwind Tour of Etsy's Monitoring Stack

@mrtazz

Questions?

Page 44: A Whirlwind Tour of Etsy's Monitoring Stack

A Whirlwind Tour of Etsy's Monitoring Stack

Daniel Schauenberg

[email protected]

@mrtazz