Winning the metrics battle
-
Upload
sihil -
Category
Technology
-
view
6.623 -
download
1
description
Transcript of Winning the metrics battle
Winning the metrics battle (finally)
Winning the metrics battle (finally)
Simon Hildrew
Infrastructure Developer
The Guardian
Nick Satterly
Monitoring Engineer
The Guardian
The metrics battlefield
1,400 2,800
50,000
180,000
Total metrics
5 minutes
every 15seconds
http://www.flickr.com/photos/ghostsigns/6676069121
http://www.flickr.com/photos/millynet/134071210
developer dashboards
0
5
10
15
20
Physical screens Screensaver hacks
dev
hack
business dashboards
metrics + dashboards = culture change
http://www.flickr.com/photos/chrisjames_taylor/5454315456
Side project
Incremental upgrade
Use off the shelf tool
Pragmatic solution
Done in a year
our approach➡ Prioritise
➡ Understand the real problem
➡ Question the tools
➡ Be ambitious
➡ Keep learning
Prioritise
drowning in work
http://www.flickr.com/photos/iampeas/246738971
a dedicated monitoring and metrics engineer
Understand the real problem
Urgent issue - current tool end of life
The story so far...
metrics were not helping us solve production outages
ballooning number of applications
but... difficult to instrument applications
T.T. Fix
T.T. Detect+
T.T. Diagnose+
T.T. Resolve
=
inaccessible tools
http://www.flickr.com/photos/kdashy/2678539087
inconsistent data
http://www.flickr.com/photos/sybrenstuvel/2468506922
hypothesising & arguingeasier than measuring
http://www.flickr.com/photos/nouqraz/200049988
The ‘right’ thing
• measure everything
• measure frequently
• measure each data point once
• input and output must be open
Question the tools
Brute force?
http://www.flickr.com/photos/epublicist/3546059144
The safe option?
http://www.flickr.com/photos/alicebartlett/2361209195
Unintuitive?
http://www.flickr.com/photos/merlijnhoek/2841785343
http://www.flickr.com/photos/evansville/8953838/
Imposing a flawed model?
Too difficult / no progress?http://www.flickr.com/photos/ginja_andy/4165849136/
Nagios
• the “IBM” of monitoring tools
• compromise over quantity and frequency of checks
• < insert your criticism of nagios here >
Zabbix
• metric collection tightly coupled to monitoring tool
• confusing UI with poor visualisation
• needed brute force to make limited API work
The ‘right’ thing
• measure everything
• measure frequently
• measure each data point once
• input and output must be open
don’t compromise
Be ambitious
Throw work away
http://www.flickr.com/photos/mugley/2961131550
Draw your dream
Get as far as you can
http://www.flickr.com/photos/sk8geek/7358702704
graphite
Etsy dashboard
FITB ganglia
network applicationshosts
db?
api?
SNMP? syslog?
alerting?
message queue
screens users
Develop missing pieces
http://www.flickr.com/photos/kalexanderson/5969012589
graphite
Etsy dashboard
FITB ganglia
network applicationshosts
mongodb elastic search
ganglia alerts
ganglia-api
syslog alerts
SNMP alerts
alerta
message queue
screens users
Guardian Managementhttps://github.com/guardian/guardian-management
Ganglia APIhttps://github.com/guardian/ganglia-api
rescale image???
Alertahttps://github.com/guardian/alerta
• Ganglia
• FITB
• Graphite
• Etsy dashboards
• Guardian managementhttps://github.com/guardian/guardian-management
• Guardian ganglia-apihttps://github.com/guardian/ganglia-api
• Guardian alertahttps://github.com/guardian/alerta
Current stack
Keep learning
we are not there yet
Watch the cultural changes
detecting
diagnosis
diagnosis
performance testing
confirmation
#monitoringsucks
➡ Prioritise
➡ Understand the real problem
➡ Question the tools
➡ Be ambitious
➡ Keep learning
tools can change culture
Thank you
Simon Hildrew@sihil
Nick Satterly@nicksatterly
http://github.com/guardianhttp://gu.com/p/3ap5f