Winning the metrics battle

61
Winning the metrics battle (finally)

description

The slides from a presentation at Velocity Europe 2012 talk about how the Guardian does metrics an monitoring. The original proposal is at http://velocityconf.com/velocityeu2012/public/schedule/detail/26576 and there is also an article about it at http://www.guardian.co.uk/info/developer-blog/2012/oct/04/winning-the-metrics-battle

Transcript of Winning the metrics battle

Page 1: Winning the metrics battle

Winning the metrics battle (finally)

Page 2: Winning the metrics battle

Winning the metrics battle (finally)

Simon Hildrew

Infrastructure Developer

The Guardian

Nick Satterly

Monitoring Engineer

The Guardian

Page 3: Winning the metrics battle
Page 4: Winning the metrics battle

The metrics battlefield

Page 5: Winning the metrics battle

1,400 2,800

50,000

180,000

Total metrics

Page 6: Winning the metrics battle

5 minutes

every 15seconds

http://www.flickr.com/photos/ghostsigns/6676069121

http://www.flickr.com/photos/millynet/134071210

Page 7: Winning the metrics battle

developer dashboards

Page 8: Winning the metrics battle

0

5

10

15

20

Physical screens Screensaver hacks

Page 9: Winning the metrics battle

dev

hack

Page 10: Winning the metrics battle

business dashboards

Page 11: Winning the metrics battle

metrics + dashboards = culture change

Page 12: Winning the metrics battle

http://www.flickr.com/photos/chrisjames_taylor/5454315456

Page 13: Winning the metrics battle

Side project

Incremental upgrade

Use off the shelf tool

Pragmatic solution

Done in a year

our approach➡ Prioritise

➡ Understand the real problem

➡ Question the tools

➡ Be ambitious

➡ Keep learning

Page 14: Winning the metrics battle

Prioritise

Page 15: Winning the metrics battle

drowning in work

http://www.flickr.com/photos/iampeas/246738971

Page 16: Winning the metrics battle

a dedicated monitoring and metrics engineer

Page 17: Winning the metrics battle

Understand the real problem

Page 18: Winning the metrics battle

Urgent issue - current tool end of life

Page 19: Winning the metrics battle

The story so far...

Page 20: Winning the metrics battle

metrics were not helping us solve production outages

Page 21: Winning the metrics battle

ballooning number of applications

Page 22: Winning the metrics battle

but... difficult to instrument applications

Page 23: Winning the metrics battle

T.T. Fix

T.T. Detect+

T.T. Diagnose+

T.T. Resolve

=

Page 24: Winning the metrics battle

inaccessible tools

http://www.flickr.com/photos/kdashy/2678539087

Page 25: Winning the metrics battle

inconsistent data

http://www.flickr.com/photos/sybrenstuvel/2468506922

Page 26: Winning the metrics battle

hypothesising & arguingeasier than measuring

http://www.flickr.com/photos/nouqraz/200049988

Page 27: Winning the metrics battle

The ‘right’ thing

• measure everything

• measure frequently

• measure each data point once

• input and output must be open

Page 28: Winning the metrics battle

Question the tools

Page 29: Winning the metrics battle

Brute force?

http://www.flickr.com/photos/epublicist/3546059144

Page 30: Winning the metrics battle

The safe option?

http://www.flickr.com/photos/alicebartlett/2361209195

Page 31: Winning the metrics battle

Unintuitive?

http://www.flickr.com/photos/merlijnhoek/2841785343

Page 32: Winning the metrics battle

http://www.flickr.com/photos/evansville/8953838/

Imposing a flawed model?

Page 33: Winning the metrics battle

Too difficult / no progress?http://www.flickr.com/photos/ginja_andy/4165849136/

Page 34: Winning the metrics battle

Nagios

• the “IBM” of monitoring tools

• compromise over quantity and frequency of checks

• < insert your criticism of nagios here >

Page 35: Winning the metrics battle

Zabbix

• metric collection tightly coupled to monitoring tool

• confusing UI with poor visualisation

• needed brute force to make limited API work

Page 36: Winning the metrics battle

The ‘right’ thing

• measure everything

• measure frequently

• measure each data point once

• input and output must be open

Page 37: Winning the metrics battle
Page 38: Winning the metrics battle

don’t compromise

Page 39: Winning the metrics battle

Be ambitious

Page 40: Winning the metrics battle

Throw work away

http://www.flickr.com/photos/mugley/2961131550

Page 41: Winning the metrics battle

Draw your dream

Page 42: Winning the metrics battle

Get as far as you can

http://www.flickr.com/photos/sk8geek/7358702704

Page 43: Winning the metrics battle

graphite

Etsy dashboard

FITB ganglia

network applicationshosts

db?

api?

SNMP? syslog?

alerting?

message queue

screens users

Page 44: Winning the metrics battle

Develop missing pieces

http://www.flickr.com/photos/kalexanderson/5969012589

Page 45: Winning the metrics battle

graphite

Etsy dashboard

FITB ganglia

network applicationshosts

mongodb elastic search

ganglia alerts

ganglia-api

syslog alerts

SNMP alerts

alerta

message queue

screens users

Page 46: Winning the metrics battle

Guardian Managementhttps://github.com/guardian/guardian-management

Page 47: Winning the metrics battle

Ganglia APIhttps://github.com/guardian/ganglia-api

Page 48: Winning the metrics battle

rescale image???

Alertahttps://github.com/guardian/alerta

Page 49: Winning the metrics battle

• Ganglia

• FITB

• Graphite

• Etsy dashboards

• Guardian managementhttps://github.com/guardian/guardian-management

• Guardian ganglia-apihttps://github.com/guardian/ganglia-api

• Guardian alertahttps://github.com/guardian/alerta

Current stack

Page 50: Winning the metrics battle

Keep learning

Page 51: Winning the metrics battle

we are not there yet

Page 52: Winning the metrics battle

Watch the cultural changes

Page 53: Winning the metrics battle

detecting

Page 54: Winning the metrics battle

diagnosis

Page 55: Winning the metrics battle

diagnosis

Page 56: Winning the metrics battle

performance testing

Page 57: Winning the metrics battle

confirmation

Page 58: Winning the metrics battle

#monitoringsucks

Page 59: Winning the metrics battle

➡ Prioritise

➡ Understand the real problem

➡ Question the tools

➡ Be ambitious

➡ Keep learning

Page 60: Winning the metrics battle

tools can change culture

Page 61: Winning the metrics battle

Thank you

Simon Hildrew@sihil

[email protected]

Nick Satterly@nicksatterly

[email protected]

http://github.com/guardianhttp://gu.com/p/3ap5f