Graph everything

29
Graph everything! Oliver Hankeln / gutefrage.net Samstag, 21. September 13

description

I gave this presentation on metrics, graphing and the surrounding cultural issues at Monitorama 2013 in Berlin. If you want how we at gutefrage.net store our metrics this is the talk for you.

Transcript of Graph everything

Page 1: Graph everything

Graph everything!Oliver Hankeln / gutefrage.net

Samstag, 21. September 13

Page 2: Graph everything

Who am I?

Senior Engineer - Data and Infrastructure at gutefrage.net GmbH

Was doing software development before

DevOps advocate

Samstag, 21. September 13

Page 3: Graph everything

Who is Gutefrage.net?

Germany‘s biggest Q&A platform

#1 German site (mobile) about 5M Unique Users

#3 German site (desktop) about 17M Unique Users

> 4 Mio PI/day

Part of the Holtzbrinck group

Running several platforms (Gutefrage.net, Helpster.de, Cosmiq, Comprano, ...)

Samstag, 21. September 13

Page 4: Graph everything

Flight AB6188Samstag, 21. September 13

Page 5: Graph everything

What you will get

How do we store our metrics?

Our experiences with that setup

Why the hell are we doing that?

Some thoughts on metrics

Samstag, 21. September 13

Page 6: Graph everything

How we store our metrics

Samstag, 21. September 13

Page 7: Graph everything

Our requirements

Creating new metrics has to be simple

no compaction (bye bye RRDTool)

System has to scale

Samstag, 21. September 13

Page 8: Graph everything

openTSDB

Written at StumbleUpon but OpenSource

Uses HBase as a storage

Distributed system (multiple TSDs)

Samstag, 21. September 13

Page 9: Graph everything

The ecosystem

App feeds metrics in via RabbitMQ

We base Icinga checks on the metrics

We evaluate etsy Skyline for anomaly detection

We deploy sensors via chef

Samstag, 21. September 13

Page 10: Graph everything

Our experiences

Samstag, 21. September 13

Page 11: Graph everything

What works well

We store about 200M data points in several thousand time series with no issues

tcollector is decoupling measurement from storage

Creating new metrics is really easy

Samstag, 21. September 13

Page 12: Graph everything

Challenges

The UI is seriously lacking

no annotation support out of the box

Only 1s time resolution (and only 1 value/s/time series)

Samstag, 21. September 13

Page 13: Graph everything

salvation is coming

OpenTSDB 2 is around the corner

millisecond precision

annotations and meta data

decent API

Samstag, 21. September 13

Page 14: Graph everything

Why the hell are we doing this?

Samstag, 21. September 13

Page 15: Graph everything

Communication

Replace gut feeling with real data

Helps to avoid the blame game

Brains prefer graphs to numbers

Samstag, 21. September 13

Page 16: Graph everything

Getting insights

We move towards Continuous Deployment

Complex systems show emergent behaviour

Graphs are the correct flight level

Samstag, 21. September 13

Page 17: Graph everything

Lean Startup

Build - Measure - Learn cycle

You have to define measureable goals

No. It‘s measure not guessing

Samstag, 21. September 13

Page 18: Graph everything

Perspectives

Operations (Server load, traffic, disk space,...)

Developers (DB Queries/PageView, JS errors,...)

Product Owners (Content creation, Content Quality, ...)

...

Samstag, 21. September 13

Page 19: Graph everything

Some random thoughts

Samstag, 21. September 13

Page 20: Graph everything

Public display

Helps that everyone feels involved

n+1 eyes see more than n eyes

Needs a culture of trust

Samstag, 21. September 13

Page 21: Graph everything

Alerting

Fixed values for alerts are not good enough

Drawing Attention vs. Alerting

False positives are bugs

Don‘t call the on-call-guy for nothing

Samstag, 21. September 13

Page 22: Graph everything

Metrics != boring

You can (and should) get creative with what you measure.

Have some brainstorming sessions

Insights may come from surprising places

Samstag, 21. September 13

Page 23: Graph everything

Track team happiness

There is no fixed scale

It forces you to communicate

If you listen you can find problems in the team

Samstag, 21. September 13

Page 24: Graph everything

Track ops confidence

create a platform where you can buy or sell your on-call shifts.

The price for a shift tells you how confident the team is.

This has not been tested - yet.

Samstag, 21. September 13

Page 25: Graph everything

Track recruiting efforts

Helps to get a feeling about the job market

Reminds everyone to keep looking for new colleagues

BTW: we are hiring ;-)

Samstag, 21. September 13

Page 26: Graph everything

Questions?

Please contact me:

[email protected]

@mydalon

I‘ll upload the slides and tweet about it

Samstag, 21. September 13

Page 27: Graph everything

one more thing

Samstag, 21. September 13

Page 28: Graph everything

Please give [email protected]

@mydalon

Samstag, 21. September 13

Page 29: Graph everything

Image Sources:

Plane: Felix Gottwald - www.felixgottwald.net (Creative Commons Attribution Share Alike 3.0German)

Talking men: Deutsche Fotothek - Peter, Richard sen.

Money: Wikimedia contributor Avij

Other images: Oliver Hankeln

This presentation is licenced under Creative Commons Attribution Share Alike 3.0

Samstag, 21. September 13