Download - OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Transcript
Page 1: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

GraphiteGraphs for the modern age

Page 2: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Graphite basics

● Graphite generates graphs from timeseries data– Think MRTG or Cacti

– More flexible than those

Page 3: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Graphite basics

● Graphite generates graphs from timeseries data– Think MRTG or Cacti

– More flexible than those

● Written in Python– This does impact performance

Page 4: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Graphite basics

● Graphite generates graphs from timeseries data– Think MRTG or Cacti

– More flexible than those

● Written in Python– This does impact performance

● Web based and easy to use– For once, not a marketing buzzword

Page 5: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

The church of Graphs

● Pattern Recognition

Page 6: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

The church of Graphs

● Pattern Recognition● Correlation

Page 7: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

The church of Graphs

● Pattern Recognition● Correlation● Analytics

Page 8: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

The church of Graphs

● Pattern Recognition● Correlation● Analytics● Anomaly detection

Page 9: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Helpful Graphite features

● Out of order data insertion

Page 10: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Helpful Graphite features

● Out of order data insertion● Ability to compare corresponding time periods

(time travel)

Page 11: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Helpful Graphite features

● Out of order data insertion● Ability to compare corresponding time periods

(time travel)● Custom retention periods

Page 12: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Moving parts

● Relays– Send data to correct backend store

Page 13: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Moving parts

● Relays– Send data to correct backend store

● Pattern matching on metric names● Consistent hashing

Page 14: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Moving parts

● Relays– Send data to correct backend store

● Pattern matching on metric names● Consistent hashing

● Storage– Flat, fixed size files

● These are created when the metric is first recorded● Changing later is hard

Page 15: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Moving parts

● Relays– Send data to correct backend store

● Pattern matching on metric names● Consistent hashing

● Storage– Flat, fixed size files

● These are created when the metric is first recorded● Changing later is hard

● Webapp– Django based application offering a web api and Javascript

based frontend application

Page 16: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data output

● Web API

Page 17: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data output

● Web API– Everything is a HTTP GET

– A number of functions for data manipulation

Page 18: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data output

● Web API– Everything is a HTTP GET

– A number of functions for data manipulation

● Graphite offers outputs in multiple formats

Page 19: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data output

● Web API– Everything is a HTTP GET

– A number of functions for data manipulation

● Graphite offers outputs in multiple formats– Graphical (PNG, SVG)

– Structured(JSON, CSV)

– Raw data

Page 20: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

Page 21: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

● Using the default frontend– For single, one off graphs

– Debugging problems

Page 22: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

● Using the default frontend– For single, one off graphs

– Debugging problems

● Using builtin dashboards– Users create their own dashboards

– Third part dashboard tools

Page 23: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

● Using the default frontend– For single, one off graphs

– Debugging problems

● Using builtin dashboards– Users create their own dashboards

– Third part dashboard tools

Page 24: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● Custom pages pulling in PNG images– Just <img src=”some url here”>

● Using the default frontend– For single, one off graphs

– Debugging problems

● Using builtin dashboards– Users create their own dashboards– Third part dashboard tools

● Using third party libraries– JSON is nice for this

– Cubism, D3.js, rickshaw, etc

Page 25: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● API– Monitoring

– Runtime performance tuning

Page 26: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● API– Monitoring

– Runtime performance tuning

● Postmortem analytics

Page 27: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Using Graphite

● API– Monitoring

– Runtime performance tuning

● Postmortem analytics● Performance debugging

Page 28: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Making Graphite scale

● Original setup– Small cluster

● Two frontend boxes, two backend

Page 29: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Making Graphite scale

● Original setup– Small cluster

● Two frontend boxes, two backend

– RAID 1+0 with 4 spinning disks● This works well, with about 200 machines

Page 30: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Making Graphite scale

● Original setup– Small cluster

● Two frontend boxes, two backend

– RAID 1+0 with 4 spinning disks● This works well, with about 200 machines

– All those individual files force a lot of seeks

Page 31: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out - try 1

● Add more backend boxes

Page 32: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out - try 1

● Add more backend boxes– Manual rules to split traffic

– Pattern matching based on metric names

Page 33: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out - try 1

● Add more backend boxes– Manual rules to split traffic

– Pattern matching based on metric names

Page 34: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out - try 1

● Add more backend boxes– Manual rules to split traffic

– Pattern matching based on metric names● Balancing traffic is hard

Page 35: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling up

● Replace spinning disks with SSDs

Page 36: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling up

● Replace spinning disks with SSDs● Massive performance improvement due to

more IOPS– Still not as much as we needed

Page 37: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling up

● Replace spinning disks with SSDs● Massive performance improvement due to

more IOPS– Still not as much as we needed

● Losing a SSD meant we had a box die– This has been fixed

Page 38: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling up

● Replace spinning disks with SSDs● Massive performance improvement due to

more IOPS– Still not as much as we needed

● Losing a SSD meant we had a box die– This has been fixed

● SSDs are not as reliable as spinning rust– SSDs last for between 12 to 14 months

Page 39: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Sharding – take II

● At about 10 storage servers, manually maintaining regular expressions became painful

Page 40: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Sharding – take II

● At about 10 storage servers, manually maintaining regular expressions became painful

● Keeping disk usage balanced was even harder– Anyone is allowed to create graphs

Page 41: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Sharding - take II

● Replace regular expressions with consistent hashing

● Switch to RAID 0– We have switched back to RAID 1

● Store data on two nodes in each ring● Mirror rings in datacenters● Shuffle metrics to avoid losing data and disk

space.

Page 42: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Disk usage

● Graphite uses a lot of disk io– Background graph is in thousands on the Y axis.

– Individual files increase seek times

● There are a lot of stat(2) calls– This hasn't been investigated yet

Page 43: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Naming conventions

● Graphite has no rules for names

Page 44: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Naming conventions

● Graphite has no rules for names● We adopted:

– sys.* is for system metrics

– user.* is for testing/other stuff

– Anything else which makes sense is acceptable

Page 45: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Collecting metrics

● We have all sorts of homegrown scripts– Shell

– Perl

– Python

– Powershell

Page 46: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Collecting metrics

● We have all sorts of homegrown scripts– Shell

– Perl

– Python

– Powershell

● Originally used collectd for system metrics– The version of collected we were using had memory

usage issues● These have been fixed later

Page 47: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Collecting metrics

● System metrics are now collected by diamond

Page 48: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Collecting metrics

● System metrics are now collected by diamond● Diamond is a Python application

– Base framework + metric collection scripts

– Added custom patches for internal metrics

– Added patches to send monitoring data directly to Nagios for passive checks

Page 49: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU

Page 50: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

Page 51: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

● Added relays in each datacenter– Still need more CPU

Page 52: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

● Added relays in each datacenter– Still need more CPU

● Ran multiple instances on each relay host– Still need more CPU

Page 53: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Relay issues

● The Python relaying implementation eats CPU● Started with relays directly on the cluster

– Still need more CPU

● Added relays in each datacenter– Still need more CPU

● Ran multiple instances on each relay host– Still need more CPU

● Finally rewrote in C and added more relay hosts– This works for us (and we have breathing room)

Page 54: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data visibility

● We send data to multiple places– Metrics get dropped

Page 55: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Data visibility

● We send data to multiple places– Metrics get dropped

● Small application in Go which gets data from multiple locations and gives us a single merged resultset– Prototyped in Python, which was too slow

Page 56: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

statsd

● We had statsd running, but unused for a long time– statsd use is still relatively small

– Only a few internal applications use it

– We already have an analytics framework for this

Page 57: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

statsd

● We had statsd running, but unused for a long time– statsd use is still relatively small

– Only a few internal applications use it

– We already have an analytics framework for this

● The PCI vulnerability scanner reliably crashed it– This was patched and pushed upstream

Page 58: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Business metrics

● Turns out, developers like Graphite– They don't reliably understand whisper semantics

● Querying Graphite like SQL doesn't work

– They create a large number of named metrics● foo.bar.YYYY-MM-DD● Disk space use is a sudden concern

– Especially when you don't try and restrict this (feature, not bug)

Page 59: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Scaling out clusters

● Different groups have different requirements– Multiple backend rings, same frontend

● Unix systems● Windows● Networking● Business metrics● User testing

Page 60: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Current problems

● Hardware– Need more CPU

● Especially on the frontends where we do a lot of maths

– Better disk reliability on SSDs● Replacing disks is expensive

– More disk IO● SSDs are now maxed out under stat(2) calls● Testing Fusion IO cards

– 10% faster, but we don't know babout reliability yet

Page 61: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Current problems

● People– If you need a graph, put the data in Graphite

● Even if the data isn't time series data

● Frontend scalability– The default frontend doesn't work well with a few

thousand hosts

● Software upgrades– Our last Whisper upgrade caused data recording to

stop

Page 62: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Current problems

● Managability– Getting rid of older, non-required metrics is a lot of

effort

– Adding hosts into a ring requires manual rebalancing effort

Page 63: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Future possiilities

● Testing Cassandra as a backend (cyanite)● Anomaly detection

– Tested Skyline, didn't scale

● More business metrics● Sparse metrics

– Metrics with a lot of nulls, but potentially a lot of named metrics involved

Page 64: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Peopleware

● Hiring people to work on interesting challenges– Sysadmins, developers

– http://www.booking.com/jobs

● Booking.com will be sponsoring a Graphite dev summit in June (tentatively just before the devopsdays Amsterdam event)

Page 65: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Reference URLS● Graphite

– https://github.com/graphite-project

● Graphite API– http://graphite.readthedocs.org/en/latest/functions.html

● C Carbon relay– https://github.com/grobian/carbon-c-relay

● Zipper– https://github.com/grobian/carbonserver

● Cyanite– https://github.com/pyr/cyanite

– https://github.com/brutasse/graphite-cyanite

Page 66: OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

?