Graphite at CityGrid - LA DevOps April 2014

33
Graphite at CityGrid if you can’t measure it, you can’t fix it Wil Heitritter Director, Tech Ops Los Angeles DevOps 2014/04/28

description

High-level description of CityGrid's use of Graphite for collecting/displaying metrics, along with some interesting use-cases.

Transcript of Graphite at CityGrid - LA DevOps April 2014

Page 1: Graphite at CityGrid - LA DevOps April 2014

Graphite at CityGridif you can’t measure it, you can’t fix it

Wil HeitritterDirector, Tech Ops

Los Angeles DevOps2014/04/28

Page 2: Graphite at CityGrid - LA DevOps April 2014

Magnum esse solem philosophus probabit, quantus sit mathematicus

-Seneca

Page 3: Graphite at CityGrid - LA DevOps April 2014

Objectives

- Introduce Graphite to new users

- Show what we like, what we hate

- Present some interesting use-cases

- Generate discussion

Page 4: Graphite at CityGrid - LA DevOps April 2014

Before Graphite

Ganglia

• Predictable interface

• Text “metrics” to store versions

• Slow

• Couldn’t pick and choose metrics to see

Page 5: Graphite at CityGrid - LA DevOps April 2014

Why ganglia sucked

- Clusters had to be pre-configured

- Multicast vs. Unicast

- Data Retention

- Static Web Interface (can’t pick and choose)

- Static Host List

Page 6: Graphite at CityGrid - LA DevOps April 2014

What did we think wanted?

Ease of adding metrics

Ease of sending metrics

Powerful metric display

Retain ganglia-style cluster dashboards

Long-term configurable metric retention

Page 7: Graphite at CityGrid - LA DevOps April 2014

Graphite!

Page 8: Graphite at CityGrid - LA DevOps April 2014

What is Graphite?

a highly scalable real-time graphing system

which collects numeric time-series data

is managed by carbon

and stored as whisper files

and visualized through web interfaces

or queried via the API

http://graphite.wikidot.com/

Page 9: Graphite at CityGrid - LA DevOps April 2014

Graphite: what we like

Sending metrics is simple

Retrieving metrics is simple

Dashboard creation and sharing… is simple

Many functions()

120MM+ metric values received daily

Backfilling past metrics is simple

Expandable - different frontends

Page 10: Graphite at CityGrid - LA DevOps April 2014

Graphite: what sucks

Dashboard ownership/promotion

No ganglia-like standard dashboard

Data retention… is NOT as simple as we thought

Page 11: Graphite at CityGrid - LA DevOps April 2014

CityGrid’s Graphite

Implementation

Page 12: Graphite at CityGrid - LA DevOps April 2014

Metric NamingBusiness Metrics

- These are metrics that are not specific to a specific server

- Format: business.${hierarchical}.${path}.${here}.$metric

- Example: business.ec2.testaccount.us-east-1a.OnDemand.running.m2.4xlarge

Page 13: Graphite at CityGrid - LA DevOps April 2014

Metric Naming

Server Metrics

- These metrics are specific to a particular server (just like ganglia)

- Format: servers.${class}.${f_q_d_n}.${metric}

- Example: servers.rvw.aws1prdrvw1_subdom_cityg_com.LW_api_reviews_QPS

Page 14: Graphite at CityGrid - LA DevOps April 2014

Sending metrics

Sending directly from metric scripts

- /etc/graphite.conf

- May need to spread out sending if in volume

Collecting from gmond every minute

- Metrics are spread out to prevent spiking

- False data (gmond acts as a cache)

Page 15: Graphite at CityGrid - LA DevOps April 2014

Impact of staggered sending

Page 16: Graphite at CityGrid - LA DevOps April 2014

Sending is simply...

echo $metric $value $timestamp | nc $relay $port

Page 17: Graphite at CityGrid - LA DevOps April 2014

Performance

carbon-cache/carbon-relay

SSD

replication within minutes

Page 18: Graphite at CityGrid - LA DevOps April 2014

Maintenance

Changing retention

- whisper-auto-resize.py

Filling holes

- whisper-fill $source $destination

Backups

- Dashboards

- Metrics

Page 19: Graphite at CityGrid - LA DevOps April 2014

Graphite Use-Cases

Page 20: Graphite at CityGrid - LA DevOps April 2014

Single Metric

Page 21: Graphite at CityGrid - LA DevOps April 2014

Combined Metrics

Page 22: Graphite at CityGrid - LA DevOps April 2014

Key Metrics Dashboard

Examples of Key Metrics

- QPS

- Processing Time (Max/Mean/Distribution)

- Metrics about sub-requests

- Network usage

- CPU/load

Page 23: Graphite at CityGrid - LA DevOps April 2014

Key Metrics Dashboard

Page 24: Graphite at CityGrid - LA DevOps April 2014
Page 25: Graphite at CityGrid - LA DevOps April 2014

Nagios Integration

check_graphite_target!highestMax(servers.mai.@[email protected]_map_return_code_5*_ratio, 1

)!5!10

Page 26: Graphite at CityGrid - LA DevOps April 2014

How about Pie Charts?

Page 27: Graphite at CityGrid - LA DevOps April 2014
Page 28: Graphite at CityGrid - LA DevOps April 2014

Ad-Hoc Dashboards

Demo

Page 29: Graphite at CityGrid - LA DevOps April 2014

What NOT to do

Page 30: Graphite at CityGrid - LA DevOps April 2014

Trying it out for yourself

Page 31: Graphite at CityGrid - LA DevOps April 2014

Quick Setup

Install & Start# pip install https://github.com/graphite-project/ceres/tarball/master

# pip install whisper

# pip install carbon

# pip install graphite-web

start it up...

send it a metric:echo business.test.metric1 1 `date “+%s”` | nc localhost 2003

OK, it’s almost that easy...

Page 32: Graphite at CityGrid - LA DevOps April 2014

Discussion

Page 33: Graphite at CityGrid - LA DevOps April 2014