Metrics stack 2.0

Post on 27-Jan-2015

114 views 3 download

Tags:

description

Most metrics systems link timeseries to a string key, some add a few tags. They often lack information, use inconsistent formats and terminology, and are poorly organized. As the amount of people and software generating, processing, storing and visualizing metrics grows, this approach becomes very cumbersome and there is a lot to be gained from taking a step back and re-thinking metric identifiers and metadata. Metrics 2.0 is a set of conventions around metrics: With barely any extra work metrics become self-describing and standardized. Compatibility between tools increases dramatically, dashboards can automatically convert information needs into graphs, graph renderers can present data more usefully, anomaly detection and aggregators can work more autonomously and avoid common mistakes. Result: less micromanaging of software and configuration, quicker results, more clarity. Less frustration and room for errors. This talk will also cover the tools that turn this concept into production-ready reality: Graph-Explorer is an application that integrates with Graphite. Enter an expression that represents an information need and it generates the corresponding graphs or alerting rules, automatically applying unit conversion, aggregation, processing, etc. Statsdaemon is an aggregation daemon like Etsy's Statsd that expresses performed aggregations and statistical operations by updating the metrics tags, making sure that the metric metadata always corresponds to the data. Dieter Plaetinck is a systems-gone-backend engineer at Vimeo.

Transcript of Metrics stack 2.0

   

   Credit: user niteroi @ panoramio.com

   

vimeo.com/43800150

   

   

   

   

   

   

   

   

1  Metrics 2.0 concepts

2  Implementation

3  Advanced stuff

   

“Dieter” ?

   

Peter   Deter→

   

Terminology sync

   

(1234567890, 82)

(1234567900, 123)

(1234567910, 109)

(1234567920, 77)

db15.mysql.queries_running

host=db15 mysql.queries_running

   

   

How many pagerequests/s is vimeo.com doing?

   

● stats.hits.vimeo_com

● stats_counts.hits.vimeo_com

   

   

stats.<host>.requesthostport.vimeo_com_443

   

stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90

   

O(X*Y*Z)X = # apps                

Y = # people             

Z = # aggregators     

   

How long does it take to retrieve an object from swift?

   

stats.timers.<host>.proxy­server.<swift_type>.<http_method>.<http_code>.timing.<stat>

stats.timers.<host>.object­server.<http_method>.timing.<stat>

target=stats.timers.dfs*.object*GET*timing.mean ?

target=groupByNode(stats.timers.dfs*.proxy­server.object.GET.*.timing.mean,2,"avg")

target=stats.timers.dfs*.object­server.GET.timing.mean

   

swift_type=object stat=mean timing GET avg by http_code

   

   

   

O((DxV)^2)D = # dimensions             

V = # values per dim             

   

collectd.db.disk.sda1.disk_time.write

   

   

   

What should I name my metric?

   

101001000

100001000001000000

   

   

Metrics 2.0

   

Old:● information lacking

● fields unclear & inconsistent

● cumbersome strings / trees

● forbidden characters

New:● Self­describing

● Standardized

● all dimensions in orthogonal tag­space

● Allow some useful characters

   

stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90

{    “server”: “dfvimeodfsproxy5”,    “http_method”: “GET”,    “http_code”: “200”,    “unit”: “ms”,    “target_type”: “gauge”,    “stat”: “upper_90”,    “swift_type”: “object”    “plugin”: “swift_proxy_server”}

   

Main advantages:● Immediate understanding of metric meaning (ideally)

● Minimize time to graphs, dashboards, alerting rules 

   

github.com/vimeo/graph­explorer/wiki

   

SI + IEC

B   Err   Warn   Conn   Job   File   Req    ...

MB/s   Err/d   Req/h   ...

   

{

    “site”: “vimeo.com”,

    “port”: 80,

    “unit”: “Req/s”,

    “direction”: “in”,

    “service”: “webapp_php”,

    “server”:  “webxx”

}

   

   

Carbon­tagger:

... service=foo.instance=host.target_type=gauge.type=calculation.unit=B 123 1234567890

Statsdaemon:

..unit=B..unit=B...        unit=B/s→

..unit=ms..unit=ms..    unit=ms stat=mean→

                                   → unit=ms stat=upper_90

                                   → ...

   

   

   

Graph­Explorer queries 101

site:api.vimeo.com unit=Req/s

requesthostport api_vimeo_com

   

   

Smoothing

avg over 10M

avg over ...

   

   

Aggregation, compare port 80 vs 443

avg by <dimension>

sum by <dimension>

sum by server

   

   

Compare 80 traffic amongt servers

site:api.vimeo.com unit=Req/s port=80 group by none avg over 10M

   

   

Graph­Explorer queries 201

proxy­server swift server:regex upper_90 unit=ms from <datetime> to <datetime> avg over <timespec> 

   

   

   

   

   

Compare object put/get

Stack .. http_method:(PUT|GET) swift_type=object avg by http_code,server

   

   

Comparing servers

http_method:(PUT|GET) avg by http_code,swift_type,http_method group by none

   

   

Compare http codes for GET, per swift type

http_method=GET avg by server group by swift_type

   

   

transcode unit=Job/s avg over <time> from <datetime> to <datetime>

    Note: data is obfuscated

   

Bucketing

!queue sum by zone:ap­southeast|eu­west|us­east|us­west|sa­east|vimeo­df|vimeo­lv group by state

    Note: data is obfuscated

   

Compare job states per region (zones bucket)

group by zone

    Note: data is obfuscated

   

Unit conversion

unit=Mb/s network dfvimeorpc sum by server

   

   

   

unit=MB

   

   

   

{

    server=dfvimeodfs1

    plugin=diskspace

    mountpoint=_srv_node_dfs5

    unit=B

    type=used

    target_type=gauge

}

   

server:dfvimeodfs unit=GB type=free srv node

   

   

unit=GB/d group by mountpoint

   

   

   

   

   

   

   

Dashboard definition

 queries = [

   'cpu usage sum by core',

   'mem unit=B !total group by type:swap',

   'stack network unit=b/s',

   'unit=B (free|used) group by =mountpoint'

 ]

   

   

stats.dfvimeocliapp2.twitter.error

{

    “n1”: “dfvimeocliapp2”,

    “n2”: “twitter”,

    “n3”: “error”,

    “plugin”: “catchall_statsd”,

    “source”: “statsd”,

    “target_type”: “rate”,

    “unit”: “unknown/s”

}

   

Two hard things in computer science

   

stats.gauges.files.

id_boundary_7day

stats.gauges.files.

id_boundary_ceil

   

unit=File id_boundary_7d 

{

   “unit”: “File”,

   “n1”: “id_boundary_7d”,

}

   

{

    “intrinsic”: {

        “site”: “vimeo.com”,

        “unit”: “Req/s”

    },

    “extrinsic”: {

        “agent”: “diamond”,

        “processed_by”: “statsd1”,

        “src”: “index.php:135”,

        “replaces”: “vimeo_com_reqps”

    }

}

   

site=vimeo.com unit=Req/s \

  processed_by=statsd1 \ src=index.php:135 added_by=dieter \

123 1234567890

   

   

Equivalence

servers.host.cpu.total.iowait   “core” : “_sum_”→

servers.host.cpu.<core­number>.iowait

servers.host.loadavg.15

   

Rollups & aggregation

   

/etc/carbon/storage­aggregation.conf[min]

pattern = \.min$

aggregationMethod = min

[max]

pattern = \.max$

aggregationMethod = max

[sum]

pattern = \.count$

aggregationMethod = sum

[default_average]

pattern = .*

aggregationMethod = average

   

   

2 kinds of graphite users

   

Self­describing metrics

stat=upper/lower/mean/...target_type=counter..

   

●    stats.timers.render_time.histogram.bin_0.01●    stats.timers.render_time.histogram.bin_0.1●    stats.timers.render_time.histogram.bin_1           unit=Freq_abs bin_upper=1→

●    stats.timers.render_time.histogram.bin_10●    stats.timers.render_time.histogram.bin_50●    stats.timers.render_time.histogram.bin_inf●    stats.timers.render_time.lower                            unit=ms stat=lower→

●    stats.timers.render_time.mean                            unit=ms stat=mean→

●    stats.timers.render_time.mean_90                      ...→

●    stats.timers.render_time.median●    stats.timers.render_time.std●    stats.timers.render_time.upper●    stats.timers.render_time.upper_90

   

Also..

● graphite API functions such as "cumulative", "summarize" and "smartSummarize"

● Graph renderers

   

   From: dygraphs.com

   

   

   

   

   

   

Facet based suggestions

   

   

Metric types

● gauge● count & rate● counter● timer

   

   

   

   

   

gauge

● Multiple values in same interval● “sticky”

   

   

Count & Rate

   

Counter

   

Timer..

   

   http://janabeck.com/blog/2012/10/12/lessons­learned­from­100/

   

Timer..

   

● What should a metric be?● Stickyness?● Behavior on no packets received● Behavior on multiple packets received

   

My personal takeaways

   

Conclusion● Building graphs, setting up alerting cumbersome● Esp. changing information needs (troubleshooting, exploring, ..)● Esp. Complicated information needs 

  → PAIN

● Structuring metrics● Self­describing metrics● Standardized metrics● Native metrics 2.0

●  → BREEZE 

   

Conclusion

● Metrics can be so much more usable and useful. Let's talk about tagging, standardisation, retaining information throughout the pipeline.

● Converting information needs into graph defs, alerting rules● Graph­Explorer, carbon­tagger, statsdaemon, …● Graphite­ng (native metrics 2.0)● Metrics 2.0 in your apps, agents, aggregators?● Build out structured metrics library

   

github.com/vimeo

github.com/Dieterbe

twitter.com/Dieter_be

dieter.plaetinck.be