Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics at scale @UBERMantas Klasavičius

About MeSenior software engineer @ Uber

About MeSenior software engineer @ Uber

<metric_path> <value> <timestamp>

UBER 6 continents72 countries425 cities

>5 million a day >1000

engineers

7 years

UBER in Vilnius

3y ago

>20 engineers4 Teams:

- Observability

- Databases- Foundations- DevExp

Hypergrowth defines us...

Growth of Services

MetricsMetrics @UBER is a first class citizen

T0 Service

Handling ~500M telemetry timeseries

Writing ~3M values/sec and running ~1K queries/sec

50M minutes worth of data per sec

Growing >25% month over month

Metrics CollectionGraphite ~2013

Metrics CollectionGraphite 2015

Metrics CollectionConsidered choices

Netflix Atlas

Blueflood

Update graphite

Metrics CollectionM3

Metrics CollectionCassandra is a figure of epic tradition and of tragedy.

High write throughput

Cassandra data model supports time series data-store - DTCS

Cassandra's native TTL support

Metrics CollectionCassandra - our use case

Separate clusters for different types of data

Clusters spans multiple datacenters

Dynamically control to which cluster data is written

Forcibly deleting old data

https://github.com/m3db/m3db/




Metrics CollectionMetrics as free resource

*.application_1431728998581_0361.*

*. Connections.10_30_3_24.0x64d11081baa1837.*

*. ply_1b09f59b-a3cf-4b9a-99b4-93e8eb16722c.*

*. check-<uid_or_uuid>.*

Metrics CollectionCost accounting and metrics about metrics

Metrics VisualizationM3 - Querying

Metrics VisualizationGrafana

Observability: Past, Present, and Future

Metrics Visualization

aggregate = fillNulls target | sum;fetch name:requests.errors caller:cn| aggregate| asPercent (fetch name:requests caller:cn | aggregate)| anomalies| sort max| tail 10

M3QL - Query Like It’s Bash

tail( sort( anomalies( asPercent( sum(fillNulls(stats.counts.cn.*.requests.errors)), sum(fillNulls(stats.counts.cn.*.requests)) ), max ) ), 10)

Metrics VisualizationGraphite Way vs. M3QL


Alerting based on metricsQuery Based Alerting

graphite.absolute_threshold( ‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’, alias=’velocity filter blocked requests’, warning_over=0.1, critical_over=10.0,)


Alerting based on metricsClassic Thresholding

Classic high / low thresholds have some intrinsic problems.• Labor-intensive: each threshold is hand-tuned and

manually updated.• Too sensitive: hard to set thresholds for metrics with

large fluctuations, even if there’s an obvious pattern.• Not sensitive enough: thresholds take a long time to

catch slow degradations.• Poor UX: configuring really good alerts requires

specialized knowledge of the query language.• No guidance: system doesn’t offer automated root

cause exploration.


Alerting based on metrics

• Zero config: thresholds are set and maintained automatically.• Dynamic adjustment: thresholds cope with noise, underlying growth,

seasonality and rollouts.• Rapid detection: embarrassingly parallel algorithm is efficient enough

for minute-by-minute analysis at scale.• Integrated UX: work within our existing telemetry and alert

configuration systems.• Helpful: automated root cause analysis.

In short, the only input is a list of business-critical metrics.

Intelligent Monitoring


Alerting based on metrics

The max lower threshold

exceeds the min upper threshold

Dynamic Thresholds


Alerting based on metricsOutage Detection

< 1% outages missed.

6.5 out of 10 alerts are true issues.


Alerting based on metricsF3

stats.foo

anomalies(stats.foo)

On-Call Dashboard

We are hiring!

[email protected]

Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Technology

Transcript of Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)