Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

28
Metrics at scale @UBER Mantas Klasavičius

Transcript of Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Page 1: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics at scale @UBERMantas Klasavičius

Page 2: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

About MeSenior software engineer @ Uber

Page 3: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

About MeSenior software engineer @ Uber

<metric_path> <value> <timestamp>

Page 4: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

UBER 6 continents72 countries425 cities

>5 million a day >1000

engineers

7 years

Page 5: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

UBER in Vilnius

3y ago

>20 engineers4 Teams:

- Observability

- Databases- Foundations- DevExp

Page 6: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Hypergrowth defines us...

Growth of Services

Page 7: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

MetricsMetrics @UBER is a first class citizen

T0 Service

Handling ~500M telemetry timeseries

Writing ~3M values/sec and running ~1K queries/sec

50M minutes worth of data per sec

Growing >25% month over month

Page 8: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionGraphite ~2013

Page 9: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionGraphite 2015

Page 10: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionConsidered choices

Netflix Atlas

Blueflood

Update graphite

Page 11: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionM3

Page 12: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionM3

Page 13: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionCassandra is a figure of epic tradition and of tragedy.

High write throughput

Cassandra data model supports time series data-store - DTCS

Cassandra's native TTL support

Page 14: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionCassandra - our use case

Separate clusters for different types of data

Clusters spans multiple datacenters

Dynamically control to which cluster data is written

Forcibly deleting old data

https://github.com/m3db/m3db/

Page 15: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionMetrics as free resource

*.application_1431728998581_0361.*

*. Connections.10_30_3_24.0x64d11081baa1837.*

*. ply_1b09f59b-a3cf-4b9a-99b4-93e8eb16722c.*

*. check-<uid_or_uuid>.*

Page 16: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics CollectionCost accounting and metrics about metrics

Page 17: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics VisualizationM3 - Querying

Page 18: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics VisualizationGrafana

Page 19: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Observability: Past, Present, and Future

Metrics Visualization

aggregate = fillNulls target | sum;fetch name:requests.errors caller:cn| aggregate| asPercent (fetch name:requests caller:cn | aggregate)| anomalies| sort max| tail 10

M3QL - Query Like It’s Bash

tail( sort( anomalies( asPercent( sum(fillNulls(stats.counts.cn.*.requests.errors)), sum(fillNulls(stats.counts.cn.*.requests)) ), max ) ), 10)

Page 20: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Metrics VisualizationGraphite Way vs. M3QL

Page 21: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Observability: Past, Present, and Future

Alerting based on metricsQuery Based Alerting

graphite.absolute_threshold( ‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’, alias=’velocity filter blocked requests’, warning_over=0.1, critical_over=10.0,)

Page 22: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Observability: Past, Present, and Future

Alerting based on metricsClassic Thresholding

Classic high / low thresholds have some intrinsic problems.• Labor-intensive: each threshold is hand-tuned and

manually updated.• Too sensitive: hard to set thresholds for metrics with

large fluctuations, even if there’s an obvious pattern.• Not sensitive enough: thresholds take a long time to

catch slow degradations.• Poor UX: configuring really good alerts requires

specialized knowledge of the query language.• No guidance: system doesn’t offer automated root

cause exploration.

Page 23: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Observability: Past, Present, and Future

Alerting based on metrics

• Zero config: thresholds are set and maintained automatically.• Dynamic adjustment: thresholds cope with noise, underlying growth,

seasonality and rollouts.• Rapid detection: embarrassingly parallel algorithm is efficient enough

for minute-by-minute analysis at scale.• Integrated UX: work within our existing telemetry and alert

configuration systems.• Helpful: automated root cause analysis.

In short, the only input is a list of business-critical metrics.

Intelligent Monitoring

Page 24: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Observability: Past, Present, and Future

Alerting based on metrics

The max lower threshold

exceeds the min upper threshold

Dynamic Thresholds

Page 25: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Observability: Past, Present, and Future

Alerting based on metricsOutage Detection

< 1% outages missed.

6.5 out of 10 alerts are true issues.

Page 26: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

Observability: Past, Present, and Future

Alerting based on metricsF3

stats.foo

anomalies(stats.foo)

Page 27: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

On-Call Dashboard

Page 28: Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

We are hiring!

[email protected]