Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
-
Upload
lviv-it-arena -
Category
Technology
-
view
60 -
download
3
Transcript of Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
Metrics at scale @UBERMantas Klasavičius
About MeSenior software engineer @ Uber
About MeSenior software engineer @ Uber
<metric_path> <value> <timestamp>
UBER 6 continents72 countries425 cities
>5 million a day >1000
engineers
7 years
UBER in Vilnius
3y ago
>20 engineers4 Teams:
- Observability
- Databases- Foundations- DevExp
Hypergrowth defines us...
Growth of Services
MetricsMetrics @UBER is a first class citizen
T0 Service
Handling ~500M telemetry timeseries
Writing ~3M values/sec and running ~1K queries/sec
50M minutes worth of data per sec
Growing >25% month over month
Metrics CollectionGraphite ~2013
Metrics CollectionGraphite 2015
Metrics CollectionConsidered choices
Netflix Atlas
Blueflood
Update graphite
Metrics CollectionM3
Metrics CollectionM3
Metrics CollectionCassandra is a figure of epic tradition and of tragedy.
High write throughput
Cassandra data model supports time series data-store - DTCS
Cassandra's native TTL support
Metrics CollectionCassandra - our use case
Separate clusters for different types of data
Clusters spans multiple datacenters
Dynamically control to which cluster data is written
Forcibly deleting old data
https://github.com/m3db/m3db/
Metrics CollectionMetrics as free resource
*.application_1431728998581_0361.*
*. Connections.10_30_3_24.0x64d11081baa1837.*
*. ply_1b09f59b-a3cf-4b9a-99b4-93e8eb16722c.*
*. check-<uid_or_uuid>.*
Metrics CollectionCost accounting and metrics about metrics
Metrics VisualizationM3 - Querying
Metrics VisualizationGrafana
Observability: Past, Present, and Future
Metrics Visualization
aggregate = fillNulls target | sum;fetch name:requests.errors caller:cn| aggregate| asPercent (fetch name:requests caller:cn | aggregate)| anomalies| sort max| tail 10
M3QL - Query Like It’s Bash
tail( sort( anomalies( asPercent( sum(fillNulls(stats.counts.cn.*.requests.errors)), sum(fillNulls(stats.counts.cn.*.requests)) ), max ) ), 10)
Metrics VisualizationGraphite Way vs. M3QL
Observability: Past, Present, and Future
Alerting based on metricsQuery Based Alerting
graphite.absolute_threshold( ‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’, alias=’velocity filter blocked requests’, warning_over=0.1, critical_over=10.0,)
Observability: Past, Present, and Future
Alerting based on metricsClassic Thresholding
Classic high / low thresholds have some intrinsic problems.• Labor-intensive: each threshold is hand-tuned and
manually updated.• Too sensitive: hard to set thresholds for metrics with
large fluctuations, even if there’s an obvious pattern.• Not sensitive enough: thresholds take a long time to
catch slow degradations.• Poor UX: configuring really good alerts requires
specialized knowledge of the query language.• No guidance: system doesn’t offer automated root
cause exploration.
Observability: Past, Present, and Future
Alerting based on metrics
• Zero config: thresholds are set and maintained automatically.• Dynamic adjustment: thresholds cope with noise, underlying growth,
seasonality and rollouts.• Rapid detection: embarrassingly parallel algorithm is efficient enough
for minute-by-minute analysis at scale.• Integrated UX: work within our existing telemetry and alert
configuration systems.• Helpful: automated root cause analysis.
In short, the only input is a list of business-critical metrics.
Intelligent Monitoring
Observability: Past, Present, and Future
Alerting based on metrics
The max lower threshold
exceeds the min upper threshold
Dynamic Thresholds
Observability: Past, Present, and Future
Alerting based on metricsOutage Detection
< 1% outages missed.
6.5 out of 10 alerts are true issues.
Observability: Past, Present, and Future
Alerting based on metricsF3
stats.foo
anomalies(stats.foo)
On-Call Dashboard
We are hiring!