The math behind big systems analysis.

43
A tour through mathematical methods on systems telemetry Math in Big Systems If it was a simple math problem, we’d have solved all this by now.

description

This presentations discusses the challenges in processing real-time, high-frequency time-series data for anomaly detection.

Transcript of The math behind big systems analysis.

Page 1: The math behind big systems analysis.

A tour through mathematical methods on systems telemetry

Math in Big SystemsIf it was a simple math problem, we’d have solved all this by now.

Page 2: The math behind big systems analysis.

The many faces of

Theo Schlossnagle @postwait

CEO Circonus

Page 3: The math behind big systems analysis.

Choosing an approach is premature…

Problems

How do we determine the type of signal?

How do we manage off-line modeling?

How do we manage online fault detection?

How do we reconcile so users don’t hate us?

How we we solve this in a big-data context?

Page 4: The math behind big systems analysis.

Picking

an Approach

Statistical?

Machine learning?

Supervised?

Ad-hoc?

ontological? (why it is what it is)

Page 5: The math behind big systems analysis.

tl;dr

Apply PhDsApply PhDsRinseWashRepeat

Page 6: The math behind big systems analysis.

Garbage in, category out.

ClassificationUnderstanding a signal

We found to be quite ad-hoc

At least the feature extraction

Page 7: The math behind big systems analysis.

A year of service… I should be able to learn something.

API requests/second 1 year

Page 8: The math behind big systems analysis.

A year of service… I should be able to learn something.

API requests 1 year

Page 9: The math behind big systems analysis.

A year of service… I should be able to learn something.

API requests 1 year∆v

∆t, ∀∆v ≥ 0

Page 10: The math behind big systems analysis.

Some data goes both ways…

Complicating Things

Imagine disk space used…

it makes sense as a gauge (how full)

it makes sense as rate (fill rate)

Page 11: The math behind big systems analysis.

Error + error + guessing = success

How we categorize

Human identify a variety of categories.

Devise a set of ad-hoc features.

Bayesian model of features to categories.

Human tests.

https://www.flickr.com/photos/chrisyarzab/5827332576

Page 12: The math behind big systems analysis.

Many signals have significant noise around their averages

Signal NoiseA single “obviously wrong” measurement…

is often a reasonable outlier.

Page 13: The math behind big systems analysis.

A year of service… I should be able to learn something.

API requests/second 1 year

Page 14: The math behind big systems analysis.

At a resolution where we witness: “uh oh”

API requests/second 4 weeks

Page 15: The math behind big systems analysis.

But, are there two? three?

API requests/second 4 weeks

Is that super interesting?

Page 16: The math behind big systems analysis.

Bring the noise!

API requests/second 2 days

Page 17: The math behind big systems analysis.

Think about what this means… statistically

API requests/second 1 yearenvelope of ±1 std dev

Page 18: The math behind big systems analysis.

Lies, damned lies, and statistics

Simple Truths

Statistics are only really useful withp-values are low.

p ≤ 0.01 : very strong presumption against null hyp.0.01 < p ≤ 0.05 : strong presumption against null hyp.0.05 < p ≤ 0.1 : low presumption against null hyp.p > 0.1 : no presumption against the null hyp.

from xkcd #882 by Randall Munroe

Page 19: The math behind big systems analysis.

What does a p-value have to do with applying stats?

The p-value problem It turns out a lot of measurement data (passive) is very infrequent.

60% of the time…

it works every time.

Page 20: The math behind big systems analysis.

Our low frequencies lead us to

questions of doubt…

Given a certain statistical model:

How many few points need to be seen before we are sufficiently confident that it does not fit the model (presumption against the null hypothesis)?

With few, we simply have outliers or insignificant aberrations.

http://www.flickr.com/photos/rooreynolds/

Page 21: The math behind big systems analysis.

Solving the Frequency Problem

More data, more often… (obviously)

1. sample faster (faster from the source)

2. analyze wider (more sources)

OR

Page 22: The math behind big systems analysis.

Increasing frequency is the only option at times.

Signals of Importance Without large-scale systems

We must increase frequency

https://www.flickr.com/photos/whosdadog/3652670385

Page 23: The math behind big systems analysis.

Most algorithms require measuring residuals from a mean

Mean means Calculating means is “easy”

There are some pitfalls

What do mean that my mean is mean? Why can’t math be nice to people?

https://www.flickr.com/photos/imagesbywestfall/3606314694

Page 24: The math behind big systems analysis.

Newer data should influence our model.

Signals change

The model needs to adapt.

Exponentially decaying averages are quite common in online control systems and used as a basis for creating control charts.

Sliding windows are a bit more expensive.

Page 25: The math behind big systems analysis.

EWM vs. SWM❖ EWM : ST

❖ S1 = V1

❖ ST = 𝛼VT + (1-𝛼)ST-1

❖ Low computation overhead

❖ Low memory usage

❖ Hard to repeat offline

❖ SMW tracking the last N values : ST

❖ ST = ST-1 - VT-N/n + VT/n

❖ Low computational overhead

❖ High memory usage

❖ Easy to repeat offline

Page 26: The math behind big systems analysis.

Repeatable outcomes are needed

In our system…

We need our online algorithms to match our offline algorithms.

This is because human beings get pissed off when they can’t repeat outcomes that woke them up in the middle of the night.

EWM: not repeatableSWM: expensive in online application

Page 27: The math behind big systems analysis.

Repeatable, low-cost sliding windows

Our solution:lurching windows

fixed rolling windowsof fixed windows

SWM + EWM

Page 28: The math behind big systems analysis.

Putting it all together… How to test if we don’t match our model?

Page 29: The math behind big systems analysis.

Hypothesis Testing

Page 30: The math behind big systems analysis.

The CUSUM Method

Page 31: The math behind big systems analysis.

Applying CUSUM

API requests/second 4 weeksCUSUM Control Chart

Page 32: The math behind big systems analysis.

Can we do better?

Investigations

The CUSUM method has some issues. It’s challenging when signals are noise or of variable rate.

We’re looking into the Tukey test:• compares all possible pairs of means• test is conservative in light of uneven sample sizes

https://www.flickr.com/photos/st3f4n/4272645780

Page 33: The math behind big systems analysis.

Most statistical methods assume a normal distribution.

Your telemetry data hasnever been evenly distributed

All this “deviation from the mean”

assumes some symmetric distribution.

All that work and you tell me *now* that I don’t have a normal distribution?

Statistics suck.

https://www.flickr.com/photos/imagesbywestfall/3606314694

Page 34: The math behind big systems analysis.

Think about what this means… statistically

Rather obviouslynot a normal distribution

The average minus the standard deviation is less than zero on a measurement that can only be ≥ 0.

Page 35: The math behind big systems analysis.

With all the data, we have a complete picture of population distribution.

What if we had all the data? … full distribution

Now we can do statisticsthat isn’t hand-wavy.

Page 36: The math behind big systems analysis.

High volume data requires a different strategy

What happens whenwe get what we asked for?

10,000 measurements per second? more?

on each stream…

with millions of streams.

Page 37: The math behind big systems analysis.

Let’s understand the scope of the problem.

First some realities

This is 10 billion to 1 trillion measurements per second.

At least a million independent models.

We need to cheat.

https://www.flickr.com/photos/thost/319978448

Page 38: The math behind big systems analysis.

When we have too much, simplify…

Information compressionWe need to look to transform the data.

Add error in the value space.Add error in the time space.

https://www.flickr.com/photos/meddygarnet/3085238543

Page 39: The math behind big systems analysis.

Summarization & Extraction

❖ Take our high-velocity stream

❖ Summarize as a histogram over 1 minute (error)

❖ Extract useful less-dimensional characteristics

❖ Apply CUSUM and Tukey tests on characteristics

Page 40: The math behind big systems analysis.

Lorem Ipsum Dolor

Modes & moments. Strong indicators of shifts in workload

http://www.brendangregg.com/FrequencyTrails/modes.html

Page 41: The math behind big systems analysis.

Lorem Ipsum Dolor

Quantiles… Useful if you understand the problem domain and the expected distribution.

https://metamarkets.com/2013/histograms/

Page 42: The math behind big systems analysis.

Lorem Ipsum Dolor

Inverse Quantiles…Q: “What quantile is 5ms of latency?”

Useful if you understand the problem domain and the expected distribution.

Page 43: The math behind big systems analysis.

Thank ou! and thank you to the real brains behind Circonus’ analysis systems: Heinrich and George.