CMG15 Session 525

64
Views and opinions expressed in this presentation are solely of its authors and do not necessarily represent those of Alphabet, Inc or its subsidiaries, including Google, Inc. Select images and formulae are provided with permission from Google, Inc Percentile-Based Approach To Forecasting Workload Growth Alex Gilgur, Douglas Browning, Stephen Gunn, Xiaojun Di, Wei Chen, and Rajesh Krishnaswamy IT Capacity and Performance 41 st International Conference by the Computer Measurement Group (CMG'15) San Antonio, TX November 5, 2015 Session 525

Transcript of CMG15 Session 525

Page 1: CMG15 Session 525

Views and opinions expressed in this presentation are solely of its authors and do not necessarily represent those of Alphabet, Inc or its subsidiaries, including Google, Inc.

Select images and formulae are provided with permission from Google, Inc

Percentile-Based ApproachTo Forecasting Workload

Growth

Alex Gilgur, Douglas Browning, Stephen Gunn,Xiaojun Di, Wei Chen, and Rajesh Krishnaswamy

IT Capacity and Performance 41st International Conference by

the Computer Measurement Group (CMG'15)

San Antonio, TX November 5, 2015Session 525

Page 2: CMG15 Session 525

How often do you see such patterns?

How do you predict them?

Page 3: CMG15 Session 525

“What's in a name? That which we call a rose / By any other name /Would smell as sweet”

● Useful Workload Measures:○ Outlier Boundary ○ Mean + “Z” Standard Deviations (e.g. “3 sigma”, “6 sigma”, etc.)○ 95th percentile○ 90th percentile○ 75th percentile○ Simple Average (Mean)○ Median○ 25th percentile

● Define Workload via Little's Law:○ Number of units of work in the system

■ Number of packets in flight■ Number of queries in queue

○ W = X * T

Page 4: CMG15 Session 525

● Workload Forecasting○ When Classical Methods Fail○ Workload statistics:

■ “Z-sigma”■ Quantile Regression

● p95● Outlier Boundaries

○ Knees, Hyperbolae, and Sensitivity○ Quantile Compression○ Predicting the Workload○ Use Cases

What We Will Talk About

Page 5: CMG15 Session 525

“It may be normal, darling; but I'd rather be natural.”Truman Capote, Breakfast at Tiffany’s

Usual Assumptions:

● Residuals are Normally distributed● Mean and StDev of residuals are constant

Page 6: CMG15 Session 525

“Double, Double, Toil and Trouble”

Page 7: CMG15 Session 525

Solutions

And the Winner Is...

Page 8: CMG15 Session 525

… Predicting 95th Percentile Directly !

Page 9: CMG15 Session 525

Why not Split the Workload into Two Servers?

● Sometimes app1 and app2 have to go to the same server for processing.

● To size the server (VM/Network Link/Storage LUN/… ), we only want to forecast the upper bound.

● Sometimes it’s hard to get additional capacity: ○ budget, justification, approvals, etc.○ Cloud helps, but...

Percentile-based Modeling is often the only solution

Page 10: CMG15 Session 525

QuantReg for the 2 types of workload

fits very nicely in both cases

Page 11: CMG15 Session 525

Should we Size Hardware for 95%-ile of Workload?

5% of the time SLA (99.9%? 99.999%?)

will be violated

Page 12: CMG15 Session 525

IQR vs. 5th and 95th Percentiles

IQR excludes “true” outliers

IQR (Tukey’s method) 5th and 95th Percentiles

Page 13: CMG15 Session 525

Shouldn’t we Size Resources for Non-Outliers Instead?

John Tukey’s IQR method

● Why?○ Normality does not matter○ 5th & 95th percentiles in this scenario

would have “outlawed” a good part of data points that are NOT outliers

● Why Not?○ Multi-Modal distributions

We size for SLA, as long as traffic stays within outlier boundaries

Page 14: CMG15 Session 525

Tukey’s Boundaries in Trended Data

Page 15: CMG15 Session 525

Tukey’s Boundaries in Trended Data

Page 16: CMG15 Session 525

Tukey’s Boundaries in Trended Data

Page 17: CMG15 Session 525

● don’t predict based on all data:○ find natural groupings (GMM, DBSCAN, ...) ○ then fit the model○ use the higher cluster to guarantee QoS

Tukey’s Method● robust boundaries● distribution-agnostic● can be used to guarantee high QoS

○ Unimodal Distribution ○ Multi-Modal Distribution

Page 18: CMG15 Session 525

Long Story Short

● Forecast Percentiles

● Find Natural Groupings

● Size For Outlier Boundaries

Page 19: CMG15 Session 525

● Workload Forecasting○ When Classical Methods Fail○ Workload statistics:

■ “Z-sigma”■ Quantile Regression

● p95● Outlier Boundaries

○ Knees, Hyperbolae, and Sensitivity○ Quantile Compression○ Predicting the Workload○ Use Cases

What We Will Talk About Next

Page 20: CMG15 Session 525

Knees, Hyperbolae, and Sensitivity

Capacity:

Workload:

In a closed (constrained) system, sensitivity of throughput to latency increases with the throughput

Page 21: CMG15 Session 525

In Human Terms

● As throughput increases, latency can only increase.

● As latency increases, throughput in a constrained queueing system can only decrease.

● As we increase throughput in a constrained system near its saturation point, its upper percentiles must grow at a slower pace than lower percentiles.

Page 22: CMG15 Session 525

In Mathematical Terms

Quantile Compression Theorem:

IF raw demand on a constrained system X′ is

moderated via a monotonically increasing damped function X = f(X′),

THEN, as the system is approaching saturation,

smaller percentiles of moderated demand X grow on average faster than higher percentiles.

This is only a presentation; for mathematical proof, please see the paper.

Page 23: CMG15 Session 525

Long Story Short

“It’s just there...”-Miles Davis

Quantile Compression:As the system is approaching saturation, smaller percentiles of moderated demand X grow on average faster than higher percentiles.

Page 24: CMG15 Session 525

In Practical Terms: are We Constrained?

Percentile trajectories diverge; we are NOT constrained here.

Page 25: CMG15 Session 525

In Practical Terms: are We Constrained?

p5 and p95 trajectories converge; we ARE GETTING constrained here.

Page 26: CMG15 Session 525

Percentile trajectories are almost all parallel; we are almost NOT constrained here.

In Practical Terms: are We Constrained?

“It’s always the quiet ones”

Page 27: CMG15 Session 525

In Practical Terms: are We Constrained?Unconstrained Growth Rates:P97.5` > p95` > p75` > p50`

p95 trajectory is growing slower than p50; we ARE constrained here.

Page 28: CMG15 Session 525

In Practical Terms: are We Constrained?Predictions made:p75` > p95` > p50` > p97.5`

Line Predicted by p95

Line Predicted by p50

Line Predicted by p75

Line Predicted by p97.5

Unconstrained Growth Rates:P97.5` > p95` > p75` > p50`

p95 trajectory is growing slower than p50; we ARE constrained here.

Page 29: CMG15 Session 525

In Practical Terms: are We Constrained?Predictions made:p75` > p95` > p50` > p97.5`

Line Predicted by p95

Line Predicted by p50

Line Predicted by p75

Line Predicted by p97.5

Unconstrained Growth Rates:P97.5` > p95` > p75` > p50`

Observed Growth Rates:p75` > p95` > p50` > p97.5`

p95 trajectory is growing slower than p50; we ARE constrained here.

Page 30: CMG15 Session 525

In Statistical Terms: When Resource is Unconstrained

Unbounded Resource Throughput: Unimodal; Asymmetric; Skew is Constant

Page 31: CMG15 Session 525

Bounded (Constrained) Resource Throughput: may become Bimodal; Skew may vary

In Statistical Terms: When Resource is Constrained

Page 32: CMG15 Session 525

Can we Measure Asymmetry?

Page 33: CMG15 Session 525

Not All Distributions are Easy to Deal With

What if...

...Mean and

Variance are

undefined

?

Page 34: CMG15 Session 525

Not All Distributions are Easy to Deal With

What if...

...Mean and

Variance are

undefined

?

Page 35: CMG15 Session 525

Percentiles win!

What if...

...Mean and

Variance are

undefined

?

Page 36: CMG15 Session 525

Long Story Short

When resource is constrained:

1. Distribution changes:a. becomes left-skewedb. becomes bimodal

2. Skew is very important

3. Percentile-based Skew is the preferable statistic

Page 37: CMG15 Session 525

Some More Examples that you May Have Seen Before

Page 38: CMG15 Session 525

Growth Was Constrained

Page 39: CMG15 Session 525

Unconstrained Growth

Right-Skewed (long right tail)

Page 40: CMG15 Session 525

Controlled Growth

Left-Skewed (long left tail)

Page 41: CMG15 Session 525

What We Will Talk About Next

● Workload Forecasting○ When Linear Regression fails○ Workload statistics:■ “Z-sigma”■ Quantile Regression● p95● Outlier Boundaries

○ Knees, Hyperbolae, and Sensitivity○ Quantile Compression○ Predicting the Workload○ Use Cases

Page 42: CMG15 Session 525

“None that I know will be, much that I fear may chance”

● Regression: ○ Business Metrics○ Little's Law○ Time-related Covariates

● Time Series Analysis (Forecasting):○ EWMA○ ARIMA

Page 43: CMG15 Session 525

Is it right to Size Resources Using Upper Percentiles of Bounded Data?

Forecasting demand using bounded data leads to undersizing the resource

Doing so is the path to the dark side.

Resource Constraint =>

Quantile Compression =>

Underforecasting the load =>

Undersizing the resource

Quantile Compression:As the system is approaching saturation, smaller percentiles of moderated demand X grow on average faster than higher percentiles.

Page 44: CMG15 Session 525

Can we infer unbounded lines from bounded data?

TimeStamp

Ske

w1. Find Skew for Unbounded Data

2. Forecast Upper and Lower Percentiles to the Time Horizon of Interest

3. Infer Unbounded Upper Percentiles (Skew = const)

4. If (unbounded = forecasted) => system is still unbounded

5. If (unbounded > forecasted & forecast > history) => system will be constrained

Page 45: CMG15 Session 525

Throughput Forecasting Algorithm

Get U(t)StartIdentify the most

appropriate trend type

Done

Predict Trajectories for the LB

(p25) and Median

(LB’, M’) = Prediction for Low

Bounds and Median

Save the forecast

For each timestamp

Build hourly

boxplots

data

Thro

ughput

Thro

ughput

Page 46: CMG15 Session 525

Identify the most

appropriate trend type

Thro

ughput LINT

hro

ughput

LOG

Thro

ughput

EXPTh

rou

gh

pu

t

QUAD

Th

roughput

PWRTh

roughput

R2 = 0.45

R2 = 0.34

R2 = 0.47

R2 = 0.38

R2 = 0.46

Trend Type Selection

● we know the variance is huge● we are selecting TREND TYPE● we are NOT selecting MODEL

Page 47: CMG15 Session 525

Now we can use T-test

A few words about R2

Page 48: CMG15 Session 525

Now we can use T-test

LINThro

ughput R2 = 0.45

QUAD

Thro

ughput

R2 = 0.46

EXPThro

ughput

R2 = 0.47

LOGThro

ughput

R2 = 0.34

PWRThro

ughput

R2 = 0.38

A few words about R2

Page 49: CMG15 Session 525

Trend Type SelectionT

hro

ughput

LIN

Thro

ughput

R2 = 0.45

● we know the variance is huge● we are selecting TREND TYPE● we are NOT selecting MODEL

MODELS: “LIN”, “PWR”, “EXP”, “LOG”,“QUAD”

Identify the most

appropriate trend type

Page 50: CMG15 Session 525

Long Story Short

Forecasting Algorithm:

1. Compute the Skew

2. Identify the Trend Type

3. Forecast p25 and p50

4. Apply Skew to Compute Upper Percentiles

5. Compute Outlier Boundaries

Page 51: CMG15 Session 525

● Workload Forecasting○ When Classical Methods Fail○ Workload statistics:

■ “Z-sigma”■ Quantile Regression

● p95● Outlier Boundaries

○ Knees, Hyperbolae, and Sensitivity○ Quantile Compression○ Predicting the Workload○ Use Cases

What We Will Talk About Next

Page 52: CMG15 Session 525

Th

rou

gh

put

Th

rou

gh

put

Use Cases: Unbounded: How Far to the Threshold?

Threshold

1000

250

Non-Outliers above threshold

Pr {traffic > threshold} > 5%

Page 53: CMG15 Session 525

Another interesting scenario: Forecasting Resource Congestion Zone

Page 54: CMG15 Session 525

Forecasting ResourceCongestion Zone

By predicting collision points for different percentiles,we can get a general idea of a Resource Congestion Zone

HAL9000: I've just picked up a fault

in the AE35 unit. It's going to go

100% failure in 72 hours.

Page 55: CMG15 Session 525

Use Cases: Unbounded: How Much to Add?

(unbounded = forecasted) => system is still unbounded

Page 56: CMG15 Session 525

Use Cases: Bounded (Congested): How Much to Add?

(unbounded > forecasted) => system was, and will be, constrained

Page 57: CMG15 Session 525

Use Cases: Bounded (Congested): How Much to Add?

(unbounded > forecasted) => system may have been, and will be, constrained

Page 58: CMG15 Session 525

Long Story Short

● Feedback Loop & Quantile Compression:

○ “It’s just there”:■ explicitly, via the protocol.■ implicitly, in the saturation dynamics.

● Do not assume anything!

○ Especially about shapes of distributions.

● Do not forecast p95!

○ Forecast Outlier Boundaries instead.○ Mean and Variance are overrated!

● Do Size Hardware for the would-have-been-

unbounded Forecasts

Page 59: CMG15 Session 525

Alex Gilgur

[email protected] / [email protected]

+1 (408) 475-7582 / +1 (408) 828-2115

Page 60: CMG15 Session 525

Appendix

Page 61: CMG15 Session 525

“Big 7” of Linearizable Equations

odds log

Page 62: CMG15 Session 525

Is it Right to Size Resources Using Upper Percentiles?

Quantile Compression:As the system is approaching saturation, smaller percentiles of moderated demand X grow on average faster than higher percentiles.

Page 63: CMG15 Session 525

Is it Right to Size Resources Using Upper Percentiles?

Page 64: CMG15 Session 525

ForecastingMethods:

● EWMA

● ARIMA

● Regression

EWMA models are very specific and computationally fast, but they have to be told trend

(linear or exponential) and seasonality (additive or multiplicative).

ARIMA model will implicitly account for trends, seasonality, and stationarity of the data.

Autocorrelation of ARIMA residuals provide all the periodicities that have been missed.

For stationary data, use ARIMA

For non-stationary data, use EWMA

EWMA and ARIMA overlap

When to use Regression:

● data are monotonic.

● seasonality is NOT statistically significant.

● EWMA and ARIMA fail.

When to use Quantile Regression:

● Upper and Lower bounds behave differently.

● Outliers are possible.

For each data set, we can run a model competition, computing forecast model quality based

on a weighted sum of model goodness of fit, model suitability for forecasting, data stationarity

and data variability, and selecting the model that works best for each data set.

EWMA

ARIMA

Quantile Regression