The Mystery Machine: End-to-end performance analysis of large-scale Internet services

45
The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch

description

The Mystery Machine: End-to-end performance analysis of large-scale Internet services. Michael Chow David Meisner , Jason Flinn , Daniel Peek, Thomas F. Wenisch. Internet services are complex. Tasks. Time. Scale and heterogeneity make Internet services complex. - PowerPoint PPT Presentation

Transcript of The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Page 1: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

The Mystery Machine:End-to-end performance analysis of

large-scale Internet services

Michael ChowDavid Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch

Page 2: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 2

Internet services are complex

Tasks

Time

Scale and heterogeneity make Internet services complex

Page 3: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 3

Internet services are complex

Tasks

Time

Scale and heterogeneity make Internet services complex

Page 4: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 4

Analysis Pipeline

Page 5: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 5

Step 1: Identify segments

Page 6: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 6

Step 2: Infer causal model

Page 7: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 7

Step 3: Analyze individual requests

Page 8: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 8

Step 4: Aggregate results

Page 9: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 9

Challenges

• Previous methods derive a causal model– Instrument scheduler and communication– Build model through human knowledge

Need method that works at scale with heterogeneous components

Page 10: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 10

Opportunities

• Component-level logging is ubiquitous

• Handle a large number of requests

Tremendous detail about a request’s execution

Coverage of a large range of behaviors

Page 11: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 11

The Mystery Machine

1) Infer causal model from large corpus of traces– Identify segments– Hypothesize all possible causal relationships– Reject hypotheses with observed counterexamples

2) Analysis– Critical path, slack, anomaly detection, what-if

Page 12: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 12

Step 1: Identify segments

Page 13: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 13

Define a minimal schema

Segment 1 Segment 2

Task

Event Event Event

Page 14: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 14

Define a minimal schema

Segment 1 Segment 2

Task

Event Event Event

Request identifierMachine identifierTimestampTaskEvent

Aggregate existing logs using minimal schema

Page 15: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 15

Step 2: Infer causal model

Page 16: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 16

Types of causal relationshipsRelationship Counterexample

Happens-Before

Mutual Exclusion

Pipeline

BA AB

BA

ABOR BA

BA C

B’A’ C’

BA C

B’C’ A’

t1

t2

t1

t2

Page 17: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 17

Producing causal model

Causal Model

S1 N1 C1

S2 N2 C2

Page 18: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 18

Producing causal model

Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2Trace 1

Time

Page 19: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 19

Producing causal model

Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2Trace 1

Time

Page 20: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 20

Producing causal model

Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Time

Trace 2

Page 21: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 21

Producing causal model

Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Time

Trace 2

Page 22: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 22

Producing causal model

Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Time

Trace 2

Page 23: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 23

Step 3: Analyze individual requests

Page 24: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 24

Critical path using causal model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2Trace 1

Time

Page 25: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 25

Critical path using causal model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Trace 1

Page 26: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 26

Critical path using causal model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Trace 1

Page 27: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 27

Step 4: Aggregate results

Page 28: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 28

Inaccuracies of Naïve Aggregation

Page 29: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 29

Inaccuracies of Naïve Aggregation

Page 30: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 30

Inaccuracies of Naïve Aggregation

Need a causal model to correctly understand latency

Page 31: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 31

High variance in critical path

• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency

Server Network Client

Percent of critical path Percent of critical path Percent of critical path

cdf

Page 32: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 32

High variance in critical path

• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency

Server Network Client

Percent of critical path Percent of critical path Percent of critical path

cdf

20% of requests, server contributes 10%

of latency

Page 33: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 33

High variance in critical path

• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency

Server Network Client

Percent of critical path Percent of critical path Percent of critical path

cdf

20% of requests, server contributes 50% or more of

latency

Page 34: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 34

Diverse clients and networks

Server Network Client

Server Network Client

Server Network Client

Page 35: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 35

Diverse clients and networks

Server Network Client

Server Network Client

Server Network Client

Page 36: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 36

Diverse clients and networks

Server Network Client

Server Network Client

Server Network Client

Page 37: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 37

Differentiated service

Deliver data when needed and reduce average response time

No slack in server generation timeProduce data faster

Decrease end-to-end latency

Slack in server generation timeProduce data slower

End-to-end latency stays same

Page 38: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 38

Additional analysis techniques

• Slack analysis

• What-if analysis– Use natural variation in large data set

C1N1S1

C2N2S2

Time

Page 39: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 39

What-if questions

• Does server generation time affect end-to-end latency?

• Can we predict which connections exhibit server slack?

Page 40: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 40

Server slack analysis

Slack < 25ms Slack > 2.5s

Page 41: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 41

Server slack analysis

Slack < 25ms Slack > 2.5s

End-to-end latency increases as server generation time

increases

Server generation time has little effect on end-

to-end latency

Page 42: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 42

Predicting server slack

• Predict slack at the receipt of a request• Past slack is representative of future slack

First slack (ms)

Seco

nd s

lack

(ms)

Page 43: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 43

Predicting server slack

• Predict slack at the receipt of a request• Past slack is representative of future slack

Classifies 83% of requests

correctlyType II Error 9%

Type I Error 8%

First slack (ms)

Seco

nd s

lack

(ms)

Page 44: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 44

Conclusion

Page 45: The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Michael Chow 45

Questions