Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel...

56
Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick 1 Jiaqi Tan 2 , Rajeev Gandhi 1 , Priya Narasimhan 1 1 Carnegie Mellon University 2 DSO National Labs, Singapore Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 1

Transcript of Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel...

Page 1: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Black-Box Problem Diagnosisin Parallel File Systems

Michael P. Kasick1

Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1

1Carnegie Mellon University 2DSO National Labs, Singapore

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 1

Page 2: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Problem Diagnosis Goals

To diagnose problems in off-the-shelf parallel file systemsEnvironmental performance problems: disk & network faultsTarget file systems: PVFS & Lustre

To develop methods applicable to existing deploymentsApplication transparency: avoid code-level instrumentationMinimal overhead, training, and configurationSupport for arbitrary workloads: avoid models, SLOs, etc.

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 2

Page 3: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problems

No errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance

Storage-related problems:

Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle

Network-related problems:

Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Page 4: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance

Storage-related problems:

Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle

Network-related problems:

Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Page 5: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance

Storage-related problems:Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle

Network-related problems:

Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Page 6: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance

Storage-related problems:Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle

Network-related problems:Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Page 7: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 4

Page 8: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Target Parallel File Systems

Aim to support I/O-intensive applicationsProvide high-bandwidth, concurrent access

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 5

Page 9: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Parallel File System Architecture

network

clients

I/O servers

ios0 ios1 ios2 iosN mds0 mdsM

metadata

servers

One or more I/O and metadata serversClients communicate with every server

No server-server communication

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 6

Page 10: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Parallel File System Data Striping

…543210Logical File:

Server 1 0 3 6 …

Server 2 1 4 7 …

Server 3 2 5 8 …

PhysicalFiles

Client stripes local file into 64 kB–1 MB chunksWrites to each I/O server in round-robin order

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 7

Page 11: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Parallel File Systems: Empirical Insights (I)

Server behavior is similar for most requestsLarge requests are striped across all serversSmall requests, in aggregate, equally load all servers

Hypothesis: Peer-similarityFault-free servers exhibit similar performance metricsFaulty servers exhibit dissimilarities in certain metricsPeer-comparison of metrics identifies faulty node

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 8

Page 12: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Example: Disk-Hog Fault

0 200 400 600

020

000

6000

010

0000

Elapsed Time (s)

Sec

tors

Rea

d (/s

) Faulty server

Non-faulty servers

Peer-asymmetry

Strongly motivates peer-comparison approachMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 9

Page 13: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Page 14: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)

Symmetrically on throughput metrics (↓ on all nodes)

0 200 400 600 800

Elapsed Time (s)

I/O W

ait T

ime

(ms)

010

0020

00 Faulty serverNon−faulty servers

Peer-asymmetry

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Page 15: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)

0 200 400 600 800

Elapsed Time (s)

Sec

tors

Rea

d (/s

)

040

000

8000

0

Faulty serverNon−faulty servers

No asymmetry

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Page 16: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)

Faults distinguishable by which metrics are peer-divergent

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Page 17: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 11

Page 18: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

System Model

Fault Model:Non-fail-stop problems

“Limping-but-alive” performance problems

Problems affecting storage & network resources

Assumptions:Hardware is homogeneous, identically configuredWorkloads are non-pathological (balanced requests)Majority of servers exhibit fault-free behavior

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 12

Page 19: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Instrumentation

Sampling of storage & network performance metricsSampled from /proc once every secondGathered from all server nodes

Storage-related metrics of interest:Throughput: Bytes read/sec, bytes written/secLatency: I/O wait time

Network-related metrics of interest:Throughput: Bytes received/sec, transmitted/secCongestion: TCP sending congestion window

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 13

Page 20: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Workloads

ddw & ddr (dd write & read)Use dd to write/read many GB to/from fileLarge (order MB) I/O requests, saturating workload

iozonew & iozoner (IOzone write & read)Ran in either write/rewrite or read/reread modeLarge I/O requests, workload transitions, fsync

postmark (PostMark)Metadata-heavy, small reads/writes (single server)Simulates email/news servers

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 14

Page 21: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Fault Types

Susceptible resources:Storage: Access contentionNetwork: Congestion, packet loss (faulty hardware)

Manifestation mechanism:Hog: Introduces new visible workload (server-monitored)Busy/Loss: Alters existing workload (unmonitored)

Storage NetworkHog disk-hog write-network-hog

read-network-hogBusy/Loss disk-busy receive-packet-loss

send-packet-loss

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15

Page 22: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Fault Types

Susceptible resources:Storage: Access contentionNetwork: Congestion, packet loss (faulty hardware)

Manifestation mechanism:Hog: Introduces new visible workload (server-monitored)Busy/Loss: Alters existing workload (unmonitored)

Storage NetworkHog disk-hog write-network-hog

read-network-hogBusy/Loss disk-busy receive-packet-loss

send-packet-loss

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15

Page 23: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Experiment Setup

PVFS cluster configurations:10 clients, 10 combined I/O & metadata servers6 clients, 12 combined I/O & metadata servers

Luster cluster configurations:10 clients, 10 I/O servers, 1 metadata server6 clients, 12 I/O servers, 1 metadata server

Each client runs same workload for ≈600 s

Faults injected on single server for 300 s

All workload & fault combinations run 10 times

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 16

Page 24: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 17

Page 25: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Diagnostic Algorithm

Phase I: Node IndictmentHistogram-based approach (for most metrics)Time series-based approach (congestion window)Both use peer-comparison to indict faulty node

Phase II: Root-Cause AnalysisAscribes to root cause based on affected metrics

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 18

Page 26: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Page 27: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding window

Compute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Page 28: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pair

Flag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Page 29: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds threshold

Flag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Page 30: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Page 31: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Threshold Selection

Fault-free training session (stress test)Run ddw, ddr, & postmark under fault-free conditionsFind minimum threshold that eliminates all anomalies

Histogram comparison uses per-server thresholdsCaptures performance profile of each serverImportant to train on each cluster & file system

Train on performance-stressing workloads onlyMetrics deviate most when servers are saturatedLess intense workloads have better coupled behavior

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 20

Page 32: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Example: PVFS Throughput (Disk-Hog Fault)

0 200 400 600

020

000

6000

010

0000

Elapsed Time (s)

Sec

tors

Rea

d (/s

)

Faulty serverNon−faulty servers

PVFS + disk-hog

PVFS only

Throughput diverges due to disk-hog on faulty serverMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 21

Page 33: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Phase II: Root-Cause Analysis

Build table of metrics & faults affecting them:

Storage Throughput: Storage Latency:disk-hog disk-hog

disk-busyNetwork Throughput: Network Congestion:network-hog network-hogpacket-loss (ACKs only) packet-loss

Derive checklist that maps divergent metrics to causeInfers resource responsibleDetermines mechanism by which resource faulted

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 22

Page 34: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Checklist for Root-Cause Analysis

Peer-divergence in . . .

Yes: disk-hog faultStorage throughput?No: next question

Yes: disk-busy faultStorage latency?No: . . .

Yes: network-hog faultNetwork throughput?∗No: . . .

Yes: packet-loss faultNetwork congestion?No: no fault discovered

∗Must diverge in both receive & transmit, or in absence of congestion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 23

Page 35: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 24

Page 36: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Results: Single Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 25

Page 37: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Results: Aggregate

PVFS 10/10 PVFS 6/12 Lustre 10/10 Lustre 6/12

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Cluster

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 26

Page 38: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Results Summary

True-positives inconsistent across faultsSome faults are not observable for all workloadsMinimal performance effect where not observable

True- & false-positives inconsistent across clustersAlgorithm sensitive to imprecise thresholdsRank metrics based on degree of dissimilarity

Strategy is promising in general

Instrumentation overhead< 1% increase in workload runtime, negligible

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 27

Page 39: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 28

Page 40: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Future Work

Analysis: Improve diagnosis accuracy ratesMake analysis robust to imprecise thresholds

Real-world data: Deploy on a production systemValidate technique on real workloads, at scale

Coverage: Expand target problem classOther sources of performance & non-performance faults

Instrumentation: Expand instrumentationAdditional black-box metrics, request sniffing & tracing

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 29

Page 41: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Summary

Problem diagnosis in parallel file systemsIllustrates use of OS-level metrics in diagnosisLeverages peer-comparison to identify faulty nodesDemonstrates root-cause analysis by metrics affected

Diagnosis method is applicable to existing deploymentsInstrumentation is minimally invasive, low overheadFault-free training with stress tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 30

Page 42: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Peer-Comparison Scalability

Number of comparisons: n(n−1)2 =⇒ O(n2)

Insight: Don’t need to compare one node against all

Proposed solution:Establish n−k partitions with k serversPerform peer-comparisons among servers in each partitionRepartition with a different grouping for each window

Solution comparisons: (n−k)k(k−1)2 =⇒ O(n)

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 31

Page 43: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Congestion Window Problem

No closely-coupled peer behaviorcwnd is intentionally noisy under normal conditionsSynchronized connections can’t fully use link capacityCan’t compare histograms, too much variance

Congestion window packet-loss heuristic:TCP responds to packet-loss by halving cwndExponential decay after multiple loss eventsLog scale: Each loss results in linear cwnd decrease

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 32

Page 44: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Time Series Comparison Example

0 200 400 600

25

1020

5010

0

Elapsed Time (s)

Seg

men

ts

0 200 400 600

25

1020

5010

0

Elapsed Time (s)

Seg

men

ts

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33

Page 45: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Time Series Comparison Example

0 200 400 600

25

1020

5010

0

Elapsed Time (s)

Seg

men

ts

0 200 400 600

25

1020

5010

0

Elapsed Time (s)

Seg

men

ts

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33

Page 46: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Time Series Comparison Example

Elapsed Time (s)

Seg

men

ts

100 200 300 400 500 600 700

25

1050

100 200 300 400 500 600 700

100 200 300 400 500 600 700

25

1050

25

1050

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34

Page 47: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Time Series Comparison Example

Elapsed Time (s)

Seg

men

ts

100 200 300 400 500 600 700

25

1050

100 200 300 400 500 600 700

100 200 300 400 500 600 700

25

1050

25

1050

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34

Page 48: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Heterogeneous Hardware (ddr)

0 100 200 300 400 500 600

050

100

150

200

Elapsed Time (s)

I/O W

ait T

ime

(ms)

Disks are same model, have different performance profilesMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 35

Page 49: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Load Imbalances (postmark)

0 100 200 300 400 500 600 700

0e+0

01e

+05

2e+0

53e

+05

Elapsed Time (s)

Byt

es R

ecei

ved

(B/s

)

“/” on one metadata server, all path lookups go thereMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 36

Page 50: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Cross-Resource Influence (ddr)

0 200 400 600 800

510

2050

100

200

Elapsed Time (s)

Seg

men

ts

Faulty serverNon−faulty servers

Disk-busy effect on server cwnd, unintentional syncMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 37

Page 51: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Delayed ACKs (ddw)

0 200 400 600

0e+0

02e

+05

4e+0

56e

+05

8e+0

5

Elapsed Time (s)

Byt

es T

rans

mitt

ed (B

/s)

Faulty serverNon−faulty servers

Packet-loss fault may also deviate network throughputMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 38

Page 52: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Results: PVFS 10/10 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 39

Page 53: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Results: PVFS 6/12 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 40

Page 54: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Results: Lustre 10/10 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 41

Page 55: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Results: Lustre 6/12 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 42

Page 56: Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie

Results: Aggregate

PVFS 10/10 PVFS 6/12 Lustre 10/10 Lustre 6/12

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Cluster

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 43