Local diagnosis + Smart Link remote Diagnosis Innovative ...
Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel...
Transcript of Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel...
Black-Box Problem Diagnosisin Parallel File Systems
Michael P. Kasick1
Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1
1Carnegie Mellon University 2DSO National Labs, Singapore
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 1
Problem Diagnosis Goals
To diagnose problems in off-the-shelf parallel file systemsEnvironmental performance problems: disk & network faultsTarget file systems: PVFS & Lustre
To develop methods applicable to existing deploymentsApplication transparency: avoid code-level instrumentationMinimal overhead, training, and configurationSupport for arbitrary workloads: avoid models, SLOs, etc.
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 2
Motivation: Real Problem Anecdotes
Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster
“Limping-but-alive” server problems
No errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance
Storage-related problems:
Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle
Network-related problems:
Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3
Motivation: Real Problem Anecdotes
Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster
“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance
Storage-related problems:
Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle
Network-related problems:
Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3
Motivation: Real Problem Anecdotes
Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster
“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance
Storage-related problems:Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle
Network-related problems:
Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3
Motivation: Real Problem Anecdotes
Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster
“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance
Storage-related problems:Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle
Network-related problems:Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3
Outline
1 Introduction
2 Experimental Methods
3 Diagnostic Algorithm
4 Results
5 Conclusion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 4
Target Parallel File Systems
Aim to support I/O-intensive applicationsProvide high-bandwidth, concurrent access
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 5
Parallel File System Architecture
network
clients
I/O servers
ios0 ios1 ios2 iosN mds0 mdsM
metadata
servers
One or more I/O and metadata serversClients communicate with every server
No server-server communication
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 6
Parallel File System Data Striping
…543210Logical File:
Server 1 0 3 6 …
Server 2 1 4 7 …
Server 3 2 5 8 …
PhysicalFiles
Client stripes local file into 64 kB–1 MB chunksWrites to each I/O server in round-robin order
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 7
Parallel File Systems: Empirical Insights (I)
Server behavior is similar for most requestsLarge requests are striped across all serversSmall requests, in aggregate, equally load all servers
Hypothesis: Peer-similarityFault-free servers exhibit similar performance metricsFaulty servers exhibit dissimilarities in certain metricsPeer-comparison of metrics identifies faulty node
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 8
Example: Disk-Hog Fault
0 200 400 600
020
000
6000
010
0000
Elapsed Time (s)
Sec
tors
Rea
d (/s
) Faulty server
Non-faulty servers
Peer-asymmetry
Strongly motivates peer-comparison approachMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 9
Parallel File Systems: Empirical Insights (II)
Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .
Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10
Parallel File Systems: Empirical Insights (II)
Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .
Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)
Symmetrically on throughput metrics (↓ on all nodes)
0 200 400 600 800
Elapsed Time (s)
I/O W
ait T
ime
(ms)
010
0020
00 Faulty serverNon−faulty servers
Peer-asymmetry
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10
Parallel File Systems: Empirical Insights (II)
Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .
Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)
0 200 400 600 800
Elapsed Time (s)
Sec
tors
Rea
d (/s
)
040
000
8000
0
Faulty serverNon−faulty servers
No asymmetry
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10
Parallel File Systems: Empirical Insights (II)
Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .
Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)
Faults distinguishable by which metrics are peer-divergent
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10
Outline
1 Introduction
2 Experimental Methods
3 Diagnostic Algorithm
4 Results
5 Conclusion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 11
System Model
Fault Model:Non-fail-stop problems
“Limping-but-alive” performance problems
Problems affecting storage & network resources
Assumptions:Hardware is homogeneous, identically configuredWorkloads are non-pathological (balanced requests)Majority of servers exhibit fault-free behavior
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 12
Instrumentation
Sampling of storage & network performance metricsSampled from /proc once every secondGathered from all server nodes
Storage-related metrics of interest:Throughput: Bytes read/sec, bytes written/secLatency: I/O wait time
Network-related metrics of interest:Throughput: Bytes received/sec, transmitted/secCongestion: TCP sending congestion window
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 13
Workloads
ddw & ddr (dd write & read)Use dd to write/read many GB to/from fileLarge (order MB) I/O requests, saturating workload
iozonew & iozoner (IOzone write & read)Ran in either write/rewrite or read/reread modeLarge I/O requests, workload transitions, fsync
postmark (PostMark)Metadata-heavy, small reads/writes (single server)Simulates email/news servers
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 14
Fault Types
Susceptible resources:Storage: Access contentionNetwork: Congestion, packet loss (faulty hardware)
Manifestation mechanism:Hog: Introduces new visible workload (server-monitored)Busy/Loss: Alters existing workload (unmonitored)
Storage NetworkHog disk-hog write-network-hog
read-network-hogBusy/Loss disk-busy receive-packet-loss
send-packet-loss
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15
Fault Types
Susceptible resources:Storage: Access contentionNetwork: Congestion, packet loss (faulty hardware)
Manifestation mechanism:Hog: Introduces new visible workload (server-monitored)Busy/Loss: Alters existing workload (unmonitored)
Storage NetworkHog disk-hog write-network-hog
read-network-hogBusy/Loss disk-busy receive-packet-loss
send-packet-loss
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15
Experiment Setup
PVFS cluster configurations:10 clients, 10 combined I/O & metadata servers6 clients, 12 combined I/O & metadata servers
Luster cluster configurations:10 clients, 10 I/O servers, 1 metadata server6 clients, 12 I/O servers, 1 metadata server
Each client runs same workload for ≈600 s
Faults injected on single server for 300 s
All workload & fault combinations run 10 times
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 16
Outline
1 Introduction
2 Experimental Methods
3 Diagnostic Algorithm
4 Results
5 Conclusion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 17
Diagnostic Algorithm
Phase I: Node IndictmentHistogram-based approach (for most metrics)Time series-based approach (congestion window)Both use peer-comparison to indict faulty node
Phase II: Root-Cause AnalysisAscribes to root cause based on affected metrics
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 18
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding window
Compute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pair
Flag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds threshold
Flag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Threshold Selection
Fault-free training session (stress test)Run ddw, ddr, & postmark under fault-free conditionsFind minimum threshold that eliminates all anomalies
Histogram comparison uses per-server thresholdsCaptures performance profile of each serverImportant to train on each cluster & file system
Train on performance-stressing workloads onlyMetrics deviate most when servers are saturatedLess intense workloads have better coupled behavior
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 20
Example: PVFS Throughput (Disk-Hog Fault)
0 200 400 600
020
000
6000
010
0000
Elapsed Time (s)
Sec
tors
Rea
d (/s
)
Faulty serverNon−faulty servers
PVFS + disk-hog
PVFS only
Throughput diverges due to disk-hog on faulty serverMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 21
Phase II: Root-Cause Analysis
Build table of metrics & faults affecting them:
Storage Throughput: Storage Latency:disk-hog disk-hog
disk-busyNetwork Throughput: Network Congestion:network-hog network-hogpacket-loss (ACKs only) packet-loss
Derive checklist that maps divergent metrics to causeInfers resource responsibleDetermines mechanism by which resource faulted
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 22
Checklist for Root-Cause Analysis
Peer-divergence in . . .
Yes: disk-hog faultStorage throughput?No: next question
Yes: disk-busy faultStorage latency?No: . . .
Yes: network-hog faultNetwork throughput?∗No: . . .
Yes: packet-loss faultNetwork congestion?No: no fault discovered
∗Must diverge in both receive & transmit, or in absence of congestion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 23
Outline
1 Introduction
2 Experimental Methods
3 Diagnostic Algorithm
4 Results
5 Conclusion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 24
Results: Single Cluster
control diskhog diskbusy wnethog rnethog recvloss sendloss
Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive
Fault
Acc
urac
y R
ate
(%)
020
4060
8010
0
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 25
Results: Aggregate
PVFS 10/10 PVFS 6/12 Lustre 10/10 Lustre 6/12
Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive
Cluster
Acc
urac
y R
ate
(%)
020
4060
8010
0
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 26
Results Summary
True-positives inconsistent across faultsSome faults are not observable for all workloadsMinimal performance effect where not observable
True- & false-positives inconsistent across clustersAlgorithm sensitive to imprecise thresholdsRank metrics based on degree of dissimilarity
Strategy is promising in general
Instrumentation overhead< 1% increase in workload runtime, negligible
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 27
Outline
1 Introduction
2 Experimental Methods
3 Diagnostic Algorithm
4 Results
5 Conclusion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 28
Future Work
Analysis: Improve diagnosis accuracy ratesMake analysis robust to imprecise thresholds
Real-world data: Deploy on a production systemValidate technique on real workloads, at scale
Coverage: Expand target problem classOther sources of performance & non-performance faults
Instrumentation: Expand instrumentationAdditional black-box metrics, request sniffing & tracing
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 29
Summary
Problem diagnosis in parallel file systemsIllustrates use of OS-level metrics in diagnosisLeverages peer-comparison to identify faulty nodesDemonstrates root-cause analysis by metrics affected
Diagnosis method is applicable to existing deploymentsInstrumentation is minimally invasive, low overheadFault-free training with stress tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 30
Peer-Comparison Scalability
Number of comparisons: n(n−1)2 =⇒ O(n2)
Insight: Don’t need to compare one node against all
Proposed solution:Establish n−k partitions with k serversPerform peer-comparisons among servers in each partitionRepartition with a different grouping for each window
Solution comparisons: (n−k)k(k−1)2 =⇒ O(n)
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 31
Congestion Window Problem
No closely-coupled peer behaviorcwnd is intentionally noisy under normal conditionsSynchronized connections can’t fully use link capacityCan’t compare histograms, too much variance
Congestion window packet-loss heuristic:TCP responds to packet-loss by halving cwndExponential decay after multiple loss eventsLog scale: Each loss results in linear cwnd decrease
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 32
Time Series Comparison Example
0 200 400 600
25
1020
5010
0
Elapsed Time (s)
Seg
men
ts
0 200 400 600
25
1020
5010
0
Elapsed Time (s)
Seg
men
ts
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33
Time Series Comparison Example
0 200 400 600
25
1020
5010
0
Elapsed Time (s)
Seg
men
ts
0 200 400 600
25
1020
5010
0
Elapsed Time (s)
Seg
men
ts
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33
Time Series Comparison Example
Elapsed Time (s)
Seg
men
ts
100 200 300 400 500 600 700
25
1050
100 200 300 400 500 600 700
100 200 300 400 500 600 700
25
1050
25
1050
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34
Time Series Comparison Example
Elapsed Time (s)
Seg
men
ts
100 200 300 400 500 600 700
25
1050
100 200 300 400 500 600 700
100 200 300 400 500 600 700
25
1050
25
1050
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34
Heterogeneous Hardware (ddr)
0 100 200 300 400 500 600
050
100
150
200
Elapsed Time (s)
I/O W
ait T
ime
(ms)
Disks are same model, have different performance profilesMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 35
Load Imbalances (postmark)
0 100 200 300 400 500 600 700
0e+0
01e
+05
2e+0
53e
+05
Elapsed Time (s)
Byt
es R
ecei
ved
(B/s
)
“/” on one metadata server, all path lookups go thereMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 36
Cross-Resource Influence (ddr)
0 200 400 600 800
510
2050
100
200
Elapsed Time (s)
Seg
men
ts
Faulty serverNon−faulty servers
Disk-busy effect on server cwnd, unintentional syncMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 37
Delayed ACKs (ddw)
0 200 400 600
0e+0
02e
+05
4e+0
56e
+05
8e+0
5
Elapsed Time (s)
Byt
es T
rans
mitt
ed (B
/s)
Faulty serverNon−faulty servers
Packet-loss fault may also deviate network throughputMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 38
Results: PVFS 10/10 Cluster
control diskhog diskbusy wnethog rnethog recvloss sendloss
Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive
Fault
Acc
urac
y R
ate
(%)
020
4060
8010
0
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 39
Results: PVFS 6/12 Cluster
control diskhog diskbusy wnethog rnethog recvloss sendloss
Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive
Fault
Acc
urac
y R
ate
(%)
020
4060
8010
0
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 40
Results: Lustre 10/10 Cluster
control diskhog diskbusy wnethog rnethog recvloss sendloss
Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive
Fault
Acc
urac
y R
ate
(%)
020
4060
8010
0
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 41
Results: Lustre 6/12 Cluster
control diskhog diskbusy wnethog rnethog recvloss sendloss
Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive
Fault
Acc
urac
y R
ate
(%)
020
4060
8010
0
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 42
Results: Aggregate
PVFS 10/10 PVFS 6/12 Lustre 10/10 Lustre 6/12
Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive
Cluster
Acc
urac
y R
ate
(%)
020
4060
8010
0
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 43