Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population...
-
Upload
rosamund-farmer -
Category
Documents
-
view
217 -
download
0
Transcript of Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population...
![Page 1: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/1.jpg)
Disk Failures
Eli Alshan
![Page 2: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/2.jpg)
Agenda• Articles survey
– Failure Trends in a Large Disk Drive Population– Article review– Conclusions– Criticism
– Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you?
– Article review– Conclusions– Criticism
• Further research suggestion
![Page 3: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/3.jpg)
Definitions• Disk failure - drive is considered to have failed if it
was replaced as part of a repairs procedure– 15-60% of drives considered to have failed at the
user site are found to have no defect by the manufacturers
• MTTF - Mean Time To Failure• AFR - Annual Fail Rate• ARR – Annual Replacement Rate
![Page 4: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/4.jpg)
Failure Trends in a Large Disk Drive Population
• Analysis of drives self monitoring data, collected from large disk drive
• Attempt to isolate parameters highly correlated with disk failures
![Page 5: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/5.jpg)
Results - Utilization
• Very young and very old age groups appear to show the expected behavior
• Possible Explanation -Infant mortality
![Page 6: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/6.jpg)
Results - Temperature
• Lower temperatures are associated with higher failure rates
![Page 7: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/7.jpg)
Results – SMART• Scan Errors – background surface scan errors• Reallocation count - count of sector data
reallocations triggered by recurring errors caused by the sector
• Offline Reallocations - reallocation counts in which only reallocated sectors found during background scrubbing
• Probational Counts - sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems
![Page 8: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/8.jpg)
Results – SMART
![Page 9: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/9.jpg)
Results – SMART• Scan errors affect the survival probability of young
drives dramatically but after the first month the curve flattens out
• Older drives, decline steadily in survival probability throughout the 8-month period
• This behavior could be another manifestation of infant mortality phenomenon
![Page 10: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/10.jpg)
Results – SMART
![Page 11: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/11.jpg)
Conclusions• No consistent pattern of higher failure rates for
higher temperature drives or for those drives at higher utilization levels was found
• Few SMART parameters are well-correlated with higher failure probabilities
• Out of all failed drives, over 36% have no count in any of the SMART signals, temperature or utilization indication before failure
![Page 12: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/12.jpg)
Criticism• Attempt to analyze complex, correlated input data
one parameter at a time might be misleading• Temperature and utilization should be time
windowed, so the reading closer to the failure will receive more attention
• Physical vicinity between tested drives must be taken into account since close drives experience similar environmental conditions
![Page 13: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/13.jpg)
Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you?
• An analysis of seven data sets, with a focus on storage related failures– Disk replacement rates observed in the field and compare
our observations with common predictors and models used by vendors
– Statistical properties of disk replacement rates
![Page 14: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/14.jpg)
Disk Replacement Rates• The measured average ARR was 3.4 times larger than
0.88% given in the datasheet
![Page 15: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/15.jpg)
Disk Replacement Rates• Contrary to common
and proposed models, hard drive replacement rates do not enter steady state after the first year of operation, but steadily increase over time
![Page 16: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/16.jpg)
Statistical properties of disk failures
• The hypothesis that time between disks replacements follows an exponential distribution can be rejected with high confidence
• The distribution of time between disk replacements exhibits decreasing hazard rates. Disk replacements are fit best with gamma and Weibull distributions.
![Page 17: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/17.jpg)
Statistical propertiesof disk failures
• The statistical analysis present strong evidence for the existence of correlations between disk replacement intervals. In particular, the empirical data exhibits significant levels of autocorrelation and long-range dependence.
![Page 18: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/18.jpg)
Conclusions• The article demonstrates the lack of reliability of data
MTTF and AFR provided by disk vendors.• Based on the data analysis the papers authors find a
significant correlation between disk failures intervals. • The paper was able to substantiate with significant
statistical confidence the commonly made assumption that exponentially distributed time between failures is not realistic.
• The article identifies as the key features that distinguish the empirical distribution of time between disk replacements from the exponential distribution, higher levels of variability and decreasing hazard rates.
![Page 19: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/19.jpg)
Criticism• Data set size is relatively small which might invalidate
it’s thorough statistical analysis performed• The statistical model suggested in the article seem to
be too simplistic to describe a complex system as a disk in drive population
![Page 20: Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.](https://reader036.fdocuments.in/reader036/viewer/2022062519/5697bfb91a28abf838c9f95a/html5/thumbnails/20.jpg)
Further research suggestion• State machine disk health model (HMM)
• State estimation:– Vector of drive health indicators– Current state of the drives physically close to the drive
• Parameters estimation:– BIC + EM (Baum-Welch Algorithm)
𝑆1 𝑆2 𝑆𝑁…
𝑆𝑁 −1
𝑂1 𝑂2 𝑂𝑁𝑂𝑁−1
Transition Probability :
Emission Probability :