Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and...

24
Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and...

Page 1: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Failure Trends in a Large Disk Drive Population

Authors: Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´e Barroso

Presented by Vinuthna & Arjun

Page 2: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Motivation

• 90% of all new information is stored on magnetic disks.

• Most of such data stored on HDD• Study failure patterns and key factors that affect

the life • Analyze the correlation between failures and

parameters that are believed to impact life of HDD

• Why ? --better design and maintenance of storage systems

Page 3: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Previous studies

• Mostly accelerated aging experiments – poor predictor

• Moderate size• Stats present on returned units from warranty

databases• No insight on what actually happened to drive

during operation

Page 4: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Our study

• Large study – examining hard drives in Google’s infrastructure. 1 lac disk drives

• Disk population size is large but depth and detail of study from a end users point of view

• Why? Manufacturers say failure rate is below 2% but end user experiences much high failure rate

• Some studies say the failure rate is 20-30% when manufacturer says no prob and it fails on field

Page 5: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

SYSTEM HEALTH INFRASTRUCTURE

•Collection layer – collects data from each server and dumps to repository•Storage based on BIGTABLE which is based on GFS. Has 2D data cells and 3rd dimension for time version•Database has complete history of environment, error, config and repair events•A daemon runs on each machines. It is light weight & gives info to collectors•Large scale analysis done by MapReduce•Computation is readily available, user focuses on algorithm of computations

Page 6: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Some other info

• Data collected over nine months.• Mix of HDD--- diff ages, manufacturers and

models• Failure info mined from previous repair

databases upto 5 years• We monitor temp, activity levels and SMART

parameters• Results are not affected by population mix

Page 7: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Results

• Utilization• Previous notion – high duty cycles affect disk

drives negatively

Page 8: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Utilization AFR

•More utilization, more failures true only for infant mortality stage and end stage•After 1st year high utilization is only moderately over low utilization•How is this possible- Survival of the fittest, previous correlation based on accelerated life test. Same is seen here.•Conclusion – Utilization has much weaker correlation to failure than assumed before

Page 9: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Temperature

•Previous belief temperature change of 15C can double failure rate•PDF – Failure does not increase with temperature. Infact lower temperatures may have higher failure rate•For age vs AFR – flat failure rate for mid range temp, Modest increase for low temps•High temp is not associated with high failure rate, except when old•Conclusion – If moderate temp range is considered, temp is not a strong factor for failure rate

Page 10: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

SMART Data Analysis

• Some signals more relevant to disk failures• Parameters– Scan errors– Reallocation counts– Offline Reallocations– Probational counts– Miscellaneous signals

Page 11: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Scan errors

• Errors that are reported when drives scan the disk surface in the background

• Indicative of surface defects• Consistent impact on AFR• Drives with scan errors are 39 times more

likely to fail after first scan error

Page 12: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Page 13: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Page 14: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Reallocation Counts

• Represents the number of times a faulty sector is remapped to new physical sector

• Consistent impact on AFR

• 14 times more likely to fail

Page 15: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Page 16: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Page 17: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Offline reallocations

• Subset of reallocation counts• Reallocated sectors found during background

scrubbing• Survival probability worse than total

reallocations• 21 times more likely to fail

Page 18: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Page 19: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Page 20: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Probational counts

• Sectors are on ‘probation’ until they fail permanently or work without problems

• 16 times more likely to fail• Threshold is 1

Page 21: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Page 22: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Page 23: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Miscellaneous signals

• Seek errors• CRC errors• Power cycles• Calibration retries• Spin retries• Power-on hours• Vibration

Page 24: Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.

Conclusion

• Larger population size used compared to previous studies

• Lack of consistent pattern of failures for high temperatures or utilization levels

• SMART parameters are well correlated with failure probabilities

• Prediction models based only on SMART parameters is limited in accuracy