USENIX LISA10 November 7, 2010
Techniques for Handling Huge Storage
USENIX LISA’10 ConferenceNovember 8, 2010
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
AgendaHow did we get here?When good data goes badCapacity, planning, and design What comes next?
2
Note: this tutorial uses live demos, slides not so much
Sunday, November 7, 2010
3
History
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Milestones in Tape Evolution
4
1951 - magnetic tape for data storage1964 - 9 track1972 - Quarter Inch Cartridge (QIC)1977 - Commodore Datasette1984 - IBM 34801989 - DDS/DAT1995 - IBM 35902000 - T99402000 - LTO2006 - T100002008 - TS1130
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Milestones in Disk Evolution
5
1954 - hard disk invented1950s - Solid state disk invented1981 - Shugart Associates System Interface (SASI)1984 - Personal Computer Advanced Technology (PC/AT)Attachment,
later shortened to ATA1986 - “Small” Computer System Interface (SCSI)1986 - Integrated Drive Electronics (IDE)1994 - EIDE1994 - Fibre Channel (FC)1995 - Flash-based SSDs2001 - Serial ATA (SATA)2005 - Serial Attached SCSI (SAS)
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Architectural ChangesSimple, parallel interfacesSerial interfacesAggregated serial interfaces
6
Sunday, November 7, 2010
7
When Good Data Goes Bad
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Failure RatesMean Time Between Failures (MTBF)
Statistical interarrival error rate Often cited in literature and data sheetsMTBF = total operating hours / total number of failures
Annualized Failure Rate (AFR)AFR = operating hours per year / MTBFExpressed as a percentExample
MTBF = 1,200,000 hoursYear = 24 x 365 = 8,760 hoursAFR = 8,760 / 1,200,000 = 0.0073 = 0.73%
AFR is easier to grok than MTBF
8
Operating hours per year is a flexible definition
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Multiple Systems and Statistics
Consider 100 systems each with an MTBF = 1,000 hoursAt time=1,000 hours, 100 failures occurredNot all systems will see one failure
9
0
10
20
30
40
0 1 2 3 4
Num
ber o
f Sys
tem
s
Number of Failures
Very, Very Unlucky
Unlucky
Very Unlucky
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Failure RatesMTBF is a summary metric
Manufacturers estimate MTBF by stressing many units for short periods of qualification time
Summary metrics hide useful informationExample: mortality study
Study mortality of children aged 5-14 during 1996-1998Measured 20.8 per 100,000MTBF = 4,807 yearsCurrent world average life expectancy is 67.2 years
For large populations, such as huge disk farms, the summary MTBF can appear constant
Better question to be answered, “is my failure rate increasing or decreasing?”
10
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Why Do We Care?Summary statistics, like MTBF or AFR, can me misleading or risky if
we do not also distinguish between stable and trending processesWe need to analyze the ordered times between failure in relationship
to the system age to describe system reliability
11
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Time Dependent ReliabilityUseful for repairable systems
System can be repaired to satisfactory operation by any actionFailures occur sequentially in time
Measure the age of the components of a systemNeed to distinguish age from interarrival times (time between
failures)Doesn’t have to be precise, resolution of weeks works okSome devices report Power On Hours (POH)
SMART for disksOSesClerical solutions or inventory asset systems work fine
12
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
TDR Example 1
13
0
5
10
15
20
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
Mea
n C
umul
ativ
e Fa
ilure
s
System Age (months)
Disk Set ADisk Set BDisk Set CTarget MTBF
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
TDR Example 2
14
Did a common event occur?
0
5
10
15
20
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
Mea
n C
umul
ativ
e Fa
ilure
s
System Age (months)
Disk Set ADisk Set BDisk Set CTarget MTBF
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
TDR Example 2.5
15
0
5
10
15
20
Jan 1, 2010 May 14, 2011 Sep 23, 2012 Feb 3, 2014
Mea
n C
umul
ativ
e Fa
ilure
s
Date
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Long Term StorageNear-line disk systems for backup
Access time and bandwith advantages over tapeEnterprise-class tape for backup and archival
15-30 years shelf lifeSignificant ECC
Read error rate: 1e-20Enterprise-class HDD read error rate: 1e-15
16
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reliability
17
Reliability is time dependentTDR analysis reveals trendsUse cumulative plots, mean cumulative plots, and recurrance ratesGraphs are goodTrack failures and downtime by system versus age and calendar datesCorelate anomalous behaviorManage retirement, refresh, preventative processes using real data
Sunday, November 7, 2010
18
Data Sheets
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reading Data SheetsManufacturers publish useful data sheets and product guidesReliability information
MTBF or AFRUER, or equivalentWarranty
PerformanceInterface bandwidthSustained bandwidth (aka internal or media bandwidth)Average rotational delay or rpm (HDD)Average response or seek timeNative sector size
EnvironmentalsPower
19
AFR operating hours per year can be a footnote
Sunday, November 7, 2010
20
Availability
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Nines MatterIs the Internet up?
21
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Nines MatterIs the Internet up?Is the Internet down?
22
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Nines MatterIs the Internet up?Is the Internet down?Is the Internet reliability 5-9’s?
23
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Nines Don’t MatterIs the Internet up?Is the Internet down?Is the Internet’s reliability 5-9’s?Do 5-9’s matter?
24
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reliability Matters!Is the Internet up?Is the Internet down?Is the Internet’s reliability 5-9’s?Do 5-9’s matter?Reliability matters!
25
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Designing for FailureChange design perspectiveDesign to success
How to make it work?What you learned in school: solve the equationCan be difficult...
Design for failureHow to make it work when everything breaks?What you learned in the army: win the warCan be difficult... at first...
26
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
HA-Cluster plugin
Example: Design for Success
x86 ServerNexentaStor
Shared Storage
Shared Storage
x86 ServerNexentaStor
FCSASiSCSI
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Designing for FailureApplication-level replicationHard to implement - coding required
Some activity in open communityHard to apply to general purpose computing
ExamplesDoD, Google, Facebook, Amazon, ...The big guys
Tends to scale well with sizeMultiple copies of data
28
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reliability - AvailabilityReliability trumps availability
If disks didn’t break, RAID would not existIf servers didn’t break, HA cluster would not exist
Reliability measured in probabilitiesAvailability measured in nines
29
Sunday, November 7, 2010
30
Data Retention
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Evaluating Data RetentionMTTDL = Mean Time To Data LossNote: MTBF is not constant in the real world, but keeps math simpleMTTDL[1] is a simple MTTDL modelNo parity (single vdev, striping, RAID-0)
MTTDL[1] = MTBF / NSingle Parity (mirror, RAIDZ, RAID-1, RAID-5)
MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)Double Parity (3-way mirror, RAIDZ2, RAID-6)
MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)Triple Parity (4-way mirror, RAIDZ3)
MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)
31
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Another MTTDL ModelMTTDL[1] model doesn't take into account unrecoverable readBut unrecoverable reads (UER) are becoming the dominant failure
modeUER specifed as errors per bits readMore bits = higher probability of loss per vdev
MTTDL[2] model considers UER
32
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Why Worry about UER?Richard's study
3,684 hosts with 12,204 LUNs11.5% of all LUNs reported read errors
Bairavasundaram et.al. FAST08www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf1.53M LUNs over 41 monthsRAID reconstruction discovers 8% of checksum mismatches“For some drive models as many as 4% of drives develop
checksum mismatches during the 17 months examined”Manufacturers trade UER for space
33
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Why Worry about UER?
RAID array study
34
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Why Worry about UER?
RAID array study
35
UnrecoverableReads
Disk Disappeared“disk pull”
“Disk pull” tests aren’t very useful
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
MTTDL[2] ModelProbability that a reconstruction will fail
Precon_fail = (N-1) * size / UERModel doesn't work for non-parity schemes
single vdev, striping, RAID-0Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
MTTDL[2] = MTBF / (N * Precon_fail)Double Parity (3-way mirror, RAIDZ2, RAID-6)
MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)Triple Parity (4-way mirror, RAIDZ3)
MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail)
36
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Practical View of MTTDL[1]
37
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
MTTDL[1] Comparison
38
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
MTTDL Models: Mirror
39
Spares are not always better...
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
MTTDL Models: RAIDZ2
40
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Space, Dependability, and Performance
41
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Dependability Use CaseCustomer has 15+ TB of read-mostly data16-slot, 3.5” drive chassis2 TB HDDsOption 1: one raidz2 set
24 TB available space12 data2 parity2 hot spares, 48 hour disk replacement time
MTTDL[1] = 1,790,000 yearsOption 2: two raidz2 sets
24 TB available space (each set)6 data2 parityno hot spares
MTTDL[1] = 7,450,000 years
42
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Planning for Spares Number of systems Need for sparesHow many spares do you need?How often do you plan replacements?
Replacing devices immediately becomes impracticalNot replacing devices increases risk, but how much?There is no black/white answer, it depends...
43
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
SparesOptimizer Demo
44
Sunday, November 7, 2010
Capacity, Planning, and Design
45
Sunday, November 7, 2010
USENIX LISA10 November 7, 201046
SpaceSpace is a poor sizing metric, really!Technology marketing heavily pushes space
Maximizing space can mean compromising performance AND reliability
As disks and tapes get bigger, they don’t get better$150 rulePHB’s get all excited about space
Most current capacity planning tools manage by space
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
BandwidthBandwidth constraints in modern systems are rareOverprovisioning for bandwidth is relatively simpleWhere to gain bandwidth can be tricky
Link aggregationEthernetSAS
MPIOAdding parallelism beyond 2 trades off reliability
47
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
LatencyLower latency == better performanceLatency != IOPS
IOPS also achieved with parallelismParallelism only delivers latency when latency is constrained by
bandwidthLatency = access time + transfer timeHDD
Access time limited by seek and rotateTransfer time usually limited by media or internal bandwidth
SSDAccess time limited by architecture more than cTransfer time limited by architecture and interface
TapeAccess time measured in seconds
48
Sunday, November 7, 2010
49
Deduplication
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
What is Deduplication?A $2.1 Billion feature2009 buzzword of the yearTechnique for improving storage space efficiency
Trades big I/Os for small I/OsDoes not eliminate I/O
Implementation stylesoffline or post processing
data written to nonvolatile storageprocess comes along later and dedupes dataexample: tape archive dedup
inlinedata is deduped as it is being allocated to nonvolatile storageexample: ZFS
50
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Dedup how-toGiven a bunch of dataFind data that is duplicatedBuild a lookup table of references to dataReplace duplicate data with a pointer to the entry in the lookup tableGrainularity
fileblockbyte
51
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Dedup ConstraintsSize of the deduplication tableQuality of the checksums
Collisions happenAll possible permutations of N bits cannot be stored in N/10 bitsChecksums can be evaluated by probability of collisionsMultiple checksums can be used, but gains are marginal
Compression algorithms can work against deduplicationDedup before or after compression?
52
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Verification
add reference
checksum
compress
DDT entry lookup
write()
read data
data match?
new entry
yes
no
verify?
yes
no
yes
noDDT
match?
53
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reference Counts
54
Eggs courtesy of Richard’s chickens
Sunday, November 7, 2010
55
Replication
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Replication Services
Recovery Point Objective
System I/O Performance
Text
Days
Seconds
Slower Faster
Mirror
Application Level
Replication
Block ReplicationDRBD, SNDR
Object-level syncDatabases, ZFS
File-level syncrsync
Traditional Backup NDMP, tar
Hours
56
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
How Many Copies Do You Need?Answer: at least one, more is better...One production, one backupOne production, one near-line, one backupOne production, one near-line, one backup, one at DR siteOne production, one near-line, one backup, one at DR site, one
archived in a vaultRAID doesn’t countConsider 3 to 4 as a minimum for important data
57
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Tiering Example
58
Big, honkingdisk array
Big, honkingtape library
File-basedbackup
Works great, but...
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Tiering Example
59
Big, honkingdisk array
Big, honkingtape library
File-basedbackup
... backups never complete
10 million files1 million daily changes
12 hourbackup window
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Tiering Example
60
Big, honkingdisk array
Big, honkingtape library
Near-linebackup
Backups to near-line storage and tape have different policies
10 million files1 million daily changes
weeklybackup window
hourly block-levelreplication
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Tiering Example
61
Big, honkingdisk array
Big, honkingtape library
Near-linebackup
Quick file restoration possible
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Application-Level Replication Example
62
Site 2
Long-termarchive option
Site 1
Data stored atdifferent sites
Site 3
Application
Sunday, November 7, 2010
63
Data Sheets
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Reading Data Sheets ReduxManufacturers publish useful data sheets and product guidesReliability information
MTBF or AFRUER, or equivalentWarranty
PerformanceInterface bandwidthSustained bandwidth (aka internal or media bandwidth)Average rotational delay or rpm (HDD)Average response or seek timeNative sector size
EnvironmentalsPower
64
AFR operating hours per year can be a footnote
Sunday, November 7, 2010
65
Summary
Sunday, November 7, 2010
USENIX LISA10 November 7, 2010
Key Points
66
You will need many copies of your data, get used to itThe cost/byte decreases faster than kicking old habitsReplication is a good thing, use oftenTiering is a good thing, use often
Beware of designing for success, design for failure, tooReliability trumps availabilitySpace, dependability, performance: pick two
Sunday, November 7, 2010
Top Related