Post on 13-Mar-2020
DON’T LET RAID RAID THE LIFETIME OF
YOUR SSD ARRAY
Sangwhan Moon and A. L. Narasimha Reddy
Texas A&M University
sangwhan@tamu.edu reddy@ece.tamu.edu
• Reliability of MLC Flash Memory
RELIABILITY OF SSD
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
0 5 10 15 20 25 30 35 40 45 50
Raw
Bit
Err
or
Rat
e
Write Count (x1000)
Read Disturb (per read) Data Retention (per month)
* The measurement data from Hairong Sun et al., “Qualifying Reliability of Solid-State Storage from Multiple Aspects,” MSST’11. 3/26
• Device Level Protection Scheme
– Error Correcting Code (ECC)
– Flash Translation Layer
• Log-like Writing and Garbage Collection
• Wear Leveling
• System Level Protection Scheme
– Parity Protection (RAID5)
These protection schemes require additional writes
internally which in turn reduce the lifetime of SSD
RELIABILITY OF SSD
4/26
• Protect a device array from a device failure
– Protect each page group from a page error
• PARITY = XOR of ALL data
PARITY PROTECTION (RAID5)
010000….
101111….
111100….
000011….
010000….
111100….
000011….
Failure
Recovery
PARITY
5/26
1. Parity update results in additional writes
– Write amplification: [*N/(N-1), 2]
PARITY PROTECTION (RAID5)
* N is the number of SSDs in an SSD array
010000….
101111….
111100….
000011….
111001….
101010….
Write Parity Update
6/26
2. Parity consumes more space
– Higher space utilization reduces the lifetime
PARITY PROTECTION (RAID5)
Striping
Parity Protection
50%
75% PARITY PARITY
PARITY
7/26
• Parity protection is supposed to improve
the lifetime of SSD array
1) Parity update amplifies the number of
writes by up to 2
2) Space overhead for parity protection
initiates frequent garbage collection
PROBLEM STATEMENT
Is parity protection beneficial or not in terms of reliability?
8/26
For the same number of SSDs given, to
store the same amount of data, which is
better in lifetime, striping (RAID0) or
parity protection (RAID5)?
PROBLEM STATEMENT
9/26
CONTRIBUTIONS
Our preliminary results
Parity protection conditionally provides benefit in lifetime over striping.
Parity protection vs. Striping
Systems with different parameters are explored.
Markov models
We estimated the lifetime of SSD arrays.
10/26
• Page Error Rate Model at x write count
– Bit errors accumulate until access time
– ECC detects/corrects the bit errors
LIFETIME MODEL
0 1 2 K FE …
𝛍 + 𝐠(𝐢, 𝐱)
𝐒𝛌(𝐱) (𝐒 − 𝟏)𝛌(𝐱) (𝐒 − 𝐊)𝛌(𝐱)
Raw Bit Error Rate Bits per page
Page recovery rate
ECC correctable bits
Page Error
# of bit errors accumulates
Uncorrectable Page Error Rate 11/26
• Source of failures
– Page error
– Device failure
• Any failure results in data loss in striping
• Parity protection loses data when
– Two page errors in the same page group
– Two device failures
– Page error + device failure
– Device failure + page error
LIFETIME MODEL
12/26
• Parity protected SSD array
LIFETIME MODEL
0
v0
FR
𝑁𝑢𝑠𝑇𝜈(𝑥) 𝑁 − 1 𝜈 𝑥 + 𝑑
𝜉
d0
𝑁𝑑 𝜂
1
v1 FR
𝑁 − 2 𝑢𝑠𝑇𝜈 𝑥 + 𝑑
𝜉
d1
𝑁𝑑 𝜂
𝑁 − 1 𝑇𝑢𝑠𝜈(𝑥)
𝑁 − 1 𝑢𝑠𝑇𝜈 𝑥 + 𝑑
𝑁 − 2 𝜈 𝑥 + 𝑁 − 1 𝑑
. . .
Uncorrectable page error rate
Device failure rate
Device recovery rate
Page group recovery rate
Data loss
v0
v: page error d: device failure FR: data loss
The number of device replacement for device failure recovery
Data loss
13/26
• Mean Time to Data Loss (MTTDL)
– The expected time to encounter the first
data loss in an SSD array
LIFETIME MODEL
MTTDL = jp(j) 1− p(i)
j−1
i=1
tw
∞
j=1
The probability of data loss at j write count
Time to write an SSD array
14/26
• SSD Parameters
– 3x nm MLC flash memory
– Capacity = 80GB
– Page size = 4KB
• ECC: 61-bit errors correctable BCH code
– Annual device failure rate = 3%
– TRIM command is exploited
SIMULATION ENVIRONMENT
15/26
• Simulation Parameters
– The amount of data = 30GB/SSD
– Workload
• Read + Write = 125 MB/s/SSD
• Read : Write = 3:1
– 8 SSDs in an SSD array
• Relative MTTDL
– The ratio of the lifetime of the target SSD array to
that of single SSD with default parameters
SIMULATION ENVIRONMENT
16/26
0
0.2
0.4
0.6
0.8
1
1.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Re
lati
ve M
TTD
L
Space Utilization of Single SSD
ANALYSIS OF SINGLE SSD
Lifetime decreases when space utilization increases
17/26
EVALUATION: DIFFERENT NUMBER OF DEVICES
* The amount of data = 30GB/SSD
0
0.2
0.4
0.6
0.8
1
1.2
Single 2 SSDs(Mirroring)
4 SSDs 8 SSDs 12 SSDs 16 SSDs
Re
lati
ve M
TTD
L
RAID5 Striping Single
Mirroring is a way worse than striping due to its penalties
RAID5 is potentially better than striping when ≥ 8 SSDs
18/26
EVALUATION: DIFFERENT AMOUNT OF DATA
0
0.2
0.4
0.6
0.8
1
1.2
80 GB 160 GB 240 GB 320 GB 400 GB
Re
lati
ve M
TTD
L
Total amount of data kept in 8 SSDs
RAID5 StripingConsiderable amount of overprovisioning is required for RAID5 to win over striping
19/26
EVALUATION: DIFFERENT PARAMETERS
[TECC] S. Moon and A. Reddy, “Write amplification due to ECC on flash memory or leave those bit errors alone,” in MSST’12 * 80% of workload accumulates on 20 % of space
0
0.5
1
1.5
2
Reference TECC Hotness* R:W=9:1 62.5MB/s/SSD
Re
lati
ve M
TTD
L
Lifetime of 8 SSDs with different parameters
RAID5 StripingTECC, hotness, read intensive workload, or less intensive workload may help RAID5 win over striping
20/26
EVALUATION: ANNUAL DEVICE FAILURE RATE = 5%
Annual Device Failure Rate = 5%
0
0.2
0.4
0.6
0.8
1
1.2
Single 2 SSDs(Mirroring)
4 SSDs 8 SSDs 12 SSDs 16 SSDs
Re
lati
ve M
TTD
L
RAID5 Striping SingleRAID5 is more effective on protecting SSDs with higher device failure rate.
21/26
0
0.2
0.4
0.6
0.8
1
1 RAID(4 SSDs)
5 RAIDs(20 SSDs)
10 RAIDs(40 SSDs)
15 RAIDs(60 SSDs)
20 RAIDs(80 SSDs)
25 RAIDs(100 SSDs)
50 RAIDs(200 SSDs)
Re
lati
ve M
TTD
L
RAID5 (4 SSDs) Striping
EVALUATION: LARGE SCALE STORAGE SYSTEM
RAID5 wins over striping in large scale storage systems
22/26
• Parity protection is potentially worse than
striping with small number of SSDs
• Parity protection wins against striping when
1) considerably lower space utilization is guaranteed.
2) TECC, hotness, read intensive workload, or less
intensive workload is provided.
3) SSDs have higher device failure rate.
• Parity protection wins against striping in large
scale storage systems
SUMMARY OF EVALUATIONS
23/26
• Other lifetime evaluations
– Different write sizes
– Other storage systems (e.g. RAID6)
• Monetary cost of ownership
• Validation of our analytic models
• Advanced techniques to reduce write
amplification from parity protection
FUTURE WORK
24/26
• Markov models to estimate the lifetime
of an SSD array with protection schemes
• Lifetime comparison of striping (RAID0)
and parity protection (RAID5) with
different parameters
• Parity protection is conditionally superior
to striping.
CONCLUSION
25/26