Erasure Coding in Windows Azure Storage - USENIX · Erasure Coding in Windows Azure Storage Cheng...
Transcript of Erasure Coding in Windows Azure Storage - USENIX · Erasure Coding in Windows Azure Storage Cheng...
Erasure Coding in Windows Azure Storage
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin Microsoft Corporation USENIX ATC, Boston, MA, June 2012
2
North America Region Europe Region Asia Pacific Region
3
Major datacenter
CDN PoPs
Windows Azure Storage
• Blobs
• CDN
• Drives
• Tables
• Queues
4
Massive Distributed Storage Systems in the Cloud
nMTTFMTTF OneFirst /
5
Replication vs. Erasure Coding
a=2
b=3
a=2
b=3 b=3
a=2
a+b=5
replication
erasure
coding
a=2
b=3
a=2
6
2a+b=7
WAS Stream Layer
7
Storage Location Service
Storage Stamp
Partition Layer
Stream Layer
Intra-stamp replication
Storage Stamp
Partition Layer
Stream Layer
Intra-stamp replication
Inter-stamp (Geo)
replication
Microsoft Confidential 8
Conventional Erasure Coding – Reed-Solomon 6+3
sealed extent ( 3 GB )
9
d0 d1 d2 d3 d4 d5
p1
p2
p3
Designing For Erasure Coding - 1
10
Designing For Erasure Coding - 2
11
Space Savings with RS 6+3 (over 3-replication)
12
Savings
Replicated
Data Fragments
Code Fragments
d0 d1 d2 d3 d4 d5 p0 p1 p2
sealed extent ( 3 GB )
overhead
(6+3)/6 = 1.5x
13
p0 p1 p2 p3
d0 d1 d2 d3 d4 d5 p0 p1 p2
sealed extent ( 3 GB )
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11
overhead
(6+3)/6 = 1.5x
(12+4)/12 = 1.33x
14
reconstruction read is expensive –
reading d0 during unavailability requiring 12 fragments (12 disk I/Os, 12 net transfers)
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11
p0
p1
p3
p2
15
16
can we achieve1.33x overhead
while requiring only 6 fragments for reconstruction?
17
optimize erasure coding for cloud storage making single failure reconstruction most efficient
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
• LRC12+2+2: 12 data fragments, 2 local parities and 2 global parities
• storage overhead: (12 + 2 + 2) / 12 = 1.33x
• Local parity: reconstruction requires only 6 fragments
18
sealed extent ( 3 GB )
• LRC12+2+2 needs to recover • arbitrary 3 failures
• as many 4 failures as possible
19
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q0
q1
py px
recover y1 from py (group y)
20
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q0
q1
py px
recover y1 from py (group y)
recover x0 and x2 from q0 and q1
21
y0 y1 y2 y3 y4 y5 x0 x1 x2 x3 x4 x5
q0
q1
py px
how to recover the 4 failures and all similar cases?
(see paper)
22
• LRC12+2+2:
• reliability: RS12+4 > LRC12+2+2 > RS6+3
23
24
0
2
4
6
8
10
12
1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
Re
con
stru
ctio
n R
ead
Co
st
Storage Overhead
Reed-Solomon
LRC
same overhead
half cost (6 3)
same cost
1.5x 1.33x
LRC (12+4+2)
LRC (12+2+2)
RS6+3
25
RS10+4
• RS10+4: HDFS-RAID
at Facebook
• RS6+3: GFS II (Colossus)
at Google
RS12+4
RS (6 + 3)
reconstruction cost = 6
LRC (14 + 2 + 2)
reconstruction cost = 7
RS (14 + 4)
reconstruction cost = 14
26
14% savings