Post on 05-Jul-2020
Storage codes: Managing Big Data with Small OverheadsData with Small Overheads
Presented by
Anwitaman Datta & Frédérique E. Oggier Nanyang Technological University, Singapore
© 2013, A. Datta & F. Oggier, NTU Singapore
Tutorial at NetCod 2013, Calgary, Canada.
Big Data Storage: DisclaimerBig Data Storage: Disclaimer
A note from the trenches: "You know you have a large storage f y g gsystem when you get paged at 1 AM because you only have a few petabytes of storage left." – from Andrew Fikes’ (Principal Engineer Google) faculty summit talk ` Storage ArchitectureEngineer, Google) faculty summit talk Storage Architecture and Challenges `, 2010.
© 2013, A. Datta & F. Oggier, NTU Singapore2
Big Data Storage: DisclaimerBig Data Storage: Disclaimer
A note from the trenches: "You know you have a large storage f y g gsystem when you get paged at 1 AM because you only have a few petabytes of storage left." – from Andrew Fikes’ (Principal Engineer Google) faculty summit talk ` Storage ArchitectureEngineer, Google) faculty summit talk Storage Architecture and Challenges `, 2010.
We neverWe neverget such calls!!
© 2013, A. Datta & F. Oggier, NTU Singapore3
Big dataBig data
• June 2011 EMC2 studyy– world’s data is more than doubling
every 2 years• faster than Moore’s Law
Big data: - big problem?
bi i ?– 1.8 zettabytes of data to be created in 2011
- big opportunity?
Zetta: 1021
Zettabyte: If you stored all of this data onDVD th t k ld h f th E thDVDs, the stack would reach from the Earthto the moon and back.
© 2013, A. Datta & F. Oggier, NTU Singapore
* http://www.emc.com/about/news/press/2011/20110628-01.htm
4
The data deluge: Some numbersThe data deluge: Some numbers• Facebook “currently” (in 2010) stores
over 260 billion images, which translatesto over 20 petabytes of data. Users uploadone billion new photos (60 terabytes) eachweek and Facebook serves over oneweek and Facebook serves over onemillion images per second at peak.[quoted from a paper on “Haystack” fromFacebook]
• On “Saturday”, photo number fourbillion was uploaded to photo sharing siteFlickr. This comes just five and a halfmonths after the 3 billionth and nearly 18months after the 3 billionth and nearly 18months after photo number two billion. –Mashable (13th October 2009)[http://mashable.com/2009/10/12/flickr-4-
© 2013, A. Datta & F. Oggier, NTU Singapore
pbillion/]
5
Scale how?Scale how?
© 2013, A. Datta & F. Oggier, NTU Singapore * Definitions from Wikipedia6
Scale how?Scale how?To scale vertically (or scale up) means to
add resources to a single node in a system*
© 2013, A. Datta & F. Oggier, NTU Singapore
Scale up* Definitions from Wikipedia7
Scale how? not distributing is not an option!Scale how?To scale vertically (or scale up) means to
add resources to a single node in a system*
g p
To scale horizontally (or scale out) means toTo scale horizontally (or scale out) means to add more nodes to a system, such as adding a new
computer to a distributed software application*
© 2013, A. Datta & F. Oggier, NTU Singapore
Scale up Scale out* Definitions from Wikipedia8
Failure (of parts) is InevitableFailure (of parts) is Inevitable
© 2013, A. Datta & F. Oggier, NTU Singapore9
Failure (of parts) is InevitableFailure (of parts) is Inevitable • But, failure of the system is not an option either!
F il i th ill f i l ’– Failure is the pillar of rivals’ success …
© 2013, A. Datta & F. Oggier, NTU Singapore10
Deal with itDeal with it
• Data from Los Alamos National Laboratory (DSN 2006), y ( ),gathered over 9 years, 4750 machines, 24101 CPUs.
• Distribution of failures:– Hardware 60%– Software 20%– Network/Environment/Humans 5%
• Failures occurred between once a day to once a month.
© 2013, A. Datta & F. Oggier, NTU Singapore11
Failure happens without fail soFailure happens without fail, so …• But, failure of the system is not an option either!
F il i th ill f i l ’– Failure is the pillar of rivals’ success …
• Solution: Redundancy
© 2013, A. Datta & F. Oggier, NTU Singapore12
Many Levels of RedundancyMany Levels of Redundancy
• Physicaly• Virtual resource• Availability zone• Region• Cloud
© 2013, A. Datta & F. Oggier, NTU Singapore
From: http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html
13
Redundancy Based Fault ToleranceRedundancy Based Fault Tolerance
• Replicate datap– e.g., 3 or more copies– In nodes on different racks
• Can deal with switch failures• Can deal with switch failures
• Power back-up using battery between racks (Google)
© 2013, A. Datta & F. Oggier, NTU Singapore14
But At What Cost?But At What Cost?
• Failure is not an option, but …p ,– … are the overheads acceptable?
© 2013, A. Datta & F. Oggier, NTU Singapore15
Reducing the Overheads of Redundancy• Erasure codes
– Much lower storage overhead
Reducing the Overheads of Redundancy
Much lower storage overhead– High level of fault-tolerance
• In contrast to replication or RAID based systems
H h i l i ifi l i h “b li ”• Has the potential to significantly improve the “bottomline” – e.g., Both Google’s new DFS Collossus, as well as Microsoft’s Azure
now use ECs
© 2013, A. Datta & F. Oggier, NTU Singapore16
Erasure Codes (ECs)Erasure Codes (ECs)
© 2013, A. Datta & F. Oggier, NTU Singapore17
Erasure Codes (ECs)Erasure Codes (ECs)
• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.
© 2013, A. Datta & F. Oggier, NTU Singapore18
Erasure Codes (ECs)Erasure Codes (ECs)
• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.
• 3 way replication is a (3,1) erasure code!
© 2013, A. Datta & F. Oggier, NTU Singapore19
Erasure Codes (ECs)Erasure Codes (ECs)
• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.
• 3 way replication is a (3,1) erasure code!
k=1 block
© 2013, A. Datta & F. Oggier, NTU Singapore20
Erasure Codes (ECs)Erasure Codes (ECs)
• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.
• 3 way replication is a (3,1) erasure code!
Encodingg
k=1 block n=3 encoded blocks
© 2013, A. Datta & F. Oggier, NTU Singapore21
Erasure Codes (ECs)Erasure Codes (ECs)
• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.
• 3 way replication is a (3,1) erasure code!
Encodingg
k=1 block n=3 encoded blocks
• An erasure code such that the k original blocks can be recreated out of any k encoded blocks is called MDS
© 2013, A. Datta & F. Oggier, NTU Singapore
(maximum distance separable).
22
Erasure Codes (ECs)Erasure Codes (ECs) • Originally designed for communication
– EC(n k)
Receive any k’ (≥ k) bl k
B1B2 O1O1
EC(n,k)
D di…k’ (≥ k) blocks
…
O2
…mes
sage
uct D
ataO2
Decoding… …
L t bl k
Encoding…
Dat
a =
m
Rec
onst
ru
Bn
Ok
Lost blocks
Originalk blocks
R
Ok
© 2013, A. Datta & F. Oggier, NTU Singapore
n encoded blocks k blocks 23
Erasure Codes for Networked StorageErasure Codes for Networked Storage
OB1 Retrieve any O
bjec
t
O1
O2B2
…
yk’ (≥ k) blocks
t Dat
a
O1
O2
Dat
a =
Ob
Encoding… …
econ
stru
ct
DecodingBl
D
Ok … Lost blocks O i i l
Re
Ok
k blocks Bn
n encoded blocks
… Lost blocks Originalk blocks
© 2013, A. Datta & F. Oggier, NTU Singapore
(stored in storage devices in a network)24
HDFS-RAIDHDFS RAID
• Distributed RAID File system (DRFS) clienty ( )– provides application access to the files in the DRFS– transparently recovers any corrupt or missing blocks encountered when
reading a file (degraded read)• Does not carry out repairs
• RaidNode, a daemon that creates and maintains parity files for all data files stored in the DRFS
• BlockFixer, which periodically recomputes blocks that have been lost or corrupted– RaidShell allows on demand repair triggered by administrator– RaidShell allows on demand repair triggered by administrator
• Two kinds of erasure codes implemented– XOR code and Reed-Solomon code (typically 10+4 w/ 1.4x overhead)
© 2013, A. Datta & F. Oggier, NTU Singapore25From http://wiki.apache.org/hadoop/HDFS-RAID
Replenishing Lost Redundancy for ECsB1 Retrieve any
Replenishing Lost Redundancy for ECs• Repair needed for long term resilience.
B2
…
Retrieve any k’ (≥ k) blocks
O1
O2Recreate
lost blocks
… …Decoding EncodingBl
Re-insert
L t bl kOk
Reinsert in (new) storage devices, so that there is (again)
Bn
…
Lost blocks Originalk blocks
that there is (again) n encoded blocks
n encoded blocks
© 2013, A. Datta & F. Oggier, NTU Singapore26
Replenishing Lost Redundancy for ECsB1 Retrieve any
Replenishing Lost Redundancy for ECs• Repair needed for long term resilience.
B2
…
Retrieve any k’ (≥ k) blocks
O1
O2Recreate
lost blocks
… …Decoding EncodingBl
Re-insert
L t bl kOk
Reinsert in (new) storage devices, so that there is (again)
Bn
…
Lost blocks Originalk blocks
that there is (again) n encoded blocks
n encoded blocks
© 2013, A. Datta & F. Oggier, NTU Singapore• Repairs are expensive!
27
Tailored-Made Codes for StorageTailored Made Codes for Storage
© 2013, A. Datta & F. Oggier, NTU Singapore28
Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance
Traditional MDS erasure codes achieve these.
© 2013, A. Datta & F. Oggier, NTU Singapore29
Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability
Traditional MDS erasure codes achieve these.
Better repairability
© 2013, A. Datta & F. Oggier, NTU Singapore30
Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability
Traditional MDS erasure codes achieve these.
Smaller repair fan-in
Better repairability
Reduced I/O for repairs Possibility of multiple simultaneous repairs Fast repairs Fast repairs Efficient B/W usage…
© 2013, A. Datta & F. Oggier, NTU Singapore31
Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability
Traditional MDS erasure codes achieve these.
Smaller repair fan-in
Better repairability• Better …
Reduced I/O for repairs Possibility of multiple simultaneous repairs Fast repairs Fast repairs Efficient B/W usage…
© 2013, A. Datta & F. Oggier, NTU Singapore32
Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability
Traditional MDS erasure codes achieve these.
Smaller repair fan-in
Better repairability• Better …
Reduced I/O for repairs Possibility of multiple simultaneous repairs Fast repairs Fast repairs Efficient B/W usage…
Better data-insertion Better migration to archival
© 2013, A. Datta & F. Oggier, NTU Singapore33
g…
Pyramid (Local Reconstruction) CodesPyramid (Local Reconstruction) Codes
Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et al. @ NCA 2007
© 2013, A. Datta & F. Oggier, NTU Singapore34
Erasure Coding in Windows Azure Storage, C. Huang et al. @ USENIX ATC 2012
Pyramid (Local Reconstruction) CodesPyramid (Local Reconstruction) Codes
– Good for degraded reads (data locality)
Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et al. @ NCA 2007
© 2013, A. Datta & F. Oggier, NTU Singapore35
Erasure Coding in Windows Azure Storage, C. Huang et al. @ USENIX ATC 2012
Pyramid (Local Reconstruction) CodesPyramid (Local Reconstruction) Codes
– Good for degraded reads (data locality)– Not all repairs are cheap (only partial parity locality)
Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et al. @ NCA 2007
p p ( y p p y y)
© 2013, A. Datta & F. Oggier, NTU Singapore36
Erasure Coding in Windows Azure Storage, C. Huang et al. @ USENIX ATC 2012
Regenerating CodesRegenerating Codes• Network information flow based arguments to determine
“optimal” trade-off of storage/repair-bandwidthoptimal trade off of storage/repair bandwidth
© 2013, A. Datta & F. Oggier, NTU Singapore37
Locally Repairable CodesLocally Repairable Codes• Codes satisfying: low repair fan-in, for any failure
Th i i i t f “l ll d d bl d ”• The name is reminiscent of “locally decodable codes”
© 2013, A. Datta & F. Oggier, NTU Singapore
38
Self-repairing CodesSelf-repairing Codes
© 2013, A. Datta & F. Oggier, NTU Singapore39
Self-repairing CodesSelf-repairing Codes• Usual disclaimer: “To the best of our knowledge”
Fi t i t f l ll i bl d– First instances of locally repairable codes• Self‐repairing Homomorphic Codes for Distributed Storage Systems
– Infocom 2011• Self‐repairing Codes for Distributed Storage Systems – A ProjectiveSelf repairing Codes for Distributed Storage Systems A Projective Geometric Construction
– ITW 2011– Since then, there have been many other instances from other
researchers/groups
© 2013, A. Datta & F. Oggier, NTU Singapore40
Self-repairing CodesSelf-repairing Codes• Usual disclaimer: “To the best of our knowledge”
Fi t i t f l ll i bl d– First instances of locally repairable codes• Self‐repairing Homomorphic Codes for Distributed Storage Systems
– Infocom 2011• Self‐repairing Codes for Distributed Storage Systems – A ProjectiveSelf repairing Codes for Distributed Storage Systems A Projective Geometric Construction
– ITW 2011– Since then, there have been many other instances from other
researchers/groups• Note
– k encoded blocks are enough to recreate the object• Caveat: not any arbitrary k (i.e., SRCs are not MDS)• However, there are many such k combinations
© 2013, A. Datta & F. Oggier, NTU Singapore41
Self-repairing Codes: Blackbox ViewSelf-repairing Codes: Blackbox ViewB1 Retrieve some B2
…
k” (< k) blocks (e.g. k”=2)to recreate a lost block
Bl
Re-insert… Lost blocks
Reinsert in (new) storage devices, so
Bn
n encoded blocks
… Lost blocks that there is (again) n encoded blocks
© 2013, A. Datta & F. Oggier, NTU Singapore
(stored in storage devices in a network)
42
PSRC Example
© 2013, A. Datta & F. Oggier, NTU Singapore
PSRC Example
© 2013, A. Datta & F. Oggier, NTU Singapore
PSRC Example
(o1+o2+o4) + (o1) => o2+o4
R i i t d (o3) + (o2+o3) => o2(o1) + (o2) => o1+ o2
Repair using two nodes
Four pieces needed to regenerate two pieces
Say N1 and N3
© 2013, A. Datta & F. Oggier, NTU Singapore
PSRC Example
(o1+o2+o4) + (o1) => o2+o4
R i i t d (o3) + (o2+o3) => o2(o1) + (o2) => o1+ o2
Repair using two nodes
Four pieces needed to regenerate two pieces
Say N1 and N3
(o +o +o ) + (o ) => o +o(o2) + (o4) => o2+ o4Repair using three nodes
© 2013, A. Datta & F. Oggier, NTU Singapore
(o1+o2+o4) + (o4) => o1+o2
Three pieces needed to regenerate two pieces
Say N2, N3 and N4
46
RecapReplicas
DataData access
Erasure coded data
© 2013, A. Datta & F. Oggier, NTU Singapore47
RecapReplicas
Datafault-tolerant
Data access
data access
(MSR’s Reconstruction
Erasure coded data
Reconstruction code)
© 2013, A. Datta & F. Oggier, NTU Singapore48
RecapReplicas
Datafault-tolerant
Data access
data access
(MSR’s Reconstruction
Erasure coded data
Reconstruction code)
(partial) re-encode/repair
© 2013, A. Datta & F. Oggier, NTU Singapore49
(p ) p(e.g., Self-Repairing Codes)
RecapReplicas
Data insertion
Datafault-tolerant
Data access
data access
(MSR’s Reconstruction
Erasure coded data
Reconstruction code)
(partial) re-encode/repair
© 2013, A. Datta & F. Oggier, NTU Singapore50
(p ) p(e.g., Self-Repairing Codes)
NextReplicas
Data insertion
Datafault-tolerant
Data access
data access
(MSR’s Reconstruction
Erasure coded data
Reconstruction code)
(partial) re-encode/repair
© 2013, A. Datta & F. Oggier, NTU Singapore51
(p ) p(e.g., Self-Repairing Codes)
Inserting Redundant Data
Inserting Redundant Data
© 2013, A. Datta & F. Oggier, NTU Singapore52
Inserting Redundant Data
• Data insertion
Inserting Redundant Data
© 2013, A. Datta & F. Oggier, NTU Singapore53
Inserting Redundant Data
• Data insertion
Inserting Redundant Data
– Replicas can be inserted in a pipelined manner
© 2013, A. Datta & F. Oggier, NTU Singapore54
Inserting Redundant Data
• Data insertion
Inserting Redundant Data
– Replicas can be inserted in a pipelined manner
© 2013, A. Datta & F. Oggier, NTU Singapore55
Inserting Redundant Data
• Data insertion
Inserting Redundant Data
– Replicas can be inserted in a pipelined manner
– Traditionally, erasure coded systems used a central point of processing
© 2013, A. Datta & F. Oggier, NTU Singapore56
Inserting Redundant Data
• Data insertion
Inserting Redundant Data
– Replicas can be inserted in a pipelined manner
– Traditionally, erasure coded systems used a central point of processing
Can the process of redundancy generation be distributed among
© 2013, A. Datta & F. Oggier, NTU Singapore57
be distributed among the storage nodes?
In-network codingIn network coding
• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 29(1), 2013
© 2013, A. Datta & F. Oggier, NTU Singapore
58
In-network codingIn network coding
• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013
• Motivations– Reduce the bottleneck at a single point
• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation
© 2013, A. Datta & F. Oggier, NTU Singapore
59
In-network codingIn network coding
• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013
• Motivations– Reduce the bottleneck at a single point
• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation
© 2013, A. Datta & F. Oggier, NTU Singapore
60
In-network codingIn network coding
• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013
• Motivations– Reduce the bottleneck at a single point
• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation
– Utilize network resources opportunisticallyD Wh k/ d b d i h hi
© 2013, A. Datta & F. Oggier, NTU Singapore
• Data-centers: When network/nodes are not busy doing other things• P2P/F2F: When nodes are online
61
In-network codingIn network coding
• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013
• Motivations– Reduce the bottleneck at a single point
• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation
More traffic (ephemeral resource)
b t hi h d t
– Utilize network resources opportunisticallyD Wh k/ d b d i h hi
but higher data insertion throughput
© 2013, A. Datta & F. Oggier, NTU Singapore
• Data-centers: When network/nodes are not busy doing other things• P2P/F2F: When nodes are online
62
In-network codingIn network coding
© 2013, A. Datta & F. Oggier, NTU Singapore63
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
© 2013, A. Datta & F. Oggier, NTU Singapore64
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:
© 2013, A. Datta & F. Oggier, NTU Singapore65
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
© 2013, A. Datta & F. Oggier, NTU Singapore66
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]
© 2013, A. Datta & F. Oggier, NTU Singapore67
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]
• A naïve approach pp
© 2013, A. Datta & F. Oggier, NTU Singapore68
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]
• A naïve approach pp
r1
r2
r4
© 2013, A. Datta & F. Oggier, NTU Singapore69
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]
• A naïve approach pp
r1 r6
r2
r4
© 2013, A. Datta & F. Oggier, NTU Singapore70
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]
• A naïve approach pp
r1 r6
r2
r4 r7
© 2013, A. Datta & F. Oggier, NTU Singapore71
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]
• A naïve approach pp
r1 r6 Need a good “schedule” to insert the redundancy!
- Avoid cycles/dependenciesr2
r4 r7
Avoid cycles/dependencies
© 2013, A. Datta & F. Oggier, NTU Singapore72
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]
• A naïve approach pp
r1 r6 Need a good “schedule” to insert the redundancy!
- Avoid cycles/dependenciesr2
r4 r7
Avoid cycles/dependenciesSubject to “unpredictable” availability of resource!!
© 2013, A. Datta & F. Oggier, NTU Singapore73
In-network codingIn network coding
• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!
• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6
• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]
• A naïve approach pp
r1 r6 Need a good “schedule” to insert the redundancy!
- Avoid cycles/dependenciesr2
r4 r7
Avoid cycles/dependenciesSubject to “unpredictable” availability of resource!!
Turns out to be
© 2013, A. Datta & F. Oggier, NTU Singapore74
Turns out to be O(n!) w/ Oracle
In-network codingIn network coding
© 2013, A. Datta & F. Oggier, NTU Singapore75
In-network codingIn network coding
• Heuristics– Several other policies (such as max data) were also tried
© 2013, A. Datta & F. Oggier, NTU Singapore76
In-network codingIn network coding
© 2013, A. Datta & F. Oggier, NTU Singapore77
In-network codingIn network coding
© 2013, A. Datta & F. Oggier, NTU Singapore78
In-network codingIn network coding
• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried
© 2013, A. Datta & F. Oggier, NTU Singapore79
In-network codingIn network coding
• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried
• Provided 40% (out of a possible 57%) bandwidthProvided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code
© 2013, A. Datta & F. Oggier, NTU Singapore80
In-network codingIn network coding
• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried
• Provided 40% (out of a possible 57%) bandwidthProvided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code– An increase in the data-insertion throughput btw. 40-60%
© 2013, A. Datta & F. Oggier, NTU Singapore81
In-network codingIn network coding
• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried
• Provided 40% (out of a possible 57%) bandwidthProvided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code– An increase in the data-insertion throughput btw. 40-60%
• No free lunch: Increase of 20-30% overall network traffic
© 2013, A. Datta & F. Oggier, NTU Singapore82
RecapReplicas
Data insertion
Datafault-tolerant
Data access
data access
(MSR’s Reconstruction
Erasure coded data
Reconstruction code)
(partial) re-encode/repair
© 2013, A. Datta & F. Oggier, NTU Singapore83
(p ) p(e.g., Self-Repairing Codes)
NextReplicas
Data insertion
Datafault-tolerant
Data access
data access
(MSR’s Reconstruction
Erasure coded data
Reconstruction code)
(partial) re-encode/repair
© 2013, A. Datta & F. Oggier, NTU Singapore84
(p ) p(e.g., Self-Repairing Codes)
RapidRAIDRapidRAID
• Ref:– RapidRAID: Pipelined Erasure Codes for Fast Data Archival in
Distributed Storage Systems (Infocom 2013) • Has some local repairability properties, but that aspect is yet to be exploredHas some local repairability properties, but that aspect is yet to be explored
– Another code instance @ ICDCN 2013• Decentralized Erasure Coding for Ecient Data Archival in Distributed
Storage SystemsStorage Systems– Systematic code (unlike RapidRAID)– Found using numerical methods, and a general theory for the construction of
such codes, as well as their repairability properties are open issues
© 2013, A. Datta & F. Oggier, NTU Singapore85
RapidRAIDRapidRAID
• Ref:– RapidRAID: Pipelined Erasure Codes for Fast Data Archival in
Distributed Storage Systems (Infocom 2013) • Has some local repairability properties, but that aspect is yet to be exploredHas some local repairability properties, but that aspect is yet to be explored
– Another code instance @ ICDCN 2013• Decentralized Erasure Coding for Ecient Data Archival in Distributed
Storage SystemsStorage Systems– Systematic code (unlike RapidRAID)– Found using numerical methods, and a general theory for the construction of
such codes, as well as their repairability properties are open issues
• Problem statement: Can the existing (replication based) redundancy be exploited to create an erasure coded archive?
© 2013, A. Datta & F. Oggier, NTU Singapore86
Slight change of viewSlight change of view
S S S
S1 S1 S1
S1 S2 S
S1 S2 S
S1 S2 S
S2 S2 S2
S3 S4
S3 S4
S3 S4
S3 S3 S3
S4 S4 S4
© 2013, A. Datta & F. Oggier, NTU Singapore87
Two ways to look at replicated data
RapidRAIDRapidRAID
centralized encoding processencoding process
© 2013, A. Datta & F. Oggier, NTU Singapore88
RapidRAIDRapidRAID
D t li iDecentralizing the hitherto centralized
encoding processencoding process
© 2013, A. Datta & F. Oggier, NTU Singapore89
RapidRAID – Example (8 4) code
RapidRAID Example (8,4) code
© 2013, A. Datta & F. Oggier, NTU Singapore90
RapidRAID – Example (8 4) code
• Initial configuration
RapidRAID Example (8,4) code
g
© 2013, A. Datta & F. Oggier, NTU Singapore91
RapidRAID – Example (8 4) code
• Initial configuration
RapidRAID Example (8,4) code
g
• Logical phase 1: Pipelined coding
© 2013, A. Datta & F. Oggier, NTU Singapore92
RapidRAID – Example (8 4) code
RapidRAID Example (8,4) code
© 2013, A. Datta & F. Oggier, NTU Singapore93
RapidRAID – Example (8 4) code
• Logical phase 2: Further local coding
RapidRAID Example (8,4) code
g p g
© 2013, A. Datta & F. Oggier, NTU Singapore94
RapidRAID – Example (8 4) code
• Logical phase 2: Further local coding
RapidRAID Example (8,4) code
g p g
Resulting Linear Code
© 2013, A. Datta & F. Oggier, NTU Singapore95
RapidRAID: Some resultsRapidRAID: Some results
© 2013, A. Datta & F. Oggier, NTU Singapore96
RapidRAID: Some resultsRapidRAID: Some results
© 2013, A. Datta & F. Oggier, NTU Singapore97
Big picReplicas
Data insertion
Datafault-tolerant
Data access
data access
(MSR’s Reconstruction
Erasure coded data
Reconstruction code)
Agenda: A composite system achieving all these (partial) re-encode/repair
© 2013, A. Datta & F. Oggier, NTU Singapore98
properties …(p ) p
(e.g., Self-Repairing Codes)
Wrapping up: A moment of reflection
• Revisiting repairability – an engineering alternativeg p y g g– Redundantly Grouped Cross-object Coding for Repairable Storage
(APSys2012)– The CORE Storage Primitive: Cross-Object Redundancy for Efficient DataThe CORE Storage Primitive: Cross-Object Redundancy for Efficient Data
Repair & Access in Erasure Coded Storage (arXiv: arXiv:1302.5192)• HDFS-RAID compatible implementation • http://sands sce ntu edu sg/StorageCORE/http://sands.sce.ntu.edu.sg/StorageCORE/
© 2013, A. Datta & F. Oggier, NTU Singapore99
Wrapping up: A moment of reflectionec
ts
Erasure coding of individual objects
e11 e12 e1k e1k+1 e1n
ffere
nt o
bje
e21…
e22…
e2k…
e2k+1…e2n…
iece
s of d
if
em1
…
em2
…
emk
……
emk+1
…
emn
…
…
ure
code
d p
p1 p1 pk pk+1 pn
D-4
of e
rasu
© 2013, A. Datta & F. Oggier, NTU Singapore100
RA
ID
(reminiscent of product codes!)
Separation of concernsSeparation of concerns
• Two distinct design objectives for distributed storage systemsg j g y– Fault-tolerance– Repairability
A t l i l id• An extremely simple idea– Introduce two different kinds of redundancy
• Any (standard) erasure code – for fault-tolerance
• RAID-4 like parity (across encoded pieces of different objects) – for repairability
© 2013, A. Datta & F. Oggier, NTU Singapore
CORE repairabilityCORE repairability
• Choosing a suitable m < kg– Reduction in data transfer for repair– Repair fan-in disentangled from base code parameter “k”
• Large “k” may be desirable for faster (parallel) data access• Large k may be desirable for faster (parallel) data access• Codes typically have trade-offs between repair fan-in, code parameter
“k” and code’s storage overhead (n/k)
• However: The gains from reduced fan in is probabilistic• However: The gains from reduced fan-in is probabilistic– For i.i.d. failures with probability “f”
• Possible to reduce repair timeBy pipelining data through the live nodes and computing partial
© 2013, A. Datta & F. Oggier, NTU Singapore
– By pipelining data through the live nodes, and computing partial parity
• Interested to – Follow: http://sands.sce.ntu.edu.sg/CodingForNetworkedStorage/ p g g g
• Also, two surveys on (repairability of) storage codes – one short, at high level (SIGACT Distr. Comp. News, Mar. 2013) – one detailed (FnT June 2013)
© 2013, A. Datta & F. Oggier, NTU Singapore
one detailed (FnT, June 2013) – Get involved: {anwitaman,frederique}@ntu.edu.sg
103 ଧନ୍ଯବାଦ