Storage codes: Managing Big Data with Small OverheadsData with...

Post on 05-Jul-2020

0 views 0 download

Transcript of Storage codes: Managing Big Data with Small OverheadsData with...

Storage codes: Managing Big Data with Small OverheadsData with Small Overheads

Presented by

Anwitaman Datta & Frédérique E. Oggier Nanyang Technological University, Singapore

© 2013, A. Datta & F. Oggier, NTU Singapore

Tutorial at NetCod 2013, Calgary, Canada.

Big Data Storage: DisclaimerBig Data Storage: Disclaimer

A note from the trenches: "You know you have a large storage f y g gsystem when you get paged at 1 AM because you only have a few petabytes of storage left." – from Andrew Fikes’ (Principal Engineer Google) faculty summit talk ` Storage ArchitectureEngineer, Google) faculty summit talk Storage Architecture and Challenges `, 2010.

© 2013, A. Datta & F. Oggier, NTU Singapore2

Big Data Storage: DisclaimerBig Data Storage: Disclaimer

A note from the trenches: "You know you have a large storage f y g gsystem when you get paged at 1 AM because you only have a few petabytes of storage left." – from Andrew Fikes’ (Principal Engineer Google) faculty summit talk ` Storage ArchitectureEngineer, Google) faculty summit talk Storage Architecture and Challenges `, 2010.

We neverWe neverget such calls!!

© 2013, A. Datta & F. Oggier, NTU Singapore3

Big dataBig data

• June 2011 EMC2 studyy– world’s data is more than doubling

every 2 years• faster than Moore’s Law

Big data: - big problem?

bi i ?– 1.8 zettabytes of data to be created in 2011

- big opportunity?

Zetta: 1021

Zettabyte: If you stored all of this data onDVD th t k ld h f th E thDVDs, the stack would reach from the Earthto the moon and back.

© 2013, A. Datta & F. Oggier, NTU Singapore

* http://www.emc.com/about/news/press/2011/20110628-01.htm

4

The data deluge: Some numbersThe data deluge: Some numbers• Facebook “currently” (in 2010) stores

over 260 billion images, which translatesto over 20 petabytes of data. Users uploadone billion new photos (60 terabytes) eachweek and Facebook serves over oneweek and Facebook serves over onemillion images per second at peak.[quoted from a paper on “Haystack” fromFacebook]

• On “Saturday”, photo number fourbillion was uploaded to photo sharing siteFlickr. This comes just five and a halfmonths after the 3 billionth and nearly 18months after the 3 billionth and nearly 18months after photo number two billion. –Mashable (13th October 2009)[http://mashable.com/2009/10/12/flickr-4-

© 2013, A. Datta & F. Oggier, NTU Singapore

pbillion/]

5

Scale how?Scale how?

© 2013, A. Datta & F. Oggier, NTU Singapore * Definitions from Wikipedia6

Scale how?Scale how?To scale vertically (or scale up) means to

add resources to a single node in a system*

© 2013, A. Datta & F. Oggier, NTU Singapore

Scale up* Definitions from Wikipedia7

Scale how? not distributing is not an option!Scale how?To scale vertically (or scale up) means to

add resources to a single node in a system*

g p

To scale horizontally (or scale out) means toTo scale horizontally (or scale out) means to add more nodes to a system, such as adding a new

computer to a distributed software application*

© 2013, A. Datta & F. Oggier, NTU Singapore

Scale up Scale out* Definitions from Wikipedia8

Failure (of parts) is InevitableFailure (of parts) is Inevitable

© 2013, A. Datta & F. Oggier, NTU Singapore9

Failure (of parts) is InevitableFailure (of parts) is Inevitable • But, failure of the system is not an option either!

F il i th ill f i l ’– Failure is the pillar of rivals’ success …

© 2013, A. Datta & F. Oggier, NTU Singapore10

Deal with itDeal with it

• Data from Los Alamos National Laboratory (DSN 2006), y ( ),gathered over 9 years, 4750 machines, 24101 CPUs.

• Distribution of failures:– Hardware 60%– Software 20%– Network/Environment/Humans 5%

• Failures occurred between once a day to once a month.

© 2013, A. Datta & F. Oggier, NTU Singapore11

Failure happens without fail soFailure happens without fail, so …• But, failure of the system is not an option either!

F il i th ill f i l ’– Failure is the pillar of rivals’ success …

• Solution: Redundancy

© 2013, A. Datta & F. Oggier, NTU Singapore12

Many Levels of RedundancyMany Levels of Redundancy

• Physicaly• Virtual resource• Availability zone• Region• Cloud

© 2013, A. Datta & F. Oggier, NTU Singapore

From: http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html

13

Redundancy Based Fault ToleranceRedundancy Based Fault Tolerance

• Replicate datap– e.g., 3 or more copies– In nodes on different racks

• Can deal with switch failures• Can deal with switch failures

• Power back-up using battery between racks (Google)

© 2013, A. Datta & F. Oggier, NTU Singapore14

But At What Cost?But At What Cost?

• Failure is not an option, but …p ,– … are the overheads acceptable?

© 2013, A. Datta & F. Oggier, NTU Singapore15

Reducing the Overheads of Redundancy• Erasure codes

– Much lower storage overhead

Reducing the Overheads of Redundancy

Much lower storage overhead– High level of fault-tolerance

• In contrast to replication or RAID based systems

H h i l i ifi l i h “b li ”• Has the potential to significantly improve the “bottomline” – e.g., Both Google’s new DFS Collossus, as well as Microsoft’s Azure

now use ECs

© 2013, A. Datta & F. Oggier, NTU Singapore16

Erasure Codes (ECs)Erasure Codes (ECs)

© 2013, A. Datta & F. Oggier, NTU Singapore17

Erasure Codes (ECs)Erasure Codes (ECs)

• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.

© 2013, A. Datta & F. Oggier, NTU Singapore18

Erasure Codes (ECs)Erasure Codes (ECs)

• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.

• 3 way replication is a (3,1) erasure code!

© 2013, A. Datta & F. Oggier, NTU Singapore19

Erasure Codes (ECs)Erasure Codes (ECs)

• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.

• 3 way replication is a (3,1) erasure code!

k=1 block

© 2013, A. Datta & F. Oggier, NTU Singapore20

Erasure Codes (ECs)Erasure Codes (ECs)

• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.

• 3 way replication is a (3,1) erasure code!

Encodingg

k=1 block n=3 encoded blocks

© 2013, A. Datta & F. Oggier, NTU Singapore21

Erasure Codes (ECs)Erasure Codes (ECs)

• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.

• 3 way replication is a (3,1) erasure code!

Encodingg

k=1 block n=3 encoded blocks

• An erasure code such that the k original blocks can be recreated out of any k encoded blocks is called MDS

© 2013, A. Datta & F. Oggier, NTU Singapore

(maximum distance separable).

22

Erasure Codes (ECs)Erasure Codes (ECs) • Originally designed for communication

– EC(n k)

Receive any k’ (≥ k) bl k

B1B2 O1O1

EC(n,k)

D di…k’ (≥ k) blocks

O2

…mes

sage

uct D

ataO2

Decoding… …

L t bl k

Encoding…

Dat

a =

m

Rec

onst

ru

Bn

Ok

Lost blocks

Originalk blocks

R

Ok

© 2013, A. Datta & F. Oggier, NTU Singapore

n encoded blocks k blocks 23

Erasure Codes for Networked StorageErasure Codes for Networked Storage

OB1 Retrieve any O

bjec

t

O1

O2B2

yk’ (≥ k) blocks

t Dat

a

O1

O2

Dat

a =

Ob

Encoding… …

econ

stru

ct

DecodingBl

D

Ok … Lost blocks O i i l

Re

Ok

k blocks Bn

n encoded blocks

… Lost blocks Originalk blocks

© 2013, A. Datta & F. Oggier, NTU Singapore

(stored in storage devices in a network)24

HDFS-RAIDHDFS RAID

• Distributed RAID File system (DRFS) clienty ( )– provides application access to the files in the DRFS– transparently recovers any corrupt or missing blocks encountered when

reading a file (degraded read)• Does not carry out repairs

• RaidNode, a daemon that creates and maintains parity files for all data files stored in the DRFS

• BlockFixer, which periodically recomputes blocks that have been lost or corrupted– RaidShell allows on demand repair triggered by administrator– RaidShell allows on demand repair triggered by administrator

• Two kinds of erasure codes implemented– XOR code and Reed-Solomon code (typically 10+4 w/ 1.4x overhead)

© 2013, A. Datta & F. Oggier, NTU Singapore25From http://wiki.apache.org/hadoop/HDFS-RAID

Replenishing Lost Redundancy for ECsB1 Retrieve any

Replenishing Lost Redundancy for ECs• Repair needed for long term resilience.

B2

Retrieve any k’ (≥ k) blocks

O1

O2Recreate

lost blocks

… …Decoding EncodingBl

Re-insert

L t bl kOk

Reinsert in (new) storage devices, so that there is (again)

Bn

Lost blocks Originalk blocks

that there is (again) n encoded blocks

n encoded blocks

© 2013, A. Datta & F. Oggier, NTU Singapore26

Replenishing Lost Redundancy for ECsB1 Retrieve any

Replenishing Lost Redundancy for ECs• Repair needed for long term resilience.

B2

Retrieve any k’ (≥ k) blocks

O1

O2Recreate

lost blocks

… …Decoding EncodingBl

Re-insert

L t bl kOk

Reinsert in (new) storage devices, so that there is (again)

Bn

Lost blocks Originalk blocks

that there is (again) n encoded blocks

n encoded blocks

© 2013, A. Datta & F. Oggier, NTU Singapore• Repairs are expensive!

27

Tailored-Made Codes for StorageTailored Made Codes for Storage

© 2013, A. Datta & F. Oggier, NTU Singapore28

Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance

Traditional MDS erasure codes achieve these.

© 2013, A. Datta & F. Oggier, NTU Singapore29

Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability

Traditional MDS erasure codes achieve these.

Better repairability

© 2013, A. Datta & F. Oggier, NTU Singapore30

Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability

Traditional MDS erasure codes achieve these.

Smaller repair fan-in

Better repairability

Reduced I/O for repairs Possibility of multiple simultaneous repairs Fast repairs Fast repairs Efficient B/W usage…

© 2013, A. Datta & F. Oggier, NTU Singapore31

Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability

Traditional MDS erasure codes achieve these.

Smaller repair fan-in

Better repairability• Better …

Reduced I/O for repairs Possibility of multiple simultaneous repairs Fast repairs Fast repairs Efficient B/W usage…

© 2013, A. Datta & F. Oggier, NTU Singapore32

Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability

Traditional MDS erasure codes achieve these.

Smaller repair fan-in

Better repairability• Better …

Reduced I/O for repairs Possibility of multiple simultaneous repairs Fast repairs Fast repairs Efficient B/W usage…

Better data-insertion Better migration to archival

© 2013, A. Datta & F. Oggier, NTU Singapore33

g…

Pyramid (Local Reconstruction) CodesPyramid (Local Reconstruction) Codes

Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et al. @ NCA 2007

© 2013, A. Datta & F. Oggier, NTU Singapore34

Erasure Coding in Windows Azure Storage, C. Huang et al. @ USENIX ATC 2012

Pyramid (Local Reconstruction) CodesPyramid (Local Reconstruction) Codes

– Good for degraded reads (data locality)

Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et al. @ NCA 2007

© 2013, A. Datta & F. Oggier, NTU Singapore35

Erasure Coding in Windows Azure Storage, C. Huang et al. @ USENIX ATC 2012

Pyramid (Local Reconstruction) CodesPyramid (Local Reconstruction) Codes

– Good for degraded reads (data locality)– Not all repairs are cheap (only partial parity locality)

Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et al. @ NCA 2007

p p ( y p p y y)

© 2013, A. Datta & F. Oggier, NTU Singapore36

Erasure Coding in Windows Azure Storage, C. Huang et al. @ USENIX ATC 2012

Regenerating CodesRegenerating Codes• Network information flow based arguments to determine

“optimal” trade-off of storage/repair-bandwidthoptimal trade off of storage/repair bandwidth

© 2013, A. Datta & F. Oggier, NTU Singapore37

Locally Repairable CodesLocally Repairable Codes• Codes satisfying: low repair fan-in, for any failure

Th i i i t f “l ll d d bl d ”• The name is reminiscent of “locally decodable codes”

© 2013, A. Datta & F. Oggier, NTU Singapore

38

Self-repairing CodesSelf-repairing Codes

© 2013, A. Datta & F. Oggier, NTU Singapore39

Self-repairing CodesSelf-repairing Codes• Usual disclaimer: “To the best of our knowledge”

Fi t i t f l ll i bl d– First instances of locally repairable codes• Self‐repairing Homomorphic Codes for Distributed Storage Systems

– Infocom 2011• Self‐repairing Codes for Distributed Storage Systems – A ProjectiveSelf repairing Codes for Distributed Storage Systems  A Projective Geometric Construction

– ITW 2011– Since then, there have been many other instances from other

researchers/groups

© 2013, A. Datta & F. Oggier, NTU Singapore40

Self-repairing CodesSelf-repairing Codes• Usual disclaimer: “To the best of our knowledge”

Fi t i t f l ll i bl d– First instances of locally repairable codes• Self‐repairing Homomorphic Codes for Distributed Storage Systems

– Infocom 2011• Self‐repairing Codes for Distributed Storage Systems – A ProjectiveSelf repairing Codes for Distributed Storage Systems  A Projective Geometric Construction

– ITW 2011– Since then, there have been many other instances from other

researchers/groups• Note

– k encoded blocks are enough to recreate the object• Caveat: not any arbitrary k (i.e., SRCs are not MDS)• However, there are many such k combinations

© 2013, A. Datta & F. Oggier, NTU Singapore41

Self-repairing Codes: Blackbox ViewSelf-repairing Codes: Blackbox ViewB1 Retrieve some B2

k” (< k) blocks (e.g. k”=2)to recreate a lost block

Bl

Re-insert… Lost blocks

Reinsert in (new) storage devices, so

Bn

n encoded blocks

… Lost blocks that there is (again) n encoded blocks

© 2013, A. Datta & F. Oggier, NTU Singapore

(stored in storage devices in a network)

42

PSRC Example

© 2013, A. Datta & F. Oggier, NTU Singapore

PSRC Example

© 2013, A. Datta & F. Oggier, NTU Singapore

PSRC Example

(o1+o2+o4) + (o1) => o2+o4

R i i t d (o3) + (o2+o3) => o2(o1) + (o2) => o1+ o2

Repair using two nodes

Four pieces needed to regenerate two pieces

Say N1 and N3

© 2013, A. Datta & F. Oggier, NTU Singapore

PSRC Example

(o1+o2+o4) + (o1) => o2+o4

R i i t d (o3) + (o2+o3) => o2(o1) + (o2) => o1+ o2

Repair using two nodes

Four pieces needed to regenerate two pieces

Say N1 and N3

(o +o +o ) + (o ) => o +o(o2) + (o4) => o2+ o4Repair using three nodes

© 2013, A. Datta & F. Oggier, NTU Singapore

(o1+o2+o4) + (o4) => o1+o2

Three pieces needed to regenerate two pieces

Say N2, N3 and N4

46

RecapReplicas

DataData access

Erasure coded data

© 2013, A. Datta & F. Oggier, NTU Singapore47

RecapReplicas

Datafault-tolerant

Data access

data access

(MSR’s Reconstruction

Erasure coded data

Reconstruction code)

© 2013, A. Datta & F. Oggier, NTU Singapore48

RecapReplicas

Datafault-tolerant

Data access

data access

(MSR’s Reconstruction

Erasure coded data

Reconstruction code)

(partial) re-encode/repair

© 2013, A. Datta & F. Oggier, NTU Singapore49

(p ) p(e.g., Self-Repairing Codes)

RecapReplicas

Data insertion

Datafault-tolerant

Data access

data access

(MSR’s Reconstruction

Erasure coded data

Reconstruction code)

(partial) re-encode/repair

© 2013, A. Datta & F. Oggier, NTU Singapore50

(p ) p(e.g., Self-Repairing Codes)

NextReplicas

Data insertion

Datafault-tolerant

Data access

data access

(MSR’s Reconstruction

Erasure coded data

Reconstruction code)

(partial) re-encode/repair

© 2013, A. Datta & F. Oggier, NTU Singapore51

(p ) p(e.g., Self-Repairing Codes)

Inserting Redundant Data

Inserting Redundant Data

© 2013, A. Datta & F. Oggier, NTU Singapore52

Inserting Redundant Data

• Data insertion

Inserting Redundant Data

© 2013, A. Datta & F. Oggier, NTU Singapore53

Inserting Redundant Data

• Data insertion

Inserting Redundant Data

– Replicas can be inserted in a pipelined manner

© 2013, A. Datta & F. Oggier, NTU Singapore54

Inserting Redundant Data

• Data insertion

Inserting Redundant Data

– Replicas can be inserted in a pipelined manner

© 2013, A. Datta & F. Oggier, NTU Singapore55

Inserting Redundant Data

• Data insertion

Inserting Redundant Data

– Replicas can be inserted in a pipelined manner

– Traditionally, erasure coded systems used a central point of processing

© 2013, A. Datta & F. Oggier, NTU Singapore56

Inserting Redundant Data

• Data insertion

Inserting Redundant Data

– Replicas can be inserted in a pipelined manner

– Traditionally, erasure coded systems used a central point of processing

Can the process of redundancy generation be distributed among

© 2013, A. Datta & F. Oggier, NTU Singapore57

be distributed among the storage nodes?

In-network codingIn network coding

• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 29(1), 2013

© 2013, A. Datta & F. Oggier, NTU Singapore

58

In-network codingIn network coding

• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013

• Motivations– Reduce the bottleneck at a single point

• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation

© 2013, A. Datta & F. Oggier, NTU Singapore

59

In-network codingIn network coding

• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013

• Motivations– Reduce the bottleneck at a single point

• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation

© 2013, A. Datta & F. Oggier, NTU Singapore

60

In-network codingIn network coding

• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013

• Motivations– Reduce the bottleneck at a single point

• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation

– Utilize network resources opportunisticallyD Wh k/ d b d i h hi

© 2013, A. Datta & F. Oggier, NTU Singapore

• Data-centers: When network/nodes are not busy doing other things• P2P/F2F: When nodes are online

61

In-network codingIn network coding

• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013

• Motivations– Reduce the bottleneck at a single point

• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation

More traffic (ephemeral resource)

b t hi h d t

– Utilize network resources opportunisticallyD Wh k/ d b d i h hi

but higher data insertion throughput

© 2013, A. Datta & F. Oggier, NTU Singapore

• Data-centers: When network/nodes are not busy doing other things• P2P/F2F: When nodes are online

62

In-network codingIn network coding

© 2013, A. Datta & F. Oggier, NTU Singapore63

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

© 2013, A. Datta & F. Oggier, NTU Singapore64

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:

© 2013, A. Datta & F. Oggier, NTU Singapore65

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

© 2013, A. Datta & F. Oggier, NTU Singapore66

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

© 2013, A. Datta & F. Oggier, NTU Singapore67

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

• A naïve approach pp

© 2013, A. Datta & F. Oggier, NTU Singapore68

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

• A naïve approach pp

r1

r2

r4

© 2013, A. Datta & F. Oggier, NTU Singapore69

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

• A naïve approach pp

r1 r6

r2

r4

© 2013, A. Datta & F. Oggier, NTU Singapore70

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

• A naïve approach pp

r1 r6

r2

r4 r7

© 2013, A. Datta & F. Oggier, NTU Singapore71

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

• A naïve approach pp

r1 r6 Need a good “schedule” to insert the redundancy!

- Avoid cycles/dependenciesr2

r4 r7

Avoid cycles/dependencies

© 2013, A. Datta & F. Oggier, NTU Singapore72

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

• A naïve approach pp

r1 r6 Need a good “schedule” to insert the redundancy!

- Avoid cycles/dependenciesr2

r4 r7

Avoid cycles/dependenciesSubject to “unpredictable” availability of resource!!

© 2013, A. Datta & F. Oggier, NTU Singapore73

In-network codingIn network coding

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

• A naïve approach pp

r1 r6 Need a good “schedule” to insert the redundancy!

- Avoid cycles/dependenciesr2

r4 r7

Avoid cycles/dependenciesSubject to “unpredictable” availability of resource!!

Turns out to be

© 2013, A. Datta & F. Oggier, NTU Singapore74

Turns out to be O(n!) w/ Oracle

In-network codingIn network coding

© 2013, A. Datta & F. Oggier, NTU Singapore75

In-network codingIn network coding

• Heuristics– Several other policies (such as max data) were also tried

© 2013, A. Datta & F. Oggier, NTU Singapore76

In-network codingIn network coding

© 2013, A. Datta & F. Oggier, NTU Singapore77

In-network codingIn network coding

© 2013, A. Datta & F. Oggier, NTU Singapore78

In-network codingIn network coding

• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried

© 2013, A. Datta & F. Oggier, NTU Singapore79

In-network codingIn network coding

• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried

• Provided 40% (out of a possible 57%) bandwidthProvided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code

© 2013, A. Datta & F. Oggier, NTU Singapore80

In-network codingIn network coding

• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried

• Provided 40% (out of a possible 57%) bandwidthProvided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code– An increase in the data-insertion throughput btw. 40-60%

© 2013, A. Datta & F. Oggier, NTU Singapore81

In-network codingIn network coding

• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried

• Provided 40% (out of a possible 57%) bandwidthProvided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code– An increase in the data-insertion throughput btw. 40-60%

• No free lunch: Increase of 20-30% overall network traffic

© 2013, A. Datta & F. Oggier, NTU Singapore82

RecapReplicas

Data insertion

Datafault-tolerant

Data access

data access

(MSR’s Reconstruction

Erasure coded data

Reconstruction code)

(partial) re-encode/repair

© 2013, A. Datta & F. Oggier, NTU Singapore83

(p ) p(e.g., Self-Repairing Codes)

NextReplicas

Data insertion

Datafault-tolerant

Data access

data access

(MSR’s Reconstruction

Erasure coded data

Reconstruction code)

(partial) re-encode/repair

© 2013, A. Datta & F. Oggier, NTU Singapore84

(p ) p(e.g., Self-Repairing Codes)

RapidRAIDRapidRAID

• Ref:– RapidRAID: Pipelined Erasure Codes for Fast Data Archival in

Distributed Storage Systems (Infocom 2013) • Has some local repairability properties, but that aspect is yet to be exploredHas some local repairability properties, but that aspect is yet to be explored

– Another code instance @ ICDCN 2013• Decentralized Erasure Coding for Ecient Data Archival in Distributed

Storage SystemsStorage Systems– Systematic code (unlike RapidRAID)– Found using numerical methods, and a general theory for the construction of

such codes, as well as their repairability properties are open issues

© 2013, A. Datta & F. Oggier, NTU Singapore85

RapidRAIDRapidRAID

• Ref:– RapidRAID: Pipelined Erasure Codes for Fast Data Archival in

Distributed Storage Systems (Infocom 2013) • Has some local repairability properties, but that aspect is yet to be exploredHas some local repairability properties, but that aspect is yet to be explored

– Another code instance @ ICDCN 2013• Decentralized Erasure Coding for Ecient Data Archival in Distributed

Storage SystemsStorage Systems– Systematic code (unlike RapidRAID)– Found using numerical methods, and a general theory for the construction of

such codes, as well as their repairability properties are open issues

• Problem statement: Can the existing (replication based) redundancy be exploited to create an erasure coded archive?

© 2013, A. Datta & F. Oggier, NTU Singapore86

Slight change of viewSlight change of view

S S S

S1 S1 S1

S1 S2 S

S1 S2 S

S1 S2 S

S2 S2 S2

S3 S4

S3 S4

S3 S4

S3 S3 S3

S4 S4 S4

© 2013, A. Datta & F. Oggier, NTU Singapore87

Two ways to look at replicated data

RapidRAIDRapidRAID

centralized encoding processencoding process

© 2013, A. Datta & F. Oggier, NTU Singapore88

RapidRAIDRapidRAID

D t li iDecentralizing the hitherto centralized

encoding processencoding process

© 2013, A. Datta & F. Oggier, NTU Singapore89

RapidRAID – Example (8 4) code

RapidRAID Example (8,4) code

© 2013, A. Datta & F. Oggier, NTU Singapore90

RapidRAID – Example (8 4) code

• Initial configuration

RapidRAID Example (8,4) code

g

© 2013, A. Datta & F. Oggier, NTU Singapore91

RapidRAID – Example (8 4) code

• Initial configuration

RapidRAID Example (8,4) code

g

• Logical phase 1: Pipelined coding

© 2013, A. Datta & F. Oggier, NTU Singapore92

RapidRAID – Example (8 4) code

RapidRAID Example (8,4) code

© 2013, A. Datta & F. Oggier, NTU Singapore93

RapidRAID – Example (8 4) code

• Logical phase 2: Further local coding

RapidRAID Example (8,4) code

g p g

© 2013, A. Datta & F. Oggier, NTU Singapore94

RapidRAID – Example (8 4) code

• Logical phase 2: Further local coding

RapidRAID Example (8,4) code

g p g

Resulting Linear Code

© 2013, A. Datta & F. Oggier, NTU Singapore95

RapidRAID: Some resultsRapidRAID: Some results

© 2013, A. Datta & F. Oggier, NTU Singapore96

RapidRAID: Some resultsRapidRAID: Some results

© 2013, A. Datta & F. Oggier, NTU Singapore97

Big picReplicas

Data insertion

Datafault-tolerant

Data access

data access

(MSR’s Reconstruction

Erasure coded data

Reconstruction code)

Agenda: A composite system achieving all these (partial) re-encode/repair

© 2013, A. Datta & F. Oggier, NTU Singapore98

properties …(p ) p

(e.g., Self-Repairing Codes)

Wrapping up: A moment of reflection

• Revisiting repairability – an engineering alternativeg p y g g– Redundantly Grouped Cross-object Coding for Repairable Storage

(APSys2012)– The CORE Storage Primitive: Cross-Object Redundancy for Efficient DataThe CORE Storage Primitive: Cross-Object Redundancy for Efficient Data

Repair & Access in Erasure Coded Storage (arXiv: arXiv:1302.5192)• HDFS-RAID compatible implementation • http://sands sce ntu edu sg/StorageCORE/http://sands.sce.ntu.edu.sg/StorageCORE/

© 2013, A. Datta & F. Oggier, NTU Singapore99

Wrapping up: A moment of reflectionec

ts

Erasure coding of individual objects

e11 e12 e1k e1k+1 e1n

ffere

nt o

bje

e21…

e22…

e2k…

e2k+1…e2n…

iece

s of d

if

em1

em2

emk

……

emk+1

emn

ure

code

d p

p1 p1 pk pk+1 pn

D-4

of e

rasu

© 2013, A. Datta & F. Oggier, NTU Singapore100

RA

ID

(reminiscent of product codes!)

Separation of concernsSeparation of concerns

• Two distinct design objectives for distributed storage systemsg j g y– Fault-tolerance– Repairability

A t l i l id• An extremely simple idea– Introduce two different kinds of redundancy

• Any (standard) erasure code – for fault-tolerance

• RAID-4 like parity (across encoded pieces of different objects) – for repairability

© 2013, A. Datta & F. Oggier, NTU Singapore

CORE repairabilityCORE repairability

• Choosing a suitable m < kg– Reduction in data transfer for repair– Repair fan-in disentangled from base code parameter “k”

• Large “k” may be desirable for faster (parallel) data access• Large k may be desirable for faster (parallel) data access• Codes typically have trade-offs between repair fan-in, code parameter

“k” and code’s storage overhead (n/k)

• However: The gains from reduced fan in is probabilistic• However: The gains from reduced fan-in is probabilistic– For i.i.d. failures with probability “f”

• Possible to reduce repair timeBy pipelining data through the live nodes and computing partial

© 2013, A. Datta & F. Oggier, NTU Singapore

– By pipelining data through the live nodes, and computing partial parity

• Interested to – Follow: http://sands.sce.ntu.edu.sg/CodingForNetworkedStorage/ p g g g

• Also, two surveys on (repairability of) storage codes – one short, at high level (SIGACT Distr. Comp. News, Mar. 2013) – one detailed (FnT June 2013)

© 2013, A. Datta & F. Oggier, NTU Singapore

one detailed (FnT, June 2013) – Get involved: {anwitaman,frederique}@ntu.edu.sg

103 ଧନ୍ଯବାଦ