Storage codes: Managing Big Data with Small OverheadsData with...

Storage codes: Managing Big Data with Small OverheadsData with Small Overheads

Presented by

Anwitaman Datta & Frédérique E. Oggier Nanyang Technological University, Singapore

Tutorial at NetCod 2013, Calgary, Canada.

Big Data Storage: DisclaimerBig Data Storage: Disclaimer

A note from the trenches: "You know you have a large storage f y g gsystem when you get paged at 1 AM because you only have a few petabytes of storage left." – from Andrew Fikes’ (Principal Engineer Google) faculty summit talk ` Storage ArchitectureEngineer, Google) faculty summit talk Storage Architecture and Challenges `, 2010.

Big Data Storage: DisclaimerBig Data Storage: Disclaimer

A note from the trenches: "You know you have a large storage f y g gsystem when you get paged at 1 AM because you only have a few petabytes of storage left." – from Andrew Fikes’ (Principal Engineer Google) faculty summit talk ` Storage ArchitectureEngineer, Google) faculty summit talk Storage Architecture and Challenges `, 2010.

We neverWe neverget such calls!!

Big dataBig data

• June 2011 EMC2 studyy– world’s data is more than doubling

every 2 years• faster than Moore’s Law

Big data: - big problem?

bi i ?– 1.8 zettabytes of data to be created in 2011

- big opportunity?

Zetta: 1021

Zettabyte: If you stored all of this data onDVD th t k ld h f th E thDVDs, the stack would reach from the Earthto the moon and back.

* http://www.emc.com/about/news/press/2011/20110628-01.htm

The data deluge: Some numbersThe data deluge: Some numbers• Facebook “currently” (in 2010) stores

over 260 billion images, which translatesto over 20 petabytes of data. Users uploadone billion new photos (60 terabytes) eachweek and Facebook serves over oneweek and Facebook serves over onemillion images per second at peak.[quoted from a paper on “Haystack” fromFacebook]

• On “Saturday”, photo number fourbillion was uploaded to photo sharing siteFlickr. This comes just five and a halfmonths after the 3 billionth and nearly 18months after the 3 billionth and nearly 18months after photo number two billion. –Mashable (13th October 2009）[http://mashable.com/2009/10/12/flickr-4-

pbillion/]

Scale how?Scale how?

Scale how?Scale how?To scale vertically (or scale up) means to

add resources to a single node in a system*

Scale up* Definitions from Wikipedia7

Scale how? not distributing is not an option!Scale how?To scale vertically (or scale up) means to

add resources to a single node in a system*

To scale horizontally (or scale out) means toTo scale horizontally (or scale out) means to add more nodes to a system, such as adding a new

computer to a distributed software application*

Scale up Scale out* Definitions from Wikipedia8

Failure (of parts) is InevitableFailure (of parts) is Inevitable

Failure (of parts) is InevitableFailure (of parts) is Inevitable • But, failure of the system is not an option either!

F il i th ill f i l ’– Failure is the pillar of rivals’ success …

Deal with itDeal with it

• Data from Los Alamos National Laboratory (DSN 2006), y ( ),gathered over 9 years, 4750 machines, 24101 CPUs.

• Distribution of failures:– Hardware 60%– Software 20%– Network/Environment/Humans 5%

• Failures occurred between once a day to once a month.

Failure happens without fail soFailure happens without fail, so …• But, failure of the system is not an option either!

F il i th ill f i l ’– Failure is the pillar of rivals’ success …

• Solution: Redundancy

Many Levels of RedundancyMany Levels of Redundancy

• Physicaly• Virtual resource• Availability zone• Region• Cloud

From: http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html

Redundancy Based Fault ToleranceRedundancy Based Fault Tolerance

• Replicate datap– e.g., 3 or more copies– In nodes on different racks

• Can deal with switch failures• Can deal with switch failures

• Power back-up using battery between racks (Google)

But At What Cost?But At What Cost?

• Failure is not an option, but …p ,– … are the overheads acceptable?

Reducing the Overheads of Redundancy• Erasure codes

– Much lower storage overhead

Reducing the Overheads of Redundancy

Much lower storage overhead– High level of fault-tolerance

• In contrast to replication or RAID based systems

H h i l i ifi l i h “b li ”• Has the potential to significantly improve the “bottomline” – e.g., Both Google’s new DFS Collossus, as well as Microsoft’s Azure

now use ECs

Erasure Codes (ECs)Erasure Codes (ECs)

• An (n,k) erasure code = a map that takes as input k blocks and ( , ) p poutputs n blocks, thus introducing n-k blocks of redundancy.

• 3 way replication is a (3,1) erasure code!

k=1 block

Encodingg

k=1 block n=3 encoded blocks

Encodingg

k=1 block n=3 encoded blocks

• An erasure code such that the k original blocks can be recreated out of any k encoded blocks is called MDS

(maximum distance separable).

Erasure Codes (ECs)Erasure Codes (ECs) • Originally designed for communication

– EC(n k)

Receive any k’ (≥ k) bl k

B1B2 O1O1

EC(n,k)

D di…k’ (≥ k) blocks

…mes

Decoding… …

L t bl k

Encoding…

Lost blocks

Originalk blocks

n encoded blocks k blocks 23

Erasure Codes for Networked StorageErasure Codes for Networked Storage

OB1 Retrieve any O

yk’ (≥ k) blocks

Encoding… …

DecodingBl

Ok … Lost blocks O i i l

k blocks Bn

n encoded blocks

… Lost blocks Originalk blocks

(stored in storage devices in a network)24

HDFS-RAIDHDFS RAID

• Distributed RAID File system (DRFS) clienty ( )– provides application access to the files in the DRFS– transparently recovers any corrupt or missing blocks encountered when

reading a file (degraded read)• Does not carry out repairs

• RaidNode, a daemon that creates and maintains parity files for all data files stored in the DRFS

• BlockFixer, which periodically recomputes blocks that have been lost or corrupted– RaidShell allows on demand repair triggered by administrator– RaidShell allows on demand repair triggered by administrator

• Two kinds of erasure codes implemented– XOR code and Reed-Solomon code (typically 10+4 w/ 1.4x overhead)

Replenishing Lost Redundancy for ECsB1 Retrieve any

Replenishing Lost Redundancy for ECs• Repair needed for long term resilience.

Retrieve any k’ (≥ k) blocks

O2Recreate

lost blocks

… …Decoding EncodingBl

Re-insert

L t bl kOk

Reinsert in (new) storage devices, so that there is (again)

Lost blocks Originalk blocks

that there is (again) n encoded blocks

n encoded blocks

Replenishing Lost Redundancy for ECsB1 Retrieve any

Replenishing Lost Redundancy for ECs• Repair needed for long term resilience.

Retrieve any k’ (≥ k) blocks

O2Recreate

lost blocks

… …Decoding EncodingBl

Re-insert

L t bl kOk

Reinsert in (new) storage devices, so that there is (again)

Lost blocks Originalk blocks

that there is (again) n encoded blocks

n encoded blocks

Tailored-Made Codes for StorageTailored Made Codes for Storage

Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance

Traditional MDS erasure codes achieve these.

Tailored-Made Codes for StorageTailored Made Codes for StorageDesired code properties include:• Low storage overhead• Low storage overhead• Good fault tolerance • Better repairability

Better repairability

Smaller repair fan-in

Better repairability

Reduced I/O for repairs Possibility of multiple simultaneous repairs Fast repairs Fast repairs Efficient B/W usage…

Better repairability• Better …

Better data-insertion Better migration to archival

Pyramid (Local Reconstruction) CodesPyramid (Local Reconstruction) Codes

Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems, C. Huang et al. @ NCA 2007

Erasure Coding in Windows Azure Storage, C. Huang et al. @ USENIX ATC 2012

– Good for degraded reads (data locality)

– Good for degraded reads (data locality)– Not all repairs are cheap (only partial parity locality)

p p ( y p p y y)

Regenerating CodesRegenerating Codes• Network information flow based arguments to determine

“optimal” trade-off of storage/repair-bandwidthoptimal trade off of storage/repair bandwidth

Locally Repairable CodesLocally Repairable Codes• Codes satisfying: low repair fan-in, for any failure

Th i i i t f “l ll d d bl d ”• The name is reminiscent of “locally decodable codes”

Self-repairing CodesSelf-repairing Codes

Self-repairing CodesSelf-repairing Codes• Usual disclaimer: “To the best of our knowledge”

Fi t i t f l ll i bl d– First instances of locally repairable codes• Self‐repairing Homomorphic Codes for Distributed Storage Systems

– Infocom 2011• Self‐repairing Codes for Distributed Storage Systems – A ProjectiveSelf repairing Codes for Distributed Storage Systems A Projective Geometric Construction

– ITW 2011– Since then, there have been many other instances from other

researchers/groups

Self-repairing CodesSelf-repairing Codes• Usual disclaimer: “To the best of our knowledge”

Fi t i t f l ll i bl d– First instances of locally repairable codes• Self‐repairing Homomorphic Codes for Distributed Storage Systems

– Infocom 2011• Self‐repairing Codes for Distributed Storage Systems – A ProjectiveSelf repairing Codes for Distributed Storage Systems A Projective Geometric Construction

– ITW 2011– Since then, there have been many other instances from other

researchers/groups• Note

– k encoded blocks are enough to recreate the object• Caveat: not any arbitrary k (i.e., SRCs are not MDS)• However, there are many such k combinations

Self-repairing Codes: Blackbox ViewSelf-repairing Codes: Blackbox ViewB1 Retrieve some B2

k” (< k) blocks (e.g. k”=2)to recreate a lost block

Re-insert… Lost blocks

Reinsert in (new) storage devices, so

n encoded blocks

… Lost blocks that there is (again) n encoded blocks

(stored in storage devices in a network)

PSRC Example

(o1+o2+o4) + (o1) => o2+o4

R i i t d (o3) + (o2+o3) => o2(o1) + (o2) => o1+ o2

Repair using two nodes

Four pieces needed to regenerate two pieces

Say N1 and N3

PSRC Example

(o1+o2+o4) + (o1) => o2+o4

R i i t d (o3) + (o2+o3) => o2(o1) + (o2) => o1+ o2

Repair using two nodes

Four pieces needed to regenerate two pieces

Say N1 and N3

(o +o +o ) + (o ) => o +o(o2) + (o4) => o2+ o4Repair using three nodes

(o1+o2+o4) + (o4) => o1+o2

Three pieces needed to regenerate two pieces

Say N2, N3 and N4

RecapReplicas

DataData access

Erasure coded data

RecapReplicas

Datafault-tolerant

Data access

data access

(MSR’s Reconstruction

Erasure coded data

Reconstruction code)

RecapReplicas

Datafault-tolerant

Data access

data access

Erasure coded data

(partial) re-encode/repair

(p ) p(e.g., Self-Repairing Codes)

RecapReplicas

Data insertion

Datafault-tolerant

Data access

data access

Erasure coded data

NextReplicas

Data insertion

Datafault-tolerant

Data access

data access

Erasure coded data

Inserting Redundant Data

• Data insertion

– Replicas can be inserted in a pipelined manner

• Data insertion

– Traditionally, erasure coded systems used a central point of processing

• Data insertion

– Traditionally, erasure coded systems used a central point of processing

Can the process of redundancy generation be distributed among

be distributed among the storage nodes?

In-network codingIn network coding

• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 29(1), 2013

• Ref: In-Network Redundancy Generation for Opportunistic y ppSpeedup of Backup, Future Generation Comp. Syst. 2013

• Motivations– Reduce the bottleneck at a single point

• The “source” (or first point of processing) still needs to inject “enough information” for the network to be able to carry out the rest of the redundancy generationredundancy generation

– Utilize network resources opportunisticallyD Wh k/ d b d i h hi

• Data-centers: When network/nodes are not busy doing other things• P2P/F2F: When nodes are online

More traffic (ephemeral resource)

b t hi h d t

– Utilize network resources opportunisticallyD Wh k/ d b d i h hi

but higher data insertion throughput

• Data-centers: When network/nodes are not busy doing other things• P2P/F2F: When nodes are online

• Dependencies among self-repairing coded fragments can be p g p g gexploited for in-network coding!

• Consider a SRC(3,7) code with the following dependencies:

• Consider a SRC(3,7) code with the following dependencies:– r1, r2, r3=r1+r2, r4, r5=r1+r4, r6=r2+r4, r7=r1+r6

• Note: r6=r1+r7, etc. also … [details in Infocom 2011 paper]

• A naïve approach pp

r1 r6 Need a good “schedule” to insert the redundancy!

- Avoid cycles/dependenciesr2

Avoid cycles/dependencies

Avoid cycles/dependenciesSubject to “unpredictable” availability of resource!!

Turns out to be

Turns out to be O(n!) w/ Oracle

• Heuristics– Several other policies (such as max data) were also tried

• RndFlw was the best heuristicRndFlw was the best heuristic – Among those we tried

• Provided 40% (out of a possible 57%) bandwidthProvided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code

• Provided 40% (out of a possible 57%) bandwidthProvided 40% (out of a possible 57%) bandwidth savings at source for a SRC(7,3) code– An increase in the data-insertion throughput btw. 40-60%

• No free lunch: Increase of 20-30% overall network traffic

RecapReplicas

Data insertion

Datafault-tolerant

Data access

data access

Erasure coded data

NextReplicas

Data insertion

Datafault-tolerant

Data access

data access

Erasure coded data

RapidRAIDRapidRAID

• Ref:– RapidRAID: Pipelined Erasure Codes for Fast Data Archival in

Distributed Storage Systems (Infocom 2013) • Has some local repairability properties, but that aspect is yet to be exploredHas some local repairability properties, but that aspect is yet to be explored

– Another code instance @ ICDCN 2013• Decentralized Erasure Coding for Ecient Data Archival in Distributed

Storage SystemsStorage Systems– Systematic code (unlike RapidRAID)– Found using numerical methods, and a general theory for the construction of

such codes, as well as their repairability properties are open issues

RapidRAIDRapidRAID

• Ref:– RapidRAID: Pipelined Erasure Codes for Fast Data Archival in

Distributed Storage Systems (Infocom 2013) • Has some local repairability properties, but that aspect is yet to be exploredHas some local repairability properties, but that aspect is yet to be explored

– Another code instance @ ICDCN 2013• Decentralized Erasure Coding for Ecient Data Archival in Distributed

Storage SystemsStorage Systems– Systematic code (unlike RapidRAID)– Found using numerical methods, and a general theory for the construction of

such codes, as well as their repairability properties are open issues

• Problem statement: Can the existing (replication based) redundancy be exploited to create an erasure coded archive?

Slight change of viewSlight change of view

S1 S1 S1

S1 S2 S

S2 S2 S2

S3 S3 S3

S4 S4 S4

Two ways to look at replicated data

RapidRAIDRapidRAID

centralized encoding processencoding process

RapidRAIDRapidRAID

D t li iDecentralizing the hitherto centralized

encoding processencoding process

RapidRAID – Example (8 4) code

RapidRAID Example (8,4) code

• Initial configuration

• Logical phase 1: Pipelined coding

• Logical phase 2: Further local coding

Resulting Linear Code

RapidRAID: Some resultsRapidRAID: Some results

Big picReplicas

Data insertion

Datafault-tolerant

Data access

data access

Erasure coded data

Agenda: A composite system achieving all these (partial) re-encode/repair

properties …(p ) p

(e.g., Self-Repairing Codes)

Wrapping up: A moment of reflection

• Revisiting repairability – an engineering alternativeg p y g g– Redundantly Grouped Cross-object Coding for Repairable Storage

(APSys2012)– The CORE Storage Primitive: Cross-Object Redundancy for Efficient DataThe CORE Storage Primitive: Cross-Object Redundancy for Efficient Data

Repair & Access in Erasure Coded Storage (arXiv: arXiv:1302.5192)• HDFS-RAID compatible implementation • http://sands sce ntu edu sg/StorageCORE/http://sands.sce.ntu.edu.sg/StorageCORE/

Wrapping up: A moment of reflectionec

Erasure coding of individual objects

e11 e12 e1k e1k+1 e1n

e21…

e22…

e2k…

e2k+1…e2n…

s of d

……

p1 p1 pk pk+1 pn

(reminiscent of product codes!)

Separation of concernsSeparation of concerns

• Two distinct design objectives for distributed storage systemsg j g y– Fault-tolerance– Repairability

A t l i l id• An extremely simple idea– Introduce two different kinds of redundancy

• Any (standard) erasure code – for fault-tolerance

• RAID-4 like parity (across encoded pieces of different objects) – for repairability

CORE repairabilityCORE repairability

• Choosing a suitable m < kg– Reduction in data transfer for repair– Repair fan-in disentangled from base code parameter “k”

• Large “k” may be desirable for faster (parallel) data access• Large k may be desirable for faster (parallel) data access• Codes typically have trade-offs between repair fan-in, code parameter

“k” and code’s storage overhead (n/k)

• However: The gains from reduced fan in is probabilistic• However: The gains from reduced fan-in is probabilistic– For i.i.d. failures with probability “f”

• Possible to reduce repair timeBy pipelining data through the live nodes and computing partial

– By pipelining data through the live nodes, and computing partial parity

• Interested to – Follow: http://sands.sce.ntu.edu.sg/CodingForNetworkedStorage/ p g g g

• Also, two surveys on (repairability of) storage codes – one short, at high level (SIGACT Distr. Comp. News, Mar. 2013) – one detailed (FnT June 2013)

one detailed (FnT, June 2013) – Get involved: {anwitaman,frederique}@ntu.edu.sg

103 ଧନ୍ଯବାଦ

Storage codes: Managing Big Data with Small OverheadsData with...

Documents

Transcript of Storage codes: Managing Big Data with Small OverheadsData with...

The Manual of Mach3, EMC2 Interface Board CM106 1. Outlinecnc4youstore.com/images/Ebay Item-1/CM-106/CNC_Interface_Borad... · The Manual of Mach3, EMC2 Interface Board CM106 ...

Carolina Crown 2013 - Emc2

Emc2 Machine Newman1

Money on the Table: Exploring the Forward Capacity Market ...4 EMC2 Development Company, Inc. March 2009 EMC2 Development Company, Inc. May 2013 Confidential The PJM Electric Market

Big Enterprise DataBig Data Challenges Action 2 Action 1 Action 3 All Relevant Big Data Quick Insight Right Action 100101 011010 100101 Cost of store vs. cost to process data considerations

User Manual V2 - LinuxCNCContents Cover i Contents 1 1 EMC2 3 1.1 This Manual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 1.2 How EMC2 Works

KONFIGURISANJE EMC2 ZA PROGRAMIRANJE I SIMULACIJU VIŠEOSNIH

EMC2 Manual Pages

Virtual Organisations in Grids TERENA TF-EMC2, Barcelona 8 September 2005

Big Science and Big DataBig Science and Big Data Dirk Duellmann, CERN Apache Big Data Europe 28 Sep 2015, Budapest, Hungary 16/02/2015 Real-Time Analytics: Making better and faster

Original Author: EMC2-DP V2 - Sundance DSP...Figure 1: EMC2-Z7015/30 (with Zynq SoM and Break-out) EMC2-DP Issue 3.3.0 Page 8 4 Circuit Description The main component of the EMC2-DP

EMC2 Arrowhead talk

EMC2 User Manual

Raisonnement médical en urgence EMC2 2016

EMC2 (2).ppt

EMC2 Wysiwyg 2D ﬁnite elements mesh generator - sorbonne …hecht/ftp/emc2/RTemc2_gb.pdf · 2017. 5. 10. · Un logiciel d’édition de maillages et de contours bidimensionnels

Monitoring social and geopolitical events with Big Data€¦ · Monitoring economic, social and geopolitical events with Big DataBig Data & Big Models at BBVA Research … which needs

EMC2 (12).ppt

Step by step manual hwo to install and configure EMC2 running ...

Kinetic River Corp. and EMC2 sign distribution and service ...€¦ · EMC2 offers the most complete range of flow cytometry services in southern Europe, including reliable maintenance