Archiving Cold Data in Warehouses With Clustered Network Coding

14
Archiving Cold Data in Warehouses with Clustered Network Coding Fabien André Technicolor Anne-Marie Kermarrec Inria Erwan Le Merrer Nicolas Le Scouarnec Gilles Straub Alexandre van Kempen Technicolor Abstract Modern storage systems now typically combine plain repli- cation and erasure codes to reliably store large amount of data in datacenters. Plain replication allows a fast access to popular data, while erasure codes, e.g., Reed-Solomon codes, provide a storage-efficient alternative for archiving less popular data. Although erasure codes are now increas- ingly employed in real systems, they experience high over- head during maintenance, i.e., upon failures, typically re- quiring files to be decoded before being encoded again to repair the encoded blocks stored at the faulty node. In this paper, we propose a novel erasure code system, tailored for networked archival systems. The efficiency of our approach relies on the joint use of random codes and a clustered placement strategy. Our repair protocol leverages network coding techniques to reduce by 50% the amount of data transferred during maintenance, by repairing several cluster files simultaneously. We demonstrate both through an analysis and extensive experimental study conducted on a public testbed that our approach significantly decreases both the bandwidth overhead during the maintenance process and the time to repair lost data. We also show that using a non- systematic code does not impact the throughput, and comes only at the price of a higher CPU usage. Based on these re- sults, we evaluate the impact of this higher CPU consump- tion on different configurations of data coldness by deter- mining whether the cluster’s network bandwidth dedicated to repair or CPU dedicated to decoding saturates first. Keywords Distributed Storage, Erasure Codes, Mainte- nance, Cold Data. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. EuroSys 2014, April 13–16 2014, Amsterdam, Netherlands. Copyright c 2014 ACM 978-1-4503-2704-6/14/04. . . $15.00. http://dx.doi.org/10.1145/2592798.2592816 1. Introduction Redundancy is key to provide a reliable service in practical systems composed of unreliable components. Typically, dis- tributed storage systems heavily rely on redundancy to mask ineluctable disk/node unavailabilities and failures. While three-way replication is the simplest means to achieve relia- bility with redundancy, it is now acknowledged that erasure codes can dramatically improve storage efficiency [46]. Major cloud systems such as those of Google [15], Mi- crosoft [5] and Facebook [39] have recently adopted erasure codes with the most popular choice being Reed-Solomon codes for their simplicity. Since replication has higher stor- age costs but remains more efficient than codes regarding reads and writes, storage systems now tend to differentiate between cold data (i.e., no longer frequently accessed) from hot ones, typically the most popular data largely accessed, and process them differently [39]. Plain replication ensures hot data reliability while erasure codes are used for cold data archival. Indeed, reading an erasure coded data is more time and resource consuming than reading a replicated data. This sets the scene for new offers like Amazon Glacier [19], pro- viding a low cost archival system, at the price of a file acces- sibility in the order of hours. Reed-Solomon codes are the de facto standard of code- based redundancy in practice. However, having been de- signed for communication systems, they lack an efficient repair procedure, which is important for networked storage systems. Indeed, in storage systems, the level of redundancy decreases over time with failures. An additional maintenance mechanism is thus key to sustain this redundancy and pre- serve the reliability of stored information. As Reed-Solomon codes are not associated with a tailored mechanism, they suffer from significant overhead in terms of bandwidth us- age and decoding operations when maintenance has to be triggered. In order to address these two drawbacks, archi- tectural solutions have been proposed [40], as well as new code designs [12, 24, 29], paving the way for better trade- offs between storage, reliability and maintenance efficiency. The optimal tradeoff has been provided by Dimakis et al. [9] with the use of network coding. This initial work has been

description

paper

Transcript of Archiving Cold Data in Warehouses With Clustered Network Coding

  • Archiving Cold Data in Warehouseswith Clustered Network Coding

    Fabien AndrTechnicolor

    Anne-Marie KermarrecInria

    Erwan Le Merrer Nicolas Le ScouarnecGilles Straub Alexandre van Kempen

    Technicolor

    AbstractModern storage systems now typically combine plain repli-cation and erasure codes to reliably store large amount ofdata in datacenters. Plain replication allows a fast accessto popular data, while erasure codes, e.g., Reed-Solomoncodes, provide a storage-efficient alternative for archivingless popular data. Although erasure codes are now increas-ingly employed in real systems, they experience high over-head during maintenance, i.e., upon failures, typically re-quiring files to be decoded before being encoded again torepair the encoded blocks stored at the faulty node.

    In this paper, we propose a novel erasure code system,tailored for networked archival systems. The efficiency ofour approach relies on the joint use of random codes and aclustered placement strategy. Our repair protocol leveragesnetwork coding techniques to reduce by 50% the amountof data transferred during maintenance, by repairing severalcluster files simultaneously. We demonstrate both throughan analysis and extensive experimental study conducted on apublic testbed that our approach significantly decreases boththe bandwidth overhead during the maintenance process andthe time to repair lost data. We also show that using a non-systematic code does not impact the throughput, and comesonly at the price of a higher CPU usage. Based on these re-sults, we evaluate the impact of this higher CPU consump-tion on different configurations of data coldness by deter-mining whether the clusters network bandwidth dedicatedto repair or CPU dedicated to decoding saturates first.

    Keywords Distributed Storage, Erasure Codes, Mainte-nance, Cold Data.

    Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] 2014, April 1316 2014, Amsterdam, Netherlands.Copyright c 2014 ACM 978-1-4503-2704-6/14/04. . . $15.00.http://dx.doi.org/10.1145/2592798.2592816

    1. IntroductionRedundancy is key to provide a reliable service in practicalsystems composed of unreliable components. Typically, dis-tributed storage systems heavily rely on redundancy to maskineluctable disk/node unavailabilities and failures. Whilethree-way replication is the simplest means to achieve relia-bility with redundancy, it is now acknowledged that erasurecodes can dramatically improve storage efficiency [46].

    Major cloud systems such as those of Google [15], Mi-crosoft [5] and Facebook [39] have recently adopted erasurecodes with the most popular choice being Reed-Solomoncodes for their simplicity. Since replication has higher stor-age costs but remains more efficient than codes regardingreads and writes, storage systems now tend to differentiatebetween cold data (i.e., no longer frequently accessed) fromhot ones, typically the most popular data largely accessed,and process them differently [39]. Plain replication ensureshot data reliability while erasure codes are used for cold dataarchival. Indeed, reading an erasure coded data is more timeand resource consuming than reading a replicated data. Thissets the scene for new offers like Amazon Glacier [19], pro-viding a low cost archival system, at the price of a file acces-sibility in the order of hours.

    Reed-Solomon codes are the de facto standard of code-based redundancy in practice. However, having been de-signed for communication systems, they lack an efficientrepair procedure, which is important for networked storagesystems. Indeed, in storage systems, the level of redundancydecreases over time with failures. An additional maintenancemechanism is thus key to sustain this redundancy and pre-serve the reliability of stored information. As Reed-Solomoncodes are not associated with a tailored mechanism, theysuffer from significant overhead in terms of bandwidth us-age and decoding operations when maintenance has to betriggered. In order to address these two drawbacks, archi-tectural solutions have been proposed [40], as well as newcode designs [12, 24, 29], paving the way for better trade-offs between storage, reliability and maintenance efficiency.The optimal tradeoff has been provided by Dimakis et al. [9]with the use of network coding. This initial work has been

  • Block 1 File X

    Block 1File Y

    Node 1

    Block 2 File X

    Node 2

    Block 2 File Y

    Node 4

    Block 4 File Y

    Classical Reed-Solomon repair

    Block 1 File X

    Block 1File Y

    Node 1

    Block 2 File X

    Node 2

    Block 3 File X

    Node 3

    Block 2 File Y Block 3 File Y

    CNC repair

    Block 4 File X

    Node 4

    Block 4 File Y

    Block 4 File X Block 3 File X

    Node 3

    Block 3 File Y

    2 blocks + 2 blocks

    = 4 blocks on network to be decoded to restore Node 1s data

    = 3 blocks only on network to restore, without decoding

    Figure 1. With Reed-Solomon codes, upon the failure ofNode 1, files X and Y are repaired independently thus requir-ing to transfer a total of 4 blocks (2 blocks for each of thefiles) to decode both files and generate a new block for eachfile. Instead, in CNC repair, blocks of X and Y are com-bined at nodes used for repair, so that only 3 blocks need tobe transferred over the network; new blocks for each file arethen generated without decoding the original files X and Y .

    followed by numerous theoretical studies on coding schemesachieving the tradeoff [11]. However, these code designs ei-ther exist only for high redundancy (i.e., rate lower than 1/2)or have high computing costs [13] thus limiting their appli-cability to practical systems where both low storage over-head and reasonable computing costs are desirable. More-over, all these studies consider the repair of a single file, thusignoring potential benefits of repairing several distinct filestogether.

    Instead of relying on specifically structured codes, ran-dom codes are an appealing alternative to provide fault tol-erance in a distributed setting [1, 9, 10, 18, 31, 33]. Theyprovide a simple and efficient way to construct optimal codesw.h.p., such as Reed-Solomon ones, while offering attractiveproperties in terms of maintenance. However, the practicalaspects of the maintenance of such codes in practice havereceived little attention so far.

    In this paper, we propose a novel approach to redundancymanagement, combining both random codes and networkcoding, to provide a practical maintenance protocol. Themain intuition behind our system is to apply random codesand network coding at the granularity of groups of nodes(clusters), factorizing the repair cost across several files atthe same time. This is illustrated on Figure 1.

    More specifically, our contributions are the following:

    1. We propose a novel maintenance system, combining aclustered placement strategy, random codes and networkcoding techniques at the node level (i.e., between differ-ent files hosted by a single machine). We call this ap-proach CNC, for Clustered Network Coding. We showthat CNC halves the data transferred during the mainte-nance process when compared to standard erasure codes.

    Replication Reed-Solomon CNC

    Fault tolerance w.r.t

    storage overhead (optimal)

    Efficient file access (applied across files)

    Low repair bandwidth

    (whole file)

    (only half)

    Reintegration Figure 2. Comparison of CNC with most implementedredundancy mechanisms: Replication, and Reed-Solomoncodes (systematic form). Note that Reed Solomon codes ap-plied accros files allow direct read on encoded data.

    Moreover, CNC enables reintegration (i.e., the capabil-ity to reintegrate nodes which have been wrongfully de-clared as faulty). Typically if Node 1 in Fig. 1 turns outnot to have failed (e.g. system timeout was set at a toolow value), the new blocks created by CNC are usefuland increase the level of availability. On the contrary,the blocks generated by the classical repair are identicalcopies of the blocks of Node 1, making reintegration use-less. Finally a simple random selection of nodes duringthe maintenance process ensures that the network load isevenly balanced between nodes. This enables the storagesystem to scale with the number of files to repair, as theavailable bandwidth is consumed as efficiently as it canbe. We provide an analysis of CNC demonstrating its per-formance.

    2. We deployed CNC on the public experimental testbedGRID5000 [21], to evaluate its benefits and compare itagainst Reed-Solomon codes. Experimental results showthat encoding and decoding times are similar to the onesof Reed-Solomon codes, while the time to repair a fail-ure is drastically reduced. Using CNC instead of Reed-Solomon codes as a negligible effect on the contentionof the archival cluster. Finally, we show that the fact thatCNC does not rely on a systematic code such as Reed-Solomon does not hamper the performance of the systemeven in failure-free executions.

    Figure 2 summarizes the properties of CNC and exist-ing redundancy mechanisms (i.e., replication, and System-atic Reed-Solomon codes1) conveying the benefits of CNC,in the context of cold data storage where low storage over-head and repair bandwidth are more important than efficientfile access. Note that in the context of this work, we assumethat data is erasure coded for an enhanced reliability. We em-phasize that the objective of this work is not to replace repli-cation with erasure codes, but to provide an efficient mainte-

    1 With systematic Reed-Solomon codes, the encoded data includes in clearthe original data, allowing access to sub-part of the original data withoutdecoding (see e.g., [39]).

  • nance mechanism for erasure coded data in an archival setup(i.e., cold data storage).

    The rest of the paper is organized as follows. We first re-view the background on maintenance techniques using era-sure codes in Section 2. Our novel system is presented inSection 3 and analyzed in Section 4. We evaluate and com-pare CNC against state of the art approaches in Section 5.Finally, we present related work in Section 6 and concludein Section 7.

    2. Motivation and Background2.1 Maintenance in Storage SystemsDistributed storage systems are designed to provide reliablestorage service over unreliable components [8, 16, 17, 30,41]. In order to deal with component failures [15, 45], faulttolerance usually relies on data redundancy; three-way repli-cation is the storage policy adopted by Hadoop [43] or bythe Google file system [17] for example. Data redundancymust be complemented with a maintenance mechanism ableto recover from the loss of data when failures occur. Thispreserves the reliability guarantees of the system over time.Maintenance has already lain at the very heart of numerousstorage systems design [4, 14, 20, 44]. Similarly, reintegra-tion, which is the capability to reintegrate replicas stored ona node wrongfully declared as faulty, was shown in [7] to beone of the key techniques to reduce the maintenance cost.These studies focused on the maintenance of replicas. Whileplain replication is easy to implement and maintain, it suf-fers from a high storage overhead, typically x instances ofthe same file are needed to tolerate x 1 simultaneous fail-ures. This high overhead is a growing concern especially asthe scale of storage systems keeps increasing. This motivatessystem designers to consider erasure codes as an alternativeto replication, in particular for cold data [39]. Yet, using era-sure codes significantly increases the complexity of the sys-tem and challenges designers for efficient maintenance algo-rithms.

    2.2 Erasure Codes in Storage SystemsErasure codes have been widely acknowledged as muchmore efficient than replication [46] with respect to storageoverhead. More specifically, Maximum Distance Separable(MDS) codes are known to be optimal: for a given storageoverhead (i.e., the ratio of original quantity of data to storeover the total quantity of data including redundancy), MDScodes provide the optimal efficiency in terms of data avail-ability. With an MDS code (n, k), the file to store is splitinto k chunks, encoded into n blocks with the property thatany subset of k out of n blocks suffices to reconstruct thefile. Thus, to reconstruct a file of M bytes, one needs todownload exactlyM bytes, which corresponds to the sameamount of data as if plain replication was used.

    When using codes, the encoding can be applied per fileas shown in Figure 3a, or across files as shown in Figure 3b.

    Node 1

    X1

    Y1

    Node 2

    X2

    Y2

    X1 + X2

    Y1 + Y2

    Node 4

    X1 + 2X2

    Y1 + 2Y2

    Node 3

    Y

    (Dec

    ode

    )

    Read k blocks and

    Decode Y if needed

    (a) Encoding per file

    Node 1

    X

    Node 2

    Node 4

    Node 3

    Y

    X+Y

    X+2Y

    Y

    Directly read Y

    (b) Encoding across files

    Figure 3. Access to data, coding per file (i.e., indepen-dently) or coding across files.

    When encoding per file, each file is split into k blocks andencoded independently. As a consequence, the redundancyblocks contain information either about file X or about fileY but not both. In this case, accessing a file requires down-loading k blocks and decoding them if needed. When en-coding across files, each file is considered as a block, andk files are encoded together to generate redundancy blocks,e.g. on Nodes 3 and 4 in Figure 3b. In this case, provided thatthe code used is systematic (i.e., the original data is avail-able in clear in the encoded data), it is possible to downloadone block and get the corresponding file efficiently withoutdecoding (e.g., X and Y are stored as is as shown in Fig-ure 3b). This second design has the advantage of enablingdata to be read directly by fetching one block from a singlenode (e.g., reading Y from Node 2 on Figure 3). This fullyleverages the systematic property of codes, limiting both ac-cess to disks and decoding operations, contrary to the firstdesign requiring to fetch blocks from k nodes and decod-ing them if needed (e.g., reading Y1 and Y2 from node 1and 2). However, when codes are non-systematic, only thefirst design can be used; hence, accesses to files incur k diskoperations and a decoding thus consuming some additionalI/Os and CPU.

    As the codes we build our solution upon are non-systematic,we consider that encoding is applied per file in the rest ofthe paper. However, for the sake of completeness, in orderto measure the impact of contacting several nodes (insteadof 1), we also compare our scheme to systematic Reed-Solomon codes applied across files in Section 5.

    Reed-Solomon codes are a classical example of MDScodes, and are already deployed in cloud-based storage sys-tems [5, 15, 39]. However, as pointed out in [40], one ofthe major concern of erasure codes lies in their maintenanceprocess incurring significant bandwidth overhead as well ashigh decoding costs as explained below.

  • Maintenance of Erasure Codes. When a node is declaredfaulty, all blocks of the files it was hosting need to be re-created on a new node. The repair process works as follows:for one block of a file to repair, the new node first needs todownload k blocks of this file (i.e., corresponding to the sizeof the file) to be able to decode it. Once decoded, the newnode can re-encode the file and regenerate the lost block.This must be iterated for all the lost blocks. Three issuesarise:

    1. Repairing one block (possibly a small part of a file)requires the new node to download enough blocks (i.e.,k) to reconstruct the entire file. This is required for allthe blocks previously stored on the faulty node.

    2. The new node must then decode the file, though it doesnot want to access it 2. Decoding operations are known tobe time consuming in particular for large files.

    3. Reintegrating a node which has been wrongfully declaredas faulty is almost useless. This is due to the fact that thenew blocks created during the repair operation have tobe strictly identical to the lost ones for this is necessaryto sustain the coding strategy3. Therefore, reintegratinga node results in having two identical copies of the in-volved blocks (the reintegrated ones and the new ones).Such blocks can only be useful if either the reintegratednode or the new node fails but not in the event of anyother node failure.

    In order to mitigate these drawbacks, various solutionshave been suggested. Lazy repairs, for instance as describedin [4], consist of deliberately delaying the repairs, waitingfor several failures before repairing all of them together. Thisenables to repair multiple failures with the bandwidth (i.e.,data transferred) and decoding overhead needed for repair-ing one failure. However, delaying repairs leaves the sys-tem more vulnerable in case of a burst of failures. Architec-tural solutions have also been proposed, as for example theHybrid strategy [40]. This consists of maintaining one fullreplica stored on a single node in addition to multiple en-coded blocks. This extra replica is used upon repair, avoid-ing the decoding operation. However, maintaining an extrareplica on a single node significantly complicates the de-sign, while incurring scalability issues. Finally, new classesof codes have been designed [12, 24, 25] which trade storageoptimality for a better maintenance efficiency.

    Random Codes CNC relies on random linear codes (ran-dom codes for short) that represent an appealing alterna-tive to classical erasure codes in terms of storage efficiencyand reliability, while considerably simplifying the mainte-nance process. Random codes have been initially evaluated

    2 Even in systematic codes, for 2/3 of possible failures, a decoding isrequired as a block from the systematic part is missing.3 This can be achieved either by a tracker maintaining the global informationabout all blocks or by the new node inferring the exact structure of the lostblocks from all existing ones.

    2X1+7X2 8X1+3X2 4X1+3X2 9X1+2X2 6X1+5X2

    X1 X2

    n=5 random linear

    combinations

    File X split into

    k=2 chunks

    2 9

    8 3 4 3 2 7

    5 6

    Figure 4. Creation process of encoded blocks using a ran-dom code. All the coefficients are chosen randomly. Anyk = 2 blocks is enough to reconstruct the file X.

    in the context of distributed storage systems in [1]. Authorsshowed that random codes can provide an efficient fault tol-erant mechanism with the property that no synchronizationbetween nodes is required. Instead, the way blocks are gen-erated on each node is achieved independently in such a waythat it fits the coding strategy with a high probability. Avoid-ing such synchronization is crucial in distributed settings, asalso demonstrated in [18, 31].

    Encoding a file using random codes is simple: each fileis divided into k chunks and the blocks stored for reliabilityare created as random linear combinations of these k blocks(see Figure 4). All blocks, along with their associated coef-ficients, are then stored on n different nodes. Note that theadditional storage space required for the coefficients is typi-cally negligible compared to the size of each block.

    In order to reconstruct a file initially encoded with a givenk, one needs to download k different blocks of this file.Random matrix theory over finite field ensures that if onetakes k random vectors of the same subspace, these k vectorsare linearly independent with a probability which can bemade arbitrary close to one, depending on the field size [1].In other words, an encoded file can be reconstructed as soonas any set of k encoded blocks is collected. This representsthe optimal solution (MDS codes).

    3. Clustered Network CodingOur CNC system is designed to sustain a predefined level ofreliability, i.e., of data redundancy, set by the archival systemoperator. This reliability level then directly translates into aredundancy factor applied to stored files, with parameters k(number of blocks sufficient to retrieve a file) and n (totalnumber of redundant blocks for a file). A typical scenariofor using CNC is a storage cluster like in the Google FileSystem [17], where large files are split into smaller filesof the same size, for example 1 GB as in Windows AzureStorage [5]. These files are then erasure coded in order tosave storage space. We assume that the failure detection isperformed by a monitoring system, whose description is outof the scope of this paper. We also assume that this systemtriggers the repair process, assigning new nodes to replacethe faulty ones.

  • 3.1 A Cluster-based ApproachTo provide an efficient maintenance, CNC relies on (i) host-ing all blocks related to a set of files on a single cluster ofnodes, and (ii) repairing multiple files simultaneously. Tothis end, the system is partitioned into disjoint (logical) clus-ters of n nodes, so that each node of the storage system be-longs to only one cluster. Each file to be stored is encodedusing random codes and is randomly associated to a singlecluster, so as to balance the storage load on each clusterevenly. All blocks of a given file are then stored on the nnodes of the same cluster. In other words, CNC placementstrategy consists in storing blocks of two different files be-longing to the same cluster on the same set of nodes4. Notethat these clusters are constructed at a logical level. In prac-tice, nodes of a given cluster may span geo-dispersed sites toprovide an enhanced reliability. Obviously, there is a trade-off between minimizing inter-site traffic and high reliability;this is outside the scope of this paper. In such a setup, thearchival system manager (e.g., the master node in the GoogleFile System [17]) only needs to maintain two data structures:an index which maps each file to one cluster and an indexwhich contains the set of the identifier of nodes in each clus-ter. This simple data placement scheme leads to significantdata transfer gains and better load balancing, by clusteringoperations on encoded blocks, as explained in the remainingpart of this section.

    3.2 Maintenance of CNCWhen a node failure is detected, the maintenance operationshould ensure that all blocks hosted on the faulty node are re-paired. This preserves the redundancy factor and hence thepredefined reliability level of the archival system. While inmost systems, repair is usually performed at the granularityof a file, a node failure typically leads to the loss of sev-eral blocks, involving several files. CNC precisely leveragesthis characteristic; when a node fails, multiple repairs aretriggered, one for each particular block of one file that thefaulty node was storing. Traditional approaches using era-sure codes actually consider a failed node as the failure ofall of its blocks. By contrast, the novelty of CNC is to lever-age network coding at the node level (i.e., between multipleblocks of different files on a particular cluster). This is pos-sible since CNC placement strategy clusters files so that allnodes of a cluster store the same files. Network coding hasalready been studied to reduce the bandwidth during mainte-nance [9, 22, 23] but only at the file level (i.e., between mul-tiple blocks of a single file). CNC differs from these worksas it repairs different files simultaneously by mixing them,thus enabling the reduction of the amount of data to be trans-ferred during the maintenance process in practical archivalsystems.

    4 An analytical evaluation of the mean time to data loss for such a clusteringplacement can be found in [6].

    3.3 An Illustrating ExampleBefore generalizing in the next section, we first describe asimple example (see Figure 5). This provides the intuitionbehind CNC compared to a classical maintenance process.We consider two files X and Y of size M = 1024 MB,encoded with random codes (k = 2, n = 4), stored onthe 4 nodes of the same cluster (i.e., nodes 1 to 4). FileX is chunked into k = 2 chunks X1, X2 and file Y intochunks Y1 and Y2. Each node stores one encoded blockrelated to X and one encoded block related to Y , which arerespectively random linear combinations of {X1, X2} and{Y1, Y2}. Each block is of size Mk = 512 MB so that eachnode stores a total 1024 MB.

    Let us consider the failure of Node 4. In a classical re-pair process, the new node asks k = 2 nodes their blockscorresponding to files X and Y and downloads 4 blocks, fora total of 2048 MB. This enables the new node to decodethe two files independently, and then re-encode each file toregenerate the lost blocks of X and Y and store them.

    Instead, CNC leverages the fact that the encoded blocksrelated to X and Y are stored on the same node and re-stored on the same new node to encode the files togetherrather than independently during the repair process. Moreprecisely, if the nodes are able to compute a linear combi-nation of their encoded blocks, we can prove that if k = 2,only 3 blocks are sufficient to perform the repair of the twofiles X and Y. Thus, the transfer of only 3 blocks incurs thedownload of 1536 MB, instead of the 2048 MB needed withthe classical repair process. In addition, this repair can beprocessed without decoding any of the two files. In practice,the new node has to contact the three remaining nodes toperform the repair. Each of the three nodes sends the newnode a random linear combination of its two blocks with theassociated coefficients. Note that the two files are now in-termingled (i.e., encoded together). However, we want to beable to access each file independently after the repair. Thechallenge is thus to create two new random blocks, with therestrictions that one is only a random linear combination ofthe X blocks, and the other of the Y blocks. In this exam-ple, finding the appropriate coefficients in order to cancel theXi or Yi, comes down to solve for each file X and Y a sys-tem of two equations with three unknowns5. The new nodethen makes two different linear combinations of the threereceived blocks according to the previously computed coef-ficients, (A=-6, B=-22, C=25) and (D=20, E=9, G=-17) inthe example. Thereby it creates two new independent ran-dom blocks, related to file X and Y respectively. The repairis then performed, saving the bandwidth consumed by thetransfer of one block (i.e., 512 MB in this example). Note

    5 The system is the following:{A 4 +B 8 C 8 = 0A 12 +B 3 + C 6 = 0 for (A,B,C), and{D 15 + E 12 + F 24 = 0D 9 + E 14 + F 18 = 0 for (D,E,F), in Figure 5

  • Node 1

    2X1 + 7X2

    5Y1 + 3Y2

    Node 2

    8X1 + 3X2

    6Y1 + 7Y2

    4X1 + 3X2

    8Y1 + 6Y2

    No

    de

    4

    9X1 + 2X2

    3Y1 + 8Y2

    Node 3

    x2

    x3

    4X1 + 14X2 + 15Y1 + 9Y2

    8X1 + 3X2 + 12Y1 + 14Y2

    8X1 + 6X2 + 24Y1 + 18Y2

    16X1 + 205X2 + 0Y1 + 0Y2

    0X1 + 0X2 + 246Y1 + 88Y2

    New Node 4

    A

    B

    C

    D

    E

    F

    Cluster

    x1

    x2

    x2

    x3 Repair blocks

    (Only 3 transmitted)

    Local computation on new Node 4

    (a) CNC Repair

    Node 1

    2X1 + 7X2

    5Y1 + 3Y2

    Node 2

    8X1 + 3X2

    6Y1 + 7Y2

    4X1 + 3X2

    8Y1 + 6Y2

    9X1 + 2X2

    3Y1 + 8Y2

    Node 3

    New Node 4 Cluster

    Repair blocks

    (Transmitted 4)

    8X1 + 3X2

    4X1 + 3X2

    6Y1 + 7Y2

    8Y1 + 6Y2

    X1

    X2

    Y1

    Y2

    9X1 + 2X2

    3Y1 + 8Y2

    Decode

    File X

    Decode

    File Y

    Local computation on new Node 4

    (b) Classical Repair

    Figure 5. Comparison between CNC and classical maintenance process, for the repair of a failed node which was storing twoblocks of two different files (X & Y) in a cluster of 4 (with k = 2, n = 4). All stored blocks as well as transferred blocks andrepair blocks in the example have exactly the same size.

    that the example is given over the integers for simplicity,though arithmetic operations would be computed over a fi-nite field in an implementation.

    3.4 CNC: The General CaseWe now generalize the previous example for any k. We firstdefine a repair block object: a repair block is a randomlinear combination of two encoded blocks of two differentfiles stored on a given node. Repair blocks are transientobjects which only exist during the maintenance process(i.e., repair blocks only transit on the network and are neverstored permanently). We are now able to formulate the coretechnical result of this paper; the following theorem appliesin a context where different files are encoded using randomcodes with the same k, and the encoded blocks are placedaccording to the cluster placement described in the previoussection.

    Theorem 1. In order to repair two different files, download-ing k + 1 repair blocks from k + 1 different nodes is a suffi-cient condition.

    Repairing two files jointly actually comes down to cre-ating one new random block for each of the two files. Theformal proof, provided in the technical report [3], relies onshowing that vectors resulting from CNC operations remainrandom, which ensures that blocks do not degenerate in thelong run due to successive operations performed over them.This theorem thus implies that instead of having to down-load 2k blocks as with Reed-Solomon codes when repair-ing, CNC decreases that need to only k + 1. Other impli-cations and analysis are detailed in the next section. Notethat the encoded blocks of the two files do not need to havethe same size. In case of different sizes, the smallest is sim-ply zero-padded during the network coding operations as is

    usually done in this context; padding is then removed at theend of the repair process. In a real system, nodes usuallystore far more than two blocks, implying multiple iterationsof the process previously described. More formally, to re-store a failed node which was storing x blocks, the repairprocess must be iterated x2 times. In fact, as two new blocksare repaired during each iteration, the number of iterationsis halved compared to the classical repair process. Note thatin case of an odd number of blocks stored, the repair pro-cess is iterated until only one block remains. The last blockis repaired downloading k blocks of the corresponding filewhich are then randomly combined to conclude the repair.The overhead related to the repair of the last block in caseof an odd block number becomes negligible with a growingnumber of blocks stored.

    The fact that the repair process must be iterated severaltimes can also be leveraged to balance the bandwidth loadover all the nodes in the cluster. Only k+1 nodes, out of then nodes of the cluster, are selected at each iteration of therepair process. As all nodes of the cluster have a symmetricalrole, a different set of k + 1 nodes can be selected at eachiteration. In order to leverage the whole available bandwidthof the cluster, CNC makes use of a random selection ofthese k + 1 nodes at each iteration. In other words, for eachround of the repair process, the new node selects k+1 nodesuniformly at random over the n cluster nodes. Doing so, weshow that every node is evenly loaded, i.e., each node sendsthe same number of repair blocks in expectation.

    More formally, let N be the number of repair blocks sentby a given node. In a cluster where n nodes participate in themaintenance operation, for T iterations of the repair process,the average number of repair blocks sent by each node is:

    E(N) = Tk + 1

    n(1)

  • An example illustrating this load balancing is provided inthe next section.

    4. CNC AnalysisThe novel maintenance protocol proposed in the previoussection enables (i) to significantly reduce the amount of datatransferred during the repair process; (ii) to balance the loadbetween the nodes of a cluster; (iii) to avoid computationallyintensive decoding operations, and finally, (iv) to provideuseful reintegration. Benefits are now detailed.

    4.1 Transfer SavingsA direct implication of Theorem 1 is that for large enoughvalues of k, the required data transfer to perform a repairis halved; this directly results in a better usage of availablebandwidth. To repair two files in a classical repair process,the new node needs to download at least 2k blocks to be ableto decode each of the two files. Then, the ratio k+12k (CNCover Reed-Solomon) tends to 1/2 as larger values of k areused.

    The exact necessary amount of data (x, k, s) to repair xblocks of size s encoded with the same k is given as follows:

    (x, k, s) =

    {x2 s(k + 1) if x is evenx2 s(k + 1 +

    k1x ) if x is odd

    An example of the transfer savings is given in Figure 6,for k = 16 and a file size of 1 GB.

    From Theorem 1, CNC requires to repair lost files ingroups of two. One can wonder whether there is a benefitin grouping more than two files during the repair. In fact,a simple extension of Theorem 1 reveals that to group Gfiles together, a sufficient condition is that the new nodedownloads (G 1)k + 1 repair blocks from (G 1)k + 1distinct nodes over the n nodes of the cluster. Firstly, thisimplies that the new node must be able to contact manymore nodes than k + 1. Secondly, we can easily see thatthe gains made possible by CNC are maximal when two filesare considered simultaneously: savings in data transfer whenrepairing are expressed by the ratio (G1)k+1Gk . The minimalvalue of this ratio (12 , which is equivalent to the maximalgain) is obtained when G = 2 and large value of k.

    A second natural question is whether or not downloadingfewer than (G 1)k + 1 repair blocks to group G filestogether is possible. We can positively answer this question,as the value (G1)k+1 is only a sufficient condition. In fact,if nodes do not send random combinations, but carefullychoose the coefficients of the combination, it is theoreticallypossible to download fewer repair blocks. However, as Ggrows, finding such coefficients becomes computationallyintractable, especially for large values of k. This then callsfor the use of the simpler operation i.e., G = 2 as we havepresented in this paper.

    0 100 200 300 400 500 600 700 800 900

    1000

    0 200 400 600 800 1000

    Rep

    air b

    andw

    idth

    (GB)

    Files to repair

    ReplicationRS

    CNC (k=8)CNC (k=16)CNC (k=32)

    Figure 6. Necessary amount of data to transfer to repair afailed node, according to the selected redundancy scheme (1GB files).

    4.2 Load BalancingAs previously mentioned, when a node fails, the repair pro-cess is iterated as many times as needed to repair all lostblocks. CNC ensures that the load over remaining nodes isbalanced during maintenance, because of the random selec-tion of the k + 1 nodes at each round.

    Consider a scenario involving a 5 node cluster, storing 10different files encoded with random codes (k = 2). Node5 has failed, involving the loss of 10 blocks of the 10 filesstored on that cluster. Nodes 1 to 4 are available for therepair process. T = 5 iterations of the repair process arenecessary to recreate the 10 new blocks, as each iterationenables to repair 2 blocks at the same time. The total numberof repair blocks sent during the whole maintenance is T (k + 1) = 15, whereas the classical repair process needs todownload 20 encoded blocks. The random selection ensuresin addition that the load is evenly balanced between theavailable nodes of the cluster. Here, nodes 1, 2 and 4 areselected during the first repair round, then nodes 2, 3 and 4during the second round and so forth. The total number ofrepair blocks is balanced between all available nodes, eachsending T(k+1)n =

    154 = 3.75 repair blocks on average.

    As a consequence of using the whole available bandwidth inparallel, and as opposed to sequentially fetching blocks foronly a subset of nodes, the Time To Repair (TTR) a failednode is also greatly reduced.

    4.3 No Decoding OperationsDecoding operations are known to be time consuming andtherefore should be done only in case of file accesses. Whilethe use of classical erasure codes requires such decoding totake place upon repair, CNC avoids those operations. In fact,no file needs to be decoded at any time in CNC: repairingtwo blocks only requires computing two linear combinationsinstead of decoding the two files. This greatly simplifies therepair process over classical approaches. As a consequence,the time to perform a repair is reduced compared to the

  • classical reparation process, especially when dealing withlarge files as confirmed by our experiments in Section 5.

    4.4 ReintegrationThe decision to declare a node as faulty is usually performedusing timeouts; this is typically an error prone decision [7].In fact, nodes can be wrongfully timed-out and can recon-nect once the repair is done [27]. While the longer the time-outs, the fewer errors are made, adopting large timeouts mayjeopardize the reliability guarantees, typically in the eventof burst of failures. The interest of reintegration is to beable to leverage the fact that nodes which have been wrong-fully timed-out are reintegrated in the system. Authors in [7]showed that reintegration is a key concept to save mainte-nance bandwidth. However, reintegration has not been ad-dressed when using erasure codes.

    As previously mentioned, when using classical erasurecodes, the repaired blocks have to be strictly identical tothe lost ones. Therefore, reintegrating a node which wassuspected as faulty in the system is almost useless since thisresults in two identical copies of the lost and the repairedblocks. Such blocks can only be useful in the event of thefailure of two specific nodes, the incorrectly timed-out nodeand the new one. Instead, reintegration is always useful whendeploying CNC. More precisely, every single new block canbe leveraged to compensate for the loss of any other blockand therefore is useful in the event of the failure of any node.Indeed, new created blocks are simply new random blocks,thus different from the lost ones while being functionallyequivalent. Therefore, each new block contributes to theredundancy factor of the cluster.

    5. EvaluationIn order to confirm the theoretical savings provided by theCNC repair protocol in terms of bandwidth utilization anddecoding operations, we deployed CNC over an experimen-tal platform. We now describe the implementation of the sys-tem and CNC experimental results.

    5.1 System OverviewWe implemented a simple storage cluster with architecturesimilar to Hadoop [43] or the Google File System [17]. Thisarchitecture is composed of one tracker node that managesthe metadata of files, and several storage nodes that storethe data. This set of storage nodes forms a cluster as definedin Section 3. The overview of the system architecture is de-picted in Figure 7. Client nodes can PUT/GET the data di-rectly to/from the storage nodes, after having obtained theirIP addresses from the tracker. In case of a storage node fail-ure, the tracker initiates the repair process and schedules therepair jobs. All files to be stored in the system are encodedusing random codes with the same k. Let n be the number ofstorage nodes in the cluster, then n encoded blocks are cre-ated for each file, one for each storage node. Note that the

    Tracker Node

    Client Node

    Fil

    es

    Me

    ta D

    ata

    Cluster of Storage Nodes

    New Node

    ASK_REPAIRBLOCK

    REPAIRBLOCK

    Figure 7. Experimental System Overview.

    system can thus tolerate n k storage node failures beforefiles are lost for good.

    Operations In the case of a PUT operation, the client firstencodes blocks. The coefficients of the linear combinationassociated with each encoded block are appended at thebeginning of the block. Those n encoded blocks are sent tothe n storage nodes of the cluster using a PUT_BLOCK_MSG.A PUT_BLOCK_MSG contains the encoded information, aswell as the hash of the corresponding file. Upon the receiptof a PUT_BLOCK_MSG, the storage node stores the encodedblock using the hash as filename. To retrieve the file, theclient sends a GET_BLOCK_MSG to at least k nodes, outof the n nodes of the cluster. A GET_BLOCK_MSG onlycontains the hash of the file to be retrieved. Upon the receiptof a GET_BLOCK_MSG, the storage node sends the blockcorresponding to the given hash. As soon as the client hasreceived k blocks, the file can be recovered.

    In case of a storage node failure, a new node is selectedby the tracker to replace the failed one. This new node sendsan ASK_REPAIRBLOCK_MSG to k + 1 storage nodes. AnASK_REPAIRBLOCK_MSG contains the two hashes of thetwo blocks which have to be combined following the repairprotocol described in Section 3. Upon the receipt of anASK_REPAIRBLOCK_MSG, the storage node combines thetwo encoded blocks corresponding to the two hashes, andsends the resulting block back to the new node. As soon ask + 1 blocks are received, the new node can regenerate twolost blocks. This process is iterated until all lost blocks arerepaired.

    5.2 Deployment and ResultsWe deployed this system on the Grid5000 experimentaltestbed [21]. The experiment ran on 24 storage nodes, 1tracker node, and 4 client nodes all connected through a 1Gbps network. Each node has 2 Intel Xeon E5520 CPUs at2.26 GHz, 32 GB RAM and two 300 GB SAS hard drivesused in RAID-0. The 24 storage nodes form a logical cluster,as defined in Section 3. All files were encoded with k = 16,and have a size of 1 GB, which is the size used for sealedextents in Windows Azure Storage [5].

  • 0 100 200 300 400

    16 32 128 256 5121024

    MB

    /s

    File size (MB)

    Encoding Xeon 5520 [k=16]

    CNCRS

    0 100 200 300 400

    16 32 128 256 5121024File size (MB)

    Decoding Xeon 5520 [k=16]

    0 100 200 300 400

    16 32 128 256 5121024File size (MB)

    Encoding Xeon E5-2630 [k=16]

    0 100 200 300 400

    16 32 128 256 5121024File size (MB)

    Decoding Xeon E5-2630 [k=16]

    0 200 400 600 800

    4 6 8 12 16 20 24

    MB

    /s

    k [n=int(k*1.5)]

    [1024MB]

    0 200 400 600 800

    4 6 8 12 16 20 24k [n=int(k*1.5)]

    [1024MB]

    0 200 400 600 800

    4 6 8 12 16 20 24k [n=int(k*1.5)]

    [1024MB]

    0 200 400 600 800

    4 6 8 12 16 20 24k [n=int(k*1.5)]

    [1024MB]

    Figure 8. Single-core in-memory encoding and decoding throughput for various file sizes with k=16 and for various valuesof k with file size=1024 MB, on Xeon E5520 (2.26 Ghz) and Xeon E5-2630 (2.30 GHz) running Linux 3.2 64bit with 32 GBRAM.

    Implementation We implemented the coding logic ofCNC as a library in C relying on GF Complete [36] for fi-nite fields operations. The networking and storage logic hasbeen implemented in C++ using Boost.Asio. The client per-forms all encoding/decoding operations using one dedicatedthread, possibly sending/receiving other blocks while com-puting. However, the storage nodes being repaired do notperform computation while receiving data; latency is at thisstage less critical, since no user is directly impacted. BesidesCNC, our system also supports systematic Reed-Solomoncodes both applied per file or across files as described inFigure 3. We compare CNC to these two systematic Reed-Solomon coding schemes, noted Reed-Solomon or RS.

    Encoding/Decoding performance We first look at ourCNC library in-memory encoding and decoding rates, ontwo different machines (one considered slow, and the otherfast). Those rates are measured when using random codesfor various code lengths (k), depending on the size of thefile to be encoded (16 MB to 1024 MB), and depending onthe hardware of the two different machines. Results are de-picted on Figure 8. For a given (k, n), encoding and decodingrates are close to linear with the file size. For example with(k = 16, n = 24) the encoding of a 1 GB file occurs at 125MB/s on the Xeon E5520, while the faster Xeon E5-2630 en-codes at around 200 MB/s. Decoding speeds are 200 MB/sand 300 MB/s respectively. This confirms that machine ar-chitectures are crucial for performance while dealing withcoding libraries [36, 37].

    Both the encoding and decoding rates are representedon the figures (when some data from the systematic part is

    missing) for Reed-Solomon codes. These are provided bythe Jerasure library [38]. We observe that rates for CNCand Reed-Solomon codes are fairly similar, CNC being abit faster for decoding, and a bit slower for encoding. Theminor difference between the two schemes is due to the factthat, for the block sizes (i.e., 1 to 64 MB) and the k (i.e.,4 to 24) that we consider, applying the operations to thedata completely dominates over other costs (e.g., invertingmatrices of coefficients which is costlier for random matricesthan for Reed-Solomon generator matrices).

    Repair Time In this experiment, we measure the total re-pair time upon a node failure, depending on the amount ofstorage of the faulty node. The results, depicted in Figure 9,include time to receive repair blocks at the new node, timeto compute (decoding for Reed-Solomon codes, and linearcombinations for new blocks creation for CNC), as wellas wait time (which corresponds to the delay until whichthe last repair block has been received, allowing operation).Hence, it represents the effective time between failure detec-tion and complete repair.

    Figure 9 shows that the repair time is dramatically re-duced when using CNC compared to Reed-Solomon codes,especially with an increasing amount of data to be repaired.For instance, to repair a node hosting 128 GB of data, CNCand Reed-Solomon codes require respectively 824 and 2076seconds (i.e., a 60% reduction when using CNC). These timesavings are mainly due to the fact that decoding operationsare avoided in CNC, and that less information is transferred.

  • 0

    500

    1000

    1500

    2000

    2500

    CNCRS CNC

    RS CNCRS CNC

    RS

    Seco

    nds

    Wait timeReceive time

    Compute time

    128 GB64 GB32 GB16 GB

    Figure 9. Repair time for CNC and Reed-Solomon codesfor various amounts of data. The total time is split betweenwaiting time (for response), reception time (over the net-work) and time dedicated to computing.

    PUT and GET performance without failures Figure 10shows the performance of PUT operations from a singleclient accessing the cluster. The system is able to performPUT operations at a rate of 40 MB/s for CNC and 45 MB/sfor Reed-Solomon codes with encoding per file, and at a rateof 55 MB/s for Reed-Solomon codes with encoding acrossfiles. CNC and Reed-Solomon codes exhibit similar perfor-mance when applied per file. This is consistent with the en-coding speed we observed in Figure 8. Encoding across filesis slightly faster due to the fact that files do not need to besplit into chunks before being encoded.

    Figure 11 shows the performance of GET operations froma single client. They are performed at a rate of 110 MB/sfor both CNC and Reed-Solomon codes (encoding per file).For these, the network (1 Gbps) is clearly the limiting fac-tor, which is again consistent with the high decoding speed(greater than 190 MB/s) that we observed in Figure 8. Infact, for Reed-Solomon coding applied across files, there isa slight performance drop (around 90 MB/s): In this case,the GET contacts only one storage node thus opening a sin-gle TCP connection and reading from a single storage node.Hence, the client does not saturate the 1 Gbps link, as it isthe case for the encoding per file where k TCP connectionsare opened (parallel reads from k storage nodes).

    Figure 12 shows the performance of multiple clients ac-cessing the cluster concurrently. The clients perform GEToperations continuously for 30 minutes and we compute theaggregate throughput of all clients. We observe that there isno strong degradation of performance due to concurrency.Note that Reed-Solomon codes with encoding across filesfully leverage the systematic nature of such codes: reading ablock incurs only one disc seek without requiring a decodingoperation. Yet, our experiments show that this property doesnot hamper CNC because access to disk is a negligible fac-tor. The gap between encoding per file and encoding acrossfiles diminishes as the number of clients increases, as can be

    0

    15

    30

    45

    60

    75

    0 20 40 60 80 100 120 140

    Put t

    hrou

    ghpu

    t (M

    B/s)

    Total Amount Put (GB)

    CNCReed-Solomon (per file)

    Reed-Solomon (across files)

    Figure 10. Single-client throughput of the PUT operationfor various amounts of data.

    0

    20

    40

    60

    80

    100

    120

    0 20 40 60 80 100 120 140

    Get

    thro

    ughp

    ut (M

    B/s)

    Total Amount Get (GB)

    CNCReed-Solomon (per file)

    Reed-Solomon (across files)

    Figure 11. Single-client throughput of the GET operationfor various amounts of data.

    expected. Up to 4 clients running, the performance increaseslinearly as clients do not compete for resources. For 4 clientsin parallel performing continuous GET operations, the aver-age throughput per client is 360-365 MB/s for CNC and RSper file, and 380 MB/s for RS across file. Coding across fileshas a slight advantage (less than 5 %) when 4 clients contin-uously query the storage cluster at around 90 MB/s each: in-deed coding across files implies bigger blocks and less TCPconnections per client. This has a limited impact and was notvisible for a single client.

    The number of concurrent clients we consider (i.e., 4 inthis experiment) is already much higher than the averagenumber of clients that would access data on a cluster stor-ing cold data as described in [39]. In [39], the logical clusteris composed of 36TB storage nodes, and data is marked ascold if not accessed for at least 3 months. Let us assume thatthe system offers each user with 100 GB archival capacity,with 1 GB archive files, with users accessing to one of theirbackup archive once a month. In this case, we would observea read rate as low as 3 103 read/s, which is much lowerthan the read rate of our experiment, and also much lowerthan the read rate reported for hot data on production sys-tems (e.g., 30 reads/s [15]). As a consequence, the penalty

  • 0

    100

    200

    300

    400

    500

    1 2 3 4Agg

    rega

    te G

    et th

    roug

    hput

    (MB/

    s)

    Number of clients

    CNCReed-Solomon (per file)

    Reed-Solomon (across files)

    Figure 12. Multiple-client throughput of the GET operationfor various amounts of data.

    in I/O due to encoding per file rather than across files (seeFigure 3) has a negligible impact for cold data. This is con-firmed experimentally, as according to these figures, there isno penalty in using CNC for 1 client and little impact for upto 4 clients (our cluster only allowed 4 client nodes). Also,the decoding is not a limiting factor. Hence, CNC comparesfavorably to Reed-Solomon codes in spite of the decodingneeded. The only additional cost that must be paid for usingCNC is a higher CPU utilization. In our experiments, a sin-gle dedicated thread (and thus one CPU core used at 100%)was sufficient for handling the encoding and decoding op-erations. The negligible impact in term of I/Os is consistentwith the block sizes that we consider. For a file of 1 GB splitinto k=16 blocks, we need approximately 8 second to receiveall blocks of 64 MB over the network which is 2000 timeslonger than a typical disk seek (4 ms).6

    Performance under failures Figure 13 plots the perfor-mance of the system when it has suffered failures. We ran aset of experiments measuring the GET performance of sev-eral concurrent clients running on 4 client nodes. A fixednumber of storage nodes were continuously failed so as toevaluate the performance of the various coding schemes forthe possible failure configurations (e.g., 1 failure of a sys-tematic node, 1 failure of a non-systematic node, 2 fail-ures of systematic nodes). For a given failure probability, wethen evaluate the average performance which depends on theprobability that the system is in each of the possible failureconfigurations. This allows us to measure the average per-formance of the system in spite of failures being rare andoccurring rarely in the time-span of a real experiment.

    From Figure 13, we observe that the average throughputfor CNC and Reed-Solomon (per file) does not degradeas the failure rate increases. This is due to the fact thatthe clients do not saturate the capacity of the remainingstorage nodes. Indeed, in our setting after two failures, 22

    6 Even for smaller files of 16 MB split into k=16 blocks of 1 MB, we stillneed approximately 0.125 seconds to receive all blocks of 1 MB whichremains 30 times longer than a typical disk seek.

    0

    20

    40

    60

    80

    100

    0.001 0.01 0.1

    Ave

    rage

    Get

    thro

    ughp

    ut (M

    B/s)

    Failure Rate

    CNCReed-Solomon (per file)

    Reed-Solomon (across files)

    Figure 13. Bandwidth depending on the failure rate.

    of the 24 storage nodes are still available. When encoding isapplied across files, a slight degradation is observed as whena failure occurs, it has a probability of 16/24 to affect oneof the systematic nodes so that accessing the file stored onthat node requires performing a degraded read (i.e., transferencoded data corresponding to 16 files and decode this datato recover the single file we try to read). This degradationremains limited as even if failures are frequent, failures donot necessarily affect the data read. Limiting the impact ofsuch degraded reads is a current area of research in codes forstorage as discussed in the next section.

    Impact of data coldness. From the previous experiments,we observed that CNC does not degrade the throughputwhen compared to Reed-Solomon applied across files butcomes at the price of a higher CPU utilization when readingdata (i.e., GET operations). However, CNC also halves therepair bandwidth. In this subsection, we study which of thenetwork or the CPU is the most limiting factor for variousfailure rates (e.g., once a day to once a year).

    Data is considered cold when not accessed for a givenamount of time. We consider various thresholds for datacoldness (from 1 day to 1 year). A threshold of 1 monthmeans that data is accessed at most once a month. We usethis as an upper bound on the read rate to infer the corre-sponding upper bound on CPU usage.

    We consider a cluster of storage nodes connected througha 1 Gbps network, having Xeon E5520 processors and 72TB of storage. The system also comprises 4 client nodes,which dedicate 1 CPU core to perform decoding operations.Storage nodes dedicate up to 1/4th of their bandwidth to re-pairs reserving the other 3/4th for other operations such asreads. In Figure 14a, we provide a plot showing the percent-age of CPU usage (i.e., the average percentage of usage ofthe cores that client nodes dedicate to decoding operations)for CNC. CPUs are not saturated if the data accessed lessthan once every 14 days is marked as cold. In Figure 14b, weplot the percentage of dedicated bandwidth usage for repairoperations with Reed-Solomon. The network is saturated ifrepairs are needed at least once per month. Figure 14c sum-

  • 0

    0.2

    0.4

    0.6

    0.8

    1

    1 10 100

    CPU

    Usa

    ge (%

    )

    Data cold after (days)

    CNC (Decoding)

    (a)

    0

    0.2

    0.4

    0.6

    0.8

    1

    1 10 100

    Net

    wor

    k U

    sage

    (%)

    Time between failures (days)

    RS (Repair)

    (b)

    1

    10

    100

    1 10 100

    Tim

    e be

    twee

    n fa

    ilure

    s (da

    ys)

    Data cold after (days)

    None applicable (both saturated)

    RS to be preferred (CNC saturated)

    Either CNC or RS(none saturated)

    CNC to be preferred(RS saturated)

    (c)

    Figure 14. CPU saturation for various coldness thresholds (a) and network saturation for various mean times between failures(b). Thresholds where CNC or RS are preferred (c).

    marizes the thresholds (i.e., the minimal settings where theresource they consume the most is not saturated) for usingCNC, Reed-Solomon or any of them. Obviously, these con-clusions hold in our experimental setting (1 Gbps Ethernet,and 1 Xeon E5520) and need to be adapted when consid-ering faster networks or CPUs. Yet, in our experience, theyhold in current network and architecture configurations.

    Impact of decoding As stated earlier, the main impact ofusing CNC is a higher CPU cost for decoding. In our setting(i.e., a cluster from Grid5000 interconnected using 1 GbpsEthernet), the CPU was not the most limiting factor as asingle core of the CPUs could decode up to 200 MBps (1600Mbps) according to Figure 8.

    Figure 8 also shows that on modern CPUs, 1 core candecode at more than 250 MBps (2 Gbps). These results ex-trapolate to multiple cores as the decoding can be performedin parallel by multiple cores (i.e. decoding is performed bystripes and stripes can be processed independently). Table 1gives the throughput of various processors and the corre-sponding network they allow to saturate. Notice that appro-priate CPUs allow saturating 1 Gbps, 10 Gbps, or even 40Gbps networks. A 100 Gbps Ethernet network would how-ever not be saturated, leaving a slight advantage to system-atic Reed-Solomon (i.e., 100 Gbps for RS vs 80 Gbps forCNC with 3 Xeon E7-4870) in that specific configuration.

    6. Related WorkThe problem of efficiently maintaining erasure-coded con-tent has triggered a novel research area both in theoreticaland practical communities. Design of novel codes tailoredfor networked storage system has emerged, with differentpurposes. For instance, in a context where partial recoveringmay be tolerated, priority random linear codes have beenproposed in [32] to offer the property that critical data hasa higher opportunity to survive node failures than data ofless importance. Another point in the code design space isprovided by self-repairing codes [34] which have been espe-cially designed to minimize the number of nodes contacted

    during a repair thus enabling faster and parallel replenish-ment of lost redundancy.

    In a context where bandwidth is a scarce resource, net-work coding has been shown to be a promising techniquewhich can serve the maintenance process. Network codingwas initially proposed to improve the throughput utilizationof a given network topology [2]. Introduced in distributedstorage systems in [9], it has been shown that the use of net-work coding techniques can dramatically reduce the main-tenance bandwidth. The authors of [9] derived a class ofcodes, namely regenerating codes which achieve the optimaltradeoffs between storage efficiency and repair bandwidth.In spite of their attractive properties, regenerating codes aremainly studied in an information theory context and lack ofpractical insights. Indeed, this seminal paper provides theo-retical bounds on the quantity of data to be transferred duringa repair. The computational cost of a random linear imple-mentation of these codes, which is rather high, can be foundin [13]. Recent advances in this research area are surveyedin [11, 26, 28].

    Recently, authors in [29], [35] and [42] have designednew codes tailored for cloud systems. In [29], the authorsproposed a new class of Reed-Solomon codes, namely ro-tated Reed-Solomon codes with the purpose of minimizingI/O for recovery and degraded read. While important for hotdata (i.e., data frequently accessed), minimizing I/O is lesscrucial when storing cold data as we observed in our exper-iments. Simple Regenerating Codes, introduced in [35] re-duce the maintenance bandwidth while providing exact re-

    Table 1. Decoding throughput of various processors forCNC (k=16, 1 GB file size) and corresponding capacity interm of network.

    Cores CPU(s) Throughput Net. saturated4 Xeon E3-1220 8 Gbps 1 Gbps6 Xeon E5-2630 12 Gbps 10 Gbps

    24 2 Xeon E5-2695 48 Gbps 40 Gbps40 4 Xeon E7-4870 80 Gbps

  • pairs, and simple XOR implementation. A novel family ofcodes called Locally Repairable Codes (LRCs) has been pro-posed in [42], also reducing the maintenance bandwidth byadding additional local parities while still providing exact re-pair. Yet this reduction comes at the price of losing optimalstorage efficiency. Moreover, an exact repair does not pro-vide the benefits of reintegration. Eventually, a new familyof codes, called Piggybacked-RS codes has been proposedby [39]. They are constructed by taking an existing Reed-Solomon code and adding carefully designed functions ofone byte-level stripe onto the parities of other byte-levelstripes. They allow reducing the maintenance bandwidth bya factor of 30% while still preserving the Reed-SolomonMDS storage efficiency.

    Some other recent works [22, 23] aim to bring networkcoding into practical systems. Code design presented in pa-per [23] is not MDS, thus consuming more storage space.Codes in [22] handle a single failure; they target a mainte-nance framework that operates over a cloud of clouds. Ad-ditionally, CNC codes do not require splitting blocks furthercontrary to the F-MSR codes in [22] (and the correspondingincrease in coding/decoding costs). Finally, and despite theprobabilistic nature of both types of codes, the repair processfrom [22] has a significant probability of seeing data losseswhile operating. This is due to the impossibility of combin-ing data blocks directly within clouds machines. Specificmechanisms have to be implemented in order to ensure dataintegrity on the long term, which adds design complexityto the overall proposal. One significant convergence point ofthe two approaches is that both codes are non-systematic, ar-guing the possibility of bandwidth gains in archival systems.

    7. ConclusionWhile erasure codes, typically Reed-Solomon, have beenwidely acknowledged as a sound alternative to plain repli-cation in the context of reliable distributed archival systems,they suffer from high costs, bandwidth and computationally-wise, upon node repair. In this paper, we address this issueand provide a novel code-based system providing high reli-ability and efficient maintenance for practical archival sys-tems. The originality of our approach, CNC, stems from acluster-based placement strategy, assigning a set of files toa specific cluster of nodes combined with the use of ran-dom codes and network coding at the granularity of severalfiles. CNC leverages network coding and the co-location ofblocks of several files to encode files together during the re-pair. This provides a significant decrease of the bandwidthrequired during repair, avoids file decoding and providesuseful node reintegration. We provide a theoretical analy-sis of CNC. We also implemented CNC and deployed it on atestbed. Our evaluation shows a 50% improvement of CNCwith respect to bandwidth consumption and repair time overReed-Solomon-based approaches; the price to pay being amoderately higher CPU utilization as a single core of a mod-

    ern processor is sufficient for handling transfers on a 1 Gbpsnetwork. Also, the impact of chunking files due to the use ofa non-systematic code remains limited for not so frequentlyaccessed data. We have shown that in our setting (1Gbps net-work, Xeon E5520) for cold data, CNC impact is a limitedCPU usage without throughput loss. As CNC reduces themaintenance-related costs, it is particularly adapted to colddata storage such as archival systems.

    8. AcknowledgmentsWe thank anonymous reviewers and our shepherd LidongZhou for their useful comments. We are particularly gratefulto Lidong Zhou for his great help in improving the experi-mental contribution of this paper. We thank Ahmed Oulabasfor his contribution to the CNC coding library.

    This study was partially funded by the ODISEA collabo-rative project from the System@tic and Images & Rseauxclusters. Experiments presented in this paper were carriedout using the Grid5000 experimental testbed, being devel-oped under the INRIA ALADDIN development action withsupport from CNRS, RENATER and several Universities aswell as other funding bodies.

    References[1] S. Acedanski, S. Deb, M. Mdard, and R. Koetter. How good

    is random linear coding based distributed networked storage.In NetCod, 2005.

    [2] R. Ahlswede, N. Cai, S.-Y. Li, and R. Yeung. Network In-formation Flow. IEEE Transactions On Information Theory,46:12041216, 2000.

    [3] F. Andr, A.-M. Kermarrec, Erwan Le Merrer, N. LeScouarnec, G. Straub, and A. van Kempen. ArchivingCold Data in Warehouses with Clustered Network Coding.arxiv:1206.4175.

    [4] R. Bhagwan, K. Tati, Y.-C. Cheng, S. Savage, and G. M.Voelker. Total recall: system support for automated availabil-ity management. In NSDI, 2004.

    [5] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold,S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas,C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali,R. Abbasi, A. Agarwal, M. F. ul Haq, M. I. ul Haq, D. Bhard-waj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran,K. Manivannan, and L. Rigas. Windows Azure Storage: ahighly available cloud storage service with strong consistency.In SOSP, 2011.

    [6] S. Caron, F. Giroire, D. Mazauric, J. Monteiro, andS. Prennes. Data life time for different placement policiesin P2P storage systems. In Globe, 2010.

    [7] B.-G. Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon,F. Kaashoek, J. Kubiatowicz, and R. Morris. Efficient ReplicaMaintenance for Distributed Storage Systems. In NSDI, 2006.

    [8] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica.Wide-area cooperative storage with CFS. In SOSP, 2001.

  • [9] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. O. Wainwright, andK. Ramchandran. Network Coding for Distributed StorageSystems. In INFOCOM, 2007.

    [10] A. G. Dimakis, V. Prabhakaran, and K. Ramchandran. Decen-tralized Erasure Codes for Distributed Networked Storage. InJoint special issue, IEEE/ACM Transactions on Networkingand IEEE Transactions on Information Theory, 2006.

    [11] A. G. Dimakis, K. Ramchandran, Y. Wu, and C. Suh. A Sur-vey on Network Codes for Distributed Storage. The Proceed-ings of the IEEE, 99:476489, 2010.

    [12] A. Duminuco and E. Biersack. Hierarchical Codes: Howto Make Erasure Codes Attractive for Peer-to-Peer StorageSystems. In P2P, 2008.

    [13] A. Duminuco and E. Biersack. A Pratical Study of Regen-erating Codes for Peer-to-Peer Backup Systems. In ICDCS,2009.

    [14] A. Duminuco, E. Biersack, and T. En-Najjary. Proactive repli-cation in distributed storage systems using machine availabil-ity estimation. In CoNEXT, 2007.

    [15] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong,L. Barroso, C. Grimes, and S. Quinlan. Availability in Glob-ally Distributed Storage Systems. In OSDI, 2010.

    [16] A. Gharaibeh and M. Ripeanu. Exploring data reliabilitytradeoffs in replicated storage systems. In HPDC, 2009.

    [17] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google FileSystem. In SOSP, 2003.

    [18] C. Gkantsidis and P. Rodriguez. Network Coding for LargeScale Content Distribution. In INFOCOM, 2005.

    [19] Glacier. http://aws.amazon.com/fr/glacier/.

    [20] P. B. Godfrey, S. Shenker, and I. Stoica. Minimizing Churn inDistributed Systems. In SIGCOMM, 2006.

    [21] Grid5000. https://www.grid5000.fr/.

    [22] Y. Hu, H. C. H. Chen, P. P. C. Lee, and Y. Tang. NCCloud:Applying Network Coding for the Storage Repair in a Cloud-of-Clouds. In FAST, 2012.

    [23] Y. Hu, C.-M. Yu, Y. K. Li, P. Lee, and J. Lui. NCFS: Onthe Practicality and Extensibility of a Network-Coding-BasedDistributed File System. In NetCod, 2011.

    [24] C. Huang, M. Chen, and J. Li. Pyramid Codes: FlexibleSchemes to Trade Space for Access Efficiency in ReliableData Storage Systems. In NCA, 2007.

    [25] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan,J. Li, and S. Yekhanin. Erasure coding in Windows AzureStorage. In USENIX ATC, 2012.

    [26] S. Jiekak, A.-M. Kermarrec, N. Le Scouarnec, G. Straub, andA. Van Kempen. Regenerating Codes: A System Perspective.ACM SIGOPS Operating Systems Review, 47:2332, 2013.

    [27] A. Kermarrec, E. Le Merrer, G. Straub, and A. Van Kempen.Availability-Based Methods for Distributed Storage Systems.In SRDS, 2012.

    [28] A. Kermarrec, N. Le Scouarnec, and G. Straub. RepairingMultiple Failures with Coordinated and Adaptive Regenerat-ing Codes. arxiv:1102.0204, (updated September 2013).

    [29] O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. Re-thinking Erasure Codes for Cloud File Systems: MinimizingI/O for Recovery and Degraded Reads. In FAST, 2012.

    [30] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton,D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon,W. Weimer, C. Wells, and B. Zhao. OceanStore: an architec-ture for global-scale persistent storage. ACM SIGPLAN Not.,35(11):190201, 2000.

    [31] H.-Y. Lin and W.-G. Tzeng. A Secure Erasure Code-BasedCloud Storage System with Secure Data Forwarding. IEEETransactions on Parallel and Distributed Systems, 2012.

    [32] Y. Lin, B. Liang, and B. Li. Priority Random Linear Codes inDistributed Storage Systems. IEEE Transactions on Paralleland Distributed Systems, 20(11):16531667, 2009.

    [33] M. Martalo and, M. Picone, M. Amoretti, G. Ferrari, andR. Raheli. Randomized network coding in distributed storagesystems with layered overlay. In ITA, 2011.

    [34] F. E. Oggier and A. Datta. Self-repairing homomorphic codesfor distributed storage systems. In INFOCOM, 2011.

    [35] D. S. Papailiopoulos, J. Luo, A. G. Dimakis, C. Huang, andJ. Li. Simple Regenerating Codes: Network Coding for CloudStorage. In INFOCOM, 2012.

    [36] J. S. Plank, K. Greenan, and E. L. Miller. Screaming FastGalois Field Arithmetic Using Intel SIMD Extensions. InFAST, 2013.

    [37] J. S. Plank, J. Luo, C. D. Schuman, L. Xu, and Z. Wilcox-OHearn. A performance evaluation and examination of open-source erasure coding libraries for storage. In FAST, 2009.

    [38] J. S. Plank, S. Simmerman, and C. D. Schuman. Jerasure:A Library in C/C++ Facilitating Erasure Coding for StorageApplications - Version 1.2A. University of Tennessee, CS-08-627, 2008.

    [39] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, andK. Ramchandran. A Solution to the Network Challenges ofData Recovery in Erasure-coded Distributed Storage Systems:A Study on the Facebook Warehouse Cluster. In HotStorage,2013.

    [40] R. Rodrigues and B. Liskov. High Availability in DHTs:Erasure Coding vs. Replication. In IPTPS, 2005.

    [41] A. I. T. Rowstron and P. Druschel. Storage Managementand Caching in PAST, A Large-scale, Persistent Peer-to-peerStorage Utility. In SOSP, 2001.

    [42] M. Sathiamoorthy, M. Asteris, D. S. Papailiopoulos, A. G.Dimakis, R. Vadali, S. Chen, and D. Borthakur. XORingElephants: Novel Erasure Codes for Big Data. In VLDB, 2013.

    [43] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. TheHadoop Distributed File System. In MSST, 2010.

    [44] K. Tati and G. M. Voelker. On Object Maintenance in Peer-to-Peer Systems. In IPTPS, 2006.

    [45] K. V. Vishwanath and N. Nagappan. Characterizing cloudcomputing hardware reliability. In SoCC, 2010.

    [46] H. Weatherspoon and J. Kubiatowicz. Erasure Coding Vs.Replication: A Quantitative Comparison. In IPTPS, 2002.