On the Delay-Storage Trade-Off in Content Download from Coded Distributed Storage Systems

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY 2014 989

On the Delay-Storage Trade-Off in ContentDownload from Coded Distributed Storage Systems

Gauri Joshi, Yanpei Liu, Student Member, IEEE, and Emina Soljanin, Fellow, IEEE

Abstract—We study how coding in distributed storage reducesexpected download time, in addition to providing reliabilityagainst disk failures. The expected download time is reducedbecause when a content file is encoded with redundancy anddistributed across multiple disks, reading only a subset of thedisks is sufficient for content reconstruction. For the same totalstorage used, coding exploits the diversity in storage better thansimple replication, and hence gives faster download. We use anovel fork-join queueing framework to model multiple usersrequesting the content simultaneously, and derive bounds onthe expected download time. Our system model and results area novel generalization of the fork-join system that is studiedin queueing theory literature. Our results demonstrate thefundamental trade-off between the expected download time andthe amount of storage space. This trade-off can be used fordesign of the amount of redundancy required to meet the delayconstraints on content delivery.

Index Terms—distributed storage, fork-join queues, MDScodes

I. INTRODUCTION

LARGE-SCALE cloud storage and distributed file systemssuch as Amazon Elastic Block Store (EBS) [1] and

Google File System (GoogleFS) [2] have become the back-bone of many applications, e.g., searching, e-commerce, andcluster computing. In these distributed storage systems, thecontent files stored on a set of disks may be simultaneously re-quested by multiple users. The users have two major demands:reliable storage and fast content download. Content downloadtime includes the time taken for a user to compete with theother users for access to the disks, and the time to acquire thedata from the disks. Fast content download is important fordelay-sensitive applications such as video streaming, VoIP, aswell as collaborative tools like Dropbox [3] and Google Docs[4].

In large-scale distributed storage systems, disk failures arethe norm and not the exception [2]. To protect the data fromdisk failures, cloud storage providers today simply replicatecontent throughout the storage network over multiple disks.In addition to fault tolerance, replication makes the contentquickly accessible since multiple users requesting a content

Manuscript received May 15, 2013; revised October 1, 2013, November 5,2013, and December 10, 2013. This work was presented in part at the 50thAnnual Allerton Conference on Communication, Control and Computing,Monticello IL, October 2012.

G. Joshi is with the Department of Electrical Engineering and ComputerScience, Massachusetts Institute of Technology, Cambridge, MA (e-mail:[email protected]).

Y. Liu is with the Department of Electrical and Computer Engineering,University of Wisconsin Madison, Madison, WI (e-mail: [email protected]).

E. Soljanin is with Bell Labs, Alcatel-Lucent, Murray Hill, NJ (e-mail:[email protected]).

Digital Object Identifier 10.1109/JSAC.2014.140518.

can be directed to different replicas. However, replicationconsumes a large amount of storage space. In data centersthat process massive data, using more storage space implieshigher expenditure on electricity, maintenance and repair, aswell as the cost of leasing physical space.

Coding, which was originally developed for reliable com-munication in presence of noise, offers a more efficient wayto store data in distributed systems. The main idea behindcoding is to add redundancy so that a content, stored on aset of disks, can be reconstructed by reading a subset ofthese disks. Previous work shows that coding can achievethe same reliability against failures with lower storage spaceused. It also allows efficient replacement of disks that haveto be removed due to failure or maintenance. We show thatin addition to reliability and easy repair, coding also givesfaster content download because we only have to wait forcontent download from a subset of the disks. Some preliminaryresults on the analysis of download time via queueing-theoreticmodeling are presented in [5].

A. Previous Work

Research in coding for distributed storage was galvanizedby the results reported in [6]. Prior to that work, literatureon distributed storage recognized that, when compared withreplication, coding can offer huge storage savings for the samereliability levels. But it was also argued that the benefits ofcoding are limited, and are outweighed by certain disadvan-tages and extra complexity. Namely, to provide reliability inmulti-disk storage systems, when some disks fail, it must bepossible to restore either the exact lost data or an equivalentreliability with minimal download from the remaining storage.This problem of efficient recovery from disk failures wasaddressed in some early work [7]. But in general, the costof repair regeneration was considered much higher in codedthan in replication systems [8], until [6] established existenceand advantages of new regenerating codes. This work was thenquickly followed, and the area is very active today (see e.g.,[9]–[11] and references therein).

Only recently [12]–[14] was it realized that, in additionto reliability, coding can guarantee the same level of contentaccessibility, but with lower storage than replication. In [12],the scenario that when there are multiple requests, all exceptone of them are blocked and the accessibility is measured interms of blocking probability is considered. In [13], multiplerequests are placed in a queue instead of blocking and theauthors propose a scheduling scheme to map requests toservers (or disks) to minimize the waiting time. In [14], theauthors give a combinatorial proof that flooding requests to all

0733-8716/14/$31.00 c© 2014 IEEE

990 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY 2014

disks, instead of a subset of them gives the fastest downloadtime. This result corroborates the system model we consider inthis paper to model the distributed storage system and analyzeits download time.

Using redundancy in coding for delay reduction has alsobeen studied in the context of packet transmission in [15]–[17],and in some content retrieval scenarios [18], [19]. Althoughthey share some common spirit, they do not consider the effectof queueing of requests in coded distributed storage systems.

B. Our Contributions

We show that coding allows fast content download inaddition to reliable storage. When the content is coded anddistributed on multiple disks, it is sufficient to read it onlyfrom a subset of these disks. We take a queuing-theoreticapproach to study how coding the content in this way providesdiversity in storage, and achieve a significant reduction in thedownload time. The analysis of download time leads us to aninteresting trade-off between download time and storage space,which can be used to design the optimal level of redundancyin a distributed storage system. To the best of our knowledge,we are the first to propose the (n, k) fork-join system andfind bounds on its mean response time, a novel generalizationof the (n, n) fork-join system studied in queueing theoryliterature.

We consider that requests entering the system are assignedto multiple disks, where they enter local queues waiting fordisks access. Note that this is in contrast to some existingworks (e.g. [13], [14]) where requests wait in a centralizedqueue when all disks are busy. Our approach of immediatedispatching of requests to local queues is used by mostserver farms to facilitate fast acknowledgement response tocustomers [20]. Under this queueing model, we propose the(n, k) fork-join system, where each request is forked to ndisks that store the coded content, and it exits the system whenany k (k ≤ n) disks are read. The (n, n) fork-join system inwhich all n disks have to be read has been extensively studiedin queueing theory and operations research related literature[21]–[23]. Our analysis of download time can be seen as ageneralization to the analysis of the (n, n) fork-join system.

The rest of the paper is organized as follows. In Sec. II,we analyze the expected download time of the (n, k) fork-join system and present the fundamental trade-off betweenexpected download time and storage. These results werepresented in part in [5]. In Sec. III, we relax some simpli-fying assumptions, and present the delay-storage trade-off byconsidering some practical issues such as heavy-tailed andcorrelated service times of the disks. In Sec. IV, we extend theanalysis to distributed systems with a large number of disks.Such systems can be divided into groups of n disks each,where each group is an independent (n, k) fork-join system.Finally, Sec. V concludes the paper and gives future researchdirections.

II. THE (n, k) FORK-JOIN SYSTEM

We consider the scenario that users attempt to downloadthe content from the distributed storage system where theirrequests are placed in a queue at each disk. In Sec. II-A we

3F

Abandon

J

1

1

12

2

23

3

3

4

4

4λ

λ

λ

kμ

kμ

kμ

Fig. 1. Illustration of the (3, 2) fork-join system. Since 2 out of 3 tasksof Job 1 are served, the third task abandons its queue and the job exits thesystem. Job 2 has to wait for one more task to be served.

propose the (n, k) fork-join system to model the queueingof download requests, and derive theoretical bounds on theexpected download time in Sec. II-B. This analysis leadsus to the fundamental trade-off between download time andstorage, which provides insights into the practical systemdesign. Numerical and simulation results demonstrating thistrade-off are presented in Sec. II-C.

A. System Model

Consider that a content F of unit size is divided into kblocks of equal size. It is encoded to n ≥ k blocks usingan (n, k) maximum distance separable (MDS) code, and thecoded blocks are stored on an array of n disks. MDS codeshave the property that any k out of the n blocks are sufficientto reconstruct the entire file. MDS codes have been suggestedto provide reliability against disk failures. In this paper weshow that, in addition to error-correction, we can exploit thesecodes to reduce the download time of the content.

The encoded blocks are stored on n different disks (oneblock per disk). Each incoming request is sent to all n disks,and the content can be recovered when any k out of nblocks are successfully downloaded. We model the queueingof download requests at the disks using the (n, k) fork-joinsystem which is defined as follows.

Definition 1 ((n, k) fork-join system). An (n, k) fork-joinsystem consists of n nodes. Every arriving job is divided inton tasks which enter first-come first-serve queues at each ofthe n nodes. The job departs the system when any k out ofn tasks are served by their respective nodes. The remainingn− k tasks abandon their queues and exit the system beforecompletion of service.

The (n, n) fork-join system, known in literature as fork-join queue, has been extensively studied in, e.g., [21]–[23].However, the (n, k) generalization in Definition 1 above hasnot been previously studied to our best knowledge.

We consider that arrival of download requests is Poissonwith rate λ. Every request is forked to the n disks. The timetaken to download one unit of data is exponential with mean1/μ. Since, each disk stores 1/k units of data, consider thatthe service time for each node is exponentially distributed withmean 1/μ′ where μ′ = kμ. Define the load factor ρ′ � λ/μ′.This model with an M/M/1 queue at every disk is sometimesreferred to as a Flatto-Hahn-Wright (or FHW) model [24],[25] in fork-join queue literature. While most of our analytical

JOSHI et al.: ON THE DELAY-STORAGE TRADE-OFF IN CONTENT DOWNLOAD FROM CODED DISTRIBUTED STORAGE SYSTEMS 991

results in Section II and Section IV are for the FHW model,we use simulations to study systems with M/G/1 queues at thedisks in Section III.

For the (n, n) fork-join system to be stable, [26] shows thatthe arrival rate λ must be less than μ′, the service rate of anode, which in our (n, n) system equals to nμ. In Lemma 1below, we show that λ < nμ is also a necessary condition forthe stability of the (n, k) fork-join system for any 1 ≤ k ≤ n.

Example 1. If content F is split into equal blocks a andb, and stored on 3 disks as a, b, and a ⊕ b, the exclusive-or of blocks a and b. Thus each disk stores content of halfthe size of file F . Downloads from any 2 disks jointly enablereconstruction of F . Fig. 1 illustrates this (3, 2) fork-joinsystem. Each download request, or a job is forked to the3 nodes. When 2 out of 3 tasks are served, the third taskabandons its queue and the job exits the system. For example,Job 1 is about to exit the system, while Job 2 is waiting forone more task to be served. The letters F and J denote thefork and join operations, respectively.

Lemma 1 (Stability of (n, k) fork-join system). For the (n, k)fork-join system to be stable, the rate of Poisson arrivals λand the service rate μ′ = kμ per node must satisfy

λ <nμ′

k= nμ. (1)

Proof. Tasks arrive at each queue at rate λ and are served atrate μ′ = kμ. But when k out of the n tasks finish service,the remaining n−k tasks abandon their queues. A task can beone of the abandoning tasks with probability (n−k)/n. Hencethe effective arrival rate to each queue is λ minus the rate ofabandonment λ(n− k)/n. Then the condition for stability ofeach queue is

λ− λ(n− k)

n< μ′, (2)

which reduces to (1).

B. Bounds on the Mean Response Time

Our objective is to determine the expected download time,which we refer to as the mean response time T(n,k) of the(n, k) fork-join system. It is the expected time that a jobspends in the system, from its arrival until k out of n of itstasks are served by their respective nodes. Previous works[21]–[23] have studied T(n,n), but it has not been found inclosed form – only bounds are known. An exact expressionfor the mean response time is found only for the (2, 2) fork-join system [22].

Since the n tasks are served by independent M/M/1 queues,intuition suggests that T(n,k) is the kth order statistic of nexponential service times. However this is not true, whichmakes the analysis of T(n,k) challenging. The reason why theorder statistics approach does not work is that when j nodes(j < n) finish serving their tasks they can start serving thetasks of the next job (cf. Fig. 1). As a result, the service timeof a job depends on the departure time of previous jobs.

We now present upper and lower bounds on the meanresponse time T(n,k). The numerical results in Section II-Cshow that these bounds are fairly tight.

Theorem 1 (Upper Bound on Mean Response Time). Themean response time T(n,k) of an (n, k) fork-join systemsatisfies

T(n,k) ≤ Hn −Hn−k

μ′ + (3)

λ[(Hn2 −H(n−k)2) + (Hn −H(n−k))

2]

2μ′2[1− ρ′(Hn −Hn−k)] ,

where λ is the request arrival rate, μ′ is the service rate ateach queue, ρ′ = λ/μ′ is the load factor, and the generalizedharmonic numbers Hn and Hn2 are defined by

Hn =n∑

j=1

1

jand Hn2 =

n∑j=1

1

j2. (4)

The bound is valid only when ρ′(Hn −Hn−k) < 1.

Proof. To find this upper bound, we use an easier to analysemodel called the split-merge system. In the (n, k) fork-joinqueueing model, after a node serves a task, it can start servingthe next task in its queue. On the contrary, in the split-mergemodel, the n nodes are blocked until k of them finish service.Thus, the job departs all the queues at the same time. Due tothis blocking of nodes, the mean response time of the (n, k)split-merge model is an upper bound on (and a pessimisticestimate of) T(n,k) for the (n, k) fork-join system.

The (n, k) split-merge system is equivalent to an M/G/1queue whose service time is a random variable S distributedaccording to the kth order statistic of the exponential distri-bution. The mean and variance of S are

E[S] =Hn −Hn−k

μ′ and V[S] =Hn2 −H(n−k)2

μ′2 . (5)

The Pollaczek-Khinchin formula [27] gives the mean responsetime T of an M/G/1 queue in terms of its mean and variance:

T = E[S] +λ(E[S]2 + V[S]

2(1− λE[S]). (6)

Substituting (5) into (6) gives the upper bound (3). Note thatthe Pollaczek-Khinchin formula is valid only when 1

λ > E[S],the stability condition of the M/G/1 queue. Since E[S] in-creases with k, there exists a k0 such that the M/G/1 queueis unstable for all k ≥ k0. The inequality 1

λ > E[S] can besimplified to ρ′(Hn −Hn−k) < 1 which is the condition forvalidity of the upper bound given in Theorem 1.

Remark 1. For the (n, n) fork-join system, the authors in[22] find an upper bound on mean response time differentfrom (3) derived above. To find the bound, they first prove thatthe response times of the n queues form a set of associatedrandom variables [28]. Then they use the property of asso-ciated random variables that their expected maximum is lessthan that for independent variables with the same marginaldistributions. However this approach used in [22] cannot beextended to the (n, k) fork-join system with k < n becausethis property of associated variables does not hold for the kth

order statistic for k < n.

As a corollary to Theorem 1 above, we can get an exactexpression for T(n,1), the mean response time of the (n, 1)fork-join system. Recall that in the (n, 1) fork-join system,


the entire content is replicated on n disks, and we just haveto wait for any one disk to serve the incoming request.

Corollary 1. The mean response time T(n,1) of the (n, 1)fork-join system is given by

T(n,1) =1

nμ− λ. (7)

Proof. In Theorem 1 we constructed the (n, k) split-mergesystem which always has worse response time than the cor-responding (n, k) fork-join system. For the special case whenk = 1, the split-merge system is equivalent to the fork-joinsystem and gives the same response time. Substituting k = 1and μ′ = kμ = μ in (5) and (6) we get the result (7).

Theorem 2 (Lower Bound on Mean Response Time). Themean response time T(n,k) of an (n, k) fork-join queueingsystem satisfies

T(n,k)≥ 1

μ′[Hn −Hn−k + ρ′(Hn(n−ρ′) −H(n−k)(n−k−ρ′))

],

(8)where λ is the request arrival rate, μ′ is the service rate ateach queue, ρ′ = λ/μ′ is the load factor, and the generalizedharmonic number Hn(n−ρ′) is given by

Hn(n−ρ′) =n∑

j=1

1

j(j − ρ′).

Proof. The lower bound in (8) is a generalization of the boundfor the (n, n) fork-join system derived in [23]. The bound forthe (n, n) system is derived by considering that a job goesthrough n stages of processing. A job is said to be in the jth

stage if j out of n tasks have been served by their respectivenodes for 0 ≤ j ≤ n−1. The job waits for the remaining n−jtasks to be served, after which it departs the system. For the(n, k) fork-join system, since we only need k tasks to finishservice, each job now goes through k stages of processing.In the jth stage, where 0 ≤ j ≤ k − 1, j tasks have beenserved and the job will depart when k− j more tasks to finishservice.

We now show that the service rate of a job in the jth stageof processing is at most (n − j)μ′. Consider two jobs B1

and B2 in the ith and jth stages of processing respectively.Let i > j, that is, B1 has completed more tasks than B2.Job B2 moves to the (j + 1)th stage when one of its n − jremaining tasks complete. If all these tasks are at the heads oftheir respective queues, the service rate for job B2 is exactly(n− j)μ′. However since i > j, B1’s task could be ahead ofB2’s in one of the n − j pending queues, due to which thattask of B2 cannot be immediately served. Hence, we haveshown that the service rate of in the jth stage of processingis at most (n− j)μ′.

Thus, the time for a job to move from the jth to (j + 1)th

stage is lower bounded by 1/((n−j)μ′−λ), the mean responsetime of an M/M/1 queue with arrival rate λ and service rate(n − j)μ′. The total mean response time is the sum of themean response times of each of the k stages of processingand is bounded below as

1 2 3 4 5 6 7 8 9 100.1

0.15

0.2

0.25

0.3

0.35

Mea

n re

spon

se ti

me

k

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

Sto

rage

T(10, k)

simulation

Upper boundLower boundRequired storage

Fig. 2. Behavior of the mean response time T(10,k) for λ = 1 and μ = 1as k increases (and storage n/k decreases). The plot shows that the boundson mean response time given by (3) and (8) are tight for small k and becomeloose as k increases.

T(n,k) ≥k−1∑j=0

1

(n− j)μ′ − λ,=

1

μ′

k−1∑j=0

1

(n− j)− ρ′,

=1

μ′

k−1∑j=0

[ 1

n− j+

ρ′

(n− j)(n− j − ρ′)

],

=1

μ′[Hn−Hn−k+ρ′(Hn(n−ρ′)−H(n−k)(n−k−ρ′))

].

Hence, we have found lower and upper bounds on themean response time T(n,k). In Fig. 2 we demonstrate howthe tightness of the bounds for a (10, k) fork-join system withλ = 1 and μ = 1. Note that the upper bound for k = n shownin the plot is T(n,n) ≤ Hn/(nμ− λ) as given in [22], insteadof the bound in (3). The reason behind this substitution isexplained in Remark 1.

We observe in Fig. 2 that the bounds become loose ask increases. In particular, the upper bound becomes loosebecause the blocking of queues in split-merge system becomessignificant when k increases. Similarly, the lower bound be-comes loose with increasing k because the difference betweenthe actual service rate in the jth stage of processing, and itsbound (n− j)μ′ increases. When k = 1, the bounds coincideand give T(n,1) = 1/(nμ − λ). For the same reasons, thebounds also become loose when μ decreases.

C. Download Time vs. Storage Space Trade-off

We next present numerical results demonstrating the fun-damental trade-off between storage and response time of the(n, k) fork-join system. We also compare the response timeof the (n, k) system to the power-of-d assignment policy.

The expected download time of the file can be reducedin two ways 1) by increasing the total storage to n/k perfile, and 2) by increasing the number n of disks used for filestorage. Both the total storage and the number of disks couldbe a limiting factor in practice. We first address the scenariowhere the number of disks n is kept constant, but the storage


5 10 15 200.01

1

100

n, n = 2k

Mea

n re

spon

se ti

me

μ = 5 upper boundμ = 5 lower boundμ = 1 upper boundμ = 1 lower boundμ = 0.51 upper boundμ = 0.51 lower bound

Fig. 3. Mean response time upper and lower bounds on the mean responsetime T(n,n/2) for λ = 1 and three different service rates μ. Due to thediversity advantage of more disks, T(n,n/2) reduces with n.

expansion changes from 1 to n as we choose k from 1 to n.We then study the scenario where the storage expansion factorn/k is kept constant, but the number of disks varies.

1) Flexible Storage Expansion & Fixed Number of Disks:Fig. 2 is a plot of mean response versus k for a fixed numberof disks n. Note that as we increase k, the total storage n/kdecreases. When we increase k, two factors affect the meanresponse time in opposite ways: 1) As k increases the storageper disk reduces which reduces mean response time. 2) Withhigher k we have to wait for more nodes to finish service forthe job to exit the system. Hence we lose the diversity benefitof coding, which results in increased mean response time.

In Fig. 2 we observe that, the second factor dominates,causing T(n,k) to strictly increase with k. However, if theservice rate μ is lowered, the per-node service time 1/kμbecomes large at small k. This can cause the mean responsetime to first decrease, and then increase with k.

2) Flexible Number of Disks & Fixed Storage Expansion:Next, we take a different viewpoint and analyze the benefit ofspreading the content across more disks while using the sametotal storage space. Fig. 3 plots the bounds (3) and (8) on themean response time T(n,k) versus k while keeping constantcode rate k/n = 1/2, for the (n, k) fork-join system withλ = 1 and three different values of μ. For these parametervalues the bounds are tight, and can be used for analysis inplace of simulations.

We observe that the mean response time T(n,k) reduces ask increases because we get the diversity advantage of havingmore disks. The reduction in T(n,k) happens at the higher ratefor small values of k and μ. For heavy-tailed distributions (e.g.Pareto, cf. Sec. III), the benefit that comes from diversity iseven larger.T(n,k) approaches zero as n → ∞ for a fixed storage

expansion n/k. This is because we assumed that service rateof a single disk is kμ since the 1/k units of the content F isstored on one disk. However, in practice the mean service time1/kμ will not go zero as reading each disk will need somenon-zero setup time in completing each task, irrespective ofthe amount of data read from the disk.

1 2 3 4 5 6 7 80.01

1

100

Average time to download one unit of content (1/μ)

Mea

n R

espo

nse

Tim

e

(10, 1) fork−join system(20, 2) fork−join systemPower−of−2Power−of−10 (LWL job assignment)

Fig. 4. For λ = 1 and the same amount of total storage used (10 units),the fork-join system has lower mean response time than the correspondingpower-of-d and LWL assignment policies.

3) Comparison with Power-of-d Assignment: In our dis-tributed storage model, we divide the content into k blocks,we use 1/k units of space of each disk, and hence totalstorage space used is n/k units. This is unlike conventionalreplication-based storage solutions where n entire copies ofcontent are stored on the n disks. In such systems, eachincoming request can be assigned to any of the n disks.One such assignment policy is the power-of-d assignment[20], [29]. For each incoming request, the power-of-d jobassignment uniformly selects d nodes (d ≤ n) and sends therequest to the node with least work left among the d nodes.The amount of work left of a node can be the expected timetaken for that node to become empty when there are no newarrivals or simply the number of jobs queued. When d = n,power-of-d reduces to the least-work-left (LWL) policy.

Fig. 4 is a plot of the mean response time versus 1/μ,the average time taken to download one unit of content. Allthe systems shown in Fig. 4 use the same total storage spacen/k = 10 units. We observe in Fig. 4 that the fork-join systemoutperforms the power-of-d and LWL assignment policies.This is because as we saw in Fig. 3, when we increase n andk while keeping the ratio n/k (the total storage) fixed, themean response time of the (n, k) fork-join system decreases.That is, the diversity advantage dominates over the slowdowndue to waiting for more nodes to finish service. Thus, forlarge enough n, the (n, k) fork-join system outperforms thecorresponding power-of-d scheme that uses the same storagespace n/k units.

There are other practical issues that are not considered inFig. 4. For instance, in the (n, k), system there are commu-nication costs associated with forking jobs to n nodes andcosts of decoding the MDS coded blocks. On the other hand,the power-of-d assignment system requires constant feedbackfrom the nodes to determine the work left at each node.

III. GENERALIZING THE SERVICE DISTRIBUTION

The theoretical analysis and numerical results so far as-sumed a specific service time distribution at each node – weconsidered the exponential distributions. In this section we


present some results by generalizing the service time distribu-tion. In Section III-A we extend the upper bound to generalservice time distributions. We present numerical results forheavy-tailed and correlated service times in Section III-B andSection III-C respectively.

A. General Service Time Distribution

In several practical scenarios the service distribution isunknown. We present an upper bound on the mean responsetime for such cases, only using the mean and the variance ofthe service distribution. Let the service times of the n nodes,X1, X2, . . . , Xn be the i.i.d random variables with E[Xi] =

1μ′

and V[Xi] = σ2 for all i.

Theorem 3 (Upper Bound with General Service Time). Themean response time T(n,k) of an (n, k) fork-join system withgeneral service time X such that E[X ] = 1

μ′ and V[X ] = σ2

satisfies

T(n,k) ≤ γ +λ[γ2 + σ2C(n, k)

]2 [1− λγ]

(9)

where γ � 1μ′ +σ

√k−1

n−k+1 , and C(n, k) is the constant givenin the table in [30].

Proof. The proof follows from Theorem 1 where the upperbound can be calculated using (n, k) split-merge system andPollaczek-Khinchin formula (6). However, since we do nothave an exact expression for E[S] and V[S], where S is thekth order statistic of the service times X1, X2, · · ·Xn, weuse the following upper bounds derived in [31] and [30].

E[S] ≤ γ =1

μ′ + σ

√k − 1

n− k + 1, (10)

V[S] ≤ C(n, k)σ2. (11)

The proof of (10) and (11) are given in [31] and [30]respectively. The constant C(n, k) can be found in the tablein [30]. Holding n constant, C(n, k) decreases as k increases.

Observe that (6) strictly increases as either E[S] or V[S]increases. As a result, we can substitute the upper bounds init to obtain the upper bound on mean response time (9).

Regarding the lower bound, we note that our proof inTheorem 2 cannot be extended to this general service timesetting because the proof requires the memoryless property ofthe service time which does not hold in general.

B. Heavy-tailed Service Time

In many practical systems the service time has a heavy-taildistribution, which means that there is a larger probability ofgetting very large values. More formally, a random variableX is said to be heavy-tail distribution if its tail probabilityis not exponentially bounded and limx→∞ eβx Pr(X > x) =∞ for all β > 0. We consider the Pareto distribution whichhas been widely used to model heavy-tailed jobs in existingliterature (see for example [32], [33]). The Pareto distributionis parametrized by scale parameter xm and shape parameterα and its cumulative distribution function is FX(x) = 1 −(xm

x

)α for x ≥ xm and 0 otherwise. A smaller value of α

2 4 6 8 10

10−1

100

k

Mea

n re

spon

se ti

me

Pareto, α = 1.8Pareto, α = 4Pareto, α = ∞Exponential

Fig. 5. Mean response time T(10,k) of different service time distributions.λ = 1 and μ = 3. For more heavy-tailed (smaller α) distributions, theincrease in mean response time with k becomes dominant since we have towait for more nodes to finish service.

implies a heavier tail. In particular, when α = ∞ the servicetime becomes deterministic and when α ≤ 1 the service timebecomes infinite. In [32] Pareto distribution with α = 1.1 wasreported for the sizes of files on the web.

In Fig. 5 we plot the mean response time T(n,k) versus kfor n = 10 disks, for arrival rate λ = 1 and service rate μ = 3for the exponential and Pareto service distributions. Each diskstores 1/k units of data and thus the service rate of eachindividual queue is μ′ = kμ. For a given k, all distributionshave the same mean service time 1/kμ. We observe that as thedistribution becomes more heavy-tailed (smaller α), waitingfor more nodes (larger k) to finish results in an increase inmean response time which outweighs the decrease causedby smaller service time 1/kμ. For smaller α, the optimalk decreases because the increase in mean response time forlarger k is more dominant.

C. Correlated Service Times

Thus far we have considered that the n tasks of a job haveindependent service times. We now analyze how the corre-lation between service times affects the mean response timeof the fork-join system. In practice the correlation betweenservice times could be because the service time is proportionalto the size of the file being downloaded. We model thecorrelation by considering that the service time of each taskis δXd + (1 − δ)Xr,i, a weighted sum of two independentexponential random variables Xd and Xr,i both with mean1/kμ. The variable Xd is fixed across the n queues, and Xr,i

is the independent for the queues 1 ≤ i ≤ n. The weight δrepresents the degree of correlation between the service timesof the n queues. When δ = 0, the system is identical to theoriginal (n, k) fork-join system analyzed in Section II. Themean response time T ′

(n,k) of the (n, k) fork-join system withservice time distribution as described above is,

T ′(n,k) = δE[Xd] + (1− δ)T(n,k), (12)

=δ

kμ+ (1 − δ)T(n,k),


2 4 6 8 10

0.03

0.1

0.4

k

Mea

n re

spon

se ti

me

δ = 1δ = 0.5δ = 0

Fig. 6. Mean response time of T ′(n,k)

of (10, k) fork-join systems withcorrelated service times for λ = 1, μ = 3, and δ = 0, 0.5, 1. As δ increases,the service times are more correlated and we lose the diversity advantage ofcoding.

where in T(n,k) is the response time with independent ex-ponential service times analyzed in Section II. Fig. 6 showsthe trade-off between mean response time and k for weightδ = 0, 0.5, and 1. When δ is 0, coding provides diversity inthis regime and gives faster response time for smaller k as wealready observed in Fig. 2. As the correlation between servicetimes increases we lose the diversity advantage provided bycoding and do not get fast response for small k. Note thatfor δ = 1, there is no diversity advantage and the decreasein response time with k is only because of the fact that eachdisk stores 1/k units of data.

IV. THE (m,n, k) FORK-JOIN SYSTEM

In a distributed storage with a large number of disksm, having an (m, k) fork-join system would involve largesignaling overhead of forking the request to all the m disks,and high decoding complexity. The decoding complexity ishigh even with small k because it depends on the field size,which is a function of m in standard codes such as Reed-Solomon codes. Hence, we propose a system where we dividethe m disks into g = m/n groups of n disks each, whichact as independent (n, k) fork-join systems. In Section IV-Awe give the system model and analyze the mean responsetime of the (m,n, k) fork-join system. In Section IV-B wepresent numerical results comparing the mean response timewith different policies of assigning an incoming request to oneof the groups.

A. Analysis of Response Time

Consider a distributed storage system with m disks. Wedivide then into g = m/n groups of n disks each. We refer tothis system as the (m,n, k) fork-join system, formally definedas follows.

Definition 2 (The (m,n, k) fork-join system). An (m,n, k)fork-join system consists of m ≥ n disks partitioned intog = m/n groups with n disks each. An incoming downloadrequest is assigned to one of the g groups according to some

2 4 6 8 10 120

0.05

0.1

0.15

0.2

0.25

k

Mea

n re

spon

se ti

me

(12, 6, k) Exponential(12, 6, k) Pareto, α = 1.8(12, 12, k) Pareto, α = 1.8(12, 12, k) Exponential

(12, 6, k)

(12, 12, k)

Fig. 7. Mean response time T(12,n,k) with the exponential and Pareto servicedistributions, and parameters λ = 1 and μ = 3. Given m and we would liketo find the smallest n and largest k that can achieve a given target responsetime.

policy (e.g., uniformly at random). Each group behaves as anindependent (n, k) fork-join system described in Definition 1.

Lemma 1 can be extended to show that λi < nμ, for 1 ≤i ≤ g where λi is the request arrival rate to each group i, arenecessary conditions for the stability of the (m,n, k) fork-joinsystem.

The mean response time depends on the policy used toassign an incoming request to one of the groups. Suppose eachincoming request is randomly assigned to the ith group for1 ≤ i ≤ g with probability pi, such that

∑gi=1 pi = 1. Then

the arrivals to the ith group are Poisson with rate λi = piλ. Forthis assignment policy we can extend the bounds in Theorem 1and Theorem 2 to derive bounds on the mean response timeof the (m,n, k) fork-join system as follows.

Corollary 2. The bounds on mean response time T(m,n,k) ofan (m,n, k) fork-join system are weighted sums of the boundsin (3) and (8) with λ replaced by λi, where the weights equalto pi = λi/λ.

For example, under the uniform job assignment policy, withpi = 1/g for all 1 ≤ i ≤ g, T(m,n,k) is bounded by (3) and(8) with λ replaced by λ/g.

B. Numerical Results

To reduce the decoding complexity and signaling overhead,an (m,n, k) fork-join system with smaller n, and thus moregroups g = m/n, is preferred. However, reducing n reducesthe diversity advantage which could give higher expecteddownload time. Thus, there is a delay-complexity trade-offwhen we vary the number of groups g. Moreover, the contenthas to be replicated at all groups to which its request can bedirected. Thus, having a large number of groups, also meansincreased storage space.

In Fig. 7 we plot the mean response time for (12, n, k)system and uniform group assignment with exponential andPareto service times. Given the number of disks m, we wouldlike to find the smallest n, and largest k that can achieve agiven target response time. Smaller n means there are less


2 4 6 8 10

1

2

3

4

5

6

k

Mea

n re

spon

se ti

me

(20, 5, k) uniform(20, 5, k) LWL (power−of−4)(20, 5, k) power−of−2(20, 10, k) LWL (power−of−2)(20, 10, k) uniform

(20, 5, k)

(20, 10, k)

Fig. 8. Mean response time of (20, n, k) systems, with λ = 1 and μ = 1/8for different group assignment policies. The power-of-2 and LWL assignmentgive faster response time than uniform assignment.

disks per group, and hence less signaling overhead of forkinga request to the disks in a group. Larger k is desirable becausethe total storage space used is m/k units. For exponentialservice distribution, Fig. 7 shows that diversity of having alarge n, or smaller k always gives lower response time. Butthis monotonicity does not hold for the Pareto service timedistribution. For example, the (12, 6, 3) fork-join system withn = 6 disks per group and 12/3 = 4 units of storage used,gives lower response time than the (12, 12, 2) fork-join systemwith n = 12 disks per group and total storage 12/2 = 6 unitsof storage used.

We now study the response time of the (m,n, k) fork-joinsystem under three different group assignment policies – theuniform job assignment policy, where each incoming requestis assigned to a group chosen uniformly at random from the ggroups, and the power-of-d and least-work-left (LWL) policiesintroduced in Section II-C3.

In Fig. 8 we show a comparison of the response time ofthe (20, n, k) fork-join system with the uniform, power-of-d and LWL group assignment policies. Request arrival arePoisson with rate λ = 1 and service times are exponentialwith rate μ = 1/8. As expected, the power-of-d assignmentsgive lower response time than the uniform assignment but itis at the cost of receiving feedback about the amount of workleft at each node. We again note that power-of-2 policy isonly slightly worse than the LWL policy (cf. Fig. 4). Thesimulation suggests power-of-d group assignment is a strategyworth considering in actual implementations.

V. CONCLUDING REMARKS

A. Major Implications

We show how coding in distributed storage systems, whichis used to provide reliability against disk failures, also reducesthe content download time. We consider that content is dividedinto k blocks, and stored on n > k disks or nodes in a network.The redundancy is added using an (n, k) maximum distanceseparable (MDS) code, which allows content reconstructionby reading any k of the n disks. Since the download timefrom each disk is random, waiting for only k out of n disksreduces overall download time significantly.

We take a queueing-theoretic approach to model multipleusers requesting the content simultaneously. We propose the(n, k) fork-join system model where each request is forkedto queues at the n disks. This is a novel generalization of thepreviously studied (n, n) fork-join queue. We derive upper andlower bounds on the expected download time and show thatthey are fairly tight. To the best of our knowledge, we are thefirst to propose the (n, k) fork-join system and find boundson its mean response time. We also extend this analysis todistributed systems with large number of disks, that can bedivided into many (n, k) fork-join systems.

Our results demonstrate the fundamental trade-off betweenthe download time and the amount of storage space. Thistrade-off can be used for design of the amount of redundancyrequired to meet the delay constraints of content delivery.We observe that the optimal operating point varies withthe service distribution of the time to read each disk. Wepresent theoretical results for the exponential distribution, andsimulation results for the heavy-tailed Pareto distribution.

B. Future Perspectives

Although, we focus on distributed storage here, the resultsin this paper can be extended to computing systems such asMapReduce [34] as well as content access networks [16], [35].

There are some practical issues affecting the downloadtime that are not considered in this paper, and could beaddressed in future work. For instance, the signaling over-head of forking the request to n disks, and the complexityof decoding increases with n. In practical storage systems,adding redundancy in storage not only requires extra capitalinvestment in storage and networking but also consumes moreenergy [36]. It would be interesting to study the fundamentaltrade-off between power consumption and quality-of-service.Finally, note that in this paper we focus on the read operationin a storage system. However in practical systems requestsentering the system consist of both read and write operations– we leave the investigation of the write operation for futurework.

REFERENCES

[1] Amazon EBS, http://aws.amazon.com/ebs/.[2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,”

in ACM SIGOPS Op. Sys. Rev., vol. 37, no. 5, 2003, pp. 29–43.[3] Dropbox, http://www.dropbox.com/.[4] Google Docs, http://docs.google.com/.[5] G. Joshi, Y. Liu, and E. Soljanin, “Coding for fast content download,”

Allerton Conf., pp. 326–333, Oct. 2012.[6] A. G. Dimakis, P. B. Godfrey, M. Wainwright, and K. Ramchandran,

“Network coding for distributed storage systems,” Proc. IEEE INFO-COM, pp. 2000–2008, May 2007.

[7] M. Blaum, J. Brady, J. Bruck, and J. Menon, “EVENODD: an efficientscheme for tolerating double disk failures in RAID architectures,” IEEETrans. Comput., vol. 44, no. 2, pp. 192–202, 1995.

[8] R. Rodrigues and B. Liskov, “High availability in DHTs: Erasure codingvs. replication,” Int. Workshop Peer-to-Peer Sys., pp. 226–239, Feb.2005.

[9] K. V. Rashmi, N. B. Shah, P. V. Kumar, and K. Ramchandran, “Explicitconstruction of optimal exact regenerating codes for distributed storage,”Allerton Conf. on Commun. Control and Comput., pp. 1243 – 1249, Sep.2009.

[10] N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran, “Interfer-ence alignment in regenerating codes for distributed storage: necessityand code constructions,” IEEE Trans. Inf. Theory, vol. 58, pp. 2134–2158, Apr. 2012.


[11] I. Tamo, Z. Wang and J. Bruck, “Zigzag Codes: MDS Array CodesWith Optimal Rebuilding,” IEEE Trans. Inf. Theory, vol. 59, no. 3, pp.1597–1616, 2013.

[12] U. Ferner, M. Medard, and E. Soljanin, “Toward sustainable networking:Storage area networks with network coding,” Allerton Conf., pp. 517–524, Oct. 2012.

[13] L. Huang, S. Pawar, H. Zhang, and Kannan Ramchandran, “Codes canreduce queuing delay in data centers,” Proc. Int. Symp. Inform. Theory,pp. 2766–2770, Jul. 2012.

[14] N. Shah, K. Lee, and K. Ramachandran, “The MDS queue: analyzinglatency performance of codes and redundant requests,” Tech. Rep.arXiv:1211.5405, Nov. 2012.

[15] G. Kabatiansky, E. Krouk and S. Semenov, Error correcting coding andsecurity for data networks: analysis of the superchannel concept, 1st ed.Wiley, Mar. 2005, ch. 7.

[16] N. F. Maxemchuk, “Dispersity routing in high-speed networks,” Compu.Networks and ISDN Sys., vol. 25, pp. 645–661, Jan. 1993.

[17] Y. Liu, J. Yang, and S. C. Draper, “Exploiting route diversity in multi-packet transmission using mutual information accumulation,” AllertonConf. on Commun. Control and Comput., pp. 1793–1800, Sep. 2011.

[18] L. Xu, “Highly available distributed storage systems,” Ph.D. dissertation,California Institute of Technology, 1998.

[19] E. Soljanin, “Reducing delay with coding in (mobile) multi-agentinformation transfer,” Allerton Conf. on Commun. Control and Comput.,pp. 1428–1433, Sep. 2010.

[20] M. Harchol-Balter, Performance Modeling and Design of ComputerSystems: Queueing Theory in Action. Cambridge University Press,2013.

[21] C. Kim and A. K. Agrawala, “Analysis of the fork-join queue,” IEEETrans. Comput., vol. 38, no. 2, pp. 250–255, Feb. 1989.

[22] R. Nelson and A. Tantawi, “Approximate analysis of fork/join synchro-nization in parallel queues,” IEEE Trans. Comput., vol. 37, no. 6, pp.739–743, Jun. 1988.

[23] E. Varki, A. Merchant and H. Chen, “The M/M/1 fork-join queue withvariable sub-tasks,” unpublished, available online.

[24] L. Flatto and S. Hahn, “Two parallel queues created by arrivals withtwo demands i,” SIAM Journal on Applied Mathematics, vol. 44, no. 5,pp. 1041–1053, 1984.

[25] P. E. Wright, “Two parallel processors with coupled inputs,” Advancesin applied probability, pp. 986–1007, 1992.

[26] P. Konstantopoulos and J. Walrand, “Stationarity and stability of fork-join networks,” J. Appl. Prob., vol. 26, pp. 604–614, Sep. 1989.

[27] H. C. Tijms, A first course in stochastic models, 2nd ed. Wiley, 2003,ch. 2.5, p. 58.

[28] J. Esary, F. Proschan and D. Walkup, “Association of random variables,with applications,” Annals of Math. Stat., vol. 38, no. 5, pp. 1466–1474,Oct. 1967.

[29] M. Mitzenmacher, “The power of two choices in randomized loadbalancing,” Ph.D. dissertation, University of California Berkeley, CA,1996.

[30] N. Papadatos, “Maximum variance of order statistics,” Ann. Inst. Statist.Math, vol. 47, pp. 185–193, 1995.

[31] B. C. Arnold and R. A. Groeneveld, “Bounds on expectations of linearsystematic statistics based on dependent samples,” Annals of Stat., vol. 7,pp. 220–223, Oct. 1979.

[32] M. E. Crovella and A. Bestavros, “Self-similarity in World Wide Webtraffic: evidence and possible causes,” IEEE/ACM Trans. Netw., pp. 835–846, Dec. 1997.

[33] M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-law relation-ships of the Internet topology,” Proc. ACM SIGCOMM, pp. 251–262,1999.

[34] J. Dean and S. Ghemawat, “MapReduce: simplified data processing onlarge clusters,” Commu. of ACM, vol. 51, pp. 107–113, Jan. 2008.

[35] A. Vulimiri, O. Michel, P. B. Godfrey, and S. Shenker, “More is less:reducing latency via redundancy,” Proc. ACM HotNets, pp. 13–18, 2012.

[36] T. Bostoen, S. Mullender, and Y. Berbers, “Power-reduction techniquesfor data-center storage systems,” ACM Comput. Surveys, vol. 45, no. 3,2013.

Gauri Joshi received a B.Tech degree in ElectricalEngineering, and an M. Tech in Communication andSignal Processing from the Indian Institute of Tech-nology (IIT) Bombay in 2009 and 2010 respectively.She received an S.M. degree in 2012, and is nowpursuing a PhD at the Massachusetts Institute ofTechnology (MIT). She is a recipient of the InstituteGold Medal for academic excellence at IIT Bombay,and the William Martin Memorial Prize for bestS.M. thesis in computer science at MIT.

Yanpei Liu (S’09) received the B.Eng. degree inelectronic engineering from The Chinese Univer-sity of Hong Kong, Hong Kong, China, in 2009(with the highest honor), and the M.S. degree incomputer science from University of Wisconsin,Madison. Currently, he is pursuing his Ph.D. degreein electrical and computer engineering at Universityof Wisconsin, Madison.

Emina Soljanin received the PhD and MS degreesfrom Texas A & M University, College Station, in1989 and 1994, and the European Diploma degreefrom University of Sarajevo, Bosnia, in 1986, all inElectrical Engineering.

From 1986 to 1988, she worked in the Ener-goinvest Company, Bosnia, developing optimizationalgorithms and software for power system control.After graduating from Texas A & M in 1994, shejoined Bell Laboratories, Murray Hill, NJ, where sheis now a Distinguished Member of Technical Staff

in the Mathematics of Networks and Communications research department.Her research interests are in the broad area of communications, informationand coding theory, and their applications. In the course of her twenty year em-ployment with Bell Labs, she has participated in a very wide range of researchand business projects. These projects include designing the first distanceenhancing codes to be implemented in commercial magnetic storage devices,first forward error correction for Lucent’s optical transmission devices, colorspace quantization and color image processing, quantum computation, linkerror prediction methods for the third generation wireless network standards,and anomaly and intrusion detection. Her most recent activities are in the areaof network and rateless coding. For research in these areas, she has won NSF,DARPA, NAE, and ONR funding, for salary, students, travel, and workshops.

Dr. Soljanin served as a Technical Proof-Reader, 1990-1992, and as the As-sociate Editor for Coding Techniques, 1997-2000, for the IEEE Transactionson Information Theory. She has been serving as a Co-Chair for the DIMACSSpecial Focus on Computational Information Theory and Coding. She spent2008 as a visiting researcher at Ecole Polytechnique Federale de Lausanne(EPFL), in Switzerland. Dr. Soljanin is a member of the editorial board ofthe Springer Journal on Applicable Algebra in Engineering, Communicationand Computing (AAECP), and a member of the Board of Governors of theIEEE Information Theory Society.

On the Delay-Storage Trade-Off in Content Download from Coded Distributed Storage Systems

Documents

Transcript of On the Delay-Storage Trade-Off in Content Download from Coded Distributed Storage Systems