Optimizing Differentiated Latency in Multi-Tenant, Erasure...

1

Optimizing Differentiated Latency in Multi-Tenant,Erasure-Coded Storage

Yu Xiang, Tian Lan, Member, IEEE, Vaneet Aggarwal, Senior Member, IEEE, and Yih-Farn R. Chen

Abstract—Erasure codes are widely used in distributed storagesystems since they provide space-optimal data redundancy toprotect against data loss. Despite recent progress on quantifyingaverage service latency when erasure codes are employed, thereis very little work on providing differentiated latency amongmultiple tenants that may have different latency requirements.This paper proposes a novel framework for providing andoptimizing differentiated latency in erasure-coded storage byinvestigating two policies, weighted queue and priority queue, forscheduling tenant requests. For both policies, we quantify servicelatency for different tenant classes for homogeneous files witharbitrary placement and service time distributions. We developan optimization framework that jointly minimizes differentiatedlatency over three decision spaces: data placement, requestscheduling and resource management. Efficient algorithms har-nessing bipartite matching and convex optimization techniquesare developed to solve the proposed optimization. Our solutionenables elastic service-level agreements to meet heterogeneousapplication requirements. We further prototype our solution withboth queuing models applied in an open-source, cloud storagedeployment that simulates three geographically distributed datacenters through bandwidth reservations. Experimental resultsvalidate our theoretical delay analysis and show significant jointlatency reduction for different classes of files, providing valuableinsights into service differentiation and elastic quality of service(QoS) in erasure-coded storage systems.

Index Terms—Differentiated services, Distributed storage, Era-sure codes, Weighted queue, Priority queue, Quality of service,Service-level agreements.

I. INTRODUCTION

Erasure coding has been increasingly adopted by storagesystems such as Microsoft Azure and Google Cloud Storagedue to better space efficiency, while achieving the same orbetter level of reliability compared to full data replication. As aresult, the effect of coding on content retrieval latency in data-center storage system is drawing more and more significantattention these days. Google and Amazon have publishedthat every 500 ms extra delay means a 1.2% user loss [2].Optimizing cloud storage to meet heterogeneous service levelobjectives for latency among multiple tenants is an importantand challenging problem. Yet, due to the lack of analyticlatency models for erasure-coded storage for differentiatedservices, most of the literature is limited to the analysis andoptimization of average service latency of all tenants, e.g.,[19], [24], [20], [27], failing to recognize cloud applications’heterogeneous preferences.

Y. Xiang and Y. R. Chen are with AT&T Labs-Research, Bedminster, NJ07921 (email: {yxiang, chen}@research.att.com). T. Lan is with Departmentof ECE, George Washington University, DC 20052 (email: [email protected]).V. Aggarwal is with the School of IE, Purdue University, West Lafayette, IN47907 (email: [email protected]). This work was supported in part by theNational Science Foundation under Grant no. CNS-1618335.

This paper was presented in part at the 35th International Conference onDistributed Computing Systems (ICDCS) 2015 [1].

Providing the same level of latency to all applications isunsatisfactory - cloud tenants may find it either inadequateor too expensive to fit their specific requirements, which isshown to vary significantly [2]. In fact, elastic QoS is anintegral part of cloud computing and one of its most attractivepremises. To our best knowledge, however, the optimizationof differentiated service latency in an erasure-coded storagesystem is an open problem. First, latency is largely affected byqueueing, and the existence of multiple tenants causes queue-ing for all requests to share the same underlying infrastructure.Second, using (n, k) erasure coding, an access request can beserved by any k-out-of-n data chunks. It requires solving achunk placement and dynamic scheduling problem in line withtenants’ latency requirements. In this paper, we study erasure-coded storage under two request management policies, priorityqueuing and weighted queuing. By quantifying service latencyof these policies, we are able to propose a novel optimizationframework that provides differentiated service latency to meetheterogeneous application requirements and to enable ElasticService-level Agreements (SLAs) in cloud storage.

Much of the prior work on improving storage latency inan erasure-coded system is limited to the easier problemof analysing and optimizing average latency over all cloudtenants, regardless of their different latency preferences. Forhomogeneous files, the authors of [19], [20] proposed a block-t-scheduling policy that only allows the first t requests at thehead of the buffer to move forward in order to gain tractability.A separate line of work was developed using the fork-joinqueue [21], [22], [23] and provides different bounds of averageservice latency for erasure-coded storage systems [24], [25],[26]. Recently, a new approach to analyzing average latency inerasure-coded storage was proposed in [27]. It harnesses orderstatistic analysis and a new probabilistic scheduling policyto derive an upper bound of average latency in closed-form.Not only does this result supersede previous latency analysis[19], [24], [20] by incorporating multiple non-homogeneousfiles and arbitrary service time distribution, its closed-formquantification of service latency also enables a joint latencyand storage cost minimization that can be efficiently solved viaan approximate algorithm [27], [28], [29]. The probabilisticscheduling policy has also been shown to be asymptoticallyoptimal for tail latency index of heavy-tailed files [30].

However, all these prior works are focused on analyzingand optimizing average service latency, which is undesirablefor a multi-tenant cloud environment where each tenant has adifferent latency requirement for accessing files in an erasurecoded, online cloud storage. While customizing elastic servicelatency for the tenants is undoubtedly appealing, it also comeswith great technical challenges and calls for a new frameworkfor quantifying, optimizing, and delivering differentiated ser-vice latency in general erasure-coded storage. In this paper,

2

we propose two classes of policies for providing differentiatedlatency and elastic QoS in multi-tenant, erasure-coded storagesystems: non-preemptive priority queue and weighted queuefor scheduling tenant requests. Both of which partition tenantsinto different service classes based on their delay requirementand apply differentiated management policy to file requestsgenerated by tenants in each service class.

In particular, our first policy is modeled as a non-preemptivepriority queue. That is, a file request generated by a tenantin high priority class (i.e., having more stringent delay re-quirements) can move ahead of all the low priority requests(i.e., generated by tenants in low priority class) waiting in thequeue of storage nodes, but low priority requests in service arenot interrupted by high priority requests. The second policy ismodeled as a weighted queue, where file requests submitted bytenants in each class are buffered and processed in a separatefirst-come-first-serve queue at each storage node. The servicerate of these queues are tunable with the constraints that theyare non-negative, and sum to at most the service rate of theserver. Tuning these weights allows us to provide differentiatedservice rate to tenants in different classes, therefore assigningdifferentiated service latency to tenants. In this paper, we applyprobabilistic scheduling in [27] and derive closed-form latencybounds for the service policies that use non-preemptive priorityqueue and weighted queue, under arbitrary data placement andservice time distributions.

The goal of this paper is to quantify service latency underthese two policies and to develop an optimization frameworkto minimize differentiated service latency for multiple tenants.Using the latency analysis, we propose a novel optimiza-tion framework for minimizing differentiated service latency.More precisely, for priority queue policy, we formulate ajoint latency minimization problem for all tenants over twodimensions, i.e., the placement of erasure-encoded file chunkson distributed storage nodes and the scheduling of chunkaccess requests, while the joint optimization for weightedqueue policy further minimizes latency over weights assignedto each service class. We show that both problems are mixed-integer optimization and difficult to solve in general. Todevelop efficient algorithmic solutions, we identify that eachproblem can be decoupled into two sets of sub-problems:one equivalent to a bipartite matching and the other provento be convex. We develop a greedy algorithm to solve thesesubproblems optimally in an iterative fashion, therefore jointlyminimizing services latency of all tenants in different serviceclasses. This optimization gives rise to new challenges thatcannot be addressed by scheduling algorithms in Map-reducetype of framework [14], [15], [16], [17], because chunkplacement in erasure-coded storage must be solved jointly witha request scheduling problem that selects ki-out-of-ni distinctchunks/servers with available data to service any request.

To validate our theoretical analysis and joint latency op-timization for different tenants, we provide a prototype ofthe proposed algorithms in Tahoe [38], which is an open-source, distributed file system based on the zfec erasure codinglibrary for fault tolerance. A Tahoe storage system consistingof 12 storage nodes are deployed as virtual machines inan OpenStack-based data center environment. One additionalstorage client was deployed to issue storage requests. Fromthe experiment results, we first find that the service timedistribution is proportional to the bandwidth of the server,

which validates an assumption used in the analysis of theweighted queue latency. Further, the experiment results vali-date fast convergence of our differentiated latency optimizationalgorithms. We see that our algorithms efficiently reducelatency both with the priority and the weighted queues, andthe results from the experiments are reasonably close to thegiven latency bounds for both the models. Finally, we note thatpriority queuing could lead to unfairness since the low prioritytenants only share residual service rates left over by highpriority tenants, while weighted queuing is able to balanceservice rates by optimizing weights assigned to each serviceclass.

II. SYSTEM MODEL

In this section we introduce the system model and the twoqueuing models that are used in this paper. We consider adata center consisting of m heterogeneous servers, denotedby M = {j = 1, 2, . . . ,m}. A set of tenants store r het-erogeneous files among those m servers. Each file is encodedwith different erasure code as follows. We divide file i into kifixed-size chunks then encode it using an (ni, ki) MaximumDistance Separable (MDS) erasure code [31] to generate nidistinct encoded chunks of equal size. Each encoded chunk isthen stored on a different storage node. Thus, we denote Sias the set of storage nodes hosting file i chunks, which satisfySi ⊆ M and ni = |Si|. Thus, each selected server will haveonly one chunk of file i. When processing a file-retrievingrequest, the (ni, ki) MDS erasure code allows the content tobe reconstructed from any subset of ki-out-of-ni chunks, so weneed to access only ki chunks of file i to recover the completefile, and we have to find out which ki of the ni chunks shouldbe accessed. We assume that the parameters ki and ni for eachfile are fixed. The value of ki is determined by content sizeand the choice of chunk size.

We assume that the tenants’ files are divided into 2 serviceclasses, R1 for delay-sensitive files and R2 for non-delay-sensitive files. There are a series of file requests for the r files,such that requests for each file (or content) i arrive as a Poissonprocess with rate λi, i = 1, . . . , r. For a given placement (Si)for each file, and erasure-coded parameters (ni, ki), we willuse probabilistic scheduling to select ki file chunks when a fileis requested. There are ni-choose-ki options, and each optionis chosen with certain probability. This scheduling strategywas proposed in [27], and an outer bound on the latency wasprovided. The probabilistic scheduling for all these ni-choose-ki options was shown in [27] to be equivalent to choosing eachof the storage nodes in Si with certain probability, say πi,j .More formally, let πi,j ∈ [0, 1] be the probability that a requestfor file i will be forwarded to and served by storage node j,i.e.,

πi,j = P[j ∈ Si is selected | ki nodes are selected]. (1)

It is easy to see that πi,j = 0 if no chunk of content i isplaced on node j, i.e., j /∈ S. Further,

∑j πi,j = ki since

exactly ki distinct storage nodes are needed to process eachrequest. Next, we will describe the two queuing models thatare used in this paper.

A. Priority QueuingWe assign a high priority for delay-sensitive files in R1

and a low priority for non-delay-sensitive files in R2. In order

3

A1\A

High Low

A2

C1

D1

Leaves

State 1

B1

B2

A2\A

C1

D1

B1

B2

High Low

State 2

C2\A

B2

D1

State 5

B1\A

C2

D2

B2

D1

High Low

State 4

C1\A

D1

B1

B2

High Low

State 3

High Low

D2

Fig. 1. System evolution for high/Low priority queuing

\A

High Low

A2

C1

State 1

B2

D1

\A

C1 B2

D1

High Low

State 2

\A

D2

State 5

C2 D1

D2

High Low

State 4

\A

D1

Leaves

B1

B2

High Low

State 3

A1 B1 A2 B1 A2 B2 C1 B2 C1 D1

C1 C2

High Low

Fig. 2. System evolution for weighted queuing

to serve the tenants with different priorities, priority queuesinvolve having two sets of queues at each storage node - highpriority queues and low priority queues. The requests made bythe high priority tenant enter the high priority queues and therequests made by the low priority tenant enters the low priorityqueues. A chunk is serviced from high priority queue as longas there is a chunk in the queue, and a chunk is serviced fromthe low priority queue only if there is no chunk in the highpriority queue. We assume a non-preemptive priority queue,where if a high priority request arrives during the waiting timeof a low priority request, the arriving high priority request willbe served first while the request which is already in service willnot be affected by the later arrival of high priority requests.

An example is shown in Fig 1, where a server has twoqueues, one for high priority requests and the other for lowpriority requests, which we denote as Qh and Ql respectively.At state 1, Qh has two jobs A2 and C1 and Ql has threejobs B1, B2 and D2 in queue, with job A1 in service. Sincepriority queues are used, a low priority request will only beserved when the high priority queue is empty and thus afterserving A1, A2 followed by C1 will be served, and all jobs inlow priority queue will wait as depicted in state 2. In state 3,after C1 leaves, there is no new high priority request and thusB1 from Ql will be served. State 4 shows a new high priorityrequest C2 arrives during when B1 is being served. Due tothe use of non-preemptive priority queue, the system will notdisrupt service for B1. After B1 is served, C2 will be servedbefore other low priority requests as depicted in State 5.

B. Weighted QueuingWeighted queuing apportions service rate among different

service classes in proportion to given weights. Tenants withhigher weights receive more service rates, while tenants withlower weights can still receive their fair share if the weightsare properly balanced. Assume that the total bandwidth Ballocated to a server is divided among two service classesby weights w1 and w2, such that all tenants’ files in classk gets bandwidth of Bk = Bwk for k ∈ {1, 2}, satisfying∑2k=1 wk = 1. Thus, the system is equivalent to dividing

each physical server into two logical servers, one with servicebandwidth B1 = Bw1 and other with B2 = Bw2. We assumethat the service time is inversely proportional to bandwidthsuch that service time with B/2 bandwidth will be twiceas compared to that with a bandwidth of B. Unlike priority

queuing, each server now is able to serve two requests fromdifferent classes at the same time, offering different servicebandwidths for different classes. In this case, low-priority classmay still get a small portion of bandwidth, without having towait for all the high-priority jobs to finish.

To compare with the priority queuing model, we consideran example as shown in Fig 2. We have the same set of jobs asin Fig 1. We have A1, A2, C1, C2 as class 1 jobs, and B1, B2,D1, D2 as class 2 jobs, where class 1 is the class with largerweighted bandwidth resources and thus equivalent to a higherpriority. Let the jobs in service at state 1 be A1 and B1. Eachtime a logical server for either class is free, the first job ofthat class in queue will be served at the service bandwidth ofBwk, k ∈ {1, 2}, as shown in all states in Fig 2. We can seefrom Fig 2 that all class one requests are served during the 5states, and 3 of the class 2 jobs are served. However, priorityqueuing has to wait for class one requests to be served beforeclass 2, and thus fewer requests were served in the previousexample. Thus, weighted queuing provides more fairness tolow priority class customers.

III. UPPER BOUND: PROBABILISTIC SCHEDULING

In this section, we will derive a latency upper bound foreach file in the two service classes.

We consider two types of delay, Queuing delay which isdenoted by Qj and Connection delay, which is denoted by Nj .Queuing delay is the waiting time a chunk request receives innode j due to sharing of network and I/O bandwidth, and isdetermined by service rates and arrival rates of chunk requestat each storage node according to our queuing models. Theconnection delay includes the overhead to set up a connectionfrom a tenant to node j in a certain data-center. We assumethat the connection delay has a known mean ηj and varianceξ2j , and is independent of Qj . It is easy to see that if a subsetAi ⊆ Si of ki nodes are selected to process a file-i request,its latency is determined by the maximum of queuing plusconnection delay of the ki nodes in Ai. Therefore, underprobabilistic scheduling we find average latency of file i as

T̄i = E[maxj∈Ai

(Nj + Qj)],

where the expectation is taken over all system dynamics in-cluding queuing at each storage node and the random selectionof Ai for each request with respect to probabilities πi,j ∀j.

4

We denote Xj as the service time per chunk at nodej, which has an arbitrary distribution satisfying finite meanE[Xj ] = 1/µj , variance E[Xj

2] − E[Xj]2 = σ2

j , secondmoment E[Xj

2] = Γj,2, third moment E[Xj3] = Γj,3. The

authors of [27] gave an upper bound on T̄i using its mean andvariance as follows.

Lemma 1 ([27]): The expected latency T̄i of file i is tightlyupper bounded by

T̄i ≤ minz∈R

z +∑j∈Si

πi,j2

(E[Dj ]− z)

+∑j∈Si

πi,j2

[√(E[Dj ]− z)2 + Var[Dj ]

] , (2)

where Dj = Nj + Qj is the aggregate delay on node j withmean E[Dj ] and variance Var[Dj ].

We now consider the cases when different storage nodesmanage chunk requests using priority or weighted queuemodels and derive upper bounds on service latency.

A. Latency Analysis with Priority Queues

According to our system model, we consider two priorityqueues for each storage node, one for requests of file in serviceclass R1 (high priority class), and the other for requests inservice class R2 (low priority class). Each file is either highpriority or low priority and thus all chunk requests of the samefile have the same priority level k, where k = 1, 2. We notethat the queuing delay for different priority class requests isdifferent. This is because low priority class requests have towait for the queue of high priority class to finish. Thus, weneed to find an upper bound to the expected latency of the filesin the different priority classes using probabilistic scheduling.Let Λjk =

∑i∈Rk

λiπij be the aggregate arrival rate of classk requests on node j and ρjk = Λjk/µj the correspondingservice intensity. We analyze non-preemptive priority queueson each node and use the result in Lemma 1 to obtain an upperbound on service latency for each file in the two service classesas follows.

Theorem 1: For non-preemptive priority queues, the ex-pected latency T̄i,k of file i of class k is upper bounded by T̄iin (2), where E([Dj ] and Var[Dj ] for class k are denoted byE([Djk] and Var[Djk], respectively, and are given as follows.

E([Dj1] = ηj +1

µj+

(Λj1 + Λj2)Γj,22(1− ρj1)

, (3)

Var[Dj1] = ξ2j + σ2

j +

(Λj1 + Λj2)Γj,33(1− ρj1)

+

(Λj1 + Λj2)2Γ2j,2

4(1− ρj1)2, (4)

E[Dj2] = ηj +1

µj+

(Λj1 + Λj2)Γj,22(1− ρj1)(1− ρj1 − ρj2)

, (5)

Var[Dj2] = ξ2j + σ2

j +(Λj1 + Λj2)Γj,3

3(1− ρj1)2(1− ρj2)+

(Λj1 + Λj2)2Γ2j,2

4(1− ρj1)2(1− ρj2)2+

Λj1(Λj1 + Λj2)Γ2j,2

2(1− ρj1)3(1− ρj2)(6)

Theorem 1 quantifies the mean and variance of the delayfor retrieving a file chunk of class k from each server j andthen employs the order statistics bound in Lemma 1 to derivea bound on the maximum latency to retrieve ki chunks fromdistinct servers. In our non-preemptive queuing model, anyrequests belonging to high priority class R1 are moved tothe head of line and processed as soon as the storage serverbecomes free. Thus, these high priority requests are processedas if low priority class requests do not exist, except for aresidual service time to complete any request that has alreadystarted on the server. For low priority class R2, a new requestis served only if there is no high priority request in the queue.The service time (i.e., queuing delay) consists of four parts:the residual service time to complete any active request, thetotal service time of high priority requests that are already inqueue, the total service time of the low priority requests thatare already in queue before target request’s arrival, and thetotal service time of the high priority requests arrived duringtarget request’s waiting time. The detailed expressions of themoments of the queuing delays are presented in Appendix A.

B. Latency Analysis for Weighted QueuingWe consider the weighted queue policy, where each storage

node employs a separate queue for each service class. Queuingdelay Qk

j for class k requests on node j depends on thequeuing weights wjk since service bandwidth on each storagenode is shared among all queues in proportion to their assignedweights, whereas connection delay due to protocol overhead isindependent of bandwidth allocation and remains unchanged.Let µj be the overall service rate on node j. According toour weighted queuing model, class k requests on node jreceive wjk fraction of service bandwidth and therefore havean average service time 1/(wjkµj) per chunk request. Due toPoisson property of request arrivals, each weighted queue canbe modeled as a M/G/1 queue, whose mean and variance canbe found in closed-form.

Theorem 2: For weighted queues, the expected latency T̄i,kof file i of class k is upper bounded by T̄i in (2), whereE([Dj ] and Var[Dj ] for class k are denoted by E([Djk] andVar[Djk], respectively, and are given as follows.

E[Djk] = ηj +1

µj+

ΛkjP2kjΓj,2

2wjk(wjk − ΛkjPkj/µj)(7)

Var[Djk] = ξ2j + σ2

j +2∑k=1

Pkj

(ΛkjPkjΓj,3

3w2jk(wjk − ΛkjPkj/µj)

+Λ2kjP

2kjΓ

2j,2

4w2jk(wjk − ΛkjPkj/µj)2

), (8)

where Λkj =∑i∈Rk

λiπij is the arrival rate of class k filesat the storage node k, Pkj =

Λkj∑k Λkj

is the proportion of classk files at storage node j.

5

With weighted queues, the two queues for different classesare M/G/1 queues with different arrival rates and service ratessince we assumed that service time is proportional to thebandwidth. Thus, we can look at each queue separately andthe analysis as in [27] can be done for each class. The meanand variance with weighted queues can be found by summingthe mean and variance of each of the classes by weighing thesums with the probability that a file belongs to that particularclass. The detailed proof is given in Appendix B.

IV. JOINT LATENCY OPTIMIZATION FOR DIFFERENTIATEDSERVICES CLASSES

In this section we jointly optimize differentiated servicelatency for files in all service classes over three dimensions.First, we need to decide the chunk placement of each filei across the m servers based on the erasure code parameter(ni, ki). Second, when a client requests a file, a subset of kinodes hosting desired file chunks must be selected accordingto probabilities πi,j ∀j, i.e., the access probabilities for eachchunk1. Finally, in the case of weighted queues, we also needto decide on the weights wjk used for sharing service band-width among different classes of files. In this section, we willformulate the service latency optimization problem in erasure-coded storage using priority and weighted queues, and thenpropose the two algorithms, namely Algorithm JLOP (JointLatency Optimization in Priority Queuing) and JLOW (JointLatency and Weight Optimization), for solving the latencyminimization for differentiated services classes in the modeldescribed in previous section. In an online environment withdynamic file arrivals and removals, we can solve ProblemsJLOP and JLOW repeatedly using the proposed algorithms toupdate system configurations on the fly.

A. Latency Optimization for Priority Queues

For priority queues, we give preference to high-prioritytenants by allowing them to occupy as many system resourcesas needed without consideration for low-priority requests,while the low-priority tenants share the remaining resourcesgiven decisions of high-priority tenants. Thus, we propose atwo-stage optimization problem as follows. First, we jointlyoptimize the chunk placement and access probabilities for allfiles in the high-priority class to minimize service latency theyreceive. Then, latency for the low-priority files is minimizedbased on the existing traffic generated by the high-priorityfiles.

Let λ̂k =∑i is file of priority class k λi be the total arrival rate

for high priority requests, and thus λi/λ̂k is the fraction of filei requests among the class k priority files. The average latencyof all files in class k is given by

∑i∈Rk

(λi/λ̂k)T̄ik, for k =1, 2. Our goal is to minimize the latency of high priority filesover their chunk placement and access probabilities regardlessof low priority requests, and based on the decision, to updateoptimization variables of low priority files to minimize theirlatency under existing high priority file traffic. We formulateproblem JLOP, which consists of 2 sequential optimizationproblems for k = 1, 2 as follows.

1Note that it is possible to have πi,j = 0 even if a chunk of file i is placedon node j, i.e., the chunk is only a cold copy to ensure fault-tolerance and isnever accessed in latency optimization.

JLOP :

min∑i∈Rk

λi

λ̂kT̄ik (9)

s.t.

m∑j=1

πi,j = ki, ∀i (10)

πi,j ∈ [0, 1], πi,j = 0, ∀k, ∀j /∈ Si,(11)|Si| = ni and Si ⊆M (12)

var. Si, πi,j , ∀i, j.We can see this optimization problem for k = 1, 2 is a

mixed integer optimization as we have fixed n servers to selectfor chunk placement for each request. It is hard to solve ingeneral, thus we propose to break this problem into two sub-problems: placement and scheduling, which have an easier-to-handle structure and can be solved efficiently.

We consider the sub-problem for optimizing schedulingprobabilities and recognize that for fixed chunk placements,Problem JLOP is convex in πij . The scheduling algorithm forhigh priority class is convex, since the low priority class doesnot exist as seen by the high priority class, so the problembecomes the optimization with only one priority class, whichcan be easily proved that T̄ik is convex over Λjk as shownin [27], and Λjk is a linear combination of πij , thus theoptimization problem is convex in πij when other parametersare fixed. For low priority class, we have the following lemma:

Lemma 2: (Convexity of the scheduling sub-problem for lowpriority class) When {z, Si} are fixed, Problem JLPO is aconvex optimization over probabilities {πi,j}.

Now we consider the placement sub-problem that optimiz-ing over Si for each file request with fixed z, πij for all classesin priority queuing. We first rewrite latency bound T̄ik as

T̄ik = z +∑j∈Si

πij2F (z,Λjk) (13)

where F (z,Λjk) = Ajk +√A2jk +Bjk is an auxiliary

function with Ajk = E([Djk] − z and Bjk = Var[Djk]. Toshow that the placement subproblem can be cast into a bipartitematching, we consider the problem of placing file i chunks onm available servers. It is easy to see that placing the chunksis equivalent to permuting the existing access probabilities{πi,j , ∀j} on all m servers, because πi,j > 0 only if a chunkof file i is placed on server j. Let β be a permutation ofm elements. Under new placement defined by β, the newprobability of accessing file i chunk on server j becomesπi,βj . Thus, our objective in this sub-problem is to find sucha placement (or permutation) β(j) ∀j that minimizes theaverage service latency, which can be solved via a matchingproblem between the set of existing scheduling probabilities{πij , ∀j} and the set of m available servers, with respectto their load excluding the contribution of file i itself. LetΛ−ijk = Λjk−λiπij when request for file i is of priority class k,be the total request rate at server j, excluding the contributionof file i. We define a complete bipartite graph Gr = (U ,V, E)with disjoint vertex sets U ,V of equal size m and edge weightsgiven by

Du,v =∑i

λiπiu

2λ̂kF (z,Λ−ivk + λiπiu), ∀u, v (14)

6

which quantifies the contribution to overall latency (in objec-tive value (9) by assigning existing πiu to server v that hasan existing load Λ−ivk . It is easy to see that a minimum-weightmatching on Gr finds an optimal β to minimize

m∑u=1

Du,β(u) =

m∑u=1

∑i

λiπiu

2λ̂kF (z,Λ−ivk + λiπiu),

which is exactly the optimization objective of Problem JLOP(except for a constant) if a file chunk with access probabilityπi,u to a rack with existing load Λ−ii,β(u).

Now we can propose our algorithm JLPO, which solves theplacement and scheduling sub-problems iteratively for eachof the 2 priority classes. Algorithm JLPO finds a solution forthe joint optimization by iteratively solving its sub-problems,i.e., requires scheduling and chunk placement for the twoservice classes. In Step 1, we solve a convex optimizationto determine the optimal request scheduling for high priorityclass with all other decision variables fixed. Next, for givenscheduling probabilities, Step 2 computes an optimal matchingbetween file chunks and storage serves to minimize overallaccess latency. The same process is repeated in Steps 3 and4 for the low priority service class, which also requires tosolve a convex scheduling optimization (due to Lemma 2)and optimal matching for chunk placement. Finally, latencybounds for both high and low priority classes are updated viaan optimization of z. The proposed algorithm is guaranteedto converge to a local optimal point of the joint optimization,since no local changes (in scheduling or placement alone) inany sub-problems is able to further improve the objective. Thealgorithm is guaranteed to converge because the optimality insolving each sub-problem ensures a sequence of monotonicallydecreasing latency objectives.

B. Latency Optimization for Weighted QueuesIn the presence of weighted queues, we consider a joint

optimization of all files in different service classes by min-imizing a weighted aggregate latency. Let λ̂k =

∑i∈Rk

λi,λ̂ =

∑k λ̂k, and T̃ik be given by the upper bound on T̄i1 in

Theorem 2. Then, we want to optimize the following.

min C1T̃1 + C2T̃2 (15)

s.t. T̃k =λ̂k

λ̂

∑i is file of class i

T̃ik, (16)

m∑j=1

πi,j = ki, πi,j ∈ [0, 1], πi,j = 0 (17)

∀j /∈ Si, (18)N∑i=1

∑j 6=i

wjk = 1. (19)

|Si| = ni, and Ski ⊆M (20)var. Si, πi,j , wjk, ∀i, j, k.

As we have to choose fixed n servers for chunk placementaccording to the (n, k) erasure code applied, problem JLWO isa mixed integer optimization and hard to have a computationalsolution. In order to solve this optimization problem, wefirst break Problem JLWO into three sub-problems: (i) aweight sub-problem for optimizing service bandwidth among

Algorithm JLOP :

Initialize t = 0, ε > 0.Initialize feasible {z(0), π

(i,j0), Si(0)}.

while O(t)−O(t− 1) > ε// Solve scheduling and placement for high priority// class with given {z(t), π1

i,j(t), π2i,j(t), S

1i (t), S2

i (t)}Step 0:for all high priority jobs

// Solve scheduling for high priority class with given//{z(t), Si(t), }Step 1: π1

i,j(t+ 1) = arg minπi,j

(9) s.t. (12), (11).

// Solve placement for high priority class with given// {z(t), πi,j(t+ 1)}Step 2: for i = 1, . . . , r

Calculate Λ−ii,j using {πi,j(t+ 1)}.Calculate Djk from (15).(β(j)∀j ∈ N )=Hungarian Algorithm({Djk}).Update πi,β(j)(t+ 1) = πi,j(t) ∀i, j.Initialize S1

i (t+ 1) = {}.for j = 1, . . . , N

if ∃i s.t. π1i,j(t+ 1) > 0

Update St(t+ 1) = Si(t+ 1) ∪ {j}.end if

end forend for

end for// Solve scheduling and placement for low priority// class with given {z(t), π1

i,j(t), S1i (t)}

Go To Step 0Step 3:for all low priority jobs

// Solve scheduling for low priority class with given{z(t), S1

i (t+ 1)}Go To Step 1// Solve placement for low priority class with given{z(t), π1

i,j(t+ 1)}Go To Step 2Update S2

i (t+ 1) and π2i,j(t+ 1)

end for// Update bound for both high/low priority class given// {πi,j(t+ 1), Si(t+ 1)}z(t+ 1) = arg min

z∈R(9).

Update objective value O(t+1).Update t = t+ 1.

end whileOutput: {Sr(t), πi,jr(t), wi,j(t)}

Fig. 3. Algorithm JLOP: Our proposed algorithm for solving ProblemJLOP.

different queues by choosing weights wjk, (ii) a schedulingsub-problem for determining accessing probabilities πi,j , and(iii) a placement sub-problem that select a subset Si nodes tohost encoded chunks of file i.

We first recognize the scheduling sub-problem is convex.It can be easily proven using the convexity of T̄k over Λjkas shown in [27], and due to the fact that Λjk is a linearcombination of πij . Second, as for the placement problem weagain cast it into a matching, similar to the one proposed forpriority queuing. It results in a bipartite matching that canbe solved efficiently. Finally, we show that the weight sub-problem is convex with respect to wjk in the following lemma.

Lemma 3: (Convexity of the bandwidth allocation sub-problem.) When {z, πi,j , Si} are fixed, Problem JLWO is aconvex optimization over weights {wjk}.

We propose Algorithm JLWO to solve the differentiatedlatency optimization in weighted queuing by iteratively com-puting optimal solutions to the three sub-problems. In particu-lar, Step 1 finds optimal bandwidth allocation for fixed chunk

7

Algorithm JLWO :

Initialize t = 0, ε > 0.Initialize feasible {z(0), πi,j(0), Si(0), wjk}.while O(t)−O(t− 1) > ε

// Solve bandwidth allocation for given {z(t), πi,j(t), Si(t)}wjk(t+ 1) = arg min

wjk

(15) s.t. (19).

// Solve scheduling for given {z(t), Si(t), wjk(t+ 1)}πi,j(t+ 1) = arg min

πi,j(15) s.t. (16), (17).

// Solve placement for given {z(t), wjk(t+ 1), πi,j(t+ 1)}for i = 1, . . . , r

Calculate Λ−ri,j using {πi,j(t+ 1)}.Calculate Djk from (14).(β(j)∀j ∈ N )=Hungarian Algorithm({Djk}).Update πi,β(j)(t+ 1) = πi,j(t) ∀i, j.Initialize Si(t+ 1) = {}.for j = 1, . . . , N

if ∃i s.t. πi,j(t+ 1) > 0Update Si(t+ 1) = Si(t+ 1) ∪ {j}.

end ifend for

end for// Update bound for given {wjk(t+ 1), πi,j(t+ 1), Si(t+ 1)}z(t+ 1) = arg min

z∈R. (15).

Update objective value O(t+1)=(15).Update t = t+ 1.

end whileOutput: {Si(t), πi,j(t), wjk(t)}

Fig. 4. Algorithm JLWO: Our proposed algorithm for solving ProblemJLWO.

placement and request scheduling via a convex optimization(due to Lemma 3) of weights wi. Next, Step 2 solves optimalscheduling probabilities, and Step 3 solves optimal chunkplacement using Bipartite matching. Since each step finds anoptimal solution for a sub-problem, the iterative algorithmis guaranteed to generate a monotonic sequence of objectivevalues and is converged to a local optimal solution.

V. IMPLEMENTATION AND EVALUATION

A. Tahoe Testbed

To validate our proposed algorithms for priority andweighted queuing models, and to evaluate their performance,we implemented the algorithms in Tahoe [38], which is anopen-source, distributed file-system based on the zfec erasurecoding library. It provides three special instances of a genericnode: (a) Tahoe Introducer: it keeps track of a collectionof storage servers and clients and introduces them to eachother. (b) Tahoe Storage Server: it exposes attached storageto external clients and stores erasure-coded shares. (c) TahoeClient: it processes upload/download requests and connectsto storage servers through a Web-based REST API and theTahoe-LAFS (Least-Authority File System) storage protocolover SSL.

Our experiment is done on a Tahoe testbed consists ofthree separated hosts in an Openstack cluster running Havana,where we reserved bandwidths between these hosts in differentavailability zones (using a bandwidth control tool from ourCloud QoS platform) to simulate three separate data centers.The virtual hosts (VM’s) on the same host are simulated asreal hosts that belong to the same availability zone and in onedata center. Out of the virtual hosts distributed across the threesimulated data centers, we ran 12 storage servers on 12 virtual

Storage Server Introducer X-‐Large Client Node

Fig. 5. Our Tahoe testbed with three hosts and each has 4 VMs as storageservers

hosts, and we ran one introducer on a separate virtual hostand 1 extra-large client node (with sufficient cpu and memorycapacities) as a virtual host to initiate all client requests. The12 storage servers are evenly distributed onto the three data-centers; i.e., we have four VM instances(storage servers) ineach of the three Openstack availability zones in our cluster.We had to use an extra-large client node since we had toinitiate a large number of requests through the client, usingmultiple Tahoe ports to simulate multiple clients on this node.All VMs have a 100GB volume attached for storing chunksin the case of storage servers and clients, or meta informationin the case of the introducer. Our Tahoe testbed is shown inFig 5.

B. Basic Experiment Setup

We use (7,4) erasure code in the Tahoe testbed throughoutthe experiments described in the implementation section; how-ever, we have two experiment setups for for the two differentqueuing models.

For priority queuing, Algorithm Priority provides chunkplacement and request scheduling for both high priority andlow priority classes, which we also denote as class 1 (highpriority) and class 2 (low priority). We assign the arrival ratesof the two classes and generate file requests of class 1 andclass 2 one by one based on the arrival rates. File requests aredivided into chunk requests and then dispatched to the servers.When chunk requests arrive at the server, class 2 requests enterat the end of the queue while class 1 requests enter at theend of class 1 requests, before any class 2 request. Since weare using non-preemptive priority queuing, class 2 requeststhat are already in service before a class 1 request enters thequeue will be allowed to be completed with no interferencefrom class 1 requests. Chunk placement and request schedulingdecisions that will minimize average latency for both classeswill come from Algorithm JLOP.

For weighted queuing, requests are generated based onarrival rates for the two classes, then the system calls abandwidth reservation tool to reserve the assigned bandwidthBwi,j based on weights of each server from the algorithm, andthen submit all the requests of the two classes. The system willcalculate the chunk placement, request scheduling and weightassignment decisions from Algorithm JLWO.

8

C. Experiments and EvaluationConvergence of Algorithm. We implemented Algorithm

JLOP and Algorithm JLWO in MATLAB, where the convexoptimization sub-problems are solved using MOSEK, a com-mercial optimization solver (even though one can implementthem with any standard algorithm). For 12 distributed storageservers in our testbed, Figure 6 demonstrates the convergenceof our algorithms, which optimizes the latency for the twoclasses in both queuing models over: chunk placement Si,request scheduling πi,j and bandwidth weights distributionwi,j (for weighted queuing model). As we have proven that al-gorithm JLOP and JLWO are both convex and the optimizationis solvable earlier, now we see the convergence of the proposedoptimized queuing algorithms in Fig. 6. Our algorithms forthe two models efficiently solve the optimization problem withr = 1000 files. For weighted queuing we set C1 = 1, C2 = 0.4in the objective function. For priority queuing, we plot latencyfor the two classes in such a way that the latency for class 2 ismultiplied by a factor of 0.05; this is necessary to have T1 andT2 on the same scale since class 2 experiences a high latency.It is observed that the normalized objective converges within175 iterations for a tolerance ε = 0.01. To achieve dynamicfile management, our optimization algorithm can be executedrepeatedly upon file arrivals and departures.

Validation of Experiment Setup. While our service delaybound applies to arbitrary distribution and works for systemshosting any number of files, we first run an experiment tounderstand actual service time distribution for each class inboth priority queuing and weighted queuing models on ourtestbed. We upload r = 1000 files of size 100MB file usinga (7, 4) erasure code. All the experiments for access were runto account for a 10-hour duration, which would account forefficient averaging. The experiment for priority queuing usedan aggregate request arrival rate for both classes at 0.28/sec,out of which the arrival rate for class 1 request is 0.0122/secand arrival rate for class 2 is 0.2678/sec. Thus, the ratio ofclass1/class2=1/22. This ratio is such that the interval betweenclass 1 arrivals is similar to the chunk service time so thatclass 2 requests get a chance to be serviced. For weightedqueuing, the aggregate request arrival rate for both classesis 0.28/sec, with each class having a rate of 0.14/sec, ratioof class1/class2 requests =1/1. Then we measure the chunkservice time of class 1 and class 2 at the servers of bothmodels. Bandwidth weights are 0.6 for class 1 and 0.4 forclass 2 for weighted queuing. Figure 7 depicts the CumulativeDistribution Function (CDF) of the chunk service time foreach class in each queuing model. We note that both class 1and class 2 for priority queuing have a mean chunk servicetime of around 22 sec, and class 1 for weighted queuing hasa mean chunk service time of 35 sec and class 2 for weightedqueuing has a mean chunk service time of 60 sec, whichmeans the chunk service time is proportional to the bandwidthreservation based on weight assignments for M/G/1 queues,thus validating our assumption for the analysis of weightedqueues.

Validation of algorithms and joint optimization. In orderto validate that the algorithms work for the two queuingmodels, we choose r = 1000 files of size 100MB, andaggregate arrival rate 0.28/sec. For weighted queuing, we useratio class1/class2=1:1, i.e., arrival rate of 0.14/sec for eachclass. We then upload heterogeneous files of size 100MB

to obtain the service time statistics for class 1 and class 2(including mean, variance, and second, third moment) at allstorage nodes when running the two queuing models. We fixC1 = 1 and vary C2 from 0 to 1 and run both AlgorithmPriority (not affected by C2) and Algorithm JLOP and JLWOto generate the optimal solution for both priority queuing andweighted queuing. The algorithms provide chunk placement,request scheduling for the two models and weight assignment(for weighted queuing). We then find the average retrievallatency of the r files. For priority queuing, the aggregaterequest arrival rate is 0.28/sec with ratio class1/class2=1:22.We repeat the experiment for priority queuing using placementand scheduling decisions from algorithm JLOP, and take theaverage latency to get enough data points for both priorityclasses in Fig. 8, even though C2 does not affect latencyin priority queuing. From Fig. 8 we can see for weightedqueuing, latency of class 2 increases as C2 decreases; i.e.,when class 2 becomes even less important. Also, the averagelatency of class 1 requests decreases as C2 decreases. Thisshows the expected result that when class 2 becomes moreimportant, more weight is allocated to class 2, and since C2

is always smaller than C1, class 1 gets more bandwidth. Forpriority queuing we can see even with a much lower arrivalrate of class 1 as compared to weighted queuing model, class2 requests rarely get a chance to be served due to the priorityqueuing policy. Thus, they experience extremely long latencyas compared to class 1 requests, and even when they arecompared to both classes in weighted queuing. Fig 8 showsthat weighted queuing provides much more fairness for class2 requests than priority queuing.

Evaluation of the performance of our solution–Priorityqueuing. To demonstrate the effectiveness of algorithm JLOPfor priority queuing, we vary file size in the experiments from50MB to 250MB with an aggregate request arrival rate at0.28, ratio of high priority/low priority=1/22, and erasure code(7,4). We choose r = 1000 files, with 200 files of each filesize, i.e., 200 files of 50MB, 200 of 100MB, etc. We initiateretrieval requests of these files and plot the average latency forfiles of each size. We also showed our analytic latency upperbound in Figure 9. We see that the average latency increasesalmost linearly with file size for high priority class since theservice time for each class increases linearly, and queuingdelay depends only on high priority requests. However, forrequests of low priority latency increases more than linearlywith file sizes since the latency depends mainly on whether arequest get a chance to be served or not, i.e., queuing delaydominates and the service time is much smaller as comparedto queuing delay. We also observe that our analytic latencybound tightly follows actual average service latency for bothclasses.

Next, we fixed the arrival rate of low priority requests to0.14/sec and varied the file request arrival rate of high priorityclass from λ1 = 0.027 /sec to λ1 = 0.015 /sec with filesize 200MB. Actual service delay and our analytic bound foreach class is shown by a bar plot in Figure 10. Our boundprovides a close estimate of service latency as shown in thefigure. As arrival rates for high priority increases, latency oflow priority class shows logarithmic growth, which means theprobability that a low priority request gets served becomesdramatically less as the arrival rate of high priority requestsincreases, which leads to extremely long queuing delay for

9

Fig. 6. Convergence of Algorithm Priority andAlgorithm Weighted for heterogeneous fileson our 12-node testbed. Both algorithms effi-ciently compute the solution in 175 iterations.

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80

Cumula&

ve Distributed

Fun

c&on

Latency (Sec)

Class 1, Priority Class 2, Priority Class 1, Weighted Class 2, Weighted

Fig. 7. Actual service time distribution forboth classes of priority queuing and weightedqueuing; each has r = 1000 files, each ofsize 100MB using erasure code (7,4) withthe aggregate request arrival rate set to λi =0.28 /sec in each model

10

100

1000

0 0.2 0.4 0.6 0.8 1 1.2

Average Lten

cy (Sec)

C2

Priority, Class 1 Priority, Class 2 Weighted, Class 1 Weighted, Class 2

Fig. 8. r = 1000 files, each of size 100MB,aggregate request arrival rate for both classesis 0.28/sec for both priority/weighted queu-ing; varying C2 to validate our algorithms,weighted queuing provides more fairness toclass 2 requests.

0

200

400

600

800

1000

1200

1400

1600

1800

0

20

40

60

80

100

120

140

160

50M 100M 150M 200M 250M Average Latency of Low

Prio

rity Class (Sec)

Average Latency of High Priority Class (Sec)

File Size(MB)

High Priority, Experiment Low Priority, Experiment High Priority, Bound Low Priority, Bound

Fig. 9. Evaluation of different file sizes inpriority queuing. Both experiment and boundstatistics are using the secondary axis. La-tency increases quickly as file size growsdue to the queuing delay of both classes inpriority queuing. Our analytic latency boundtaking both network and queuing delay intoaccount tightly follows actual service latency.

1

10

100

1000

10000

0.027 0.024 0.021 0.018 0.015

Average Latency (Sec)

Request Arrival Rate for Weighted Class 1 (/Sec)

High Priority, Experiment Low Priority, Experiment High Priority, Bound Low Priority, Bound

Fig. 10. Evaluation of different request ar-rival rates in priority queuing. Fixed λ2 =0.14/sec and varying λ1. As arrival rates ofhigh priority class increase, latency of lowpriority requests shows logarithm growth.

0

50

100

150

200

250

300

350

50M 100M 150M 200M 250M


File SIze (MB)

Class 1, Experiment Class 2, Experiment Class 1, Bound Class 2, Bound

Fig. 11. Evaluation of different file sizes inweighted queuing. Latency increase showsmore fairness for class 2 requests. Our an-alytic latency bound taking both network andqueuing delay into account tightly followsactual service latency for both classes.

low priority requests. This shows extreme unfairness for class2 requests.

Evaluation of the performance of our solution–Weightedqueuing. For weighed queuing, we design a similar ex-periment as in the case of priority queuing to compare theresults. First, we varied the file size from 50MB to 250MBwith aggregate request arrival rate of 0.28/sec, and the ratio ofclass1/class2=1/1. We have 1000 heterogeneous files in total,out of which every 200 files are of the same file size. A (7,4)erasure code is applied, C1 = 1, and C2 = 0.4. We used thesame workload combination as in the file size experiment forpriority queuing. As shown in Fig. 11, we can see for bothclasses, latency increases as file size increases, but class 2increases much faster than class 1. This is because the class2 requests typically get a small portion of service bandwidth,thus increasing file size will increase the service time and thusthe queuing time is a lot more than class 1 requests, whichget more bandwidth. Also the analytic bound for both classestightly follows the actual latency as shown in the figure aswell. However, even though class 2 requests are experiencinglonger latency than class 1 requests in this case, the latencyis still within a proper range unlike that in priority queuing.Thus, we can see that weighted queuing provides more fairnessto different classes.

Next we varied the aggregate arrival rate of both class 1

and class 2 requests from 0.34/sec to 0.22/sec as shown inFigure 12 with file size 200MB, while keeping the ratio ofclass1/class2=1. Actual service delay and our analytic boundfor each class is shown in Figure 10, where the bound providesa close estimate of service latency as shown in the figure. Wecan see as the arrival rate for class 1 increases, latency ofclass 2 also increases much faster than that of class 1. This isbecause increasing the arrival rate for class 1 will give morebandwidth to class 1 in order to reduce their latency and thusfurther decreases the bandwidth to class 2. Also, increasing theworkload could not be completely compensated by increasingthe bandwidth, and thus latency of class 1 increases too.

0

100

200

300

400

500

600

700

0.34 0.31 0.28 0.25 0.22


Aggregate Request Arrival Rate for Both Classes (/Sec)

Class 1, Weighted Class 2, Weighed Class 1, Bound Class 2, Bound

Fig. 12. Evaluation of different request arrival rates in weighted queuing.As the arrival rate increases, latency increase shows more fairness for class 2requests compared to priority queuing.

10

VI. CONCLUSIONS

Relying on a novel probabilistic scheduling policy, this pa-per develops an analytical upper bound on average service de-lay of multi-tenant, erasure-coded storage with arbitrary num-ber of files and any service time distribution using weightedqueuing or priority queuing to provide differentiated servicesto different tenants. An optimized distributed storage systemis then formalized using these queues. Even though only localoptimality can be guaranteed due to the non-convex nature ofthe problems, the proposed algorithm significantly reduces thelatency. Both our theoretical analysis and algorithm design arevalidated via a prototype in Tahoe, an open-source, distributedfile system, in an open-source, cloud storage deployment thatsimulates three geographically distributed data centers throughbandwidth reservations. Note that this paper focuses on twoservice classes only, and an extension to a general number ofservice classes is still an open research problem.

VII. APPENDIX

A. Theorem 1Proof: Following the results in [39, Chapter 3][40, Chapter7], we have that the mean queuing delay for the high priorityclass at server j is given as

E[Qj1] =E[R]

1− ρj1where E[R] is the residual service time, given as

E[R] =

2∑k=1

ΛjkΓ2j

2,

and Λ =2∑k=1

Λjk be the aggregate request arrival rate of the

two priority classes at server j. The mean queuing delay forthe low priority class at server j is given as

E[Qj2] =E[R]

(1− ρj1)(1− ρj1 − ρj2)

The second moment of the queuing delay for the highpriority class at server j is

E[Q2j1] = 2(E[Qj1])2 +

ρj1E[R2]

1− ρj1,

where E[R] =E[X2

j ]

2E[Xj ] and E[R2] =E[X3

j ]

3E[Xj ] . The variance ofthe queuing delay for the high priority class at server j is

Var[Qj1] = E[Q2j1]− (E[Qj1])2

=ρj1

(1− ρj1)2(ρj1(E[R])2 + (1− ρj1)E[R2])(21)

The second moment of the queuing delay for the lowpriority class at server j is

E[Q2j2] = 2(E[Qj2])2 +

(ρj1 + ρj2)E[R2]

(1− ρj1)2(1− ρj2),

and the variance of the queuing delay for the high priorityclass at server j is

Var[Qj2] = E[Q2j1]− (E[Qj1])2

=ρj1 + ρj2

(1− ρ2j1(1− ρj2)2)

((1− ρj2)E[R2]

+[(ρj1 + ρj2) +ρj1(1− ρj2)

1− ρj1](E[R])2)(22)

This further gives

Var[Qj2] =(Λj1 + Λj2)Γj,3

3(1− ρj1)2(1− ρj2)

+

((Λj1 + Λj2)Γj,2

2(1− ρj1)(1− ρj2)

)2

+Λj1(Λj1 + Λj2)Γ2

j,2

2(1− ρj1)3(1− ρj2)(23)

These will further give the result as shown in Theorem 1after simplification.

B. Theorem 2

Proof: From the weighted queuing model we know that each ofthem will be served at different service rate, thus service timefor each weighted queue with weight wjk would be E[X]/wjk,with probability Pjk, thus the expected service time would

ben∑k=1

PjkE[X]wjk

, variance of service time should be E[X2jk]−

(E[Xjk)2], thus we have Var[Xjk] as the second item in EZj .We also have

E[Qj ] =

n∑k=1

PjkE[Qjk],

where E[Qjk] is the expected waiting time for requests inthe M/G/1 queue with weight wjk, which is the same as whatwe derived for single priority class in priority queuing, withdifferent service rates:

E[Qjk] =ΛjkPjkE[X2]/w2

jk

2− ΛjkPjkE[X]/wjk

Then we can obtain:

E[Qj ] =

n∑k=1

PjkΛjkPjkE[X2]/w2

jk

2− ΛjkPjkE[X]/wjk

After simplification, we obtain the third item in the statementof the lemma. For the variance, we apply the results we get forsingle priority class model with weighted service rates and takeaverage with the distribution probability for each weightedqueue to get:

Qjk =

n∑k=1

Pjk(ΛjkPjkE[X3]/w3

jk

3(1− ΛjkPjkE[X]/wjk)+

Λ2jkPjkE[X4]/w4

jk

4(1− ΛjkPjkE[X]/w2jk)

) (24)

A simplification will lead to the equation in Theorem 2.

11

C. Lemma 2

Proof: As shown in Equation (13), T̄ik depends on F (z, πij)for weighted queuing. Since we have:

F (z,Λjk) = Aj2 +√A2j2 +Bj2

We find the second order derivatives of Aj2 with respect toΛj2.

∂2AjkdΛ2

j2

=Γj,2

(ρj1 − 1)(ρj1µj + Λj2 − µj)3(25)

We note that ρj1−1 < 0, and (ρj1µj+Λj2−µj)3 is negativeas long as ρj2 < 1 − ρj1, since Λj2 is a linear combinationof πij for low priority class, i.e., Aj2 is convex in πij as longas Λj2 <

1−ρj1µj

. We find the second order derivatives of Bj2with respect to Λj2.

∂2Bj2dΛ2

j2

=2Γj,3µj(Λj1 + µj)

3(1− ρj1)2(µj − Λj2)3+

Γ2j,2µ

2j (Λj1 + µj)(3Λj1 + µj + 2Λj2)

2(1− ρj1)2(µj − Λj2)4+

Λj1Γ2j,2µj(Λj1 + µj)

(1− ρj1)3(µj − Λj2)3(26)

Here, all the three terms are positive as long as Λjk < µj .Thus, both Aj2 and Bj2 are convex in πj2. Now we provethat F is convex in Aj2 and Bj2:

∂2F

∂Λ2j2

=∂2Aj2∂Λ2

j2

+Aj2

∂2Aj2

∂Λ2j2

+Bj2∂2Bj2

∂Λ2j2

(A2j2 +Bj22)1/2

+

(Aj2∂Bj2

∂Λj2+Bj2

∂Aj2

∂Λj2)2

(A2j2 +B2

j2)3/2(27)

As we have proved that ∂2Aj2

dΛ2j2

> 0 and ∂2Aj2

dΛ2j2

> 0 then it easy

to see from equation (27) that ∂2F∂Λ2

j2> 0. Since F is increasing

and convex in Aj2 and Bj2, and Aj2 and Bj2 are both convexin Λj2, we conclude that their composition F (z,Λj2) is alsoconvex in Λj2. This completes the proof.

D. Lemma 3

Proof: As shown in Equation (13), T̄1, T̄2 depends onF (z, πij , wjk) for weighted queuing. Since we have:

F (z,Λjk, wjk) = Aj +√A2jk +Bjk

We find the second order derivatives of Ajk with respect towjk.

∂2Ajkdw2

jk

=

2∑k=1

ΛjkP2jkµjΓj,2((

3µjwjk

2 − ΛjkPjk)2 +3µ2

jw2jk

4 )

w3jk(µjwjk − ΛjkPjk)3

(28)where the numerator is positive since it’s the sum of twosquares (

3µjwjk

2 − ΛjkPjk)2 and3µ2

jw2jk

4 . Denominator ispositive as long as wjk >

ΛjkPjk

µj.

The second order derivatives of Bjk with respect to wjk isgiven as

∂2Bjkdw2

jk

=Λ2jkP

3jkΓ2

j,2

2w2jk(wjk − ΛjkPjk

µj)2·

(3

w2jk

+

4

wjk(wjk − ΛjkPjk

µj)

+3

(wjk − ΛjkPjk

µj)2

,(29)

which is positive as long as wjk >ΛjkPjk

µj. Thus, both Ajk

and Bjk are convex in wjk. As proved in the proof of Lemma2, F is convex in Ajk and Bjk. Since F is increasing andconvex in Ajk and Bjk, and Ajk and Bjk are both convexin wjk, we conclude that their composition F (z,Λjk, wjk) isalso convex in wjk. This completes the proof.

REFERENCES

[1] Y. Xiang, T. Lan, V. Aggarwal, and Y. R. Chen, “Multi-Tenant LatencyOptimization in Erasure-Coded Storage with Differentiated Services,” inProc. ICDCS, Jun-Jul. 2015.

[2] E. Schurman and J. Brutlag, “ The user and business impact of serverdelays, additional bytes and http chunking in web search,” OReillyVelocity Web performance and operations conference, June 2009.

[3] Y. Lu, Q. Xie, G. Kliot, A. Geller, J. Larus, and A. Greenberg, “Joinidle-queue: A novel load balancing algorithm for dynamically scalable webservices,” in Proc. IFIP Perforamnce, 2010.

[4] A.G. Dimakis, V. Prabhakaran, and K. Ramchandran, “Distributed datastorage in sensor networks using decentralized erasure codes,” Signals,Systems and Computers, 2004. Conference Record of the Thirty-EighthAsilomar., 2004.

[5] M.K. Aguilera, R. Janakiraman, L. Xu, “Using Erasure Codes Efficientlyfor Storage in a Distributed System,” Proceedings of the 2005 Interna-tional Conference on DSN, pp. 336-345, 2005.

[6] H. Kameyam and Y. Sato, “Erasure Codes with Small Overhead Factorand Their Distributed Storage Applications,” CISS ’07. 41st AnnualConference, 2007.

[7] A. Fallahi and E. Hossain, “Distributed and energy-Aware MAC fordifferentiated services wireless packet networks: a general queuinganalytical framework,” IEEE CS, CASS, ComSoc, IES, SPS, 2007.

[8] A.S. Alfa, “Matrix-geometric solution of discrete time MAP/PH/1priority queue,” Naval research logistics, vol. 45, 00. 23-50, 1998.

[9] N.E. Taylor and Z.G. Ives, “Reliable storage and querying for collabo-rative data sharing systems,” IEEE ICED Conference, 2010.

[10] J.H. Kim and J.K. Lee, “Performance of carrier sense multiple accesswith collision avoidance in wireless LANs,” Proc. IEEE IPDS., 1998.

[11] E. Ziouva and T. Antoankopoulos, “CSMA/CA Performance under hightraffic conditions: throughput and delay analysis,” Computer Comm, vol.25, pp. 313-321, 2002.

[12] S. Mochan and L. Xu, “Quantifying Benefit and Cost of ErasureCode based File Systems.” Technical report available at http ://nisl.wayne.edu/Papers/Tech/cbefs.pdf , 2010.

[13] H. Weatherspoon and J. D. Kubiatowicz, “Erasure Coding vs. Replica-tion: A Quantitative Comparison.” In Proceedings of the First IPTPS,2002.

[14] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effectivestraggler mitigation: Attack of the clones.” In Proceedings of USENIXconference on Networked Systems Design and Implementation, 2013.

[15] H. Xu and W. C. Lau, “Optimization for speculative execution in bigdata processing clusters, IEEE Transactions on Parallel and DistributedSystems, 2016.

[16] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B.Saha, and E. Harris, “Reining in the outliers in mapreduce clusters usingmantri.” In Proceedings of USENIX conference on Operating systemsdesign and implementation, 2010.

[17] D. Wang, G. Joshi, and G. Wornell, “Using straggler replication toreduce latency in large-scale parallel computing.” ACM SIGMETRICSPerformance Evaluation Review, 2015.

[18] C. Angllano, R. Gaeta and M. Grangetto, “Exploiting Rateless Codes inCloud Storage Systems,” IEEE Transactions on Parallel and DistributedSystems, Pre-print 2014.

[19] L. Huang, S. Pawar, H. Zhang and K. Ramchandran, “Codes Can ReduceQueueing Delay in Data Centers,” Journals CORR, vol. 1202.1359,2012.

12

[20] N. Shah, K. Lee, and K. Ramachandran, “The MDS queue: analyzing la-tency performance of codes and redundant requests,” IEEE InternationalSymposium on Information Theory, July. 2014.

[21] F. Baccelli, A. Makowski, and A. Shwartz, “The fork-join queue andrelated systems with synchronization constraints: stochastic ordering andcomputable bounds, Advances in Applied Probability, pp. 629660, 1989.

[22] R. Pedarsani, J. Walrand and Y. Zhong, “Robust scheduling in a flexiblefork-join network,” 53rd IEEE Conference on Decision and Control, LosAngeles, CA, 2014, pp. 3669-3676.

[23] E. Ozkan and A. R. Ward, “On the Control of Fork-Join Networks,”arXiv:1505.04470v2, Jun 2016.

[24] G. Joshi, Y. Liu, and E. Soljanin, “On the Delay-Storage Trade-off inContent Download from Coded Distributed Storage Systems,” JSAC,Volume: 32, Issue: 5, May 2014.

[25] S. Chen, Y. Sun, U.C. Kozat, L. Huang, P. Sinha, G. Liang, X. Liu andN.B. Shroff, “ When Queuing Meets Coding: Optimal-Latency DataRetrieving Scheme in Storage Clouds,” IEEE Infocom, April 2014.

[26] A. Kumar, R. Tandon and T.C. Clancy, “On the Latency of Erasure-Coded Cloud Storage Systems,” arXiv:1405.2833, May 2014.

[27] Y. Xiang, T. Lan, V. Aggarwal, and Y. R. Chen, “Joint Latencyand Cost Optimization for Erasure-coded Data Center Storage,” IEEETransactions on Networking, Volume: 24, Issue: 4, Aug. 2016.

[28] Y. Xiang, T. Lan, V. Aggarwal, and Y. R. Chen, “Joint Latency andCost Optimization for Erasure-coded Data Center Storage,” IEEE/ACMTransactions on Networking, vol. 24, no. 4, pp. 2443-2457, Aug. 2016.

[29] Y. Xiang, V. Aggarwal, T. Lan, and Y. R. Chen, “Differentiated latencyin data center networks with erasure coded files through traffic engi-neering,” IEEE Transactions on Cloud Computing, 2017.

[30] V. Aggarwal and T. Lan, “Tail Index for a Distributed Storage Systemwith Pareto File Size Distribution, arXiv:1607.06044, Jul 2016.

[31] F. MacWilliams and N. Sloane, “The Theory of Error-Correcting Codes.”North-Holland, 1977.

[32] D. Bertsimas and K. Natarajan, “Tight bounds on Expected OrderStatistics,” Probability in the Engineering and Informational Sciences,2006.

[33] A. Abdelkefi and J. Yuming, “A Structural Analysis of Network Delay,”Ninth Annual CNSR, 2011.

[34] F. Paganini, A. Tang, A. Ferragut and L.L.H. Andrew, “NetworkStability Under Alpha Fair Bandwidth Allocation With General FileSize Distribution,” IEEE Transactions on Automatic Control, 2012.

[35] A.B. Downey, “The structural cause of file size distributions,” Proceed-ings of Ninth International Symposium on MASCOTS, 2011.

[36] M. Bramson, Y. Lu, and B. Prabhakar, “Randomized load balancingwith general service time distributions,” Proceedings of ACM Sigmetrics,2010.

[37] D. Bertsimas and K. Natarajan, “Tight bounds on Expected OrderStatistics,” Probability in the Engineering and Informational Sciences,2006.

[38] B. Warner, Z. Wilcox-O’Hearn, and R. Kinninmont, “Tahoe-LAFSdocs,” available online at https://tahoe-lafs.org/trac/tahoe-lafs.

[39] L. Kleinrock, “Queueing Systems Volume 2: Computer Applications”.John Wiley & Sons, 1976.

[40] J. Sztrik. Basic Queueing Theory. 2016. GlobeEdit, OmniScriptumGmbH & Co, KG, Saarbrucken, Germany (2016), ISBN 978-3-639-73471-3.

Yu Xiang received the B.A.Sc. degree from HarbinInstitute of Technology in 2010, and the Ph.D.degree from George Washington University in 2015,both in Electrical Engineering. She is now a se-nior inventive scientist at AT&T Labs-Research.Her current research interest are in cloud resourceoptimization, distributed storage systems and cloudstorage charge-back.

Tian Lan (S’03-M’10) received the B.A.Sc. degreefrom the Tsinghua University, China in 2003, theM.A.Sc. degree from the University of Toronto,Canada, in 2005, and the Ph.D. degree from thePrinceton University in 2010. Dr. Lan is currentlyan Associate Professor of Electrical and ComputerEngineering at the George Washington University.His research interests include cloud resource opti-mization, mobile networking, storage systems andcyber security. Dr. Lan received the 2008 IEEESignal Processing Society Best Paper Award, the

2009 IEEE GLOBECOM Best Paper Award, and the 2012 INFOCOM BestPaper Award.

Vaneet Aggarwal (S’08 - M’11 - SM15) receivedthe B.Tech. degree in 2005 from the Indian Instituteof Technology, Kanpur, India, and the M.A. andPh.D. degrees in 2007 and 2010, respectively fromPrinceton University, Princeton, NJ, USA, all inElectrical Engineering.

He is currently an Assistant Professor at PurdueUniversity, West Lafayette, IN. Prior to this, hewas a Senior Member of Technical Staff Researchat AT&T Labs-Research, NJ, and an Adjunct As-sistant Professor at Columbia University, NY. His

research interests are in applications of statistical, algebraic, and optimizationtechniques to distributed storage systems, machine learning, and wirelesssystems. Dr. Aggarwal was the recipient of Princeton University’s PorterOgden Jacobus Honorific Fellowship in 2009. In addition, he received AT&TKey Contributor award in 2013, AT&T Vice President Excellence Award in2012, and AT&T Senior Vice President Excellence Award in 2014.

Yih-Farn (Robin) Chen is a Director of Inven-tive Science, leading the Cloud Platform SoftwareResearch department at AT&T Labs - Research.His current research interests include cloud com-puting, software-defined storage, mobile computing,distributed systems, World Wide Web, and IPTV. Heholds a Ph.D. in Computer Science from Universityof California at Berkeley, an M.S. in ComputerScience from University of Wisconsin, Madison,and a B.S. in Electrical Engineering from NationalTaiwan University. Robin is an ACM Distinguished

Scientist and a Vice Chair of the International World Wide Web ConferencesSteering Committee (IW3C2). He also serves on the editorial board of IEEEInternet Computing.

Optimizing Differentiated Latency in Multi-Tenant, Erasure...

Documents

Transcript of Optimizing Differentiated Latency in Multi-Tenant, Erasure...