Proportional-Share Scheduling for Distributed Storage...

Proportional-Share Scheduling for Distributed Storage Systems

Yin Wang∗

University of [email protected]

Arif MerchantHP [email protected]

AbstractFully distributed storage systems have gained popularity in thepast few years because of their ability to use cheap commodityhardware and their high scalability. While there are a num-ber of algorithms for providing differentiated quality of serviceto clients of a centralized storage system, the problem has notbeen solved for distributed storage systems. Providing perfor-mance guarantees in distributed storage systems is more com-plex because clients may have different data layouts and accesstheir data through different coordinators (access nodes), yet theperformance guarantees required are global.

This paper presents a distributed scheduling framework. It isan adaptation of fair queuing algorithms for distributed servers.Specifically, upon scheduling each request, it enforces an extradelay (possibly zero) that corresponds to the amount of servicethe client gets on other servers. Different performance goals,e.g., per storage node proportional sharing, total service pro-portional sharing or mixed, can be met by different delay func-tions. The delay functions can be calculated at coordinatorslocally so excess communication is avoided. The analysis andexperimental results show that the framework can enforce per-formance goals under different data layouts and workloads.

1 Introduction

The storage requirements of commercial and institutionalorganizations are growing rapidly. A popular approachfor reducing the resulting cost and complexity of man-agement is to consolidate the separate computing andstorage resources of various applications into a commonpool. The common resources can then be managed to-gether and shared more efficiently. Distributed storagesystems, such asFederated Array of Bricks(FAB) [20],Petal [16], and IceCube [27], are designed to serve aslarge storage pools. They are built from a number ofindividual storage nodes, or bricks, but present a sin-gle, highly-available store to users. High scalability isanother advantage of distributed storage systems. Thesystem can grow smoothly from small to large-scale in-stallations because it is not limited by the capacity of anarray or mainframe chassis. This satisfies the needs ofservice providers to continuously add application work-loads onto storage resources.

A data center serving a large enterprise may supportthousands of applications. Inevitably, some of these ap-plications will have higher storage performance require-

∗This work was done during an internship at HP Laboratories.

Figure 1:A distributed storage system

ments than others. Traditionally, these requirements havebeen met by allocating separate storage for such appli-cations; for example, applications with high write ratesmay be allocated storage on high-end disk arrays withlarge caches, while other applications live on less expen-sive, lower-end storage. However, maintaining separatestorage hardware in a data center can be a managementnightmare. It would be preferable to provide each ap-plication with the service level it requires while sharingstorage. However, storage systems typically treat all I/Orequests equally, which makes differentiated service dif-ficult. Additionally, a bursty I/O workload from one ap-plication can cause other applications sharing the samestorage to suffer.

One solution to this problem is to specify the perfor-mance requirement of each application’s storage work-load and enable the storage system to ensure that it ismet. Thus applications are insulated from the impactof workload surges in other applications. This can beachieved by ordering the requests from the applicationsappropriately, usually through a centralized scheduler, tocoordinate access to the shared resources [5, 6, 24]. Thescheduler can be implemented in the server or as a sep-arate interposed request scheduler[2, 12, 17, 29] thattreats the storage server as a black box and applies theresource control externally.

Centralized scheduling methods, however, fit poorlywith distributed storage systems. To see this, considerthe typical distributed storage system shown in Figure 1.The system is composed of bricks; each brick is a com-puter with a CPU, memory, networking, and storage. Ina symmetric system, each brick runs the same software.Data stored by the system is distributed across the bricks.Typically, a client accesses the data through acoordina-

tor, which locates the bricks where the data resides andperforms the I/O operation. A brick may act both as astorage node and a coordinator. Different requests, evenfrom the same client, may be coordinated by differentbricks. Two features in this distributed architecture pre-vent us from applying any existing request schedulingalgorithm directly. First, the coordinators are distributed.A coordinator schedules requests possibly without theknowledge of requests processed by other coordinators.Second, the data corresponding to requests from a clientcould be distributed over many bricks, since a logicalvolume in a distributed storage system may be striped,replicated, or erasure-coded across many bricks [7]. Ourgoal is to design a distributed scheduler that can provideservice guarantees regardless of the data layout.

This paper proposes a distributed algorithm to enforceproportional sharingof storage resources amongstreamsof requests. Each stream has an assignedweight, and thealgorithm reserves for it a minimum share of the systemcapacity proportional to its weight. Surplus resources areshared among streams with outstanding requests, alsoin proportion to their weights. System capacity, in thiscontext, can be defined in a variety of ways: for ex-ample, the number of I/Os per second, the number ofbytes read or written per second, etc. The algorithm iswork-conserving: no resource is left idle if there is anyrequest waiting for it. However, it can be shown easilythat a work-conserving scheduling algorithm for multi-ple resources (bricks in our system) cannot achieve pro-portional sharing in all cases. We present an extensionto the basic algorithm that allows per-brick proportionalsharing in such cases, or a method that provides a hybridbetween system-wide proportional sharing and per-brickproportional sharing. This method allows total propor-tional sharing when possible while ensuring a minimumlevel of service on each brick for all streams.

The contribution of this paper includes a novel distrib-uted scheduling framework that can incorporate manyexisting centralized fair queuing algorithms. Withinthe framework, several algorithms that are extensions toStart-time Fair Queuing [8] are developed for differentsystem settings and performance goals. To the best ofour knowledge, this is the first algorithm that can achievetotal service proportional sharing for distributed storageresources with distributed schedulers. We evaluate theproposed algorithms both analytically and experimen-tally on a FAB system, but the results are applicable tomost distributed storage systems. The results confirmthat the algorithms allocate resources fairly under vari-ous settings — different data layouts, clients accessingthe data through multiple coordinators, and fluctuatingservice demands.

This paper is organized as follows. Section 2 presentsan overview of the problem, the background, and the

Figure 2: Data access model of a distributed storage sys-tem. Different clients may have different data layouts spread-ing across different sets of bricks. However, coordinators knowall data layouts and can handle requests from any client.

related work. Section 3 describes our distributed fairqueueing framework, two instantiations of it, and theirproperties. Section 4 presents the experimental evalua-tion of the algorithms. Section 5 concludes.

2 Overview and background

We describe here the distributed storage system that ourframework is designed for, the proportional sharing prop-erties it is intended to enforce, the centralized algorithmthat we base our work upon, and other related work.

2.1 Distributed Storage Systems

Figure 2 shows the configuration of a typical distributedstorage system. The system includes a collection of stor-agebricks, which might be built from commodity disks,CPUs, and NVRAM. Bricks are connected by a standardnetwork such as gigabit Ethernet. Access to the data onthe bricks is handled by the coordinators, which presenta virtual disk or logical volumeinterface to the clients.In the FAB distributed storage system [20], a client mayaccess data through an arbitrary coordinator or a set ofcoordinators at the same time to balance its load. Co-ordinators also handle data layout and volume manage-ment tasks, such as volume creation, deletion, extensionand migration. In FAB, the coordinators reside on thebricks, but this is not required. We consider local areadistributed storage systems where the network latenciesare small compared with disk latencies. We assume thatthe network bandwidths are sufficiently large that the I/Othroughput is limited by the bricks rather than the net-work.

The data layout is usually designed to optimize prop-erties such as load balance, availability, and reliability. InFAB, a logical volume is divided into a number ofseg-ments, which may be distributed across bricks using areplicated or erasure-coded layout. The choice of brick-set for each segment is determined by the storage system.Generally, the layout is opaque to the clients.

The scheduling algorithm we present is designed forsuch a distributed system, making a minimum of as-

SYMBOLS DESCRIPTION

φf Weight of streamfpi

f Streamf ’s i-th requestpi

f,A Streamf ’s i-th request to brickAcost(·) Cost of a single requestcostmax

f Max request cost of streamfcostmax

f,A Max cost on brickA of f

Wf (t1, t2) Aggregate cost of requests servedfrom f during interval[t1, t2]

batchcost Total cost of requests in between(pi

f,A) pi−1f,A andpi

f,A, includingpif,A

batchcostmaxf,A Max value ofbatchcost(pi

f,A)A(·) Arrival time of a requestS(·) Start tag of a requestF (·) Finish tag of a requestv(t) Virtual time at timetdelay(·) Delay value of a request

Table 1:Some symbols used in this paper.

sumptions. The data for a client may be laid out in anarbitrary manner. Clients may request data located onan arbitrary set of bricks at arbitrary and even fluctuatingrates, possibly through an arbitrary set of coordinators.

2.2 Proportional Sharing

The algorithms in this paper support proportional sharingof resources for clients with queued requests. Each clientis assigned a weight by the user and, in every time inter-val, the algorithms try to ensure that clients with requestspending during that interval receive service proportionalto their weights.

More precisely, I/O requests are grouped into serviceclasses calledstreams, each with a weight assigned; e.g.,all requests from a client could form a single stream. Astream isbackloggedif it has requests queued. A streamf consists a sequence of requestsp0

f ...pnf . Each request

has an associated service costcost(pif ). For example,

with bandwidth performance goals, the cost might be thesize of the requested data; with service time goals, re-quest processing time might be the cost. The maximumrequest cost of streamf is denotedcostmax

f . The weightassigned to streamf is denotedφf ; only the relative val-ues of the weights matter for proportional sharing.

Formally, if Wf (t1, t2) is the aggregate cost of the re-quests from streamf served in the time interval[t1, t2],then the unfairness between two continuously back-logged streamsf andg is defined to be:

∣∣∣∣Wf (t1, t2)

φf− Wg(t1, t2)

φg

∣∣∣∣ (1)

A fair proportional sharing algorithm should guaran-tee that (1) is bounded by a constant. The constant usu-

Figure 3:Distributed data. Streamf sends requests to brickA only while streamg sends requests to bothA andB.

ally depends on the stream characteristics, e.g.,costmaxf .

The time interval[t1, t2] in (1) may be any time dura-tion. This corresponds to a “use it or lose it” policy, i.e.,a stream will not suffer during one time interval for con-suming surplus resources in another interval, nor will itbenefit later from under-utilizing resources.

In the case of distributed data storage, we need to de-fine what is to be proportionally shared. Let us first lookat the following example.

EXAMPLE 1. Figure 3 is a storage system consistingof two bricksA andB. If streamsf andg are equallyweighted and both backlogged atA, how should we al-locate the service capacity of brickA?

There are two alternatives for the above example, whichinduce two different meanings for proportional sharing.The first issingle brick proportional sharing, i.e., ser-vice capacity of brickA will be proportionally shared.Many existing proportional sharing algorithms fall intothis category. However, streamg also receives service atbrick B, thus receiving higher overall service. While thisappears fair because streamg does a better job of balanc-ing its load over the bricks than streamf , note that thedata layout may be managed by the storage system andopaque to the clients; thus the quality of load balancingis merely an accident. From the clients’ point of view,streamf unfairly receives less service than streamg. Theother alternative istotal service proportional sharing. Inthis case, the share of the service streamf receives onbrick A can be increased to compensate for the fact thatstreamg receives service on brickB, while f does not.This problem is more intricate and little work has beendone on it.

It is not always possible to guarantee total service pro-portional sharing with a work-conserving scheduler, i.e.,where the server is never left idle when there is a requestqueued. Consider the following extreme case.

EXAMPLE 2. Streamf requests service from brickAonly, while equally weighted streamg is sending requeststo A and many other bricks. The amount of servicegobtains from the other bricks is larger than the capacityof A. With a work-conserving scheduler, it is impossibleto equalize the total service of the two streams.

If the scheduler tries to make the total service receivedby f andg as close to equal as possible,g will be blockedat brickA, which may not be desirable. Inspired by theexample, we would like to guarantee some minimum ser-vice on each brick for each stream, yet satisfy total ser-vice proportional sharing whenever possible.

In this paper, we propose a distributed algorithmframework under which single brick proportional shar-ing, total service proportional sharing, and total serviceproportional sharing with a minimum service guaranteeare all possible. We note that, although we focus on dis-tributed storage systems, the algorithms we propose maybe more broadly applicable to other distributed systems.

2.3 The centralized approach

In selecting an approach towards a distributed propor-tional sharing scheduler, we must take four requirementsinto account: i) the scheduler must be work-conserving:resources that backlogged streams are waiting for shouldnever be idle; ii) “use it or lose it”, as described in theprevious section; iii) the scheduler should accommodatefluctuating service capacity since the service times of IOscan vary unpredictably due to the effects of caching, se-quentiality, and interference by other streams; and iv)reasonable computational complexity—there might bethousands of bricks and clients in a distributed stor-age system, hence the computational and communicationcosts must be considered.

There are many centralized scheduling algorithms thatcould be extended to distributed systems [3, 4, 8, 28,30, 17, 14, 9]. We chose to focus on the Start-timeFair Queuing (SFQ) [8] and its extension SFQ(D) [12]because they come closest to meeting the requirementsabove. We present a brief discussion of the SFQ andSFQ(D) algorithms in the remainder of this section.

SFQ is a proportional sharing scheduler for a singleserver; intuitively, it works as follows. SFQ assigns astart timetag and afinish timetag to each request cor-responding to the normalized times at which the requestshould start and complete according to a system notion ofvirtual time. For each stream, a new request is assigneda start time based on the assigned finish time of the pre-vious request, or the current virtual time, whichever isgreater. The finish time is assigned as the start time plusthe normalized cost of the request. The virtual time isset to be the start time of the currently executing request,or the finish time of the last completed request if there isnone currently executing. Requests are scheduled in theorder of their start tags. It can be shown that, in any timeinterval, the service received by two backlogged work-loads is approximately proportionate to their weights.

More formally, the requestpif is assigned thestart tag

S(pif ) and thefinish tagF (pi

f ) as follows:

S(pif ) = max{v(A(pi

f )), F (pi−1f )}, i ≥ 1 (2)

F (pif ) = S(pi

f ) +cost(pi

f )φf

, i ≥ 1 (3)

whereA(pif ) is the arrival time of requestpi

f , andv(t) isthe virtual time att; F (p0

f ) = 0, v(0) = 0.SFQ cannot be directly applied to storage systems,

since storage servers are concurrent, serving multiple re-quests at a time, and the virtual timev(t) is not defined.Jin et al. [12] extended SFQ to concurrent servers bydefining the virtual time as the maximum start tag ofrequests in service (the last request dispatched). Theresulting algorithm is calleddepth-controlled start-timefair queuingand abbreviated to SFQ(D), whereD is thequeue depth of the storage device. As with SFQ, the fol-lowing theorem [12] shows that SFQ(D) provides back-logged workloads with proportionate service, albeit witha looser bound on the unfairness.

THEOREM 1. During any interval [t1, t2], the dif-ference between the amount of work completed by anSFQ(D) server for two backlogged streamsf and g isbounded by:∣∣∣∣Wf (t1, t2)

φf− Wg(t1, t2)

φg

∣∣∣∣ ≤(

costmaxf

φf+

costmaxg

φg

)∗

(D + 1) (4)

While there are more complex variations of SFQ [12]that can reduce the unfairness of SFQ(D), for simplicity,we use SFQ(D) as the basis for our distributed schedul-ing algorithms. Since the original SFQ algorithm cannotbe directly applied to storage systems, for the sake ofreadability, we will use “SFQ” to refer to SFQ(D) in theremainder of the paper.

2.4 Related Work

Extensive research in scheduling for packet switchingnetworks has yielded a series of fair queuing algorithms;see [19, 28, 8, 3]. These algorithms have been adaptedto storage systems for service proportional sharing. Forexample, YFQ [1], SFQ(D) and FSFQ(D) [12] arebased on start-time fair queueing [8]; SLEDS [2] andSARC [29] use leaky buckets; CVC [11] employs the vir-tual clock [30]. Fair queuing algorithms are popular fortwo reasons: 1) they provide theoretically proven strongfairness, even under fluctuating service capacity, and 2)they are work-conserving.

However, fair queuing algorithms are not convenientfor real-time performance goals, such as latencies. Toaddress this issue, one approach is based on real-timeschedulers; e.g., Facade [17] implements an Earliest

Deadline First (EDF) queue with the proportional feed-back for adjusting the disk queue length. Anothermethod is feedback control, a classical engineering tech-nique that has recently been applied to many computingsystems [10]. These generally require at least a rudimen-tary model of the system being controlled. In the case ofstorage systems, whose performance is notoriously dif-ficult to model [22, 25], Triage [14] adopts an adaptivecontroller that can automatically adjust the system modelbased on input-output observations.

There are some frameworks [11, 29] combining theabove two objectives (proportional sharing and latencyguarantees) in a two-level architecture. Usually, the firstlevel guarantees proportional sharing by fair queueingmethods, such as CVC [11] and SARC [29]. The sec-ond level tries to meet the latency goal with a real-timescheduler, such as EDF. Some feedback from the secondlevel to the first level scheduler is helpful to balance thetwo objectives [29]. All of the above methods are de-signed for use in a centralized scheduler and cannot bedirectly applied to our distributed scheduling problem.

Existing methods for providing quality of service indistributed systems can be put into two categories. Thefirst category is the distributed scheduling of a single re-source. The main problem here is to maintain informa-tion at each scheduler regarding the amount of resourceeach stream has so far received. For example, in fairqueuing algorithms, where there is usually a system vir-tual timev(t) representing the normalized fair amount ofservice that all backlogged clients should have receivedby time t, the problem is how to synchronize the vir-tual time among all distributed schedulers. This can besolved in a number or ways; for example, in high capac-ity crossbar switches, in order to fairly allocate the band-width of the output link, the virtual time of different inputports can be synchronized by theaccess bufferinside thecrossbar [21]. In wireless networks, the communicationmedium is shared. When a node can overhear packagesfrom neighboring nodes for synchronization, distributedpriority backoff schemes closely approximate a singleglobal fair queue [18, 13, 23]. In the context of storagescheduling, Request Window [12] is a distributed sched-uler that is similar to a leaky bucket scheduler. Servicesfor different clients are balanced by the windows issuedby the storage server. It is not fully work-conserving un-der light workloads.

The second category is centralized scheduling of mul-tiple resources. Gulati and Varman [9] address the prob-lem of allocating disk bandwidth fairly among concur-rent competing flows in a parallel I/O system with multi-ple disks and a centralized scheduler. They aim at the op-timization problem of minimizing the unfairness amongdifferent clients with concurrent requests. I/O requestsare scheduled in batches, and a combinatorial optimiza-

tion problem is solved in each round, which makes themethod computationally expensive. The centralized con-troller makes it unsuitable for use in fully distributedhigh-performance systems, such as FAB.

To the best of our knowledge, the problem of fairscheduling in distributed storage systems that involveboth distributed schedulers and distributed data has notbeen previously addressed.

3 Proportional Sharing in DistributedStorage Systems

We describe a framework for proportional sharing in dis-tributed storage systems, beginning with the intuition,followed by a detailed description, and two instantiationsof the method exhibiting different sharing properties.

3.1 An intuitive explanation

First, let us consider the simplified problem where thedata is centralized at one brick, but the coordinators maybe distributed. An SFQ scheduler could be placed eitherat coordinators or at the storage brick. As fair schedulingrequires the information for all backlogged streams, di-rect or indirect communication among coordinators maybe necessary if the scheduler is implemented at coordi-nators. Placing the scheduler at bricks avoids the prob-lem. In fact, SFQ(D) can be used without modificationin this case, provided that coordinators attach a streamID to each request so that the scheduler at the brick canassign the start tag accordingly.

Now consider the case where the data is distributedover multiple bricks as well. In this case, SFQ sched-ulers at each brick can guarantee only single brick pro-portional sharing, but not necessarily total service pro-portional sharing because the scheduler at each brick seesonly the requests directed to it and cannot account for theservice rendered at other bricks.

Suppose, however, that each coordinator broadcasts allrequests to all bricks. Clearly, in this case, each brickhas complete knowledge of all requests for each stream.Each brick responds only to the requests for which itis the correct destination. The remaining requests aretreated asvirtual requests, and we call the combinedstream of real and virtual request avirtual stream; seeFig. 4. A virtual request takes zero processing time butdoes account for the service share allocated to its sourcestream. Then the SFQ scheduler at the brick guaran-tees service proportional sharing of backlogged virtualstreams. As the aggregate service cost of a virtual streamequals the aggregate service cost of the original stream,total service proportional sharing can be achieved.

The above approach is simple and straightforward, butwith large-scale distributed storage systems, broadcast-

Figure 4: The naive approach. The coordinator broadcastsevery request to all bricks. Requests to incorrect destinationbricks arevirtual and take zero processing time. Proportionalscheduling at each local brick guarantees total service propor-tional sharing.

Figure 5:The improved approach. Only the aggregate costof virtual requests is communicated, indicated by the num-ber before each request (assuming unit cost of each request).Broadcasting is avoided yet total service proportional sharingcan be achieved.

ing is not acceptable. We observe, however, that the SFQscheduler requires only knowledge of the cost of eachvirtual request, the coordinators may therefore broad-cast the cost value instead of the request itself. In ad-dition, the coordinator may combine the cost of consec-utive virtual requests and piggyback the total cost infor-mation onto the next real request; see Fig. 5. The com-munication overhead is negligible because, in general,read/write data rather than requests dominate the com-munication and the local area network connecting bricksusually has enough bandwidth for this small overhead.

The piggyback cost information on each real requestis called thedelayof the request, because the modifiedSFQ scheduler will delay processing the request accord-ing to this value. Different delay values may be used fordifferent performance goals, which greatly extends theability of SFQ schedulers. This flexibility is captured inthe framework presented next.

3.2 Distributed Fair Queuing Framework

We propose the distributed fair queuing framework dis-played in Fig. 6; as we show later, it can be used for totalproportional sharing, single-brick proportional sharing,or a hybrid between the two. Assume there are streamsf, g, ... and bricksA,B, .... The fair queueing scheduleris placed at each brick as just discussed. The scheduler

has a priority queue for all streams and orders all requestsby some priority, e.g., start time tags in the case of anSFQ scheduler. On the other hand, each coordinator hasa separate queue for each stream, where the requests in aqueue may have different destinations.

When we apply SFQ to the framework, each requesthas a start tag and a finish tag. To incorporate the ideapresented in the previous section, we modify the compu-tation of the tags as follows:

S(pif,A) = max{

v(A(pif,A)), F (pi−1

f,A) +delay(pi

f,A)

φf

}(5)

F (pif,A) = S(pi

f,A) +cost(pi

f,A)

φf(6)

The only difference between SFQ formulae (2-3) andthose above is the new delay function for each request,which is calculated at coordinators and carried by therequest. The normalized delay value translates into theamount of time by which the start tag should be shifted.How the delay is computed depends upon the propor-tional sharing properties we wish to achieve, and we willdiscuss several delay functions and the resulting sharingproperties in the sections that follow. We will refer to themodified Start-time Fair Queuing algorithm as Distrib-uted Start-time Fair Queuing (DSFQ).

In DSFQ, as in SFQ(D), v(t) is defined to be the starttag of the last request dispatched to the disk before orat timet. There is no global virtual time in the system.Each brick maintains its own virtual time, which variesat different bricks depending on the workload and theservice capacity of the brick.

We note that the framework we propose works withother fair scheduling algorithms [28] as long as eachstream has its own clock such that the delay can be ap-plied; for example, a similar extension could be made tothe Virtual Clock algorithm [30] if we desire proportionalservice over an extended time period (time-averaged fair-ness) rather than the “use it or lose it” property (instan-taneous fairness) supported by SFQ. Other options be-tween these two extreme cases could be implemented inthis framework, as well. For example, the scheduler canconstrain each stream’s time tag to be within some win-dow of the global virtual time. Thus, a stream that under-utilizes its share can get extra service later, but only to alimited extent.

If the delay value is set to always be zero, DSFQreduces to SFQ and achieves single brick proportionalsharing. We next consider other performance goals.

Figure 6:The distributed fair queuing framework

3.3 Total Service Proportional Sharing

We describe how the distributed fair queueing frame-work can be used for total proportional sharing wheneach stream uses one coordinator, and then argue that thesame method also engenders total proportional sharingwith multiple coordinators per stream.

3.3.1 Single-Client Single-Coordinator

We first assume that requests from one stream are alwaysprocessed by one coordinator; different streams may ormay not have different coordinators. We will later extendthis to the multiple coordinator case. The performancegoal, as before, is that the total amount of service eachclient receives must be proportional to its weight.

As described in Section 3.1, the following delay func-tion for a request from streamf to brick A representsthe total cost of requests sent to other bricks since theprevious request to brickA.

delay(pif,A) = batchcost(pi

f,A)− cost(pif,A) (7)

When this delay function is used with the distributedscheduling framework defined by formulae (5-7), we callthe resulting algorithmTOTAL-DSFQ. The delay func-tion (7) is the total service cost of requests sent to otherbricks since the last request on the brick. Intuitively,it implies that, if the brick is otherwise busy, a requestshould wait an extra time corresponding to the aggregateservice requirements of the preceding requests from thesame stream that were sent to other bricks, normalizedby the stream’s weight.

Why TOTAL-DSFQ engenders proportional sharingof the total service received by the streams can be ex-plained using virtual streams. According to the formulae(5-7), TOTAL-DSFQis exactly equivalent to the archi-tecture where coordinators send virtual streams to thebricks and bricks are controlled by the standard SFQ.This virtual stream contains all the requests inf , but therequests that are not destined forA are served atA in

zero time. Note that SFQ holds its fairness property evenwhen the service capacity varies [8]. In our case, theserver capacity (processing speed) varies from normal, ifthe request is to be serviced on the same brick, to infinityif the request is virtual and is to be serviced elsewhere.Intuitively, since the brickA sees all the requests inf(and their costs) as a part of the virtual stream, the SFQscheduler atA factors in the costs of the virtual requestsserved elsewhere in its scheduling, even though they con-sume no service time atA. This will lead to proportionalsharing of the total service. The theorem below formal-izes the bounds on unfairness usingTOTAL-DSFQ.

THEOREM 2. Assume streamf is requesting serviceon Nf bricks and streamg on Ng bricks. During anyinterval [t1, t2] in which f and g are both continuouslybacklogged at some brickA, the difference between thetotal amount of work completed by all bricks for the twostreams during the entire interval, normalized by theirweights, is bounded as follows:

∣∣∣∣Wf (t1, t2)

φf− Wg(t1, t2)

φg

∣∣∣∣

≤ ((D + DSFQ) ∗Nf + 1)costmax

f,A

φf+

((D + DSFQ) ∗Ng + 1)costmax

g,A

φg+

(DSFQ + 1)

(batchcostmax

f,A

φf+

batchcostmaxg,A

φg

)(8)

whereD is the queue depth of the disk1, andDSFQ isthe queue depth of the Start-time Fair Queue at the brick.

PROOF. The proof of this and all following theoremscan be found in [26].

The bound in Formula (8) has two parts. The first partis similar to the bound of SFQ(D) in (4), the unfairnessdue to server queues. The second part is new and con-tributed by the distributed data. If the majority of re-quests of streamf is processed at the backlogged server,thebatchcostmax

f,A is small and the bound is tight. Other-wise, if f gets a lot of service at other bricks, the boundis loose.

As we showed in Example 2, however, there are sit-uations in which total proportional sharing is impos-sible with work conserving schedulers. In the theo-rem above, this corresponds to the case with an infinitebatchcostmax

g,A , and hence the bound is infinite. To de-lineate more precisely when total proportional sharing ispossible underTOTAL-DSFQ, we characterize when thetotal service rates of the streams are proportional to theirweights. The theorem below says that, underTOTAL-DSFQ, if a set of streams are backlogged together at a

1If there are multiple disks (the normal case),D is the sum of thequeue depths of the disks.

set of bricks, then either their normalized total servicerates over all bricks are equal (thus satisfying the totalproportionality requirement), or there are some streamswhose normalized service rates are equal and the remain-der receive no service at the backlogged bricks becausethey already receive more service elsewhere.

Let Rf (t1, t2) = Wf (t1, t2)/(φf ∗ (t2 − t1)) bethe normalized service rate of streamf in the duration(t1, t2). If the total service rates of streams are pro-portional to their weights, then their normalized servicerates should be equal as the time intervalt2 − t1 goesto infinity. Suppose streamf is backlogged at a setof bricks, denoted as setS, its normalized service rateat S is denoted asRf,S(t1, t2), andRf,other(t1, t2) de-notes its normalized total service rate at all other bricks.Rf (t1, t2) = Rf,S(t1, t2) + Rf,other(t1, t2). We drop(t1, t2) hereafter as we always consider interval(t1, t2).

THEOREM 3. Under TOTAL-DSFQ, if during (t1,t2), streams{f1, f2, ...fn} are backlogged at a set ofbricks S, in the order Rf1,other ≤ Rf2,other ≤ ...Rfn,other, ast2 − t1 → ∞, eitherRf1 = Rf2 = ...Rfn

or ∃k ∈ {1, 2, ...n − 1}, such thatRf1 = ...Rfk≤

Rfk+1,other andRfk+1,S = ...Rfn,S = 0.

The intuition of Theorem 3 is as follows. At brick setS, let us first setRf1,S = Rf2,S = ... = Rfn,S = 0 andtry to allocate the resources ofS. Streamf1 has the high-est priority since its delay is the smallest. Thus the SFQscheduler will increaseRf1,S until Rf1 = Rf2,other.Now bothf1 andf2 have the same total service rate andthe same highest priority. Brick setS will then increaseRf1,S andRf2,S equally untilRf1 = Rf2 = Rf3,other.In the end, either all the streams have the same total ser-vice rate, or it is impossible to balance all streams dueto the limited service capacity of all bricks inS. In thelatter case, the firstk streams have equal total servicerates, while the remaining streams are blocked for ser-vice atS. Intuitively, this is the best we can do with awork-conserving scheduler to equalize normalized ser-vice rates.

In Section 3.4 we propose a modification toTOTAL-DSFQthat ensures no stream is blocked at any brick.

3.3.2 Single-client Multi-coordinator

So far we have assumed that a stream requests servicethrough one coordinator only. In many high-end systems,however, it is preferable for high-load clients to distrib-ute their requests among multiple coordinators in orderto balance the load on the coordinators. In this section,we discuss the single-client multi-coordinator setting andthe corresponding fairness analysis forTOTAL-DSFQ.In summary, we find thatTOTAL-DSFQdoes engender

Figure 7: Effect of multiple coordinators under TOTAL -DSFQ. Delay value of an individual request is different fromFig. 5, but the total amount of delay remains the same.

total proportional sharing in this setting, except in someunusual cases.

We motivate the analysis with an example. First, letus assume that a stream accesses two coordinators inround-robin order and examine the effect on the delayfunction (7) through the example stream in Fig. 5. Theresult is displayed in Fig. 7. Odd-numbered requests areprocessed by the first coordinator and even-numbered re-quests are processed by the second coordinator. With onecoordinator, the three requests to brickA have delay val-ues 0, 2 and 0. With two round-robin coordinators, thedelay values of the two requests dispatched by the firstcoordinator are now 0 and 1; the delay value of the re-quest dispatched by the second coordinator is 1. Thus,although individual request may have delay value differ-ent from the case of single coordinator, the total amountof delay remains the same. This is because every virtualrequest (to other bricks) is counted exactly once.

We formalize this result in Theorem 4 below, whichsays, essentially, that streams backlogged at a brick re-ceive total proportional service so long as each streamuses a consistent set of coordinators (i.e., the same set ofcoordinators for each brick it accesses).

Formally, assume streamf sends requests throughncoordinatorsC1, C2, ..., Cn, and coordinatorCi receivesa substream off denoted asfi. With respect to brickA,each substreamfi has itsbatchcostmax

fi,A. Let us first as-

sume thatbatchcostmaxfi,A

is finite for all substreams, i.e.,requests toA are distributed among all coordinators.

THEOREM 4. Assume streamf accessesn coordina-tors such that each one receives substreamsf1, ..., fn,respectively, and streamg accessesm coordinators withsubstreamsg1, ..., gm, respectively. During any interval[t1, t2] in which f and g are both continuously back-logged at brickA, inequality (8) still holds, where

batchcostmaxf,A = max{ batchcostmax

f1,A, ...

batchcostmaxfn,A} (9)

batchcostmaxg,A = max{ batchcostmax

g1,A, ...

batchcostmaxgm,A} (10)

An anomalous case arises if a stream partitions the

bricks into disjoint subsets and accesses each partitionthrough separate coordinators. In this case, the requestsserved in one partition will never be counted in the delayof any request to the other partition, and the total servicemay no longer be proportional to the weight. For ex-ample, requests toB in Fig. 7 have smaller delay valuesthan the ones in Fig. 5. This case is unlikely to occur withmost load balancing schemes such as round-robin or uni-formly random selection of coordinators. Note that thealgorithm will still guarantee total proportional sharingif differentstreams use separate coordinators.

More interestingly, selecting randomly among mul-tiple coordinators may smooth out the stream, and re-sult in more uniform delay values. For example, ifbatchcost(pj

f,A) in the original stream is a sequence ofi.i.d. (independent, identically distributed) random vari-ables with large variance such thatbatchcostmax

f,A mightbe large, it is not difficult to show that with indepen-dently random mapping of each request to a coordina-tor, batchcost(pj

fi,A) is also a sequence of i.i.d. random

variables with the same mean, but the variance decreasesas number of coordinators increases. This means thatunder random selection of coordinators, while the aver-age delay is still the same (thus service rate is the same),the variance in the delay value is reduced and thereforethe unfairness bound is tighter. We test this observationthrough an empirical study later.

3.4 Hybrid Proportional Sharing

UnderTOTAL-DSFQ, Theorem 3 tells us that a streammay be blocked at a brick if it gets too much serviceat other bricks. This is not desirable in many cases.We would like to guarantee a minimum service rate foreach stream on every brick so the client program canalways make progress. Under the DSFQ framework,i.e., formulae (5-6), this means that the delay must bebounded, using a different delay function than the oneused inTOTAL-DSFQ. We next develop a delay functionthat guarantees a minimum service share to backloggedstreams on each brick.

Let us assume that the weights assigned to streams arenormalized, i.e.0 ≤ φf ≤ 1 and

∑f φf = 1. Sup-

pose that, in addition to the weightφf , each streamf isassigned a brick-minimum weightφmin

f , correspondingto the minimum service share per brick for the stream.2

We can then show that the following delay function willguarantee the required minimum service share on eachbrick for each stream.

delay(pif,A) =

φf/φminf − 1

1− φf∗ cost(pi

f,A) (11)

2Setting the brick-minimum weights requires knowledge of theclient data layouts. We do not discuss this further in the paper.

We can see, for example, that settingφminf = φf yields

a delay of zero, and the algorithm then reduces to sin-gle brick proportional sharing that guarantees minimumshareφf for streamf , as expected.

By combining delay function (11) with the delay func-tion (7) for TOTAL-DSFQ, we can achieve an algorithmthat approaches total proportional sharing while guaran-teeing a minimum service level for each stream per brick,as follows.

delay(pif,A) = min { batchcost(pi

f,A)− cost(pif,A),

φf/φminf − 1

1− φf∗ cost(pi

f,A)}(12)

The DSFQ algorithm using the delay function (12) de-fines a new algorithm calledHYBRID-DSFQ. Since thedelay underHYBRID-DSFQ is no greater than the de-lay in (11), the service rate at every brick is no lessthan the rate under (11), thus the minimum per brickservice shareφmin

f is still guaranteed. On the otherhand, if the amount of service a streamf receiveson other bricks between requests to brickA is lowerthan (φf/φmin

f − 1)/(1− φf ) ∗ cost(pif,A), the delay

function behaves similarly to equation (7), and hencethe sharing properties in this case should be similar toTOTAL-DSFQ, i.e., total proportional sharing.

Empirical evidence (in Section 4.3) indicates thatHYBRID-DSFQ works as expected for various work-loads. However, there are pathological workloads thatcan violate the total service proportional sharing prop-erty. For example, if a stream using two bricks knows itsdata layout, it can alternate bursts to one brick and thenthe other. UnderTOTAL-DSFQ, the first request in eachburst would have received a large delay, correspondingto the service the stream had received on the other brickduring the preceding burst, but inHYBRID-DSFQ, thedelay is truncated by the minimum share term in the de-lay function. As a result, the stream receives more ser-vice than its weight entitles it to. We believe that this canbe resolved by including more history in the minimumshare term, but the design and evaluation of such a delayfunction is reserved to future work.

4 Experimental Evaluation

We evaluate our distributed proportional sharing algo-rithm in a prototype FAB system [20], which consistsof six bricks. Each brick is an identically configuredHP ProLiant DL380 server with 2x 2.8GHz Xeon CPU,1.5GB RAM, 2x Gigabit NIC, and an integrated SmartArray 6i storage controller with four 18G Ultral320, 15Krpm SCSI disks configured as RAID 0. All bricks arerunning SUSE 9.2 Linux, kernel 2.6.8-24.10-smp. Eachbrick runs a coordinator. To simplify performance com-

parisons, FAB caching was turned off, except at the disklevel.3

The workload generator consists of a number ofclients (streams), each running several Postmark [15] in-stances. We chose Postmark as the workload generatorbecause its independent, randomly-distributed requestsgive us the flexibility to configure different workloadsfor demonstration and for testing extreme cases. EachPostmark thread has exactly one outstanding request inthe system at any time, accessing its isolated 256MBlogical volume. Unless otherwise specified, each vol-ume resides on a single brick and each thread generatesrandom read/write requests with file sizes from 1KB to16KB. The number of transactions per thread is suffi-ciently large so the thread is always active when it is on.Other parameters of Postmark are set to the default val-ues.

With our work-conserving schedulers, until a streamis backlogged, its IO throughput increases as the num-ber of threads increases. When it is backlogged, on theother hand, the actual service amount depends on thescheduling algorithm. To simplify calculation, we targetthroughput (MB/sec) proportional sharing, and define thecost of a request to be its size. Other cost functions suchas IO/sec or estimated disk service time could be used aswell.

4.1 Single Brick Proportional Sharing

We first demonstrate the effect ofTOTAL-DSFQon twostreams reading from one brick. Streamf consistentlyhas 30 Postmark threads, while the number of Postmarkthreads for streamg is increased from 0 to 20. The ra-tio of weights betweenf andg is at 1:2. As the data isnot distributed, the delay value is always zero and this isessentially the same as SFQ(D) [12].

Figure 8 shows the performance isolation between thetwo clients. The throughput of streamg is increasing andits latency is fixed untilg acquires its proportional shareat around 13 threads. After that, additional threads donot give any more bandwidth but increase the latency.On the other hand, the throughput and latency of streamf are both affected byg. Onceg gets its share, it has nofurther impact onf .

4.2 Total Service Proportional Sharing

Figure 9 demonstrates the effectiveness ofTOTAL-DSFQfor two clients. The workload streams have ac-cess patterns shown in Fig. 3. We arranged the data

3Our algorithms focus on the management of storage bandwidth;a full exploration of the management of multiple resources (includingcache and network bandwidth) to control end-to-end performance isbeyond the scope of this paper.

0 5 10 15 200

2

4

6

number of Postmark threads of stream g

thro

uput

(M

B/s

)

stream fstream g

(a) throughputs for two streams

0 5 10 15 200

0.05

0.1

0.15

0.2

number of Postmark threads of stream g

aver

age

late

ncy

(s)

(b) latencies for two streams

Figure 8: Proportional sharing on one brick. φf :φg=1:2;legend in (a) also applies to (b).

layout so that each Postmark thread accesses only onebrick. Streamf and streamg both have 30 threads onbrick A throughout the experiment, meanwhile, an in-creasing number of threads fromg is processed at brickB. Postmark allows us to specify the maximum size ofthe random files generated, and we tested the algorithmwith workloads using two different maximum randomfile sizes, 16KB and 1MB.

Figure 9(a) shows that as the number of Postmarkthreads from streamg directed to brickB increases, itsthroughput from brickB increases, and the share it re-ceives at brickA decreases to compensate. The totalthroughputs received by streamsf and g stay roughlyequal throughout. As the streamg becomes more unbal-anced between bricksA andB, however, the throughputdifference between streamsf andg varies more. Thiscan be related to the fairness bound in Theorem 2: asthe imbalance increases, so doesbatchcostmax

g,A , and thebound becomes a little looser. Figure 9(b) uses the samedata layout but with a different file size and weight ratio.As g gets more service onB, its throughput rises onBfrom 0 to 175 MB/s. As a result, the algorithm increasesf ’s share on the shared brickA, and its throughput risesfrom 40 MB/s to 75 MB/s, whileg’s throughput onAdrops from 160 MB/s to 125 MB/s. In combination thethroughput of both streams increases, whole maintaininga 1:4 ratio.

The experiment in Figure 10 has a data layout with de-pendencies among requests. Each thread ing accesses allthree bricks, while streamf accesses one brick only. Theresource allocation is balanced when streamg has threeor more threads on the shared brick. Asg has a RAID-0 data layout, the service rates on the other two bricksare limited by the rate on the shared brick. This exper-iment shows thatTOTAL-DSFQ correctly controls the

0 5 10 15 200

2

4

6

number of Postmark threads of stream g on brick B

thro

ughp

ut (

MB

/s)

stream fstream g, brick Astream g, brick Bstream g, total

(a) Random I/O,φf :φg=1:1, max file size=16KB

0 5 10 15 200

100

200

300


thro

ughp

ut (

MB

/s)

(b) Sequential I/O,φf :φg=1:4, max file size=1MB

Figure 9: Total service proportional sharing. f ’s data ison brick A only; g has data on both bricksA andB. As ggets more service on the bricks it does not share withf , thealgorithm increasesf ’s share on the brick they do share; thusthe total throughputs of both streams increase.

0 5 10 15 200

2

4

6

number of Postmark threads of stream g on one brick

Thr

ough

put (

MB

/s)

stream fstream g, one brickstream g, total

Figure 10:Total service proportional sharing with stripeddata. φf :φg=1:1. g has RAID-0 logical volume striping onthree bricks;f ’s data is on one brick only.

total share in the case where requests on different bricksare dependent. We note that in this example, the sched-uler could allocate more bandwidth on the shared brickto streamg in order to improve the total system through-put instead of maintaining proportional service; however,this is not our goal.

Figure 11 is the result with multiple coordinators. Thedata layouts and workloads are the same as in the exper-iment shown in Figures 3 and 9: two bricks, streamfaccesses only one, and streamg accesses both. The onlydifference is that streamg accesses both bricksA andBthrough two or four coordinators in round-robin order.

Using multiple coordinators still guarantees propor-tional sharing of the total throughput. Furthermore, acomparison of Fig. 9, 11(a), and 11(b) indicates that asthe number of coordinators increases, the match betweenthe total throughputs received byf andg is closer, i.e.,the unfairness bound is tighter. This confirms the ob-

0 5 10 15 200

2

4


thro

ughp

ut (

MB

/s)

stream fstream g, totalstream g, brick Astream g, brick B

(a) Two coordinators

0 5 10 15 200

2

4


thro

ughp

ut (

MB

/s)

(b) Four coordinators

Figure 11:Total service proportional sharing with multi-coordinator, φf :φg=1:1

servation in Section 3.3.2 that multiple coordinators maysmooth out a stream and reduce the unfairness.

4.3 Hybrid Proportional Sharing

The result ofHYBRID-DSFQ is presented in Fig. 12.The workload is the same as in the experiment shownin Figures 3 and 9: streamf accesses brickA only, andstreamg accesses bothA andB. Streamsf andg bothhave 20 Postmark threads onA, andg has an increas-ing number of Postmark threads onB. We wish to givestreamg a minimum share of1/12 on brick A when itis backlogged. This corresponds toφmin

g = 1/12; basedon Equation 12, the delay function forg is

delay(pig,A) = min { batchcost(pi

g,A)− cost(pig,A),

10 ∗ cost(pig,A)}

Streamf is served onA only and thus the delay is al-ways zero.

With HYBRID-DSFQ, the algorithm reserves a min-imum share for each stream, and tries to make the to-tal throughput as close as possible without reallocatingthe reserved share. For this workload, the service capac-ity of a brick is approximately 6MB/sec. We can see inFig. 12(a) that if the throughput of streamg on brickBis less than 4MB,HYBRID-DSFQ can balance the to-tal throughputs of the two streams. Asg receives moreservice on brickB, the maximum delay part inHYBRID-DSFQtakes effect andg gets its minimum share on brickA. The total throughputs are no longer proportional tothe assigned weights, but is still reasonably close. Fig-ure 12(b) repeats the experiment with the streams select-ing between two coordinators alternately; the workload

0 5 10 15 200

2

4

6


thro

ughp

ut (

MB

/s)

↓0.5MB/s

stream fstream g, totalstream g, brick Astream g, brick B

(a) throughputs withHYBRID-DSFQ

0 5 10 15 200

2

4

6


thro

ughp

ut (

MB

/s)

↓

0.5MB/s

(b) throughputs withHYBRID-DSFQ, two coordinators

Figure 12:Two-brick experiment using HYBRID -DSFQ

and data layout are otherwise identical to the single co-ordinator experiment. The results indicate thatHYBRID-DSFQworks as designed with multiple coordinators too.

4.4 Fluctuating workloads

First we investigate howTOTAL-DSFQresponds to sud-den changes in load by using an on/off fluctuating work-load. Figure 13 shows the total throughputs of the threestreams. Steamsf andg are continuously backloggedat brick A and thus the total throughputs are the same.When streamh is on, some bandwidth on brickB is oc-cupied byh (h’s service is not proportional to its weightbecause of insufficient threads it has and thus it is notbacklogged on brickB). As a result,g’s throughputdrops. Thenf ’s throughput follows closely after a sec-ond, because part off ’s share onA is reallocated tog tocompensate its loss onB. Detailed throughputs ofg oneach brick is not shown on the picture. We also see thatas the number of threads (and hence the SFQ depth) in-creases, the sharp drop ing’s throughput is more signifi-cant. These experimental observations agree with the un-fairness bounds onTOTAL-DSFQshown in Theorem 2,which increase with the queue depth.

Next we examine the effectiveness of different propor-tional sharing algorithms through sinusoidal workloads.Both streamsf andg access three bricks and overlap onone brick only, brickA. The number of Postmark threadsfor each stream on each brick is approximately a sinu-soidal function with different frequency; see Fig. 14(a).To demonstrate the effectiveness of proportional shar-ing, we try to saturate brickA by setting the number ofthreads on it to a sinusoidal function varying from 15to 35, while thread numbers on other bricks take valuesfrom 0 to 10 (not shown in Fig. 14(a)). The result con-

0 10 20 30 40 50 60 70 800

1

2

3

4

5

Time (sec)

thro

ughp

ut (

MB

/s)

stream f stream g, total stream h

(a) Streamsf andg both have 20 Postmark threads on brickA, i.e.,Queue depthDSFQ = 20

0 10 20 30 40 50 60 70 800

1

2

3

4

5

Time (sec)

thro

ughp

ut (

MB

/s)

(b) Streamsf andg both have 30 threads,DSFQ = 30

Figure 13: Fluctuating workloads. Streamsf and g bothhave the same number of Postmark threads on brickA, andstreamg has 10 additional Postmark threads on brickB. Inaddition, there is a streamh that has 10 on/off threads on brickB that are repeatedly on together for 10 seconds and then off for10 seconds. The weights are equal:φf : φg : φh = 1 : 1 : 1.

firms several hypotheses. Figure 14(b) is the result ona standard FAB without any fair scheduling. Not sur-prisingly, the throughput curves are similar to the threadcurves in Fig. 14(a) except when the server is saturated.Figure 14(c) shows that single brick proportional sharingprovides proportional service on brickA but not neces-sarily the total service. At time 250, the service onAis not proportional becauseg has minimum threads onAand is not backlogged. Figure 14(d) displays the effect oftotal service proportional sharing. The total service ratesmatch well in general. At times around 65, 100, 150,and 210, the rates deviate because one stream gets toomuch service on other bricks, and its service onA dropsclose to zero. ThusTOTAL-DSFQ cannot balance thetotal service. At time around 230-260, the service ratesare not close because streamg is not backlogged, as wasthe case in Fig. 14(c). Finally, Fig. 14(e) confirms theeffect of hybrid proportional sharing. Comparing withFig. 14(d), HYBRID-DSFQ proportional sharing guar-antees minimum share whenTOTAL-DSFQdoes not, atthe cost of slightly greater deviation from total propor-tional sharing during some periods.

5 Conclusions

In this paper, we presented a proportional-servicescheduling framework suitable for use in a distributedstorage system. We use it to devise a distributed sched-uler that enforces proportional sharing of total service

0 50 100 150 200 250 30010

20

30

40

50

60

Time (sec)

Num

ber

of th

read

s

f, total g, total f, brick A g, brick A

(a) Number of Postmark threads

0 50 100 150 200 250 3000

2

4

6

8

Time (sec)

Thr

ough

put (

MB

/s)

(b) Without any proportional sharing scheduling

0 50 100 150 200 250 3000

2

4

6

8

Time (sec)

Thr

ough

put (

MB

/s)

(c) Single brick proportional sharing

0 50 100 150 200 250 3000

2

4

6

8

↓0.5MB/s

Time (sec)

Thr

ough

put (

MB

/s)

(d) Total service proportional sharing

0 50 100 150 200 250 3000

2

4

6

8

↓0.5MB/s

Time (sec)

Thr

ough

put (

MB

/s)

(e) Hybrid proportional sharing

Figure 14:Sinusoidal workloads, φf :φg=1:1. The legend in(a) applies to all the figures.

between streams to the degree possible given the work-loads. Enforcing proportional total service in a distrib-uted storage system is hard because different clients canaccess data from multiple storage nodes (bricks) usingdifferent, and possibly multiple, access points (coordina-tors). Thus, there is no single entity that knows the stateof all the streams and the service they have received. Ourscheduler extends the SFQ(D) [12] algorithm, whichwas designed as a centralized scheduler. Our sched-uler is fully distributed, adds very little communicationoverhead, has low computational requirements, and iswork-conserving. We prove the fairness properties ofthis scheduler analytically and also show experimentalresults from an implementation on the FAB distributedstorage system that illustrate these properties.

We also present examples of unbalanced workloadsfor which no work-conserving scheduler can provideproportional sharing of the total throughput, and attempt-ing to come close can block some clients on some bricks.We demonstrate a hybrid scheduler that attempts to pro-vide total proportional sharing where possible, whileguaranteeing a minimum share per brick for every client.Experimental evidence indicates that it works well.

Our work leaves several issues open. First, we as-sumed that clients using multiple coordinators load thosecoordinators equally or randomly; while this is a rea-sonable assumption in most cases, there may be caseswhen it does not hold — for example, when some co-ordinators have an affinity to data on particular bricks.Some degree of communication between coordinatorsmay be required in order to provide total proportionalsharing in this case. Second, more work is needed to de-sign and evaluate better hybrid delay functions that candeal robustly with pathological workloads. Finally, ouralgorithms are designed for enforcing proportional ser-vice guarantees, but in many cases, requirements may bebased partially on absolute service levels, such as a spec-ified minimum throughput, or maximum response time.We plan to address how this may be combined with pro-portional sharing in future work.

6 Acknowledgements

Antonio Hondroulis and Hernan Laffitte assisted withexperimental setup. Marcos Aguilera, Eric Anderson,Christos Karamanolis, Magnus Karlsson, Kimberly Kee-ton, Terence Kelly, Mustafa Uysal, Alistair Veitch andJohn Wilkes provided valuable comments. We also thankthe anonymous reviewers for their comments, and CarlWaldspurger for shepherding the paper.

References

[1] J. L. Bruno, J. C. Brustoloni, E. Gabber, B. Ozden, andA. Silberschatz. Disk scheduling with quality of ser-

vice guarantees. InProceedings of the IEEE Interna-tional Conference on Multimedia Computing and Systems(ICMCS), June 1999.

[2] D. D. Chambliss, G. A. Alvarez, P. Pandey, and D. Ja-dav. Performance virtualization for large-scale storagesystems. InProceedings of the 22nd International Sym-posium on Reliable Distributed Systems (SRDS). IEEEComputer Society, Oct. 2003.

[3] H. M. Chaskar and U. Madhow. Fair scheduling with tun-able latency: a round-robin approach.IEEE/ACM Trans.Netw., 11(4):592–601, 2003.

[4] A. Demers, S. Keshav, and S. Shenker. Analysis and sim-ulation of a fair queueing algorithm. InSIGCOMM: Sym-posium proceedings on Communications architectures &protocols, Sept. 1989.

[5] Z. Dimitrijevi c and R. Rangaswami. Quality of servicesupport for real-time storage systems. InInternationalIPSI-2003 Conference, Oct. 2003.

[6] B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, I. Pratt,A. Warfield, P. Barham, and R. Neugebauer. Xen and theart of virtualization. InProceedings of the ACM Sympo-sium on Operating Systems Principles, Oct. 2003.

[7] S. Frølund, A. Merchant, Y. Saito, S. Spence, andA. Veitch. A decentralized algorithm for erasure-codedvirtual disks. InInt. Conf. on Dependable Systems andNetworks, June 2004.

[8] P. Goyal, M. Vin, and H. Cheng. Start-time fair queueing:A scheduling algorithm for integrated services packetswitching networks.ACM Transactions on Networking,5(5):690–704, 1997.

[9] A. Gulati and P. Varman. Lexicographic QoS schedulingfor parallel I/O. InProceedings of the 17th ACM Sympo-sium on Parallelism in Algorithms and Architectures, July2005.

[10] J. L. Hellerstein, Y. Diao, S. Parekh, and D. M. Tilbury.Feedback Control of Computing Systems. Wiley-IEEEPress, 2004.

[11] L. Huang, G. Peng, and T. cker Chiueh. Multi-dimensional storage virtualization. InProceedings of theInternational Conference on Measurement and Modelingof Computer Systems (SIGMETRICS), June 2004.

[12] W. Jin, J. S. Chase, and J. Kaur. Interposed proportionalsharing for a storage service utility. InProceedings of theInternational Conference on Measurement and Modelingof Computer Systems (SIGMETRICS), June 2004.

[13] V. Kanodia, C. Li, A. Sabharwal, B. Sadeghi, andE. Knightly. Distributed priority scheduling and mediumaccess in ad hoc networks.Wireless Networks, 8(5):455–466, Nov. 2002.

[14] M. Karlsson, C. Karamanolis, and X. Zhu. Triage: Perfor-mance isolation and differentiation for storage systems. InProceedings of the 12th International Workshop on Qual-ity of Service (IWQoS). IEEE, June 2004.

[15] J. Katcher. Postmark: A new file system benchmark.Technical Report 3022, Network Appliance, Oct. 1997.

[16] E. K. Lee and C. A. Thekkath. Petal: Distributed virtualdisks. InProceedings of ASPLOS. ACM, Oct. 1996.

[17] C. R. Lumb, A. Merchant, and G. A. Alvarez. Facade:virtual storage devices with performance guarantees. InProceedings of the 2nd USENIX Conference on File andStorage Technologies (FAST), Mar. 2003.

[18] H. Luo, S. Lu, V. Bharghavan, J. Cheng, and G. Zhong.A packet scheduling approach to QoS support in multi-hop wireless networks.Mob. Netw. Appl., 9(3):193–206,2004.

[19] A. K. Parekh and R. G. Gallager. A generalized proces-sor sharing approach to flow control in integrated servicesnetworks: The single node case.IEEE/ACM Transactionson Networking, 1(3):344–357, 1993.

[20] Y. Saito, S. Frølund, A. Veitch, A. Merchant, andS. Spence. Fab: Building distributed enterprise disk ar-rays from commodity components. InProceedings of AS-PLOS. ACM, Oct. 2004.

[21] D. C. Stephens and H. Zhang. Implementing distributedpacket fair queueing in a scalable switch architecture. InProceedings of INFOCOM, Mar. 1998.

[22] M. Uysal, G. A. Alvarez, and A. Merchant. A modular,analytical throughput model for modern disk arrays. InProc. of the 9th Intl. Symp. on Modeling, Analysis andSimulation on Computer and Telecommunications Sys-tems (MASCOTS). IEEE, Aug. 2001.

[23] N. H. Vaidya, P. Bahl, and S. Gupta. Distributed fairscheduling in a wireless lan. InMobiCom: Proceedings ofthe 6th annual international conference on Mobile com-puting and networking, Aug. 2000.

[24] C. A. Waldspurger. Memory resource management inVMware ESX server. InProceedings of the USENIX Sym-posium on Operating Systems Design and Implementa-tion, Dec. 2002.

[25] M. Wang, K. Au, A. Ailamaki, A. Brockwell, C. Falout-sos, and G. R. Ganger. Storage device performance pre-diction with cart models. InProceedings of the Inter-national Conference on Measurement and Modeling ofComputer Systems (SIGMETRICS), June 2004.

[26] Y. Wang and A. Merchant. Proportional service allocationin distributed storage systems. Technical Report HPL-2006-184, HP Laboratories, Dec. 2006.

[27] W. Wilcke et al. IBM intelligent bricks project—petabytes and beyond.IBM Journal of Research and De-velopment, 50(2/3):181–197, Mar. 2006.

[28] H. Zhang. Service disciplines for guaranteed performanceservice in packet-switching networks.Proceedings of theIEEE, 83(10):1374–96, 1995.

[29] J. Zhang, A. Sivasubramaniam, A. Riska, Q. Wang, andE. Riedel. An interposed 2-level I/O scheduling frame-work for performance virtualization. InProceedings ofthe International Conference on Measurement and Mod-eling of Computer Systems (SIGMETRICS), June 2005.

[30] L. Zhang. Virtualclock: A new traffic control algorithmfor packet-switched networks.ACM Transactions onComputer Systems, 9(2):101–124, 1991.

Proportional-Share Scheduling for Distributed Storage...

Documents

Transcript of Proportional-Share Scheduling for Distributed Storage...