Cost-aware caching schemes in heterogeneous storage systems

J Supercomput (2011) 56: 56–78DOI 10.1007/s11227-009-0342-1

Cost-aware caching schemes in heterogeneous storagesystems

Abhirup Chakraborty · Ajit Singh

Published online: 20 October 2009© Springer Science+Business Media, LLC 2009

Abstract Modern single- and multi-processor computer systems incorporate, eitherdirectly or through a LAN, a number of storage devices with diverse performancecharacteristics. These storage devices have to deal with workloads with unpredictableburstiness. A storage-aware caching scheme—that partitions the cache among thedisks, and aims at balancing the work across the disks—is necessary in this envi-ronment. Moreover, maintaining proper size for these partitions is crucial. Adjustingthe partition size after each epoch (a certain time interval) assumes that the work-load in the subsequent epoch will show similar characteristics as observed in the cur-rent epoch. However, in an environment with highly bursty and time-varying work-load such an approach seems to be optimistic. Moreover, the existing storage-awarecaching schemes assume linear relationship between cache size and hit ratio. But, inpractice a (disk) partition may accumulate cache blocks (thus, choke the remainingdisks) without increasing the hit ratio significantly. This disk choking phenomenonmay degenerate the performance of the disk system. In this paper, we address theissues of continuous repartitioning and disk choking. First, we present a cachingscheme that continuously adjusts the partition size forgoing any periodic activity.Later, considering the disk choking issue, we present a repartitioning frameworkbased on the notion of marginal gains. Experimental results show the effectivenessof our approach. We show that our scheme outperforms the existing storage-awarecaching schemes.

Keywords Heterogeneous storage systems · Caching · Disk array · File systems

A. Chakraborty (�) · A. SinghDepartment of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada,N2L 3G1e-mail: [email protected]

A. Singhe-mail: [email protected]

mailto:[email protected]

mailto:[email protected]

Cost-aware caching schemes in heterogeneous storage systems 57

1 Introduction

Modern computer systems interact with a broad and diverse set of storage devices.Accessing storage devices via local file systems, remote file servers such as NFS [29],AFS [16], Sprite [25] and xFS [2], archival storage on tapes, read-only compact disks,network attached disks [13], etc. is a common activity on computer systems. Diskarrays [27], where disks of different ages and performance parameters may be in-corporated, are commonly used. Moreover, nowadays there exist storage sites that aclient can access across the Internet [21, 33]. Thus, there is a diversity of behaviorand properties among the storage devices. And these characteristics of the deviceswill vary greatly as new storage components [7] are introduced.

Though this set of devices is disparate, one similarity is inherent among all: thetime to access them is high, specially as compared to the CPU cache and the memorylatencies. Thus, there exists a wide gap between the performance of microprocessorsand disks. To bridge this gap between microprocessors and disks, today’s file systemsgenerally deploy a file cache. A file cache is a portion of the main memory allocatedby the operating system to be used for temporarily storing the frequently used diskblocks. Thus, the storage system can serve a disk block without accessing the disk ifthe requested block is found within the file cache. This way, the file cache filters diskrequests and thereby reduces the overall execution time of individual applications andalso increases the overall system performance, often by an order of magnitude.

Though there have been substantial changes in the storage technology over thepast decades, the caching architecture used by modern operating systems has re-mained unmodified. There have been some innovations in techniques, for example,incorporating application control [5], multi-level caching [14, 37], integrating filecache and virtual memory cache [25], integrating caching and prefetching [4, 15].However, as far as caching policies are concerned, most modern operating systemsemploy Clock, LRU or LRU-like algorithms because of their simplicity. But, bothClock and LRU algorithms have several limitations [3], and recently there have beenworks on improving the caching policy by making the replacement decision basedon a combination of characteristics inherent in the workloads: spatial and temporallocality (recency), and frequency [3, 11, 20, 24]. The problem with these algorithmsis that they are cost-oblivious: the replacement cost is assumed to be uniform for allthe cache blocks. On the contrary, these cache blocks might be fetched from deviceswith diverse performance characteristics. So, the assumption of uniform replacementcost is problematic in a system deploying multiple device types with a rich set of per-formance characteristics. As a simple example, consider a block fetched from a localdisk as compared to the one fetched from a remote, highly contended file server. Inthis case, the operating system should most likely prefer the block from the file serverfor replacement [12].

The storage-aware caching in [12]—that is herein referred to as Forney’salgorithm—addresses the issue of caching in a heterogeneous storage environmentand proposes a caching scheme based on aggregate partitioning that attempts to bal-ance works (i.e., cumulative delay while accessing blocks) across devices. In thisalgorithm, the behavior of the caching system is observed during an epoch, and atthe end of each epoch the cache is repartitioned based on that observation. Here, dur-ing an epoch—specially, at the end of the epoch—the performance may degenerate

58 A. Chakraborty, A. Singh

severely: a partition may experience large amount of delay while another partitionmay have cache blocks that are no longer pivotal. So, the length of an epoch, or win-dow size (W ), has a significant role in the efficacy of the algorithm. The performancecould be rendered smooth by having this epoch length very small. But, making thiswindow size small has two adverse effects: first, a smaller W -value might not providesufficient feedback and smoothing so as to make decision while repartitioning; sec-ond, small W -value indicates frequent repartitioning, which needs significant amountof processing overhead resulting in a less efficient operating system. Though the for-mer cause can be eliminated by accumulating sampled information over a numberof past epochs, the latter one still exists. Furthermore, every repartitioning may notproduce any significant change in the size of the partitions, hence rendering theserepartitioning activities worthless. Consider, for example, one extreme case, wherethe workload of all devices remains steady for a long time; so, no adjustment of thepartitions is necessary during that interval, and hence all repartitioning activities inthat interval are redundant. In addition to that, the other two parameters of this algo-rithm, which are termed as threshold (T ) and base correction amount (I ), need to beset empirically with care. The threshold value is used in determining the partitionsthat might be termed as the page (or block) consumers, whereas the base correctionamount indicates the number of pages (or blocks) a page (or block) consumer shouldconsume.

Based on these observations, we realize that the caching scheme can be renderedefficient by repartitioning the cache in a continuous fashion without using the conceptof an epoch, but only when it is necessary. Moreover, we observe that existing aggre-gate partitioning algorithms still have an inherent problem. These algorithms assumethat the relationship between cache size and hit ratio is linear, and hence work acrossa slow disk can be decreased by allocating more cache blocks to that disk. But, thisrelationship is not linear after a certain threshold. Hence, in practice a disk with ahigher age or workload may consume blocks without increasing cache hits propor-tionately. This happens when a disk enters the saturation region where additional diskblocks cannot impart significant increase in cache hits. This problem is inherent in allthe caching methods based on aggregate partitioning. At a first glance, it appears thata lazy repartitioning approach may alleviate the problem. In lazy repartitioning, thealgorithm does not reallocate blocks instantly to adjust the partitions to the desiredsize. On the contrary, this scheme reallocates blocks only when the partition (blockconsumer) demands the allocated blocks. So, even if a slow disk logically derivesblocks from the fast disks, the fast disk can still use the blocks that belong to the slowdisk. Hence, the number of unutilized blocks in the slow disk could be reduced. But,in this approach, the allocation is one-way: the fast disks only lose the blocks, butcan never regain. In a system where the working set corresponding to the slow diskis larger than the cache size, this problem can easily be realized: in this scenario, theslow disk can consume the whole cache blocks choking the remaining disks. We termthis as disk choking problem in computer systems.

In this paper, we address the two issues outlined above. First, we present a simpleand efficient solution to the repartitioning problem that adjusts the partitions in acontinuous fashion. Then, we address the problem of disk choking and propose asolution based on the notion of marginal utility. Here, marginal utility refers to the


reduction in work performed by a disk (or, reduction in the amount of delay) with theaddition of an extra cache block to the corresponding partition. The concept of usingmarginal utility in allocating buffers has been studied by the database community.In [26], the authors propose an approach for buffer allocation based on both the accesspattern of queries and the availability of buffers during runtime. In relational databasemanagement systems, queries are issued by the clients and these queries wait in aqueue before execution. As a query is selected for execution, the buffer managerexamines the access pattern of the query and availability of the buffers in the bufferpool. Based on these observations the buffer manager allocates buffers to the queries.The issue of partitioning a cache among several competing disks is different fromthe buffer allocation among the queries. In database management systems, a queryruns for a short time, and the buffer allocation algorithm does not allocate buffers to arunning query based on the performance of the query as it runs: buffers are allocatedbefore execution. The main difference in our situation is that in a general purposeheterogeneous storage environment, categorization or formulation of various accesspatterns is not possible. So, the marginal utility should be computed online basedon the observation of the cache performance while supplied with a reference string.We propose a framework to capture the marginal utility values of the cache blocks.Based on this framework, we propose a technique to adjust the partition size duringthe system activity.

The rest of the paper is organized as follows: Sect. 2 presents the overview ofthe algorithmic aspects of this work. Section 3 presents a number of approaches topartition the cache on a continuous fashion. Section 4 describes the framework andmechanism to repartition the cache considering the issue of disk choking. Section 5outlines the simulation environment. Section 6 presents experimental results showingthe effectiveness of the utility-based approach. Section 7 provides a survey enumerat-ing the works related to ours. Finally, Sect. 8 concludes the paper and outlines futureextensions to this research work.

2 Preliminaries

This section provides an overview of the algorithmic issues we explore. First, weoutline the existing cost-aware algorithms based on aggregate partitioning. Then weprovide a taxonomy of aggregate partitioning. We use the terms page and block inter-changeably in the paper.

2.1 Algorithms based on aggregate partitioning

In a cost-oblivious caching approach, an incoming page (or block) replaces an ex-isting page that may be located anywhere in the cache. This can also be termed asplace-anywhere approach. In a place-anywhere algorithm costs are recorded at thepage level granularity, and a page can occupy any location in the cache. On the con-trary, an aggregate partitioning algorithm divides the cache into logical partitions,and assigns a partition to a device. The algorithm maintains performance or cost in-formation at the granularity of partitions. As cost information is maintained for each


partition, the amount of meta-data is reduced and cost information can be updatedwithout scanning the whole cache. Moreover, this aggregate partitioning integrateswell with the existing software, as cost-oblivious policies can be employed for re-placing individual pages within a partition.

Forney’s algorithm is the first cost-aware algorithm that utilizes the notion of ag-gregate partitioning [12]. It considers both static (due to diverse physical charac-teristics of storage media) and dynamic (due to variation of workload on disks, andnetwork traffic) performance heterogeneity. In this approach, the cache is divided intological partitions, where blocks within a partition are from the same device and thusshare the same replacement cost. The size of each partition is varied dynamically tobalance work across devices. Here, work is defined as the cumulative delay for eachdevice. The main challenge of this algorithm is to determine the relative size of thepartitions dynamically. This dynamic repartitioning algorithm basically works in twophases: in the first phase, the cumulative delay for each device is determined; and inthe second phase, cache partitions are adjusted. These two phases repeat cyclically.

The cumulative delay for each partition (or device) is measured over the last W

successful device requests (distributed over all the devices), where W is the windowsize. Knowing the mean delay over all partitions and the per device cumulative waittime, the relative wait time for each device is determined.

During repartitioning, page consumers and page suppliers are identified based onrelative wait times of the partitions. Page consumers are partitions that have rela-tive wait time above a threshold T ; and page suppliers are partitions having below-threshold wait times. Here, the threshold value is used to infer a variation in delaydue to the variations in workload or device characteristics only. Moreover, the algo-rithm classifies each partition into one of four states: cool, warming, cooling, andwarm. Of these, the first one corresponds to page supplier and the rest correspondto page consumers. A page consumer increases its partition size by I pages, whereI is the base correction amount. If a partition remains as a page consumer duringsubsequent epochs, the increase in partition size grows exponentially. On the otherhand, the number of pages a page supplier j must yield is given as:

IRWTj∑

i∈suppliers IRWT i

× Ncon

where

IRWTj = 1 − relative wait time of partition j,

Ncon = Total pages to be consumed by all the consumers

during this repartitioning.

2.2 Taxonomy

Two basic approaches are possible for aggregate partitioning: static and dynamic. Ina static scheme size of each partition is predetermined, and remains fixed during theoperation of the system. However, without estimation of workload, and knowledge ofmiss rate as a function of cache size, it is not possible to come up with partition sizes


that balance the work across devices. Thus, dynamic partitioning is necessary, whichadjusts the partition sizes during the operation of the system.

Dynamic partitioning can be classified into eager partitioning and lazy partition-ing. In eager partitioning, whenever new partition sizes are desired, partition sizesare adjusted immediately by physically moving the blocks. Lazy partitioning gradu-ally moves the allocated blocks to the partition only when demanded by the partition(consumer). In this scheme, a partition does not incorporate newly assigned blocks in-stantaneously, rather the partition claims new blocks only when it needs cache blocksto store incoming disk blocks.

3 Continuous repartitioning

Our approach, termed as continuous repartitioning, attempts to balance work acrossdevices, and employs an aggregate partitioning scheme. We associate with each par-tition i a parameter Di , that records the cumulative delay of that partition. Wheneverthere occurs a miss in partition i, this parameter Di is incremented by the retrievaltime of the incoming block. Upon a miss, depending on the D-value of the relatedpartition, one of the two alternatives can happen: the partition either receives a blockfrom another partition or evicts a block of its own, to make room for the incomingdisk block. To aid in making this decision, partitions are classified into three cate-gories or states: consumer, supplier, and neutral. Upon a miss, a consumer receivesa block from one of the suppliers, whereas a supplier or a neutral partition replacesone of its blocks. Efficient maintenance of the state of partitions is the cardinal is-sue of our algorithm. This can be done either by using a graph-based approach—thatprovides a set of suppliers for each individual partition—or by implicitly maintain-ing consumers and suppliers in generalized form. We discuss these techniques in thesubsequent parts of this section. At the end of this section, we discuss the strategyemployed by a consumer, while selecting a victim partition from a set of suppliers.

3.1 A simple approach

To determine whether a partition can receive a cache block, or whether a partitionshould yield a cache block, we can maintain a Directed Acyclic Graph (DAG), whereeach node corresponds to a partition. In this DAG, there is an edge i → j , if Di > Dj .The edge i → j indicates that partition j is a supplier for partition i. So, the set ofsuppliers for a particular partition can be decided from the outgoing edges of thecorresponding node; and whenever a miss occurs within that partition, new blockscan be acquired from any of these suppliers.

But, this approach has a drawback: it requires frequent and unnecessary blockswitches, which refer to transfers of blocks from one partition to another. The blocksthat are in transition may not have a good utilization. Moreover, though an individualblock switch requires an insignificant amount of processing, huge number of blockswitches over a short interval may add up to a significant processing overhead.


Fig. 1 An instance of anunnecessary block switch: j

consumes blocks from k and l,but at the same time may supplyblocks to i

3.2 Refined approach

To eliminate frequent and unnecessary block switching, we introduce the notion of athreshold (δ). This parameter should be chosen carefully, based on the workload andbehavior of the devices. Now, in the DAG, an edge i → j is added, if Di ≥ Dj + δ.But, still the problem of redundant block switching is inherent in this approach. A par-tition receiving blocks from another partition may have to yield blocks to some otherpartitions. This occurs when a partition is consumer with respect to some partitions,but supplier for some other partitions. This scenario is shown in Fig. 1. Here, when-ever a miss occurs within j , it receives block from either k or l; and whenever thereis a miss in i, it receives block from any of the partitions l, j , and k. So, there may beunnecessary block transfer, as j receives, and may deliver block, at the same time.

To eliminate the possibility of unnecessary block switching, we impose the fol-lowing constraint:

Constraint a: Do not allow a partition to be a candidate to be both a supplierand a consumer at the same time.

Here, we observe that variations in D-values can be controlled by restricting thelength of all paths in the DAG. In one extreme instance of the DAG, there is no edgein the DAG: all the D-values are within δ of one another, and hence can be termedhomogeneous. It is expected that, given a suitable value of δ, existence of paths oflength more than or equal to two is highly improbable.

In the algorithms based on a DAG, maintenance of the DAG needs substantialprocessing which is proportional to the cache misses; whenever a miss occurs, andhence a particular D-value is changed, we have to scan the DAG to determine whichedges should be deleted and which edges should be inserted. Moreover, to find thesuppliers for a particular partition i, we have to traverse the DAG starting from thenode i and select the leaf nodes (those having no outgoing edge) that are the suppliersfor that partition. As these activities are done on each cache miss, the processingoverhead is not insignificant. Other than that, some memory needs to be allotted tostore the DAG.

To eliminate these problems, we attempt to represent suppliers and consumersimplicitly without using DAG. We outline the technique in the following subsection.


3.3 Implicitly maintaining suppliers and consumers

In a DAG-based approach, we can get different sets of suppliers relative to eachpartition by simply traversing the DAG. Instead of deriving the set of suppliers foreach partition, we endeavor to implicitly maintain the list of suppliers and consumersin a generalized way. In this case, a consumer can receive a block, if needed, from asupplier properly selected from the list of suppliers.

In this approach, as shown in Fig. 2, we maintain two lists: one for suppliers,denoted as S; and the other for consumers, denoted as C. Four variables, as shownin Table 1, keep track of maximum and minimum values of both lists. The minimumgap or distance between the supplier and consumer lists is denoted as h. Lists C andS are subjected to following constraint:

Constraint b: The minimum gap between lists C and S must be greater than orequal to δ.

A partition (i) can be placed in the list C, if and only if the following conditionholds:

DCmax − Di < δ and Di − DS

max ≥ δ. (1)

Here, the first term ensures that the constraint a is not violated, and the second termensures that constraint b is maintained.

In the similar way, a partition (i) can be placed in the list S, if and only if thefollowing condition holds:

Di − DSmax < δ and DC

min − Di ≥ δ. (2)

Fig. 2 The lists S and C areseparated by vertical distance h.The vertical scale represents thecumulative delay; the nodes arespread horizontally just for thesimplicity of representation

Table 1 Parameters associatedwith the supplier and consumerlists

DCmax Node with maximum D-value, in Consumer list

DCmin Node with minimum D-value, in Consumer list

DSmax Node with maximum D-value, in Supplier list

DSmin Node with minimum D-value, in Supplier list

h Minimum gap between lists C and S


So, a partition can decide its state (i.e., supplier, consumer, or neutral) in constanttime, without scanning through the all the partitions. During the operation of thesystem, as the D-values of partitions change, the supplier and the consumer lists canbe maintained by adjusting the minimum and maximum values in these lists.

3.4 Selecting victim partition

While a cache miss occurs in a partition in the consumer state, it selects a victimpartition from the list of suppliers. In selecting the victim partition, a strategy similarto inverse lottery, as previously proposed for resource allocation [34], can be used.The idea is that each supplier is given a number of tickets in inverse proportion toits cumulative delay. When a replacement is needed, a lottery is held by selectinga random ticket; the partition holding that ticket becomes the victim partition. Thisvictim partition then yields its least valuable page. The purpose of this mechanism isto penalize more the suppliers with less cumulative delay.

3.5 Agility

In experiments as provided in Sect. 6, we have taken care of the issue related toagility. This refers to the fact that a partition should instantly catch up with the changein workload. Consider two partitions, i and j . Suppose partition i incurs high activityand its delay may increase in proportion. So, its delay surpasses that of partition j bya large amount (i.e., Di � Dj ). In this case, partition j may have to wait a long timebefore it can consume a block from partition i. These can be solved by imposing arestriction on the growth of the D-values of the partitions.

Also, some disks may be inactive or may have little workload during a long timeinterval. In that case, the D-value of the partition will remain almost fixed. Hence wealso have to restrict the divergence of the D-values of the partitions.

4 Repartitioning based on marginal utility

As outlined in Sect. 1, the existing approaches based on aggregate partitioning sufferfrom a significant limitation related to linear relationship assumption between cachesize and hit ratio. However, this assumption is not valid: under certain conditions, adisk can consume blocks without increasing cache hits proportionately. Thus the con-tinuous repartitioning approach given in Sect. 3 is no longer effective in this scenario.In such a scenario, the proposed solution must track the utilization of a block within apartition. A partition can only consume a block if that partition can render better uti-lization of the block. We maintain that marginal utilities of the blocks within variouspartitions might be a suitable indicator of the utilization of a block within a parti-tion. We outline a framework to calculate the marginal utility of the blocks within apartition, and use this marginal utility to make repartitioning decision. As illustratedin the subsequent subsection, we maintain the marginal utility at the granularity of acache segment that consists of a few cache blocks. So, in this approach, the unit ofrepartitioning is a segment.


4.1 Marginal utility

For a cache of size n, marginal utility of the nth cache block can be expressed as:

MU(n) = D(n − 1) − D(n),

where D(n) refers to delay experienced by the cache misses when the cache size is n.Maintaining marginal utility at block level leads to overhead in terms of processing

and storage space. So, we divide a cache into segments of length nseg in blocks andmaintain the marginal utility for each segment. Hence, a segment consists of nsegadjacent cache blocks that are allocated/deallocated as a whole within a partition.This leads to the piecewise estimation of the delay as a function of cache size. Herewe assume that the cache uses the LRU replacement algorithm. Figure 3 shows themarginal utilities of the cache segments.

While accessing a block on a disk, the entire cache partition corresponding to thedisk is searched to locate that block within the partition. If the block hits within asegment of the partition, the marginal utility of that segment is incremented by theaccess time of that block.

4.2 Repartitioning

The repartitioning scheme uses the marginal utility values while making the reparti-tioning decision. A partition (consumer) takes a cache segment from another partition(supplier) if the cache segment can better be utilized within the former partition. Forthis, a consumer should maintain the utility of the segments that are not within thecurrent partition. So, a consumer should keep a ghost cache for the segments that lieoutside the current partition size. The ghost cache does not keep the cache blocks; itonly stores the block identifiers. So, the space overhead for the ghost cache space isvery low.

Fig. 3 The LRU cache is broken into segments. Marginal utility for each segment represents the numberof cache hits within the cache segment


4.2.1 Basic idea

We use the notation Pi to denote the ith partition, and use si to indicate the size(in segments) of the partition Pi (Pi consists of si segments, so si th segment is thelast segment of partition Pi ). We use P k

i to refer to the kth segment of partition Pi .At a certain instant, partition Pi may either consume a segment at location (si + 1)

or supply its si th segment. We call the si th segment (that the partition will supplyfirst) the primary S-segment, and the (si + 1)-th segment (where a segment consumedfrom another partition will be placed) the primary1 C-segment of partition Pi . Now,a partition (Pi) can consume a segment from a partition j if

MU(P

si+1i

) ≥ MU(P

sjj

) + δ.

Here, δ is a threshold that should be chosen carefully. This parameter is introducedto suppress transferring segments from one partition to another due to instantaneousvariation in the workload. Here it should be noted that a partition can consume notonly one segment but multiple segments (from one or more other partitions) at a time.So, this approach can quickly adapt to the variation in workload. Moreover, as theghost segments (for a consumer) store the identifiers of the cache blocks, those blockscan be prefetched instantaneously from the disk. This will increase the throughput ofthe disk system. Figure 4 shows this concept of transferring a segment to the con-sumer. We address issue of maintaining ghost segments at the end of this subsection.The issue of agility as described in Sect. 3 is also applicable in this repartitioningapproach. We incorporated this mechanism in the experiments provided in Sect. 6.

Here, it should be noted that though a segment consists of a number of physicallyadjacent cache blocks, the disk blocks mapped to, or buffered within, the segmentmay not be adjacent in physical storage. The placement of a disk block within a cachepartition is entirely determined by the intra-partition caching policy (e.g., LRU).

4.2.2 Identifying suppliers and consumers

So far we have laid out a part of the framework for repartitioning. One major issuethat remains to be resolved is to decide when to make this repartitioning decision. In autility-based approach the relation between a supplier and a consumer stays for a veryshort interval. This relation may disappear immediately after a repartitioning decisionis made (i.e., one or more segments are transferred to a consumer). So, maintainingthe supplier–consumer relationship is not relevant in this scenario.

One simple approach is to take the decision at each cache miss. But, this is pro-hibitive as we have to scan all the partitions to decide whether there exists any con-sumer or supplier. Moreover, a partition may consume or supply segments after alarge interval of disk activity. Another approach is to make the repartitioning deci-sion after a certain time interval. But, this approach is also inefficient: it may performunnecessary repartitioning task or may repartition at inopportune moment (e.g., par-titions should have been adjusted long before).

1As described in a subsequent part of this section, a partition can consume or supply multiple segments ata time. Hence, the primary is introduced to refer to the first segment to be supplied or consumed.


Fig. 4 Partition Pi consumes the sj th segment (and subsequent segments) of the partition Pj if thosesegments would render better utility in partition Pi . Once the sj th segment is transferred to the partitionPi , this segment becomes the (si + 1)-th segment of partition Pi . Now, the (si + 1)-th segment of thepartition Pi is loaded with the disk blocks contained within ghost segment si + 1. Partition Pj maintainsa ghost segment for the sj th segment

Based on the above observations, we propose a repartitioning approach that min-imizes the overhead and adjusts the partitions whenever necessary. In this approach,we try to maintain two variables, max and min, that refer to the partition with themaximum MU-value of the primary C-segment and the minimum MU-value of theprimary S-segment, respectively.2 As will be evident in the subsequent part of thissubsection, maintaining the value min is not feasible. So, instead of the value min, wekeep minV which indicates the lower limit of minimum MU-value of the primary S-segments in all the partitions: minimum MU-value of the primary S-segments (fromall partitions) is at least minV .

Maintaining the variable max is simple. Upon a cache miss on a partition, thepartition:

1. adjusts the MU-values,2. if there is any change in the MU-value of its primary C-segment, it checks whether

MU-value of its C-segment is greater than that of the partition denoted by max,and adjusts max accordingly.

So, the max value is set properly whenever there is a miss. And this max value alwaysrefers to the partition with the maximum MU-value of the primary C-segment. But,the scenario is different in case of the variable min. The variable might refer to apartition that has not been accessed for a long time interval. So, the task of setting thevariable min should be attributed to other active partitions. For this, we maintain the

2It should be noted that in this scenario partitions cannot be categorized as suppliers or consumers before-hand. The intention of maintaining max and min is to track whether there develops any supplier–consumerrelationship within the system.


variable minV . Now, we try to adjust the variable minV only when

MU(P smax+1

max

) − minV ≥ δ.

Hence, we attempt to adjust the variable minV when the MU-value of the C-segmentof the partition Pmax exceeds the minV by the amount δ. Note that the above is thecondition for repartitioning, but repartitioning might not be feasible at this momentas minimum MU-value of the S-segments among all the partitions might have beenchanged since the variable minV is initialized. So, we first adjust the minV value andcheck whether any adjustment in the partition sizes is feasible. Thus, when a missoccurs in a partition, we have to perform the following steps in addition to the twosteps given earlier.

3. If there is any change in MU(Psmax+1max ) (i.e., either max refers to the current par-

tition or max is changed) and if MU(Psmax+1max ) − minV ≥ δ, then initialize minV .

Otherwise stop.4. If MU(P

smax+1max ) − minV ≥ δ, adjust the partition sizes.

4.2.3 Ghost segments

Here, it should be noted that maintaining the ghost segments for all the cache seg-ments is not necessary. It might be sufficient to maintain a few ghost segments foreach partition. We observe that increasing the number of ghost segments does notincrease the performance, and maintaining only three ghost segments per partitionrenders good performance. As a partition grows, there arises the need for estimatingthe MU-values of a high order segment. This estimation can be done based on theMU-values of the ghost segments maintained by the algorithm, using a forward inter-polation method. When a partition grows, low order segments of the ghost segmentscan be allocated to maintain the high order ghost segments that will newly enter intothe ghost window. Figure 5 shows the concept of using ghost window.

5 Evaluation environment

This section describes our methodology for evaluating the performance of storage-aware caching. We describe our simulator and the storage environment assumed inthe simulator. In Sect. 6, we present the results obtained using this simulator.

To measure the performance of storage-aware caching, we have implemented atrace-driven simulator. This simulator assumes a storage environment where a num-ber of disks (of varying ages) are accessed by a single client. The client has a localcache that is partitioned across the disks. Each partition is maintained using the LRUreplacement strategy. The focus of our investigation is to maintain the proper partitionsize dynamically. The client issues the workload for the disks.

The client workload that drives the simulator is captured using a trace file. Thetrace file specifies the data blocks accessed at various time points. We derive thesynthetic disk traces using the PQRS algorithm proposed in [35]. This algorithm isshown to generate traces that capture the spatiotemporal burstiness and correlation


Fig. 5 The rightmost ghost segment corresponds to the cache segment si + r . When the partition Pi

consumes a segment si + 1, the leftmost ghost segment is allocated for the cache segment si + r + 1. TheMU-value of the segment si + r + 1 is estimated (using a forward interpolation equation) based on thecurrent values in the window. Thus, the space for the ghost window remains constant

in real traces [28]. We use several traces (trace 1, trace 2, trace 3, trace 4 and trace5) in evaluating the performance of the caching schemes. Number of disk blocks fortrace 1 and trace 3 are 120,000 and those of trace 2 and trace 4 are taken as 100,000.The number of requests for trace 1, trace 2, trace 3 and trace 4 are 200,000, 180,000,350,000 and 300,000 respectively. The size of a disk block is taken as 8 KB. Trace5 emulates the non-linear behavior between the cache size and hit rate. This simpletrace file contains a series of sequential scans of the disk blocks.

Using these five trace files, we perform five sets of experiments. In the first twosettings, we use trace 1 and trace 2. In the first setting, we use only trace 1 and applythe trace file to each of the disks. This is similar to RAID-0 environment where adisk block is split across a set of disks, and each of the disks should be accessed toretrieve a block. In the second setting, we use only trace 2, and feed the referencestring of the trace file on the disks in a shifted fashion. We identify equally spacedpositions within the reference string, and start to feed the references among the disksstarting from these positions. In the third and fourth settings, we use trace 3 andtrace 4 respectively. To vary the workload with time we use a simple access modelthat filters the disk requests using a distribution function and vary the distributionperiodically. In the fifth setting, we use trace 5 and trace 2. We apply trace 5 on theslow disks, and apply trace 2 among the rest of the disks starting from the differentlocations within the trace as described earlier.

We model the disk access time using only disk bandwidth, average seek time,average rotational latency. Hence, our disk model considers the worst-case scenario.Device heterogeneity is achieved by device aging. As in [12], we consider a basedevice (IBM 9LZX) and age its performance over a range of years. A collection ofdisks from this set is used as the disk system in the simulator. Characteristics of thedisks of various ages is shown in Table 2. The disk parameters used in the simulatorare based on the ages of an individual disk (i.e., as given by Table 2).


Table 2 Aging a base diskdevice (IBM 9LZX). The tableshows the performanceparameters of the same basedevice in different generationsas the disk technology improves

Age Bandwidth Seek time Rotation

(years) (MB/S) (ms) (ms)

0 20.0 5.30 3.00

1 14.3 5.89 3.33

2 10.2 6.54 3.69

3 7.29 7.27 4.11

4 5.21 8.08 4.56

5 3.72 8.98 5.07

6 2.66 9.97 5.63

7 1.90 11.1 6.26

8 1.36 12.3 6.96

9 0.97 13.7 7.73

10 0.69 15.2 8.59

6 Experimental results

In this section, we present a series of experimental results for assessing the effec-tiveness of the proposed caching scheme. We measure the throughput obtained atthe client side, and use it as the performance metric. This throughput is measuredby observing the delay in retrieving the disk blocks. While measuring the delay, weconsider only the delay in cache misses. Delay experienced in cache hit is compar-atively negligible. We measure this throughput by varying the age of the slow diskand cache size. We perform the experiments using five settings of the trace files asstated in Sect. 5. Using each of the settings, we observe throughput while varyingthe disk age and cache size. As the disk system, we consider a set of four disks, andthe slow disk is chosen randomly from the pool. To compare our results, we use anexisting storage-aware caching schemes (Forney’s scheme ). The δ-value is set to 500ms, and is observed to capture the changes in the stable behavior of a disk. For theutility-based approach we maintain only three ghost segments per partition. The con-trol values of the cache size and age of the slow disks is set to be 250 MB and 4 years,respectively. We set the segment size as the 2 percent of the total cache size. Initially,the cache partitions (possibly with the exception of the last one) are assigned with anequal number of consecutive segments.

Here it should be noted that, contrary to the reference [12], we do not model thedisk request size and do not use the request locality to calculate the seek time androtational latency. In our simulation the request size is equal to the block size. Hence,we take a pessimistic approach and assume that one disk miss results in a delay equalto the disk access time. Whereas in the reference [12], the delay is calculated at thegranularity of the request size by exploiting the locality of the requests. Here, therequest size is far greater than a block size.

We first present the experimental results with the time-varying workloads. Fig-ures 6 and 7 show the effect of varying the age of a disk. As shown in the figures,


Fig. 6 Overall throughput ofthe disk system with varyingages of the slow disk (Tracesetting 1)


Fig. 8 Overall throughput ofthe disk system with varyingcache sizes (Trace setting 1)

throughput decreases with the increase in the slow disks’ age. However, the contin-uous repartitioning and the utility-based scheme attain higher throughput than theForney’s scheme, as the disk age becomes higher.

The effect of varying cache size on the caching schemes is shown in Figs. 8 and 9.Here, throughput for each of the traces increases with increase in cache size. Asshown in the figures, the new schemes achieve higher throughput while comparedwith the Forney’s scheme.





Figures 10, 11 and 12 show the effect of varying the age of a disk. Here, we select aparticular disk and get the simulation data by aging the disk. As shown in the figures,throughput decreases with the increase in the slow disk’s age. In the first two settingsthe performance of the continuous repartitioning and utility-based scheme is almostidentical, the throughput of the latter slightly dominating that of the former. However,performance of these two schemes is notably higher than that of the Forney’s scheme.Here, throughput with setting 4 is higher than that with setting 3 because of higher





spatial and temporal locality of the relevant traces. The effectiveness of the utility-based scheme is evident in Fig. 12. In this setting, as described earlier, one the appliedtraces (i.e., trace 5) captures the non-linear relationship between cache size and hitratio. As shown in the figure, while applied with such a trace, the utility-based schemeattains significantly higher throughput than other two schemes.

The effect of varying cache size on the caching schemes is shown in Figs 13, 14and 15. Here, throughput for each of the settings increases with increase in cache size.In the first two settings (Figs. 13 and 14), the utility-based scheme performs as well



Fig. 16 Overall throughput ofthe disk system with varyingδ-values (Trace setting 1)

as the continuous repartitioning scheme. However, this scheme outperforms the restof the schemes in experimental setting 5 where a trace with non-linear relationshipbetween cache size and hit ratio is feed onto the slow disk. Here, as the cache size isincreased, the throughput for the utility-based scheme increases rapidly (Fig. 15). Incase of continuous repartitioning scheme, there is only a little increase in the through-put with the increase in cache size. On the other hand, in case of Forney’s scheme thethroughput remains almost stable with an increase in cache size.

Figure 16 shows the throughput with varying δ-values. The throughput is lowfor a very low δ value; also, the throughput decreases sharply for a very high value(greater than 2000 ms). Selecting a lower value results in redundant block switchingdue to instantaneous variations in workloads. On the other hand, selecting a very highvalue does not result in good utilization of the cache blocks. Thus, selecting the deltavalue an order of magnitude higher than a miss cost is necessary to capture the longterm variations in the workloads. From the Fig. 16, we observe that the performanceremains stable for a wide range of delta values.


7 Related work

There has been work on cost-aware caching in the areas of web caching, main mem-ory and disk caching. We revisit the significant works in each of the areas.

In web caching, community studied the cost-aware caching considering the pageswith varying size and costs. This general caching problem is more intricate than theuniform version. In [17], Irani studies the special case of this general problem con-sidering only the pages with varying sizes. Here, it is pointed out that Belady’s rule isno longer optimal if pages and costs differ. Page replacement policies for the generalcaching problem is studied by Albers et al. [1]. Here, the authors classified generalcaching problems into four models and proposed several approximate solutions to theoffline case of the problems. The theoretical computer science community has studiedcost-aware algorithms as k-server problems [23]. Cost-aware caching falls within arestricted class of k-server problems, i.e., weighted caching. The Greedy-Dual (GD)algorithm [38] introduces variable fetch costs for pages of uniform size. The Greedy-Dual-Size (GDS) algorithm [6, 17] extends the GD to the environment with variableobject size and fetch cost. LANDLORD [39], which is closely related to the GDS webcaching algorithm in [6], is a significant algorithm in the literature. Page replacementalgorithms developed in the context of web caches do not necessarily apply to diskcaching in a file system. The main reason behind this is that a file cache stores fixedsize blocks, and does not take into account the size of a document. But, a web cacheuses whole document caching, and the size of the web documents varies dependingon the type of the information they contain (video, audio, text, etc.). There is a largevariation in performance in the wide area of Internet compared to the performancevariation in storage system. Finally, the replacement cost of the web pages does notshow strong correlation like that of blocks within a storage system.

In the reference [18], the authors propose a Cost-Sensitive OPTimal replacementalgorithm (CSOPT) that minimizes a miss cost function in a system which has twotypes of miss costs: local and remote memory misses. This work is set in the contextof CC-NUMA multiprocessors where local and remote misses have different costsdue to the large remote-to-local memory latency ratio [22, 36]. Moreover, a remotemiss always consumes interconnect bandwidth whereas a local miss can be satisfiedlocally. This algorithm does not always replace a block selected by the OPT algorithmif the block has high miss cost. Instead, CSOPT considers keeping a high cost blockin the cache until it is referenced again. So, this algorithm tries to save a miss on anexpensive block by trading off several misses on some cheap blocks. Hence, insteadof minimizing the miss count, CSOPT minimizes the overall cost in cache misses.The size of the search tree used by the algorithm is huge which makes the algorithmunrealizable in any practical system.

In the reference [19], the authors consider the non-uniform miss costs among thecache blocks and propose several extensions of LRU. The idea behind these exten-sions is to keep (if feasible) a high cost block victimized by LRU in the cache untilits next reference and replace a block with low replacement cost. In such a case, thevictimized block with high replacement cost is called to be in reservation. This ideaof reservation is borrowed from the CSOPT algorithm mentioned earlier. The costof the reserved block is deprecated over time according to various algorithms and


ultimately the reservation is released. In contrast to these cost-sensitive replacementalgorithms, algorithms based on aggregate partitioning exploit the correlation of re-placement costs among the disk blocks, and instead of devising new algorithm fromthe scratch re-use existing cost-oblivious caching algorithms (e.g., LRU) within eachpartition.

In the reference [10], Chu and Opderbeck propose a method for varying theamount of physical memory available to a process. This method is based on the ob-servation of page fault frequency. Reference [31] proposes a partitioning scheme,based on hardware and software, that partitions a set associative cache (e.g., L1 orL2 cache) among a set of processes or threads. Contrary to processor caches that re-quire cache access time in nanoseconds, a great deal of processing is feasible duringor between accesses of a disk cache; hence, more sophisticated and complex tech-niques can be applied to a disk cache policy [32]. Based on this observation, refer-ence [32] proposes a partitioning scheme to manage a disk cache shared by multipleprocesses. In this approach, a disk cache is partitioned into disjoint blocks amongseveral processes, and the size of each partition is determined by the locality of thecorresponding processes. The cache management algorithm determines the size ofeach partition, and dynamically adjusts the partition size. The partitioning techniqueis based on the method proposed by Stone, Wolf and Turek [30]. Zhu et al. [40] pro-pose a storage cache management scheme that aims at increasing the average idleperiod for the disks, thus saving disk energy consumption. However, none of the ap-proaches consider the storage device heterogeneity in terms of access costs. Our workconsiders non-uniform access costs of the disk blocks, exploits the cost correlationamong the cached blocks and at the same time re-uses the existing cost-obliviouscaching algorithms. Preliminary versions of this work appeared in [8, 9].

8 Conclusion

In this paper, we identified the inherent problems with the caching algorithm in het-erogeneous storage systems. Firstly, we proposed and analyzed solution strategiesthat attempt to adjust partition sizes in a continuous fashion. We outline a strategy tomaintain partition states implicitly. Using only one variable, this approach enables tobalance the cumulative delay of the partitions. Our solution is simple in computation,and continuously adjusts the workload to balance the cumulative delay of the parti-tions. The only parameter δ can be chosen based on the workload and performanceof the devices.

In a caching environment with aggregate partitioning, supplied with a uniformworkload having non-linear relationship between cache size and hit ratio, a slow diskmay consume cache space choking the rest of the disks. This phenomenon of diskchoking degenerates the performance of the disk system as a whole. We proposeda framework to partition the cache based on the utility of the cache blocks withina partition. Experimentation using a simple trace capturing the non-linear behaviorbetween the cache size and hit ratio demonstrates that the utility-based scheme no-tably outperforms other schemes. A strategy for caching disk blocks is implementedby the operating system and thus affects the performance of the computer system at


a very fundamental level. Thus even a small improvement on this score assumes alarge significance. In our work we assume that a disk block is brought into the cacheonly when a miss occurs, i.e., we do not consider prefetching. As future works welike to investigate the issue of prefetching in the heterogeneous storage environment.Another direction of the research work is to incorporate power-aware storage systemand study of the tradeoffs between energy savings (power) and performance of thedisk systems. Changing the workloads by replicating data, and by dynamically ad-justing the data placement and/or request redirection (across the disks) during systemactivity is an interesting research issue; the issue of coupling the caching and dataallocation scheme is orthogonal to the one considered in this paper.

References

1. Albers S, Arora S, Khanna S (1999) Page replacement for general caching problem. In: Proceedingsof the tenth annual ACM-SIAM symposium on discrete algorithms, January 1999

2. Anderson T, Dahlin M, Neefe J, Pat-terson D, Roselli D, Wang R (1995) Serverless network filesystems. In: Proceedings of the 15th symposium on operating system principles. ACM, Colorado,USA, December 1995, pp 109–126

3. Bansal S, Modha DS (2004) Car:clock with adaptive replacement. In: Proceedings of the USENIXconference on file and storage technologies, March 2004, pp 187–200

4. Cao P, Felten EW, Karlin AR, Li K (1995) A study of integrated prefetching and caching strategies. In:Proceedings of the joint international conference on measurement & modeling of computer systems(SIGMETRICS), May 1995, pp 188–197

5. Cao P, Felten EW, Li K (1994) Implementation and performance of application-controlled filecaching. In: Proceedings of the first USENIX conference on operating systems design and imple-mentation, November 1994, pp 165–178

6. Cao P, Irani S (1997) Cost-aware www proxy caching algorithms. In: Proceedings of the USENIXsymposium on internet technologies and systems, December 1997, pp 193–206

7. Carley LR, Ganger GR, Nagle DF (2000) Mems-based integrated-circuit mass-storage systems. Com-mun ACM 43(11):72–80

8. Chakraborty L, Singh A (2005) A new approach to cost-aware caching in heterogeneous storagesystems. In: Second international workshop on operating systems, programming environments andmanagement tools for high performance computing on clusters (COSET-2) (held in conjunction withACM international conference on supercomputing 2005), Cambridge, Massachusetts, June 2005, pp1–6

9. Chakraborty L, Singh A (2007) A utility-based approach to cost-aware caching in heterogeneousstorage systems. In: IEEE international parallel & distributed processing symposium (IPDPS), March2007, pp 1–10

10. Chu WW, Opderbeck H (1972) The page fault frequency replacement algorithm. In: Proceedings ofthe AFIS conference, pp 597–608

11. Ding X, Jiang S, Chen F (2007) A buffer cache management scheme exploiting both temporal andspatial localities. ACM Transactions on Storage (TOS) 3(2), June 2007

12. Forney BC, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2002) Storage-aware caching: revisitingcaching for heterogeneous storage systems. In: Proceedings of the 2002 USENIX conference on fileand storage technology, January 2002, pp 61–74

13. Gibson GA, Meter RV (2000) Network attached storage architecture. Commun ACM 43(11):37–4514. Gill BS (2008) On multi-level exclusive caching: offline optimality and why promotions are better

than demotions. In: Proceedings of the 6th USENIX conference on file and storage technologies,February 2008

15. Gill BS, Modha DS (2005) Sarc: sequential prefetching in adaptive replacement cache. In: Proceed-ings of the annual conference on USENIX annual technical conference

16. Howard JH, Kazar ML, Menees SG, Nichols DA, Satyanarayanan M, Sidebotham RN, West MJ(1988) Scale and performance in a distributed file system. ACM Trans Comput Syst 6(1):51–81


17. Irani S (1997) Page replacement with multi-size pp and applications to web-caching. In: Proceedingsof the ACM symposium on the theory of computing

18. Jeong J, Dubois M (1999) Optimal replacements in caches with two miss costs. In: Proceedings ofthe 11th ACM symposium on parallel algorithms and architectures, June 1999, pp 155–164

19. Jeong J, Dubois M (2003) Cost-sensitive cache replacement algorithms. In: Proceedings of the IEEEconference on high performance computing

20. Jiang S, Ding X, Chen F (2005) Clock-pro: an effective improvement of the clock replacement. In:Proceedings of the annual USENIX technical conference, April 2005, pp 323–336

21. Kubiatowicz J, Bindel D, Eaton P, Chen Y, Geels D, Gummadi R, Rhea S, Weimer W, Wells C,Weatherspoon H, Zhao B (2000) Oceanstore: an architecture for global-scale persistent storage. In:Proceedings of the ninth international conference on architectural support for programming languagesand operating systems (ASPLOS 2000), November 2000

22. Lovett T, Clapp R (1996) Sting: a CC-NUMA computer system for the commercial marketplace. In:Proceedings of the 23rd international symposium on computer architecture, May 1996, pp 308–317

23. Manasse M, McGeoch L, Sleator D (1988) Competitive algorithms for on-line problems. In: Proceed-ings of the twentieth annual ACM symposium on theory of computing, May 1988, pp 322–333

24. Megiddo N, Modha DS (2003) Arc: a self-tuning, low overhead replacement cache. In: Proceedingsof the 2nd USENIX conference on file and storage technologies, March 2003

25. Nelson MN, Welch BB, Ousterhout JK (1988) Caching in the sprite network file system. ACM TransComput Syst 6(1), February 1988

26. Ng R, Faloutsos C, Sellis T (1991) Flexible buffer allocation based on marginal gains. In: Proceedingsof the 1991 ACM conference on management of data (SIGMOD), pp 387–396

27. Patterson DA, Gibson GA, Katz RH (1998) A case for redundant arrays of inexpensive disks. In:Proceedings of the ACM SIGMOD conference, June 1998

28. Ruemmler C, Wilkies J (1993) Unix disk access patterns. In: Proceedings of the winter 1993 USENIX,January 1993, pp 405–420

29. Sandberg R (1985) The design and implementation of the sun network file system. In: Proceedings ofthe 1985 USENIX summer technical conference, June 1985, pp 119–130

30. Stone HS, Wolf JL, Turek J (1989) Optimal partitioning of cache memory. IBM research reportRC14444, March 1989, pp 1–25

31. Suh GE, Rudolph L, Devadas S (2004) Dynamic partitioning of shared cache memory. J Supercomput28(1):7–26

32. Thiebaut D, Stone HS, Wolf JL (1992) Improving disk cache hit-ratios through cache partitioning.IEEE Trans Comput 41(6):665–676

33. Vahdat A, Anderson T, Dahlin M, Belani E, Culler D, Eastham P, Yoshikawa C (1998) Webos: oper-ating system services and wide area applications. In: Proceedings of the seventh symposium on highperformance distributed computing, July 1998

34. Waldspurger CA, Weihl WE (1994) Lottery scheduling: flexible proportional-share resource manage-ment. In: Proceedings of the first USENIX symposium on operating systems design and implementa-tion, November 1994

35. Wang M, Ailamaki A, Faloutsos C (2002) Capturing the spatio-temporal behavior of real traffic data.In: Proceedings of the symposium on computer performance modeling, measurement and evaluation,September 2002

36. Weber W, Gold S, Helland P, Shimizu T, Wichi T, Wilcke W (1997) The memory interconnect ar-chitecture: a cost-effective infrastructure for high performance servers. In: Proceedings of the 24thinternational symposium on computer architecture, June 1997, pp 98–107

37. Yadgar G, Factor M, Schuster A (2007) Karma: know-it-all replacement for a multilevel cache. In:Proceedings of the 5th USENIX conference on file and storage technologies, February 2007

38. Young NE (1991) On-line caching as cache size varies. In: Proceedings of the symposium on discretealgorithms

39. Young NE (1999) On-line file caching. In: Proceedings of the ninth annual ACM-SIAM symposiumon discrete algorithms, January 1999

40. Zhu Q, Zhou Y (2005) Power-aware storage cache management. IEEE Trans Comput 54(5):587–602

Cost-aware caching schemes in heterogeneous storage systems

Documents

Transcript of Cost-aware caching schemes in heterogeneous storage systems