Virtual - EECS at UC BerkeleyComputer Science Division, Univ ersit y of California, Berk eley, f ryw...

Virtual Log Based File Systems for

a Programmable Disk�

Randolph Y. Wangy Thomas E. Andersonz David A. Pattersony

Abstract

In this paper, we study how to minimize the la-tency of small transactional writes to disks. The ba-sic approach is to write to free sectors that are nearthe current disk head location by leveraging the em-bedded processor core inside the disk. We developa number of analytical models to demonstrate theperformance potential of this approach. We thenpresent the design of a variation of a log-structured�le system based on the concept of a virtual log,which supports fast small transactional writes with-out extra hardware support. We compare our ap-proach against traditional update-in-place and log-ging systems by modifying the Solaris kernel to serveas a simulation engine. Our evaluations show thatrandom updates on an unmodi�ed UFS execute upto an order of magnitude faster on a virtual logthan on a conventional disk. The virtual log canalso signi�cantly improve LFS in cases where delay-ing small writes is not an option or on-line cleaningwould degrade performance. If the current trendsof disk technology continue, we expect the perfor-mance advantage of this approach to become evenmore pronounced in the future.

1 Introduction

In this paper, we set out to answer a simple ques-tion: how do we minimize the latency of small trans-actional writes to disk?

The performance of small synchronous diskwrites impact the performance of important appli-

�This work was supported in part by the Defense Ad-vanced Research Projects Agency (DABT63-96-C-0056), theNational Science Foundation (CDA 9401156), California MI-CRO, the AT&T Foundation, Digital Equipment Corpora-tion, Hewlett Packard, IBM, Intel, Sun Microsystems, andXerox Corporation. Anderson was also supported by a Na-tional Science Foundation Presidential Faculty Fellowship.

yComputer Science Division, University of California,Berkeley, frywang,[email protected]

zDepartment of Computer Science and Engineering, Uni-versity of Washington, Seattle, [email protected]

cations such as recoverable virtual memory [29], per-sistent object stores [2, 19], and database applica-tions [34, 35]. These systems have become morecomplex in order to deal with the increasing rela-tive cost of small writes [20].

Similarly, most existing �le systems are care-fully structured to avoid small synchronous diskwrites. UFS by default delays data writes to thedisk. More recent research has shown that it is pos-sible to delay metadata writes as well if they arecarefully ordered [10]. The Log-structured FileSystem (LFS) [27] claims to be write optimized, butdoes so by batching small writes. While the struc-tural integrity of �le systems can be maintained us-ing these delayed write techniques, none of themprovides data reliability. Write-ahead logging sys-tems [4, 7, 13, 33] accumulate small updates in alog and replay the modi�cations later by updatingin place. Databases often place the log on a separatedisk to avoid having the small updates to the logcon ict with reads to other parts of the disk. Ourinterest is in the limits to small write performancefor a single disk. Understanding the single disk casealso may enable us to improve the performance oflog truncation.

Because of the limitations imposed by disks,non-volatile RAM (NVRAM) or an uninterruptablepower supply (UPS) is often used to provide fasttransactional writes [3, 16, 17, 20]. However, whenwrite locality exceeds bu�er capacity, performancedegrades. There are also applications that demandstricter guarantees of reliability and integrity thanthat of either NVRAM or UPS. Fast small diskwrites can provide a cost e�ective complement toNVRAM solutions in these respects.

Unlike most of the existing �le systems, we donot seek to avoid small disk writes. Instead, ourbasic approach is to write to a disk location that isclosest to the current head location, while retainingtransactional behavior. We call this eager writing.Eager writing requires the �le system to be awareof the precise disk head location and disk geometry.One way to satisfy this requirement is to enhancethe disk interface to the host so that the host �le

system can have precise knowledge of the disk state.A second solution is to migrate into the disk some ofthe �le system responsibilities which are tradition-ally executed on the host. In the rest of this paper,we will assume this second approach, although thetechniques that we will describe do not necessarilydepend on the ability to run �le systems inside disksand can work equally well using the �rst solution.

Several technology trends have simultaneouslyenabled and necessitated the approach of migrat-ing �le system responsibility into the disk. First,Moore's Law has driven down the relative cost ofCPU power to disk bandwidth, enabling powerfulsystems to be embedded on disk devices [1]. Forexample, the 233 Mhz StrongARM 110 by Digitalcosts under $20 [6]. As this trend continues, it willsoon be possible to run the entire �le system on thedisk.

Second, disk bandwidth has been scaling fasterthan other aspects of the disk system. Disk band-width has been growing at 40% per year [12]. I/Obus performance has been scaling less quickly [22].The ability of the �le system to communicate withthe disk (to reorganize the disk, for example) with-out consuming valuable I/O bus bandwidth hasbecome increasingly important. Seek time, half-rotation delay, head switch time, and track switchtime are improving at even slower rates. Typicaldisk latency has improved at an annual rate of only10% in the past decade [22]. A �le system whosesmall write latency is largely decided by the diskbandwidth instead of any other parameters will con-tinue to perform well.

Third, the increasing complexity of the moderndisk drives and the fast product cycles make it in-creasingly di�cult for operating system vendors toincorporate useful device heuristics into their �lesystems to improve performance. The traditionalhost �le system and the disk �rmware do not shareperfect knowledge of each other and the approachof increasing complexity of the �le system to matchthe complexity of the modern disks cannot work.In contrast, by running �le system code inside thedisk, we can combine the precise knowledge of the�le system semantics and detailed disk mechanismto perform optimizations that are otherwise impos-sible.

The basic concept of performing writes near thecurrent disk head position is by no means a newone [5, 9, 11, 14, 24]. But these systems either do notprovide transactional behavior, or have poor failurerecovery times, or require NVRAM for transactions.In this work, we present the design of a virtual log,a logging strategy based on eager writing with these

unusual features:

� Virtual logging exploits fast writes to providetransactions without special hardware supportby writing the commit record to the next freeblock after the data write(s).

� The virtual log allows space occupied by obso-lete entries to be reused without recopying liveentries.

� The virtual log boot straps its recovery from thelog tail pointer, which can be stored at a �xedlocation on disk as part of the �rmware powerdown sequence, allowing e�cient normal opera-tions.

We discuss two designs in which the virtual logcan be used to improve �le system performance.The �rst is to use it to implement a logical disk in-terface. This design, called a VLD, does not alterthe existing disk interface and can deliver the perfor-mance advantage of eager writing to an unmodi�ed�le system. In the second approach, which we havenot implemented, we seek a tighter integration ofthe virtual log into the �le system; we present thedesign of a variation of LFS, called VLFS, whichshould provide even better performance than theVLD. While our work focuses on a traditional ar-chitecture in which the user of the transactional �lesystem executes on the host, the technique can alsobe extended to move the transaction code itself intothe disk to speed up operations such as read-modify-write [25]. We develop analytical models and algo-rithms to answer a number of fundamental questionsabout eager writing:

� What is the theoretical limit of the performanceof this approach?

� How should we re-engineer the host/disk inter-face to take advantage of the knowledge of thedisk mechanism and �le system semantics?

� How can we ensure open space under the diskhead?

� How does this approach fare as di�erent compo-nents of the disk mechanism improve at di�erentrates?

We evaluate our approach against update-in-place and logging by modifying the Solaris kernel toserve as a simulation engine. Our evaluations showthat an unmodi�ed UFS on an eager writing diskruns about ten times as fast as an update-in-placesystem for random small updates. Eager writing'seconomical use of bandwidth also allows it to signif-icantly improve LFS in cases where delaying smallwrites is not an option or on-line cleaning woulddegrade performance. As disk bandwidth continuesto improve faster than disk latency, the performanceadvantage of eager writing should become more pro-

found in the future.

Of course, these bene�ts may come at a price ofpotentially reducing read performance. Like LFS,the virtual log does not necessarily place data inthe optimal position for future reads. But as in-creasingly large �le caches are employed, modern�le systems such as the Network Appliance �le sys-tem report predominantly write tra�c [17]. Largecaches also provide opportunity for read reorganiza-tion before the reads happen [23].

Although our evaluations show signi�cant per-formance promise of the virtual log, this paperremains a preliminary study. A full evaluation ofthe approach would require solutions to algorithmicand policy questions of the data reorganizer as partof a complete VLFS implementation. These issues,as well as questions such as how to extend the vir-tual logging technique to multiple disks are subjectsof our ongoing research.

The remainder of the paper is organized as fol-lows. Section 2 presents the eager writing analyticalmodels. Section 3 presents the design of the vir-tual log and the virtual log based LFS. Section 4describes the experimental platform that is usedto evaluate the update-in-place, logging, and eagerwriting strategies. Section 5 evaluates these alter-natives with a series of experiments. Section 6 de-scribes some of the related work. Section 7 con-cludes.

2 Limits to Low Latency Writes

The principle of writing data near the currentdisk head position is most e�ective when the diskhead is always on a free sector when a write re-quest arrives. This is not always possible in reality.In this section, we develop a number of analyticalmodels to estimate the amount of time needed tolocate free sectors under various utilizations. Thesemodels will help us evaluate whether eager writingis a sound strategy, set performance targets for realimplementations, and predict future improvementsas disks improve. We also use the models to moti-vate new �le system allocation and reorganizationalgorithms. Because we are interested in the the-oretical limits of eager writing latency, the modelsare for the smallest addressable unit: a disk sector(although the validity of the formulas do not dependon the actual sector size and can apply equally wellto larger blocks).

2.1 A Single Track Model

Suppose a disk track contains n sectors, its uti-lization is (1 � p), and the free space is randomlydistributed. On average, the number of sectors thedisk head must skip before arriving at any free sec-tor is:

(1� p)n

1 + pn(1)

Proof. Suppose n is the track size, k is the num-ber of free sectors in the track, and E(n; k) is theexpected number of used sectors we encounter be-fore reaching the �rst free sector. With a probabilityof k=n, the �rst sector is free and the number of sec-tors the disk head must skip is 0. Otherwise, with aprobability of (n� k)=n, the �rst sector is used andwe must continue searching in the remaining (n�1)sectors, of which k are free. Therefore, the expecteddelay in this case is [1 + E(n � 1; k)]. This yieldsthe recurrence:

E(n; k) =n� k

n[1 +E(n� 1; k)] (2)

By induction on n, it is easy to prove that the fol-lowing is the unique solution to (2):

E(n; k) =n� k

1 + k(3)

(1) follows if we substitute k in (3) with pn.For example, when there is only one free sec-

tor left in the track (pn = 1), (1) degenerates toa half rotation. More generally, (1) is roughly theratio between occupied sectors and free ones. Thisis a promising result for the eager writing approachbecause, for example, even at a relatively high uti-lization of 80%, we can expect to incur only a four-sector rotation delay to locate a free sector. Fortoday's disks, this translates to less than 100 �s.In six to seven years, this delay should improve byanother order of magnitude because it scales withplatter bandwidth. In contrast, it is di�cult foran update-in-place system to avoid at least a half-rotation delay. With today's technology, this is atbest 3 ms, and it improves at a slower rate thanbandwidth. This di�erence is the fundamental rea-son why eager writing has the potential to outper-form update-in-place.

2.2 A Single Cylinder Model

We now extend (1) to cover a single cylinder. Wecompare the time needed to locate the nearest sectorin the current track against the time needed to �ndone in other tracks in the same cylinder and take

the minimum of the two. Therefore the expectedlatency can be expressed as:

X

x

X

y

min(x; y) � fx(p; x) � fy(p; y) (4)

where x is the delay (in units of sectors) experiencedto locate the closest sector in the current track, yis the delay experienced to locate the closest sectorin other tracks in the current cylinder, and fx(p; x)and fy(p; y) are the probability functions of x andy, respectively, under the assumption of a free spacepercentage of p. Suppose a head switch costs s andthere are t tracks in a cylinder, then the probabilityfunctions can be expressed as:

fx(p; x) = p(1� p)x (5)

fy(p; y) = fx(1� (1� p)t�1; y � s) (6)

In other words, fx is the probability that there arex occupied sectors followed by a free sector in thecurrent track, and fy is the probability that the �rst(y � s) rotational positions in all (t� 1) tracks areoccupied and there is one free sector at the nextrotational position in at least one of these tracks.

Figure 1 validates the model of (4) with a sim-ulation of the HP97560 and the Seagate ST19101disks whose relevant parameters are detailed in Ta-ble 1. The HP97560 is a relatively old disk; wechose it because its characteristics are well docu-mented and its existing simulators are thoroughlyvalidated [18, 28]. The Seagate is state-of-art tech-nology. It is more complex and the simulator isat best a coarse approximation. For example, themodels are for a single density zone while the actualdisk has multiple density zones. The simulated ea-ger writing algorithm in Figure 1 is not restricted tothe current cylinder and always seeks to the nearestsector. (We will show a modi�ed eager writing algo-rithm in the presence of a free space compactor inthe next section.) The �gure shows that the singlecylinder model is in fact a good approximation foran entire zone. This is a simple consequence of therelative seek cost { nearby cylinders are not muchmore likely than the current cylinder to have a freesector in a rotationally good position. This is alsobecause the head switch time and single cylinderseek time are close to each other.

Compared to the half-rotation delays of 7 ms forthe HP and 3 ms for the Seagate that an update-in-place system may incur, Figure 1 promises signif-icant performance improvement, especially at lowerutilizations. Indeed, the �gure shows that the eagerwriting latency has improved by nearly an order ofmagnitude on the newer Seagate disk compared to

0

0.5

1.0

1.5

2.0

2.5

3.0

0 20 40 60 80 100Ave

rage

Tim

e to

Loc

ate

a F

ree

Sec

tor

(ms)

HP97560 ModelHP97560 Simulation

ST19101 ModelST19101 Simulation

Free Disk Space (%)

Figure 1: Amount of time to locate the �rst free sectoras a function of the disk utilization. The model is fora single cylinder while the simulation results are for theentire zone.

HP97560 ST19101Sectors per Track (n) 72 256Tracks per Cylinder (t) 19 16

Head Switch (s) 2.5 ms 0.5 msMinimum Seek 3.6 ms 0.5 ms

Rotation Speed (RPM) 4002 10000SCSI Overhead (o) 2.3 ms 0.1 ms

Table 1: Parameters of the HP97560 and the SeagateST19101 disks.

the HP disk. As the disk utilization increases, how-ever, it takes longer for an eager writing disk headto �nd a nearby free sector and the performancedegrades. When this occurs, one solution is to com-pact the free space under the disk head to ensurelow latency. We next develop a model assuming thepresence of a compactor.

2.3 A Model Assuming a Compactor

Compacting free space and reorganizing data areresponsibilities of the storage subsystem and it isnatural to dedicate these tasks to the embeddedprocessor in the disk. A self-reorganizing disk cantake advantage of the \free" bandwidth between thedisk head and the platters during idle periods with-out consuming valuable I/O bus bandwidth, pollut-ing host cache and memory, or interfering with hostCPU operation.

We now develop a performance model of eagerwriting assuming the existence of a free space com-pactor that constantly generates empty tracks. Westart writing to an empty track and continue to �llthe same track with newly arrived write requestsuntil the utilization of the current track reaches a

0

0.5

1.0

1.5

2.0

2.5

3.0

0 20 40 60 80 100Ave

rage

Tim

e to

Loc

ate

a F

ree

Sec

tor

(ms)

HP97560 SimulationHP97560 Model

ST19101 SimulationST19101 Model

Track Switch Threshold (%)

Figure 2: Average latency to locate free sectors for allwrites performed to an initially empty track as a functionof the track switch threshold. The track switch thresholdis the percentage of free sectors reserved per track beforea switch occurs. A high threshold corresponds to frequentswitches.

threshold. Then we switch to the next empty trackand repeat the process. If writes arrive with no de-lay, then it would be trivial to �ll the track. If writesarrive randomly and we assume the free space dis-tribution is random at the time of the arrivals, thenwrites between successive track switches follow themodel of (1). Therefore, substituting pn in (1) withi, the number of free sectors, the total number ofsectors to skip between track switches can be ex-pressed as:

nX

i=k+1

n� i

1 + i(7)

where n is the total number of sectors in a track andk is the number of free sectors reserved per trackbefore switching tracks. Suppose each track switchcosts s and the rotation delay of one sector is r; be-cause we pay one track switch cost per (n�k) writes,the average latency per sector of this strategy canbe expressed as:

s+ r �Pn

i=k+1n�i1+i

n� k(8)

So far, we have assumed that the free space dis-tribution is always random. This is not the casewith the approach of �lling the track up to a thresh-old because, for example, the �rst free sector imme-diately following a number of used sectors is morelikely to be selected for eager writing than one fol-lowing a number of free sectors. In general, this non-randomness increases latency. Although the pre-cise modeling of the non-randomness is quite com-plex, through approximations and empirical trials,we have found that adding the following � function

to (7) works well for a wide range of disk parametersin practice:

�(n; k) =(n� k � 0:5)p+2

(8� n96) � (p+ 2) � np

(9)

where p = 1+n=36. By approximating the summa-tion in (7) with an integral and adding the correc-tion factor of (9) to account for the non-randomness,we arrive at the �nal latency model:

s+ r � [(n+ 1) ln n+2k+2

� (n� k) + �(n; k)]

n� k(10)

Figure 2 validates the model of (10). When weswitch tracks too frequently, although the free sec-tors in any particular track between switches areplentiful, we are penalized by the high switch cost.When we switch tracks too infrequently, it becomesincreasingly di�cult to locate free sectors in an in-creasingly crowded track and the performance isalso non-optimal. In general, the model aids the se-lection of an optimal threshold for a particular set ofdisk parameters. More importantly, the model alsoreassures us that we need not su�er the performancedegradation seen at high utilizations in Figure 1 ifwe judiciously choose a track switch threshold. Aslong as there is su�cient idle time to compact freespace, we can expect to stay comfortably at thelower part of the curve in Figure 2, which againpromises signi�cant improvement over an update-in-place approach. Figure 1 and Figure 2 also showwhy the demands on the compactor are less than thedemands on the LFS cleaner. This is because themodels indicate that compacting is only necessaryfor high utilizations and the compactor can movedata at small granularities.

3 A Virtual Log

Eager writing allows a high degree of location in-dependence of data. It also presents two challenges:how to locate data as its physical location can con-stantly change, and how to recover this location in-formation after a failure. In this section, we explainthe rationale behind the design of the virtual logand the �le systems built on it.

3.1 The Indirection Map

To support location independence, we introducea level of indirection, similar to the technique foundin a number of existing �le systems [5, 8, 9]. Thehost �le system manipulates logical addresses. The

obsolete entry

log tail

log tail

(a)

(b)

Figure 3: Maintaining the virtual log. (a) New map en-try sectors are appended to a backward chain. (b) Im-plementing the backward chain as a tree to support mapentry overwriting.

disk system maintains a logical-to-physical address

map. As new blocks are allocated, new map entriesare created; as blocks are overwritten, their mapentries are updated; as blocks are read, their mapentries are queried for their physical locations; andas blocks are freed, so are their map entries. Mostmodern disk drives already maintain a similar mapso that they can transparently remap failed sectors.Updates to such a map, however, occur rarely. Inthe rest of this section, we will describe how we keepthe map persistent, how we perform updates, howwe recover the map after a failure, and how �le sys-tems can be built using these primitives. Our goal isto avoid some of the ine�ciencies and in exibilitiesof the previous approaches. These pitfalls include:� lengthy scanning of large portions of the disk torecover the map,

� reliance on NVRAM, which is not always avail-able (and even when it is, NVRAM can usuallybe put to more productive uses than storing mapentries for recovery),

� excessive overhead in terms of space and extraI/O's needed to maintain the map, and

� altering the physical block format on disks toinclude, for example, a self-identifying header foreach disk block.

3.2 Maintaining and Recovering theVirtual Log

We implement the indirection map as a table. Tokeep the map persistent, we leverage the low latencyo�ered by eager writing. Whenever an update takesplace, we write the piece of the table that containsthe new map entry to a free sector near the diskhead. Suppose the �le system addresses the disk

at sector granularity and each map entry consumesa word, the indirection map will consume a stor-age overhead of less than 1% of the disk capacity.We will discuss how to further reduce this storageoverhead to enable the entire indirection map to bestored in disk memory in the next section. If thetransaction includes multiple data blocks and theirmap entries do not fall in the same map sector, thenmultiple map sectors may need written. Althoughthe alternative of logging the multiple map entriesusing a single sector may better amortize the costof map entry updates, it requires garbage collect-ing the obsolete map entries and is not used in ourcurrent design. We will discuss the issues of how tominimize map entry update I/O and how to avoidgarbage collection later in the paper.

To be able to recover the map after a failure, wemust be able to identify the locations of the map en-try sectors which are scattered throughout the diskdue to eager writing. One way to accomplish thisis to thread all the map entry sectors together toform a log. We term this a virtual log because thecomponents of the log are not necessarily physicallycontiguous. Because eager writing prevents us frompredicting the location of the next on-disk map en-try, we cannot maintain forward pointers in the mapentries. Instead, we chain the map entries backwardby appending new tail entries to the virtual log asshown in Figure 3a. Note that we can adapt thistechnique to a disk that supports a small writableheader per block, in which case a map entry in-cluding the backward pointer will be placed in theheader. We do not assume this support in the restof the discussion.

As map entries are overwritten, if we continu-ously append to the virtual log, the backward chainwill accumulate obsolete sectors over time. We can-not simply reuse these obsolete sectors because do-ing so will break the backward chain. Our solutionis to implement the backward linked list as a treeas shown in Figure 3b. Whenever an existing mapentry is overwritten, a new log tail is introduced asthe new tree root. One branch of the root points tothe previous root; the other points to the map sec-tor following the overwritten map entry. The over-written sector can be recycled without breaking thevirtual log. As Section 3.3 will show, we can keepthe entire virtual log in disk memory during normaloperation. Consequently, overwriting a map entryrequires only one disk I/O to create the new log tail.

In order to be able to recover the virtual log with-out scanning the disk, we must remember the loca-tion of the log tail. Modern disk drives use residualpower left in the drive to park their heads in a land-

ing zone at the outer perimeter of the disks prior tospinning down the drives. It is easy to modify the�rmware so that the drive records the current logtail location at a �xed location on the disk before itparks the actuator[21, 32]. To be sure of the validityof the content stored at this �xed location, we canprotect it with a checksum and clear it after recov-ery. In the extremely rare case when the powerdown mechanism fails, we can detect the failure bycomputing the checksum and resort to scanning thedisk to retrieve the log tail.

With a stable log tail, recovering the virtual logis straightforward. We start at the log tail as theroot and traverse the tree on the frontier based onage. Obsolete log entries can be recognized as suchbecause their updated versions are younger and tra-versed earlier.

3.3 Implementing LFS on the VirtualLog

So far, we have described 1) a generic loggingstrategy that can support transactional behavior,and 2) an indirection map built on the log whichimplements a logical disk interface that supportslocation independence. One advantage of this ap-proach is that we can implement eager writing be-hind a logical disk interface and deliver its perfor-mance advantage to an unmodi�ed �le system. Wenow describe another application of the virtual log:implementing a variant of the log-structured �le sys-tem (VLFS). Unlike the logical disk approach of theprevious section, the VLFS requires modifying thedisk interface to the host �le system. As a result ofseeking a tighter integration between the �le systemand the programmable disk, however, this allowsa number of optimizations impossible with an un-modi�ed UFS. Currently, we have not directly im-plemented VLFS. Instead, the experiments in Sec-tion 5 are based on �le systems running on the vir-tual log via the logical disk interface as describedin the last section. We indirectly deduce the VLFSperformance by evaluating these �le systems.

One disadvantage of the indirection map as de-scribed in Section 3.1 is the amount of storage spaceand the extra I/O's needed to maintain and querythe map. To solve this ine�ciency, the �le systemcan store physical addresses of the data blocks inthe inodes (and indirection blocks). This is thesame approach taken by LFS and is shown in Fig-ure 4. As �le blocks are written, the data blocks, theinode blocks that contain physical addresses of thedata blocks, and inode maps that contain physicaladdresses of the inodes are all appended to the log.

log

data inodeinode map(virtual log entry)

tail

Figure 4: Implementing LFS on the virtual log. VLFSuses eager writing to write to the disk so the log is notnecessarily physically contiguous. The virtual log containsonly the inode map blocks.

What is di�erent in the virtual log based implemen-tation (VLFS) is that the log need not be physicallycontiguous, and only the inode map blocks logicallybelong to the virtual log. This is essentially addinga level of indirection to the indirection map. Theadvantage is that the inode map, which is the solecontent of the virtual log, is now compact enoughto be stored in memory; it also reduces the numberof I/O's needed to maintain the indirection map be-cause VLFS simply takes advantage of the existingindirection data structures in the �le system with-out introducing its own.

Another LFS optimization that can also be ap-plied to VLFS is checkpointing for recovery. Peri-odically, we write the entire inode map to the diskcontiguously and truncate the virtual log. At recov-ery time, while LFS reads a checkpoint at a knowndisk location and rolls forward, VLFS traverses thevirtual log backwards from the log tail towards thecheckpoint.

In order to support VLFS, the disk needs to ex-port a new interface to the host. In deciding howto partition the responsibility between the disk andthe host, we face at least two con icting goals. Onthe one hand, we need to move enough �le systemoperations into the disk so that they can take advan-tage of the knowledge of the disk mechanism whichis unavailable on the host. On the other hand, wewould like the host to retain certain operations sothat we do not place undue stress on the I/O bus.Two key questions are where to place the data cacheand where to place the inode cache. To minimizethe crossing of the I/O bus, one design is to retainboth caches on the host. Under this design, on awrite, in addition to the data blocks, the disk is alsogiven the inode. The disk writes the data blocks, in-serts their physical addresses into the inode, writesthe inode, writes the inode map blocks, and returnsthe modi�ed inode to the host. This design compli-cates the e�ort of keeping the inode cache coherentif the disk is to modify the inodes later during the

process of free space compacting or data reorgani-zation. A simpler alternative is to keep the inodecache on the disk. This alternative drastically sim-pli�es the interface and the cache coherence issueat the expense of more communication across theI/O bus. The precise placement of these caches andthe associated interfaces are subjects of our ongoingresearch.

To support the VLFS free space compactor,which is equivalent to an LFS-style cleaner, weintroduce the equivalent of the LFS segment sum-mary, a track summary. The track summary detailsthe content of a track, recording the inode numberand block o�set of each block. Like all other datastructures of VLFS, the track summary has no per-manent location and can be written anywhere in thetrack. In fact, although the track summary is a logi-cally distinct data structure, it can be piggy-backedonto a log map entry to avoid extra disk writes1.

During VLFS compacting, a whole track is readand its track summary is examined to determine thelocations of the inodes that point to the live blocks.If the inodes in question are in the current track,then we can simply copy the data to its destinationtrack, update the corresponding inodes, and rewritethe inodes. If the inodes in question are not found inthe current track, then we have the choice of eitherreading the inodes from other tracks and complet-ing the cleaning or \cleaning around" these blocksby leaving them where they are. Placing the onlycopy of the inode cache on the disk instead of on thehost can simplify the implementation of the VLFScompactor by making the disk the sole entity thatcan read or write inodes. This can also allow con-current write operations by both the host and thecompactor using techniques similar to LFS \opti-mistic cleaning" [15, 30] that avoids expensive �lelocks..

3.4 Comparing VLFS with LFS

VLFS and LFS share a number of common ad-vantages. Both can bene�t from an asynchronousmemory bu�er by preventing short-lived data fromever reaching the disk. Both can bene�t from diskreorganization during idle time: the LFS cleanercan generate empty segments while the VLFS com-pactor can compact free space under the disk head,so that during busy time both �le systems can workwith large extents of free space to ensure good per-formance.

1We can avoid inadvertent freeing of the the piggy-backedtrack summaries by storing the track summary locations inmemory.

Due to eager writing, VLFS possesses a num-ber of unique advantages. First, small synchronouswrites perform well on VLFS whereas the LFS per-formance su�ers if an application requires frequent\fsync" operations2. This is because LFS can onlye�ciently write large segments. Eager writing, how-ever, can e�ciently perform small writes at the mostconvenient disk locations. Second, while the freespace compactor is only an optimization for VLFS,the cleaner is a necessity for LFS. In cases where idletime is scarce or disk utilization is high, LFS per-formance may degrade due to the repeated copyingof live data by the cleaner [23, 30, 31]. Eager writ-ing avoids this bandwidth waste and �lls the avail-able space in the most e�cient way possible. Third,cleaning under LFS is a relatively heavy weight op-eration because it must move data at segment gran-ularity, typically 0.5 MB or larger. As a result, LFSneeds large idle intervals to mask the cleaning over-head. The VLFS compactor has no restriction interms of the amount of data it can move so it cantake advantage of relatively short idle intervals. Fi-nally, reads can interfere with LFS writes by forcingthe disk head away from free space and/or disturb-ing the track bu�er (which can be sometimes usedto absorb writes without accessing the disk plat-ters). VLFS can gracefully accommodate reads byperforming intervening writes near the data beingread.

VLFS and LFS also share some common disad-vantages. For example, data written randomly mayhave poor sequential read performance. In some ofthese situations, reorganization techniques that canimprove LFS performance [23] should be equally ap-plicable to VLFS. In some other situations, sequen-tial reads after random updating is nothing morethan an artifact of the I/O interface. For exam-ple, an application that streams through all therecords of a database usually does not particularlycare about the ordering of the records. An I/O in-terface that hints to the storage system to deliverthe records in any order it deems convenient canperform well for both read and write operations.Indeed, an interface similar to that used by \In-formed Prefetching" [26] can serve the virtual logwell.

4 Experimental Platform

To study the e�ectiveness of eager writing, weevaluate the following four combinations of �le sys-

2\fsync" is the system call that forces dirty data to thedisk.

virtuallog

ramdisk

HP/Seagatemodels

VLD

HP/Seagatemodels

ramdisk

virtuallog

regular

Solariskernel

user LFS

benchmarks

MinixUFS

log-structuredlogical disk

UFS raw disk

Figure 5: Architecture of the experimental platform.

tems and simulated disks: a UFS on a regular disk,a UFS on a Virtual Log Disk (VLD), an LFS on aregular disk, and an LFS on a VLD. These combi-nations are illustrated in Figure 5. Although we dorecognize that a complete evaluation of the VLFSwould require a complete implementation of VLFS,with its associated data reorganizer and free spacecompactor, in this paper, we take the �rst step ofdeducing the behavior of VLFS by examining the�le systems running on the VLD.

Two host machines are used: one is a 50 MhzSUN SPARCstation-10 which is equipped with 64MB of memory and runs Solaris 2.6; the other isa similarly con�gured UltraSPARC-170 workstationthat runs at 167 Mhz. The SPARCstation-10 sup-ports both 4 KB and 8 KB �le blocks while theUltraSPARC-170 only supports 8 KB �le blocks.Because our focus is small write performance, ourexperiments are run on the SPARCstation-10 un-less explicitly stated to be otherwise. We performsome additional experiments on the UltraSPARC-170 only to study the impact of host processingspeed. In the rest of this section, we brie y describeeach of the disk modules and �le systems shown inFigure 5.

4.1 The Regular Disk

The regular disk module simulates a portion ofthe HP97560 disk or the Seagate ST19101 disk.Their relevant parameters are shown in Table 1. A

ramdisk driver is used to store �le data using 24 MBof kernel memory. The Dartmouth simulator [18]is ported into the kernel to ensure realistic timingbehavior of the HP disk. We simply adjust the pa-rameters of the Dartmouth model to coincide withthose of the Seagate disk to simulate the faster disk.Although the Seagate simulator is not as precise asthe HP model, it gives a clear indication of how theimprovements in disk technology are a�ecting thedi�erent disk allocation strategies. We have run allour experiments on both the the HP model and theSeagate model. Unless stated otherwise, however,the results presented are those obtained on the fastSeagate model.

One advantage of the ramdisk simulator is thatit allows two possible modes of simulation. In onemode, the simulator sleeps the right amount of timereported by the Dartmouth model to imitate thebehavior of the physical disks and we can conductthe evaluations by directly timing the application.In the second mode, the simulator does not sleep inthe kernel; it runs at memory speed instead and onlyreports statistics. This allows us to speed up certainphases of the experiments whose actual elapsed timeis not important. The disadvantage of the ramdisksimulator is its small size due to the limited avail-able kernel memory. We only simulate 36 cylindersof the HP97560 and 11 cylinders of the Seagate. Theresults reported in Section 5, however, should be ap-plicable to a much larger zone. We plan to use a realdisk to store the �le data in a future version of thesimulator, although this modi�cation would slightlycomplicate timing measurements because the timespent in the real disk will need to be carefully ac-counted for in the �nal analysis.

4.2 The Virtual Log Disk

The VLD adds the virtual log to the disk sim-ulator described above. It exports the same blockdevice driver interface so it can be used by any exist-ing �le system. The VLD maintains an in memoryindirection map. As the map entries are updated,the VLD appends new map entries to the virtuallog as described in Section 3.2. To avoid having thedisk head trapped in regions of high utilization dur-ing eager writing, the VLD simply performs cylin-der seeks only in one direction until it reaches thelast cylinder, at which point it starts from the �rstcylinder again.

One challenge of implementing the VLD whilepreserving the existing disk interface is handlingdeletes, which are not visible at the block devicedriver interface. This is a common problem faced by

logical disks. Our solution is to monitor overwrites:when a logical address is re-used by a write, theVLD detects that the old logical-to-physical map-ping can be freed and a new one needs to be estab-lished. The disadvantage of the approach is that itdoes not capture the blocks that are freed by the�le system but not yet overwritten.

Another question that arose during the imple-mentation of the VLD is the choice of the physicalblock size. For example, suppose the host �le systemwrites 4 KB blocks. If the VLD chooses a physicalblock size of 512 B, it may need to locate eight sep-arate free sectors to complete eager writing one log-ical block. If the VLD chooses a physical block sizeof 4 KB instead, although it takes longer to locatethe �rst free 4 KB-aligned sector, it is guaranteedto have eight consecutive free sectors afterwards. Tounderstand this tradeo�, we can extend the modelof (1) in Section 2.1. Suppose the �le system logicalblock size is B and the VLD physical block size isb (b � B), then the average amount of time (ex-pressed in the numbers of sectors skipped) neededto locate all the free sectors for a logical block is:

(1� p)n

b+ pn� B (11)

(11) indicates that the latency is lowest when thephysical block size matches the logical block size. Inall our experiments, we have used a physical blocksize of 4 KB. Each physical block requires one mapentry; so the entire map consumes 24 KB and isstored in memory in our simulations.

The third issue concerns the interaction betweeneager writing and the disk track bu�er read-aheadalgorithm. When reading, the HP97560 (or at leastthe Dartmouth simulator) keeps in cache only thesectors from the beginning of the current requestthrough the current read-ahead point and discardsthe data whose addresses are lower than that of thecurrent request. This algorithm makes sense for se-quential reads of data whose physical addresses in-crease monotonically. This is the case for �le sys-tems which directly manipulates physical addresses.For VLD, however, the combination of eager writ-ing and the logical-to-physical address translationmeans that the sequentially read data may not nec-essarily have monotonically increasing physical ad-dresses. As a result, the HP97560 tends to purgedata prematurely from its read-ahead bu�er underVLD. The solution is to aggressively prefetch theentire track as soon as the head reaches the targettrack and not discard data until it is delivered tothe host during sequential reads on the VLD. Themeasurements of sequential reads on the VLD in

Section 5 were taken with this modi�cation.Lastly, the VLD module also implements a free

space compactor. Although the eager writing strat-egy should allow the compactor to take advantageof idle intervals of arbitrary length, for simplicity,our compactor compacts free space at the granular-ity of tracks. During idle periods, the compactorreads the current track and uses eager writing tocopy the live data to other tracks, in a way simi-lar to hole-plugging under LFS [23, 36]. Currently,we choose compaction targets randomly and planto investigate optimal VLD compaction algorithms(e.g., ones that preserve or enhance read and writelocality) in the future. Applying the lessons learnedfrom the models in Section 2.3, the VLD �lls emptytracks to a certain threshold (75% in the experi-ments). After exhausting empty tracks generatedby the compactor, the VLD reverts to the greedyalgorithm modeled in Section 2.2.

4.3 UFS

Because both the regular disk and the VLD ex-port the standard block device driver interface, wecan run the Solaris UFS (subsequently also labeledas UFS) unmodi�ed on these disks. We con�gureUFS on both disks with a block size of 4 KB anda fragment size of 1 KB. Like most other Unix lo-cal �le systems, the Solaris UFS updates metadatasynchronously while the user can specify whetherthe data writes are synchronous. It also performsaggressive prefetching after several sequential readsare detected.

4.4 LFS

We have ported the MIT Log-Structured LogicalDisk [8], a user level implementation of LFS (sub-sequently also labeled as LFS). It consists of twomodules: the MinixUFS and the log-structured log-ical disk, both running at user level. MinixUFS ac-cesses the logical disk with a block interface whilethe logical disk interfaces with the raw disk usingsegments. The block size is 4 KB and the segmentsize is 0.5 MB. Read-ahead in MinixUFS is disabled.A number of other issues also impact the LFS per-formance. First, MinixUFS employs a �le bu�ercache of 6.1 MB. Unless \sync" operations are is-sued, all writes are asynchronous. In some of theexperiments in Section 5, we assume this bu�er tobe made of NVRAM so that the LFS con�gurationcan have a similar reliability guarantee as that ofthe synchronous systems. In this context, we willexamine the e�ectiveness of NVRAM.

Second, the logical disk's response to a \sync"operation is determined by a tunable parametercalled partial segment threshold. If the current seg-ment is �lled above the threshold at the time of the\sync", the current segment is ushed to the diskas if it were full. If it is �lled below the thresh-old, the current segment is written to the disk butthe memory copy is retained to receive more writes.The partial segment threshold in the experiments isset to 75%.

Third, the original version of the logical disk onlyinvokes the cleaner when it runs out of empty seg-ments. We have modi�ed the cleaner so that it canbe invoked during idle periods before it runs out offree space.

5 Experimental Results

In this section, we compare the performance ofeager writing against that of update-in-place andlogging with a variety of micro-benchmarks. We�rst run the small �le and large �le benchmarks thatare commonly used by similar �le system studies.Then we use a benchmark that best demonstratesthe strength of eager writing: small random syn-chronous updates with no idle time. This bench-mark also illustrates the e�ect of disk utilization.Next we examine the e�ect of technology trends asdisks improve. Last, we examine how the availabil-ity of idle time impacts the performance of eagerwriting and logging. Unless explicitly stated to beotherwise, the experimental platform is based on theSPARCstation-10 running on the simulated Seagatedisk.

5.1 Small File Performance

We �rst examine two benchmarks similar to theones used by both the original LFS study [27] andthe Logical Disk study [8]. In the �rst benchmark,we create 1500 1 KB �les, read them back after acache ush, and delete them. The benchmark isrun on empty disks. The results of this benchmarkare shown in Figure 6. Under LFS, updates are ushed to disk only if the memory bu�er is �lled.(Otherwise, LFS pays a heavy price by continuously ushing partial segments to the disk.) Under UFS,updates are synchronous.

As expected, UFS on the VLD is able to sig-ni�cantly outperform the same �le system on theregular disk for the create and delete phases of thebenchmark. This is due to eager writing's abilityto complete small writes much faster than update-

0

2

4

6

8

10

12

41

Regular VLD Regular VLDUFS UFS LFS LFS

CreateReadDelete

Nor

mal

ized

Per

form

ance

1

Figure 6: Small �le performance. The benchmark cre-ates, reads, and deletes 1 KB �les. All performance isnormalized to that of UFS running on a regular disk.

in-place. The performance improvement, however,is not as great as the analytical models would sug-gest. We will see why this is the case in Section 5.4.The read performance on the VLD is slightly worsethan that on the regular disk because of the over-head introduced by the indirection map and the factthat read-ahead inside the disk is not as e�ective.We will see the same pattern in read performanceof other experiments as well.

LFS comfortably out-performs UFS by batchingsmall writes and preventing overwritten directoriesfrom reaching the disk. The performance of LFSon the regular disk and the VLD are close becauseeager writing can support large segment writes ase�ciently as update-in-place does. If we do notconsider the cost of LFS cleaning (which is not trig-gered in this experiment), the performance of a UFSthat employs delayed write techniques such as thoseproposed in [10] should be between that of the un-modi�ed UFS and that of LFS. Just like under LFS,however, delayed writes under UFS do not o�er datareliability.

From this benchmark, we speculate that by inte-grating LFS with the virtual log, the VLFS (whichwe have not implemented) should be able to approx-imate the performance of UFS on the VLD when wemust write synchronously to the disk, while retain-ing the bene�ts of LFS when asynchronous bu�eringis acceptable.

5.2 Large File Performance

In the second benchmark, we write a 10 MB �lesequentially, read it back sequentially, write 10 MBof data randomly to the same �le, read it back se-quentially again, and �nally read 10 MB of ran-

0

1

2

3

4

5

6

7

Regular VLD Regular VLDUFS UFS LFS LFS

Sequential WriteSequential ReadRandom Write (Async.)Random Write (Sync.)Sequential Read AgainRandom Read

Ba

ndw

idth

(M

B/s

)

Figure 7: Large �le performance. The benchmark se-quentially writes a 10 MB �le, reads it back sequentially,writes it again randomly (both asynchronously and syn-chronously for the UFS runs), reads it again sequentially,and �nally reads it randomly.

dom data from the �le. The performance of randomI/O can also be an indication of the e�ect of inter-leaving a large number of independent concurrentworkloads. The benchmark is again run on emptydisks. Figure 7 shows the results. The writes areasynchronous with the exception of the two randomwrite runs on UFS that are labeled as \Sync". Nei-ther the LFS cleaner nor the VLD compactor is runin the benchmark.

We �rst point out a few characteristics that areresults of implementation artifacts. The �rst twophases of the LFS performance are not as good asthose of UFS. This is because LFS relies on So-laris for the raw disk interface and is less e�cientthan the native in-kernel UFS. Furthermore, LFSdisables prefetching, which explains its low sequen-tial read bandwidth. The UFS sequential read per-formance is signi�cantly better than its write per-formance due to aggressive prefetching both at the�le system level and inside the disk. With theseartifacts explained, we now examine a number ofinteresting performance characteristics.

First, sequential read after random write per-forms poorly in all LFS and VLD systems. Thisis because both logging and eager writing destroyspatial locality. This, fortunately, is a problem thatcan be solved by caching, data reorganization [23],or I/O interface changes as explained in Section 3.4.

Second, the LFS random write bandwidth ishigher than that of sequential write. This is becauseduring the random write phase, some of the blocksare written multiple times and the overall numberof bytes that reach the disk is smaller than that of

the sequential case. This is an important bene�t ofdelayed writes.

Third, on a UFS, while it is not surprising thatthe synchronous random writes do well on the VLD,it is interesting to note that even sequential writesperform better on the VLD. This is because of theoccasional inadvertent miss of disk rotations on theregular disk. Interestingly enough, this phenomenondoes not occur if we run the benchmark on theslower HP97560 disk. This evidence supports ourearlier contention that tuning the host operatingsystem to match changing technologies in both thedisk and the host is indeed a di�cult and error-prone task. The approach of running the �le sys-tem inside the disk in general and the concept of avirtual log in particular can simplify such e�orts.

Fourth, although our regular disk simulator doesnot implement disk queue sorting, UFS does sortthe asynchronous random writes when ushing todisk. Therefore, the performance of this phase ofthe benchmark, which is also worse on the regulardisk than on the VLD due to the reason describedabove, is a best case scenario of what disk queuesorting can accomplish. In general, disk queue sort-ing is likely to be much less e�ective when the diskqueue length is short compared to the working setsize during random updates. Similarly, if the size ofa write-ahead log is small compared to the size ofthe database, the throughput of log truncation willbe limited. The VLD based systems need not su�erfrom these limitations.

Finally, LFS running on the VLD deliversslightly better write bandwidth than LFS on theregular disk. In addition to the potential occasionalmiss of rotations on the regular disk, this is alsobecause even with large segment-sized writes, LFSon a regular disk still needs to perform the occa-sional long-distance seeks, which the VLD diligentlyavoids with eager writing. In summary, the bench-mark further demonstrates the power of combininglazy writing by the �le system with eager writing bythe disk.

5.3 E�ect of Disk Utilization

There are a number of questions that are stillunanswered by the �rst two benchmarks. First, theVLD always has plenty of free space to work with inthe previous benchmarks. How much performancedegradation can we expect when the disk is fuller?Second, the LFS cleaner is not invoked under theprevious benchmarks. What then is the impact onperformance when the cleaner is forced to run undervarious disk utilizations? Third, we know that LFS

performs poorly with frequent ushes to the disk.How much then can NVRAM help? We attempt toanswer these questions with the third benchmark.

In this benchmark, we create a single �le of cer-tain size. Then we repeatedly choose a random 4KB block to update. There is no idle time betweenwrites. For UFS, the �le is opened with the agthat indicates to the �le system that all writes areto be performed synchronously. In other words, the\write" system call does not return until the blockis written to the disk surface. For LFS, we do not ush to the disk until the 6.1 MB �le bu�er cacheis full; we assume in this case that the bu�er cacheis made of NVRAM. We measure the steady statebandwidth of UFS on the regular disk, UFS on theVLD, and LFS on the regular disk. We repeat theexperiment under di�erent disk utilizations by vary-ing the size of the �le we update.

Figure 8 plots the average latency experiencedper write. UFS on the regular disk su�ers from ex-cessive disk head movement due to the update-in-place policy. The latency increases slightly as theupdated �le grows because the disk head needs totravel a greater distance between successive writes.This increase may have been larger had we simu-lated the entire disk.

LFS provides excellent performance when the en-tire �le �ts in the NVRAM bu�er. As soon as the �leoutgrows the available NVRAM, writes are forcedto the disk; as this happens and as disk utilizationincreases, the cleaner must run, quickly dominat-ing performance. The plateau between roughly 60%and 85% disk utilization is due to the fact that theLFS cleaner tends to choose less utilized segmentsto clean; with certain distribution patterns of freespace, the number of segments to clean in order togenerate the same amount of free segments may bethe same as (or even larger than) that that requiredunder a higher utilization.

With eager writing, the VLD su�ers from neitherthe excessive disk head movements, nor the band-width waste as a result of cleaning. As the disk uti-lization increases, the VLD latency also rises. Therise, however, is not signi�cant compared to the var-ious overheads in the system. We examine the ef-fects of system overheads on the VLD in the nextsection.

5.4 E�ect of Technology Trends

Despite the advantage that eager writing hasdemonstrated over update-in-place so far, the per-formance di�erence is not as great as the analyti-cal models might suggest. To see why this is the

0

5

10

15

20

25

30

0 20 40 60 80 100

Late

ncy

per

4K B

lock

(m

s)

Disk Utilization (%)

LFS with NVRAM on Regular Disk

UFS on VLD

UFS on Regular Disk

Figure 8: Performance of random small synchronous up-dates under various disk utilizations. The disk utilizationis obtained from the Unix \df" utility and includes about12% of reserved free space that is not usable. The arrowon the x-axis points to the size of the NVRAM used byLFS.

case, we now provide a more detailed breakdown ofthe benchmark times reported in the last section.We also examine how the technology trends impactthe relative ratios of the individual components thatmake up the overall latency.

We repeat the experiment of the last section onthree di�erent platforms under the same disk uti-lization (80%)3. Figure 9 shows the result. The �rsttwo bars compare the performance of update-in-place and virtual logging on a SPARCstation-10 andHP disk combination. In the next run, we replacethe older HP disk with the new Seagate disk. Inthe third run, we replace the older SPARCstation-10 with the newer UltraSPARC-170. We see thatthe performance gap between update-in-place andvirtual logging widens from less than three-fold toalmost an order of magnitude as both disks and hostprocessors improve.

Figure 10 reveals the underlying source of thisperformance di�erence by providing the detailedbreakdown of the latencies shown in the preceding�gure. The component labeled as \SCSI overhead"is the �xed amount of time that the embedded pro-cessor in the disk spends processing each SCSI com-mand. The component labeled as \transfer" is thetime it takes to move the bits to or from the mediaafter the head has been positioned over the targetsector. The component labeled as \locate sectors"is the amount of time the disk spends positioningthe disk head. It includes seek, rotation, and headswitch times. The component labeled as \other"includes the operating system processing overhead,

3The VLD latency in this case is taken immediately afterrunning a compactor, as explained in Section 5.5.

02468

1012141618

UFS on

UFS on

HPSPARCSeagate

SPARCSeagate

UltraSPARC

Late

ncy

per

4K B

lock

(m

s)

VLD

RegularDisk

Figure 9: Performance gap between update-in-place andvirtual-logging widens as disks and host processors im-prove.

0

20

40

60

80

100

SCSIOverhead

Transfer

LocateSectors

Other

Late

ncy

Bre

akdo

wn

(%)

HPSPARCSeagate

SPARCSeagate

Ultra SPARC

Figure 10: Breaking down the total latency into SCSIoverhead, transfer time, time required to locate free sec-tors, and other processing time. The latency breakdownsshow the underlying reason between the performance dif-ference. Update-in-place performance becomes increas-ingly dominated by mechanical delays of disk while virtuallogging achieves a balance between processor and disk im-provements.

which includes the time to run the virtual log algo-rithm because the disk simulator is part of the hostkernel. The time consumed by simulating the diskmechanism itself, however, is less than 5% of thiscomponent.

We see that the mechanical delay becomes adominant factor of update-in-place latency. We alsosee that eager writing has indeed succeeded in sig-ni�cantly reducing the disk head positioning times.The overall performance improvement on the olderdisk or on the older host, however, is low due to thehigh overheads. After we replace the older disk, theperformance of virtual logging becomes host limitedas the component labeled as \other" dominates. Af-ter we replace the older host, however, the latencycomponents again become more balanced. This in-

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7

Late

ncy

per

4K B

lock

(m

s)

128K256K504K

1008K2016K4032K

Idle Interval Len gth (s)

12

4

5

6

3

123456

A

D

B C

Figure 11: Performance of LFS (with NVRAM) as afunction of available idle time.

dicates that the virtual log is able to ride the im-pressive disk bandwidth growth, achieving a balancebetween processor and disk improvements.

5.5 E�ect of Available Idle time

The benchmark in Section 5.3 assumes zero idletime. This severely stresses LFS because the timeconsumed by the cleaner is not masked by idle pe-riods. It also penalizes the VLD by disallowing freespace compacting. In this section, we examine howLFS on a regular disk and how UFS on a VLD be-have as more idle time becomes available. We mod-ify the benchmark of Section 5.3 to perform a burstof random updates, pause, and repeat. The diskutilization is kept at 80%.

Figure 11 shows how the LFS performance re-sponds to the increase of idle interval length. Eachcurve represents a di�erent burst size. At pointA, no idle time is available. LFS �lls up theNVRAM with updates, and then proceeds to ushthe bu�ered segments to the regular disk, invokingthe cleaner when necessary.

At point B, enough idle time is available to cleanone segment. If the burst size is less than or equalto 1 MB, at the time when the NVRAM becomesfull, the cleaner has already generated the maxi-mum number of empty segments possible. Conse-quently, it takes a constant amount of time to ushthe NVRAM. This is why the �rst four curves meetat point B4. A similar latency is achieved at pointC where cleaning two segments per idle interval issu�cient to compact all the free space.

4In this case, because the NVRAM size is larger thanthe available free space, the cleaner still needs to run during ushing to reclaim the free space created by the overwritesbu�ered in the NVRAM.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 0.1 0.2 0.3 0.4 0.5 0.6

Late

ncy

per

4K B

lock

(m

s)

128K256K512K

1024K2048K4096K

Idle Interval Len gth (s)

1 2 3 4

6

5

123456

Figure 12: Performance of UFS on VLD as a function ofavailable idle time.

Point D corresponds to su�cient idle time to ush the entire burst from NVRAM to the disk.The benchmark is able to run at NVRAM speedbecause the ushing time, which includes occasionalcleaning, is entirely masked by the long idle periods.The �rst three curves coincide at this point becausethey correspond to burst sizes that can �t in a singlesegment.

We do not pretend to know the optimal LFScleaning algorithm in the presence of NVRAM,which is outside the scope of this paper. Never-theless, this experiment reveals two characteristicsin terms of the impact of available idle time onLFS performance. First, because the cleaner movessegment-sized data, LFS can only bene�t from rel-atively coarse-grained idle intervals. Second, unlessthere is su�cient idle time to mask the ushing ofthe burst bu�ered in the NVRAM, the LFS perfor-mance remains poor.

Figure 12 shows the result of repeating the samebenchmark on UFS running on a VLD. Unlike theLFS cleaner, the VLD compactor has no restrictionin terms of the granularity of the data it moves. Forconvenience, our implementation of the compactormoves data at track granularity, which is still muchsmaller than the LFS segment granularity. The per-formance on the VLD improves along a continuumof idle interval lengths. Also note the smaller scaleof the axes in Figure 12 than those of Figure 11. Fur-thermore, the VLD performance is also more pre-dictable, whereas LFS experiences large variancesin performance depending on an array of factors in-cluding whether the NVRAM is full and whetherdisk cleaning is necessary. The experiments of thissection run on the SPARCstation-10. As we haveseen in the last section, a more powerful host pro-cessor such as the UltraSPARC-170 can easily cutthe latency in Figure 12 in half.

The disadvantage of UFS on the VLD comparedto LFS with NVRAM is the limiting performancewith in�nite amount of idle time. Because eachwrite is synchronously written to the disk surface,the VLD experiences the overheads detailed in Sec-tion 5.4. The overheads also render the impact ofthe compactor less important. Fortunately, as ex-plained in that section, as disks continue to improveand operating system implementors pay more care-ful attention to streamlining the I/O path, we ex-pect this gap to narrow.

Furthermore, eager writing does not dictate theuse of a UFS, nor does it preclude the use ofNVRAM. A lazy writing �le system that employsboth an NVRAM and an eager writing VLD canenjoy 1) the low latency of and the �ltering of theshort-lived data by the NVRAM, and 2) the e�ectiveuse of the available bandwidth by the eager writingstrategy when idle intervals are short or disk utiliza-tion is high.

6 Related Work

The work presented in this paper builds upona number of existing techniques including reducinglatency by writing data near the disk head, a trans-actional log, �le systems that support data locationindependence, and log-structured �le systems. Thegoal of this study is not to demonstrate the e�ec-tiveness of any of these individual ideas. Rather, thegoals are 1) provide a theoretical foundation of ea-ger writing with the analytical models, 2) show thatthe integration of these ideas at the disk level canprovide a number of unique bene�ts to both UFSand LFS, 3) demonstrate implementations that re-alize the bene�ts of eager writing without the se-mantic compromise of delayed writes or extra hard-ware support such as NVRAM, and 4) conduct aseries of systematic experiments to quantify the dif-ferences of the alternatives. We are not aware ofexisting studies that aim for these goals.

Simply writing data near the disk head is nota new idea. Many e�orts have focused on improv-ing the performance of the write-ahead log. Thisis motivated by the observation that appending tothe log may incur extra rotational delay even whenno seek is required. The IBM IMS Write AheadData Set (WADS) system [11] addresses this issuefor drums (�xed-head disks) by keeping some trackscompletely empty. Once each track is �lled with asingle block, it is not re-used until the data is copiedout of the track into its normal location.

Likewise, Hagmann places the write-ahead log in

its own logging disk[14], where each log append can�ll any open block in the cylinder until its utilizationreaches a threshold. Eager writing in our system,while retaining good logging performance, assumesno dedicated logging disk and does not require copy-ing of data from the log into its permanent location.The virtual log is the �le system.

A number of �le systems have also explored theidea of lowering latency of small writes by writingnear the disk head location. In [24], Menon proposesto use this technique to speed up parity updates indisk arrays. This is similar to some write-aheadlogging systems in that it requires multiple disksand rewriting of data.

Mime [5], the extension of Loge [9], is the closestin spirit to our system. Using an indirection map,it also writes near disk head locations to improvesmall write performance. There are a number of dif-ferences between Mime and the virtual log. First,Mime relies on self-identifying disk blocks. Second,Mime scans free segments to recover its indirec-tion map. As disk capacity increases, potentiallyreaching a terabyte by the year 2002 [22], this scan-ning may become a time consuming process. Third,the virtual log incorporates a free space compactor,whose importance is shown in Section 2.3.

The Network Appliance �le system, WAFL [16,17], checkpoints the disk to a consistent state pe-riodically, uses NVRAM for fast writes betweencheckpoints, and can write data and metadata any-where on the disk. An exception of the write-anywhere policy is the root inodes, which are writ-ten for each checkpoint and must be at �xed loca-tions. Unlike Mime, WAFL supports fast recoveryby rolling forward from a checkpoint using the login the NVRAM. One goal of the virtual log is tosupport fast transactions and fast recovery withoutNVRAM, with its capacity, reliability, and cost limi-tations. Another di�erence between WAFL and thevirtual log is that the WAFL write allocation de-cisions are made at the RAID controller level, sothe opportunity to optimize for rotational delay islimited.

Our idea of fast transactional writes using thevirtual log originated as a generalization of the Au-toRAID technique of hole-plugging [36], to improveLFS performance at high disk utilizations with-out AutoRAID hardware support for self-describingdisk sectors. In hole-plugging, partially empty seg-ments are freed by writing their live blocks intothe holes found in other segments. This outper-forms traditional cleaning at high disk utilizationsby avoiding reading and writing a large number ofnearly full segments just to produce a few empty

segments[23]. AutoRAID requires an initial log-structured write of a physically contiguous segment,after which it is free to copy the live data in a seg-ment into any empty block on the disk. Comparedto AutoRAID, our approach eliminates the initialsegment write and more e�ciently schedules the in-dividual writes.

7 Conclusion

In this paper, we have developed a number ofanalytical models which show the theoretical perfor-mance potential of eager writing, the technique ofperforming small transactional writes near the diskhead position. We have presented the virtual log de-sign which delivers fast transactional writes withoutthe semantic compromise of delayed writes or ex-tra hardware support such as NVRAM. This designrequires careful log management for fast recovery.By conducting a systematic comparison of this ap-proach against traditional update-in-place and log-ging approaches, we have demonstrated the bene-�ts of applying the virtual log approach to bothUFS and LFS. The availability of fast transactionscan also simplify the design of �le systems and so-phisticated applications. As the current technologytrends continue, we expect that the performance ad-vantage of this approach will become increasinglyimportant.

Acknowledgements

We would like to thank Arvind Krishnamurthyfor many interesting discussions on the proofs of sev-eral analytical models, Marvin Solomn for discov-ering the simplest proof of the single track model,Daniel Stodolsky and Chris Malakapalli for helpingus understand the working of Quantum and Sea-gate disks, and Jeanna Neefe Matthews for ask-ing the question of how hole-plugging could be e�-ciently implemented without hardware support forself-describing disk sectors.

References

[1] Adams, L., and Ou, M. Processor Integration in a DiskController. IEEE Micro 17, 4 (July 1997).

[2] Atkinson, M., Chisholm, K., Cockshott, P., andMarshall, R. Algorithms for a Persistent Heap. Soft-ware - Practice and Experience 13, 3 (March 1983), 259{271.

[3] Baker, M., Asami, S., Deprit, E., Ousterhout,

J., and Seltzer, M. Non-Volatile Memory forFast, Reliable File Systems. In Proceedings of theFifth International Conference on Architectural Sup-port for Programming Languages and Operating Sys-tems (ASPLOS-V) (Sept. 1992), pp. 10{22.

[4] Birrell, A., Hisgen, A., Jerian, C., Mann, T., andSwart, G. The Echo Distributed File System. TechnicalReport 111, Digital Equipment Corp. Systems ResearchCenter, Sept. 1993.

[5] Chao, C., English, R., Jacobson, D., Stepanov, A.,and Wilkes, J. Mime: a High Performance ParallelStorage Device with Strong Recovery Guarantees. Tech.Rep. HPL-CSP-92-9 rev 1, Hewlett-Packard Company,Palo Alto, CA, March 1992.

[6] Chart Watch: Mobile Processors. Microprocessor Re-port, June 1997.

[7] Chutani, S., Anderson, O., Kazar, M., Leverett,B., Mason, W., and Siedbotham, R. The Episode FileSystem. In Proc. of the 1992 Winter USENIX (January1992), pp. 43{60.

[8] de Jonge, W., Kaashoek, M. F., and Hsieh, W. C.

The Logical Disk: A New Approach to Improving FileSystems. In Proc. of the 14th ACM Symposium on Op-erating Systems Principles (December 1993), pp. 15{28.

[9] English, R. M., and Stepanov, A. A. Loge: a Self-Organizing Disk Controller. In Proc. of the 1992 WinterUSENIX (January 1992).

[10] Ganger, G. R., and Patt, Y. N. Metadata UpdatePerformance in File Systems. In Proc. of the First Sym-posium on Operating Systems Design and Implementa-tion (November 1994), pp. 49{60.

[11] Gawlick, D., Gray, J., Limura, W., and Obermarck,

R. Method and Apparatus for Logging Journal DataUsing a Log Write Ahead Data Set. U.S. Patent 4507751issued to IBM, March 1985.

[12] Grochowski, E. G., and Hoyt, R. F. Future Trendsin Hard Disk Drives. IEEE Transactions on Magnetics32, 3 (May 1996).

[13] Hagmann, R. Reimplementing the Cedar File SystemUsing Logging and Group Commit. In Proc. of the 11thACM Symposium on Operating Systems Principles (Oc-tober 1987), pp. 155{162.

[14] Hagmann, R. Low Latency Logging. Tech. Rep. CSL-91-1, Xerox Corporation, Palo Alto, CA, February 1991.

[15] Hartman, J., and Ousterhout, J. The Zebra StripedNetwork File System. ACM Transactions on ComputerSystems (Aug. 1995).

[16] Hitz, D., Lau, J., and Malcolm, M. File System De-sign for an NFS File Server Appliance. In Proc. of the1994 Winter USENIX (January 1994).

[17] Hitz, D., Lau, J., and Malcolm, M. File System De-sign for an NFS File Server Appliance. Tech. Rep. 3002,Network Appliance, March 1995.

[18] Kotz, D., Toh, S., and Radhakrishnan, S. A De-tailed Simulation Model of the HP 97560 Disk Drive.Tech. Rep. PCS-TR91-220, Dept. of Computer Science,Dartmouth College, July 1994.

[19] Lamb, C., Landis, G., Orenstein, J., and Weinreb,

D. The ObjectStore Database System. Communicationsof the ACM 34, 10 (October 1991), 50{63.

[20] Lowell, D. E., and Chen, P. M. Free Transactionswith Rio Vista. In Proc. of the 16th ACM Symposium onOperating Systems Principles (October 1997), pp. 92{101.

[21] Malakapalli, C. Personal Communication, SeagateTechnology, Inc., July 1998.

[22] Mashey, J. R. Big Data and the Next Wave of InfraS-tress. Computer Science Division Seminar, Universityof California, Berkeley, October 1997.

[23] Matthews, J. N., Roselli, D. S., Costello, A. M.,Wang, R. Y., and Anderson, T. E. Improving thePerformance of Log-Structured File Systems with Adap-tive Methods. In Proc. of the 16th ACM Symposium onOperating Systems Principles (October 1997), pp. 238{251.

[24] Menon, J., Roche, J., and Kasson, J. Floatingparity and data disk arrays. Journal of Parallel andDistributed Computing 17, 1 and 2 (January/February1993), 129{139.

[25] O'Toole, J., and Shrira, L. Opportunistic Log: Ef-�cient Installation Reads in a Reliable Storage Server.In Proc. of the First Symposium on Operating SystemsDesign and Implementation (November 1994), pp. 39{48.

[26] Patterson, R. H., Gibson, G. A., Ginting, E.,

Stodolsky, D., and Zelenka, J. Informed Prefetch-ing and Caching. In Proceedings of the ACM FifteenthSymposium on Operating Systems Principles (Decem-ber 1995).

[27] Rosenblum, M., and Ousterhout, J. The Designand Implementation of a Log-Structured File System.In Proc. of the 13th Symposium on Operating SystemsPrinciples (Oct. 1991), pp. 1{15.

[28] Ruemmler, C., and Wilkes, J. An Introduction toDisk Drive Modeling. IEEE Computer 27, 3 (March1994), 17{28.

[29] Satyanarayanan, M., Mashburn, H. H., Kumar, P.,Steere, D. C., and Kistler, J. J. Lightweight Re-coverable Virtual Memory. In Proc. of the 14th ACMSymposium on Operating Systems Principles (Decem-ber 1993), pp. 146{160.

[30] Seltzer, M., Bostic, K., McKusick, M., and

Staelin, C. An Implementation of a Log-StructuredFile System for UNIX. In Proc. of the 1993 WinterUSENIX (Jan. 1993), pp. 307{326.

[31] Seltzer, M., Smith, K., Balakrishnan, H., Chang,J., McMains, S., and Padmanabhan, V. File SystemLogging Versus Clustering: A Performance Comparison.In Proc. of the 1995 Winter USENIX (Jan. 1995).

[32] Stodolsky, D. Personal Communication, QuantumCorp., July 1998.

[33] Sweeney, A., Doucette, D., Hu, W., Anderson, C.,

Nishimoto, M., and Peck, G. Scalability in the XFSFile System. In Proc. of the 1996 Winter USENIX (Jan-uary 1996), pp. 1{14.

[34] Transaction Processing Performance Council.TPC Benchmark B Standard Speci�cation. WatersideAssociates, Fremont, CA, Aug. 1990.

[35] Transaction Processing Performance Council.TPC Benchmark C Standard Speci�cation. WatersideAssociates, Fremont, CA, August 1996.

[36] Wilkes, J., Golding, R., Staelin, C., and Sullivan,

T. The HP AutoRAID Hierarchical Storage System. InProc. of the 15th ACM Symposium on Operating Sys-tems Principles (December 1995), pp. 96{108.

Virtual - EECS at UC BerkeleyComputer Science Division, Univ ersit y of California, Berk eley, f ryw...

Documents

Transcript of Virtual - EECS at UC BerkeleyComputer Science Division, Univ ersit y of California, Berk eley, f ryw...