FlashBased Extended Cache for Higher Throughput and ... · We have implemented FaCE in the...

Flash-Based Extended Cache forHigher Throughput and Faster Recovery

Woon-Hak Kang‡

[email protected]

Sang-Won Lee‡

[email protected]

Bongki Moon§

[email protected]‡College of Info. & Comm. Engr., Sungkyunkwan University, Suwon 440-746, Korea§Dept. of Computer Science, University of Arizona, Tucson, Arizona, 85721, USA

ABSTRACT

Considering the current price gap between disk and flashmemory drives, for applications dealing with large scale data,it will be economically more sensible to use flash memorydrives to supplement disk drives rather than to replace them.This paper presents FaCE, which is a new low-overheadcaching strategy that uses flash memory as an extensionto the DRAM buffer. FaCE aims at improving the trans-action throughput as well as shortening the recovery timefrom a system failure. To achieve the goals, we propose twonovel algorithms for flash cache management, namely, Multi-Version FIFO replacement and Group Second Chance. Onestriking result from FaCE is that using a small flash memorydrive as a caching device could deliver even higher through-put than using a large flash memory drive to store the entiredatabase tables. This was possible due to flash write opti-mization as well as disk access reduction obtained by theFaCE caching methods. In addition, FaCE takes advan-tage of the non-volatility of flash memory to fully supportdatabase recovery by extending the scope of a persistentdatabase to include the data pages stored in the flash cache.We have implemented FaCE in the PostgreSQL open sourcedatabase server and demonstrated its effectiveness for TPC-C benchmarks.

1. INTRODUCTIONAs the technology of flash memory solid state drives (SSD)

continues to advance, they are increasingly adopted in awide spectrum of storage systems to deliver higher through-put at more affordable prices. For example, it has beenshown that flash memory SSDs can outperform disk drivesin throughput, energy consumption and cost effectiveness fordatabase workloads [12, 13]. Nevertheless, it is still true thatthe price per unit capacity of flash memory SSDs is higherthan that of disk drives, and the market trend is likely tocontinue for the foreseeable future. Therefore, for applica-tions dealing with large scale data, it may be economically

more sensible to use flash memory SSDs to supplement diskdrives rather than to replace them.

In this paper, we present a low-overhead strategy for us-ing flash memory as an extended cache for a recoverabledatabase. This method is referred to as Flash as CacheExtension or FaCE for short. Like any other caching mech-anism, the objective of FaCE is to provide the performanceof flash memory and the capacity of disk at as little costas possible. We set out to achieve this with the realiza-tion that flash memory has drastically different characteris-tics (such as no-overwriting, slow writes, and non-volatility)than DRAM buffer and they must be dealt with carefullyand exploited effectively. There are a few existing approachesthat store frequently accessed data in flash memory drivesby using them as either faster disk drives or an extension tothe RAM buffer. The FaCE method we propose is in linewith the existing cache extension approaches but it is alsodifferent from them in many ways.

First, FaCE aims at flash write optimization as well as diskaccess reduction. Unlike a DRAM buffer that yields uniformperformance for both random and sequential accesses, theperformance of flash memory varies considerably depend-ing on the type of operations (i.e., read or write) and thepattern of accesses (i.e., random or sequential). With mostcontemporary flash memory SSDs, random writes are slowerthan sequential writes approximately by an order of magni-tude. FaCE provides flash-aware strategies for managing theflash cache that can be designed and implemented indepen-dently of the DRAM buffer management policy. By turningsmall random writes to large sequential ones, high sequentialbandwidth and internal parallelism of modern flash memorySSDs can be utilized more effectively for higher through-put [5].

One surprising consequence we observed was that a disk-based OLTP system with a small flash cache added outper-formed, by almost a factor of two, even an OLTP systemthat stored the entire database on flash memory devices.This result, which has never been reported to the best of ourknowledge, demonstrates that the FaCE method provides acost-effective performance boost for a disk-based OLTP sys-tem.

Second, FaCE takes advantage of the non-volatility offlash memory and extends the scope of a persistent databaseto include the data pages stored in the flash cache. Once adata page evicted from the DRAM buffer is staged in theflash cache, it is considered having been propagated to thepersistent database. Therefore, the data pages in the flashcache can be utilized to minimize the recovery overhead,

1615

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 38th International Conference on Very Large Data Bases,August 27th - 31st 2012, Istanbul, Turkey.Proceedings of the VLDB Endowment, Vol. 5, No. 11Copyright 2012 VLDB Endowment 2150-8097/12/07... $ 10.00.

arX

iv:1

208.

0289

v1 [

cs.D

B]

1 A

ug 2

012

accelerate restarting the system from a failure and achievetransaction atomicity and durability at the nominal cost.The only additional processing required for a system restartis to restore the metadata of flash cache, but it does not takemore than just a few seconds. In addition, since most datapages that need to be accessed during the recovery phasecan be found in the flash cache, the recovery time will besignificantly shortened. The recovery manager of FaCE pro-vides mechanisms that fully support database recovery andpersistent metadata management for cached pages.

Third, FaCE is a low-overhead framework that uses a flashmemory drive as an extension of a DRAM buffer rather thana disk replacement. The use of flash cache is tightly coupledwith the DRAM buffer. Unlike some existing approaches,a data page is cached in flash memory not on entry to theDRAM buffer, but instead on exit from it. This is because acopy in the flash cache will never be accessed while anothercopy of the same page exists in the DRAM buffer. The run-time overhead is very little, as there is no need for manual oralgorithmic elaboration to separate hot data items from thecold ones. FaCE decreases but never increases the amountof traffic to and from disk, because it does not migrate dataitems between flash memory and disk drives just for the sakeof higher cache hits. When a page is to be removed fromthe flash cache, if it is valid and dirty, it will be written todisk, very much like a dirty page evicted from the DRAMbuffer and flushed to disk.

We have implemented the FaCE system in the PostgreSQLopen source database server. Most of the changes necessaryfor FaCE are made within the buffer manager module, thecheckpoint process and the recovery daemon. In the TPC-Cbenchmarks, we observed that FaCE yielded hit rates fromthe flash cache in the range from low 60 to mid 80 percentand reduced disk writes 50 to 70 percent consistently. Thishigh hit rate and considerable write reduction led to sub-stantial improvement in transaction throughput by up to afactor of two or more. Besides, FaCE reduced the restarttime by more than a factor of four consistently across vari-ous checkpoint intervals.

The rest of the paper is organized as follows. Section 2reviews the previous work on flash memory caching andpresents the analysis of the cost effectiveness of flash mem-ory as an extended cache. In Section 3, we overview the ba-sic framework and design alternatives of FaCE and presentthe key algorithms for flash cache management. Section 4presents the database recovery system designed for FaCE. InSection 5, we analyze the performance impact of the FaCEsystem on the TPC-C workloads and demonstrate its advan-tages over the existing approaches. Lastly, Section 6 sum-marizes the contributions of this paper.

2. BACKGROUNDAs the technology of flash memory SSDs continues to in-

novate, the performance has improved remarkably in bothsequential bandwidth and random throughput. Table 1 com-pares the price and performance characteristics of flash mem-ory SSDs and magnetic disk drives. The throughput andbandwidth in the table were measured by Orion calibrationtool [15] when the devices were in the steady state. Theflash memory SSDs are based on either MLC or SLC typeNAND flash chips. The magnetic disks are enterprise class15k RPM drives, and they were tested as a single unit or asa 8-disk RAID-0 array.

2.1 Faster Disk vs. Buffer ExtensionIn some of the existing approaches [3, 11], flash memory

drives are used as yet another type of disk drives with fasteraccess speed. Flash memory drives may replace disk drivesaltogether for small to medium scale databases or may beused more economically to store only frequently accesseddata items for large scale databases. When both types ofmedia are deployed at the same time, a data item typicallyresides exclusively in either of the media unless disk drivesare used in a lower tier in the storage hierarchy [8].

One technical concern about this approach is the cost ofidentifying hot data items. This can be done either stati-cally by profiling [3] or dynamically by monitoring at run-time [11], but not without drawbacks. While the static ap-proach may not cope with changes in access patterns, thedynamic one may suffer from excessive run-time and spaceoverheads for identifying hot and data objects and migrat-ing them between flash memory and disk drives. It is shownthat the dynamic approach becomes less effective when theworkload is update intensive [3].

In contrast, if flash memory SSDs are used as a DRAMbuffer extension or a cache layer between DRAM and disk,it can simply go along with the data page replacement mech-anism provided by the DRAM buffer pool without havingto provide any additional mechanism to separate hot datapages from cold ones. There is no need to monitor dataaccess patterns, hence very little run-time overhead and nonegative impact from the prediction quality of future dataaccess patterns.

Two important observations can be made in Table 1 inregard to utilizing flash memory as a cache extension. First,there still exists considerable bandwidth disparity betweenrandom writes and sequential writes with both SLC-basedand MLC-based flash memory devices. The random writebandwidth was in the range of merely 10 to 13 percent ofsequential write bandwidth, while random read enjoys band-width much closer (in the range of 48 to 60 percent) to thatof sequential read. Unlike a DRAM buffer that yields uni-form performance regardless of types or patterns of accesses,the design of flash cache management should take the uniquecharacteristics of flash memory into account.

Second, disk arrays are a very cost-effective means forproviding high sequential bandwidth. However, they stillfall far behind flash memory SSDs with respect to randomI/O throughput, especially for random reads. Therefore, thecaching framework for flash memory should be designed suchthat random disk I/O operations are replaced by randomflash read and sequential flash write operations as much aspossible.

2.2 Cost-Effectiveness of Flash CacheWith a buffer replacement algorithm that does not suffer

from Belady’s anomaly [1], an increase in the number ofbuffer frames is generally expected to provide a fewer pagefaults. Tsuei et al. have shown that the data hit rate isa linear function of log (BufferSize) when the databasesize is fixed [18]. Based on this observation, we analyzethe cost-effectiveness of flash memory as a cache extension.The question we are interested in answering is how muchflash memory will be required to achieve the same level ofreduction in I/O time obtained by an increase in DRAMbuffer capacity. In the following analysis, Cdisk and Cflash

denote the time taken to access a disk page and the time

1616

Storage 4KB Random Throughput (IOPS) Sequential Bandwidth (MB/sec) Capacity Price in $Media Read Write Read Write in GB ($/GB)

MLC SSD† 28,495 6,314 251.33 242.80 256 450 (1.78)MLC SSD‡ 35,601 2,547 258.70 80.81 80 180 (2.25)SLC SSD§ 38,427 5,057 259.2 195.25 32 440 (13.75)Single disk¶ 409 343 156 154 146.8 240 (1.63)

8-disk¶ RAID-0 2,598 2,502 848 843 1,170 1,920 (1.63)

SSD: †Samsung 470 Series 256GB, ‡Intel X25-M G2 80GB, §Intel X25-E 32GB¶Disk: Seagate Cheetah 15K.6 146.8GB

Table 1: Price and Performance Characteristics of Flash Memory SSDs and Magnetic Disk Drives

taken to access a flash page, respectively.Suppose the DRAM buffer size is increased from B to

(1+ δ)B for some δ > 0. Then the increment in the hit rateis expected to be

α log((1 + δ)B)− α log(B) = α log(1 + δ)

for a positive constant α. The increased hit rate will lead toreduction in disk accesses, which accounts for reduced I/Otime by αCdisk log(1 + δ). If the δB increment of DRAMbuffer capacity is replaced by an extended cache of θB flashmemory, then the data hit rate (for both DRAM hits andflash memory hits) will increase to α log((1 + θ)B). Sincethe rate of DRAM hits will remain the same, the rate offlash memory hits will be given by

α log((1 + θ)B)− α log(B) = α log(1 + θ).

Each flash memory hit translates to an access to disk re-placed by an access to flash memory. Therefore, the amountof reduced I/O time by the flash cache will be α(Cdisk −Cflash) log(1 + θ).

The break-even point for θ is obtained by the followingequation

αCdisk log(1 + δ) = α(Cdisk − Cflash) log(1 + θ)

and is represented by a formula below.

1 + θ = (1 + δ)Cdisk

Cdisk−Cflash

For most contemporary disk drives and flash memory SSDs,the value of Cdisk

Cdisk−Cflashis very close to one. For example,

with a Seagate disk drive and a Samsung flash memory SSDshown in Table 1, the value of the fraction is approximately1.006 for read-only workload and 1.025 for write-only work-load.

This implies that the effect of disk access reduction byflash cache extension is almost as good as that of extendedDRAM cache. Given that NAND flash memory is almostten times cheaper than DRAM with respect to price percapacity and the price gap is expected to grow, this analy-sis demonstrates that the cost-effectiveness of flash cache isindeed significant.

2.3 Related WorkFlash memory SSDs have recently been adopted by com-

mercial database systems to store frequently accessed datapages. For example, Oracle Exadata caches hot data pagesin flash memory when they are fetched from disk [16, 17].Hot data selection is done statically by the types of data

such that tables and indexes have higher priority than logand backup data.

Similarly, the Bufferpool Extension prototype of IBM DB2proposes a temperature-aware caching (TAC) scheme thatrelies on data access frequencies [2, 4]. TAC monitors dataaccess patterns continuously to identify hot data regions atthe granularity of an extent or a fixed number of data pages.Hot data pages are cached in the flash memory cache onentry to the DRAM buffer from disk. TAC adopts a write-through caching policy. When a dirty page is evicted fromthe DRAM buffer, it is written to both the flash cache anddisk. Consequently, the flash cache provides caching effectfor read operations but no effect of reducing disk write op-erations. Besides, the high cost of maintaining cache meta-data persistently in flash memory degrades the overall per-formance. (See Section 4.1 for more detailed descriptions ofthe persistent metadata management by TAC.)

In contrast, the Lazy Cleaning (LC) method presented ina recent study caches data pages in flash memory upon exitfrom the DRAM buffer and handles them by a write-backpolicy if they are dirty [6]. It uses a background thread toflush dirty pages from the flash cache to disk, when the per-centage of dirty pages goes beyond a tunable threshold. TheLC method manages the flash cache using LRU-2 replace-ment algorithm. Hence replacing a page in the flash cacheincurs costly random read and random write operations. Be-sides, since no mechanism is provided for incorporating theflash cache in database recovery, dirty pages cached in flashmemory are subject to checkpointing to disk. It has been re-ported in the study that the additional cost of checkpointingis significant [6].

Among the aforementioned flash caching approaches, theLC method is the closest to the FaCE caching scheme pre-sented in this paper. In both approaches, data pages arecached in flash memory upon exit from the DRAM bufferand managed by a write-back policy. Other than those sim-ilarities, the FaCE method differs from the LC method ina few critical aspects of utilizing flash memory for cachingand database recovery purposes, and achieves almost twicehigher transaction throughput than the LC method.

First of all, FaCE considers not only cache hit rates butalso write optimization of flash cache. It manages the flashcache in the first-in-first-out fashion and allows multipleversions left in the flash cache, so that random writes areavoided or turned into sequential ones. The FIFO style re-placement allows FaCE to replace a group of pages at onceso that internal parallelism of a flash memory SSD can beexploited. It also boosts the cache hit rate by allowing

1617

a second chance for warm data pages to stay in the flashcache. In general, a flash caching mechanism improves thethroughput of an OLTP system as the capacity of its flashcache increases, which will peak when the entire databaseis stored in the flash cache. With the FaCE method, how-ever, we observed that a disk-based OLTP system achievedalmost twice higher transaction throughput than an OLTPsystem based entirely on flash memory, by utilizing a flashcache whose size is only a small fraction (about 10%) of thedatabase size.1

Second, unlike the LC method, FaCE extends the scopeof a persistent database to include the data pages stored inthe flash cache. It may sound a simple notion but its impli-cations are significant. For example, database checkpointingcan be done much more efficiently by flushing dirty pagesto the flash cache rather than disk and by not subjectingdata pages in the flash cache to checkpointing. Further-more, the persistent data copies stored in the flash cachecan be utilized for faster database recovery from a failure.FaCE provides a low-overhead mechanism for maintainingthe metadata persistently in flash memory. We have imple-mented all the caching and recovery mechanisms of FaCEfully in the PostgreSQL system.

3. FLASH AS CACHE EXTENSIONThe principal benefit of using flash memory as an exten-

sion to a DRAM buffer is that a mix of flash memory anddisk drives can be used in a unified way without manualor algorithmic intervention for data placement across thedrives. When a data page is deemed cold and swapped outby the DRAM buffer manager, it may be granted a chanceto stay in the flash cache for an extended period. If the datapage turns out warm enough to be referenced again whilestaying in the flash cache, then it will be swapped back intothe DRAM buffer from flash memory much more quicklythan from disk. If a certain page is indeed cold and not ref-erenced for a long while, then it will not be cached in eitherthe DRAM buffer or the flash cache.

3.1 Basic Framework

Database

RAM Buffer

(LRU) 1. Evict

(Clean/Dirty)

3. Flash hit

4. Fetch on miss 2. Stage out dirty pages

Flash as Cache Extension

(LRU / FIFO)

Figure 1: Flash as Cache Extension: Overview

1Performance gain by the LC method was measured whenthe flash cache was larger than half of the entire database [6].Therefore, the level of transaction throughput reported inthe study is expected to fairly close to what would beachieved by storing the entire database in flash memory de-vices.

Figure 3.1 illustrates the main components and their in-teractions of a caching system deployed for a database sys-tem, when the flash cache is enabled. The main interactionsbetween the components are summarized as follows.

• When the database server requests a data page that isnot in memory (i.e., a DRAM buffer miss), the flashcache is searched for the page. (A list of pages cachedin flash memory is maintained in the DRAM buffer tosupport the search operations.) If the page is found inthe flash cache (i.e., a flash hit), it is fetched from theflash cache. Otherwise, it is fetched from disk.

• When a page is swapped out of the DRAM buffer,different actions can be taken on the page dependingon whether it is clean or dirty. If it is clean, the page isdiscarded or staged in to the flash cache. If it is dirty,the page is written back to either disk or flash cache,or both.

• When a page is staged out of the flash cache, differ-ent actions can be taken on the page depending onwhether it is clean or dirty. If it is clean, the pageis just discarded. If it is dirty, the page is written todisk unless it was already written to disk when evictedfrom the DRAM buffer.

Evidently from the key interactions described above, thefundamental issues are when and which data pages shouldbe staged in to or out of the flash cache. There are quite afew alternatives to addressing the issues and they may haveprofound impact on the overall performance of a databaseserver. For example, when a data page is fetched from diskon a DRAM cache miss, we opt not to enter the page to theflash cache, because the copy in the flash cache will never beaccessed while the page is cached in the DRAM buffer. Forthe reason, staging a page in to the flash cache is consideredonly when the page is evicted from the DRAM buffer.

3.2 Design Choices for FaCEThe FaCE caching scheme focuses on how exactly data

pages are to be staged in the flash cache and how the flashcache should be managed. Specifically, in the rest of thissection, we discuss alternative strategies towards the follow-ing three key questions and justify the choices made for theFaCE scheme: (1) When a dirty page is evicted from theDRAM buffer, does it have to be written through to diskas well as the flash cache or only to the flash cache? (2)Which replacement algorithm is better suited to exploitingflash memory as an extended cache? (3) When a clean pageis evicted from the DRAM buffer, does it have to be cachedin flash memory at all or given as much preference as adirty page? These are orthogonal questions. An alterna-tive strategy can be chosen separately for each of the threedimensions.

Note that the FaCE scheme does not distinguish datapages by types (e.g., index pages vs. log data) nor monitordata access patterns to separate hot pages from cold ones.Nonetheless, there is nothing in the FaCE framework thatdisallows such additional information to be used in makingcaching decisions. Table 2 summarizes the design choicesof FaCE and compares them with existing flash memorycaching methods. The design choices of FaCE will be elab-orated in this section.

1618

Exadata TAC LC FaCEWhen on entry on entry on exit on exitWhat clean both both bothSync write-thru write-thru writeback writeback

Replace LRU Temperature LRU-2 FIFO

Table 2: Flash Caching Methods

Write-Back than Write-Through

When a dirty page is evicted from the DRAM buffer, itmay be written through to both disk and the flash cache.Alternatively, a dirty page may be written back only to theflash cache, leaving its disk copy intact and outdated untilit is synchronized with the current copy when being stagedout of the flash cache and written to disk.

The two alternative approaches are equivalent in the effectof flash caching for read operations. However, the overheadof the write-through policy is obviously much higher thanthat of the write-back policy for write operations. While thewrite-through policy requires a disk write as well as a flashwrite for each dirty page being evicted from the DRAMbuffer, the write-back policy reduces disk writes by stagingthem in the flash cache until they are staged out to disk.In effect, write-back replaces one or more disk writes (re-quired for a dirty page evicted repeatedly) with as manyflash writes followed by a single (deferred) disk write. Forthe reason, FaCE adopts the write-back policy rather thanwrite-through.

Now that the flash and disk copies of a data page maynot be in full sync, FaCE ensures that a request of the pagefrom the database server is always served by the currentpage. Note that the adoption of a write-back policy does notmake the database server any more vulnerable to data cor-ruption, as flash memory is a non-volatile storage medium.In fact, the FaCE caching scheme provides a low-cost recov-ery mechanism that takes advantage of the persistency offlash-resident cached data. (Refer to Section 4 for details.)

Replacement by Multi-Version FIFO

The size of a flash cache is likely to be much larger thanthe DRAM buffer but is not unlimited either. The pageframes in the flash cache need to be managed carefully forbetter utilization. The obvious first choice is to manage theflash cache by LRU replacement. With LRU replacement,a victim page is chosen at the rear end of the LRU listregardless of its physical location in the flash cache. Thisimplies that each page replacement incurs a random flashwrite for an (logically) in-place replacement. Random writesare the pattern that requires flash memory to make the moststrenuous effort to process, and the bandwidth of randomwrites is typically an order of magnitude lower than that ofsequential writes with flash memory.

Alternatively, a flash cache can be managed by FIFOreplacement. The FIFO replacement may seem irrationalgiven that it is generally considered inferior to the LRU re-placement and can suffer from Belady’s anomaly with re-spect to hit rate. Nonetheless, the FIFO replacement hasits own unique merit when it is applied to a flash cachingscheme. Since all incoming pages are enqueued to the rearend of the flash cache, all flash writes will be done sequen-tially in the append-only fashion. This particular write pat-tern is known to be a perfect match with flash memory [13],and helps the flash cache yield the best performance.

The FaCE scheme adopts a variant of FIFO replacement.This is different from the traditional FIFO replacement inthat one or more different versions of a data page are allowedto be present in the flash cache simultaneously. When aflash frame is to be replaced by an incoming page, a victimframe is selected at the front end of the flash cache andthe incoming page is enqueued to the rear end of the flashcache. If the incoming page is dirty, then it is enqueuedunconditionally. No additional action is taken to removeexisting versions in order not to incur random writes. Ifthe incoming page is clean, then it is enqueued only whenthe same copy does not exist in the flash cache. A datapage dequeued from the flash cache is written to disk onlyif it is dirty and it is the latest version in the flash cache,so that redundant disk writes can be avoided. We call thisapproach Multi-Version FIFO (mvFIFO) replacement.Refer to Section 3.3 for the elaborate design of FaCE andits optimization.

Despite the previous studies reporting the limitations ofLRU replacement applied to a second level buffer cache [20,19], we observed that LRU replacement still delivered higherhit rates in the flash cache than the mvFIFO replacement.This is partly attributed to the fact that the former keepsno more than a single copy of a page cached in flash memorywhile the latter may need to store one or more versions of apage. However, the gap in the hit rates was narrow and faroutweighed by the reduced I/O cost of the mvFIFO replace-ment with respect to the overall transaction throughput.

Caching Clean and Dirty

Caching a page in flash memory will be beneficial, if thecached copy is referenced again before being removed fromthe flash cache and the cost of a disk read is higher than thecombined cost of a flash write and a flash read, which is truefor most contemporary disk drives and flash memory SSDs.The caching decision thus needs to be made based on howprobable it is a cached page will be referenced again at leastonce before the page is removed from the flash cache.

As a matter of fact, if a page being evicted from theDRAM buffer is dirty, it is always beneficial to cache thepage in flash memory, because an immediate disk write wouldbe requested for the page otherwise. In addition, by cachingdirty pages, the write-back policy can turn a multitude ofdisk writes - required for repeated evictions of the same pagefrom the DRAM buffer - into as many flash writes followedby a single disk write. On the other hand, the thresholdfor caching a clean page in flash memory is higher. This isbecause caching a clean page could cause a dirty page to beevicted from the flash cache and written to disk, while nodisk write would be incurred if the clean page was simplydiscarded. In that case, the cost of caching a clean page,which could be as high as a disk write and a flash write, willnot be recovered by a single flash hit.

Therefore, especially when a write-back policy is adopted,dirty pages should have priority over clean ones with respectto caching in flash memory. This justifies the adoption ofour new mvFIFO replacement that allows multiple versionsof a data page to be present in the flash cache.

3.3 The FaCE SystemIn a traditional buffer management system, a dirty flag is

maintained for each page kept in the buffer pool to indicatewhether the page has been updated since it was fetched from

1619

Database

DRAM Buffer

(LRU) On evict p from DRAM buffer,

enqueue if fdirty=1 or p Flash Cache;

invalidate p’ s previous version, if any;

valid ← 1;

fdirty ← 0

on fetch p from Flash Cache

dirty ← fdirty ← 0

on fetch p from disk

Flush to disk

if valid=1 ᴧ dirty=1

Flash Cache(mvFIFO Queue)

Discard

if valid=0 v dirty=0

dirty ← fdirty ← 1

on update p

in RAM buffer

On evict p

from Flash Cache

Figure 2: Multi-Version FIFO Flash Cache

disk. Using a single flag per page in the buffer is sufficient,because there exist no more than two versions of a data pagein the database system at any moment. In the FaCE system,however, a data page can reside in the flash cache as well asa DRAM buffer and disk. Therefore, the number of differentcopies of a page may be more than two, not to mention thedifferent versions of data pages that may be maintained inthe flash cache by the mvFIFO replacement.

We introduce another flag called a flash dirty (or fdirtyin short) to represent the state of a data page accurately.Much like a dirty flag is set on for a data page updated inthe DRAM buffer, a fdirty flag is set on for a buffered datapage that is newer than its corresponding flash resident copy.With the flags playing different roles, the FaCE system candetermine what action needs to be taken when a page isevicted from the DRAM buffer or from the flash cache. Thiswill be explained in detail shortly.

Both dirty and fdirty flags are reset to zero, when a pageis fetched from disk (because a copy does not exist in theflash cache). They are both set to one when the page isupdated in the DRAM buffer. If a page is fetched from theflash cache after being evicted from the DRAM buffer, thenthe fdirty flag must be reset to zero to indicate that the twocopies in the DRAM buffer and the flash cache are synced.However, the dirty flag of this page must remain unaffected,because the copies in the DRAM and flash cache may still benewer than its disk copy. This requires that while the fdirtyflags are needed only for the pages in the DRAM buffer,the dirty flags must be maintained for the pages both in theDRAM buffer and in the flash cache.

When a page is evicted from the DRAM buffer, it is en-queued to the flash cache unconditionally if the fdirty flag ison. Otherwise, it is enqueued to the flash cache only whenthere is no identical copy already in the flash cache. If acopy is enqueued unconditionally by a raised fdirty flag, itwill be the most recent version of the page in the flash cache.Since the conditional enqueuing ensures no redundancy inthe flash cache, these enqueuing methods – conditional andunconditional – of mvFIFO guarantee that there will be nomore than a copy of a distinct version and the latest versionof a page can be considered the only valid copy among thosein the flash cache.

For the convenience of disposing invalid copies, we main-tain a valid flag for each page cached in the flash memory.The valid flag is set on for any page entering the flash cache,

which in turn invalidates the previous version in the flashcache so that there exists only a single valid copy at anymoment in time. When a page is dequeued from the flashcache, we will flush it to disk only if the dirty and valid flagsare both on. Otherwise, it will be simply discarded. Notethat valid flags as well as dirty and fdirty flags are main-tained as memory-resident metadata. Invalidating a pagecopy does not incur any I/O operation. The main opera-tions of the mvFIFO replacement are illustrated in Figure 2and summarized in Algorithm 1.

Algorithm 1: Multi-Version FIFO Replacement

On eviction of page p from the DRAM buffer:if p.fdirty = true ∨ p /∈ flash cache then

invalidate the previous version of p if it exists;p is enqueued to the flash cache;p.valid ← true;

endifOn eviction of page p from the flash cache:

if p.dirty = true ∧ p.valid = true thenp is written back to disk;

elsep is discarded;

endifOn fetch of page p from disk:

dirty ← fdirty ← false;On fetch of page p from the flash cache:

fdirty ← false;On update of page p in the DRAM buffer:

dirty ← fdirty ← true;

Group Second Chance (GSC)

The Second Chance replacement is a variant of the FIFOreplacement and is generally considered superior to its basicform. The same idea of second chance can be adopted forthe mvFIFO replacement. With a second chance given to avalid page being dequeued from the flash cache, if the pagehas been referenced while staying in the flash cache, it willbe enqueued back instead of being discarded or flushed todisk.

For a DRAM buffer pool, the second chance replacementis implemented by associating a reference flag with each pageand by having a clock hand point to a page being considered

1620

for a second chance. Since it does not involve copying ormoving pages around, the second chance replacement canbe adopted with a little additional cost for setting and re-setting reference flags. In contrast, for the flash cache beingmanaged by themvFIFO replacement, it is desired that datapages are physically dequeued from and enqueued to flashmemory for the sake of efficiency of sequential IO operations.

The negative aspect of this, however, is the increased I/Oactivities, as dequeuing and enqueuing a page require twoI/O operations to be made to the flash cache. This will befurther aggravated if more than a few valid (and referenced)pages need to be examined by the second chance replace-ment before a victim page is found in the flash cache. Itis an ironic situation, because the more pages in the flashcache are hit by references or utilized, the more likely thecost of a replacement grows higher.

To address this concern, we propose a novel Group Sec-ond Chance (GSC) replacement for the flash cache. Whena page evicted from the DRAM buffer is about to enter theflash cache, a replacement is triggered to make a space forthe page. The pages at the front end of the flash cachewill be scanned to find a victim page, but the group secondchance limits the scan depth so that the replacement cost isbounded. Though the scan depth can be set to any smallconstant, it will make a practical sense to set the scan depthto no more than the number of pages (typically 64 or 128)in a flash memory block.

All the pages within the scan depth are dequeued fromthe flash cache in a batch. Following the basic mvFIFOreplacement, the pages in a batch are either discarded orflushed to disk if their reference flags are down. Then, theremaining pages will be enqueued back to the flash cache. Ina rare case where all the pages in the batch are referenced,the page at the very front end will be discarded or flushed todisk in order to make a space for an incoming page from theDRAM buffer. In a more typical case, the number of pagesto be enqueued back will be much smaller than the scandepth. In this case, more pages are pulled from the LRUtail of the DRAM buffer to fill up the space in the batch.This ensures that a dequeuing or enqueuing operation willbe carried out by a single batch-sized I/O operation, muchmore infrequently than being done for individual pages.

Pulling page frames from the DRAM buffer is analogousto what is done by the background writeback daemons of theLinux kernel or the DBWR processes of the Oracle databaseserver. The Linux writeback daemons wake up when freememory shrinks below a threshold and write dirty pagesback to disk to free memory. The Oracle DBWR processesperform a batch write to make clean buffer frames available.We expect that the effect of pulling page frames is negligibleon hit rate of the DRAM buffer and the flash cache eitherpositively or negatively.

Note that page replacement in the flash cache can also bedone in the batch I/O operations without second chances.We refer to this approach as Group Replacement (GR)and compare this with Group Second Chance in Section 5 tobetter understand the performance impact of the optimiza-tions separately.

4. RECOVERY IN FaCEWhen a system failure happens, it must be recovered to

a consistent state such that the atomicity and durability oftransactions are ensured. Two fundamental principles for

database recovery are write-ahead logging and commit-timeforce-write of the log tail. The FaCE system is no differentin that these two principles are strictly applied for databaserecovery.

When a dirty page is evicted from the DRAM buffer, all ofits log records are written to a stable log device in advance.As far as the durability of transactions is concerned, once adirty page is written to either the flash cache or disk, thepage is considered propagated persistently to the database,as the flash memory drive used as a cache extension is non-volatile. The non-volatility of flash memory guarantees thatit is always possible for the FaCE system to recover the latestcopies from the flash cache after a system failure. Therefore,FaCE utilizes data pages stored in the flash cache to servedual purposes, namely, cache extension and recovery.

In this section, we present the recovery mechanism ofFaCE that achieves transaction atomicity and durability ata nominal cost and recovers the database system from afailure quickly by utilizing data pages survived in the flashcache.

4.1 Checkpointing the Flash CacheIf a database system crashes while operating with FaCE

enabled, there is an additional consideration for the stan-dard restart steps to restore the database to a consistentstate. Some of the data pages stored in the flash cachemay be newer than the corresponding disk copies, and thedatabase consistency can only be restored by using thoseflash copies. Even if they were fully synced, the flash copiesshould be preferred to the disk copies, because the flashcopies would help the system restart more quickly.

The only problem in doing this though is that we mustguarantee the information about data pages stored in theflash cache survives a system failure. Otherwise, with theinformation lost, the flash copies of data pages will be inac-cessible when the system restarts, and the database may notbe restored to a consistent state. Of course, it is not impos-sible to restore the metadata by analyzing the entire flashcache but it will be an extremely time-consuming process.One practice common in most database systems is to includeadditional metadata (e.g., file and block identification num-bers and pageLSN) in the individual page header [9]. How-ever, adding the information about whether it is cached inthe flash to the page header is not an option either, becauseit will incur too much additional disk I/O to update thedisk-resident page header whenever a page enters or leavesthe flash cache.

One remedy proposed by the temperature-aware caching(TAC) [2] is to maintain the metadata current persistentlyin the flash memory. The metadata are maintained in adata structure called a slot directory that contains one en-try for each data page stored in the flash cache. TAC usesflash memory as a write-through cache and relies on aninvalidation-validation mechanism to keep the flash cacheconsistent with disk at all times. One obvious drawback ofthis approach is that an entry in the slot directory need beupdated for each page entering the flash cache. This over-head will not be trivial, because updating an entry requirestwo additional random flash writes - one for invalidation andanother for validation.

In principle, this burden will be shared by any LRU-basedflash caching method that maintains metadata persistentlyin the flash memory. This is because it needs to update an

1621

entry in the metadata directory for each page being replacedin the flash cache, incurring just as much overhead as TACdoes. This overhead will be significant and unavoidable,because the LRU replacement selects any page in the flashcache for replacement and, consequently, updating metadataentries will have to be done by random write operations.

RAM-resident

meta data directory

Persistent

meta data directory

Flash Cache Checkpoint

(by segment)

Meta data entry (24 bytes)

Entries are written

in sequence

Current Segment

of Meta Data

(64,000 entries)pageLSN flagpage_id

Figure 3: Metadata Checkpointing in FaCE

Fortunately, however, FaCE has an effective way of deal-ing with the overhead since it relies on the mvFIFO replace-ment instead of LRU. In a similar way to how a databaselog tail is maintained, metadata changes are collected inmemory and written persistently to flash memory in a sin-gle large segment. This type of metadata management isfeasible with FaCE, because a data page entering the flashcache is always written to the rear in chronological order,and so is its metadata entry to the directory. Therefore,metadata updates will be done more efficiently than doingthem for individual entries (typically tens of bytes each).This process is illustrated in Figure 3. We call this metadataflushing operation flash cache checkpointing, as saving themetadata persistently (about one and a half MBytes eachtime in the current implementation of FaCE) is analogous tothe database checkpointing. The flash cache checkpointingis triggered independently of the database checkpointing.

Of course, the memory resident portion of the metadatadirectory will be lost upon a system failure, but it can berestored quickly by scanning only a small portion of theflash cache that corresponds to the most recent segment ofmetadata. Recovering the metadata is described in moredetail in the next section.

4.2 Database RestartWhen a database system restarts after a failure, the first

thing to do is to restore the metadata directory of the flashcache. While the vast majority of metadata entries arestored in flash memory and survive a failure, the entries inthe current segment are resident in memory and lost upona failure. To restore the current segment of the metadatadirectory, the data pages whose metadata entries are lostneed be fetched from the flash cache. Those data pages canbe found at the rear end of the flash cache maintained as acircular queue. The front and rear pointers are maintainedpersistently in the flash cache. The data pages in the flashcache contain all the necessary information such as page idand pageLSN in their page header.

In theory, the metadata directory can be restored by fetch-ing the data pages belonging only to the latest segment from

the flash cache. However, this will require the database sys-tem to be quiesced while flushing the current segment ofmetadata is in progress, and its negative impact on the per-formance will not be negligible. In the current implementa-tion of FaCE, we allow a new metadata entry to enter thecurrent segment in memory, even when the previous seg-ment is currently being flushed to flash memory. Consider-ing the fact that a failure can happen in the midst of flush-ing metadata, the current implementation of FaCE restoresthe metadata directory by fetching data pages belonging tothe two most recent segments of the directory from the flashcache. This way we can avoid quiescing the database systemand improve its throughput at minimally increased cost ofrestart. In fact, as will be shown in Section 5.5, FaCE canshorten the overall restart time considerably, because mostof the recovery can be carried out by utilizing data pagescached persistently in flash memory.

5. PERFORMANCE EVALUATIONWe have implemented the FaCE system in the PostgreSQL

open source database server to demonstrate its effectivenessas a flash cache extension for database workloads. TPC-C benchmark tests were carried out on a hardware plat-form equipped with a RAID-0 array of enterprise class 15k-RPM disk drives and SLC-type and MLC-type flash memorySSDs.

5.1 Prototype ImplementationThe FaCE system has been implemented as an addition

to the buffer manager, recovery and checkpointing modulesof PostgreSQL. The most relevant functions in the buffermanager module are bufferAlloc and getFreeBuffer. ThebufferAlloc is called upon DRAM buffer misses and it in-vokes getFreeBuffer to get a victim page in the DRAMbuffer and to flush it to the database if necessary. Wehave modified these functions to incorporate the mvFIFOand the optimization strategies of FaCE in the buffer man-ager. Specifically, the modified bufferAlloc is now respon-sible for the sync and replacement methods, and the mod-ified getFreeBuffer deals with caching clean and/or dirtypages. For database checkpointing, we have modified thebufferSync function, which is called from createCheckPoint,so that all the dirty pages in the DRAM buffer are checkedin to the flash cache instead of disk. A new recovery mod-ule for the mapping metadata has been added between thePostgreSQL modules initBufferPool and startupXLOG.

Besides, we have added a few data structures to managepage frames in the flash cache. Among those are a directoryof metadata maintained for all the data pages cached in flashmemory and a hash table that maps a page id to a frame inthe flash cache. There is an entry for each page cached in theflash memory. Each entry stores its page id, frame id, dirtyflag and LSN. Both the metadata directory and the hashtable are resident in memory, but a persistent copy of theformer is maintained in flash memory as well for recoverypurposes. (See Section 4.1 for details.) Since the hash tableis not persistent, it is rebuilt from the metadata when thedatabase system restarts.

5.2 Experimental SetupThe TPC-C benchmark is a mixture of read-only and

update intensive transactions that simulate the activitiesfound in OLTP applications. The benchmark tests were

1622

carried out by running PostgreSQL with FaCE enabled ona Linux system with 2.8GHz Intel Core i7-860 processorand 4GB DRAM. This computing platform was equippedwith an MLC-based SSD (Samsung 470 Series 256GB), anSLC-based SSD (Intel X25-E 32GB), and a RAID-0 diskarray with eight drives. The RAID controller was IntelRS2WG160 with PCIe 2.0 interface and 512MB cache, andthe disk drives were an enterprise class 15k-RPM SeagateST3146356SS with 146.8GB capacity each and a Serial At-tached SCSI (SAS) interface.

The database size was set to approximately 50GB (scaleof 500, approximately 59 GB including indexes and otherdata files), and the DRAM buffer pool was limited to 200MB in order to amplify I/O effects for a relatively smalldatabase. The capacity of a flash cache was varied between2GB and 14GB so that the flash cache was larger than theDRAM buffer but smaller than the database. The number ofconcurrent clients was set to 50, which was the highest levelof concurrency achieved on the computing platform beforehitting the scalability bottleneck due to contention [10]. Thepage size of PostgreSQL was 4 KBytes. The benchmarkdatabase and workload were created by the BenchmarkSQLtool [14].

For steady-state behaviors, all performance measurementswere done after the flash cache was fully populated with datapages. Both the flash memory and disk drives were boundas a raw device, and ‘Direct IO’ flag was set in PostgreSQLfor opening data files so that interference from data cachingby the operating system was minimized.

5.3 Transaction ThroughputIn this section, we analyze the performance impact of

FaCE with respect to transaction throughput as well as hitrate and I/O utilization. To demonstrate its effectiveness,we compare FaCE (and its optimization strategies, GR andGSC, presented in Section 3.3) with the Lazy Cleaning (LC)method [6], which is one of the most recent flash cachingstrategies based on LRU replacement. Both LC and FaCEbuilt into the PostgreSQL database server cache pages –clean or dirty – when they are evicted from the DRAMbuffer, and implement the write-back policy by copying dirtypages to disk when they are evicted from the flash cache.

Read Hit Rate and Write Reduction

Table 3 compares LC and FaCE with respect to read hitsand write reductions observed in the TPC-C benchmark. Asthe size of the flash cache increased, both the hit rates andwrite reductions increased in all cases, because it becameincreasingly probable for warm pages to be referenced again,overwritten (by LC) or invalidated (by FaCE) before theywere staged out of the flash cache.

Not surprisingly, the hit rate and write reduction of LCwere approximately 2 to 9 percent higher than those of FaCEconsistently over the entire range of the flash cache sizestested. This is because, when the flash cache is managedby LC, the flash cache keeps no more than a single copy foreach cached page, and the copy is always up-to-date. Thus,LC could utilize the space of the flash cache more effectivelythan FaCE, which could store more than a single copy foreach cached page. For example, when the flash cache of 8GBwas managed by FaCE, the portion of duplicate pages wasas high as 30 to 40 percent.

Flash cache size†

(Measured in %) 2GB 4GB 6GB 8GB 10GBLC 72.9 80.0 83.7 87.0 89.3

FaCE 65.5 72.6 76.4 78.6 80.5FaCE+GR 65.5 72.6 76.2 78.6 80.4FaCE+GSC 69.7 76.6 79.8 82.1 83.7

(a) Ratio of flash cache hits to all DRAM misses

Flash cache size†



(b) Ratio of flash cache writes to all dirty evictions

†Database size: 50GB

Table 3: Read Hit and Write Reduction Rates

Despite the high ratio of duplicate pages, however, as Ta-ble 3 shows, the hit rate of FaCE was not lower than thatof LC more than 10 percent due to the locality of referencesin the TPC-C workloads. Another point to note is that thegroup second chance (GSC) improved not only read hits butalso write reductions for FaCE by giving dirty pages a sec-ond chance to stay and to be invalidated in the flash cacherather than being flushed to disk.

Utilization and Throughput

Flash cache size†



(a) Device-level utilization of the flash cache

(Measuredin IOPS)

Flash cache size†

2GB 4GB 6GB 8GB 10GBLC 4534 4226 3849 3362 3370

FaCE 4973 5870 6479 7019 7415FaCE+GR 7213 8474 9390 9848 10693FaCE+GSC 11098 12208 13031 13871 14678

(b) Throughput of 4KB-page I/O operations

†Database size: 50GB

Table 4: Utilization and I/O Throughput

Keeping a single copy for each cached page by LRU oftenrequires overwriting an existing copy – either an old versionor a victim page – in the flash cache. Overwriting a page inthe flash cache will most likely incur a random flash write.Table 4 compares LC and FaCE with respect to averageutilization and I/O throughput of the flash cache. Table 4(a)shows that LC raised the device saturation of a flash cachequickly to 92 percent or higher. The flash cache device wasnever saturated more than 76 percent by FaCE with theGSC optimization.

Table 4(b) compares LC and FaCE with respect to through-put measured in 4KB page I/O operations carried out per

1623

0

1000

2000

3000

4000

5000

6000

7000

8000

4 8 12 16 20 24 28

Tra

nsaction

s p

er

min

ute

|Flash cache|/|Database| (%)

HDD onlySSD only

FaCE+GSCFaCE+GR

FaCELC

(a) MLC SSD (Samsung 470)

0

1000

2000

3000

4000

5000

6000

7000

8000

4 8 12 16 20 24 28

Tra

nsaction

s p

er

min

ute

|Flash cache|/|Database| (%)

HDD only

SSD only

FaCE+GSCFaCE+GR

FaCELC

(b) SLC SSD (Intel X25-E)

Figure 4: Transaction Throughput: LC vs. FaCE

second. The results clearly reflects the difference betweenLC and FaCE in the device saturation of the flash cache.FaCE with group second chance (GSC) processed I/O op-erations more efficiently than LC by more than a factor offour when the size of the flash cache was 10GB. This wasbecause the write operations were dominantly random byLC while they were dominantly sequential by FaCE. Moreimportantly, as the size of the flash cache increased, the I/Othroughput of FaCE improved consistently and considerablywhile that of LC deteriorated. This was caused by a com-mon trend of writes that the randomness becomes higher asthe data region of writes is extended [12].

Impacts on Transaction Throughput

Figure 4 compares LC and FaCE with respect to transactionthroughput measured in the number of transactions pro-cessed per minute (tpmC). In order to analyze the effect ofdifferent types of flash memory SSDs, we used both MLC-type (Samsung 470) and SLC-type (Intel X25-E) SSDs in theexperiments. (Refer to Table 1 for the characteristics of theSSDs.) Besides, in order to understand the scope of perfor-mance impact by flash caching, we included the cases wherethe database was stored entirely on either a flash memorySSD (denoted by SSD-only) or a disk array (denoted byHDD-only).2

Figure 4(a) shows the transaction throughput obtainedby using an MLC SSD as a flash caching device (or as amain storage medium for the SSD-only case). Under the LCmethod, the transaction throughput remained flat withoutany significant improvements with increases in the size offlash cache. This is because the MLC SSD was already uti-lized at its maximum (93 to 98 percent as shown in Table 4),and the saturation in the flash caching device became a per-formance bottleneck quickly as more data pages were storedin the flash cache. Under the FaCE method, in contrast,the utilization of flash caching device was below 85 percentconsistently. As the size of the flash cache increased, thetransaction throughput continued to improve over the en-tire range without being limited by the I/O throughput ofthe flash caching drive.

With an SLC SSD, as is shown in Figure 4(b), almost iden-tical trend was observed in transaction throughput by FaCE.

2For the SSD-only with an SLC SSD, the database size hadto be reduced 20 percent – 400 warehouses instead of 500 –due to the limited capacity. 64GB was the largest capacity ofX25-E at the time of experiments and was not large enoughto store a 50GB database, its indexes and other related data.

The LC method, on the other hand, improved transactionthroughput and reduced the performance gap considerably.This was due to the higher random write throughput of theSLC SSD. With the group second chance optimization, how-ever, FaCE still outperformed LC at least 25 percent. As ispointed out in Section 2, the bandwidth disparity betweenrandom writes and sequential writes still exists even in theSLC type of flash memory SSDs, and FaCE was apt to takeadvantage of it better by turning random writes into sequen-tial ones.

For the same reason, by utilizing a small flash cache, FaCEachieved higher transaction throughput even than SSD-only,which would have to deal with a considerable amount of ran-dom writes. In the case of an MLC SSD, FaCE outperformedSSD-only almost by three folds. This striking result demon-strates the cost-effectiveness of FaCE in terms of return oninvestment.

In summary, while LC achieved hit rates and reduced theamount of traffic to disk drives better, FaCE was superiorto LC in utilizing the flash cache efficiently for higher I/Othroughput. Despite the trade-off, it turned out the benefitfrom higher I/O bandwidth of flash cache outweighed thebenefit from reduced disk I/O operations. Overall, FaCEoutperformed LC considerably in transaction throughput ir-respective of the type of a flash memory device used.

5.4 Cost Effectiveness and ScalabilityThis section evaluates empirically the cost effectiveness of

the flash cache and its impact on the scalable throughputof a disk array. These two aspects of flash caching have notbeen studied in the existing work.

5.4.1 More DRAM or More Flash

A database buffer pool is used to replace slow disk accesseswith faster DRAM accesses for frequently accessed databaseitems. In general, investing resources in the DRAM buffer isan effective means of improving the overall performance ofa database system, because a larger DRAM buffer pool willreduce page faults and increase the throughput. As the sizeof a DRAM buffer increases, however, the return on invest-ment will not be sustained indefinitely and will eventuallysaturate after passing a certain threshold. For example, withTPC-C workloads, the DRAM buffer hit rate is known toreach a knee point when its size is quite a small fraction ofthe database size due to skewed data accesses [18].

As the analysis given in Section 2.2 indicates, the returnon investment is likely to be much higher with flash memory

1624

(Measured in 200MB DRAM or 2GB FlashtpmC) x1 x2 x3 x4 x5

More DRAM 2061 2353 2501 2705 2843More Flash 3681 4310 4830 5161 5570

Table 5: More DRAM vs. More Flash

than DRAM given the persistent and widening price gapbetween them. To demonstrate the cost effectiveness of flashmemory as a cache extension, we measured the performancegain that would be obtained by making the same amount ofmonetary investment to DRAM and flash memory.

Assuming that the cost per gigabyte of DRAM is approx-imately ten times higher than that of MLC-type flash mem-ory [7], we evaluated the cost effectiveness by measuringthe throughput increment obtained from each 2GB of flashmemory or alternatively each 200MB of DRAM added tothe basic system configuration described in Section 5.2. Asis shown in Table 5, the transaction throughput (or returnon investment) was consistently higher with a wide marginwhen more resources were directed to flash memory ratherthan DRAM. Flash caching was disabled in the first row ofthe table, and FaCE +GSC was used in the second row.

5.4.2 Scale-Up with More Disks

No matter how large a DRAM buffer or a flash cache is,there will be cache misses if neither is as large as the entiredatabase. For the reason, the I/O throughput of disk driveswill always remain on the critical path of most databaseoperations. For example, when a data page is about to enterthe flash cache, another page may have to be staged out ofthe flash cache if it is already full. In this case, a pageevicted from the DRAM buffer will end up causing a diskwrite followed by a flash write. This acutely points out thefact that the system throughput may not be improved up toits full potential by the flash cache alone without addressingthe bottleneck (i.e., disks) on the critical path.

0

1000

2000

3000

4000

5000

6000

4 8 12 16

Tra

nsactions p

er

min

ute

Number of raided HDDs

FaCE+GSCLCHDD only

Figure 5: Effect of a Disk Array Size

Using a disk array is a popular way of increasing diskthroughput (measured in IOPS). Figure 5 compares LC andFaCE with respect to transaction throughput measured witha varying number of disk drives from four to sixteen. (Forthis test, we added eight more disk drives to the basic sys-tem configuration described in Section 5.2.) The flash cachesize was set to 6GB for both LC and FaCE, and HDD-only

was also included in the scalability comparison. The samedatabase was distributed across all available disk drives.

For FaCE and HDD-only, transaction throughput increasedconsistently with the increase in the number of disk drives.

This confirms that disk drives were on the critical path andincreasing disk throughput was among the keys to improv-ing the transaction throughput. It also confirms that, withFaCE, the flash cache was never a limiting factor in trans-action throughput until the number of disk drives was in-creased to sixteen, at which point the transaction through-put started saturating. In contrast, the transaction through-put of LC did not scale and improve at all beyond eight diskdrives, and became even worse than that of HDD-only whensixteen disk drives were used.

5.5 Performance of RecoveryIn order to evaluate the recovery system of FaCE, we

crashed the PostgreSQL server using the Linux kill com-mand and measured the time taken to restart the system.When database checkpointing was turned on, the kill com-mand was issued at the mid-point of a checkpoint interval.For example, if the checkpoint interval was 180 seconds, thekill command was issued 90 seconds after the most recentcheckpoint. In the case of FaCE, the GSC optimization wasenabled for the flash cache management. The flash cachesize was set to 4GB.

Checkpoint intervals(Measured in second) 60 120 180

FaCE+GSC 93 118 188HDD only 604 786 823

Table 6: Time Taken to Restart the System

Table 6 presents the average recovery time taken by thesystem when it was run with different checkpoint intervals60, 120, and 180 seconds. For each checkpoint interval, wetook the average of restart times measured from five separateruns. Across all three checkpoint intervals, FaCE reducedthe restart time considerably – from 77 to 85 percent – overthe system without flash caching. Such significant reductionin restart time was possible because the recovery could becarried out by utilizing data pages cached persistently inflash memory. We observed in our experiments that morethan 98 percent of data pages required for recovery werefetched from the flash cache instead of disk.

The restart times given in Table 6 for FaCE include thetime taken to restore the metadata directory, which was ap-proximately 2.5 seconds on average regardless of the check-point interval. The metadata directory can be restored byfetching the persistent portion of the directory from flashmemory and by scanning as many data pages as two seg-ments worth of metadata entries from the flash cache. Thelatter is required to rebuild the most recent segment of thedirectory. In our experiments, the persistent portion of themetadata directory was 21 MBytes and the amount of datapages to read from the flash cache was 512 MBytes.3 Thesequential read bandwidth of the flash memory SSD used inour experiments was high enough – at least 250 MB/sec –to finish this task within 2.5 seconds on average.

Figure 6 shows the time-varying transaction throughputmeasured immediately after the system was recovered froma failure. This figure clearly demonstrates that, when FaCE

3Each segment contains 64,000 metadata entries of 24 Byteseach. Among the 16 segments required for a 4GB flashcache, 14 of them are fetched directly from flash memoryand the rest are rebuilt from the data page headers.

1625

0

1000

2000

3000

4000

5000

6000

7000

8000

200 400 600 800 1000 1200 1400 1600

Tra

nsa

ctio

n t

hro

ug

hp

ut

(tp

mC

)

Time (sec)

FaCE+GSC RecoveryHDD Recovery

Figure 6: Transaction Throughput after Restart

was enabled, the system resumes normal transaction pro-cessing much more quickly and maintains higher transactionthroughput at all times. The checkpoint interval was set to180 seconds in this experiment.

6. CONCLUSIONThis paper presents a low-overhead caching method called

FaCE that utilizes flash memory as an extension to a DRAMbuffer for a recoverable database. FaCE caches data pagesin flash memory on exit from the DRAM buffer. By basingits caching decision solely on the DRAM buffer replacement,the flash cache is capable of sustaining high hit rates withoutincurring excessive run-time overheads for monitoring accesspatterns, identifying hot and cold data items, and migrat-ing them between flash memory and disk drives. We haveimplemented FaCE and its optimization strategies withinthe PostgreSQL open source database server, and demon-strated that FaCE achieves a significant improvement in thetransaction throughput.

We have also made a few important observations aboutthe effectiveness of FaCE as a flash caching method. First,FaCE demonstrates that adding flash memory as a cacheextension is more cost effective than increasing the size ofa DRAM buffer. Second, the optimization strategies (i.e.,GR and GSC) of FaCE indicate that turning small randomwrites to large sequential ones is critical to maximizing theI/O throughput of a flash caching device so as to achievescalable transaction throughput. Third, the mvFIFO re-placement of FaCE enables efficient and persistent manage-ment of the metadata directory for the flash cache, and al-lows more sustainable I/O performance for higher transac-tion throughput. Fourth, FaCE takes advantage of the non-volatility of flash memory to minimize the recovery overheadand accelerate the system restart from a failure. Since mostdata pages needed during the recovery phase tend to befound in the flash cache, the recovery time can be shortenedsignificantly.

Acknowledgments

This work was supported by National Research Founda-tion of Korea (NRF) funded by the Ministry of Education,Science and Technology (No. 2011-0026492, No. 2011-0027613), and the IT R&D program of MKE/KEIT(10041244, SmartTV 2.0 Software Platform).

7. REFERENCES[1] L. A. Belady, R. A. Nelson, and G. S. Shedler. An

Anomaly in Space-Time Characteristics of Certain

Programs Running in a Paging Machine.Communications of the ACM, 12(6):349–353, 1969.

[2] B. Bhattacharjee, K. A. Ross, C. Lang, G. A. Mihaila,and M. Banikazemi. Enhancing Recovery Using anSSD Buffer Pool Extension. In DaMon, pages 10–16,2011.

[3] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A.Ross, and C. A. Lang. An Object Placement Advisorfor DB2 Using Solid State Storage. PVLDB,2(2):1318–1329, 2009.

[4] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A.Ross, and C. A. Lang. SSD Bufferpool Extensions forDatabase Systems. PVLDB, 3(2):1435–1446, 2010.

[5] F. Chen, R. Lee, and X. Zhang. Essential Roles ofExploiting Internal Parallelism of Flash MemoryBased Solid State Drives in High-Speed DataProcessing. In HPCA, pages 266–277, 2011.

[6] J. Do, D. Zhang, J. M. Patel, D. J. DeWitt, J. F.Naughton, and A. Halverson. Turbocharging DBMSBuffer Pool Using SSDs. In SIGMOD, pages1113–1124, June 2011.

[7] DramExchange. Price Quites.http://www.dramexchange.com/.

[8] J. Gray and B. Fitzgerald. Flash Disk Opportunity forServer Applications. ACM Queue, 6(4):18–23,July/August 2008.

[9] J. Gray and A. Reuter. Transaction Processing:Concepts and Technique. Morgan KaufmanPublishers, Inc., 1993.

[10] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki,and B. Falsafi. Shore-MT: A Scalable StorageManager for the Multicore Era. In EDBT, pages24–35, March 2009.

[11] I. Koltsidas and S. D. Viglas. Flashing Up the StorageLayer. PVLDB, 1(1):514–525, 2008.

[12] S.-W. Lee, B. Moon, and C. Park. Advances in FlashMemory SSD Technology for Enterprise DatabaseApplications. In SIGMOD, pages 863–870, June 2009.

[13] S.-W. Lee, B. Moon, C. Park, J.-M. Kim, and S.-W.Kim. A Case for Flash Memory SSD in EnterpriseDatabase Applications. In SIGMOD, pages 1075–1086,June 2008.

[14] D. Lussier and S. Martin. The BenchmarkSQLProject. http://benchmarksql.sourceforge.net.

[15] Oracle Corp. ORION: Oracle I/O Test Tool.http://goo.gl/OQO7U.

[16] Oracle Corp. Oracle Databasse Concepts 11g Rel. 2.Feb 2010.

[17] Oracle Corp. Oracle TPC Benchmark C FullDisclosure Report, December 2010.

[18] T.-F. Tsuei, A. N. Packer, and K.-T. Ko. DatabaseBuffer Size Investigation for OLTP Workloads. InSIGMOD, pages 112–122, May 1997.

[19] D. L. Willick, D. L. Eager, and R. B. Bunt. DiskCache Replacement Policies for Network Fileservers.In Distributed Computing Systems, pages 2–11, May1993.

[20] Y. Zhou, J. F. Philbin, and K. Li. The Multi-QueueReplacement Algorithm for Second Level BufferCaches. In Proceedings of USENIX Annual TechnicalConference, pages 91–104, June 2001.

1626

FlashBased Extended Cache for Higher Throughput and ... · We have implemented FaCE in the...

Documents

Transcript of FlashBased Extended Cache for Higher Throughput and ... · We have implemented FaCE in the...