Enhancing Data Cache Performance via Dynamic …courses/ee8207/cachealloc.pdfEnhancing Data Cache...

Enhancing Data Cache Performance via Dynamic Allocation

George Murillo, Scott Noel, Joshua Robinson, Paul Willmann

Rice University6100 Main Street

Houston, TX 77005, USA{jmurillo, scnoel, jpr, willmann}@rice.edu

Abstract

As process technologies get smaller, the mounting prob-lem of wire delay is becoming a pervasive challengein the microprocessor architecture community. In or-der to not sacrifice the seemingly required benefits of alarger, more associative level 1 (L1) data cache, archi-tects have opted for an L1 data cache that has a mul-tiple cycle latency. Largely overlooked, however, is thedata reference patterns of modern applications that ex-hibit predictive, read-and-throw-away behavior and donot benefit from traditional large, associative caches.We examine a new cache hierarchy that achieves theequivalent or better levels of data locality using lower-delay elements that will scale better with new processtechnologies. Furthermore, we show that our proposedarchitecture achieves up to 16.7 % better performancethan existing architectures with structures of larger sizewhen real-world latency is applied, and it provides com-parable miss rates with respect to much larger, morecomplex and therefore slower architectures.

1 Introduction

Over the past few years the use of general purpose pro-cessors to run computationally intense multimedia ap-plications has increased. In addition to multimedia-onlyapplications, general purpose applications are startingto include multimedia features. This led us to ana-lyze multimedia processors and digital signal proces-sors (DSPs) in order to identify features that could beapplied to general purpose processors and improve per-formance. The inherent nature of media-only applica-tions, however, is strikingly unique compared to gen-eral purpose business and scientific programs. Withrespect to the memory subsystem, media-only applica-tions usually do not make good use of traditional cachesdue to the limited amount of temporal locality and gen-eral reuse that can be extracted from them. This is pri-

marily a result of repetitive read-and-throw-away algo-rithms in which large amounts of data are processedonce and rarely, if ever, revisited. Such data rarely ben-efits from a large, flexible cache. Media processors andDSPs typically have small simple caches (if they havecaches at all) that are often software-managed [1] in or-der to maximize their efficiency by controlling what isactually cached, and to provide deterministic behavior.

1.1 Hypothesis

In this paper we propose a new cache management pol-icy based on the algorithms and access patterns found inmany media processing streams. We propose a specialcache sidebuffer where we will place data that we antic-ipate to have little temporal locality; this data will notbe allocated in the main L1 data cache. The architecturewill perform dynamic runtime analysis of loads to deter-mine whether a load should be allocated in L1 or insteadplaced in storage outside L1 (the sidebuffer.) The im-plementation and careful management of the sidebufferand L1 data cache will reduce overall (cache and side-buffer combined) L1 cache miss rates by removing re-peating, read-only accesses that would otherwise pol-lute the L1 cache through by evicting needed data. Webelieve that our configuration will perform similarly tounmodified configurations that have larger caches, butwhen compared to configurations equal-sized cachesours will perform better.

In Section 2, we review the background and moti-vation for our work. In Section 3, we present the Dy-namic Cache Allocation architecture. In Section 4, wereview our experimental implementation and evaluationprocess. In Section 5, we present our results and analy-sis. In Section 6, we discuss related work, and in Sec-tion 7 we present our concluding remarks.

1

2 Background

Before proposing a solution to the hypothetical prob-lem of inefficient cache management given the chang-ing behavior of general purpose applications, we firstneeded to establish the validity of the problem as posed.If valid, we needed to derive a solution that could notonly outperform existing architectures but would scalein the environment of increasing relative wire delay. Itwould thus serve as a performance boost and a meansto deal with the growing scaling problems computer ar-chitects currently face. We present the supposition ofthe problems faced and our evaluation of their scientificbasis here.

2.1 Motivation

We believe that the use of general purpose processors toprocess media applications and media-like workloadswill increase during the following years. One of thecurrent major problems that general purpose processorssuffer is data cache efficiency as expressed as a missrate; architects have tried to reduce cache misses by in-creasing the size of cache and its associativity. How-ever, some workloads (such as media workloads) makevery poor use of caches and thus may pollute the cacheby introducing large quantities of ’useless’ data (lack-ing significant temporal locality) into the cache, evict-ing more useful data. Data that is not reused does notbenefit from being in the cache, and moreover it isharmful to the rest of the cached data to do so. How-ever, we foresee that future microprocessors will in facthave smaller, less associative caches because large as-sociative caches are slower in absolute terms. Rela-tive to processor clock speed (which continues to in-crease), wire delay is increasing and thus evenexistingcaches are getting slower from one generation of pro-cessor to the next. This is anecdotally exhibited in thetransition from IBM’s single-cycle POWER3-II L1 datacache [11] to the four-cycle store, two-cycle load la-tency in POWER4 [4]. Thus, increasing sizes and asso-ciativity is an infeasible option. Rather, we believe thatresearching and improving cache management policieswill yield designs that are more future-ready.

Analyzing and leveraging the behavior of media ap-plications could increase performance in general pur-pose processors both in media applications as well asregular applications. Data that is allocated in the cachethat will only be read and thrown away can cause cachepollution, because (if repeated across a wide range ofdata addresses) it may evict data from the cache thatcould be reused many times in the future. Since a largeportion of the data in some media applications tends

to fit this classification, data is constantly ejected fromgeneral-purpose caches and miss rates increase. Thisproblem will be exacerbated as more workloads withsimilar behavior are run on a general purpose proces-sor.

2.2 Initial Verification Testing

In order to first determine that modern programs ac-tually do exhibit the type of cache pollution that weare suggesting in this paper we ran several memorytraces on benchmark programs. We wanted to deter-mine what percentage of program memory accesses,especially programs in the SPEC benchmark suite andmedia programs, are a read once type of access with nowrite back. Our tests determined if a memory accesswas read once by noting if there was more than oneload to the same memory address in both the forwardand backwards direction in the trace without a store tothat address in between the loads

Figure 1 shows the percentage of memory referencesexpressed as a percentage of the total number of mem-ory references including loads and stores that we wouldlike to target for special allocation. A reference quali-fies if, during a given examination window, there are atleast two other loads from that cache line, and duringthat window there are no writes to that line. Our re-sults show that these types of accesses range anywherefrom 8 % for the equake benchmark to 50 % for themcf benchmark assuming a window size of 8 memoryaccesses in each direction from the memory access thatwe were analyzing. Note that on average 24 % of thememory accesses over all of the memory traces that weran exhibited the read once behavior that we were look-ing for. We felt that this was a high enough percent-age to warrant targeting for allocation in our sidebufferrather than the L1 cache.

We not only wanted to test if we had a large num-ber of memory accesses of this qualifying type, but wealso wanted to make sure that these type of accesseswere well dispersed throughout the cache. Wide distri-bution would suggest cache pollution is more likely dueto more conflicting addresses and thus our sidebufferwould be more effective in increasing performance (as-suming there are memory references available to takeadvantage of the less-polluted cache.) From a qualita-tive look at plots (See Figure 2 for an example) of thenumber of qualifying accesses vs. cache set addresses,it appeared very promising. Figure 2 demonstrates thatif we were to target these qualifying accesses and notallocate them in the cache, we would be affecting thefull spectrum of cache locations as opposed to only af-fecting a small corner of the cache. As there are more

2

Figure 1: Percentage of Qualifying Accesses

addresses affected, it is more likely that these referencesmay conflict with something else, especially in a direct-mapped cache. Though substantial spikes exist, there

Figure 2: gzip-log Window Size +-8

are well-dispersed accesses throughout the cache, andthe absolute numbers of these qualifying accesses (evenfor the ’small’ entries in the graph) are nontrivial. Pro-grams like gzip and MCF showed especially promisingresults - the qualifying accesses occurred almost equallyon almost every set address. Thus the cache is probablybecoming widely polluted by allowing these accesses,which do not need to go in the L1, to allocate in L1. (Wecannot guarantee a ’polluted’ state unless we can guar-antee that useful data is being evicted.) Placing theseaccesses in the sidebuffer rather than the L1 should ad-dress the problem.

The last testing that we did was to verify that these

qualifying accesses were predictable. It is not benefi-cial to know that the accesses are polluting the cache ifwe cannot dynamically predict that they are occurringand can remove them from the cache. To check thiswe looked at the dispersion of the static instructions byprogram counter (PC) for all of the memory accesses.We noted from graphing the number of qualifying ac-cesses versus the PC that most of the traces that we ranmapped the memory accesses to very few PCs. In mostcases, there were around 700 distinct PCs. Also therewere even fewer numbers of PCs that had most of thesememory accesses, so we noted that memory access tendto clump around certain PCs. Thus if we know whichPCs the qualifying accesses occur at, then we can usethe PC to predict when we have a qualifying memoryaccess that should not be allocated in the cache. SeeFigure 3 for an example of this PC behavior.

Figure 3: equake PC Analysis, +-128 References

3

3 Architecture

Through extensive analysis of the runtime memory ref-erence patterns of SPEC2000 benchmarks, we estab-lished that there exists significant opportunity to useruntime-available information to classify load instruc-tions into two classes: those that should be allocated aline in the regular cache, and those that should be heldaside due to state information that suggests the particu-lar line will not be rewritten. Here, we present an archi-tecture that implements this runtime analysis in such away as to not lengthen the critical data path.

3.1 Mapping a Solution

Our analysis suggests that the references that classifyas ’read-only’ (e.g. no writes are encountered within aspan of plus or minus 32 memory references) and repet-itive (at least three loads from that line) are generated byonly a very few static instructions as established in Sec-tion 2. This suggests that a simple finite state machine(FSM) can be used to make a prediction about whethera given load instruction is part of a large read-only ref-erence pattern. This FSM is implemented in ourLoadHistory Table(LHT.) Furthermore, our analysis sug-gests that increasing the number of distinct cache linesthat can be outstanding in a read-only reference patternreduces early eviction before a line is finished beingused. These lines are stored in thesidebuffer, whichattempts to capture read-only spatial locality. Thesestructures are laid out with the rest of the processor corein Figure 4. Throughout our architecture, we focus onsimplicity as defined by wire-delay imposed access timerather than raw transistor count. As technology trendscontinue to scale smaller, transistor availability will in-creasingly become a non-issue and performance will bedominated by the wire-delay effects.

3.2 Load History Table

The Load History Table in part works analagously toa traditional branch history table. Upon encounteringa load memory reference, the PC of that instruction isused to index a direct-mapped table of history bits. Thehistory bits function as threshhold counters. Since, inthis case, the instruction is a load, the count is increased.When there is a store, all that is taken into consider-ation as input to the LHT is the reference destinationaddress. Stores follow a different path to update state asdescribed below - the result is that the history bits as-sociated with the load to the line now being stored aredecremented. Once the threshhold value is met (whichwe set to eight with four history bits), any new load data

for that PC is filled from L2 into the sidebuffer ratherthan into the L1 cache. However, if the cache line beingreferenced is already in L1 (that is, we did not cross acache line boundary since the previous load), the cacheline staysin L1 and is not moved. Only newly allocatedlines are steered by the LHT for simplicity. In a direct-mapped L1, there would be no benefit for evicting a linefrom L1 to the sidebuffer. However, in an associativeL1, it may be beneficial in corner cases to move the lineto the sidebuffer in order to prevent the later eviction of’normal’ read-write L1 data.

In order to maintain correct history information, theLHT must also be responsible for maintaining an asso-ciation between each cache line in the L1 data cacheand the PC that ’owns’ that cache line. A PC owns acache line if it was the most recent instruction to loadthat line. This maintenance is used solely for mappingcache lines to PCs so that, in the case of a store, theLHT can look up the correct history entry to decrement.So, unlike a load, the LHT uses only the store address asinput and then calculates the relationship of that addressto a PC. For simplicity and presumed speed in physicalimplementation, we used direct-mapped L1 data cachesfor our implementation so that the mapping from ref-erence address to PC is a simple table lookup (wherethe table is the size of the number of L1 data cachelines.) Support for associativity is straightforward toimplement but deviates from our goal of reducing la-tency wherever possible. The LHT layout is pictured inFigure 5.

3.3 Sidebuffer

The sidebuffer serves as storage for read-only lines. It’simplemented as a small (8 to 32 lines) fully associativecache with first-in-first-out (FIFO) replacement. Least-recently-used (LRU) would be desirable to evaluate, butthe simulation tools we used (Simplescalar 3.0) do notimplement a LRU policy. However, due to the nontra-ditional locality behavior (e.g. spatial locality with verylittle temporal locality) of the references we are target-ing for storage in the sidebuffer, using a simpler replace-ment policy (such as FIFO) may prove just as effective.

The sidebuffer receives data from the L2 cache upona fill (when steered by the LHT according to historydata.) The sidebuffer is also read-only. Thus, there is nooutgoing path to L2 from the sidebuffer. When the side-buffer encounters a store reference that matches one ofthe lines inside it, the sidebuffer invalidates that line au-tomatically. The sidebuffer always reports a miss uponstore references, and since the data is not in L1, the L2is triggered as if there was an L1 miss. The LHT thensteers the fill data to L1 after the L2 access latency has

4

Figure 4: Dynamic Cache Allocation Architecture

expired. We could have optimized the datapath to sup-port transfer of lines from the sidebuffer to L1 to avoidthe L2 access penalty, but for simplicity in this initialimplementation, we did not. A side effect of this is thatthe data in L1 and the sidebuffer are mutually exclusive.The only overhead required to maintain this conditionis the requirement that the sidebuffer invalidates a lineif a write goes to it. If a line in the sidebuffer is in-validated, we say it has beenmisallocated. Due to themutually exclusive relationship between data in L1 andthe sidebuffer, the selection logic for the mux in Figure4 is extraordinarily simple - after a cache miss and uponcompletion of the line fill, the data will hit in one andonly one of those memories.

4 Implementation

We developed two new architectural structures, the loadhistory table (LHT) and sidebuffer, in order to exploitthe memory behavior we identified. In order to deter-mine the effectiveness of these additions, we integratedthem into the superscalar architecture simulated by sim-plescalar. We evaluated cache miss rates and IPC num-bers using sim-outorder. By varying the parameters ofboth these structures using convenient command line

options, we explored the design space to pinpoint themost effective use of transistor resources (i.e. the pointsof diminishing returns) within the realistic bounds ofthe desired single-cycle delay. Once we determined theoptimal LHT and sidebuffer parameters, we comparedsimulated results using these configurations with resultsfrom an unmodified architecture.

4.1 Applications

The memory reference behavior we targeted in our ex-periments shows up to varying degrees in all programs.In some programs, such as media processing applica-tions, the pattern is very pronounced, while in generalpurpose programs the patterns appear less predictably.We ran simulations using both general purpose applica-tions (i.e. SPECint and SPECfp) and multimedia appli-cations from the MediaBench suite.

4.2 Exploring the Design Space

Before comparing our modified architecture to an un-modified architecture, we first evaluated how to opti-mally configure our new structures. We determinedthe most effective design in terms of transistor countand latency before doing a full performance analysis.

5

Figure 5: Internal LHT Architecture

Since we intended the inclusion of our changes to af-fect the hit rate of the level 1 data cache system, weadjusted our parameters with the single goal of mini-mizing the miss rate. To do this part, we integrated ourproposed changes into sim-cache to quickly evaluate L1data cache miss rates.

Between the two structures (the LHT and sidebuffer),there were four parameters which influence the com-bined level one miss rates. The LHT hasN direct-mapped entries, each withB bits. IncreasingN hasthe effect of reducing the number of conflicts in the ta-ble and thereby reducing the number of misallocationsto the sidebuffer due to aliasing. The other LHT pa-rameter isB, the number of bits per entry, which cor-responds to the granularity of the confidence counters.More bits correspond to the ability to capture more his-tory for a particular PC, which eventually relates to bet-ter prediction and fewer combined L1 misses due tofewer misallocations in the sidebuffer. Figure 6 showsthe effect of different numbers of history bits on thecombined L1/Sidebuffer miss rate for the gzip-graphicbenchmark. This uses a fixed sidebuffer size of 16 en-tries.

For each data point, the threshhold is set to be thevalue at which the most significant bit flips. If thethreshhold is too high, our allocation policy into thecache may be too conservative. Therefore, fewer lineswill be allocated into the sidebuffer, and we will limit

Figure 6: Effect of History Size on CombinedL1/Sidebuffer Miss Rate (gzip-graphic)

6

the benefits of our architecture. This is exhibited in Fig-ure 6 with five history bits. Too few history bits (andthus overly aggressive allocation into the sidebuffer) isdetrimental when sidebuffer invalidation happens toooften and the L2 load penalty is incurred.

The sidebuffer is implemented as a standard cachestructure in simplescalar so it has the standard cacheparameters. To maximize the hit rate in the sidebuffer,we decided early on to make it a fully-associative struc-ture. If wire latencies continue to constrain perfor-mance, we will eventually want to explore making thesidebuffer set-associative to decrease the access latency.We examined implementations with few (4 to 32) linesin the sidebuffer, so a fully-associative implementationseemed achievable. We assumed a base LRU replace-ment policy in Simplescalar, though this is currently im-plemented as FIFO in the Simplescalar 3.0 tools. Evalu-ating a simpler replacement policy, as discussed above,may be valuable.

Other than the associativity and replacement policy,we are left with two cache parameters for the sidebufferto examine. First we varied the sidebuffer block size(line size) and found that doubling the L1 block sizegave us a small benefit in reducing the miss rate (pre-sumably because of spatial locality). The benefit wassmall enough so that we decided to avoid the hardwarecomplexity of working with two different L1 line sizesand instead matched the sidebuffer block size with theL1 block size.

The last sidebuffer parameter is the number of lines.Obviously, the greater the number of lines, the better thehit rate of the sidebuffer. But we were not free to makethis structure unreasonably large due to our requirementthat it be accessible in one cycle. This is crucial sincethe sidebuffer, like the L1 data cache, is on the criti-cal path of all load instructions (albeit in parallel withthe L1 data cache.) Keeping this in mind, we only ex-plored values that we feel would result in a sidebufferwith a 1 cycle access time. Figure 7 shows the effect ofvarying the number of sidebuffer entries when runningthe gzip-graphic benchmark with four history bits and athreshhold value of eight.

4.3 Experimental Parameters

We examined a variety of applications from differentcomputing domains on a base architecture simulated bySimplescalar 3.0.

4.3.1 Host Architecture

The final values we chose as optimal for the LHT is a1024 entry table with four bits per entry, resulting in a

Figure 7: Effect of Sidebuffer Size on CombinedL1/Sidebuffer Miss Rate (gzip-graphic)

512 byte structure. For the sidebuffer parameters, wedetermined a 16 line, 32 byte block size cache to beoptimal - this corresponds to our estimate at an upperbound for a one-cycle accessible structure. As wiredelay effects increasingly dominate in the future andthe sidebuffer becomes too slow, our preliminary datasuggests we should first reduce associativity and thenreduce capacity. The Simplescalar 3.0 sim-outorderdefaults we used (when integrated with our changes)are as depicted in Figure 8. It is noteworthy thatin following with Simplescalar defaults, all cachelatencies are fully pipelined, and cache operationsare lockup-free with infinite outstanding referencesallowed. Unless otherwise specified, the followingparameters are constant for all simulations.

Figure 8: Base Architecture Parameters

7

4.3.2 Applications Examined

We simulated our modified architecture versus sev-eral comparable, non- modified architectures on a widerange of applications. We ran a subset of the SPEC2000[2] suite including integer applications mcf, vpr, andgzip on a graphic image. We also ran one programfrom the SPECfp suite: equake. All of these programswere run using the reduced input datasets because oflimited computation resources. We also fast-forwardedthe simulations past the first million instructions be-cause we are not interested in the performance of theprograms while they are initializing. The integer appli-cations were chosen to represent the domain of generalpurpose applications, while the two floating-point appli-cations were chosen to represent scientific applications.

In addition to SPEC programs, we also ran simula-tions on a set applications from the MediaBench [6]suite. These programs included a PCM audio com-pressor and decompressor, EPIC (image compression),a G721 compression and decompression program, andan MPEG2 encoder and decoder. The MediaBench sim-ulations used standard datasets because they were smallenough to make the run-time of our simulations rea-sonable. This time we fast-forwarded through the firsthundred thousand instructions. The duration of thesebenchmarks was considerably shorter than SPEC.

4.4 Effectiveness of LHT and Sidebuffer

With an optimal LHT and sidebuffer size, we then be-gan cycle accurate performance simulations using sim-outorder in the simplescalar suite. We were interestedin how well our system performed against the baselinewith different sizes of L1 data cache. We ran our ex-periments over a range of small cache sizes because webelieve that as wire delays continue to increase relativeto transistor switching time, more and more micropro-cessors will go to a smaller L1 cache to maintain a one-cycle hit time. As part of our hypothesis, we expect ourmodified architecture to perform better and better as theL1 data cache is scaled down in size and associativity.

4.5 Comparison Architectures

For baseline comparisons, we compare against a datacache hierarchy with that is identical to Dynamic Allo-cation in size, latency, and associativity, but lacks a side-buffer and dynamic management. This should providesome insight into the benefit Dynamic Allocation aloneprovides, but such a comparison is slightly in our advan-tage because it ignores the disparity in transistor count.In emerging architectures with larger and larger transis-

tor budgets, we are far less concerned with transistorcount than latency, which we have worked to minimizethrough the use of small structures. Thus, the DynamicAllocation overhead to latency should be minimal.

The current choice among some architects for use ofthe increasing transistor budget is to increase the size ofthe L1 while accepting the two cycle access time andhoping the increased hit rate will offset the loss. State-of-the-art processors currently are forced to deal withthis tradeoff. As such, we compared against data cachehierarchies similar to those in current high-performancecomputers. For ourmediumconfiguration, we used size,associativity, and latency parameters similar to those inthe Intel Itanium [5]. For thelarge configuration, weused parameters similar to those in the IBM POWER4[4]. We assumed a one-cycle access time for our con-figuration due to the careful analyzation of our datapathfor bottlenecks and the simplification of all serial-accesslong-delay units in our architecture.

5 Experimental Results

5.1 SPEC2000 Performance

Figure 9 shows the combined L1 miss rate of Dynamic

Figure 9: SPEC2000 L1 Data Cache Miss Rates

Allocation versus that of the traditional configurations.Miss rate here refers to the total miss rate between thesidebuffer and the L1 data cache. Dynamic Allocationmostly achieves miss rates comparable to those of farlarger (8 and 16 times larger for themediumand largeconfigurations, respectively) caches with much more as-sociativity, and it drastically outperforms the miss rateof the baseline configuration with the same L1 cachesize. While the overall cache size for Dynamic Allo-cation is larger by 16 cache lines (the size of the side-

8

buffer), this does not account for such a pronounced dif-ference in miss rate. These benchmarks clearly benefitfrom the isolation and separation of the types of local-ity that Dynamic Allocation demarcates. Isolating thereferences with approximately little temporal locality inthe sidebuffer leaves the main L1 free for use by therest of the application. There are enoughusefulnon-sidebuffer references to benefit from the decreased con-tention in the main L1 data cache such that miss rateswith this much smaller yet fast architecture are compet-itive. Of particular note here is that Dynamic Allocationmanages to get ”in the ballpark” of the miss rates of sig-nificantly more complex and larger memory structuresfor these general-purpose applications.

Figure 10 shows the performance numbers of Dy-

Figure 10: SPEC2000 Performance Results

namic Allocation versus configurations similar to cur-rent designs when latency is taken into consideration.Due to latency-hiding techniques in our baseline con-figuration such as out-of-order issue, it is not a giventhat performance will suffer directly according to missrate [3]. Dynamic Allocation outperforms all other con-figurations, sometimes by wide margins. Though wedid not complete results under all configurations forother SPEC benchmarks, Dynamic Allocation also out-performed all other configurations for ammp, whichwas the only other SPEC benchmark we ran againstthe fully-configured modified simulator. Clearly, thesebenchmarks are sensitive to memory access latency.The smallest speedup we see is 2.4 % for gzip-graphicversus themedium traditionalconfiguration, while thelargest was 16.7 % for equake versus a cache 16 times aslarge. The arithmetic mean speedup was 8.66 %. Whilethese numbers are impressive by themselves, the impor-

tant aspect of these results is future scalability. As wiredelay becomes even more of an issue, the disparity be-tween the large, associative structures and Dynamic Al-location will be even more pronounced.

Figures 11 and 12 plot the miss rate of GZIP-graphicand equake, respectively, as the size of the L1 cacheis reduced for varying sidebuffer sizes and configura-tions. The ’32’ and ’64’ bytes refers to the size of thecache lines for those plots. As L1 becomes smallerand smaller, the miss rates for Dynamic Allocationconfigurations do increase. However, they do not in-crease as quickly as unmodified configurations, even forDynamic Allocation with very small sidebuffer sizes.Thus, as L1 data caches get smaller, Dynamic Alloca-tion is even more beneficial.

Figure 11: Miss Rate as L1 Decreases - GZIP-graphic

Figure 12: Miss Rate as L1 Decreases - equake

9

5.2 MediaBench Results

Figure 13 shows the combined (as previously defined)L1 miss rates of Dynamic Allocation versus the tradi-tional configurations described above for the Media-Bench suite. Most noteworty is that miss rate does

Figure 13: MediaBench L1 Data Cache Miss Rates

not vary significantly with increased cache size past thesmall configuration. This suggests that these types ofapplications do not benefit greatly from any cache dueto their access patterns and typically large working sets,though they have a minimum amount of locality thatmust be captured. Dynamic Allocation helps meet thisminimum and get in the neighborhood of miss rates ofthe larger, more complex configurations for some appli-cations, such as MPEG2 decode and encode, and G721encode and decode. However, since most of these ap-plications are in fact small computation kernels that op-erate only on their incoming and outgoing data, theremay be very few (if any) memory references that wouldnot be targeted by the sidebuffer and thus there may bevery few available references to take advantage of theless-conflicted main L1 data cache. While sidebufferaccess rates for these benchmarks are higher than forSPEC, that does not imply that the regular L1 accessesare well-distributed or that they will benefit from theless-conflicted cache. Furthermore, some applicationsclearly do not benefit at all from caching (after a certainbound) as continuing to increase cache size and associa-tivity does not show an improvement in miss rate. Epichas some unique behavior and is described below.

Figure 14 presents the performance of the Media-Bench suite with latencies applied as before. As sug-gested by the miss rate numbers, many MediaBenchapplications do not benefit greatly from Dynamic Al-location. The arithmetic mean speedup was 4.11 %.

Figure 14: MediaBench Performance Results

Unlike the SPEC results, the baselinesmall traditionalconfiguration with its single-cycle latency performs al-most identically to Dynamic Allocation. It is notewor-thy to compare MPEG2 decode with MPEG2 encode.MPEG2 decode experiencescacheablebehavior whereadding a larger cache helps decrease misses thus in-creasing performance. Furthermore, it is quite obviousfrom the miss rate graph that MPEG2 decode experi-ences a dramatic improvement in miss rate with Dy-namic Allocation. MPEG2 encode, however, has fairlygood behavior with respect to miss rate even with thesmallest cache, and does not get substantially better asdecode does. Thus, for applications where miss rate isnontrivial, Dynamic Allocation provides better perfor-mance when latencies are applied to the larger, slowermemory hierarchies.

Also of particular note is the Epic benchmark. WhileDynamic Allocation did help to reduce the miss rate forequal-size caches, it is still a capacity-bound problemeven though Dynamic Allocation attempts to alleviatethe capacity-filling effects of many read-only, stridingapplications. 44.8 % of the memory accesses in Epicare to the sidebuffer, but among the references to reg-ular (including flushes incurred by the sidebuffer), themiss rate is a significant 29.98 %. Epic uses a huff-man encoding algorithm that has a large internal sym-bol tree that is sorted in a read-write manner, and more-over is not in contiguous memory (e.g. it is representedas a node-based structure.) Thus, it is likely that thislarge read-write structure is organized and referencedin such a way as to make it unwieldy in a very smalldirect-mapped L1 cache even after conflicts introduced

10

by read-only data streams are removed. Increasing theassociativity of the small cache reduced miss rates tolevels comparable with both the larger configurations.

6 Related Work

The field of intelligent cache management for the pur-poses of improving overall performance and efficiencyhas been explored in parallel, yet different and some-times complimentary ways. Jouppi [7] proposed reduc-ing conflicts in direct-mapped caches by adding smallassociative caches such as victim caches. However, vic-tim caches do not address the problem of large work-ing sets of the kind targeted by Dynamic Allocation thatmay be larger than the size of the victim cache.

Jouppi also proposed stream buffering on misses,while Baer and Chen [8] extended the idea to use anFSM to predict the references to be filled in the streambuffer. In this vein, the stream buffer can be com-pared to the functionality provided by the separate stor-age of the sidebuffer, but there are important distinc-tions: previous predictive measures were by expecteddata stride and ran ahead of the program rather thanDynamic Allocation’s per-PC analysis of the currentinstruction, and these were methods of guessing whatshould be fetched thus increasing L1-to-L2 bandwidthconsumed and pressure on branch prediction accuracy.This is often at the expense of wasted bandwidth. Dy-namic Allocation does not address main memory la-tency, and while it may increase bandwidth consumedin rare corner cases when sidebuffer entries must berefetched from L2 if the L2 entry has been evicted (e.g.during an L1 allocation following an attempted write toa sidbuffer line), but this is almost certainly less com-mon than prefetching waste. Analysis of sidebufferaccess and L1 reload patterns show this is in fact ex-tremely rare. Compiler algorithms [9] for prefetchingwork complimentarily to the efforts of Dynamic Allo-cation since they map to specific instructions in appli-cations.

McFarling [10] proposed an FSM approach to ex-cluding specific lines from allocation in a direct-mapped cache in order to reduce conflicts in direct-mapped caches. His work focuses almost entirely oninstruction caching, and the FSM he uses examines stateon a per-cache-line basis; it does not consider the staticinstruction that generated that reference. Furthermore,Dynamic Allocation provides specific management ofthe ’excluded’ (sidebuffer) cache lines and gains perfor-mance from the flexibility and the size of the sidebuffer.Most significantly, however, is that dynamic exclusiongives weight to the data pre-existing in the cache by

recording the previous hits on given cache lines. Whilethis may have the same effect as Dynamic Allocationin some cases, in others it may be overly pessimisticabout the usefulness of incoming data. Dynamic Al-location instead profiles the locality of specific instruc-tions rather than relying on specific cache line state. TheDynamic Allocation FSM is also considerably simplerthan McFarling’s and does not allow quick thrashing be-tween the sidebuffer and main L1.

Gonzalez et. al. [12] proposed splitting the L1 datacache into two separate structures, adual data cache,to capture spatial and temporal locality separately. Thisdesign is significantly more complex than Dynamic Al-location because it allows a cache line to be valid inboth caches at once and allows both to be read-write.Furthermore, the locality prediction table they use isvery accurate, but is far more complex than our simplethreshhold counting mechanism - it keeps history dataabout strides, length, last-address-accessed, and state;updating this is nontrivial in hardware. The overall ar-chitecture is unlikely to scale well. Dynamic Alloca-tion, in contrast, makes a much looser approximationabout the type of locality and achieves reasonable per-formance with a much simpler design.

7 Concluding Remarks

Dynamic Cache Allocation is a scalable mechanism forenhancing the performance of small, simple data cachesin a wide array of applications. Dynamic Cache Alloca-tion improves data cache miss rates substantially over adirect-mapped cache. It also outperforms the data cachehierarchies of modern day processors when real-worldlatencies are applied. Most importantly, as cache sizeand associativity are reduced in order to keep up withthe latency requirements of increasing clock rates, Dy-namic Cache Allocation’s cache miss rate diverges fromthe miss rate of standard cache configurations by widerand wider margins. This suggests the benefit of Dy-namic Allocation will increase as wire delay in memo-ries continues to increase latency. Dynamic Allocationoffers a lower-latency alternative to achieving miss ratescomparable to those of larger, more complex caches.

Acknowledgements

We would like to thank Mike Brogioli for directing usto MediaBench and helping us get started with it.

11

References

[1] Keith D. Cooper and Timothy J. Harvey,”Compiler-Controlled Memory”,Proceedings ofthe 8th International Conference on ArchitecturalSupport for Programming Languages and Oper-ating Systems, October 1998.

[2] John L. Henning, ”SPEC CPU2000: MeasuringCPU Performance in the New Millenium”,Com-puter, July 2000.

[3] John Hennessey and David Patterson,ComputerArchitecture: A Quantitative Approach. ThirdEdition, Morgan Kaufman Publishers, San Fran-cisco, California, 2002.

[4] J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H.Le, and B. Sinharoy, ”POWER4 System Archi-tecture”,IBM Journal of Research and Develop-ment, January 2002.

[5] Harsh Sharangpani and Ken Arora, ”Ita-nium Processor Microarchitecture”,IEEE Micro,September/October 2000.

[6] Chunho Lee, Miodrag Potkonjak, and WilliamH. Mangione-Smith, ”MediaBench: A Tool forEvaluating and Synthesizing Multimedia andCommunications Systems”,Proceedings of the30th International Symposium on Microarchitec-ture, December 1997.

[7] Noman P. Jouppi, ”Improving Direct-MappedCache Performance by the Addition of a SmallFully-Associative Cache and Prefetch Buffers”,Proceedings of the 17th International Symposiumon Computer Architecture, May 1990.

[8] Jean-Loup Baer and Tien-Fu Chen, ”An Effec-tive On-Chip Preloading Scheme to Reduce DataAccess Penalty”,Proceedings of Supercomput-ing, November 1991.

[9] Todd C. Mowry, Monica S. Lam, and AnoopGupta, ”Design and Evaluation of a Compiler Al-gorithm for Prefetching”,Proceedings of the 5th

International Conference on Architectural Sup-port for Programming Languages and OperatingSystems, October 1992.

[10] Scott McFarling, ”Cache Replacement with Dy-namic Exclusion”,Proceedings of the 19th Inter-national Symposium on Computer Architecture,April 1992.

[11] Bob Amos, Sanjay Deshpande, Mike May-field, and Frank O’Connell, ”RS/6000 SP375 MHz POWER3 SMP High Node,IBMOnline Whitepaper Respository, http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/nighthawk.html, August 2001.

[12] Antonio Gonzalez, Carlos Aliagas, MateoValero, ”A Data Cache with Multiple CachingStrategies Tuned to Different Types of Locality”,Proceedings of the 9th International Conferenceon Supercomputing, July 1995.

12

Enhancing Data Cache Performance via Dynamic …courses/ee8207/cachealloc.pdfEnhancing Data Cache...

Documents

Transcript of Enhancing Data Cache Performance via Dynamic …courses/ee8207/cachealloc.pdfEnhancing Data Cache...