Analysis and Improvement of Cache performance for ...elec525/projects/mmcache_report.pdf · we only...

Analysis and Improvement of Cache performance forMultimedia Applications

Bo Zhang, Guohui Wang, Jesus Jason Sedano*Dept. of Computer Science, *Dept. of Electrical &Computer Engineering,

Rice University, Houston, Tx{bozhang, ghwang, jjsedano} @ rice.edu

Abstract

Multimedia applications are among thequickly growing areas in computersystems and becoming more and moreimportant in people’s daily life. They areessentially memory-intensive and time-critical applications. Previous results andour analysis both confirm that mediaapplications have some special propertieslike predictable data access patterns. Theseproperties make them relatively easy to beoptimized on general-purpose processorsusing data prefetching techniques.

In this paper, we first systematically studythe cache performance of multimediaapplications to characterize their behaviorsso that we can have deeper insights intomedia applications characteristics. Thenbased on our understanding, we useprefetching schemes, under differentmemory access models, to improve cacheperformance. Our simulation results showthat prefetching is an effective way toimprove cache performance in terms ofmiss rate.

1. IntroductionMultimedia applications are fast becomingone of the dominant workloads for moderncomputer systems [3]. The dependency ofInternet use and the demand by themajority of users for graphically basedprograms will continue to trend upward. Itcan be seen that even the simplest tasksnow use a complex graphical interface tomake it more user friendly and will many

times contain many bells and whistlesusually being videos, sounds and fancygraphical pictures.

The real-time constraint of mediaapplications demands a high level ofperformance density. To meet the highperformance requirement, special purposemedia processors, such as Imagine [4],were designed to run media processingefficiently. The media processor isexpensive and less flexible, which canonly be used effectively for specialpurpose systems. However, with thedevelopment of the Internet and thegraphical interface, far more mediaapplications are running on generalpurpose processors. General purposeprocessors offer increased flexibility butachieve performance levels two or threeorders of magnitude worse than specialpurpose processors [4].

A common belief for this hugeperformance gap is that conventionalgeneral purpose architectures are poorlymatched to the specific properties ofmedia applications including little datareuse, high data parallelism, large data setsand computationally intensive processing.The memory systems of general-purposearchitectures depend on caches optimizedfor reducing latency and data reuse, anddon’t efficiently exploit the available dataparallelism in media applications [4]. So,it is interesting to study the cacheperformance when given mediaapplications and in turn, try to improve the

performance of media applications ongeneral purpose processors.

In this project, we give a thorough analysisof cache performance for mediaapplications on general purposeprocessors. We try to answer the followingquestions: how is media applications’cache performance? How does cacheconfigurations impact the cacheperformance of media application? Whatis the memory access pattern for mediaapplication? Based on our analysis, wewill be able to use a simple prefetchscheme to improve cache performance ofmedia applications. The remainder of thispaper is organized as follows: section 2introduces our methodology to study cacheperformance of media applications.Section 3 discusses the analytical resultsof cache performance. Section 4introduces our simulation of a desiredprefetch scheme to improve cacheperformance of media applications.Section 5 give a discussion of simulationresults for prefetching. Section 6summarizes the paper.

2. Methodology2.1 Simulation environmentOur study of cache performance is basedon SimpleScalar simulator [1]. We useSimpleScalar Toolset version 3.0, PISAarchitecture. The simulation is run onSPARC system, with Solaris OS.

2.2 Benchmarks2.2.1 MediaBenchMediaBench [2] consists of completeapplications coded in high level languages.It includes core algorithms for mostwidely used multimedia applications. Allthe applications are publicly available andwidely used on general purposeprocessors. We choose a subset from theMediaBench, including JPEG, MPEG,

ADPCM, MESA, GHOSTSCRIPT, andEPIC. We choose these applicationsbecause they cover different types ofmedia application, such as an image,video, audio, 3D graph, document andcompression.

JPEG: JPEG is a standardized imagecompression mechanism for both full-color and gray-scale images. It has theability to compress an image withlittle or no noticeable degradation inimage quality. We use JPEGdecompression, the input file is a 5KBjpg file and the output .ppm file is100KB.

MPEG2: MPEG2 is a standard fordigital video transmission. We usempeg2 decoder in our simulation witha 34KB input file.

EPIC: EPIC is an experimental imagecompression utility. The compressionalgorithms are based on a bi-orthogonal, critically sampled dyadicwavelet decomposition and acombined run-length/Huffmanentropy encoder. We use a 65KB pgmfile for compression with EPICprogram.

ADPCM: ADPCM stands forAdaptive Differential Pulse CodeModulation. It is a family of speechcompression and decompressiona l g o r i t h m s . A c o m m o nimplementation takes 16bit linearPCM samples and converts them to 4-bit samples. We use the program“rawdaudio” with a 72KB inputadpcm file.

GhostScript: An interpreter for thePostScript language. We use theprogram “gs” with a 78KB input filetiger.ps

Mesa: Mesa is a clone of OpenGL, acommonly used 3-D graphics library.

We use a demo program “mipmap” inour simulation, which demonstrateusing mipmaps for texture maps.

2.2.2 SPEC2000 ApplicationsWe choose the SPEC2000 [5] benchmarkto represent traditional programs. It is noteasy to collect all the binary and input filesof SPEC2000 programs. In our simulation,we use Compress, Go and GCC.

3. Cache Performance ofmedia applicationsIt is important to quantitativelycharacterize the cache behavior ofmultimedia applications in order toprovide insights for people’s futureresearch on this area. In this section, thecache performance and memory accesspatterns of a set of representativemultimedia benchmarks are analyzed first.We present a clear picture of the memoryreference characteristics of multimediaapplications with regards to their cacheperformances and memory access patterns.Cache performance of differentmultimedia applications under differentparameters settings are studied in order togain a better understanding of theirmemory access characteristics. Forcomparison purpose, three benchmarksfrom SPEC2000 are also studied to showthe differences between multimedia andgeneral-purpose applications.3.1 Simulation SettingThree factors are highly related to cacheperformance: cache size, associativity, andline size. Therefore, we conducted foursets of simulations to study how thesethree factors influence L-1 data cacheperformance. All four sets of simulationsuse LRU as their cache replacementalgorithms. The configurations of the foursets of simulations are:

1) Varying cache size from 4k to256k while keeping line size 32Bconstant and using directlymapping policy.2) Varying cache size from 4k to256k while keeping line size 32Bconstant and using 2-way mapping.3) Varying associativity from 1 to8 while keeping line size 32B andcache size 16KB.4) Varying line size from 8 bytes to64 bytes while keeping cache size16 KB and 2-way mapping.

3.2 Simulation results and analysisIn this section, results for each abovesimulation are presented and detailedanalysis is also provided. . In this sectionwe only use L-1 data cache miss rate torepresent cache performance.

Figure 1. L1 D-cache performance (one-way, line size: 32Bytes)

Figure 1 compares the L-1 data cache missrate of different media benchmarks andSPEC benchmarks by varying cache sizesfrom 4k to 256K, as we describe above,we keep other cache-related parametersconstant, that is to say, we use 32B linesize, directly mapping and LRU cachereplacements algorithms. A couple ofobservations from Figure 1 are:

1) Miss rate decreases with theincrease of cache size

2) Generally speaking, multimediaapplications have better cacheperformance compared toSPEC2000 benchmarks, which isdifferent from our expectations.3) Different media applicationsshow very different cacheperformance. The miss rates ofADPCM and MPEG are very low(miss rate ~ 0.1% at 16KB cache),while Mesa, EPIC and GhostScriptperform much worse.4) Within 6 media benchmarks,EPIC and MESA are different fromothers because their miss rateseems not very sensitive to cachesize. That is to say, by onlyincreasing cache size, it is hard toimprove their miss rate.

Figure 2. L1 D-cache performance (2-way, line_size: 32Bytes)

Figure 2 shows similar results using missrate comparison results generated bysimulation set 2. Same observations canbe seen in Figure 2 too. The miss rates inFigure 2 are relatively lower than whatwas found in Figure 1, which is notsurprising.

Figure 3. L1 D-cache performance(varying associativity)

Now we fix cache size and continue tostudy how other factors like associativityand line size can affect cacheperformances significantly so that we canget some deeper insights on how toimprove the cache performance ofmultimedia applications. Figure 3 showsthe results for simulation set 3. Big missrate drop is observed when associativityincrease from 1 to 2. But after that evenwe continue to increase associativity, wecan’t gain much benefit any more, whichimplies that associativity is not aneffective factor to influence cacheperformance.

Figure 4. L1 D-cache performancevarying line sizeThen we switch to line size. Figure 4presents the results of simulation set 4. Aswe can see, miss rates of all media

benchmarks except GhostScript dropalmost by half when we double line size.Of course, larger line size means morememory bandwidth usage. The fact thatincreasing line size can improve miss ratesignificantly implies that mediaapplications have good locality andsequential memory access property.GhostScript ‘s miss rate doesn’t decrease alot with increase of line size, soGhostScript may not have strongsequential property.

Figure5 Memory Access PatternsIn order to verify our conjecture, wecollected simulation traces of differentbenchmarks and study their memoryaccess patterns. Basically we plotted theaddress deviation of adjacent memoryaccess. Figure 5 shows memory accesspatterns of different benchmarks under thecache configuration: 16KB, 32B line sizeand 2-way associativity. X axis is time (incycle), Y axis is the address deviation. Itcan be observed that multimediaapplications have more regular memoryaccess pattern than spec2000 benchmarks.As we can see, Ghostscript’s memoryaccess pattern is pretty random. That iswhy we can’t improve its cacheperformance by simply increasing cachesizes.

In summary, media applications havelower miss rate than traditional programs.

But the cache performance for differentmedia applications can be very different.ADPCM and MPEG have very low missrate, while the miss rate of EPIC and Mesais much higher. All the media applicationexcept GhostScript have regular memoryaccess pattern. And we believe that this isdue to the following reasons [3]:

1) Most multimedia applicationsapply block-partitioning algorithmsto the input data and work on smallblocks of data that easily fit in thecache.2) Within these blocks, there issignificant data reuse as well asspatial locality.3) Third, a large number ofre fe rences genera ted bymultimedia applications are to theirinternal data structures, which arerelatively small and can easily fitin reasonably sized caches.

These results inspire us to believe thatprefetching should be a very promisingtechnique to improve performance ofmedia applications, such as Mesa andEPIC, due to their regular memory accesspatternsSo we hypothesize that by implementingappropriate prefetching schemes in generalpurpose processors, we can better leveragethe properties of media applications toprovide better cache performance. In nextsection we will introduce our experienceof using prefetch to improve cacheperformance for media application.

4. Prefetch Simulation4.1 One block look-aheadprefetchingWe first try simple one-block-look-aheadprefetch[6] on the general purposeprocessor. One-block-look-ahead meansthat, once there is a cache miss, weprefetch the next block into cache. One-block-look-ahead prefetch seems to be

very simple, but our intuition is that, thisscheme could be a good choice for mediaapplications. We know that, mediaapplications have relatively regularmemory access stride, but not constantstride. It is very hard to accurately predictthe stride dynamically during runningtime. On the other hand, because of thestreaming property of media applications,if one memory block is accessed, it ishighly possible that the next block is goingto be accessed in future.

4.2 Simulation assumptionsWe implemented the one block look-aheadin SimpleScalar Version 3.0. In oursimulation, different assumptions ofmemory access model are used:

(1) Pipeline memory model: the pipelinememory model is adopted by SimpleScalarsimulation. In this model, processor canissue memory requests in a pipelinedmanner, and fetch multiple data blocks inparallel. The pipeline memory modelimplies unlimited memory bandwidth. Inthis model, the benefit of prefetch isactually overrated.

(2) Sequential memory model [7]: Thepipeline memory model is obviouslyimpractical in real system, so weimplemented a sequential memory modelin SimpleScalar. This model is the mostrestrictive one since it implies that nomemory requests can be initiated untilprevious request is completed. This meansthat, if we place a memory request whileanother memory request is stilloutstanding, we have to wait until data ofthe outstanding request arrives frommemory before the new request starts to beserved. Another issue needs to beconsidered in this model is: what ispriority of data fetching after cache

misses? In our simulation, we use twopolicies: The first one is FIFO policy,which means we treat missed block andprefetching block at the same priority, firstcome first served. Second, missed blockfirst policy, which means if there is amissed block request, we first serve thisrequest. Only when there is no missedrequest and the memory bus is free, wecan process the prefetching blocks in therequest queue.

4.3 Simulation settingsWe performed three sets of simulations. Inthe first set, we measured the cacheperformance with and without prefetchingunder a pipeline memory model. Second,we ran simulations under a sequentialmemory model and a FIFO policy.Thirdly, we use a sequential memorymodel and missed block first policy. In ourexperiments, the simulation settings weused are:

Cahce line size: 32 Bytes Associativity: 2 way Cache size: 8KB, 16KB, 32KB L1 hit latency = 1 cycle L2 hit latency = 6 cycle L2 cache: 256KB 4-way set

associative cache with 64bytes line

Memory access latency = 26cycle

Cache replacement policy:LRU

4.4 Simulation metrics(1) L1 cache miss rateThis metric is used to reflect the successrate of prefetching schemes on thereducing cache misses. In our results, anaccess is considered a hit even if the datarequest for this access has been issued buthas not yet completed.

(2) Cycles Per Instruction (CPI)

CPI represents the average number ofcycles every instruction takes whenrunning. This metrics is used to evaluatehow the running performance ofbenchmarks is impacted by prefetching.

5. Results and Discussion5.1 Results for pipelined memorymodelFigure 6 shows the results under thepipeline memory model, in which Figure6(a), (c), (e) show the miss rates ofdifferent benchmark programs withdifferent cache sizes. We can see that, forall the media applications, the prefetchscheme can reduce cache miss rate over noprefetch scheme. For an 8KB cache, themiss rate of Mesa is decreased from 3.0%to 2.2%, Epic’s is decreased from 1.93%to 1.07%. The only exception is theGhostScript program for a cache size of8KB. We believe the reason is that, thememory access pattern for Ghostscript isfar messier compared to other mediaapplications, when the cache size is small,the data pollution leads to more cachemisses. For SPEC2000 benchmarks,because the memory access pattern is notas regular as media applications, the resultis not so good. But we can still see that,the miss rates for Compress and Go arereduced, whereas the miss rate of GCC at8KB cache increased.

Figure 6(b), (d), (f) show the CPI resultsfor different benchmark programs. TheCPIs of all media applications exceptGhostScript are reduced slightly (~1%).The impact of pre-fetch scheme on CPI isnot very obvious. The reason is, first, CPIcan be impacted by many factors, such as

pipeline and memory latency, cacheperformance is only one of them. Second,the miss rate of media applications isalready very low, so there is no muchroom to improve the overall performanceby reducing cache misses.

5.2 Results for sequential memorymodelThe results of the sequential memorymodel are showed in Figure 7. Let’s firstsee the results for the FIFO priority policy.In Figure 7(a), (c), (e), we can see that, themiss rate of media applications can bereduced. However, in terms of CPI resultsin Figure 7(b), (d), (f), we can see that, theCPI for Mesa and GhostScript increased alittle bit over no prefetch scheme. Thereason is because we give the samepriority to missed block request and pre-fetching. The data fetching after cachemisses can be delayed by prefetching, andthe pipeline is unnecessarily stalled in thiscase.

In missed block first policy, we give missblock a higher priority. The pre-fetchrequests can be served only when there isno missed block request. From the resultswe can see that, the reduction of miss ratesis not as good as FIFO policy. The reasonis if we give pre-fetching a low priority,some pre-fetch requests can be invalidatedbecause the data has already been missedand fetched before the pre-fetching isreally served. However, with the missedblock first policy, the CPI for everybenchmark becomes better than FIFOpolicy. This is because unnecessarypipeline stall caused by pre-fetch iseliminated.

(a) (b)

(c) (d)

(e) (f)Figure 6. Prefetching results for pipelined memory: (a) Miss rate for 8k cache, (b)CPI result for 8k cache, (c) Miss rate for 16k cache, (d) CPI result for 16k cache, (e)Miss rate for 32k cache, (f) CPI result for 32k cache

(a) (b)

(c) (d)

(e) (f)Figure 6. Prefetching results for sequential memory: (a) Miss rate for 8k cache, (b)CPI result for 8k cache, (c) Miss rate for 16k cache, (d) CPI result for 16k cache, (e)Miss rate for 32k cache, (f) CPI result for 32k cache

6. Conclusion and futurework Multimedia has become one of dominant

applications on general-purposeprocessors. Since these applications

normally have special properties, it isgenerally assumed that they have poormemory performances compared totraditional applications. In this paper, weperformed a comprehensive study ofmemory performance on a subsetrepresentative media benchmarks and wefound that:

1) Multimedia applications have lower L-1data cache miss rate compared to general-purpose applications.

2) Some multimedia applications’performances are hard to get improved byonly increasing cache sizes.

3) Media applications have regularmemory access patterns that make it easierto be optimized by using data prefetchingtechniques.

We implemented one-block-look-aheadprefetching techniques under two differentmemory access models. The simulationresults show the prefetch scheme caneffectively improve L1 data cache missrate of multimedia applications. However,because the miss rate of mediaapplications is very low, the benefit ofprefetch on CPI is not very significant.

There are some interesting future works.First, it is interesting to continue study onthe memory access patterns of differentapplications. Secondly more intelligent oradaptive prefetching schemes should befurther exploited. At last, more specificand advanced computational unit formultimedia applications should be studiedto improve performance of mediaapplications.

AcknowledgementWe would like to thank Prof. Rixner andPaul for their helpful feedbacks and

comments through the entire process ofthe project.

7. References[1] SimpleScalar Tool Set Version 3.0:http://www.cs.wisc.edu/mscalar/simplescalar.html

[2] C. Lee, M. Potkonjak, W.MangioneSmith, “MediaBench: A Tool forEvaluating and Synthesizing Multimediaand Communications Systems”, Proc. 30th

Ann. Int’l Symp. Microarchitecture,pp.330-335, 1997

[3] S. Sohoni, Z.Xu, R. Min, and Y. Hu,“A Study of Memory System Performanceof Multimedia Applications”, inProceedings of the ACM JointInternational Conference on Measurement& Modeling of Computer Systems(SIGMETRICS 2001), Cambridge,Massachusetts, pp.206-215, June 2001

[4] B. Khailany, W. J. Dally, S. Rixner, U.J. Kapasi, P. Mattson, J. Namkoong, J. D.Owens, B. Towles, and A. Chang,"Imagine: Media processing withstreams," IEEE Micro, pp. 34--46, April2001.

[5] SPEC2000: http://www.spec.org/cpu/

[6] A. J. Smith, “Cache Memories”, ACMComputing Surveys, Vol.14, No 3,pp.473-530, 1982

[7] Tien-Fu Chen and Jean-Loup Baer,“Effective Hardware-Based DataPrefetching for High PerformanceProcessors”, IEEE Transactions onComputers, Vol. 44, No. 5, pp. 609-623,May 1995

Analysis and Improvement of Cache performance for ...elec525/projects/mmcache_report.pdf · we only...

Documents

Transcript of Analysis and Improvement of Cache performance for ...elec525/projects/mmcache_report.pdf · we only...