Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache...

69
Lecture 12: Memory Hierarchy -- Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong Yan [email protected] https://passlab.github.io/CSCE513 1

Transcript of Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache...

Page 1: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Lecture12:MemoryHierarchy-- CacheOptimizations

CSCE513ComputerArchitecture

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://passlab.github.io/CSCE513

1

Page 2: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

TopicsforMemoryHierarchy

• MemoryTechnologyandPrincipalofLocality– CAQA:2.1,2.2,B.1– COD:5.1,5.2

• CacheOrganizationandPerformance– CAQA:B.1,B.2– COD:5.2,5.3

• CacheOptimization– 6BasicCacheOptimizationTechniques

• CAQA:B.3– 10AdvancedOptimizationTechniques

• CAQA:2.3• VirtualMemoryandVirtualMachine

– CAQA:B.4,2.4;COD:5.6,5.7– Skipforthiscourse

2

Page 3: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

CacheandMemoryAccess(ld/st instructions)

3

Page 4: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ASummaryonSourcesofCacheMisses

• Compulsory (coldstartorprocessmigration,firstreference):firstaccesstoablock– “Cold” factoflife:notawholelotyoucandoaboutit– Note:Ifyouaregoingtorun“billions” ofinstruction,CompulsoryMisses

areinsignificant• Conflict (collision):

– Multiplememorylocationsmappedtothesamecachelocation

– Solution1:increasecachesize– Solution2:increaseassociativity

• Capacity:– Cachecannotcontainallblocksaccessbytheprogram– Solution:increasecachesize

• Coherence(Invalidation):otherprocess(e.g.,I/O)updatesmemory– Wewillcoveritlateron.

Page 5: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

MemoryHierarchyPerformance

• Twoindirectperformancemeasureshavewaylaidmanyacomputerdesigner.– Instructioncountisindependentofthehardware;– Missratecouldbeindependentofthehardwaremostly

• AbettermeasureofmemoryhierarchyperformanceistheAverageMemoryAccessTime(AMAT)perinstructions

AMAT =Hit time+Miss rate×Miss penalty

TimeCycleClockPenaltyMissRateMissnInstructioAccessesMemoryCPIICTimeCPU Execution ) (* ´´´+=

Page 6: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ImpactonPerformance• Supposeaprocessorexecutesat

– ClockRate=200MHz(5nspercycle)– CPI=1.1– 50%arith/logic,30%ld/st,20%control

• Supposethat10%ofmemoryoperationsget50cyclemisspenalty

• Supposethat1%ofinstructionaccessesgetsamemisspenalty

• CPI=idealCPI+averagestallsperinstruction=1.1(cycles/ins)+[0.30(DataMops/ins)x0.10(miss/DataMop)x50(cycle/miss)]+[1 (InstMop/ins)x0.01(miss/InstMop)x50(cycle/miss)]

=(1.1+1.5+.5)cycle/ins=3.1

• 2/3.1(64.5%)ofthetimetheproc isstalledwaitingformemory!

IdealCPI 1.1

DataMiss 1.5

Inst Miss 0.5

Page 7: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ImprovingCachePerformance

7

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Goals BasicApproachesReducingMiss Rate Largerblocksize,largercachesizeandhigher

associativityReducingMissPenalty Multilevel caches,andhigherreadpriorityover

writesReducingHitTime Avoidaddresstranslationwhenindexingthecache

Page 8: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

1.ReduceMissRateviaLargerBlockSize

Page 9: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

MuchLargerBlockSize:à IncreaseMissRate

9

Page 10: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Example:MissRatevs ReduceAMAT

10

Page 11: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

LargerBlockSize:à IncreaseMissPenaltyChooseaBlockSizeBasedonAMAT

11

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Page 12: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

2.ReduceMissRateviaLargerCache

• Increasingcapacityofcachereducescapacitymisses• Maybelongerhittimeandhighercost• Trends:LargerL2orL3off-chipcaches

12

Page 13: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

3.ReduceMissRateviaHigherAssociativity

• 2:1CacheRule:– MissRateDMcachesizeN=MissRate2-waycachesizeN/2

• 8-waysetassociativeisaseffectiveasfullyassociativeforpracticalpurposes

• Tradeoff:higherassociativecachecomplicatesthecircuit– Mayhavelongerclockcycle

• Beware:Executiontimeistheonlyfinalmeasure!– WillClockCycletimeincrease?– Hill[1988]suggestedhittimefor2-wayvs.1-wayexternalcache+10%,internal+2%

Page 14: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Associativityßà AMAT

14

Page 15: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Associativityßà AMAT:HighAssociativityLeadstoHigherAccessTime

15

Page 16: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

4.ReduceMissPenaltyviaMultilevelCaches

16

• Approaches– MakethecachefastertokeeppacewiththespeedofCPUs– Makethecachelargertoovercomethewideninggap

• L1:fasthits,L2:fewermisses• L2Equations

• HitTimeL1 <<HitTimeL2 <<…<<HitTimeMem

• MissRateL1 <MissRateL2 <…

Page 17: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

MissRateinMultilevelCaches

17

• Localmissrate—missesinthiscachedividedbythetotalnumberofmemoryaccessestothiscache(MissrateL1,MissrateL2)– L1cacheskimsthecreamofthememoryaccesses

• Globalmissrate—missesinthiscachedividedbythetotalnumberofmemoryaccessesgeneratedbytheCPU(MissrateL1,MissRateL1xMissRateL2)– Indicatewhatfractionofthememoryaccessesthatleavethe

CPUgoallthewaytomemory

Page 18: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

MissRatesinMultilevelCaches

18

Page 19: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

19

MultilevelCaches:DesignofL2

• Size– SinceeverythinginL1cacheislikelytobeinL2cache,L2cache

shouldbemuchbiggerthanL1• WhetherdatainL1isinL2

– noviceapproach:designL1andL2independently– multilevelinclusion:L1dataarealwayspresentinL2

• Advantage:easyforconsistencybetweenI/Oandcache(checkingL2only)

• Drawback:L2mustinvalidateallL1blocksthatmapontothe2nd-levelblocktobereplaced=>slightlyhigher1st-levelmissrate– i.e.IntelPentium4:64-byteblockinL1and128-byteinL2

– multilevelexclusion:L1dataisneverfoundinL2• AcachemissinL1resultsinaswapofblocksbetweenL1andL2• Advantage:preventwastingspaceinL2

– i.e.AMDAthlon:64KBL1and256KBL2

Page 20: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

MultilevelCaches:Example

20

Page 21: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

MultilevelCaches:Example

21

Page 22: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

5.ReduceMissPenaltybyGivingPrioritytoReadMissesoverWrites

22

• Servereadsbeforewriteshavebeencompleted• Writethroughwithwritebuffers

SW R3,512(R0) ;M[512]<- R3 (cacheindex0)LW R1,1024(R0) ;R1<- M[1024] (cacheindex0)LW R2,512(R0) ;R2<- M[512] (cacheindex0)

Problem:writethroughwithwritebuffersofferRAWconflictswithmainmemoryreadsoncachemisses

– Ifsimplywaitforwritebuffertoempty,mightincreasereadmisspenalty(oldMIPS1000by50%)

– Checkwritebuffercontentsbeforeread;ifnoconflicts,letthememoryaccesscontinue

• WriteBackSupposeareadmisswillreplaceadirtyblock– Normal:Writedirtyblocktomemory,andthendotheread– Instead:Copythedirtyblocktoawritebuffer,dotheread,andthendothewrite

– CPUstalllesssincerestartsassoonasdoread

Page 23: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

6.ReduceHitTimebyAvoidingAddressTranslationduringIndexingoftheCache

•Importanceofcachehittime–AverageMemoryAccessTime=HitTime +MissRate*MissPenalty–More importantly, cache access time limits the clock cycle rate in manyprocessors today!

•Fast hit time:–Quickly and efficiently find out if data is in the cache, and– if it is, get that data out of the cache

•Four techniques:1.Small and simple caches2.Avoiding address translation during indexing of the cache3.Pipelined cache acces4.Trace caches

23

Page 24: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Avoidingaddresstranslationduringcacheindexing

•Twotasks:indexingthecacheandcomparingaddresses•virtuallyvs.physicallyaddressedcache

–virtualcache:usevirtualaddress(VA)forthecache–physicalcache:usephysicaladdress(PA)aftertranslatingvirtualaddress

•Challengestovirtualcache1.Protection:page-levelprotection(RW/RO/Invalid)mustbechecked

–It’scheckedaspartofthevirtualtophysicaladdresstranslation–solution:anadditionfieldtocopytheprotectioninformationfromTLBandcheckitoneveryaccesstothecache

2.contextswitching:sameVAofdifferentprocessesrefertodifferentPA,requiringthecachetobeflushed–solution:increasewidthofcacheaddresstagwithprocess-identifiertag(PID)

3.Synonymsoraliases:twodifferentVAforthesamePA–inconsistencyproblem:twocopiesofthesamedatainavirtualcache–hardwareantialiasing solution:guaranteeeverycacheblockauniquePA

–Alpha21264:checkallpossiblelocations.Ifoneisfound,itisinvalidated–softwarepage-coloring solution:forcingaliasestosharesomeaddressbits

–Sun’sSolaris:allaliasesmustbeidenticalinlast18bits=>noduplicatePA4.I/O:typicallyusePA,soneedtointeractwithcache(seeSection5.12)

24

Page 25: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Virtuallyindexed,physicallytaggedcache

25

CPU

TB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

Synonym Problem

CPU

$ TB

MEM

VAPA

TagsPA

Overlap cache accesswith VA translation:requires $ index toremain invariant

across translation

VATags

L2 $

Page 26: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Summaryofthe6BasicCacheOptimizationTechniques(TextbookB.3)

26

Page 27: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

AdvancedCacheOptimizationTechniques1. Reducingthehittime—Smallandsimplefirst-levelcachesandway-

prediction.Bothtechniquesalsogenerallydecreasepowerconsumption.

2. Increasingcachebandwidth—Pipelinedcaches,multibanked caches,andnonblocking caches.Thesetechniqueshavevaryingimpactsonpowerconsumption.

3. Reducingthemisspenalty—Criticalwordfirstandmergingwritebuffers.Theseoptimizationshavelittleimpactonpower.

4. Reducingthemissrate—Compileroptimizations.Obviouslyanyimprovementatcompiletimeimprovespowerconsumption.

5. Reducingthemisspenaltyormissrateviaparallelism—Hardwareprefetchingandcompilerprefetching.Theseoptimizationsgenerallyincreasepowerconsumption,primarilyduetoprefetched datathatareunused.

• Increasehardwarecomplexity• Requiresophisticatedcompilertransformation

27

Page 28: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

1.SmallandSimpleFirst-levelCaches

28

• Cache-hitcriticalpath,threesteps:1. addressingthetagmemoryusingtheindexportionoftheaddress,2. comparingthereadtagvaluetotheaddress,and3. settingthemultiplexortochoosethecorrectdataitemifthecacheissetassociative.

• Guideline:smallerhardwareisfaster,Smalldatacacheandthusfastclockrate– sizeoftheL1cacheshasrecentlyincreasedeitherslightlyornotatall.

• Alpha21164has8KBInstructionand8KBdatacache+96KBsecondlevelcache• E.g.,L1cachessamesizefor3generationsofAMDmicroprocessors:K6,Athlon,and

Opteron– AlsoL2cachesmallenoughtofitonchipwithprocessor⇒ avoidstimepenaltyofgoing

offchip• Guideline:simplerhardwareisfaster

– Direct-mappedcachescanoverlapthetagcheckwiththetransmissionofthedata,effectivelyreducinghittime.

– Lowerlevelsofassociativitywillusuallyreducepowerbecausefewercachelinesmustbeaccessed.

• Generaldesign:smallandsimplecachefor1st-levelcache– Keepingthetagsonchipandthedataoffchipfor2nd-levelcaches– OneemphasisisonfastclocktimewhilehidingL1misseswithdynamicexecutionand

usingL2cachestoavoidgoingtomemory

Page 29: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

CacheSizeà AMAT

29

http://www.hpl.hp.com/research/cacti/

Page 30: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

CacheSizeà PowerConsumption

30

Page 31: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Examples

31

Page 32: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

32

2.FastHitTimesViaWayPrediction

• Howtocombinefasthittimeofdirect-mappedwithlowerconflictmissesof2-waySAcache?

• Wayprediction:keepextrabitsincachetopredict“way”(blockwithinset)ofnextaccess– Multiplexersetearlytoselectdesiredblock;only1tag

comparisondonethatcycle(inparallelwithreadingdata)– MissÞ checkotherblocksformatchesinnextcycle

• Accuracy» 85%• Drawback:CPUpipelineharderifhittimeisvariable-length

Hit Time

Way-Miss Hit Time Miss Penalty

Page 33: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

3.IncreasingCacheBandwidthbyPipelining

• Simplytopipelinecacheaccess– Multipleclockcyclefor1st-level

cachehit

• Advantage:fastcycletimeandslowhit• Example:accessinginstructionsfromI-cache

– Pentium:1clockcycle,mid-1990– PentiumPro~PentiumIII:2clocks,mid-1990to2000– Pentium4:4clocks,2000s– IntelCorei7:4clocks

• Drawback:Increasingthenumberofpipelinestagesleadsto– greaterpenaltyonmispredicted branchesand– moreclockcyclesbetweentheissueoftheloadandtheuseofthedata

• Notethatitincreasesthebandwidthofinstructionsratherthandecreasingtheactuallatencyofacachehit

33

Page 34: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

34

4.IncreasingCacheBandwidthwithNon-BlockingCaches

• Non-blocking orlockup-free cacheallowscontinuedcachehitsduringmiss– RequiresF/Ebitsonregistersorout-of-orderexecution– Requiresmulti-bankmemories

• Hitundermiss reduceseffectivemisspenaltybyworkingduringmissvs.ignoringCPUrequests

• Hitundermultiplemiss ormissundermiss furtherlowerseffectivemisspenaltybyoverlappingmultiplemisses– Significantlyincreasescomplexityofcachecontrollersincecan

bemanyoutstandingmemoryaccesses– Requiresmultiplememorybanks– PeniumProallows4outstandingmemorymisses

Page 35: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

EffectivenessofNon-BlockingCache

35

Page 36: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

PerformanceEvaluationofNon-BlockingCaches

• Acachemissdoesnotnecessarilystalltheprocessor– difficulttojudgetheimpactofanysinglemissandhenceto

calculatetheaveragememoryaccesstime.• Theeffectivemisspenaltyisthenon-overlappedtimethatthe

processorisstalled.– notthesumofthemisses

• Thebenefitofnonblocking cachesiscomplex,dependson– themisspenaltywhentherearemultiplemisses,– thememoryreferencepattern,and– howmanyinstructionstheprocessorcanexecutewithamiss

outstanding.• Forout-of-orderprocessors

– Checktextbook

36

Page 37: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

37

5.IncreasingCacheBandwidthViaMultipleBanks

• Ratherthantreatingcacheassinglemonolithicblock,divideintoindependentbankstosupportsimultaneousaccesses– TheArmCortex-A8supportsonetofourbanksinitsL2cache;– theIntelCorei7hasfourbanksinL1(tosupportupto2

memoryaccessesperclock),andtheL2haseightbanks.

Page 38: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

SequentialInterleaving

• WorksbestwhenaccessesnaturallyspreadacrossbanksÞ– mappingofaddressestobanksaffectsbehaviorofmemory

system• Simplemappingthatworkswellissequentialinterleaving

– Spreadblockaddressessequentiallyacrossbanks– E,g,banki hasallblockswithaddressimodulon

38

Page 39: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

39

6.ReduceMissPenalty:EarlyRestartandCriticalWordFirst

• Don’twaitforfullblockbeforerestartingCPU• CriticalWordFirst—Requestmissedwordfrommemoryfirst,sendittoCPUrightaway;letCPUcontinuewhilefillingrestofblock– LargeblocksmorepopulartodayÞ CriticalWord1stwidely

used• Earlyrestart—Assoonasrequestedwordofblockarrives,sendtoCPUandcontinueexecution– SpatiallocalityÞ tendtowantnextsequentialword,somay

stillpaytogetthatone

block

Page 40: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

40

7.MergingWriteBuffertoReduceMissPenalty

• Writebufferletsprocessorcontinuewhilewaitingforwritetocomplete

• Mergingwritebuffer:– Ifbuffercontainsmodifiedblocks,addressescanbechecked

toseeifnewdatamatchesthatofsomewritebufferentry– Ifso,newdatacombinedwiththatentry

• Forsequentialwritesinwrite-throughcaches,increasesblocksizeofwrite(moreefficient)

• SunT1(Niagara)andmanyothersusewritemerging

Page 41: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

MergeWriteBufferExample

41

Page 42: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

42

8.ReducingMissesbyCompilerOptimizationsSoftware-onlyApproach

• McFarling [1989]reducedmissesby75%insoftwareon8KBdirect-mappedcache,4byteblocks

• Instructions– Reorderproceduresinmemorytoreduceconflictmisses– Profilingtolookatconflicts(usingtoolstheydeveloped)

• Data– Loopinterchange:Changenestingofloopstoaccessdatainmemory

order– Blocking:Improvetemporallocalitybyaccessingblocksofdata

repeatedlyvs.goingdownwholecolumnsorrows– Mergingarrays:Improvespatiallocalitybysinglearrayofcompound

elementsvs.2arrays– Loopfusion:Combine2independentloopsthathavesamelooping

andsomevariableoverlap

Page 43: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

LoopInterchangeExample

/* Before */

for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)

for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];

/* After */

for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)

for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

43

Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Sequence of access: X[0][0], X[1][0], X[2][0], …

Sequence of access: X[0][0], X[0][1], X[1][2], …

0 1 2 3 4 5 6 … 99012345

…4999

Page 44: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

BlockingExample/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){r = 0;

for (k = 0; k < N; k = k+1){

r = r + y[i][k]*z[k][j];};

x[i][j] = r;

};

• Twoinnerloops:– ReadallNxNelementsofz[]– ReadNelementsof1rowofy[]repeatedly– WriteNelementsof1rowofx[]

• CapacitymissesafunctionofN&CacheSize:– 2N3+N2 =>(assumingnoconflict;otherwise…)

• Idea:computeonBxBsubmatrixthatfits

44

Page 45: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ArrayAccessinMatrixMultiplication

45

Page 46: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ArrayAccessforBlocking/TilingTransformation

46

Page 47: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

BlockingExample/* After */

for (jj = 0; jj < N; jj = jj+B)

for (kk = 0; kk < N; kk = kk+B)

for (i = 0; i < N; i = i+1)

for (j = jj; j < min(jj+B-1,N); j = j+1)

{r = 0;

for (k = kk; k < min(kk+B-1,N); k = k+1) {

r = r + y[i][k]*z[k][j];};

x[i][j] = x[i][j] + r;

};

• BcalledBlockingFactor• Capacitymissesfrom2N3 +N2 to2N3/B+N2

• Reduceconflictmissestoo?47

Page 48: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ReducingConflictMissesbyBlocking

• ConflictmissesincachesnotFAvs.blockingsize– Lametal[1991]:Blockingfactorof24had1/5themissesvs.

48despitebothfittingincache

48Blocking Factor

Mis

s Ra

te

0

0.05

0.1

0 50 100 150

Fully Associative Cache

Direct Mapped Cache

Page 49: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Merging/Spliting ArraysExample

/* Before: 2 sequential arrays */

int val[SIZE];

int key[SIZE];

/* After: 1 array of structures */

struct merge {

int val;

int key;

};

struct merge merged_array[SIZE];

Reduceconflictsbetweenval &key;improvespatiallocality

49

Array of Struct or Struct of Array

Page 50: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

LoopFusionExample/* Before */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)d[i][j] = a[i][j] + c[i][j];

/* After */for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1){a[i][j] = 1/b[i][j] * c[i][j];d[i][j] = a[i][j] + c[i][j];}

2missesperaccesstoa &c vs.onemissperaccess;improvespatiallocality

50

Page 51: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

SummaryofCompilerOptimizationstoReduceCacheMisses(byhand)

51

Performance Improvement

1 1.5 2 2.5 3

compress

cholesky(nasa7)

spicemxm (nasa7)btrix (nasa7)

tomcatvgmty (nasa7)

vpenta (nasa7)

mergedarrays

loopinterchange

loop fusion blocking

Page 52: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

9.ReducingMissesbyHardware PrefetchingofInstructions&Data

52

• Hardwareprefetch itemsbeforetheprocessorrequeststhem.– Bothinstructionsanddatacanbeprefetched,– Eitherdirectlyintothecachesorintoanexternalbufferthatcanbe

morequicklyaccessedthanmainmemory.• Instructionprefetching

– Typically,CPUfetches2blocksonmiss:requestedandnext– Requestedblockgoesininstructioncache,prefetched goesininstructionstream

buffer

• Dataprefetching– Pentium4canprefetch dataintoL2cachefromupto8streamsfrom8different4

KBpages– Prefetchinginvokedif2successiveL2cachemissestoapage,

ifdistancebetweenthosecacheblocksis<256bytes– TheIntelCorei7supportshardwareprefetchingintobothL1andL2withthemost

commoncaseofprefetchingbeingaccessingthenextline.• Simplerthanbefore.

Page 53: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

HardwarePrefetching

• Reliesonutilizingmemorybandwidththatotherwisewouldbeunused– Ifitinterfereswithdemandmissesitcanactuallylower

performance.– Whenprefetched dataarenotusedorusefuldataaredisplaced,

prefetchingwillhaveaverynegativeimpactonpower.

53

Page 54: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

HardwarePrefetching

• Reliesonutilizingmemorybandwidththatotherwisewouldbeunused– Ifitinterfereswithdemandmissesitcanactuallylower

performance.– Whenprefetched dataarenotusedorusefuldataaredisplaced,

prefetchingwillhaveaverynegativeimpactonpower.

54

Page 55: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

55

10.Compiler-ControlledPrefetchingtoReduceMissPenaltyorMissRate

Compilertoinsertprefetch instructionstorequestdatabeforetheprocessorneedsit.

• Dataprefetch– Loaddataintoregister(HPPA-RISCloads)– Cacheprefetch:loadintocache

(MIPSIV,PowerPC,SPARCv.9)– Specialprefetchinginstructionscannotcausefaults;formof

speculativeexecution

• Prefetch instructionstaketime– Iscostofprefetch issues<savingsinreducedmisses?– Highersuperscalarreducesproblemofissuebandwidth

Page 56: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Compiler-ControlledPrefetchingExample

• Page#93

56

Page 57: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Compiler-ControlledPrefetchingExample

57

Page 58: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

PrefetchinginGCCCompiler

58https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

Page 59: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Summaryofthe10AdvancedCacheOptimizationTechniques

59

Page 60: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ReviewonCachePerformanceandOptimizations

60

Page 61: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Summaryofthe6BasicCacheOptimizationTechniques(TextbookB.3)

61

Page 62: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Summaryofthe10AdvancedCacheOptimizationTechniques

62

Page 63: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ReviewofCacheOptimizations

• 3C’sofcachemisses– Compulsory,Capacity,Conflict

• Writepolicies– Writeback,write-through,write-allocate,nowriteallocate

• Multi-levelcachehierarchies reducemisspenalty– 3levelscommoninmodernsystems(somehave4!)– CanchangedesigntradeoffsofL1cacheifknowntohaveL2

• Prefetching:retrieve memorydatabeforeCPUrequest– Prefetching canwastebandwidthandcausecachepollution– Softwarevs hardwareprefetching

• Softwarememoryhierarchyoptimizations– Loopinterchange,loopfusion,cachetiling

63

Page 64: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

DesignGuideline

• Cacheblocksize:32or64bytes– Fixedsizeacrosscachelevels

• Cachesizes(percore):– L1:Smallandfastestforlowhittime,2Kto62KeachforD$

andI$separated– L2:Largeandfasterforlowmissrate,256K– 512Kfor

combinedD$andI$combined– L3:Largeandfastforlowmissrate:1MB– 8MBforcombined

D$andI$combined• Associativity

– L1:directed,2/4way– L2:4/8way

• Banked,pipelinedandno-blockingaccess

64

Page 65: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

ForSoftwareDevelopment

• Explicitorcompiler-controlledprefetching– Insertprefetchingcall.

• Explicitorcompiler-assistedcodeoptimizationforcacheperformance– Looptransformation(interchange,blocking,etc)

65

Page 66: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

MemoryHierarchyPerformance,B.2inTextbook

• Twoindirectperformancemeasureshavewaylaidmanyacomputerdesigner.– Instructioncountisindependentofthehardware;– Missrateisindependentofthehardware.

• AbettermeasureofmemoryhierarchyperformanceistheAverageMemoryAccessTime(AMAT)perinstructions

AMAT =Hit time+Miss rate×Miss penalty

TimeCycleClockPenaltyMissRateMissnInstructioAccessesMemoryCPIICTimeCPU Execution ) (* ´´´+=

Page 67: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Equations(B-19,B-21)

67

Page 68: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Review

• Finished– AppendixA,InstructionSetPrinciples– AppendixB,ReviewofMemoryHierarchy– AppendixC,Pipelining:BasicandIntermediateConcepts– Chapter1,FundamentalsofQuantitativeDesignandAnalysis– Chapter2,MemoryHierarchyDesign

• Assignment#3posted,due10/29• SecondHalf

– Chapter3,Instruction-LevelParallelismandItsExploitation– Chapter4,Data-LevelParallelisminVector,SIMD,andGPU

Architectures– Chapter5,Thread-LevelParallelism– Chapter7,Domain-SpecificArchitectures

68

Page 69: Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache Optimizations CSCE 513 Computer Architecture Department of Computer Science and Engineering

Feedback

• Lecture– Amountanddepthofcontent

• Assignment– Amountanddepth– Amountanddepthofhand-onworkandprogramming– Extraquestionsandbonuspoints

• Midterm– Depthandamount

• Nov12andNov14classes(regularclassonNov19th)– RescheduleonetoNov9(?)– Remoteviawebconnect (?)

69