Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache...
Transcript of Lecture 12: Memory Hierarchy --Cache Optimizations...Lecture 12: Memory Hierarchy--Cache...
Lecture12:MemoryHierarchy-- CacheOptimizations
CSCE513ComputerArchitecture
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://passlab.github.io/CSCE513
1
TopicsforMemoryHierarchy
• MemoryTechnologyandPrincipalofLocality– CAQA:2.1,2.2,B.1– COD:5.1,5.2
• CacheOrganizationandPerformance– CAQA:B.1,B.2– COD:5.2,5.3
• CacheOptimization– 6BasicCacheOptimizationTechniques
• CAQA:B.3– 10AdvancedOptimizationTechniques
• CAQA:2.3• VirtualMemoryandVirtualMachine
– CAQA:B.4,2.4;COD:5.6,5.7– Skipforthiscourse
2
CacheandMemoryAccess(ld/st instructions)
3
ASummaryonSourcesofCacheMisses
• Compulsory (coldstartorprocessmigration,firstreference):firstaccesstoablock– “Cold” factoflife:notawholelotyoucandoaboutit– Note:Ifyouaregoingtorun“billions” ofinstruction,CompulsoryMisses
areinsignificant• Conflict (collision):
– Multiplememorylocationsmappedtothesamecachelocation
– Solution1:increasecachesize– Solution2:increaseassociativity
• Capacity:– Cachecannotcontainallblocksaccessbytheprogram– Solution:increasecachesize
• Coherence(Invalidation):otherprocess(e.g.,I/O)updatesmemory– Wewillcoveritlateron.
MemoryHierarchyPerformance
• Twoindirectperformancemeasureshavewaylaidmanyacomputerdesigner.– Instructioncountisindependentofthehardware;– Missratecouldbeindependentofthehardwaremostly
• AbettermeasureofmemoryhierarchyperformanceistheAverageMemoryAccessTime(AMAT)perinstructions
AMAT =Hit time+Miss rate×Miss penalty
TimeCycleClockPenaltyMissRateMissnInstructioAccessesMemoryCPIICTimeCPU Execution ) (* ´´´+=
ImpactonPerformance• Supposeaprocessorexecutesat
– ClockRate=200MHz(5nspercycle)– CPI=1.1– 50%arith/logic,30%ld/st,20%control
• Supposethat10%ofmemoryoperationsget50cyclemisspenalty
• Supposethat1%ofinstructionaccessesgetsamemisspenalty
• CPI=idealCPI+averagestallsperinstruction=1.1(cycles/ins)+[0.30(DataMops/ins)x0.10(miss/DataMop)x50(cycle/miss)]+[1 (InstMop/ins)x0.01(miss/InstMop)x50(cycle/miss)]
=(1.1+1.5+.5)cycle/ins=3.1
• 2/3.1(64.5%)ofthetimetheproc isstalledwaitingformemory!
IdealCPI 1.1
DataMiss 1.5
Inst Miss 0.5
ImprovingCachePerformance
7
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Goals BasicApproachesReducingMiss Rate Largerblocksize,largercachesizeandhigher
associativityReducingMissPenalty Multilevel caches,andhigherreadpriorityover
writesReducingHitTime Avoidaddresstranslationwhenindexingthecache
1.ReduceMissRateviaLargerBlockSize
MuchLargerBlockSize:à IncreaseMissRate
9
Example:MissRatevs ReduceAMAT
10
LargerBlockSize:à IncreaseMissPenaltyChooseaBlockSizeBasedonAMAT
11
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
2.ReduceMissRateviaLargerCache
• Increasingcapacityofcachereducescapacitymisses• Maybelongerhittimeandhighercost• Trends:LargerL2orL3off-chipcaches
12
3.ReduceMissRateviaHigherAssociativity
• 2:1CacheRule:– MissRateDMcachesizeN=MissRate2-waycachesizeN/2
• 8-waysetassociativeisaseffectiveasfullyassociativeforpracticalpurposes
• Tradeoff:higherassociativecachecomplicatesthecircuit– Mayhavelongerclockcycle
• Beware:Executiontimeistheonlyfinalmeasure!– WillClockCycletimeincrease?– Hill[1988]suggestedhittimefor2-wayvs.1-wayexternalcache+10%,internal+2%
Associativityßà AMAT
14
Associativityßà AMAT:HighAssociativityLeadstoHigherAccessTime
15
4.ReduceMissPenaltyviaMultilevelCaches
16
• Approaches– MakethecachefastertokeeppacewiththespeedofCPUs– Makethecachelargertoovercomethewideninggap
• L1:fasthits,L2:fewermisses• L2Equations
• HitTimeL1 <<HitTimeL2 <<…<<HitTimeMem
• MissRateL1 <MissRateL2 <…
MissRateinMultilevelCaches
17
• Localmissrate—missesinthiscachedividedbythetotalnumberofmemoryaccessestothiscache(MissrateL1,MissrateL2)– L1cacheskimsthecreamofthememoryaccesses
• Globalmissrate—missesinthiscachedividedbythetotalnumberofmemoryaccessesgeneratedbytheCPU(MissrateL1,MissRateL1xMissRateL2)– Indicatewhatfractionofthememoryaccessesthatleavethe
CPUgoallthewaytomemory
MissRatesinMultilevelCaches
18
19
MultilevelCaches:DesignofL2
• Size– SinceeverythinginL1cacheislikelytobeinL2cache,L2cache
shouldbemuchbiggerthanL1• WhetherdatainL1isinL2
– noviceapproach:designL1andL2independently– multilevelinclusion:L1dataarealwayspresentinL2
• Advantage:easyforconsistencybetweenI/Oandcache(checkingL2only)
• Drawback:L2mustinvalidateallL1blocksthatmapontothe2nd-levelblocktobereplaced=>slightlyhigher1st-levelmissrate– i.e.IntelPentium4:64-byteblockinL1and128-byteinL2
– multilevelexclusion:L1dataisneverfoundinL2• AcachemissinL1resultsinaswapofblocksbetweenL1andL2• Advantage:preventwastingspaceinL2
– i.e.AMDAthlon:64KBL1and256KBL2
MultilevelCaches:Example
20
MultilevelCaches:Example
21
5.ReduceMissPenaltybyGivingPrioritytoReadMissesoverWrites
22
• Servereadsbeforewriteshavebeencompleted• Writethroughwithwritebuffers
SW R3,512(R0) ;M[512]<- R3 (cacheindex0)LW R1,1024(R0) ;R1<- M[1024] (cacheindex0)LW R2,512(R0) ;R2<- M[512] (cacheindex0)
Problem:writethroughwithwritebuffersofferRAWconflictswithmainmemoryreadsoncachemisses
– Ifsimplywaitforwritebuffertoempty,mightincreasereadmisspenalty(oldMIPS1000by50%)
– Checkwritebuffercontentsbeforeread;ifnoconflicts,letthememoryaccesscontinue
• WriteBackSupposeareadmisswillreplaceadirtyblock– Normal:Writedirtyblocktomemory,andthendotheread– Instead:Copythedirtyblocktoawritebuffer,dotheread,andthendothewrite
– CPUstalllesssincerestartsassoonasdoread
6.ReduceHitTimebyAvoidingAddressTranslationduringIndexingoftheCache
•Importanceofcachehittime–AverageMemoryAccessTime=HitTime +MissRate*MissPenalty–More importantly, cache access time limits the clock cycle rate in manyprocessors today!
•Fast hit time:–Quickly and efficiently find out if data is in the cache, and– if it is, get that data out of the cache
•Four techniques:1.Small and simple caches2.Avoiding address translation during indexing of the cache3.Pipelined cache acces4.Trace caches
23
Avoidingaddresstranslationduringcacheindexing
•Twotasks:indexingthecacheandcomparingaddresses•virtuallyvs.physicallyaddressedcache
–virtualcache:usevirtualaddress(VA)forthecache–physicalcache:usephysicaladdress(PA)aftertranslatingvirtualaddress
•Challengestovirtualcache1.Protection:page-levelprotection(RW/RO/Invalid)mustbechecked
–It’scheckedaspartofthevirtualtophysicaladdresstranslation–solution:anadditionfieldtocopytheprotectioninformationfromTLBandcheckitoneveryaccesstothecache
2.contextswitching:sameVAofdifferentprocessesrefertodifferentPA,requiringthecachetobeflushed–solution:increasewidthofcacheaddresstagwithprocess-identifiertag(PID)
3.Synonymsoraliases:twodifferentVAforthesamePA–inconsistencyproblem:twocopiesofthesamedatainavirtualcache–hardwareantialiasing solution:guaranteeeverycacheblockauniquePA
–Alpha21264:checkallpossiblelocations.Ifoneisfound,itisinvalidated–softwarepage-coloring solution:forcingaliasestosharesomeaddressbits
–Sun’sSolaris:allaliasesmustbeidenticalinlast18bits=>noduplicatePA4.I/O:typicallyusePA,soneedtointeractwithcache(seeSection5.12)
24
Virtuallyindexed,physicallytaggedcache
25
CPU
TB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Synonym Problem
CPU
$ TB
MEM
VAPA
TagsPA
Overlap cache accesswith VA translation:requires $ index toremain invariant
across translation
VATags
L2 $
Summaryofthe6BasicCacheOptimizationTechniques(TextbookB.3)
26
AdvancedCacheOptimizationTechniques1. Reducingthehittime—Smallandsimplefirst-levelcachesandway-
prediction.Bothtechniquesalsogenerallydecreasepowerconsumption.
2. Increasingcachebandwidth—Pipelinedcaches,multibanked caches,andnonblocking caches.Thesetechniqueshavevaryingimpactsonpowerconsumption.
3. Reducingthemisspenalty—Criticalwordfirstandmergingwritebuffers.Theseoptimizationshavelittleimpactonpower.
4. Reducingthemissrate—Compileroptimizations.Obviouslyanyimprovementatcompiletimeimprovespowerconsumption.
5. Reducingthemisspenaltyormissrateviaparallelism—Hardwareprefetchingandcompilerprefetching.Theseoptimizationsgenerallyincreasepowerconsumption,primarilyduetoprefetched datathatareunused.
• Increasehardwarecomplexity• Requiresophisticatedcompilertransformation
27
1.SmallandSimpleFirst-levelCaches
28
• Cache-hitcriticalpath,threesteps:1. addressingthetagmemoryusingtheindexportionoftheaddress,2. comparingthereadtagvaluetotheaddress,and3. settingthemultiplexortochoosethecorrectdataitemifthecacheissetassociative.
• Guideline:smallerhardwareisfaster,Smalldatacacheandthusfastclockrate– sizeoftheL1cacheshasrecentlyincreasedeitherslightlyornotatall.
• Alpha21164has8KBInstructionand8KBdatacache+96KBsecondlevelcache• E.g.,L1cachessamesizefor3generationsofAMDmicroprocessors:K6,Athlon,and
Opteron– AlsoL2cachesmallenoughtofitonchipwithprocessor⇒ avoidstimepenaltyofgoing
offchip• Guideline:simplerhardwareisfaster
– Direct-mappedcachescanoverlapthetagcheckwiththetransmissionofthedata,effectivelyreducinghittime.
– Lowerlevelsofassociativitywillusuallyreducepowerbecausefewercachelinesmustbeaccessed.
• Generaldesign:smallandsimplecachefor1st-levelcache– Keepingthetagsonchipandthedataoffchipfor2nd-levelcaches– OneemphasisisonfastclocktimewhilehidingL1misseswithdynamicexecutionand
usingL2cachestoavoidgoingtomemory
CacheSizeà AMAT
29
http://www.hpl.hp.com/research/cacti/
CacheSizeà PowerConsumption
30
Examples
31
32
2.FastHitTimesViaWayPrediction
• Howtocombinefasthittimeofdirect-mappedwithlowerconflictmissesof2-waySAcache?
• Wayprediction:keepextrabitsincachetopredict“way”(blockwithinset)ofnextaccess– Multiplexersetearlytoselectdesiredblock;only1tag
comparisondonethatcycle(inparallelwithreadingdata)– MissÞ checkotherblocksformatchesinnextcycle
• Accuracy» 85%• Drawback:CPUpipelineharderifhittimeisvariable-length
Hit Time
Way-Miss Hit Time Miss Penalty
3.IncreasingCacheBandwidthbyPipelining
• Simplytopipelinecacheaccess– Multipleclockcyclefor1st-level
cachehit
• Advantage:fastcycletimeandslowhit• Example:accessinginstructionsfromI-cache
– Pentium:1clockcycle,mid-1990– PentiumPro~PentiumIII:2clocks,mid-1990to2000– Pentium4:4clocks,2000s– IntelCorei7:4clocks
• Drawback:Increasingthenumberofpipelinestagesleadsto– greaterpenaltyonmispredicted branchesand– moreclockcyclesbetweentheissueoftheloadandtheuseofthedata
• Notethatitincreasesthebandwidthofinstructionsratherthandecreasingtheactuallatencyofacachehit
33
34
4.IncreasingCacheBandwidthwithNon-BlockingCaches
• Non-blocking orlockup-free cacheallowscontinuedcachehitsduringmiss– RequiresF/Ebitsonregistersorout-of-orderexecution– Requiresmulti-bankmemories
• Hitundermiss reduceseffectivemisspenaltybyworkingduringmissvs.ignoringCPUrequests
• Hitundermultiplemiss ormissundermiss furtherlowerseffectivemisspenaltybyoverlappingmultiplemisses– Significantlyincreasescomplexityofcachecontrollersincecan
bemanyoutstandingmemoryaccesses– Requiresmultiplememorybanks– PeniumProallows4outstandingmemorymisses
EffectivenessofNon-BlockingCache
35
PerformanceEvaluationofNon-BlockingCaches
• Acachemissdoesnotnecessarilystalltheprocessor– difficulttojudgetheimpactofanysinglemissandhenceto
calculatetheaveragememoryaccesstime.• Theeffectivemisspenaltyisthenon-overlappedtimethatthe
processorisstalled.– notthesumofthemisses
• Thebenefitofnonblocking cachesiscomplex,dependson– themisspenaltywhentherearemultiplemisses,– thememoryreferencepattern,and– howmanyinstructionstheprocessorcanexecutewithamiss
outstanding.• Forout-of-orderprocessors
– Checktextbook
36
37
5.IncreasingCacheBandwidthViaMultipleBanks
• Ratherthantreatingcacheassinglemonolithicblock,divideintoindependentbankstosupportsimultaneousaccesses– TheArmCortex-A8supportsonetofourbanksinitsL2cache;– theIntelCorei7hasfourbanksinL1(tosupportupto2
memoryaccessesperclock),andtheL2haseightbanks.
SequentialInterleaving
• WorksbestwhenaccessesnaturallyspreadacrossbanksÞ– mappingofaddressestobanksaffectsbehaviorofmemory
system• Simplemappingthatworkswellissequentialinterleaving
– Spreadblockaddressessequentiallyacrossbanks– E,g,banki hasallblockswithaddressimodulon
38
39
6.ReduceMissPenalty:EarlyRestartandCriticalWordFirst
• Don’twaitforfullblockbeforerestartingCPU• CriticalWordFirst—Requestmissedwordfrommemoryfirst,sendittoCPUrightaway;letCPUcontinuewhilefillingrestofblock– LargeblocksmorepopulartodayÞ CriticalWord1stwidely
used• Earlyrestart—Assoonasrequestedwordofblockarrives,sendtoCPUandcontinueexecution– SpatiallocalityÞ tendtowantnextsequentialword,somay
stillpaytogetthatone
block
40
7.MergingWriteBuffertoReduceMissPenalty
• Writebufferletsprocessorcontinuewhilewaitingforwritetocomplete
• Mergingwritebuffer:– Ifbuffercontainsmodifiedblocks,addressescanbechecked
toseeifnewdatamatchesthatofsomewritebufferentry– Ifso,newdatacombinedwiththatentry
• Forsequentialwritesinwrite-throughcaches,increasesblocksizeofwrite(moreefficient)
• SunT1(Niagara)andmanyothersusewritemerging
MergeWriteBufferExample
41
42
8.ReducingMissesbyCompilerOptimizationsSoftware-onlyApproach
• McFarling [1989]reducedmissesby75%insoftwareon8KBdirect-mappedcache,4byteblocks
• Instructions– Reorderproceduresinmemorytoreduceconflictmisses– Profilingtolookatconflicts(usingtoolstheydeveloped)
• Data– Loopinterchange:Changenestingofloopstoaccessdatainmemory
order– Blocking:Improvetemporallocalitybyaccessingblocksofdata
repeatedlyvs.goingdownwholecolumnsorrows– Mergingarrays:Improvespatiallocalitybysinglearrayofcompound
elementsvs.2arrays– Loopfusion:Combine2independentloopsthathavesamelooping
andsomevariableoverlap
LoopInterchangeExample
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
43
Sequential accesses instead of striding through memory every 100 words; improved spatial locality
Sequence of access: X[0][0], X[1][0], X[2][0], …
Sequence of access: X[0][0], X[0][1], X[1][2], …
0 1 2 3 4 5 6 … 99012345
…4999
BlockingExample/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Twoinnerloops:– ReadallNxNelementsofz[]– ReadNelementsof1rowofy[]repeatedly– WriteNelementsof1rowofx[]
• CapacitymissesafunctionofN&CacheSize:– 2N3+N2 =>(assumingnoconflict;otherwise…)
• Idea:computeonBxBsubmatrixthatfits
44
ArrayAccessinMatrixMultiplication
45
ArrayAccessforBlocking/TilingTransformation
46
BlockingExample/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
• BcalledBlockingFactor• Capacitymissesfrom2N3 +N2 to2N3/B+N2
• Reduceconflictmissestoo?47
ReducingConflictMissesbyBlocking
• ConflictmissesincachesnotFAvs.blockingsize– Lametal[1991]:Blockingfactorof24had1/5themissesvs.
48despitebothfittingincache
48Blocking Factor
Mis
s Ra
te
0
0.05
0.1
0 50 100 150
Fully Associative Cache
Direct Mapped Cache
Merging/Spliting ArraysExample
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of structures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reduceconflictsbetweenval &key;improvespatiallocality
49
Array of Struct or Struct of Array
LoopFusionExample/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)d[i][j] = a[i][j] + c[i][j];
/* After */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1){a[i][j] = 1/b[i][j] * c[i][j];d[i][j] = a[i][j] + c[i][j];}
2missesperaccesstoa &c vs.onemissperaccess;improvespatiallocality
50
SummaryofCompilerOptimizationstoReduceCacheMisses(byhand)
51
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky(nasa7)
spicemxm (nasa7)btrix (nasa7)
tomcatvgmty (nasa7)
vpenta (nasa7)
mergedarrays
loopinterchange
loop fusion blocking
9.ReducingMissesbyHardware PrefetchingofInstructions&Data
52
• Hardwareprefetch itemsbeforetheprocessorrequeststhem.– Bothinstructionsanddatacanbeprefetched,– Eitherdirectlyintothecachesorintoanexternalbufferthatcanbe
morequicklyaccessedthanmainmemory.• Instructionprefetching
– Typically,CPUfetches2blocksonmiss:requestedandnext– Requestedblockgoesininstructioncache,prefetched goesininstructionstream
buffer
• Dataprefetching– Pentium4canprefetch dataintoL2cachefromupto8streamsfrom8different4
KBpages– Prefetchinginvokedif2successiveL2cachemissestoapage,
ifdistancebetweenthosecacheblocksis<256bytes– TheIntelCorei7supportshardwareprefetchingintobothL1andL2withthemost
commoncaseofprefetchingbeingaccessingthenextline.• Simplerthanbefore.
HardwarePrefetching
• Reliesonutilizingmemorybandwidththatotherwisewouldbeunused– Ifitinterfereswithdemandmissesitcanactuallylower
performance.– Whenprefetched dataarenotusedorusefuldataaredisplaced,
prefetchingwillhaveaverynegativeimpactonpower.
53
HardwarePrefetching
• Reliesonutilizingmemorybandwidththatotherwisewouldbeunused– Ifitinterfereswithdemandmissesitcanactuallylower
performance.– Whenprefetched dataarenotusedorusefuldataaredisplaced,
prefetchingwillhaveaverynegativeimpactonpower.
54
55
10.Compiler-ControlledPrefetchingtoReduceMissPenaltyorMissRate
Compilertoinsertprefetch instructionstorequestdatabeforetheprocessorneedsit.
• Dataprefetch– Loaddataintoregister(HPPA-RISCloads)– Cacheprefetch:loadintocache
(MIPSIV,PowerPC,SPARCv.9)– Specialprefetchinginstructionscannotcausefaults;formof
speculativeexecution
• Prefetch instructionstaketime– Iscostofprefetch issues<savingsinreducedmisses?– Highersuperscalarreducesproblemofissuebandwidth
Compiler-ControlledPrefetchingExample
• Page#93
56
Compiler-ControlledPrefetchingExample
57
PrefetchinginGCCCompiler
58https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Summaryofthe10AdvancedCacheOptimizationTechniques
59
ReviewonCachePerformanceandOptimizations
60
Summaryofthe6BasicCacheOptimizationTechniques(TextbookB.3)
61
Summaryofthe10AdvancedCacheOptimizationTechniques
62
ReviewofCacheOptimizations
• 3C’sofcachemisses– Compulsory,Capacity,Conflict
• Writepolicies– Writeback,write-through,write-allocate,nowriteallocate
• Multi-levelcachehierarchies reducemisspenalty– 3levelscommoninmodernsystems(somehave4!)– CanchangedesigntradeoffsofL1cacheifknowntohaveL2
• Prefetching:retrieve memorydatabeforeCPUrequest– Prefetching canwastebandwidthandcausecachepollution– Softwarevs hardwareprefetching
• Softwarememoryhierarchyoptimizations– Loopinterchange,loopfusion,cachetiling
63
DesignGuideline
• Cacheblocksize:32or64bytes– Fixedsizeacrosscachelevels
• Cachesizes(percore):– L1:Smallandfastestforlowhittime,2Kto62KeachforD$
andI$separated– L2:Largeandfasterforlowmissrate,256K– 512Kfor
combinedD$andI$combined– L3:Largeandfastforlowmissrate:1MB– 8MBforcombined
D$andI$combined• Associativity
– L1:directed,2/4way– L2:4/8way
• Banked,pipelinedandno-blockingaccess
64
ForSoftwareDevelopment
• Explicitorcompiler-controlledprefetching– Insertprefetchingcall.
• Explicitorcompiler-assistedcodeoptimizationforcacheperformance– Looptransformation(interchange,blocking,etc)
65
MemoryHierarchyPerformance,B.2inTextbook
• Twoindirectperformancemeasureshavewaylaidmanyacomputerdesigner.– Instructioncountisindependentofthehardware;– Missrateisindependentofthehardware.
• AbettermeasureofmemoryhierarchyperformanceistheAverageMemoryAccessTime(AMAT)perinstructions
AMAT =Hit time+Miss rate×Miss penalty
TimeCycleClockPenaltyMissRateMissnInstructioAccessesMemoryCPIICTimeCPU Execution ) (* ´´´+=
Equations(B-19,B-21)
67
Review
• Finished– AppendixA,InstructionSetPrinciples– AppendixB,ReviewofMemoryHierarchy– AppendixC,Pipelining:BasicandIntermediateConcepts– Chapter1,FundamentalsofQuantitativeDesignandAnalysis– Chapter2,MemoryHierarchyDesign
• Assignment#3posted,due10/29• SecondHalf
– Chapter3,Instruction-LevelParallelismandItsExploitation– Chapter4,Data-LevelParallelisminVector,SIMD,andGPU
Architectures– Chapter5,Thread-LevelParallelism– Chapter7,Domain-SpecificArchitectures
68
Feedback
• Lecture– Amountanddepthofcontent
• Assignment– Amountanddepth– Amountanddepthofhand-onworkandprogramming– Extraquestionsandbonuspoints
• Midterm– Depthandamount
• Nov12andNov14classes(regularclassonNov19th)– RescheduleonetoNov9(?)– Remoteviawebconnect (?)
69