Caches and Memory Hierarchy: Review - UCSBtyang/class/240a17/slides/Cache1.pdf · 6 Memory...
Transcript of Caches and Memory Hierarchy: Review - UCSBtyang/class/240a17/slides/Cache1.pdf · 6 Memory...
-
1
Caches and Memory Hierarchy:Review
UCSB CS240A, Fall 2017
-
2
Motivation• Mostapplicationsinasingleprocessorrunsatonly10-20%oftheprocessorpeak
• Mostofthesingleprocessorperformancelossisinthememorysystem– Movingdatatakesmuchlongerthanarithmeticandlogic
• Parallel computing with low single machine performance is not good enough.
• Understand high performance computing and cost in a single machine setting
• Review of cache/memory hierarchy
-
Second-LevelCache(SRAM)
TypicalMemoryHierarchy
Control
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s 1,000,000’s
Size(bytes): 100’s 10K’sM’sG’sT’s
3
• Principleoflocality+memoryhierarchypresentsprogrammerwith≈asmuchmemoryasisavailableinthecheapest technologyatthe≈speedofferedbythefastest technology
Cost/bit:highest lowest
Third-LevelCache(SRAM)
-
4
IdealizedUniprocessor Model• Processornamesbytes,words,etc.initsaddressspace
– Theserepresentintegers,floats,pointers,arrays,etc.• Operationsinclude
– Readandwriteintoveryfastmemorycalledregisters– Arithmeticandotherlogicaloperationsonregisters
• Orderspecifiedbyprogram– Readreturnsthemostrecentlywrittendata– Compilerandarchitecturetranslatehighlevelexpressionsinto“obvious”lowerlevelinstructions
– Hardwareexecutesinstructionsinorderspecifiedbycompiler• IdealizedCost
– Eachoperationhasroughlythesamecost(read,write,add,multiply,etc.)
A=B+CÞ
Readaddress(B)toR1Readaddress(C)toR2R3=R1+R2WriteR3toAddress(A)
-
5
Uniprocessors intheRealWorld• Realprocessorshave
– registersandcaches• smallamountsoffastmemory• storevaluesofrecentlyusedornearbydata• differentmemoryopscanhaveverydifferentcosts
– parallelism• multiple“functionalunits”thatcanruninparallel• differentorders,instructionmixeshavedifferentcosts
– pipelining• aformofparallelism,likeanassemblylineinafactory
• Whyisthisyourproblem?• Intheory,compilersandhardware“understand”allthisandcanoptimizeyourprogram;inpracticetheydon’t.
• Theywon’tknowaboutadifferentalgorithmthatmightbeamuchbetter“match”totheprocessor
-
6
MemoryHierarchy• Mostprogramshaveahighdegreeoflocality intheiraccesses
– spatiallocality: accessingthingsnearbypreviousaccesses– temporallocality: reusinganitemthatwaspreviouslyaccessed
• Memoryhierarchytriestoexploitlocalitytoimproveaverage
on-chipcacheregisters
datapath
control
processor
Secondlevelcache(SRAM)
Mainmemory
(DRAM)
Secondarystorage(Disk)
Tertiarystorage
(Disk/Tape)
Speed 1ns 10ns 100ns 1-10ms 10sec
Size KB MB GB TB PB
-
Processor
Control
Datapath
Review:CacheinModernComputerArchitecture
7
PC
Registers
Arithmetic&LogicUnit(ALU)
MemoryInput
Output
BytesAddress
WriteData
ReadData
Processor-MemoryInterface I/O-MemoryInterfaces
Program
Data
Cache
-
8
CacheBasics• Cacheisfast(expensive)memorywhichkeepscopyofdatainmainmemory;itishiddenfromsoftware– Simplestexample:dataatmemoryaddressxxxxx1101isstored
atcachelocation1101• Memorydataisdividedintoblocks
– Cacheaccessmemorybyablock(cacheline)– Cachelinelength:#ofbytesloadedtogetherinoneentry
• Cacheisdividedbythenumberofsets– Acacheblockcanbehostedinoneset.
• Cachehit:in-cachememoryaccess—cheap• Cachemiss:Needtoaccessnext,slowerlevelofcache
-
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111
8 88Byte
Word8-Byte Block
address address address
2 LSBs are 0 3 LSBs are 0
0
1
2
3
01234567012345670123456701234567
Byte offset in blockBlock #10/4/17 9
MemoryBlock-addressingexample
-
ProcessorAddressFieldsusedbyCacheController
• BlockOffset:Byteaddresswithinblock– Bisnumberofbytesperblock
• SetIndex:Selectswhichset.Sisthenumberofsets• Tag:Remainingportionofprocessoraddress
• SizeofTag=Addresssize– log(S)– log(B)
Block offsetSetIndexTag
10
ProcessorAddress
CacheSizeC= Associativity N × #ofSetS × CacheBlockSizeBExample:Cachesize16K.8bytesasablock.à 2Kblocksà IfN=1,S=2Kusing11bits.
Associativity N represents#itemsthatcanbeheldper set
-
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
82
83
84
85
86
87
88
89
90
91
2
3
4
5
6
7
0
1
2
3
0
1
0
1
0
1
0
1
0
1
010100100000
010100110000
010101000000
010101010000
010101100000
010101110000
010110000000
010110010000
010110100000
010110110000
Blocknumberaliasingexample
10/4/17 11
Block# Block#mod8 Block#mod2
12-bitmemoryaddresses,16Byteblocks
3-bitsetindex 1-bitsetindex
-
• 4byteblocks,cachesize=1Kwords(or4KB)
Direct-MappedCache:N=1.S=NumberofBlocks=210
20Tag 10Index
DataIndex TagValid012...
102110221023
3130... 131211... 210Byteoffset
20
Data
32
Hit
12
Validbitensures
somethingusefulincacheforthisindex
CompareTagwith
upperpartofAddresstoseeifaHit
Readdatafromcacheinstead
ofmemoryifaHit
Comparator
CacheSizeC= Associativity N × #ofSetS × CacheBlockSizeB
-
CacheOrganizations
• “FullyAssociative”:Blockcangoanywhere– N=numberofblocks.S=1
• “DirectMapped”:Blockgoesoneplace– N=1.S=cachecapacityintermsofnumberofblocks
• “N-waySetAssociative”:Nplacesforablock
BlockIDBlockID
CacheSizeC= N × #ofSetS × SizeBAssociativity N represents#itemsthatcanbeheldper set
-
Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)
3130...131211...210 Byteoffset
DataTagV012...
253254255
DataTagV012...
253254255
DataTagV012...
253254255
SetIndex
DataTagV012...
253254255
8Index
22Tag
Hit Data
32
4x1select
Way0 Way1 Way2 Way3
14
-
Howtofindifadataaddressincache?
15
• Assumeblocksize8bytesàlast3bitsofaddressareoffset.
• Setindex2bits.• Givenaddress0b1001011,wheretofindthis
itemfromcache?
0bmeansbinarynumber
-
Howtofindifadataaddressincache?
16
• Assumeblocksize8bytesàlast3bitsofaddressareoffset.
• Setindex2bits.• 0b1001011à Blocknumber0b1001.• Set index2bits(mod4)
• Set numberà 0b01.• Tag =0b10.
• If directorybasedcache,onlyoneblockinset#1.
• If 4ways,therecouldbe4blocksinset#1.• Use tag0b10 tocomparewhatisintheset.
0bmeansbinarynumber
-
CacheReplacementPolicies• RandomReplacement
– Hardwarerandomlyselectsacacheevict• Least-RecentlyUsed
– Hardwarekeepstrackofaccesshistory– Replacetheentrythathasnotbeenusedforthelongesttime– For2-wayset-associativecache,needonebitforLRUreplacement
• ExampleofaSimple“Pseudo”LRUImplementation– Assume64FullyAssociativeentries– Hardwarereplacementpointerpointstoonecacheentry– Wheneveraccessismadetotheentrythepointerpointsto:
• Movethepointertothenextentry– Otherwise:donotmovethepointer– (exampleof“not-most-recentlyused”replacementpolicy)
:
Entry0Entry1
Entry63
ReplacementPointer
17
-
HandlingDataWriting
• Storeinstructionswritetomemory,changingvalues
• Needtomakesurecacheandmemoryhavesamevaluesonwrites:2policies– 1)Write-throughpolicy:writecacheandwritethroughthecachetomemory
• Everywriteeventuallygetstomemory• Tooslow,soincludeWriteBuffertoallowprocessortocontinueoncedatainBuffer
• Bufferupdatesmemoryinparalleltoprocessor– 2)Write-backpolicy
18
-
Write-ThroughCache
• Writebothvaluesincacheandinmemory
• WritebufferstopsCPUfromstallingifmemorycannotkeepup
• Writebuffermayhavemultipleentriestoabsorbburstsofwrites
• Whatifstoremissesincache?
19
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041 Addr Data
WriteBuffer
-
HandlingStoreswithWrite-Back
2)Write-BackPolicy:writeonlytocacheandthenwritecacheblockbacktomemorywhenevictblockfromcache– Writescollectedincache,onlysinglewritetomemoryperblock
– Includebittoseeifwrotetoblockornot,andthenonlywritebackifbitisset
• Called“Dirty”bit(writingmakesit“dirty”)
20
-
Write-BackCache
• Store/cachehit,writedataincacheonly&setdirtybit– Memoryhasstalevalue
• Store/cachemiss,readdatafrommemory,thenupdateandsetdirtybit– “Write-allocate”policy
• Load/cachehit,usevaluefromcache
• Onanymiss,writebackevictedblock,onlyifdirty.Updatecachewithnewblockandcleardirtybit.
21
Processor
32-bitAddress
32-bitData
Cache
32-bitAddress
32-bitData
Memory
1022 99252
720
12
1312041
DDDD
DirtyBits
-
Write-Throughvs.Write-Back
• Write-Through:– Simplercontrollogic– Morepredictabletimingsimplifiesprocessorcontrollogic
– Easiertomakereliable,sincememoryalwayshascopyofdata(bigidea:Redundancy!)
• Write-Back– Morecomplexcontrollogic– Morevariabletiming(0,1,2memoryaccessespercacheaccess)
– Usuallyreduceswritetraffic
– Hardertomakereliable,sometimescachehasonlycopyofdata
22
-
Cache(Performance) Terms
• Hitrate:fractionofaccessesthathitinthecache• Missrate:1– Hitrate• Misspenalty:timetoreplaceablockfromlowerlevelinmemoryhierarchytocache
• Hittime:timetoaccesscachememory(includingtagcomparison)
23
-
AverageMemoryAccessTime(AMAT)• AverageMemoryAccessTime(AMAT)istheaveragetimetoaccessmemoryconsideringbothhitsandmissesinthecache
AMAT= Timeforahit+Missrate× Misspenalty
24
Givena0.2nsclock,amisspenaltyof50clockcycles,amissrateof2%perinstructionandacachehittimeof1clockcycle,whatisAMAT?AMAT=1cycle+0.02*50=2cycles=0.4ns.