Spring 2016Caches
May4:Caches
1
Spring 2016Caches
Roadmap
2
car *c = malloc(sizeof(car));c->miles = 100;c->gals = 17;float mpg = get_mpg(c);free(c);
Car c = new Car();c.setMiles(100);c.setGals(17);float mpg =
c.getMPG();
get_mpg:pushq %rbpmovq %rsp, %rbp...popq %rbpret
Java:C:
Assemblylanguage:
Machinecode:
01110100000110001000110100000100000000101000100111000010110000011111101000011111
Computersystem:
OS:
Memory&dataIntegers&floatsMachinecode&Cx86assemblyProcedures&stacksArrays&structsMemory&cachesProcessesVirtualmemoryMemoryallocationJavavs.C
Spring 2016Caches
HowdoesexecutiontimegrowwithSIZE?
3
int array[SIZE];
int sum = 0;
for (int i = 0; i < 200000; i++) {
for (int j = 0; j < SIZE; j++) {
sum += array[j];
}
}
SIZE
TIME
Plot
Spring 2016Caches
ActualData
4SIZE
Time
Spring 2016Caches
Makingmemoryaccessesfast!¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization¢ Programoptimizationsthatconsidercaches
5
Spring 2016Caches
Problem:Processor-MemoryBottleneck
MainMemory
CPU Reg
Processorperformancedoubledaboutevery18months Buslatency/bandwidth
evolvedmuchslower
Core2Duo:Canprocessatleast256Bytes/cycle
Core2Duo:Bandwidth2Bytes/cycleLatency100-200cycles(30-60ns)
Problem:lotsofwaitingonmemory6cycle:singlemachinestep(fixed-time)
Spring 2016Caches
Problem:Processor-MemoryBottleneck
MainMemory
CPU Reg
Processorperformancedoubledaboutevery18months Buslatency/bandwidth
evolvedmuchslower
Core2Duo:Canprocessatleast256Bytes/cycle
Core2Duo:Bandwidth2Bytes/cycleLatency100-200cycles(30-60ns)
Solution:caches7
Cache
cycle:singlemachinestep(fixed-time)
Spring 2016Caches
Cache💰¢ Ahiddenstoragespace
forprovisions,weapons,and/ortreasures
¢ Computermemorywithshortaccesstimeusedforthestorageoffrequentlyorrecentlyusedinstructionsordata(i-cacheandd-cache)§ Moregenerally:
usedtooptimizedatatransfersbetweenanysystemelementswithdifferentcharacteristics(networkinterfacecache,I/Ocache,etc.)
8
Spring 2016Caches
GeneralCacheMechanics
9
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
7 9 14 3Cache
Memory • Larger,slower,cheapermemory.• Viewedaspartitionedinto“blocks”
or“lines”
Dataiscopiedinblock-sizedtransferunits
• Smaller,faster,moreexpensivememory.
• Cachesasubsetoftheblocks(a.k.a.lines)
Spring 2016Caches
GeneralCacheConcepts:Hit
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
7 9 14 3Cache
Memory
DatainblockbisneededRequest:14
14Blockbisincache:Hit!
10
Spring 2016Caches
GeneralCacheConcepts:Miss
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
7 9 14 3Cache
Memory
DatainblockbisneededRequest:12
Blockbisnotincache:Miss!
BlockbisfetchedfrommemoryRequest:12
12
12
12
Blockbisstoredincache•Placementpolicy:determineswherebgoes•Replacementpolicy:determineswhichblockgetsevicted(victim)
11
Spring 2016Caches
WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith
addressesnearorequaltothosetheyhaveusedrecently
12
Spring 2016Caches
WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith
addressesnearorequaltothosetheyhaveusedrecently
¢ Temporallocality:§ Recentlyreferenceditemsarelikely
tobereferencedagaininthenearfuture
§ Whyisthisimportant?
block
13
Spring 2016Caches
WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith
addressesnearorequaltothosetheyhaveusedrecently
¢ Temporallocality:§ Recentlyreferenceditemsarelikely
tobereferencedagaininthenearfuture
¢ Spatiallocality?
block
14
Spring 2016Caches
WhyCachesWork¢ Locality: Programstendtousedataandinstructionswith
addressesnearorequaltothosetheyhaveusedrecently
¢ Temporallocality:§ Recentlyreferenceditemsarelikely
tobereferencedagaininthenearfuture
¢ Spatiallocality:§ Itemswithnearbyaddressestend
tobereferencedclosetogetherintime
§ Howdocachestakeadvantageofthis?
block
block
15
Spring 2016Caches
Example:AnyLocality?sum = 0;for (i = 0; i < n; i++) {
sum += a[i];}return sum;
16
Spring 2016Caches
Example:AnyLocality?
¢ Data:§ Temporal:sum referencedineachiteration§ Spatial:arraya[] accessedinstride-1pattern
¢ Instructions:§ Temporal:cyclethroughlooprepeatedly§ Spatial:referenceinstructionsinsequence
¢ Beingabletoassessthelocalityofcodeisacrucialskillforaprogrammer
17
sum = 0;for (i = 0; i < n; i++) {
sum += a[i];}return sum;
Spring 2016Caches
LocalityExample#1int sum_array_rows(int a[M][N]){
int i, j, sum = 0;
for (i = 0; i < M; i++)for (j = 0; j < N; j++)
sum += a[i][j];
return sum;}
18
a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]
Spring 2016Caches
LocalityExample#1
19
1:a[0][0]2:a[0][1]3:a[0][2]4:a[0][3]5:a[1][0]6:a[1][1]7:a[1][2]8:a[1][3]9:a[2][0]10:a[2][1]11:a[2][2]12:a[2][3]
stride-1
OrderAccessed
76 92 108
a[0][0]
a[0][1]
a[0][2]
a[0][3]
5a[1][1]
a[1][2]
a[1][3]
0 5a[2][2]
a[2][3]
a[2][1]
a[2][0]
a[1][0]
LayoutinMemory
M=3,N=4
a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]
76isjustonepossible startingaddressofarraya
int sum_array_rows(int a[M][N]){
int i, j, sum = 0;
for (i = 0; i < M; i++)for (j = 0; j < N; j++)
sum += a[i][j];
return sum;}
Spring 2016Caches
LocalityExample#2int sum_array_cols(int a[M][N]){
int i, j, sum = 0;
for (j = 0; j < N; j++)for (i = 0; i < M; i++)
sum += a[i][j];
return sum;}
20
a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]
Spring 2016Caches
1:a[0][0]2:a[1][0]3:a[2][0]4:a[0][1]
LocalityExample#2
21
a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]
5:a[1][1]6:a[2][1]7:a[0][2]8:a[1][2]9:a[2][2]10:a[0][3]11:a[1][3]12:a[2][3]
stride-N
76 92 108
a[0][0]
a[0][1]
a[0][2]
a[0][3]
5a[1][1]
a[1][2]
a[1][3]
0 5a[2][2]
a[2][3]
a[2][1]
a[2][0]
a[1][0]
LayoutinMemoryOrderAccessed
M=3,N=4int sum_array_cols(int a[M][N]){
int i, j, sum = 0;
for (j = 0; j < N; j++)for (i = 0; i < M; i++)
sum += a[i][j];
return sum;}
Spring 2016Caches
LocalityExample#3int sum_array_3d(int a[M][N][L]){
int i, j, k, sum = 0;
for (i = 0; i < N; i++)for (j = 0; j < L; j++)
for (k = 0; k < M; k++)sum += a[k][i][j];
return sum;}
¢ Whatiswrongwiththiscode?¢ Howcanitbefixed?
22
Spring 2016Caches
LocalityExample#3
¢ Whatiswrongwiththiscode?¢ Howcanitbefixed?
2376 92 108
a[0][0][0]
a[0][0][1]
a[0][0][2]
a[0][0][3]
5
a[0][1][1]
a[0][1][2]
a[0][1][3]
0 5
a[0][2][2]
a[0][2][3]
a[0][2][1]
a[0][2][0]
a[0][1][0]
LayoutinMemory(forM=?,N=3,L=4)
a[1][0][0]
a[1][0][1]
a[1][0][2]
a[1][0][3]
5
a[1][1][1]
a[1][1][2]
a[1][1][3]
0 5
a[1][2][2]
a[1][2][3]
a[1][2][1]
a[1][2][0]
a[1][1][0]
a[2][0][0]
a[2][0][1]
a[2][0][2]
a[2][0][3]
5
a[2][1][1]
a[2][1][2]
a[2][1][3]
0 5
a[2][2][2]
a[2][2][3]
a[2][2][1]
a[2][2][0]
a[2][1][0] …
int sum_array_3d(int a[M][N][L]){
int i, j, k, sum = 0;
for (i = 0; i < N; i++)for (j = 0; j < L; j++)
for (k = 0; k < M; k++)sum += a[k][i][j];
return sum;}
Spring 2016Caches
Makememoryfastagain.¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization¢ Programoptimizationsthatconsidercaches
24
Spring 2016Caches
CostofCacheMisses¢ Hugedifferencebetweenahitandamiss
§ Couldbe100x,ifjustL1andmainmemory
¢ Wouldyoubelieve99%hitsistwiceasgoodas97%?§ Consider:
Cachehittimeof1cycleMisspenaltyof100cycles
25
cycle=singlefixed-timemachinestep
Spring 2016Caches
CostofCacheMisses¢ Hugedifferencebetweenahitandamiss
§ Couldbe100x,ifjustL1andmainmemory
¢ Wouldyoubelieve99%hitsistwiceasgoodas97%?§ Consider:
Cachehittimeof1cycleMisspenaltyof100cycles
§ Averageaccesstime:§ 97%hits:1cycle+0.03*100cycles=4cycles§ 99%hits:1cycle+0.01*100cycles=2cycles
¢ Thisiswhy“missrate”isusedinsteadof“hitrate”
26
cycle=singlefixed-timemachinestep
checkthecacheeverytime
Spring 2016Caches
CachePerformanceMetrics
¢ MissRate§ Fractionofmemoryreferencesnotfoundincache(misses/accesses)
=1- hitrate§ Typicalnumbers(inpercentages):
§ 3%- 10%forL1§ Canbequitesmall(e.g.,<1%)forL2,dependingonsize,etc.
¢ HitTime§ Timetodeliveralineinthecachetotheprocessor
§ Includestimetodeterminewhetherthelineisinthecache§ Typicalhittimes:4clockcyclesforL1;10clockcyclesforL2
¢ MissPenalty§ Additionaltimerequiredbecauseofamiss§ Typically50- 200cyclesformissinginL2&goingtomainmemory
(Trend:increasing!)27
Spring 2016Caches
Canwehavemorethanonecache?¢ Whywouldwewanttodothat?
28
Spring 2016Caches
MemoryHierarchies¢ Somefundamentalandenduringpropertiesofhardwareand
softwaresystems:§ Fasterstoragetechnologiesalmostalwayscostmoreperbyteandhave
lowercapacity§ Thegapsbetweenmemorytechnologyspeedsarewidening
§ Truefor:registers↔cache,cache↔DRAM,DRAM↔disk,etc.§ Well-writtenprogramstendtoexhibitgoodlocality
¢ Thesepropertiescomplementeachotherbeautifully
¢ Theysuggestanapproachfororganizingmemoryandstoragesystemsknownasamemoryhierarchy
29
Spring 2016Caches
May6
30
Spring 2016Caches
AnExampleMemoryHierarchy
31
registers
on-chipL1cache(SRAM)
mainmemory(DRAM)
localsecondarystorage(localdisks)
Larger,slower,cheaperperbyte
remotesecondarystorage(distributedfilesystems, webservers)
off-chipL2cache(SRAM)
Smaller,faster,costlierperbyte
<1ns
1ns
5-10ns
100ns
150,000ns
10,000,000ns(10ms)
1-150ms
SSD
Disk
5-10s
1-2min
15-30min
31days
66months =1.3years
1- 15years
Spring 2016Caches
AnExampleMemoryHierarchy
32
registers
on-chipL1cache(SRAM)
mainmemory(DRAM)
localsecondarystorage(localdisks)
Larger,slower,cheaperperbyte
remotesecondarystorage(distributedfilesystems, webservers)
Localdiskshold filesretrievedfromdisksonremotenetworkservers
Mainmemoryholdsdiskblocksretrievedfromlocaldisks
off-chipL2cache(SRAM)
L1cacheholdscachelinesretrievedfromL2cache
CPUregistershold wordsretrievedfromL1cache
L2cacheholdscachelinesretrievedfrommainmemory
Smaller,faster,costlierperbyte
Spring 2016Caches
AnExampleMemoryHierarchy
33
registers
on-chipL1cache(SRAM)
mainmemory(DRAM)
localsecondarystorage(localdisks)
Larger,slower,cheaperperbyte
remotesecondarystorage(distributedfilesystems, webservers)
off-chipL2cache(SRAM)
explicitlyprogram-controlled(e.g.refertoexactly%rax,%rbx)
Smaller,faster,costlierperbyte
programsees“memory”;hardwaremanagescaching
transparently
Spring 2016Caches
MemoryHierarchies
¢ Fundamentalideaofamemoryhierarchy:§ Foreachlevelk,thefaster,smallerdeviceatlevelkservesasacachefor
thelarger,slowerdeviceatlevelk+1.
¢ Whydomemoryhierarchieswork?§ Becauseoflocality,programstendtoaccessthedataatlevelk more
oftenthantheyaccessthedataatlevelk+1.§ Thus,thestorageatlevelk+1canbeslower,andthuslargerand
cheaperperbit.
¢ BigIdea: Thememoryhierarchycreatesalargepoolofstoragethatcostsasmuchasthecheapstoragenearthebottom,butthatservesdatatoprogramsattherateofthefaststoragenearthetop.
34
Spring 2016Caches
IntelCorei7CacheHierarchy
Regs
L1d-cache
L1i-cache
L2unifiedcache
Core0
Regs
L1d-cache
L1i-cache
L2unifiedcache
Core3
…
L3unifiedcache(sharedbyallcores)
Mainmemory
Processorpackage
Blocksize:64bytesforallcaches.
L1i-cacheandd-cache:32KB,8-way,Access:4cycles
L2unifiedcache:256KB,8-way,Access:11cycles
L3unifiedcache:8MB,16-way,Access:30-40cycles
35
Spring 2016Caches
Makingmemoryaccessesfast!¢ Cachebasics¢ Principleoflocality¢ Memoryhierarchies¢ Cacheorganization
§ Direct-mapped (sets; index+tag)§ Associativity(ways)§ Replacementpolicy§ Handlingwrites
¢ Programoptimizationsthatconsidercaches
36
Spring 2016Caches
CacheOrganization¢ Whereshoulddatagointhecache?
§ Weneedamappingfrommemoryaddressestospecificlocationsinthecachetomakecheckingthecacheforanaddressfast§ Otherwiseeachmemoryaccessrequires“searchingtheentirecache”(slow!)
§ Whatisadatastructurethatprovidesfastlookup?
37
Spring 2016Caches
Aside:HashTablesforFastLookup
38
0123456789
Insert:
52734
1002119
Spring 2016Caches
Aside:HashTablesforFastLookup
39
000001010011100101110111
01234567
Insert:
000001000101110011101010100111
Spring 2016Caches
Where should we put data in the cache?
40
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
Data
Memory Cache
¢ How can we compute this mapping?
addressmodcachesizesameas:low-orderlog2(cachesize)bits
Spring 2016Caches
Where should we put data in the cache?
41
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
Data
Memory Cache
Collision.Hmm..Thecachemightgetconfusedlater!Why?Andhowdowesolvethat?
Spring 2016Caches
Usetagstorecordwhichlocationiscached
42
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
Tag Data00??0101
Memory Cache
tag=restofaddressbits
Spring 2016Caches
What’sacacheblock?(orcacheline)
43
0123456789101112131415
ByteAddress
0123
Index
0
1
2
3
4
5
6
7
Block(line)number
block/linesize=?
typicalblock/linesizes:32bytes,64bytes
0000000100100011010001010110011110001001101010111100110111101111
(00)(01)(10)(11)
Tag
Spring 2016Caches
Apuzzle.¢ Whatcanyouinferfromthis:
¢ Cachestartsempty¢ Access(addr: hit/miss)stream:
¢ (12:miss),(13:hit),(14:miss)
44
blocksize>=2bytes blocksize<3bytes
Spring 2016Caches
Directmappedcache
¢ Directmapped:§ Eachmemoryaddresscan
bemappedtoexactlyoneindexinthecache.
§ Easytofindanaddress!Cheaptoimplement.
§ But…
¢ Whathappensifaprogramusesaddresses2, 6, 2, 6, 2,…?
45
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
MemoryAddress
________
Tag
2and6conflict,restofcacheisunused
Spring 2016Caches
Associativity¢ Whatifwecouldstoredatainanyplaceinthecache?
§ Minimizeconflicts!§ Morecomplicatedhardware,consumesmorepower,slower.
46
0000000100100011010001010110011110001001101010111100110111101111
MemoryAddress
________00100110
TagAccesses:2, 6, 2, 6, 2,…?
Spring 2016Caches
Associativity¢ Whatifwecouldstoredatainanyplaceinthecache?
§ Morecomplicatedhardware,consumesmorepower,slower.
¢ Sowecombine thetwoideas:§ Eachaddressmapstoexactlyoneset.§ Eachsetcanhavemorethanoneway.
47
01234567
Set
0
1
2
3
Set
0
1
Set
1-way8 sets,
1 block each
2-way4 sets,
2 blocks each
4-way2 sets,
4 blocks each
0
Set
8-way1 set,
8 blocks
direct mapped fully associative
Spring 2016Caches
NowhowdoIknowwheredatagoes?
48
m-bit Address
k bits(m-k-n) bitsn-bit Block
OffsetTag Index
Spring 2016Caches
NowhowdoIknowwheredatagoes?
49
0123456789101112131415
ByteAddress
0123
Index
0
1
2
3
4
5
6
7
Block(line)number
0000000100100011010001010110011110001001101010111100110111101111
00011011
capacity: 8bytesblocksize: 2bytessets: 4ways: 1
m-bit Address
k bits(m-k-n) bitsn-bit Block
OffsetTag Index
Wherewould13(1101)bestored?
Tag
Spring 2016Caches
Exampleplacementinset-associativecaches¢ Wherewoulddatafromaddress0x1833 beplaced?¢ 0x1833 inbinaryis:
50
m-bit Address
k bits(m-k-n) bitsn-bit Block
OffsetTag Index
k=? k=? k=?
blocksize: 16bytescapacity: 8blocks0001 1000 0011 0011011 0011
Set Tag Data01234567
1-way associativity8 sets, 1 block each
Set Tag Data
0
1
2
3
Set Tag Data
0
1
2-way associativity4 sets, 2 blocks each
4-way associativity2 sets, 4 blocks each
Spring 2016Caches
m-bit Address
k bits(m-k-4) bits4-bit Block
OffsetTag Index
k=3 k=2 k=1
51
Exampleplacementinset-associativecaches¢ Wherewoulddatafromaddress0x1833 beplaced?¢ 0x1833 inbinaryis:
blocksize: 16bytescapacity: 8blocks0001 1000 0011 0011
Set Tag Data01234567
1-way associativity8 sets, 1 block each
Set Tag Data
0
1
2
3
Set Tag Data
0
1
2-way associativity4 sets, 2 blocks each
4-way associativity2 sets, 4 blocks each
Spring 2016Caches
Blockreplacement¢ Anyemptyblockinthecorrectsetmaybeusedforstoringdata.¢ Iftherearenoemptyblocks,whichoneshouldwereplace?
§ Obviousfordirect-mappedcaches,whataboutset-associative?§ Cachestypicallyusesomethingclosetoleastrecentlyused(LRU)
(hardwareusuallyimplements“notmostrecentlyused”)
52
01234567
Set
0
1
2
3
Set
0
1
Set
1-way associativity8 sets, 1 block each
2-way associativity4 sets, 2 blocks each
4-way associativity2 sets, 4 blocks each
Spring 2016Caches
Anotherpuzzle.¢ Whatcanyouinferfromthis?
¢ Cachestartsempty¢ Access(addr: hit/miss)stream:
¢ (10:miss);(12:miss);(10:miss)
53
12isnotinthesameblockas10
12’sblockreplaced10’sblock
direct-mappedcachewith<=2sets
Spring 2016Caches
GeneralCacheOrganization(S,E,B)E=linesperset(wesay“E-way”)
S=#sets
set
line
0 1 2 B-1tagv
validbitB=bytesofdatapercacheline(thedatablock)
cachesize:SxExBdatabytes
54
Spring 2016Caches
CacheReadE=linesperset
S=2s sets
0 1 2 B-1tagv
validbitB=2b bytesofdatapercacheline(thedatablock)
tbits sbits bbitsAddressofbyteinmemory:
tag setindex
blockoffset
databeginsatthisoffset
• Locateset•Checkifanyline insethasmatching tag•Yes+linevalid:hit• Locatedatastartingatoffset
55
Spring 2016Caches
Example:Direct-MappedCache(E=1)
S=2s sets
Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes
tbits 0…01 100Addressofint:
0 1 2 7tagv 3 654
0 1 2 7tagv 3 654
0 1 2 7tagv 3 654
0 1 2 7tagv 3 654
findset
56
Spring 2016Caches
Example:Direct-MappedCache(E=1)Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes
tbits 0…01 100Addressofint:
0 1 2 7tagv 3 654
match?:yes=hitvalid?+
blockoffset
tag
57
Spring 2016Caches
Example:Direct-MappedCache(E=1)Direct-mapped:OnelinepersetAssume:cacheblocksize8bytes
tbits 0…01 100Addressofint:
0 1 2 7tagv 3 654
match?:yes=hitvalid?+
int (4Bytes)ishere
blockoffset
Nomatch? Thenoldlinegetsevictedandreplaced
58
Thisiswhywewantalignment!
Spring 2016Caches
Example(forE=1)int sum_array_rows(double a[16][16]){
int i, j;double sum = 0;
for (i = 0; i < 16; i++)for (j = 0; j < 16; j++)
sum += a[i][j];return sum;
}
32B=4doubles
Assume:cold(empty)cache3bitsforset,5bitsforoffset
aa...ayyy yxx xx000
int sum_array_cols(double a[16][16]){
int i, j;double sum = 0;
for (j = 0; j < 16; j++)for (i = 0; i < 16; i++)
sum += a[i][j];return sum;
}
Assumesum,i,j inregistersAddressofanalignedelementofa:aa...ayyyyxxxx000
59
0,0 0,1 0,2 0,3
0,4 0,5 0,6 0,7
0,8 0,9 0,a 0,b
0,c 0,d 0,e 0,f
1,0 1,1 1,2 1,3
1,4 1,5 1,6 1,7
1,8 1,9 1,a 1,b
1,c 1,d 1,e 1,f
32B=4doubles4missesperrowofarray
4*16=64misseseveryaccessamiss16*16=256misses
0,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
4,0 4,1 4,2 4,3
0,0:aa...a000 000 000000,4:aa...a000 001 000001,0:aa...a000 100 000002,0:aa...a001 000 00000
Spring 2016Caches
Example(forE=1)float dotprod(float x[8], float y[8]){
float sum = 0;int i;
for (i = 0; i < 8; i++)sum += x[i]*y[i];
return sum;}
60
x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]
ifx andy havealignedstartingaddresses,
e.g.,&x[0]=0,&y[0]=128
ifx andy haveunalignedstartingaddresses,
e.g.,&x[0]=0,&y[0]=160
x[0] x[1] x[2] x[3]
y[0] y[1] y[2] y[3]
x[4] x[5] x[6] x[7]
y[4] y[5] y[6] y[7]
Inthisexample,cacheblocksare16bytes;8setsincacheHowmanyblockoffsetbits?Howmanysetindexbits?
Addressbits:ttt....tsssbbbbB=16=2b: b=4offsetbitsS=8=2s: s=3indexbits
0: 000....00000000128: 000....10000000160: 000....10100000
Spring 2016Caches
E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes
tbits 0…01 100Addressofshortint:
findset
61
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
Spring 2016Caches
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes
tbits 0…01 100Addressofshortint:
compareboth
valid?+ match:yes=hit
blockoffset
tag
62
Spring 2016Caches
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
E-waySet-AssociativeCache(Here:E=2)E=2:TwolinespersetAssume:cacheblocksize8bytes
tbits 0…01 100Addressofshortint:
valid?+ match:yes=hit
blockoffset
shortint (2Bytes)ishere
Nomatch?• Onelineinsetisselectedforevictionandreplacement• Replacementpolicies:random,leastrecentlyused(LRU),…
63
compareboth
Spring 2016Caches
Example(forE=2)float dotprod(float x[8], float y[8]){
float sum = 0;int i;
for (i = 0; i < 8; i++)sum += x[i]*y[i];
return sum;}
64
x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]Ifx andy havealignedstartingaddresses,e.g.&x[0]=0,&y[0]=128,canstillfitbothbecausetwolinesineachset
x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]
Spring 2016Caches
TypesofCacheMisses:3C’s!¢ Cold(compulsory)miss
§ Occursonfirstaccesstoablock
¢ Conflictmiss§ Conflictmissesoccurwhenthecacheislargeenough,butmultipledata
objectsallmaptothesameslot§ e.g.,referencingblocks0,8,0,8,...couldmisseverytime
§ direct-mappedcacheshavemoreconflictmissesthann-wayset-associative (wheren>1)
¢ Capacitymiss§ Occurswhenthesetofactivecacheblocks(theworkingset)
islargerthanthecache(justwon’tfit,evenifcachewasfully-associative)§ Note:Fully-associative onlyhasColdandCapacitymisses
65
Spring 2016Caches
Whataboutwrites?¢ Multiplecopiesofdataexist:
§ L1,L2,possiblyL3,mainmemory
¢ Whatisthemainproblemwiththat?§ Whichcopiesshouldweupdate?§ Whatifit’sahit?Miss?
66
L1
L2Cache
L3Cache
MainMemory
Spring 2016Caches
Whataboutwrites?¢ Multiplecopiesofdataexist:
§ L1,L2,possiblyL3,mainmemory
¢ Whattodoonawrite-hit?§ Write-through:writeimmediatelytomemory,allcachesinbetween.§ Write-back:deferwritetomemoryuntillineisevicted(replaced)
§ Musttrackwhichcachelineshavebeenmodified(“dirtybit”)
¢ Whattodoonawrite-miss?§ Write-allocate(“fetchonwrite”):loadintocache,updatelineincache.
§ Goodifmorewritesorreadstothelocationfollow§ No-write-allocate(“writearound”):justwriteimmediatelytomemory.
¢ Typicalcaches:§ Write-back+Write-allocate,usually§ Write-through+No-write-allocate,occasionally
67
why?
Spring 2016Caches
Write-back,write-allocateexample
68
0xBEEFCache
Memory
U
0xCAFE
0xBEEF
0
T
U
dirtybit
tag(thereisonlyonesetinthistinycache,sothetagistheentireaddress!)
Inthisexamplewearesortofignoringblockoffsets.Hereablockholds2bytes(16bits,4hexdigits).
Normallyablockwouldbemuchbiggerandthustherewouldbemultipleitemsperblock.Whileonlyoneiteminthatblockwouldbewrittenatatime,theentirelinewouldbebroughtintocache.
ContentsofmemorystoredataddressU
Spring 2016Caches
Write-back,write-allocateexample
69
0xBEEFCache
Memory
U
0xCAFE
0xBEEF
0
T
U
mov 0xFACE, T
dirtybit
Step1:BringTintocache Step2:Write0xFACE tocacheonlyandsetdirtybit.
mov U,%rax
mov 0xFEED,T Writehit!Write0xFEED tocacheonly
1. WriteTbacktomemorysinceit’sdirty.
2. BringUintothecachesowecancopyitinto%rax
Spring 2016Caches
0xBEEFU 0
Write-back,write-allocateexample
70
0xCAFECache
Memory
T
0xCAFE
0xBEEF
T
U
mov 0xFACE, T
dirtybit0xCAFE 0
Step1:BringTintocache
Spring 2016Caches
0xBEEFU 0
Write-back,write-allocateexample
71
0xCAFECache
Memory
T
0xCAFE
0xBEEF
T
U
mov 0xFACE, T
dirtybit0xFACE 1
Step2:Write0xFACEtocacheonlyandsetdirtybit.
Spring 2016Caches
0xBEEFU 0
Write-back,write-allocateexample
72
0xCAFECache
Memory
T
0xCAFE
0xBEEF
T
U
mov 0xFACE,T mov 0xFEED,T
dirtybit0xFACE 10xFEED
Writehit!Write0xFEED to
cacheonly
Spring 2016Caches
0xBEEFU 0
Write-back,write-allocateexample
73
0xCAFECache
Memory
U
0xFEED
0xBEEF
T
U
mov 0xFACE,T mov 0xFEED,T
dirtybit0xFACE 00xBEEF
mov U,%rax
1.WriteTbacktomemorysinceitisdirty.
2.BringUintothecachesowecancopyitinto%rax
Spring 2016Caches
BacktotheCorei7tolookatways
Regs
L1d-cache
L1i-cache
L2unifiedcache
Core0
Regs
L1d-cache
L1i-cache
L2unifiedcache
Core3
…
L3unifiedcache(sharedbyallcores)
Mainmemory
Processorpackage
Block/linesize:64bytesforall.
L1i-cacheandd-cache:32KB,8-way,Access:4cycles
L2unifiedcache:256KB,8-way,Access:11cycles
L3unifiedcache:8MB,16-way,Access:30-40cycles
74
slower,butmorelikelytohit
Spring 2016Caches
May11¢ Reminders
§ Lab3duetonight@11:59pm§ HW3(mostlycacheproblems)dueFriday
75
Spring 2016Caches
Whereelseiscachingused?
76
Spring 2016Caches
SoftwareCachesareMoreFlexible¢ Examples
§ Filesystembuffercaches,browsercaches,etc.§ Content-deliverynetworks(CDN):cachefortheInternet(e.g.Netflix)
¢ Somedesigndifferences§ Almostalwaysfully-associative
§ so,noplacementrestrictions§ indexstructureslikehashtablesarecommon(forplacement)
§ Morecomplexreplacementpolicies§ missesareveryexpensivewhendiskornetworkinvolved§ worththousandsofcyclestoavoidthem
§ Notnecessarilyconstrainedtosingle“block”transfers§ mayfetchorwrite-backinlargerunits,opportunistically
77
Spring 2016Caches
OptimizationsfortheMemoryHierarchy¢ Writecodethathaslocality!
§ Spatial:accessdatacontiguously§ Temporal:makesureaccesstothesamedataisnottoofarapartintime
¢ Howcanyouachievelocality?§ Properchoiceofalgorithm§ Looptransformations
78
Spring 2016Caches
Example:ArrayofStructs
79
typedef struct {char * name;int scores[4];
} Student;/* Array of structs */Student students[N];
/* Compute average score for assignment ‘hw’ */void average_score(int hw) {
int total = 0;for (int s = 0; s < N; s++) {
total += students[s].scores[hw];return total / (double)N;
}
0x00 0x60010
0x08 10 10
0x10 9 8
0x18 0x6020b
0x20 9 9
0x28 9 7
0x30 0x6031a
0x38 10 8
0x40 5 6
... ...
student[0]
Memory
student[1]
student[2]
.scores[1]
.scores[1]
.scores[1]
Spring 2016Caches
Example:ArrayofStructs
1. students[0].scores[1] set:0,tag:000COLD MISS
2. students[1].scores[1] set:0,tag:001COLD MISS
3. students[2].scores[1] set:1,tag:001COLD MISS
80
double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {
total += students[s].scores[hw];return total / (double)N;
}
0x00 0x60010
0x08 10 10
0x10 9 8
0x18 0x6020b
0x20 9 9
0x28 9 7
0x30 0x6031a
0x38 10 8
0x40 5 6
... ...
students[0]
Memory
students[1]
students[2]
.scores[1]
.scores[1]
.scores[1]
Set Tag Data
0
1
capacity: 32bytesblocksize: 16bytessets: 2ways: 1
typedef struct {char * name;int scores[4];
} Student;/* Array of structs */Student students[N];
0x60010
10 10000
9 9
9 7001
0x6031a
10 8001
Spring 2016Caches
Example:ArrayofStructs¢ CacheMissAnalysis
§ Accesses: int(4bytes)§ Stride: sizeof(Student)=24§ Cacheblocksize:16bytes§ Notemporallocality,soonlycoldmisses§ Missrate:100%
81
double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {
total += students[s].scores[hw];return total / (double)N;
}
0x00 0x60010
0x08 10 10
0x10 9 8
0x18 0x6020b
0x20 9 9
0x28 9 7
0x30 0x6031a
0x38 10 8
0x40 5 6
... ...
students[0]
Memory
students[1]
students[2]
.scores[1]
.scores[1]
.scores[1]
capacity: 32bytesblocksize: 16bytessets: 2ways: 1
typedef struct {char * name;int scores[4];
} Student;/* Array of structs */Student students[N];
Cache
Spring 2016Caches
0x00 0x60010
0x08 0x6020b
0x10 0x6031a
... ...
0x80 10 9
0x88 10 ?
... ...
0xb0 10 9
0xb8 8 ?
... ...
Example:ArrayofStructs vsStructofArrays
82
0x00 0x60010
0x08 10 10
0x10 9 8
0x18 0x6020b
0x20 9 9
0x28 9 7
0x30 0x6031a
0x38 10 8
0x40 5 6
... ...
students[0]
Memory
students[1]
students[2]
.scores[1]
.scores[1]
.scores[1]
/* “Struct of arrays” */struct {
char * names[N];int scores[4][N];
} students;
typedef struct {char * name;int scores[4];
} Student;/* Array of structs */Student students[N];
Memory students.names
students.scores[0]
students.scores[1]
Disclaimer:Thisisnotidealstyle-wise,justanoptionforoptimizing.
Spring 2016Caches
Example:ArrayofStructs vs StructofArrays
1. students.scores[1][0] set:1,tag:100COLD MISS
2. students.scores[1][1] set:1,tag:101HIT
3. students.scores[1][2] set:1,tag:101HIT
83
/* “Struct of arrays” */struct {
char * names[N];int scores[4][N];
} students;
0x00 0x60010
0x08 0x6020b
0x10 0x6031a
... ...
0x80 10 9
0x88 10 ?
... ...
0xb0 10 9
0xb8 8 ?
... ...
Memory students.names
students.scores[0]
students.scores[1]
0xb0 = 1011 0000
Set Tag Data
0
1
double average_score(int hw) {int total = 0;for (int s = 0; s < N; s++) {
total += students[s].scores[hw];return total / (double)N;
}
Spring 2016Caches
Example:ArrayofStructs vs StructofArrays
84
/* “Struct of arrays” */struct {
char * names[N];int scores[4][N];
} students;
0x00 0x60010
0x08 0x6020b
0x10 0x6031a
... ...
0x80 10 9
0x88 10 ?
... ...
0xb0 10 9
0xb8 8 ?
... ...
Memory students.names
students.scores[0]
students.scores[1]
¢ CacheMissAnalysis§ Accesses: int(4bytes)§ Stride:4bytes§ Cacheblocksize:16bytes§ 16bytes/block /4bytes/int =4ints/block§ Missrate:
§ 1COLDMISS/block§ 4ints/block – 1MISS/block =3HIT/block§ 75%HITrate
capacity: 32bytesblocksize: 16bytessets: 2ways: 1
Cache
Spring 2016Caches
Example:ArrayofStructs vs StructofArrays
85
/* “Struct of arrays” */struct {
char * names[N];int scores[4][N];
} students;
0x00 0x60010
0x08 0x6020b
0x10 0x6031a
... ...
0x80 10 9
0x88 10 ?
... ...
0xb0 10 9
0xb8 8 ?
... ...
Memory students.names
students.scores[0]
students.scores[1]
Set Tag Data
0
1
double student_grade(int s) {int total = 0;for (int hw = 0; hw < 4; hw++) {
total += students.scores[s][hw];return total / (double)4;
}
Butdon’t forgetaboutotherusesofthisstruct/array…
WhatwouldtheMISSratebeforthis?
Spring 2016Caches
Example:MatrixMultiplication
a b
i
j
*c
=
c = (double *) calloc(sizeof(double), n*n);
/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {
int i, j, k;for (i = 0; i < n; i++)
for (j = 0; j < n; j++)for (k = 0; k < n; k++)
c[i*n + j] += a[i*n + k] * b[k*n + j];}
86
i
j
memoryaccesspattern?
Extraexample(skipping inclass)
Spring 2016Caches
CacheMissAnalysis¢ Assume:
§ Matrixelementsaredoubles§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann,not left-shiftedbyn)
¢ Firstiteration:§ n/8+n=9n/8misses
(omittingmatrixc)
§ Afterwardsincache:(schematic)
*=
n
*=8doubleswide
87
n/8misses
…
nmisses
eachitemincolumnindifferentcacheline
spatiallocality:chunksof8itemsinarowinsamecacheline
Extraexample(skipping inclass)
Spring 2016Caches
CacheMissAnalysis¢ Assume:
§ Matrixelementsaredoubles§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)
¢ Otheriterations:§ Again:
n/8+n=9n/8misses(omittingmatrixc)
¢ Totalmisses:§ 9n/8*n2 =(9/8)*n3
n
*=8wide
88onceperelement
Extraexample(skipping inclass)
Spring 2016Caches
BlockedMatrixMultiplicationc = (double *) calloc(sizeof(double), n*n);
/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {
int i, j, k, i1, j1, k1;for (i = 0; i < n; i+=B)
for (j = 0; j < n; j+=B)for (k = 0; k < n; k+=B)
/* B x B mini matrix multiplications */for (i1 = i; i1 < i+B; i1++)
for (j1 = j; j1 < j+B; j1++)for (k1 = k; k1 < k+B; k1++)
c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1];}
a b
i1
j1
*c
=
BlocksizeBxB89
Extraexample(skipping inclass)
Spring 2016Caches
CacheMissAnalysis¢ Assume:
§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)§ Threeblocksfitintocache:3B2 <C
¢ First(block)iteration:§ B2/8missesforeachblock§ 2n/B*B2/8=nB/4
(omittingmatrixc)
§ Afterwardsincache(schematic)
*=
*=
BlocksizeBxB
n/Bblocks
90
B2elementsperblock,8percacheline
n/Bblocksperrow,n/Bblockspercolumn
Extraexample(skipping inclass)
Spring 2016Caches
CacheMissAnalysis¢ Assume:
§ Cacheblock=64bytes=8doubles§ CachesizeC<<n(muchsmallerthann)§ Threeblocksfitintocache:3B2 <C
¢ Other(block)iterations:§ Sameasfirstiteration§ 2n/B*B2/8=nB/4
¢ Totalmisses:§ nB/4*(n/B)2 =n3/(4B)
*=
BlocksizeBxB
n/Bblocks
91
Extraexample(skipping inclass)
Spring 2016Caches
Summary¢ Noblocking: (9/8) *n3
¢ Blocking: 1/(4B) *n3
¢ IfB=8differenceis4*8*9/8=36x¢ IfB=16differenceis4*16*9/8=72x
¢ SuggestslargestpossibleblocksizeB,butlimit3B2 <C!
¢ Reasonfordramaticdifference:§ Matrixmultiplicationhasinherenttemporallocality:
§ Inputdata:3n2,computation2n3
§ EveryarrayelementusedO(n)times!§ Butprogramhastobewrittenproperly
92
Extraexample(skipping inclass)
Spring 2016Caches
Cache-FriendlyCode¢ Programmercanoptimizeforcacheperformance
§ Howdatastructuresareorganized§ Howdataareaccessed
§ Nestedloopstructure§ Blocking(previousexample)isageneraltechnique
¢ Allsystemsfavor“cache-friendlycode”§ Gettingabsoluteoptimumperformanceisveryplatformspecific
§ Cachesizes,linesizes,associativities,etc.§ Cangetmostoftheadvantagewithgenericcode
§ Keepworkingsetreasonablysmall(temporallocality)§ Usesmallstrides(spatiallocality)§ Focusoninnerloopcode
93
Spring 2016Caches
IntelCorei7CacheHierarchy
Regs
L1d-cache
L1i-cache
L2unifiedcache
Core0
Regs
L1d-cache
L1i-cache
L2unifiedcache
Core3
…
L3unifiedcache(sharedbyallcores)
Mainmemory
Processorpackage
L1i-cacheandd-cache:32KB,8-way,Access:4cycles
L2unifiedcache:256KB,8-way,Access:11cycles
L3unifiedcache:8MB,16-way,Access:30-40cycles
Blocksize:64bytesforallcaches.
94
Spring 2016Caches
TheMemoryMountain
128m32m
8m2m
512k128k
32k0
2000
4000
6000
8000
10000
12000
14000
16000
s1s3
s5s7
s9s11
Size(bytes)
Readth
roughput(MB/s)
Stride(x8bytes)
Corei7Haswell2.1GHz32KBL1d-cache256KBL2cache8MBL3cache64Bblocksize
Slopesofspatiallocality
Ridgesoftemporallocality
L1
Mem
L2
L3
Aggressiveprefetching
95
Top Related