Memory Hierarchy: Cache - Wellesley Collegecs240/s16/slides/cache.pdfAnother puzzle: Cache starts...

Post on 10-Jun-2020

5 views 0 download

Transcript of Memory Hierarchy: Cache - Wellesley Collegecs240/s16/slides/cache.pdfAnother puzzle: Cache starts...

MemoryHierarchy:Cache

MemoryhierarchyCachebasicsLocalityCacheorganizationCache-aware programming

Howdoesexecution timegrowwithSIZE?int[] array = new int[SIZE];fillArrayRandomly(array); int s = 0;

for (int i = 0; i < 200000; i++) {for (int j = 0; j < SIZE; j++) {s += array[j];

}}

3SIZE

TIME

reality beyondO(...)

4

0

5

10

15

20

25

30

35

40

45

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

SIZE

Time

Processor-MemoryBottleneck

5

MainMemory

CPU Reg

Processorperformancedoubledaboutevery18months Busbandwidth

evolvedmuchslower

Bandwidth:256bytes/cycleLatency:1-fewcycles

Bandwidth:2Bytes/cycleLatency:100cycles

Solution:caches

Cache

Example

CacheEnglish:n.ahiddenstoragespaceforprovisions,weapons,ortreasuresv.tostoreawayinhidingforfutureuse

ComputerScience:n.acomputermemorywithshortaccesstimeusedtostorefrequentlyorrecentlyusedinstructionsordatav. tostore[data/instructions]temporarilyforlaterquickretrieval

AlsousedmorebroadlyinCS:softwarecaches,filecaches,etc.

6

GeneralCacheMechanics

7

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

Memory Larger, slower, cheaper.Partitioned intoblocks (lines).

Dataismovedinblockunits

Smaller, faster,moreexpensive.Storessubsetofmemoryblocks.

(lines)

CPU Block: unitofdataincacheandmemory.(a.k.a.line)

CacheHit

8

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

Memory

1.Requestdatainblock b.Request:14

142.Cachehit:

Blockbisincache.

CPU

9

CacheMiss

9

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

Memory

1.Request datainblockb.Request:12

2.Cache miss:blockisnot incache

4.Cachefill:Fetchblockfrommemory,storeincache.

Request:12

12

12

9

9

12

3.Cacheeviction:Evictablocktomakeroom,maybestoretomemory.

PlacementPolicy:where toputblockincache

ReplacementPolicy:whichblocktoevict

CPU

Locality:whycacheswork

Programstendtousedataandinstructionsataddressesnearorequaltothosetheyhaveusedrecently.

Temporallocality:Recently referenced itemsarelikelytobereferenced againinthenear future.

Spatiallocality:Itemswithnearbyaddresses arelikelytobereferenced closetogether intime.

Howdocachesexploittemporalandspatiallocality?

10

block

block

Locality#1

Data:Temporal:sum referenced ineachiterationSpatial:arraya[] accessed instride-1 pattern

Instructions:Temporal:execute looprepeatedlySpatial:execute instructions insequence

Assessinglocalityincodeisanimportantprogrammingskill.

11

sum = 0;for (i = 0; i < n; i++) {

sum += a[i];}return sum;

Whatisstoredinmemory?

Locality#2

12

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

1:a[0][0]2:a[0][1]3:a[0][2]4:a[0][3]5:a[1][0]6:a[1][1]7:a[1][2]8:a[1][3]9:a[2][0]10:a[2][1]11:a[2][2]12:a[2][3]

stride1

int sum_array_rows(int a[M][N]) {int sum = 0;

for (int i = 0; i < M; i++) {for (int j = 0; j < N; j++) {

sum += a[i][j];}

}return sum;

}

row-majorMxN2DarrayinC

Locality#3

13

int sum_array_cols(int a[M][N]) {int sum = 0;

for (int j = 0; j < N; j++) {for (int i = 0; i < M; i++) {

sum += a[i][j];}

}return sum;

}

1:a[0][0]2:a[1][0]3:a[2][0]4:a[0][1]5:a[1][1]6:a[2][1]7:a[0][2]8:a[1][2]9:a[2][2]10:a[0][3]11:a[1][3]12:a[2][3]

strideN

row-majorMxN2DarrayinC

…a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

Locality#4

Whatis"wrong"withthiscode?Howcanitbefixed?

14

int sum_array_3d(int a[M][N][N]) {int sum = 0;

for (int i = 0; i < N; i++) {for (int j = 0; j < N; j++) {

for (int k = 0; k < M; k++) {sum += a[k][i][j];

}}

}return sum;

}

CostofCacheMissesHugedifferencebetweenahitandamiss

Couldbe100x,ifjustL1andmainmemory

99%hitscouldbetwiceasgoodas97%.How?Assumecachehittimeof1cycle,misspenaltyof100cycles

Meanaccess time:97%hits:1cycle+0.03*100cycles=4cycles99%hits:1cycle+0.01*100cycles=2cycles

15

hit/miss rates

CachePerformanceMetrics

MissRateFractionofmemoryaccesses todatanotincache (misses /accesses)Typically: 3%- 10%forL1;maybe<1% forL2,depending onsize,etc.

HitTimeTimetofindanddeliverablockinthecachetotheprocessor.Typically:1- 2clockcyclesforL1;5- 20clockcyclesforL2

MissPenaltyAdditional timerequired oncachemiss=mainmemoryaccess timeTypically50- 200cyclesforL2 (trend:increasing!)

16

Memory

memoryhierarchywhydoesitwork?

persistentstorage(harddisk, flash,overnetwork,cloud,etc.)

mainmemory(DRAM)

L3cache(SRAM,off-chip)

L1cache(SRAM,on-chip)

L2cache(SRAM,on-chip)

registerssmall,fast,power-hungry,expensive

large,slow,power-efficient,cheap

programsees“memory”;hardwaremanagescaching

transparently

explicitlyprogram-controlled

Cache Organization:KeyPointsBlockFixed-sizeunitofdata inmemory/cache

PlacementPolicyWhereshouldagivenblockbestoredinthecache?

§ direct-mapped, setassociative

ReplacementPolicyWhatifthereisnoroominthecacheforrequesteddata?

§ leastrecentlyused,mostrecentlyused

WritePolicyWhenshouldwritesupdatelowerlevelsofmemoryhierarchy?

§ writeback,writethrough,writeallocate,nowriteallocate

Blocks 00000000

00001000

00010000

00011000

Memory(byte)address

00010010

Dividememory intofixed-sizealignedblocks.powerof2

fullbyteaddress

BlockIDaddressbits- offsetbits

offsetwithinblocklog2(blocksize)

Example:blocksize=8

block

0

block

1

block

2

block

3

00010001000100100001001100010100000101010001011000010111

rememberwithinSameBlock?(PointersLab) ...

Note:draw

ingaddressorderdifferentlyfromhereon!

PlacementPolicy

00011011

IndexCache

S=#slots=4

Small,fixednumberofblockslots.

Large,fixednumberofblockslots.

Memory Mapping:index(BlockID)=???BlockID

0000000100100011010001010110011110001001101010111100110111101111

Placement:Direct-Mapped

21

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Memory Mapping:index(BlockID)=BlockIDmod SBlockID

Cache

S=#slots=4

(easyforpower-of-2blocksizes...)

Placement:mappingambiguity

22

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Memory

Whichblockisinslot2?

BlockID

Cache

S=#slots=4

Mapping:index(BlockID)=BlockIDmod S

Placement:Tagsresolveambiguity

23

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Memory

BlockIDbitsnotusedforindex.

BlockID

Tag Data00110101

Cache

S

Mapping:index(BlockID)=BlockIDmod S

Address=Tag,Index,Offset

00010010 fullbyteaddress

BlockIDAddressbits - Offsetbits

Offsetwithinblocklog2(blocksize)=b

#addressbits

BlockIDbits - IndexbitsTag

log2(# cacheslots)Index

a-bitAddresssbits(a-s-b) bits bbits

OffsetTag Index

Wherewithinablock?

Whatslot inthecache?Disambiguates slotcontents.

Placement:Direct-Mapped

25

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Memory

(stilleasyforpower-of-2blocksizes...)

BlockID

Cache

Whynotthismapping?index(BlockID)=BlockID/ S

Apuzzle.

Cachestartsempty.Access(address,hit/miss)stream:

(10,miss),(11,hit),(12,miss)

Whatcouldtheblocksizebe?

26

blocksize>=2bytes blocksize<8bytes

Placement:directmappingconflicts

Whathappenswhenaccessinginrepeatedpattern:0010,0110,0010,0110,0010...?

27

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

BlockID

cacheconflictEveryaccess suffers amiss,evictscachelineneededbynextaccess.

Placement:SetAssociative

28

0

1

2

3

Set

2-way4sets,

2blockseach

0

1

Set

4-way2sets,

4blockseach

01234567

Set

1-way8sets,

1blockeach

directmapped

0

Set

8-way1set,

8blocks

fullyassociative

Mapping:index(BlockID)=BlockIDmod S

S=#slotsincachesets

Indexperset ofblockslots.Storeblockinany slotwithinset.

Replacementpolicy:ifset isfull,whatblockshouldbereplaced?Common: leastrecentlyused(LRU)buthardwareusually implements “notmostrecentlyused”

Example:Tag,Index,Offset?

index(1101)=____

4-bitAddress OffsetTag Index

tagbits ____setindexbits ____blockoffsetbits____

Direct-mapped4slots2-byteblocks

Example:Tag,Index,Offset?

16-bitAddress OffsetTag IndexE-wayset-associativeS slots16-byteblocks

01234567

Set

0

1

2

3

Set

0

1

Set

E=1-wayS=8sets

E=2-wayS=4sets

E=4-wayS=2sets

tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____

tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____

tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____

ReplacementPolicyIfsetisfull,whatblockshouldbereplaced?

Common: leastrecentlyused(LRU)(buthardwareusually implements “notmostrecently used”

Anotherpuzzle:Cachestartsempty,usesLRU.Access(address,hit/miss)stream

(10,miss);(12,miss);(10,miss)

31

12isnotinthesameblockas10 12’sblockreplaced10’sblock

direct-mapped cacheassociativity ofcache?

GeneralCacheOrganization (S,E,B)

32

Elinesperset(“E-way”)

Ssets

set

block/line

0 1 2 B-1tagv

validbit B =2b bytesofdatapercacheline(thedatablock)

cachecapacity:SxExBdatabytesaddresssize:t+s+baddressbits

Powersof2

CacheRead

33

E=2e lines perset

S=2s sets

0 1 2 B-1tag1

validbitB=2b bytesofdatapercacheline(thedatablock)

tbits sbits bbitsAddressofbyteinmemory:

tag setindex

blockoffset

databeginsatthisoffset

LocatesetbyindexHitifanyblock inset:

is valid;andhas matching tag

Getdataatoffsetinblock

CacheRead:Direct-Mapped (E=1)

34

S=2s sets

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

findset

Thiscache:• Blocksize:8bytes• Associativity: 1blockperset(directmapped)

CacheRead:Direct-Mapped (E=1)

35

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

match?:yes=hitvalid?+

blockoffset

tag 7654

int (4Bytes) ishere

Ifnomatch:old line isevictedandreplaced

Thiscache:• Blocksize:8bytes• Associativity: 1blockperset(directmapped)

Direct-MappedCachePractice

12-bitaddress16lines,4-byteblocksizeDirectmapped

36

11 10 9 8 7 6 5 4 3 2 1 0

03DFC2111167––––03161DF0723610D5

098F6D431324––––03630804020011B2––––0151112311991190B3B2B1B0ValidTagIndex

––––014FD31B7783113E15349604116D

––––012C––––00BB3BDA159312DA––––02D98951003A1248B3B2B1B0ValidTagIndex

0x354

0xA20

Offsetbits? Indexbits?Tagbits?

Example (E=1)

37

int sum_array_rows(double a[16][16]){double sum = 0;

for (int r = 0; r < 16; r++){for (int c = 0; c < 16; c++){

sum += a[r][c];}

}return sum;

}

32bytes=4doubles

Assume: cold(empty)cache3-bitsetindex,5-bitoffset

aa...arrr rcc cc000

int sum_array_cols(double a[16][16]){double sum = 0;

for (int c = 0; c < 16; c++){for (int r = 0; r < 16; r++){

sum += a[r][c];}

}return sum;

}

Localsinregisters.Assume a is aligned such that&a[r][c] is aa...a rrrr cccc 000

0,0 0,1 0,2 0,3

0,4 0,5 0,6 0,7

0,8 0,9 0,a 0,b

0,c 0,d 0,e 0,f

1,0 1,1 1,2 1,3

1,4 1,5 1,6 1,7

1,8 1,9 1,a 1,b

1,c 1,d 1,e 1,f

32bytes=4doubles

4missesperrowofarray4*16=64misses

everyaccessamiss16*16=256misses

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

4,0 4,1 4,2 4,3

0,0:aa...a000 000 000000,4:aa...a000 001 000001,0:aa...a000 100 000002,0:aa...a001 000 00000

Example (E=1)

38

int dotprod(int x[8], int y[8]) {int sum = 0;

for (int i = 0; i < 8; i++) {sum += x[i]*y[i];

}return sum;

}

x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]

ifxandyaremutuallyaligned,e.g.,0x00,0x80

ifxandyaremutuallyunaligned,e.g.,0x00,0xA0

x[0] x[1] x[2] x[3]

y[0] y[1] y[2] y[3]

x[4] x[5] x[6] x[7]

y[4] y[5] y[6] y[7]

block=16bytes;8sets incacheHowmanyblockoffsetbits?Howmanysetindexbits?

Addressbits:ttt....tsssbbbbB=16=2b: b=4offsetbitsS=8=2s: s=3indexbits

Addresses asbits0x00000000: 000....000000000x00000080: 000....100000000x000000A0: 000....1010000016bytes=4ints

CacheRead:Set-Associative (Example:E=2)

39

tbits 0…01 100Addressofint:

findset

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

Thiscache:• Blocksize:8bytes• Associativity: 2blocksperset

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

CacheRead:Set-Associative (Example:E=2)

40

Thiscache:• Blocksize:8bytes• Associativity: 2blocksperset

tbits 0…01 100Addressofint:

compareboth

valid?+ match:yes=hit

blockoffset

tag 7654

int (4Bytes) ishere

Ifnomatch:Evictandreplaceone line inset.

Example (E=2)

42

float dotprod(float x[8], float y[8]) {float sum = 0;

for (int i = 0; i < 8; i++) {sum += x[i]*y[i];

}return sum;

}

x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]Ifxandyaligned,e.g.&x[0]=0,&y[0]=128,canstillfitbothbecauseeachsethasspacefortwoblocks/lines

x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]4sets

2blocks/lines perset

TypesofCacheMisses

Cold(compulsory)miss

Conflictmiss

Capacitymiss

Whichonescanwemitigate/eliminate?How?

43

WritingtocacheMultiplecopiesofdataexist,mustbekeptinsync.

Write-hitpolicyWrite-through:Write-back:needsadirtybit

Write-misspolicyWrite-allocate:No-write-allocate:

Typicalcaches:Write-back+Write-allocate, usuallyWrite-through +No-write-allocate, occasionally

44

Write-back,write-allocateexample

45

0xCAFECache

Memory

U

0xFACE

0xCAFE

0

T

U

dirtybittag

1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)

a. MissonT.

eax = 0xCAFEecx =Tedx =U

Cache/memorynotinvolved

Write-back,write-allocateexample

46

Cache

Memory 0xFACE

0xCAFE

T

U

dirtybit

1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)

a. MissonT.b. EvictU(clean:discard).c. FillT(write-allocate).d. WriteTincache(dirty).

4. mov (%edx),%eaxa. MissonU.tag

T 00xFACE0xFEED 1

eax = 0xCAFEecx =Tedx =U

Write-back,write-allocateexample

47

0xCAFECache

Memory

U

0xFACE

0xCAFE

0

T

U

dirtybittag

eax = 0xCAFEecx =Tedx =U

1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)

a. MissonT.b. EvictU(clean:discard).c. FillT(write-allocate).d. WriteTincache(dirty).

4. mov (%edx),%eaxa. MissonU.b. EvictT(dirty:writeback).c. FillU.d. Set%eax.

5. DONE.0xFEED

0xCAFE

ExampleMemoryHierarchy

48

Regs

L1d-cache

L1i-cache

L2unified cache

Core0

Regs

L1d-cache

L1i-cache

L2unified cache

Core3

L3unified cache(sharedbyallcores)

Mainmemory

Processorpackage

L1i-cacheandd-cache:32KB,8-way,Access: 4cycles

L2unified cache:256KB,8-way,Access: 11cycles

L3unified cache:8MB,16-way,Access: 30-40cycles

Blocksize:64bytesforallcaches.

slower,butmorelikelytohit

Typicallaptop/desktopprocessor(alwayschanging)

Aside:softwarecachesExamples

Filesystembuffercaches,webbrowser caches,databasecaches,networkCDNcaches,etc.

SomedesigndifferencesAlmostalwaysfully-associative

Oftenusecomplexreplacement policies

Notnecessarily constrained tosingle“block”transfers

49

Cache-FriendlyCodeLocality,locality,locality.Programmercanoptimizeforcacheperformance

Datastructure layoutDataaccesspatterns

Nested loopsBlocking(seeCSAPP6.5)

Allsystemsfavor“cache-friendlycode”Performance ishardware-specificGeneric rulescapturemostadvantages

Keepworkingsetsmall(temporal locality)Usesmallstrides (spatial locality)Focusoninnerloopcode

50