CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache...

49
CS 240 Stage 3 Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the process model Virtual memory Dynamic memory allocation Victory lap

Transcript of CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache...

Page 1: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CS240Stage3AbstractionsforPracticalSystems

CachingandthememoryhierarchyOperatingsystemsandtheprocessmodelVirtualmemoryDynamicmemoryallocationVictorylap

Page 2: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

MemoryHierarchy:Cache

MemoryhierarchyCachebasicsLocalityCacheorganizationCache-awareprogramming

Page 3: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

HowdoesexecutiontimegrowwithSIZE?int array[SIZE];fillArrayRandomly(array); int s = 0;

for (int i = 0; i < 200000; i++) {for (int j = 0; j < SIZE; j++) {s += array[j];

}}

4SIZE

TIME

Page 4: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

reality

5

0

5

10

15

20

25

30

35

40

45

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

SIZE

Time

Page 5: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Processor-MemoryBottleneck

6

MainMemory

CPU Reg

Processorperformancedoubledaboutevery18months Busbandwidth

evolvedmuchslower

Bandwidth:256bytes/cycleLatency:1-fewcycles

Bandwidth:2Bytes/cycleLatency:100cycles

Solution:caches

Cache

Example

Page 6: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CacheEnglish:n.ahiddenstoragespaceforprovisions,weapons,ortreasuresv.tostoreawayinhidingforfutureuse

ComputerScience:n.acomputermemorywithshortaccesstimeusedtostorefrequentlyorrecentlyusedinstructionsordatav. tostore[data/instructions]temporarilyforlaterquickretrieval

AlsousedmorebroadlyinCS:softwarecaches,filecaches,etc.

7

Page 7: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

GeneralCacheMechanics

8

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

Memory Larger,slower,cheaper.Partitionedintoblocks (lines).

Dataismovedinblockunits

Smaller,faster,moreexpensive.Storessubsetofmemoryblocks.

(lines)

CPU Block: unitofdataincacheandmemory.(a.k.a.line)

Page 8: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CacheHit

9

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

Memory

1.Requestdatainblock b.Request:14

142.Cachehit:

Blockbisincache.

CPU

Page 9: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

9

CacheMiss

10

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

Memory

1.Request datainblockb.Request:12

2.Cache miss:blockisnot incache

4.Cachefill:Fetchblockfrommemory,storeincache.

Request:12

12

12

9

9

12

3.Cacheeviction:Evictablocktomakeroom,maybestoretomemory.

PlacementPolicy:wheretoputblockincache

ReplacementPolicy:whichblocktoevict

CPU

Page 10: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Locality:whycacheswork

Programstendtousedataandinstructionsataddressesnearorequaltothosetheyhaveusedrecently.

Temporallocality:Recentlyreferenceditemsarelikelytobereferencedagaininthenearfuture.

Spatiallocality:Itemswithnearbyaddressesarelikelytobereferencedclosetogetherintime.

Howdocachesexploittemporalandspatiallocality?

11

block

block

Page 11: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Locality#1

Data:Temporal:sum referencedineachiterationSpatial:arraya[] accessedinstride-1 pattern

Instructions:Temporal:executelooprepeatedlySpatial:executeinstructionsinsequence

Assessinglocalityincodeisanimportantprogrammingskill.

12

sum = 0;for (i = 0; i < n; i++) {

sum += a[i];}return sum;

Whatisstoredinmemory?

Page 12: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Locality#2

13

a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

1:a[0][0]2:a[0][1]3:a[0][2]4:a[0][3]5:a[1][0]6:a[1][1]7:a[1][2]8:a[1][3]9:a[2][0]10:a[2][1]11:a[2][2]12:a[2][3]

stride1

int sum_array_rows(int a[M][N]) {int sum = 0;

for (int i = 0; i < M; i++) {for (int j = 0; j < N; j++) {

sum += a[i][j];}

}return sum;

}

row-majorMxN2DarrayinC

Page 13: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Locality#3

14

int sum_array_cols(int a[M][N]) {int sum = 0;

for (int j = 0; j < N; j++) {for (int i = 0; i < M; i++) {

sum += a[i][j];}

}return sum;

}

1:a[0][0]2:a[1][0]3:a[2][0]4:a[0][1]5:a[1][1]6:a[2][1]7:a[0][2]8:a[1][2]9:a[2][2]10:a[0][3]11:a[1][3]12:a[2][3]

strideN

row-majorMxN2DarrayinC

…a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]

Page 14: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Locality#4

Whatis"wrong"withthiscode?Howcanitbefixed?

15

int sum_array_3d(int a[M][N][N]) {int sum = 0;

for (int i = 0; i < N; i++) {for (int j = 0; j < N; j++) {

for (int k = 0; k < M; k++) {sum += a[k][i][j];

}}

}return sum;

}

Page 15: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CostofCacheMissesMisscostcouldbe100× hitcost.

99%hitscouldbetwiceasgoodas97%.How?Assumecachehittimeof1cycle,misspenaltyof100cycles

Meanaccesstime:97%hits:1cycle+0.03*100cycles=4cycles99%hits:1cycle+0.01*100cycles=2cycles

16

hit/missrates

Page 16: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CachePerformanceMetrics

MissRateFractionofmemoryaccessestodatanotincache(misses/accesses)Typically: 3%- 10%forL1;maybe<1% forL2,dependingonsize,etc.

HitTimeTimetofindanddeliverablockinthecachetotheprocessor.Typically:1- 2clockcyclesforL1;5- 20clockcyclesforL2

MissPenaltyAdditionaltimerequiredoncachemiss=mainmemoryaccesstimeTypically50- 200cyclesforL2 (trend:increasing!)

17

Page 17: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Memory

memoryhierarchywhydoesitwork?

persistentstorage(harddisk, flash,overnetwork,cloud,etc.)

mainmemory(DRAM)

L3cache(SRAM,off-chip)

L1cache(SRAM,on-chip)

L2cache(SRAM,on-chip)

registerssmall,fast,power-hungry,expensive

large,slow,power-efficient,cheap

explicitlyprogram-controlled

Page 18: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Cache Organization:KeyPointsBlockFixed-sizeunitofdata inmemory/cache

PlacementPolicyWhereinthecacheshouldagivenblockbestored?

§ direct-mapped,setassociative

ReplacementPolicyWhatifthereisnoroominthecacheforrequesteddata?

§ leastrecentlyused,mostrecentlyused

WritePolicyWhenshouldwritesupdatelowerlevelsofmemoryhierarchy?

§ writeback,writethrough,writeallocate,nowriteallocate

Page 19: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Blocks 00000000

00001000

00010000

00011000

Memory(byte)address

00010010

Divideaddressspaceintofixed-sizealignedblocks.powerof2

fullbyteaddress

BlockIDaddressbits- offsetbits

offsetwithinblocklog2(blocksize)

Example:blocksize=8

block

0

block

1

block

2

block

3

00010001000100100001001100010100000101010001011000010111

rememberwithinSameBlock?(PointersLab) ...

Note:draw

ingaddressorderdifferentlyfromhereon!

Page 20: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

PlacementPolicy

00011011

IndexCache

S=#slots=4

Small,fixednumberofblockslots.

Large,fixednumberofblockslots.

Memory Mapping:index(BlockID)=???BlockID

0000000100100011010001010110011110001001101010111100110111101111

Page 21: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Placement:Direct-Mapped

22

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Memory Mapping:index(BlockID)=BlockIDmod SBlockID

Cache

S=#slots=4

(easyforpower-of-2blocksizes...)

Page 22: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Placement:mappingambiguity

23

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Memory

Whichblockisinslot2?

BlockID

Cache

S=#slots=4

Mapping:index(BlockID)=BlockIDmod S

Page 23: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Placement:Tagsresolveambiguity

24

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Memory

BlockIDbitsnotusedforindex.

BlockID

Tag Data00110101

Cache

S

Mapping:index(BlockID)=BlockIDmod S

Page 24: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Address=Tag,Index,Offset

00010010 fullbyteaddress

BlockIDAddressbits- Offsetbits

Offsetwithinblocklog2(blocksize)=b

#addressbits

BlockIDbits- IndexbitsTag

log2(# cacheslots)Index

a-bitAddresssbits(a-s-b)bits bbits

OffsetTag Index

Wherewithinablock?

Whatslotinthecache?Disambiguatesslotcontents.

Page 25: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Placement:Direct-Mapped

26

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

Memory

(stilleasyforpower-of-2blocksizes...)

BlockID

Cache

Whynotthismapping?index(BlockID)=BlockID/ S

Page 26: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Apuzzle.

Cachestartsempty.Access(address,hit/miss)stream:

(10,miss),(11,hit),(12,miss)

Whatcouldtheblocksizebe?

27

blocksize>=2bytes blocksize<8bytes

Page 27: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Placement:directmappingconflicts

Whathappenswhenaccessinginrepeatedpattern:0010,0110,0010,0110,0010...?

28

00011011

Index

0000000100100011010001010110011110001001101010111100110111101111

BlockID

cacheconflictEveryaccesssuffersamiss,evictscachelineneededbynextaccess.

Page 28: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Placement:SetAssociative

29

0

1

2

3

Set

2-way4sets,

2blockseach

0

1

Set

4-way2sets,

4blockseach

01234567

Set

1-way8sets,

1blockeach

directmapped

0

Set

8-way1set,

8blocks

fullyassociative

Mapping:index(BlockID)=BlockIDmod S

S=#slotsincachesets

Oneindexperset ofblockslots.Storeblockinany slotwithinset.

Replacementpolicy:ifsetisfull,whatblockshouldbereplaced?Common: leastrecentlyused(LRU)buthardwareusuallyimplements“notmostrecentlyused”

Page 29: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Example:Tag,Index,Offset?

index(1101)=____

4-bitAddress OffsetTag Index

tagbits ____setindexbits ____blockoffsetbits____

Direct-mapped4slots2-byteblocks

Page 30: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Example:Tag,Index,Offset?

16-bitAddress OffsetTag IndexE-wayset-associativeS slots16-byteblocks

01234567

Set

0

1

2

3

Set

0

1

Set

E=1-wayS=8sets

E=2-wayS=4sets

E=4-wayS=2sets

tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____

tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____

tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____

Page 31: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

ReplacementPolicyIfsetisfull,whatblockshouldbereplaced?

Common: leastrecentlyused(LRU)(buthardwareusuallyimplements“notmostrecentlyused”

Anotherpuzzle:Cachestartsempty,usesLRU.Access(address,hit/miss)stream

(10,miss);(12,miss);(10,miss)

32

12isnotinthesameblockas10 12’sblockreplaced10’sblock

direct-mappedcacheassociativity ofcache?

Page 32: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

GeneralCacheOrganization(S,E,B)

33

Elinesperset(“E-way”)

Ssets

set

block/line

0 1 2 B-1tagv

validbit B =2b bytesofdatapercacheline(thedatablock)

cachecapacity:SxExBdatabytesaddresssize:t+s+baddressbits

Powersof2

Page 33: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CacheRead

34

E=2e linesperset

S=2s sets

0 1 2 B-1tag1

validbitB=2b bytesofdatapercacheline(thedatablock)

tbits sbits bbitsAddressofbyteinmemory:

tag setindex

blockoffset

databeginsatthisoffset

LocatesetbyindexHitifanyblockinset:

is valid;andhasmatchingtag

Getdataatoffsetinblock

Page 34: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CacheRead:Direct-Mapped (E=1)

35

S=2s sets

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

0 1 2 7tagv 3 654

findset

Thiscache:• Blocksize:8bytes• Associativity:1blockperset(directmapped)

Page 35: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CacheRead:Direct-Mapped (E=1)

36

tbits 0…01 100Addressofint:

0 1 2 7tagv 3 654

match?:yes=hitvalid?+

blockoffset

tag 7654

int (4Bytes)ishere

Ifnomatch:oldlineisevictedandreplaced

Thiscache:• Blocksize:8bytes• Associativity:1blockperset(directmapped)

Page 36: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Direct-MappedCachePractice

12-bitaddress16lines,4-byteblocksizeDirectmapped

37

11 10 9 8 7 6 5 4 3 2 1 0

03DFC2111167––––03161DF0723610D5

098F6D431324––––03630804020011B2––––0151112311991190B3B2B1B0ValidTagIndex

––––014FD31B7783113E15349604116D

––––012C––––00BB3BDA159312DA––––02D98951003A1248B3B2B1B0ValidTagIndex

0x354

0xA20

Offsetbits?Indexbits?Tagbits?

Page 37: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Example (E=1)

38

int sum_array_rows(double a[16][16]){double sum = 0;

for (int r = 0; r < 16; r++){for (int c = 0; c < 16; c++){

sum += a[r][c];}

}return sum;

}

32bytes=4doubles

Assume:cold(empty)cache3-bitsetindex,5-bitoffset

aa...arrr rcc cc000

int sum_array_cols(double a[16][16]){double sum = 0;

for (int c = 0; c < 16; c++){for (int r = 0; r < 16; r++){

sum += a[r][c];}

}return sum;

}

Localsinregisters.Assume a is aligned such that&a[r][c] is aa...a rrrr cccc 000

0,0 0,1 0,2 0,3

0,4 0,5 0,6 0,7

0,8 0,9 0,a 0,b

0,c 0,d 0,e 0,f

1,0 1,1 1,2 1,3

1,4 1,5 1,6 1,7

1,8 1,9 1,a 1,b

1,c 1,d 1,e 1,f

32bytes=4doubles

4missesperrowofarray4*16=64misses

everyaccessamiss16*16=256misses

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

4,0 4,1 4,2 4,3

0,0:aa...a000 000 000000,4:aa...a000 001 000001,0:aa...a000 100 000002,0:aa...a001 000 00000

Page 38: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Example (E=1)

39

int dotprod(int x[8], int y[8]) {int sum = 0;

for (int i = 0; i < 8; i++) {sum += x[i]*y[i];

}return sum;

}

x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]

ifxandyaremutuallyaligned,e.g.,0x00,0x80

ifxandyaremutuallyunaligned,e.g.,0x00,0xA0

x[0] x[1] x[2] x[3]

y[0] y[1] y[2] y[3]

x[4] x[5] x[6] x[7]

y[4] y[5] y[6] y[7]

block=16bytes;8setsincacheHowmanyblockoffsetbits?Howmanysetindexbits?

Addressbits:ttt....tsssbbbbB=16=2b: b=4offsetbitsS=8=2s: s=3indexbits

Addressesasbits0x00000000: 000....000000000x00000080: 000....100000000x000000A0: 000....1010000016bytes=4ints

Page 39: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

CacheRead:Set-Associative (Example:E=2)

40

tbits 0…01 100Addressofint:

findset

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

Thiscache:• Blocksize:8bytes• Associativity:2blocksperset

Page 40: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

0 1 2 7tagv 3 6540 1 2 7tagv 3 654

CacheRead:Set-Associative (Example:E=2)

41

Thiscache:• Blocksize:8bytes• Associativity:2blocksperset

tbits 0…01 100Addressofint:

compareboth

valid?+ match:yes=hit

blockoffset

tag 7654

int (4Bytes)ishere

Ifnomatch:Evictandreplaceonelineinset.

Page 41: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Example (E=2)

43

float dotprod(float x[8], float y[8]) {float sum = 0;

for (int i = 0; i < 8; i++) {sum += x[i]*y[i];

}return sum;

}

x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]Ifxandyaligned,e.g.&x[0]=0,&y[0]=128,canstillfitbothbecauseeachsethasspacefortwoblocks/lines

x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]4sets

2blocks/linesperset

Page 42: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

TypesofCacheMisses

Cold(compulsory)miss

Conflictmiss

Capacitymiss

Whichonescanwemitigate/eliminate?How?

44

Page 43: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

WritingtocacheMultiplecopiesofdataexist,mustbekeptinsync.

Write-hitpolicyWrite-through:Write-back:needsadirtybit

Write-misspolicyWrite-allocate:No-write-allocate:

Typicalcaches:Write-back+Write-allocate,usuallyWrite-through+No-write-allocate,occasionally

45

Page 44: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Write-back,write-allocateexample

46

0xCAFECache

Memory

U

0xFACE

0xCAFE

0

T

U

dirtybittag

1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)

a. MissonT.

eax = 0xCAFEecx =Tedx =U

Cache/memorynotinvolved

Page 45: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Write-back,write-allocateexample

47

Cache

Memory 0xFACE

0xCAFE

T

U

dirtybit

1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)

a. MissonT.b. EvictU(clean:discard).c. FillT(write-allocate).d. WriteTincache(dirty).

4. mov (%edx),%eaxa. MissonU.tag

T 00xFACE0xFEED 1

eax = 0xCAFEecx =Tedx =U

Page 46: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Write-back,write-allocateexample

48

0xCAFECache

Memory

U

0xFACE

0xCAFE

0

T

U

dirtybittag

eax = 0xCAFEecx =Tedx =U

1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)

a. MissonT.b. EvictU(clean:discard).c. FillT(write-allocate).d. WriteTincache(dirty).

4. mov (%edx),%eaxa. MissonU.b. EvictT(dirty:writeback).c. FillU.d. Set%eax.

5. DONE.0xFEED

0xCAFE

Page 47: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

ExampleMemoryHierarchy

49

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core0

Regs

L1d-cache

L1i-cache

L2unifiedcache

Core3

L3unifiedcache(sharedbyallcores)

Mainmemory

Processorpackage

L1i-cacheandd-cache:32KB,8-way,Access:4cycles

L2unifiedcache:256KB,8-way,Access:11cycles

L3unifiedcache:8MB,16-way,Access:30-40cycles

Blocksize:64bytesforallcaches.

slower,butmorelikelytohit

Typicallaptop/desktopprocessor(c.a.201_)

Page 48: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Aside:softwarecachesExamples

Filesystembuffercaches,webbrowsercaches,databasecaches,networkCDNcaches,etc.

SomedesigndifferencesAlmostalwaysfully-associative

Oftenusecomplexreplacementpolicies

Notnecessarilyconstrainedtosingle“block”transfers

50

Page 49: CS 240 Stage 3 Abstractions for Practical Systemscs240/f19/slides/cache.pdfAnother puzzle: Cache starts empty, uses LRU. Access (address, hit/miss) stream (10, miss); (12, miss); (10,

Cache-FriendlyCodeLocality,locality,locality.Programmercanoptimizeforcacheperformance

DatastructurelayoutDataaccesspatterns

NestedloopsBlocking(seeCSAPP6.5)

Allsystemsfavor“cache-friendlycode”Performanceishardware-specificGenericrulescapturemostadvantages

Keepworkingsetsmall(temporallocality)Usesmallstrides(spatiallocality)Focusoninnerloopcode

51