Computer Architecture Cache Memory

Computer Architecture 2014 – Caches1

Computer Architecture

Cache Memory

By Yoav Etsion and Dan TsafrirPresentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz

In the old days… The predecessor of ENIAC

(the first general-purpose electronic computer)

Designed & built in 1944-1949 by Eckert & Mauchly (who also invented ENIAC), with John Von Neumann

Unlike ENIAC, binary rather than decimal, and a “stored program” machine

Operational until 1961

EDVAC (Electronic DiscreteVariable Automatic Computer)

In the olden days… In 1945, Von Neumann wrote:

“…This result deserves to be noted. It shows in a most striking way where the real difficulty, the main bottleneck, of an automatic very high speed computing device lies: at the memory.”

Von Neumann & EDVAC

In the olden days… Later, in 1946, he wrote:

“…Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available……We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible”

Von Neumann & EDVAC

Not so long ago… In 1994, in their paper

“Hitting the Memory Wall: Implications of the Obvious”,

William Wulf and Sally McKee said:

“We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs.

The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”

Not so long ago…

DRAM9% per yr2X in 10 yrs

CPU60% per yr2X in 1.5 yrs

Gap grew 50% per year

More recently (2008)…lo

erFast

The memory wall in the multicore era

Processor cores

Conventionalarchitecture

Memory Trade-Offs Large (dense) memories are slow Fast memories are small, expensive and consume high

power Goal: give the processor a feeling that it has a memory

which is large (dense), fast, consumes low power, and cheap

Solution: a Hierarchy of memories

Speed: Fastest SlowestSize: Smallest BiggestCost: Highest LowestPower: Highest Lowest

L1CacheCPU L2

CacheL3

CacheMemory(DRAM)

Typical levels in mem hierarchy

Response time Size Memory level≈ 0.5 ns ≈ 100 bytes CPU registers≈ 1 ns ≈ 64 KB L1 cache ≈ 20 ns ≈ 8 – 32 MB Last Leve cache

(LLC)≈ 150 ns ≈ 4 – 100s GB Main memory

(DRAM)W? r? 128 GB SSD≈ 5 ms ≈ 1 – 4 TB Hard disk (SATA)

Why Hierarchy Works: Locality

Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again

soon Example: code and variables in loops

Keep recently accessed data closer to the processor

Spatial Locality (Locality in Space): If an item is referenced, nearby items tend to be referenced

soon Example: scanning an array

Move contiguous blocks closer to the processor

Due to locality, memory hierarchy is a good idea We’re going to use what we’ve just recently used And we’re going to use its immediate neighborhood

Programs with locality cache well ...

SpatialLocality

Temporal Locality

Donald J. Hatfield, Jeanette Gerald: Program Restructuring forVirtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Bad locality behavior

Memory Hierarchy: Terminology

For each memory level define the following Hit: data appears in the memory level Hit Rate: the fraction of accesses found in that level Hit Latency: time to access the memory level

• includes also the time to determine hit/miss Miss: need to retrieve data from next level Miss Rate: 1 - (Hit Rate) Miss Penalty: Time to bring in the missing info (replace a

block) + Time to deliver the info to the accessor

Average memory access time = t_effective = (Hit Lat. Hit Rate) + (Miss Pen. Miss

Rate) = (Hit Lat. Hit Rate) + (Miss Pen. (1- Hit

Rate)) If hit rate is close to 1, t_effective is close to Hit latency,

which is generally what we want

Effective Memory Access Time

Cache – holds a subset of the memory Hopefully – the subset that is being used now Known as “the working set”

Effective memory access time• teffective = (tcache Hit Rate) + (tmem (1 – Hit rate))• tmem includes the time it takes to detect a cache miss

Example Assume: tcache = 10 ns , tmem = 100 nsec

Hit Rate t eff (nsec) 0 100 50 55 90 20 99 10.9 99.9 10.1

tmem/tcache goes up more important that hit-rate closer to 1

The cache holds a small part of the entire memory Need to map parts of the memory into the cache

Main memory is (logically) partitioned into “blocks” or “lines” or, when the info is cached, “cachelines” Typical block size is 32, 64 bytes Blocks are “aligned” in memory

Cache partitioned to cache lines Each cache line holds a block Only a subset of the blocks is mapped

to the cache at a given time The cache views an address as

Why use lines/blocks rather than words?

Cache – main idea

Block # offset

memory

cache0123456.

90919293

Line Tag

Cache Lookup Cache hit

Block is mapped to the cache – return data according to block’s offset

Cache miss Block is not mapped to the cache

do a cacheline fill• Fetch block into fill buffer

• may require few cycles • Write fill buffer into cache

May need to evict another block from the cache • Make room for the new block

memory

cache0123456.

90919293

Checking valid bit & tag Initially cache is empty

Need to have a “line valid” indication – line valid bit A line may also be invalidated

Tag ArrayTag

Tag Offset0431

Data array

hit data

valid bit

Cache organization Basic questions:

Associativity: Where can we place a memory block in the cache?

Eviction policy: Which cache line should be evicted on a miss?

Associativity: Ideally, every memory block can go to each cache line

• Called Fully-associative cache• Most flexible, but most expensive

Compromise: simpler designs • Blocks can only reside in a subset of cache lines

• Direct-mapped cache• 2-way set associative cache• N-way set associative cache

Fully Associative Cache An address is partitioned to

offset within block block number

Each block may be mapped to each of the cache lines Lookup block in all lines

Each cache line has a tag All tags are compared to the

block# in parallel Need a comparator per line If one of the tags matches the

block#, we have a hit • Supply data according to

offset Best hit rate, but most

wasteful Must be relatively small

Tag Array

Tag = Block# Offset

Address Fields0431

Data array031

Line==

hit data

Fully Associative Cache Is said to be a “CAM”

Content Addressable Memory

Tag Array

Tag = Block# Offset

Address Fields0431

Data array031

Line==

hit data

Direct Map Cache Each memory block can only be

mapped to a single cache line

Offset Byte within the cache-line

Set The index into the “cache

array”, and to the “tag array” For a given set (an index), only

one of the cache lines that has this set can reside in the cache

Tag Remaining block bits are used

as tag Tag uniquely identifies mem.

block Must compare the tag stored in

the tag array to the tag of the address

TagArray

Tag Set Offset

Address

041331

5Block number

29 =512 sets

DataArray

Direct Map Cache (cont) Partition memory into slices

slice size = cache size Partition each slice to blocks

Block size = cache line size Distance of block from slice start

indicates position in cache (set) Advantages

Easy & fast hit/miss resolution Easy & fast replacement algorithm Lowest power

Disadvantage Line has only “one chance” Lines replaced due to “conflict

misses” Organization with highest miss-rate

CacheSize

Mapped to set X

CacheSize

Line Size: 32 bytes 5 Offset bitsCache Size: 16KB = 214 Bytes

#lines = cache size / line size = 214/25=29=512

#sets = #lines = 512#set bits = 9 bits (=5…13)

#Tag bits = 32 – (#set bits + #offset bits) = 32 – (9+5) = 18 bits (=14…31)

Lookup Address: 0x123456780001 0010 0011 0100 0101 0110 0111

Direct Map Cache – Example

offset=0x18

set=0x0B3

tag=0x048B1

TagTag Set Offset

Address Fields041331

Tag Array

=Hit/Miss

Direct map (tiny example) Assume

Memory size is 2^5 = 32 bytes

For this, need 5-bit address A block is comprised of 4

bytes Thus, there are exactly 8

blocks

Note Need only 3-bits to identify a

block The offset is exclusively

used within the cache lines The offset is not used to

locate the cache line

00 01 10 11000001010011100101110111

Offset (within a block)Bl

Address 11111Address 01110

Address 00001

Direct map (tiny example) Further assume

The size of our cache is 2 cache-lines (=> need 2=5-2-1 tag bits)

The address divides like so b4 b3 | b2 | b1 b0 tag | set | offset

00 01 10 11000001010011100101110111

00 01 10 1101

b3 b401

tag array(bits)

data array(bytes)

memory array(bytes)

even cache linesodd cache lines

Direct map (tiny example) Accessing address

0 0 0 1 0 (= marked “C”)

The address divides like so b4 b3 | b2 | b1 b0 tag (00) | set (0)| offset

00 01 10 11A B C D 000

001010011100101110111

00 01 10 11A B C D 0

b3 b40 0 0

tag array(bits)

cache array(bytes)

memory array(bytes)

0 1 0 1 0 (=Y)

00 01 10 11000001

W X Y Z 010011100101110111

00 01 10 11W X Y Z 0

b3 b41 0 0

tag array(bits)

cache array(bytes)

memory array(bytes)

1 0 0 1 0 (=Q)

00 01 10 11000001010011

T R Q P 100101110111

00 01 10 11T R Q P 0

b3 b40 1 0

tag array(bits)

cache array(bytes)

memory array(bytes)

1 1 0 1 0 (=J)

00 01 10 11000001010011100101

L K J I 110111

00 01 10 11L K J I 0

b3 b41 1 0

tag array(bits)

cache array(bytes)

memory array(bytes)

0 0 1 1 0 (=B)

00 01 10 11000

D C B A 001010011100101110111

00 01 10 110

D C B A 1

b3 b40

tag array(bits)

cache array(bytes)

memory array(bytes)

0 1 1 1 0 (=Y)

00 01 10 11000001010

W Z Y X 011100101110111

00 01 10 110

W Z Y X 1

b3 b40

tag array(bits)

cache array(bytes)

memory array(bytes)

Direct map (tiny example) Now assume

The size of our cache is 4 cache-lines

The address divides like so b4 | b3 b2 | b1 b0 tag | set | offset

00 01 10 11000001010011

D C B A 100101110111

00 01 10 11D C B A 00

011011

b41 00

011011

tag array(bits)

cache array(bytes)

memory array(bytes)

Direct map (tiny example) Now assume

The size of our cache is 4 cache-lines

The address divides like so b4 | b3 b2 | b1 b0 tag | set | offset

00 01 10 11W Z Y X 000

001010011100101110111

00 01 10 11W Z Y X 00

011011

b40 00

011011

tag array(bits)

cache array(bytes)

memory array(bytes)

2-Way Set Associative Cache Each set holds two line (way 0 and way 1)

Each block can be mapped into one of two lines in the appropriate set (HW checks both ways in parallel)

Cache effectively partitioned into twoExample:

Line Size: 32 bytesCache Size 16KB#of lines 512 lines#sets 256Offset bits 5 bitsSet bits 8 bitsTag bits 19 bits

Address0001 0010 0011 01000101 0110 0111 1000

Offset: 1 1000 = 0x18 = 24Set: 1011 0011 = 0x0B3 = 179Tag: 000 1001 0001 1010 0010 = = 0x091A2

LineTagTag Line

Tag Set Offset

Address Fields041231

Cachestorage

Way 1Tag Array

031Way 0Tag Array

031 Cachestorage

WAY #1WAY #0

2-Way Cache – Hit Decision

Tag Set Offset041231

Hit/Miss

Data Out

DataTag

2-Way Set Associative Cache (cont)

Partition memory into “slices” or “ways” slice size = way size = ½ cache size

Partition each slice to blocks Block size = cache line size Distance of block from slice-start

indicates position in cache (set) Compared to direct map cache

Half size slice 2× #slices 2× #blocks mapped to each cache set

Each set can have 2 blocks at a given time

++ Fewer collisions/evictions ---- More logic, more power consuming

WaySize

Mapped to set X

WaySize

N-way set associative cache Similarly to 2-way At the extreme, every cache line is a way…

Cache organization summary Increasing set associativity

Improves hit rate Increases power consumption Increases access time

Strike a balance

Cache Read Miss On a read miss – perform a cache line fill

Fetch entire block that contains the missing data from memory

Block is fetched into the cache line fill buffer May take a few bus cycles to complete the fetch

• e.g., 64 bit (8 byte) data bus, 32 byte cache line 4 bus cycles

• Can stream (forward) the critical chunk into the core before the line fill ends

Once the entire block fetched into the fill buffer It is moved into the cache

Cache Replacement Policy Direct map cache – easy

A new block is mapped to a single line in the cache Old line is evicted (re-written to memory if needed)

N-way set associative cache – harder Choose a victim from all ways in the appropriate set But which? To determine, use a replacement algorithm

Example replacement policies FIFO (First In First Out) Random LRU (Least Recently used) Optimum (theoretical, postmortem, called “Belady”)

Computer Architecture Cache Memory

Documents

Transcript of Computer Architecture Cache Memory

Cache Memory · 2019. 2. 13. · Cache Memory COE 403 – Computer Architecture - KFUPM Muhamed Mudawar – slide 2 The Need for Cache Memory Widening speed gap between CPU and main

EE898.02 Architecture of Digital Systems Lecture 3 Cache Memory

Cache memory

Caches - Stony Brook UniversityL1 I-Cache L1 D-Cache L2 Cache I-TLB D-TLB Main Memory (DRAM) L3 Cache (LLC) Spring 2015 :: CSE 502 –Computer Architecture Processor Memory Organization

William Stallings Computer Organization and Architecture 8th Edition Chapter 4 Cache Memory.

Computer Organization and Architecture Cache Memory Chapter 4.

DDM - A Cache-Only Memory Architecturealaa/ece588/papers/hagersten_computer_1992.pdf · DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landin, and Seif Haridi Swedish

Cache Memory – Page 1 of 26CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Memory Hierarchy Reading: Stallings, Chapter 4.

Cache Memory – Page 1 of 81CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Cache Memory Reading: Stallings, Chapter 4.

Csci 211 Computer System Architecture – Review on Cache Memory Xiuzhen Cheng cheng@gwu.edu.

361 Computer Architecture Lecture 14: Cache Memory

CS 211: Computer Architecture Cache Memory Design - SEAS

DDM - A Cache-Only Memory Architecture

A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicoreswongwf/papers/ASPDAC14-ppt.pdf · 2014. 6. 2. · STT-RAM L1 Cache Architecture for Shared Memory

Architecture and Compilers PART IICache-only memory architecture (COMA) Large cache per processor to replace shared memory A data item is either in one cache (non-shared) or in multiple

Computer Architecture 2011 – Caches 1 Lihu Rappoport and Adi Yoaz Computer Architecture Cache Memory.

Embedded Computer Architecture Memory Hierarchy: Cache Recap

CS 211: Computer Architecture Cache Memory Designbhagiweb/cs211/lectures/cache1.pdf · CS 211: Computer Architecture Cache Memory Design CS 135 Course Objectives: Where are we? CS

Execution-Cache-Memory Performance Model: Introduction and ... · Execution-Cache-Memory Performance Model: Introduction and Validation Johannes Hofmann Chair for Computer Architecture

Cache memory and cache