CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15...

32
CSE331 W15.1 Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15 Section 1: Mary Jane Irwin ( www.cse.psu.edu/~mji ) Section 2: Feihui Li ( www.cse.psu.edu/~feli ) Course material on ANGEL: cms.psu.edu [adapted from D. Patterson slides]

Transcript of CSE331 W15.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 15...

CSE331 W15.1 Irwin&Li Fall 2006 PSU

CSE 331Computer Organization and

DesignFall 2006

Week 15

Section 1: Mary Jane Irwin (www.cse.psu.edu/~mji)

Section 2: Feihui Li (www.cse.psu.edu/~feli )

Course material on ANGEL: cms.psu.edu

[adapted from D. Patterson slides]

CSE331 W15.2 Irwin&Li Fall 2006 PSU

Head’s Up

Last week’s material Intro to pipelined datapath design; memory design

This week’s material Memory hierarchies

- Reading assignment – PH: 7.1-7.2

Final exam week Final Exam is Tuesday, Dec 19, 4:40 to 6:30, 110 Business

Reminders HW 8 (last) is due, Dec. 14 (by 11:55pm) Quiz 7 (last) is due, Dec. 15 (by 11:55pm) Dec. 13th deadline for filing grade corrections/updates

CSE331 W15.3 Irwin&Li Fall 2006 PSU

To say that the P6 project had its share of people issues would be an understatement. Any effort that combines extreme creativity, mindbending stress, and colorful personalities has to account for the “people factor,” … The people factor is probably the least understood aspect of hardware design, and nothing you learn in graduate school compares with what a day-to-day, long-term, high-visibility, pressure-cooker project can teach you about your coworkers, or yourself.

The Pentium Chronicles, Colwell, pg. 137

CSE331 W15.4 Irwin&Li Fall 2006 PSU

Review: Major Components of a Computer

Processor

Control

Datapath

Memory

Devices

Input

Output

Cach

e

Main

M

emo

ry

Seco

nd

ary M

emo

ry(D

isk)

CSE331 W15.5 Irwin&Li Fall 2006 PSU

The Memory Hierarchy

L1$

L2$

Main Memory

Secondary Memory

Increasing distance from the processor in access time

Processor

(Relative) size of the memory at each level

Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM

4-8 bytes (word)

1 to 4 blocks

1,024+ bytes (disk sector = page)

8-32 bytes (block)

Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology

CSE331 W15.6 Irwin&Li Fall 2006 PSU

The Memory Hierarchy: Why Does it Work?

Temporal Locality (Locality in Time): Keep most recently accessed instructions/datum closer to

the processor

Spatial Locality (Locality in Space): Move blocks consisting of contiguous words to the upper

levels

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

CSE331 W15.7 Irwin&Li Fall 2006 PSU

The Memory Hierarchy: Terminology Hit: data is in some block in the upper level (Blk X) Hit Rate: the fraction of memory accesses found in the upper

level Hit Time: Time to access the upper level which consists of

- RAM access time + Time to determine hit/miss

Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y)

Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level

+ Time to deliver the block the processor

Hit Time << Miss Penalty

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

CSE331 W15.8 Irwin&Li Fall 2006 PSU

How is the Hierarchy Managed?

registers memory by compiler (programmer?)

cache main memory by the cache controller hardware

main memory disks by the operating system (virtual memory) virtual to physical address mapping assisted by the

hardware (TLB) by the programmer (files)

CSE331 W15.9 Irwin&Li Fall 2006 PSU

Why? The “Memory Wall”

The Processor vs DRAM speed “gap” continues to grow

0.01

0.1

1

10

100

1000

VAX/1980 PPro/1996 2010+

Processor

DRAM

Clo

cks

per

inst

ruct

ion

Clo

cks

per

DR

AM

acc

ess

Good cache design is increasingly important to overall performance

CSE331 W15.10 Irwin&Li Fall 2006 PSU

Two questions to answer (in hardware): Q1: How do we know if a data item is in the cache? Q2: If it is, how do we find it?

Direct mapped For each item of data at the lower level, there is exactly

one location in the cache where it might be - so lots of items at the lower level must share locations in the upper level

Address mapping:

(block address) modulo (# of blocks in the cache)

First consider block sizes of one word

The Cache

CSE331 W15.12 Irwin&Li Fall 2006 PSU

Caching: A Simple First Example

00

011011

Cache

Main Memory

Q2: How do we find it?

Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)

Tag Data

Q1: Is it there?

Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache

Valid

0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx

Two low order bits define the byte in the word (32b words)

(block address) modulo (# of blocks in the cache)

Index

CSE331 W15.14 Irwin&Li Fall 2006 PSU

Direct Mapped Cache

0 1 2 3

4 3 4 15

Consider the main memory word reference string 0 1 2 3 4 3 4 15

00 Mem(0) 00 Mem(0)00 Mem(1)

00 Mem(0) 00 Mem(0)00 Mem(1)00 Mem(2)

miss miss miss miss

miss misshit hit

00 Mem(0)00 Mem(1)00 Mem(2)00 Mem(3)

01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)

01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)

01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)

01 4

11 15

00 Mem(1)00 Mem(2)

00 Mem(3)

Start with an empty cache - all blocks initially marked as not valid

8 requests, 6 misses

CSE331 W15.15 Irwin&Li Fall 2006 PSU

One word/block, cache size = 1K words

MIPS Direct Mapped Cache Example

20Tag 10Index

Data Index TagValid012...

102110221023

31 30 . . . 13 12 11 . . . 2 1 0Byte offset

What kind of locality are we taking advantage of?

20

Data

32

Hit

CSE331 W15.16 Irwin&Li Fall 2006 PSU

Read hits (I$ and D$) this is what we want!

Write hits (D$ only) allow cache and memory to be inconsistent

- write the data only into the cache (then write-back the cache contents to the memory when that cache block is “evicted”)

- need a dirty bit for each cache block to tell if it needs to be written back to memory when it is evicted

require the cache and memory to be consistent- always write the data into both the cache and the memory

(write-through)

- don’t need a dirty bit

- writes run at the speed of the main memory - slow! – or can use a write buffer, so only have to stall if the write buffer is full

Handling Cache Hits

CSE331 W15.17 Irwin&Li Fall 2006 PSU

Review: Why Pipeline? For Throughput!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UI$ Reg D$ Reg

AL

UI$ Reg D$ Reg

AL

UI$ Reg D$ RegA

LUI$ Reg D$ Reg

AL

UI$ Reg D$ Reg

To keep the pipeline

running at its maximum rate both I$ and D$ need to satisfy a request from

the datapath every cycle.

What happens when they

can’t do that?

To avoid a structural hazard need two caches on-chip: one for instructions (I$) and one for data (D$)

CSE331 W15.19 Irwin&Li Fall 2006 PSU

Another Reference String Mapping

0 4 0 4

0 4 0 4

Consider the main memory word reference string 0 4 0 4 0 4 0 4

miss miss miss miss

miss miss miss miss

00 Mem(0) 00 Mem(0)01 4

01 Mem(4)000

00 Mem(0)01

4

00 Mem(0)01 4

00 Mem(0)01

401 Mem(4)

00001 Mem(4)

000

Start with an empty cache - all blocks initially marked as not valid

Ping pong effect due to conflict misses - two memory locations that map into the same cache

block

8 requests, 8 misses

CSE331 W15.20 Irwin&Li Fall 2006 PSU

Sources of Cache Misses Compulsory (cold start or process migration, first reference):

First access to a block, “cold” fact of life, not a whole lot you can do about it

If you are going to run “millions” of instruction, compulsory misses are insignificant

Conflict (collision) Multiple memory locations mapped to the same cache

location- Solution 1: increase cache size

- Solution 2: increase associativity

Capacity Cache cannot contain all blocks accessed by the program

- Solution: increase cache size

CSE331 W15.21 Irwin&Li Fall 2006 PSU

Handling Cache Misses Read misses (I$ and D$)

stall the pipeline, fetch the block from the next level in the memory hierarchy, write the word+tag in the cache and send the requested word to the processor, let the pipeline resume

Write misses (D$ only)1. stall the pipeline, fetch the block from next level in the memory

hierarchy, install it in the cache (may involve having to evict a dirty block if using a write-back cache), write the word+tag in the cache, let the pipeline resume

or (normally used in write-back caches)2. Write allocate – just write the word+tag into the cache (may

involve having to evict a dirty block), no need to check for cache hit, no need to stall

or (normally used in write-through caches with a write buffer)3. No-write allocate – skip the cache write (but must invalidate that

cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full

CSE331 W15.22 Irwin&Li Fall 2006 PSU

Multiword Block Direct Mapped Cache

8Index

DataIndex TagValid012...

253254255

31 30 . . . 13 12 11 . . . 4 3 2 1 0Byte offset

20

20Tag

Hit Data

32

Block offset

Four words/block, cache size = 1K words

What kind of locality are we taking advantage of?

CSE331 W15.24 Irwin&Li Fall 2006 PSU

Taking Advantage of Spatial Locality

0

Let cache block hold more than one word 0 1 2 3 4 3 4 15

1 2

3 4 3

4 15

00 Mem(1) Mem(0)

miss

00 Mem(1) Mem(0)

hit

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

miss

hit

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

miss

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

01 5 4hit

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

hit

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

miss

11 15 14

Start with an empty cache - all blocks initially marked as not valid

8 requests, 4 misses

CSE331 W15.26 Irwin&Li Fall 2006 PSU

We concluded that no single quantifiable metric could reliably predict career success. However, we also felt that fast-trackers have identifiable qualities and show definite trends.

One quality is a whatever-it-takes attitude. All high-output engineers have a willingness to do whatever it takes to make a project succeed, especially their particular corner of it. If that means working weekends, doing extracurricular research, or rewriting tools, they do it.

The Pentium Chronicles, Colwell, pg. 140

CSE331 W15.27 Irwin&Li Fall 2006 PSU

Miss Rate vs Block Size vs Cache Size

0

5

10

8 16 32 64 128 256

Block size (bytes)

Mis

s ra

te (

%) 8 KB

16 KB

64 KB

256 KB

Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

CSE331 W15.28 Irwin&Li Fall 2006 PSU

Block Size Tradeoff

Larger block size means larger miss penalty- Latency to first word in block + transfer time for remaining words

MissPenalty

Block Size

MissRate Exploits Spatial Locality

Fewer blocks compromisesTemporal Locality

Block Size

AverageAccess

TimeIncreased Miss

Penalty& Miss Rate

Block Size

In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate

Larger block sizes take advantage of spatial locality but If the block size is too big relative to the cache size, the miss

rate will go up

CSE331 W15.29 Irwin&Li Fall 2006 PSU

Multiword Block Considerations Read misses (I$ and D$)

Processed the same as for single word blocks – a miss returns the entire block from memory

Miss penalty grows as block size grows- Early restart – datapath resumes execution as soon as the

requested word of the block is returned- Requested word first – requested word is transferred from

the memory to the cache (and datapath) first Nonblocking cache – allows the datapath to continue to

access the cache while the cache is handling an earlier miss

Write misses (D$) Can’t use write allocate or will end up with a “garbled”

block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall time

CSE331 W15.30 Irwin&Li Fall 2006 PSU

Other Ways to Reduce Cache Miss Rates

1. Allow more flexible block placement In a direct mapped cache a memory block maps to

exactly one cache block At the other extreme, could allow a memory block to be

mapped to any cache block – fully associative cache A compromise is to divide the cache into sets each of

which consists of n “ways” (n-way set associative)

2. Use multiple levels of caches Add a second level of caches on chip – normally a

unified L2 cache (i.e., it holds both instructions and data)- L1 caches focuses on minimizing hit time in support of a

shorter clock cycle (smaller with smaller block sizes)

- L2 cache focuses on reducing miss rate to reduce the penalty of long main memory access times (larger with larger block sizes)

CSE331 W15.31 Irwin&Li Fall 2006 PSU

Cache Summary The Principle of Locality:

Program likely to access a relatively small portion of the address space at any instant of time

- Temporal Locality: Locality in Time

- Spatial Locality: Locality in Space

Three major categories of cache misses: Compulsory misses: sad facts of life, e.g., cold start misses Conflict misses: increase cache size and/or associativity

Nightmare Scenario: ping pong effect! Capacity misses: increase cache size

Cache design space total size, block size, associativity (replacement policy) write-hit policy (write-through, write-back) write-miss policy (write allocate, write buffers)

CSE331 W15.32 Irwin&Li Fall 2006 PSU

Improving Cache PerformanceReduce the hit time

smaller cache direct mapped

cache smaller blocks for writes

- no write allocate – just write to write buffer

- write allocate – write to a delayed write buffer that then writes to the cache

Reduce the miss rate bigger cache associative

cache larger blocks

(16 to 64 bytes) use a victim

cache – a small buffer that holds the most recently discarded blocks

Reduce the miss penalty smaller blocks

- for large blocks fetch critical word first

use a write buffer- check write

buffer (and/or victim cache) on read miss – may get lucky

use multiple cache levels – L2 cache not tied to CPU clock rate

faster backing store/improved memory bandwidth- wider buses- SDRAMs

CSE331 W15.33 Irwin&Li Fall 2006 PSU

N r

ows

N cols

DRAM

ColumnAddress

M-bit Output

M bit planes N x M SRAM

RowAddress

Review: (DDR) SDRAM Operation

After a row is read into the SRAM register

Input CAS as the starting “burst” address along with a burst length

Transfers a burst of data (ideally a cache block) from a series of sequential addr’s within that row- A clock controls transfer of

successive words in the burst

+1

Row Address

CAS

RAS

Col Address

1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit

Cycle Time

Row Add

CSE331 W15.34 Irwin&Li Fall 2006 PSU

The off-chip interconnect and memory architecture affects overall system performance dramatically

Memory Systems that Support Caches

CPU

Cache

MainMemory

bus

Assume

1. 1 clock cycle (1 ns) to send the address from the cache to the Main Memory

2. 50 ns (50 processor clock cycles) for DRAM first word access time, 10 ns (10 clock cycles) cycle time (remaining words in burst for SDRAM)

3. 1 clock cycle (1 ns) to return a word of data from the Main Memory to the cache

Memory-Bus to Cache bandwidth

number of bytes accessed from Main Memory and transferred to cache/CPU per clock cycle

32-bit data&

32-bit addrper cycle

on-chip

CSE331 W15.36 Irwin&Li Fall 2006 PSU

One Word Wide Memory Organization

CPU

Cache

MainMemory

bus

on-chip

If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory

cycle(s) to send address

cycle(s) to read DRAM

cycle(s) to return data

total clock cycles miss penalty

Number of bytes transferred per clock cycle (bandwidth) for a miss is

bytes per clock

1

50

1

52

4/52 = 0.077

CSE331 W15.38 Irwin&Li Fall 2006 PSU

Burst Memory Organization

CPU

Cache

MainMemory

bus

on-chip

What if the block size is four words and a (DDR) SDRAM is used?

cycle(s) to send 1st address

cycle(s) to read DRAM

cycle(s) to return last data word

total clock cycles miss penalty

Number of bytes transferred per clock cycle (bandwidth) for a single miss is

bytes per clock

50 cycles

10 cycles

10 cycles

10 cycles

1

50 + 3*10 = 80

1

82

(4 x 4)/82 = 0.183

CSE331 W15.39 Irwin&Li Fall 2006 PSU

DRAM Memory System Summary

Its important to match the cache characteristics caches access one block at a time (usually more than

one word)

with the DRAM characteristics use DRAMs that support fast multiple word accesses,

preferably ones that match the block size of the cache

with the memory-bus characteristics make sure the memory-bus can support the DRAM

access rates and patterns with the goal of increasing the Memory-Bus to Cache

bandwidth