CS 152 Computer Architecture and Engineering Lecture...

UC Regents Fall 2005 © UCBCS 152 L13: Cache I

2005-10-13John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 13 – Cache I

www-inst.eecs.berkeley.edu/~cs152/

A cosmic ray hits a DRAM cell ...

TAs: David Marquardt and Udam Saini


4096 rows

1

of

4096

decoder

2048 columns Each

column 4 bits deep

33,554,432 usable bits(tester found good bits in bigger array)

12-bitrow

address input

8196 bits delivered by sense amps

Select requested bits, send off the chip

Last Time: DRAM design

DRAM has high latency to first bit out. A fact of life.


Today: Caches and the Memory System

Memory Hierarchy: Technology motivation for caching.

Locality: Why caching works

Cache design: Final project component.

Datapath

Memory

Processor

Input

Output

Control


1977: DRAM faster than microprocessors

Apple ][ (1977)

Steve WozniakSteve

Jobs

CPU: 1000 ns DRAM: 400 ns


Since then: technology scaling ...

Circuit in 250 nm technology (introduced in 2000)

H nanometers long

Same circuit in 180 nm technology (introduced in 2003)

0.7 x H nm

Each dimension

30% smaller. Area is 50%

smaller

Logic circuits use smaller C’s, lower Vdd, and higher kn and kp to speed up clock rates.


DRAM scaled for more bits, not more MHz

Assume Ccell = 1 fF

Word line may have 2000 nFet drains,assume word line C of 100 fF, or 100*Ccell.

Ccell holds Q = Ccell*(Vdd-Vth)

When we dump this charge onto the word line, what voltage do we see?

dV = [Ccell*(Vdd-Vth)] / [100*Ccell]

dV = (Vdd-Vth) / 100 ⋲ tens of millivolts! In practice, scale array to get a 60mV signal.


1980-2003, CPU speed outpaced DRAM ...

10

DRAM

CPU

Performance(1/latency)

100

1000

1980

2000

1990 Year

Gap grew 50% per year

Q. How do architects address this gap? A. Put smaller, faster “cache” memories

between CPU and DRAM. Create a “memory hierarchy”.

10000The

power wall

2005

CPU60% per yr2X in 1.5 yrs

DRAM9% per yr2X in 10 yrs


Caches: Variable-latency memory ports Lower Level

MemoryUpper Level

MemoryTo Processor

From Processor

Blk X

Blk Y

Small, fast Large, slow

FromCPU

To CPU

Data in upper memory returned with lower latency.

Data in lower level returned with higher latency.


Cache replaces data, instruction memory

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

Mux,Logic

IF (Fetch) ID (Decode) EX (ALU) MEM WB

Replace with Instruction Cache and Data Cacheof DRAM

main memory


Recall: Intel ARM XScale CPU (PocketPC)1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

32 KB Instruction Cache

32 KB Data Cache

180 nm process (introduced 2003)

UC Regents Spring 2005 © UCBCS 152 L14: Cache I


2005 Memory Hierarchy: Apple iMac G5

iMac G51.6 GHz$1299.00

Reg L1 Inst

L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 160 1e7

Let programs address a memory space that scales to the disk size, at a speed that is

usually as fast as register access

Managed by compiler

Managed by hardware

Managed by OS,hardware,application

Goal: Illusion of large, fast, cheap memory

UC Regents Spring 2005 © UCBCS 152 L14: Cache I

(1K)

Registers

L1 (64K Instruction)

L1 (32K Data)

512KL2

90 nm, 58 M transistors

PowerPC 970 FX


Latency: A closer look

Reg L1 Inst

L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 160 1e7Latency

(sec) 0.6n 1.9n 1.9n 6.9n 100n 12.5m

Hz 1.6G 533M 533M 145M 10M 80

Architect’s latency toolkit:

Read latency: Time to return first byte of a random access

(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth.

(2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.


Programs with locality cache well ...

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Time

Mem

ory

Addr

ess

(one

dot

per

acc

ess)

Q. Point out bad locality behavior ...

SpatialLocality

Temporal Locality

Bad


The caching algorithm in one slide

Temporal locality: Keep most recently accessed data closer to processor.

Spatial locality: Move contiguous blocks in the address space to upper levels.

Lower LevelMemory

Upper LevelMemory

To Processor

From Processor

Blk X

Blk Y


Caching terminology

Lower LevelMemory

Upper LevelMemory

To Processor

From Processor

Blk X

Blk Y

Hit: Data appearsin upper

level block(ex: Blk X)

Miss: Data retrieval from lower level

needed(Ex: Blk Y)

Hit Rate: The fraction of memory accesses found in

upper level.

Miss Rate: 1 - Hit Rate

Hit Time: Time to access upper level. Includes hit/miss check.

Miss penalty: Time to replace

block in upper level + deliver to CPU

Hit Time << Miss Penalty

UC Regents Fall 2005 © UCBCS 152 L12: Memory and Interfaces

Admin: Final Xilinx Checkoff Tomorrow

Lab report due Monday, 11:59 PM.

Final project posted soon ...


Cache Design Example

Recall: Static Memory ...


Recall: Static Memory Cell Design

Wordline

Bitline

Gnd Vdd Vdd Gnd

Bitline !


SRAM array: simpler than DRAM array4/12/04 ©UCB Spring 2004

CS152 / Kubiatowicz Lec19.13

° Why do computer designers need to know about RAM technology?

• Processor performance is usually limited by memory bandwidth

• As IC densities increase, lots of memory will fit on processor chip

- Tailor on-chip memory to specific needs

- Instruction cache

- Data cache

- Write buffer

° What makes RAM different from a bunch of flip-flops?• Density: RAM is much denser

Random Access Memory (RAM) Technology

4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec19.14

Static RAM Cell

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

word

° Write:1. Drive bit lines (bit=1, bit=0)

2.. Select row

° Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!

2.. Select row

3. Cell pulls one line low

4. Sense amp on column detects difference between bit and bit

replaced with pullupto save area

10

0 1


Lec19.15

Typical SRAM Organization: 16-word x 4-bit

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp

: : : :

Word 0

Word 1

Word 15

Dout 0Dout 1Dout 2Dout 3

- +Wr Driver &

Precharger - +Wr Driver &



Precharger

Ad

dress D

ecod

erWrEn

Precharge

Din 0Din 1Din 2Din 3

A0

A1

A2

A3

Q: Which is longer:

word line or

bit line?


Lec19.16

° Write Enable is usually active low (WE_L)

° Din and Dout are combined to save pins:• A new control signal, output enable (OE_L) is needed

• WE_L is asserted (Low), OE_L is disasserted (High)

- D serves as the data input pin

• WE_L is disasserted (High), OE_L is asserted (Low)

- D is the data output pin

• Both WE_L and OE_L are asserted:

- Result is unknown. Don’t do that!!!

° Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need)

A

DOE_L

2 Nwordsx M bit

SRAM

N

M

WE_L

Logic Diagram of a Typical SRAM

WriteDriver

WriteDriver

WriteDriver

WriteDriver

Word and bit lines slow down as array grows larger! Architects specify number of rows and columns.

ParallelDataI/OLines

Add muxesto selectsubset of bits

How could we pipeline this memory?


Cache Design Example


CPU address space: An array of “blocks” Block #

7

123456

0

227

- 1

.

..

32-byte blocks

27 bits 5 bits

The job of a cache is to hold

a “popular” subset of blocks.

32-bit Memory Address

Which block? Byte #

031


One Approach: Fully Associative Cache

Cache Tag (27 bits) Byte Select

531 04

Ex: 0x04

ValidBit

Byte 31 ... Byte

1Byte

0

Byte 31 ... Byte

1Byte

0

Cache DataHolds 4 blocks

=

=

=

=

HitReturn bytes of “hit” cache line

Block # (”Tags”)026

Ideal, but expensive ...


Building a cache with one comparatorBlock #

7

123456

0

227

- 1

.

..

32-byte blocksBlocks of a certain color may only appear in one

line of the cache.

32-bit Memory Address

Which block? Color Byte #031 4567

25 bits 2 bits 5 bits

Cache index


Example: A Direct Mapped Cache

Cache Tag (25 bits) Index Byte Select

531 04

=

Hit

Ex: 0x01

Return bytes of “hit” cache line

Ex: 0x00

PowerPC 970: 64K direct-mapped Level-1 I-cache

67

ValidBit

Byte 31 ... Byte

1Byte

0

Byte 31 ... Byte

1Byte

0

Cache Tags 024 Cache Data


The limits of direct-mapped caches ...


Time

Mem

ory

Addr

ess

(one

dot

per

acc

ess)

What if both regions have same block color?


Hybrid Design: Set Associative Cache

Cache Tag (26 bits) Index (2 bits)

Byte Select (4 bits)

Cache block halved to keep # cached bits constant.

Valid

Cache Block

Cache Block

Cache Tags Cache Data

Cache Block

Cache Block

Cache TagsValidCache Data

Ex: 0x01

=

HitRight

=

HitLeft

Return bytes of “hit” set

member

“N-way” set associative -- N is number of blocks for each color

16 bytes16 bytes

PowerPC 970: 32K 2-wayset associative L1 D-cache


The benefits of set-associativity ...


Time

Mem

ory

Addr

ess

(one

dot

per

acc

ess)

What if both regions have same block color?

Q. What costs (over direct mapped) for this benefit?


Conclusions ...

Program locality is why building a memory hierarchy makes sense

Latency toolkit: hierarchy design,bit-wise parallelism, pipelining.

Cache operation: compare tags, detect hits, select bytes.

In practice: how many rows, how many columns, how many arrays.


Lectures: Coming up next ...

More on caches +intro to final project

Advanced topics begin ...

CS 152 Computer Architecture and Engineering Lecture...

Documents

Transcript of CS 152 Computer Architecture and Engineering Lecture...