Savio Chau Spring Quarter, 2002 Final Review Final: June 10, 2001 3:00 p.m. to 6:00 p.m. Knudsen...

Savio Chau

Spring Quarter, 2002

Final Review

Final: June 10, 2001

3:00 p.m. to 6:00 p.m.

Knudsen 1200B

Extra office hour: Friday 6/7/02 4:30 p.m.to 7:30 p.m.

Saturday 6/8/01 4:00 p.m. to 6:00 p.m.

Savio Chau

Areas for Study• What is computer architecture?• Number Representation

– Floating point number representation and IEEE 754– Floating point operations with IEEE 754

• MIPS instruction set– Able to write simple assembly code with MIPS instruction set– Understanding of procedure calls and stack management

• Procedure call– Stack management

• General ideas about single cycle/multi cycle data path and control unit design

• Pipelined Processor– Basic concepts and data flow in pipeline– Hazards

• Data Hazard– Stalling the pipe– Forwarding (including the special case of lw followed by R-type)

• Control Hazard– Branch Prediction

Savio Chau

Areas for Study• Memory Hierarchy and Virtual Memory

– Concept of memory hierarchy and locality (spatial and temporal)– Performance of memory hierarchy: calculation of average access time– Cache organizations and overheads

• Associativity: direct mapping, set associate, fully associate• Block size• Replacement policies• Write back vs. write through

– Virtual Memory• Virtual to Physical Address Translation: Page Table, Page Frame Table • Table Look-aside Buffer (TLB)

– You should know how to read/write data from a memory hierarchy with an virtual address

• I/O System– I/O system architecture – I/O system design process– I/O system design parameters– I/O device interface design– Your should be able to do both system level and detailed design

Savio Chau

What is Computer Architecture?

• Coordination of many levels of abstraction• Under a rapidly changing set of forces• Design, Measurement, and Evaluation

Courtesy D. Patterson

I/O systemInstr. Set Proc.

Compiler

Operating System

Application

Digital Design

Circuit Design

Instruction Set Architecture

Firmware

Datapath & Control

Physical Design

Vdd

I1 O1

I1 O1

Vdd

Control

ALU

I Reg

Mem

Software

Hardware I1O2

O1

I2

Bottom Upview

Savio Chau

IEEE 754 Standard for Floating Point Numbers

• Maximize precision of representation with fix number of bits– Gain 1 bit by making leading 1 of mantissa implicit. Therefore,

F = 1 + significand, Value = (1)s (1 + significand) 2 E

• Easy for comparing numbers– Put sign bit at MSB– Use bias instead of sign bit for exponent field

Real exponent value = exponent - bias, bias = 127 for single precision Examples: IEEE 754 value Floating Point Number ValueExponent A = -126 00000001 (1)s F 2 (1-127) = (1)s F 2-126 Exponent B = 127 11111110 (1)s F 2 (254-127) = (1)s F 2127

This is much easier to compare than having A = 12610 = 100000102 and B = 12710 = 011111112

• Need to take care special cases (by convention)Value = 0 E = 0 f = 0 i.e., f = significandValue = (1)s E = 255 f = 0Value = (1)s(0.f)2-126 E = 0 f 0 Value has been denormalized

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

sign Exponent (biased) Significand only (leading 1 is implicit)

Two formats: single precision (32-bit) and double precision (64-bit). Single precision format:

Savio Chau

IEEE 754 Computation Example

A) 40 = (–1)0 1. 25 25 = (–1)0 1.012 2(132 – 127) = [0][10000100][101000000000000000000]

B) –80 = (–1)1 1. 25 26 = (–1)1 1. 012 2(133 – 127) = [1][10000101][111101000000000000000]

C) Denormalize the significand with the lower exponent and then align the exponents:

40 = (–1)0 0. 3125 27 = (–1)0 0.01012 2 (134 – 127) = [0][10000110][010100000000000000000]

–80 = (–1)1 0. 6250 27 = (–1)1 0.10102 2 (134 – 127) = [1][10000110][101000000000000000000] D) Need to convert the IEEE 754 significand of –80 into 2’s complement before the subtraction: –80 = [1][10000110][101000000000000000000] [1][10000110][011000000000000000000] 40 – 80 = [0][10000110][010100000000000000000] + [1][10000110]

[011000000000000000000]= [0][10000110][101100000000000000000]

E) Convert the result in 2’s complement into IEEE 754 = [1][10000110][010100000000000000000]

F) Renormalize: [1][10000110][010100000000000000000] = [1][10000100][010000000000000000000]

= (–1)1 1.012 25

Check: 40 – 80 = – 40 = (–1)1 1.25 25 = (–1)1 1.012 25

Savio Chau

Procedure Call: An Overly Simplified Example

main() /* Caller */{

x = y + z;funct(arg); /* procedure call */…

}

PC main addr

$v0

$a0 arg

($2)

($4)

$t0 x

$t1 y

$t2 z

($8)

($9)

($10)

w

$ra main addr3 ($31)

132funct addr 12 w

v

3main addr

int funct( arg ) /* Callee */{

w = arg – v;return (w);

}

Addr

1 2 3

Addr 1

2 3

arg

But!• What if there are more than 4 arguments?• What if there are some register values need to be preserved

across procedure call (e.g., if you want to preserve the value x)? • What if another procedure call happens before the current

procedure is completed?

3

Savio Chau

Call-Return Linkage: Stack Frames

FPARGS

Callee Save Registers

(old $fp, $ra, $s0,etc)

Local VariablesSP

Grows and shrinks during expression evaluation

Sta

ck F

ram

e o

r A

ctiv

atio

n R

eco

rd

Reference Argumentsand Local Variables atFixed (negative)Offset From FP

High Mem

Low Mem

Solution:

• Save the needed information (e.g., arguments, return address) onto a stack in memory

• Information needed by the called procedure are grouped into a stack frame

• Many variations on stacks possible (up/down, last pushed / next )

(frame pointer points to 1st word of frame)

(stack pointer points to last word of frame)

Savio Chau

Performance of An Ideal Pipeline

• Latency of Pipeline = Latency of a Single Task

• Potential Throughput Improvement = Number of Pipeline Stages Under The Ideal Situations That All Instructions Are Independent and No Branch Instructions

• Pipeline Rate is Limited by the Slowest Pipeline Stage

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Clk

1st lw

2nd lw

3rd lw

IFetch Reg/Dec Exec Mem WrBack



Savio Chau

Example of Detailed Pipeline Operations

See MIPS Example in Class

rt

rd

ID/E

XPC

Addr

InstructionMemory

Rd Reg1RdReg2

RegistersWr RegWr Data

AddrRd Data

DataMemory

Wr Data

PCsrc IF/ID

4 Reg

Writ

e

ALU

src

ALUop

RegDst

Branch

Mem

Wr

Mem

toR

zero

out

<15:0>

Mem

Rd

rs

A

Zero

ALUout

0wb

exm

wb

IF/I

D

mwb

EX

/ME

M

ME

M/W

B

1

rt

Extrt

rdMux

Co

ntr

ol

0

1

Ad

d Ad

d

B

ALUControl

Mux

AL

U

0

1

rd

BMux

0

1

rd

ID/EX EX/MEM MEM/WB--/IF

Mux

AL

Uo

ut

md

o

<10:0>

<31:0>

<31:26>

x4

Clk PC 1 00 lw $2, 0($3) 2 04 add $4, $0, $5 3 08 sw $6, 4($3) 4 12 addi $7, $2, 100 5 16 add $8, $2, $5 6 20 add $9, $2, $4 7 24 sub $10, $4, $7 8 28 add $11, $7, $8

Clk PC 1 00 lw $2, 0($3) 2 04 add $4, $0, $5 3 08 sw $6, 4($3) 4 12 addi $7, $2, 100 5 16 add $8, $2, $5 6 20 add $9, $2, $4 7 24 sub $10, $4, $7 8 28 add $11, $7, $8

Savio Chau

Signal Propagation through the Example Pipeline

Instru

ct in P

C

PC

Instru

ct in IF

/ID

IF/ID

.rs

IF/ID

.rt

IF/ID

.rd

IF/ID

.Imm

ed

16

Instru

ct in ID

/EX

ID/E

X.A

ID/E

X.B

ID/E

X.Im

me

d1

6

ID/E

X.rt

ID/E

X.rd

ID/E

X.A

LU

src

ID/E

X.A

LU

op

ID/E

X.R

egD

st

ID/E

X.B

ran

ch

ID/E

X.M

em

Wr

ID/E

X.M

em

Rd

ID/E

X.M

em

toR

ID/E

X.R

egW

rite

Instru

ct in E

X/M

EM

EX

/ME

M.A

LU

ou

t

EX

/ME

M.B

EX

/ME

M.rd

EX

/ME

M.b

ran

cha

dd

EX

/ME

M.Z

ero

EX

/ME

M.B

ran

ch

EX

/ME

M.M

em

Wr

EX

/ME

M.M

em

Rd

EX

/ME

M.M

em

toR

EX

/ME

M.R

egW

rite

Instru

ct in M

EM

/WB

ME

M/W

B.m

do

ME

M/W

B.A

LU

ou

t

ME

M/W

B.rd

ME

M/W

B.M

emto

R

ME

M/W

B.R

egW

rite

Clo

ck

4

ad

d

16

ad

di

2 7 X 10

0

sw $3

$6 4 6 X 1 ad

d

X 0 1 0 X 0 ad

d

$0

+ $

5

X 4 X X 0 0 0 1 1 lwM

em

[$3

+0

]

$3

+ 0

2 0 1

Clo

ck

5

ad

d

20

ad

d

2 5 8 X

ad

di

$2

$7

10

0

7 X 1

ad

d

0 0 0 0 1 1 sw

$3

+ 4

$6 X X X X 1 0 X 0 ad

dX

$0

+ $

5

4 1 1

Clo

ck

6

sub

24

ad

d

2 4 9 X

ad

d

$2

$5 X 5 8 0

ad

d

1 0 0 0 1 1

ad

di

$2

+ 1

00

X 7 X X 0 0 0 1 1 sw X X X X 0

Clo

ck

7

sub

11

6

su

b

4 7 10 X

ad

d

$2

$4 X 4 9 0

ad

d

1 0 0 0 1 1

ad

d

$2

+ $

5

$5 8 X X 0 0 0 1 1

ad

di

X

$2

+ 1

00

7 1 1

Savio Chau

Single Cycle, Multiple Cycle, vs. Pipeline

Savio Chau

Pipeline Hazards

• Pipelining Limitations: Hazards are Situations that Prevent the Next Instruction from Executing During its Designated Cycle– Structural Hazard:

Resource Conflict When Several Pipelined Instructions Need the Same Functional Unit Simultaneously

– Data Hazard:An Instruction Depends on the Result of a Prior Instruction that is Still in the Pipeline

– Control Hazard:Pipelining of Branches and Other Instructions that Change the PC

• Solutions:– Common to all: Stall the Pipeline by Inserting “Bubbles” Until the

Hazard is Resolved

– Structural: Don’t share components between instructions, use special components (e.g., 2 port memory)

– Data: re-ordering of instructions, forwarding

– Control Hazard: Branch prediction, re-ordering of instructions

Savio Chau

To Stall a Pipelined Data PathDon’t Change PC, Keeps Fetching Same Instruction, Sets All Control Signals in The ID/EX Pipeline Register to Benign Values (0)

sub r4, r1 ,r3All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

All ctrl set to 0

sub r4, r1 ,r3(refetch)

sub r4, r1 ,r3(refetch)

(execute)

Each refetch creates a bubble

(I.e., do nothting)

(I.e., do nothting)

(I.e., do nothting)

Do not update PC

Savio Chau

Hardware to Stall The Pipeline

• Step 1: Detecting the hazard (check if lw is being executed and if the memory data is loaded to one of the operands in the next instruction)

– Stall = if (ID/EX.MemRead and ((ID/EX.rt = IF/ID.rs) or (ID/EX.rt = IF/ID.rt))) • Step 2: If Stall is true

– Do not fetch the next instruction by disabling the writing to PC and IF/ID registers– Disable all control signals of the current instruction

RegFile

Forwarding Unit

exmwb

mwb wb

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Co

ntro

l

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/ID

Wr

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID

.rt

IF/ID

.rs

IF/ID

.opcode

Savio Chau

ID/EX

Stalling The Pipeline Example: R-type after lw

RegFile

Forwarding Unit

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd B

EX/MEM

MEM/WB

Mux

Mux

Co

ntro

l

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/ID

Wr

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID

.rt

IF/ID

.rs

IF/ID

.op

lw r1, 0(r2)

sub r4, r1 ,r3

and r6, r7 ,r1

or r8, r1 ,r9

mwb wb

Fwd A

lwsub

ID/EX.MemRead = 1 lw instrcution

Su

b

ID/EX.rt = R1

IF/ID

.rs = R

1 MemRead = 1, MemWr = 0

RegWr = 1

exmwb

Savio Chau


RegFile

Forwarding Unit

mwb wb

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Co

ntro

l

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/ID

Wr

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID

.rt

IF/ID

.rs

IF/ID

.op

lw r1, 0(r2)

sub r4, r1 ,r3

and r6, r7 ,r1

or r8, r1 ,r9

PC

Wr=

0

lwsub

ID/EX.MemRead = 1 lw instrcution

Su

b

ID/EX.rt = R1

IF/ID

.rs = R

1

IF/ID

Wr =

0

exmwb

MemRead = 1, MemWr = 0

RegWr = 1

Savio Chau


RegFile

Forwarding Unit

exmwb wb

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Co

ntro

l

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/ID

Wr

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID

.rt

IF/ID

.rs

IF/ID

.op

lw r1, 0(r2)

sub r4, r1 ,r3

and r6, r7 ,r1

or r8, r1 ,r9

lw

sub

Su

b

IF/ID

.rs = R

1 MemRead = 0, MemWr = 0

RegWr = 0

mwb M

emR

ead = 1

Mem

Wr =

0RegWr = 1

Re-Fetch

sub

No

t D

oin

g

An

yth

in

g

bu

bb

le

Savio Chau


RegFile

Forwarding Unit

mwb

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Co

ntro

l

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/ID

Wr

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID

.rt

IF/ID

.rs

IF/ID

.op

lw r1, 0(r2)

sub r4, r1 ,r3

and r6, r7 ,r1

or r8, r1 ,r9

lwsub

Mem

Read

= 0

Mem

Wr =

0RegWr = 0 RegWr = 1

and

wb

exmwb


RegWr = 1

sub

bu

bble

Savio Chau


RegFile

Forwarding Unit

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Co

ntro

l

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/ID

Wr

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID

.rt

IF/ID

.rs

IF/ID

.op

lw r1, 0(r2)

sub r4, r1 ,r3

and r6, r7 ,r1

or r8, r1 ,r9

sub

Mem

Read =

0M

emW

r = 0

RegWr = 1 RegWr = 0

and

wb

exmwb


RegWr = 1

mwb

or lw data

sub

bu

bble

Savio Chau


RegFile

Forwarding Unit

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Co

ntro

l

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/ID

Wr

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID

.rt

IF/ID

.rs

IF/ID

.op

lw r1, 0(r2)

sub r4, r1 ,r3

and r6, r7 ,r1

or r8, r1 ,r9

sub

Mem

Read =

0M

emW

r = 0

RegWr = 1 RegWr = 1

and

wb


RegWr = 1

mwb

lw data

or

exmwb

The bubble has not changed any state of the pipeline

Savio Chau


RegFile

Forwarding Unit

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux

Mux

Co

ntro

l

0

Hazard Detect

IF/ID

Instr Mem

PC

rs

rdrt

rtrt

IF/ID

Wr

PC

Wr

ID/EX.MemRead

ID/EX.rt

IF/ID

.rt

IF/ID

.rs

IF/ID

.op

lw r1, 0(r2)

sub r4, r1 ,r3

and r6, r7 ,r1

or r8, r1 ,r9

Mem

Read =

0M

emW

r = 0

RegWr = 1 RegWr = 1

and

wb

or

exmwb

mwb

lw datasub data

The bubble has not changed any state of the pipeline

Savio Chau

Data Hazard Solution: Forwarding

• Fwd A = 1 (i.e., Type 1a)if (EX/MEM.RegWrite and (EX/MEM.RegRd 0) and (EX/MEM.RegRd = ID/EX.RegRs))

Fwd A = 2 (i.e.,Type 2a)if (MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (MEM/WB.RegRd = ID/EX.RegRs))

• Fwd B = 1 (i.e., Type 1b)if (EX/MEM.RegWrite and (EX/MEM.RegRd 0) and (EX/MEM.RegRd = ID/EX.RegRt))

Fwd B = 2 (i.e.,Type sb)if (MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (MEM/WB.RegRd = ID/EX.RegRt))

Logic Equation for the Control Outputs of the Forwarding Unit

RegFile

Forwarding Unit

exmwb

mwb wb

Control

rdrd

rs

Mux A

Mux B

Data MemoryA

LU

Mux

rdrt

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux

0

1

2

0

1

2

Savio Chau

Forwarding Exampleadd r1 ,r2, r3

sub r4, r1 ,r3

and r6, r7 ,r1

RegFile

exmwb

mwb wb

Control

Mux A

Mux B

Data MemoryA

LU

Mux

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux ad

d

r1

r2

r3

A=R[rs]

B=R[rt]

A+B

01

add

r1

sub

r4

r1

r3

B=R[rt]

A=R[rs]

A - B A+B add

r1

sub

r6

r7

r1

B=R[rt]

A=R[rs]

A • B

A+B

and

A-B

r410Forwarding

Unitrs

rdrt rd rd

Type 1a Hazard Type 2b Hazard

Savio Chau

One Case Forwarding Can’t Avoid Stallingadd r1 ,r2, r3

sub r4, r1 ,r3

and r6, r7 ,r1

RegFile

exmwb

mwb wb

Control

Mux A

Mux B

Data MemoryA

LU

Mux

Fwd A

Fwd B

ID/EX

EX/MEM

MEM/WB

Mux lw

r1

r2

r3

A=R[rs]

Addr

Forwarding Unit

rd

Problem: lw followed by R-type – the lw instruction is still reading memory when the sub instruction needs the data for EX. Need to stall 1 cycle (see previous example)

lw

r1

add

r4

r1

r3

B=R[rt]

A=R[rs]

A+ B addr

Type 1a Hazard, but cannot forward EX/MEM output. It is not valid output of lw

rs

rdrt rd

lw

add

Mem[addr]

Valid output for lw

Savio Chau

Control Hazard Solution: Branch Prediction (e.g., Predict Branch Not Taken)

Result of comparison not to branch

Assume branch not taken



Prediction is correct, branching does not cause any penalty

PC=12

PC=16

PC=20

PC=24

PC=28 or $15,$7,$3

Savio Chau

Penalty of Wrong Prediction




Branch target

PC=12

PC=16

PC=20

PC=24

PC=36 Result of comparison branch taken

Prediction is incorrect, need to flush pipe, penalty = without branch prediction (3 cycles)

Savio Chau

To Reduce Branch Panelty Move Address Calculation Hardware Forward

1st clock delay

2nd clock delay

3rd clock delay

Savio Chau

To Reduce Branch Panelty Move Address Calculation Hardware Forward

1st clock delay

Savio Chau

Memory Hierarchy• Motivations:

– Large Memories (DRAM) are Slow and Lower Cost– Small Memories (SRAM) are Fast but Higher Cost

• Goal: Present the User with a Large Memory at the Lowest Cost while Providing Access at a Speed Comparable to the Fastest Technology

• Reduce the Required Bandwidth of the Large Memory

Fast Memory(small)

LargeMemory(slow)

Memory Hierarchy

Savio Chau

Typical Memory Hierarchy

Performance:

CPU Registers: in 100’s of Bytes<10’s of ns

Cache: in K Bytes10-100 ns$0.01 - 0.001/bit

Main Memory: in M Bytes100ns - 1us$0.01 - 0.001/bit

Disk: in G Bytesms10-3 - 10-4 cents/bit

Tape : infinite capacitysec-min10-6 cents/bit

Registers

Cache

Memory

Disk

Tape

Savio Chau

Why Memory Hierarchy Works?

• The Principle of Locality:– Program Accesses a Relatively Small Portion of the Address Space at

Any Instant of Time. Example: 90% of Time in 10% of the Code– Put All Data in Large Slow Memory and Put the Portion of Address

Space Being Accessed into the Small Fast Memory.

• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an Item is Referenced, It will Tend

to be Referenced Again Soon– Spatial Locality (Locality in Space): If an Item is Referenced, Items

Whose Addresses are Close by Tend to be Referenced Soon.

Savio Chau

Analysis of Memory Hierarchy Performance

General Idea• Average Memory Access Time = Upper level hit rate Upper level hit time

+ Upper level miss rate Miss penalty• Example, let:

– h = Hit rate: the percentage of memory references that are found in upper level– 1- h = Miss Rate

– tm = the Hit Time of the Main Memory

– tc = the Hit Time of the Cache Memory

• Then, Average Memory Access Time = h tc + (1- h)(tc + tm)

= tc + (1- h) tm

Note: This example assumes cache has to be looked up to determine if miss has occurred. The time to look up cache is also equal to tc.

• This formula can be applied recursively to multiple levels. Let: Let: The subscript Ln refer to the upper level memory (e.g., a cache)

The subscript Ln-1 refer to the lower level memory (e.g., main memory)– Average Memory Access Time =

hLn tLn + (1- hLn) [tLn + {hLn-1 tLn-1 + (1- hLn-1) (tLn-1 + tm)} ]

• The trick is how to find the miss penalty

Savio Chau

Cache Organization

• Mechanism for looking up data– Index: to look up a block or a set in the cache

– Tag: to determine if the data is what you want (hit or miss)

– Byte Select (or Word Select): to select the byte (or word) that you need in a block

• Block size: to take advantage of spatial locality– Temporal locality might be compromised if block size is too large

– In general, larger block size has higher miss penalty (unless wide parallel memory is used)

• Associativity: to reduce conflict– Direct Mapping

– Set Associative

– Fully Associative

• Write Policy: to ensure consistency between cache and memory– Write Through

– Write Back

Savio Chau

Large Block Size

• For a 2N Byte Cache:– The Uppermost (32- N) Bits Are Always The Cache Tag– The Lowest M Bits Are The Byte Select ( Block Size = 2M )– The Middle (32 - N - M) Bits Are The Cache Index

mux

Hit Byte 32

0x50 0x01 0x00

Savio Chau

Associativity0123456789

0123

Direct Mapped:Memory Blocks (M mod N)go only into a single block

0123456789

Set 0

Set 1

0123

0123456789

EntireCache

0123

Set Associative:Memory Blocks (M mod N) can go anywhere in a set of blocks

Fully Associative:Memory Blocks (M mod N) can go anywhere in the cache

Savio Chau

Cache Overhead Estimation Example

2 x 32 2-to-1 mux

32 2-to-1 mux

1 bit

Tag

3232

3232

32-bit data

… … …

V

0

Tag Word #1Word #2

212 -1

… … …

V

0

Tag Word #1Word #2

= =

2-to-1 MUX

2-to-1 MUX

word 1Hit

D D

Select

word 2

12 bits19 bits

212 -1

19 191 1 1 1

1 1

index Word Sel

2-to-1 MUX

Number of indexes = 216 bytes 1word/4 bytes 1block/2 words 1set/2 block = 212 (sets = # of index)Number of index bits = 12 bitsNumber of word select bits = 1Number of bits in tag = 32 bits – 12 bits – 1 bit = 19 bits Storage overhead = (19 bits + 1 bit + 1 bit)/block 2 blocks/set 212 sets = 172032 bits Number of comparators = 19 bit/set 2 sets = 38 Number of multiplexors = 32 + 32 + 32 = 96 (2-to-1 mux)Miscellaneous gates: 2 AND gates and 1 OR gate

Memory size = 4 Gbytes (i.e., 32-bit address)Cache size = 64 KbytesWord addressable

Savio Chau

Similarities Between Cache and Virtual Memory

• Both Use Two Levels of Memories

– Higher Level: Faster and Smaller

– Lower Level: Slower and Larger

• Both Rely on the Principle of Locality

• Both Use Associativity to Reduce Conflicts

• Both Need to Decide Which Block in Higher Level has to be Replaced Upon Miss

cache Main memorySecondary Storage

Cache Design

Virtual Memory Design

Savio Chau

Differences Between Cache and Virtual Memory

• Cache is several orders of magnitude faster than virtual memory, while virtual memory is several orders of magnitude larger than cache

• Consequently– Virtual memory can use software to track blocks in use while cache

has to use hardware– The cost to implement full associativity is low for Virtual memory

and very high for cache– Virtual memory can use more sophisticated block replacement

algorithms– Virtual memory has to use write-back while cache can use write-

back or write-through

Parameter Typical Value in Cache

Typical Value in Virtual Memory

Total Size in Blocks 1000 - 1000,000 2000 - 250,000 Total Size in Kbytes 8 - 8,000 8000 - 8,000,000 Block Size in Bytes 16 - 256 4000 - 64,000 Miss Panelty in Cycles 10 - 100 1 M - 10 M Miss Rate 0.1% - 10% 0.00001% - 0.0001%

Savio Chau

Page Table and Page Frame Table• Page Table:

– Used by program to keep track which page is in the secondary store and which is in main memory

– Translate virtual memory address into physical address

000F000 X1

... ......

00002000 R1

00001000 R/W0

Physical Page AddressAccess RightValid

Page Table Pointer Note

Virtual Page #

C37000

...

737000

29B000

• Page Frame Table: – Used by the operating system to know how the pages in main memory are

allocated to different active jobs– To provide information for deciding which page is candidate to be replaced

Page Frame # in Main Memory Used Bit Dirty Bit User Virtual Page Address

0 (addr = 000000) 1 1 A 0000029B

1 (addr = 001000) 1 0 B 00000737

... ... ... ... ...

2n- 1 (addr = FFF000) 0 0 A 000C374

Savio Chau

Address Mapping

• Address Translation Determines If Main Memory Has the Requested Page by Examining the Valid Bit of the Page in the Page Table

• If the Requested Page Is Not in Main Memory, Operating System Transfers Data from Secondary Memory to Main Memory and Then Set the Valid Bit. Write the old page back to memory if necessary (e.g., page modified but not saved).

V=1

To Cache

V=0

New Page

Old Page

Write AddressRead Address

V=1

To Cache

new phy addr

Savio Chau

Translation of Virtual to Physical Address• Page Table Located in Physical Memory

• V = Valid Bit:– V = 1: Page is in Main Memory

• Access Rights: R = Read- Only, R/ W = Read/ Write, X = Execute Only

AccessRights

Physical Page #

Physical Address

2018

To Memory if V=1

Savio Chau

Translation Lookaside Buffer

• Cache of Recently Used Page Table Entries

• Can Be Fully Associative, Set Associative, or Direct Mapped

• Direct Mapped TLB Example:

Note: Dirty bit indicates if the page in memory has been modified. If it has not been modified, it will be replaced without copying back to memory.

index

Savio Chau

Virtual Memory and Cache MappingsExample: Decstation 3100

Note: Another important bookkeeping bit Write Access Bit for Write Protection Is Not Shown

Virtual Page Number Page Offset

Physical Address

31 12 11 0Virtual Address

====

=

mux

TLB

TLB Hit

Valid Dirty Tag Physical Page #

Valid Tag Data

Cache Hit

Ta

g

Ind

ex

Data

Byte Offset

14

32

2

1220

20

Savio Chau

Accessing Data from Memory HierarchyTLB Tag OffsetTLB index

Virtual Address Format:

Procedure:Step 1: Translate virtual address to physical address

Use TLB to reduce page table look up timeIf hit, use physical address in TLB to look up cache (step 2)If miss, go to page table in main memory

If found in page table, update TLB and look up cache (step 2)

If page fault, use page frame table to pick a page in memory to be

replacedupdate page frame tableupdate page table in memorycopy data from disk to the selected memory page

if the selected page is dirty, write it back to disk first

update cache if the data from disk has a cache hitupdate TLB, get physical address and go to step 2

Step 2: Use physical address to access data from cacheIf hit, use data from cacheIf miss, go to main memory to access data

update cache

Virtual Page #

Savio Chau

I/O System Architecture Overview

User Application

Operating System

Device Driver

I/O Controller

I/O Device

I/O Device

system call

Memory or I/O Bus

Media

Software

Hardware

Device Driver

Protocol can be defined at

all levels

I/O Controller

Physical

Logical

System Interface

Savio Chau

A Classificaiton of I/O According to the Targets of I/O Operation

• Processor to MemoryVery low latency, very high throughput, very low protocol overhead

• Processor to PeripheralLatency, throughput, and protocol overhead vary according to the I/O devices

• Processor to Processors

– Tightly Coupled: all processors share a physical memoryLow latency, high throughput, low overhead protocol, coherence problem

– Loosely Coupled: each processor has its own physical memoryMedium latency, medium throughput, high protocol overhead, scalable

• Processor to NetworkHigh latency, low throughput, high protocol overhead, very scalable

Savio Chau

I/O System Example

Processor

Cache

Memory - I/O Bus

MainMemory

I/O Controller

Graphics

Network

DiskDisk

I/O Controller Network Interface

Controller

IEEE 1394 Bus Interface

Contorller

Processor

Cache

To Other Processors or Peripherals on the

IEEE 1394 Bus

Savio Chau

I/O System Design Process• Establish Requirements: Understanding What You Need

• Select the I/O System That Has the Required Capability: Understand What the I/O System being Considered Can Do

• Integration: Understand How Everything Fits Together

• Implementation

Device A? Device B?

Device B? Device C? Device D?

Bus A?

Bus B?Bus C?

Device A Device B

Device B Device C Device D

Bus B? ?

? ? ?

Savio Chau

I/O System Design Example: Establish Requirements

• Design an I/O architecture for a spacecraft that has the following equipment

Flight Computer

(CDH)

Flight Computer

(ACS)

Flight Computer (Payload)

Star TrackerStar TrackerTelecom Subsystem

Telecom Subsystem

Inertia Measurement Unit

Inertia Measurement Unit

Power Control Unit

Power Control Unit

Thruster Control Unit

Thruster Control Unit

Wide Angle Camera

High Resolution Camera

Radar Sounder

Altimeter

Data Rate: 5 Kbps1transaction/secLatency < 10 ms

Data Rate: 8 Mbps1000 samples/secLatency < 0.1 ms

Data Rate: 10 Kbps1000 samples/secLatency < 0.1 ms

Data Rate: 400 bps2 commands/secLatency < 0.5 sec

Data Rate < 100 bps10 commands/secLatency < 0.1 ms

Data Rate: 20 Mbps2 frames/secLatency < 0.5 sec

Data Rate: 20 Mbps2 frames/secLatency < 0.5 sec

Data Rate: 1 Mbps1 transaction/secLatency < 1 sec

Data Rate: 5 Kbps100 samples/secLatency < 0.01 sec

I/O?

System Constraints (Prioritized):1. Total power consumption of the avionics system < 100 W. 2. The I/O system power consumption should be less than 35% of the avionics system.3. Each subsystem has to meet the latency and throughput requirements4. System reliability should exceed 12 years (i.e., requires fault tolerance)5. The system design should be scalable and distributed.6. Maximum distance between subsystems is 5 meters. Average distance is 3 m.7. Minimize the cable mass.

Savio Chau

I/O System Design Example: Candidate I/O Interface

Metrics IEEE 1394(Cable version)

IEEE 1393 Fiber Channel I2C UART (Direct Interface)

Ethernet(IEEE 802.3)

Raw Bandwidth 100, 200, 400 Mbps

200 to 1000 Mbps 1 Gbps 100, 400 Kbps 115 Kbps to 10 Mbps

10, 100 Mbps

Latency 125 s max 196 bits N nodes

196 bits N (loop)

Undeterministic < 100 ns Undeterministic

Topology Tree Ring Loop, Star, Switch network

Multi-Drop Star Multi-Drop

Signal Level Protocol

Async Async Async Async Async Async

Cable Type Electrical (Twisted pair)

Optical Fiber Optical Fiber, Electrical

(Twisted pair)

Electrical(Single end)

Electrical(Twisted pair)

Electrical(Coaxial)

Power Note 1 1 W/node 8 W/node 8 W/node 5 mW/node 35 mW/node 150 mW/node

Multi-master Yes Yes Yes Yes No Yes

Max. # Nodes 64 127 127 for Loop 128 N/A 248

Max Bus Length Note 1

72 m(4.5 m/hop)

10 km,(100m/hop)

Fiber: 10 kmElectrical: 30m

Approx. 40 m (load<400 pf)

Approx. 10 m 500 m

Protocol Overhead

8 % for 278 byte data

3 bytes per 53-byte frame

25 % for 2168 byte data Note 2

1 byte address +Ack bit / byte

1 start + 1 stop bits/byte (25%)

64 bytes / msg (msg < 1500 B)

Savio Chau

I/O System Design Example: Selecting an I/O Interface

• There are 17 nodes in the system and the power allocation of the I/O system is 35 W. This eliminates the Fiber Channel and the IEEE 1393

• The latency requirement eliminates the I2C and Ethernet• The total bandwidth requirement of the system 56 Mbps. This eliminates the UART• The system reliability requirement eliminates the IEEE 1394 bus because tree topology is

not very fault tolerant• All interface options, except the UART, are buses and thus meet the scalability

requirement. All bus options here support distributed processing.• The distance requirement prohibits the search for a parallel bus• All interface options, except the UART, are serial buses and thus meet the cable mass

requirement

PROBLEM: WE DON’T HAVE AN OPTION THAT CAN MEET ALL REQUIREMENTS!

Resolution: Since power consumption and latency are technology dependent and difficult to improve, the next best option is to improve system reliability using fault tolerance design techniques. Therefore, the IEEE 1394 is the best choice in this case but need to be enhanced with fault tolerance design techniques. Use dual redundant buses.

Check: Since redundant buses have to be used, the number of interfaces of the IEEE 1394 bus is doubled. The power consumption will be 17 x 1 W x 2 = 34 W. This is OK since it is still within the 35 W power constraint.

Savio Chau

Key I/O Design Parameters to be Discussed

• Connectivity

• Protocol

• Access Control

• Performance

• Expandability

• Failure Handling

• Operating System Support

Physical • Protocol• Connectivity• Access Control• Performance• Expandability• Failure Handling

Logical• Protocol• Failure Handling

System Interface• Operating System Support• Failure Handling

Typical I/O System Layers and Key Parameters

Savio Chau

Specification of the Interface Signals

Proc Data Bus(Processor controller)

Proc Address Bus(Processor controller)

00000001

00050000

(go-read)

Controller Read Request(Controller device)

Write Enable(Processor controller)

Read Enable(Processor controller)

I/O Data Bus(Device Controller)

I/O Data Ready(Device Controller)

Valid data

00000000

00050001

100000000

00050001

Valid data

00050002

Processor

Proc Data Bus

Proc Addr Bus

Write Enable

Read Enable

I/O Controller

Read Request

I/O Data Bus

I/O Data ReadyI/O Device

Design an I/O controller that reads a 32-bit word from an I/O device under the command of the processor. The protocol and timing are as follows

Write Command Read Status Read Status Read Data

Savio Chau

Logic Design in RTLRTL of I/O Controller:Clock 1: Wait_Proc1: If proc_addr_bus = 0x00050002 & read_enable = 1(Decoding) Then proc_data_bus STATUS_REG

Goto Wait_Proc1 If proc_addr_bus = 0x00050001 & read_enable = 1

Then proc_data_bus DATA_REGGoto Wait_Proc1

If proc_addr_bus = 0x00050000 & write_enable = 1Then COMMAND_REG proc_data_bus

If COMMAND_REG != 0x00000001 Then Goto Wait_Proc1

Else read_request 1 Clock 2: Wait_Dev: If io_data_ready = 0(Get I/O data) Then goto Wait_Dev

Else DATA_REG io_data_busSTATUS_REG<31> 1read_request 0

If proc_addr_bus = 0x00050002 & read_enable = 1Then proc_data_bus STATUS_REG

Clock 3: Wait_Proc2: If proc_addr_bus = 0x00050001 & read_enable = 1(Proc get data) Then proc_data_bus DATA_REG

Else goto Wait_Proc2:If proc_addr_bus = 0x00050002 & read_enable = 1Then proc_data_bus STATUS_REG

Clock 4: Goto Wait_Proc1

Savio Chau

Realization of the Design in Hardware

Decoder

Command

Reg

Status

Reg

Data Reg

Control Logicmux

01

DRWrite

DRRead

SRRead

SRWrite

CRWrite

GoRead

io_data_ready

Read_request

Proc_addr

Proc_data

Read_Enable

Write_Enable

IO_data

DataReady

CRWrite = 1; DataReady = 0SRWrite = 0; SRRead = 1DRRead = 1;DRWrite = 0If GoRead, ReadRequest = 1, else ReadRequest = 0

CRWrite = 0SRWrite = 0; SRRead = 1DRRead = 0;If io_data_ready, DRWrite = 1Else DRWrite = 0If io_data_ready, DataReady = 1Else DataReady = 0If io_data_ready, ReadRequest = 0Else ReadRequest = 1

GoRead

GoRead

Io_data_readyy

Io_data_ready

GoRead

CRWrite = 0SRWrite = 0; SRRead = 1DRWrite = 0If Read Data Reg, DRRead = 1Else DRRead = 0DataReady = 0;ReadRequest = 0

Read Data Reg

Read Data Reg

CRWrite = 0SRWrite = 0; SRRead = 1DRRead = 1;DRWrite = 0DataReady = 0;ReadRequest = 0

I/O Controller Data Path and Control:

Savio Chau

Writing the Software Driver for the Processor

MIPS Device Driver for the I/O Controller:

# Assuming the I/O Controller is memory mapped# Assuming Command Register address (0x00050000) is in $s0# Assuming the GoRead command (0x00000001) is in $t0# Assuming Status Register address (0x00050001) is in $s1# When Status Register = 0x10000000, it indicates data in Data Register

is ready# Assuming Data Register address (0x00050002) is in $s2# The read data will be stored in $s3

sw $t0, 0($s0) # Proc writes GoRead to Command Reg

Wait: lw $t1, 0($s1) # Proc checks Status Regsubi $t2, $t1, 0x10000000bne $t2, $0, Wait # Wait if I/O data not readylw $s3 0($s2) # Proc read Data Reg

Savio Chau

This is the best class I have so far

GOOD LUCK!

Savio Chau Spring Quarter, 2002 Final Review Final: June 10, 2001 3:00 p.m. to 6:00 p.m. Knudsen...

Documents

Transcript of Savio Chau Spring Quarter, 2002 Final Review Final: June 10, 2001 3:00 p.m. to 6:00 p.m. Knudsen...