Savio Chau Spring Quarter, 2002 Final Review Final: June 10, 2001 3:00 p.m. to 6:00 p.m. Knudsen...
-
date post
21-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of Savio Chau Spring Quarter, 2002 Final Review Final: June 10, 2001 3:00 p.m. to 6:00 p.m. Knudsen...
Savio Chau
Spring Quarter, 2002
Final Review
Final: June 10, 2001
3:00 p.m. to 6:00 p.m.
Knudsen 1200B
Extra office hour: Friday 6/7/02 4:30 p.m.to 7:30 p.m.
Saturday 6/8/01 4:00 p.m. to 6:00 p.m.
Savio Chau
Areas for Study• What is computer architecture?• Number Representation
– Floating point number representation and IEEE 754– Floating point operations with IEEE 754
• MIPS instruction set– Able to write simple assembly code with MIPS instruction set– Understanding of procedure calls and stack management
• Procedure call– Stack management
• General ideas about single cycle/multi cycle data path and control unit design
• Pipelined Processor– Basic concepts and data flow in pipeline– Hazards
• Data Hazard– Stalling the pipe– Forwarding (including the special case of lw followed by R-type)
• Control Hazard– Branch Prediction
Savio Chau
Areas for Study• Memory Hierarchy and Virtual Memory
– Concept of memory hierarchy and locality (spatial and temporal)– Performance of memory hierarchy: calculation of average access time– Cache organizations and overheads
• Associativity: direct mapping, set associate, fully associate• Block size• Replacement policies• Write back vs. write through
– Virtual Memory• Virtual to Physical Address Translation: Page Table, Page Frame Table • Table Look-aside Buffer (TLB)
– You should know how to read/write data from a memory hierarchy with an virtual address
• I/O System– I/O system architecture – I/O system design process– I/O system design parameters– I/O device interface design– Your should be able to do both system level and detailed design
Savio Chau
What is Computer Architecture?
• Coordination of many levels of abstraction• Under a rapidly changing set of forces• Design, Measurement, and Evaluation
Courtesy D. Patterson
I/O systemInstr. Set Proc.
Compiler
Operating System
Application
Digital Design
Circuit Design
Instruction Set Architecture
Firmware
Datapath & Control
Physical Design
Vdd
I1 O1
I1 O1
Vdd
Control
ALU
I Reg
Mem
Software
Hardware I1O2
O1
I2
Bottom Upview
Savio Chau
IEEE 754 Standard for Floating Point Numbers
• Maximize precision of representation with fix number of bits– Gain 1 bit by making leading 1 of mantissa implicit. Therefore,
F = 1 + significand, Value = (1)s (1 + significand) 2 E
• Easy for comparing numbers– Put sign bit at MSB– Use bias instead of sign bit for exponent field
Real exponent value = exponent - bias, bias = 127 for single precision Examples: IEEE 754 value Floating Point Number ValueExponent A = -126 00000001 (1)s F 2 (1-127) = (1)s F 2-126 Exponent B = 127 11111110 (1)s F 2 (254-127) = (1)s F 2127
This is much easier to compare than having A = 12610 = 100000102 and B = 12710 = 011111112
• Need to take care special cases (by convention)Value = 0 E = 0 f = 0 i.e., f = significandValue = (1)s E = 255 f = 0Value = (1)s(0.f)2-126 E = 0 f 0 Value has been denormalized
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
sign Exponent (biased) Significand only (leading 1 is implicit)
Two formats: single precision (32-bit) and double precision (64-bit). Single precision format:
Savio Chau
IEEE 754 Computation Example
A) 40 = (–1)0 1. 25 25 = (–1)0 1.012 2(132 – 127) = [0][10000100][101000000000000000000]
B) –80 = (–1)1 1. 25 26 = (–1)1 1. 012 2(133 – 127) = [1][10000101][111101000000000000000]
C) Denormalize the significand with the lower exponent and then align the exponents:
40 = (–1)0 0. 3125 27 = (–1)0 0.01012 2 (134 – 127) = [0][10000110][010100000000000000000]
–80 = (–1)1 0. 6250 27 = (–1)1 0.10102 2 (134 – 127) = [1][10000110][101000000000000000000] D) Need to convert the IEEE 754 significand of –80 into 2’s complement before the subtraction: –80 = [1][10000110][101000000000000000000] [1][10000110][011000000000000000000] 40 – 80 = [0][10000110][010100000000000000000] + [1][10000110]
[011000000000000000000]= [0][10000110][101100000000000000000]
E) Convert the result in 2’s complement into IEEE 754 = [1][10000110][010100000000000000000]
F) Renormalize: [1][10000110][010100000000000000000] = [1][10000100][010000000000000000000]
= (–1)1 1.012 25
Check: 40 – 80 = – 40 = (–1)1 1.25 25 = (–1)1 1.012 25
Savio Chau
Procedure Call: An Overly Simplified Example
main() /* Caller */{
x = y + z;funct(arg); /* procedure call */…
}
PC main addr
$v0
$a0 arg
($2)
($4)
$t0 x
$t1 y
$t2 z
($8)
($9)
($10)
w
$ra main addr3 ($31)
132funct addr 12 w
v
3main addr
int funct( arg ) /* Callee */{
w = arg – v;return (w);
}
Addr
1 2 3
Addr 1
2 3
arg
But!• What if there are more than 4 arguments?• What if there are some register values need to be preserved
across procedure call (e.g., if you want to preserve the value x)? • What if another procedure call happens before the current
procedure is completed?
3
Savio Chau
Call-Return Linkage: Stack Frames
FPARGS
Callee Save Registers
(old $fp, $ra, $s0,etc)
Local VariablesSP
Grows and shrinks during expression evaluation
Sta
ck F
ram
e o
r A
ctiv
atio
n R
eco
rd
Reference Argumentsand Local Variables atFixed (negative)Offset From FP
High Mem
Low Mem
Solution:
• Save the needed information (e.g., arguments, return address) onto a stack in memory
• Information needed by the called procedure are grouped into a stack frame
• Many variations on stacks possible (up/down, last pushed / next )
(frame pointer points to 1st word of frame)
(stack pointer points to last word of frame)
Savio Chau
Performance of An Ideal Pipeline
• Latency of Pipeline = Latency of a Single Task
• Potential Throughput Improvement = Number of Pipeline Stages Under The Ideal Situations That All Instructions Are Independent and No Branch Instructions
• Pipeline Rate is Limited by the Slowest Pipeline Stage
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clk
1st lw
2nd lw
3rd lw
IFetch Reg/Dec Exec Mem WrBack
IFetch Reg/Dec Exec Mem WrBack
IFetch Reg/Dec Exec Mem WrBack
Savio Chau
Example of Detailed Pipeline Operations
See MIPS Example in Class
rt
rd
ID/E
XPC
Addr
InstructionMemory
Rd Reg1RdReg2
RegistersWr RegWr Data
AddrRd Data
DataMemory
Wr Data
PCsrc IF/ID
4 Reg
Writ
e
ALU
src
ALUop
RegDst
Branch
Mem
Wr
Mem
toR
zero
out
<15:0>
Mem
Rd
rs
A
Zero
ALUout
0wb
exm
wb
IF/I
D
mwb
EX
/ME
M
ME
M/W
B
1
rt
Extrt
rdMux
Co
ntr
ol
0
1
Ad
d Ad
d
B
ALUControl
Mux
AL
U
0
1
rd
BMux
0
1
rd
ID/EX EX/MEM MEM/WB--/IF
Mux
AL
Uo
ut
md
o
<10:0>
<31:0>
<31:26>
x4
Clk PC 1 00 lw $2, 0($3) 2 04 add $4, $0, $5 3 08 sw $6, 4($3) 4 12 addi $7, $2, 100 5 16 add $8, $2, $5 6 20 add $9, $2, $4 7 24 sub $10, $4, $7 8 28 add $11, $7, $8
Clk PC 1 00 lw $2, 0($3) 2 04 add $4, $0, $5 3 08 sw $6, 4($3) 4 12 addi $7, $2, 100 5 16 add $8, $2, $5 6 20 add $9, $2, $4 7 24 sub $10, $4, $7 8 28 add $11, $7, $8
Savio Chau
Signal Propagation through the Example Pipeline
Instru
ct in P
C
PC
Instru
ct in IF
/ID
IF/ID
.rs
IF/ID
.rt
IF/ID
.rd
IF/ID
.Imm
ed
16
Instru
ct in ID
/EX
ID/E
X.A
ID/E
X.B
ID/E
X.Im
me
d1
6
ID/E
X.rt
ID/E
X.rd
ID/E
X.A
LU
src
ID/E
X.A
LU
op
ID/E
X.R
egD
st
ID/E
X.B
ran
ch
ID/E
X.M
em
Wr
ID/E
X.M
em
Rd
ID/E
X.M
em
toR
ID/E
X.R
egW
rite
Instru
ct in E
X/M
EM
EX
/ME
M.A
LU
ou
t
EX
/ME
M.B
EX
/ME
M.rd
EX
/ME
M.b
ran
cha
dd
EX
/ME
M.Z
ero
EX
/ME
M.B
ran
ch
EX
/ME
M.M
em
Wr
EX
/ME
M.M
em
Rd
EX
/ME
M.M
em
toR
EX
/ME
M.R
egW
rite
Instru
ct in M
EM
/WB
ME
M/W
B.m
do
ME
M/W
B.A
LU
ou
t
ME
M/W
B.rd
ME
M/W
B.M
emto
R
ME
M/W
B.R
egW
rite
Clo
ck
4
ad
d
16
ad
di
2 7 X 10
0
sw $3
$6 4 6 X 1 ad
d
X 0 1 0 X 0 ad
d
$0
+ $
5
X 4 X X 0 0 0 1 1 lwM
em
[$3
+0
]
$3
+ 0
2 0 1
Clo
ck
5
ad
d
20
ad
d
2 5 8 X
ad
di
$2
$7
10
0
7 X 1
ad
d
0 0 0 0 1 1 sw
$3
+ 4
$6 X X X X 1 0 X 0 ad
dX
$0
+ $
5
4 1 1
Clo
ck
6
sub
24
ad
d
2 4 9 X
ad
d
$2
$5 X 5 8 0
ad
d
1 0 0 0 1 1
ad
di
$2
+ 1
00
X 7 X X 0 0 0 1 1 sw X X X X 0
Clo
ck
7
sub
11
6
su
b
4 7 10 X
ad
d
$2
$4 X 4 9 0
ad
d
1 0 0 0 1 1
ad
d
$2
+ $
5
$5 8 X X 0 0 0 1 1
ad
di
X
$2
+ 1
00
7 1 1
Savio Chau
Single Cycle, Multiple Cycle, vs. Pipeline
Savio Chau
Pipeline Hazards
• Pipelining Limitations: Hazards are Situations that Prevent the Next Instruction from Executing During its Designated Cycle– Structural Hazard:
Resource Conflict When Several Pipelined Instructions Need the Same Functional Unit Simultaneously
– Data Hazard:An Instruction Depends on the Result of a Prior Instruction that is Still in the Pipeline
– Control Hazard:Pipelining of Branches and Other Instructions that Change the PC
• Solutions:– Common to all: Stall the Pipeline by Inserting “Bubbles” Until the
Hazard is Resolved
– Structural: Don’t share components between instructions, use special components (e.g., 2 port memory)
– Data: re-ordering of instructions, forwarding
– Control Hazard: Branch prediction, re-ordering of instructions
Savio Chau
To Stall a Pipelined Data PathDon’t Change PC, Keeps Fetching Same Instruction, Sets All Control Signals in The ID/EX Pipeline Register to Benign Values (0)
sub r4, r1 ,r3All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
All ctrl set to 0
sub r4, r1 ,r3(refetch)
sub r4, r1 ,r3(refetch)
(execute)
Each refetch creates a bubble
(I.e., do nothting)
(I.e., do nothting)
(I.e., do nothting)
Do not update PC
Savio Chau
Hardware to Stall The Pipeline
• Step 1: Detecting the hazard (check if lw is being executed and if the memory data is loaded to one of the operands in the next instruction)
– Stall = if (ID/EX.MemRead and ((ID/EX.rt = IF/ID.rs) or (ID/EX.rt = IF/ID.rt))) • Step 2: If Stall is true
– Do not fetch the next instruction by disabling the writing to PC and IF/ID registers– Disable all control signals of the current instruction
RegFile
Forwarding Unit
exmwb
mwb wb
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Co
ntro
l
0
Hazard Detect
IF/ID
Instr Mem
PC
rs
rdrt
rtrt
IF/ID
Wr
PC
Wr
ID/EX.MemRead
ID/EX.rt
IF/ID
.rt
IF/ID
.rs
IF/ID
.opcode
Savio Chau
ID/EX
Stalling The Pipeline Example: R-type after lw
RegFile
Forwarding Unit
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd B
EX/MEM
MEM/WB
Mux
Mux
Co
ntro
l
0
Hazard Detect
IF/ID
Instr Mem
PC
rs
rdrt
rtrt
IF/ID
Wr
PC
Wr
ID/EX.MemRead
ID/EX.rt
IF/ID
.rt
IF/ID
.rs
IF/ID
.op
lw r1, 0(r2)
sub r4, r1 ,r3
and r6, r7 ,r1
or r8, r1 ,r9
mwb wb
Fwd A
lwsub
ID/EX.MemRead = 1 lw instrcution
Su
b
ID/EX.rt = R1
IF/ID
.rs = R
1 MemRead = 1, MemWr = 0
RegWr = 1
exmwb
Savio Chau
Stalling The Pipeline Example: R-type after lw
RegFile
Forwarding Unit
mwb wb
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Co
ntro
l
0
Hazard Detect
IF/ID
Instr Mem
PC
rs
rdrt
rtrt
IF/ID
Wr
PC
Wr
ID/EX.MemRead
ID/EX.rt
IF/ID
.rt
IF/ID
.rs
IF/ID
.op
lw r1, 0(r2)
sub r4, r1 ,r3
and r6, r7 ,r1
or r8, r1 ,r9
PC
Wr=
0
lwsub
ID/EX.MemRead = 1 lw instrcution
Su
b
ID/EX.rt = R1
IF/ID
.rs = R
1
IF/ID
Wr =
0
exmwb
MemRead = 1, MemWr = 0
RegWr = 1
Savio Chau
Stalling The Pipeline Example: R-type after lw
RegFile
Forwarding Unit
exmwb wb
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Co
ntro
l
0
Hazard Detect
IF/ID
Instr Mem
PC
rs
rdrt
rtrt
IF/ID
Wr
PC
Wr
ID/EX.MemRead
ID/EX.rt
IF/ID
.rt
IF/ID
.rs
IF/ID
.op
lw r1, 0(r2)
sub r4, r1 ,r3
and r6, r7 ,r1
or r8, r1 ,r9
lw
sub
Su
b
IF/ID
.rs = R
1 MemRead = 0, MemWr = 0
RegWr = 0
mwb M
emR
ead = 1
Mem
Wr =
0RegWr = 1
Re-Fetch
sub
No
t D
oin
g
An
yth
in
g
bu
bb
le
Savio Chau
Stalling The Pipeline Example: R-type after lw
RegFile
Forwarding Unit
mwb
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Co
ntro
l
0
Hazard Detect
IF/ID
Instr Mem
PC
rs
rdrt
rtrt
IF/ID
Wr
PC
Wr
ID/EX.MemRead
ID/EX.rt
IF/ID
.rt
IF/ID
.rs
IF/ID
.op
lw r1, 0(r2)
sub r4, r1 ,r3
and r6, r7 ,r1
or r8, r1 ,r9
lwsub
Mem
Read
= 0
Mem
Wr =
0RegWr = 0 RegWr = 1
and
wb
exmwb
MemRead = 0, MemWr = 0
RegWr = 1
sub
bu
bble
Savio Chau
Stalling The Pipeline Example: R-type after lw
RegFile
Forwarding Unit
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Co
ntro
l
0
Hazard Detect
IF/ID
Instr Mem
PC
rs
rdrt
rtrt
IF/ID
Wr
PC
Wr
ID/EX.MemRead
ID/EX.rt
IF/ID
.rt
IF/ID
.rs
IF/ID
.op
lw r1, 0(r2)
sub r4, r1 ,r3
and r6, r7 ,r1
or r8, r1 ,r9
sub
Mem
Read =
0M
emW
r = 0
RegWr = 1 RegWr = 0
and
wb
exmwb
MemRead = 0, MemWr = 0
RegWr = 1
mwb
or lw data
sub
bu
bble
Savio Chau
Stalling The Pipeline Example: R-type after lw
RegFile
Forwarding Unit
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Co
ntro
l
0
Hazard Detect
IF/ID
Instr Mem
PC
rs
rdrt
rtrt
IF/ID
Wr
PC
Wr
ID/EX.MemRead
ID/EX.rt
IF/ID
.rt
IF/ID
.rs
IF/ID
.op
lw r1, 0(r2)
sub r4, r1 ,r3
and r6, r7 ,r1
or r8, r1 ,r9
sub
Mem
Read =
0M
emW
r = 0
RegWr = 1 RegWr = 1
and
wb
MemRead = 0, MemWr = 0
RegWr = 1
mwb
lw data
or
exmwb
The bubble has not changed any state of the pipeline
Savio Chau
Stalling The Pipeline Example: R-type after lw
RegFile
Forwarding Unit
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux
Mux
Co
ntro
l
0
Hazard Detect
IF/ID
Instr Mem
PC
rs
rdrt
rtrt
IF/ID
Wr
PC
Wr
ID/EX.MemRead
ID/EX.rt
IF/ID
.rt
IF/ID
.rs
IF/ID
.op
lw r1, 0(r2)
sub r4, r1 ,r3
and r6, r7 ,r1
or r8, r1 ,r9
Mem
Read =
0M
emW
r = 0
RegWr = 1 RegWr = 1
and
wb
or
exmwb
mwb
lw datasub data
The bubble has not changed any state of the pipeline
Savio Chau
Data Hazard Solution: Forwarding
• Fwd A = 1 (i.e., Type 1a)if (EX/MEM.RegWrite and (EX/MEM.RegRd 0) and (EX/MEM.RegRd = ID/EX.RegRs))
Fwd A = 2 (i.e.,Type 2a)if (MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (MEM/WB.RegRd = ID/EX.RegRs))
• Fwd B = 1 (i.e., Type 1b)if (EX/MEM.RegWrite and (EX/MEM.RegRd 0) and (EX/MEM.RegRd = ID/EX.RegRt))
Fwd B = 2 (i.e.,Type sb)if (MEM/WB.RegWrite and (MEM/WB.RegRd 0) and (MEM/WB.RegRd = ID/EX.RegRt))
Logic Equation for the Control Outputs of the Forwarding Unit
RegFile
Forwarding Unit
exmwb
mwb wb
Control
rdrd
rs
Mux A
Mux B
Data MemoryA
LU
Mux
rdrt
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux
0
1
2
0
1
2
Savio Chau
Forwarding Exampleadd r1 ,r2, r3
sub r4, r1 ,r3
and r6, r7 ,r1
RegFile
exmwb
mwb wb
Control
Mux A
Mux B
Data MemoryA
LU
Mux
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux ad
d
r1
r2
r3
A=R[rs]
B=R[rt]
A+B
01
add
r1
sub
r4
r1
r3
B=R[rt]
A=R[rs]
A - B A+B add
r1
sub
r6
r7
r1
B=R[rt]
A=R[rs]
A • B
A+B
and
A-B
r410Forwarding
Unitrs
rdrt rd rd
Type 1a Hazard Type 2b Hazard
Savio Chau
One Case Forwarding Can’t Avoid Stallingadd r1 ,r2, r3
sub r4, r1 ,r3
and r6, r7 ,r1
RegFile
exmwb
mwb wb
Control
Mux A
Mux B
Data MemoryA
LU
Mux
Fwd A
Fwd B
ID/EX
EX/MEM
MEM/WB
Mux lw
r1
r2
r3
A=R[rs]
Addr
Forwarding Unit
rd
Problem: lw followed by R-type – the lw instruction is still reading memory when the sub instruction needs the data for EX. Need to stall 1 cycle (see previous example)
lw
r1
add
r4
r1
r3
B=R[rt]
A=R[rs]
A+ B addr
Type 1a Hazard, but cannot forward EX/MEM output. It is not valid output of lw
rs
rdrt rd
lw
add
Mem[addr]
Valid output for lw
Savio Chau
Control Hazard Solution: Branch Prediction (e.g., Predict Branch Not Taken)
Result of comparison not to branch
Assume branch not taken
Assume branch not taken
Assume branch not taken
Prediction is correct, branching does not cause any penalty
PC=12
PC=16
PC=20
PC=24
PC=28 or $15,$7,$3
Savio Chau
Penalty of Wrong Prediction
Assume branch not taken
Assume branch not taken
Assume branch not taken
Branch target
PC=12
PC=16
PC=20
PC=24
PC=36 Result of comparison branch taken
Prediction is incorrect, need to flush pipe, penalty = without branch prediction (3 cycles)
Savio Chau
To Reduce Branch Panelty Move Address Calculation Hardware Forward
1st clock delay
2nd clock delay
3rd clock delay
Savio Chau
To Reduce Branch Panelty Move Address Calculation Hardware Forward
1st clock delay
Savio Chau
Memory Hierarchy• Motivations:
– Large Memories (DRAM) are Slow and Lower Cost– Small Memories (SRAM) are Fast but Higher Cost
• Goal: Present the User with a Large Memory at the Lowest Cost while Providing Access at a Speed Comparable to the Fastest Technology
• Reduce the Required Bandwidth of the Large Memory
Fast Memory(small)
LargeMemory(slow)
Memory Hierarchy
Savio Chau
Typical Memory Hierarchy
Performance:
CPU Registers: in 100’s of Bytes<10’s of ns
Cache: in K Bytes10-100 ns$0.01 - 0.001/bit
Main Memory: in M Bytes100ns - 1us$0.01 - 0.001/bit
Disk: in G Bytesms10-3 - 10-4 cents/bit
Tape : infinite capacitysec-min10-6 cents/bit
Registers
Cache
Memory
Disk
Tape
Savio Chau
Why Memory Hierarchy Works?
• The Principle of Locality:– Program Accesses a Relatively Small Portion of the Address Space at
Any Instant of Time. Example: 90% of Time in 10% of the Code– Put All Data in Large Slow Memory and Put the Portion of Address
Space Being Accessed into the Small Fast Memory.
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an Item is Referenced, It will Tend
to be Referenced Again Soon– Spatial Locality (Locality in Space): If an Item is Referenced, Items
Whose Addresses are Close by Tend to be Referenced Soon.
Savio Chau
Analysis of Memory Hierarchy Performance
General Idea• Average Memory Access Time = Upper level hit rate Upper level hit time
+ Upper level miss rate Miss penalty• Example, let:
– h = Hit rate: the percentage of memory references that are found in upper level– 1- h = Miss Rate
– tm = the Hit Time of the Main Memory
– tc = the Hit Time of the Cache Memory
• Then, Average Memory Access Time = h tc + (1- h)(tc + tm)
= tc + (1- h) tm
Note: This example assumes cache has to be looked up to determine if miss has occurred. The time to look up cache is also equal to tc.
• This formula can be applied recursively to multiple levels. Let: Let: The subscript Ln refer to the upper level memory (e.g., a cache)
The subscript Ln-1 refer to the lower level memory (e.g., main memory)– Average Memory Access Time =
hLn tLn + (1- hLn) [tLn + {hLn-1 tLn-1 + (1- hLn-1) (tLn-1 + tm)} ]
• The trick is how to find the miss penalty
Savio Chau
Cache Organization
• Mechanism for looking up data– Index: to look up a block or a set in the cache
– Tag: to determine if the data is what you want (hit or miss)
– Byte Select (or Word Select): to select the byte (or word) that you need in a block
• Block size: to take advantage of spatial locality– Temporal locality might be compromised if block size is too large
– In general, larger block size has higher miss penalty (unless wide parallel memory is used)
• Associativity: to reduce conflict– Direct Mapping
– Set Associative
– Fully Associative
• Write Policy: to ensure consistency between cache and memory– Write Through
– Write Back
Savio Chau
Large Block Size
• For a 2N Byte Cache:– The Uppermost (32- N) Bits Are Always The Cache Tag– The Lowest M Bits Are The Byte Select ( Block Size = 2M )– The Middle (32 - N - M) Bits Are The Cache Index
mux
Hit Byte 32
0x50 0x01 0x00
Savio Chau
Associativity0123456789
0123
Direct Mapped:Memory Blocks (M mod N)go only into a single block
0123456789
Set 0
Set 1
0123
0123456789
EntireCache
0123
Set Associative:Memory Blocks (M mod N) can go anywhere in a set of blocks
Fully Associative:Memory Blocks (M mod N) can go anywhere in the cache
Savio Chau
Cache Overhead Estimation Example
2 x 32 2-to-1 mux
32 2-to-1 mux
1 bit
Tag
3232
3232
32-bit data
… … …
V
0
Tag Word #1Word #2
212 -1
… … …
V
0
Tag Word #1Word #2
= =
2-to-1 MUX
2-to-1 MUX
word 1Hit
D D
Select
word 2
12 bits19 bits
212 -1
19 191 1 1 1
1 1
index Word Sel
2-to-1 MUX
Number of indexes = 216 bytes 1word/4 bytes 1block/2 words 1set/2 block = 212 (sets = # of index)Number of index bits = 12 bitsNumber of word select bits = 1Number of bits in tag = 32 bits – 12 bits – 1 bit = 19 bits Storage overhead = (19 bits + 1 bit + 1 bit)/block 2 blocks/set 212 sets = 172032 bits Number of comparators = 19 bit/set 2 sets = 38 Number of multiplexors = 32 + 32 + 32 = 96 (2-to-1 mux)Miscellaneous gates: 2 AND gates and 1 OR gate
Memory size = 4 Gbytes (i.e., 32-bit address)Cache size = 64 KbytesWord addressable
Savio Chau
Similarities Between Cache and Virtual Memory
• Both Use Two Levels of Memories
– Higher Level: Faster and Smaller
– Lower Level: Slower and Larger
• Both Rely on the Principle of Locality
• Both Use Associativity to Reduce Conflicts
• Both Need to Decide Which Block in Higher Level has to be Replaced Upon Miss
cache Main memorySecondary Storage
Cache Design
Virtual Memory Design
Savio Chau
Differences Between Cache and Virtual Memory
• Cache is several orders of magnitude faster than virtual memory, while virtual memory is several orders of magnitude larger than cache
• Consequently– Virtual memory can use software to track blocks in use while cache
has to use hardware– The cost to implement full associativity is low for Virtual memory
and very high for cache– Virtual memory can use more sophisticated block replacement
algorithms– Virtual memory has to use write-back while cache can use write-
back or write-through
Parameter Typical Value in Cache
Typical Value in Virtual Memory
Total Size in Blocks 1000 - 1000,000 2000 - 250,000 Total Size in Kbytes 8 - 8,000 8000 - 8,000,000 Block Size in Bytes 16 - 256 4000 - 64,000 Miss Panelty in Cycles 10 - 100 1 M - 10 M Miss Rate 0.1% - 10% 0.00001% - 0.0001%
Savio Chau
Page Table and Page Frame Table• Page Table:
– Used by program to keep track which page is in the secondary store and which is in main memory
– Translate virtual memory address into physical address
000F000 X1
... ......
00002000 R1
00001000 R/W0
Physical Page AddressAccess RightValid
Page Table Pointer Note
Virtual Page #
C37000
...
737000
29B000
• Page Frame Table: – Used by the operating system to know how the pages in main memory are
allocated to different active jobs– To provide information for deciding which page is candidate to be replaced
Page Frame # in Main Memory Used Bit Dirty Bit User Virtual Page Address
0 (addr = 000000) 1 1 A 0000029B
1 (addr = 001000) 1 0 B 00000737
... ... ... ... ...
2n- 1 (addr = FFF000) 0 0 A 000C374
Savio Chau
Address Mapping
• Address Translation Determines If Main Memory Has the Requested Page by Examining the Valid Bit of the Page in the Page Table
• If the Requested Page Is Not in Main Memory, Operating System Transfers Data from Secondary Memory to Main Memory and Then Set the Valid Bit. Write the old page back to memory if necessary (e.g., page modified but not saved).
V=1
To Cache
V=0
New Page
Old Page
Write AddressRead Address
V=1
To Cache
new phy addr
Savio Chau
Translation of Virtual to Physical Address• Page Table Located in Physical Memory
• V = Valid Bit:– V = 1: Page is in Main Memory
• Access Rights: R = Read- Only, R/ W = Read/ Write, X = Execute Only
AccessRights
Physical Page #
Physical Address
2018
To Memory if V=1
Savio Chau
Translation Lookaside Buffer
• Cache of Recently Used Page Table Entries
• Can Be Fully Associative, Set Associative, or Direct Mapped
• Direct Mapped TLB Example:
Note: Dirty bit indicates if the page in memory has been modified. If it has not been modified, it will be replaced without copying back to memory.
index
Savio Chau
Virtual Memory and Cache MappingsExample: Decstation 3100
Note: Another important bookkeeping bit Write Access Bit for Write Protection Is Not Shown
Virtual Page Number Page Offset
Physical Address
31 12 11 0Virtual Address
====
=
mux
TLB
TLB Hit
Valid Dirty Tag Physical Page #
Valid Tag Data
Cache Hit
Ta
g
Ind
ex
Data
Byte Offset
14
32
2
1220
20
Savio Chau
Accessing Data from Memory HierarchyTLB Tag OffsetTLB index
Virtual Address Format:
Procedure:Step 1: Translate virtual address to physical address
Use TLB to reduce page table look up timeIf hit, use physical address in TLB to look up cache (step 2)If miss, go to page table in main memory
If found in page table, update TLB and look up cache (step 2)
If page fault, use page frame table to pick a page in memory to be
replacedupdate page frame tableupdate page table in memorycopy data from disk to the selected memory page
if the selected page is dirty, write it back to disk first
update cache if the data from disk has a cache hitupdate TLB, get physical address and go to step 2
Step 2: Use physical address to access data from cacheIf hit, use data from cacheIf miss, go to main memory to access data
update cache
Virtual Page #
Savio Chau
I/O System Architecture Overview
User Application
Operating System
Device Driver
I/O Controller
I/O Device
I/O Device
system call
Memory or I/O Bus
Media
Software
Hardware
Device Driver
Protocol can be defined at
all levels
I/O Controller
Physical
Logical
System Interface
Savio Chau
A Classificaiton of I/O According to the Targets of I/O Operation
• Processor to MemoryVery low latency, very high throughput, very low protocol overhead
• Processor to PeripheralLatency, throughput, and protocol overhead vary according to the I/O devices
• Processor to Processors
– Tightly Coupled: all processors share a physical memoryLow latency, high throughput, low overhead protocol, coherence problem
– Loosely Coupled: each processor has its own physical memoryMedium latency, medium throughput, high protocol overhead, scalable
• Processor to NetworkHigh latency, low throughput, high protocol overhead, very scalable
Savio Chau
I/O System Example
Processor
Cache
Memory - I/O Bus
MainMemory
I/O Controller
Graphics
Network
DiskDisk
I/O Controller Network Interface
Controller
IEEE 1394 Bus Interface
Contorller
Processor
Cache
To Other Processors or Peripherals on the
IEEE 1394 Bus
Savio Chau
I/O System Design Process• Establish Requirements: Understanding What You Need
• Select the I/O System That Has the Required Capability: Understand What the I/O System being Considered Can Do
• Integration: Understand How Everything Fits Together
• Implementation
Device A? Device B?
Device B? Device C? Device D?
Bus A?
Bus B?Bus C?
Device A Device B
Device B Device C Device D
Bus B? ?
? ? ?
Savio Chau
I/O System Design Example: Establish Requirements
• Design an I/O architecture for a spacecraft that has the following equipment
Flight Computer
(CDH)
Flight Computer
(ACS)
Flight Computer (Payload)
Star TrackerStar TrackerTelecom Subsystem
Telecom Subsystem
Inertia Measurement Unit
Inertia Measurement Unit
Power Control Unit
Power Control Unit
Thruster Control Unit
Thruster Control Unit
Wide Angle Camera
High Resolution Camera
Radar Sounder
Altimeter
Data Rate: 5 Kbps1transaction/secLatency < 10 ms
Data Rate: 8 Mbps1000 samples/secLatency < 0.1 ms
Data Rate: 10 Kbps1000 samples/secLatency < 0.1 ms
Data Rate: 400 bps2 commands/secLatency < 0.5 sec
Data Rate < 100 bps10 commands/secLatency < 0.1 ms
Data Rate: 20 Mbps2 frames/secLatency < 0.5 sec
Data Rate: 20 Mbps2 frames/secLatency < 0.5 sec
Data Rate: 1 Mbps1 transaction/secLatency < 1 sec
Data Rate: 5 Kbps100 samples/secLatency < 0.01 sec
I/O?
System Constraints (Prioritized):1. Total power consumption of the avionics system < 100 W. 2. The I/O system power consumption should be less than 35% of the avionics system.3. Each subsystem has to meet the latency and throughput requirements4. System reliability should exceed 12 years (i.e., requires fault tolerance)5. The system design should be scalable and distributed.6. Maximum distance between subsystems is 5 meters. Average distance is 3 m.7. Minimize the cable mass.
Savio Chau
I/O System Design Example: Candidate I/O Interface
Metrics IEEE 1394(Cable version)
IEEE 1393 Fiber Channel I2C UART (Direct Interface)
Ethernet(IEEE 802.3)
Raw Bandwidth 100, 200, 400 Mbps
200 to 1000 Mbps 1 Gbps 100, 400 Kbps 115 Kbps to 10 Mbps
10, 100 Mbps
Latency 125 s max 196 bits N nodes
196 bits N (loop)
Undeterministic < 100 ns Undeterministic
Topology Tree Ring Loop, Star, Switch network
Multi-Drop Star Multi-Drop
Signal Level Protocol
Async Async Async Async Async Async
Cable Type Electrical (Twisted pair)
Optical Fiber Optical Fiber, Electrical
(Twisted pair)
Electrical(Single end)
Electrical(Twisted pair)
Electrical(Coaxial)
Power Note 1 1 W/node 8 W/node 8 W/node 5 mW/node 35 mW/node 150 mW/node
Multi-master Yes Yes Yes Yes No Yes
Max. # Nodes 64 127 127 for Loop 128 N/A 248
Max Bus Length Note 1
72 m(4.5 m/hop)
10 km,(100m/hop)
Fiber: 10 kmElectrical: 30m
Approx. 40 m (load<400 pf)
Approx. 10 m 500 m
Protocol Overhead
8 % for 278 byte data
3 bytes per 53-byte frame
25 % for 2168 byte data Note 2
1 byte address +Ack bit / byte
1 start + 1 stop bits/byte (25%)
64 bytes / msg (msg < 1500 B)
Savio Chau
I/O System Design Example: Selecting an I/O Interface
• There are 17 nodes in the system and the power allocation of the I/O system is 35 W. This eliminates the Fiber Channel and the IEEE 1393
• The latency requirement eliminates the I2C and Ethernet• The total bandwidth requirement of the system 56 Mbps. This eliminates the UART• The system reliability requirement eliminates the IEEE 1394 bus because tree topology is
not very fault tolerant• All interface options, except the UART, are buses and thus meet the scalability
requirement. All bus options here support distributed processing.• The distance requirement prohibits the search for a parallel bus• All interface options, except the UART, are serial buses and thus meet the cable mass
requirement
PROBLEM: WE DON’T HAVE AN OPTION THAT CAN MEET ALL REQUIREMENTS!
Resolution: Since power consumption and latency are technology dependent and difficult to improve, the next best option is to improve system reliability using fault tolerance design techniques. Therefore, the IEEE 1394 is the best choice in this case but need to be enhanced with fault tolerance design techniques. Use dual redundant buses.
Check: Since redundant buses have to be used, the number of interfaces of the IEEE 1394 bus is doubled. The power consumption will be 17 x 1 W x 2 = 34 W. This is OK since it is still within the 35 W power constraint.
Savio Chau
Key I/O Design Parameters to be Discussed
• Connectivity
• Protocol
• Access Control
• Performance
• Expandability
• Failure Handling
• Operating System Support
Physical • Protocol• Connectivity• Access Control• Performance• Expandability• Failure Handling
Logical• Protocol• Failure Handling
System Interface• Operating System Support• Failure Handling
Typical I/O System Layers and Key Parameters
Savio Chau
Specification of the Interface Signals
Proc Data Bus(Processor controller)
Proc Address Bus(Processor controller)
00000001
00050000
(go-read)
Controller Read Request(Controller device)
Write Enable(Processor controller)
Read Enable(Processor controller)
I/O Data Bus(Device Controller)
I/O Data Ready(Device Controller)
Valid data
00000000
00050001
100000000
00050001
Valid data
00050002
Processor
Proc Data Bus
Proc Addr Bus
Write Enable
Read Enable
I/O Controller
Read Request
I/O Data Bus
I/O Data ReadyI/O Device
Design an I/O controller that reads a 32-bit word from an I/O device under the command of the processor. The protocol and timing are as follows
Write Command Read Status Read Status Read Data
Savio Chau
Logic Design in RTLRTL of I/O Controller:Clock 1: Wait_Proc1: If proc_addr_bus = 0x00050002 & read_enable = 1(Decoding) Then proc_data_bus STATUS_REG
Goto Wait_Proc1 If proc_addr_bus = 0x00050001 & read_enable = 1
Then proc_data_bus DATA_REGGoto Wait_Proc1
If proc_addr_bus = 0x00050000 & write_enable = 1Then COMMAND_REG proc_data_bus
If COMMAND_REG != 0x00000001 Then Goto Wait_Proc1
Else read_request 1 Clock 2: Wait_Dev: If io_data_ready = 0(Get I/O data) Then goto Wait_Dev
Else DATA_REG io_data_busSTATUS_REG<31> 1read_request 0
If proc_addr_bus = 0x00050002 & read_enable = 1Then proc_data_bus STATUS_REG
Clock 3: Wait_Proc2: If proc_addr_bus = 0x00050001 & read_enable = 1(Proc get data) Then proc_data_bus DATA_REG
Else goto Wait_Proc2:If proc_addr_bus = 0x00050002 & read_enable = 1Then proc_data_bus STATUS_REG
Clock 4: Goto Wait_Proc1
Savio Chau
Realization of the Design in Hardware
Decoder
Command
Reg
Status
Reg
Data Reg
Control Logicmux
01
DRWrite
DRRead
SRRead
SRWrite
CRWrite
GoRead
io_data_ready
Read_request
Proc_addr
Proc_data
Read_Enable
Write_Enable
IO_data
DataReady
CRWrite = 1; DataReady = 0SRWrite = 0; SRRead = 1DRRead = 1;DRWrite = 0If GoRead, ReadRequest = 1, else ReadRequest = 0
CRWrite = 0SRWrite = 0; SRRead = 1DRRead = 0;If io_data_ready, DRWrite = 1Else DRWrite = 0If io_data_ready, DataReady = 1Else DataReady = 0If io_data_ready, ReadRequest = 0Else ReadRequest = 1
GoRead
GoRead
Io_data_readyy
Io_data_ready
GoRead
CRWrite = 0SRWrite = 0; SRRead = 1DRWrite = 0If Read Data Reg, DRRead = 1Else DRRead = 0DataReady = 0;ReadRequest = 0
Read Data Reg
Read Data Reg
CRWrite = 0SRWrite = 0; SRRead = 1DRRead = 1;DRWrite = 0DataReady = 0;ReadRequest = 0
I/O Controller Data Path and Control:
Savio Chau
Writing the Software Driver for the Processor
MIPS Device Driver for the I/O Controller:
# Assuming the I/O Controller is memory mapped# Assuming Command Register address (0x00050000) is in $s0# Assuming the GoRead command (0x00000001) is in $t0# Assuming Status Register address (0x00050001) is in $s1# When Status Register = 0x10000000, it indicates data in Data Register
is ready# Assuming Data Register address (0x00050002) is in $s2# The read data will be stored in $s3
sw $t0, 0($s0) # Proc writes GoRead to Command Reg
Wait: lw $t1, 0($s1) # Proc checks Status Regsubi $t2, $t1, 0x10000000bne $t2, $0, Wait # Wait if I/O data not readylw $s3 0($s2) # Proc read Data Reg
Savio Chau
This is the best class I have so far
GOOD LUCK!