dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0...
Transcript of dapjlhOrg riscv slides ch01hy225/20a/epanalipsi_2_proc.pdf · 1 1 x10 0 0 1 x11 sd add add 1 1 0 0...
COMPUTERORGANIZATION AND DESIGNThe Hardware/Software Interface
RISC-VEdition
Chapter 4The Processor
01 ldEn
00
RegWrite
11
rd
10
Write
A
ddre
ss D
eco
der
64
64
WriteData
64
64
64
ReadData2ReadData1
64
64rs2rs1
01 10
64
00011100 1110
64
64
register 01 (x1)
ckregister 01 (x1)
ckregister 01 (x1)
ck
Chapter 4 — The Processor — 6
Control
Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html
Pipelined Datapath & Control Operationwithout data or control dependencies, yet
University of Crete
Dept. of Computer Science
CS−225 (HY−225)
Computer ORganization
Spring 2020 semester
Slides for §9.3 − 9.5
§9.4 Control for the Pipelined Datapath
§9.3 Pipelined Datapath Operation
§9.5 Graphical representation: time−work
ld
76 add
−100x1
−1001
300
400 14
100
4880
80
add
700
ld
Cycle 5
rs2
we Din
rd
rs1 Addr
we rd
Din
Dout
DM?
f7
op f7
?rdIRPC
+
PC
+4
br/jmp addr.
A
IM
I
RF
64:
68:
72:
76:
ld x10, 40(x1)
ld x13, 48(x1)
add x14, x5, x6
60:
sub x11, x2, x3
f3
add x12, x3, x4
A
B
ImmImm
Control
LU
A
Reg.Rd; Op
Instr. Fetch
work
time
Cycle 5
Instr. Fetch Reg.Rd; Op ALU Data Mem. Write Back
Instr. Fetch Reg.Rd; Op ALU Data Mem.
Instr. Fetch Reg.Rd; Op ALU
Instr. Fetch
1
1
x10
0
0
1
x11
sd
add add
1
1
0
0
x12
1
0
0
0
100
x1700
76 add
−100
−100
400
300
x13 130 14
48
Cycle 5
rs2
we Din
rd
rs1 Addr
we rd
Din
Dout
DMf7
op f7
rdIRPC
++4
br/jmp addr.
A
IM
I
RF
64:
68:
72:
76:
ld x10, 40(x1)
sd x13, 48(x1)
add x14, x5, x6
60:
sub x11, x2, x3
f3
add x12, x3, x4
A
B
Imm
PC
Imm
Control
LU
A
0
01
1
Data Dependences (Hazards) in Pipelines
RAW (Read after Write) − true dependence, as above
RAR (Read after Read) − not a dependence, can freely reorder the reads
WAW (Write after Write) − if you want to reorder them, simply abort the write of I1
just keep a copy of the old data and have I1 read that copyWAR (Write after Read) − "antidependence": if you want to do the write (I2) early,
(if no one reads this word between I1 and I2)
an earlier Instruction:
a later Instruction
I2 needs the new data written by I1,hence must wait for I1 to write −or at least to generate− the new data
some Memory or
Register File:
a wordwrite
I1
read
Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html
I2:
Reg.Rd; OpInstr. Fetch
Program Order
time
Data Memory accesses are performed ‘in−order’ in our simple pipeline,i.e. are not reordered relative to what the program specifies,thus, no dependences of memory word accesses are ever violated
No Memory Data Hazards in our simple Pipeline
Data Mem. Write BackALUReg.Rd; Op
Data Mem. Write BackALU
Data Mem. Write Back
Data Mem.ALUReg.Rd; OpInstr. Fetch
Data Mem. Write BackALUReg.Rd; OpInstr. Fetch
Data Mem. Write BackALU
Reg.Wr.
For each instruction that writes a destination register, if the next 2 or 3 instructionsread that same register, i.e. need its result, we have to do something about it...
Register Accesses reordered, with Pipeliningtime
in initial Program OrderRegister Accesses
Data Mem. Reg. WriteALUReg. ReadFetch
Data Mem.ALUReg. ReadInstr. Fetch Reg.Wr.
Register Accessesof next 2 or 3 instructions
potential Dependence Hazardsreordered in time, thus causing
time
Data Mem. Reg. WriteALUReg. ReadFetch
Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
Data Mem.ALUReg. ReadInstr. Fetch
Actual Need−Produce Time vs. from/in−Register Time
ALU Instructions:actual
‘official’ Register Positionwritten intoOutput Result Produced
Inputs NeededInputs Read from ‘official’ Position
Output Result
We can ‘Bypass’ the ‘official’ loop through the Register File for immediate−use Results
computation
Load Instructions: Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
Inputs Needed
Output Result Produced
Inputs Read
Output Resultwritten
actual computation
time
All we care about is actual Results ‘Forwarded’ from Producer to Consumer instruction
Data Mem. Reg. WriteALUReg. ReadInstr. Fetch
the one immediately succeedinginstruction, without it having to wait one extra cycle
ALU op. result can be
to any subsequent instructionforwarded
Loaded data canNOT be used by
ALUInstr. Fetch Reg. Read Reg. Write
Data Mem.ALUInstr. Fetch Reg. Read Reg. Write
Data Mem.
ALUInstr. Fetch Reg. WriteReg. Read
Data Mem.ALUInstr. Fetch Reg. Read Reg. Write
Data Mem.ALUInstr. Fetch Reg. Read
Data Mem.
Reg. Write
ALU instructions never stall the pipeline, but Load instructions will do sowhen immediately followed by a dependent instruction
from Load Instruction:
from ALU Instructions:
time
ALU result to next I; Load result to next−after−next I
ALUInstr. Fetch Reg. Read Reg. Write
Data Mem.
Imm
LU
A
++4
br/jmp
PC A
IM
I
rs2
we Din
rd
rs1
f7
op f7
RF
f3
Imm
A
B
IR
Op. Dec.
Addr
we rd
Din
Dout
DM
rd
rs2
fwd.ctrl
Forwarding Controlrd rrd5rrd4
rwe4
rrd3
rwe3
01
01
mrd3
mwe3
mrd4
mwe4
rwe5rs1
forward
Wait/Repeat!
Abort!
fetch 60
sub x10, x3
64:
68:
72:
76:
ld x10, 40(x1)
add x14, x5, x6
60:
sub x11, x10, x3
add x12, x3, x4
sd x13, 48(x1)
The instruction immediately after a Load wants to use the load’ed data
Simple, in−order Pipeline: the next instruction has to wait too!
Impossible without losing one cycle:force this instruction to wait (repeat itself on the next cycle)
fetch 64
Distance 1 dependence on LOAD: Wait!
time
x1 + 40
no−op.no−op no−op
Data Mem.x10 − x3fetch 68
x3 + x4fetch 68
ld: read x1 read M[140] write x10
write x11
Data Mem.add x3, x4 write x12
H’zrd D’tct!
LU
A68
Imm++4
br/jmp
A
IM
I
rs2
we Din
rd
rs1
f7
op f7
RF
f3
Imm
A=100
B
IR
Op. Dec.
Addr
we rd
Din
Dout
DM
rd
40
x10
1
add
0
1
1
x10
x3 x11
sub
PC
1
Forwarding ControlHazard Detection
ldE
n
ldE
n
01
rrd3rd
rs1
rs2
fwd.ctrl
need.rs1
rrd5
rwe5
rrd4
rwe4rwe3
mrd3
mwe3
mrd4
mwe4ne
ed
.rs2
0
Wait
(OK to reorder sd−ld) or i==j (fwd in reg.)?
If unknown to compiler, static sch. impossible=> dynamic scheduling at runtime (ooo pipe)
Does the compiler know for sure if i!=j
t1,
sub t1, t0, t1
sd 24(gp)t1,
2 extra clock cycles lost
e = b − f;
a = b + c;
a[i] = b + c;
e = b − a[j];What if the program is?:
RAW dependence?
sd e
sub
ld f
ld b
ld c
add
t1
ld b
ld c
sd e
sub
ld f
add
sd a
This is ‘Static’ Scheduling, at Compile Time
t0
two tem
pora
ryre
gis
ters
suffic
et2
t0
t1
thre
e tem
pora
ry r
egis
ters
needed
Instruction Scheduling
sd af
b
+16:+8:+0:
+24:+32:
e
c
a gp
the more things you have‘up in the air’ (in parallel),the more temporaryregisters you needin order to ‘name’those ‘pending’ values
ld 32(gp)t2,
ld t0, 8(gp)
ld t1, 16(gp)
add t1, t0, t1
sd t1, 0(gp)
sub t1, t0,
sd 24(gp)t1,
t2
No extra clock cycle lost
sd t1, 0(gp)
ld t0, 8(gp)
ld t1, 16(gp)
add t1, t0, t1
ld 32(gp)
Control Dependences (branch/jump) in Pipelines
Copyright 2020 University of Crete − https://www.csd.uoc.gr/~hy225/20a/copyright.html (slides 1−9); Elsevier (slides 10−11)
‘Data Dependence’ = next instruction uses data (register/memory) from previous
‘Control Dependence’ = which is the next instruction depends on the previous
Control Dependences arise from ‘Control Transfer Instructions (CTI)’
Control Transfer Instructions (CTI) are: Jump and Branch Instructions
‘Jumps’ are Unconditional CTI’s: they always transfer control
‘Branches’ are Conditional CTI’s: whether or not they transfer controldepends on the result of a data comparison that they have to perform
Statistics (rough numbers, in a majority of programs, but NOT always so):
− about 1/3 of executed branches are not taken (unsuccessful) = ~5% of all instr.− about 2/3 of executed branches are ‘taken’ (successful) = ~10% of all instr.
− most backwards branches appear in loops, and they are about 90% taken
Branches are about 15−16% of all (‘dynamically’) executed instructions in a program
Jumps are about 4−5% of all executed instructions in a program − procedure calls are about 1%, and returns another ~1%, of all executed instr.
56
fetch
fetch
fetch
fetch
fetch
fetch
40:
44:
48:
52:
76:
36:
ALU
ALU DM WB
ALU DM
branch! (2 or more cycles)
sd
and
ld
xor+4
+4
72:
noopnoop
noop noop noop
noopnoopnoopnoop
SpeculativeExecution
Aborted Execution!
before it causes permanent damage:
speculative executionneed to abort
before DM and WB stages
In modern processors, branch latency is quite longIn our simple pipeline, branch latency is 2 cycles (read registers; compare)
Example here with 3−cycle branch latency
About 2/3 of all executed branches are taken, so this is a heavy loss
In this example, each taken branch causes the loss of 3 extra clock cycles
(with MIPS−style comparisons (beq/bne only) it could even be 1 cycle)
Branch Taken example40: beq ..., goto72
44: sd ...
48: and ...
52: or ...... ...
36: add ...
72: ld ...
76: xor ...fetch add ALU DM WB
+4
+4
+4
+4
48: and ...
52: or ...... ...
36: add ...
72: ld ...
76: xor ...
40 72260 200
88 120180 160
......
... ...
A ‘best approximation’ − not necessarily correct information
Like IM −the Instruction Cache− this will oftentimes ‘overflow’:
old pairs are removed to make room for more recent ones
May be complemented with a small hardware stack:
− on every call (jal ra,...), push the return address;
− on every return (jr ra), pop an address and predict jumpin to that one
Branches that are believed not−taken are NOT entered into the BTB
usually went, in the past.Target PC to which this instruction
PC of a jump or branch−likely instruction;
Branch Target Buffer (BTB)A small table − a cache, like a hash table − containing
their next−PC is something other than PC+4
pairs of (instruction) addresses for which
there is statistical evidence that
In parallel with each Fetch, search the fetched instruction’s PC value in the BTB
40: beq ..., goto72
44: sd ...
fetch
SpeculativeExecution
Continue Executionand Commitfetch ALU DM WB
fetch xor ALU DM WB
fetch ALU DM WB
fetch op.dec
fetch
ALU
op.dec
DM
ALU
WB
DM
36:
BTB
+440:
BTB
+4
branch taken, as predicted
BTB
+4
BTB
+4BTB
+4
ld
op.dec
else, fetch from PC+4
When a matching BTB entry is found, use its Prediction;
When the BTB prediction is Correct40: beq ..., goto72
44: sd ...
48: and ...
52: or ...... ...
36: add ...
72: ld ...
76: xor ...fetch add ALU DM WB
72:
80:BTB
+4
76:
88:
84:
When Prediction is Correct, NO extra clock cycles are lost!
40 72260 200
88 120180 160
......
... ...
BTB
noop
noop noop noop
fetch
SpeculativeExecution
fetch ALU
fetch xor
fetch
36:
BTB
+440:
BTB
+4
BTB
+4
BTB
+4
ld
br. not taken: mispredicted
When Mispredicted, branches cost 3 extra clock cycles in this pipeline
44:
48:
76:
72:
DM
WBDM
ALUfetch
ALUsdfetch
+4
BTB
Flush the Pipeline!
and
Abort!
Prediction says: After fetching from 40, fetch from 72
But this time, the branch ends up going the other way: to 44
When the BTB prediction is Wrong40: beq ..., goto72
44: sd ...
48: and ...
52: or ...... ...
36: add ...
72: ld ...
76: xor ...fetch add ALU DM WB
80:BTB
+4
40 72260 200
88 120180 160
......
... ...
BTBnoopnoopnoopnoop
noop
Chapter 1 — Computer Abstractions and Technology — 28
Relative Performancen Define Performance = 1/Execution Timen “X is n time faster than Y”
n== XY
YX
time Executiontime ExecutionePerformancePerformanc
n Example: time taken to run a programn 10s on A, 15s on Bn Execution TimeB / Execution TimeA
= 15s / 10s = 1.5n So A is 1.5 times faster than B
Chapter 1 — Computer Abstractions and Technology — 37
Performance Summary
n Performance depends onn Algorithm: affects IC, possibly CPIn Programming language: affects IC, CPIn Compiler: affects IC, CPIn Instruction set architecture: affects IC, CPI, Tc
The BIG Picture
cycle ClockSeconds
nInstructiocycles Clock
ProgramnsInstructioTime CPU ´´=
Chapter 1 — Computer Abstractions and Technology — 35
CPI in More Detailn If different instruction classes take different
numbers of cycles
å=
´=n
1iii )Count nInstructio(CPICycles Clock
n Weighted average CPI
å=
÷øö
çèæ ´==
n
1i
ii Count nInstructio
Count nInstructioCPICount nInstructio
Cycles ClockCPI
Relative frequency
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38
Average Access Timen Hit time is also important for performance
n Average memory access time (AMAT)
n AMAT = Hit time + Miss rate × Miss penalty
n Example
n CPU with 1ns clock, hit time = 1 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
n AMAT = 1 + 0.05 × 20 = 2ns
n 2 cycles per instruction
Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36
Measuring Cache Performancen Components of CPU time
n Program execution cycles
n Includes cache hit time
n Memory stall cycles
n Mainly from cache misses
n With simplifying assumptions:
§5.4
Measurin
g a
nd Im
provin
g C
ache P
erfo
rm
ance
penalty MissnInstructio
MissesProgram
nsInstructio
penalty Missrate MissProgram
accessesMemory
cycles stallMemory
´´=
´´=