Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW...
Transcript of Tema 5: Processadors superescalarsstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema5.pdf · 3 VLIW...
1
Tema 5: Processadors superescalars
Eduard Ayguadé i Josep Llosa
These slides have been prepared using material available at: 1) the companion web site for “Computer Organization & Design. The Hardware/Software Interface. Copyright 1998 Morgan Kaufmann Publishers.” 2) some slides and examples are part of theteaching material of Prof. Guri Sohi (U. of Wisconsin/Madison). 3) Some processor diagrams have been extracted from the“Microprocessor Report journal, Copyright In/Stat&MDR.” 4) Other material available through the internet.
Issuing multiple instructions per cycleIssuing multiple instructions per cycle
CPI < 1
Two variations:Very Long Instruction Word (VLIW): fixed number of instructions(up to 16) scheduled by the compiler
Joint HP/Intel (EPIC/Itanium)Superscalar: varying number of instructions/cycle (1 to 8), scheduled by compiler (statically scheduled) or by HW (Tomasulo; dynamically scheduled)
IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA-8000
2
VLIW designVLIW design
Making our pipeline superscalar, for example:2 instructions executed per cycle, with some constraints:
One instruction arithmeticOne instruction accessing memoryThe second can only be issued if the first is issued
2 instructions fetched from memory, paired and aligned to 64-bitboundaries to feed the two pipelines
M
M
W
W
WEDFLoad or Store instruction
WEDFALU or branch instruction
MEDFLoad or Store instruction
MEDFALU or branch instruction
WMEDFLoad or Store instruction
WMEDFALU or branch instruction
WMEDFLoad or Store instruction
WMEDFALU or branch instruction
VLIW designVLIW design
8
3
VLIW designVLIW design
Simple superscalar code scheduling:Loop: ld $3, 0($1)
add $3, $3, $2st $3, 0($1)addi $1, $1, -4bne $1, $0, loop
The first three instructions have data dependences, and so do the last two. A possible schedule is:
i.e. 4 cycles to execute 5 instructions (CPI=0.8)
4st $3, 4($1)bne $1, $0, loop3nopadd $3, $3, $22nopaddi $1, $1, -41ld $3, 0($1)noploop:
cycleData memory accessALU or branch
VLIW designVLIW design
Loop unrolling can help to decrease CPI (assuming that thenumber of iterations is a multiple of 4):
Loop: ld $3, 0($1)add $3, $3, $2st $3, 0($1)ld $3, 4($1)add $3, $3, $2st $3, 4($1)ld $3, 8($1)add $3, $3, $2st $3, 8($1)ld $3, 12($1)add $3, $3, $2st $3, 12($1)addi $1, $1, -16bne $1, $0, loop
(notice that with unrolling we reduce the number ofinstructions that control the execution of loop)
4
VLIW designVLIW design
A possible schedule is:
i.e. 8 cycles to execute 4 iterations (instead of 4 cycles periteration): CPI=0.57
4ld $4, 4($1)add $6, $6, $2
5st $3, 0($1)add $5, $5, $2
6st $6, 12($1)add $4, $4, $2
7st $5, 8($1)addi $1, $1, -16
8st $4, 4+16($1)bne $1, $0, loop
3ld $5, 8($1)add $3, $3, $22ld $6, 12($1)nop1ld $3, 0($1)noploop:
cycleData memory accessALU or branch
EPIC: beyond RISC and VLIWEPIC: beyond RISC and VLIW
Explicitly Parallel Instruction Computing:Parallel instruction encodingInstruction dependence hints allows flexible instruction groupingLarge directly addressable register file (128 or more)Fully predicated instruction set
Family of binary-compatible processors avoiding recompilation
5
EPIC: Instructions BundleEPIC: Instructions Bundle
Grouping information: dependencies among instructions in the bundle (no empty slots as in VLIW) and chaining with next bundle No direct mapping to hardware as in VLIW
Instruction 2Instruction 2 Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate
Each instruction contains:• Opcode• Predicate register (6)• Source1 (7)• Source2 (7)• Destination (7)• Opcode extension / branch target / misc
Template contains:• Instruction grouping
information• Prefetch hints
EPIC: PredicationEPIC: Predication
Predicate registers: 64, each just one bit
Increased opportunity for parallel execution
instr 1instr 2…cmp (a==b)jump equ lb1
instr 1instr 2…cmp (a==b)jump equ lb1
instr 3instr 4jump lb2
instr 3instr 4jump lb2
lb1: instr 5instr 6
lb1: instr 5instr 6
lb2: instr 7instr 8...
lb2: instr 7instr 8...
instr 1instr 2...p1, p2 ← cmp (a==b)(p1) instr 3(p1) instr 4(p2) instr 5(p2) instr 6instr 7instr 8...
instr 1instr 2...p1, p2 ← cmp (a==b)(p1) instr 3(p1) instr 4(p2) instr 5(p2) instr 6instr 7instr 8...
6
EPIC: SpeculationEPIC: Speculation
Long latency loads stall the processorAvoid complexity of out-of-order logic in dynamic superscalar cores
load.s: speculative, non faulting, load access
instr 1instr 2jump equ lb1
instr 1instr 2jump equ lb1
load $1, …instr 3 …, $1…
load $1, …instr 3 …, $1…
load.s $1, …instr 1instr 2jump equ lb1
load.s $1, …instr 1instr 2jump equ lb1
chk.s $1instr 3 …, $1…
chk.s $1instr 3 …, $1…
Itanium® architectureItanium® architecture
Up to 6 instructions per cycle from two bundles, 3 instructions each.
7
Itanium® vs. Itanium® 2 architectureItanium® vs. Itanium® 2 architecture
Estimated Itanium® 2 performance: 1.5-2X of Itanium
Itanium® 2 die layout and overviewItanium® 2 die layout and overview
8
Superscalar designSuperscalar design
Superscalar Design: Execute multiple instructions everyclock cycle
T = N * CPI * 1/W * tc
being W the number of instructions that can be initiated per clock cycle
The superscalar wants to avoid the rigid layout ofinstructions imposed by the VLIW design
We are going too fast … think againWe are going too fast … think again
#define N 9984double x[N+8], y[N+8], u[N]loop () {
register int i;double q;for (i=0; i<N; i++) {
q = u[i] * y[i];y[i] = x[i] + q;x[i] = q – u[i] * x[i];
}}
Operations per iteration:• 3 reads• 2 writes• 2 multiplications• 1 addition• 1 substraction
Operations per iteration:• 3 reads• 2 writes• 2 multiplications• 1 addition• 1 substraction
9
We are going too fast … think againWe are going too fast … think again
ld $10, @y[0]ld $11, @u[0]ld $12, @x[0]ld $13, Nloop: ld f1, 0($10)
ld f2, 0($11)ld f3, 0($12)mulf f4, f1, f2mulf f5, f2, f3add $11, $11, #8addf f6, f4, f3subf f7, f4, f5st 0($10), f6add $10, $10, #8st 0($12), f7sub $13, $13, #1bne loopadd $12, $12, #8 ; delay slot
Execution stage:• 1 cycle integer• 3 cycles FP
Branches effective after D
Execution stage:• 1 cycle integer• 3 cycles FP
Branches effective after D
We are going too fast … think againWe are going too fast … think again
Pipelined processor with 1 integer unit and 1 FP unit, 100 Mhz
1 iteration of 14 instructions every 15 cyclesIf tc=10ns, we get:
93 MIPS out of 100 MIPS peak26.6 MFLOPS out of 100 MFLOPS peak
F D E M W
F D E WE E
F D E M WF D E M W
F D E WE EF D E M W
F D E WE EF D E WE E
F D E M WF D E M W
F D E M WF D E M W
F D E M WF D E M W
F D E M W
F D E
F D E M WF D E M
F DF
…
1 iteration, 15 cycles
10
We are going too fast … think againWe are going too fast … think again
VLIW processor with W=4, 1 integer unit, 1 FP, 1 memoryunit and 1 branch unit
1 iteration of 14 instructions every 5 cyclesIf tc=10ns, we get:
280 MIPS out of 400 MIPS peak80 MFLOPS out of 100 MFLOPS peak
branchesintegerFPmemory
add3 (i+2)mulf2 (i+2)st1 (i+1)
bne (i+2)mulf1 (i+2)st2 (i)
sub (i+2)subf (i+1)ld3 (i+2)
add2 (i+2)addf (i+1)ld2 (i+2)
add1 (i+2)ld1 (i+2)
time
Modulo scheduling
st2 (0)
st2 (1)
We are going too fast … think againWe are going too fast … think again
20
21
19
18
17
16
15
14
13
12
st2 (0)11
st1 (0)10
9
subf (0)8
addf (0)7
6
add3 (0)mulf2 (0)5
bne (0)mulf1 (0)4
sub (0)ld3 (0)3
add2 (0)ld2 (0)2
add1 (0)ld1 (0)1
branchesintegerFPmemoryCycle
st2 (1)
st1 (1)
subf (1)
addf (1)
add3 (1)mulf2 (1)
bne (1)mulf1 (1)
sub (1)ld3 (1)
add2 (1)ld2 (1)
add1 (1)ld1 (1)
branchesintegerFPmemory
st2 (2)
st1 (2)
subf (2)
addf (2)
add3 (2)mulf2 (2)
bne (2)mulf1 (2)
sub (2)ld3 (2)
add2 (2)ld2 (2)
add1 (2)ld1 (2)
branchesintegerFPmemory
Two competing memory operations in the same cycle
st2 delayed: one iteration lasts 3 more cycles, but we can overlap and start a new iteration every 5 cycles pipelining
11
ld f1, 0($10)ld f2, 0($11)ld f3, 0($12)mulf f4, f1, f2mulf f5, f2, f3add $11, $11, #8addf f6, f4, f3subf f7, f4, f5st 0($10), f6add $10, $10, #8st 0($12), f7sub $13, $13, #1bne loopadd $12, $12, #8
We are going too fast … think againWe are going too fast … think again
F FF F
D DD D F F F F
E D D D D F F F F
M E D D D D F F
Iteration i Iteration i+1
time
W M E D D F FF F
EW M D DD D F F F F
EW E E D D D D F F F F
E E M D D D D F F
W E W E D D
W W W M M E E
W W M M E
W W M EW
EW
EE
MW
EM
W EW
EE
EM
W EW
EE
EW
EE
EM
WE
MW
EM
WE
MW
EM
WE
MW
W E EE E E E
W E M M E E11 cycles per iteration
In order execution
Note: instructions start execution following the order of F/D. Otherwise, it could be10 cycles per iteration.
How can a processor that fetches instructions in lexicographical order achieve this parallelism?
We are going too fast … think againWe are going too fast … think again
FD
EM
W
FD
EW
EE
FD
EM
WF
DE
MW
FD
EW
EE
FD
FD
EW
EE
FD
EW
EE
FD
EM
WF
DF
DE
MW
FD F
DF
D
EM
W
EM
W
EM
WE
MW
EM
W
Iteration i Iteration i+1
time
FD
EM
W
FD
EW
EE
FD
EM
WF
DE
MW
FD
EW
EE
FD
EM
WF
DE
WE
EF
DE
WE
EF
DE
MW
FD
EM
WF
DE
MW
FD
EM
WF
DE
MW
FD
EM
W
Out-of-order execution
ld f1, 0($10)ld f2, 0($11)ld f3, 0($12)mulf f4, f1, f2mulf f5, f2, f3add $11, $11, #8addf f6, f4, f3subf f7, f4, f5st 0($10), f6add $10, $10, #8st 0($12), f7sub $13, $13, #1bne loopadd $12, $12, #8
How can a processor that fetches instructions in lexicographical order achieve this parallelism?
12
We are going too fast … think againWe are going too fast … think again
FD
EM
W
FD
EW
EE
FD
EM
WF
DE
MW
FD
EW
EE
FD
FD
EW
EE
FD
EW
EE
FD
EM
WF
DF
DE
MW
FD F
DF
D
EM
W
EM
W
EM
WE
MW
EM
W
Iteration i Iteration i+1
time
FD
EM
W
FD
EW
EE
FD
EM
WF
DE
MW
FD
EW
EE
FD
EM
WF
DE
WE
EF
DE
WE
EF
DE
MW
FD
EM
WF
DE
MW
FD
EM
WF
DE
MW
FD
EM
W
FD
EM
W
FD
EW
EE
FD
EM
WF
DE
MW
FD
EW
EE
FD
EM
WF
DE
WE
EF
DE
WE
EF
DE
MW
FD
EM
WF
DE
MW
FD
EM
WF
DE
MW
FD
EM
W
Out-of-order executionBranch prediction
5 cyclesper iteration?
ld f1, 0($10)ld f2, 0($11)ld f3, 0($12)mulf f4, f1, f2mulf f5, f2, f3add $11, $11, #8addf f6, f4, f3subf f7, f4, f5st 0($10), f6add $10, $10, #8st 0($12), f7sub $13, $13, #1bne loopadd $12, $12, #8
How can a processor that fetches instructions in lexicographical order achieve this parallelism?
Iteration i+2
We are going too fast … think againWe are going too fast … think again
VLIW processor with W=7, 2 integer units, 2 FP, 2 ld/st unitand 1 branch unit
1 iteration of 14 instructions every 3 cyclesIf tc=10ns, we get:
467 MIPS out of 700 MIPS peak133 MFLOPS out of 200 MFLOPS peak
add3 (i+3)
add2 (i+3)
add1 (i+3)
integer
mulf1 (i+3)
mulf2 (i+2)
FP
addf (i+2)
subf (i+1)
FP
sub (i+3)
integer branchesmemorymemory
st1 (i+1)
bne (i+3)st2 (i)ld3 (i+3)
ld2 (i+3)ld1 (i+3)time
13
Superscalar designSuperscalar design
Problems for Superscalar design:Structural Hazards:
Need Multiple Execution Units (Multiple Pipelines)Need multiple simultaneous accesses to register files.Need multiple simultaneous accesses to caches
Data Hazards:How to deal with Read After Write hazardsHow to deal with Write After Read and Write After Write hazardsWhat to do with stalled instructions
Control Hazards:What to do with conditional branchesWhat to do with computed branches
Superscalar designSuperscalar design
WAR and WAW dependences caused by out-of-orderexecution:
ld $3, 10($1)add $4, $3, $3ld $3, 100($2) add $5, $3, $3…sub $6, $6, $3
F D E M M M M M M M WF D E M W
F D E M WF D E M W
F D E M W
…
$3 written
$3 written
14
Superscalar designSuperscalar design
Structural Hazards:Have as many functional units as neededBuild register files with many read and write portsBuild multi-port caches
Data Hazards solutions:Execute instructions in order. Use score-board to eliminate data hazards by stalling instructionsExecute instructions out or order, as soon as operands are available, but graduate them in order. Why?Use register renaming to avoid WAR and WAW data hazards
Superscalar designSuperscalar design
Control Hazards solutions:Use branch prediction:
Make sure that the branch is resolved before registers are modified…or use speculative execution, roll back results if branches werepredicted wrong
15
Branch predictionBranch prediction
What do we need to predict for a jump/branch?jump:
the address of the target address, which can be stored in the sameinstruction or computed from the current PC plus a displacement
Return from subroutine ret:the return address, which is obtained from the stack (increasing theSP and reading from memory)
conditional branch:the address of the target address, which is usually computed from thecurrent PC plus a displacementIs the branch going to branch or continue with next instruction?
Branch predictionBranch prediction
Branch Target BufferStores for each jump/branch the target address
It is a cache, so when full a replacement algorithm is applied
@ instruction target @
BTB
PC
=PC + 4
hit
0
1
16
Branch predictionBranch prediction
The BTB does not work for predicting the return address ofa subroutine
The same instruction address (the one that points to ret) needs to have different target addresses:
Solution: implement a return stack that mimics the original stack
call subroutine
call subroutine
……..
……
…………..
ret
subroutine:
Branch predictionBranch prediction
Branch History TableSays whether or not branch taken last timeThe simplest is a 1-bit table attached to the BTB that is set to 1 iflast time the branch jumped (taken), 0 otherwise (not taken). Initially set to 0
@ instruction
BTB
PC
=PC + 4
hit
BHT
0/1
and
target @
0
1
17
Branch predictionBranch prediction
Question: How many mispredictions in a loop?
Answer: 2End of loop case, when it exits instead of looping as beforeFirst time through loop on next time through code, when it predictsexit instead of looping
Solution: 2-bit counter BHT that changes prediction only ifget misprediction twice:
Increment for taken, decrement for not-taken00,01,10,11 (initially set to 00)
Branch predictionBranch prediction
Automaton for the 2-bit counter predictor:
Can it be better? Yes, of course … a lot of research in branch prediction in the last years
T T NT NTNT NT NT
T T T
NTT
NT: not takenT: taken
11 10 01 00
18
Correlating branchesCorrelating branches
Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects predictionof current branch
Idea: record m most recently executed branches as takenor not taken, and use that pattern to select the properbranch history table
In general, (m,n) predictor means record last m branchesto select between 2m history tables each with n-bitcounters
Correlating branchesCorrelating branches
Branch History Table (BHT)
Old 2-bit BHT is then a (0,2) predictorPHT could also be indexed with some bits of the PC
@ instruction
BTB
PC
=PC + 4
hit
BHT
01
f
target @
0
1
0110
m
n
PHT
19
Freq
uenc
y of
Mis
pred
ictio
ns
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
nasa
7
mat
rix30
0
tom
catv
dodu
cd
spic
e
fppp
p gcc
espr
esso
eqnt
ott li
0%1%
5%6% 6%
11%
4%
6%5%
1%
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Accuracy of different schemesAccuracy of different schemes
Selective history predictorSelective history predictor
… but not always the same predictor is the best
Predict taken/not_taken
11100100
Choose predictor2
Choose predictor1
PC
GlobalHistory
k
predictor1predictor1
predictor2predictor2
01
f
n
0
1
selector
20
SpeculationSpeculation
Allow an instruction, that is dependent on branch, toexecute (without any consequences, including exceptions): boosting
Separate speculative bypassing of results from real bypassing of results:
When instruction no longer speculative (i.e. branch has beenresolved), its boosted results can update state or can be discardedExecute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits
We will elaborate on this later in this chapter
Dependences in a programDependences in a program
RAW, WAR and WAW dependencesRAW
WARWAW
21
Dependences in a programDependences in a program
RAW dependences are important because they determine the data flow in the program. We will solve them later in this chapter
WAR and WAW appear because we are reusing registers tostore temporary values
The number of registers visible at the machine language level isfixed (and usually small)They need to be reused
Solution: dynamically rename registers!
Register renamingRegister renaming
Each entry of the (logical) register file either:Contains the value that is stored in this registerContains a pointer to an element of a list of (physical) registersavailable for renaming
$0$1$2
$i
$31
ren0ren1
ren2
ren3
renj
0: renamed1: value
0 renj
1 value
rename buffer
register file
22
Register renamingRegister renaming
At the decode stage, the destination register (e.g. $i) isalways renamed with a register from the rename buffer (e.g. renj)
From now on, when an instruction uses register $i, thename of the source register is changed to renjWhen the value of renj has been computed, it istransferred (if still needed as register $i) to the registerfile. renj is free for a new renaming
Register renamingRegister renaming
Example:add $3, $3, 4ld $8, ($3)add $3, $3, 4ld $9, ($3)
If first instruction finishes after ren3 has been used, then theresult in ren1 does not have to be written back to $3
add ren1, $3, 4ld ren2, (ren1)add ren3, ren1, 4ld ren4, (ren3)
$3
$8$9
ren1 0 $3
$8$9
ren1 0
ren2 0
$3
$8$9
ren3 0
ren2 0
$3
$8$9
ren3 0
ren2 0ren4 0
23
In-order superscalar processorsIn-order superscalar processors
Instructions are fetched, executed and committed in compiler-generated order
if one instruction stalls, all instructions behind it stall
Instructions are statically scheduled by the hardwaremeans they are scheduled in their compiler-generated orderhow many of the next n instructions can be issued, where n is thesuperscalar issue width
Main advantage of in-order instruction scheduling: simplerimplementation
faster clock cyclefewer transistors
Out-of-order superscalar processorsOut-of-order superscalar processors
Instructions are fetched in compiler-generated order, butthey may be executed out of this order
Instruction completion may be in-order (today) or out-of-order (older computers)
Dynamic schedulinghardware decides in what order instructions can be executedinstructions behind a stalled instruction can pass it
Main advantage of out-of-order execution: higherperformance
better at hiding latencies, less processor stallinghigher utilization of functional units
24
Precise exceptionsPrecise exceptions
An exception is precise if the following two conditions are met:
All the instructions preceding the instruction that produced theexception have been executed and have modified the processstate correctlyAll instructions following the instruction that produced theexception have not yet been executed and have done no modification to the process state
In-order completion is necessary in order to have a microarchitecture with precise exceptions
Out-of-order superscalar processorsOut-of-order superscalar processors
25
Dynamic executionDynamic execution
Based on Tomasulo’s algoritm proposed back in the 60’s
Why do we study it? because it lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …
It did not consider in-order completion
Dynamic executionDynamic execution
Control and buffers distributed with functional units
ADD MULT
$0$1$2
$i
$31
0
1
ren0ren1
ren2
ren3
renj
rename/reorderbuffer
register file
Common Data Bus CDB
bypassesreservationstations
mux
26
Reservation stationsReservation stations
An instruction is sent to a reservation station if there is oneempty for the resource that can execute it
Reservation stations hold instructions, with possiblepending operands, waiting there for all operands available
busy oper tag1 source1 tag2 source2 dest
0: source operand not available1: source operand available
if tag=0, pointer to register in rename buffer
pointer to register in rename buffer
if 1, reservation station is occupied with an instruction(with pending operands or in execution)
value1 value2
if tag=1, value for the operand
Reservation stationsReservation stations
Each reservation station RS monitors if a result is availableon the CDB, whose destination register is one of thepending operands:
RStag=0 and RSsource=CDBdest
The CDB transfers the value (CDBvalue) that will be stored in the destination register (CBDdest) of the rename buffer
27
Reservation stationsReservation stations
Ready to execute? … and then?Ready to execute? … and then?
When both RStag1=RStag2=1, theinstruction is ready forexecution
For example, assume that wehave two adders and fourreservation stations (Entry0..3) that can feed them
Signal “I want an adder” is generated as
RStag1 and RStag2
28
Reservation stationsReservation stations
All reservation stations can be unified in a single structure(in some processors called Instruction Window)
ExampleExample
F1 ← F2 / F4F6 ← F0 + F1F1 ← F3 - F4F7 ← F1 * F5F8 ← F2 + F3
1 ADD/SUB, 3 cycles, 2 RS
1 MUL, 4 cycles, 1 RS
1 DIV, 8 cycles, 1 RS
W-EEEI---DF
WEEEE---IDF
WEEEIDF
WEEE--------IDF
WEEEEEEEEIDF
2019181716151413121110987654321
Note: The reservation station is occupied from I to W
29
Access to memoryAccess to memory
Some superscalar processors only allowed a single memory operation per cycle, but this rapidly became a performance bottleneck
To allow multiple memory requests to be servicedsimultaneously, the memory hierarchy has to be multiported
It is usually sufficient to multiport only the lowest level ofthe memory hierarchy, namely the primary caches sincemany requests do not proceed to the upper levels in thehierarchy
Multiported cacheMultiported cache
Access time increases with number of ports
Multiporting can be achieved by making multiple serial requests during the same cycle
L1L1
port1 port2
30
Multiported cacheMultiported cache
Multiporting can also be achieved by having multiplememory banks: Interleaved cache
Bandwidth reduced if both accesses go to the same bank
bank1bank1 bank2bank2 bank3bank3 bank4bank4
port1 port2
Access to memoryAccess to memory
To allow memory operations to be overlapped with otheroperations (both memory and non-memory), the memoryhierarchy must be non-blocking
That is, if a memory request misses in the data cache, other memory requests, should be allowed to proceed: hit-on-miss or miss-on-miss
31
Access to memoryAccess to memory
Each memory port requires an adder to compute theeffective memory address:
register numbers are readily available from the instructionsthemselves while memory addresses have to be computed andavailable late in the pipeline
Loads/stores do exhibit RAW, WAW and WAR dependences in both the computation of the effectiveaddress or in the value that needs to be stored in memory
Reservation stations for ld/stReservation stations for ld/st
Are there any order constraints that need to be considered?
32
Dependency checkingDependency checking
Examples:ld $1, 100($2)mul $3, $4, $5ld $6, 4($3)st $7, 10($8)
mul $1, $2, $3st $2, 4($1)ld $5, 10($6)add $7, $7, $5
I
F D I E M WF D I E WE E E E
F D I E M WI I I IF D I E M W
What if 4($3)=10($8)? Incorrect!
What if 4($1)=10($6)? Incorrect!
F D I E WF D I I M WI I I E
F D I E M WF D I E W
E E E E
I
I
Dependency checkingDependency checking
Examples:mul $1, $2, $3st $4, 4($1)st $5, 10($6)
Stores must go to memory in FIFO order in order to haveprecise exceptions
F D I E WF D I I M WI I I E
F D I E M W
E E E E
What if 4($1) ≠ 10($6), but 4($1) causes an exception? In order to be precise, the second store can not go to memory
What if 4($1) = 10($6)? Incorrect!
I
33
Dependency checkingDependency checking
Whenever there is a load in a reservation station whoseaddress can not be computed, any store that follows can not go to memory
Similarly, whenever there is a store in a reservation stationwhose address is not yet know, any memory access thatfollows can not go to memory
Bypassing memory accessesBypassing memory accesses
New load addresses are checked with waiting store addresses. If thereis a match, the load must wait for the store it matches. Store data may be bypassed to the matching load
34
Instruction retirementInstruction retirement
Retiring only one instruction per cycle can be a bottleneck. It is possible to retire multiple instructions in parallel
Rename / reorderBuffer
Reorder bufferReorder buffer
The reorder buffer is an extension of the rename buffer:
The tail does not advance until the value for that renamedregister has been generated
ren0ren1
ren2
reni
renj
tail: points to the first renamedregister pending to be moved tothe register file
head: points to the last register renamed (i.e. the lastinstruction decoded)
35
Reorder bufferReorder buffer
If head reaches tail, renaming has to be stopped. Thisstalls the processor until registers for renaming are available (structural hazard)
A value from a register in the reorder buffer may not needto be transfered to the register file:
None of the registers in the register file is renamed to it
Support for speculative executionSupport for speculative execution
Each entry in the reorder buffer contains a special“speculative” bit:
speculative instructions are marked in the reorder buffershould a branch becomes confirm it will turn the speculative bits ofthe corresponding speculative instruction to “confirm”if it is not confirmed, status is set to “kill”
When an instruction reaches the tail of the reorder buffer:if it is marked “speculative”, must stall retirement till it is not speculativeif it is marked “confirm”, continue to commit instructionif it is marked as “kill”, its result is discarded
36
ExampleExample
F1 ← F2 / F4F6 ← F0 + F1F1 ← F3 - F4F7 ← F1 * F5F8 ← F2 + F3
1 ADD/SUB, 3 cycles, 2 RS
1 MUL, 4 cycles, 1 RS
1 DIV, 8 cycles, 1 RS
C----W-EEEI---DF
C----WEEEE---IDF
C--------WEEEIDF
CWEEE--------IDF
CWEEEEEEEEIDF
2019181716151413121110987654321
ExampleExample
F1 ← F2 / F4F6 ← F0 + F1F1 ← F3 - F4F7 ← F1 * F5F8 ← F2 + F3
1 ADD/SUB, 3 cycles, 2 RS
1 MUL, 4 cycles, 1 RS
1 DIV, 8 cycles, 1 RS
C-WEEEI-------DF
C----WEEEE---IDF
C--------WEEEIDF
CWEEE--------IDF
CWEEEEEEEEIDF
2019181716151413121110987654321
rename/reorder buffer with 4 entries: structural hazard
37
Yet another exampleYet another example
Yet another exampleYet another example
The processor has:an instruction window with 24 entriesa LSU with 8 entriesan instruction fetch buffer (IB) with 4 entries2 adders1 pipelined multiplier with a 2 cycle latencyand two memory ports both pipelined with a 2 cycle latency
The physical register file is integrated into the instructionwindow. An structure (RAT: Register Alias Table) maintainsmappings from the logical register numbers to instructionwindow entries (if the logical register is renamed).
38
Yet another example: cycle 1Yet another example: cycle 1
Yet another example: cycle 2Yet another example: cycle 2
39
Yet another example: cycle 3Yet another example: cycle 3
Yet another example: cycle 4Yet another example: cycle 4
40
Yet another example: cycle 5Yet another example: cycle 5
Yet another example: cycle 6Yet another example: cycle 6
41
Yet another example: cycle 7Yet another example: cycle 7
Yet another example: cycle 8Yet another example: cycle 8
42
Yet another example: cycle 9Yet another example: cycle 9
Yet another example: cycle 10Yet another example: cycle 10
43
Yet another example: cycle 11Yet another example: cycle 11
Current high-performance µP (8/06)Current high-performance µP (8/06)
44
Current high-performance µP (8/06)Current high-performance µP (8/06)
30 years of progress30 years of progress
4004 to Pentium® 4 processor:Transistor count: more than 20,000x increaseFrequency: more than 20,000x increase39% compound annual growth
Now you can tell what has happened in between …