Advanced Computer Architecture Instruction Level...
Transcript of Advanced Computer Architecture Instruction Level...
Advanced Computer Architecture
Instruction Level ParallelismInstruction Level Parallelism
Myung Hoon, Sunwoo
School of Electrical and Computer EngineeringAjou University
Outline• ILP• Compiler techniques to increase ILP• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic Scheduling• Tomasulo Algorithmg• Speculation
Ajou Univ. SOC Lab.MultimediaCommunications2
Recall from Pipelining Review
• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls p p p+ Control Stalls
– Ideal pipeline CPI: measure of the maximum performance attainable by the implementationatta ab e by t e p e e tat o
– Structural hazards: HW cannot support this combination of instructionsD t h d I t ti d d lt f i– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)
Ajou Univ. SOC Lab.MultimediaCommunications3
Instruction Level Parallelism• Instruction-Level Parallelism (ILP): overlap the execution of
instructions to improve performance
• 2 approaches to exploit ILP:1) Rely on hardware to help discover and exploit the parallelism dynamically
(e.g., Pentium 4, AMD Opteron, IBM Power)(e.g., Pentium 4, AMD Opteron, IBM Power)
2) Rely on software technology to find parallelism, statically at compile-time (e.g., Itanium 2)( g , )
Ajou Univ. SOC Lab.MultimediaCommunications4
Instruction-Level Parallelism (ILP)
• Basic Block (BB) ILP is quite small– BB: a straight-line code sequence with no branches in except to the entry and– BB: a straight-line code sequence with no branches in except to the entry and
no branches out except at the exit– average dynamic branch frequency 15% to 25%
=> 4 to 7 instructions execute between a pair of branches– Plus instructions in BB likely to depend on each otherPlus instructions in BB likely to depend on each other
• To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks
• Simplest: loop-level parallelism to exploit parallelism among iterations of a loop. E.g.,
for (i=1; i<=1000; i=i+1)x[i] = x[i] + y[i];
Ajou Univ. SOC Lab.MultimediaCommunications5
Loop-Level Parallelism
• Exploit loop-level parallelism to parallelism by “unrolling loop” either by
1.dynamic via branch prediction or 2.static via loop unrolling by compiler
Determining instruction dependence is critical to Loop Level Parallelism
• If 2 instructions areparallel they can execute simultaneously in a pipeline of arbitrary depth– parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards)
– dependent, they are not parallel and must be executed in order, although they may often be partially overlappedy y p y pp
Ajou Univ. SOC Lab.MultimediaCommunications6
Data Dependence and Hazards
• InstrJ is data dependent (aka true dependence) on InstrI:1 I t t i t d d b f I t it it1. InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3J: sub r4 r1 r3
2. or InstrJ is data dependent on InstrK which is dependent on InstrI
• If two instructions are data dependent, they cannot execute simultaneously or
J: sub r4,r1,r3
p , y ybe completely overlapped
• Data dependence in instruction sequence ⇒ data dependence in source code ⇒ effect of original data dependence must be preserved
• If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard
Ajou Univ. SOC Lab.MultimediaCommunications7
ILP and Data Dependancies, Hazards
• HW/SW must preserve program order: order instructions would execute in if executed sequentially as determined by
i i loriginal source program– Dependences are a property of programs
• Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline
• Importance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possibly be exploited
HW/SW l l it ll li b i d l h it• HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program
Ajou Univ. SOC Lab.MultimediaCommunications8
Name Dependence #1: Anti-dependence
• Name dependence: when 2 instructions use same register or memory location called a name but no flow of data between the instructionslocation, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence
• Instr writes operand before Instr reads it• InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”
• If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard
Ajou Univ. SOC Lab.MultimediaCommunications9
Name Dependence #2: Output dependence
• InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3 J: add r1,r2,r3
• Called an “output dependence” by compiler writers
J: add r1,r2,r3K: mul r6,r1,r7
• Called an output dependence by compiler writersThis also results from the reuse of name “r1”
• If anti-dependence caused a hazard in the pipeline called a Write After WriteIf anti dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard
• Instructions involved in a name dependence can execute simultaneously if nameInstructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict
– Register renaming resolves name dependence for regs– Either by compiler or by HW
Ajou Univ. SOC Lab.MultimediaCommunications10
Control Dependencies
• Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to , g , p ppreserve program orderif p1 {S1;S1;
};if p2 {if p2 {S2;
}}
• S1 is control dependent on p1, and S2 is control dependent on p2but not on p1.p
Ajou Univ. SOC Lab.MultimediaCommunications11
Control Dependence Ignored
• Control dependence need not be preservedControl dependence need not be preserved– willing to execute instructions that should not have been executed, thereby
violating the control dependences, if can do so without affecting correctness of the program
• Instead, 2 properties critical to program correctness are 1) exception behavior and ) p2) data flow
Ajou Univ. SOC Lab.MultimediaCommunications12
Exception Behavior• Preserving exception behavior
⇒ any changes in instruction execution order must not change how exceptions are raised in program (⇒ no new exceptions)
• Example:pDADDU R2,R3,R4BEQZ R2,L1LW R1,0(R2)
L1L1:– (Assume branches not delayed)
• Problem with moving LW before BEQZ?
Ajou Univ. SOC Lab.MultimediaCommunications13
Data Flow
• Data flow: actual flow of data values among instructions that produce g presults and those that consume them
– branches make flow dynamic, determine which instruction is supplier of data
• Example:DADDU R1,R2,R3BEQZ R4,LQ ,DSUBU R1,R5,R6L: …OR R7,R1,R8
• OR depends on DADDU or DSUBU? Must preserve data flow on execution
Ajou Univ. SOC Lab.MultimediaCommunications14
Computers in the News
Who said this? A. Jimmy Carter, 1979B. Bill Clinton, 1996
C. Al Gore, 2000D. George W. Bush, 2006
"Again I'd repeat to you that if we can remain theAgain, I d repeat to you that if we can remain the most competitive nation in the world, it will benefit the worker here in America. People have got to understand, when we talk about spending your taxpayers' money on research and development, there is a correlating benefit, particularly to your children. See, it takes a while for some of the investments that are being made with government dollars to come to market. I don't know if people realize this but the Internet began as the Defenserealize this, but the Internet began as the Defense Department project to improve military communications. In other words, we were trying to figure out how to better communicate, here was research money spent, and as a result y pof this sound investment, the Internet came to be.
The Internet has changed us. It's changed the whole world."
Ajou Univ. SOC Lab.MultimediaCommunications15
Outline• ILP• Compiler techniques to increase ILP
Loop Unrolling• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic Scheduling• Tomasulo Algorithm• SpeculationSpeculation
Ajou Univ. SOC Lab.MultimediaCommunications16
Software Techniques - Example
• This code, add a scalar to a vector:This code, add a scalar to a vector:for (i=1000; i>0; i=i–1)
x[i] = x[i] + s;• Assume following latencies for all examples• Assume following latencies for all examples
– Ignore delayed branch in these examples
Instruction Instruction Delay Latency(distance) (stalls between)
producing result using result in cycles in cyclesproducing result using result in cycles in cyclesFP ALU op Another FP ALU op 4 3FP ALU op Store double 3 2 Load double FP ALU op 2 1Load double Store double 1 0Integer op Integer op 1 0
Ajou Univ. SOC Lab.MultimediaCommunications17
Integer op Integer op 1 0
FP Loop: Where are the Hazards?
• First translate into MIPS code: -To simplify assume 8 is lowest address
Loop: L.D F0,0(R1);F0=vector elementADD D F4 F0 F2 dd l f F2
-To simplify, assume 8 is lowest address
ADD.D F4,F0,F2;add scalar from F2S.D 0(R1),F4;store resultDADDUI R1,R1,-8;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=zero
Ajou Univ. SOC Lab.MultimediaCommunications18
FP Loop Showing Stalls1 Loop: L.D F0,0(R1) ;F0=vector element2 stall3 ADD.D F4,F0,F2 ;add scalar in F24 stall5 stall5 stall6 S.D 0(R1),F4 ;store result7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW)
Instr ction Instr ction Latenc in
8 stall ;assumes can’t forward to branch9 BNEZ R1,Loop ;branch R1!=zero
Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3
9 clock c cles Re rite code to minimi e stalls?
FP ALU op Store double 2 Load double FP ALU op 1
Ajou Univ. SOC Lab.MultimediaCommunications
• 9 clock cycles: Rewrite code to minimize stalls?
19
Revised FP Loop Minimizing Stalls
1 Loop: L.D F0,0(R1)2 DADDUI R1 R1 -82 DADDUI R1,R1,-83 ADD.D F4,F0,F24 stall
5 stall
6 S.D 8(R1),F4 ;altered offset when move DSUBUI
7 BNEZ R1,Loop
Instruction Instruction Latency in
, p
Swap DADDUI and S.D by changing address of S.Dy
producing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2
7 clock cycles but just 3 for execution (L D ADD D S D) 4 for loop
FP ALU op Store double 2 Load double FP ALU op 1
Ajou Univ. SOC Lab.MultimediaCommunications
7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; How make faster?
20
Unroll Loop Four Times (straightforward way)
Rewrite loop to minimize stalls?
1 Loop:L.D F0,0(R1)3 ADD.D F4,F0,F2
1 cycle stall2 cycles stall
6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ7 L.D F6,-8(R1)9 ADD.D F8,F6,F212 S.D -8(R1),F8 ;drop DSUBUI & BNEZ13 L.D F10,-16(R1)15 ADD.D F12,F10,F218 S 16( 1) 12 d S18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ19 L.D F14,-24(R1)21 ADD.D F16,F14,F224 S D 24(R1) F1624 S.D -24(R1),F1625 DADDUI R1,R1,#-32 ;alter to 4*826 BNEZ R1,LOOP
27 clock cycles, or 6.75 per iteration(Assumes R1 is multiple of 4)
Ajou Univ. SOC Lab.MultimediaCommunications21
( p )
Unrolled Loop Detail• Do not usually know upper bound of loop• Suppose it is n, and we would like to unroll the loop to make k copies of
the bodythe body• Instead of a single unrolled loop, we generate a pair of consecutive
loops:– 1st executes (n mod k) times and has a body that is the original loop– 1st executes (n mod k) times and has a body that is the original loop– 2nd is the unrolled body surrounded by an outer loop that iterates (n/k)
times• For large values of n, most of the execution time will be spent in theFor large values of n, most of the execution time will be spent in the
unrolled loop
Ajou Univ. SOC Lab.MultimediaCommunications22
Unrolled Loop Detailn = 10, k = 3
(n mod k) = 1
/
n = 10, k = 5
(n mod k) = 0
Loop: L.D F0,0(R1)ADD.D F4,F0,F2S D 0(R1) F4
L.D F0,0(R1)ADD.D F4,F0,F21st exec.
n / k = 3n / k = 2
S.D 0(R1),F4L.D F6,-8(R1)ADD.D F8,F6,F2S.D -8(R1),F8
S.D 0(R1),F4Loop: L.D F6,-8(R1)
ADD.D F8,F6,F2S D 8(R1) F82nd( ),
L.D F10,-16(R1)ADD.D F12,F10,F2S.D -16(R1),F12 L D F14 24(R1)
S.D -8(R1),F8 L.D F10,-16(R1)ADD.D F12,F10,F2S.D -16(R1),F12
2nd exec.~
L.D F14,-24(R1)ADD.D F16,F14,F2S.D -24(R1),F16L.D F14,-32(R1)
L.D F14,-24(R1)ADD.D F16,F14,F2S.D -24(R1),F16DADDUI R1 R1 # 32
ADD.D F16,F14,F2S.D -32(R1),F16DADDUI R1,R1,#-40BNEZ R1 LOOP
DADDUI R1,R1,#-32BNEZ R1,LOOP
Ajou Univ. SOC Lab.MultimediaCommunications23
BNEZ R1,LOOP
Unrolled Loop That Minimizes Stalls1 Loop:L.D F0,0(R1)2 L.D F6,-8(R1)3 L D F10 -16(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD D F8 F6 F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F49 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#32, ,13 S.D 8(R1),F16 ; 8-32 = -2414 BNEZ R1,LOOP
14 clock cycles, or 3.5 per iteration
Ajou Univ. SOC Lab.MultimediaCommunications24
5 Loop Unrolling Decisions• Requires understanding how one instruction depends on
another and how the instructions can be changed or reordered given the dependences:given the dependences:
1. Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code)
2 U diff t i t t id t i t f d2. Use different registers to avoid unnecessary constraints forced by using same registers for different computations
3. Eliminate the extra test and branch instructions and adjust the jloop termination and iteration code
4. Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from differentinterchanged by observing that loads and stores from different iterations are independent • Transformation requires analyzing memory addresses and finding that they
do not refer to the same address
5. Schedule the code, preserving any dependences needed to yield the same result as the original code
Ajou Univ. SOC Lab.MultimediaCommunications25
3 Limits to Loop Unrolling1. Decrease in amount of overhead amortized with each extra unrolling
• Amdahl’s Law
2. Growth in code size • For larger loops, concern it increases the instruction cache miss rate
3. Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling
If t b ibl t ll t ll li l t i t l• If not be possible to allocate all live values to registers, may lose some or all of its advantage
• Loop unrolling reduces impact of branches on pipeline; another way• Loop unrolling reduces impact of branches on pipeline; another way is branch prediction
Ajou Univ. SOC Lab.MultimediaCommunications26
Static Branch Prediction• Lecture from last week showed scheduling code around
delayed branch• To reorder code around branches, need to predict branchTo reorder code around branches, need to predict branch
statically when compile • Simplest scheme is to predict a branch as taken
Average misprediction = untaken branch frequency = 34% SPEC– Average misprediction = untaken branch frequency = 34% SPEC
22%
18%20%
25%
e
12%
18%
11% 12%9% 10%
15%
10%
15%
20%
dict
ion
Rat
e
4%6%
0%
5%
10%
Mis
pre
compre
sseq
ntott
espre
sso gcc li
dodu
c
ear
hydro
2dmdlj
dpsu
2cor
Ajou Univ. SOC Lab.MultimediaCommunications27
Integer Floating Point
Dynamic Branch Prediction• Why does prediction work?
– Underlying algorithm has regularities– Data that is being operated on has regularities– Data that is being operated on has regularities– Instruction sequence has redundancies that are artifacts of way that
humans/compilers think about problems
• Is dynamic branch prediction better than static branch prediction?– Seems to be – There are a small number of important branches in programs which have– There are a small number of important branches in programs which have
dynamic behavior
Ajou Univ. SOC Lab.MultimediaCommunications28
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Table: Lower bits of PC address index table of 1-bit values– Says whether or not branch taken last time– No address check
• Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 p, p ( giterations before exit):
– End of loop case, when it exits instead of looping as before– First time through loop on next time through code, when it predicts exit instead g p g p
of looping
Ajou Univ. SOC Lab.MultimediaCommunications29
Dynamic Branch Prediction
• Solution: 2-bit scheme where change prediction only if get mispredictiontwicetwice
T
NT
T NTPredict Taken Predict Taken
NTT
NT
Predict Not Taken
Predict Not TakenT
• Red: stop, not taken
NT
• Green: go, taken• Adds hysteresis to decision making process
Ajou Univ. SOC Lab.MultimediaCommunications30
BHT Accuracy• Mispredict because either:
– Wrong guess for that branch– Got branch history of wrong branch when index the table
• 4096 entry table:
18%
14%16%18%20%
Rat
e
5%
12%10% 9%
5%
9% 9%
8%10%12%14%
edic
tion
R
5% 5%
0% 1%
0%2%4%6%
Mis
pre
0%
eqnto
ttes
press
o gcc li
spice
dodu
csp
ice
fpppp
matrix3
00na
sa7
Ajou Univ. SOC Lab.MultimediaCommunications
e m
31
Integer Floating Point
Correlated Branch Prediction (1)• Definition: Branch prediction that uses the behavior of other branches to
make a prediction• Consider the following code sequence• Consider the following code sequence
BNEZ R1,L1 ; branch b1 (d!=0)
Assuming d is assigned to R1 register
if (d==0) d=1;if (d==1) ...;
To MIPS code DADDIU R1,R0,#1 ; d==0, so d=1
L1: DADDIU R3,R1,#-1
BNEZ R3,L2 ; branch b2 (d!=0)BNEZ R3,L2 ; branch b2 (d! 0)
…
L2: …..
[P ibl ti ]
Initial value of d d == 0 ? b1 Value of d
before b2 d ==1 ? b2
[Possible execution sequences]
0 Yes NT 1 Yes NT
1 No T 1 Yes NT
2 No T 2 No T
Ajou Univ. SOC Lab.MultimediaCommunications32
2 No T 2 No T
NT: Not Taken , T: Taken
Correlated Branch Prediction (2)• In the case of 1-bit predictor
T NT
Predict TakenPredict Not
TakenT
NT
d=? b1 prediction b1 New b1 b2 b2 New b2
[Behavior of a 1-bit predictor initialized to NT]
d=? b1 prediction action prediction prediction action prediction
2 NT(initial) T T NT(initial) T T0 T NT NT T NT NTMispredicted Mispredictediteration
Mispredicted Mispredicted
2 NT T T NT T T0 T NT NT T NT NT
Mispredicted
Mispredicted
Mispredicted
Mispredicted
Mispredicted
iteration
• As the above table shows, all the branches are mispredicted.
Ajou Univ. SOC Lab.MultimediaCommunications33
Correlated Branch Prediction (3)• Consider a predictor that uses 1 bit of correlation
– The easiest way to think is that every branch has two separate prediction bits» The first bit : the prediction bit if the last branch is not taken.» The second bit : the prediction bit if the last branch is taken.
Prediction bits Prediction if last branch is not taken Prediction if last branch is taken
NT/NT NT NT
NT/T NT T
[B h i f th 1 bit di t ith 1bit f l ti ]
NT/T NT T
T/NT T NT
T/T T T
d=? b1 prediction b1 action
New b1 prediction
b2prediction
b2 action
New b2prediction
[Behavior of the 1-bit predictor with 1bit of correlation]
2 NT/NT(initial) T T/NT NT/NT(initial) T NT/T0 T/NT NT T/NT NT/T NT NT/T2 T/NT T T/NT NT/T T NT/T
predicted
predicted
predicted
predicted
iteration
Mispredicted Mispredicted
0 T/NT NT T/NT NT/T NT NT/Tp
predicted
p ed cted
predicted
• The branches except the first iteration are correctly predicted.• Even if we had chosen different values for d, the predictor for b2 would correctly predict,
Ajou Univ. SOC Lab.MultimediaCommunications34
p y psince b2 is correlated with the previous prediction of b1.
• 1-bit predictor with 1bit of correlation is called a (1,1) predictor.
Correlated Branch Prediction (4)• In general, a (m, n) predictor means recording last m branches to select between 2m history
tables, each with n-bit counters– (1,1) predictor uses the behavior of the last one branch to choose from 2 branch predictors(i.e.
hi t t bl 21) h f hi h i 1 bit( 1) di thistory tables, 21), each of which is an 1-bit(n=1) predictor• (2,2) predictor
– Behavior of recent branches selects between four predictions of next branch, updating just that predictionp
Branch address
4
The low-order address bits of the branch instruction address
If two branches have the same low-order address
2-bits per branch predictorbits, this doesn’t matter.
The prediction is a hint that is assumed to be correct. If the hint turns out to be wrong, the prediction bit is inverted and stored back
Prediction
prediction bit is inverted and stored back.
n • Global Branch History: m-bit shift register keeping T/NT status of last m branches.
Ajou Univ. SOC Lab.MultimediaCommunications35
2-bit global branch history
Example-(1,1) predictor• (1,1) predictor example (NT = 0, T = 1)
d=? b1 prediction b1 action New b1 prediction
b2prediction
b2 action
New b2prediction
2 NT/NT(initial) T T/NT NT/NT(initial) T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
I iti ll NT I iti ll NT 01b1 addr. 01b1 addr.
0 T/NT NT T/NT NT/T NT NT/T
Predict NT for b1 branch b1 action is Taken
1 2
New Prediction is Taken
2
0
Branch address (2 bits)
addr. addr.
000 100
Initially NT Initially NT
0 0
Branch address (2 bits)
01
addr. addr.
000 100
1 0
Branch address (2 bits)
0
addr. addr.
000
001
100
101
NT T NT T NT T
0
0
0
0
001
010
011
101
110
111
0
0
0
0
001
010
011
101
110
111
1
0
0
0
001
010
011
101
110
111
Update
0
1-bit global branch history (GBH)
Initially NT 0
1-bit global branch history(GBH)
1
1-bit global branch history(GBH)
b1 action is Taken
Update
121 bit predictor
Ajou Univ. SOC Lab.MultimediaCommunications36
1 bit global branch history (GBH) b t g oba b a c sto y(G )
Predict NT•Update new prediction value into the content of 001 addr
•Update 1(taken) into GBH
Example-(1,1) predictord=? b1 prediction b1 action New b1
predictionb2
predictionb2
action New b2
prediction
2 NT/NT(initial) T T/NT NT/NT(initial) T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T
3
4
5
11 b2 addr. 11 b2 addr.
B h dd (2 bit )
01 b1 addr.
b2 prediction is NT because the last branch b1 is T b2 action is T, so new b2 prediction is T.b1 prediction is NT because the last branch b2 is T
5
1 0
Branch address (2 bits)
addr. addr.
000
001
100
101 1 0
Branch address (2 bits)
addr. addr.
000
001
100
101 1 0
Branch address (2 bits)
addr. addr.
000
001
100
101
NT TNT T NT T
0 0010
011
110
111 0 1010
011
110
111
Update
0 1010
011
110
111
5
1
1-bit global branch history
1
1-bit global branch history
b2 action is Taken
Update
1
1-bit global branch history
Predict NT34
Ajou Univ. SOC Lab.MultimediaCommunications37
Predict NT •Update new prediction value into the content of 111 addr.•Update 1(taken) into GBH
Example-(1,1) predictord=? b1 prediction b1 action New b1
predictionb2
predictionb2
action New b2
prediction
2 NT/NT(initial) T T/NT NT/NT(initial) T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T7 8
01 b1 addr. 11b2 addr. 11b2 addr.
b1 action is NT, so prediction is correct.
6b2 prediction is NT because the last branch b1 is NT
7
b2 action is NT, so prediction is correct.
8
1 0
Branch address (2 bits)
addr. addr.
000
001
100
101 1 0
Branch address (2 bits)
addr. addr.
000
001
100
101 1 0
Branch address (2 bits)
addr. addr.
000
001
100
101
NT TNT T NT T
1
0
0
1
001
010
011
101
110
111
1
0
0
1
001
010
011
101
110
111 0 1
001
010
011
101
110
111
Not update Not update
6
0
1-bit global branch history
0
1-bit global branch history
Prediction NT 0
1-bit global branch history
b2 action is Not Taken
b1 action is Not Taken
p Not update78
Ajou Univ. SOC Lab.MultimediaCommunications38
Example-(1,1) predictord=? b1 prediction b1 action New b1
predictionb2
predictionb2
action New b2
prediction
2 NT/NT(initial) T T/NT NT/NT(initial) T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T10 11
01b1 addr. 11 b2 addr.
b1 action is T, so prediction is correct.
9b2 prediction is T because the last branch b1 is T
b2 action is T, so prediction is correct.
01
11
b1 addr.
1 0
Branch address (2 bits)
addr. addr.
000
001
100
101 1 0
Branch address (2 bits)
addr. addr.
000
001
100
101
NT T NT T
1 0
Branch address (2 bits)
addr. addr.
000
001
100
101
NT T
1
0
0
1
001
010
011
101
110
111 0 1
001
010
011
101
110
1119
1
0
0
1
001
010
011
101
110
11110
N t d t
0
1-bit global branch history
1
1-bit global branch history
1
1-bit global branch history
11
Prediction T b1 action is Taken
Not update
Prediction T
Ajou Univ. SOC Lab.MultimediaCommunications39
Accuracy of Different Schemes2-bit predictor can access all 4096 entries or a part of 4096 entries .
16%
18%
20%
4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT
11%12%
14%
16% Unlimited Entries 2 bit BHT1024 Entries (2,2) BHT
ncy o
f
dic
tions
5%6% 6% 6%
5%6%
8%
10%
0%
Fre
quen
Mis
pre
d
0%1%
5%4%
5%
1%2%
4%
0% 0%
nasa7
matr
ix300
doducd
spic
e
fpppp
gcc
expre
sso
eqnto
tt li
tom
catv
Ajou Univ. SOC Lab.MultimediaCommunications40
4,096 entries: 2-bits per entryUnlimited entries: 2-bits/entry1,024 entries (2,2)
Tournament Predictors (1)• The main idea behind Tournament predictors is to implement several predictors and
then dynamically select one of the predictors.
• Global PredictorGlobal Predictor– If the surrounding pattern includes neighboring branches, the predictor is referred to as a global
predictor
Make use of the fact that the behavior of many branches is strongly correlated with the history of– Make use of the fact that the behavior of many branches is strongly correlated with the history of other recently taken branches.
– Keep a single shift register updated with the recent history of every branch executed, and use this value to index into a table.
A single shift register that keeps track of recent history for all branches
00…110101
12 bit
Table of4K entries
12 bits
12 bits
Ajou Univ. SOC Lab.MultimediaCommunications41
Pattern history table
Tournament Predictors (2)• Local Predictor
– If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor
– Local branch history table » indexed by the low-order bits of each branch instruction’s address from Program Counter
(PC)» Each history table entry records the direction taken by the most recent n branches.Each history table entry records the direction taken by the most recent n branches.
(ex) if the branch was recently taken 10 times, the entry will be all 1s.
i=0;do{
Branch PC
Use 10 bits of branch PC toi d i t l l hi t t bl
i=i+1;} while(i==10);
Single branch is taken 10 times until i is 10 in do loop.
T bl f
index into local history tableLocal branch history table
Table of1K entries0111011001 10-bit history
indexes intonext level
Ajou Univ. SOC Lab.MultimediaCommunications42
Table of 1024 entries of 10-bithistories for a single branch
next levelPattern history table
Tournament Predictors (3)• Tournament Predictor
P1/P2 =global predictor/ local predictorP1 P2 P1 P2
S0 =00S3 =11
P1c P2c P1c-P2c
0 0 0 No change
0 1 -1 Decrement counter
1 0 1 Increment counter
S1 =01S2 =10
1 0 1 Increment counter
1 1 0 No change
P1c and P2c denote whether predictors P1 and P2 are correct respectivelycorrect, respectively.
• The tournament predictor uses 2-bit up/down saturating counter.The tournament predictor uses 2 bit up/down saturating counter.Ex) If the current tournament predictor state is S3 and P2 is correct, then the P1c
is 0 and P2c is 1. Hence, the counter is decrement by 1 and the next tournament predictor state is S2
Ajou Univ. SOC Lab.MultimediaCommunications43
predictor state is S2.
Tournament Predictors (4)
Tournament predictor using 4K 2-bit counters indexed by local branch address.
• Alpha 21264 branch predictor [2001]
Chooses between:
– Global predictor» 4K entries index by history of last 12 branches (212 = 4K)» 4K entries index by history of last 12 branches (2 4K)» Each entry is a standard 2-bit predictor
– Local predictor» Local history table: 1024 10-bit entries recording last 10 branches, index by branch addressy g y» The pattern of the last 10 occurrences of that particular branch used to index table of 1K
entries with 3-bit saturating counters
LocalPredictor
Global
MUX
TournamentPredictor
Branch PC
Predictor
Ajou Univ. SOC Lab.MultimediaCommunications44
[Alpha 21264 branch predictor logic]Table of 2-bit
saturating counters
Comparing Predictors (Fig. 2.8)
• Advantage of tournament predictor is ability to select the right predictor for a particular branch
– Particularly crucial for integer benchmarks. – A typical tournament predictor will select the global predictor almost 40% of
the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarksthe SPEC FP benchmarks
Ajou Univ. SOC Lab.MultimediaCommunications45
Pentium 4 Misprediction Rate (per 1000 instructions not per branch)(per 1000 instructions, not per branch)
13
1213
14
≈6% misprediction rate per branch SPECint (19% of INT instructions are branch)
11
10
11
12n
stru
ctio
ns
≈2% misprediction rate per branch SPECfp(5% of FP instructions are branch)
7
9
7
8
9
per
10
00
In
5
5
6
7
pre
dic
tio
ns
2
3
4
Bra
nch
mis
10 0 0
0
1
4gzip
5.vpr
6.gcc
1.mcf
crafty
pwise
swim
mgrid
applu
mesa
Ajou Univ. SOC Lab.MultimediaCommunications46
164.g
175
176
181.
186.cr
168.wupw
171.sw
172.mg
173.ap
177.m
SPECint2000 SPECfp2000
Branch Target Buffers (BTB)
• Branch target calculation is costly and stalls the instruction fetch.Branch target calculation is costly and stalls the instruction fetch.
• BTB stores PCs the same way as caches
• The PC of a branch is sent to the BTB
• When a match is found the corresponding Predicted PC is returned
• If the branch was predicted taken instruction fetch continues at the returned• If the branch was predicted taken, instruction fetch continues at the returned predicted PC
Ajou Univ. SOC Lab.MultimediaCommunications47
Branch Target Buffers
Ajou Univ. SOC Lab.MultimediaCommunications48
Dynamic Branch Prediction Summary
• Prediction becoming important part of execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated with next branch– Either different branches (GA)– Or different executions of same branches (PA)
• Tournament predictors take insight to next level, by using multiple predictors – usually one based on global information and one based on local information, and y g ,
combining them with a selector– In 2006, tournament predictors using ≈ 30K bits are in processors like the Power5
and Pentium 4
• Branch Target Buffer: include branch address & prediction
Ajou Univ. SOC Lab.MultimediaCommunications49
Outline• ILP• Compiler techniques to increase ILP
Loop Unrolling• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic Scheduling• Tomasulo Algorithm• SpeculationSpeculation
Ajou Univ. SOC Lab.MultimediaCommunications50
Advantages of Dynamic Scheduling
• Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior
• It handles cases when dependences unknown at compile time – it allows the processor to tolerate unpredictable delays such as cache misses, by
ti th d hil iti f th i t lexecuting other code while waiting for the miss to resolve
• It allows code that compiled for one pipeline to run efficiently on a different pipelinepipeline
• It simplifies the compiler
• Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling
Ajou Univ. SOC Lab.MultimediaCommunications51
HW Schemes: Instruction Parallelism
• Key idea: Allow instructions behind stall to proceedDIVD F0 F2 F4DIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14
• Enables out-of-order execution and allows out-of-order completion (e.g., SUBD)
– In a dynamically scheduled pipeline, all instructions still pass through issue stagein order (in-order issue)
Will di ti i h h i t ti b i ti d h it l t• Will distinguish when an instruction begins execution and when it completesexecution; between 2 times, the instruction is in execution
• Note: Dynamic execution creates WAR and WAW hazards and makesexceptions harder
Ajou Univ. SOC Lab.MultimediaCommunications52
Dynamic Scheduling Step 1
• Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID) also called Instruction IssueInstruction Decode (ID), also called Instruction Issue
• Split the ID pipe stage of simple 5-stage pipeline into 2 stages:
• Issue—Decode instructions, check for structural hazards
• Read operands—Wait until no data hazards, then read operands
Ajou Univ. SOC Lab.MultimediaCommunications53
A Dynamic Algorithm: Tomasulo’s
• For IBM 360/91 (before caches!)L l t– ⇒ Long memory latency
• Goal: High Performance without special compilers
• Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operationsscheduling of operations
– This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware!
• Why Study 1966 Computer?
• The descendants of this have flourished!– Alpha 21264, Pentium 4, AMD Opteron, Power 5, …
Ajou Univ. SOC Lab.MultimediaCommunications54
Tomasulo Algorithm
• Control & buffers distributed with Function Units (FU)– FU buffers called “reservation stations”; have pending operands– FU buffers called reservation stations ; have pending operands
• Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; ( ); g g ;
– Renaming avoids WAR, WAW hazards– More reservation stations than registers, so can do optimizations compilers can’t
• Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs
– Avoids RAW hazards by executing an instruction only when its operands are available
• Load and Stores treated as FUs with RSs as well
• Integer instructions can go past branches (predict taken), allowing FP ops beyond basic block in FP queue
Ajou Univ. SOC Lab.MultimediaCommunications55
Tomasulo Organization
From Mem FP RegistersFP OpQueueQueue
Load BuffersLoad1Load2
Store Buffers
Load2Load3Load4Load5Load6
Add1Add2 Mult1
Buffers
FP dd
Add3
FP lti li
Mult2
Reservation Stations
To Mem
FP adders FP multipliers
Ajou Univ. SOC Lab.MultimediaCommunications56
Common Data Bus (CDB)
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)Vj Vk V l f S dVj, Vk: Value of Source operands
– Store buffers has V field, result to be storedQj, Qk: Reservation stations producing source registers (value to be written)
– Note: Qj,Qk=0 => ready– Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busyBusy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write each register, if one exists Blank when no pending instructions that will write that registerone exists. Blank when no pending instructions that will write that register.
Ajou Univ. SOC Lab.MultimediaCommunications57
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op QueueIf ti t ti f ( t t l h d)If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).
2. Execute—operate on operands (EX)When both operands ready then execute;When both operands ready then execute;if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting units;Write on Common Data Bus to all awaiting units; mark reservation station available
• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)Common data bus: data source ( come from bus)
– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast
• Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
Ajou Univ. SOC Lab.MultimediaCommunications58
Tomasulo Example
Instruction status: Exec Write
Instruction stream
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
R i S i
3 Load/Buffers
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk
Add1 NoAdd2 NoFU count 3 FP Adder R SAdd3 NoMult1 NoMult2 No
FU countdown
3 FP Adder R.S.2 FP Mult R.S.
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Ajou Univ. SOC Lab.MultimediaCommunications59
Clock cycle counter
Tomasulo Example Cycle 1Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6VADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 NoMult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
Ajou Univ. SOC Lab.MultimediaCommunications60
Tomasulo Example Cycle 2Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6VADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 NoMult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
N t C h lti l l d t t di
Ajou Univ. SOC Lab.MultimediaCommunications61
Note: Can have multiple loads outstanding
Tomasulo Example Cycle 3Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6VADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult1 Yes MULTD R(F4) Load2Mult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued
Ajou Univ. SOC Lab.MultimediaCommunications62
Stations; MULT issued• Load1 completing; what is waiting for Load1?
Tomasulo Example Cycle 4Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6VADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult1 Yes MULTD R(F4) Load2Mult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1
• Load2 completing; what is waiting for Load2?
Ajou Univ. SOC Lab.MultimediaCommunications63
Load2 completing; what is waiting for Load2?
Tomasulo Example Cycle 5Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5VADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No
10 Mult1 Yes MULTD M(A2) R(F4)10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2
• Timer starts down for Add1 Mult1
Ajou Univ. SOC Lab.MultimediaCommunications64
• Timer starts down for Add1, Mult1
Tomasulo Example Cycle 6Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5VADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2
Ajou Univ. SOC Lab.MultimediaCommunications65
• Issue ADDD here despite name dependency on F6?
Tomasulo Example Cycle 7Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5VADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2
Ajou Univ. SOC Lab.MultimediaCommunications66
• Add1 (SUBD) completing; what is waiting for it?
Tomasulo Example Cycle 8Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 No2 Add2 Yes ADDD (M-M) M(A2)
Add3 No7 Mult1 Yes MULTD M(A2) R(F4)7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 (M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications67
Tomasulo Example Cycle 9Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 No1 Add2 Yes ADDD (M-M) M(A2)
Add3 No6 Mult1 Yes MULTD M(A2) R(F4)6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 M(A2) Add2 (M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications68
Tomasulo Example Cycle 10Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 No0 Add2 Yes ADDD (M-M) M(A2)
Add3 No5 Mult1 Yes MULTD M(A2) R(F4)5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
10 FU Mult1 M(A2) Add2 (M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications69
• Add2 (ADDD) completing; what is waiting for it?
Tomasulo Example Cycle 11Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 No
4 Mult1 Yes MULTD M(A2) R(F4)4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) (M-M+M(M-M) Mult2
• Write result of ADDD here?
Ajou Univ. SOC Lab.MultimediaCommunications70
Write result of ADDD here?• All quick instructions complete in this cycle!
Tomasulo Example Cycle 12Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 No
3 Mult1 Yes MULTD M(A2) R(F4)3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 M(A2) (M-M+M(M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications71
Tomasulo Example Cycle 13Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 No
2 Mult1 Yes MULTD M(A2) R(F4)2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 M(A2) (M-M+M(M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications72
Tomasulo Example Cycle 14Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 No
1 Mult1 Yes MULTD M(A2) R(F4)1 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 M(A2) (M-M+M(M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications73
Tomasulo Example Cycle 15Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 No
0 Mult1 Yes MULTD M(A2) R(F4)0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 M(A2) (M-M+M(M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications74
• Mult1 (MULTD) completing; what is waiting for it?
Tomasulo Example Cycle 16Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) (M-M+M(M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications75
• Just waiting for Mult2 (DIVD) to complete
Faster than light computation(skip a couple of cycles)
Ajou Univ. SOC Lab.MultimediaCommunications76
Tomasulo Example Cycle 55Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12Clock F0 F2 F4 F6 F8 F10 F12 ...
55 FU M*F4 M(A2) (M-M+M(M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications77
Tomasulo Example Cycle 56Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56VADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) (M-M+M(M-M) Mult2
Ajou Univ. SOC Lab.MultimediaCommunications78
• Mult2 (DIVD) is completing; what is waiting for it?
Tomasulo Example Cycle 57Instruction status: Exec Write
Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57V 7ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q
Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) (M-M+M(M-M) Result
• Once again: In-order issue, out-of-order execution and out-
Ajou Univ. SOC Lab.MultimediaCommunications79
Once again: In order issue, out of order execution and outof-order completion.
Why can Tomasulo overlap iterations of loops?
• Register renaming– Multiple iterations use different physical destinations for registers
(dynamic loop unrolling).
• Reservation stations – Permit instruction issue to advance past integer control flow operations– Also buffer old values of registers - totally avoiding the WAR stall
• Other perspective: Tomasulo building data flow dependency graph on the fly
Ajou Univ. SOC Lab.MultimediaCommunications80
Tomasulo’s scheme offers 2 major advantages
1. Distribution of the hazard detection logic– distributed reservation stations and the CDB– distributed reservation stations and the CDB– If multiple instructions waiting on single result, & each instruction has
other operand, then instructions can be released simultaneously by broadcast on CDB
– If a centralized register file were used, the units would have to read theirresults from the registers when register buses are available
2. Elimination of stalls for WAW and WAR hazards
Ajou Univ. SOC Lab.MultimediaCommunications81
Tomasulo Drawbacks
• Complexity– delays of 360/91, MIPS 10000, Alpha 21264,delays of 360/91, MIPS 10000, Alpha 21264,
IBM PPC 620 in CA:AQA 2/e, but not in silicon!
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus– Each CDB must go to multiple functional units
hi h it hi h i i d it⇒high capacitance, high wiring density– Number of functional units that can complete per cycle limited to one!
» Multiple CDBs ⇒ more FU logic for parallel assoc stores
• Non-precise interrupts!– We will address this later
Ajou Univ. SOC Lab.MultimediaCommunications82
Outline• ILP• Compiler techniques to increase ILP
Loop Unrolling• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic Scheduling• Tomasulo Algorithm• SpeculationSpeculation
Ajou Univ. SOC Lab.MultimediaCommunications83
Speculation to greater ILP
• Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correctg p g g– Speculation ⇒ fetch, issue, and execute instructions as if
branch predictions were always correct Dynamic scheduling ⇒ only fetches and issues instructions– Dynamic scheduling ⇒ only fetches and issues instructions
• Essentially a data flow execution model: Operations execute as soon as ytheir operands are available
Ajou Univ. SOC Lab.MultimediaCommunications84
Speculation to greater ILP
3 components of HW-based speculation:
1. Dynamic branch prediction to choose which instructions to execute
2. Speculation to allow execution of instructions before control dependences are resolved + ability to undo effects of incorrectly speculated sequence+ ability to undo effects of incorrectly speculated sequence
3. Dynamic scheduling to deal with scheduling of different combinations of basic blocks
Ajou Univ. SOC Lab.MultimediaCommunications85
Adding Speculation to Tomasulo
• Must separate execution from allowing instruction to finish or “commit”
• This additional step called instruction commit
• When an instruction is no longer speculative, allow it to update the register file or memory
• Requires additional set of buffers to hold results of instructions that have finished execution but have not committed
• This reorder buffer (ROB) is also used to pass results among instructions that may be speculated
Ajou Univ. SOC Lab.MultimediaCommunications86
Reorder Buffer (ROB)• In non-speculative Tomasulo’s algorithm, once an instruction writes its
result, any subsequently issued instructions will find result in the register file
• With speculation, the register file is not updated until the instruction commits
– (we know definitively that the instruction should execute)
• Thus the ROB supplies operands in interval between completion of• Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit
– ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s algorithm( ) p p g
– ROB extends architectured registers like RS
Ajou Univ. SOC Lab.MultimediaCommunications87
Reorder Buffer Entry FieldsEach entry in the ROB contains four fields:
1. Instruction type • a branch (has no destination result), a store (has a memory address
destination), or a register operation (ALU operation or load, which has register destinations)
2. Destination• Register number (for loads and ALU operations) or
memory address (for stores)memory address (for stores) where the instruction result should be written
3. Value3. Value• Value of instruction result until the instruction commits
4 Ready4. Ready• Indicates that instruction has completed execution, and the value is ready
Ajou Univ. SOC Lab.MultimediaCommunications88
Reorder Buffer operation• Holds instructions in FIFO order, exactly as issued• When instructions complete, results placed into ROB
– Supplies operands to other instruction between execution complete & commit ⇒ more registers like RS
– Tag results with ROB buffer number instead of reservation station• Instructions commit ⇒values at head of ROB placed in registers (or memory
locations)• As a result, easy to undo speculated instructions
i di d b h ion mispredicted branches or on exceptionsReorderBufferFP
OpQueue FP Regs
Commit path
FP Adder FP AdderRes Stations Res Stations
Commit path
Ajou Univ. SOC Lab.MultimediaCommunications89
FP Adder FP Adder
Recall: 4 Steps of Speculative Tomasulo Algorithm
1. Issue—get instruction from FP Op QueueIf ti t ti d d b ff l t f i i t & d d &If reservation station and reorder buffer slot free, issue instr & send operands &reorder buffer no. for destination (this stage sometimes called “dispatch”)
2. Execution—operate on operands (EX)2. Execution operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)
3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.
4. Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes y) preorder buffer (sometimes called “graduation”)
Ajou Univ. SOC Lab.MultimediaCommunications90
Tomasulo With Reorder Buffer - Cycle 1LD F6 34+R2
Issue Exec Write Commit1LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 N
ADDD F6 F8 F2
0 Add3 No0 Mult1 No0 Mult2 No
Busy AddressEntry Busy Instruction State Destination Value Load1 Yes 34+Regs[R2]
1 Yes LD F6, 34(R2) Issue F6 Load22 Load33456 Reorder Buffer78910
F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #1Busy no no no Yes no no no no
91
Tomasulo With Reorder Buffer - Cycle 2LD F6 34+R2
Issue Exec Write Commit1 2LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 2
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 N
ADDD F6 F8 F2
0 Add3 No0 Mult1 No0 Mult2 No
Busy AddressEntry Busy Instruction State Destination Value Load1 Yes 34+Regs[R2]
1 Yes LD F6, 34(R2) Ex1 F6 Load2 Yes 45+Regs[R3]2 Yes LD F2, 45(R3) Issue F2 Load33456 Reorder Buffer78910
F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #2 #1Busy no Yes no Yes no no no no
92
Tomasulo With Reorder Buffer - Cycle 3LD F6 34+R2
Issue Exec Write Commit1 2 3LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 32 33
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 N
ADDD F6 F8 F2
0 Add3 No0 Mult1 Yes Mult Regs[F4] #2 #30 Mult2 No
Busy AddressEntry Busy Instruction State Destination Value Load1 No
1 Yes LD F6, 34(R2) write F6 Mem[load1] Load2 Yes 45+Regs[R3]2 Yes LD F2, 45(R3) Ex1 F2 Load33 Yes MULT F0, F2, F4 Issue F0456 Reorder Buffer78910
F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #2 #1Busy Yes Yes no Yes no no no no
93
Tomasulo With Reorder Buffer - Cycle 4LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 4 2 3 43 44
Time Name Busy Op Vj Vk Qj Qk Dest2 Add1 Yes SUB Regs[F6] Mem[45+Regs[R3]] #4 Reservation0 Add2 No Stations0 Add3 No
ADDD F6 F8 F2
0 Add3 No10 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 No
Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 Yes LD F2, 45(R3) write F2 Mem[load2] Load33 Yes MULT F0, F2, F4 EX1 F04 Yes SUBD F8, F6, F2 Issue F856 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #2 #4Busy Yes Yes no no Yes no no no
94
Tomasulo With Reorder Buffer - Cycle 5LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 44 55
Time Name Busy Op Vj Vk Qj Qk Dest1 Add1 Yes SUB Regs[F6] Mem[45+Regs[R3]] #4 Reservation0 Add2 No Stations0 Add3 N
ADDD F6 F8 F2
0 Add3 No9 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5
Busy AddressEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex2 F04 Yes SUBD F8, F6, F2 Ex1 F85 Yes DIVD F10, F0, F6 Issue F106 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #4 #5Busy Yes no no no Yes Yes no no
95
Tomasulo With Reorder Buffer - Cycle 6LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 44 556
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 Yes SUB Regs[F6] Mem[45+Regs[R3]] #4 Reservation0 Add2 Yes Add Regs[F2] #4 #6 Stations0 Add3 N
ADDD F6 F8 F2 6
0 Add3 No8 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5
Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex3 F04 Yes SUBD F8, F6, F2 Ex2 F85 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 Issue F6 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no
96
Tomasulo With Reorder Buffer - Cycle 7LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 44 5 756 7
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation2 Add2 Yes Add #4 Regs[F2] #6 Stations0 Add3 N
ADDD F6 F8 F2 6 7
0 Add3 No7 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5
Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex4 F04 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 EX1 F6 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no
97
Tomasulo With Reorder Buffer - Cycle 8LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 44 5 756 7
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation1 Add2 Yes Add #4 Regs[F2] #6 Stations0 Add3 N
ADDD F6 F8 F2 6 7
0 Add3 No6 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5
Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex5 F04 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 Ex2 F6 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no
98
Tomasulo With Reorder Buffer - Cycle 9LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 44 5 756 7 9
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 Yes Add #4 Regs[F2] #6 Stations0 Add3 No
ADDD F6 F8 F2 6 7 9
0 Add3 No5 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5
Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex6 F04 Y SUBD F8 F6 F2 it F8 F6 #24 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no
99
Tomasulo With Reorder Buffer - Cycle 10LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 44 5 756 7 9
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 N
ADDD F6 F8 F2 6 7 9
0 Add3 No4 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5
Busy AddressEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex7 F04 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no
100
Tomasulo With Reorder Buffer - Cycle 11LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 44 5 756 7 9
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 No
ADDD F6 F8 F2 6 7 9
0 Add3 No3 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5
Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex8 F04 Y SUBD F8 F6 F2 it F8 F6 #24 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no
101
Tomasulo With Reorder Buffer - Cycle 12LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 44 5 756 7 9
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 No
ADDD F6 F8 F2 6 7 9
0 Add3 No2 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5
Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex9 F04 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer789
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no
102
Tomasulo With Reorder Buffer - Cycle 13LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 4 13 4 5 75 136 7 9
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 No
ADDD F6 F8 F2 6 7 9
0 Add3 No0 Mult1 No0 Mult2 Yes DIV #2xRegs[F4] Regs[F6] #5
Busy AddressEntry Busy Instruction State Destination Value Load1 NoEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 write F0 #2 x Regs[F4]4 Yes SUBD F8 F6 F2 write F8 F6 #24 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Ex1 F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer7889
10F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no
103
Tomasulo With Reorder Buffer - Cycle 14LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 4 13 144 5 75 136 7 9
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 No
ADDD F6 F8 F2 6 7 9
0 Add3 No0 Mult1 No0 Mult2 Yes DIV #2xRegs[F4] Regs[F6] #5
Busy AddressEntry Busy Instruction State Destination Value Load1 NoEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 No MULT F0, F2, F4 commit F0 #2 x Regs[F4]4 Y SUBD F8 F6 F2 it F8 F6 #24 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Ex2 F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer788910
F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #6 #4 #5Busy No no no Yes Yes Yes no no
104
Tomasulo With Reorder Buffer - Cycle 15LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 4 13 144 5 7 155 136 7 9
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 No
ADDD F6 F8 F2 6 7 9
0 Add3 No0 Mult1 No0 Mult2 Yes DIV #2xRegs[F4] Regs[F6] #5
Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 No MULT F0, F2, F4 commit F0 #2 x Regs[F4]4 N SUBD F8 F6 F2 it F8 F6 #24 No SUBD F8, F6, F2 commit F8 F6 - #25 Yes DIVD F10, F0, F6 Ex3 F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer78910
F0 F2 F4 F6 F8 F10 F12 ... F30
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #6 #5Busy no no no Yes no Yes no no
105
Tomasulo With Reorder Buffer - Cycle 16LD F6 34+R2
Issue Exec Write Commit1 2 3 4LD F6 34+R2
LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
1 2 3 42 3 4 53 4 13 144 5 7 155 136 7 9
Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations
ADDD F6 F8 F2 6 7 9
0 Add3 No0 Mult1 No0 Mult2 Yes DIV #2xRegs[F4] Regs[F6] #5
Busy AddressEntry Busy Instruction State Destination Value Load1 No
1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 No MULT F0, F2, F4 commit F0 #2 x Regs[F4]4 No SUBD F8, F6, F2 commit F8 F6 - #25 Yes DIVD F10, F0, F6 Ex4 F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer7 Need 36 more EX
cycles for DIV to8910
F0 F2 F4 F6 F8 F10 F12 ... F30
cycles for DIV to finish…
Ajou Univ. SOC Lab.MultimediaCommunications
Reorder # #6 #5Busy no no no Yes no Yes no no
106
Avoiding Memory Hazards• WAW and WAR hazards through memory are eliminated with
speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pendingstill be pending
• RAW hazards through memory are maintained by two restrictions: 1 not allo ing a load to initiate the second step of its e ec tion if an acti e1. not allowing a load to initiate the second step of its execution if any active
ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and
2. maintaining the program order for the computation of an effective address of a load with respect to all earlier storesof a load with respect to all earlier stores.
• these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memorylocation written to by an earlier store cannot perform the memory access until the store has written the data
Ajou Univ. SOC Lab.MultimediaCommunications107
Exceptions and Interrupts• IBM 360/91 invented “imprecise interrupts”
– Computer stopped at this PC; its likely close to this address– Not so popular with programmersp p p g– Also, what about Virtual Memory? (Not in IBM 360)
• Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit
– If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly
– This is exactly same as need to do with precise exceptionsThis is exactly same as need to do with precise exceptions
• Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROBy
– If a speculated instruction raises an exception, the exception is recorded in the ROB
– This is why reorder buffers in all new processors
Ajou Univ. SOC Lab.MultimediaCommunications108
Getting CPI below 1
• CPI ≥ 1 if issue only 1 instruction every clock cycle
• Multiple-issue processors come in 3 flavors: 1. statically-scheduled superscalar processors,2 dynamically scheduled superscalar processors and2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors
• 2 types of superscalar processors issue varying numbers of instructions per clock – use in-order execution if they are statically scheduled, or – out-of-order execution if they are dynamically scheduled
• VLIW processors, in contrast, issue a fixed number of instructions p , ,formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)
Ajou Univ. SOC Lab.MultimediaCommunications109
VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple operations– In IA-64, grouping called a “packet”– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
• Tradeoff instruction space for simple decoding– The long instruction word has room for many operations– By definition, all the operations the compiler puts in the long instruction word are y , p p p g
independent => execute in parallel– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide– Need compiling technique that schedules across several branches
Ajou Univ. SOC Lab.MultimediaCommunications110
Recall: Unrolled Loop that Minimizes Stalls for Scalar
1 Loop: L.D F0,0(R1)2 L D F6 8(R1)
L.D to ADD.D: 1 Cycle2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD D F4 F0 F2
ADD.D to S.D: 2 Cycles
5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD D F16 F14 F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S D -16(R1),F1211 S.D 16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24( ), ;
14 clock cycles, or 3.5 per iteration
Ajou Univ. SOC Lab.MultimediaCommunications111
Loop Unrolling in VLIW
Memory Memory FP FP Int op/ ClockMemory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchL.D F0,0(R1) L.D F6,-8(R1) 1L.D F10,-16(R1) L.D F14,-24(R1) 2L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4
ADD.D F20,F18,F2 ADD.D F24,F22,F2 5S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6S.D -16(R1),F12 S.D -24(R1),F16 7S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8S.D -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delaysy7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)Average: 2.5 ops per clock, 50% efficiency
Ajou Univ. SOC Lab.MultimediaCommunications
Note: Need more registers in VLIW (15 vs. 6 in SS)
112
Problems with 1st Generation VLIW
• Increase in code sizeti h ti i t i ht li d f t i biti l– generating enough operations in a straight-line code fragment requires ambitiously
unrolling loops– whenever VLIW instructions are not full, unused functional units translate to wasted
bits in instruction encodingg
• Operated in lock-step; no hazard detection HW– a stall in any functional unit pipeline caused entire processor to stall since all– a stall in any functional unit pipeline caused entire processor to stall, since all
functional units must be kept synchronized– Compiler might predict function units, but caches hard to predict
• Binary code compatibility– Pure VLIW => different numbers of functional units and unit latencies require
different versions of the code
Ajou Univ. SOC Lab.MultimediaCommunications113
Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”Instruction Computer (EPIC)• IA-64: instruction set architecture• 128 64-bit integer regs + 128 82-bit floating point regs• 128 64-bit integer regs + 128 82-bit floating point regs
– Not separate register files per functional unit as in old VLIW• Hardware checks dependencies
(interlocks => binary compatibility over time)(interlocks => binary compatibility over time)• Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mispredictions?It i ™ fi t i l t ti (2001)• Itanium™ was first implementation (2001)
– Highly parallel and deeply pipelined hardware at 800Mhz– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
• Itanium 2™ is name of 2nd implementation (2005)– 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process– Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3, , , ,
Ajou Univ. SOC Lab.MultimediaCommunications114
Increasing Instruction Fetch Bandwidth
• Predicts next instruct address, sends it out
Branch Target Buffer (BTB)
,before decoding instructuction
• PC of branch sent to BTB• When match is found,
Predicted PC is returned• If branch predicted taken, p ,
instruction fetch continues at Predicted PC
Ajou Univ. SOC Lab.MultimediaCommunications115
IF BW: Return Address Predictor
• Small buffer of return addresses acts as a stack• Caches most recent return addressesCaches most recent return addresses• Call ⇒ Push a return address
on stack• Return ⇒ Pop an address off stack & predict as new PC• Return ⇒ Pop an address off stack & predict as new PC
60%
70%
ncy
gom88ksim
40%
50%
60%
n f
requen
cc1compressxlisp
20%
30%
pre
dic
tion
ijpegperlvortex
0%
10%
0 1 2 4 8 16
Mis
p
Ajou Univ. SOC Lab.MultimediaCommunications
0 1 2 4 8 16Return address buffer entries
116
More Instruction Fetch Bandwidth
• Integrated branch prediction branch predictor is part of instruction fetch unit and is constantly predicting branchesy p g
• Instruction prefetch Instruction fetch units prefetch to deliver multiple instruct. per clock, integrating it with branch prediction
• Instruction memory access and buffering Fetching multiple instructionsInstruction memory access and buffering Fetching multiple instructions per cycle:
– May require accessing multiple cache blocks (prefetch to hide cost of crossing cache blocks)
– Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed
Ajou Univ. SOC Lab.MultimediaCommunications117
Speculation: Register Renaming vs. ROB
• Alternative to ROB is a larger physical set of registers combined with register renamingg
– Extended registers replace function of both ROB and reservation stations
• Instruction issue maps names of architectural registers to physical registerInstruction issue maps names of architectural registers to physical register numbers in extended register set
– On issue, allocates a new unused register for the destination (which avoids WAW and WAR hazards)
– Speculation recovery easy because a physical register holding an instruction destination does not become the architectural register until the instruction commits
• Most Out-of-Order processors today use extended registers with renaming
Ajou Univ. SOC Lab.MultimediaCommunications118
Value Prediction
• Attempts to predict value produced by instruction– E.g., Loads a value that changes infrequentlyE.g., Loads a value that changes infrequently
• Value prediction is useful only if it significantly increases ILPFocus of research has been on loads; so so results no processor uses value– Focus of research has been on loads; so-so results, no processor uses value prediction
• Related topic is address aliasing prediction• Related topic is address aliasing prediction– RAW for load and store or WAW for 2 stores
Address alias prediction is both more stable and simpler since need not• Address alias prediction is both more stable and simpler since need not actually predict the address values, only whether such values conflict
– Has been used by a few processors
Ajou Univ. SOC Lab.MultimediaCommunications119
(Mis) Speculation on Pentium 4% f i t d
43% 45%
• % of micro-ops not used
39%40%
45%
24% 24%30%35%
20%20%
25%
3%1% 1% 0%5%
10%15%
1% 1% 0%0%
5%
I t Fl ti P i t
Ajou Univ. SOC Lab.MultimediaCommunications120
Integer Floating Point
Perspective
• Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model
• Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice
• Conservative in ideas, just faster clock and bigger
• Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995announced in 1995
– Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units⇒ performance 8 to 16X
• Peak v. delivered performance gap increasing
Ajou Univ. SOC Lab.MultimediaCommunications121