Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer...
-
Upload
lora-porter -
Category
Documents
-
view
219 -
download
4
Transcript of Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer...
Chapter 3Instruction Level Parallelism
2Dr. Eng. Amr T. Abdel-Hamid
Elect 707
Spring 2014
Co
mp
uter A
pp
lication
s
Text book slides: Computer Architecture: A Quantitative Approach 4th Edition, John L. Hennessy & David A. Patterso with modifications.
Dr. A
mr T
alaat
Elect 707
Control Hazard on BranchesThree Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch
Dr. A
mr T
alaat
Elect 707
Solving Branch Problems
Prediction Static: Software Dynamic: hardware
Speculation
Dr. A
mr T
alaat
Elect 707
Static Branch Alternatives#1: Stall until branch direction is clear#2: Predict Branch Not Taken
Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS
MIPS still incurs 1 cycle branch penaltyOther solutions: branch target known before outcome
Dr. A
mr T
alaat
Elect 707
Dynamic Branch Prediction
Performance = ƒ(accuracy, cost of misprediction) Branch History Table: Lower bits of PC address index
table of 1-bit values Says whether or not branch taken last time No address check
Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it
predicts exit instead of looping
Dr. A
mr T
alaat
Elect 707
Branch History Table
T
Predict TakenPredict not Taken
1 0T
NT
1-bit prediction
NT
Feedback
01
BHTbranch PC
Dr. A
mr T
alaat
Elect 707
Solution: 2-bit scheme where change prediction only if get misprediction twice
Adds hysteresis to decision making process
Dynamic Branch Prediction
T
T NT
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
NTT
NT
Dr. A
mr T
alaat
Elect 707
18%
5%
12%10%
9%
5%
9% 9%
0%1%
0%2%4%6%8%
10%12%14%16%18%20%
eqnt
ott
espr
esso gc
c li
spice
dodu
c
spice
fppp
p
mat
rix30
0
nasa
7
Mis
pre
dic
tio
n R
ate
BHT Accuracy Mispredict because either:
Wrong guess for that branch Got branch history of wrong branch when index the
table 4096 entry table:
Dr. A
mr T
alaat
Elect 707 9
Correlating Branches
Code example showing the potential
Assemble code
Observation: if BNEZ1 and BNEZ2 is not taken, then BNEZ3 is taken
Dr. A
mr T
alaat
Elect 707 10
Correlating Branch Predictor
Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)
Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction
(1,1) predictor: 1-bit global, 1-bit local
Branch address (4 bits)
1-bits per branch local predictors
PredictionPrediction
1-bit global branch history(0 = not taken)
Dr. A
mr T
alaat
Elect 707
Correlated Branch Prediction
Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table
In general, (m,n) predictor means record last m branches to select between 2m history tables, each with n-bit counters Thus, old 2-bit BHT is a (0,2) predictor
Global Branch History: m-bit shift register keeping T/NT status of last m branches.
Each entry in table has m n-bit predictors.
Dr. A
mr T
alaat
Elect 707
Correlating Branches(2,2) predictor
– Behavior of recent branches selects between four predictions of next branch, updating just that prediction
Branch address
2-bits per branch predictor
Prediction
2-bit global branch history
4
Dr. A
mr T
alaat
Elect 707
0%
Freq
uenc
y of
Mis
pred
icti
ons
0%1%
5%6% 6%
11%
4%
6%5%
1%2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Accuracy of Different Schemes
4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT
nasa
7
mat
rix3
00
dodu
cd
spic
e
fppp
p
gcc
expr
esso
eqnt
ott li
tom
catv
Dr. A
mr T
alaat
Elect 707
Loop Example
Loop: LD F0 0 R1 SD F4 0 R1 SUBI R1 R1 #8 MULTD F4 F0 F2
BNEZ F4 Loop
very long wait for branch?
Dr. A
mr T
alaat
Elect 707
spec·u·la·tion spec·u·la·tion : [spek-yuh-ley-shuhn] Show IPA1.
the contemplation or consideration of some subject: toengage in speculation on humanity's ultimate destiny.
2. a single instance or process of consideration.3. a conclusion or opinion reached by such contemplation:
These speculations are impossible to verify.4.
conjectural consideration of a matter; conjecture or surmise:a report based on speculation rather than facts.
5. engagement in business transactions involving considerable risk but offering the chance of large gains, especially trading in commodities, stocks, etc., in the hope of profit from changes in the market price
Dr. A
mr T
alaat
Elect 707
Speculation to greater ILP
Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct Dynamic scheduling only fetches and issues
instructions Speculation fetch, issue, and execute instructions
as if branch predictions were always correct Essentially a data flow execution model: Operations
execute as soon as their operands are available
Dr. A
mr T
alaat
Elect 707
Speculation to greater ILP3 components of HW-based speculation:
1.Dynamic branch prediction to choose which instructions to execute 2.Speculation to allow execution of instructions before control dependences are resolved + ability to undo effects of incorrectly speculated sequence 3.Dynamic scheduling to deal with scheduling of different combinations of basic blocks
Dr. A
mr T
alaat
Elect 707
Adding Speculation to Tomasulo
Must separate execution from allowing instruction to finish or “commit”
This additional step called instruction commit When an instruction is no longer speculative,
allow it to update the register file or memory Requires additional set of buffers to hold results
of instructions that have finished execution but have not committed
This reorder buffer (ROB) is also used to pass results among instructions that may be speculated
Dr. A
mr T
alaat
Elect 707
Reorder Buffer (ROB)
In Tomasulo’s algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file
With speculation, the register file is not updated until the instruction commits (we know definitively that the instruction should execute)
Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit ROB is a source of operands for instructions, just as
reservation stations (RS) provide operands in Tomasulo’s algorithm
ROB extends architecture registers like RS
Dr. A
mr T
alaat
Elect 707
Reorder Buffer Entry
Each entry in the ROB contains four fields: 1. Instruction type
• a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations)
2. Destination• Register number (for loads and ALU operations) or
memory address (for stores) where the instruction result should be written
3. Value• Value of instruction result until the instruction commits
4. Ready• Indicates that instruction has completed execution, and
the value is ready
Dr. A
mr T
alaat
Elect 707
Reorder Buffer operation
Holds instructions in FIFO order, exactly as issued When instructions complete, results placed into ROB
Supplies operands to other instruction between execution complete & commit more registers like RS
Tag results with ROB buffer number instead of reservation station
Instructions commit values at head of ROB placed in registers
As a result, easy to undo speculated instructions on mispredicted branches or on exceptions
ReorderBufferFP
OpQueue
FP Adder FP AdderRes Stations Res Stations
FP Regs
Commit path
Dr. A
mr T
alaat
Elect 707
4 Steps of Speculative Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue
instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”)
2. Execution—operate on operands (EX) When both operands ready then execute; if not ready,
watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)
3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.4. Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)
Dr. A
mr T
alaat
Elect 707
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1F0F0 LD F0,10(R2)LD F0,10(R2) NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
Registers
Dr. A
mr T
alaat
Elect 707
2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F10F10
F0F0ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
Registers
Dr. A
mr T
alaat
Elect 707
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
Registers
Dr. A
mr T
alaat
Elect 707
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0F0 ADDD F0,F4,F6ADDD F0,F4,F6 NN
F4F4 LD F4,0(R3)LD F4,0(R3) NN
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
Registers
5 0+R35 0+R3
Dr. A
mr T
alaat
Elect 707
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD ROB5, R(F6)6 ADDD ROB5, R(F6)
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
----
F0F0ROB5ROB5
ST 0(R3),F4ST 0(R3),F4
ADDD F0,F4,F6ADDD F0,F4,F6NN
NN
F4F4 LD F4,0(R3)LD F4,0(R3) NN
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
Dest
Reorder Buffer
Registers
1 10+R21 10+R25 0+R35 0+R3
Dr. A
mr T
alaat
Elect 707
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
----
F0F0M[10]M[10]
ST 0(R3),F4ST 0(R3),F4
ADDD F0,F4,F6ADDD F0,F4,F6YY
NN
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
Registers
2 ADDD R(F4),ROB12 ADDD R(F4),ROB16 ADDD M[10],R(F6)6 ADDD M[10],R(F6)
Dr. A
mr T
alaat
Elect 707
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
----
F0F0M[10]M[10]
<val2><val2>ST 0(R3),F4ST 0(R3),F4
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
Registers
Dr. A
mr T
alaat
Elect 707
----
F0F0M[10]M[10]
<val2><val2>ST 0(R3),F4ST 0(R3),F4
ADDD F0,F4,F6ADDD F0,F4,F6YY
YY
F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY
---- BNE F2,<…>BNE F2,<…> NN
3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1
Tomasulo With Reorder buffer:
ToMemory
FP addersFP adders FP multipliersFP multipliers
Reservation Stations
FP OpQueue
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F2F2
F10F10
F0F0
DIVD F2,F10,F6DIVD F2,F10,F6
ADDD F10,F4,F0ADDD F10,F4,F0
LD F0,10(R2)LD F0,10(R2)
NN
NN
NN
Done?
DestDest
Oldest
Newest
from Memory
1 10+R21 10+R2Dest
Reorder Buffer
Registers
Dr. A
mr T
alaat
Elect 707
Getting CPI below 1
CPI ≥ 1 if issue only 1 instruction every clock cycle Multiple-issue processors come in 3 flavors:
1. statically-scheduled superscalar processors,2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors
2 types of superscalar processors issue varying numbers of instructions per clock use in-order execution if they are statically scheduled, or out-of-order execution if they are dynamically scheduled
VLIW processors, in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)
Dr. A
mr T
alaat
Elect 707
Reading Assignment: Intel Core i7 prediction scheme
Sections 3.3 & 3.6