Instruction Level Parallelism - WBUTHELP.COM · Pipelining achieves Instruction Level Parallelism...
Transcript of Instruction Level Parallelism - WBUTHELP.COM · Pipelining achieves Instruction Level Parallelism...
Instruction Level Parallelism
� Pipelining achieves Instruction Level Parallelism (ILP)� Multiple instructions in parallel
� But, problems with pipeline hazards� CPI = Ideal CPI + stalls/instruction� Stalls = Structural + Data (RAW/WAW/WAR) +
Control� How to reduce stalls?
� That is, how to increase ILP?
Techniques for Improving ILP
� Loop unrolling� Basic pipeline scheduling� Dynamic scheduling, scoreboarding, register
renaming� Dynamic memory disambiguation� Dynamic branch prediction� Multiple instruction issue per cycle
� Software and hardware techniques
Loop-Level Parallelism
� Basic block: straight-line code w/o branches� Fraction of branches: 0.15� ILP is limited!
� Average basic-block size is 6-7 instructions� And, these may be dependent
� Hence, look for parallelism beyond a basic block
� Loop-level parallelism is a simple example of this
Loop-Level Parallelism: An
Example� Consider the loop:
for(int i = 1000; i >= 1; i = i-1) {x[i] = x[i] + C; // FP
}� Each iteration of the loop is independent of other
iterations� Loop-level parallelism
� To convert it into ILP:� Loop unrolling (static, dynamic)� Vector instructions
The Loop, in DLX
� In DLX, the loop looks like:
Loop: LD F0, 0(R1) // F0 is array elementADDD F4, F0, F2// F2 has the scalar 'C'SD 0(R1), F4 // Stored resultSUBI R1, R1, 8 // For next iterationBNEZ R1, Loop // More iterations?
� Assume:� R1 is the initial address� F2 has the scalar value 'C'� Lowest address in array is '8'
How Many Cycles per Loop?
CC1 Loop: LD F0, 0(R1)CC2 stallCC3 ADDD F4, F0, F2CC4 stallCC5 stallCC6 SD 0(R1), F4CC7 SUBI R1, R1, 8CC8 stallCC9 BNEZ R1, LoopCC10 stall
Reducing Stalls by Scheduling
CC1 Loop: LD F0, 0(R1)CC2 SUBI R1, R1, 8CC3 ADDD F4, F0, F2CC4 stallCC5 BNEZ R1, LoopCC6 SD 8(R1), F4
� Realizing that SUBI and SD can be swapped is non-trivial!
� Overhead versus actual work:� 3 cycles of work, 3 cycles of overhead
Unrolling the Loop
Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4 // No SUBI, BNEZLD F6, -8(R1) // Note diff FP reg, new offsetADDD F8, F6, F2SD -8(R1), F8LD F10, -16(R1) // Note diff FP reg, new offsetADDD F12, F10, F2SD -16(R1), F8LD F14, -24(R1) // Note diff FP reg, new offsetADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, 32
How Many Cycles per Loop?
Loop: LD F0, 0(R1) // 1 stallADDD F4, F0, F2 // 2 stallsSD 0(R1), F4LD F6, -8(R1) // 1 stallADDD F8, F6, F2 // 2 stallsSD -8(R1), F8LD F10, -16(R1) // 1 stallADDD F12, F10, F2 // 2 stallsSD -16(R1), F8LD F14, -24(R1) // 1 stallADDD F16, F14, F2 // 2 stallsSD -24(R1), F16SUBI R1, R1, 32// 1 stall
28 cycles per unrolled loop
==7 cycles per original loop
Scheduling the Unrolled Loop
Loop: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD 0(R1), F4SD -8(R1), F8SUBI R1, R1, 32SD 16(R1), F8BNEZ R1, Loop
14 cycles per unrolled loop
==3.5 cycles per original loop
Observations and Requirements
� Gain from scheduling is even higher for unrolled loop!� More parallelism is exposed on unrolling
� Need to know that 1000 is a multiple of 4� Requirements:
� Determine that loop can be unrolled� Use different registers to avoid conflicts� Determine that SD can be moved after SUBI,
and find the offset adjustment� Understand dependences
Dependences
� Dependent instructions ==> cannot be in parallel
� Three kinds of dependences:� Data dependence (RAW)� Name dependence (WAW and WAR)� Control dependence
Dependences (continued)
� Dependences are properties of programs� Stalls are properties of the pipeline� Two possibilities:
� Maintain dependence, but avoid stalls� Eliminate dependence by code transformation
Data Dependence
� Data dependence represents data flow from one instruction to another� One instruction uses the result of another� Take transitive closure
� In our example: Loop: LD F0, 0(R1)
ADDD F4, F0, F2
SD 0(R1), F4
SUBI R1, R1, 8
1
Note: dependence in memory is hard to detect100(R4) and 80(R6) may be the same20(R1) and 20(R1) may be different at different times
Name Dependence
� Two instructions use the same register/memory (name), but there is no flow of data� Anti-dependence: WAR hazard� Output dependence: WAW hazard
� Can do register renaming – s tatically, or dynamically
Name Dependence in our Example
Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4LD F0, -8(R1)ADDD F4, F0, F2SD -8 R1), F4LD F0, -16(R1)ADDD F4, F0, F2SD -16(R1), F4LD F0, -24(R1)ADDD F4, F0, F2SD -24(R1), F4SUBI R1, R1, 32
Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4LD F6, -8(R1)ADDD F8, F6, F2SD -8(R1), F8LD F10, -16(R1)ADDD F12, F10, F2SD -16(R1), F8LD F14, -24(R1)ADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, 32
Register renaming
Control Dependence
� An example:T1;if p1 {
S1;}
� Statement S1 is control-dependent on p1, but T1 is not
� What this means for execution� S1 cannot be moved before p1� T1 cannot be moved after p1
Control Dependence in our Example
Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4SUBI R1, R1, 8BEQZ R1, exitLD F6 0(R1)ADDD F8, F6, F2SD 0(R1), F8SUBI R1, R1, 8BEQZ R1, exit// Two more such...SUBI R1, R1, 8BNEZ R1, Loop
Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4LD F6, -8(R1)ADDD F8, F6, F2SD -8(R1), F8LD F10, -16(R1)ADDD F12, F10, F2SD -16(R1), F8LD F14, -24(R1)ADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, 32
Handling Control Dependence
� Control dependence need not be maintained� We need to maintain:
� Exception behaviour – do not caus e new exceptions
� Data flow – ensur e the right data item is used� Speculation and conditional instructions are
techniques to get around control dependence
Loop Unrolling: a Relook� Our example:
for(int i = 1000; i >= 1; i = i-1) {x[i] = x[i] + C; // FP
}� Consider:
for(int i = 1000; i >= 1; i = i-1) {A[i-1] = A[i] + C[i]; // S1B[i-1] = B[i] + A[i-1]; // S2
}� S2 is dependent on S1� S1 is dependent on its previous iteration; same
case with S2� Loop-carried dependence ==> loop iterations have to
be in-order
Removing Loop-Carried
Dependence� Another example:
for(int i = 1000; i >= 1; i = i-1) {A[i] = A[i] + B[i]; // S1B[i-1] = C[i] + D[i]; // S2
}� S1 depends on the prior iteration of S2
� Can be removed (no cyclic dependence)A[1000] = A[1000] + B[1000];for(int i = 1000; i >= 2; i = i-1) {
B[i-1] = C[i] + D[i]; // S2A[i-1] = A[i-1] + B[i-1]; // S1
}B[0] = C[1] + D[1];
Static vs. Dynamic Scheduling
� Static scheduling: limitations� Dependences may not be known at compile time� Even if known, compiler becomes complex� Compiler has to have knowledge of pipeline
� Dynamic scheduling� Handle dynamic dependences� Simpler compiler� Efficient even if code compiled for a different
pipeline
Dynamic Scheduling
� For now, we will focus on overcoming data hazards
� The idea:� DIVD F0, F2, F4� ADDD F10, F0, F8� SUBD F12, F8, F14
� SUBD can proceed without waiting for DIVD
CDC 6600: A Case Study
� IF stage: fetch instructions onto a queue� ID stage is split into two stages:
� Issue: decode and check for structural hazards� Read operands: check for data hazards
� Execution may begin, and may complete out-of-order� Complications in exception handling� Ignore for now
� What is the logic for data hazard checks?
The CDC Scoreboard
� Out-of-order completion ==> WAR and WAW hazards possible
� Scoreboard: a data-structure for all hazard detection in the presence of out-of-order execution/completion
� All instructions “cons ult” the scoreboard to detect hazards
The Scoreboard Solution
� Three components:� Stages of the pipeline:
� Issue (ID1), Read-operands (ID2), EX, WB
� Data structure (in hardware)� Logic for hazard detection, stalling
Scoreboard Control & the
Pipeline Stages� Issue (ID1): decode, check if functional unit is
free, and if a previous instruction has the same destination register� No such hazard ==> scoreboard issues to the
appropriate functional unit� Note: structural/WAW hazards prevented by stalling here� Note: stall here ==> IF queue will grow
� Read operands (ID2): � Operand is available if no earlier instruction is going
to write it, or if the register is being written currently� RAW hazards are resolved here
Scoreboard Control & the
Pipeline Stages (continued)
� Execute (EX): � Functional units perform execution� Scoreboard is notified on completion
� Write-Back (WB): � Check for WAR hazards
� Stall on detection� Write-back otherwise
Some Remarks
� WAW causes stall in ID1, WAR causes stall in WB
� No forwarding logic� Output written as soon as it is available (and no
WAR hazard)� Structural hazard possible in register
read/write� CDC has 16 functional units, and 4 buses
The Scoreboard Data-Structures
� Instruction status� Functional unit status� Register result status�
� Randy Katz's CS252 slides... (Lecture 10, Spring 1996)� Scoreboard pipeline control� A detailed example
Limitations of the Scoreboard
� Speedup of 1.7 for (compiled) FORTRAN, speedup of 2.5 for hand-coded assembly
� Scoreboard only in basic-block!� Some hazards still cause stalls:
� Structural� WAR, WAW
Dynamic Scheduling
� Better than static scheduling� Scoreboarding:
� Used by the CDC 6600� Useful only within basic block� WAW and WAR stalls
� Tomasulo algorithm:� Used in IBM 360/91 for the FP unit� Main additional feature: register renaming to
avoid WAR and WAW stalls
Register Renaming: Basic Idea
� Compiler maps memory --> registers statically
� Register renaming maps registers --> virtual registers in hardware, dynamically
� Should keep track of this mapping� Make sure to read the current value
� Num. virtual registers > Num. ISA registers usually
� Virtual registers are known as reservation stations in the IBM 360/91
Tomasulo: Main Architectural
Features
� Reservation stations: fetch and buffer operand as soon as it is available
� Load/store buffers: have the address (and data for store) to be loaded/stored
� Distributed hazard detection and execution control
� Common Data Bus (CDB): results passed from where generated to where needed
� Note: IBM 360/91 also had reg-mem instns.
The Tomasulo Architecture
Load BuffersFP Opn Queue
FP Regs
Store Buffers
Resvn. Stns.Resvn. Stns.
FP ADD/SUBFP MUL/DIV
Opn. Bus
Opnd. Bus
From mem.
To mem.
From instn. unit
Common Data Bus
Pipeline Stages� Issue:
� Wait for free Reservation Station (RS) or load/store buffer, and place instruction there
� Rename registers in the process (WAR and WAW handled here)
� Execute (EX):� Monitor CDB for required operand� Checks for RAW hazard in this process
� Write Result (WB):� Write to CDB� Picked up by any RS, store buffer, or register
Register Renaming
� In RS, operands referred to by a tag (if operand not already in a register)
� The tag refers to the RS (which contains the instruction) which will produce the required operand
� Thus each RS acts as a virtual register
The Data Structure
� Three parts, like in the scoreboard:� Instruction status� Reservation stations, Load/Store buffers,
Register file� Register status: which unit is going to produce
the register value� This is the register --> virtual register mapping
Components of RS, Reg. File,
Load/Store Buffers� Each RS has:
� Op: the operation (+, -, x, /)� Vj, Vk: the operands (if available)� Qj, Qk: the RS tag producing Vj/Vk (0 if Vj/Vk known)� Busy: is RS busy?
� Each reg. in reg. file and store buffer has:� Qi: tag of RS whose result should go to the reg. or
the mem. locn. (blank ==> no such active RS)� Load and store buffers have:
� Busy field, store buffer has value V to be stored
Maintaining the Data Structure� Issue:
� Wait until: RS or buffer empty� Updates: Qj, Qk, Vj, Vk, Busy of RS/buffer;
Maintain register mapping (register status)� Execute:
� Wait until: Qj=0 and Qk=0 (operands available)� Write result:
� CDB result picked up by RS (update Qj, Qk, Vj, Vk), store buffers (update Qi, V), register file (update register status)
� Update Busy of the RS which finished
Some Examples
� Randy Katz's CS252 slides... (Lecture 11, Spring 1996)
� Dynamic loop unrolling example from text
Dynamic Loop Unrolling
� Assume branch predicted to be taken� Denote: load buffers as L1, L2..., ADDD RSs
as A1, A2...� First loop: F0 --> L1, F4 --> A1� Second loop: F0 --> L2, F4 --> A2
Loop: LD F0, 0(R1) // F0 is array elementADDD F4, F0, F2// F2 has the scalar 'C'SD 0(R1), F4 // Stored resultSUBI R1, R1, 8 // For next iterationBNEZ R1, Loop // More iterations?
Summary Remarks
� Memory disambiguation required� Drawbacks of Tomasulo:
� Large amount of hardware� Complex control logic� CDB is performance bottleneck
� But:� Required if designing for an old ISA� Multiple issue ==> register renaming and
dynamic scheduling required� Next class: branch prediction
Dealing with Control Hazards
� Software techniques:� Branch delay slots� Software branch prediction
� Canceling or nullifying branches
� Misprediction rates can be high� Worse if multiple issue per cycle
� Hence, hardware/dynamic branch prediction
Branch Prediction Buffer
� PC --> Taken/Not-Taken (T/NT) mapping� Can use just the last few bits of PC
� Prediction may be that of some other branch� Ok since correctness is not affected
� Shortcoming of this prediction scheme:� Branch mispredicted twice for each execution of
a loop� Bad if loop is smallfor(int i = 0; i < 10; i++) {
x[i] = x[i] + C;}
Two-Bit Predictor
� Have to mispredict twice before changing prediction� Built in hysteresis
� General case is an n-bit predictor� 0 to (2^n)-1 saturating counter� 0 to (2^[n-1])-1 predict as taken� 2^[n-1] to (2^n)-1 predict as not-taken
� Experimental studies: 2-bit as good as n-bit
Implementing Branch Prediction
Buffers� Implementing branch prediction buffers
� Small cache accessed along with the instruction in IF
� Or, additional 2 bits in instruction cache� Note: branch prediction buffer not useful for
DLX pipeline� Branch target not known earlier than branch
condition
Prediction Performance
� 4096 entries in the prediction buffer� SPEC89, IBM Power architecture
Nasa7 Ma-trix300
Tomcatv
Doduc Spice Fpppp Gcc Espresso
Eqn-tott
Li0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%M
ispr
edic
tion
rate
Improving Branch Prediction
� Two ways: increase buffer size, improve accuracy
Nasa7
Ma-trix300
Tomcatv
Doduc
Spice Fpppp
Gcc Espresso
Eqn-tott
Li0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
4096 entries
Inf. entries
Mis
pred
ictio
n ra
te
Improving Prediction Accuracy
� Predict branches based on outcomes of recent other branchesif(aa == 2) {
aa = 0;}if(bb == 2) {
bb = 0;}if(aa == bb) {
// Do something}� Correlating, or two-level predictor
Two-Level Predictor
� There are effectively two predictors for each branch:� Depending on whether previous branch is T/NT
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T
Prediction
bits
Prediction if
last branch
NT
Prediction if
last branch T
Two-Level Predictor (continued)
� Last predictor was a (1,1) predictor� One bit each of history, and prediction
� General case is (m,n) predictor� m bits of history, n bits of prediction
� How to implement?� Have an m-bit shift register
Cost of Two-Level Predictor
� Number of bits required:� Num. branch entries x 2^m x n
� How many bits in 4096 (0,2) predictor?� 8K
� How many branch entries for an 8K (2,2) predictor?� 1K
Performance of (2,2) Predictor
Nasa7 Ma-trix300
Tomcatv
Doduc Spice Fpppp Gcc Espresso
Eqn-tott
Li0.00%
2.50%
5.00%
7.50%
10.00%
12.50%
15.00%
17.50%
20.00%
4096 entries; (0,2) Inf. entries; (0,2) 1K entries; (2,2)
Mis
pred
ictio
n ra
te
Branch Target Buffer
� Branch prediction buffer is not useful for DLX� Need to know target address by the end of IF
� Store branch target address also� Branch target buffer, or cache
� Access branch target buffer in IF cycle� Hit ==> predicted branch target known at the end
of IF� We also need to know if the branch is predicted
T/NT
Branch Target Buffer
(continued)
� No entry found ==> (Target = PC+4)� Exact match of PC is important
� Since we are predicting even before knowing that it is a branch instruction
� Hardware is similar to a cache� Need to store predicted PC only for taken
predictions
Lookup based on PCPredicted target
Steps in Using a Target Buffer
IF ID EX
AccessInstn. Cache
andtarget buffer
Entryfound?
A takenbranch?
Usepredicted
PC
A takenbranch?
Mispredictedbranch; restartfetch; deletebuffer entry
Correctprediction,
proceed
Make newtarget buffer
entryNormal
execution
Yes
No
Yes
No
No
Yes
Penalties in Branch Prediction
� Given a prediction accuracy of p, a buffer hit-rate of h, and a taken branch frequency of f, what is the branch penalty?� h x (1-p) x 2 + (1-h) x f x 2
Buffer hit? Branch taken? Penalty
Yes Yes 0
Yes No 2
No - 2
Storing Target Instructions
� Directly store instructions instead of target address� Target buffer access is now allowed to take
longer� Or, branch folding can be achieved
� Replace fetched instruction with that found in the target buffer entry
� Zero cycle unconditional branch; may be conditional as well