Instruction Level Parallelism - WBUTHELP.COM · Pipelining achieves Instruction Level Parallelism...

Instruction Level Parallelism

� Pipelining achieves Instruction Level Parallelism (ILP)� Multiple instructions in parallel

� But, problems with pipeline hazards� CPI = Ideal CPI + stalls/instruction� Stalls = Structural + Data (RAW/WAW/WAR) +

Control� How to reduce stalls?

� That is, how to increase ILP?

Techniques for Improving ILP

� Loop unrolling� Basic pipeline scheduling� Dynamic scheduling, scoreboarding, register

renaming� Dynamic memory disambiguation� Dynamic branch prediction� Multiple instruction issue per cycle

� Software and hardware techniques

Loop-Level Parallelism

� Basic block: straight-line code w/o branches� Fraction of branches: 0.15� ILP is limited!

� Average basic-block size is 6-7 instructions� And, these may be dependent

� Hence, look for parallelism beyond a basic block

� Loop-level parallelism is a simple example of this

Loop-Level Parallelism: An

Example� Consider the loop:

for(int i = 1000; i >= 1; i = i-1) {x[i] = x[i] + C; // FP

}� Each iteration of the loop is independent of other

iterations� Loop-level parallelism

� To convert it into ILP:� Loop unrolling (static, dynamic)� Vector instructions

The Loop, in DLX

� In DLX, the loop looks like:

Loop: LD F0, 0(R1) // F0 is array elementADDD F4, F0, F2// F2 has the scalar 'C'SD 0(R1), F4 // Stored resultSUBI R1, R1, 8 // For next iterationBNEZ R1, Loop // More iterations?

� Assume:� R1 is the initial address� F2 has the scalar value 'C'� Lowest address in array is '8'

How Many Cycles per Loop?

CC1 Loop: LD F0, 0(R1)CC2 stallCC3 ADDD F4, F0, F2CC4 stallCC5 stallCC6 SD 0(R1), F4CC7 SUBI R1, R1, 8CC8 stallCC9 BNEZ R1, LoopCC10 stall

Reducing Stalls by Scheduling

CC1 Loop: LD F0, 0(R1)CC2 SUBI R1, R1, 8CC3 ADDD F4, F0, F2CC4 stallCC5 BNEZ R1, LoopCC6 SD 8(R1), F4

� Realizing that SUBI and SD can be swapped is non-trivial!

� Overhead versus actual work:� 3 cycles of work, 3 cycles of overhead

Unrolling the Loop

Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4 // No SUBI, BNEZLD F6, -8(R1) // Note diff FP reg, new offsetADDD F8, F6, F2SD -8(R1), F8LD F10, -16(R1) // Note diff FP reg, new offsetADDD F12, F10, F2SD -16(R1), F8LD F14, -24(R1) // Note diff FP reg, new offsetADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, 32

How Many Cycles per Loop?

Loop: LD F0, 0(R1) // 1 stallADDD F4, F0, F2 // 2 stallsSD 0(R1), F4LD F6, -8(R1) // 1 stallADDD F8, F6, F2 // 2 stallsSD -8(R1), F8LD F10, -16(R1) // 1 stallADDD F12, F10, F2 // 2 stallsSD -16(R1), F8LD F14, -24(R1) // 1 stallADDD F16, F14, F2 // 2 stallsSD -24(R1), F16SUBI R1, R1, 32// 1 stall

28 cycles per unrolled loop

==7 cycles per original loop

Scheduling the Unrolled Loop

Loop: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4, F0, F2ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2SD 0(R1), F4SD -8(R1), F8SUBI R1, R1, 32SD 16(R1), F8BNEZ R1, Loop

14 cycles per unrolled loop

==3.5 cycles per original loop

Observations and Requirements

� Gain from scheduling is even higher for unrolled loop!� More parallelism is exposed on unrolling

� Need to know that 1000 is a multiple of 4� Requirements:

� Determine that loop can be unrolled� Use different registers to avoid conflicts� Determine that SD can be moved after SUBI,

and find the offset adjustment� Understand dependences

Dependences

� Dependent instructions ==> cannot be in parallel

� Three kinds of dependences:� Data dependence (RAW)� Name dependence (WAW and WAR)� Control dependence

Dependences (continued)

� Dependences are properties of programs� Stalls are properties of the pipeline� Two possibilities:

� Maintain dependence, but avoid stalls� Eliminate dependence by code transformation

Data Dependence

� Data dependence represents data flow from one instruction to another� One instruction uses the result of another� Take transitive closure

� In our example: Loop: LD F0, 0(R1)

ADDD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, 8

1

Note: dependence in memory is hard to detect100(R4) and 80(R6) may be the same20(R1) and 20(R1) may be different at different times

Name Dependence

� Two instructions use the same register/memory (name), but there is no flow of data� Anti-dependence: WAR hazard� Output dependence: WAW hazard

� Can do register renaming – s tatically, or dynamically

Name Dependence in our Example

Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4LD F0, -8(R1)ADDD F4, F0, F2SD -8 R1), F4LD F0, -16(R1)ADDD F4, F0, F2SD -16(R1), F4LD F0, -24(R1)ADDD F4, F0, F2SD -24(R1), F4SUBI R1, R1, 32

Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4LD F6, -8(R1)ADDD F8, F6, F2SD -8(R1), F8LD F10, -16(R1)ADDD F12, F10, F2SD -16(R1), F8LD F14, -24(R1)ADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, 32

Register renaming

Control Dependence

� An example:T1;if p1 {

S1;}

� Statement S1 is control-dependent on p1, but T1 is not

� What this means for execution� S1 cannot be moved before p1� T1 cannot be moved after p1

Control Dependence in our Example

Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4SUBI R1, R1, 8BEQZ R1, exitLD F6 0(R1)ADDD F8, F6, F2SD 0(R1), F8SUBI R1, R1, 8BEQZ R1, exit// Two more such...SUBI R1, R1, 8BNEZ R1, Loop

Loop: LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4LD F6, -8(R1)ADDD F8, F6, F2SD -8(R1), F8LD F10, -16(R1)ADDD F12, F10, F2SD -16(R1), F8LD F14, -24(R1)ADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, 32

Handling Control Dependence

� Control dependence need not be maintained� We need to maintain:

� Exception behaviour – do not caus e new exceptions

� Data flow – ensur e the right data item is used� Speculation and conditional instructions are

techniques to get around control dependence

Loop Unrolling: a Relook� Our example:

for(int i = 1000; i >= 1; i = i-1) {x[i] = x[i] + C; // FP

}� Consider:

for(int i = 1000; i >= 1; i = i-1) {A[i-1] = A[i] + C[i]; // S1B[i-1] = B[i] + A[i-1]; // S2

}� S2 is dependent on S1� S1 is dependent on its previous iteration; same

case with S2� Loop-carried dependence ==> loop iterations have to

be in-order

Removing Loop-Carried

Dependence� Another example:

for(int i = 1000; i >= 1; i = i-1) {A[i] = A[i] + B[i]; // S1B[i-1] = C[i] + D[i]; // S2

}� S1 depends on the prior iteration of S2

� Can be removed (no cyclic dependence)A[1000] = A[1000] + B[1000];for(int i = 1000; i >= 2; i = i-1) {

B[i-1] = C[i] + D[i]; // S2A[i-1] = A[i-1] + B[i-1]; // S1

}B[0] = C[1] + D[1];

Static vs. Dynamic Scheduling

� Static scheduling: limitations� Dependences may not be known at compile time� Even if known, compiler becomes complex� Compiler has to have knowledge of pipeline

� Dynamic scheduling� Handle dynamic dependences� Simpler compiler� Efficient even if code compiled for a different

pipeline

Dynamic Scheduling

� For now, we will focus on overcoming data hazards

� The idea:� DIVD F0, F2, F4� ADDD F10, F0, F8� SUBD F12, F8, F14

� SUBD can proceed without waiting for DIVD

CDC 6600: A Case Study

� IF stage: fetch instructions onto a queue� ID stage is split into two stages:

� Issue: decode and check for structural hazards� Read operands: check for data hazards

� Execution may begin, and may complete out-of-order� Complications in exception handling� Ignore for now

� What is the logic for data hazard checks?

The CDC Scoreboard

� Out-of-order completion ==> WAR and WAW hazards possible

� Scoreboard: a data-structure for all hazard detection in the presence of out-of-order execution/completion

� All instructions “cons ult” the scoreboard to detect hazards

The Scoreboard Solution

� Three components:� Stages of the pipeline:

� Issue (ID1), Read-operands (ID2), EX, WB

� Data structure (in hardware)� Logic for hazard detection, stalling

Scoreboard Control & the

Pipeline Stages� Issue (ID1): decode, check if functional unit is

free, and if a previous instruction has the same destination register� No such hazard ==> scoreboard issues to the

appropriate functional unit� Note: structural/WAW hazards prevented by stalling here� Note: stall here ==> IF queue will grow

� Read operands (ID2): � Operand is available if no earlier instruction is going

to write it, or if the register is being written currently� RAW hazards are resolved here

Scoreboard Control & the

Pipeline Stages (continued)

� Execute (EX): � Functional units perform execution� Scoreboard is notified on completion

� Write-Back (WB): � Check for WAR hazards

� Stall on detection� Write-back otherwise

Some Remarks

� WAW causes stall in ID1, WAR causes stall in WB

� No forwarding logic� Output written as soon as it is available (and no

WAR hazard)� Structural hazard possible in register

read/write� CDC has 16 functional units, and 4 buses

The Scoreboard Data-Structures

� Instruction status� Functional unit status� Register result status�

� Randy Katz's CS252 slides... (Lecture 10, Spring 1996)� Scoreboard pipeline control� A detailed example

Limitations of the Scoreboard

� Speedup of 1.7 for (compiled) FORTRAN, speedup of 2.5 for hand-coded assembly

� Scoreboard only in basic-block!� Some hazards still cause stalls:

� Structural� WAR, WAW

Dynamic Scheduling

� Better than static scheduling� Scoreboarding:

� Used by the CDC 6600� Useful only within basic block� WAW and WAR stalls

� Tomasulo algorithm:� Used in IBM 360/91 for the FP unit� Main additional feature: register renaming to

avoid WAR and WAW stalls

Register Renaming: Basic Idea

� Compiler maps memory --> registers statically

� Register renaming maps registers --> virtual registers in hardware, dynamically

� Should keep track of this mapping� Make sure to read the current value

� Num. virtual registers > Num. ISA registers usually

� Virtual registers are known as reservation stations in the IBM 360/91

Tomasulo: Main Architectural

Features

� Reservation stations: fetch and buffer operand as soon as it is available

� Load/store buffers: have the address (and data for store) to be loaded/stored

� Distributed hazard detection and execution control

� Common Data Bus (CDB): results passed from where generated to where needed

� Note: IBM 360/91 also had reg-mem instns.

The Tomasulo Architecture

Load BuffersFP Opn Queue

FP Regs

Store Buffers

Resvn. Stns.Resvn. Stns.

FP ADD/SUBFP MUL/DIV

Opn. Bus

Opnd. Bus

From mem.

To mem.

From instn. unit

Common Data Bus

Pipeline Stages� Issue:

� Wait for free Reservation Station (RS) or load/store buffer, and place instruction there

� Rename registers in the process (WAR and WAW handled here)

� Execute (EX):� Monitor CDB for required operand� Checks for RAW hazard in this process

� Write Result (WB):� Write to CDB� Picked up by any RS, store buffer, or register

Register Renaming

� In RS, operands referred to by a tag (if operand not already in a register)

� The tag refers to the RS (which contains the instruction) which will produce the required operand

� Thus each RS acts as a virtual register

The Data Structure

� Three parts, like in the scoreboard:� Instruction status� Reservation stations, Load/Store buffers,

Register file� Register status: which unit is going to produce

the register value� This is the register --> virtual register mapping

Components of RS, Reg. File,

Load/Store Buffers� Each RS has:

� Op: the operation (+, -, x, /)� Vj, Vk: the operands (if available)� Qj, Qk: the RS tag producing Vj/Vk (0 if Vj/Vk known)� Busy: is RS busy?

� Each reg. in reg. file and store buffer has:� Qi: tag of RS whose result should go to the reg. or

the mem. locn. (blank ==> no such active RS)� Load and store buffers have:

� Busy field, store buffer has value V to be stored

Maintaining the Data Structure� Issue:

� Wait until: RS or buffer empty� Updates: Qj, Qk, Vj, Vk, Busy of RS/buffer;

Maintain register mapping (register status)� Execute:

� Wait until: Qj=0 and Qk=0 (operands available)� Write result:

� CDB result picked up by RS (update Qj, Qk, Vj, Vk), store buffers (update Qi, V), register file (update register status)

� Update Busy of the RS which finished

Some Examples

� Randy Katz's CS252 slides... (Lecture 11, Spring 1996)

� Dynamic loop unrolling example from text

Dynamic Loop Unrolling

� Assume branch predicted to be taken� Denote: load buffers as L1, L2..., ADDD RSs

as A1, A2...� First loop: F0 --> L1, F4 --> A1� Second loop: F0 --> L2, F4 --> A2

Loop: LD F0, 0(R1) // F0 is array elementADDD F4, F0, F2// F2 has the scalar 'C'SD 0(R1), F4 // Stored resultSUBI R1, R1, 8 // For next iterationBNEZ R1, Loop // More iterations?

Summary Remarks

� Memory disambiguation required� Drawbacks of Tomasulo:

� Large amount of hardware� Complex control logic� CDB is performance bottleneck

� But:� Required if designing for an old ISA� Multiple issue ==> register renaming and

dynamic scheduling required� Next class: branch prediction

Dealing with Control Hazards

� Software techniques:� Branch delay slots� Software branch prediction

� Canceling or nullifying branches

� Misprediction rates can be high� Worse if multiple issue per cycle

� Hence, hardware/dynamic branch prediction

Branch Prediction Buffer

� PC --> Taken/Not-Taken (T/NT) mapping� Can use just the last few bits of PC

� Prediction may be that of some other branch� Ok since correctness is not affected

� Shortcoming of this prediction scheme:� Branch mispredicted twice for each execution of

a loop� Bad if loop is smallfor(int i = 0; i < 10; i++) {

x[i] = x[i] + C;}

Two-Bit Predictor

� Have to mispredict twice before changing prediction� Built in hysteresis

� General case is an n-bit predictor� 0 to (2^n)-1 saturating counter� 0 to (2^[n-1])-1 predict as taken� 2^[n-1] to (2^n)-1 predict as not-taken

� Experimental studies: 2-bit as good as n-bit

Implementing Branch Prediction

Buffers� Implementing branch prediction buffers

� Small cache accessed along with the instruction in IF

� Or, additional 2 bits in instruction cache� Note: branch prediction buffer not useful for

DLX pipeline� Branch target not known earlier than branch

condition

Prediction Performance

� 4096 entries in the prediction buffer� SPEC89, IBM Power architecture

Nasa7 Ma-trix300

Tomcatv

Doduc Spice Fpppp Gcc Espresso

Eqn-tott

Li0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%M

ispr

edic

tion

rate

Improving Branch Prediction

� Two ways: increase buffer size, improve accuracy

Nasa7

Ma-trix300

Tomcatv

Doduc

Spice Fpppp

Gcc Espresso

Eqn-tott

Li0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

4096 entries

Inf. entries

Mis

pred

ictio

n ra

te

Improving Prediction Accuracy

� Predict branches based on outcomes of recent other branchesif(aa == 2) {

aa = 0;}if(bb == 2) {

bb = 0;}if(aa == bb) {

// Do something}� Correlating, or two-level predictor

Two-Level Predictor

� There are effectively two predictors for each branch:� Depending on whether previous branch is T/NT

NT/NT NT NT

NT/T NT T

T/NT T NT

T/T T T

Prediction

bits

Prediction if

last branch

NT

Prediction if

last branch T

Two-Level Predictor (continued)

� Last predictor was a (1,1) predictor� One bit each of history, and prediction

� General case is (m,n) predictor� m bits of history, n bits of prediction

� How to implement?� Have an m-bit shift register

Cost of Two-Level Predictor

� Number of bits required:� Num. branch entries x 2^m x n

� How many bits in 4096 (0,2) predictor?� 8K

� How many branch entries for an 8K (2,2) predictor?� 1K

Performance of (2,2) Predictor

Nasa7 Ma-trix300

Tomcatv

Doduc Spice Fpppp Gcc Espresso

Eqn-tott

Li0.00%

2.50%

5.00%

7.50%

10.00%

12.50%

15.00%

17.50%

20.00%

4096 entries; (0,2) Inf. entries; (0,2) 1K entries; (2,2)

Mis

pred

ictio

n ra

te

Branch Target Buffer

� Branch prediction buffer is not useful for DLX� Need to know target address by the end of IF

� Store branch target address also� Branch target buffer, or cache

� Access branch target buffer in IF cycle� Hit ==> predicted branch target known at the end

of IF� We also need to know if the branch is predicted

T/NT

Branch Target Buffer

(continued)

� No entry found ==> (Target = PC+4)� Exact match of PC is important

� Since we are predicting even before knowing that it is a branch instruction

� Hardware is similar to a cache� Need to store predicted PC only for taken

predictions

Lookup based on PCPredicted target

Steps in Using a Target Buffer

IF ID EX

AccessInstn. Cache

andtarget buffer

Entryfound?

A takenbranch?

Usepredicted

PC

A takenbranch?

Mispredictedbranch; restartfetch; deletebuffer entry

Correctprediction,

proceed

Make newtarget buffer

entryNormal

execution

Yes

No

Yes

No

No

Yes

Penalties in Branch Prediction

� Given a prediction accuracy of p, a buffer hit-rate of h, and a taken branch frequency of f, what is the branch penalty?� h x (1-p) x 2 + (1-h) x f x 2

Buffer hit? Branch taken? Penalty

Yes Yes 0

Yes No 2

No - 2

Storing Target Instructions

� Directly store instructions instead of target address� Target buffer access is now allowed to take

longer� Or, branch folding can be achieved

� Replace fetched instruction with that found in the target buffer entry

� Zero cycle unconditional branch; may be conditional as well

Instruction Level Parallelism - WBUTHELP.COM · Pipelining achieves Instruction Level Parallelism...

Documents

Transcript of Instruction Level Parallelism - WBUTHELP.COM · Pipelining achieves Instruction Level Parallelism...