Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R....

Oct. 18, 2000 Machine Organization 1

Machine Organization (CS 570)

Lecture 4: Pipelining*

Jeremy R. JohnsonWed. Oct. 18, 2000

*This lecture was derived from material in the text (Chap. 3).All figures from Computer Architecture: A Quantitative Approach, Second Edition, by John Hennessy and David Patterson, are copyrighted material (COPYRIGHT 1996 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED).


Introduction• Objective: To understand pipelining and the enhanced performance it

provides

• Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Instructions are broken down into stages and while one instruction is executing one stage another instruction can simultaneously execute another stage.

• Topics– Review DLX– Simple Implementation of DLX– Basic Pipeline for DLX– Pipeline hazards– Floating point pipeline


Instruction Format• R-type Instruction (register format - add, sub, …)

• I-type Instruction (immediate format - load, store, branch, immediate)

• J-type Instruction (jump, jal)

op rs rt rd func

op rs rt Immediate

op offset added to PC


Implementation Stages

• Instruction Fetch Cycle (IF)– IR Mem[PC]– NPC PC + 4

• Instruction Decode/Register Fetch Cycle (ID)– A Regs[IR6..10]– B Regs[IR11..15]– Imm ((IR16)16 ## IR16..31)



• Execution/Effective Address Cycle (EX)– Memory Reference:

• ALUOutput A + Imm;– Register-Register ALU Instruction:

• ALUOutput A func B;– Register-Immediate ALU Instruction:

• ALUOutput A op Imm;– Branch:

• ALUOutput NPC + Imm;• Cond (A op 0) ;



• Memory Access/Branch Completion Cycle (MEM)– Memory Reference:

• LMD Mem[ALUOutput]; or• Mem[ALUOutput] B;

– Branch:• if (Cond) PC ALUOutput;

• Write-back Cycle (WB)– Register-Register ALU Instruction:

• Regs[IR16..20] ALUOutput;– Register-Immediate ALU Instruction:

• Regs[IR11..15] ALUOutput;– Load Instruction:

• Regs[IR11..15] LMD;


DLX Datapath


Simple DLX Pipeline

• Each stage (clock-cycle) becomes a pipeline stage• Overlap execution of instructions • Add registers between stages

Instruction Number 1 2 3 4 5 6 7 8 9Instruction i IF ID EX MEM WBInstruction i+1 IF ID EX MEM WBInstruction i+2 IF ID EX MEM WBInstruction i+3 IF ID EX MEM WBInstruction i+4 IF ID EX MEM WB

Clock Cycle


Overlap of Functional Units


Pipelined Datapath


Pipeline Performance

• Expect speedup equal to the number of pipe stages– assumes equal sized tasks– no additional overhead due to pipelining

• Speedup from pipelining (reduce CPI or decrease clock)= Avg. inst. Ex. time unpipelined/ Avg. inst. Ex. Time pipelined

• Example: 10 ns clock without pipelining, 11 ns with pipelining (account for overhead). ALU (40%), Branch (20%) take 4 cycles, Memory (20%) takes 5.

• Speedup = 10 ns ((.4 + .2) 4 + .2 5)/ 11 ns = 44/11 = 4


Pipeline Hazards

• Situations in pipelining when the next instruction cannot execute in the following clock cycle

• Structural hazards– hardware can not support the combination of instructions that we

want to execute in the same cycle• Control hazards

– need to make a decision based on the results of one instruction while others are executing

• Data hazards– an instruction depends on a the results of a previous instruction still in

the pipeline


Pipeline Performance II

• Must account for hazards– Hazards introduce stall cycles in the pipeline

= Avg. inst. Ex. time unpipelined/ Avg. inst. Ex. Time pipelined

= CPI unpipelined Clock cycle unpipelined / CPI pipelined Clock cycle pipelined

= CPI unpipelined/(1 + Pipeline stall cycles per. Inst.) Clock cycle unpipelined/Clock cycle pipelined Pipeline Depth/(1 + Pipeline stall cycles per. Inst.)


Structural Hazards

• Problem: conflict in resources

• Example: Suppose that instruction and data memory was shared in single-cycle pipeline. Data access conflicts with instruction fetch

• Solution: remove conflicting stages, redesign resources to separate resources, or replicate resources


Structural Hazard


Data Hazards

• Problem: Instruction depends on the result of a previous instruction still in the pipeline

• Example:– add R1, R2, R3– sub R5, R1, R4

• Solutions:– forwarding or bypassing– instruction reordering to remove dependencies


Data Hazard Example

– add R1, R2, R3– sub R4, R1, R5– and R6, R1, R7– or R8, R1, R9– xor R10, R1, R11


Data Dependencies


Data Forwarding


Implementing Forwarding

• Detection – e.g. EX/MEM.IR16..20 =

ID/EX6..10

• Use multiplexor to select forwarded results


Data Hazard with Stall

– lw R1, 0(R2)– sub R4, R1, R5– and R6, R1, R7– or R8, R1, R9

Instruction Number 1 2 3 4 5 6 7 8lw R1, 0(R2) IF ID EX MEM WBsub R4, R1, R5 IF ID EX MEM WBand R6, R1, R7 IF ID EX MEM WBor R8, R1, R9 IF ID EX MEM WB

Instruction Number 1 2 3 4 5 6 7 8 9lw R1, 0(R2) IF ID EX MEM WBsub R4, R1, R5 IF ID Stall EX MEM WBand R6, R1, R7 IF Stall ID EX MEM WBor R8, R1, R9 Stall IF ID EX MEM WB


Compiler Scheduling for Data Hazards

• Data hazards are naturally generated– C = A + B

• lw R1, A• lw R2, B• add R3, R1, R2• sw C, R3

• Compiler can reorder instructions to remove dependencies– a = b + c; d = e - f;

• lw R1, b• lw R2, c• lw R3, e• add R5, R1, R2• lw R4, f• sw a, R5• sub R6, R3, R4• sw d, R6


Effectiveness of Scheduling


Control Hazards

• Problem: The next element to go into the pipe may depend on currently executing instruction or we may have to wait until a stage is completed to determine the next stage

• Example: branch instruction

• Solutions:– Stall - operate sequentially until decision can be made (wastes time)– Predict - guess what to do next. If guess correct, operate normally, if

guess is wrong clear the pipe and begin again– Compute address of branch target earlier


Pipeline Stall for Branch

• Stall pipeline until MEM stage, which determines new PC

• Don’t stall until a branch is detected (ID)

• 3 cycles lost per branch is significant– 30% branch frequency + ideal CPI = 1 machine with branch stalls

only achieves 1/2 of ideal speedup

Instruction Number 1 2 3 4 5 6 7 8 9Branch instruction IF ID EX MEM WBBranch successor IF Stall Stall IF ID EX MEM WBBranch successor + 1 IF ID EX MEMBranch successor + 2 IF ID EXBranch successor + 3 IF IDBranch successor + 4 IF

Clock Cycle


Computing the Taken PC Earlier• Can detect branch condition (BEQZ, BNEZ) during ID• Need extra adder to compute branch target during ID• This reduces stall to one cycle


Compile Time Branch Prediction

• Assume either that the branch is taken or not taken• Proceed under this assumption - if wrong “back out” and

start over.

Instruction Number 1 2 3 4 5 6 7 8 9Untaken branch inst IF ID EX MEM WBInstruction + 1 IF ID EX MEM WBInstruction + 2 IF ID EX MEM WBInstruction + 3 IF ID EX MEM WBInstruction + 4 IF ID EX MEM WB

Clock Cycle

Instruction Number 1 2 3 4 5 6 7 8 9Taken branch inst IF ID EX MEM WBInstruction + 1 IF Idle Idle Idle IdleBranch Target IF ID EX MEM WBBranch Target + 1 IF ID EX MEM WBBranch Target + 2 IF ID EX MEM WB

Clock Cycle


Delayed Branch• Instruction after branch (branch delay slot) is executed no

matter what the outcome of the branch is• Requires that the instruction in the branch delay slot is safe

to execute independent of branch• Effectiveness depends on compiler

Instruction Number 1 2 3 4 5 6 7 8 9Untaken branch inst IF ID EX MEM WBbranch delay inst IF ID EX MEM WBInstruction + 2 IF ID EX MEM WBInstruction + 3 IF ID EX MEM WBInstruction + 4 IF ID EX MEM WB

Clock Cycle

Instruction Number 1 2 3 4 5 6 7 8 9Taken branch inst IF ID EX MEM WBbranch delay inst IF ID EX MEM WBBranch Target IF ID EX MEM WBBranch Target + 1 IF ID EX MEM WBBranch Target + 2 IF ID EX MEM WB

Clock Cycle


Designing Instruction Sets (MIPS) for Pipelining

• Want to break down instruction execution into a reasonable number of stages of roughly equal complexity

• All instructions the same length– easier to fetch and decode

• Few instruction formats (source register fields are located in the same place)

– can begin reading registers at the same time instruction is decoded• Memory operands appear only in loads and stores

– calculate address during execute stage and access memory following stage - otherwise expand to addr stage, mem stage and ex stage

• Operands must be aligned in memory– don’t have to worry about a single data transfer instruction requireing two data

memory accesses; hence, it requires a single pipeline stage

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R....

Documents

Transcript of Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R....