Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R....

29
Oct. 18, 2000 Machine Organization 1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived from material in the text (Chap. 3). All figures from Computer Architecture: A Quantitative Approach, Second Edition, by John Hennessy and David Patterson, are copyrighted material (COPYRIGHT 1996 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED).

description

Oct. 18, 2000Machine Organization3 Instruction Format R-type Instruction (register format - add, sub, …) I-type Instruction (immediate format - load, store, branch, immediate) J-type Instruction (jump, jal) op rs rt rd func op rs rt Immediate op offset added to PC

Transcript of Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R....

Page 1: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 1

Machine Organization (CS 570)

Lecture 4: Pipelining*

Jeremy R. JohnsonWed. Oct. 18, 2000

*This lecture was derived from material in the text (Chap. 3).All figures from Computer Architecture: A Quantitative Approach, Second Edition, by John Hennessy and David Patterson, are copyrighted material (COPYRIGHT 1996 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED).

Page 2: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 2

Introduction• Objective: To understand pipelining and the enhanced performance it

provides

• Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Instructions are broken down into stages and while one instruction is executing one stage another instruction can simultaneously execute another stage.

• Topics– Review DLX– Simple Implementation of DLX– Basic Pipeline for DLX– Pipeline hazards– Floating point pipeline

Page 3: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 3

Instruction Format• R-type Instruction (register format - add, sub, …)

• I-type Instruction (immediate format - load, store, branch, immediate)

• J-type Instruction (jump, jal)

op rs rt rd func

op rs rt Immediate

op offset added to PC

Page 4: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 4

Implementation Stages

• Instruction Fetch Cycle (IF)– IR Mem[PC]– NPC PC + 4

• Instruction Decode/Register Fetch Cycle (ID)– A Regs[IR6..10]– B Regs[IR11..15]– Imm ((IR16)16 ## IR16..31)

Page 5: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 5

Implementation Stages

• Execution/Effective Address Cycle (EX)– Memory Reference:

• ALUOutput A + Imm;– Register-Register ALU Instruction:

• ALUOutput A func B;– Register-Immediate ALU Instruction:

• ALUOutput A op Imm;– Branch:

• ALUOutput NPC + Imm;• Cond (A op 0) ;

Page 6: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 6

Implementation Stages

• Memory Access/Branch Completion Cycle (MEM)– Memory Reference:

• LMD Mem[ALUOutput]; or• Mem[ALUOutput] B;

– Branch:• if (Cond) PC ALUOutput;

• Write-back Cycle (WB)– Register-Register ALU Instruction:

• Regs[IR16..20] ALUOutput;– Register-Immediate ALU Instruction:

• Regs[IR11..15] ALUOutput;– Load Instruction:

• Regs[IR11..15] LMD;

Page 7: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 7

DLX Datapath

Page 8: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 8

Simple DLX Pipeline

• Each stage (clock-cycle) becomes a pipeline stage• Overlap execution of instructions • Add registers between stages

Instruction Number 1 2 3 4 5 6 7 8 9Instruction i IF ID EX MEM WBInstruction i+1 IF ID EX MEM WBInstruction i+2 IF ID EX MEM WBInstruction i+3 IF ID EX MEM WBInstruction i+4 IF ID EX MEM WB

Clock Cycle

Page 9: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 9

Overlap of Functional Units

Page 10: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 10

Pipelined Datapath

Page 11: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 11

Pipeline Performance

• Expect speedup equal to the number of pipe stages– assumes equal sized tasks– no additional overhead due to pipelining

• Speedup from pipelining (reduce CPI or decrease clock)= Avg. inst. Ex. time unpipelined/ Avg. inst. Ex. Time pipelined

• Example: 10 ns clock without pipelining, 11 ns with pipelining (account for overhead). ALU (40%), Branch (20%) take 4 cycles, Memory (20%) takes 5.

• Speedup = 10 ns ((.4 + .2) 4 + .2 5)/ 11 ns = 44/11 = 4

Page 12: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 12

Pipeline Hazards

• Situations in pipelining when the next instruction cannot execute in the following clock cycle

• Structural hazards– hardware can not support the combination of instructions that we

want to execute in the same cycle• Control hazards

– need to make a decision based on the results of one instruction while others are executing

• Data hazards– an instruction depends on a the results of a previous instruction still in

the pipeline

Page 13: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 13

Pipeline Performance II

• Must account for hazards– Hazards introduce stall cycles in the pipeline

= Avg. inst. Ex. time unpipelined/ Avg. inst. Ex. Time pipelined

= CPI unpipelined Clock cycle unpipelined / CPI pipelined Clock cycle pipelined

= CPI unpipelined/(1 + Pipeline stall cycles per. Inst.) Clock cycle unpipelined/Clock cycle pipelined Pipeline Depth/(1 + Pipeline stall cycles per. Inst.)

Page 14: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 14

Structural Hazards

• Problem: conflict in resources

• Example: Suppose that instruction and data memory was shared in single-cycle pipeline. Data access conflicts with instruction fetch

• Solution: remove conflicting stages, redesign resources to separate resources, or replicate resources

Page 15: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 15

Structural Hazard

Page 16: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 16

Data Hazards

• Problem: Instruction depends on the result of a previous instruction still in the pipeline

• Example:– add R1, R2, R3– sub R5, R1, R4

• Solutions:– forwarding or bypassing– instruction reordering to remove dependencies

Page 17: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 17

Data Hazard Example

– add R1, R2, R3– sub R4, R1, R5– and R6, R1, R7– or R8, R1, R9– xor R10, R1, R11

Page 18: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 18

Data Dependencies

Page 19: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 19

Data Forwarding

Page 20: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 20

Implementing Forwarding

• Detection – e.g. EX/MEM.IR16..20 =

ID/EX6..10

• Use multiplexor to select forwarded results

Page 21: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 21

Data Hazard with Stall

– lw R1, 0(R2)– sub R4, R1, R5– and R6, R1, R7– or R8, R1, R9

Instruction Number 1 2 3 4 5 6 7 8lw R1, 0(R2) IF ID EX MEM WBsub R4, R1, R5 IF ID EX MEM WBand R6, R1, R7 IF ID EX MEM WBor R8, R1, R9 IF ID EX MEM WB

Instruction Number 1 2 3 4 5 6 7 8 9lw R1, 0(R2) IF ID EX MEM WBsub R4, R1, R5 IF ID Stall EX MEM WBand R6, R1, R7 IF Stall ID EX MEM WBor R8, R1, R9 Stall IF ID EX MEM WB

Page 22: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 22

Compiler Scheduling for Data Hazards

• Data hazards are naturally generated– C = A + B

• lw R1, A• lw R2, B• add R3, R1, R2• sw C, R3

• Compiler can reorder instructions to remove dependencies– a = b + c; d = e - f;

• lw R1, b• lw R2, c• lw R3, e• add R5, R1, R2• lw R4, f• sw a, R5• sub R6, R3, R4• sw d, R6

Page 23: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 23

Effectiveness of Scheduling

Page 24: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 24

Control Hazards

• Problem: The next element to go into the pipe may depend on currently executing instruction or we may have to wait until a stage is completed to determine the next stage

• Example: branch instruction

• Solutions:– Stall - operate sequentially until decision can be made (wastes time)– Predict - guess what to do next. If guess correct, operate normally, if

guess is wrong clear the pipe and begin again– Compute address of branch target earlier

Page 25: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 25

Pipeline Stall for Branch

• Stall pipeline until MEM stage, which determines new PC

• Don’t stall until a branch is detected (ID)

• 3 cycles lost per branch is significant– 30% branch frequency + ideal CPI = 1 machine with branch stalls

only achieves 1/2 of ideal speedup

Instruction Number 1 2 3 4 5 6 7 8 9Branch instruction IF ID EX MEM WBBranch successor IF Stall Stall IF ID EX MEM WBBranch successor + 1 IF ID EX MEMBranch successor + 2 IF ID EXBranch successor + 3 IF IDBranch successor + 4 IF

Clock Cycle

Page 26: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 26

Computing the Taken PC Earlier• Can detect branch condition (BEQZ, BNEZ) during ID• Need extra adder to compute branch target during ID• This reduces stall to one cycle

Page 27: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 27

Compile Time Branch Prediction

• Assume either that the branch is taken or not taken• Proceed under this assumption - if wrong “back out” and

start over.

Instruction Number 1 2 3 4 5 6 7 8 9Untaken branch inst IF ID EX MEM WBInstruction + 1 IF ID EX MEM WBInstruction + 2 IF ID EX MEM WBInstruction + 3 IF ID EX MEM WBInstruction + 4 IF ID EX MEM WB

Clock Cycle

Instruction Number 1 2 3 4 5 6 7 8 9Taken branch inst IF ID EX MEM WBInstruction + 1 IF Idle Idle Idle IdleBranch Target IF ID EX MEM WBBranch Target + 1 IF ID EX MEM WBBranch Target + 2 IF ID EX MEM WB

Clock Cycle

Page 28: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 28

Delayed Branch• Instruction after branch (branch delay slot) is executed no

matter what the outcome of the branch is• Requires that the instruction in the branch delay slot is safe

to execute independent of branch• Effectiveness depends on compiler

Instruction Number 1 2 3 4 5 6 7 8 9Untaken branch inst IF ID EX MEM WBbranch delay inst IF ID EX MEM WBInstruction + 2 IF ID EX MEM WBInstruction + 3 IF ID EX MEM WBInstruction + 4 IF ID EX MEM WB

Clock Cycle

Instruction Number 1 2 3 4 5 6 7 8 9Taken branch inst IF ID EX MEM WBbranch delay inst IF ID EX MEM WBBranch Target IF ID EX MEM WBBranch Target + 1 IF ID EX MEM WBBranch Target + 2 IF ID EX MEM WB

Clock Cycle

Page 29: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Oct. 18, 2000 Machine Organization 29

Designing Instruction Sets (MIPS) for Pipelining

• Want to break down instruction execution into a reasonable number of stages of roughly equal complexity

• All instructions the same length– easier to fetch and decode

• Few instruction formats (source register fields are located in the same place)

– can begin reading registers at the same time instruction is decoded• Memory operands appear only in loads and stores

– calculate address during execute stage and access memory following stage - otherwise expand to addr stage, mem stage and ex stage

• Operands must be aligned in memory– don’t have to worry about a single data transfer instruction requireing two data

memory accesses; hence, it requires a single pipeline stage