Computer Architecture and Engineering Lecture 7 Pipelining Ics152/fa05/lecnotes/lec4-1.pdf ·...

UC Regents Fall 2005 © UCBCS 152 L7: Pipelining I

2005-9-20John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 152 Computer Architecture and Engineering

Lecture 7 – Pipelining I

www-inst.eecs.berkeley.edu/~cs152/

TAs: David Marquardt and Udam Saini


Office Hours Change

David: W 3-4, Th 3-4, 125 CoryUdam: W 3-5 125 Cory, Tu 10-12 345 SodaJohn: Mon 9:30-10:30 AM, 315 Soda


Last Time: Performance Equation

SecondsProgram

InstructionsProgram= Seconds

Cycle InstructionCycles

Goal is to optimize execution time, notindividualequation

terms.

The CPI of the

program.Reflects

the program’s instruction

mix.

Machinesare

optimizedwith

respect toprogram

workloads.

Clockperiod.

Optimizejointlywith

machineCPI.


Today: Introduction to Pipelining

How to apply the performance equation to our single-cycle CPU.

Why pipelining is hard: data hazards,control hazards, structural hazards.

Pipelining: an idea from assemblyline production applied to CPU design

Also: Introduction to Lab 3


Note: Reading is Fundamental ...

The lectures are a gentle introduction, to prepare you to read the book ...

The book presentation of pipelined processors is sufficient to do Lab 3.

These lectures are not.


Recall: Our single-cycle processor

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Dout

Data Memory

WE

Din

Addr

MemToReg

Addr Data

Instr

Mem32A

L

U

32

32

op

Ext

SecondsProgram

InstructionsProgram

= SecondsCycle Instruction

Cycles

CPI == 1This is good.

Slow.This is bad.

Challenge: Speed up clock while keeping CPI == 1


Recall: An R-format CPU design

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op

opcode rs rt rd functshamt

Decode fields to get : ADD $8 $9 $10

Logic


Reminder: How data flows after posedge

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op

Logic

Addr Data

InstrMem

D

PC

Q+

0x4


Next posedge: Update state and repeat

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

D

PC

Q


Observation: Logic idle most of cycle

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Dout

Data Memory

WE

Din

Addr

MemToReg

Addr Data

Instr

Mem32A

L

U

32

32

op

Ext

For most of cycle, ALU is either “waiting” for its inputs, or “holding” its output

Ideal: a CPU architecture where each part is always “working”.


Inspiration: Automobile assembly lineAssembly line moves on a steady clock.

Each station does the same task on each car.Car

body shell

Car chassis

Mergestation

Boltingstation

The clock


Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.


Inspiration: Automobile assembly lineLine speed limited by slowest task.

Most efficient if all tasks take same time to do


Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!

These lines go 24 x 7, and rarely shut down.

Why?


Lessons from car assembly lines

Faster line movement yields more cars per hour off the line.

Faster line movement requires more stages, each doing simpler tasks.

To maximize efficiency, all stages should take same amount of time(if not, workers in fast stages are idle)

“Filling”, “flushing”, and “stalling” assembly line are all bad news.


Key Analogy: The instruction is the car

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR IR IR

Instruction Fetch

IR

Pipeline Stage #1 Stage #2

Controlshardware

in stage 2

Stage #3

Controlshardware

in stage 3

Stage #4

Controlshardware

in stage 4

Stage #5

Controlshardware

in stage 5

“Data-stationary control”


Example: Decode & Register Fetch Stage

D

PC

Q

+

0x4

Addr Data

Instr

Mem

IR

Instr Fetch

Pipeline Stage #1

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

Ext

IR

B

A

M

Stage #2

Decode & Reg Fetch

IR

Stage #3

ADD R4,R3,R2OR R7,R6,R5SUB R10, R9,R8

ADD R4,R3,R2OR R7,R6,R5SUB R10,R9,R8

A sample program

R’s chosen so that instructions are

independent - like cars on the line.


Decode & Reg Fetch

Performance Equation and Pipelining

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch Stage #3

SecondsProgram



To get shortest clock period,

balance the work to do in each

pipeline stage.

CPI == 1Once pipe is fill,one instructioncompletes per

cycle

Clock period is shorter

Less work to do in each cycle


Hazards: An instruction is not a car ...

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch

Stage #1 Stage #2 Stage #3

Decode & Reg Fetch

ADD R4,R3,R2OR R5,R4,R2

An example of a “hazard” -- we must

(1) detect and (2) resolve all hazards

to make a CPU that matches ISA

R4 not written yet ...... wrong value of R4 fetched from RegFile, contract with programmer broken! Oops! ADD R4,R3,R2

OR R5,R4,R2

New sample program


Decode & Reg Fetch

Performance Equation and Hazards

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR IR

B

A

M

Instr Fetch Stage #3

SecondsProgram



“Software slows the machine

down”Seymour Cray

Some ways to cope with hazards

makes CPI > 1“stalling pipeline”

Added logic to detect and resolve hazards increases

clock period


A (simplified) 5-stage pipelined CPU

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

Instr Fetch

“IF” Stage “ID/RF” Stage

Decode & Reg Fetch

1 2

“EX” StageExecution

32A

L

U

32

32

op

IR

Y

M

3

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

“MEM” StageMemory

WE, MemToReg

4WB5

WriteBack

Mux,Logic

Welcome to Lab 3!


Administrivia: Upcoming deadlines ...

Thursday 9/29: At 11:59 PM via email:Lab 2 peer evaluations, and Lab 3 preliminary design document due.

Monday 9/26: Lab 2 final report due via the submit program, 11:59 PM.

Friday 9/23: Lab 2 “Xilinx Checkoff”, in section. For non-150 students, “150 Lab Lecture 4”, 2-3 PM, 125 Cory.

Lab 3 now available on the web site


Starting 9/29: Homework, Midterm, LabHW graded on effort

Midterm two weeks from today, in evening, no class that day.

Thursday review session.Will cover format, material, and ground rules for test.

Lab 3 design doc, checkoffs, later in week ...


Lab 3 Introduction

“Pipelining Your Processor”


Week 1 for Lab 3: Pipelining Processors


Week 2: Hazard-Free Code on the Board

ADD R4,R3,R2OR R7,R6,R5SUB R10,R9,R8

A sample program

R’s chosen so that instructions are

independent - like cars on the line.


Week 3: Run TA’s “Hard Tests” on Xilinx

An example of a “hazard” -- we must

(1) detect and (2) resolve all hazards

to make a CPU that matches ISA

ADD R4,R3,R2OR R5,R4,R2

New sample program


Next 2 Lectures: Pipelining details ...

Control, Hazards,Forwarding

Computer Architecture and Engineering Lecture 7 Pipelining Ics152/fa05/lecnotes/lec4-1.pdf ·...

Documents

Transcript of Computer Architecture and Engineering Lecture 7 Pipelining Ics152/fa05/lecnotes/lec4-1.pdf ·...