Todayʼs Menu Multi-Cycle Exceptions Exceptions ... · 13 Pipelining Multicycle Pipelining Let’s...

1

Multi-Cycle Exceptions Pipelining

2

Today’s Menu

  Exceptions   What are they?   What do we do about them?

  Introduction to pipelining   Why pipelining?   Why is it difficult?   How can we do it efficiently?   Examples

3

Exceptions and Interrupts

  Exceptions are ‘exceptional events’ that disrupt the normal flow of a program

  Terminology varies between different machines   Examples of Interrupts

  User hitting the keyboard   Disk drive asking for attention   Arrival of a network packet

  Examples of Exceptions   Divide by zero   Overflow   Page fault

4

Handling Exceptions and Interrupts

  When do we jump to an exception?

  Upon detection, invoke the OS to “service the event”   Right when it occurs?   What about in the middle of executing a multi-cycle instruction

  Difficult to abort the middle of an instruction   Processor checks for event at the end of every instruction   Processor provides EPC & Cause registers to inform OS of cause

  EPC - Exception Program Counter   Holds PC that the OS should jump to when resuming execution

  Cause Register   Holds bit-encoded cause of the exception

5

Exception Flow

  When an exception (or interrupt) occurs, control is transferred to the OS   When the OS is done, it jumps back to the user program (if it can)

User Process

Event

Operating System

exception Exception processing by exception handler

Exception return (optional)

6

Why This Is Very Messy

  You have many instructions in flight   In one of these instructions, a “bad thing”

happens, eg, divide-by-zero

  What do we have to do?   We have to deal with this event, since normal

program execution is probably now incorrect   But, we have a bunch of instructions in flight

  Many of them, but maybe not all of them, need to get killed

  Don’t want to kill stuff that is actually correct, and waste that work.

  When do we kill them?   NOW -- die die die….?   Wait till exception-causing instruction finishes?   Wait till the pipeline empties?

  Very very very messy part of real machine design.

7

Review of Multicycle vs. Single Cycle

  Single cycle implementations have to consider the worst case delay through the datapath to come-up with the cycle time.

  Multicycle implementations have the advantage of using a different number of cycles for executing each instruction.

  In general, the multicycle machine is better than the single cycle machine, but the actual execution time strongly depends on the workload.

  The most widely used machine implementation is neither single cycle, nor multicycle – it’s the pipelined implementation. (Next lecture)

8

Read Reg 1 Read Reg 2 Write Reg Write Data

Read Data 1 Read Data 2

Register File

ALU

Instruction

Sign extend

16 32

Read data

Write Data

Data Memory (RAM)

M U X

M U X

Zero

Instruction Memory (RAM)

PC Adder 4

Current PC

ADDER

<< 2

M U X

Complete Single-cycle Datapath

9

1 1

Cost of the Single Cycle Architecture

Instr Class 1

Instr Class 2

Instr Class 3

Our Cycle Time (longest Instruction)

3 3 2 1

Most of the time is wasted! 10

Multi-cycle Solution

Instr Class 1

Instr Class 2

Instr Class 3 Takes 4 cycles

Takes 2 cycles

1 3 2 1 3 1

Less Wasted Time

Idea: Let the FASTEST instruction determine clock period

11

Multi-cycle Reality

  We are going to go further than allowing the fastest instruction to determine rate

  We are going to break EVERY instruction up into phases

R-class

Load

Branch

Store

12

Multicycle Control – Add Intermediate Registers

Instruction Register

Read Data 1 Read Data 2 ALU

M U X

PC

M U X

Read Reg1 Read Reg2 Write Reg Write Data

M U X M

U X

M U X

Sign Extend

Shift left 2

Write Data

16 32

4

Zero

IorD

MemRead MemWrite IRWrite

RegDest

RegWrite ALU SelA

ALU SelB

MemToReg

ALU Control

Instruction [5:0]

ALU Op

A

B

MDR

ALUOut

13

Pipelining

  Multicycle Pipelining

  Let’s build cars

14

Pipelining

  Can we go faster?   Pipelining:

  Production assembly lines   Henry Ford, Model T, 1908

  Two ways to build a car:   Each step takes 1 hour

Non-pipelined: 1 car/4 hours

15

Pipelining





16

Pipelining





17

Pipelining





18

Pipelining





19

Pipelining

Non-pipelined: 1 car/4 hours pipelined: 1 car/hour




20

Analogy: Gasoline Transportation

  Trucking gas from depot to gas station   Get the barrels   Load them into the truck   Drive to the gas station   Unload the gas   Return for more oil

  Let’s do the math   Each truck can carry 5 barrels   Can load a truck with 5 barrels in 1 hour   It takes each truck 1 day to drive to and from gas station   Q: How many barrels per week are delivered?   Q: What if I had more trucks?

GAS STATION

21

Looks a Lot Like a Multicycle Processor

Instruction Register

Read Data 1 Read Data 2 ALU

M U X

M U X

Read Reg1 Read Reg2 Write Reg Write Data

M U X M

U X

M U X

Sign Extend

Shift left 2

Write Data

16 32

4

Zero

Memory

  What are the steps   Fetch an instruction (Get the barrels)   Decode the instruction (Load them into the truck)   ALU OP (Drive to the gas station)   Memory Access (Unload the gas)   Write-back (Return for more oil)

22

Business 201

GAS STATION

  Roll the barrels down the road   Big fire hazard - probably will not meet OSHA standards

US Occupational Safety and Health Administration

23

Business 201

  Build a pipeline   Will meet OSHA standards   Might make the environmentalists angry   Now let’s do the math

  Pipeline can accept 1 barrel every hour   Q: How many barrels get delivered to the gas station per day?   Q: How many barrels are “in-flight” at any moment?

GAS STATION

24

Trucking vs. Pipelines

  Trucks   Each truck can carry 5 barrels   Can load a truck with 5 barrels in 1 hour   Truck takes 1 day to drive to and from gas

station   LOTS of TIME when loading area, gas

station, and pieces of the road are unused

  Unless you have lots of trucks

GAS STATION

•  Pipelines •  Pipeline can accept 1 barrel every hour

•  Resources (loading area, gas station, pipeline) are always in use

•  As long as you can keep your pipeline full (e.g., you have enough barrels)

25

Big Idea: Pipeline Concurrency

This computation is “too long”

100 ns

Pipelined version, 5 pipe stages

~20 ns Latches, called ‘Pipeline registers’ break up computation into stages

26

Big Idea: It’s Faster

I can “launch” a new computation every 100ns in this structure

100 ns

Pipelined version, 5 pipe stages: I can launch a new computation every 20ns in pipelined structure

~20 ns Latches, called ‘Pipeline registers’ break up computation into stages

27

Pipelining: Implementation Issues

  What prevents us from just doing a zillion pipe stages?   Some computations just won’t divide into any finer

(shorter in time) logical implementations   Ultimately, often comes down to circuit design issues

~20 ns

~2 ns

5 stages: OK

50 stages: nope, sorry

28

Pipelining: Implementation Issues

  What prevents us from just doing a zillion pipe stages?   Those latches are NOT free, they take up area, and there is a real delay to go THRU the

latch itself

  In modern, deep pipeline (10-20 stages), this is a real effect   Typically see logic “depths” in one pipe stage of 10-20 “gates”

~2ns

10 stage pipe

~0.2ns

1 2 3 4 5 ~20 At these speeds, and with this few levels of logic, latch delay is important

29

Remember the ARM big.LITTLE Idea?

LITTLE

BIG

Pipeline depth: 8-10 Much lower power

Pipeline depth: 15-24 Much higher frequency 30

How Many Pipeline Stages?

  E.g., Intel   Pentium 4: over 20 stages   More than 120 instructions in flight   High clock frequency (>3GHz)   High IPC (Instructions per Cycle)

  Too many stages:

  Lots of complications   Should take care of possible dependencies among in-flight instructions   Control logic is huge   Too little work per stage, too high a branch miss-prediction penalty bad performance

31

  Unpipelined

  Pipelined

  Ideally, Speeduppipeline = Timesequential Pipeline Depth

Performance of Pipelined Systems time

instructions

Latency 5 cycles Pipeline

stage time

Throughput: 1 per 5 cycles

Latency 5 cycles

Throughput: 1 per 1 cycle

Ideal speedup only if we can keep the pipeline full!

32

MIPS Pipeline Stages

  Stage 1: Instruction Fetch IF   Stage 2: Instruction Decode ID   Stage 3: Execute EX   Stage 4: Memory Access MEM   Stage 5: Write Back (to register file) WB

33

5-stage Version of MIPS Datapath STAGE 3

ALU Execute

ALU

Write Data

M

U X

STAGE 5 Writeback

Read data

M U X

STAGE 4 MemAcc

Data Memory (RAM)

STAGE2 Decode



Register File

Sign extend

16 32

STAGE 1 Instr. Fetch


PC

Adder 4

Current PC

R E G I S T E R S

R E G I S T E R S

R E G I S T E R S

R E G I S T E R S

34

Complete 5 Stage Pipeline (Drawn Smaller)



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend

IF/ID ID/EX EX/MEM MEM/WB

35

In cycle 4 we have 3 instructions “in-flight”: Inst 1 is accessing the memory (DM) Inst 2 is using the ALU (EX) Inst 3 is access the register file (ID)

Flow of Instructions Through Pipeline

IM

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

LW R1, 100(R0) LW R2,200(R0) LW R3, 300(R0)

REG

IM

ALU

REG

IM

Reg

DM

ALU

Reg

DM Reg

DM

ALU

REG

Program Execution

Time

36

Stage 1 - IF (Instruction Fetch)



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


Instruction Fetch LW

37

Stage 2 - ID (Instruction Decode)



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


Instruction Decode LW

38

Stage 3 - EX (Execution)



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


Execution LW

39

Stage 4 - MEM (Memory)



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


Memory LW

40

Stage 5 - WB (Write Back)



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


WriteBack LW

41

New Complications

  The good news   Multiple instructions are running at the same time, thru the datapath   This works because each stage of pipeline is isolated by latches   So, in the best of all possible worlds, N stage pipe has N instructions flowing thru it,

speedup is close to N.

  The bad news   Instructions interfere with each other   Common name for these: conflicts

  Why?   Different instructions “in flight” thru data path at same time   Different instructions might want to use the same piece of hardware in the datapath at

the same time (i.e., in same clock cycle)   These conflicts — contention for an over-used resource — are the source of endless

grief in pipeline design

42

Good News: >1 Instruction “In Flight” in Pipe

REG IM DM ALU Reg

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

ADD R2,R3,R1 SUB R5,R6,R7 ADD R10,R11,R12

REG IM DM ALU Reg

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

43

Bad News: Instructions Interfere

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

ADD R10, R11, R12 ADD R17, R0, R0 ADD R16, R0, R0 SUB R20, R21, R22 ADD R30, R17, R18

REG IM DM ALU

Program Execution

Time

REG IM DM ALU

Clock Cycle 8

REG IM DM ALU

REG IM DM ALU

REG

REG

REG

REG

REG IM DM ALU

Write to the register file

Read from the register file

44

The conflict from previous slide’s

instruction sequence

Instruction Interference in a Pipe

  In its most basic form, it’s about contention for a resource   2 instructions want to “use” a piece of hardware in the pipe   There’s only one of these in the pipe, maybe it can’t “service” the requirements of

more than one instruction at a time

Iget Rget ALU op Mput Rput


45

Sometimes, You Can Redesign the Resource

  In this particular case…   The problem is one instruction READS register file   …and the other WRITES register file   Solution: allow WRITE-then-READ in one clock cycle (“double pump”)

Iget ALU op Mput

W R

Rget W

R

Rput

Iget ALU op Mput

W R

Rget W

R

Rput

No conflict now, 1st instruction writes in 1st half of clock cycle, later instruction reads in 2nd half

46

Now, Even this Case Works OK

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

ADD R10, R11, R12 ADD R17, R0, R0 ADD R16, R0, R0 SUB R20, R21, R22 ADD R30, R17, R18

REG IM DM ALU

Program Execution

Time

REG IM DM ALU

Clock Cycle 8

REG IM DM ALU

REG IM DM ALU

REG

REG

REG

REG

REG IM DM ALU

17 W

17 R

47

But..This Case Still Screws Up

REG IM DM ALU Reg

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

ADD R2,R3,R1 SUB R5,R6,R7 ADD R10,R11,R12 ADD R12,R10,R11

REG IM DM ALU Reg

REG IM DM ALU Reg

Program Execution

Time

REG IM DM ALU Reg

Clock Cycle 8

Writeback Result into R10

Read value out of R10

10 W

10 R

48

Another Conflict: Data Hazards

  Basic structure   An instruction in flight wants to use a data value that’s not “done” yet   “Done” means “it’s been computed” and “it’s located where I would normally expect to go

look in the pipe hardware to find it”

  Basic cause   You are used to assuming a purely sequential model of instruction execution   Instruction N finishes before instruction N+k, for k >= 1   Nope, sorry -- not true any more in a pipeline   There are dependencies now between “nearby” instructions

(“near” in sequential order of fetch from memory)

  Consequence   Data hazards -- instructions want data values that are not done yet, or in the right place yet

49

This Data Hazard, Revisited

  In this particular case…   R10 value is not computed or returned to register file when later instruction wants to use it

as an input

Double pumping reg file doesn’t help here; later instruction needs R10 2 clock cycles before it’s been computed & stored back. Oops…



10 W

10 R

50

Coping with Data Hazards

  What do you do?   Sometimes the dumb-sounding answer is right

  Hypothesis:   It is BAD when certain instructions “overlap” in time in certain patterns in our 5 stage

MIPS pipeline

  Proposed solution   Don’t let them overlap like this…?   Right - that is one solution

  Mechanics   Don’t let the instruction flow thru the pipe   In particular, don’t let it WRITE any bits anywhere in the pipe hardware that represents

REAL CPU state (e.g., register file, memory)   Name for this operation: PIPELINE STALL

51

Coping with Data Hazards: Example

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

ADD R10, R11, R12 ADD R12, R10, R11 ADD R11, R10, R12

REG IM DM ALU Reg

Program Execution

Time

REG IM DM ALU Reg

Clock Cycle 8

REG IM DM ALU Reg

10 W

10 R

10 R

52

Solution 1 : Stall

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

IM REG ALU bubble bubble

ADD R10, R11, R12 ADD R12, R10, R11 ADD R11, R10, R12

DM

REG IM ALU

10 W

10 R

Empty slots in in the pipe

called bubbles; means no real

instruction work getting saved here

10 R

53

Mechanically: How Do We Stall?

  Add extra hardware to detect stall situations   Watches the instruction field bits   Looks for “read versus write” conflicts in particular pipe stages   Basically, a bunch of careful “case logic”

  Add extra hardware to push bubbles thru pipe   Actually, relatively easy   Can just let the instruction you want to stall GO FORWARD thru the pipe…   …but, TURN OFF the bits that allow any results to get written into the machine state   So, the instruction “executes” (it does the work), but doesn’t “save”

“If an instruction executes in the middle of forest, but no registers are around to save the results…did it really execute?” (No.)

54

Recall the Registers Between Pipeline Stages



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


55

Recall What an Instruction Looks Like

  add R8, R17, R18   is stored in binary format as

  00000010 00110010 01000000 00100000   MIPS lays out instructions into “fields”

  op operation of the instruction   Rs first register source operand   rt second register source operand   rd register destination operand   shamt shift amount   funct function (select type of operation)

31 26 25 21 20 16 15 11 10 6 5 0 000000 10001 1 0010 01000 00000 100000 op rs rt rd shamt funct

We gotta watch these reg op fields

56

Data Hazard Logic



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

Sign extend

IF/ID ID/EX EX/MEM MEM/WB Rs Rt Rd

Rd

Rd

Data Hazard Logic Rs =? Rd Rt =? Rd

between ID/EX, EX/MEM, and MEM/WB Stages

57

Example

sub R2, R1, R3 Rd = R2 Rs = R1 Rt = R3 and R12, R2, R5 Rd = R12 Rs = R2 Rt = R5 or R13, R6, R2 Rd = R13 Rs = R6 Rt = R2 add R14, R2, R2 Rd = R14 Rs = R2 Rt = R2 sw R15, 100(R2) Rd = R15 Rs = R2 Rt = XX

  SUB-AND Hazard   EX/MEM.RegisterRd == ID/EX. RegisterRs == R2

  SUB-OR Hazard   MEM/WB.RegisterRd == ID/EX. RegisterRt == R2

58

Example

sub R2, R1, R3 Rd = R2 Rs = R1 Rt = R3 and R12, R2, R5 Rd = R12 Rs = R2 Rt = R5 or R13, R6, R2 Rd = R13 Rs = R6 Rt = R2 add R14, R2, R2 Rd = R14 Rs = R2 Rt = R2 sw R15, 100(R2) Rd = R15 Rs = R2 Rt = XX

  Interactions (real or not) can be tricky   Example: do instruction #1 (sub) and #4 (add) interact, conflict?   Well, they do BOTH want to use R2…

??

59

No Dependence Between #1 and #4

REG IM DM ALU

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU

Program Execution

Clock Cycle 8

SUB R2, R1, R3 AND R12, R2, R5 OR R13, R6, R2 ADD R14, R2, R2

REG IM DM ALU

REG IM DM ALU

REG

REG

REG

REG

2 W

2 R

In this case, double pumped reg file makes it ok…

60

REG IM DM ALU Reg

How Else Could We Stall the Pipeline?   Compiler can insert nops

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

ADD R10, R11, R12 nop nop ADD R12, R10, R11

IM DM ALU Reg

ALU

REG

IM DM Reg REG

On MIPS R0 = R0+R0 will do it-- saves no

state

61

Or, The Hardware Can Simulate NOPS

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

IM REG ALU

ADD R10, R11, R12 stall stall ADD R12, R10, R11 DM

IM

IM

bubble bubble bubble

bubble bubble

bubble

bubble bubble

Reg

62

Next lecture

  How to fix the pipeline to avoid (most) dependency problems …

Todayʼs Menu Multi-Cycle Exceptions Exceptions ... · 13 Pipelining Multicycle Pipelining Let’s...

Documents

Transcript of Todayʼs Menu Multi-Cycle Exceptions Exceptions ... · 13 Pipelining Multicycle Pipelining Let’s...