Todayʼs Menu Multi-Cycle Exceptions Exceptions ... · 13 Pipelining Multicycle Pipelining Let’s...
Transcript of Todayʼs Menu Multi-Cycle Exceptions Exceptions ... · 13 Pipelining Multicycle Pipelining Let’s...
1
Multi-Cycle Exceptions Pipelining
2
Today’s Menu
Exceptions What are they? What do we do about them?
Introduction to pipelining Why pipelining? Why is it difficult? How can we do it efficiently? Examples
3
Exceptions and Interrupts
Exceptions are ‘exceptional events’ that disrupt the normal flow of a program
Terminology varies between different machines Examples of Interrupts
User hitting the keyboard Disk drive asking for attention Arrival of a network packet
Examples of Exceptions Divide by zero Overflow Page fault
4
Handling Exceptions and Interrupts
When do we jump to an exception?
Upon detection, invoke the OS to “service the event” Right when it occurs? What about in the middle of executing a multi-cycle instruction
Difficult to abort the middle of an instruction Processor checks for event at the end of every instruction Processor provides EPC & Cause registers to inform OS of cause
EPC - Exception Program Counter Holds PC that the OS should jump to when resuming execution
Cause Register Holds bit-encoded cause of the exception
5
Exception Flow
When an exception (or interrupt) occurs, control is transferred to the OS When the OS is done, it jumps back to the user program (if it can)
User Process
Event
Operating System
exception Exception processing by exception handler
Exception return (optional)
6
Why This Is Very Messy
You have many instructions in flight In one of these instructions, a “bad thing”
happens, eg, divide-by-zero
What do we have to do? We have to deal with this event, since normal
program execution is probably now incorrect But, we have a bunch of instructions in flight
Many of them, but maybe not all of them, need to get killed
Don’t want to kill stuff that is actually correct, and waste that work.
When do we kill them? NOW -- die die die….? Wait till exception-causing instruction finishes? Wait till the pipeline empties?
Very very very messy part of real machine design.
7
Review of Multicycle vs. Single Cycle
Single cycle implementations have to consider the worst case delay through the datapath to come-up with the cycle time.
Multicycle implementations have the advantage of using a different number of cycles for executing each instruction.
In general, the multicycle machine is better than the single cycle machine, but the actual execution time strongly depends on the workload.
The most widely used machine implementation is neither single cycle, nor multicycle – it’s the pipelined implementation. (Next lecture)
8
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
Instruction
Sign extend
16 32
Read data
Write Data
Data Memory (RAM)
M U X
M U X
Zero
Instruction Memory (RAM)
PC Adder 4
Current PC
ADDER
<< 2
M U X
Complete Single-cycle Datapath
9
1 1
Cost of the Single Cycle Architecture
Instr Class 1
Instr Class 2
Instr Class 3
Our Cycle Time (longest Instruction)
3 3 2 1
Most of the time is wasted! 10
Multi-cycle Solution
Instr Class 1
Instr Class 2
Instr Class 3 Takes 4 cycles
Takes 2 cycles
1 3 2 1 3 1
Less Wasted Time
Idea: Let the FASTEST instruction determine clock period
11
Multi-cycle Reality
We are going to go further than allowing the fastest instruction to determine rate
We are going to break EVERY instruction up into phases
R-class
Load
Branch
Store
12
Multicycle Control – Add Intermediate Registers
Instruction Register
Read Data 1 Read Data 2 ALU
M U X
PC
M U X
Read Reg1 Read Reg2 Write Reg Write Data
M U X M
U X
M U X
Sign Extend
Shift left 2
Write Data
16 32
4
Zero
IorD
MemRead MemWrite IRWrite
RegDest
RegWrite ALU SelA
ALU SelB
MemToReg
ALU Control
Instruction [5:0]
ALU Op
A
B
MDR
ALUOut
13
Pipelining
Multicycle Pipelining
Let’s build cars
14
Pipelining
Can we go faster? Pipelining:
Production assembly lines Henry Ford, Model T, 1908
Two ways to build a car: Each step takes 1 hour
Non-pipelined: 1 car/4 hours
15
Pipelining
Non-pipelined: 1 car/4 hours
Can we go faster? Pipelining:
Production assembly lines Henry Ford, Model T, 1908
Two ways to build a car: Each step takes 1 hour
16
Pipelining
Non-pipelined: 1 car/4 hours
Can we go faster? Pipelining:
Production assembly lines Henry Ford, Model T, 1908
Two ways to build a car: Each step takes 1 hour
17
Pipelining
Non-pipelined: 1 car/4 hours
Can we go faster? Pipelining:
Production assembly lines Henry Ford, Model T, 1908
Two ways to build a car: Each step takes 1 hour
18
Pipelining
Non-pipelined: 1 car/4 hours
Can we go faster? Pipelining:
Production assembly lines Henry Ford, Model T, 1908
Two ways to build a car: Each step takes 1 hour
19
Pipelining
Non-pipelined: 1 car/4 hours pipelined: 1 car/hour
Can we go faster? Pipelining:
Production assembly lines Henry Ford, Model T, 1908
Two ways to build a car: Each step takes 1 hour
20
Analogy: Gasoline Transportation
Trucking gas from depot to gas station Get the barrels Load them into the truck Drive to the gas station Unload the gas Return for more oil
Let’s do the math Each truck can carry 5 barrels Can load a truck with 5 barrels in 1 hour It takes each truck 1 day to drive to and from gas station Q: How many barrels per week are delivered? Q: What if I had more trucks?
GAS STATION
21
Looks a Lot Like a Multicycle Processor
Instruction Register
Read Data 1 Read Data 2 ALU
M U X
M U X
Read Reg1 Read Reg2 Write Reg Write Data
M U X M
U X
M U X
Sign Extend
Shift left 2
Write Data
16 32
4
Zero
Memory
What are the steps Fetch an instruction (Get the barrels) Decode the instruction (Load them into the truck) ALU OP (Drive to the gas station) Memory Access (Unload the gas) Write-back (Return for more oil)
22
Business 201
GAS STATION
Roll the barrels down the road Big fire hazard - probably will not meet OSHA standards
US Occupational Safety and Health Administration
23
Business 201
Build a pipeline Will meet OSHA standards Might make the environmentalists angry Now let’s do the math
Pipeline can accept 1 barrel every hour Q: How many barrels get delivered to the gas station per day? Q: How many barrels are “in-flight” at any moment?
GAS STATION
24
Trucking vs. Pipelines
Trucks Each truck can carry 5 barrels Can load a truck with 5 barrels in 1 hour Truck takes 1 day to drive to and from gas
station LOTS of TIME when loading area, gas
station, and pieces of the road are unused
Unless you have lots of trucks
GAS STATION
• Pipelines • Pipeline can accept 1 barrel every hour
• Resources (loading area, gas station, pipeline) are always in use
• As long as you can keep your pipeline full (e.g., you have enough barrels)
25
Big Idea: Pipeline Concurrency
This computation is “too long”
100 ns
Pipelined version, 5 pipe stages
~20 ns Latches, called ‘Pipeline registers’ break up computation into stages
26
Big Idea: It’s Faster
I can “launch” a new computation every 100ns in this structure
100 ns
Pipelined version, 5 pipe stages: I can launch a new computation every 20ns in pipelined structure
~20 ns Latches, called ‘Pipeline registers’ break up computation into stages
27
Pipelining: Implementation Issues
What prevents us from just doing a zillion pipe stages? Some computations just won’t divide into any finer
(shorter in time) logical implementations Ultimately, often comes down to circuit design issues
~20 ns
~2 ns
5 stages: OK
50 stages: nope, sorry
28
Pipelining: Implementation Issues
What prevents us from just doing a zillion pipe stages? Those latches are NOT free, they take up area, and there is a real delay to go THRU the
latch itself
In modern, deep pipeline (10-20 stages), this is a real effect Typically see logic “depths” in one pipe stage of 10-20 “gates”
~2ns
10 stage pipe
~0.2ns
1 2 3 4 5 ~20 At these speeds, and with this few levels of logic, latch delay is important
29
Remember the ARM big.LITTLE Idea?
LITTLE
BIG
Pipeline depth: 8-10 Much lower power
Pipeline depth: 15-24 Much higher frequency 30
How Many Pipeline Stages?
E.g., Intel Pentium 4: over 20 stages More than 120 instructions in flight High clock frequency (>3GHz) High IPC (Instructions per Cycle)
Too many stages:
Lots of complications Should take care of possible dependencies among in-flight instructions Control logic is huge Too little work per stage, too high a branch miss-prediction penalty bad performance
31
Unpipelined
Pipelined
Ideally, Speeduppipeline = Timesequential Pipeline Depth
Performance of Pipelined Systems time
instructions
Latency 5 cycles Pipeline
stage time
Throughput: 1 per 5 cycles
Latency 5 cycles
Throughput: 1 per 1 cycle
Ideal speedup only if we can keep the pipeline full!
32
MIPS Pipeline Stages
Stage 1: Instruction Fetch IF Stage 2: Instruction Decode ID Stage 3: Execute EX Stage 4: Memory Access MEM Stage 5: Write Back (to register file) WB
33
5-stage Version of MIPS Datapath STAGE 3
ALU Execute
ALU
Write Data
M
U X
STAGE 5 Writeback
Read data
M U X
STAGE 4 MemAcc
Data Memory (RAM)
STAGE2 Decode
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
Sign extend
16 32
STAGE 1 Instr. Fetch
Instruction Memory (RAM)
PC
Adder 4
Current PC
R E G I S T E R S
R E G I S T E R S
R E G I S T E R S
R E G I S T E R S
34
Complete 5 Stage Pipeline (Drawn Smaller)
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
35
In cycle 4 we have 3 instructions “in-flight”: Inst 1 is accessing the memory (DM) Inst 2 is using the ALU (EX) Inst 3 is access the register file (ID)
Flow of Instructions Through Pipeline
IM
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
LW R1, 100(R0) LW R2,200(R0) LW R3, 300(R0)
REG
IM
ALU
REG
IM
Reg
DM
ALU
Reg
DM Reg
DM
ALU
REG
Program Execution
Time
36
Stage 1 - IF (Instruction Fetch)
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
Instruction Fetch LW
37
Stage 2 - ID (Instruction Decode)
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
Instruction Decode LW
38
Stage 3 - EX (Execution)
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
Execution LW
39
Stage 4 - MEM (Memory)
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
Memory LW
40
Stage 5 - WB (Write Back)
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
WriteBack LW
41
New Complications
The good news Multiple instructions are running at the same time, thru the datapath This works because each stage of pipeline is isolated by latches So, in the best of all possible worlds, N stage pipe has N instructions flowing thru it,
speedup is close to N.
The bad news Instructions interfere with each other Common name for these: conflicts
Why? Different instructions “in flight” thru data path at same time Different instructions might want to use the same piece of hardware in the datapath at
the same time (i.e., in same clock cycle) These conflicts — contention for an over-used resource — are the source of endless
grief in pipeline design
42
Good News: >1 Instruction “In Flight” in Pipe
REG IM DM ALU Reg
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
ADD R2,R3,R1 SUB R5,R6,R7 ADD R10,R11,R12
REG IM DM ALU Reg
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
43
Bad News: Instructions Interfere
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
ADD R10, R11, R12 ADD R17, R0, R0 ADD R16, R0, R0 SUB R20, R21, R22 ADD R30, R17, R18
REG IM DM ALU
Program Execution
Time
REG IM DM ALU
Clock Cycle 8
REG IM DM ALU
REG IM DM ALU
REG
REG
REG
REG
REG IM DM ALU
Write to the register file
Read from the register file
44
The conflict from previous slide’s
instruction sequence
Instruction Interference in a Pipe
In its most basic form, it’s about contention for a resource 2 instructions want to “use” a piece of hardware in the pipe There’s only one of these in the pipe, maybe it can’t “service” the requirements of
more than one instruction at a time
Iget Rget ALU op Mput Rput
Iget Rget ALU op Mput Rput
45
Sometimes, You Can Redesign the Resource
In this particular case… The problem is one instruction READS register file …and the other WRITES register file Solution: allow WRITE-then-READ in one clock cycle (“double pump”)
Iget ALU op Mput
W R
Rget W
R
Rput
Iget ALU op Mput
W R
Rget W
R
Rput
No conflict now, 1st instruction writes in 1st half of clock cycle, later instruction reads in 2nd half
46
Now, Even this Case Works OK
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
ADD R10, R11, R12 ADD R17, R0, R0 ADD R16, R0, R0 SUB R20, R21, R22 ADD R30, R17, R18
REG IM DM ALU
Program Execution
Time
REG IM DM ALU
Clock Cycle 8
REG IM DM ALU
REG IM DM ALU
REG
REG
REG
REG
REG IM DM ALU
17 W
17 R
47
But..This Case Still Screws Up
REG IM DM ALU Reg
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
ADD R2,R3,R1 SUB R5,R6,R7 ADD R10,R11,R12 ADD R12,R10,R11
REG IM DM ALU Reg
REG IM DM ALU Reg
Program Execution
Time
REG IM DM ALU Reg
Clock Cycle 8
Writeback Result into R10
Read value out of R10
10 W
10 R
48
Another Conflict: Data Hazards
Basic structure An instruction in flight wants to use a data value that’s not “done” yet “Done” means “it’s been computed” and “it’s located where I would normally expect to go
look in the pipe hardware to find it”
Basic cause You are used to assuming a purely sequential model of instruction execution Instruction N finishes before instruction N+k, for k >= 1 Nope, sorry -- not true any more in a pipeline There are dependencies now between “nearby” instructions
(“near” in sequential order of fetch from memory)
Consequence Data hazards -- instructions want data values that are not done yet, or in the right place yet
49
This Data Hazard, Revisited
In this particular case… R10 value is not computed or returned to register file when later instruction wants to use it
as an input
Double pumping reg file doesn’t help here; later instruction needs R10 2 clock cycles before it’s been computed & stored back. Oops…
Iget Rget ALU op Mput Rput
Iget Rget ALU op Mput Rput
10 W
10 R
50
Coping with Data Hazards
What do you do? Sometimes the dumb-sounding answer is right
Hypothesis: It is BAD when certain instructions “overlap” in time in certain patterns in our 5 stage
MIPS pipeline
Proposed solution Don’t let them overlap like this…? Right - that is one solution
Mechanics Don’t let the instruction flow thru the pipe In particular, don’t let it WRITE any bits anywhere in the pipe hardware that represents
REAL CPU state (e.g., register file, memory) Name for this operation: PIPELINE STALL
51
Coping with Data Hazards: Example
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
ADD R10, R11, R12 ADD R12, R10, R11 ADD R11, R10, R12
REG IM DM ALU Reg
Program Execution
Time
REG IM DM ALU Reg
Clock Cycle 8
REG IM DM ALU Reg
10 W
10 R
10 R
52
Solution 1 : Stall
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
IM REG ALU bubble bubble
ADD R10, R11, R12 ADD R12, R10, R11 ADD R11, R10, R12
DM
REG IM ALU
10 W
10 R
Empty slots in in the pipe
called bubbles; means no real
instruction work getting saved here
10 R
53
Mechanically: How Do We Stall?
Add extra hardware to detect stall situations Watches the instruction field bits Looks for “read versus write” conflicts in particular pipe stages Basically, a bunch of careful “case logic”
Add extra hardware to push bubbles thru pipe Actually, relatively easy Can just let the instruction you want to stall GO FORWARD thru the pipe… …but, TURN OFF the bits that allow any results to get written into the machine state So, the instruction “executes” (it does the work), but doesn’t “save”
“If an instruction executes in the middle of forest, but no registers are around to save the results…did it really execute?” (No.)
54
Recall the Registers Between Pipeline Stages
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
55
Recall What an Instruction Looks Like
add R8, R17, R18 is stored in binary format as
00000010 00110010 01000000 00100000 MIPS lays out instructions into “fields”
op operation of the instruction Rs first register source operand rt second register source operand rd register destination operand shamt shift amount funct function (select type of operation)
31 26 25 21 20 16 15 11 10 6 5 0 000000 10001 1 0010 01000 00000 100000 op rs rt rd shamt funct
We gotta watch these reg op fields
56
Data Hazard Logic
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
Sign extend
IF/ID ID/EX EX/MEM MEM/WB Rs Rt Rd
Rd
Rd
Data Hazard Logic Rs =? Rd Rt =? Rd
between ID/EX, EX/MEM, and MEM/WB Stages
57
Example
sub R2, R1, R3 Rd = R2 Rs = R1 Rt = R3 and R12, R2, R5 Rd = R12 Rs = R2 Rt = R5 or R13, R6, R2 Rd = R13 Rs = R6 Rt = R2 add R14, R2, R2 Rd = R14 Rs = R2 Rt = R2 sw R15, 100(R2) Rd = R15 Rs = R2 Rt = XX
SUB-AND Hazard EX/MEM.RegisterRd == ID/EX. RegisterRs == R2
SUB-OR Hazard MEM/WB.RegisterRd == ID/EX. RegisterRt == R2
58
Example
sub R2, R1, R3 Rd = R2 Rs = R1 Rt = R3 and R12, R2, R5 Rd = R12 Rs = R2 Rt = R5 or R13, R6, R2 Rd = R13 Rs = R6 Rt = R2 add R14, R2, R2 Rd = R14 Rs = R2 Rt = R2 sw R15, 100(R2) Rd = R15 Rs = R2 Rt = XX
Interactions (real or not) can be tricky Example: do instruction #1 (sub) and #4 (add) interact, conflict? Well, they do BOTH want to use R2…
??
59
No Dependence Between #1 and #4
REG IM DM ALU
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU
Program Execution
Clock Cycle 8
SUB R2, R1, R3 AND R12, R2, R5 OR R13, R6, R2 ADD R14, R2, R2
REG IM DM ALU
REG IM DM ALU
REG
REG
REG
REG
2 W
2 R
In this case, double pumped reg file makes it ok…
60
REG IM DM ALU Reg
How Else Could We Stall the Pipeline? Compiler can insert nops
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
ADD R10, R11, R12 nop nop ADD R12, R10, R11
IM DM ALU Reg
ALU
REG
IM DM Reg REG
On MIPS R0 = R0+R0 will do it-- saves no
state
61
Or, The Hardware Can Simulate NOPS
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
IM REG ALU
ADD R10, R11, R12 stall stall ADD R12, R10, R11 DM
IM
IM
bubble bubble bubble
bubble bubble
bubble
bubble bubble
Reg
62
Next lecture
How to fix the pipeline to avoid (most) dependency problems …