Post on 20-Dec-2015
1
COMP 206:COMP 206:Computer Architecture and Computer Architecture and
ImplementationImplementation
Montek SinghMontek Singh
Wed., Sep 24, 2003Wed., Sep 24, 2003
Topic: Topic: Pipelining -- Intermediate Pipelining -- Intermediate
ConceptsConcepts
(Multicycle Operations; Exceptions)(Multicycle Operations; Exceptions)
2
OutlineOutline Multi-cycle operationsMulti-cycle operations
Floating-point operationsFloating-point operations Structural and data hazardsStructural and data hazards
Interrupts, Faults and ExceptionsInterrupts, Faults and Exceptions Precise exceptionsPrecise exceptions Complications in pipelinesComplications in pipelines
READING: Appendix AREADING: Appendix A
3
Pipelining Multicycle OperationsPipelining Multicycle Operations Assume five-stage pipelineAssume five-stage pipeline Third stage (execution) has two functional Third stage (execution) has two functional
units E1 and E2units E1 and E2 Instruction goes through either E1 or E2, but not bothInstruction goes through either E1 or E2, but not both E1 and E2 are not pipelinedE1 and E2 are not pipelined Stage delay of E1 = 2 cyclesStage delay of E1 = 2 cycles Stage delay of E2 = 4 cyclesStage delay of E2 = 4 cycles No buffering on inputs of E1 and E2No buffering on inputs of E1 and E2
Stage delay of other stages = 1 cycleStage delay of other stages = 1 cycle Consider an instruction sequence of five Consider an instruction sequence of five
instructionsinstructions Instructions 1, 3, 5 need E1Instructions 1, 3, 5 need E1 Instructions 2, 4 need E2Instructions 2, 4 need E2
4
Space-Time Diagram: Multicycle Space-Time Diagram: Multicycle OperationsOperationsDelay 1 2 3 4 5 6 7 8 9 10 11 12 13
1 IF 1 2 3 4 5 5 51 ID 1 2 3 4 4 4 52 E1 1 1 3 3 5 54 E2 2 2 2 2 4 4 4 41 MEM 1 3 2 5 41 WB 1 3 2 5 4
Out-of-order completionOut-of-order completion 3 finishes before 2, and 5 finishes before 43 finishes before 2, and 5 finishes before 4
Instructions may be delayed after entering the Instructions may be delayed after entering the pipeline because of pipeline because of structural hazardsstructural hazards Instructions 2 and 4 both want to use E2 unit at same timeInstructions 2 and 4 both want to use E2 unit at same time Instruction 4 Instruction 4 stallsstalls in ID unit in ID unit This causes instruction 5 to This causes instruction 5 to stallstall in IF unit in IF unit
5
Floating-Point Operations in MIPSFloating-Point Operations in MIPS
IFIF IDID
MEMMEM
WBWB
A1A1 A2A2 A3A3 A4A4
M1M1 M2M2 M3M3 M4M4 M5M5 M6M6 M7M7
EXEX
DIV (25)
Structural hazard:not fully pipelined
Structural hazard:instructions havevarying running
times
WAW hazardspossible; WAR
hazards notpossible
Longer operationlatency impliesmore frequentstalls for RAW
hazards
Out-of-ordercompletion; hasramifications for
exceptions
6
Structural Hazard on WB UnitStructural Hazard on WB Unit1 2 3 4 5 6 7 8 9 10 11
DIV.D (issued at t = -16) D D D D D D D D D MEM WBMUL.D F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WBinteger instruction IF ID EX MEM WBinteger instruction IF ID EX MEM WBADD.D F2, F4, F6 IF ID A1 A2 A3 A4 MEM WBinteger instruction IF ID EX MEM WBinteger instruction IF ID EX MEM WBL.D F2, 0(R2) IF ID EX MEM WB
This is worst-case scenario: max steady-state number of write ports is 1This is worst-case scenario: max steady-state number of write ports is 1 Don’t replicate resources; detect and serialize access as neededDon’t replicate resources; detect and serialize access as needed
Early resolutionEarly resolution Track use of WB in ID stage (using shift register), stall instructions thereTrack use of WB in ID stage (using shift register), stall instructions there
reservation registerreservation register Simplifies pipeline control; all stalls occur in IDSimplifies pipeline control; all stalls occur in ID
adds shift register and write-conflict logicadds shift register and write-conflict logic Late resolutionLate resolution
Stall instructions at entry to MEM or WB stageStall instructions at entry to MEM or WB stage Complicates pipeline control (two stall locations)Complicates pipeline control (two stall locations)
7
1 2 3 4 5 6 7 8 9 10 11 12 13DIV.D (issued at t = -16) D D D D D D D D D MEM WBMULT.D F0, F4, F6 IF ID s M1 M2 M3 M4 M5 M6 M7 MEM WBinteger instruction IF s ID EX MEM WBinteger instruction IF ID EX MEM WBADD.D F2, F4, F6 IF ID s A1 A2 A3 A4 MEM WBL.D F2, 0(R2) IF ID EX MEM WB
WAW HazardsWAW Hazards
WAW hazard arises only when no instruction between ADD.D and WAW hazard arises only when no instruction between ADD.D and L.D uses result computed by ADD.DL.D uses result computed by ADD.D Adding an instruction like “ADD.D F8,F2,F4” before L.D would stall Adding an instruction like “ADD.D F8,F2,F4” before L.D would stall
pipeline enough for RAW hazard to avoid WAW hazardpipeline enough for RAW hazard to avoid WAW hazard Can happen through a branch/trap (example in HP3, Section A.9)Can happen through a branch/trap (example in HP3, Section A.9) Rare situation, but must still handle correctlyRare situation, but must still handle correctly
Hazard resolutionHazard resolution Delay the issue of L.D until ADD.D enters MEMDelay the issue of L.D until ADD.D enters MEM Cancel write of ADD.DCancel write of ADD.D
8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19L: L.D F4, 0(R2) IF L M A A S S S S S S S DM:MUL.D F0, F4, F6 ID L M M A A A A A A A S DA:ADD.D F2, F0, F8 EX L S S S SS:S.D 0(R2), F2 Mult M M M M M M MD:DIV.D F12, F4, F8 Add A A A A
Div D D D D D DMEM L M A SWB L M A S
RAW HazardsRAW Hazards
Longer delays of FP operations increases number of stalls in Longer delays of FP operations increases number of stalls in response to RAW hazardsresponse to RAW hazards
Two methods for reducing stallsTwo methods for reducing stalls Compiler could have moved instruction D between instructions M Compiler could have moved instruction D between instructions M
and A, which would allow D to complete earlier; or hardware could and A, which would allow D to complete earlier; or hardware could detect this possibility and issue instruction D out of orderdetect this possibility and issue instruction D out of order
ID stage is a bottleneck because instructions wait there for their ID stage is a bottleneck because instructions wait there for their operands to be available; could add buffers (reservation stations) operands to be available; could add buffers (reservation stations) to functional units and let instructions await their operands thereto functional units and let instructions await their operands there
9
Responsibilities of ID (all stalls in Responsibilities of ID (all stalls in ID)ID) Three sets of checksThree sets of checks
Structural hazardsStructural hazardsCheck for availability of FP unitCheck for availability of FP unitEnsure WB unit will be available when neededEnsure WB unit will be available when needed
RAW hazardsRAW hazardsStall current instruction until its source registers are not Stall current instruction until its source registers are not
listed as pending registers in a pipeline register that will listed as pending registers in a pipeline register that will not be available when current instruction needs the resultnot be available when current instruction needs the result
WAW hazardsWAW hazards If any instruction in adder, divider, or multiplier has same If any instruction in adder, divider, or multiplier has same
register destination as current instruction, stall current register destination as current instruction, stall current instructioninstruction
Hazards between FP and integer instructionsHazards between FP and integer instructions Integer and FP instructions use disjoint sets of Integer and FP instructions use disjoint sets of
registers, except for FP-integer register movesregisters, except for FP-integer register moves FP load-stores can conflict with integer load-stores in FP load-stores can conflict with integer load-stores in
MEM stageMEM stage
10
MIPS R4000 Floating-Point MIPS R4000 Floating-Point PipelinePipelineStage Functional Unit Description
A FP adder Mantissa ADD stageD FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers
1 2 3 4A x xDEMNR x xS x xU x
AddSubtract
1 2 3 4 5 6 7 8A xDE xM x x x xN x xR xSU x
Multiply
1 2 3 4 … 30 31 32 33 34 35 36A x x x xD x … x x x x xEMNR x x x xSU x
Divide
11
Instruction Mixes in FP Pipeline: Adds Instruction Mixes in FP Pipeline: Adds OnlyOnly
1 2 3 4A x xDEMNR x xS x xU x
AddSubtract
Can’t initiateanother addon cycle 2Conflict here
Can’t initiateanother addon cycle 3Conflict here
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19A x x y y x x y y x x y yDEMNR x x y y x x y y x x y yS x x y y x x y y x x y yU x y x y x y
• Forbidden latencies: 1 and 2• Steady-state utilization (cycles 4 through 18) = (5*7)/(8*15) = 35/120 = 29.17%• Total utilization (cycles 1 through 19) = (5+5*7+2)/(8*19) = 42/152 = 27.63%
• Forbidden latencies: 1 and 2• Steady-state utilization (cycles 4 through 18) = (5*7)/(8*15) = 35/120 = 29.17%• Total utilization (cycles 1 through 19) = (5+5*7+2)/(8*19) = 42/152 = 27.63%
12
FP Pipeline: Multiplies OnlyFP Pipeline: Multiplies Only
1 2 3 4 5 6 7 8A xDE xM x x x xN x xR xSU x
1 1 1 1 0 0 0 0
Multiply
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28A x y z x y zDE x y z x y zM x x x x y y y y z z z z x x x x y y y y z z z zN x x y y z z x x y y z zR x y z x y zSU x y z x y z
• Collision vector: 1 indicates forbidden latency 0 indicates allowed latency• Steady-state utilization (cycles 5-24) = (5*10)/(8*20) = 50/160 = 31.25%• Total utilization (cycles 1-28) = (5+5*10+5)/(8*28) = 60/224 = 26.79%
• Collision vector: 1 indicates forbidden latency 0 indicates allowed latency• Steady-state utilization (cycles 5-24) = (5*10)/(8*20) = 50/160 = 31.25%• Total utilization (cycles 1-28) = (5+5*10+5)/(8*28) = 60/224 = 26.79%
13
FP Pipeline: Adds and MultipliesFP Pipeline: Adds and Multiplies
1 2 3 4A x xDEMNR x xS x xU x
AddSubtract
1 2 3 4 5 6 7 8A xDE xM x x x xN x xR xSU x
Multiply
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28A a a m b b n a a m b b n a a m b b nDE m n m n m nM m m m m n n n n m m m m n n n n m m m m n n n nN m m n n m m n n m m n nR a a m b b n a a m b b n a a m b b nS a a b b a a b b a a b bU m a n b m a n b m a n b
• Note out-of-order completion• Steady-state utilization (cycles 6-21) = (4*17)/(8*16) = 68/128 = 53.13%• Total utilization = (12+4*17+22)/(8*28) = 85/224 = 37.95%
• Note out-of-order completion• Steady-state utilization (cycles 6-21) = (4*17)/(8*16) = 68/128 = 53.13%• Total utilization = (12+4*17+22)/(8*28) = 85/224 = 37.95%
14
Interrupts, Faults, or ExceptionsInterrupts, Faults, or Exceptions
Synchronous, coerced interrupts that occur Synchronous, coerced interrupts that occur within instructions and after which execution within instructions and after which execution must resume are the hardest to implementmust resume are the hardest to implement
See Figure A.27 in HP3See Figure A.27 in HP3
I/O I/O requestrequest
AsyncAsync CoercedCoerced Between Between instr.instr.
ResumeResume
OS callOS call SyncSync User User requestrequest
Between Between instr.instr.
ResumeResume
BreakpoiBreakpointnt
SyncSync User User requestrequest
Between Between instr.instr.
ResumeResume
Power Power failfail
AsyncAsync CoercedCoerced Within Within instr.instr.
TerminatTerminatee
15
Precise Interrupts (Sequential Precise Interrupts (Sequential Processor)Processor) When interrupt occurs, state of interrupted process is When interrupt occurs, state of interrupted process is
saved, including PC (= saved, including PC (= uu), registers, and memory), registers, and memory Interrupt is Interrupt is preciseprecise if the following three conditions hold if the following three conditions hold
All instructions preceding All instructions preceding uu have been executed, and have have been executed, and have modified the state correctlymodified the state correctly
All instructions following All instructions following uu are unexecuted, and have not are unexecuted, and have not modified the statemodified the state
If the interrupt was caused by an instruction, it was caused by If the interrupt was caused by an instruction, it was caused by instruction instruction uu, which is either completely executed (overflow) or , which is either completely executed (overflow) or completely unexecuted (VM page fault)completely unexecuted (VM page fault)
Precise interrupts are desirable if software is to fix up Precise interrupts are desirable if software is to fix up error that caused interrupt and execution has to be error that caused interrupt and execution has to be resumedresumed Easy for external interrupts, could be complex and costly for Easy for external interrupts, could be complex and costly for
internalinternal Imperative for some interrupts (VM page faults, IEEE FP Imperative for some interrupts (VM page faults, IEEE FP
standard)standard)
16
Problems on Sequential Problems on Sequential ProcessorsProcessors Instruction modifies state Instruction modifies state
early, then causes an early, then causes an interruptinterrupt State change must be State change must be
undoneundone Example: First operand of Example: First operand of
VAX instruction uses VAX instruction uses autodecrement autodecrement addressing mode, which addressing mode, which writes a register. Trying writes a register. Trying to access second operand to access second operand causes a page fault. causes a page fault. Since instruction Since instruction execution cannot be execution cannot be completed, we must completed, we must restore the register restore the register written by autodecrement written by autodecrement to its original valueto its original value
Long-running instructionsLong-running instructions Not enough to be able to Not enough to be able to
restore state, must make restore state, must make progress from interrupt to progress from interrupt to interruptinterrupt
Example: MVC on IBM 360 Example: MVC on IBM 360 copies 256 bytescopies 256 bytes
No virtual memory, so interrupts No virtual memory, so interrupts not allowed to stop MVCnot allowed to stop MVC
Example: MVC on IBM 370 Example: MVC on IBM 370 copies 256 bytescopies 256 bytes
Has virtual memory, so first Has virtual memory, so first access all pages involved; after access all pages involved; after that, no interrupts allowedthat, no interrupts allowed
Example: MVCL on IBM 370 Example: MVCL on IBM 370 copies up to 2copies up to 22424 bytes bytes
Has VM; two addresses and Has VM; two addresses and length are in registerslength are in registers
Registers saved and restored on Registers saved and restored on interrupts (making progress)interrupts (making progress)
17
Interrupts in MIPS PipelineInterrupts in MIPS PipelinePipeline stage Problem exceptions
IF Page fault on instruction fetchMisaligned memory accessMemory-protection violation
ID Undefined or illegal opcodeEX Arithmetic exception
MEM Page fault on data fetchMisaligned memory accessMemory-protection violation
WB None
How do we stop and restart execution on an interrupt to How do we stop and restart execution on an interrupt to keep it precise?keep it precise?
What problems do delayed branches cause?What problems do delayed branches cause? What happens if multiple exceptions occur in the What happens if multiple exceptions occur in the
pipeline?pipeline? Can exceptions occur out-of-order?Can exceptions occur out-of-order? What problems do multi-cycle instructions cause? What problems do multi-cycle instructions cause?
18
MIPS Integer Pipeline, Single MIPS Integer Pipeline, Single InterruptInterrupt1 2 3 4 5 6 7 8 9 10
u-2 F D X M Wu-1 F D X M Wu F D X M W
u+1 F D X M Wu+2 F D X M W
TRAP F D X M W
Force Force TRAPTRAP instruction in pipeline on next IF instruction in pipeline on next IF Turn off all writes for faulting instruction and Turn off all writes for faulting instruction and
subsequent instructionssubsequent instructions After exception-handling routine in OS receives control, After exception-handling routine in OS receives control,
save save PCPC of faulting instruction of faulting instruction When exception has been handled, the RFE instruction When exception has been handled, the RFE instruction
reloads PC and restarts sequential instruction executionreloads PC and restarts sequential instruction execution
19
Complications with Delayed Complications with Delayed BranchesBranches 1 2 3 4 5 6 7 8 9
1 branch F D X M W2 delay slot F D X M Wu BTA F D X M W
u+1 F D X M Wu+2 F D X M W
Suppose instruction 2 causes an exception (e.g., a Suppose instruction 2 causes an exception (e.g., a page fault) after the taken branch completes page fault) after the taken branch completes (determining that the branch outcome is true)(determining that the branch outcome is true) Instruction 2 cannot completeInstruction 2 cannot complete Neither can instruction uNeither can instruction u
On restart, we do not have sequential executionOn restart, we do not have sequential execution We must remember two PC values: 2 and uWe must remember two PC values: 2 and u
20
Complications with Multiple Complications with Multiple ExceptionsExceptions
1 2 3 4 5 6LW F D X M WADD F D X M W
At same cycle, LW takes a data page fault and At same cycle, LW takes a data page fault and ADD takes an arithmetic exceptionADD takes an arithmetic exception
On an unpipelined machine, LW’s exception On an unpipelined machine, LW’s exception would occur firstwould occur first Handle the page faultHandle the page fault Restart executionRestart execution ADD will cause arithmetic exception to reoccur; handle ADD will cause arithmetic exception to reoccur; handle
it thenit then
21
Complications with Out-of-order Complications with Out-of-order ExceptionsExceptions
1 2 3 4 5 6LW F D X M WADD F D X M W
LW takes data page fault, ADD takes instruction LW takes data page fault, ADD takes instruction page faultpage fault
Relative timing differs between unpipelined and Relative timing differs between unpipelined and pipelined machinespipelined machines To maintain precise interrupts, we need to consider To maintain precise interrupts, we need to consider
both when they occur and the instructions that caused both when they occur and the instructions that caused themthem
Post exceptions in exception status vector, turn off Post exceptions in exception status vector, turn off state modifications, and check vector in WB unitstate modifications, and check vector in WB unit
22
Complications with Multicycle Complications with Multicycle OperationsOperations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28DIVF F0, F2, F4 F D X X X X X X X X X X X X X X X X X X X X X X X X M WADDF F10, F10, F8 F D X X X X M WSUBF F12, F12, F14 F D X X X X M W
Instructions are independent (no hazards) and therefore Instructions are independent (no hazards) and therefore issue immediatelyissue immediately
Differences in running times causes out-of-order Differences in running times causes out-of-order terminationtermination
DIVF throws arithmetic exception late in its executionDIVF throws arithmetic exception late in its execution At that point, ADDF and SUBF have both completed At that point, ADDF and SUBF have both completed
execution and destroyed one of their operandsexecution and destroyed one of their operands Can we maintain precise interrupts under these Can we maintain precise interrupts under these
conditions?conditions?
23
FP Pipeline Exceptions: Solns. 1 FP Pipeline Exceptions: Solns. 1 and 2and 2 Settle for imprecise interrupts (CRAY, with Settle for imprecise interrupts (CRAY, with
checkpointing)checkpointing) Done on Alpha 21064 and 21164, IBM Power-1 and Done on Alpha 21064 and 21164, IBM Power-1 and
Power-2, MIPS R8000 by supporting a fast imprecise Power-2, MIPS R8000 by supporting a fast imprecise mode and a slow precise modemode and a slow precise mode
Not an option if you have to support virtual memory Not an option if you have to support virtual memory or IEEE floating point standardor IEEE floating point standard
Software finishes certain instructions (SPARC)Software finishes certain instructions (SPARC) Keep enough state around for trap handler to create a Keep enough state around for trap handler to create a
precise sequence for exception and finish work for precise sequence for exception and finish work for some instruction stagessome instruction stages
Only FP instructions cause this problemOnly FP instructions cause this problem1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19F D X X X X X X X X X X X X X X X M W
F D X X X X X X X X M WF D X X X X X X X X M W
F D X X X X M W
24
FP Pipeline Exceptions: Solns. 3 FP Pipeline Exceptions: Solns. 3 and 4and 4 Stalling (MIPS R2000/3000, MIPS R4000, Stalling (MIPS R2000/3000, MIPS R4000,
Pentium)Pentium) An instruction is allowed to issue only if it is certain An instruction is allowed to issue only if it is certain
that all the instructions before the issuing instruction that all the instructions before the issuing instruction will complete without causing an exceptionwill complete without causing an exception
To prevent excessive stalling, FP units must decide on To prevent excessive stalling, FP units must decide on possibility of exceptions early in pipelinepossibility of exceptions early in pipeline
General methods (PowerPC 620, MIPS R10000)General methods (PowerPC 620, MIPS R10000) Reorder buffer, history file, future fileReorder buffer, history file, future file An instruction is allowed to finalize its writes only An instruction is allowed to finalize its writes only
when all previously issued instructions are completewhen all previously issued instructions are complete More naturally used in connection with ILP (Chapter 4)More naturally used in connection with ILP (Chapter 4) Significant complexity (to be discussed later)Significant complexity (to be discussed later)