HazardsCS510 Computer Architectures Lecture 7 - 1 Lecture 7 Pipeline Hazards.

Click here to load reader

download HazardsCS510 Computer Architectures Lecture 7 - 1 Lecture 7 Pipeline Hazards.

of 43

description

HazardsCS510 Computer Architectures Lecture Its Not That Easy to Achieve the Promised Performance Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions –Data hazards: Instruction depends on result of prior instruction still in the pipeline –Control hazards: Pipelining of branches and other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”, i.e., idle clock cycles, in the pipeline

Transcript of HazardsCS510 Computer Architectures Lecture 7 - 1 Lecture 7 Pipeline Hazards.

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards HazardsCS510 Computer Architectures Lecture Pipelining Lessons A B C D 6 PM 789 TaskOrderTaskOrder Time Pipelining doesnt help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stagesPotential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup HazardsCS510 Computer Architectures Lecture Its Not That Easy to Achieve the Promised Performance Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches and other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more bubbles, i.e., idle clock cycles, in the pipeline HazardsCS510 Computer Architectures Lecture Structural Hazards /Memory Instruction Order LOAD Instr 1 Instr 2 Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Instr 3 Reg ALU Reg Mem Reg ALU Mem Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Mem Reg Instr 4 Mem Operation on Memory by 2 different instructions in the same clock cycle HazardsCS510 Computer Architectures Lecture Structural Hazards with Single-Port Memory Instruction Order LOAD Instr 1 Instr 2 Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Instr 3 Stall Reg ALU Reg Mem Reg ALU Mem Stall Mem Reg ALU Instr 3 Mem 3 cycles stall with 1-port memory HazardsCS510 Computer Architectures Lecture Avoiding Structural Hazard with Dual-Port Memory Instruction Order Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 LOAD IM Reg ALU RegDM Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 IM DM IM Reg ALU DMReg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DM IM DM IM DM IM DM IM DM IM DM No stall with 2-port memory HazardsCS510 Computer Architectures Lecture Speed Up Equation for Pipelining Ideal CPI for pipelined machines is almost always 1 Ideal CPI = CPI unpipelined /Pipeline depth(Number of pipeline stages) x Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined CPI pipelined Clock Cycle pipelined x Speedup from pipelining Ave Instr Time unpipelined Ave Instr Time pipelined CPI unpipelined x Clock Cycle unpipelined CPI pipelined x Clock Cycle pipelined CPI unpipelined Clock Cycle unpipelined CPI pipelined Clock Cycle pipelined HazardsCS510 Computer Architectures Lecture Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr = 1 + Pipeline stall clock cycles per instr x Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined x Speedup = Pipeline depth Clock Cycle unpipelined 1 + Pipeline stall CPI Clock Cycle pipelined HazardsCS510 Computer Architectures Lecture Dual-Port vs Single-Port Memory Machine A: 2-port memory(needs no stall for Load); same clock cycle as unpipelined machine Machine B: 1-ported memory(needs 3 cycles stall for Load); 1.05 times faster clock rate than the unpipelined machine Ideal CPI = 1 for both Loads are 40% of instructions executed Machine A is 1.15 times faster SpeedUp A = [Pipeline Depth/(1 + 0)] x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/( x 3) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.2) x 1.05 = 0.87 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15 HazardsCS510 Computer Architectures Lecture Data Hazard on Registers ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Reg ALU MemReg Mem ALU MemReg ALU MemReg Mem ALU MemReg Mem Reg Time(clock cycles) R1 Re Reg Reg Reg Reg HazardsCS510 Computer Architectures Lecture Data Hazard on Registers Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle Clcok Cycle Register Ri Store into Ri Read from Ri HazardsCS510 Computer Architectures Lecture Data Hazard on Registers ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time(clock cycles) R1 Reg Needs to Stall 2 cycles Reg HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr i followed by Instr j Read After Write (RAW) Instr j tries to read operand before Instr i writes it Instr i LW R1, 0(R2) Instr j SUBR 4, R1, R5 HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr I followed by Instr J Write After Read (WAR) Instr j tries to write operand before Instr i reads it Instr i ADD R1, R2, R3 Instr j LW R2, 0(R5) Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, Reads are always in stage 2, and Writes are always in stage 5 HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr I followed by Instr J Write After Write (WAW) Instr j tries to write operand before Instr i writes it Leaves wrong result ( Instr i not Instr j ) Instr i LW R1, 0(R2) Instr j LW R1, 0(R3) Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes HazardsCS510 Computer Architectures Lecture Forwarding to Avoid Data Hazards Time(clock cycles) ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 Mem Reg ALU Reg ALU MemReg Mem Reg ALU MemReg ALU MemReg Mem Reg ALU MemReg CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem HazardsCS510 Computer Architectures Lecture HW Change for Forwarding MUX Zero? Data Memory ALU D/A Buffer A/M BufferM/W Buffer HazardsCS510 Computer Architectures Lecture HazardsCS510 Computer Architectures Lecture Load Delay Due to Data Hazard LOAD R1,0(R2) Time(clock cycles) AND R6,R1,R7 IM Reg ALU DMReg OR R8,R1,R9 Reg ALU DM IM SUB R4,R1,R6 Reg ALU DMReg IM Load Delay =2cycles Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU RegDM Reg ALU DMReg IM HazardsCS510 Computer Architectures Lecture Load Delay with Forwarding LOAD R1,0(R2) Time(clock cycles) IM Reg ALU RegDM SUB R4,R1,R6 AND R6,R1,R7 OR R8,R1,R9 IM Reg ALU DMReg ALU DMReg IM We need to add HW, called Pipeline Interlock IM Reg ALU DMReg ALU DMReg IM Reg ALU DMReg IM Load Delay with Forwarding=1cycle HazardsCS510 Computer Architectures Lecture Try to produce fast code for a = b + c; d = e - f; assuming a, b, c, d,e, and f are in memory. Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd Slow code(with forwarding): LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd RAW Stall HazardsCS510 Computer Architectures Lecture % loads stalling pipeline 0%20%40%60%80% tex spice gcc 25% 14% 31% 65% 42% 54% scheduledunscheduled Compiler Avoiding Load Stalls HazardsCS510 Computer Architectures Lecture Mem Stage WB Stage IF Stage ID StageEX Stage Instr. Memory Sign Ext Zero? Data Memory PC MUX Add ALU Reg File SMD LMD F/D BufferD/A BufferA/M Buffer M/W Buffer Pipelined DLX Datapath Branch Address Calculation Decide Condition Branch Decision for target address HazardsCS510 Computer Architectures Lecture Control Hazard on Branches: Three Stall Cycles IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 IM Reg ALU DMReg IM Reg ALU DMReg Program execution order in instructions 40 BEQ R1,R3, AND R12,R2, R5 48 OR R13,R6, R2 52 ADD R14,R2, R2 80 LD R4,R7, 100 Shouldt be executed when branch condition is true ! IM Reg ALU DMReg Branch Delay = 3 cycles IM Reg ALU DMReg IM Reg ALU DMReg Branch Target available HazardsCS510 Computer Architectures Lecture Control Hazard on Branches: Three Stall Cycles Branch instruction IF ID EX MEM WB Branch successor IF ID EX MEM 3 Wasted clock cycles for the TAKEN branch Now, we know the instruction being executed is a branch. But stall until branch target address is known. Now, target address is available.We dont know yet the instruction being executed is a branch. Fetch the branch successor. Branch successor + 1 IF ID EX Branch successor + 2 IF ID HazardsCS510 Computer Architectures Lecture Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9 Half of the ideal speed Two part solution: Determine the branch is TAKEN or NOT TAKEN sooner, AND Compute TAKEN Branch Address(Branch Target) earlier DLX branch tests if register = 0 or 1 DLX Solution: Get New PC earlier Zero test to ID stage - Move Zero test to ID stage Additional ADDER - Additional ADDER to calculate New PC(taken PC) in ID stage - 1 clock cycle penalty for branch in contrast to 3 cycles HazardsCS510 Computer Architectures Lecture Pipelined DLX Datapath To get target addr. earlier To get the Condition Earlier. Target Address available after ID. When a branch instruction is in Execute stage, Next Address is available here. HazardsCS510 Computer Architectures Lecture HazardsCS510 Computer Architectures Lecture Branch Behavior in Programs Conditional branch frequencies integer average to 16 % floating point to 12 % Forward and backward taken branches forward taken % backward taken % the average of all conditional branches % HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives Stall until branch direction is clear Predict branch NOT TAKEN Predict branch TAKEN Delayed branch HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (1) STALL Stall until branch direction is clear Branch instruction IF ID EX MEM WB 3 cycle penalty Branch instruction IF ID EX MEM WB Branch successor stall IF ID EX MEM WB Branch successor + 1 IF ID EX MEM Branch successor + 2 IF ID Revised DLX pipeline(get the branch address at EX) 1 cycle penalty(Branch Delay Slot) Branch successor stall stall stall IF ID EX MEM Branch successor + 1 IF ID EX Branch successor + 2 IF ID HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (2) Predict Branch NOT TAKEN Execute successor instructions in the sequence PC+4 is already calculated, so use it to get the next instruction Flush instructions in the pipeline if branch is actually TAKEN Advantage of late pipeline state update 47% of DLX branches are NOT TAKEN on the average NOT TAKEN branch instruction i IF ID EX MEM WB instruction i+1 IF ID EX MEM WB instruction i+2 IF ID EX MEM WB No penalty TAKEN branch instruction i IF ID EX MEM WB instruction i+1 IF ID EX MEM WB instruction T IF ID EX MEM WB 1 cycle penalty Flush this instruction in progress HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (3) Predict Branch TAKEN 53% DLX branches TAKEN on average Branch target address available after ID in DLX DLX still incurs 1 cycle branch penalty for TAKEN branch Other machines: branch target known before outcome 2 cycle penalty in DLX(1 in other machines). 1 cycle penalty in DLX(0 in other machines) NOT TAKEN instruction i IF ID EX MEM WB Instruction T stall IF Instruction i+1 IF ID EX MEM WB TAKEN branch instruction i IF ID EX MEM WB Instruction T stall IF ID EX MEM WB Instruction T+1 IF ID EX MEM WB TAKEN address not available at this time TAKEN address available HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (4) Delayed Branch Delayed Branch Delay branch to take place AFTER a successor instruction branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken 1 slot delayed branch allows proper decision and branch target address in 5 stage DLX pipeline with control hazard improvement Delayed Branch of length n HazardsCS510 Computer Architectures Lecture Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch TAKEN From fall through: only valuable when branch NOT TAKEN Canceling branches allow more slots to be filled Compiler effectiveness for single delayed branch slot: Fills about 60% of delayed branch slots About 80% of instructions executed in delayed branch slots are useful in computation About 50% (60% x 80%) of slots usefully filled HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: Delayed Branch From target SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then Delay slot ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6 - Improve performance when TAKEN(loop) - Must be alright to execute rescheduled instructions if Not Taken - May need duplicate the instruction if it is the target of another branch instr. From fall through ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6 Delay slot ADD R1, R2, R3 if R2=0 then SUB R4, R5, R6 - Improve performance when NOT TAKEN - Must be alright to execute instructions of Taken - Always improve performance - Branch must not depend on rescheduled instructions From before ADD R1, R2, R3 if R2=0 then Delay slot if R2=0 then ADD R1, R2, R3 HazardsCS510 Computer Architectures Lecture Limitations on Delayed Branch Difficulty in finding useful instructions to fill the delayed branch slots Solution - Squashing Delayed branch associated with a branch prediction Instructions in the predicted path are executed in the delayed branch slot If the branch outcome is mispredicted, instructions in the delayed branch slot are squashed(discarded) HazardsCS510 Computer Architectures Lecture Canceling Branch Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to Restrictions on scheduling instructions at the delay slots Limitations on the ability to predict whether it will TAKE or NOT TAKE at compile time Instruction includes the direction that the branch was predicted When the branch behaves as predicted, the instructions in the delay slot are executed When branch is incorrectly predicted, the instructions in the delay slot are turned into No-OPs Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements HazardsCS510 Computer Architectures Lecture Evaluating Branch Alternatives Stall pipeline x3=1.42 5/1.42= Predict Taken x1=1.14 5/1.14= Predict Not Taken x0.65=1.09 5/1.09= Delayed branch x0.5=1.07 5/1.07= Pipeline speedup = Pipeline depth / CPI = Pipeline depth 1 + Branch frequency x Branch penalty Conditional and Unconditional collectively 14% frequency, 65% of branch is TAKEN Scheduling Branch CPI speedup vs speedup vs scheme penalty unpipelined stall HazardsCS510 Computer Architectures Lecture Static(Compiler) Prediction of Taken/Untaken Branches Code Motion LWR1, 0(R2) SUB R1, R1, R3 BEQZR1, L ORR4, R5, R6 ADDR10,R4,R3 L: ADDR7, R8, R9 NOT TAKEN If branch is almost always NOT TAKEN, and R4 is not needed on the taken path, and R5 and R6 are not modified in the following instruction(s), this move can increase speed Depend on LW, need to stall TAKEN If branch is almost always TAKEN, and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed HazardsCS510 Computer Architectures Lecture Static(Compiler) Prediction of Taken/Untaken Branches Improves strategy for placing instructions in delay slot Two strategies Direction-based Prediction: TAKEN backward branch, NOT TAKEN forward branch Profile-based prediction: Record branch behaviors, predict branch based on the prior run(s) Frequency of Misprediction 0% 10% 20% 30% 40% 50% 60% 70% alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Always taken Misprediction Rate 0% 2% 4% 6% 8% 10% 12% 14% alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Taken backwards Not Taken Forwards HazardsCS510 Computer Architectures Lecture Evaluating Static Branch Prediction Strategies Misprediction rate ignores frequency of branch Instructions between mispredicted branches is a better metric Instructions per mispredicted branch alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Profile-basedDirection-based HazardsCS510 Computer Architectures Lecture Pipelining Summary Just overlap tasks, and easy if tasks are independent Speed Up