Post on 07-Oct-2020
Instruction Level Parallelism
● ILP, Loop level Parallelism● Dependences, Hazards● Speculation, Branch prediction
Basic Block● A straight line code sequence with no branches in
except to the entry and no branches out except at the exit
Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, Loop
DADDUI R1, 0(R2)
BEQZ R2, L1
LW R1, 0(R2)
L1:
ILP
● Name dependence
– antidependence, output dependence
– Register renaming● Hazard
– Overlap during execution would change the order of access to the operand involved in the dependence.
for (i=0; i<=999; i=i+1)x[i] = x[i] + a;
Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, LoopData DependenceName Dependence
ADD.D F4, F0, F2ADD.D F4, F6, F8
Hazards● Program Order
– ILP preserves program order only where it affects the outcome of the program
● Structural Hazards– Resource conflicts
● Data Hazards– RAW, WAW, WAR
Structural Hazard
MEM ID EX MEM WB
MEM ID EX MEM WB
MEM ID EX MEM WB
MEM ID EX MEM WB
i1
i2
i3
i4
...
1 2 3 4 5 6 7 8 9
MEM ID EX MEM WBi5
HAZARD!!!
Data HazardDADDDSUBANDORXOR
R4,R1,R5R6,R1,R7
R1,R2,R3
R8,R1,R9R10,R1,R11
IM REG DMDADD
DSUB
AND
OR
Time (clock cycles)
XOR
ALU REG
IM REG DMALU REG
IM REG DMALU REG
IM REG DMALU
IM REG ALU
Avoiding Data Hazards – ForwardingDADDDSUBANDORXOR
R4,R1,R5R6,R1,R7
R1,R2,R3
R8,R1,R9R10,R1,R11
IM REG DMDADD
DSUB
AND
OR
Time (clock cycles)
XOR
ALU REG
IM REG DMALU REG
IM REG DMALU REG
IM REG DMALU
IM REG ALU
Load Delay SlotLDDSUBANDOR
R4,R1,R5R6,R1,R7
R1,0(R2)
R8,R1,R9
IM REG DMLD
DSUB
AND
OR
Time (clock cycles)
ALU REG
IM REG DMALU REG
IM REG DMALU REG
IM REG DMALU
The loaded value might not be available in the destination
register for use by the instruction immediately following the load
LOAD DELAY SLOT
Cost of StallsData references = 40%. Ideal CPI=1.Processor with hazard is 1.1 times faster than the processor without hazard.Which processor is faster?
Pipeline CPI= Ideal pipeline CPI +Structural stalls+Data hazard stalls+Control stalls
Pipeline Scheduling
Reorder the instructions of the program so that dependent
instructions are far enough apart
Done by the compiler, before the program runs:
Static Instruction Scheduling
Done by the hardware, when the program is running:
Dynamic Instruction Scheduling
Pipeline Scheduling
LW R3, 0(R1)
LW R13, 0(R11)
ADDI R5, R3, 1
ADD R2, R2, R3
ADD R12, R13, R3
LW R3, 0(R1)
ADDI R5, R3, 1
ADD R2, R2, R3
LW R13, 0(R11)
ADD R12, R13, R3
stall
stall
Original Program
Pipeline Scheduling
Scheduled Code
Total Execution Cycles: 7 Total Execution Cycles: 5
Loop-level Parallelism
Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, #-8
BNE R1, R2, Loop
Original Loop:Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
L.D F6, -8(R1)
ADD.D F8, F2, F6
S.D F8, -8(R1)
L.D F10, -16(R1)
ADD.D F12, F2, F10
S.D F12, -16(R1)
L.D F14, -24(R1)
ADD.D F16, F2, F14
S.D F16, -24(R1)
DADDUI R1, R1, #-32
BNE R1, R2, Loop
UNROLLED
LOOP
Loop Unrolling
Instr producing result
Instr using result Latency to avoid a stall
FP ALU op Another FP ALU op
3
FP ALU op Store Double 2
Load Double FP ALU op 1
Load Double Store double 0
Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
ADD.D F8, F2, F6
S.D F8, -8(R1)
L.D F6, -8(R1)
L.D F10, -16(R1)
ADD.D F12, F2, F10
S.D F12, -16(R1)
L.D F14, -24(R1)
ADD.D F16, F2, F14
S.D F16, -24(R1)
DADDUI R1, R1, #-32
BNE R1, R2, Loop
Total Cycles: 27 cycles
Loop Unrolling
Instr producing result
Instr using result Latency to avoid a stall
FP ALU op Another FP ALU op
3
FP ALU op Store Double 2
Load Double FP ALU op 1
Load Double Store double 0
Loop: L.D F0, 0(R1)
ADD.D F4, F0, F2
ADD.D F8, F2, F6
ADD.D F12, F2, F10
ADD.D F16, F2, F14
S.D F4, 0(R1)
S.D F8, -8(R1)
L.D F6, -8(R1)
L.D F10, -16(R1)
S.D F12, 16(R1)
L.D F14, -24(R1)
S.D F16, 8(R1)
BNE R1, R2, Loop
Total Cycles: 14 cyclesDADDUI R1, R1, #-32
➢ Code Size➢ Register pressure
Exceptions● Certain exceptional events that occur during
program execution, handled by the processor hardware
● Control transfer to specific OS code based on the family of exception
● I/O device requests, System call, Breakpoint, Integer arithmetic overflow, FP arithmetic anomaly, Page fault, Undefined or unimplemented instruction, Hardware malfunctions, Power failure.
Exceptions● Synchronous vs. Asynchronous● User requested vs. Coerced● User maskable vs. User non-maskable● Within vs. Between instructions
– Save, and restore processor state
– restartable pipeline
● Resume vs. Terminate
Stopping and Restarting Execution● Trap instruction, Turn off writes, Save PC, Save
processor state, Exception handler, RFE● Precise exceptions
Pipeline stage Problem exceptions occurring
IF Page fault on IF, misaligned memory access; memory protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch; misaligned memory access; memory protection violation
WB None
Precise ExceptionsLD IF ID EX MEM WB
DADD IF ID EX MEM WB
● Exceptions at the same cycle● Early exception by a later instruction● Instruction Status Vector
– Check before commit
Control Dependences● Program correctness
– Data flow and Exception behaviour
● Software Speculation– Liveness
DADDU R2, R3, R4
BEQZ R2, L1
LW R1, 0(R2)
L1:
DADDU R1, R2, R3
BEQZ R4, L1
DSUBU R1, R5, R6
L1: …........
OR R7, R1, R8
DADDU R1, R2, R3
BEQZ R12, L1
DSUBU R4, R5, R6
DADDU R5, R4, R9
L1: OR R7, R8, R9
Branch Hazards
● 1 stall cycle for every branch yields a performance loss of 10% to 30%!
IF ID EX MEM WB
IF
IF ID EX MEM WB
IF ID EX MEM WB
Branch
Branch Successor
Branch Successor + 1
Branch Successor + 2
Time(clock cycles)
1 2 3 4 5 6 7 8 9
IF ID EX MEM WB
Reducing Pipeline Branch Penalties● Freeze the pipeline● Static Prediction
– Predict Taken, Predict Untaken
● Fill Branch Delay Slot
IF ID EX MEM WB
IF
IF ID EX MEM WB
IF ID EX MEM WB
Branch
Branch Delay Slot
Branch Successor
Branch Successor + 1
Time(clock cycles)
1 2 3 4 5 6 7 8 9
ID EX MEM WB
From the MIPS ISA ManualThe transfer of control
takes place only following the instruction
immediately after the control transfer
instruction
Branch Delay Slot
Performance of Branch Schemes
Stall cyclesBranches=Branch frequency×Branch penalty
Speedup pipelining=Pipeline depth
1+Pipeline stall cycles per instruction
Speedup pipelining=Pipeline depth
1+Branch frequency×Branch penalty
Classes of ExceptionsException type Synchronous
vs. AsyncUser request vs. Coerced
User maskable vs. nonmaskable
Within vs. between instructions
Resume vs. Terminate
I/O device request
Async Coerced Nonmaskable Between Resume
Invoke OS Sync User request Nonmaskable Between Resume
Tracing Instruction Execution
Sync User request User maskable
Between Resume
Breakpoint Sync User request User maskable
Between Resume
Arithmetic Overflow
Sync Coerced User maskable
Within Resume
FP underflow or overflow
Sync Coerced User maskable
Within Resume
Page fault Sync Coerced Nonmaskable Within Resume
Undefined Instructions
Sync Coerced Nonmaskable Within Terminate
Hardware malfunctions
Async Coerced Nonmaskable Within Terminate
Power Failure Async Coerced Nonmaskable Within Terminate
Smith and Pleszkun, Implementing precise interrupts in pipelined processors, IEEE Transactions on Computers, 37(5), 1998.