Post on 17-Jan-2018
description
claudio.talarico@mail.ewu.edu 1
Computing Systems
Pipelining: enhancing performance
2
Pipelining
Technique in which multiple instructions are overlapped in execution Instructions’ steps can be carried in parallel
Texec = 2400 ps
Texec = 1400 ps
3
Pipelining
Improve performance by increasing instruction throughput as opposed to decreasing the execution time (= latency) of an individual instruction increasing throughput decrease total time to complete the work
Ideal speedup is number of stages in the pipeline. Do we achieve this? stages may be imperfectly balanced Pipelining involves some overhead
stagespipeofNumbernsinstructiobetweenTime
nsinstructiobetweenTime ednonpipelinpipelined
4
Pipelining
What makes it easy (designing instruction sets for pipelining) all instructions are the same length just a few instruction formats memory operands appear only in loads and stores
What makes it hard? sometime the next instruction cannot be started in the next cycle (hazards) structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction
We’ll build a simple pipeline and look at these issues instructions supported: lw, sw, add, sub, and, or, slt, beq
5
Basic idea
Basic idea: take a single-cycle datapath and separate it into pieces
mux
Stylized Datapath; the drawing leaves out some details
6
Pipelined datapath
There is a bug ! Can you find it ? What instructions can we execute to manifest the bug?
Instructions and data move from left to right (with two exceptions)
7
Corrected datapath
For the load instruction we need to preserve the destination register number until the data is read from the MEM/WB pipeline register
8
Graphically representing pipelines Pipelining can be difficult to understand
every clock cycle, many instructions are simultaneously executing in a single datapath
To aid understanding there are 2 basic styles of pipeline figures: Multiple-clock-cycle pipeline diagrams Single-clock-cycle pipeline diagram
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? help understand datapaths
We highlight the right half of registers or memory when they are being read and highlight the left half when they are being written
9
Multiple-clock cycle diagrams: graphical view
10
Multiple-clock cycle diagrams: traditional view
11
Single-clock cycle diagrams:pipeline at a particular time instant
12
Pipeline operation
One operation begins in every cycle Also, one operation completes in each cycle Each instruction takes 5 cycles
K cycles in general, where k is the depth of the pipeline In one clock cycle, several instructions are active Different stages are executing different instructions When a stage is not used no control needs to be applied
Issue: how to generate control signals ? we need to set the control values for each pipeline stage for
each instruction
13
Pipeline Control
Note: we moved the position of the destination register
14
Pipeline Control
We have 5 stages. What needs to be controlled in each stage?
Instruction Fetch and PC Increment The control signals to read IM and write the PC are always
asserted, so there is nothing special to control this pipeline stage Instruction Decode / Register Fetch
The same thing happens at every clock cycle, so there are no optional control lines to set
Execution / address calculation control lines set in this stage: RegDest, ALUop, and ALUSrc
Memory access control lines set in this stage: Branch, MemRead, and MemWrite
Write Back control lines set in this stage: MemtoReg, and RegWrite
15
Pipeline Control Since the control signals are needed from the execution stage on:
we can generate the control signals during the instruction decode stage and
pass them along the pipeline registers just like the data
Execution/Address Calculation stage control lines
Memory access stage control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
we have nine control lines
16
Pipeline datapath with control
17
Dependencies Problem with starting next instruction before first is finished
dependencies that “go backward in time” are data hazards
18
Software solution
Have compiler guarantee no hazards Where do we insert the “nops” ?
sub $2, $1, $3and $12,$2, $5or $13,$6, $2add $14,$2, $2sw $15,100($2)
Problem: this really slows us down!
Two nops here !!
19
Hardware solution: Forwarding Use temporary results, don’t wait for them to be written
ALU forwarding (EX hazard) read/write to same register (MEM hazard)
what if this $2 was $13?
20
Forwarding logic
Forwarding from EX/MEM registers if (EX/MEM.RegWrite // instruction writes to register and (EX/MEM.RegisterRd != 0) // not if destination $zeroand (EX/MEM.RegisterRd = ID/EX.Register.Rs))ForwardA = 10
if (EX/MEM.RegWrite // instruction writes to register and (EX/MEM.RegisterRd != 0) // not if destination $zeroand (EX/MEM.RegisterRd = ID/EX.Register.Rt))ForwardB = 10
21
Forwarding logic
Forwarding from MEM/WB registers
if (MEM/WB.RegWrite // instruction writes to register and (MEM/WB.RegisterRd != 0) // not if destination $zeroand (MEM/WB.RegisterRd = ID/EX.Register.Rs))ForwardA = 01
if (MEM/WB.RegWrite // instruction writes to register and (MEM/WB.RegisterRd != 0) // not if destination $zeroand (MEM/WB.RegisterRd = ID/EX.Register.Rt))ForwardB = 01
Almost true !!! There is a bug !!!
22
Forwarding logic Let’s consider a sequence of instructions all reading and writing to
the same register
According to the previous policy, since MEM/WB.RegisterRd=ID/EX.Register.Rswe “should” forward from MEM/WB.
MEM/WB
EX/MEM
But, this time the more recent result is in the EX/MEM register
Thus, we have to forward from EX/MEM register(Fortunately, we already know how to do !!)
23
Forwarding logic
Forwarding from MEM/WB registers (corrected version)
if (MEM/WB.RegWrite //instruction writes to register and (MEM/WB.RegisterRd != 0) //not if destination $zeroand (MEM/WB.RegisterRd = ID/EX.Register.Rs)and (EX/MEM.RegisterRd != ID/EX.Register.Rs))ForwardA = 01
if (MEM/WB.RegWrite //instruction writes to register and (MEM/WB.RegisterRd != 0) //not if destination $zeroand (MEM/WB.RegisterRd = ID/EX.Register.Rt)and (EX/MEM.RegisterRd != ID/EX.Register.Rt))ForwardB = 01
Make sure the latest value is not in EX/MEM
24
Forwarding unit
The main idea (some details not shown)
ForwardA
ForwardB
25
Forwarding unit
Mux control Source CommentForwardA=00 ID/EX The first ALU operand comes from the register fileForwardA=10 EX/MEM The first ALU operand is forwarded from prior ALU resultForwardA=01 MEM/WB The first ALU operand is forwarded from DM or an earlier
ALU resultForwardB=00 ID/EX The first ALU operand comes from the register fileForwardB=10 EX/MEM The first ALU operand is forwarded from prior ALU resultForwardB=01 MEM/WB The first ALU operand is forwarded from DM or an earlier
ALU result
26
Can’t always forward !!! Load word instruction can still cause a hazard:
- an instruction tries to read a register following a load instruction that writes to the same register.
Thus, we need a hazard detection unit to “stall” the load instruction
The hazard cannot be solved by forwardingwe must stall (insert a nop)
27
Stall logic
Hazard detection unit:
if (ID/EX.MeamRead and((ID/EX.Register.Rt = IF/ID.Register.Rs) or
(ID/EX.Register.Rt = IF/ID.Register.Rt)))stall the pipeline
We can stall by letting an instruction that won’t do anything go forward
Deasserting the control lines (in this way the instruction has no effect, act like a bubble in the pipeline)
and preventing the following instructions to be fetched
This is accomplished simply by preventing the PC register and the IF/ID register from changing
The only instruction that reads data memory is load
The destination of the load instruction is in the Rt field
PC
…
…
28
Pipeline with hazard detection unitSome details not shown
29
Branch Hazards (= control hazards) When we decide to branch, other instructions are in the pipeline!
30
Solutions to branch hazard
Branch stalling (software) easy but inefficient
Static branch prediction: we assume “branch not taken” we need to add hardware for flushing instructions if we are wrong we must discard the instructions in IF, ID, and EX stages (change the control
values to 0) Reducing the branch delay penalty
move the branch decision earlier (to ID stage) compare the two registers read in the ID stage comparison for equality requires few extra gates still need to flush instruction in IF/ID register clearing the register transform the fetched
instruction into a nop Make the hazard into a feature: delayed branch slot
always execute the instruction following the branch
Registers =
31
Branch detection in the ID stagebranch target computation has been moved ahead
32
Delayed branch (MIPS) A “branch delay slot” which the compiler tries to fill with
a useful instruction (make the one cycle delay part of the ISA)
best solution branch mostly taken
33
Branches
If the branch is taken, we may have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction
drastically hurts performance Solution: dynamic branch prediction (keep track of branch history)
Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!
Modern processors predict correctly 95% of the time!
Example:
Loop branch that branch 9 times in a row, then is not taken. Assume 1-bit predictor.
We will fail prediction the first and last time prediction accuracy 80%
34
Improving performance
Try and avoid stalls! e.g., reorder these instructions:
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
Dynamic Pipeline Scheduling Hardware is organized differently and chooses which
instructions to execute next Will execute instructions out of order (e.g., doesn’t wait
for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
Trying to exploit instruction-level parallelism
35
Dynamic scheduled pipeline
36
Advanced Pipelining
Trying to exploit instruction-level parallelism Increase the depth of the pipeline (overlap more instructions) Replicate internal functional units to start more than one instruction
each cycle (multiple issue) static multiple issues (decision at compile time) dynamic multiple issues (decision at execution time)
Loop unrolling to expose more ILP (better scheduling)
“Superscalar” processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”)
VLIW: very long instruction word, static multiple issue (relies more on compiler technology)
37
Summary
Pipelined processors divide execution in multiple steps Pipelining improves instruction throughput, not the inherent
execution time (latency) of instructions However pipeline hazards reduce performance
structural, data and control hazards Structural hazards are resolved by duplicating resources Data forwarding helps resolve data hazards
but not all hazard can be resolved (load followed by R-type) some data hazards require nop insertion (bubbles)
Control hazard delay penalty can be reduced by branch prediction always not taken, delayed slots, dynamic prediction
38
Concluding Remarks
Pipelined processors are not easy to design Technology affect implementation Instruction set design affect performance and design difficulty More stages do not necessarily lead to higher performance Pipelining and multiple issue both attempt to exploit ILP