pipelined processor design.ppt

download pipelined processor design.ppt

of 72

Transcript of pipelined processor design.ppt

  • 8/12/2019 pipelined processor design.ppt

    1/72

    112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Pipelined Processor Design

  • 8/12/2019 pipelined processor design.ppt

    2/72

    212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Presentation Outline

    Pipelining versus Serial Execution

    Pipelined Datapath and Control

    Pipeline Hazards

    Data Hazards and Forwarding

    Load Delay, Hazard Detection, and Stall

    Control Hazards

    Delayed Branch and Dynamic Branch Prediction

  • 8/12/2019 pipelined processor design.ppt

    3/72

    312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Laundry Example: Three Stages

    1. Wash dirty load of clothes

    2. Dry wet clothes

    3. Fold and put clothes into drawers

    Each stage takes 30 minutes to complete

    Four loads of clothes to wash, dry, and fold

    A B

    C D

    Pipelining Example

  • 8/12/2019 pipelined processor design.ppt

    4/72

    412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Sequential laundry takes 6 hoursfor 4 loads

    Intuitively, we can use pipeliningto speed up laundry

    Sequential Laundry

    Time

    6 PM

    A

    30 30 30

    7 8 9 10 11 12 AM

    30 30 30

    B

    30 30 30

    C

    30 30 30

    D

  • 8/12/2019 pipelined processor design.ppt

    5/72

    512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Pipelined laundry takes3 hoursfor 4 loads

    Speedup factor is 2for4 loads

    Time to wash, dry, andfold one load is still thesame (90 minutes)

    Pipelined Laundry: Start LoadASAP

    Time

    6 PM

    A

    30

    7 8 9 PM

    B

    30

    30

    C

    30

    30

    30

    D

    30

    30

    30

    30

    30 30

  • 8/12/2019 pipelined processor design.ppt

    6/72

    612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Serial Execution versus Pipelining

    Consider a task that can be divided into ksubtasks

    The ksubtasksare executed on kdifferent stages

    Each subtask requires one time unit

    The total execution time of the task is k time units

    Pipelining is to overlap the execution The kstages work in parallel on kdifferent tasks

    Tasks enter/leave pipeline at the rate of one task per time unit

    1 2 k1 2 k

    1 2 k

    1 2 k1 2 k

    1 2 k

    Without Pipelining

    One completion every k time units

    With Pipelining

    One completion every 1time unit

  • 8/12/2019 pipelined processor design.ppt

    7/72712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Synchronous Pipeline

    Uses clocked registersbetween stages

    Upon arrival of a clock edge

    All registers hold the results of previous stages simultaneously

    The pipeline stages are combinational logiccircuits

    It is desirable to have balancedstages

    Approximately equal delay in all stages

    Clock period is determined by the maximum stage delay

    S1 S2 SkRegister

    Register

    Register

    Register

    Input

    Clock

    Output

  • 8/12/2019 pipelined processor design.ppt

    8/72812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Let ti= time delay in stage S

    i

    Clock cycle t= max(ti) is the maximum stage delay

    Clock frequencyf = 1/t = 1/max(ti)

    A pipeline can process ntasks in k + n1 cycles kcycles are needed to complete the first task

    n1 cycles are needed to complete the remaining n1 tasks

    Ideal speedup of a

    k-stage pipeline over serial execution

    Pipeline Performance

    k + n1Pipelined execution in cycles

    Serial execution in cycles== Sk k for largennkSk

  • 8/12/2019 pipelined processor design.ppt

    9/72912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    MIPS Processor Pipeline

    Five stages, one cycle per stage

    1. IF: Instruction Fetch from instruction memory

    2. ID: Instruction Decode, register read, and J/Br address

    3. EX: Executeoperation or calculate load/store address

    4. MEM: Memory accessfor load and store

    5. WB: Write Backresult to register

  • 8/12/2019 pipelined processor design.ppt

    10/721012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Single-Cycle vs PipelinedPerformance

    Consider a 5-stage instruction execution in which

    Instruction fetch = ALU operation = Data memory access = 200 ps

    Register read = register write = 150 ps

    What is the clock cycle of the single-cycle processor?

    What is the clock cycle of the pipelined processor?

    What is the speedup factor of pipelined execution?

    Solution

    Single-Cycle Clock = 200+150+200+200+150 = 900 ps

    Reg ALU MEMIF

    900 ps

    Reg

    Reg ALU MEMIF

    900 ps

    Reg

  • 8/12/2019 pipelined processor design.ppt

    11/721112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Single-Cycle versus Pipelined contd

    Pipelined clock cycle =

    CPI for pipelined execution =

    One instruction completes each cycle (ignoring pipeline fill)

    Speedup of pipelined execution =

    Instruction count and CPI are equal in both cases

    Speedup factor is less than 5 (number of pipeline stage)

    Because the pipeline stages are not balanced

    900 ps / 200 ps = 4.5

    1

    max(200, 150) = 200 ps

    200

    IF Reg MEMALU Reg

    IF Reg MEM RegALU

    IF Reg MEMALU Reg200

    200 200 200 200 200

  • 8/12/2019 pipelined processor design.ppt

    12/721212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Pipeline Performance Summary

    Pipelining doesnt improve latencyof a single instruction

    However, it improves throughputof entire workload

    Instructions are initiated and completed at a higher rate

    In ak-stagepipeline, kinstructions operate in parallel

    Overlapped execution using multiple hardware resources

    Potential speedup = number of pipeline stagesk

    Unbalanced lengths of pipeline stages reduces speedup

    Pipeline rate is limited by slowestpipeline stage

    Unbalanced lengths of pipeline stages reduces speedup

    Also, time to filland drainpipeline reduces speedup

  • 8/12/2019 pipelined processor design.ppt

    13/721312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Next . . .

    Pipelining versus Serial Execution Pipelined Datapath and Control

    Pipeline Hazards

    Data Hazards and Forwarding

    Load Delay, Hazard Detection, and Stall

    Control Hazards

    Delayed Branch and Dynamic Branch Prediction

  • 8/12/2019 pipelined processor design.ppt

    14/721412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    ID = Decode &

    Register Read

    Single-Cycle Datapath

    Shown below is the single-cycle datapath

    How to pipeline this single-cycle datapath?

    Answer:Introduce pipeline register at end of each stage

    Next

    PC

    zeroPCSrc

    ALUCtrl

    RegWrite

    ExtOp

    RegDst

    ALUSrc

    Data

    Memory

    Address

    Data_in

    Data_out

    32

    32ALU

    32

    Registers

    RA

    RB

    BusA

    BusB

    RW

    5

    BusW

    32

    Address

    Instruction

    Instruction

    Memory

    PC

    00

    30

    Rs

    5

    RdE

    Imm16

    Rt

    0

    1

    0

    1

    32

    Imm26

    32

    ALU result

    32

    0

    1

    clk

    +1

    0

    1

    30

    Jump or Branch Target Address

    MemRead

    MemWrite

    MemtoReg

    EX = ExecuteIF = Instruction Fetch MEM = Memory

    Access

    WB =

    Write

    Back

    Bne

    Beq

    J

  • 8/12/2019 pipelined processor design.ppt

    15/72

    1512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    zero

    Pipelined Datapath

    Pipeline registers are shown in green, including the PC

    Same clock edge updates all pipeline registers, registerfile, and data memory (for store instruction)

    clk

    32

    A

    LU

    32

    5

    Address

    Instruction

    Instruction

    Memory Rs

    5Rt1

    0

    ALU result

    32

    0

    1

    Data

    Memory

    Address

    Data_in

    Data_out

    ID = Decode &

    Register Read

    EX = ExecuteIF = Instruction Fetch MEM = Memory

    Access

    WB=Wri

    teBack

    32

    Register

    File

    RA

    RB

    BusA

    BusB

    RWBusW

    32

    E

    Rd

    0

    1PC

    AL

    Uout

    D

    WBData

    32

    Imm16

    Imm26

    Next

    PC

    A

    B

    Imm

    NPC2

    Instruc

    tion

    NPC

    +1

    0

    1

  • 8/12/2019 pipelined processor design.ppt

    16/72

    1612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Problem with Register Destination

    Is there a problem with the register destination address?

    Instruction in the ID stage different from the one in the WB stage

    Instruction in the WB stage is not writing to its destination registerbut to the destination of a different instruction in the ID stage

    zero

    clk

    32

    ALU

    32

    5

    Address

    Instruction

    Instruction

    MemoryRs

    5Rt1

    0

    ALU result

    32

    0

    1

    Data

    Memory

    Address

    Data_in

    Data_out

    ID = Decode &

    Register Read EX = ExecuteIF = Instruction Fetch MEM =

    Memory Access

    WB=Write

    Back

    32

    RegisterF

    ile

    RA

    RB

    BusA

    BusB

    RWBusW

    32

    E

    Rd

    0

    1PC

    ALU

    out

    D

    WBData

    32

    Imm16

    Imm26

    Next

    PC

    A

    B

    Imm

    NPC2

    Instruction

    NPC

    +1

    0

    1

  • 8/12/2019 pipelined processor design.ppt

    17/72

    1712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Pipelining the Destination Register Destination Register number should be pipelined

    Destination register number is passed from ID to WB stage

    The WB stage writes back data knowing the destination register

    zero

    clk

    32

    ALU

    32

    5

    Address

    Instruction

    Instruction

    Memory Rs

    5Rt1

    0

    ALU result

    32

    0

    1

    Data

    Memory

    Address

    Data_in

    Data_out

    ID EXIF MEM WB

    32

    RegisterFile

    RA

    RB

    BusA

    BusB

    RWBusW

    32

    E

    PC

    ALUout

    D

    W

    BData

    32

    Imm16

    Imm26

    Next

    PC

    A

    B

    Imm

    NPC2

    Ins

    truction

    NPC

    +1

    0

    1Rd

    Rd20

    1 Rd3

    Rd4

  • 8/12/2019 pipelined processor design.ppt

    18/72

    1812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Graphically Representing Pipelines

    Multiple instruction execution over multiple clock cycles

    Instructions are listed in execution order from top to bottom

    Clock cycles move from left to right

    Figure shows the use of resources at each stage and each cycle

    Time (in cycles)

    ProgramE

    xecutionOrder

    add $s1, $s2, $s3

    CC2Reg

    IM

    DM

    Reg

    sub $t5, $s2, $t3

    CC4

    ALU

    IM

    sw $s2, 10($t3)

    DM

    Reg

    CC5Reg

    ALU

    IM

    DM

    Reg

    CC6

    Reg

    ALU DM

    CC7

    Reg

    ALU

    CC8

    Reg

    DM

    lw $t6, 8($s5) IM

    CC1

    Reg

    ori $s4, $t3, 7

    ALU

    CC3

    IM

  • 8/12/2019 pipelined processor design.ppt

    19/72

    1912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Instruction-Time Diagram shows:

    Which instruction occupying what stage at each clock cycle

    Instruction flow is pipelined over the 5 stages

    Instruction-Time Diagram

    IF

    WB

    EX

    ID

    WB

    EX

    WB

    MEM

    ID

    IF

    EX

    ID

    IF

    TimeCC1 CC4 CC5 CC6 CC7 CC8 CC9CC2 CC3

    MEM

    EX

    ID

    IF

    WB

    MEM

    EX

    ID

    IF

    lw $t7, 8($s3)

    lw $t6, 8($s5)

    ori $t4, $s3, 7sub $s5, $s2, $t3

    sw $s2, 10($s3)InstructionOrder

    Up to five instructions can be in the

    pipeline during the same cycleInstruction Level Parallelism (ILP)

    ALU instructions skipthe MEM stage.

    Store instructionsskip the WB stage

  • 8/12/2019 pipelined processor design.ppt

    20/72

    2012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Control Signals

    zero

    clk

    32

    ALU

    32

    5

    Address

    Instruction

    Instruction

    MemoryRs

    5Rt1

    0

    ALU result

    32

    0

    1

    Data

    Memory

    Address

    Data_in

    Data_out

    32

    RegisterF

    ile

    RA

    RB

    BusA

    BusB

    RWBusW

    32

    E

    PC

    ALU

    out

    D

    WBData

    32

    Imm16

    Imm26

    Next

    PC

    A

    B

    Imm

    NPC2

    Instruction

    NPC

    +1

    0

    1Rd

    Rd20

    1 Rd3

    Rd4

    Same control signals used in the single-cycle datapath

    ALUCtrl

    RegWrite

    RegDst

    ALUSrc

    MemWrite

    MemtoReg

    MemRead

    ExtOp

    PCSrc

    ID EXIF MEM WB

    Bne

    Beq

    J

  • 8/12/2019 pipelined processor design.ppt

    21/72

    2112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Pipelined Control

    zero

    clk

    32

    AL

    U

    32

    5

    Address

    Instruction

    Instruction

    Memory Rs

    5Rt1

    0

    ALU result

    32

    0

    1

    Data

    Memory

    Address

    Data_in

    Data_out

    32

    RegisterFile

    RA

    RB

    BusA

    BusB

    RWBusW

    32

    EPC

    ALUout

    D

    WB

    Data

    32

    Imm16

    Imm26

    Next

    PC

    A

    B

    Imm

    NPC

    2

    Instruction

    NPC

    +1

    0

    1Rd

    Rd20

    1 Rd3

    Rd4

    PCSrc

    Bne

    Beq

    J

    Op

    RegDst

    EX

    ALUSrc

    ALUCtrl

    ExtOp

    JBeqBne

    MEM

    MemWrite

    MemRead

    MemtoReg

    WB

    RegWrite

    Pass control

    signals along

    pipeline just

    like the data

    Main& ALUControl

    func

  • 8/12/2019 pipelined processor design.ppt

    22/72

    2212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Pipelined Control Cont'd

    ID stage generates all the control signals

    Pipeline the control signals as the instruction moves

    Extend the pipeline registers to include the control signals

    Each stage uses some of the control signals

    Instruction Decode and Register Read

    Control signals are generated

    RegDst is used in this stage

    Execution Stage => ExtOp, ALUSrc, andALUCtrl

    Next PC uses J,Beq,Bne, and zerosignals for branchcontrol

    Memory Stage => MemRead,MemWrite,and MemtoReg

    Write Back Stage => RegWriteis used in this stage

  • 8/12/2019 pipelined processor design.ppt

    23/72

    2312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Op

    DecodeStage

    Execute Stage

    Control Signals

    Memory Stage

    Control Signals

    Write

    Back

    RegDst ALUSrc ExtOp J Beq Bne ALUCtrl MemRd MemWr MemReg RegWrite

    R-Type 1=Rd 0=Reg x 0 0 0 func 0 0 0 1

    addi 0=Rt 1=Imm 1=sign 0 0 0 ADD 0 0 0 1

    slti 0=Rt 1=Imm 1=sign 0 0 0 SLT 0 0 0 1

    andi 0=Rt 1=Imm 0=zero 0 0 0 AND 0 0 0 1

    ori 0=Rt 1=Imm 0=zero 0 0 0 OR 0 0 0 1

    lw 0=Rt 1=Imm 1=sign 0 0 0 ADD 1 0 1 1

    sw x 1=Imm 1=sign 0 0 0 ADD 0 1 x 0

    beq x 0=Reg x 0 1 0 SUB 0 0 x 0

    bne x 0=Reg x 0 0 1 SUB 0 0 x 0

    j x x x 1 0 0 x 0 0 x 0

    Control Signals Summary

  • 8/12/2019 pipelined processor design.ppt

    24/72

    2412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Next . . .

    Pipelining versus Serial Execution

    Pipelined Datapath and Control

    Pipeline Hazards

    Data Hazards and Forwarding

    Load Delay, Hazard Detection, and Stall

    Control Hazards

    Delayed Branch and Dynamic Branch Prediction

  • 8/12/2019 pipelined processor design.ppt

    25/72

    2512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Hazards:situations that would cause incorrect execution

    If next instruction were launched during its designated clock cycle

    1. Structural hazards

    Caused by resource contention

    Using same resource by two instructions during the same cycle

    2. Data hazards

    An instruction may compute a result needed by next instruction

    Hardware can detect dependencies between instructions

    3. Control hazards

    Caused by instructions that change control flow (branches/jumps)

    Delays in changing the flow of control

    Hazards complicate pipeline control and limit performance

    Pipeline Hazards

  • 8/12/2019 pipelined processor design.ppt

    26/72

    2612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Structural Hazards

    Problem

    Attempt to use the same hardware resource by two different

    instructions during the same cycle

    Example

    Writing back ALU result in stage 4

    Conflict with writing load data in stage 5

    WB

    WBEX

    ID

    WB

    EX MEM

    IF ID

    IF

    TimeCC1 CC4 CC5 CC6 CC7 CC8 CC9CC2 CC3

    EX

    IDIF

    MEM

    EXID

    IF

    lw $t6, 8($s5)

    ori $t4, $s3, 7sub $t5, $s2, $s3

    sw $s2, 10($s3)Instruct

    ions

    Structural Hazard

    Two instructions areattempting to write

    the register fileduring same cycle

  • 8/12/2019 pipelined processor design.ppt

    27/72

    2712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Resolving Structural Hazards Serious Hazard:

    Hazard cannot be ignored

    Solution 1: Delay Access to Resource

    Must have mechanism to delay instruction access to resource

    Delay all write backs to the register file to stage 5

    ALU instructions bypass stage 4 (memory) without doing anything Solution 2: Add more hardware resources (more costly)

    Add more hardware to eliminate the structural hazard

    Redesign the register file to have two write ports

    First write port can be used to write back ALU results in stage 4

    Second write port can be used to write back load data in stage 5

  • 8/12/2019 pipelined processor design.ppt

    28/72

    2812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Next . . .

    Pipelining versus Serial Execution

    Pipelined Datapath and Control

    Pipeline Hazards

    Data Hazards and Forwarding

    Load Delay, Hazard Detection, and Stall

    Control Hazards

    Delayed Branch and Dynamic Branch Prediction

  • 8/12/2019 pipelined processor design.ppt

    29/72

    2912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Dependency between instructions causes a data hazard

    The dependent instructions are close to each other

    Pipelined execution might change the order of operand access

    Read After WriteRAW Hazard

    Given two instructions I and J, where I comes before J

    Instruction J should read an operand after it is written by I

    Called a data dependencein compiler terminology

    I: add $s1,$s2, $s3 # $s1 is writtenJ: sub $s4, $s1, $s3 # $s1 is read

    Hazard occurs when Jreads the operand before I writes it

    Data Hazards

  • 8/12/2019 pipelined processor design.ppt

    30/72

    3012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    DMReg

    IM

    Reg

    ALU

    IM

    DM

    Reg

    Reg

    ALU

    IM

    DM

    Reg

    Reg

    ALU DM

    Reg

    ALU

    Reg

    DM

    IM

    Reg

    ALU

    IM

    Time (cycles)

    ProgramEx

    ecutionOrder

    value of $s2

    sub $s2, $t1, $t3

    CC110

    CC2

    add $s4, $s2, $t5

    10

    CC3

    or $s6, $t3, $s2

    10

    CC4

    and $s7, $t4, $s2

    10

    CC620

    CC720

    CC820

    CC5

    sw $t8, 10($s2)

    10

    Example of a RAW Data Hazard

    Result of subis needed by add, or, and, & swinstructions

    Instructions add& orwill read old valueof $s2from reg file

    During CC5, $s2is written at end of cycle, old valueis read

  • 8/12/2019 pipelined processor design.ppt

    31/72

    3112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    RegReg

    Solution 1: Stalling the Pipeline

    Three stall cycles during CC3thru CC5(wasting 3 cycles)

    Stall cycles delay execution of add & fetching of orinstruction

    The addinstruction cannot read $s2until beginning of CC6

    The addinstruction remains in the Instruction register until CC6

    ThePC register is not modifieduntil beginning of CC6

    DM

    Reg

    RegReg

    Time (in cycles)

    In

    structionOrder value of $s2

    CC110

    CC210

    CC310

    CC410

    CC620

    CC720

    CC820

    CC510

    add $s4, $s2, $t5 IM

    or $s6, $t3, $s2 IM ALU

    ALU Reg

    sub $s2, $t1, $t3 IM Reg ALU DM Reg

    CC920

    stall stall

    DM

    stall

  • 8/12/2019 pipelined processor design.ppt

    32/72

    3212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    DM

    Reg

    Reg

    Reg

    Reg

    Reg

    Time (cycles)

    ProgramE

    xecutionOrder

    value of $s2sub $s2, $t1, $t3 IM

    CC110

    CC2

    add $s4, $s2, $t5 IM

    10CC3

    or $s6, $t3, $s2

    ALU

    IM

    10CC4

    and $s7, $s6, $s2

    ALU

    IM

    10CC6

    Reg

    DM

    ALU

    20

    CC7

    Reg

    DM

    ALU

    20

    CC8

    Reg

    DM

    20

    CC5

    sw $t8, 10($s2)

    Reg

    DM

    ALU

    IM

    10

    Solution 2: Forwarding ALU Result

    TheALU resultis forwarded(fed back) to theALU input

    No bubbles are inserted into the pipeline and no cycles are wasted

    ALU result is forwarded fromALU, MEM, andWBstages

  • 8/12/2019 pipelined processor design.ppt

    33/72

    3312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Implementing Forwarding

    0123

    01

    23

    Result

    3232

    clk

    Rd

    32

    Rs

    Instruction

    0

    1

    ALU result

    32

    0

    1

    Data

    Memory

    Address

    Data_in

    Data_out

    32

    Rd4

    ALUE

    Imm16Imm26

    1

    0

    Rd3

    Rd2

    A

    BWData

    D

    Im26

    32

    RegisterFile

    RB

    BusA

    BusB

    RW

    BusW

    RARt

    Two multiplexers added at the inputs of A & B registers

    Data fromALU stage, MEM stage, andWB stageis fed back

    Two signals: ForwardAand ForwardBcontrol forwarding

    ForwardA

    ForwardB

  • 8/12/2019 pipelined processor design.ppt

    34/72

    3412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Forwarding Control Signals

    Signal ExplanationForwardA = 0 First ALU operand comes from register file = Value of (Rs)

    ForwardA = 1 Forward result of previous instruction to A (from ALU stage)

    ForwardA = 2 Forward result of 2nd

    previous instruction to A (from MEM stage)

    ForwardA = 3 Forward result of 3rdprevious instruction to A (from WB stage)

    ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)

    ForwardB = 1 Forward result of previous instruction to B (from ALU stage)

    ForwardB = 2 Forward result of 2ndprevious instruction to B (from MEM stage)

    ForwardB = 3 Forward result of 3rdprevious instruction to B (from WB stage)

  • 8/12/2019 pipelined processor design.ppt

    35/72

    3512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Forwarding Example

    Result

    3232

    clk

    Rd

    32

    Rs

    Instruction

    0

    1

    ALU result

    32

    0

    1

    Data

    Memory

    Address

    Data_in

    Data_out

    32

    Rd4

    ALU

    extImm16

    Imm26

    1

    0

    Rd3

    Rd2

    A

    B

    0123

    0123

    W

    Data

    D

    Imm

    32

    RegisterFile

    RB

    BusA

    BusB

    RWBusW

    RARt

    Instruction sequence:

    lw $t4, 4($t0)ori $t7, $t1, 2

    sub $t3, $t4, $t7

    When subinstruction is fetched

    oriwill be in the ALU stagelw will be in the MEM stage

    ForwardA = 2from MEM stage ForwardB = 1from ALU stage

    lw $t4,4($t0)ori $t7,$t1,2sub $t3,$t4,$t7

    2

    1

  • 8/12/2019 pipelined processor design.ppt

    36/72

    3612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    RAW Hazard Detection

    Currentinstruction being decoded is in Decodestage

    Previousinstruction is in the Executestage

    Second previousinstruction is in the Memorystage

    Third previousinstruction in the Write Backstage

    If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite)) ForwardA 1

    Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA 2

    Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWrite)) ForwardA 3

    Else ForwardA 0

    If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite)) ForwardB 1

    Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB 2

    Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWrite)) ForwardB 3

    Else ForwardB 0

  • 8/12/2019 pipelined processor design.ppt

    37/72

  • 8/12/2019 pipelined processor design.ppt

    38/72

    3812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Next . . .

    Pipelining versus Serial Execution

    Pipelined Datapath and Control

    Pipeline Hazards

    Data Hazards and Forwarding

    Load Delay, Hazard Detection, and Pipeline Stall

    Control Hazards

    Delayed Branch and Dynamic Branch Prediction

  • 8/12/2019 pipelined processor design.ppt

    39/72

    3912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Reg

    Reg

    Reg

    Time (cycles)

    Program

    Order

    CC2

    add $s4, $s2, $t5

    Reg

    IF

    CC3

    or $t6, $t3, $s2

    ALU

    IF

    CC6

    Reg

    DM

    ALU

    CC7

    Reg

    Reg

    DM

    CC8

    Reg

    lw $s2, 20($t1) IF

    CC1 CC4

    and $t7, $s2, $t4

    DM

    ALU

    IF

    CC5

    DM

    ALU

    Load Delay

    Unfortunately, not all data hazards can be forwarded

    Loadhas a delay that cannot be eliminated by forwarding

    In the example shown below

    The LWinstruction does not read data until end of CC4

    Cannot forward data to ADDat end of CC3 - NOT possible

    However, load canforward data to

    2nd next and later

    instructions

  • 8/12/2019 pipelined processor design.ppt

    40/72

    4012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Detecting RAW Hazard after Load

    Detecting a RAW hazard after a Load instruction:

    The loadinstruction will be in the EXstage

    Instruction that depends on the load data is in the decode stage

    Condition for stalling the pipeline

    if ((EX.MemRead == 1) // Detect Load in EX stage

    and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard

    Insert a bubbleinto the EX stage after a load instruction

    Bubble is a no-opthat wastes one clock cycle

    Delays the dependent instruction after load by once cycle

    Because of RAW hazard

  • 8/12/2019 pipelined processor design.ppt

    41/72

    4112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Regor $t6, $s3, $s2 IM DM RegALU

    RegALU DMReg

    add $s4, $s2, $t5 IM

    Reglw $s2, 20($s1) IM

    stall

    ALU

    bubble bubble bubble

    DM Reg

    Stall the Pipeline for one Cycle

    ADDinstruction depends on LWstall at CC3

    Allow Loadinstruction in ALUstage to proceed

    Freeze PCand Instructionregisters (NO instruction is fetched)

    Introduce a bubbleinto the ALUstage (bubble is a NO-OP)

    Loadcan forward data to next instruction after delaying itTime (cycles)

    Program

    Order

    CC2 CC3 CC6 CC7 CC8CC1 CC4 CC5

  • 8/12/2019 pipelined processor design.ppt

    42/72

  • 8/12/2019 pipelined processor design.ppt

    43/72

    4312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Control Signals

    Bubble= 0

    0

    1

    Hazard Detect, Forward, and Stall

    0123

    0123

    Result

    3232

    clk

    Rd

    32

    Rs

    0

    1

    ALU result

    32

    0

    1

    Data

    Memory

    Address

    Data_in

    Data_out

    32

    Rd4

    ALU

    E

    Imm26

    1

    0

    Rd3

    Rd2

    A

    BWData

    D

    Im26

    32

    RegisterFile

    RB

    BusA

    BusB

    RWBusW

    RARt

    Instruction

    ForwardB ForwardA

    Hazard Detect

    Forward, & Stall

    func

    RegDst

    Main & ALUControl

    Op

    MEME

    X

    WB

    RegWriteRegWrite

    RegWrite

    MemReadStall

    DisablePC

    PC

  • 8/12/2019 pipelined processor design.ppt

    44/72

    4412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    CPI for a Pipeline Processor

    CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAW

    Stalls + Branch Stalls CPIIdealis 1 for a simple Pipeline processor

    CPIIdealis 0.5 for 2-Way Super Scalar processor

    CPIIdealis 0.25 for 4-Way Super Scalar processor, and so on . . .

    There are no Structural Hazards in the pipeline displayed in previous slides.

    WAR and WAW hazards also do not appear in the current pipeline

    Branch Hazards will be discussed later.

    RAW Hazards (Caused due to data dependencies)

    Majority of RAW Hazards are resolved by forwarding

    Load RAW Hazards are resolved by Stalling

    Stalls due to Load is --- One Cycles in the current Pipeline

  • 8/12/2019 pipelined processor design.ppt

    45/72

    4512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    CPI for a Pipeline Processor

    Assume 37% Load instructions

    40% of the Load Instructions Cause Stalls

    How many cycle stalls for load instructions, in current pipeline?

    One Cycle

    CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls +WAW Stalls + Branch Stalls

    CPIPipeline= 1 + 0 + RAW Stalls + 0 + 0 + 0

    CPIPipeline= 1 + RAW Stalls

    CPIPipeline= 1 + Load Instructions x Load Instructions that CauseStalls x Number of Cycles of Stalls

    CPIPipeline= 1 + 0.37 x 0.4 x 1

    CPIPipeline= 1 + 0.148 = 1.148

  • 8/12/2019 pipelined processor design.ppt

    46/72

    4612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Code Scheduling to Avoid Stalls

    Compilers reorder code in a way to avoid load stalls

    Consider the translation of the following statements:

    A = B + C; D = E F; // A thru F are in Memory

    Slow code:

    lw $t0, 4($s0) # &B = 4($s0)lw $t1, 8($s0) # &C = 8($s0)

    add $t2, $t0, $t1 # stall cycle

    sw $t2, 0($s0) # &A = 0($s0)

    lw $t3, 16($s0) # &E = 16($s0)lw $t4, 20($s0) # &F = 20($s0)

    sub $t5, $t3, $t4 # stall cycle

    sw $t5, 12($0) # &D = 12($0)

    Fast code: No Stalls

    lw $t0, 4($s0)lw $t1, 8($s0)

    lw $t3, 16($s0)

    lw $t4, 20($s0)

    add $t2, $t0, $t1sw $t2, 0($s0)

    sub $t5, $t3, $t4

    sw $t5, 12($s0)

    Name Dependence: Write After

  • 8/12/2019 pipelined processor design.ppt

    47/72

    4712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Instruction Jshould write its result after it is read by I

    Called anti-dependenceby compiler writers

    I: sub $t4, $t1, $t3 # $t1 is read

    J: add $t1,$t2, $t3 # $t1 is written

    Results from reuse of the name $t1

    NOT a data hazard in the 5-stage pipeline because:

    Reads are always in stage 2

    Writes are always in stage 5, and

    Instructions are processed in order

    Anti-dependence can be eliminated by renaming

    Use a different destination register for add (eg, $t5)

    Name Dependence: Write AfterRead

    Name Dependence: Write After

  • 8/12/2019 pipelined processor design.ppt

    48/72

    4812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Name Dependence: Write AfterWrite

    Same destination register is written by two instructions

    Called output-dependence in compiler terminology

    I: sub $t1, $t4, $t3 # $t1 is written

    J: add $t1, $t2, $t3 # $t1 is written

    again

    Not a data hazard in the 5-stage pipeline because:

    All writes are ordered and always take place in stage 5

    However, can be a hazard in more complex pipelines

    If instructions are allowed to complete out of order, and

    Instruction Jcompletes and writes $t1before instruction I

    Output dependence can be eliminated by renaming $t1

    Read After Read is NOT a name dependence

  • 8/12/2019 pipelined processor design.ppt

    49/72

    4912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Next . . .

    Pipelining versus Serial Execution

    Pipelined Datapath and Control

    Pipeline Hazards

    Data Hazards and Forwarding

    Load Delay, Hazard Detection, and Stall

    Control Hazards

    Delayed Branch and Dynamic Branch Prediction

  • 8/12/2019 pipelined processor design.ppt

    50/72

    5012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Control Hazards

    Jump and Branch can cause great performance loss

    Jump instruction needs only thejump target address

    Branch instruction needs two things:

    Branch Result Taken or Not Taken

    Branch Target Address

    PC + 4 If Branch is NOT taken

    PC + 4 + 4 immediate If Branch is Taken

    Jump and Branch targets are computed in the ID stageAt which point a new instruction is already being fetched

    Jump Instruction: 2-cycle delay

    Branch: 2-cycle delay for branch result (taken or not taken)

  • 8/12/2019 pipelined processor design.ppt

    51/72

    5112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    2-Cycle Branch Delay

    Control logic detects a Branchinstruction in the 2ndStage

    ALU computes the Branch outcome in the 3rdStage

    Next1 andNext2instructions will be fetched anyway

    Convert Next1and Next2into bubbles if branch is taken

    Beq $t1,$t2,L1 IF

    cc1

    Next1

    cc2

    Reg

    IF

    Next2

    cc4 cc5 cc6 cc7

    IF Reg DMALU

    BubbleBubble Bubble

    BubbleBubble BubbleBubble

    L1: target instruction

    cc3

    BranchTargetAddr

    ALU

    Reg

    IF

  • 8/12/2019 pipelined processor design.ppt

    52/72

    5212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Implementing Jump and Branch

    zero

    clk

    32

    ALU

    32

    5

    Address

    Instruction

    Instruction

    Memory Rs

    5Rt1

    0

    32

    Regis

    terFile

    RA

    RB

    BusA

    BusB

    RWBusW

    EPC

    ALUout

    D

    Imm16

    Imm26

    A

    B

    Im26

    NPC

    2

    In

    struction

    NPC

    +1

    0

    1Rd

    Rd20

    1 Rd3

    PCSrc

    Bne

    BeqJ

    Op

    RegDst

    EX

    J, Beq, Bne

    MEM

    Main & ALUControl

    fun

    c

    Next

    PC

    0

    2

    3

    1

    0

    2

    3

    1

    Control Signals0

    1Bubble = 0

    Branch Delay = 2 cycles

    Branch target & outcomeare computed in ALU stage

    J

    umporBranchTarge

    t

  • 8/12/2019 pipelined processor design.ppt

    53/72

    5312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    CPI for a Pipeline Processor Assume 18% Branch instructions, and 10% Branch Instructions

    Assume to stall until branch condition and target is resolved

    Assume conditional 54% of conditional branches are true (i.e., these branches areTAKEN)

    Number of Stalls

    Branch Stalls incase of Taken Branch = 2

    Branch Stalls incase of Not-Taken = 2

    Jump Stalls incase of Jump = 2

    How many cycle stalls for load instructions, in current pipeline?

    One Cycle

    CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAW Stalls +Branch Stalls

    CPIPipeline= CPIIdeal+ RAW Stalls + Branch Stalls

    Ignoring RAW Stalls, as information is not provided

    CPIPipeline= 1 + Branch Stalls

    CPIPipeline= 1 + Branch Stalls

    CPIPipeline= 1 + Branch Taken Stalls + Branch Not-Taken Stalls + Jump Stalls

    CPIPipeline= 1 + 0.18 x 0.54 x 2 + 0.18 x 0.36 x 2 + 0.10 x 2

  • 8/12/2019 pipelined processor design.ppt

    54/72

    5412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Predict Branch NOT Taken

    Branches can be predicted to be NOT taken

    If branch outcome is NOT taken then

    Next1 andNext2instructions can be executed

    Do not convert Next1& Next2into bubbles

    No wasted cycles

    Beq $t1,$t2,L1 IF

    cc1

    Next1

    cc2

    Reg

    IF

    Next2

    cc3

    NOT TakenALU

    Reg

    IF Reg

    cc4 cc5 cc6 cc7

    ALU DM

    ALU DM

    Reg

    Reg

  • 8/12/2019 pipelined processor design.ppt

    55/72

    5512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    CPI for a Pipeline Processor Assume 18% Branch instructions, and 10% Branch Instructions

    Assume to Predict Not-Taken Branch policy

    Assume conditional 54% of conditional branches are true (i.e., these branches are TAKEN)

    Number of Stalls

    Branch Stalls incase of Taken Branch = 2

    Branch Stalls incase of Not-Taken = 0

    If the condition is not true then the correct instructions are already in pipeline

    Jump Stalls incase of Jump = 2

    How many cycle stalls for load instructions, in current pipeline?

    One Cycle

    CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAW Stalls + Branch Stalls

    CPIPipeline= CPIIdeal+ RAW Stalls + Branch Stalls

    Ignoring RAW Stalls, as information is not provided

    CPIPipeline= 1 + Branch Stalls

    CPIPipeline= 1 + Branch Stalls CPIPipeline= 1 + Branch Taken Stalls + Branch Not-Taken Stalls + Jump Stalls

    CPIPipeline= 1 + 0.18 x 0.54 x 2 + 0.18 x 0.36 x 0 + 0.10 x 2

    CPIPipeline= 1 + 0.18 x 0.54 x 2 + 0.10 x 2

  • 8/12/2019 pipelined processor design.ppt

    56/72

    5612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Reducing the Delay of Branches

    Branch delay can be reduced from 2 cycles tojust 1 cycle

    Branches can be determined earlier in the Decode stage

    A comparator is used in the decode stage to determine branchdecision, whether the branch is taken or not

    Because of forwarding the delay in the second stage will beincreased and this will also increase the clock cycle

    Only one instructionthat follows the branch is fetched

    If the branch is taken then only one instruction is flushed

    We should insert a bubble after jump or taken branch

    This will convert the next instruction into a NOP

  • 8/12/2019 pipelined processor design.ppt

    57/72

    5712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    J, Beq, Bne

    Reducing Branch Delay to 1 Cycle

    clk

    32

    ALU

    32

    5

    Address

    Instruction

    Instruction

    Memory Rs

    5Rt1

    0

    32

    Regis

    terFile

    RA

    RB

    BusA

    BusB

    RWBusW

    E

    PC

    ALUout

    D

    Imm16

    A

    B

    Im16

    In

    struction

    +1

    0

    1Rd

    Rd20

    1 Rd3

    PCSrc

    Bne

    BeqJ

    Op

    RegDst

    EX

    MEM

    Main & ALUControl

    fun

    c

    0

    2

    3

    1

    0

    2

    3

    1

    Control Signals0

    1Bubble = 0

    J

    umporBranchTarget

    Reset signal convertsnext instruction afterjump or taken branchinto a bubble

    Zero=

    ALUCtrl

    Reset

    NextPC

    Longer Cycle

    Data forwardedthen compared

  • 8/12/2019 pipelined processor design.ppt

    58/72

    5812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    CPI for a Pipeline Processor Assume 18% Branch instructions, and 10% Branch Instructions

    Assume to Predict Not-Taken Branch policy

    Assume conditional 54% of conditional branches are true (i.e., these branches are TAKEN)

    Number of Stalls

    Branch Stalls incase of Taken Branch = 1

    Branch Stalls incase of Not-Taken = 0

    If the condition is not true then the correct instructions are already in pipeline

    Jump Stalls incase of Jump = 1

    How many cycle stalls for load instructions, in current pipeline?

    One Cycle

    CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAW Stalls + Branch Stalls

    CPIPipeline= CPIIdeal+ RAW Stalls + Branch Stalls

    Ignoring RAW Stalls, as information is not provided

    CPIPipeline= 1 + Branch Stalls

    CPIPipeline= 1 + Branch Stalls CPIPipeline= 1 + Branch Taken Stalls + Branch Not-Taken Stalls + Jump Stalls

    CPIPipeline= 1 + 0.18 x 0.54 x 1 + 0.18 x 0.36 x 0 + 0.10 x 1

    CPIPipeline= 1 + 0.18 x 0.54 x 1 + 0.10 x 1

  • 8/12/2019 pipelined processor design.ppt

    59/72

    5912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Next . . .

    Pipelining versus Serial Execution

    Pipelined Datapath and Control

    Pipeline Hazards

    Data Hazards and Forwarding

    Load Delay, Hazard Detection, and Stall

    Control Hazards

    Delayed Branch and Dynamic Branch Prediction

  • 8/12/2019 pipelined processor design.ppt

    60/72

    6012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Branch Hazard Alternatives

    Predict Branch Not Taken (previously discussed)

    Successor instruction is already fetched

    Do NOT Flush instruction after branch if branch is NOT taken

    Flush only instructions appearing after Jump or taken branch

    Delayed Branch Define branch to take placeAFTERthe next instruction

    Compiler/assembler fills the branch delay slot (for 1 delay cycle)

    Dynamic Branch Prediction

    Loop branches are taken most of time

    Must reduce branch delay to 0, but how?

    How to predict branch behavior at runtime?

    l d h

  • 8/12/2019 pipelined processor design.ppt

    61/72

    6112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Define branch to take place afterthe next instruction

    For a 1-cycle branch delay, we have onedelay slot

    branch instruction

    branch delay slot (next instruction)

    branch target (if branch taken)

    Compiler fills the branch delay slot

    By selecting an independent instruction

    From before the branch

    If no independent instruction is found

    Compiler fills delay slot with a NO-OP

    Delayed Branch

    label:

    . . .

    add $t2,$t3,$t4

    beq $s1,$s0,label

    Delay Slot

    label:

    . . .

    beq $s1,$s0,label

    add $t2,$t3,$t4

    b k f l d h

  • 8/12/2019 pipelined processor design.ppt

    62/72

    6212-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    New meaning for branch instruction

    Branching takes place after next instruction (Not immediately!)

    Impacts software and compiler

    Compiler is responsible to fill the branch delay slot

    For a 1-cycle branch delay Onebranch delay slot

    However, modern processors and deeply pipelined

    Branch penalty is multiple cycles in deeper pipelines

    Multiple delay slots are difficult to fill with useful instructions

    MIPS used delayed branching in earlier pipelines

    However, delayed branching is not useful in recent processors

    Drawback of Delayed Branching

    CPI f Pi li P

  • 8/12/2019 pipelined processor design.ppt

    63/72

    6312-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    CPI for a Pipeline Processor Assume 18% Branch instructions, and 10% Branch Instructions

    Assume to Delay Branch policy Assume conditional 54% of conditional branches are true (i.e., these

    branches are TAKEN)

    Assume 75% of delay slots of branch instructions are filled with a validinstruction (i.e., no NO-OP)

    Assume 85% of delay slots of jump instructions are filled with a valid

    instruction.

    How many cycle stalls for load instructions, in current pipeline?

    One Cycle

    CPIPipeline= CPIIdeal+ Structural Stalls + RAW Stalls + WAR Stalls + WAWStalls + Branch Stalls

    CPIPipeline= CPIIdeal+ RAW Stalls + Branch Stalls

    Ignoring RAW Stalls, as information is not provided

    CPIPipeline= 1 + Branch Stalls + Jump Stalls

    CPIPipeline= 1 + 0.18 x 0.25 x 1 + 0.10 x 0.15 x 1

    Z D l d B hi

  • 8/12/2019 pipelined processor design.ppt

    64/72

    6412-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Zero-Delayed Branching

    How to achieve zero delay for a jump or a taken branch?

    Jump or branch target address is computed in the ID stage

    Next instruction has already been fetched in the IF stage

    Solution

    Introduce a Branch Target Buffer (BTB) in the IF stage Store the target address of recent branch and jump instructions

    Use the lower bits of the PC to index the BTB

    Each BTB entry stores Branch/Jump address & Target Address Check the PC to see if the instruction being fetched is a branch

    Update the PC using the target address stored in the BTB

    B h T t B ff

  • 8/12/2019 pipelined processor design.ppt

    65/72

    6512-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Branch Target Buffer

    The branch target bufferis implemented as a small cache

    Stores the target address of recent branches and jumps

    We must also have prediction bits

    To predict whether branches are taken or not taken

    The prediction bits are dynamically determined by the hardware

    mux

    PC

    Branch Target & Prediction Buffer

    Addresses of

    Recent Branches

    Target

    Addresses

    low-order bits

    used as index

    Predict

    BitsInc

    =predict_taken

    D i B h P di ti

  • 8/12/2019 pipelined processor design.ppt

    66/72

    6612-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Dynamic Branch Prediction

    Prediction of branches at runtime using prediction bits

    Prediction bits are associated with each entry in the BTB

    Prediction bits reflect the recent history of a branch instruction

    Typically few prediction bits (1 or 2) are used per entry

    We dont know if the prediction is correct or not If correct prediction

    Continue normal executionno wasted cycles

    If incorrect prediction (misprediction) Flush the instructions that were incorrectly fetchedwasted cycles

    Update prediction bits and target address for future use

    Dynamic Branch Prediction

  • 8/12/2019 pipelined processor design.ppt

    67/72

    6712-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Correct PredictionNo stall cycles

    YesNo

    Dynamic Branch Prediction Contd

    Use PC to address InstructionMemory and Branch Target Buffer

    FoundBTB entry with predict

    taken?

    Increment PC PC = target address

    Mispredicted Jump/branchEnter jump/branch address, target

    address, and set prediction in BTB entry.Flush fetched instructions

    Restart PC at target address

    Mispredicted branchBranch not taken

    Update prediction bitsFlush fetched instructionsRestart PC after branch

    NormalExecution

    YesNo

    Jumpor takenbranch?

    Jumpor takenbranch?

    No Yes

    IF

    ID

    EX

    1 bit P di ti S h

  • 8/12/2019 pipelined processor design.ppt

    68/72

    6812-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Prediction is just a hint that is assumed to be correct

    If incorrect then fetched instructions are flushed

    1-bit prediction scheme is simplest to implement

    1 bit per branch instruction (associated with BTB entry)

    Record last outcome of a branch instruction (Taken/Not taken)

    Use last outcome to predict future behavior of a branch

    1-bit Prediction Scheme

    Predict

    Not Taken

    TakenPredict

    Taken

    NotTaken

    Not Taken

    Taken

    1 Bit P di t Sh t i

  • 8/12/2019 pipelined processor design.ppt

    69/72

    6912-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    1-Bit Predictor: Shortcoming

    Inner loop branch mispredicted twice!

    Mispredict as taken on last iteration of inner loop

    Then mispredict as not taken on first iteration of innerloop next time around

    outer:

    inner: bne , , innerbne , , outer

    2 bit P di ti S h

  • 8/12/2019 pipelined processor design.ppt

    70/72

    7012-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    1-bit prediction scheme has a performance shortcoming

    2-bit prediction scheme works better and is often used

    4 states: strong and weak predict taken / predict not taken

    Implemented as a saturating counter

    Counter is incremented to max=3 when branch outcome is taken

    Counter is decremented to min=0 when branch is not taken

    2-bit Prediction Scheme

    Not Taken

    Taken

    Not Taken

    TakenStrong

    Predict

    Not Taken

    Taken

    Weak

    Predict

    Taken

    Not Taken

    Weak

    Predict

    Not TakenNot Taken

    TakenStrong

    Predict

    Taken

    Fallacies and Pitfalls

  • 8/12/2019 pipelined processor design.ppt

    71/72

    7112-Jun-14 | Adeel Israr | COMSATS Institute of Information Technology, Islamabad

    Fallacies and Pitfalls Pipelining is easy!

    The basic idea is easy

    The devil is in the details

    Detecting data hazards and stalling pipeline

    Poor ISA design can make pipelining harder Complex instruction sets (Intel IA-32)

    Significant overhead to make pipelining work

    IA-32 micro-op approach

    Complex addressing modes

    Register update side effects, memory indirection

    Pipeline Ha a ds S mma

  • 8/12/2019 pipelined processor design.ppt

    72/72

    Pipeline Hazards Summary

    Three types of pipeline hazards

    Structural hazards: conflicts using a resource during same cycle

    Data hazards: due to data dependencies between instructions

    Control hazards: due to branch and jump instructions

    Hazards limit the performance and complicate the design Structural hazards: eliminated by careful design or more hardware

    Data hazards are eliminated by forwarding

    However, load delay cannot be eliminated and stalls the pipeline

    Delayed branching can be a solution when branch delay = 1 cycle

    BTB with branch prediction can reduce branch delay to zero

    Branch misprediction should flush the wrongly fetched instructions