Hardware Speculation

download Hardware Speculation

of 20

Transcript of Hardware Speculation

  • 7/25/2019 Hardware Speculation

    1/20

    10/8/15

    1

    ILP: Out-of-Order Execution

    Antonia Zhai

    Department Computer Science and Engineering

    University of Minnesota

    http://www.cs.umn.edu/~zhai

    With slides from: Profs. Mowry, Falsafi, Hill, Hoe, Lipasti, Shen,

    Smith, Sohi, Vijaykumar, Patterson, Culler

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Branch on equal

    IF: Instruction fetch

    IR

  • 7/25/2019 Hardware Speculation

    2/20

    10/8/15

    2

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Datapath for Conditional Branch Instructions

    3

    PC

    Instr.Mem.

    Reg.Array

    regA

    regB

    regW

    datW

    datA

    datB

    ALU

    25:21

    20:16

    +4

    aluA

    aluB

    IncrPC

    Instr

    Xtnd

  • 7/25/2019 Hardware Speculation

    3/20

    10/8/15

    3

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Branch Prediction

    Why does prediction work?

    Underlying algorithm has regularities. Loops are iterated multiple times

    Data that is being operated on has regularities.

    Instruction sequence has redundancies:

    Artifacts of way that humans/compilers think

    E.g., Error checking branches are rarely taken

    Prediction!Compressible information streams?

    Prediction allows us to break control dependence constraints

    5

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Elements of Branch Prediction

    Determine whether it is a branch instruction

    Predict whether it will be taken or not

    Predict the target address if taken

    6

    r1 ! r2 / r3

    r2 ! r1 / r3r3 ! r2 - r3

    beq r3, 100

    Just Predicting Taken/Not Taken Can Help

    The target can be computed much

    earlier than the branch decision

  • 7/25/2019 Hardware Speculation

    4/20

    10/8/15

    4

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Prediction I: A branch will do exactly what itdid last time

    7

    Branch History Table (BHT)

    Each entry is a state machine; Indexed by low-order bits of instruction address

    Encode information about prior history of branch instructions

    Small chance of two branch instructions aliasing

    Predict whether or not branch will be taken 0

    1

    1

    1

    1

    0

    0

    0

    .. .. .. .. .. .. .. 1 0 1 0 0 0 0 0

    Branch Prediction Table Index

    Antonia Zhai !"#$%&'#() +, -#""%'+(.

    Example

    Taken/NotTaken

    Instruction Prediction

    Taken 0x108: beq r1, 0x20 Not Taken

    Taken 0x108: beq r1, 0x20

    Taken 0x108: beq r1, 0x20

    Not taken 0x108: beq r1, 0x20

    Taken 0x108: beq r1, 0x20

    Not taken 0x208: beq r2, 0x10

    Taken 0x108: beq r1, 0x20

    0 0x08

  • 7/25/2019 Hardware Speculation

    5/20

    10/8/15

    5

    Antonia Zhai !"#$%&'#() +, -#""%'+(.

    Example

    Taken/NotTaken

    Instruction Prediction

    Taken 0x108: beq r1, 0x20 Not taken

    Taken 0x108: beq r1, 0x20 Taken

    Taken 0x108: beq r1, 0x20 Taken

    Not taken 0x108: beq r1, 0x20 Taken

    Taken 0x108: beq r1, 0x20 Not taken

    Not taken 0x208: beq r2, 0x10 Taken

    Taken 0x108: beq r1, 0x20 Not taken

    0/1

    Problem:Predictor changes tooquickly

    0x08

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Example

    10

    for i = 0; i < 100; i ++ {

    for j = 0; j < 10; j++ {total = a[i][j]

    }

    }

    What is the misprediction rate?

    There are two branches:

    1. Backward branch for the inner loop(2 out of 10 misprediction for each invocation,

    100 invocation,Misprediction rate: 200/1000)

    2. Backward branch for the outer loop

    (2/100 misprediction for total)

    Solution:

    (200 + 2)/(1000 + 100)

  • 7/25/2019 Hardware Speculation

    6/20

    10/8/15

    6

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Prediction II

    Change the prediction after twomispredictions

    2-bit saturation counter

    11

    T T T

    Yes! Yes? No? No!

    NT

    T

    NT NT

    NT

    00/01/10/11 0x08

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Example

    12

    for i = 0; i < 100; i ++ {

    for j = 0; j < 10; j++ {total = a[i][j]

    }

    }

    What is the misprediction rate?There are two branches:

    1.

    Backward branch for the inner loop(1 out of 10 misprediction for each invocation,

    100 invocation,

    2 extra miss in the first iterationMisprediction rate: 102/1000)

    2. Backward branch for the outer loop(3/100 misprediction for total)

    Solution:

    (102 + 3)/(1000 + 100)

  • 7/25/2019 Hardware Speculation

    7/20

    10/8/15

    7

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Generalization

    Using a N-bit saturation counter as a predictor

    If branch taken & counter value < (2^n 1): Increment counter

    If branch not taken & counter > 0 Decrement counter

    Prediction: Taken: if most significant bit is 1

    Not taken: if most significant bit is 0

    13

    Find the proper N:

    We want to remember the history, but only recent history

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Prediction III

    Whether a branch is taken or notdepends on other branch instructions

    How can we make use of this information?

    14

    if (a > 1) // branch #1

    conquer the worldif (a < -1) // branch #2

    clean my living room

    Two branch instructions:

    If branch #1 is taken,Branch #2 will never be taken

  • 7/25/2019 Hardware Speculation

    8/20

    10/8/15

    8

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Correlation Branch Predictor

    Every branch has two separate predictors One bit predicts the branch if the last branch is taken

    One bit predicts the branch if the last branch is not taken

    A.K.A. two-level predictor

    15

    Prev.

    Branch

    Taken

    Prev.

    Branch

    not

    Taken

    NT

    NT

    NT

    NTT

    T T

    T

    One bit predictor with one bit of correlation

    Antonia Zhai !"#$%&'#() +, -#""%'+(.

    Example

    for(i = 0; i < 10; i++) {

    a = random(-100, 100)// a is a random number

    // from -100, 100

    if (a > 1) // b1

    conquer the world

    if (a < -1) // b2clean my living room

    } // b3

    Branch #2

    Input:-10, 7, 5, 10, -2, -55, 4, -89, 33, -3

    B2

    Predictor

    Correlate

    with B1

    B1Action

    B2Prediction

    B2Action

    NT/NT NT NT T

    T NT

    T NT

    T NT

    NT T

    NT T

    T NT

    NT T

    T NT

    NT T

  • 7/25/2019 Hardware Speculation

    9/20

    10/8/15

    9

    Antonia Zhai !"#$%&'#() +, -#""%'+(.

    Example

    B2Predictor

    Correlate

    with B1

    B1Action

    B2Prediction

    B2Action

    NT / NT NT NT T

    T / NT T NT NT

    T / NT T NT NT

    T / NT T NT NT

    T / NT NT T T

    T / NT NT T T

    T / NT T NT NT

    T / NT NT T TT / NT T NT NT

    T / NT NT T T

    for(i = 0; i < 10; i++) {

    a = random(-100, 100)// a is a random number

    // from -100, 100

    if (a > 1) // b1

    conquer the world

    if (a < -1) // b2clean my living room

    } // b3

    Branch #2

    Input:-10, 7, 5, 10, -2, -55, 4, -89, 33, -3

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Definition

    (1, 1) predictor

    Uses the behavior of the last branch

    Selects from 2^1 sets of choices

    Each choice is coded with 1 bit

    (m, n) predictor

    Use the history of m branches

    Select from 2^m sets of choices

    Each choice is coded with n bits

    Example: How many bits are there in a 1024-entry (2, 2) branchpredictor

    1024 * (2^2) * 2 = 8192 bits

    18

  • 7/25/2019 Hardware Speculation

    10/20

    10/8/15

    10

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    A (2,2) Predictor

    XX

    00/01/10/11 Two bit global history

    Branch Address

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Combine the Local & Global Predictors

    Local predictor: Predict based on history of just one branch

    Global predictor:

    Predictor based on global history

    Combine them with a selector (a multi-level predictor)

    20

    A branch predictor without branch address???

  • 7/25/2019 Hardware Speculation

    11/20

    10/8/15

    11

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Tournament Predictor

    Alpha branch predictor

    4k-entry 2 predictor-predictor Use a 2 bit saturation counter to select between two predictors Based on local information of the branch

    A local predictor 1024-entry 10-bit predictor, keeps track of 10 most recent

    outcomes

    The entry then selects from a 3-bit saturation counter

    4k-entry global predictor, indexed by the history of 12 branches,

    Each entry is a standard 2-bit predictor

    11.5 misprediction per 1000 completed instruction forSPECint95

    21

    Antonia Zhai !"#$%&'#() +, -#""%'+(.

    Branch Target Buffer

    0x0000ac24

    0b0000001010110000100100

    A C 2 4

    0x0000aca4

    0b0000001010110010100100

    A C 2 4

    0x00

    0x01

    0x02

    0x03

    0x04

    0x05

    0x06

    0x07

    0x08

    0x09

    0x0a

    0x0b

    0x0c

    0x0d

    0x0e

    0x0f0x2b2

    0x2b0

  • 7/25/2019 Hardware Speculation

    12/20

    10/8/15

    12

    Antonia Zhai !"#$%&'#() +, -#""%'+(.

    Example

    TAG/Address Instruction

    (all branches are taken)

    Prediction

    NULL/NULL 0xac24: beq r1, 0x20

    0xac24: beq r1, 0x20

    0xac24: beq r1, 0x20

    0xac24: beq r1, 0x20

    0xac24: beq r1, 0x20

    0xac24: beq r1, 0x20

    0xac24: beq r1, 0x200xaca4: beq r2, 0x10

    0xac24: beq r1, 0x20

    Antonia Zhai !"#$%&'#() +, -#""%'+(.

    Example

    TAG/Address Instruction

    (all branches are taken)

    Prediction

    NULL/NULL 0xac24: beq r1, 0x20 No match, no prediction

    2b0/0xac48 0xac24: beq r1, 0x20 Match, 0xac48

    2b0/0xac48 0xac24: beq r1, 0x20 Match, 0xac48

    2b0/0xac48 0xac24: beq r1, 0x20 Match, 0xac48

    2b0/0xac48 0xac24: beq r1, 0x20 Match, 0xac48

    2b0/0xac48 0xac24: beq r1, 0x20 Match, 0xac48

    2b0/0xac48 0xac24: beq r1, 0x20 Match, 0xac482b0/0xac48 0xaca4: beq r2, 0x10 No match, no prediction

    2b2/0xacb8 0xac24: beq r1, 0x20 No Match, no prediction

  • 7/25/2019 Hardware Speculation

    13/20

    10/8/15

    13

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    The Entire Process

    25

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Hardware Speculation

    When the prediction is wrong, incorrectly executed instruction must beerased

    Our hardware support system does not allow this

    Extending the hardware --- hardware speculation

    Separate the bypassing of results among instructions from thecompletion of instruction

    Adding an instruction commit stage to Tomasulos algorithm

    Goal Instruction commits inorder

    26

  • 7/25/2019 Hardware Speculation

    14/20

    10/8/15

    14

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Reorder Buffer --- ROB

    Hardware buffer that holds the results of instructions that havefinished execution but not yet committed

    InstructionType

    DestinationField

    Value Field Ready Field

    27

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Tomasulos Algorithm with ROB

    28

    FP Adders

    Common data bus CDB)

    Registers

    Operand

    Buses

    FP Multipliers

    Instr.

    Queue

    Operation

    Bus

    Res.

    Stations

    Mem.

    Unit

    Addr. Unit

    addr

    Load

    buffer

    12 1

    23

    ROB

    addrReg#Store Addr

    Store Value

  • 7/25/2019 Hardware Speculation

    15/20

    10/8/15

    15

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Four Execution Stages

    Issue: Get data from in order instruction queue

    Issue instruction if a reservation station and a ROBentry isavailable, else stall

    Read register value if available in the register or ROB, else set tag Update register entry and ROBentry

    Execute (a.k.a. issue):

    Monitor the common data bus to wait for operands

    Execute when both operands are ready (Resolve RAW dependences)

    Write Results:

    Write result to the CDB "all reservation stations, ROB

    For store value and address are sent to ROB

    Commit

    29

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Four Execution Stages

    Commit (a.k.a., complete, graduate) Normal commit: head of ROB & results in the buffer

    Update register

    Remove instruction from ROB

    Store instruction

    Update memory

    Remove instruction from ROB

    Branch instruction

    Correctly predicted, nothing Incorrectly predicted, flush ROB

    30

  • 7/25/2019 Hardware Speculation

    16/20

    10/8/15

    16

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder

    Adder2

    Op1 Op2 ROB

    Mult

    V: 3 V:3 ROB1

    Mult 2

    Tag Value

    R0 3

    R1 100

    R2 Mult 1 11

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T0

    3 cycles

    1 cycles

    1 cycles

    i1: issue

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1 ALU R2 --- No

    ROB2

    ROB3

    ROB4

    PC Op1 Op2 ROB

    Branch

    31

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

    Branchdelay slot

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder

    Adder2

    Op1 Op2 ROB

    Mult

    V: 3

    V:3 ROB1

    Mult 2

    Tag Value

    R0 3

    R1 100

    R2 Mult 1 11

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T1

    3 cycles

    1 cycles

    1 cycles

    i1: execute 1

    i2: issue

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1 ALU R2 --- NoROB2 bne --- --- No

    ROB3

    ROB4

    PC Op1 Op2 ROB

    Branch V:PC+4 T:Mult1V:0x20 ROB2

    32

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

  • 7/25/2019 Hardware Speculation

    17/20

    10/8/15

    17

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder

    Adder2

    Op1 Op2 ROB

    Mult

    V: 3 V:3 ROB1

    Mult 2

    Tag Value

    R0 3

    R1 100

    R2 Mult 1 11

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T2

    3 cycles

    1 cycles

    1 cycles

    i1: execute 2

    i2: wait (pred)i3:issue (not shown)

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1 ALU R2 --- No

    ROB2 bne --- --- No

    ROB3 St --- --- No

    ROB4

    PC Op1 Op2 ROB

    Branch V:PC+4 T:Mult1V:0x20 ROB2

    33

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder V:3 V:3 ROB4

    Adder2

    Op1 Op2 ROB

    Mult

    V: 3 V:3 ROB1

    Mult 2

    Tag Value

    R0 Adder 1 3

    R1 100

    R2 Mult 1 11

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T3

    3 cycles

    1 cycles

    1 cycles

    i1: execute 3

    i2: waiti3: execute (addr. Unit)

    i4: issue

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1 ALU R2 --- N0ROB2 bne --- --- N0

    ROB3 St --- --- NO

    ROB4 ALU R0 --- NO

    PC Op1 Op2 ROB

    Branch V:PC+4 T:Mult1V:0x20 ROB2

    34

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

  • 7/25/2019 Hardware Speculation

    18/20

    10/8/15

    18

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder V:3 V:3 ROB4

    Adder2

    Op1 Op2 ROB

    Mult

    Mult 2

    Tag Value

    R0 Adder 1 3

    R1 100

    R2 Mult 1 11

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T4

    3 cycles

    1 cycles

    1 cycles

    i1: write result

    i2: waiti3: write result(stall)

    i4: execute

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1 ALU R2 9 Yes

    ROB2 bne --- --- N0

    ROB3 St --- --- NO

    ROB4 ALU R0 --- NO

    PC Op1 Op2 ROB

    Branch V:PC+4 V:9 V:0x20 ROB2

    35

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder V:3 V:3 ROB4

    Adder2

    Op1 Op2 ROB

    Mult

    Mult 2

    Tag Value

    R0 Adder 1 3

    R1 100

    R2 9

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T5

    3 cycles

    1 cycles

    1 cycles

    i1: commit

    i2: executei3: write results

    i4: write results(stall)

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1

    ROB2 bne --- --- N0

    ROB3 St 200 V:3 Yes

    ROB4 ALU R0 --- NO

    PC Op1 Op2 ROB

    Branch V:PC+4 V:9 V:0x20 ROB2

    36

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

  • 7/25/2019 Hardware Speculation

    19/20

    10/8/15

    19

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder V:3 V:3 ROB4

    Adder2

    Op1 Op2 ROB

    Mult

    Mult 2

    Tag Value

    R0 Adder 1 3

    R1 100

    R2 9

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T6

    3 cycles

    1 cycles

    1 cycles

    i2: write result

    (misprediction)i3: wait for commit

    i4: write results

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1

    ROB2 bne --- --- Yes

    ROB3 St 200 V:3 Yes

    ROB4 ALU R0 V:6 Yes

    PC Op1 Op2 ROB

    Branch V:PC+4 V:9 V:0x20 ROB2

    37

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder

    Adder2

    Op1 Op2 ROB

    Mult

    Mult 2

    Tag Value

    R0 3

    R1 100

    R2 9

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T7

    3 cycles

    1 cycles

    1 cycles

    i2: commit

    (misprediction)i3: wait for commit

    i4: Squashed

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1

    ROB2

    ROB3 St 200 V:3 Yes

    ROB4

    PC Op1 Op2 ROB

    Branch

    38

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

    Branchdelay slot

  • 7/25/2019 Hardware Speculation

    20/20

    10/8/15

    20

    Antonia Zhai !"#$%&'#() +, -#""%'+(.Antonia Zhai

    Op1 Op2 ROB

    Adder

    Adder2

    Op1 Op2 ROB

    Mult

    Mult 2

    Tag Value

    R0 3

    R1 100

    R2 9

    i1: R2 !R0 * R0I2: bne R2, 0x20

    i3: 100(R1) !st R0

    I4: R0 !R0 + R0

    Register File

    Time: T8

    3 cycles

    1 cycles

    1 cycles

    i3: commit

    Examples: Hardware Speculation

    Op Dst Val. Ready

    ROB1

    ROB2

    ROB3

    ROB4

    PC Op1 Op2 ROB

    Branch

    39

    1 cycles

    Incorrectly

    Predicted asNot taken

    Reorder Buffer

    Branchdelay slot

    Store 3 to memory location 200