Designing SMT Processors for Atomic-Block...

124
University of Illinois http://iacoma.cs.uiuc.edu / BulkSMT: Designing SMT Processors for Atomic-Block Execution Xuehai Qian, Benjamin Sahelices and Josep Torrellas University of Illinois To appear at HPCA 2012 Thursday, December 8, 2011

Transcript of Designing SMT Processors for Atomic-Block...

  • Universityof Illinois http://iacoma.cs.uiuc.edu/

    BulkSMT: Designing SMT Processors for

    Atomic-Block Execution

    Xuehai Qian, Benjamin Sahelices and Josep TorrellasUniversity of Illinois

    To appear at HPCA 2012

    Thursday, December 8, 2011

    http://iacoma.cs.uiuc.eduhttp://iacoma.cs.uiuc.edu

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Motivation

    2Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Motivation

    • Architectures that continuously execute Atomic Blocks

    • Performance and programmability advantages [Hammond 04], [Ahn 10]

    • All proposals use single context cores

    • What if we used Simultaneous Multithreading (SMT) cores?

    • Enable a better utilization of the hardware

    • Fast/local communication between contexts

    • Enable higher-concurrency forms of atomic block execution

    2Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Contributions

    3Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Contributions• BulkSMT: first SMT design that supports atomic-block (transactional)

    execution

    • Enables concurrency between dependent blocks

    • Analysis of design space

    • SQUASH on conflict

    • STALL on conflict

    • ORDER on conflict

    • Design of a multicore of BulkSMTs

    • BulkSMT is cost Effective

    • Higher performance for the same core count

    • Competitive performance for 1/4 of the cores

    3Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Outline

    • Motivation

    • Designing BulkSMT

    • Design Space

    • Hardware Mechanisms

    • Multicore of BulkSMTs

    • Evaluation

    4Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Designing SMT For Blocked Execution

    5

    L1/L2

    Contexts

    • Version Management:

    • Conflict Detection:

    • Conflict Resolution:

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Designing SMT For Blocked Execution

    5

    L1/L2

    Contexts

    spec wr

    No disp

    • Version Management:

    • Conflict Detection:

    • Conflict Resolution:

    Eager

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Designing SMT For Blocked Execution

    5

    L1/L2

    Contexts

    spec wr

    spec rd

    No disp

    • Version Management:

    • Conflict Detection:

    • Conflict Resolution:

    Eager

    Eager

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Designing SMT For Blocked Execution

    5

    L1/L2

    Contexts

    spec wr

    spec rd

    No disp

    • SQUASH

    • STALL

    • ORDER

    • Version Management:

    • Conflict Detection:

    • Conflict Resolution:

    Eager

    Eager

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    SQUASH Design

    6

    P0L1$

    L2$

    wr x

    rd x

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    SQUASH Design

    6

    P0L1$

    L2$

    wr x

    rd x

    wr x

    rd x

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    SQUASH Design

    6

    P0L1$

    L2$

    wr x

    rd x

    wr x

    rd xsquash

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    SQUASH Design

    6

    P0L1$

    L2$

    wr x

    rd x

    wr x

    rd x

    rd x

    squash

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design

    7

    wr x

    rd x

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design

    7

    wr x

    rd x

    wr x

    rd x

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design

    7

    wr x

    rd x

    wr x

    rd x

    ...... stall

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design

    7

    wr x

    rd x

    wr x

    rd x

    commit

    ...... stall

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design

    7

    wr x

    rd x

    wr x

    rd x

    rd xcommit

    ...... stall

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design

    8

    wr x

    rd x

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design

    8

    wr x

    rd x

    wr x

    rd x

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design

    8

    wr x

    rd x

    wr x

    rd x

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design

    8

    wr x

    rd x

    wr x

    rd x

    commit

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design

    8

    wr x

    rd x

    wr x

    rd x

    commitcommit

    P0L1$

    L2$

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues

    9

    wr x

    rd x

    T0 T1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues

    • On a dependence:

    • Stall the destination block

    • HW records source and destination blocks

    • On commit/squash of source block

    • Wake up stalled destination block

    9

    wr x

    rd x

    T0 T1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues

    10Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues• Transitive stall OK

    • Block allowed to stall on an already stalled block

    10Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues• Transitive stall OK

    • Block allowed to stall on an already stalled block

    10

    wr y

    wr x

    rd x

    wr y

    T0 T1 T2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues• Transitive stall OK

    • Block allowed to stall on an already stalled block

    10

    wr y

    wr x

    rd x

    wr y

    T0 T1 T2

    T1 wakes up T0T0 wakes up T2

    Total order of blocks: T1 ➝ T0 ➝ T2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues• Transitive stall OK

    • Block allowed to stall on an already stalled block

    10

    wr y

    wr x

    rd x

    wr y

    T0 T1 T2

    T1 wakes up T0T0 wakes up T2

    Total order of blocks: T1 ➝ T0 ➝ T2

    • Cycle not OK

    • When a stall cycle is formed, block that closes the cycle is squashed

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues• Transitive stall OK

    • Block allowed to stall on an already stalled block

    10

    wr y

    wr x

    rd x

    wr y

    T0 T1 T2

    wr y

    wr x

    rd x

    T0 T1

    wr y

    T1 wakes up T0T0 wakes up T2

    Total order of blocks: T1 ➝ T0 ➝ T2

    • Cycle not OK

    • When a stall cycle is formed, block that closes the cycle is squashed

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    STALL Design Issues• Transitive stall OK

    • Block allowed to stall on an already stalled block

    10

    wr y

    wr x

    rd x

    wr y

    T0 T1 T2

    wr y

    wr x

    rd x

    T0 T1

    wr y

    T1 wakes up T0T0 wakes up T2

    Total order of blocks: T1 ➝ T0 ➝ T2

    Squash T1Total order of blocks:

    T0 ➝ T1

    • Cycle not OK

    • When a stall cycle is formed, block that closes the cycle is squashed

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (I)

    11Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (I)• On a dependence:

    • HW records source and destination block; Both blocks proceed

    • HW will enforce the same order in commit

    11Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (I)• On a dependence:

    • HW records source and destination block; Both blocks proceed

    • HW will enforce the same order in commit

    11

    • Transitive order is OK

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (I)• On a dependence:

    • HW records source and destination block; Both blocks proceed

    • HW will enforce the same order in commit

    11

    wr y

    wr x

    rd x

    wr y

    T0 T1 T2

    Total order of blocks: T1 ➝ T0 ➝ T2

    • Transitive order is OK

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (I)• On a dependence:

    • HW records source and destination block; Both blocks proceed

    • HW will enforce the same order in commit

    11

    wr y

    wr x

    rd x

    wr y

    T0 T1 T2

    Total order of blocks: T1 ➝ T0 ➝ T2

    • Cycle is not OK

    • On cycle, HW breaks it by squashing one or more blocks involved in the cycle

    • Transitive order is OK

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (I)• On a dependence:

    • HW records source and destination block; Both blocks proceed

    • HW will enforce the same order in commit

    11

    wr y

    wr x

    rd x

    wr y

    T0 T1 T2

    wr y

    wr x

    rd x

    T0 T1

    wr y

    Total order of blocks: T1 ➝ T0 ➝ T2

    • Cycle is not OK

    • On cycle, HW breaks it by squashing one or more blocks involved in the cycle

    • Transitive order is OK

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (II)

    12Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (II)

    • On a cycle: different dependences have different squash requirement

    12Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (II)

    • On a cycle: different dependences have different squash requirement

    12

    wr

    T0 T1

    rd

    RAW

    Squash T0 ⇒ Squash T1Squash T1 ⇏ Squash T0

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (II)

    • On a cycle: different dependences have different squash requirement

    12

    wr

    T0 T1

    rd

    RAW

    Squash T0 ⇒ Squash T1Squash T1 ⇏ Squash T0

    wr

    T0 T1

    wr

    WAW

    Squash T0 ⇒ Squash T1Squash T1 ⇒ Squash T0

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    ORDER Design Issues (II)

    • On a cycle: different dependences have different squash requirement

    12

    wr

    T0 T1

    rd

    RAW

    Squash T0 ⇒ Squash T1Squash T1 ⇏ Squash T0

    wr

    T0 T1

    wr

    WAW

    Squash T0 ⇒ Squash T1Squash T1 ⇒ Squash T0

    T0 T1

    wr

    WAR

    Squash T0 ⇏ Squash T1Squash T1 ⇏ Squash T0

    rd

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Outline

    • Motivation

    • Designing BulkSMT

    • Design Space

    • Hardware Mechanisms

    • Multicore of BulkSMTs

    • Evaluation

    13Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    14

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    14

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    14

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    14

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Access Recording• LW: Last Writer context-ID

    • R[i]: Read bit-mask

    • Sp: Speculative bit

    15Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Access Recording• LW: Last Writer context-ID

    • R[i]: Read bit-mask

    • Sp: Speculative bit

    15

    • Read from context k: set R[k]

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Access Recording• LW: Last Writer context-ID

    • R[i]: Read bit-mask

    • Sp: Speculative bit

    15

    • Read from context k: set R[k]

    • Write from context k: Sp ← 1, LW ← k

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Conflict Detection

    16Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Conflict Detection

    • Read After Write (RAW)

    • Load from context k reads cache line and (Sp == 1) && (LW != k)

    • Write After Read (WAR)

    • ...

    • Write After Write (WAW)

    • ...

    16Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Commit and Squash

    17Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Commit and Squash

    • On commit of context k:

    • Flash clear of the R[k] bit of all the cache lines

    • Flash conditional-clear the Sp bits of any line whose (LW == k)

    • On Squash (context k):

    • ...

    17Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    18

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    18

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    18

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    18

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection: HW Structures

    19

    Ti Tj

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection: HW Structures

    19

    Ti Tj

    Dependence Table (DT)

    Records ordered dependencesTi ➙ Tj

    i

    j

    n

    n (=4)

    1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection: HW Structures

    19

    Ti Tj

    Dependence Table (DT)

    Records ordered dependencesTi ➙ Tj

    i

    j

    Cycle Table (CT)

    In background: Finds cycles of dependences Cycle = bit in diagonal

    i

    j

    n n

    n (=4) n

    1 1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    1

    1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    d2

    1

    1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    d2

    1

    1

    1

    1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    d2

    1

    1

    1

    1

    Coldst |= ColsrcRowsrc |= Rowdst

    Propagation of d2 dependence:

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    d2

    1

    1

    1

    1

    CT

    11

    Col2 |= Col1

    Coldst |= ColsrcRowsrc |= Rowdst

    Propagation of d2 dependence:

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    d2

    1

    1

    1

    1

    CT

    11

    Col2 |= Col1

    Coldst |= ColsrcRowsrc |= Rowdst

    Propagation of d2 dependence:

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    d2

    1

    1

    1

    1

    CT

    11

    Col2 |= Col1

    1

    Coldst |= ColsrcRowsrc |= Rowdst

    Propagation of d2 dependence:

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    d2

    1

    1

    1

    1

    CT

    dA

    11

    Col2 |= Col1

    1

    dA

    Coldst |= ColsrcRowsrc |= Rowdst

    Propagation of d2 dependence:

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    20

    T0 T1 T2

    DT

    CT

    d1

    d1

    1

    1

    d2

    DT

    CT

    d2

    1

    1

    1

    1

    CT

    dA

    11

    Col2 |= Col1

    1

    dA

    CT

    11

    Row1 |= Row2

    1

    Coldst |= ColsrcRowsrc |= Rowdst

    Propagation of d2 dependence:

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    d2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    CT

    1

    d2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    CT

    d2

    1

    d2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    CT

    d2

    11

    d2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    CT

    d2

    11

    CT

    11

    d2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    CT

    d2

    11

    CT

    11

    d2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    CT

    d2

    11

    CT

    111

    d2

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Cycle Detection

    21

    T0 T1

    CT

    d1

    d1

    1

    CT

    d2

    11

    CT

    dA

    111

    d2

    dA

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Issues Related to CT and DT

    22Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Issues Related to CT and DT

    • CT is not time-critical structure: is updated in the background

    • OK to find a cycle some time later (DT is always up-to-date)

    • Update of CT is recursive

    • Terminates when any bit in diagonal is set or CT is no longer changed

    • Always detect precise cycle, no false positive

    22Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    23

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    23

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    23

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    23

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Advanced Conflict Recording

    24

    # Context

    # Co

    ntex

    t

    RAWWARWAW

    Enhanced Dependence Table

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    25

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    25

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    25

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Hardware Mechanisms

    25

    Mechanism Function Impl. Designs

    Access Recording and Conflict

    Detection

    Cycle Detection

    Advanced Conflict Recording

    Squash Set Generation

    Record the addresses accessed by each thread and detect when two threads

    have a data conflict

    Access Bits in cache and related

    logicAll

    Record data conflicts and their ordering, and detect conflict cycles

    Dependence Table and Cycle Table

    STALLORDER

    Represent the type of conflict between different blocks compactly

    Enhanced Dependence Table ORDER

    On a cycle of conflicts, decide the set of blocks to squash

    Logic that operates on the

    Dependence TableORDER

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Set of Blocks to Squash on a Cycle

    26

    T0 T1 T2

    wr

    wr

    wr

    rd

    rd

    RAW WAR

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Set of Blocks to Squash on a Cycle

    26

    T0 T1 T2

    wr

    wr

    wr

    rd

    rd

    rd

    RAW

    RAW

    WAR

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Set of Blocks to Squash on a Cycle

    26

    T0 T1 T2

    wr

    wr

    wr

    rd

    rd

    rd

    RAW

    RAW

    WAR

    • Squash the thread that closes the cycle (T0)

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Set of Blocks to Squash on a Cycle

    26

    T0 T1 T2

    wr

    wr

    wr

    rd

    rd

    rd

    RAW

    RAW

    WAR

    • Squash the thread that closes the cycle (T0)

    • Forward propagate from T0

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Set of Blocks to Squash on a Cycle

    26

    T0 T1 T2

    wr

    wr

    wr

    rd

    rd

    rd

    RAW

    RAW

    WAR

    • Squash the thread that closes the cycle (T0)

    • Forward propagate from T0

    • T0 ➙ T1: squash T1

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Set of Blocks to Squash on a Cycle

    26

    T0 T1 T2

    wr

    wr

    wr

    rd

    rd

    rd

    RAW

    RAW

    WAR

    • Squash the thread that closes the cycle (T0)

    • Forward propagate from T0

    • T0 ➙ T1: squash T1

    • T1 ➙ T2: WAR, no propagate

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Set of Blocks to Squash on a Cycle

    26

    T0 T1 T2

    wr

    wr

    wr

    rd

    rd

    rd

    RAW

    RAW

    WAR

    • Squash the thread that closes the cycle (T0)

    • Forward propagate from T0

    • T0 ➙ T1: squash T1

    • T1 ➙ T2: WAR, no propagate

    • Backward propagate from T0

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Set of Blocks to Squash on a Cycle

    26

    T0 T1 T2

    wr

    wr

    wr

    rd

    rd

    rd

    RAW

    RAW

    WAR

    • Squash the thread that closes the cycle (T0)

    • Forward propagate from T0

    • T0 ➙ T1: squash T1

    • T1 ➙ T2: WAR, no propagate

    • Backward propagate from T0

    • T2 ➙ T0: RAW, no propagate

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Outline

    • Motivation

    • Designing BulkSMT

    • Design Space

    • Hardware Mechanisms

    • Multicore of BulkSMTs

    • Evaluation

    27Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Multicore of BulkSMTs

    28Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Multicore of BulkSMTs

    28

    • Eager Scheme (E)

    • Block wants to commit

    • Commit globally first, then commit locally

    • Reception of cache invalidation

    • Check the cache access bits; may squash multiple blocks

    • Lazy Scheme (L)

    • ......

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Outline

    • Motivation

    • Designing BulkSMT

    • Design Space

    • Hardware Mechanisms

    • Multicore of BulkSMTs

    • Evaluation

    29Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Evaluation

    • Cycle-accurate execution-driven simulator based on SESC

    • 6 SPLASH-2 and 2 PARSEC applications

    • Applications run with 1, 4, or 16 threads

    • Compare performance:

    • Using same number of cores, diff number of threads

    • Using same number of threads, diff number of cores

    30Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution 31

    BK

    Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0

    1.0

    2.0

    Exec

    utio

    n Ti

    me

    BK

    BK BK

    BK BK

    BK

    BK

    BK

    BK

    SQ SQ SQ SQ SQ SQ SQ SQ SQST ST

    ST ST

    ST

    ST ST

    ST

    ST

    OR

    OR

    OR

    OR O

    R OR

    OR O

    R OR

    UsefulProcPipe

    ProcMemBlockStall

    SquashedGeoMeanExecution Time (Same HW, 1 Core)

    BKSQ,ST,OR

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution 31

    BK

    Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0

    1.0

    2.0

    Exec

    utio

    n Ti

    me

    BK

    BK BK

    BK BK

    BK

    BK

    BK

    BK

    SQ SQ SQ SQ SQ SQ SQ SQ SQST ST

    ST ST

    ST

    ST ST

    ST

    ST

    OR

    OR

    OR

    OR O

    R OR

    OR O

    R OR

    UsefulProcPipe

    ProcMemBlockStall

    SquashedGeoMean

    • BulkSMT more cost effective: faster for the same HW

    • BulkSMT ORDER is the best

    Execution Time (Same HW, 1 Core)

    BKSQ,ST,OR

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    SQ-EE

    Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0

    1.0

    2.0

    3.0Ex

    ecut

    ion

    Tim

    e

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    LSQ

    -LL

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    ST-L

    L ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L ST-L

    L

    OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E BK-E

    E

    BK-E

    E BK-E

    E BK-E

    ESQ-E

    E

    SQ-E

    E

    SQ-E

    E SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    ST-E

    E ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E ST-

    EE ST-E

    E

    OR-

    EE

    OR-

    EE

    OR-

    EE

    OR-

    EE OR-

    EE

    OR-

    EE

    OR-

    EE OR-

    EE

    OR-

    EE

    UsefulProcPipe

    ProcMemBlockStall

    SquashedGeoMean

    32

    Execution Time (Same HW, 4 Cores)

    BK-{LL,EE}{SQ,ST,OR}-{LL,EE}

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    • BulkSMT ORDER is best

    • BulkSMT limited by application scalability

    SQ-EE

    Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0

    1.0

    2.0

    3.0Ex

    ecut

    ion

    Tim

    e

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    LSQ

    -LL

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    ST-L

    L ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L ST-L

    L

    OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E BK-E

    E

    BK-E

    E BK-E

    E BK-E

    ESQ-E

    E

    SQ-E

    E

    SQ-E

    E SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    ST-E

    E ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E ST-

    EE ST-E

    E

    OR-

    EE

    OR-

    EE

    OR-

    EE

    OR-

    EE OR-

    EE

    OR-

    EE

    OR-

    EE OR-

    EE

    OR-

    EE

    UsefulProcPipe

    ProcMemBlockStall

    SquashedGeoMean

    32

    Execution Time (Same HW, 4 Cores)

    BK-{LL,EE}{SQ,ST,OR}-{LL,EE}

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution 33

    SQ-EE

    Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0

    1.0

    2.0

    3.0

    Exec

    utio

    n Ti

    me

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L BK-L

    L

    BK-L

    L

    BK-L

    L BK-L

    L

    BK-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L ST-L

    L

    OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL

    BK-E

    E BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    ESQ

    -EE

    SQ-E

    E

    SQ-E

    E SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    ST-E

    E ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E ST-

    EE ST-E

    E

    OR-

    EE

    OR-

    EE

    OR-

    EE

    OR-

    EE OR-

    EE

    OR-

    EE

    OR-

    EE OR-

    EE

    OR-

    EE

    UsefulProcPipe

    ProcMemBlockStall

    SquashedGeoMean

    Execution Time (Same Threads, 16)

    BK-{LL,EE}{SQ,ST,OR}-{LL,EE}

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution 33

    SQ-EE

    Barnes Cholesky Fluidanimate Ocean Radiosity Radix Raytrace Streamcluster GeoMean0

    1.0

    2.0

    3.0

    Exec

    utio

    n Ti

    me

    BK-L

    L

    BK-L

    L

    BK-L

    L

    BK-L

    L BK-L

    L

    BK-L

    L

    BK-L

    L BK-L

    L

    BK-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    SQ-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L

    ST-L

    L ST-L

    L

    OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL

    OR-

    LL

    BK-E

    E BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    E

    BK-E

    ESQ

    -EE

    SQ-E

    E

    SQ-E

    E SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    SQ-E

    E

    ST-E

    E ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E

    ST-E

    E ST-

    EE ST-E

    E

    OR-

    EE

    OR-

    EE

    OR-

    EE

    OR-

    EE OR-

    EE

    OR-

    EE

    OR-

    EE OR-

    EE

    OR-

    EE

    UsefulProcPipe

    ProcMemBlockStall

    SquashedGeoMean

    • BulkSMT multicore: using 1/4 hardware and achieves better performance

    Execution Time (Same Threads, 16)

    BK-{LL,EE}{SQ,ST,OR}-{LL,EE}

    Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Also in the paper

    34Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Also in the paper

    • Implementation issues

    • Handling high-contention synchornization

    • Other characteristics of execution behavior

    • Related work

    34Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Conclusion

    35Thursday, December 8, 2011

  • Xuehai Qian BulkSMT: Designing SMT Processors for Atomic Block Execution

    Conclusion

    • Proposed BulkSMT: first SMT design that supports atomic-block (transactional) execution

    • Enables concurrency between dependent blocks

    • Proposed designs of different concurrency vs cost

    • SQUASH on conflict

    • STALL on conflict

    • ORDER on conflict

    • Designed a multicore of BulkSMTs

    • BulkSMT is cost Effective

    • Higher performance for the same core count

    • Competitive performance for 1/4 of the cores

    35Thursday, December 8, 2011

  • Universityof Illinois http://iacoma.cs.uiuc.edu/

    BulkSMT: Designing SMT Processors for

    Atomic-Block Execution

    Xuehai Qian, Benjamin Sahelices and Josep TorrellasUniversity of Illinois

    To appear at HPCA 2012

    Thursday, December 8, 2011

    http://iacoma.cs.uiuc.eduhttp://iacoma.cs.uiuc.edu