Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael...

30
Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin David Tarjan*, Michael Boyer, and Kevin Skadron* Skadron* University of Virginia University of Virginia Department of Computer Science Department of Computer Science * Currently on internship/sabbatical at * Currently on internship/sabbatical at NVIDIA Research NVIDIA Research

Transcript of Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael...

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

David Tarjan*, Michael Boyer, and Kevin David Tarjan*, Michael Boyer, and Kevin Skadron*Skadron*

University of VirginiaUniversity of Virginia

Department of Computer ScienceDepartment of Computer Science

* Currently on internship/sabbatical at NVIDIA * Currently on internship/sabbatical at NVIDIA ResearchResearch

L2 L2

L2 L2

MotivationMotivation

L2 L2

L2 L2

Homogeneous Heterogeneous

Adaptive(Federation)

Multithreadedscalar IO

core

2-wayOO core

L2 L2

L2 L2

Basic InsightsBasic Insights

A multithreaded in-order core has many A multithreaded in-order core has many registers which can be reused for a reorder registers which can be reused for a reorder buffer orbuffer oractive listactive list

If cores are small, single cycle If cores are small, single cycle communication between neighbors is feasiblecommunication between neighbors is feasible

Prior work on making large OOO cores Prior work on making large OOO cores feasible can be applied at the low end to feasible can be applied at the low end to make low-cost OOO possiblemake low-cost OOO possible

Bpred

Allocate

Rename

Issue

Commit

In-order & Out-of-order PipelinesIn-order & Out-of-order Pipelines

Fetch

Decode

Execute

Mem

Writeback

Fetch

Decode

Execute

Mem

Writeback

In-order Out-of-order

Ready Bits

Subscriber Slot 1

Subscriber Slot 21

2

3

4

5

Issue Queue ExampleIssue Queue Example

1 1 IQ2

1

IQ3

IQ30

0 0

1

1

+

+

+

1

Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002

Sassone et al., Sassone et al., Matrix Scheduler Reloaded, ISCA 2007

1

2

3

Simplified Load-Store QueueSimplified Load-Store Queue

Memory Alias Table (MAT)Memory Alias Table (MAT) No store forwardingNo store forwarding No conservative waiting on storesNo conservative waiting on stores Only detect memory order violations after Only detect memory order violations after

they have occurred and flush the pipeline they have occurred and flush the pipeline when the offending instruction commitswhen the offending instruction commits

Amir Roth, Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005

MAT ExampleMAT Example

st 0x13, r5ld r1, 0x13

0

0

0

0

0

0

0

0

MAT

0

1

2

3

4

5

6

7

MAT ExampleMAT Example

st 0x13, r5ld r1, 0x13

EXE

0

0

0

1

0

0

0

0

MAT

0

1

2

3

4

5

6

7

ld executes and increments counter

MAT ExampleMAT Example

st 0x13, r5

COM

0

0

0

1 !

0

0

0

0

MAT

0

1

2

3

4

5

6

7

ld r1, 0x13

st commits and sets flag

MAT ExampleMAT Example

ld r1, 0x13

COM

0

0

0

1 !

0

0

0

0

MAT

0

1

2

3

4

5

6

7

Flush

ld commits, sees flag, and flushes pipeline

MAT ExampleMAT Example

ld r1, 0x13

0

0

0

0

0

0

0

0

MAT

0

1

2

3

4

5

6

7

MAT is reset and execution resumes

Performance ImpactPerformance Impact

0.00%

2.67%

1.71%

5.46%

0%

1%

2%

4%

5%

6%

consumer-basedissue queue

pseudo-randomscheduling

MAT commit-time branchrecovery

Ave

rag

e IP

C L

oss

PerformancePerformance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Scalar IO 2-way IO FederatedOO

2-way OO 4-way OO

Ave

rag

e IP

C

spec specint specfp

Energy EfficiencyEnergy Efficiency

0

0.5

1

1.5

2

2.5

Scalar IO 2-way IO FederatedOO

2-way OO 4-way OO

No

rmal

ized

BIP

S^

3/W

att

spec specint specfp

Area EfficiencyArea Efficiency

0

0.2

0.4

0.6

0.8

1

1.2

Scalar IO 2-way IO FederatedOO

2-way OO 4-way OO

No

rmal

ized

BIP

S^

3/(W

att*

mm

^2)

spec specint specfp

ConclusionsConclusions

Two in-order cores can be federated at run-Two in-order cores can be federated at run-time to form a 2-way OO coretime to form a 2-way OO core

Almost doubling IPC of throughput core is Almost doubling IPC of throughput core is possible with very little extra hardwarepossible with very little extra hardware

Don’t want traditional OO structures because Don’t want traditional OO structures because their performance comes at too high a pricetheir performance comes at too high a price

Best combined area- and energy-efficiencyBest combined area- and energy-efficiency

Q & AQ & A

BackupBackup

Core Fusion DataCore Fusion Data

Figure from Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007

Overall ResultsOverall Results

Scalar in-order core is 8KB I/D, 256KB L2Scalar in-order core is 8KB I/D, 256KB L2 Base 2-way core has 16KB I and D-Caches, Base 2-way core has 16KB I and D-Caches,

256KB L2, 32 entry ROB, 16 entry issue 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpredqueue, 16 entry LSQ, bimodal bpred

4-way core is 32KB I/D, 2MB L2, 128 entry 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpredROB, 32 IQ and LSQ, tournament bpred

Branch PredictionBranch Prediction

Use only a Next Line and Set (NLS) predictor, Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Bimodal predictor and a Return Address Stack (RAS)Stack (RAS)

NLS ok if your instruction working set not > I$ NLS ok if your instruction working set not > I$ sizesize

Small bimodal predictor ik ok for small Small bimodal predictor ik ok for small window processorwindow processor

FetchFetch

Two I$’s act as a I$ of twice the size and Two I$’s act as a I$ of twice the size and associativity (and random replacement)associativity (and random replacement)

More logic and buffers to capture two More logic and buffers to capture two instructions instructions

Extra cycle to route instructions from two I$’s Extra cycle to route instructions from two I$’s to two decoders to two decoders

DecodeDecode

Cancel second instruction if first turns out to Cancel second instruction if first turns out to be branchbe branch

Extra cycle to route decoded instructions to Extra cycle to route decoded instructions to new allocate stagenew allocate stage

AllocateAllocate

New logic and free lists to allocate ROB, IQ New logic and free lists to allocate ROB, IQ entriesentries

RenameRename

New table since it has too many portsNew table since it has too many ports One, centralized rename table, not One, centralized rename table, not

distributeddistributed Has separate table (or field in each RAT Has separate table (or field in each RAT

entry) for each registers producer entry) for each registers producer instructions IQ-slot number (see our new instructions IQ-slot number (see our new issue queue)issue queue)

IssueIssue

Uses a simple lookup table as wakeup Uses a simple lookup table as wakeup structure, where instructions subscribe to structure, where instructions subscribe to their input instructions (explained in detail their input instructions (explained in detail later)later)

Centralized, one IQ for the two coresCentralized, one IQ for the two cores

Register File Register File

Register file is mirrored in the two coresRegister file is mirrored in the two cores No extra copy instructions or load-balancing No extra copy instructions or load-balancing

questionsquestions

ExecuteExecute

Add extra cycle for copying result to other Add extra cycle for copying result to other core’s register file (like EV6)core’s register file (like EV6)

Memory AccessMemory Access

The two D$s are checked in parallel, each The two D$s are checked in parallel, each responsible for half of the merged D$’s waysresponsible for half of the merged D$’s ways

No standard LSQ, only a Memory Alias Table No standard LSQ, only a Memory Alias Table (details later)(details later)

Only detects ordering violations and send Only detects ordering violations and send signal to pipelinesignal to pipeline

CommitCommit

Centralized commit, no slippageCentralized commit, no slippage Recover from branch mispredictions since no Recover from branch mispredictions since no

checkpoints of RAT on branchescheckpoints of RAT on branches Recover from memory order violations (or Recover from memory order violations (or

false positives) from MATfalse positives) from MAT