A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors
description
Transcript of A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors
A Low-Complexity, High-Performance Fetch A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading ProcessorsUnit for Simultaneous Multithreading Processors
Ayose FalcónAyose Falcón Alex Ramirez Mateo Valero Alex Ramirez Mateo Valero
HPCA-10HPCA-10
February 18, 2004February 18, 2004
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 22
Simultaneous Multithreading Simultaneous Multithreading
SMT [Tullsen95] / Multistreaming [Yamamoto95]
Instructions from different threads coexist in each processor stage
Resources are shared among different threads But…
Sharing implies competition• In caches, queues, FUs, …
Fetch policy decides! time
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 33
MotivationMotivation
SMT performance is limited by fetch performance A superscalar fetch is not enough to feed an aggressive
SMT core SMT fetch is a bottleneck [Tullsen96] [Burns99]
Straightforward solution: Fetch from several threads each cyclea) Multiple fetch units (1 per thread) EXPENSIVE!
b) Shared fetch + fetch policy [Tullsen96]
Multiple PCs Multiple branch predictions per cycle Multiple I-cache accesses per cycle
Does the performance of this fetch organization compensate its complexity?
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 44
Talk OutlineTalk Outline
Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 55
BranchPredictor
Instruction Cache
Fetching from a Single Thread (1.X)Fetching from a Single Thread (1.X)
Fine-grained, non-simultaneous sharing Simple similar to a superscalar fetch unit No additional HW needed
A fetch policy is needed Decides fetch priority among threads Several proposals in the literature
SH
IFT
&M
AS
K
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 66
Fetching from a Single Thread (1.X)Fetching from a Single Thread (1.X)
But…a single thread is not enough to fill fetch BW Gshare / hybrid branch predictor + BTB limits fetch
width to one basic block per cycle (6-8 instructions)
0
1
2
3
4
5
6
7
8
1.8 1.16
Fetch Policy
Fe
tch
Th
rou
gh
pu
t (I
PF
C) Fetch BW is heavily
underused Avg 40% wasted with 1.8 Avg 60% wasted with 1.16
Fully use the fetch BW 31% fetch cycles with 1.8 6% fetch cycles with 1.16
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 77
Fetching from Multiple Threads (2.X)Fetching from Multiple Threads (2.X)
Increases fetch throughput More threads more possibilities to fill fetch BW
More fetch BW use than 1.X
Fully use the fetch BW 54% of cycles with 2.8 16% of cycles with 2.16
1.8
1.16
0
1
2
3
4
5
6
7
8
2.8 2.16
Fe
tch
Th
rou
gh
pu
t (I
PF
C)
28%28%
33%33%
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 88
Fetching from Multiple Threads (2.X)Fetching from Multiple Threads (2.X)
BranchPredictor
BANK 1
Instruction Cache
BANK 2
2 2 SH
IFT
&M
AS
KS
HIF
T&
MA
SK
ME
RG
E
2 predictions per cycle + 2 ports
Multibanked + multiportedinstruction cache
Replication of SHIFT & MASK logic
New HW to realign and merge cache lines
But…what is the additional HW cost of a 2.X fetch?
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 99
Our GoalOur Goal
Can we take the best of both worlds? Low complexity of a 1.X fetch architecture
+ High performance of a 2.X fetch architecture
That is…can a single thread provide sufficient instructions to fill the available fetch bandwidth?
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1010
Talk OutlineTalk Outline
Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1111
High Performance Fetch Engines (I)High Performance Fetch Engines (I)
We look for high performance Gshare / hybrid branch predictor + BTB Low performance
Limit fetch BW to one basic block per cycle • 6-8 instructions
We look for low complexity Trace cache, Branch Target Address Cache,
Collapsing Buffer, etc… Fetch multiple basic blocks per cycle
• 12-16 instructions High complexity
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1212
High Performance Fetch Engines (II)High Performance Fetch Engines (II)
Our alternatives
Gskew [Michaud97] + FTB [Reinman99]
FTB fetch blocks are larger than basic blocks 5% speedup over gshare+BTB in superscalars
Stream Predictor [Ramirez02]
Streams are larger than FTB fetch blocks 11% speedup over gskew+FTB in superscalars
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1313
Talk OutlineTalk Outline
Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1414
Simulation SetupSimulation Setup
Modified version of SMTSIM SMTSIM [Tullsen96]
Trace-driven, allowing wrong-path execution
Decoupled fetch (1 additional pipeline stage)
Branch predictor sizes of approx. 45KB
Decode & rename width limited to 8 instructions Fetch width 8/16 inst. Fetch buffer 32 inst.
Fetch policy ICOUNT
RAS /thread 64-entry
FTQ size /thread 4-entry
Functional units 6 int, 4 ld/st, 3 fp
Inst. queues 32 int, 32 ld/st, 32 fp
ROB /thread 256-entry
Physical registers
384 int, 384 fp
L1 I-cache & D-cache
32KB, 2W, 8 banks
L2 cache 1MB, 2W, 8banks, 10 cyc.
Line size 64B (16 instructions)
TLB 48 I + 48 D
Mem. lat. 100 cyc.
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1515
WorkloadsWorkloads
SPECint2000 Code layout optimized
Spike [Cohn97] + profile data using train input
Most representative 300M instruction trace Using ref input
Workloads including 2, 4, 6, and 8 threads Classified according to threads characteristics:
ILPILP only ILP benchmarks MEMMEM memory-bounded benchmarks MIXMIX mix of ILP and MEM benchmarks
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1616
Talk OutlineTalk Outline
Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results
ILP workloads MEM & MIX workloads
Summary & Conclusions
Only for 2 & 4 threads Only for 2 & 4 threads (see paper for the rest)(see paper for the rest)
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1717
ILP Workloads - Fetch ThroughputILP Workloads - Fetch Throughput
With a given fetch bandwidth, fetching from two threads always benefits fetch performance
Critical point is 1.16 Stream predictor Better fetch performance than 2.8 Gshare+BTB / gskew+FTB Worse fetch perform. than 2.8
0
2
4
6
8
10
12
14
1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16
2_ILP 4_ILP
IPF
C
gshare+BTB gskew+FTB stream fetch
Fetch ThroughputFetch Throughput
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1818
0
1
2
3
4
5
6
7
8
1.8 2.8 1.8 2.8
2_ILP 4_ILP
IPC
gshare+BTB gskew+FTB stream fetch
ILP Workloads – 1.XILP Workloads – 1.X (1.8) (1.8) vs 2.X vs 2.X (2.8) (2.8)
ILP benchmarks have few memory problems and high parallelism
Fetch unit is the real limiting factor The higher the fetch throughput, the higher the IPC
Commit ThroughputCommit Throughput
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 1919
ILP WorkloadsILP Workloads
So…2.X better than 1.X in ILP workloads…
But, what about 1.2X instead of 2.X? That is, 1.16 instead of 2.8 Maintain single thread fetch Cache lines and buses already 16-instruction wide
We have to modify the HW to select 16 instead of 8 instructions
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2020
ILP Workloads ILP Workloads – 2.X– 2.X (2.8) (2.8) vs 1. vs 1.2X2X (1.16) (1.16)
With 1.16, stream predictor increases throughput (9% avg) Streams are long enough for a 16-wide fetch
Fetching a single block per cycle is not enough Gshare+BTB 10% slowdown Gskew+FTB 4% slowdown
0
1
2
3
4
5
6
7
8
2.8 1.16 2.16 2.8 1.16 2.16
2_ILP 4_ILP
IPC
gshare+BTB gskew+FTB stream fetch
Similar/Better Similar/Better performance than performance than 2.162.16!!
Similar/Better Similar/Better performance than performance than 2.162.16!!
Commit ThroughputCommit Throughput
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2121
MEM & MIX Workloads - Fetch ThroughputMEM & MIX Workloads - Fetch Throughput
Same trend compared to ILP fetch throughput For a given fetch BW, fetching from two threads is better Stream > gskew + FTB > gshare + BTB
0
2
4
6
8
10
12
14
1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16 1.8 2.8 1.16 2.16
2_MIX 2_MEM 4_MIX 4_MEM
IPF
C
gshare+BTB gskew+FTB stream fetch
Fetch ThroughputFetch Throughput
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2222
MEM & MIX Workloads – 1.XMEM & MIX Workloads – 1.X (1.8) (1.8) vs 2.X vs 2.X (2.8) (2.8)
With memory-bounded benchmarks…overall performance actually decreases!! Memory-bounded threads monopolize resources for many cycles Previously identified New fetch policies
Flush [Tullsen01] or stall [Luo01, El-Mousry03] problematic threads
00.5
11.5
22.5
33.5
44.5
5
1.8 2.8 1.8 2.8 1.8 2.8 1.8 2.8
2_MIX 2_MEM 4_MIX 4_MEM
IPC
gshare+BTB gskew+FTB stream fetch
Commit ThroughputCommit Throughput
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2323
MEM & MIX workloadsMEM & MIX workloads
Fetching from only one thread allows to fetch only from the first, most priority thread Allows the highest priority thread to proceed with more
resources Avoids low-quality (less priority) threads to monopolize more
and more resources on cache misses Registers, IQ slots, etc.
Only the highest priority thread is fetched When cache miss is resolved, instructions from the second
thread will be consumed ICOUNT will give it more priority after the cache miss
resolution
A powerful fetch unit can be harmful if not well used
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2424
MEM & MIX workloads – 1.XMEM & MIX workloads – 1.X (1.8) (1.8) vs 1. vs 1.2X2X (1.16) (1.16)
0
1
2
3
4
5
1.8 1.16 2.16 1.8 1.16 2.16 1.8 1.16 2.16 1.8 1.16 2.16
2_MIX 2_MEM 4_MIX 4_MEM
IPC
gshare+BTB gskew+FTB stream fetchCommit ThroughputCommit Throughput
Even 2.16 has worse commit performance than 1.8 More interference introduced by low-quality threads
Overall, 1.16 is the best combination Low complexity fetching from one thread High performance wide fetch
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2525
Talk OutlineTalk Outline
Motivation Fetch Architectures for SMT High-Performance Fetch Engines Simulation Setup Results Summary & Conclusions
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2626
SummarySummary
Fetch unit is the most significant obstacle to obtain high SMT performance However, researchers usually don’t care about SMT fetch
performance They care on how to combine threads to maintain available fetch
throughput A simple gshare/hybrid + BTB is commonly used
Everybody assumes that 2.8 (2.X) is the correct answer
Fetching from many threads can be counterproductive Sharing implies competing Low-quality threads monopolize more and more resources
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 2727
ConclusionsConclusions
1.16 (1.2X) is the best fetch option Using a high-width fetch architecture
It’s not the prediction accuracy, it’s the fetch width
Beneficial for both ILP and MEM workloads 1.X is bad for ILP 2.X is bad for MEM
Fetches only from the most promising thread (according to fetch policy), and as much as possible
Offers the best performance/complexity tradeoff Fetching from a single thread may require
revisiting current SMT fetch policies
ThanksThanks
Questions & AnswersQuestions & Answers
Backup SlidesBackup Slides
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 3030
SMT WorkloadsSMT Workloads
Workload Threads
2_ILP eon, gcc
2_MEM mcf, twolf
2_MIX gzip, twolf
4_ILP eon, gcc, gzip, bzip2
4_MEM mcf, twolf, vpr, perlbmk
4_MIX gzip, twolf, bzip2, mcf
6_ILP eon, gcc, gzip, bzip2, crafty, vortex
6_MIX gzip, twolf, bzip2, mcf, vpr, eon
8_ILP eon, gcc, gzip, bzip2, crafty, vortex, gap, parser
8_MIX gzip, twolf, bzip2, mcf, vpr, eon, gap, parser
HPCA-10 HPCA-10 A Low-Complexity, High-Performance Fetch Unit for SMT ProcessorsA Low-Complexity, High-Performance Fetch Unit for SMT Processors 3131
Simulation SetupSimulation Setup
Fetch policy ICOUNT
Gshare predictor 64K-entry, 16 bits history
Gskew predictor 3x32K-entry, 15 bits history
BTB/FTB 2K-entry, 4W asc.
Stream predictor 1K-entry, 4W + 4K-entry, 4W
RAS /thread 64-entry
FTQ size /thread 4-entry
Functional units 6 int, 4 ld/st, 3 fp
Inst. queues 32 int, 32 ld/st, 32 fp
ROB /thread 256-entry
Physical registers 384 int, 384 fp
L1 I-cache & D-cache 32KB, 2W, 8 banks
L2 cache 1MB, 2W, 8banks, 10 cyc.
Line size 64B (16 instructions)
TLB 48 I + 48 D
Mem. lat. 100 cyc.