Procesadores Superescalares
description
Transcript of Procesadores Superescalares
Prof. Mateo Valero
Procesadores Superescalares
Las Palmas de Gran Canaria
26 de Noviembre de 1999
M. Valero 2
Initial developments
• Mechanical machines
• 1854: Boolean algebra by G. Boole
• 1904: Diode vacuum tube by J.A. Fleming
• 1946: ENIAC by J.P. Eckert and J. Mauchly
• 1945: Stored program by J.V. Neuman
• 1949: EDSAC by M. Wilkes
• 1952: UNIVAC I and IBM 701
M. Valero 3
Eniac 1946
M. Valero 4
EDSAC 1949
M. Valero 5
Pipeline
M. Valero 6
Superscalar ProcessorF
etch
Dec
ode
Ren
ame
Inst
ruct
ion
Win
dow
Wak
eup+
sele
ct
Reg
iste
rfi
le
Byp
ass
Dat
a C
ache
Fetch of multiple instructions every cycle.Rename of registers to eliminate added dependencies.Instructions wait for source operands and for functional units.Out- of -order execution, but in order graduation.
Scalable Pipes
M. Valero 7
Technology Trends and Impact
0
500
1000
1500
2000
2500
3000
3500
0.80 micras0.35 micras0.18 micras
Delay in Psec.
Issue Width= 4 Issue Width= 8
S. Palacharla et al ¨Complexity Effective…¨. ISCA 1997. Denver.
ROB Size = 32 ROB Size = 64
M. Valero 8
Physical Scalability
0102030405060708090
100
0,25 0,18 0,13 0,1 0,08 0,06
Processor generation (microns)
Di
e
re
ac
ha
bl
e
(%
)
1 clock2 clocks4 clocks8 clocks16 clocks
0,25 0,18 0,13 0,1 0,08 0,06
Doug Matzke. ¨ Will Physical Scalability… ¨. IEEE Computer. Sept. 1997. pp 37-39.
Die
rea
chab
le (
per
cen
t)
Processor generation (microns)
M. Valero 9
Register influence on ILP
• Spec95
0,4
0,9
1,4
1,9
2,4
2,9
3,4
3,9
48 64 96 128 160 192 224 256
Register file size
IPC Integer
Floating Point
8-way fetch/issuewindow of 256 entriesup to 1 taken branchg-share 64k entriesOne cycle latency
M. Valero 10
Register File Latency
– 66% and 20% performance improvement when moving from 2 to 1-cycle latency
1
1,5
2
2,5
3
3,5
4
4,5
IPC
appl
u
apsi
fppp
p
hydr
o2d
mgr
id
su2c
or
swim
tom
catv
turb
3d
wav
e5
Hm
ean
1 cycle 2 cycle
0,5
0,7
0,9
1,1
1,3
1,5
1,7
1,9
2,1
2,3
IPC
com
pres
s
gcc go
ijpeg li
m88
ksim perl
vort
ex
Hm
ean
1 cycle 2 cycle
M. Valero 11
Outline
• Virtual-physical register• A register file cache• VLIW architectures
M. Valero 12
Virtual-Physical Registers
• Motivation
– Conventional renaming scheme
– Virtual-Physical Registers
Icache Decode&Rename Commit
Register unusedRegister
used
Register used
M. Valero 13
load f2, 0(r4)fdiv f2, f2, f10fmul f2, f2, f12fadd f2, f2, 1
load p1, 0(r4)fdiv p2, p1, p10fmul p3, p2, p12fadd p4, p2, 1
renameCache miss: 20Fdiv: 20Fmul: 10Fadd: 5
Example
– Register pressure: average registers per cycle
0 5 10 15 20 25 30 35 40 45 50 55
p4
p3
p2
p1
p4
p3
p2
p1
Conventional: 3.6
Virtual-Physical: 0.7
M. Valero 14
Percentage of Used/Wasted Registers
0
20
40
60
80
100
120
UsedWasted
0
20
40
60
80
100
120
140
M. Valero 15
Virtual-Physical register• Physical register play two different roles
– Keep track of dependences (decode)– Provide a storage location for results (write-
back)• Proposal: Three types of registers
– Logical: Architected registers– Virtual-Physical (VP): Keep track of
dependences– Physical: Store values
• Approach– Decode: rename from logical to VP– Write-back (or issue): rename from VP to
physical
M. Valero 16
Virtual-Physical Registers
• Hardware support
VPreg
DS
rc1
R1
Src
2R
2
Inst
. que
ue
Lre
gC
VP
reg
RO
B
VP Preg VLreg
General Map Table
Preg
Phy. Map Table
Fet
ch
Decode IssueE
xecu
teWrite-back Commit
M. Valero 17
Virtual-Physical Registers
• No free physical register– Re-execute but… if it is the oldest instruction…
– Avoiding deadlock• A number (NRR) of registers are reserved for the oldest
instructions
• 21% speedup for Spec95 on a 8-way issue [HPCA-4]
– Conclusions– Optimal NRR is different for each program
– For a given program, best NRR may be different for different sections of code
M. Valero 18
Virtual-Physical Registers– Performance evaluation
• SimpleScalar OoO with modified renaming
• 8-way issue• RUU: 128 entries• FU (latency)
» 8 Simple int. (1)» 4 Int Mult (7)» 6 Simple FP (4)» 4 FP Mult (4)» 4 FP Div (16)» 4 mem ports
• L1 Dcache» 32 KB, 2-way, 32
B/line, 1 cycle
• L1 Icache» 32 KB, 2-way, 64
B/line, 1 cycle
• L2 cache» 1 MB, 2-way, 64 B/line,
12 cycles
• Main memory» 50 cycles
• Branch prediction» 18-bit Gshare» 2 taken branches
• Benchmarks: SPEC95» Compac/Dec compilers -
O5
M. Valero 19
Virtual-Physical Registers
– Performance evaluation
5
1 0 13
6
29
22
42
20
0
5
10
15
20
25
30
35
40
45%
Sp
eed
up
Speedup for 64 registers
M. Valero 20
IPC and NRR
1
1,5
2
2,5
3
3,5
1 4 8 16 24 36
liapplu
M. Valero 21
Virtual-Physical Registers• What is the optimal allocation policy ?
– Approximation• Registers should be allocated to the instructions that can use
them earlier (avoid unused registers)
• If some instruction should be stall because of the lack of registers, choose the latest instructions (delaying the earliest would also delay the commit of the latest)
– Implementation• Each instruction allocates a physical register in the write-
back. If none available, it steals the register from the latest instruction after the current
M. Valero 22
DSY Performance
1,9
2,1
2,3
2,5
2,7
2,9
3,1
3,3
com
pres
s gcc go li
perl
Hm
ean
conventionalvp-originalvp-dsy
1,51,71,92,12,32,52,72,93,13,3
mgr
id
tom
catv
appl
u
swim
hydr
o2d
Hm
ean
SpecInt95 SpecFp99
M. Valero 23
Performance and Number of Registers
2,2
2,3
2,4
2,5
2,6
2,7
2,8
48 64 80 96 128 160
conventionalvp-originalvp-dsy
1,2
1,4
1,6
1,8
2
2,2
2,4
2,6
2,8
3
48 64 80 96 128 160
SpecFp95SpecIn95
M. Valero 24
Outline
• Virtual-physical register• A register file cache• VLIW architecture
M. Valero 25
Register Requirements
SpecInt95
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Value & InstructionValue & ready Instruction
SpecFP95
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Value & InstructionValue & Ready Instruction
M. Valero 26
Register File Latency
– 66% and 20% performance improvement when moving from 2 to 1-cycle latency
1
1,5
2
2,5
3
3,5
4
4,5
IPC
appluapsi
fpppphydro2d
mgridsu2cor
swimtomcatv
turb3d
wave5Hmean
1 cycle 2 cycle
0,5
0,7
0,9
1,1
1,3
1,5
1,7
1,9
2,1
2,3
IPC
compress
gccgo ijpeg
li m88ksim
perlvortex
Hmean
1 cycle 2 cycle
M. Valero 27
Register File Bypass
0,50,70,91,11,31,51,71,92,12,3
1-cycle, 1-bypasslevel2 cycle, 2-bypasslevels2-cycle, 1-bypasslevel
SpecInt95
M. Valero 28
Register File Bypass
1
1,5
2
2,5
3
3,5
4
4,5
applu
apsi
fppp
hydro2dm
grid
su2corsw
in
tomcatv
turb3d
wave5
Hm
ean
1-cycle, 1-bypasslevel2 cycle, 2-bypasslevels2-cycle, 1-bypasslevel
SpecFP95
M. Valero 29
Register File Cache
• Organization– Bank 1 (Register File)
• All registers (128)
• 2-cycle latency
– Bank 2 (Reg. File Cache)
• A subset of registers (16)
• 1-cycle latency
RF
RFC
M. Valero 30
Experimental Framework
– OoO simulator• 8-way issue/commit
• Functional Units (lat.)– 2 Simple integer (1)– 3 Complex integer
» Mult. (2)» Div. (14)
– 4 Simple FP (2)– 2 FP div.: 2 (14)– 3 Branch (1)– 4 Load/Store
• 128-entry ROB
• 16-bit Gshare
• Icache and Dcache– 64 KB– 2-way set-associative– 1/8-cycle hit/miss– Dcache: Lock-up free-16
outstanding misses
– Benchmarks• Spec95• DEC compiler -O4 (int.) -O5
(FP)• 100 million after inizialitations
– Access time and area models• Extension to Wilton&Jouppi
models
M. Valero 31
Caching Policy (1 of 3)
• First policy• Many values (85%-Int and
84%-FP) are used at most once
• Thus, only non-bypassed values are cached
• FIFO replacement
RF
RFC
M. Valero 32
Performance
– 20% and 4% improvement over 2-cycle
– 29% and 13% degradation over 1-cycle
0,5
0,7
0,9
1,1
1,3
1,5
1,7
1,9
2,1
2,3
IPC
com
pres
s
gcc go
ijpe
g li
m88
ksim perl
vort
ex
Hm
ean
1 cycle RFC.1 2 cycle
1
1,5
2
2,5
3
3,5
4
4,5
IPC
appl
u
apsi
fppp
p
hydr
o2d
mgr
id
su2c
or
swim
tom
catv
turb
3d
wav
e5
Hm
ean
1 cycle RFC.1 2 cycle
M. Valero 33
Caching Policy (1 of 2)
• Second policy• Values that are sources of
any non-issued instruction with all its operands ready
– Not issued because of lack of functional units
– or, the other operand in in the main register file
RF
RFC
M. Valero 34
Performance
– 24% and 5% improvement over 2-cycle
– 25% and 12% degradation over 1-cycle
0,5
0,7
0,9
1,1
1,3
1,5
1,7
1,9
2,1
2,3
IPC
com
pres
s
gcc go
ijpe
g li
m88
ksim perl
vort
ex
Am
ean
Hm
ean
1 cycle RFC.2 2 cycle
1
1,5
2
2,5
3
3,5
4
4,5
IPC
appl
u
apsi
fppp
p
hydr
o2d
mgr
id
su2c
or
swim
tom
catv
turb
3d
wav
e5
Hm
ean
1 cycle RFC.2 2 cycle
M. Valero 35
Caching Policy (1 of 3)
• Third policy• Values that are sources of any non-issued
instruction with all its operands ready
• Prefetching– Table that for each physical register indicates which is
the other operand of the first instruction that uses it
• Replacement: give priority to those values already read at least once
M. Valero 36
Performance
– 27% and 7% improvement over 2-cycle
– 24% and 11% degradation over 1-cycle
0,5
0,7
0,9
1,1
1,3
1,5
1,7
1,9
2,1
2,3
IPC
com
pres
s
gcc go
ijpeg li
m88
ksim perl
vort
ex
Hm
ean
1 cycle RFC.3 2 cycle
1
1,5
2
2,5
3
3,5
4
4,5
IPC
appl
u
apsi
fppp
p
hydr
o2d
mgr
id
su2c
or
swim
tom
catv
turb
3d
wav
e5
Hm
ean
1 cycle RFC.3 2 cycle
M. Valero 37
Speed for Different RFC Architectures
0,7
0,9
1,1
1,3
1,5
1,7
1,9
2,1
C1 C2 C3 C4
1-cycle
2-cycle, one bypass
Non-bypass caching+ prefetch-first-pair
SpecInt95Taken into account access time
M. Valero 38
Speed for Different RFC Architectures
0,7
1,2
1,7
2,2
2,7
3,2
C1 C2 C3 C4
1-cycle
2-cycle, one bypass
Non-bypass caching+ prefetch-first-pair
SpecFp95
M. Valero 39
Conclusions
– Register file access time is critical
– Virtual-physical registers significantly
reduce the register pressure
• 24% improvement for SpecFP95
– A register file cache can reduce the average
access time
• 27% and 7% improvement for a two-level,
locality-based partitioning architecture
High performance instruction fetch through a
software/hardware cooperationAlex Ramirez
Josep Ll. Larriba-Pey
Mateo Valero
UPC-Barcelona
M. Valero 41
Superscalar ProcessorF
etch
Dec
ode
Ren
ame
Inst
ruct
ion
Win
dow
Wak
eup+
sele
ct
Reg
iste
rfi
le
Byp
ass
Dat
a C
ache
Fetch of multiple instructions every cycle.
Rename of registers to eliminate added dependencies.
Instructions wait for source operands and for functional units.
Out- of -order execution, but in order graduation.
J.E. Smith and S.Vajapeyam.¨Trace Processors…¨ IEEE Computer.Sept. 1997. pp68-74.
M. Valero 42
Motivation
• Instruction Fetch rate important not only in steady state– Program start-up– Miss-speculation points– Program segments with little ILP
InstructionFetch &Decode
InstructionFetch &Decode
InstructionExecutionInstructionExecution
Instruction Queue(s)
Branch /Jump outcome
M. Valero 43
Motivation
• Instruction fetch effectively limits the performance of superscalar processors– Even more relevant at program startup points
• More aggressive processors need higher fetch bandwidth– Multiple basic block fetching becomes necessary
• Current solutions need extensive additional hardware– Branch address cache– Collapsing buffer: multi-ported cache – Trace cache: special purpose cache
M. Valero 44
PostgreSQL
1
1.2
1.4
1.6
1.8
2
2.2
2.4
32KB 64KB F4 F8 F16 PBr Pic Bw4 Bw8 Bw16 PF- PF4
Postgres
64KB I1, 64KB D1, 256KB L2
LB
BL=0
M. Valero 45
Programs Behaviour
1
1,5
2
2,5
3
3,5
32KB 64KB F4 F8 F16 PBr Pic Bw4 Bw8 Bw16 PF- PF4
Postgres Gcc Vortex
64KB I1, 64KB D1, 256KB L2
M. Valero 46
The Fetch Unit (1 of 3)Fetch
Address
Instruction Cache
(i-cache)
Instruction Cache
(i-cache)
Shift & MaskShift & Mask
BranchPrediction
Mechanism
BranchPrediction
Mechanism
Next Address Logic
Next Address Logic
Scalar Fetch Unit To Decode
Next Fetch Address
• Scalar Fetch Unit
– Few instructions per cycle
– 1 branch
• Limitations
– Prediction accuracy
– I-cache miss rate
• Prev. work, code reordering– Fisher (IEEE Tr. on Comp. 81)
– Hwu and Chang (ISCA’89)
– Petis and Hansen (Sigplan’90)
– Torrellas et al. (HPCA’95)
– Kalamatianos et al. (HPCA’98)Software,reduce cachemisses
M. Valero 47
The Fetch Unit (2 of 3)Fetch
Address
Instruction Cache
(i-cache)
Instruction Cache
(i-cache)
Shift & MaskShift & Mask
BranchTargetBuffer
BranchTargetBuffer
ReturnStack
ReturnStack
MultipleBranch
Predictor
MultipleBranch
Predictor
Next Address Logic
Next Address Logic
Aggressive Core Fetch Unit
• Aggressive Fetch Unit
– Lot of instructions per cycle
– Several branches
• Limitations
– Prediction accuracy
– Sequentiality
– I-cache miss rate
• Prev. work, trace building– Yeh et al. (ICS’93)
– Conte et al. (ISCA’95)
– Rottenberg et al. (MICRO’96)
– Friendly et al. (MICRO’97)Hardware,form tracesat run time
To Decode Next Fetch Address
M. Valero 48
Trace Cache
b1
b2b3
b4b7 b6
b5
b0
b8
Trace is a sequence of logically contiguos instructions.
Trace cache line stores a segment of the dynamic instruction traces across multiple, potentially, taken branches:(b1-b2-b4, b1-b3-b7….)
It is indexed by fetch address and branches outcome
History-based fetch mecanism.
M. Valero 49
The Fetch Unit (3 of 3)Fetch
Address
Instruction Cache
(i-cache)
Instruction Cache
(i-cache)
Shift & MaskShift & Mask
BranchTargetBuffer
BranchTargetBuffer
ReturnStack
ReturnStack
MultipleBranch
Predictor
MultipleBranch
Predictor
Next Address Logic
Next Address Logic
Aggressive Core Fetch Unit
To Decode Next Fetch Address
Trace Cache(t-cache)
Trace Cache(t-cache)
FillBuffer
FillBuffer
From Fetch or CommitTrace Cache aims atforming traces run time
M. Valero 50
Our Contribution• Mixed software-hardware approach
– Optimize performance at compile-time• Use profiling information• Make optimum use of the available hardware
– Avoid redundant work at run-time• Do not repeat what was done at compile-time• Adapt hardware to the new software
• Software Trace Cache– Profile-directed code reordering & mapping
• Selective Trace Storage– Fill Unit modification
M. Valero 51
Our Work
• Workload analysis– Temporal locality– Sequentiality
• Software Trace Cache– Seed selection– Trace building– Trace mapping– Results
• Selective Trace Storage– Counting blue traces– Implementation– Results
32KB instruction cache64KB trace cache
6,5
7,5
8,5
9,5
10,5
gcc li postgres
FIP
A
Base TC STC STS
M. Valero 52
Dynamic referencesBenchmark75% 90% 99%
Codesize
swim 148 232 763 110350hydro2d 1223 1977 5371 125946applu 2407 5060 10509 132803m88ksim 458 1006 2863 51341li 325 563 1365 38126gcc 9595 22098 57878 349382compress 243 338 525 21991postgres 2716 5221 11748 374399
Workload Analysis (Reference Locality)
• Considerable amount of reference locality
M. Valero 53
Workload Analysis (Sequentiality)Benchmark Unpredictable Predictableswim 45.3 54.7mgrid 19.9 81.1apsi 22.1 77.9m88ksim 37.3 62.7li 49.2 50.8gcc 60.1 39.9ijpeg 70.2 29.8postgres 23.8 76.2
Loop branches Indirect jumps Subroutine returns Unpredictable conditional
branches
Fall-through Unconditional branches Conditional branches with Fixed
Behaviour Subroutine calls
Predictable Un-predictable
M. Valero 54
Software Trace Cache• Profile directed code reordering
– Obtain a weighted control flow graph– Select seeds or starting basic blocks– Build basic block traces
• Map dynamically consecutive basic blocks to physically contiguous storage
• Move unused basic blocks out of the execution path
– Carefully map these traces in memory• Avoid conflict misses in the most popular traces• Minimize conflicts among the rest
• Increased role of the instruction cache– Able to provide longer instruction traces
M. Valero 55
STC : Seed Selection
• All procedure entry points– Ordered by popularity– Starts building traces on the most popular procedures
• Knowledge based selection– Based on source code knowledge– Leads to longer sequences
• Inlining of the main path of found procedures
– Loses temporal locality• Less popular basic blocks surround the most popular ones
M. Valero 56
STC : Trace Building
• Greedy algorithm– Follow the most likely
path out of a basic block– Add secondary seeds for
all other targets• Two threshold values
– Execution threshold• Do not include
unpopular basic blocks– Transition threshold
• Do not follow unlikely transitions
• Iterate process with less restrictive thresholds
2.4
A1
A2
A3
A4 A5
A6 A7
A8
B1
C1
C2
C3
C5 C4
10
10
10
6 4
7.6
10
30
20
11
150
20
20
1
0.4
1
1
0.6
0.1
0.9
0.45
0.55
1
0.01
1
0.4
0.6
0.10.9
0.99
Branch Threshold
Branch Threshold
Valid,visit later
Valid,visit later
Exec Threshold
M. Valero 57
STC : Trace Mapping
CFA
I-cache sizeNo code here
I-cache
Most popular traces Least popular traces
M. Valero 58
I-cache Miss Rate
Instruction Cache(i-cache)
Instruction Cache(i-cache)
Xchange, Shift & MaskXchange, Shift & Mask
BTBBTB RASRAS BPBP
Next Address LogicNext Address Logic
Code Layout CacheI-cache/CFA Base P&H Torr Auto Ops 2-way Victim8KB I-cache 6.5 3.0 * * * 6.1 5.6
2KB CFA 2.3 2.2 2.14KB CFA 2.9 4.2 2.96KB CFA
* *3.1 2.3 5.2
* *
32KB I-cache 2.7 0.3 * * * 1.2 1.64KB CFA 0.2 0.3 0.28KB CFA 0.2 0.4 0.2
24KB CFA* *
0.2 0.3 0.2* *
64KB I-cache 1.4 0.09 * * * 0.3 0.48KB CFA 0.05 0.07 0.04
16KB CFA 0.14 0.08 0.0524KB CFA
* *0.02 0.03 0.03
* *
M. Valero 59
Fetch Bandwidth
Instruction Cache(i-cache)
Instruction Cache(i-cache)
Xchange, Shift & MaskXchange, Shift & Mask
BTBBTB RASRAS BPBP
Next Address LogicNext Address Logic
Code Layout Trace CacheI-cache/CFA
Base P&H Torr Auto Ops 16KB 16KB+ops
IDEAL 7.6 9.6 8.5-9.9 9.9 10.7 10.3 12.28KB I-cache 3.1 5.2 * * * 5.1 *
2KB CFA 5.6 6.0 6.2 8.44KB CFA 5.0 5.3 6.6 8.76KB CFA
* *4.9 5.8 5.6
*8.1
32KB I-cache 4.7 8.8 * * * 7.2 *4KB CFA 8.9 9.2 10.0 11.58KB CFA 8.4 8.8 10.1 11.5
24KB CFA* *
8.2 9.2 10.1*
11.664KB I-cache 1.4 9.3 * * * 8.6 *
8KB CFA 8.8 9.8 10.6 12.016KB CFA 8.4 9.7 10.5 12.124KB CFA
* *8.5 9.8 10.6
*12.1
M. Valero 60
STC : Results
32KB Instruction cache, 64KB Trace cache
2,2
4,41
3,13
2,65
4,61
5,05
2,55
4,95
4,54
2,97
5,11
5,64
2
3
4
5
6
gcc li postgres
FIP
C
BaseSTCTCS/HTC
M. Valero 61
STC: Conclusions
• STC increases the role of the core fetch unit– Build traces at compile-time
• Increases code sequentiality– Map them carefully in memory
• Reduces instruction cache miss rate
• Increased core fetch unit performance– Trace cache-like performance with no additional
hardware cost• Compile-time solution
or ...– Optimum results with a small supporting trace cache
• Better fail-safe mechanism on a trace cache miss
M. Valero 62
Selective Trace Storage
• The STC constructed traces at compile time– Blue traces
• Built at compile-time• Traces containing only consecutive instructions• May be provided by the instruction cache in a single cycle
– Red traces• Built at run-time• Traces containing taken branches• Can be provided by the trace cache in a single cycle
• Blue traces need not be stored in the trace cache– Better usage of the storage space
• Better performance with same cost• Equivalent performance at lower cost
M. Valero 63
STS: Counting Blue Traces
0%
20%
40%
60%
80%
100%
3+ breaks210
Reordering reduces the number of
breaksHigh degree of redundancy,
even in the original code
M. Valero 64
STS: Implementation
FillUnitFillUnit
BranchTargetBuffer
BranchTargetBuffer
MultipleBranch
Predictor
MultipleBranch
Predictor
ReturnAddress
Stack
ReturnAddress
Stack
Fetch Address
Filter outblue traces
in the fill unit
Xchange, Shift & MaskXchange, Shift & Mask
Next Address LogicNext Address Logic
To DecodeNext Fetch Address
Blue (redundant)
trace
Red tracecomponents
Hit
M. Valero 65
STS: FIPA - Realistic Branch Predictor
7
7.5
8
8.5
9
9.5
10
10.5
11
11.5
Gcc Li Postgres
M. Valero 66
STS: FIPC - Realistic BP - 64KB i-cache
2
2.5
3
3.5
4
4.5
5
5.5
6
Gcc Li Postgres
M. Valero 67
STS: FIPA - Perfect Branch Predictor
8
8.5
9
9.5
10
10.5
11
11.5
12
Gcc Li Postgres
M. Valero 68
STS: Conclusions
• Minor hardware modification– Filter out blue traces in the fill unit
• Avoid redundant run-time work
• Better usage of the storage space– Higher performance with the same cost
– Equivalent performance at much lower cost
• Benefits of STS increase when used with STC– The more work done at compile-time, the less work left
to do at run-time
M. Valero 69
Conclusions
• Instruction fetch is better approached using both software and hardware techniques– Compile-time code reorganization
• Increase code sequentiality• Minimize instruction cache misses
– Avoid run-time redundant work• Do not store the same traces twice
• High fetch unit performance with little additional hardware– Small 2KB complementary trace cache & smart fill unit
M. Valero 70
Future Work• Further increasing fetch performance
– Increase i-cache performance• Reduce miss ratio• Reduce miss penalty
– Increase quality of provided instructions• Better branch prediction accuracy
– Faster recovery after mispredictions
• Take the path of least resistance– Simplicity of design– Software approach whenever possible
M. Valero 71
The End