Milo M.K. Martin, Daniel J. Sorin , Harold W. Cain, Mark D. Hill, and Mikko H. Lipasti
February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling...
-
Upload
ashlee-taylor -
Category
Documents
-
view
216 -
download
0
Transcript of February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling...
February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23
Understanding Scheduling Replay Schemes
Ilhyun KimMikko H. Lipasti
PHARM TeamUniversity of Wisconsin-Madison
February 18, 2004 Slide 2 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Speculation vs. Recovery
All speculative techniques share a few common requirements Some mechanisms for generating predictions Microarchitectural support for realizing the benefits of predictions Recovery for mispredictions
Relatively little focus on recovery Prediction and speculative techniques have been discussed extensively Vague descriptions like refetch, squash, reissue and replay
Recovery for speculative scheduling: scheduling replay What are the issues in scheduling replay? What functionalities should it provide? What are the potential limitations?
February 18, 2004 Slide 3 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Related Work Selective re-issue
Initially proposed for value prediction Assumed by many data-speculation techniques Detailed mechanics were not fully described and/or developed
Generic dependence vector scheme [Sazeides, Ph.D. thesis]
Scheduling replay Alpha 21264: squashing replay Pentium 4: selective replay based on replay queue Evaluation of replay schemes [Morancho et al.] Scheduling miss prediction [Yoaz et al.]
Our work Provides a framework for developing & analyzing replay schemes Proposes token-based selective replay
February 18, 2004 Slide 4 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Outline
Speculative Scheduling & Wavefront Propagation
Parallel Verification
Scheduling Replay Schemes
Token-based Selective Replay
Performance Evaluation
Conclusions
February 18, 2004 Slide 5 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Speculative Scheduling Overview Original Tomasulo’s algorithm
SchedFetch Decode
Atomic sched / exe
WB Commit/ ExeFetch Decode Sched Disp RF Exe WB Commit
cannot achieve max ILP
Fetch Decode Sched Disp RF Exe WB Commit
speculative issueverify scheduling decisions
Speculative Scheduling
Source of scheduling misses Load instructions: D-cache / DTLB misses, store-to-load aliasing Performance / complexity optimization techniques
February 18, 2004 Slide 6 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Speculative Execution Wavefront Initiated by a set of wakeup and select operation that links data
dependences Speculative “Image” of execution
Execution Wavefront
Delay betweenthe two wavefronts
Fetch Dec Ren Que Sched Disp Disp RF RF Exe WBCom-mit
Ren
dependencelinking
datalinking
Real ExecutionWavefront
Speculative ExecutionWavefront
Speculative Execution Wavefront Real Execution Wavefront
The scheduled image is projected to the EXE stage, initiating the real execution wavefront
Serves to verify the scheduled execution Comparing the scheduled and actual execution latencies
Speculative Execution Wavefront Real Execution Wavefront Verification runs behind speculative execution wavefront
The current execution verifies scheduling decisions made in the past
cache missdetected
February 18, 2004 Slide 7 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Speculative Execution Wavefront Initiated by a set of wakeup and select operation that links data
dependences Speculative “Image” of execution
Scoreboard is OK, but not enough
Speculative Execution Wavefront Real Execution Wavefront
The scheduled image is projected to the EXE stage, initiating the real execution wavefront
Serves to verify the scheduled execution Comparing the scheduled and actual execution latencies
Serial Verification Triggers re-scheduling of directly dependent instructions e.g. a scoreboard propagates poison bits along with data dependences
Scoreboard
Fetch Dec Ren Que Sched Disp Disp RF RF Exe WBCom-mit
Ren
dependencelinking
datalinking
Real ExecutionWavefront
Speculative ExecutionWavefront
Hard to stop invalid speculative execution wavefront Verification and schedule propagates at the same rate The scheduler doesn’t know which instructions depend on the miss
The scheduler keepsissuing instructions
unnecessarily
February 18, 2004 Slide 8 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Invalid Wavefront Propagation
1.E+0
1.E+2
1.E+4
1.E+6
1.E+8
0 8 16 24 32 40 48 56 64+wavefront propagation (insts)
Load
sch
edul
ing
mis
ses
1.E+0
1.E+2
1.E+4
1.E+6
0 8 16 24 32 40 48 56 64+wavefront propagation (insts)
Load
sch
edul
ing
mis
sesParser Gap
max 836 max 157
Serial verification
Serial verification
Parallel verificationParallel verification
A load miss can propagate through 836 instruction levels!! Not bounded by the size of the instruction window (8-wide, 128RUU) Total issue count goes up by 15% in parser (compared to parallel verification)
average 10% in SPEC2K INT, worst 42% in mcf Negative impacts on performance and power
Need a mechanism to stop invalid wavefront propagation
February 18, 2004 Slide 9 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Parallel Verification
Issued instructions are verified in parallel Verification catches up with invalid speculative execution wavefront The scheduler does not trigger any further incorrect issue Other independent instructions may be issued instead
Focus of this talk: parallel verification for scheduling replay
Fetch Dec Ren Que Sched Disp Disp RF RF Exe WBCom-mit
Ren
dependencelinking
datalinking
Real ExecutionWavefront
Speculative ExecutionWavefront
parallel verification
terminated speculativeexecution wavefront
February 18, 2004 Slide 10 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Outline
Speculative Scheduling & Wavefront Propagation
Parallel Verification
Scheduling Replay Schemes
Token-based Selective Replay
Performance Evaluation
Conclusions
February 18, 2004 Slide 11 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Requirements of Parallel Verification
Propagation of scheduling verification should be FASTER than that of speculative execution wavefront propagation
Verification catches up with invalid speculative wavefront
Verification should be performed on the transitive closure of dependent instructions
No invalid wavefront slips through invalidation / recovery
Ideal scheduling replay All mis-scheduled dependent instructions are invalidated instantly Independent instructions are unaffected (selective replay)
February 18, 2004 Slide 12 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Reducing Name Space for dependence tracking
A naïve way: dependence vector scheme works, but… Dependence vector size == the max number of loads in the window Propagate full vectors to dependent instructions at e.g. rename time Scalability issues (e.g. replay at any instruction boundary)
Approximation or conversion of the name space for precise dependence tracking into a smaller set
Reduce the number of bits in dependence vectors
Scheduling missdetected
Am I dependent on the miss?
Faster verification multi-level dependence tracking
February 18, 2004 Slide 13 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Non-Selective Replay (aka “squashing” replay)
Kill operands with non-zero-value timers Assuming all operands awakened after the misscheduled instruction are incorrect
Dependence tracking: wakeup order (imprecise)
Sched Disp RF Exe Verify
Invalidate & replay ALL instructions in the load shadow
LD
ADD
OR
AND
BR
LD
ADD
OR
AND
BR
LD
ADD
OR
AND
BR
missresolvedLD
ADD
OR
AND
BR
LD
ADD
OR
AND
BR
LD
ADD
OR
AND
BR
LD
ADD
OR
Cachemiss
AND
BR
tag L =
=
Kill w
ire
tag
bu
s
tim
er
start
tim
er
start
4timer L
0ready L
tag R
0timer R
1ready R
tag L =
=
Kill w
ire
tag
bu
s
tim
er
start
tim
er
start
3timer L
1ready L
tag R
0timer R
1ready R
tag L =
=
Kill w
ire
tag
bu
s
tim
er
start
tim
er
start
4timer L
1ready L
tag R
0timer R
1ready R
tag L =
=
Kill w
ire
tag
bu
s
tim
er
start
tim
er
start
2timer L
1ready L
tag R
0timer R
1ready R
tag L =
=
Kill w
ire
tag
bu
s
tim
er
start
tim
er
start
4timer L
0ready L
tag R
0timer R
1ready R
wakeupOR instruction
Kill
wir
e (s
ing
le w
ire)
February 18, 2004 Slide 14 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Delayed Selective Replay
Invalidates all conservatively (same as non-selective replay) Samples the completion signal in the given issue slot at timer 0
Selectively re-validates direct child instructions if no poison bit from scoreboard
Dependence tracking: wakeup order and position (imprecise)
Sched Disp RF Exe Verify
tagR
ReadyL
tagL =K
ill w
ire
wakeu
p b
us
timer
ReadyR
timer
=
=
Slot #
Slot #
Com
ple
tion
bu
s
(wir
e /
issu
e s
lot)
=
tim
er
start
tim
er
start Scoreboard
Completion bus
ADD
OR
XOR
ANDBR
LD
OR
ANDBR
LD
ADD
ANDBR
LD
ADD
OR
BR
LD
ADD
OR
LDADD
OR
SUB
Cachemiss
XOR
ANDBR
re-validate
invalidated source operand(prevent further propagation)
February 18, 2004 Slide 15 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Position-based Selective Replay
Ideal selective recovery Dependence tracking is managed in a matrix form
Column: load issue slot, row: pipeline stages
Dependence tracking: 2-dimensional position (precise)
1 00 10 01 0
bit
merg
e&
shif
ttagR
ReadyR
ReadyL
tagL =
=
Kill b
us (
wir
e/m
em
port
)
tag
bu
s
dep
en
den
ce in
fo b
us
(mem
port
s X
dep
th)
ADD
0 10 00 00 0
LD
ADD
0 1
0 0
0 0
0 0
LD
ADD 0 00 10 00 00 0
ADDShift downevery cyclein sync withpipeline flow
Propagate matricesalong withtag broadcast
LD
ADD
0 01 00 10 0
SLL
LD
OR
ANDSLL
XOR0 01 00 00 0
AND
0 01 00 10 0
OR
0 01 00 00 0
XOR
0 01 00 10 0
ADD
Cache missdetected
LD
ADD
0 00 01 00 1
SLL
LD
OR
ANDSLL
XOR
0 00 01 00 0
AND
0 00 01 00 1
OR
0 00 01 00 0
XOR
0 00 01 00 1
ADD
ALU pipe MEM pipe
Sched
Disp
RF
Exe
Verify
February 18, 2004 Slide 16 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Outline
Speculative Scheduling & Wavefront Propagation
Parallel Verification
Scheduling Replay Schemes
Token-based Selective Replay
Performance Evaluation
Conclusions
February 18, 2004 Slide 17 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Limitations of Replay Schemes
Performance scalability Non-selective scheme replays independent instructions Delayed selective replay creates bubbles in scheduling
Complexity issue in position-based replay Extra wires increase exponentially as the machine grows A function of memory ports, issue width and pipeline depth
e.g. 50 to 196 extra wires when transitioning from 4 to 8-wide machines
Incompatible with data-speculation techniques (e.g. value prediction) Data-speculation techniques collapse true data dependences Wakeup order or position no longer correlates to dependences
February 18, 2004 Slide 18 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Overcoming the Limitations Source of the limitations
Dependence information propagates as a part of scheduling or execution process
Move dependence propagation out of scheduling logic Track dependences in program order (i.e. in rename stage) Similar to dependence vector scheme requires a big name space How to reduce the bits while providing precise dependence tracking?
Token-based selective replay Tracks dependences only for the instructions likely to be misscheduled
Plant tokens in loads based on scheduling hit/miss prediction Propagate the tokens to dependent instructions Selectively recover instructions with the token
Expensive backup recovery if token planting is incorrect Squash & re-insert in program order (analogous to bpred recovery)
February 18, 2004 Slide 19 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Token-based Selective Replay Pipeline structure
Fetch Decode Rename
schedmiss
predictortoken
allocator
PC
sched missconfidence
token allocation/deallocation
tokenpropagation
high-confidence?name space?
Schedule Exe Verify Commit
selective replayfor token heads
deallocatetokens when
retired
Queue
non-selective replay
for othersSquash & reinsert
instructionsin program order
Source register mapping from Rename table
dep vector0 0 1 1 0 1 0 0
Physical reg IDSrc0
dep vector1 0 1 0 0 1 0 0
Physical reg IDSrc1
1 0 1 1 0 1 0 0
conventionalinstruction / reg info
vector merge
dep_vector1 0 1 1 0 1 0 0
head1
token_ID111
+1 0 1 1 0 1 0 1
back torenametable
token allocated ?new token ID
new dep_vector
to issue queue
dep_vector1 0 1 1 0 1 0 0
tagR
ReadyR
ReadyL
tagL =
= tag
bu
s
head1
token_ID111
Kill b
us
# w
ires
in k
ill b
us
=2
X (
# t
oke
nsr
)
Token allocation
February 18, 2004 Slide 20 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Machine parameters Simplescalar-Alpha-based 4- and 8-wide OoO
4-wide: 128 ROB, 64 LSQ, 64 IQ, 2 memory ports 8-wide: 256 ROB, 128 LSQ, 128 IQ, 4 memory ports Speculative scheduling, 6-cycle schedule-to-verify delay 32K IL1 (2), 32K DL1 (2), 512K L2 (8), memory (100) Combined branch prediction, fetch until the first taken branch Position-based selective replay
Token-based selective replay 4-wide: 8 tokens, 8-wide: 16 tokens Scheduling miss predictor: 4k-entry, PC direct-mapped 2-bit counters 4-cycle penalty for squashing instructions from issue queue Re-insert instructions at the rate of machine width
SPEC2K INT, reduced input sets Reference input sets for crafty, eon and gap up to 3B instructions
February 18, 2004 Slide 21 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
0.6
0.7
0.8
0.9
1
bzip
gcc
mcf
twol
f
vpr
Load
sch
edul
ing
mis
ses
cove
red
perfect 4k predictor
Scheduling Misses Covered by Tokens
75~92% of scheduling misses are recovered by tokens selectively The misses not covered by tokens are recovered non-selectively (re-insert) mcf runs out of tokens due to many concurrent misses
Name space reduction 8-wide: Naïve vector scheme tracks 128 loads 16 loads (16 tokens)
0.6
0.7
0.8
0.9
1bz
ip
gcc
mcf
twol
f
vpr
Load
sch
edul
ing
mis
ses
cove
red
perfect 4k predictor
3.71 2.09 27.59 10.43 6.86
% load sched misses / load issues 6.86 3.18 27.60 12.31 8.88
4-wide, 8 tokens 8-wide, 16 tokens
February 18, 2004 Slide 22 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
0.8
0.9
1
1.1
1.2
1.3
bzip
gcc
mcf
twol
f
vpris
sue
coun
t no
rmal
ized
to
Pos
Sel
non-selective delayed token
1.33
Normalized Issue Count
Selective replay is essential for lower issue count Significant increase in non-selective replay
Independent instructions are unnecessarily replayed Worse on wider machines
Token scheme performs as well as ideal scheme (position-based) except for mcf: low scheduling miss coverage
0.8
0.9
1
1.1
1.2
1.3
bzip
gcc
mcf
twol
f
vpris
sue
coun
t no
rmal
ized
to
Pos
Sel
non-selective delayed token
4-wide 8-wide
February 18, 2004 Slide 23 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
0.8
0.85
0.9
0.95
1
1.05
bzip
gcc
mcf
twol
f
vpr
IPC
nor
mal
ized
to
Pos
Sel
0.8
0.85
0.9
0.95
1
1.05
bzip
gcc
mcf
twol
f
vpr
IPC
nor
mal
ized
to
Pos
Sel
non-selective delayed token
8-wide
Normalized IPC
Non-selective and delayed schemes do not scale to wider machines Scheduling miss penalty grows as the width grows
Token selective recovery Works better than non-selective or delayed selective schemes in many cases Better performance scalability
non-selective delayed token
4-wide
February 18, 2004 Slide 24 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Discussion Delayed selective recovery
A good design alternative to ideal scheme on a 4-wide machine Good tradeoffs among complexity, performance, and issue count
Complexity is a function of the number of tokens (not the machine width nor depth) in token scheme # extra wires in the scheduler = 2 X (# tokens)
Position-based scheme: {(width) X (depth) + 1} X (# mem ports) 32 (token-based) vs. 196 (position-based) on our 8-wide machine
Better for wider and deeper machines
Support for data-speculation techniques Token scheme correctly tracks true data dependences in program order Other schemes cannot recover unless correct dependences are carried
through the scheduler
February 18, 2004 Slide 25 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Conclusions Scheduling replay is essential for speculative scheduling
Invalidate and re-schedule incorrectly issued instructions Increasingly important as the pipeline become wider and deeper
Speculative wavefront propagation in scheduling replay Incurred by the schedule-to-verify delay Negatively affects issue count (power) and performance
Scheduling replay needs multi-level dependence tracking to avoid unnecessary issue under misses Issues in efficient dependence tracking Non-selective, delayed selective and position-based selective schemes
Token-based selective replay Scalable to wider machines, support for data speculation
February 18, 2004 Slide 26 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Questions??
February 18, 2004 Slide 27 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Scheduling miss predictor performance at different threshold
PC-indexed, direct-mapped, 4K entries
0
0.2
0.4
0.6
0.8
1
1.2
conf 0 conf 1 conf 2 conf 3prediction confidence threshold
load
sch
ed m
isse
s co
vere
d
bzipcraftyeongapgccgzipmcfparserperltwolfvortexvpr
0
0.2
0.4
0.6
0.8
1
1.2
conf 0 conf 1 conf 2 conf 3prediction confidence threshold
load
s pr
edic
ted
as s
ched
mis
s
bzipcraftyeongapgccgzipmcfparserperltwolfvortexvpr
Coverage of scheduling misses(higher is better)
Loads predicted to be a miss(lower is better)
February 18, 2004 Slide 28 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Limitations with data-speculation techniques
Assumptions enabling the name space conversion (into a smaller set) Data-dependence enforcement, deterministic schedule-to-verify delay
Tracking issue / execution status filters out independent instructions Data-speculation breaks those assumptions
Cannot be directly applied to data-speculation recovery
...... …… …… …… …… ……Sched ...... …… Exe Verify…… ……
issue miss detected
Issued dependent / independent
Executed independent
unissued
SchedReplay
variable
...... …… …… …… …… ……Sched ...... …… Exe Verify…… ……
issue miss detected
issued dependent / independent
Executed dependent / independent
unissued
collapsed data-dependence
DataSpeculation
Recovery
February 18, 2004 Slide 29 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
0.75
0.8
0.85
0.9
0.95
1
1.05
bzip
gcc
mcf
twol
f
vpr
IPC
nor
mal
ized
to
Pos
Sel 8-wide
0.75
0.8
0.85
0.9
0.95
1
1.05
bzip
gcc
mcf
twol
f
vpr
IPC
nor
mal
ized
to
Pos
Sel 4-wide
Normalized IPC
Re-insert All scheduling misses are recovered by squashing & re-inserting Worst-case performance of token-based replay
Conservative Loads with high misscheduling confidence are scheduled based on L2 latency Squashing & re-inserting if mis-scheduled May unnecessary delay too many loads
non-selective delayed token re-insert conservative
February 18, 2004 Slide 30 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Scheduling Replay Models
Replay-queue-based Replay (like the Pentium 4) Issued instructions move from issue queue to replay queue Circulates instructions until they hit in the scoreboard Parallel verification for this model is left to future work
Exe pipeline verify
verification status (kill bus)
retire from issue queueif correctly executed
Issue-queue-based Replay (our assumption)
issue queue
====
cache missdetected
February 18, 2004 Slide 31 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Parallel Verification
A load miss can propagate through 836 instruction levels!! Not bounded by the size of the instruction window (8-wide, 128RUU) Total issue count goes up by 15% in parser
average 10% in SPEC2K INT, worst 42% in mcf Negative impacts on performance and power
scoreboard/ checker
Sched Exe
cache misssignalcycle n
cycle n+1
cycle n+2
cache misssignal
Sched Exe
dependence tracking / parallel verification
terminated speculativeexecution wavefront
February 18, 2004 Slide 32 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10
Position-based Selective Replay
Ideal selective recovery Dependence tracking is managed in a matrix form
Column: load issue slot, row: pipeline stages
Dependence tracking: precise position information
mergematices
ADD
0 00 00 00 1
OR
0 00 00 00 1
SLL
0 00 00 00 1
AND
0 00 01 00 1
XOR
0 00 01 00 0
LD
LD
ADD
OR XOR
ANDSLL
Integer pipeline
Mem pipeline(width 2)
Sched
Disp
RF
Exe
verify
ADD
0 00 00 10 0
OR
0 00 00 10 0
XOR
0 01 00 00 0
LD
LD
OR
ANDSLL
ADD
XOR
SLL
0 00 00 10 0
AND
0 01 00 10 0
tag / dep infobroadcast
kill bus broadcast
killed killed killed killed
Cycle n
Cycle
n+1
Sched
Disp
RF
Exe
verify
1 00 10 01 0
bit
merg
e&
shif
t
invalid
ate
if
bit
s m
atc
hin
th
e last
row
tagR
ReadyR
ReadyL
tagL =
=
Kill b
us
tag
bu
s
dep
en
den
ce in
fo b
us
Cache missDetected