February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling...

February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23

Understanding Scheduling Replay Schemes

Ilhyun KimMikko H. Lipasti

PHARM TeamUniversity of Wisconsin-Madison

February 18, 2004 Slide 2 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10

Speculation vs. Recovery

All speculative techniques share a few common requirements Some mechanisms for generating predictions Microarchitectural support for realizing the benefits of predictions Recovery for mispredictions

Relatively little focus on recovery Prediction and speculative techniques have been discussed extensively Vague descriptions like refetch, squash, reissue and replay

Recovery for speculative scheduling: scheduling replay What are the issues in scheduling replay? What functionalities should it provide? What are the potential limitations?


Related Work Selective re-issue

Initially proposed for value prediction Assumed by many data-speculation techniques Detailed mechanics were not fully described and/or developed

Generic dependence vector scheme [Sazeides, Ph.D. thesis]

Scheduling replay Alpha 21264: squashing replay Pentium 4: selective replay based on replay queue Evaluation of replay schemes [Morancho et al.] Scheduling miss prediction [Yoaz et al.]

Our work Provides a framework for developing & analyzing replay schemes Proposes token-based selective replay


Outline

Speculative Scheduling & Wavefront Propagation

Parallel Verification

Scheduling Replay Schemes

Token-based Selective Replay

Performance Evaluation

Conclusions


Speculative Scheduling Overview Original Tomasulo’s algorithm

SchedFetch Decode

Atomic sched / exe

WB Commit/ ExeFetch Decode Sched Disp RF Exe WB Commit

cannot achieve max ILP

Fetch Decode Sched Disp RF Exe WB Commit

speculative issueverify scheduling decisions

Speculative Scheduling

Source of scheduling misses Load instructions: D-cache / DTLB misses, store-to-load aliasing Performance / complexity optimization techniques


Speculative Execution Wavefront Initiated by a set of wakeup and select operation that links data

dependences Speculative “Image” of execution

Execution Wavefront

Delay betweenthe two wavefronts

Fetch Dec Ren Que Sched Disp Disp RF RF Exe WBCom-mit

Ren

dependencelinking

datalinking

Real ExecutionWavefront

Speculative ExecutionWavefront

Speculative Execution Wavefront Real Execution Wavefront

The scheduled image is projected to the EXE stage, initiating the real execution wavefront

Serves to verify the scheduled execution Comparing the scheduled and actual execution latencies

Speculative Execution Wavefront Real Execution Wavefront Verification runs behind speculative execution wavefront

The current execution verifies scheduling decisions made in the past

cache missdetected


Speculative Execution Wavefront Initiated by a set of wakeup and select operation that links data

dependences Speculative “Image” of execution

Scoreboard is OK, but not enough

Speculative Execution Wavefront Real Execution Wavefront

The scheduled image is projected to the EXE stage, initiating the real execution wavefront

Serves to verify the scheduled execution Comparing the scheduled and actual execution latencies

Serial Verification Triggers re-scheduling of directly dependent instructions e.g. a scoreboard propagates poison bits along with data dependences

Scoreboard


Ren

dependencelinking

datalinking



Hard to stop invalid speculative execution wavefront Verification and schedule propagates at the same rate The scheduler doesn’t know which instructions depend on the miss

The scheduler keepsissuing instructions

unnecessarily


Invalid Wavefront Propagation

1.E+0

1.E+2

1.E+4

1.E+6

1.E+8

0 8 16 24 32 40 48 56 64+wavefront propagation (insts)

Load

sch

edul

ing

mis

ses

1.E+0

1.E+2

1.E+4

1.E+6

0 8 16 24 32 40 48 56 64+wavefront propagation (insts)

Load

sch

edul

ing

mis

sesParser Gap

max 836 max 157

Serial verification

Serial verification

Parallel verificationParallel verification

A load miss can propagate through 836 instruction levels!! Not bounded by the size of the instruction window (8-wide, 128RUU) Total issue count goes up by 15% in parser (compared to parallel verification)

average 10% in SPEC2K INT, worst 42% in mcf Negative impacts on performance and power

Need a mechanism to stop invalid wavefront propagation



Issued instructions are verified in parallel Verification catches up with invalid speculative execution wavefront The scheduler does not trigger any further incorrect issue Other independent instructions may be issued instead

Focus of this talk: parallel verification for scheduling replay


Ren

dependencelinking

datalinking



parallel verification

terminated speculativeexecution wavefront


Outline






Conclusions


Requirements of Parallel Verification

Propagation of scheduling verification should be FASTER than that of speculative execution wavefront propagation

Verification catches up with invalid speculative wavefront

Verification should be performed on the transitive closure of dependent instructions

No invalid wavefront slips through invalidation / recovery

Ideal scheduling replay All mis-scheduled dependent instructions are invalidated instantly Independent instructions are unaffected (selective replay)


Reducing Name Space for dependence tracking

A naïve way: dependence vector scheme works, but… Dependence vector size == the max number of loads in the window Propagate full vectors to dependent instructions at e.g. rename time Scalability issues (e.g. replay at any instruction boundary)

Approximation or conversion of the name space for precise dependence tracking into a smaller set

Reduce the number of bits in dependence vectors

Scheduling missdetected

Am I dependent on the miss?

Faster verification multi-level dependence tracking


Non-Selective Replay (aka “squashing” replay)

Kill operands with non-zero-value timers Assuming all operands awakened after the misscheduled instruction are incorrect

Dependence tracking: wakeup order (imprecise)

Sched Disp RF Exe Verify

Invalidate & replay ALL instructions in the load shadow

LD

ADD

OR

AND

BR

LD

ADD

OR

AND

BR

LD

ADD

OR

AND

BR

missresolvedLD

ADD

OR

AND

BR

LD

ADD

OR

AND

BR

LD

ADD

OR

AND

BR

LD

ADD

OR

Cachemiss

AND

BR

tag L =

=

Kill w

ire

tag

bu

s

tim

er

start

tim

er

start

4timer L

0ready L

tag R

0timer R

1ready R

tag L =

=

Kill w

ire

tag

bu

s

tim

er

start

tim

er

start

3timer L

1ready L

tag R

0timer R

1ready R

tag L =

=

Kill w

ire

tag

bu

s

tim

er

start

tim

er

start

4timer L

1ready L

tag R

0timer R

1ready R

tag L =

=

Kill w

ire

tag

bu

s

tim

er

start

tim

er

start

2timer L

1ready L

tag R

0timer R

1ready R

tag L =

=

Kill w

ire

tag

bu

s

tim

er

start

tim

er

start

4timer L

0ready L

tag R

0timer R

1ready R

wakeupOR instruction

Kill

wir

e (s

ing

le w

ire)


Delayed Selective Replay

Invalidates all conservatively (same as non-selective replay) Samples the completion signal in the given issue slot at timer 0

Selectively re-validates direct child instructions if no poison bit from scoreboard

Dependence tracking: wakeup order and position (imprecise)

Sched Disp RF Exe Verify

tagR

ReadyL

tagL =K

ill w

ire

wakeu

p b

us

timer

ReadyR

timer

=

=

Slot #

Slot #

Com

ple

tion

bu

s

(wir

e /

issu

e s

lot)

=

tim

er

start

tim

er

start Scoreboard

Completion bus

ADD

OR

XOR

ANDBR

LD

OR

ANDBR

LD

ADD

ANDBR

LD

ADD

OR

BR

LD

ADD

OR

LDADD

OR

SUB

Cachemiss

XOR

ANDBR

re-validate

invalidated source operand(prevent further propagation)


Position-based Selective Replay

Ideal selective recovery Dependence tracking is managed in a matrix form

Column: load issue slot, row: pipeline stages

Dependence tracking: 2-dimensional position (precise)

1 00 10 01 0

bit

merg

e&

shif

ttagR

ReadyR

ReadyL

tagL =

=

Kill b

us (

wir

e/m

em

port

)

tag

bu

s

dep

en

den

ce in

fo b

us

(mem

port

s X

dep

th)

ADD

0 10 00 00 0

LD

ADD

0 1

0 0

0 0

0 0

LD

ADD 0 00 10 00 00 0

ADDShift downevery cyclein sync withpipeline flow

Propagate matricesalong withtag broadcast

LD

ADD

0 01 00 10 0

SLL

LD

OR

ANDSLL

XOR0 01 00 00 0

AND

0 01 00 10 0

OR

0 01 00 00 0

XOR

0 01 00 10 0

ADD

Cache missdetected

LD

ADD

0 00 01 00 1

SLL

LD

OR

ANDSLL

XOR

0 00 01 00 0

AND

0 00 01 00 1

OR

0 00 01 00 0

XOR

0 00 01 00 1

ADD

ALU pipe MEM pipe

Sched

Disp

RF

Exe

Verify


Outline






Conclusions


Limitations of Replay Schemes

Performance scalability Non-selective scheme replays independent instructions Delayed selective replay creates bubbles in scheduling

Complexity issue in position-based replay Extra wires increase exponentially as the machine grows A function of memory ports, issue width and pipeline depth

e.g. 50 to 196 extra wires when transitioning from 4 to 8-wide machines

Incompatible with data-speculation techniques (e.g. value prediction) Data-speculation techniques collapse true data dependences Wakeup order or position no longer correlates to dependences


Overcoming the Limitations Source of the limitations

Dependence information propagates as a part of scheduling or execution process

Move dependence propagation out of scheduling logic Track dependences in program order (i.e. in rename stage) Similar to dependence vector scheme requires a big name space How to reduce the bits while providing precise dependence tracking?

Token-based selective replay Tracks dependences only for the instructions likely to be misscheduled

Plant tokens in loads based on scheduling hit/miss prediction Propagate the tokens to dependent instructions Selectively recover instructions with the token

Expensive backup recovery if token planting is incorrect Squash & re-insert in program order (analogous to bpred recovery)


Token-based Selective Replay Pipeline structure

Fetch Decode Rename

schedmiss

predictortoken

allocator

PC

sched missconfidence

token allocation/deallocation

tokenpropagation

high-confidence?name space?

Schedule Exe Verify Commit

selective replayfor token heads

deallocatetokens when

retired

Queue

non-selective replay

for othersSquash & reinsert

instructionsin program order

Source register mapping from Rename table

dep vector0 0 1 1 0 1 0 0

Physical reg IDSrc0

dep vector1 0 1 0 0 1 0 0

Physical reg IDSrc1

1 0 1 1 0 1 0 0

conventionalinstruction / reg info

vector merge

dep_vector1 0 1 1 0 1 0 0

head1

token_ID111

+1 0 1 1 0 1 0 1

back torenametable

token allocated ?new token ID

new dep_vector

to issue queue

dep_vector1 0 1 1 0 1 0 0

tagR

ReadyR

ReadyL

tagL =

= tag

bu

s

head1

token_ID111

Kill b

us

# w

ires

in k

ill b

us

=2

X (

# t

oke

nsr

)

Token allocation


Machine parameters Simplescalar-Alpha-based 4- and 8-wide OoO

4-wide: 128 ROB, 64 LSQ, 64 IQ, 2 memory ports 8-wide: 256 ROB, 128 LSQ, 128 IQ, 4 memory ports Speculative scheduling, 6-cycle schedule-to-verify delay 32K IL1 (2), 32K DL1 (2), 512K L2 (8), memory (100) Combined branch prediction, fetch until the first taken branch Position-based selective replay

Token-based selective replay 4-wide: 8 tokens, 8-wide: 16 tokens Scheduling miss predictor: 4k-entry, PC direct-mapped 2-bit counters 4-cycle penalty for squashing instructions from issue queue Re-insert instructions at the rate of machine width

SPEC2K INT, reduced input sets Reference input sets for crafty, eon and gap up to 3B instructions


0.6

0.7

0.8

0.9

1

bzip

gcc

mcf

twol

f

vpr

Load

sch

edul

ing

mis

ses

cove

red

perfect 4k predictor

Scheduling Misses Covered by Tokens

75~92% of scheduling misses are recovered by tokens selectively The misses not covered by tokens are recovered non-selectively (re-insert) mcf runs out of tokens due to many concurrent misses

Name space reduction 8-wide: Naïve vector scheme tracks 128 loads 16 loads (16 tokens)

0.6

0.7

0.8

0.9

1bz

ip

gcc

mcf

twol

f

vpr

Load

sch

edul

ing

mis

ses

cove

red

perfect 4k predictor

3.71 2.09 27.59 10.43 6.86

% load sched misses / load issues 6.86 3.18 27.60 12.31 8.88

4-wide, 8 tokens 8-wide, 16 tokens


0.8

0.9

1

1.1

1.2

1.3

bzip

gcc

mcf

twol

f

vpris

sue

coun

t no

rmal

ized

to

Pos

Sel

non-selective delayed token

1.33

Normalized Issue Count

Selective replay is essential for lower issue count Significant increase in non-selective replay

Independent instructions are unnecessarily replayed Worse on wider machines

Token scheme performs as well as ideal scheme (position-based) except for mcf: low scheduling miss coverage

0.8

0.9

1

1.1

1.2

1.3

bzip

gcc

mcf

twol

f

vpris

sue

coun

t no

rmal

ized

to

Pos

Sel


4-wide 8-wide


0.8

0.85

0.9

0.95

1

1.05

bzip

gcc

mcf

twol

f

vpr

IPC

nor

mal

ized

to

Pos

Sel

0.8

0.85

0.9

0.95

1

1.05

bzip

gcc

mcf

twol

f

vpr

IPC

nor

mal

ized

to

Pos

Sel


8-wide

Normalized IPC

Non-selective and delayed schemes do not scale to wider machines Scheduling miss penalty grows as the width grows

Token selective recovery Works better than non-selective or delayed selective schemes in many cases Better performance scalability


4-wide


Discussion Delayed selective recovery

A good design alternative to ideal scheme on a 4-wide machine Good tradeoffs among complexity, performance, and issue count

Complexity is a function of the number of tokens (not the machine width nor depth) in token scheme # extra wires in the scheduler = 2 X (# tokens)

Position-based scheme: {(width) X (depth) + 1} X (# mem ports) 32 (token-based) vs. 196 (position-based) on our 8-wide machine

Better for wider and deeper machines

Support for data-speculation techniques Token scheme correctly tracks true data dependences in program order Other schemes cannot recover unless correct dependences are carried

through the scheduler


Conclusions Scheduling replay is essential for speculative scheduling

Invalidate and re-schedule incorrectly issued instructions Increasingly important as the pipeline become wider and deeper

Speculative wavefront propagation in scheduling replay Incurred by the schedule-to-verify delay Negatively affects issue count (power) and performance

Scheduling replay needs multi-level dependence tracking to avoid unnecessary issue under misses Issues in efficient dependence tracking Non-selective, delayed selective and position-based selective schemes

Token-based selective replay Scalable to wider machines, support for data speculation


Questions??


Scheduling miss predictor performance at different threshold

PC-indexed, direct-mapped, 4K entries

0

0.2

0.4

0.6

0.8

1

1.2

conf 0 conf 1 conf 2 conf 3prediction confidence threshold

load

sch

ed m

isse

s co

vere

d

bzipcraftyeongapgccgzipmcfparserperltwolfvortexvpr

0

0.2

0.4

0.6

0.8

1

1.2

conf 0 conf 1 conf 2 conf 3prediction confidence threshold

load

s pr

edic

ted

as s

ched

mis

s

bzipcraftyeongapgccgzipmcfparserperltwolfvortexvpr

Coverage of scheduling misses(higher is better)

Loads predicted to be a miss(lower is better)


Limitations with data-speculation techniques

Assumptions enabling the name space conversion (into a smaller set) Data-dependence enforcement, deterministic schedule-to-verify delay

Tracking issue / execution status filters out independent instructions Data-speculation breaks those assumptions

Cannot be directly applied to data-speculation recovery

...... …… …… …… …… ……Sched ...... …… Exe Verify…… ……

issue miss detected

Issued dependent / independent

Executed independent

unissued

SchedReplay

variable

...... …… …… …… …… ……Sched ...... …… Exe Verify…… ……

issue miss detected

issued dependent / independent

Executed dependent / independent

unissued

collapsed data-dependence

DataSpeculation

Recovery


0.75

0.8

0.85

0.9

0.95

1

1.05

bzip

gcc

mcf

twol

f

vpr

IPC

nor

mal

ized

to

Pos

Sel 8-wide

0.75

0.8

0.85

0.9

0.95

1

1.05

bzip

gcc

mcf

twol

f

vpr

IPC

nor

mal

ized

to

Pos

Sel 4-wide

Normalized IPC

Re-insert All scheduling misses are recovered by squashing & re-inserting Worst-case performance of token-based replay

Conservative Loads with high misscheduling confidence are scheduled based on L2 latency Squashing & re-inserting if mis-scheduled May unnecessary delay too many loads

non-selective delayed token re-insert conservative


Scheduling Replay Models

Replay-queue-based Replay (like the Pentium 4) Issued instructions move from issue queue to replay queue Circulates instructions until they hit in the scoreboard Parallel verification for this model is left to future work

Exe pipeline verify

verification status (kill bus)

retire from issue queueif correctly executed

Issue-queue-based Replay (our assumption)

issue queue

====

cache missdetected



A load miss can propagate through 836 instruction levels!! Not bounded by the size of the instruction window (8-wide, 128RUU) Total issue count goes up by 15% in parser

average 10% in SPEC2K INT, worst 42% in mcf Negative impacts on performance and power

scoreboard/ checker

Sched Exe

cache misssignalcycle n

cycle n+1

cycle n+2

cache misssignal

Sched Exe

dependence tracking / parallel verification

terminated speculativeexecution wavefront


Position-based Selective Replay

Ideal selective recovery Dependence tracking is managed in a matrix form

Column: load issue slot, row: pipeline stages

Dependence tracking: precise position information

mergematices

ADD

0 00 00 00 1

OR

0 00 00 00 1

SLL

0 00 00 00 1

AND

0 00 01 00 1

XOR

0 00 01 00 0

LD

LD

ADD

OR XOR

ANDSLL

Integer pipeline

Mem pipeline(width 2)

Sched

Disp

RF

Exe

verify

ADD

0 00 00 10 0

OR

0 00 00 10 0

XOR

0 01 00 00 0

LD

LD

OR

ANDSLL

ADD

XOR

SLL

0 00 00 10 0

AND

0 01 00 10 0

tag / dep infobroadcast

kill bus broadcast

killed killed killed killed

Cycle n

Cycle

n+1

Sched

Disp

RF

Exe

verify

1 00 10 01 0

bit

merg

e&

shif

t

invalid

ate

if

bit

s m

atc

hin

th

e last

row

tagR

ReadyR

ReadyL

tagL =

=

Kill b

us

tag

bu

s

dep

en

den

ce in

fo b

us

Cache missDetected

February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling...

Documents

Transcript of February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling...