Value Prediction: Are(n’t) We Done Yet?
description
Transcript of Value Prediction: Are(n’t) We Done Yet?
Mikko LipastiMikko Lipasti
University of Wisconsin-MadisonUniversity of Wisconsin-Madison
Value Prediction:Value Prediction:Are(n’t) We Done Yet?Are(n’t) We Done Yet?
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 2 of 38
DefinitionDefinition
What is value prediction? Broadly, three What is value prediction? Broadly, three salient attributes:salient attributes:
1.1. Generate a speculative value (predict)Generate a speculative value (predict)
2.2. Consume speculative value (execute)Consume speculative value (execute)
3.3. Verify speculative value (compare/recover)Verify speculative value (compare/recover) This subsumes branch predictionThis subsumes branch prediction
Focus here on operand valuesFocus here on operand values
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 3 of 38
Some HistorySome History
““Classical” value predictionClassical” value prediction Independently invented by 4 groups in 1995-1996Independently invented by 4 groups in 1995-1996
1.1. AMD (Nexgen): L. Widigen and E. Sowadsky, AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March 1996, inv. March 1995patent filed March 1996, inv. March 1995
2.2. Technion: F. Gabbay and A. Mendelson, inv. Technion: F. Gabbay and A. Mendelson, inv. sometime 1995, TR 11/96, US patent Sep 1997sometime 1995, TR 11/96, US patent Sep 1997
3.3. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. 1995, ASPLOS paper submitted March 19961995, ASPLOS paper submitted March 1996
4.4. Wisconsin: Y. Sazeides, J. Smith, Summer 1996Wisconsin: Y. Sazeides, J. Smith, Summer 1996
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 4 of 38
Why?Why?
Possible explanations:Possible explanations:1.1. Natural evolution from branch predictionNatural evolution from branch prediction
2.2. Natural evolution from memoizationNatural evolution from memoization
3.3. Natural evolution from rampant speculationNatural evolution from rampant speculation Cache hit speculationCache hit speculation Memory independence speculationMemory independence speculation Speculative address generationSpeculative address generation
4.4. Improvements in tracing/simulation technologyImprovements in tracing/simulation technology ““There’s a lot of zeroes out there.” (C. Wilkerson)There’s a lot of zeroes out there.” (C. Wilkerson) Values, not just instructions & addressesValues, not just instructions & addresses
TRIP6000 [A. Martin-de-Nicolas, IBM]TRIP6000 [A. Martin-de-Nicolas, IBM]
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 5 of 38
Publications by YearPublications by Year
0
10
20
30
40
50
60
70
1996 1998 2000 2002 2004
Cu
mu
lati
ve P
ub
lica
tio
ns
ISCA
MICRO
HPCA
Others
Total
Excludes journals, workshops, compiler conferencesExcludes journals, workshops, compiler conferences
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 6 of 38
What Happened?What Happened?
Tremendous academic interestTremendous academic interest Dozens of research groups, papers, proposalsDozens of research groups, papers, proposals
No industry uptakeNo industry uptake No present or planned CPU with value predictionNo present or planned CPU with value prediction
Why?Why? Meager performance benefit (< 10%)Meager performance benefit (< 10%) Power consumptionPower consumption
Dynamic power for extra activityDynamic power for extra activity Static power (area) for prediction tablesStatic power (area) for prediction tables
Complexity and correctnessComplexity and correctness Subtle memory ordering issues [MICRO ’01]Subtle memory ordering issues [MICRO ’01] Misprediction recovery [HPCA ’04]Misprediction recovery [HPCA ’04]
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 7 of 38
Performance?Performance?
Relationship between timely fetch and value Relationship between timely fetch and value prediction benefit [Gabbay, ISCA]prediction benefit [Gabbay, ISCA]
Value prediction doesn’t help when the result can be Value prediction doesn’t help when the result can be computed before the consumer instruction is fetchedcomputed before the consumer instruction is fetched
High-bandwidth fetch helpsHigh-bandwidth fetch helps Wide trace caches studied in late 1990sWide trace caches studied in late 1990s But, these have several negative attributesBut, these have several negative attributes
Recent designs focus on frequency, not ILPRecent designs focus on frequency, not ILP High-bandwidthHigh-bandwidth fetch is a red herring fetch is a red herring
More important to fetch the More important to fetch the right instructionsright instructions
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 8 of 38
Future Adoption?Future Adoption?
Classical value prediction will only make it in the Classical value prediction will only make it in the context of a very different microarchitecturecontext of a very different microarchitecture One that explicitly and aggressively exposes ILPOne that explicitly and aggressively exposes ILP
Promising trendsPromising trends Deep pipelining craze appears to be overDeep pipelining craze appears to be over
Can’t manage the design complexityCan’t manage the design complexity
High frequency mania appears to be overHigh frequency mania appears to be over Can’t afford the powerCan’t afford the power
Architects are pursuing ILP once againArchitects are pursuing ILP once again Value prediction has another opportunityValue prediction has another opportunity
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 9 of 38
What Value Prediction BegatWhat Value Prediction Begat
Value prediction catalyzed a new focus on Value prediction catalyzed a new focus on values in computationvalues in computation This had not been studied beforeThis had not been studied before
A whole new realm of research:A whole new realm of research:
Value-Aware MicroarchitectureValue-Aware Microarchitecture Spans numerous subdisciplinesSpans numerous subdisciplines Significant industrial impact alreadySignificant industrial impact already Also, developments in supporting technologiesAlso, developments in supporting technologies
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 10 of 38
Value-Aware MicroarchitectureValue-Aware MicroarchitectureMemory Hierarchy•Register File Compression [several]•Cache Compression [Gupta, Alameldeen]•Memory Compression [e.g. IBM MXT]•Bandwidth compression
•Address and data bus encoding [Rudolph]•Initialization Traffic [Lewis]
Execution Core•Value Prediction•Operand Significance
•Low Power [Canal]•Execution bandwidth [Loh]•Bit-slicing [Pentium 4, Mestan]
•Instruction reuse [Sodani]•Carry prediction [Circuit-level Speculation]
Load/Store Processing•Load value prediction [numerous]•Fast address calculation [Austin]•Value-aware alias prediction [Onder]•Memory consistency [Cain]
Cache Coherence•Producer-side
•Silent stores, temporally silent stores [Lepak]•Speculative lock elision [Rajwar]
•Consumer side•Load value prediction using stale lines [Lepak]•“Coherence decoupling” [Burger, Sohi]
Value-AwareMicroarchitecture
Load/Store Processing•Load value prediction [numerous]•Fast address calculation [Austin]•Value-aware alias prediction [Onder]•Memory consistency [Cain]Execution Core•Value Prediction•Operand Significance
•Low Power [Canal]•Execution bandwidth [Loh]•Bit-slicing [Pentium 4, Mestan]
•Instruction reuse [Sodani]•Carry prediction [Circuit-level Speculation]
Cache Coherence•Producer-side
•Silent stores, temporally silent stores [Lepak]•Speculative lock elision [Wisc, UIUC]
•Consumer side•Load value prediction using stale lines [Lepak]•“Coherence decoupling” [ASPLOS 04]
Memory Hierarchy•Register File Compression [several]•Cache Compression [Gupta, Alameldeen]•Memory Compression [e.g. IBM MXT]•Bandwidth compression
•Address and data bus encoding [Rudolph]•Initialization Traffic [Lewis]
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 11 of 38
Supporting TechnologiesSupporting Technologies Value prediction presented some unique challenges:Value prediction presented some unique challenges:
Relatively low correct prediction rate (initially 40-50%)Relatively low correct prediction rate (initially 40-50%) Nontrivial misprediction rate with avoidable misprediction costNontrivial misprediction rate with avoidable misprediction cost
These drove study of:These drove study of: Confidence prediction/estimationConfidence prediction/estimation
First microarchitectural application of confidence estimation, though not First microarchitectural application of confidence estimation, though not widely credited or cited as suchwidely credited or cited as such
Since studied for numerous applications, e.g. gating control speculationSince studied for numerous applications, e.g. gating control speculation Selective recovery [Sazeides Ph.D., Kim HPCA ‘04]Selective recovery [Sazeides Ph.D., Kim HPCA ‘04]
Numerous challenges in extending recovery to entire windowNumerous challenges in extending recovery to entire window Both have proved to be fruitful research areasBoth have proved to be fruitful research areas Also stimulated development of software technology:Also stimulated development of software technology:
Value profilingValue profiling Value-based compiler optimizationsValue-based compiler optimizations Run-time code specializationRun-time code specialization
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 12 of 38
OutlineOutline
Some History Industry Trends Value-Aware Microarchitecture Case study: Memory Consistency [Trey Cain, Case study: Memory Consistency [Trey Cain,
ISCA 2004]ISCA 2004] Conventional load queue microarchitectureConventional load queue microarchitecture Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation
ConclusionsConclusions
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 13 of 38
Value-based Memory ConsistencyValue-based Memory Consistency
High ILP => Large instruction windowsHigh ILP => Large instruction windows Larger physical register fileLarger physical register file Larger schedulerLarger scheduler Larger load/store queuesLarger load/store queues Result in increased access latencyResult in increased access latency
Value-based ReplayValue-based Replay If load queue scalability a problem…who needs one!If load queue scalability a problem…who needs one! Instead, re-execute load instructions a 2Instead, re-execute load instructions a 2ndnd time in time in
program orderprogram order Filter replays: heuristics reduce extra cache Filter replays: heuristics reduce extra cache
bandwidth to 3.5% on averagebandwidth to 3.5% on average
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 14 of 38
Enforcing RAW dependencesEnforcing RAW dependences
1. (1) store A2. (3) store ?3. (2) load A
Program order (Exe order)
Load queue contains load addressesLoad queue contains load addresses Memory independence speculationMemory independence speculation
Hoist load above unknown store assuming it is to a different addressHoist load above unknown store assuming it is to a different address Check correctness at store retirementCheck correctness at store retirement
One search per store address calculationOne search per store address calculation If address matches, the load is squashed If address matches, the load is squashed
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 15 of 38
Enforcing memory consistencyEnforcing memory consistency
Processor p2
1. (2) store A
Processor p1
1. (3) load A
2. (1) load A
raw
war
Two approachesTwo approaches Snooping: Search per incoming invalidateSnooping: Search per incoming invalidate Insulated: Search per load address calculationInsulated: Search per load address calculation
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 16 of 38
Load queue implementationLoad queue implementation
addressCAM
loadmeta-data
RAM
external address
store address
load address
store age
load age
squash determination
queue management
external request
# of write ports = load address calc width# of write ports = load address calc width # of read ports = load+store address calc width ( + 1)# of read ports = load+store address calc width ( + 1) Current generation designs (32-48 entries, 2 write ports, Current generation designs (32-48 entries, 2 write ports,
2 (3) read ports)2 (3) read ports)
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 17 of 38
Load queue scalingLoad queue scaling
Larger instruction window => larger load Larger instruction window => larger load queuequeue Increases access latencyIncreases access latency Increases energy consumptionIncreases energy consumption
Wider issue width => more read/write Wider issue width => more read/write portsports Also increases latency and energyAlso increases latency and energy
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 18 of 38
Related work: MICRO 2003Related work: MICRO 2003
Park et al., PurduePark et al., Purdue Extra structure dedicated to enforcing memory Extra structure dedicated to enforcing memory
consistencyconsistency Increase capacity through segmentationIncrease capacity through segmentation
Sethumadhavan et al., UT-AustinSethumadhavan et al., UT-Austin Add set of filters summarizing contents of load Add set of filters summarizing contents of load
queuequeue
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 19 of 38
Keep it simple…Keep it simple…
Throw more hardware at the problem?Throw more hardware at the problem? Need to design/implement/verifyNeed to design/implement/verify Execution core is already complicatedExecution core is already complicated
Load queue checks for rare errorsLoad queue checks for rare errors Why not move error checking away from exe?Why not move error checking away from exe?
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 20 of 38
CMP
Value-based ConsistencyValue-based Consistency
ReplayReplay: access the cache a second time -cheaply!: access the cache a second time -cheaply! Almost always cache hitAlmost always cache hit Reuse address calculation and translationReuse address calculation and translation Share cache port used by stores in commit stageShare cache port used by stores in commit stage
CompareCompare: compares new value to original value: compares new value to original value Squash if the values differSquash if the values differ
This is value prediction!This is value prediction! Predict: access cache prematurelyPredict: access cache prematurely Execute: as usualExecute: as usual Verify: replay load, compare value, recover if necessaryVerify: replay load, compare value, recover if necessary
IF1 D R Q S EX CREPIF2 WB…
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 21 of 38
Rules of replayRules of replay
1.1. All prior stores must have written data to All prior stores must have written data to the cachethe cache
No store-to-load forwardingNo store-to-load forwarding
2.2. Loads must replay in program orderLoads must replay in program order
3.3. If a load is squashed, it should not be If a load is squashed, it should not be replayed a second timereplayed a second time
Ensures forward progressEnsures forward progress
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 22 of 38
Replay reductionReplay reduction
Replay costsReplay costs Consumes cache bandwidth (and power)Consumes cache bandwidth (and power) Increases reorder buffer occupancyIncreases reorder buffer occupancy
Can we avoid these penalties?Can we avoid these penalties? Infer correctness of certain operationsInfer correctness of certain operations
Four replay filtersFour replay filters These are used to avoid checking our value These are used to avoid checking our value
prediction when in fact no value prediction prediction when in fact no value prediction occurred (loaded value is known to be correct)occurred (loaded value is known to be correct) Similar to “constant prediction” in initial workSimilar to “constant prediction” in initial work
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 23 of 38
No-Reorder filterNo-Reorder filter
Avoid replay if load isn’t reordered wrt Avoid replay if load isn’t reordered wrt other memory operationsother memory operations
Can we do better?Can we do better?
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 24 of 38
Enforcing single-thread RAW Enforcing single-thread RAW dependenciesdependencies
No-Unresolved Store Address FilterNo-Unresolved Store Address Filter Load instruction Load instruction ii is replayed if there are prior is replayed if there are prior
stores with unresolved addresses when stores with unresolved addresses when ii issuesissues
Works for intra-processor RAW dependencesWorks for intra-processor RAW dependences Doesn’t enforce memory consistencyDoesn’t enforce memory consistency
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 25 of 38
Enforcing MP consistencyEnforcing MP consistency
No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line
fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow
No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external
invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 26 of 38
Constraint graphConstraint graph
Defined for sequential consistency by Landin et Defined for sequential consistency by Landin et al., ISCA-18al., ISCA-18
Directed-graph represents a multithreaded Directed-graph represents a multithreaded executionexecution Nodes represent dynamic instruction instancesNodes represent dynamic instruction instances Edges represent their transitive orders (program Edges represent their transitive orders (program
order, RAW, WAW, WAR).order, RAW, WAW, WAR). If the constraint graph is acyclic, then the If the constraint graph is acyclic, then the
execution is correctexecution is correct
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 27 of 38
Constraint graph example - SCConstraint graph example - SC
Proc 1
ST A
Proc 2
LD AST B
LD BProgramorder
Programorder
WAR
RAW
Cycle indicates that execution is
incorrect
1.
2.
3.
4.
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 28 of 38
Anatomy of a cycleAnatomy of a cycle
Proc 1
ST A
Proc 2
LD AST B
LD BProgramorder
Programorder
WAR
RAW
Incoming invalidate
Cache miss
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 29 of 38
Enforcing MP consistencyEnforcing MP consistency
No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line
fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow
No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external
invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 30 of 38
Filter SummaryFilter Summary
Replay all committed loads
No-Reorder Filter
No-Unresolved Store/No-Recent-Snoop Filter
No-Unresolved Store/No-Recent-Miss Filter
Conservative
Aggressive
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 31 of 38
OutlineOutline
Some HistorySome History Industry TrendsIndustry Trends Value-Aware MicroarchitectureValue-Aware Microarchitecture Case study: Memory Consistency [Cain, ISCA]Case study: Memory Consistency [Cain, ISCA]
Conventional load queue microarchitectureConventional load queue microarchitecture Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation
ConclusionsConclusions
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 32 of 38
Base machine modelBase machine modelPHARMsimPHARMsim PowerPC execute-at-execute simulator with OOO cores and aggressive PowerPC execute-at-execute simulator with OOO cores and aggressive
split-transaction snooping coherence protocolsplit-transaction snooping coherence protocol
Out-of-order Out-of-order execution execution corecore
5 GHZ, 5 GHZ, 15-stage, 8-wide pipeline15-stage, 8-wide pipeline
256 entry reorder buffer, 128 entry load/store queue256 entry reorder buffer, 128 entry load/store queue
32 entry issue queue32 entry issue queue
Functional Functional units units (latency)(latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit1 L1 Dcache load/store port at commit
Front-endFront-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB16k entry selection table, 64 entry RAS, 8k entry 4-way BTB
Memory Memory system system (latency)(latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines
Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)
Stride-based prefetcher modeled after Power4`Stride-based prefetcher modeled after Power4`
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 33 of 38
%L1 DCache bandwidth increase%L1 DCache bandwidth increase
(a) replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter
On average, 3.4% bandwidth overhead using no-recent-snoop filter
SPECint2000 SPECfp2000 commercial multiprocessor
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 34 of 38
Value-based replay performance Value-based replay performance (relative to constrained load queue)(relative to constrained load queue)
Value-based replay 8% faster on avg than baseline using 16-entry ld queue
SPECint2000 SPECfp2000 commercial multiprocessor
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 35 of 38
Does value locality help?Does value locality help?
Not much…Not much… Value locality does avoid memory ordering Value locality does avoid memory ordering
violationsviolations 59% single-thread violations avoided59% single-thread violations avoided 95% consistency violations avoided95% consistency violations avoided
But these violations rarely occurBut these violations rarely occur ~1 single-thread violation per 100 million instr~1 single-thread violation per 100 million instr 4 consistency violation per 10,000 instr 4 consistency violation per 10,000 instr
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 36 of 38
What About What About PowerPower??
Simple power model:Simple power model:
Empirically: 0.02 replay loads per committed Empirically: 0.02 replay loads per committed instructioninstruction
If load queue CAM energy/insn > 0.02 If load queue CAM energy/insn > 0.02 × energy energy expenditure of a cache access and comparison: expenditure of a cache access and comparison: value-based implementation saves power!value-based implementation saves power!
Energy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search × # ldq searches )
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 37 of 38
Value-based replay Pros/ConsValue-based replay Pros/Cons
+ Eliminates associative lookup hardwareEliminates associative lookup hardware Load queue becomes simple FIFOLoad queue becomes simple FIFO Negligible IPC or L1D bandwidth impactNegligible IPC or L1D bandwidth impact
+ Can be used to fix value predictionCan be used to fix value prediction Enforces dependence order consistency Enforces dependence order consistency
constraint [MICRO ‘01]constraint [MICRO ‘01]- Requires additional pipeline stagesRequires additional pipeline stages- Requires additional cache datapath for Requires additional cache datapath for
loadsloads
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 38 of 38
ConclusionsConclusions
Value predictionValue prediction Continues to generate lots of academic interestContinues to generate lots of academic interest Little industry uptake so farLittle industry uptake so far
Historical trends (narrow deep pipelines) minimized benefitHistorical trends (narrow deep pipelines) minimized benefit Sea-change underway on this frontSea-change underway on this front
Value prediction will be revisited in quest for ILPValue prediction will be revisited in quest for ILP Power consumption is key!Power consumption is key!
Value-Aware MicroarchitectureValue-Aware Microarchitecture Multiple fertile areas of researchMultiple fertile areas of research Some has found its way into productsSome has found its way into products
Are we done yet? No!Are we done yet? No! Questions?Questions?
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 39 of 38
BackupsBackups
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 40 of 38
Caveat: Memory Dependence PredictionCaveat: Memory Dependence Prediction
Some predictors train using the conflicting storeSome predictors train using the conflicting store (e.g. store-set predictor)(e.g. store-set predictor)
Replay mechanism is unable to pinpoint Replay mechanism is unable to pinpoint conflicting storeconflicting store
Fair comparison:Fair comparison: Baseline machine: store-set predictor w/ 4k entry Baseline machine: store-set predictor w/ 4k entry
SSIT and 128 entry LFSTSSIT and 128 entry LFST Experimental machine: Simple 21264-style Experimental machine: Simple 21264-style
dependence predictor w/ 4k entry history tabledependence predictor w/ 4k entry history table
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 41 of 38
Load queue search energyLoad queue search energy
0
0.5
1
1.5
2
2.5
3
3.5
16 32 64 128 256 512
number of entries
ac
ce
ss
en
erg
y (
nJ
)
rd6wr6
rd4wr4
rd2wr2
Based on 0.09 micron process technology using Cacti v. 3.2
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 42 of 38
Load queue search latencyLoad queue search latency
0
0.2
0.4
0.6
0.8
1
1.2
1.4
16 32 64 128 256 512
number of entries
ac
ce
ss
late
nc
y (
ns
)
rd6wr6
rd4wr4
rd2wr2
Based on 0.09 micron process technology using Cacti v. 3.2
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 43 of 38
BenchmarksBenchmarks
MP (16-way)MP (16-way) Commercial workloads (SPECweb, TPC-H)Commercial workloads (SPECweb, TPC-H) SPLASH2 scientific application (ocean)SPLASH2 scientific application (ocean) Error bars signify 95% statistical confidenceError bars signify 95% statistical confidence
UPUP 3 from SPECfp20003 from SPECfp2000
Selected due to high reorder buffer utilizationSelected due to high reorder buffer utilization apsi, art, wupwiseapsi, art, wupwise
3 commercial3 commercial SPECjbb2000, TPC-B, TPC-HSPECjbb2000, TPC-B, TPC-H
A few from SPECint2000A few from SPECint2000
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 44 of 38
LD ?ST ?ST ?LD ? LD ?ST ? LD ?
Life cycle of a loadLife cycle of a load
OoO Execution Window
LD ?ST ? ST ? ST ?
Load queue
LD ?LD A
LD A ST A
Blam!
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 45 of 38
Performance relative to Performance relative to unconstrained load queueunconstrained load queue
Good news: Replay w/ no-recent-snoop filter only 1% slower on average
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 46 of 38
Reorder-Buffer UtilizationReorder-Buffer Utilization
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 47 of 38
Why focus on load queue?Why focus on load queue?
Load queue has different constraints that store Load queue has different constraints that store queuequeue More loads than stores (30% vs 14% dynamic More loads than stores (30% vs 14% dynamic
instructions)instructions) Load queue searched more frequently (consuming Load queue searched more frequently (consuming
more power)more power) Store-forwarding logic performance criticalStore-forwarding logic performance critical
Many non-scalable structures in OoO processorMany non-scalable structures in OoO processor SchedulerScheduler Physical register filePhysical register file Register mapRegister map
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 48 of 38
Prior work: formal memory model Prior work: formal memory model representationsrepresentations
Local, WRT, global “performance” of memory Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)ops (Dubois et al., ISCA-13)
Acyclic graph representation (Landin et al., Acyclic graph representation (Landin et al., ISCA-18)ISCA-18)
Modeling memory operation as a series of sub-Modeling memory operation as a series of sub-operations (Collier, RAPA)operations (Collier, RAPA)
Acyclic graph + sub-operations (Adve, thesis)Acyclic graph + sub-operations (Adve, thesis) Initiation event, for modeling early store-to-load Initiation event, for modeling early store-to-load
forwarding (Gharachorloo, thesis)forwarding (Gharachorloo, thesis)
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 49 of 38
Some HistorySome History
““Classical” value predictionClassical” value prediction Independently invented by 4 groups in 1995-1996Independently invented by 4 groups in 1995-1996
1.1. AMD (Nexgen): L. Widigen and E. Sowadsky, AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March 1996, inv. March 1995patent filed March 1996, inv. March 1995
2.2. Technion: F. Gabbay and A. Mendelson, inv. Technion: F. Gabbay and A. Mendelson, inv. sometime 1995, TR 11/96, US patent Sep 1997sometime 1995, TR 11/96, US patent Sep 1997
3.3. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. 1995, ASPLOS paper submitted March, 19961995, ASPLOS paper submitted March, 1996
4.4. Wisconsin: Y. Sazeides, J. Smith, Summer 1996Wisconsin: Y. Sazeides, J. Smith, Summer 1996
From: [email protected] (Larry Widigen)Received: by charlie (4.1) id AA00850; Wed, 14 Aug 96 10:33:12 PDTDate: Wed, 14 Aug 96 10:33:12 PDTMessage-Id: <9608141733.AA00850@charlie>To: [email protected]: www location of paperStatus: ROX-Status:X-Keywords:X-UID: 1
I would like to review your forthcoming paper, "Value Locality and Load Value Prediction." Could you provide a www address where it resides? I am curious as to its contents since its title suggests that it may discuss an area where I have done some work.
Cordially,
Larry WidigenManager of Processor Development