A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,
description
Transcript of A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,
![Page 1: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/1.jpg)
1A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
A Scalable Approach to A Scalable Approach to
Thread-Level SpeculationThread-Level Speculation
J. Gregory Steffan, Christopher B. Colohan, J. Gregory Steffan, Christopher B. Colohan,
Antonia Zhai, and Todd C. MowryAntonia Zhai, and Todd C. Mowry
Computer Science DepartmentComputer Science Department
Carnegie Mellon UniversityCarnegie Mellon University
(Appeared in ISCA 2000.)(Appeared in ISCA 2000.)
![Page 2: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/2.jpg)
2A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Multithreaded Machines Are EverywhereMultithreaded Machines Are Everywhere
How can we use them? Parallelism!
C
P
C
C
P
C
C
P
C
Shared Memory
SUN MAJC,IBM Power4 ALPHA 21464 Dual Pentium SGI Origin
Threads
C
P
C
C
P
C
Shared MemoryC
C
P
C
P
![Page 3: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/3.jpg)
3A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Automatic ParallelizationAutomatic Parallelization
Proving independence of threads is hard:Proving independence of threads is hard:– complex control flowcomplex control flow
– complex data structurescomplex data structures
– pointers, pointers, pointerspointers, pointers, pointers
– run-time inputsrun-time inputs
How can we make the compiler’s job feasible?How can we make the compiler’s job feasible?
Thread-Level Speculation (TLS)
![Page 4: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/4.jpg)
4A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
ExampleExample
while (...){
x = hash[index1]; … hash[index2] = y; ...
}
Time= hash[3]…hash[10] =…
Processor
= hash[19]…hash[21] =…
= hash[33]…hash[30] =…
= hash[10]…hash[25] =…
![Page 5: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/5.jpg)
5A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Example of Thread-Level SpeculationExample of Thread-Level Speculation
Time
= hash[3]…hash[10] =…
Epoch 1
= hash[19]…hash[21] =…
Epoch 2
= hash[33]…hash[30] =…
Epoch 3
= hash[10]…hash[25] =…
Epoch 4
Processor Processor Processor Processor
![Page 6: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/6.jpg)
6A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Example of Thread-Level SpeculationExample of Thread-Level Speculation
Time
= hash[3]…hash[10] =…
Epoch 1= hash[19]…hash[21] =…
Epoch 2= hash[33]…hash[30] =…
Epoch 3= hash[10]…hash[25] =…
Epoch 4
Processor Processor Processor Processor
Violation!
![Page 7: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/7.jpg)
7A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Example of Thread-Level SpeculationExample of Thread-Level Speculation
Time
= hash[3]…hash[10] =…commit?
Epoch 1= hash[19]…hash[21] =…commit?
Epoch 2= hash[33]…hash[30] =…commit?
Epoch 3= hash[10]…hash[25] =…commit?
Epoch 4
Processor Processor Processor Processor
Violation!
![Page 8: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/8.jpg)
8A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Example of Thread-Level SpeculationExample of Thread-Level Speculation
Time
= hash[3]…hash[10] =…commit?
Epoch 1= hash[19]…hash[21] =…commit?
Epoch 2= hash[33]…hash[30] =…commit?
Epoch 3= hash[10]…hash[25] =…commit?
Epoch 4
Processor Processor Processor Processor
Violation!
= hash[10]…hash[25] =…commit?
Epoch 4Retry
![Page 9: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/9.jpg)
9A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Goals of Our ApproachGoals of Our Approach
1) Handle arbitrary memory accesses1) Handle arbitrary memory accesses– i.e. not just array referencesi.e. not just array references
2) Preserve performance of non-speculative workloads2) Preserve performance of non-speculative workloads– keep hardware support minimal and simplekeep hardware support minimal and simple
3) Apply to any scale of multithreaded architecture3) Apply to any scale of multithreaded architecture– CMPs, SMT processors, more traditional MPsCMPs, SMT processors, more traditional MPs
effective, simple, and scalable TLS
![Page 10: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/10.jpg)
10A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Overview of Our ApproachOverview of Our Approach
System requirements:System requirements:
1) Detect data dependence violations1) Detect data dependence violations • extend invalidation-based cache coherenceextend invalidation-based cache coherence
2) Buffer speculative modifications2) Buffer speculative modifications• use the caches as speculative buffersuse the caches as speculative buffers
coherence already works at a variety of scales
hence our scheme is also scalable
![Page 11: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/11.jpg)
11A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Related SchemesRelated Schemes
• Wisconsin (Multiscalar, Trace Processor)Wisconsin (Multiscalar, Trace Processor)
• Stanford (Hydra)Stanford (Hydra)
• U.P. Catalunya (Speculative Multithreading)U.P. Catalunya (Speculative Multithreading)
• Intel/U. Portland (Dynamic Multithreading)Intel/U. Portland (Dynamic Multithreading)
• Illinois at U.C. (I-ACOMA)Illinois at U.C. (I-ACOMA)
our approach seamlessly scales both up and down
![Page 12: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/12.jpg)
12A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
Details of our ApproachDetails of our Approach– life cycle of an epochlife cycle of an epoch
– speculative coherence speculative coherence
– what happens at commit timewhat happens at commit time
– forwarding data between epochsforwarding data between epochs
• PerformancePerformance
• ConclusionsConclusions
![Page 13: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/13.jpg)
13A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Life Cycle of an EpochLife Cycle of an Epoch
Spawned
BecomesSpeculative
Commit?
Init
SpeculativeWork
Wait to beHomefree?
Slow Commit:
Fast Commit:
Complete,Pass Homefree
Time
![Page 14: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/14.jpg)
14A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Epoch NumbersEpoch Numbers
Represent a partial orderingRepresent a partial ordering– signed-compare sequence numbers if TIDs matchsigned-compare sequence numbers if TIDs match
• allows for wrap-aroundallows for wrap-around
– otherwise the epochs are unorderedotherwise the epochs are unordered• from independent programs from independent programs • from independent chains of speculation within one from independent chains of speculation within one
programprogram
Thread Identifier (TID) Sequence Number
![Page 15: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/15.jpg)
15A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Thread ModelSpeculative Thread ModelRound-robin schedule of epochs to processorsRound-robin schedule of epochs to processors
– not a requirement of our scheme, just for conveniencenot a requirement of our scheme, just for convenience
Each epoch spawns the next Each epoch spawns the next – through a lightweight fork instruction (10 cycles)through a lightweight fork instruction (10 cycles)
Violations detected through pollingViolations detected through polling– each epoch runs to completion before detecting failed each epoch runs to completion before detecting failed
speculation and restartingspeculation and restarting
Violation chainingViolation chaining– if an epoch suffers a violation, we squash all logically-later if an epoch suffers a violation, we squash all logically-later
epochsepochs
many possibilities to be evaluated in future work
![Page 16: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/16.jpg)
16A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Preserving CorrectnessPreserving Correctness
Speculation must fail whenever speculative state is lostSpeculation must fail whenever speculative state is lost
– eg., replacement of a speculative line, ORB overfloweg., replacement of a speculative line, ORB overflow
Any exceptions are suppressed until epoch is homefreeAny exceptions are suppressed until epoch is homefree
– eg., divide by zero, segfaulteg., divide by zero, segfault
Polling violation detection must avoid infinite loopingPolling violation detection must avoid infinite looping
– requires a poll inside each looprequires a poll inside each loop
No system calls while speculative (for now)No system calls while speculative (for now)
ensures original sequential semantics are preserved
![Page 17: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/17.jpg)
17A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Life Cycle of an EpochLife Cycle of an Epoch
Spawned
BecomesSpeculative
Commit?
SpeculativeCoherence
Complete,Pass Homefree
Time
to Squashor Commit
Mechanisms
![Page 18: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/18.jpg)
18A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Thread A:
CacheProcessor
-Tag
InvalidState
-Data
Thread B:
![Page 19: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/19.jpg)
19A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
CacheProcessor
-Tag
InvalidState
-Data
Load X
Read
Thread A: Thread B:
![Page 20: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/20.jpg)
20A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
CacheProcessor
XTag
Excl.State
2Data
Fill
Load XThread A: Thread B:
Read
![Page 21: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/21.jpg)
21A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
CacheProcessor
XTag
Excl.State
2Data
Read-Exclusive
Load XStore X=3
read-exclusive invalidates all other copies
Thread A: Thread B:
![Page 22: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/22.jpg)
22A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
CacheProcessor
-Tag
InvalidState
-Data
Load XStore X=3
read-exclusive invalidates all other copies
Thread A: Thread B:
Read-Exclusive Invalidation
![Page 23: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/23.jpg)
23A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X )
CacheProcessor
XTag
DirtyState
3Data
CacheProcessor
-Tag
InvalidState
-Data
Load XStore X=3
the state ‘dirty’ implies exclusiveness
Fill
Thread A: Thread B:
InvalidationRead-Exclusive
![Page 24: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/24.jpg)
24A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Highlights of our scheme:Highlights of our scheme:– detection of a data dependence violationdetection of a data dependence violation
– speculatively modifiedspeculatively modified andand sharedshared cache lines cache lines
Epoch5: Epoch6:Load X
Epoch4:
Store X=3Load X
![Page 25: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/25.jpg)
25A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Epoch5:
CacheProcessor
-Tag
InvalidState
-Data
Epoch6:Load X
Read
![Page 26: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/26.jpg)
26A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Epoch5:
CacheProcessor
XTag
Excl.State
2Data
Epoch6:Load X
Fill
Spec.Loaded
track which lines are speculatively loaded
Read
![Page 27: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/27.jpg)
27A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Epoch5:
CacheProcessor
XTag
Excl.State
2Data
Epoch6:Load X
Spec.Loaded
Store X=3
Sp Read-Ex (epoch5)
speculative msgs piggyback epoch number
![Page 28: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/28.jpg)
28A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Epoch5:
CacheProcessor
XTag
Excl.State
2Data
Epoch6:Load X
Spec.Loaded
Store X=3
Sp Inv (epoch5)
epoch5 < epoch6, and speculatively loaded
Sp Read-Ex (epoch5)
![Page 29: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/29.jpg)
29A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Epoch5:
CacheProcessor
-Tag
InvalidState
-Data
Epoch6:Load X
Store X=3 speculation failed!
speculation fails for epoch 6
Sp Inv (epoch5)Sp Read-Ex (epoch5)
![Page 30: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/30.jpg)
30A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
XTag
Excl.State
3Data
Epoch5: Store X=3
Fill
Spec.Modified
track which lines are speculatively modified
CacheProcessor
-Tag
InvalidState
-Data
Epoch6:Load X speculation
failed!
Sp Inv (epoch5)Sp Read-Ex (epoch5)
![Page 31: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/31.jpg)
31A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Highlights of our scheme:Highlights of our scheme:– detection of a data dependence violationdetection of a data dependence violation
– speculatively modifiedspeculatively modified andand sharedshared cache lines cache lines
Epoch5: Epoch6:Load X
Epoch4:
Store X=3Load X
![Page 32: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/32.jpg)
32A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Epoch4:
CacheProcessor
XTag
Excl.State
3Data
Epoch5: Store X=3
Spec.Modified
![Page 33: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/33.jpg)
33A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Epoch4:
CacheProcessor
XTag
Excl.State
3Data
Epoch5: Store X=3
Spec.Modified
Load X
Read
![Page 34: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/34.jpg)
34A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Epoch4:
CacheProcessor
XTagState
3Data
Epoch5: Store X=3
Spec.Modified
Load X
notify shared
Shared
both speculatively modified and shared!
Read
![Page 35: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/35.jpg)
35A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Shared Memory (X=2)
CacheProcessor
XTagState
2Data
Epoch4:
CacheProcessor
XTagState
3Data
Epoch5: Store X=3
Spec.Modified
Load X
Shared
multiple versions of the same cache line
Fill
SharedSpec.
LoadedRead notify shared
![Page 36: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/36.jpg)
36A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Summary of New Speculative Line StateSummary of New Speculative Line State
New cache line state:New cache line state:– has it been has it been speculatively loadedspeculatively loaded??
• detect dependence violationsdetect dependence violations
– has it been has it been speculatively modifiedspeculatively modified??• buffer speculative modificationsbuffer speculative modifications
– is it in a is it in a speculative speculative sharedshared or or exclusiveexclusive state? state?• important performance optimizationsimportant performance optimizations
What if a speculative cache line is replaced?What if a speculative cache line is replaced?– speculation fails for that epochspeculation fails for that epoch
![Page 37: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/37.jpg)
37A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Implementation of Speculative StateImplementation of Speculative State
CacheProcessor
TagState Data-- --- -
-- --- -
![Page 38: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/38.jpg)
38A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Implementation of Speculative StateImplementation of Speculative State
CacheProcessor
State Data- -- -
- -
Tag--
--- -
SL--
--
SM--
--
SpeculativelyModified
SpeculativelyLoaded
modest amount of extra space
![Page 39: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/39.jpg)
39A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Life Cycle of an EpochLife Cycle of an Epoch
Spawned
BecomesSpeculative
Commit?
SpeculativeCoherence
Complete,Pass Homefree
Time
to Squashor Commit
Mechanisms Squash
![Page 40: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/40.jpg)
40A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation FailsWhen Speculation Fails
CacheProcessor
State DataSp Ex *Sp Sh *
Sp Ex *
Tag**
**Sp Sh *
SL11
01
SM00
11
FlashReset
![Page 41: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/41.jpg)
41A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation FailsWhen Speculation Fails
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
** *
SL00
00
SM00
11
Shared
Sp Sh
If Set thenInvalidate;
Flash Reset
![Page 42: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/42.jpg)
42A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation FailsWhen Speculation Fails
CacheProcessor
State DataExcl *
*
Invalid *
Tag**
**Invalid *
SL00
00
SM00
00
quick bit operation
Shared
![Page 43: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/43.jpg)
43A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Life Cycle of an EpochLife Cycle of an Epoch
Spawned
BecomesSpeculative
Commit?
SpeculativeCoherence
Complete,Pass Homefree
Time
to Squashor Commit
Mechanisms
Commit
![Page 44: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/44.jpg)
44A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataSp Ex *Sp Sh *
Sp Ex *
Tag**
**Sp Sh *
SL11
01
SM00
11
FlashReset
![Page 45: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/45.jpg)
45A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
**Sp Sh *
SL00
00
SM00
11
SharedSM & Exclusive:Become Dirty
![Page 46: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/46.jpg)
46A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
**Sp Sh *
SL00
00
SM00
11
SharedSM & Shared:
Need ExclusiveAccess
want to avoid searching entire cache
![Page 47: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/47.jpg)
47A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
*XSp Sh *
SL00
00
SM00
11
Shared
ownership required buffer (ORB)
--X
ORB
![Page 48: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/48.jpg)
48A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
*XSp Sh *
SL00
00
SM00
11
Shared
Upgrade-Request (X)
--X
ORB
![Page 49: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/49.jpg)
49A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
*XSp Sh *
SL00
00
SM00
11
Shared
Ack (X)
---
ORB
If SM,Become Dirty;
Flash Reset
Upgrade-Request (X)
![Page 50: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/50.jpg)
50A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Dirty *
Tag**
*XDirty *
SL00
00
SM00
00
Shared ---
ORB
flush the ORB, then quick bit operations
![Page 51: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/51.jpg)
51A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Forwarding Data Between EpochsForwarding Data Between Epochs
• predictable dependences cause frequent violationspredictable dependences cause frequent violations
• compiler inserts wait-signal synchronizationcompiler inserts wait-signal synchronization
Store XLoad X
synchronize to avoid violations
Wait
ForwardingWith
Store XSignal
Load X
![Page 52: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/52.jpg)
52A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculation in a Shared CacheSpeculation in a Shared Cache
Why?Why?
1) Shared-cache multithreaded architectures1) Shared-cache multithreaded architectures• eg. eg. simultaneous multithreadingsimultaneous multithreading
2) Context switch to another chain of speculation2) Context switch to another chain of speculation
3) Start new epoch while current epoch waits to commit3) Start new epoch while current epoch waits to commit
How?How?
replicate the speculative context
![Page 53: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/53.jpg)
53A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Support for Speculation in a Shared CacheSupport for Speculation in a Shared Cache
replicate the speculative context
CacheProcessor
State Data- -
-
- -
Tag--
--- -
SL--
--
SM--
--
-
ORB
![Page 54: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/54.jpg)
54A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Support for Speculation in a Shared CacheSupport for Speculation in a Shared Cache
CacheProcessor
State Data- -
-
- -
Tag--
--- -
SL--
--
SM--
--
-
ORBSL--
--
SM--
--
ORB
replicate the speculative context
![Page 55: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/55.jpg)
55A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
• Details of our ApproachDetails of our Approach
PerformancePerformance– simulation infrastructuresimulation infrastructure
– single-chip multiprocessor performancesingle-chip multiprocessor performance
– scaling beyond chip boundariesscaling beyond chip boundaries
• ConclusionsConclusions
![Page 56: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/56.jpg)
56A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Simulation InfrastructureSimulation Infrastructure
Compiler system and tools based on SUIFCompiler system and tools based on SUIF– help analyze dependences, insert synchronizationhelp analyze dependences, insert synchronization
– produce produce MIPSMIPS binaries containing TLS primitives binaries containing TLS primitives
Benchmarks (all run to completion)Benchmarks (all run to completion)– buk, compress95, ijpeg, equakebuk, compress95, ijpeg, equake
SimulatorSimulator– superscalar, similar to superscalar, similar to MIPS R10KMIPS R10K– models all bandwidth and contention models all bandwidth and contention
detailed simulation!C
C
P
C
P
Crossbar
![Page 57: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/57.jpg)
57A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Pipeline ParametersPipeline Parameters
Issue WidthIssue Width 44Functional UnitsFunctional Units 2Int, 2FP, 1Mem, 1Bra2Int, 2FP, 1Mem, 1BraReorder Buffer SizeReorder Buffer Size 3232Integer MultiplyInteger Multiply 12 cycles12 cyclesInteger DivideInteger Divide 76 cycles76 cyclesAll Other IntegerAll Other Integer 1 cycle1 cycleFP DivideFP Divide 15 cycles15 cyclesFP Square RootFP Square Root 20 cycles20 cyclesAll Other FPAll Other FP 2 cycles2 cyclesBranch PredictionBranch Prediction GShare (16KB, 8 history bits)GShare (16KB, 8 history bits)
![Page 58: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/58.jpg)
58A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Memory ParametersMemory ParametersCache Line SizeCache Line Size 32B32BInstruction CacheInstruction Cache 32KB, 4-way set-assoc32KB, 4-way set-assocData CacheData Cache 32KB, 2-way set-assoc, 2 banks32KB, 2-way set-assoc, 2 banksUnified Secondary CacheUnified Secondary Cache 2MB, 4-way set-assoc, 4 banks 2MB, 4-way set-assoc, 4 banks Miss HandlersMiss Handlers 8 for data, 2 for insts8 for data, 2 for instsCrossbar InterconnectCrossbar Interconnect 8B per cycle per bank8B per cycle per bankMinimum Miss Latency to Minimum Miss Latency to Secondary CacheSecondary Cache
10 cycles10 cycles
Minimum Miss Latency to Local Minimum Miss Latency to Local MemoryMemory
75 cycles75 cycles
Main Memory BandwidthMain Memory Bandwidth 1 access per 20 cycles1 access per 20 cyclesIntra-Chip Communication LatencyIntra-Chip Communication Latency 10 cycles10 cycles
Inter-Chip Communication LatencyInter-Chip Communication Latency 200 cycles200 cycles
![Page 59: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/59.jpg)
59A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Benchmark Details: Regions and EpochsBenchmark Details: Regions and Epochs
ApplicationApplicationUnrolling Unrolling FactorFactor
Avg. Insts. Avg. Insts. per Epochper Epoch
Parallel Parallel CoverageCoverage
bukbuk 88 81.081.0 22.8%22.8%88 135.0135.0 33.8%33.8%
compress95compress95 11 196.7196.7 24.6%24.6%11 240.4240.4 22.7%22.7%
ijpegijpeg 3232 1467.91467.9 8.2%8.2%11 80.880.8 2.2%2.2%11 84.084.0 5.0%5.0%11 100.3100.3 6.7%6.7%
equakeequake 11 2925.52925.5 39.3%39.3%
![Page 60: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/60.jpg)
60A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Performance on a 4-Processor CMPPerformance on a 4-Processor CMP
ApplicationApplication
Overall Overall Region Region
SpeedupSpeedupParallel Parallel
CoverageCoverageProgram Program SpeedupSpeedup
bukbuk 2.262.26 56.6%56.6% 1.461.46
compress95compress95 1.271.27 47.3%47.3% 1.121.12
equakeequake 1.771.77 39.3%39.3% 1.211.21
ijpegijpeg 1.941.94 22.1%22.1% 1.081.08
program speedups are limited by coverage
![Page 61: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/61.jpg)
61A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Performance on a 4-Processor CMPPerformance on a 4-Processor CMP
2.26
1.27
1.771.94
0
0.5
1
1.5
2
2.5
buk compress95 equake ijpeg
Spee
dup
Region
56.6% 47.3% 39.3% 22.1%Parallel Coverage:
![Page 62: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/62.jpg)
62A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Performance on a 4-Processor CMPPerformance on a 4-Processor CMP
2.26
1.27
1.771.94
1.46
1.12 1.211.08
0
0.5
1
1.5
2
2.5
buk compress95 equake ijpeg
Spee
dup Region
Program
program speedups are limited by coverage
56.6% 47.3% 39.3% 22.1%Parallel Coverage:
![Page 63: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/63.jpg)
63A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Varying the Number of ProcessorsVarying the Number of ProcessorsN
orm
aliz
ed R
egio
n Ex
ecut
ion
Tim
e
buk and equake are memory-bound
compress95 and ijpeg are computation-intensive
![Page 64: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/64.jpg)
64A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Varying the Number of ProcessorsVarying the Number of ProcessorsN
orm
aliz
ed R
egio
n Ex
ecut
ion
Tim
e
buk and equake scale well
passing the homefree token is not a bottleneck
![Page 65: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/65.jpg)
65A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Performance of the ORB (on a 4-CMP) Performance of the ORB (on a 4-CMP)
ApplicationApplication
Average Average Flush Flush
Latency Latency (cycles)(cycles)
ORB Size (entries)ORB Size (entries)
AverageAverage MaximumMaximumbukbuk 13.9513.95 2.382.38 99compress95compress95 0.040.04 0.010.01 88equakeequake 0.130.13 0.040.04 1212ijpegijpeg 1.061.06 0.170.17 55
a small ORB is sufficient
![Page 66: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/66.jpg)
66A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Tracking Dependences Per Cache LineTracking Dependences Per Cache Line
Problem:Problem:– analagous to false sharing: false violationsanalagous to false sharing: false violations
– write-after-write dependences also cause violationswrite-after-write dependences also cause violations• but not a true dependence!but not a true dependence!
Solution:Solution:– track dependences at a word granularitytrack dependences at a word granularity
– have an SM and SL bit per word in each cache linehave an SM and SL bit per word in each cache line
is per-word state worth the extra overhead?
![Page 67: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/67.jpg)
67A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Tracking Dependences Per Cache LineTracking Dependences Per Cache Line
Does it do any good?Does it do any good?– not for our 4 benchmarksnot for our 4 benchmarks
– adding this support showed no improvementadding this support showed no improvement
Why not?Why not?– buk and equake have random access patternsbuk and equake have random access patterns
– compress95 is heavily synchronizedcompress95 is heavily synchronized
– ijpeg is unrolled to avoid false sharingijpeg is unrolled to avoid false sharing
existing techniques for avoiding false sharing can address this problem
![Page 68: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/68.jpg)
68A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Scaling Beyond Chip BoundariesScaling Beyond Chip Boundaries
Shared Memory
C
C
P
C
P
Crossbar
C
C
P
C
P
Crossbar
Node Node
200 Cycles
simulate architectures with 1, 2 and 4 nodes
![Page 69: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/69.jpg)
69A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Scaling Beyond Chip BoundariesScaling Beyond Chip BoundariesN
orm
aliz
ed R
egio
n Ex
ecut
ion
Tim
e
multi-chip systems benefit from TLS
![Page 70: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/70.jpg)
70A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Scaling Beyond Chip BoundariesScaling Beyond Chip BoundariesN
orm
aliz
ed R
egio
n Ex
ecut
ion
Tim
e
our scheme scales well
![Page 71: A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,](https://reader036.fdocuments.in/reader036/viewer/2022070502/56814bec550346895db8c842/html5/thumbnails/71.jpg)
71A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
ConclusionsConclusions
The overheads of our scheme are low:The overheads of our scheme are low:– mechanisms to squash or commit are not a bottleneckmechanisms to squash or commit are not a bottleneck
– per-word speculative state is not always necessaryper-word speculative state is not always necessary
It offers compelling performance improvements:It offers compelling performance improvements:– program speedups from 8% to 46% on a 4-processor program speedups from 8% to 46% on a 4-processor
CMPCMP
– program speedups up to 75% on multi-chip architecturesprogram speedups up to 75% on multi-chip architectures
It is scalable:It is scalable:– coherence provides elegant data dependence trackingcoherence provides elegant data dependence tracking
seamless TLS on a wide range of architectures