More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
1
Transcript of More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.
Thread Level Speculation (TLS)
A technique for automatic parallelization.• Run threads in parallel, but in a speculative state. • Check for violations.• Commit upon successful completion.• Squash when detecting a violation.
– Propagate the squash onwards.– Re-run the thread.
Mechanism of TLS1. Managing speculative state.2. Disambiguation: checking addresses for violating
dependencies– Eager vs. Lazy
3. Upon commit– Broadcast (Everybody? Relevant?)– Invalidate/update of other threads– Leave speculative state
4. Upon squash– Broadcast– Invalidate changes for this thread– Re-run
At hardware level. Involve Cache.
Simple. Fast.
Scenarios
• Thread attributes:– Length– Memory accesses– Dependences
??
Many
??0
Serial Easily parallel
ShortManyFew
TLS costly
ShortFewFew
TLS works
LongFewFew
TLS costly
LengthAccessesDepend.
When is TLS Too Costly?
• “Too much data” scenario– Thread touches too many addresses.
• “Too much time” scenario– Execution involves many instructions
(e.g. Databases transactions).
Bulk Disambiguation of Speculative Threads in multiprocessors
Ceze, Tuck, Cascaval, Torrellas.
Tolerating Dependences Between Large Speculative Threads Via Sub-Threads
Colohan, Ailamaki, Steffan, Mowry.
Too Many Addresses – Solution 1
Each thread maintains a bitwise mask of the cache.• Flip bit on when touching an address.• Upon completion, check addresses you and others touched.
(Lazy)• Commit / Squash : send mask.• Invalidating/replacing/changing address state in cache:
use mask.
All bitwise operations. Very simple!Infeasible for size reasons (won’t scale).
Solution: Hash!
Introducing BULK - a hardware that hashes the address space into a signature (~2k in size).
0 1 0 1 0 0 0 0 1
0 0 1 1 0 0 1 0 0
0 1 1 1 0 0 1 0 1
Address Space
Signature
Bitwise OR
Upon completion, send signature!
Upon receiving, pull back to a superset of possible addresses.
Bulk Features:
• Separate Reading / Writing signatures.
• Committing: sending signature.• Invalidating: pulling back signature
into a superset.• Granularity is on word level
(not cache line)– since we map addresses
Caveat:We might see violations even if there weren't any!
When is TLS Too Costly?
• “Too much data” scenario– Thread touches too many addresses.
• “Too much time” scenario– Execution involves many instructions
(e.g. Databases transactions).
Bulk Disambiguation of Speculative Threads in multiprocessors
Ceze, Tuck, Cascaval, Torrellas.
Tolerating Dependences Between Large Speculative Threads Via Sub-Threads
Colohan, Ailamaki, Steffan, Mowry.
Handling Long Threads (Attempt 1)
Image courtesy Chris Colohan
Q: Does eliminating a data dependence help?
*p=
*q=
=*p
R2
Violation!
=*p
=*q
Parallel
Upon violation – we re-execute a long thread.
Handling Long Threads (Attempt 1)
*p=
*q=
=*p
R2
Violation!
=*p
=*q
Parallel
*q==*q
=*q
Violation!
Eliminate *p Dep.
Image courtesy Chris Colohan
Handling Long Threads (Attempt 2):Sub-Threads
• Sub-threads are checkpoints during thread execution
• No longer “all or nothing”
• Must be lightweight• Help with primary and
secondary violations
*q=Violation! =*q
=*q
Image courtesy Chris Colohan
Sub-thread Implementation
• Assume CMP with shared L2• L1 is unaware of sub-threads
– Speculatively modified bit per cache line• L2 performs eager violation detection
– 2 additional bits per cache line per sub-thread– Replication to track different sub-thread contexts
17
Sub-thread Evaluation
0
0.2
0.4
0.6
0.8
1
1.2
Idle CPU
Failed
Cache Miss
Busy
Tim
e (n
orm
aliz
ed)
New O
rder
New O
rder
150
Deliv
ery
Deliv
ery
Outer
Stock
Lev
el
Paym
ent
Order
Sta
tus
N S L N S L N S L N S L N S L N S L N S L
N = no sub-threadsS = with sub-threads
L = limit, ignoring violationsImage courtesy Chris Colohan
Summary
• Thread attributes:– Length– Memory accesses– Dependences
??
Many
??0
Serial Easily parallel
ShortFewFew
TLS works
LongManyFew
Hopeless??
LengthAccessesDepend.
ShortManyFew
LongFewFew
TLS costlyBULK
TLS costlySub-Threads
Open Questions
• Long threads that also touch many addresses.– Bulk on top of sub-threads?
• Combining lazy/eager evaluations
Thank you!
21
Buffering Large Threadsstore X, 0x00
L1$
0x00:
0x01:
L2$
X
0x00:
0x01:
L1$
0x00:
0x01:
XS1
Store and load bit per thread
Store and load bit per thread
Slide courtesy Chris Colohan
22
Buffering Large Threadsstore X, 0x00store A, 0x01
L1$
0x00:
0x01:
L2$
X
A
0x00:
L1$
0x00:
0x01:
X
A
S1
S10x01:
Slide courtesy Chris Colohan
23
Buffering Large Threadsstore X, 0x00store A, 0x01 load 0x00
L1$
0x00:
0x01:
L2$
X
A
0x00:
0x01:
L1$
0x00:
0x01:
X
X
A
S1
S1
L2
Slide courtesy Chris Colohan
24
XL2 XS1
Buffering Large Threadsstore X, 0x00store A, 0x01 load 0x00
store Y, 0x00
L1$
0x00:
0x01:
L2$
X
A
0x00:
0x01:
L1$
0x00:
0x01:
XY
AS1
YS2 L2 Replicate line – one version per thread
Replicate line – one version per thread
Slide courtesy Chris Colohan
25
Buffering Large Threadsstore X, 0x00store A, 0x01 load 0x00
load 0x01
store Y, 0x00
L1$
0x00:
0x01:
L2$
X
A
0x00:
0x01:
X
A
Y
L1$
0x00:
0x01:
Y
A
S1
S2 L2
S1 L2
Slide courtesy Chris Colohan
26
Buffering Large Threadsstore X, 0x00store A, 0x01 load 0x00
load 0x01
store Y, 0x00
store B, 0x01
L1$
0x00:
0x01:
L2$
X
A
0x00:
0x01:
X
A
L1$
0x00:
0x01:
Y
A
S1
YS2 L2
S1 L2
B
B
Slide courtesy Chris Colohan
27
Sub-thread Supportstore X, 0x00store A, 0x01 load 0x00
load 0x01
store Y, 0x00
store B, 0x01
L1$
0x00:
0x01:
L2$
X
A
0x00:
0x01:
X
A
L1$
0x00:
0x01:
S1
S1 L2
B
B
Y
YS2 L2
a {b {
Divide into two sub-threads
Only roll backviolated sub-thread
Slide courtesy Chris Colohan
Copyright 2006 Chris Colohan 28
Sub-thread Supportstore X, 0x00store A, 0x01 load 0x00
load 0x01
store Y, 0x00
L1$
0x00:
0x01:
L2$
X
A
0x00:
0x01:
X
A
Y
L1$
0x00:
0x01: A
S1a
S1a
A
A
S2a L2a
L2b
Y
a {b {
Store and load bit per sub-thread
Store and load bit per sub-thread
store B, 0x01
B
Slide courtesy Chris Colohan
Copyright 2006 Chris Colohan 29
AAAL2bS1a
Sub-thread Supportstore X, 0x00store A, 0x01 load 0x00
load 0x01
store Y, 0x00
L1$
0x00:
0x01:
L2$
X
A
0x00:
0x01:
X
Y
L1$
0x00:
0x01:
Y
S1a
A
S2a L2a
B
store B, 0x01
S1b
AB
a {b {
Slide courtesy Chris Colohan