Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit.
-
Upload
vernon-floyd -
Category
Documents
-
view
239 -
download
0
Transcript of Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit.
Transactional Memory
The Art of Multiprocessor ProgrammingHerlihy and Shavit
From the New York Times …
SAN FRANCISCO, May 7. 2004 - Intel said on Friday that it was scrapping its development of two microprocessors, a move that is a shift in the company's business strategy….
Moore’s Law
(hat tip: Herb Sutter)
Clock speed
flattening sharply
Transistor count still
rising
© 2006 Herlihy & Shavit 6
The Problem
• Cannot exploit cheap threads• Today’s Software
– Non-scalable methodologies
• Today’s Hardware– Poor support for scalable
synchronization
© 2006 Herlihy & Shavit 7
Locking
© 2006 Herlihy & Shavit 8
Coarse-Grained Locking
Easily made correct …But not scalable.
© 2006 Herlihy & Shavit 9
Fine-Grained Locking
Here comes trouble …
© 2006 Herlihy & Shavit 10
Why Locking Doesn’t Scale
• Not Robust• Relies on conventions• Hard to Use
– Conservative– Deadlocks– Lost wake-ups
• Not Composable
© 2006 Herlihy & Shavit 11
Locks are not Robust
If a thread holdinga lock is delayed
…
No one else can make progress
© 2006 Herlihy & Shavit 12
Why Locking Doesn’t Scale
• Not Robust• Relies on conventions• Hard to Use
– Conservative– Deadlocks– Lost wake-ups
• Not Composable
© 2006 Herlihy & Shavit 13
Locking Relies on Conventions
• Relation between– Lock bit and object bits– Exists only in programmer’s mind
/* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */
Actual comment from Linux
Kernel(hat tip: Bradley Kuszmaul)
© 2006 Herlihy & Shavit 14
Why Locking Doesn’t Scale
• Not Robust• Relies on conventions• Hard to Use
– Conservative– Deadlocks– Lost wake-ups
• Not Composable
© 2006 Herlihy & Shavit 15
Sadistic Homework
enq(x) enq(y)Double-ended queue
No interference if ends “far
enough” apart
© 2006 Herlihy & Shavit 16
Sadistic Homework
deq() ceq()Double-ended queue
Interference OK if ends “close
enough” together
© 2006 Herlihy & Shavit 17
Sadistic Homework
deq() deq()Double-ended queue
Make sure suspended dequeuers
awake as needed
In Search of the Lost Wake-Up
• Waiting thread doesn’t realize when to wake up
• It’s a real problem in big systems– “Calling pthread_cond_signal() or
pthread_cond_broadcast() when the thread does not hold the mutex lock associated with the condition can lead to lost wake-up bugs.”
from Google™ search for “lost wake-up”
© 2006 Herlihy & Shavit 19
You Try It …
• One lock?– Too Conservative
• Locks at each end?– Deadlock, too complicated, etc
• Waking blocked dequeuers?– Harder than it looks
© 2006 Herlihy & Shavit 20
Actual Solution
• Clean solution would be a publishable result
• [Michael & Scott, PODC 96]• High performance fine-grained
lock-based solutions are good for libraries…
• not general consumption by programmers
© 2006 Herlihy & Shavit 21
Why Locking Doesn’t Scale
• Not Robust• Relies on conventions• Hard to Use
– Conservative– Deadlocks– Lost wake-ups
• Not Composable
© 2006 Herlihy & Shavit 22
Locks do not compose
add(T1, item)
delete(T1, item)
add(T2, item) item item
Move from T1 to T2
Must lock T2
before deleting from T1
lock T2lock T2lock T1lock T1
lock T1lock T1
item
Exposing lock internals breaks abstraction
Hash Table Must lock T1
before adding item
© 2006 Herlihy & Shavit 23
Monitor Wait and Signal
zzz
If buffer is empty, wait for item to show up
Empty
bufferYes!
© 2006 Herlihy & Shavit 24
Wait and Signal do not Compose
empty
empty
zzz…
Wait for either?
© 2006 Herlihy & Shavit 25
The Transactional Manifesto
• What we do now is inadequate to meet the multicore challenge
• Research Agenda– Replace locking with a transactional
API – Design languages to support this model– Implement the run-time to be fast
enough
© 2006 Herlihy & Shavit 26
Transactions
• Atomic– Commit: takes effect– Abort: effects rolled back
• Usually retried
• Linearizable– Appear to happen in one-at-a-time
order
© 2006 Herlihy & Shavit 27
atomic { x.remove(3); y.add(3);}
atomic { y = null;}
Atomic Blocks
© 2006 Herlihy & Shavit 28
atomic { x.remove(3); y.add(3);}
atomic { y = null;}
Atomic Blocks
No data race
© 2006 Herlihy & Shavit 29
Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q;}
Sadistic Homework Revisited
(1)
Write sequential Code
© 2006 Herlihy & Shavit 30
Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; }}
Sadistic Homework Revisited
(1)
© 2006 Herlihy & Shavit 31
Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; }}
Sadistic Homework Revisited
(1)
Enclose in atomic block
© 2006 Herlihy & Shavit 32
Warning
• Not always this simple– Conditional waits– Enhanced concurrency– Overlapping locks
• But often it is– Works for sadistic homework
© 2006 Herlihy & Shavit 33
Public void Transfer(Queue<T> q1, q2){ atomic { T x = q1.deq(); q2.enq(x); }}
Composition
(1)
Trivial or what?
© 2006 Herlihy & Shavit 34
Public T LeftDeq() { atomic { if (this.left == null) retry; … }}
Wake-ups: lost and found
(1)
Roll back transaction and restart when something
changes
© 2006 Herlihy & Shavit 35
OrElse Compositionatomic { x = q1.deq(); } orElse { x = q2.deq();}
Run 1st method. If it retries …Run 2nd method. If it retries …
Entire statement retries
Transactional Memory
• Software transactional memory (STM)
• Hardware transactional memory (HTM)
• Hybrid transactional memory (HyTM, try in hardware and default to software if unsuccessful)
© 2006 Herlihy & Shavit 36
© 2006 Herlihy & Shavit 37
Hardware versus Software
• Do we need hardware at all?– Analogies:
• Virtual memory: yes!• Garbage collection: no!
– Probably do need HW for performance
• Do we need software?– Policy issues don’t make sense for
hardware
Design Issues
• Implementation choices
• Language design issues
• Semantic issues
Granularity
• Object– managed languages, Java, C#, …– Easy to control interactions between
transactional & non-trans threads
• Word– C, C++, …– Hard to control interactions between
transactional & non-trans threads
Direct/Deferred Update
• Deferred – modify private copies & install on
commit– Commit requires work– Consistency easier
• Direct – Modify in place, roll back on abort– Makes commit efficient– Consistency harder
Conflict Detection
• Eager– Detect before conflict arises– “Contention manager” module
resolves
• Lazy– Detect on commit/abort
• Mixed– Eager write/write, lazy read/write …
Conflict Detection
• Eager detection may abort transaction that could have committed.
• Lazy detection discards more computation.
Contention Managers
• Oracle that resolves conflicts– For eager conflict detection
• CM decides– Whether to abort other transaction– Or give it a chance to finish …
Contention Manager Strategies
• Exponential backoff• Oldest gets priority• Most work done gets priority• Non-waiting has priority over waiting• Lots of alternatives
– None seems to dominate– But choice can have big impact
I/O & System Calls?
• Some I/O revocable– Provide transaction-safe libraries– Undoable file system/DB calls
• Some not– Opening cash drawer– Firing missile
I/O & System Calls
• One solution: make transaction irrevocable– If transaction tries I/O, switch to
irrevocable mode.• There can be only one …
– Requires serial execution• No explicit aborts
– In irrevocable transactions
Exceptions
int i = 0;try { atomic { i++; node = new Node(); }} catch (Exception e) { print(i);}
int i = 0;try { atomic { i++; node = new Node(); }} catch (Exception e) { print(i);}
Exceptions
int i = 0;try { atomic { i++; node = new Node(); }} catch (Exception e) { print(i);}
int i = 0;try { atomic { i++; node = new Node(); }} catch (Exception e) { print(i);}
Throws OutOfMemoryException!
Exceptions
int i = 0;try { atomic { i++; node = new Node(); }} catch (Exception e) { print(i);}
int i = 0;try { atomic { i++; node = new Node(); }} catch (Exception e) { print(i);}
Throws OutOfMemoryException!
What is printed?
Unhandled Exceptions
• Aborts transaction– Preserves invariants– Safer
• Commits transaction– Like locking semantics– What if exception object refers to
values modified in transaction?
Nested Transactions
atomic void foo() { bar();}
atomic void bar() { …}
atomic void foo() { bar();}
atomic void bar() { …}
Nested Transactions
• Needed for modularity– Who knew that cosine() contained a
transaction?• Flat nesting
– If child aborts, so does parent• First-class nesting
– If child aborts, partial rollback of child only
Open Nested Transactions
• Normally, child commit– Visible only to parent
• In open nested transactions– Commit visible to all– Escape mechanism– Dangerous, but useful
• What escape mechanisms are needed?
Privatization
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Privatization
“privatize” node
Privatization
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Thread 0
Access without transactional
synchronization
Transactions
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}
atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}
Thread 0
Thread 1
Transactions
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}
atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}
Thread 0
Thread 1
Is this safe?
Transactions
Thread 0
Thread 1
Is this safe?For locks, yes.
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}
atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}
Transactions
Thread 0
Thread 1
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times
atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}
atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}
Is this safe?For locks, yes.
For STMs, maybe not.
Strong vs Weak Isolation
• Non-transactional accesses don’t synchronize
• Non-transactional thread may see delayed updates/recovery
• Solutions– Privatization barrier?– Other techniques?
Single Global Lock Semantics
• Transactions act as if it acquires SGL
• Good:– Intuitively appealing
• Bad:– What about aborted transactions?– May be expensive to implement
Simple Lock-Based STM
• STMs come in different forms– Lock-based– Lock-free
• Here we will describe a simple lock-based STM
Synchronization
• Transaction keeps– Read set: locations & values read– Write set: locations & values to be
written• Deferred update
– Changes installed at commit• Lazy conflict detection
– Conflicts detected at commit
© 2006 Herlihy & Shavit 65
STM: Transactional Locking
Map
Array of Versioned-Write-Locks
Application Memory
V#
V#
V#
© 2006 Herlihy & Shavit 66
Reading an Object
• Put V#s & value in RS• If not already locked
MemLocks
V#
V#
V#
V#
V#
© 2006 Herlihy & Shavit 67
To Write an Object
• Add V# and new value to WS
MemLocks
V#
V#
V#
V#
V#
© 2006 Herlihy & Shavit 68
To Commit
• Acquire W locks• Check V#s unchanged
• In RS & WS• Install new values• Increment V#s• Release …
MemLocks
V#
V#
V#
V#
V#
X
Y
V#+1
V#+1
Transactional Consistency
Application Memory
x
y
4
2
8
4
Invariant x = 2y
Transaction A: Write xWrite y
Transaction B: Read xRead y Compute z = 1/(x-y) = 1/2
Zombies!
• A transaction that will abort because of a synchronization conflict
• But doesn’t know it yet• And keeps running
Zombies!
Application Memory
x
y
4
2
8
4
Invariant x = 2y
Transaction A: Write x (kills B)
Write y
Transaction B: Read x = 4
Transaction B: (zombie)Read y = 4Compute z = 1/(x-y)
DIV by 0 ERROR
Solution: The “Global Clock”
• Have one shared global clock• Incremented by (small subset of)
writing transactions• Read by all transactions• Used to validate that state worked
on is always consistent
© 2006 Herlihy & Shavit 73
Read-Only Transactions
• Copy V Clock to RV• Read lock,V#• Read mem• Check unlocked• Recheck V# unchanged• Check V# < RV
MemLocks
12
32
56
19
17
100
Shared Version Clock
Private Read Version
100
Reads form a snapshot of memory.No read set!
© 2006 Herlihy & Shavit 74
Regular Transactions
• Copy V Clock to RV• On read/write, check:
• Unlocked• V# ≤ RV• Add to R/W set
MemLocks
100
Shared Version Clock
Private Read Version
100
12
32
56
19
17
© 2006 Herlihy & Shavit 75
Regular Transactions
• Acquire locks• WV = F&Inc(V Clock)• Check each V# ≤ RV• Update memory• Set write V#s to WV
MemLocks
100
Shared Version Clock
Private Read Version
100101
x
y
12
32
56
19
17
100
100
© 2006 Herlihy & Shavit 76
Hardware Transactional Memory
• Exploit Cache coherence protocols• Already do almost what we need
– Invalidation– Consistency checking
• Exploit Speculative execution– Branch prediction = optimistic synch!
© 2006 Herlihy & Shavit 77
HW Transactional Memory
Interconnect
caches
memory
read active
T
© 2006 Herlihy & Shavit 78
Transactional Memory
caches
memory
read
activeTT
active
© 2006 Herlihy & Shavit 79
Transactional Memory
caches
memory
activeTT
activecommitted
© 2006 Herlihy & Shavit 80
Transactional Memory
caches
memory
write
active
committed
TD
© 2006 Herlihy & Shavit 81
Rewind
caches
memory
activeTT
activewriteaborted
D
© 2006 Herlihy & Shavit 82
Transaction Commit
• At commit point– If no cache conflicts, we win.
• Mark transactional entries– Read-only: valid– Modified: dirty (eventually written
back)
• That’s all, folks!– Except for a few details …
© 2006 Herlihy & Shavit 83
Not all Skittles and Beer
• Limits to– Transactional cache size– Scheduling quantum
• Transaction cannot commit if it is– Too big– Too slow– Actual limits platform-dependent
Approaches
• Virtualization– Spill transactional cache to memory
• Hybrid– Use limited HTM as HW accelerator
for STM– Sun’s Rock™ processor …
Back to Amdahl’s Law:
Speedup = 1/(ParallelPart/N + SequentialPart)
Pay for N = 8 cores SequentialPart = 25%
Speedup = only 2.9 times!
Must parallelize applications on a very fine grain!
So…How will we make use of multicores?
Art of Multiprocessor Programming
86
Traditional Scaling Process
User code
TraditionalUniprocessor
Speedup1.8x1.8x
7x7x
3.6x3.6x
Time: Moore’s law
Art of Multiprocessor Programming
87
Multicore Scaling Process
User code
Multicore
Speedup 1.8x1.8x
7x7x
3.6x3.6x
Unfortunately, not so simple…
Art of Multiprocessor Programming
88
Actual Scaling Process
1.8x1.8x 2x2x 2.9x2.9x
User code
Multicore
Speedup
Parallelization and Synchronization require great care…
Need Ways to Make Scaling Smoother…
TM code
Speedup 1.8x1.8x
7x7x
3.6x3.6x
Multicore
User code
I, for one, Welcome our new Multicore Overlords …
• Multicore architectures force us to rethink how we do synchronization
• Standard locking model won’t work• Transactional model might
– Software – Hardware
• A full-employment act!