Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit.

Transactional Memory

The Art of Multiprocessor ProgrammingHerlihy and Shavit

From the New York Times …

SAN FRANCISCO, May 7. 2004 - Intel said on Friday that it was scrapping its development of two microprocessors, a move that is a shift in the company's business strategy….

Moore’s Law

(hat tip: Herb Sutter)

Clock speed

flattening sharply

Transistor count still

rising

© 2006 Herlihy & Shavit 6

The Problem

• Cannot exploit cheap threads• Today’s Software

– Non-scalable methodologies

• Today’s Hardware– Poor support for scalable

synchronization


Locking


Coarse-Grained Locking

Easily made correct …But not scalable.


Fine-Grained Locking

Here comes trouble …


Why Locking Doesn’t Scale

• Not Robust• Relies on conventions• Hard to Use

– Conservative– Deadlocks– Lost wake-ups

• Not Composable


Locks are not Robust

If a thread holdinga lock is delayed

…

No one else can make progress





• Not Composable


Locking Relies on Conventions

• Relation between– Lock bit and object bits– Exists only in programmer’s mind

/* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */

Actual comment from Linux

Kernel(hat tip: Bradley Kuszmaul)





• Not Composable


Sadistic Homework

enq(x) enq(y)Double-ended queue

No interference if ends “far

enough” apart


Sadistic Homework

deq() ceq()Double-ended queue

Interference OK if ends “close

enough” together


Sadistic Homework

deq() deq()Double-ended queue

Make sure suspended dequeuers

awake as needed

In Search of the Lost Wake-Up

• Waiting thread doesn’t realize when to wake up

• It’s a real problem in big systems– “Calling pthread_cond_signal() or

pthread_cond_broadcast() when the thread does not hold the mutex lock associated with the condition can lead to lost wake-up bugs.”

from Google™ search for “lost wake-up”


You Try It …

• One lock?– Too Conservative

• Locks at each end?– Deadlock, too complicated, etc

• Waking blocked dequeuers?– Harder than it looks


Actual Solution

• Clean solution would be a publishable result

• [Michael & Scott, PODC 96]• High performance fine-grained

lock-based solutions are good for libraries…

• not general consumption by programmers





• Not Composable


Locks do not compose

add(T1, item)

delete(T1, item)

add(T2, item) item item

Move from T1 to T2

Must lock T2

before deleting from T1

lock T2lock T2lock T1lock T1

lock T1lock T1

item

Exposing lock internals breaks abstraction

Hash Table Must lock T1

before adding item


Monitor Wait and Signal

zzz

If buffer is empty, wait for item to show up

Empty

bufferYes!


Wait and Signal do not Compose

empty

empty

zzz…

Wait for either?


The Transactional Manifesto

• What we do now is inadequate to meet the multicore challenge

• Research Agenda– Replace locking with a transactional

API – Design languages to support this model– Implement the run-time to be fast

enough


Transactions

• Atomic– Commit: takes effect– Abort: effects rolled back

• Usually retried

• Linearizable– Appear to happen in one-at-a-time

order


atomic { x.remove(3); y.add(3);}

atomic { y = null;}

Atomic Blocks


atomic { x.remove(3); y.add(3);}

atomic { y = null;}

Atomic Blocks

No data race


Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q;}

Sadistic Homework Revisited

(1)

Write sequential Code


Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; }}


(1)


Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; }}


(1)

Enclose in atomic block


Warning

• Not always this simple– Conditional waits– Enhanced concurrency– Overlapping locks

• But often it is– Works for sadistic homework


Public void Transfer(Queue<T> q1, q2){ atomic { T x = q1.deq(); q2.enq(x); }}

Composition

(1)

Trivial or what?


Public T LeftDeq() { atomic { if (this.left == null) retry; … }}

Wake-ups: lost and found

(1)

Roll back transaction and restart when something

changes


OrElse Compositionatomic { x = q1.deq(); } orElse { x = q2.deq();}

Run 1st method. If it retries …Run 2nd method. If it retries …

Entire statement retries


• Software transactional memory (STM)

• Hardware transactional memory (HTM)

• Hybrid transactional memory (HyTM, try in hardware and default to software if unsuccessful)



Hardware versus Software

• Do we need hardware at all?– Analogies:

• Virtual memory: yes!• Garbage collection: no!

– Probably do need HW for performance

• Do we need software?– Policy issues don’t make sense for

hardware

Design Issues

• Implementation choices

• Language design issues

• Semantic issues

Granularity

• Object– managed languages, Java, C#, …– Easy to control interactions between

transactional & non-trans threads

• Word– C, C++, …– Hard to control interactions between

transactional & non-trans threads

Direct/Deferred Update

• Deferred – modify private copies & install on

commit– Commit requires work– Consistency easier

• Direct – Modify in place, roll back on abort– Makes commit efficient– Consistency harder

Conflict Detection

• Eager– Detect before conflict arises– “Contention manager” module

resolves

• Lazy– Detect on commit/abort

• Mixed– Eager write/write, lazy read/write …

Conflict Detection

• Eager detection may abort transaction that could have committed.

• Lazy detection discards more computation.

Contention Managers

• Oracle that resolves conflicts– For eager conflict detection

• CM decides– Whether to abort other transaction– Or give it a chance to finish …

Contention Manager Strategies

• Exponential backoff• Oldest gets priority• Most work done gets priority• Non-waiting has priority over waiting• Lots of alternatives

– None seems to dominate– But choice can have big impact

I/O & System Calls?

• Some I/O revocable– Provide transaction-safe libraries– Undoable file system/DB calls

• Some not– Opening cash drawer– Firing missile

I/O & System Calls

• One solution: make transaction irrevocable– If transaction tries I/O, switch to

irrevocable mode.• There can be only one …

– Requires serial execution• No explicit aborts

– In irrevocable transactions

Exceptions

int i = 0;try { atomic { i++; node = new Node(); }} catch (Exception e) { print(i);}


Exceptions



Throws OutOfMemoryException!

Exceptions



Throws OutOfMemoryException!

What is printed?

Unhandled Exceptions

• Aborts transaction– Preserves invariants– Safer

• Commits transaction– Like locking semantics– What if exception object refers to

values modified in transaction?

Nested Transactions

atomic void foo() { bar();}

atomic void bar() { …}

atomic void foo() { bar();}

atomic void bar() { …}

Nested Transactions

• Needed for modularity– Who knew that cosine() contained a

transaction?• Flat nesting

– If child aborts, so does parent• First-class nesting

– If child aborts, partial rollback of child only

Open Nested Transactions

• Normally, child commit– Visible only to parent

• In open nested transactions– Commit visible to all– Escape mechanism– Dangerous, but useful

• What escape mechanisms are needed?

Privatization

Node n;atomic { n = head; if (n != null) head = head.next;}// use n.val many times




Privatization

“privatize” node

Privatization



Thread 0

Access without transactional

synchronization

Transactions



atomic {Node n = head; while (n != null) { n.val++; n = n.next; }}


Thread 0

Thread 1

Transactions





Thread 0

Thread 1

Is this safe?

Transactions

Thread 0

Thread 1

Is this safe?For locks, yes.





Transactions

Thread 0

Thread 1





Is this safe?For locks, yes.

For STMs, maybe not.

Strong vs Weak Isolation

• Non-transactional accesses don’t synchronize

• Non-transactional thread may see delayed updates/recovery

• Solutions– Privatization barrier?– Other techniques?

Single Global Lock Semantics

• Transactions act as if it acquires SGL

• Good:– Intuitively appealing

• Bad:– What about aborted transactions?– May be expensive to implement

Simple Lock-Based STM

• STMs come in different forms– Lock-based– Lock-free

• Here we will describe a simple lock-based STM

Synchronization

• Transaction keeps– Read set: locations & values read– Write set: locations & values to be

written• Deferred update

– Changes installed at commit• Lazy conflict detection

– Conflicts detected at commit


STM: Transactional Locking

Map

Array of Versioned-Write-Locks

Application Memory

V#

V#

V#


Reading an Object

• Put V#s & value in RS• If not already locked

MemLocks

V#

V#

V#

V#

V#


To Write an Object

• Add V# and new value to WS

MemLocks

V#

V#

V#

V#

V#


To Commit

• Acquire W locks• Check V#s unchanged

• In RS & WS• Install new values• Increment V#s• Release …

MemLocks

V#

V#

V#

V#

V#

X

Y

V#+1

V#+1

Transactional Consistency

Application Memory

x

y

4

2

8

4

Invariant x = 2y

Transaction A: Write xWrite y

Transaction B: Read xRead y Compute z = 1/(x-y) = 1/2

Zombies!

• A transaction that will abort because of a synchronization conflict

• But doesn’t know it yet• And keeps running

Zombies!

Application Memory

x

y

4

2

8

4

Invariant x = 2y

Transaction A: Write x (kills B)

Write y

Transaction B: Read x = 4

Transaction B: (zombie)Read y = 4Compute z = 1/(x-y)

DIV by 0 ERROR

Solution: The “Global Clock”

• Have one shared global clock• Incremented by (small subset of)

writing transactions• Read by all transactions• Used to validate that state worked

on is always consistent


Read-Only Transactions

• Copy V Clock to RV• Read lock,V#• Read mem• Check unlocked• Recheck V# unchanged• Check V# < RV

MemLocks

12

32

56

19

17

100

Shared Version Clock

Private Read Version

100

Reads form a snapshot of memory.No read set!


Regular Transactions

• Copy V Clock to RV• On read/write, check:

• Unlocked• V# ≤ RV• Add to R/W set

MemLocks

100



100

12

32

56

19

17


Regular Transactions

• Acquire locks• WV = F&Inc(V Clock)• Check each V# ≤ RV• Update memory• Set write V#s to WV

MemLocks

100



100101

x

y

12

32

56

19

17

100

100


Hardware Transactional Memory

• Exploit Cache coherence protocols• Already do almost what we need

– Invalidation– Consistency checking

• Exploit Speculative execution– Branch prediction = optimistic synch!


HW Transactional Memory

Interconnect

caches

memory

read active

T



caches

memory

read

activeTT

active



caches

memory

activeTT

activecommitted



caches

memory

write

active

committed

TD


Rewind

caches

memory

activeTT

activewriteaborted

D


Transaction Commit

• At commit point– If no cache conflicts, we win.

• Mark transactional entries– Read-only: valid– Modified: dirty (eventually written

back)

• That’s all, folks!– Except for a few details …


Not all Skittles and Beer

• Limits to– Transactional cache size– Scheduling quantum

• Transaction cannot commit if it is– Too big– Too slow– Actual limits platform-dependent

Approaches

• Virtualization– Spill transactional cache to memory

• Hybrid– Use limited HTM as HW accelerator

for STM– Sun’s Rock™ processor …

Back to Amdahl’s Law:

Speedup = 1/(ParallelPart/N + SequentialPart)

Pay for N = 8 cores SequentialPart = 25%

Speedup = only 2.9 times!

Must parallelize applications on a very fine grain!

So…How will we make use of multicores?

Art of Multiprocessor Programming

86

Traditional Scaling Process

User code

TraditionalUniprocessor

Speedup1.8x1.8x

7x7x

3.6x3.6x

Time: Moore’s law


87

Multicore Scaling Process

User code

Multicore

Speedup 1.8x1.8x

7x7x

3.6x3.6x

Unfortunately, not so simple…


88

Actual Scaling Process

1.8x1.8x 2x2x 2.9x2.9x

User code

Multicore

Speedup

Parallelization and Synchronization require great care…

Need Ways to Make Scaling Smoother…

TM code

Speedup 1.8x1.8x

7x7x

3.6x3.6x

Multicore

User code

I, for one, Welcome our new Multicore Overlords …

• Multicore architectures force us to rethink how we do synchronization

• Standard locking model won’t work• Transactional model might

– Software – Hardware

• A full-employment act!

Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit.

Documents

Transcript of Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit.