Compiler and Runtime Support for Efficient Software Transactional Memory
description
Transcript of Compiler and Runtime Support for Efficient Software Transactional Memory
Compiler and Runtime Supportfor Efficient
Software Transactional Memory
Vijay Menon
Ali-Reza Adl-Tabatabai, Brian T. Lewis,Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
2
Motivation
Multi-core architectures are mainstream– Software concurrency needed for scalability– Concurrent programming is hard– Difficult to reason about shared data
Traditional mechanism: Lock-based Synchronization– Hard to use– Must be fine-grain for scalability – Deadlocks– Not easily composable
New Solution: Transactional Memory (TM)– Simpler programming model: Atomicity, Consistency, Isolation– No deadlocks– Composability– Optimistic concurrency– Analogy
• GC : Memory allocation ≈ TM : Mutual exclusion
3
Composability
class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}
Thread-safe – but no scaling• ConcurrentHashMap (Java 5/JSR 166) does not help• Performance requires redesign from scratch & fine-grain locking
4
Transactional solution
class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}
Underlying system provide:• isolation (thread safety)• optimistic concurrency
5
Transactions are Composable
Scalability - 10,000,000 operations
0
1
2
3
4
0 4 8 12 16
# of Processors
Sca
lab
ilit
y
Synchronized Transactional
Scalability on 16-way 2.2 GHz Xeon System
6
Our System
A Java Software Transactional Memory (STM) System– Pure software implementation – Language extensions in Java– Integrated with JVM & JIT
Novel Features– Rich transactional language constructs in Java– Efficient, first class nested transactions– Risc-like STM API– Compiler optimizations– Per-type word and object level conflict detection– Complete GC support
7
System Overview
Polyglot
ORP VM
McRT STM
StarJIT
Transactional Java
Java + STM API
Transactional STIR
Optimized T-STIR
Native Code
8
Transactional Java
Java + new language constructs:• Atomic: execute block atomically
• atomic {S}• Retry: block until alternate path possible
• atomic {… retry;…}• Orelse: compose alternate atomic blocks
• atomic {S1} orelse{S2} … orelse{Sn}• Tryatomic: atomic with escape hatch
• tryatomic {S} catch(TxnFailed e) {…}• When: conditionally atomic region
• when (condition) {S}
Builds on prior researchConcurrent Haskell, CAML, CILK, JavaHPCS languages: Fortress, Chapel, X10
9
Transactional Java → Java
Transactional Java
atomic {
S;
}
STM API• txnStart[Nested]• txnCommit[Nested]• txnAbortNested• txnUserRetry• ...
Standard Java + STM API
while(true) {
TxnHandle th = txnStart();
try {
S’;
break;
} finally {
if(!txnCommit(th))
continue;
}
}
10
JVM STM support
On-demand cloning of methods called inside transactions
Garbage collection support• Enumeration of refs in read set, write set & undo log
Extra transaction record field in each object• Supports both word & object granularity
Native method invocation throws exception inside transaction• Some intrinsic functions allowed
Runtime STM API• Wrapper around McRT-STM API
• Polyglot / StarJIT automatically generates calls to API
11
Background: McRT-STM
STM for• C / C++ (PPoPP 2006)• Java (PLDI 2006)
• Writes: – strict two-phase locking– update in place– undo on abort
• Reads: – versioning– validation before commit
• Granularity per type– Object-level : small objects– Word-level : large arrays
• Benefits– Fast memory accesses (no buffering / object wrapping)– Minimal copying (no cloning for large objects)– Compatible with existing types & libraries
12
Ensuring Atomicity: Novel Combination
Memory Ops
Mode ↓ Reads Writes
Pessimistic Concurrency
Optimistic Concurrency
+ Caching effects+ Avoids lock operations
Quantitative results in PPoPP’06
+ In place updates+ Fast commits+ Fast reads
13
McRT-STM: Example
……atomic { B = A + 5;}…
……stmStart(); temp = stmRd(A); stmWr(B, temp + 5);stmCommit();…
STM read & write barriers before accessing memory inside transactions
STM tracks accesses & detects data conflicts
14
Transaction Record
Pointer-sized record per object / word
Two states:• Shared (low bit is 1)
– Read-only / multiple readers– Value is version number (odd)
• Exclusive– Write-only / single owner– Value is thread transaction descriptor (4-byte aligned)
Mapping• Object : slot in object• Field : hashed index into global record table
15
Transaction Record: Example
Every data item has an associated transaction record
TxR1
TxR2
TxR3
…TxRn
Object words hashinto table of TxRs
Hash is f(obj.hash, offset)
class Foo { int x; int y;}
vtblxy
TxRxy
vtbl Extra transactionrecord fieldObject
granularity
Wordgranularity
class Foo { int x; int y;}
hashxy
vtbl
16
Transaction Descriptor
Descriptor per thread– Info for version validation, lock release, undo on abort, …
Read and Write set : {<Ti, Ni>}– Ti: transaction record– Ni: version number
Undo log : {<Ai, Oi, Vi, Ki>}– Ai: field / element address– Oi: containing object (or null for static)– Vi: original value– Ki: type tag (for garbage collection)
In atomic region– Read operation appends read set– Write operation appends write set and undo log– GC enumerates read/write/undo logs
17
McRT-STM: Example
atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; }
T1atomic { t1 = bar.x; t2 = bar.y; }
T2
• T1 copies foo into bar• T2 reads bar, but should not see intermediate values
Class Foo { int x; int y;};Foo bar, foo;
18
McRT-STM: Example
stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit();
T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();
T2
• T1 copies foo into bar• T2 reads bar, but should not see intermediate values
19
McRT-STM: Example
stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit;
T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();
T2
hdrx = 0y = 0
5hdr
x = 9y = 7
3foo bar
Reads <foo, 3> Reads <bar, 5>
T1
x = 9
<foo, 3>Writes <bar, 5>Undo <bar.x, 0>
T2 waits
y = 7
<bar.y, 0>
7
<bar, 7>
Abort
•T2 should read [0, 0] or should read [9,7]
Commit
20
Early Results: Overhead breakdown
STM time breakdown
0%
20%
40%
60%
80%
100%
Binary tree Hashtable Linked list Btree
Application
TLS access
STM write
STM commit
STM validate
STM read
Time breakdown on single processor
STM read & validation overheads dominate
Good optimization targets
21
System Overview
Polyglot
ORP VM
McRT STM
StarJIT
Transactional Java
Java + STM API
Transactional STIR
Optimized T-STIR
Native Code
22
Leveraging the JIT
StarJIT: High-performance dynamic compiler
• Identifies transactional regions in Java+STM code
• Differentiates top-level and nested transactions
• Inserts read/write barriers in transactional code
• Maps STM API to first class opcodes in STIR
Good compiler representation →
greater optimization opportunities
23
Representing Read/Write Barriers
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
…
stmWr(&a.x, t1)
stmWr(&a.y, t2)
if(stmRd(&a.z) != 0) {
stmWr(&a.x, 0);
stmWr(&a.z, t3)
}
Traditional barriers hide redundant locking/logging
24
An STM IR for Optimization
Redundancies exposed:
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnOpenForWrite(a)
txnLogObjectInt(&a.y, a)
a.y = t2
txnOpenForRead(a)
if(a.z != 0) {
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = 0
txnOpenForWrite(a)
txnLogObjectInt(&a.z, a)
a.z = t3
}
25
Optimized Code
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
if(a.z != 0) {
a.x = 0
txnLogObjectInt(&a.z, a)
a.y = t3
}
Fewer & cheaper STM operations
26
Compiler Optimizations for Transactions
Standard optimizations• CSE, Dead-code-elimination, …
• Careful IR representation exposes opportunities and enables optimizations with almost no modifications
• Subtle in presence of nesting
STM-specific optimizations• Immutable field / class detection & barrier removal (vtable/String)
• Transaction-local object detection & barrier removal
• Partial inlining of STM fast paths to eliminate call overhead
27
Experiments
16-way 2.2 GHz Xeon with 16 GB shared memory• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
Workloads• Hashtable, Binary tree, OO7 (OODBMS)
– Mix of gets, in-place updates, insertions, and removals
• Object-level conflict detection by default– Word / mixed where beneficial
28
Effective of Compiler Optimizations
1P overheads over thread-unsafe baseline
Prior STMs typically incur ~2x on 1PWith compiler optimizations:
- < 40% over no concurrency control- < 30% over synchronization
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
HashMap TreeMap
% O
verh
ead
on
1P
Synchronized
No STM Opt
+Base STM Opt
+Immutability
+Txn Local
+Fast Path Inlining
29
Scalability: Java HashMap Shootout
Unsafe (java.util.HashMap)• Thread-unsafe w/o Concurrency Control
Synchronized• Coarse-grain synchronization via SynchronizedMap wrapper
Concurrent (java.util.concurrent.ConcurrentHashMap)• Multi-year effort: JSR 166 -> Java 5• Optimized for concurrent gets (no locking)• For updates, divides bucket array into 16 segments (size / locking)
Atomic• Transactional version via “AtomicMap” wrapper
Atomic Prime• Transactional version with minor hand optimization
• Tracks size per segment ala ConcurrentHashMap
Execution• 10,000,000 operations / 200,000 elements• Defaults: load factor, threshold, concurrency level
30
Scalability: 100% Gets
Atomic wrapper is competitive with ConcurrentHashMapEffect of compiler optimizations scale
02468
10121416
0 4 8 12 16
# of Processors
Sp
eed
up
over
1P
Un
safe
Unsafe Synchronized Concurrent
Atomic (No Opt) Atomic
31
Scalability: 20% Gets / 80% Updates
ConcurrentHashMap thrashes on 16 segmentsAtomic still scales
0
24
6
8
1012
14
16
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent Atomic (No Opt) Atomic
32
20% Inserts and Removes
Atomic conflicts on entire bucket array- The array is an object
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent Atomic
33
20% Inserts and Removes: Word-Level
We still conflict on the single size field in java.util.HashMap
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent
Object Atomic Word Atomic
34
20% Inserts and Removes: Atomic Prime
Atomic Prime tracks size / segment – lowering bottleneckNo degradation, modest performance gain
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime
35
20% Inserts and Removes: Mixed-Level
Mixed-level preserves wins & reduces overheads-word-level for arrays-object-level for non-arrays
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime Mixed Atomic Prime
36
Scalability: java.util.TreeMap
02
46
810
1214
16
0 4 8 12 16
# of Processors
Scal
abili
ty
Unsafe Synchronized Atomic
100% Gets 80% Gets
Results similar to HashMap
0
0.2
0.4
0.6
0.8
1
1.2
0 4 8 12 16
# of Processors
Scal
abili
tySynchronized Atomic Atomic Prime
37
Scalability: OO7 – 80% Reads
“Coarse” atomic is competitive with medium-grain synchronization
Operations & traversal over synthetic database
0
1
2
3
4
5
6
0 4 8 12 16
# of Processors
Sca
lab
ilit
y
Atomic Synch (Coarse) Synch (Med.) Synch (Fine)
38
Key Takeaways
Optimistic reads + pessimistic writes is nice sweet spot
Compiler optimizations significantly reduce STM overhead- 20-40% over thread-unsafe
- 10-30% over synchronized
Simple atomic wrappers sometimes good enough
Minor modifications give competitive performance to complex fine-grain synchronization
Word-level contention is crucial for large arrays
Mixed contention provides best of both
39
Research challenges
Performance– Compiler optimizations– Hardware support– Dealing with contention
Semantics– I/O & communication– Strong atomicity– Nested parallelism– Open transactions
Debugging & performance analysis tools
System integration
40
Conclusions
Rich transactional language constructs in Java
Efficient, first class nested transactions
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
Complete GC support
41