Compiler and Runtime Support for Efficient Software Transactional Memory

Compiler and Runtime Supportfor Efficient

Software Transactional Memory

Vijay Menon

Ali-Reza Adl-Tabatabai, Brian T. Lewis,Brian R. Murphy, Bratin Saha, Tatiana Shpeisman

2

Motivation

Multi-core architectures are mainstream– Software concurrency needed for scalability– Concurrent programming is hard– Difficult to reason about shared data

Traditional mechanism: Lock-based Synchronization– Hard to use– Must be fine-grain for scalability – Deadlocks– Not easily composable

New Solution: Transactional Memory (TM)– Simpler programming model: Atomicity, Consistency, Isolation– No deadlocks– Composability– Optimistic concurrency– Analogy

• GC : Memory allocation ≈ TM : Mutual exclusion

3

Composability

class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}

Thread-safe – but no scaling• ConcurrentHashMap (Java 5/JSR 166) does not help• Performance requires redesign from scratch & fine-grain locking

4

Transactional solution

class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}

Underlying system provide:• isolation (thread safety)• optimistic concurrency

5

Transactions are Composable

Scalability - 10,000,000 operations

0

1

2

3

4

0 4 8 12 16

# of Processors

Sca

lab

ilit

y

Synchronized Transactional

Scalability on 16-way 2.2 GHz Xeon System

6

Our System

A Java Software Transactional Memory (STM) System– Pure software implementation – Language extensions in Java– Integrated with JVM & JIT

Novel Features– Rich transactional language constructs in Java– Efficient, first class nested transactions– Risc-like STM API– Compiler optimizations– Per-type word and object level conflict detection– Complete GC support

7

System Overview

Polyglot

ORP VM

McRT STM

StarJIT

Transactional Java

Java + STM API

Transactional STIR

Optimized T-STIR

Native Code

8

Transactional Java

Java + new language constructs:• Atomic: execute block atomically

• atomic {S}• Retry: block until alternate path possible

• atomic {… retry;…}• Orelse: compose alternate atomic blocks

• atomic {S1} orelse{S2} … orelse{Sn}• Tryatomic: atomic with escape hatch

• tryatomic {S} catch(TxnFailed e) {…}• When: conditionally atomic region

• when (condition) {S}

Builds on prior researchConcurrent Haskell, CAML, CILK, JavaHPCS languages: Fortress, Chapel, X10

9

Transactional Java → Java

Transactional Java

atomic {

S;

}

STM API• txnStart[Nested]• txnCommit[Nested]• txnAbortNested• txnUserRetry• ...

Standard Java + STM API

while(true) {

TxnHandle th = txnStart();

try {

S’;

break;

} finally {

if(!txnCommit(th))

continue;

}

}

10

JVM STM support

On-demand cloning of methods called inside transactions

Garbage collection support• Enumeration of refs in read set, write set & undo log

Extra transaction record field in each object• Supports both word & object granularity

Native method invocation throws exception inside transaction• Some intrinsic functions allowed

Runtime STM API• Wrapper around McRT-STM API

• Polyglot / StarJIT automatically generates calls to API

11

Background: McRT-STM

STM for• C / C++ (PPoPP 2006)• Java (PLDI 2006)

• Writes: – strict two-phase locking– update in place– undo on abort

• Reads: – versioning– validation before commit

• Granularity per type– Object-level : small objects– Word-level : large arrays

• Benefits– Fast memory accesses (no buffering / object wrapping)– Minimal copying (no cloning for large objects)– Compatible with existing types & libraries

12

Ensuring Atomicity: Novel Combination

Memory Ops

Mode ↓ Reads Writes

Pessimistic Concurrency

Optimistic Concurrency

+ Caching effects+ Avoids lock operations

Quantitative results in PPoPP’06

+ In place updates+ Fast commits+ Fast reads

13

McRT-STM: Example

……atomic { B = A + 5;}…

……stmStart(); temp = stmRd(A); stmWr(B, temp + 5);stmCommit();…

STM read & write barriers before accessing memory inside transactions

STM tracks accesses & detects data conflicts

14

Transaction Record

Pointer-sized record per object / word

Two states:• Shared (low bit is 1)

– Read-only / multiple readers– Value is version number (odd)

• Exclusive– Write-only / single owner– Value is thread transaction descriptor (4-byte aligned)

Mapping• Object : slot in object• Field : hashed index into global record table

15

Transaction Record: Example

Every data item has an associated transaction record

TxR1

TxR2

TxR3

…TxRn

Object words hashinto table of TxRs

Hash is f(obj.hash, offset)

class Foo { int x; int y;}

vtblxy

TxRxy

vtbl Extra transactionrecord fieldObject

granularity

Wordgranularity

class Foo { int x; int y;}

hashxy

vtbl

16

Transaction Descriptor

Descriptor per thread– Info for version validation, lock release, undo on abort, …

Read and Write set : {<Ti, Ni>}– Ti: transaction record– Ni: version number

Undo log : {<Ai, Oi, Vi, Ki>}– Ai: field / element address– Oi: containing object (or null for static)– Vi: original value– Ki: type tag (for garbage collection)

In atomic region– Read operation appends read set– Write operation appends write set and undo log– GC enumerates read/write/undo logs

17

McRT-STM: Example

atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; }

T1atomic { t1 = bar.x; t2 = bar.y; }

T2

• T1 copies foo into bar• T2 reads bar, but should not see intermediate values

Class Foo { int x; int y;};Foo bar, foo;

18

McRT-STM: Example

stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit();

T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();

T2

• T1 copies foo into bar• T2 reads bar, but should not see intermediate values

19

McRT-STM: Example

stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit;

T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();

T2

hdrx = 0y = 0

5hdr

x = 9y = 7

3foo bar

Reads <foo, 3> Reads <bar, 5>

T1

x = 9

<foo, 3>Writes <bar, 5>Undo <bar.x, 0>

T2 waits

y = 7

<bar.y, 0>

7

<bar, 7>

Abort

•T2 should read [0, 0] or should read [9,7]

Commit

20

Early Results: Overhead breakdown

STM time breakdown

0%

20%

40%

60%

80%

100%

Binary tree Hashtable Linked list Btree

Application

TLS access

STM write

STM commit

STM validate

STM read

Time breakdown on single processor

STM read & validation overheads dominate

Good optimization targets

21

System Overview

Polyglot

ORP VM

McRT STM

StarJIT

Transactional Java

Java + STM API

Transactional STIR

Optimized T-STIR

Native Code

22

Leveraging the JIT

StarJIT: High-performance dynamic compiler

• Identifies transactional regions in Java+STM code

• Differentiates top-level and nested transactions

• Inserts read/write barriers in transactional code

• Maps STM API to first class opcodes in STIR

Good compiler representation →

greater optimization opportunities

23

Representing Read/Write Barriers

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

…

stmWr(&a.x, t1)

stmWr(&a.y, t2)

if(stmRd(&a.z) != 0) {

stmWr(&a.x, 0);

stmWr(&a.z, t3)

}

Traditional barriers hide redundant locking/logging

24

An STM IR for Optimization

Redundancies exposed:

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)

txnLogObjectInt(&a.x, a)

a.x = t1

txnOpenForWrite(a)

txnLogObjectInt(&a.y, a)

a.y = t2

txnOpenForRead(a)

if(a.z != 0) {

txnOpenForWrite(a)


a.x = 0

txnOpenForWrite(a)

txnLogObjectInt(&a.z, a)

a.z = t3

}

25

Optimized Code

atomic {

a.x = t1

a.y = t2

if(a.z == 0) {

a.x = 0

a.z = t3

}

}

txnOpenForWrite(a)


a.x = t1

txnLogObjectInt(&a.y, a)

a.y = t2

if(a.z != 0) {

a.x = 0

txnLogObjectInt(&a.z, a)

a.y = t3

}

Fewer & cheaper STM operations

26

Compiler Optimizations for Transactions

Standard optimizations• CSE, Dead-code-elimination, …

• Careful IR representation exposes opportunities and enables optimizations with almost no modifications

• Subtle in presence of nesting

STM-specific optimizations• Immutable field / class detection & barrier removal (vtable/String)

• Transaction-local object detection & barrier removal

• Partial inlining of STM fast paths to eliminate call overhead

27

Experiments

16-way 2.2 GHz Xeon with 16 GB shared memory• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)

Workloads• Hashtable, Binary tree, OO7 (OODBMS)

– Mix of gets, in-place updates, insertions, and removals

• Object-level conflict detection by default– Word / mixed where beneficial

28

Effective of Compiler Optimizations

1P overheads over thread-unsafe baseline

Prior STMs typically incur ~2x on 1PWith compiler optimizations:

- < 40% over no concurrency control- < 30% over synchronization

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

HashMap TreeMap

% O

verh

ead

on

1P

Synchronized

No STM Opt

+Base STM Opt

+Immutability

+Txn Local

+Fast Path Inlining

29

Scalability: Java HashMap Shootout

Unsafe (java.util.HashMap)• Thread-unsafe w/o Concurrency Control

Synchronized• Coarse-grain synchronization via SynchronizedMap wrapper

Concurrent (java.util.concurrent.ConcurrentHashMap)• Multi-year effort: JSR 166 -> Java 5• Optimized for concurrent gets (no locking)• For updates, divides bucket array into 16 segments (size / locking)

Atomic• Transactional version via “AtomicMap” wrapper

Atomic Prime• Transactional version with minor hand optimization

• Tracks size per segment ala ConcurrentHashMap

Execution• 10,000,000 operations / 200,000 elements• Defaults: load factor, threshold, concurrency level

30

Scalability: 100% Gets

Atomic wrapper is competitive with ConcurrentHashMapEffect of compiler optimizations scale

02468

10121416

0 4 8 12 16

# of Processors

Sp

eed

up

over

1P

Un

safe

Unsafe Synchronized Concurrent

Atomic (No Opt) Atomic

31

Scalability: 20% Gets / 80% Updates

ConcurrentHashMap thrashes on 16 segmentsAtomic still scales

0

24

6

8

1012

14

16

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent Atomic (No Opt) Atomic

32

20% Inserts and Removes

Atomic conflicts on entire bucket array- The array is an object

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent Atomic

33

20% Inserts and Removes: Word-Level

We still conflict on the single size field in java.util.HashMap

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized Concurrent

Object Atomic Word Atomic

34

20% Inserts and Removes: Atomic Prime

Atomic Prime tracks size / segment – lowering bottleneckNo degradation, modest performance gain

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime

35

20% Inserts and Removes: Mixed-Level

Mixed-level preserves wins & reduces overheads-word-level for arrays-object-level for non-arrays

0

0.5

1

1.5

2

2.5

3

0 4 8 12 16

# of Processors

Sp

eed

up

ove

r 1P

Un

safe

Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime Mixed Atomic Prime

36

Scalability: java.util.TreeMap

02

46

810

1214

16

0 4 8 12 16

# of Processors

Scal

abili

ty

Unsafe Synchronized Atomic

100% Gets 80% Gets

Results similar to HashMap

0

0.2

0.4

0.6

0.8

1

1.2

0 4 8 12 16

# of Processors

Scal

abili

tySynchronized Atomic Atomic Prime

37

Scalability: OO7 – 80% Reads

“Coarse” atomic is competitive with medium-grain synchronization

Operations & traversal over synthetic database

0

1

2

3

4

5

6

0 4 8 12 16

# of Processors

Sca

lab

ilit

y

Atomic Synch (Coarse) Synch (Med.) Synch (Fine)

38

Key Takeaways

Optimistic reads + pessimistic writes is nice sweet spot

Compiler optimizations significantly reduce STM overhead- 20-40% over thread-unsafe

- 10-30% over synchronized

Simple atomic wrappers sometimes good enough

Minor modifications give competitive performance to complex fine-grain synchronization

Word-level contention is crucial for large arrays

Mixed contention provides best of both

39

Research challenges

Performance– Compiler optimizations– Hardware support– Dealing with contention

Semantics– I/O & communication– Strong atomicity– Nested parallelism– Open transactions

Debugging & performance analysis tools

System integration

40

Conclusions

Rich transactional language constructs in Java

Efficient, first class nested transactions

Risc-like STM API

Compiler optimizations

Per-type word and object level conflict detection

Complete GC support

Compiler and Runtime Support for Efficient Software Transactional Memory

Documents

Transcript of Compiler and Runtime Support for Efficient Software Transactional Memory