Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer,...

Automatic Data Partitioning in Software Transactional Memories

Torvald Riegel,Christof Fetzer, Pascal Felber

(TU Dresden, Germany / Uni Neuchatel, Switzerland)

2

No one-size-fits-all TM! STMs:

Design: Invisible vs. visible reads Object-based vs. word-based

Parameters: Lock-based: #locks, addresslock mapping

HTMs: Different interfaces (e.g., Rock vs. AMD’s ASF) Resource bounds

Heterogeneous workloads: Global tuning does not help

Divide and conquer !?

3

How to divide User-driven? hmm, rather not …

Temporally Runtime tuning can handle phases … But only if whole workload has same phases

Memory “Word-based”: Mapping function is difficult

Runtime overheads Mapping needs to be stable Memory allocator affects mapping heavily (see false

conflicts) “Object-based”: still need mapping or per-object data

Code Problem: same function might operate on different data

4

How to conquer? Tune concurrency control mechanisms

Use different STM implementations Use HTM only where applicable/necessary Tune TM parameters per partition Challenge: Threads must agree on which

mechanisms to use for each item/location! Two-phase commit or similar is necessary

when using several independent TM mechanisms

Improve mapping/partitioning at other levels E.g., locationlock mapping

5

Data Partitioning

Partition memory automatically We use Pool Allocation (Lattner et al, PLDI 05) Mixed compile-time/runtime technique:

Based on pointer analysis for C/C++ Nodes in points-to graph become partitions Partitions are instantiated dynamically at runtime and

supplied to called functions that use these partitions

Memory allocator is not affected Implementation extends Tanger (STM compiler)

STM load/store functions get pointer to partition

6

Example: Points-to graph for STAMP’s Vacation

Type, if known

struct has 4 fields, 2 are

pointers

A Red-Black Tree instance

Partial,simplified DS graph for main()

A second Red-Black Tree instance

7

Conquering … Partition types determine STM implementation

used per partition (TinySTM): Multiple Locks (general purpose) Single Shared Lock (infrequently updated partitions) Single Exclusive Lock (low concurrency partitions) Read-Only (no concurrency control necessary) Thread-local, transaction-local

Loads/stores dispatched to type-specific STM functions on each call

Partition types and parameters can be tuned E.g., read-only partitions get tuned on first write

8

Performance

Exclusive Lockis faster thangeneral purposeSTM

Partitioningdecreases falseconflicts in lockarray.Lock hashfunction gets a2nd levelat compile time.

Partitioning addsruntime overhead

TinySTM w/o partitioningsupport, 220 / 224 locks

TinySTM with partitioning,4 different tuning heuristics

9

Performance (2)

Read-Only partitions duringfirst phase of benchmark

5 x 256K locks

226 locks !(224 livelocksdue tofalse conflicts)

10

Challenges Analysis: Calls to libraries?

Points-to graphs can probably be attached to libs (local per-function analysis + callgraph)

Analysis is bottom-up on call-graph

TM implementations that don’t support two-phase commit

Dispatch: Runtime overheads JIT? Size of binaries

Tuning partitions and partitioning No direct feedback, partitioning results in even more

parameters to be tuned Partition selection / merging at compile-time/runtime

11

Questions?

Tanger + TinySTM + …:

http://tinystm.org

(send email for version with partitioning support)

http://tinystm.org/

12

Backup Slides

13

Are there partitions?

14

Partition Type Performance & Tuning Strategies

Tuning strategy: Start with read-only type On reaching a certain number of aborts, switch to:

1. Single Exclusive Lock2. Single Shared Lock3. Multiple Locks

Part-1: switch directly to Multiple Locks, Part-4: try other types first (single locks, fewer multiple locks)

15

Analysis We use Data Structure Analysis (DSA [1]):

Pointer analysis for LLVM compiler framework Creates a points-to graph with Data Structure (DS) nodes Context-sensitive:

Data structures distinguished based on call graphs Field-sensitive:

distinguish between DS fields Unification-based:

Pointers target a single node in the points-to graph Information about pointers from different places get merged If incompatible information, node is collapsed (= “nothing

known”) Can safely analyze incomplete programs:

Calls to external / not analyzed functions have an effect only on the data that escapes into / from these functions (get marked “External”)

Analyzing more code increases analysis precision

[1] Chris Lattner, PhD thesis, 2005

16

Analysis (2)Integration into Tanger compilation process:1. Compile and link program parts into LLVM intermediate

representation module2. Analyze module using DSA

Local intra-function analysis: per-function DS graph Merge DS graphs bottom-up in callgraph (put callees’

information into callers) Merge DS graphs top-down in callgraph (vice versa)

3. Transactify module Use DSA information to decide between object-based /

word-based Requirement: If memory chunk (DS node) is object-

based, then it must be safe for object-based everywhere in the program

DSA can give us this guarantee4. Link in STM library and generate native code

Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer,...

Documents

Transcript of Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer,...