Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer,...

16
Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

Transcript of Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer,...

Page 1: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

Automatic Data Partitioning in Software Transactional Memories

Torvald Riegel,Christof Fetzer, Pascal Felber

(TU Dresden, Germany / Uni Neuchatel, Switzerland)

Page 2: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

2

No one-size-fits-all TM! STMs:

Design: Invisible vs. visible reads Object-based vs. word-based

Parameters: Lock-based: #locks, addresslock mapping

HTMs: Different interfaces (e.g., Rock vs. AMD’s ASF) Resource bounds

Heterogeneous workloads: Global tuning does not help

Divide and conquer !?

Page 3: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

3

How to divide User-driven? hmm, rather not …

Temporally Runtime tuning can handle phases … But only if whole workload has same phases

Memory “Word-based”: Mapping function is difficult

Runtime overheads Mapping needs to be stable Memory allocator affects mapping heavily (see false

conflicts) “Object-based”: still need mapping or per-object data

Code Problem: same function might operate on different data

Page 4: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

4

How to conquer? Tune concurrency control mechanisms

Use different STM implementations Use HTM only where applicable/necessary Tune TM parameters per partition Challenge: Threads must agree on which

mechanisms to use for each item/location! Two-phase commit or similar is necessary

when using several independent TM mechanisms

Improve mapping/partitioning at other levels E.g., locationlock mapping

Page 5: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

5

Data Partitioning

Partition memory automatically We use Pool Allocation (Lattner et al, PLDI 05) Mixed compile-time/runtime technique:

Based on pointer analysis for C/C++ Nodes in points-to graph become partitions Partitions are instantiated dynamically at runtime and

supplied to called functions that use these partitions

Memory allocator is not affected Implementation extends Tanger (STM compiler)

STM load/store functions get pointer to partition

Page 6: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

6

Example: Points-to graph for STAMP’s Vacation

Type, if known

struct has 4 fields, 2 are

pointers

A Red-Black Tree instance

Partial,simplified DS graph for main()

A second Red-Black Tree instance

Page 7: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

7

Conquering … Partition types determine STM implementation

used per partition (TinySTM): Multiple Locks (general purpose) Single Shared Lock (infrequently updated partitions) Single Exclusive Lock (low concurrency partitions) Read-Only (no concurrency control necessary) Thread-local, transaction-local

Loads/stores dispatched to type-specific STM functions on each call

Partition types and parameters can be tuned E.g., read-only partitions get tuned on first write

Page 8: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

8

Performance

Exclusive Lockis faster thangeneral purposeSTM

Partitioningdecreases falseconflicts in lockarray.Lock hashfunction gets a2nd levelat compile time.

Partitioning addsruntime overhead

TinySTM w/o partitioningsupport, 220 / 224 locks

TinySTM with partitioning,4 different tuning heuristics

Page 9: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

9

Performance (2)

Read-Only partitions duringfirst phase of benchmark

5 x 256K locks

226 locks !(224 livelocksdue tofalse conflicts)

Page 10: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

10

Challenges Analysis: Calls to libraries?

Points-to graphs can probably be attached to libs (local per-function analysis + callgraph)

Analysis is bottom-up on call-graph

TM implementations that don’t support two-phase commit

Dispatch: Runtime overheads JIT? Size of binaries

Tuning partitions and partitioning No direct feedback, partitioning results in even more

parameters to be tuned Partition selection / merging at compile-time/runtime

Page 11: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

11

Questions?

Tanger + TinySTM + …:

http://tinystm.org

(send email for version with partitioning support)

Page 12: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

12

Backup Slides

Page 13: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

13

Are there partitions?

Page 14: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

14

Partition Type Performance & Tuning Strategies

Tuning strategy: Start with read-only type On reaching a certain number of aborts, switch to:

1. Single Exclusive Lock2. Single Shared Lock3. Multiple Locks

Part-1: switch directly to Multiple Locks, Part-4: try other types first (single locks, fewer multiple locks)

Page 15: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

15

Analysis We use Data Structure Analysis (DSA [1]):

Pointer analysis for LLVM compiler framework Creates a points-to graph with Data Structure (DS) nodes Context-sensitive:

Data structures distinguished based on call graphs Field-sensitive:

distinguish between DS fields Unification-based:

Pointers target a single node in the points-to graph Information about pointers from different places get merged If incompatible information, node is collapsed (= “nothing

known”) Can safely analyze incomplete programs:

Calls to external / not analyzed functions have an effect only on the data that escapes into / from these functions (get marked “External”)

Analyzing more code increases analysis precision

[1] Chris Lattner, PhD thesis, 2005

Page 16: Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)

16

Analysis (2)Integration into Tanger compilation process:1. Compile and link program parts into LLVM intermediate

representation module2. Analyze module using DSA

Local intra-function analysis: per-function DS graph Merge DS graphs bottom-up in callgraph (put callees’

information into callers) Merge DS graphs top-down in callgraph (vice versa)

3. Transactify module Use DSA information to decide between object-based /

word-based Requirement: If memory chunk (DS node) is object-

based, then it must be safe for object-based everywhere in the program

DSA can give us this guarantee4. Link in STM library and generate native code