A Quest for Unified, Global View Parallel …...A Quest for Uni ed, Global View Parallel Programming...

A Quest for Unified, Global View Parallel

Programming Models for Our Future

Kenjiro Taura

University of TokyoT0

T1 T161

T2 T40

T3 T31

T4 T29

T5 T11

T6 T7

T8 T9

T10

T12 T24

T13 T14

T15 T23

T16 T20

T17

T18

T19

T21

T22

T25 T26

T27

T28

T30

T32 T38

T33 T37

T34 T35

T36

T39

T41 T77

T42 T66

T43 T62

T44

T45 T61

T46 T60

T47 T56

T48

T49 T55

T50 T54

T51 T53

T52

T57

T58

T59

T63 T65

T64

T67 T74

T68 T72

T69 T71

T70

T73

T75 T76

T78 T102

T79 T82

T80 T81 T83 T101

T84 T93

T85

T86 T87

T88 T92

T89 T90

T91

T94

T95 T96

T97

T98 T100

T99

T103 T153

T104 T122

T105 T120

T106 T111

T107 T110

T108 T109

T112 T114

T113 T115 T117

T116 T118

T119

T121

T123 T137

T124 T128

T125

T126

T127

T129 T135

T130

T131

T132 T134

T133

T136

T138 T152

T139 T143

T140

T141

T142

T144 T146

T145 T147 T150

T148 T149 T151

T154 T155

T156 T158

T157 T159 T160

T162 T184

T163 T172

T164 T166

T165 T167 T171

T168 T169

T170

T173 T175

T174 T176 T181

T177 T179

T178 T180

T182

T183

T185 T187

T186 T188 T190

T189 T191

T192

T193 T195

T194 T196 T198

T197 T199

1 / 52

Acknowledgements

▶ Jun Nakashima (MassiveThreads)

▶ Shigeki Akiyama, Wataru Endo (MassiveThreads/DM)

▶ An Huynh (DAGViz)

▶ Shintaro Iwasaki (Vectorization)

2 / 52

3 / 52

What is task parallelism?

▶ like most CS terms, the definition is vague

▶ I don’t consider contraposition “data parallelism vs.task parallelism” useful

▶ imagine lots of tasks each working on a piece of data▶ is it data parallel or task parallel?

▶ let’s instead ask:▶ what’s useful from programmer’s view point▶ what are useful distinctions to make from

implementer’s view point

4 / 52

What is task parallelism?

A system supports task parallelism when:

1. a logical unit of concurrency (that is,a task) can be created dynamically,at an arbitrary point of execution,

2. and cheaply;

3. and they are automatically mappedon hardware parallelism (cores,nodes, . . . )

4. and cheaply context-switched

create task

create task

5 / 52

What are they good for?

▶ generality: “creating tasks at arbitrary points” unifiesmany superficially different patterns

▶ parallel nested loop, parallel recursions▶ they trivially compose

▶ programmability: cheap task creation + automaticload balancing allow straightforward,processor-oblivious decomposition of the work(divide-and-conquer-until-trivial)

▶ performance: dynamic scheduling is a basis for hidinglatencies and tolerating noises

6 / 52

Our goal

▶ programmers use tasks (+higher-level syntax on top) asthe unified means to expressparallelism

▶ the system maps tasks tohardware parallelism

▶ cores within a node▶ nodes▶ SIMD lanes within a core!

7 / 52

Rest of the talk

Intra-node Task Parallelism

Task Parallelism in Distributed Memory

Need Good Performance Analysis Tools

Compiler Optimizations and Vectorization

Concluding Remarks

8 / 52

9 / 52

Agenda





Concluding Remarks

10 / 52

Taxonomy

▶ library or frontend: implemented with ordinary C/C++compilers or does it heavily rely on a tailored frontend?

▶ tasks suspendable or atomic: can tasks suspend/resumein the middle or do tasks always run to completion?

▶ synchronization patterns arbitrary or pre-defined: cantasks synchronize in an arbitrary topology or only inpre-defined synchronization patterns (e.g., bag-of-tasks,fork/join)?

▶ tasks untied or tied: can tasks migrate after theystarted?

11 / 52

Instantiations

library suspendable untied sync/frontend task tasks topology

OpenMP tasks frontend yes yes fork/joinTBB library yes no fork/joinCilk frontend yes yes fork/joinQuark library no no arbitraryNanos++ library yes yes arbitraryQthreads library yes yes arbitraryArgobots library yes yes? arbitraryMassiveThreads library yes yes arbitrary

12 / 52

MassiveThreads

▶ https://github.com/massivethreads/massivethreads

▶ design philosophy: user-level threads (ULT) in anordinary thread API as you know it

▶ tid = myth create(f, arg)▶ tid = myth join(arg)▶ myth yield to switch among threads (useful for

latency hiding)▶ mutex and condition variables to build arbitrary

synchronization patterns

▶ efficient work stealing scheduler (locally LIFO andchild-first; steal oldest task first)

▶ an (experimental) customizable work stealing[Nakashima and Taura; ROSS 2013]

13 / 52

https://github.com/massivethreads/massivethreads

User-facing APIs on MassiveThreads

▶ TBB’s task group andparallel for (but with untiedwork stealing scheduler)

▶ Chapel tasks on top ofMassiveThreads (currentlybroken orz)

▶ SML# (Ueno @ TohokuUniversity) ongoing

▶ Tapas (Fukuda @ RIKEN), adomain specific language forparticle simulation

�quicksort(a, p, q) {

if (q - p < th) {

...

} else {

mtbb::task group tg;

r = partition(a, p, q);

tg.run([=]{ quicksort(a, p, r-1); });

quicksort(a, r, q);

tg.wait();

}

}

TBB interface onMassiveThreads

14 / 52

Important performance metrics

▶ low local creation/sync overhead▶ low local context switches▶ reasonably low load balancing (migration) overhead▶ somewhat sequential scheduling order

π0

γ π1

�1 parent() {

2 π0:

3 spawn { γ: ... };

4 π1:

5 }

op measure what time (cycles)local create π0 → γ ≈ 140work steal π0 → π1 ≈ 900context switch myth yield ≈ 80

(Haswell i7-4500U (1.80GHz), GCC 4.9)

15 / 52

Comparison to other systems

0

500

1000

1500

2000

2500

3000

CilkCilkPlus

MassiveThreads

OpenMP

Qthreads

TBB

≈ 7000

73 72 138 167

clocks

child

parent �1 parent() {

2 π0:

3 spawn { γ: ... };

4 π1:

5 }

Summary:

▶ Cilk(Plus), known for its superb local creationperformance, sacrifices work stealing performance

▶ TBB’s local creation overhead is equally good, but it is“parent-first” and tasks are tied to a worker oncestarted

16 / 52

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital

▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-aware” schedulers obviouslyimportant

▶ ⇒ hierarchical/customizable schedulers proposals

▶ ⇒ yet, IMO, there are no clear demonstrations thatclearly outperform simple greedy work stealing overmany workloads

▶ the question, it seems, ultimately comes to this:

when no tasks exist near you but some mayexist far from you, steal it or not (stay idle)?

17 / 52

Further research agenda (2)

▶ quantify the gap between hand-optimizeddecomposition vs. automatic decomposition (by workstealing); e.g.

▶ Space-filling decomposition vs. work stealing▶ 2.5D matrix-multiply vs. work stealing

▶ both experimentally and theoretically

18 / 52

19 / 52

Agenda





Concluding Remarks

20 / 52

Two facets of task parallelism in distributed

memory settings

▶ a means to hide latency, for which we merely need alocal user-level thread library supportingsuspend/resume at arbitrary points

▶ a means to globally balance loads, for which we need asystem specifically designed to migrate tasks acrossaddress spaces

MassiveThreads/DM is a system supporting

▶ distributed load balancing and latency hiding

▶ + global address space supporting migration andreplication

21 / 52

Tasks to hide latencies

The goal:

▶ individual tasks look likeordinary blocking access(programmer-friendly)

▶ hide latencies by creating lotsof tasks

Ingredients for implementation:

▶ local tasking layer with goodcontext switch performance

▶ message/RDMA layer withgood multithreaded performance

�scan(global_array<T> a) {

for (i = 0; i < n; i++) {

.. = .. a[i] ..;

}

}�scan(global_array<T> a) {

pfor (i = 0; i < n; i++) {

.. = .. a[i] ..;

}

}

22 / 52

Preliminary results

▶ context switch: we used MassiveThreads’s myth yield

function to switch context upon blocking

▶ message/RDMA: we rolled our own thread-safe commlayer (on MPI, on IB verbs, and on Fujitsu TofuRMA), partly because Fujitsu MPI lacksmultithreading support

�/* a[i] */

T get(address<T>) {

issue non-blocking get(address);while (!the result available) {

myth_yield();

}

return result;} 0

200000

400000

600000

800000

1 × 106

1.2 × 106

1.4 × 106

1 2 3 4 5 6 7 8 9 10

gets/node/se

c

tasks

workers=1

workers=2

workers=3

workers=4

workers=5

23 / 52

Taxonomy

▶ library or frontend

▶ tasks suspendable or atomic

▶ synchronization patterns arbitrary or pre-defined

▶ tasks untied or tied

▶ the main issue:implementation complexity raises ondistributed memory especially for untied tasks

▶ that is, how to move tasks across address spaces?

24 / 52

Instantiations

library suspendable untied sync scale/frontend task tasks topology

Distributed Cilk frontend yes yes fork/join 16[Blumofe et al. 96]

Satin frontend yes no fork/join 256[Neuwpoort et al. 01]

Tascell frontend yes yes fork/join 128[Hiraishi et al. 09]

Scioto library no no BoT 8192[Dinan et al. 09]

HotSLAW library yes no fork/join 256[Min et al. 11]

X10/GLB library no no BoT 16384[Zhang et al. 13]

Grappa library yes no fork/join 4096[Nelson et al. 15]

MassiveThreads/DM library yes yes fork/join 4096[Akiyama et al. 15]

25 / 52

MassiveThreads/DM

▶ global (inter-node) work stealing library

▶ usable with ordinary C/C++ compilers▶ supports fork-join with untied tasks

▶ ⇒ moves native threads across nodes

26 / 52

Migrating native threads

▶ problem: the stack of native threads has pointerspointing to the inside

▶ migrating a thread to an arbitrary address breaksthese pointers

▶ ⇒ upon migration, copy the stack to the same address(iso-address [Antoniu et al. 1999])

@a

27 / 52





!@a

@a'

27 / 52





!

iso address

@a

@a'

@a @a

27 / 52

Iso-address limits scalability

▶ for each thread, all nodes must reserve its address

▶ ⇒ a huge waste of virtual memory

virt

ua

l a

dd

ress

sp

ac

e

28 / 52

Is consuming a huge virtual memory really a

problem?

▶ with high concurrency, it may indeed overflow virtualaddress space

stack size × tasks depth × cores/node × nodes214 × 213 × 28 × 213 = 248

▶ more important, the luxury use of virtual memoryprohibits using RDMA for work stealing (as RDMAmemory must be pinned)

▶ ⇒ proposed UniAddress scheme [Akiyama et al. 2015]

29 / 52

Further research agenda

▶ demonstrate global distributed load balancing withpractical workloads with lots of shared data

▶ “locality-/hierarchy-. . . ”awareness are even moreimportant in this setting

▶ latency-hiding opportunity adds an extra dimension

▶ steal or not, switch or not

30 / 52

31 / 52

Agenda





Concluding Remarks

32 / 52

Analyzing task parallel programs

▶ task parallel systems are more“opaque” from users

▶ task management, loadbalancing, scheduling

▶ they show performance differencesand researchers want to preciselyunderstand where they come from

T0

T1 T161

T2 T40

T3 T31

T4 T29

T5 T11

T6 T7

T8 T9

T10

T12 T24

T13 T14

T15 T23

T16 T20

T17

T18

T19

T21

T22

T25 T26

T27

T28

T30

T32 T38

T33 T37

T34 T35

T36

T39

T41 T77

T42 T66

T43 T62

T44

T45 T61

T46 T60

T47 T56

T48

T49 T55

T50 T54

T51 T53

T52

T57

T58

T59

T63 T65

T64

T67 T74

T68 T72

T69 T71

T70

T73

T75 T76

T78 T102

T79 T82

T80 T81 T83 T101

T84 T93

T85

T86 T87

T88 T92

T89 T90

T91

T94

T95 T96

T97

T98 T100

T99

T103 T153

T104 T122

T105 T120

T106 T111

T107 T110

T108 T109

T112 T114

T113 T115 T117

T116 T118

T119

T121

T123 T137

T124 T128

T125

T126

T127

T129 T135

T130

T131

T132 T134

T133

T136

T138 T152

T139 T143

T140

T141

T142

T144 T146

T145 T147 T150

T148 T149 T151

T154 T155

T156 T158

T157 T159 T160

T162 T184

T163 T172

T164 T166

T165 T167 T171

T168 T169

T170

T173 T175

T174 T176 T181

T177 T179

T178 T180

T182

T183

T185 T187

T186 T188 T190

T189 T191

T192

T193 T195

T194 T196 T198

T197 T199

physical resource

runtime system

create/wait task

33 / 52

DAG Recorder and DAGViz

▶ DAG Recorder runs a taskparallel program and extractsits DAG, augmented withtimestamps, CPUs, etc.

▶ DAGViz is its visualizer

A() { for(i=0;i<2;i++) { mk_task_group; create_task(B()); create_task(C()); D(); wait_tasks(); }}D() { mk_task_group; create_task(E()); F(); wait_tasks(); }

E

BC

E

BC

create_task

wait_tasks

begin_section

endtask

34 / 52

Why record the DAG?

▶ DAG is a logical representation of the programexecution independent from the runtime system

▶ you can compare DAGs by two systems side by side

▶ DAG contains sufficient information to reconstructmany details

▶ work and critical path (excluding overhead)▶ actual parallelism (running cores) along time▶ available parallelism (ready tasks) along time▶ how long each task was delayed by the scheduler

35 / 52

DAGViz Demo

Seeing is believing.

36 / 52

Challenge : reducing space requirement

▶ literally recording all subgraphs isprohibitive

▶ collapse “uninteresting” subgraphsinto single nodes

▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single

worker,2. its span is smaller than a

(configurable) threshold

E

BC

E

BC

create_task

wait_tasks

begin_section

endtask

37 / 52






(configurable) thresholdE

BC

BC

create_task

wait_tasks

begin_section

endtask

37 / 52






(configurable) thresholdE

BC

create_task

wait_tasks

begin_section

endtask

37 / 52






(configurable) thresholdcreate_task

wait_tasks

begin_section

endtask

37 / 52

Ongoing work

▶ hoping to use this tool to automate discovery of issuesin runtime systems

▶ scheduler delays along a critical path▶ work time inflation

▶ shed light on “steal or not” trade-offs

38 / 52

39 / 52

Agenda





Concluding Remarks

40 / 52

Motivation

▶ task parallelism is a friend of divide-and-conqueralgorithms

▶ divide-and-conquer makes coding “trivial,” by dividinguntil the problem becomes trivial

▶ matrix multiply, matrix factorization, triangular solve,FFT, sorting, . . .

▶ in reality, the programmer has to optimize leavesmanually

▶ why? because we lack good compilers

41 / 52

The power of divide-and-conquer

�/∗ quick sort ∗/quicksort(a , p , q) {if (q − p < 2) {return ;

} else {...

}}�/∗ FFT ∗/fft (n , x) {if (n = 1) {return x0 ;

} else {...

}}

�/∗ C += AB ∗/mm(A, B, C) {

if (|A| = 1 && |B| = 1&& |C| = 1) {

C00 += A00 ·B00 ;} else {

...}

}�/∗ triangular solve

LX = B . ∗/trsm(L,B) {

if (M = 1) {B /= l11 ;

} else {...

}}

�/∗ Cholesky factorization ∗/chol(A) {

if (n = 1) {return (

√a11) ;

} else {...

}}

They all admit“trivial” base case,only if performance isacceptable . . .

42 / 52

Static optimizations and vectorization of tasks

▶ goal: run straightforward task-based programs as fastas manually optimized programs

▶ write once, parallelize everywhere (nodes, cores, andvectors)

43 / 52

Static optimizations and vectorization of tasks

▶ goal: run straightforward task-based programs as fastas manually optimized programs

▶ write once, parallelize everywhere (nodes, cores, andvectors)

serialized and vectorized

43 / 52

What does our compiler do?

1. static cut-off statically eliminates task creations

2. code-bloat-free inlining inline-expands recursions

3. loopification transforms recursions into flat loops (andthen vectorizes it if possible)

44 / 52

Static cut-off

�1 f(a, b, · · · ) {

2 if (E) {

3 L(a, b, · · · )4 } else {

5 · · ·6 spawn f(a1, b1, · · · );7 · · ·8 spawn f(a2, b2, · · · );9 · · ·

10 }

11 }

⇒

�1 fseq(a, b, · · · ) {

2 if (E) {

3 L(a, b, · · · )4 } else {

5 · · ·6 fseq(a1, b1, · · · );7 · · ·8 fseq(a2, b2, · · · );9 · · ·

10 }

11 }

key: determine a condition Hk, in which theheight of recursion from leaves ≤ k

▶ H0 = E

▶ Hk+1 = E or ∀i(ai, bi, · · · ) satisfy Hk

when succeeded, generate code that staticallyeliminate all task creations

45 / 52

Code-bloat-free inlining

▶ under condition Hk, inline-expanding all recursions ktimes would eliminate all function calls

▶ but this would result in an exponential code bloatwhen the function has multiple recursive calls

▶ code-bloat-free inlining fuses multiple recursive callsinto a single call site

�1 · · ·2 f(a1, b1, · · · );3 · · ·4 f(a2, b2, · · · );5 · · ·

⇒

�1 for (i = 0; i < 2; i++) {

2 switch (i) {

3 case 0: · · ·4 case 1: · · ·5 }

6 f(ai, bi, · · · );7 }

46 / 52

Loopification�1 fseq(a, b, · · · ) {

2 if (E) {

3 L(a, b, · · · )4 } else {

5 · · ·6 fseq(a1, b1, · · · );7 · · ·8 fseq(a2, b2, · · · );9 · · ·

10 }

11 }

⇒

�1 for i ∈ P {

2 L(xi, yi, · · · )3 }

▶ instead of code-bloat-free inlining, loopificationattempts to generate a flat (or shallow) loop directlyfrom recursive code

▶ it tries to synthesize hypotheses that the original codeis an affine loop of leaf blocks

▶ the loopified code may then be vectorized

47 / 52

Results: effect of optimizations

0

2

4

6

8

10

12

14

16

fib nqueens

fft sortnbody

strassen

vecadd

heat2d

heat3d

gaussian

matmul

trimul

treeadd

treesum

uts

27.12 17.56 17.65 220.14 109.72

relativeperform

ance

basedynamic

staticcef

loopproposed

48 / 52

Results: remaining gap to hand-optimized code

0

0.5

1

1.5

2

2.5

3

nbodyvecadd

heat2d

heat3d

gaussian

matmul

trimul

average

geomean

relativeperform

ance

(task=1)

taskomp

omp optimizedpolly

49 / 52

50 / 52

Agenda





Concluding Remarks

51 / 52

Future outlook of task parallelism

▶ the goal: offer both programmability and performance

▶ long way toward achieving acceptable performance ondistributed memory machines. why?

▶ dynamic load balancing → random traffic▶ global address space → fine-grain communication

▶ OK in shared memory today. why not on distributedmemory (at least for now)?

▶ checking errors and completion everywhere▶ doing mutual exclusion everywhere▶ no hardware-prefetching analog▶ or lack of bandwidth to tolerate random traffic and

aggressive prefetching

Thank you for listening

52 / 52

A Quest for Unified, Global View Parallel …...A Quest for Uni ed, Global View Parallel Programming...

Documents

Transcript of A Quest for Unified, Global View Parallel …...A Quest for Uni ed, Global View Parallel Programming...