A Quest for Unified, Global View Parallel …...A Quest for Uni ed, Global View Parallel Programming...
Transcript of A Quest for Unified, Global View Parallel …...A Quest for Uni ed, Global View Parallel Programming...
A Quest for Unified, Global View Parallel
Programming Models for Our Future
Kenjiro Taura
University of TokyoT0
T1 T161
T2 T40
T3 T31
T4 T29
T5 T11
T6 T7
T8 T9
T10
T12 T24
T13 T14
T15 T23
T16 T20
T17
T18
T19
T21
T22
T25 T26
T27
T28
T30
T32 T38
T33 T37
T34 T35
T36
T39
T41 T77
T42 T66
T43 T62
T44
T45 T61
T46 T60
T47 T56
T48
T49 T55
T50 T54
T51 T53
T52
T57
T58
T59
T63 T65
T64
T67 T74
T68 T72
T69 T71
T70
T73
T75 T76
T78 T102
T79 T82
T80 T81 T83 T101
T84 T93
T85
T86 T87
T88 T92
T89 T90
T91
T94
T95 T96
T97
T98 T100
T99
T103 T153
T104 T122
T105 T120
T106 T111
T107 T110
T108 T109
T112 T114
T113 T115 T117
T116 T118
T119
T121
T123 T137
T124 T128
T125
T126
T127
T129 T135
T130
T131
T132 T134
T133
T136
T138 T152
T139 T143
T140
T141
T142
T144 T146
T145 T147 T150
T148 T149 T151
T154 T155
T156 T158
T157 T159 T160
T162 T184
T163 T172
T164 T166
T165 T167 T171
T168 T169
T170
T173 T175
T174 T176 T181
T177 T179
T178 T180
T182
T183
T185 T187
T186 T188 T190
T189 T191
T192
T193 T195
T194 T196 T198
T197 T199
1 / 52
Acknowledgements
▶ Jun Nakashima (MassiveThreads)
▶ Shigeki Akiyama, Wataru Endo (MassiveThreads/DM)
▶ An Huynh (DAGViz)
▶ Shintaro Iwasaki (Vectorization)
2 / 52
3 / 52
What is task parallelism?
▶ like most CS terms, the definition is vague
▶ I don’t consider contraposition “data parallelism vs.task parallelism” useful
▶ imagine lots of tasks each working on a piece of data▶ is it data parallel or task parallel?
▶ let’s instead ask:▶ what’s useful from programmer’s view point▶ what are useful distinctions to make from
implementer’s view point
4 / 52
What is task parallelism?
A system supports task parallelism when:
1. a logical unit of concurrency (that is,a task) can be created dynamically,at an arbitrary point of execution,
2. and cheaply;
3. and they are automatically mappedon hardware parallelism (cores,nodes, . . . )
4. and cheaply context-switched
create task
create task
5 / 52
What is task parallelism?
A system supports task parallelism when:
1. a logical unit of concurrency (that is,a task) can be created dynamically,at an arbitrary point of execution,
2. and cheaply;
3. and they are automatically mappedon hardware parallelism (cores,nodes, . . . )
4. and cheaply context-switched
create task
create task
5 / 52
What is task parallelism?
A system supports task parallelism when:
1. a logical unit of concurrency (that is,a task) can be created dynamically,at an arbitrary point of execution,
2. and cheaply;
3. and they are automatically mappedon hardware parallelism (cores,nodes, . . . )
4. and cheaply context-switched
create task
create task
5 / 52
What is task parallelism?
A system supports task parallelism when:
1. a logical unit of concurrency (that is,a task) can be created dynamically,at an arbitrary point of execution,
2. and cheaply;
3. and they are automatically mappedon hardware parallelism (cores,nodes, . . . )
4. and cheaply context-switched
create task
create task
5 / 52
What is task parallelism?
A system supports task parallelism when:
1. a logical unit of concurrency (that is,a task) can be created dynamically,at an arbitrary point of execution,
2. and cheaply;
3. and they are automatically mappedon hardware parallelism (cores,nodes, . . . )
4. and cheaply context-switched
create task
create task
5 / 52
What are they good for?
▶ generality: “creating tasks at arbitrary points” unifiesmany superficially different patterns
▶ parallel nested loop, parallel recursions▶ they trivially compose
▶ programmability: cheap task creation + automaticload balancing allow straightforward,processor-oblivious decomposition of the work(divide-and-conquer-until-trivial)
▶ performance: dynamic scheduling is a basis for hidinglatencies and tolerating noises
6 / 52
Our goal
▶ programmers use tasks (+higher-level syntax on top) asthe unified means to expressparallelism
▶ the system maps tasks tohardware parallelism
▶ cores within a node▶ nodes▶ SIMD lanes within a core!
7 / 52
Rest of the talk
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
8 / 52
9 / 52
Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
10 / 52
Taxonomy
▶ library or frontend: implemented with ordinary C/C++compilers or does it heavily rely on a tailored frontend?
▶ tasks suspendable or atomic: can tasks suspend/resumein the middle or do tasks always run to completion?
▶ synchronization patterns arbitrary or pre-defined: cantasks synchronize in an arbitrary topology or only inpre-defined synchronization patterns (e.g., bag-of-tasks,fork/join)?
▶ tasks untied or tied: can tasks migrate after theystarted?
11 / 52
Taxonomy
▶ library or frontend: implemented with ordinary C/C++compilers or does it heavily rely on a tailored frontend?
▶ tasks suspendable or atomic: can tasks suspend/resumein the middle or do tasks always run to completion?
▶ synchronization patterns arbitrary or pre-defined: cantasks synchronize in an arbitrary topology or only inpre-defined synchronization patterns (e.g., bag-of-tasks,fork/join)?
▶ tasks untied or tied: can tasks migrate after theystarted?
11 / 52
Taxonomy
▶ library or frontend: implemented with ordinary C/C++compilers or does it heavily rely on a tailored frontend?
▶ tasks suspendable or atomic: can tasks suspend/resumein the middle or do tasks always run to completion?
▶ synchronization patterns arbitrary or pre-defined: cantasks synchronize in an arbitrary topology or only inpre-defined synchronization patterns (e.g., bag-of-tasks,fork/join)?
▶ tasks untied or tied: can tasks migrate after theystarted?
11 / 52
Taxonomy
▶ library or frontend: implemented with ordinary C/C++compilers or does it heavily rely on a tailored frontend?
▶ tasks suspendable or atomic: can tasks suspend/resumein the middle or do tasks always run to completion?
▶ synchronization patterns arbitrary or pre-defined: cantasks synchronize in an arbitrary topology or only inpre-defined synchronization patterns (e.g., bag-of-tasks,fork/join)?
▶ tasks untied or tied: can tasks migrate after theystarted?
11 / 52
Instantiations
library suspendable untied sync/frontend task tasks topology
OpenMP tasks frontend yes yes fork/joinTBB library yes no fork/joinCilk frontend yes yes fork/joinQuark library no no arbitraryNanos++ library yes yes arbitraryQthreads library yes yes arbitraryArgobots library yes yes? arbitraryMassiveThreads library yes yes arbitrary
12 / 52
MassiveThreads
▶ https://github.com/massivethreads/massivethreads
▶ design philosophy: user-level threads (ULT) in anordinary thread API as you know it
▶ tid = myth create(f, arg)▶ tid = myth join(arg)▶ myth yield to switch among threads (useful for
latency hiding)▶ mutex and condition variables to build arbitrary
synchronization patterns
▶ efficient work stealing scheduler (locally LIFO andchild-first; steal oldest task first)
▶ an (experimental) customizable work stealing[Nakashima and Taura; ROSS 2013]
13 / 52
User-facing APIs on MassiveThreads
▶ TBB’s task group andparallel for (but with untiedwork stealing scheduler)
▶ Chapel tasks on top ofMassiveThreads (currentlybroken orz)
▶ SML# (Ueno @ TohokuUniversity) ongoing
▶ Tapas (Fukuda @ RIKEN), adomain specific language forparticle simulation
�quicksort(a, p, q) {
if (q - p < th) {
...
} else {
mtbb::task group tg;
r = partition(a, p, q);
tg.run([=]{ quicksort(a, p, r-1); });
quicksort(a, r, q);
tg.wait();
}
}
TBB interface onMassiveThreads
14 / 52
Important performance metrics
▶ low local creation/sync overhead▶ low local context switches▶ reasonably low load balancing (migration) overhead▶ somewhat sequential scheduling order
π0
γ π1
�1 parent() {
2 π0:
3 spawn { γ: ... };
4 π1:
5 }
op measure what time (cycles)local create π0 → γ ≈ 140work steal π0 → π1 ≈ 900context switch myth yield ≈ 80
(Haswell i7-4500U (1.80GHz), GCC 4.9)
15 / 52
Comparison to other systems
0
500
1000
1500
2000
2500
3000
CilkCilkPlus
MassiveThreads
OpenMP
Qthreads
TBB
≈ 7000
73 72 138 167
clocks
child
parent �1 parent() {
2 π0:
3 spawn { γ: ... };
4 π1:
5 }
Summary:
▶ Cilk(Plus), known for its superb local creationperformance, sacrifices work stealing performance
▶ TBB’s local creation overhead is equally good, but it is“parent-first” and tasks are tied to a worker oncestarted
16 / 52
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital
▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-aware” schedulers obviouslyimportant
▶ ⇒ hierarchical/customizable schedulers proposals
▶ ⇒ yet, IMO, there are no clear demonstrations thatclearly outperform simple greedy work stealing overmany workloads
▶ the question, it seems, ultimately comes to this:
when no tasks exist near you but some mayexist far from you, steal it or not (stay idle)?
17 / 52
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital
▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-aware” schedulers obviouslyimportant
▶ ⇒ hierarchical/customizable schedulers proposals
▶ ⇒ yet, IMO, there are no clear demonstrations thatclearly outperform simple greedy work stealing overmany workloads
▶ the question, it seems, ultimately comes to this:
when no tasks exist near you but some mayexist far from you, steal it or not (stay idle)?
17 / 52
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital
▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-aware” schedulers obviouslyimportant
▶ ⇒ hierarchical/customizable schedulers proposals
▶ ⇒ yet, IMO, there are no clear demonstrations thatclearly outperform simple greedy work stealing overmany workloads
▶ the question, it seems, ultimately comes to this:
when no tasks exist near you but some mayexist far from you, steal it or not (stay idle)?
17 / 52
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital
▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-aware” schedulers obviouslyimportant
▶ ⇒ hierarchical/customizable schedulers proposals
▶ ⇒ yet, IMO, there are no clear demonstrations thatclearly outperform simple greedy work stealing overmany workloads
▶ the question, it seems, ultimately comes to this:
when no tasks exist near you but some mayexist far from you, steal it or not (stay idle)?
17 / 52
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital
▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-aware” schedulers obviouslyimportant
▶ ⇒ hierarchical/customizable schedulers proposals
▶ ⇒ yet, IMO, there are no clear demonstrations thatclearly outperform simple greedy work stealing overmany workloads
▶ the question, it seems, ultimately comes to this:
when no tasks exist near you but some mayexist far from you, steal it or not (stay idle)?
17 / 52
Further research agenda (2)
▶ quantify the gap between hand-optimizeddecomposition vs. automatic decomposition (by workstealing); e.g.
▶ Space-filling decomposition vs. work stealing▶ 2.5D matrix-multiply vs. work stealing
▶ both experimentally and theoretically
18 / 52
19 / 52
Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
20 / 52
Two facets of task parallelism in distributed
memory settings
▶ a means to hide latency, for which we merely need alocal user-level thread library supportingsuspend/resume at arbitrary points
▶ a means to globally balance loads, for which we need asystem specifically designed to migrate tasks acrossaddress spaces
MassiveThreads/DM is a system supporting
▶ distributed load balancing and latency hiding
▶ + global address space supporting migration andreplication
21 / 52
Tasks to hide latencies
The goal:
▶ individual tasks look likeordinary blocking access(programmer-friendly)
▶ hide latencies by creating lotsof tasks
Ingredients for implementation:
▶ local tasking layer with goodcontext switch performance
▶ message/RDMA layer withgood multithreaded performance
�scan(global_array<T> a) {
for (i = 0; i < n; i++) {
.. = .. a[i] ..;
}
}�scan(global_array<T> a) {
pfor (i = 0; i < n; i++) {
.. = .. a[i] ..;
}
}
22 / 52
Preliminary results
▶ context switch: we used MassiveThreads’s myth yield
function to switch context upon blocking
▶ message/RDMA: we rolled our own thread-safe commlayer (on MPI, on IB verbs, and on Fujitsu TofuRMA), partly because Fujitsu MPI lacksmultithreading support
�/* a[i] */
T get(address<T>) {
issue non-blocking get(address);while (!the result available) {
myth_yield();
}
return result;} 0
200000
400000
600000
800000
1 × 106
1.2 × 106
1.4 × 106
1 2 3 4 5 6 7 8 9 10
gets/node/se
c
tasks
workers=1
workers=2
workers=3
workers=4
workers=5
23 / 52
Taxonomy
▶ library or frontend
▶ tasks suspendable or atomic
▶ synchronization patterns arbitrary or pre-defined
▶ tasks untied or tied
▶ the main issue:implementation complexity raises ondistributed memory especially for untied tasks
▶ that is, how to move tasks across address spaces?
24 / 52
Instantiations
library suspendable untied sync scale/frontend task tasks topology
Distributed Cilk frontend yes yes fork/join 16[Blumofe et al. 96]
Satin frontend yes no fork/join 256[Neuwpoort et al. 01]
Tascell frontend yes yes fork/join 128[Hiraishi et al. 09]
Scioto library no no BoT 8192[Dinan et al. 09]
HotSLAW library yes no fork/join 256[Min et al. 11]
X10/GLB library no no BoT 16384[Zhang et al. 13]
Grappa library yes no fork/join 4096[Nelson et al. 15]
MassiveThreads/DM library yes yes fork/join 4096[Akiyama et al. 15]
25 / 52
MassiveThreads/DM
▶ global (inter-node) work stealing library
▶ usable with ordinary C/C++ compilers▶ supports fork-join with untied tasks
▶ ⇒ moves native threads across nodes
26 / 52
Migrating native threads
▶ problem: the stack of native threads has pointerspointing to the inside
▶ migrating a thread to an arbitrary address breaksthese pointers
▶ ⇒ upon migration, copy the stack to the same address(iso-address [Antoniu et al. 1999])
@a
27 / 52
Migrating native threads
▶ problem: the stack of native threads has pointerspointing to the inside
▶ migrating a thread to an arbitrary address breaksthese pointers
▶ ⇒ upon migration, copy the stack to the same address(iso-address [Antoniu et al. 1999])
!@a
@a'
27 / 52
Migrating native threads
▶ problem: the stack of native threads has pointerspointing to the inside
▶ migrating a thread to an arbitrary address breaksthese pointers
▶ ⇒ upon migration, copy the stack to the same address(iso-address [Antoniu et al. 1999])
!
iso address
@a
@a'
@a @a
27 / 52
Iso-address limits scalability
▶ for each thread, all nodes must reserve its address
▶ ⇒ a huge waste of virtual memory
virt
ua
l a
dd
ress
sp
ac
e
28 / 52
Is consuming a huge virtual memory really a
problem?
▶ with high concurrency, it may indeed overflow virtualaddress space
stack size × tasks depth × cores/node × nodes214 × 213 × 28 × 213 = 248
▶ more important, the luxury use of virtual memoryprohibits using RDMA for work stealing (as RDMAmemory must be pinned)
▶ ⇒ proposed UniAddress scheme [Akiyama et al. 2015]
29 / 52
Further research agenda
▶ demonstrate global distributed load balancing withpractical workloads with lots of shared data
▶ “locality-/hierarchy-. . . ”awareness are even moreimportant in this setting
▶ latency-hiding opportunity adds an extra dimension
▶ steal or not, switch or not
30 / 52
31 / 52
Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
32 / 52
Analyzing task parallel programs
▶ task parallel systems are more“opaque” from users
▶ task management, loadbalancing, scheduling
▶ they show performance differencesand researchers want to preciselyunderstand where they come from
T0
T1 T161
T2 T40
T3 T31
T4 T29
T5 T11
T6 T7
T8 T9
T10
T12 T24
T13 T14
T15 T23
T16 T20
T17
T18
T19
T21
T22
T25 T26
T27
T28
T30
T32 T38
T33 T37
T34 T35
T36
T39
T41 T77
T42 T66
T43 T62
T44
T45 T61
T46 T60
T47 T56
T48
T49 T55
T50 T54
T51 T53
T52
T57
T58
T59
T63 T65
T64
T67 T74
T68 T72
T69 T71
T70
T73
T75 T76
T78 T102
T79 T82
T80 T81 T83 T101
T84 T93
T85
T86 T87
T88 T92
T89 T90
T91
T94
T95 T96
T97
T98 T100
T99
T103 T153
T104 T122
T105 T120
T106 T111
T107 T110
T108 T109
T112 T114
T113 T115 T117
T116 T118
T119
T121
T123 T137
T124 T128
T125
T126
T127
T129 T135
T130
T131
T132 T134
T133
T136
T138 T152
T139 T143
T140
T141
T142
T144 T146
T145 T147 T150
T148 T149 T151
T154 T155
T156 T158
T157 T159 T160
T162 T184
T163 T172
T164 T166
T165 T167 T171
T168 T169
T170
T173 T175
T174 T176 T181
T177 T179
T178 T180
T182
T183
T185 T187
T186 T188 T190
T189 T191
T192
T193 T195
T194 T196 T198
T197 T199
physical resource
runtime system
create/wait task
33 / 52
DAG Recorder and DAGViz
▶ DAG Recorder runs a taskparallel program and extractsits DAG, augmented withtimestamps, CPUs, etc.
▶ DAGViz is its visualizer
A() { for(i=0;i<2;i++) { mk_task_group; create_task(B()); create_task(C()); D(); wait_tasks(); }}D() { mk_task_group; create_task(E()); F(); wait_tasks(); }
E
BC
E
BC
create_task
wait_tasks
begin_section
endtask
34 / 52
Why record the DAG?
▶ DAG is a logical representation of the programexecution independent from the runtime system
▶ you can compare DAGs by two systems side by side
▶ DAG contains sufficient information to reconstructmany details
▶ work and critical path (excluding overhead)▶ actual parallelism (running cores) along time▶ available parallelism (ready tasks) along time▶ how long each task was delayed by the scheduler
35 / 52
DAGViz Demo
Seeing is believing.
36 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) threshold
E
BC
E
BC
create_task
wait_tasks
begin_section
endtask
37 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) threshold
E
BC
E
BC
create_task
wait_tasks
begin_section
endtask
37 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) thresholdE
BC
BC
create_task
wait_tasks
begin_section
endtask
37 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) thresholdE
BC
BC
create_task
wait_tasks
begin_section
endtask
37 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) thresholdE
BC
create_task
wait_tasks
begin_section
endtask
37 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) thresholdE
BC
create_task
wait_tasks
begin_section
endtask
37 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) thresholdcreate_task
wait_tasks
begin_section
endtask
37 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) thresholdcreate_task
wait_tasks
begin_section
endtask
37 / 52
Challenge : reducing space requirement
▶ literally recording all subgraphs isprohibitive
▶ collapse “uninteresting” subgraphsinto single nodes
▶ current criteria: we collapse asubgraph ⇐⇒1. its nodes are executed by a single
worker,2. its span is smaller than a
(configurable) thresholdcreate_task
wait_tasks
begin_section
endtask
37 / 52
Ongoing work
▶ hoping to use this tool to automate discovery of issuesin runtime systems
▶ scheduler delays along a critical path▶ work time inflation
▶ shed light on “steal or not” trade-offs
38 / 52
39 / 52
Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
40 / 52
Motivation
▶ task parallelism is a friend of divide-and-conqueralgorithms
▶ divide-and-conquer makes coding “trivial,” by dividinguntil the problem becomes trivial
▶ matrix multiply, matrix factorization, triangular solve,FFT, sorting, . . .
▶ in reality, the programmer has to optimize leavesmanually
▶ why? because we lack good compilers
41 / 52
The power of divide-and-conquer
�/∗ quick sort ∗/quicksort(a , p , q) {if (q − p < 2) {return ;
} else {...
}}�/∗ FFT ∗/fft (n , x) {if (n = 1) {return x0 ;
} else {...
}}
�/∗ C += AB ∗/mm(A, B, C) {
if (|A| = 1 && |B| = 1&& |C| = 1) {
C00 += A00 ·B00 ;} else {
...}
}�/∗ triangular solve
LX = B . ∗/trsm(L,B) {
if (M = 1) {B /= l11 ;
} else {...
}}
�/∗ Cholesky factorization ∗/chol(A) {
if (n = 1) {return (
√a11) ;
} else {...
}}
They all admit“trivial” base case,only if performance isacceptable . . .
42 / 52
Static optimizations and vectorization of tasks
▶ goal: run straightforward task-based programs as fastas manually optimized programs
▶ write once, parallelize everywhere (nodes, cores, andvectors)
43 / 52
Static optimizations and vectorization of tasks
▶ goal: run straightforward task-based programs as fastas manually optimized programs
▶ write once, parallelize everywhere (nodes, cores, andvectors)
serialized and vectorized
43 / 52
What does our compiler do?
1. static cut-off statically eliminates task creations
2. code-bloat-free inlining inline-expands recursions
3. loopification transforms recursions into flat loops (andthen vectorizes it if possible)
44 / 52
Static cut-off
�1 f(a, b, · · · ) {
2 if (E) {
3 L(a, b, · · · )4 } else {
5 · · ·6 spawn f(a1, b1, · · · );7 · · ·8 spawn f(a2, b2, · · · );9 · · ·
10 }
11 }
⇒
�1 fseq(a, b, · · · ) {
2 if (E) {
3 L(a, b, · · · )4 } else {
5 · · ·6 fseq(a1, b1, · · · );7 · · ·8 fseq(a2, b2, · · · );9 · · ·
10 }
11 }
key: determine a condition Hk, in which theheight of recursion from leaves ≤ k
▶ H0 = E
▶ Hk+1 = E or ∀i(ai, bi, · · · ) satisfy Hk
when succeeded, generate code that staticallyeliminate all task creations
45 / 52
Code-bloat-free inlining
▶ under condition Hk, inline-expanding all recursions ktimes would eliminate all function calls
▶ but this would result in an exponential code bloatwhen the function has multiple recursive calls
▶ code-bloat-free inlining fuses multiple recursive callsinto a single call site
�1 · · ·2 f(a1, b1, · · · );3 · · ·4 f(a2, b2, · · · );5 · · ·
⇒
�1 for (i = 0; i < 2; i++) {
2 switch (i) {
3 case 0: · · ·4 case 1: · · ·5 }
6 f(ai, bi, · · · );7 }
46 / 52
Loopification�1 fseq(a, b, · · · ) {
2 if (E) {
3 L(a, b, · · · )4 } else {
5 · · ·6 fseq(a1, b1, · · · );7 · · ·8 fseq(a2, b2, · · · );9 · · ·
10 }
11 }
⇒
�1 for i ∈ P {
2 L(xi, yi, · · · )3 }
▶ instead of code-bloat-free inlining, loopificationattempts to generate a flat (or shallow) loop directlyfrom recursive code
▶ it tries to synthesize hypotheses that the original codeis an affine loop of leaf blocks
▶ the loopified code may then be vectorized
47 / 52
Results: effect of optimizations
0
2
4
6
8
10
12
14
16
fib nqueens
fft sortnbody
strassen
vecadd
heat2d
heat3d
gaussian
matmul
trimul
treeadd
treesum
uts
27.12 17.56 17.65 220.14 109.72
relativeperform
ance
basedynamic
staticcef
loopproposed
48 / 52
Results: remaining gap to hand-optimized code
0
0.5
1
1.5
2
2.5
3
nbodyvecadd
heat2d
heat3d
gaussian
matmul
trimul
average
geomean
relativeperform
ance
(task=1)
taskomp
omp optimizedpolly
49 / 52
50 / 52
Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
51 / 52
Future outlook of task parallelism
▶ the goal: offer both programmability and performance
▶ long way toward achieving acceptable performance ondistributed memory machines. why?
▶ dynamic load balancing → random traffic▶ global address space → fine-grain communication
▶ OK in shared memory today. why not on distributedmemory (at least for now)?
▶ checking errors and completion everywhere▶ doing mutual exclusion everywhere▶ no hardware-prefetching analog▶ or lack of bandwidth to tolerate random traffic and
aggressive prefetching
Thank you for listening
52 / 52