Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §,...
-
Upload
bradley-dobbie -
Category
Documents
-
view
212 -
download
0
Transcript of Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §,...
Fine-grain Task Aggregation and Coordination on GPUs
Marc S. Orr†§, Bradford M. Beckmann§, Steven K. Reinhardt§, David A. Wood†§
ISCA, June 16, 2014
† §
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20142
Executive Summary
SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelism‒Compare to Pthreads, Cilk, MapReduce, TBB, etc.
Goal: enable irregular parallelism on GPUs‒Why? More GPU applications‒How? Fine-grain task aggregation‒What? Cilk on GPUs
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20143
Outline
Background‒GPUs‒Cilk‒Channel Abstraction
Our Work‒Cilk on Channels‒Channel Design
Results/Conclusion
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20144
CP
GPUs Today
GPU tasks scheduled by control processor (CP)—small, in-order programmable core
Today’s GPU abstractions are coarse-grain
GPU
CPSIMDSIMD SIMD
System Memory
SIMD
+ Maps well to SIMD hardware- Limits fine-grain scheduling
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20145
Cilk Background
Cilk extends C for divide and conquer parallelism Adds keywords
‒spawn: schedule a thread to execute a function‒sync: wait for prior spawns to complete
1: int fib(int n) {2: if (n <= 2) return 1;3: int x = spawn fib(n - 1);4: int y = spawn fib(n - 2);5: sync;6: return (x + y);7: }
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20146
AggAgg
Prior Work on Channels
CP, or aggregator (agg), manages channels
Finite task queues, except:1. User-defined scheduling2. Dynamic aggregation3. One consumption function
channels
GPUSIMD SIMDSIMD SIMD
System Memory
Dynamic aggregation enables “CPU-like” scheduling abstractions on GPUs
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20147
Outline
Background‒GPUs‒Cilk‒Channel Abstraction
Our Work‒Cilk on Channels‒Channel Design
Results/Conclusion
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20148
Enable Cilk on GPUs via Channels
Cilk routines split by sync into sub-routines
Step 1
1: int fib (int n) {2: if (n<=2) return 1;3: int x = spawn fib (n-1);4: int y = spawn fib (n-2);5: sync;6: return (x+y);7: }
1: int fib (int n) {2: if (n<=2) return 1;3: int x = spawn fib (n-1);4: int y = spawn fib (n-2);5: }
6: int fib_cont(int x, int y) {7: return (x+y);8: }
“pre-sync”
“continuation”
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20149
3
4 3
5534
3
Enable Cilk on GPUs via Channels
Channels instantiated for breadth-first traversal‒Quickly populates GPU’s tens
of thousands of lanes‒Facilitates coarse-grain
dependency management
Step 2
“pre-sync” task ready
“continuation” tasktask A spawned task BA B
task B depends on task AA B
“pre-sync” task done
54 3
2 2 1
2 1
3
fib_cont channel stack:
top ofstack
fib channel
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201410
Bound Cilk’s Memory Footprint
Bound memory to the depth of the Cilk tree by draining channels closer to the base case
‒The amount of work generated dynamically is not known a priori
We propose that GPUs allow SIMT threads to yield‒Facilitates resolving conflicts on shared resources like memory
5
4 3
2 2 1
2 1
3
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201411
Channel Implementation
Our design accommodates SIMT access patterns+ array-based+ lock-free+ non-blocking
See Paper
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201412
Outline
Background‒GPUs‒Cilk‒Channel Abstraction
Our Work‒Cilk on Channels‒Channel Design
Results/Conclusion
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201413
Methodology
Implemented Cilk on channels on a simulated APU‒Caches are sequentially consistent‒Aggregator schedules Cilk tasks
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201414
Cilk scales with the GPU Architecture
1CU
2CU
s
4CU
s
8CU
s
1CU
2CU
s
4CU
s
8CU
s
1CU
2CU
s
4CU
s
8CU
s
1CU
2CU
s
4CU
s
8CU
s
Fibonacci Queens Sort Strassen
00.10.20.30.40.50.60.70.80.9
1
No
rmal
ized
exe
cuti
on
tim
e
More Compute Units Faster execution
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201415
Conclusion
We observed that dynamic aggregation enables new GPU programming languages and abstractions
We enabled dynamic aggregation by extending the GPU’s control processor to manage channels
We found that breadth first scheduling works well for Cilk on GPUs
We proposed that GPUs allow SIMT threads to yield for breadth first scheduling
Future work should focus on how the control processor can enable more GPU applications
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201416
Backup
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201417
Divergence and Channels
Branch divergence
Memory divergence+ Data in channels good‒Pointers to data in channels bad
Fibonacci Queens Sort Strassen0
20406080
100
1-16 17-3233-48 49-64 lanes active
Perc
ent o
f wav
efro
nts
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201418
GPU NOT Blocked on Aggregator
fib queens sort strassen0
10
20
30
40
50
60
70
80
90
100
simple2-way light OoO2-way OoO4-way OoO%
of
tim
e
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201419
GPU Cilk vs. standard GPU workloads
Cilk is more succinct than SIMT languages Channels trigger more GPU dispatches
LOC reduction
Dispatch rate
Speedup
Strassen 42% 13x 1.06Queens 36% 12.5x 0.98
Same performance, easier to program
| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201420
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.