Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §,...

20
Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§ , Bradford M. Beckmann § , Steven K. Reinhardt § , David A. Wood †§ ISCA, June 16, 2014 §

Transcript of Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §,...

Page 1: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

Fine-grain Task Aggregation and Coordination on GPUs

Marc S. Orr†§, Bradford M. Beckmann§, Steven K. Reinhardt§, David A. Wood†§

ISCA, June 16, 2014

† §

Page 2: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20142

Executive Summary

SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelism‒Compare to Pthreads, Cilk, MapReduce, TBB, etc.

Goal: enable irregular parallelism on GPUs‒Why? More GPU applications‒How? Fine-grain task aggregation‒What? Cilk on GPUs

Page 3: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20143

Outline

Background‒GPUs‒Cilk‒Channel Abstraction

Our Work‒Cilk on Channels‒Channel Design

Results/Conclusion

Page 4: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20144

CP

GPUs Today

GPU tasks scheduled by control processor (CP)—small, in-order programmable core

Today’s GPU abstractions are coarse-grain

GPU

CPSIMDSIMD SIMD

System Memory

SIMD

+ Maps well to SIMD hardware- Limits fine-grain scheduling

Page 5: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20145

Cilk Background

Cilk extends C for divide and conquer parallelism Adds keywords

‒spawn: schedule a thread to execute a function‒sync: wait for prior spawns to complete

1: int fib(int n) {2: if (n <= 2) return 1;3: int x = spawn fib(n - 1);4: int y = spawn fib(n - 2);5: sync;6: return (x + y);7: }

Page 6: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20146

AggAgg

Prior Work on Channels

CP, or aggregator (agg), manages channels

Finite task queues, except:1. User-defined scheduling2. Dynamic aggregation3. One consumption function

channels

GPUSIMD SIMDSIMD SIMD

System Memory

Dynamic aggregation enables “CPU-like” scheduling abstractions on GPUs

Page 7: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20147

Outline

Background‒GPUs‒Cilk‒Channel Abstraction

Our Work‒Cilk on Channels‒Channel Design

Results/Conclusion

Page 8: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20148

Enable Cilk on GPUs via Channels

Cilk routines split by sync into sub-routines

Step 1

1: int fib (int n) {2: if (n<=2) return 1;3: int x = spawn fib (n-1);4: int y = spawn fib (n-2);5: sync;6: return (x+y);7: }

1: int fib (int n) {2: if (n<=2) return 1;3: int x = spawn fib (n-1);4: int y = spawn fib (n-2);5: }

6: int fib_cont(int x, int y) {7: return (x+y);8: }

“pre-sync”

“continuation”

Page 9: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 20149

3

4 3

5534

3

Enable Cilk on GPUs via Channels

Channels instantiated for breadth-first traversal‒Quickly populates GPU’s tens

of thousands of lanes‒Facilitates coarse-grain

dependency management

Step 2

“pre-sync” task ready

“continuation” tasktask A spawned task BA B

task B depends on task AA B

“pre-sync” task done

54 3

2 2 1

2 1

3

fib_cont channel stack:

top ofstack

fib channel

Page 10: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201410

Bound Cilk’s Memory Footprint

Bound memory to the depth of the Cilk tree by draining channels closer to the base case

‒The amount of work generated dynamically is not known a priori

We propose that GPUs allow SIMT threads to yield‒Facilitates resolving conflicts on shared resources like memory

5

4 3

2 2 1

2 1

3

Page 11: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201411

Channel Implementation

Our design accommodates SIMT access patterns+ array-based+ lock-free+ non-blocking

See Paper

Page 12: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201412

Outline

Background‒GPUs‒Cilk‒Channel Abstraction

Our Work‒Cilk on Channels‒Channel Design

Results/Conclusion

Page 13: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201413

Methodology

Implemented Cilk on channels on a simulated APU‒Caches are sequentially consistent‒Aggregator schedules Cilk tasks

Page 14: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201414

Cilk scales with the GPU Architecture

1CU

2CU

s

4CU

s

8CU

s

1CU

2CU

s

4CU

s

8CU

s

1CU

2CU

s

4CU

s

8CU

s

1CU

2CU

s

4CU

s

8CU

s

Fibonacci Queens Sort Strassen

00.10.20.30.40.50.60.70.80.9

1

No

rmal

ized

exe

cuti

on

tim

e

More Compute Units Faster execution

Page 15: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201415

Conclusion

We observed that dynamic aggregation enables new GPU programming languages and abstractions

We enabled dynamic aggregation by extending the GPU’s control processor to manage channels

We found that breadth first scheduling works well for Cilk on GPUs

We proposed that GPUs allow SIMT threads to yield for breadth first scheduling

Future work should focus on how the control processor can enable more GPU applications

Page 16: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201416

Backup

Page 17: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201417

Divergence and Channels

Branch divergence

Memory divergence+ Data in channels good‒Pointers to data in channels bad

Fibonacci Queens Sort Strassen0

20406080

100

1-16 17-3233-48 49-64 lanes active

Perc

ent o

f wav

efro

nts

Page 18: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201418

GPU NOT Blocked on Aggregator

fib queens sort strassen0

10

20

30

40

50

60

70

80

90

100

simple2-way light OoO2-way OoO4-way OoO%

of

tim

e

Page 19: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201419

GPU Cilk vs. standard GPU workloads

Cilk is more succinct than SIMT languages Channels trigger more GPU dispatches

  LOC reduction

Dispatch rate

Speedup

Strassen 42% 13x 1.06Queens 36% 12.5x 0.98

Same performance, easier to program

Page 20: Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr †§, Bradford M. Beckmann §, Steven K. Reinhardt §, David A. Wood †§ ISCA, June 16, 2014.

| Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 201420

Disclaimer & Attribution

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.