General Transformations for GPU Execution of Tree...

General Transformations for GPU Execution of

Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni

School of Electrical and Computer Engineering

* Now at Qualcomm; ** Now at Google

Thursday, March 27, 14

GPU execution of irregular programs

• GPUs offer promise of massive, energy-efficient parallelism

• Much success in mapping regular applications to GPUs

• Regular memory accesses, predictable computation

• Much less success in mapping irregular applications

• Pointer-based data structures

• Unpredictable, input-dependent computation and memory accesses

2


Tree traversal algorithms

• Many irregular algorithms are built around tree-traversal

• Barnes-Hut

• Nearest-neighbor

• 2-point correlation

• Numerous papers describing how to map tree traversal algorithms to GPUs

3


Point correlation

4

• Data mining algorithm

• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p

• Naïve approach: compare all N points with p

• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p


Point correlation

5






Point correlation

6






Point correlation

7

A


Point correlation

7

A

B G


Point correlation

7

A

B G

C F


Point correlation

7

A

B G

C F

D E


Point correlation

7

A

B G

C F H K

D E


Point correlation

7

A

B G

C F H K

D E I J


Point correlation

8

D E I J

C F H K

B G

A


Point correlation

9

D E I J

C F H K

B G

A


Point correlation

10

D E I J

C F H K

B G

A


Point correlation

11

D E I J

C F H K

B G

A


Point correlation

12

D E I J

C F H K

B G

A


Point correlation

13

D E I J

C F H K

B G

A


Point correlation

14

D E I J

C F H K

B G

A


Point correlation

15

KDCell root = /* build kdtree */;Set<Point> ps;double radius;

foreach Point p in ps { recurse(p, root, radius);}...void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}


Basic pattern

16

TreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}


Basic pattern

16

recursive traversal

TreeNode root;Set<Point> ps;



Basic pattern

16

recursive traversal

tree structureTreeNode root;Set<Point> ps;



Basic pattern

16

recursive traversal

repeated traversal




Basic pattern

16

recursive traversal

repeated traversal



Lots of parallelism!


What’s the problem?

• GPUs add high overhead for recursion

• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses

• Status quo: ad hoc solutions

• New algorithm? New GPU techniques!

17


What’s the problem?

• GPUs add high overhead for recursion

• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses

• Status quo: ad hoc solutions

• New algorithm? New GPU techniques!

17

Want generally applicable techniques for

mapping irregular applications to GPUs


Contributions

• Two general techniques for mapping tree-traversals to GPUs

• Autoropes: eliminates recursion overhead

• Lockstepping: promotes memory coalescing

• Compiler pass to automatically apply techniques to recursive tree-traversal code

• Significant GPU speedups on 5 tree-traversal algorithms

18


Naïve GPU implementation

• Warp-based SIMT (single-instruction, multiple-thread) execution

• 32 points put in a single warp

• Warp traverses tree

• All points in warp must execute same instruction

• If points diverge, some points sit idle while other threads execute

19



20

A

B G

C F H K

D E I J



20

A

B G

C F H K

D E I J

A

B G

C F

D E



20

A

B G

C F H K

D E I J

A

B G

H K

I J


A

B G

H K

I J

C F

D E


21

B

A

G


A

B G

H K

I J

C F

D E


22

B

A

G


A

B G

H K

I J

C F

D E


23

B

A

G


A

B G

H K

I J

C F

D E


24

B

A

G


A

B G

H K

I J

C F

D E


25

B

A

G


A

B G

H K

I J

C F

D E


26

B

A

G


A

B G

H K

I J

C F

D E


27

B

A

G


A

B G

H K

I J

C F

D E


28

B

A

G


A

B G

H K

I J

C F

D E


29

B

A

G


A

B G

H K

I J

C F

D E


30

B

A

G


A

B G

H K

I J

C F

D E


31

B

A

G


A

B G

H K

I J

C F

D E


32

B

A

G


A

B G

H K

I J

C F

D E


33

B

A

G


A

B G

H K

I J

C F

D E


34

B

A

G


Lots of accesses to tree

• Many accesses just moving up the tree in order to later move down again

• Lots of function stack manipulation

• Trees are very large, cannot be stored in GPU’s fast memory

• Want to minimize accesses to tree

35


How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J


How to avoid extra accesses to tree?

• Installing ropes into a tree requires complex, application-specific preprocessing

• Using ropes correctly during execution requires complex, application-specific logic

37

A

B G

C F H K

D E I J


Autoropes

• General technique for “linearizing” tree traversal for arbitrary traversal algorithms

• Achieve generality, simplicity and space-efficiency at the cost of overhead

• Key insight: recursive tree algorithms are just depth-first traversals of a tree; can transform into iterative algorithm

38


Autoropes

39

A

B G

C F H K

D E I J

A

B G

C F

D EA

Rope stack


Autoropes

40

A

B G

C F H K

D E I J

G

A

B G

C F

D EB

Rope stack


Autoropes

41

A

B G

C F H K

D E I J

F

A

B G

C F

D E

G

C

Rope stack


Autoropes

42

A

B G

C F H K

D E I J

E

A

B G

C F

D E

FG

D

Rope stack


Autoropes

43

A

B G

C F H K

D E I J

F

A

B G

C F

D E

G

E

Rope stack


Autoropes

44

A

B G

C F H K

D E I J

G

A

B G

C F

D EF

Rope stack


Autoropes

45

A

B G

C F H K

D E I J

A

B G

C F

D EG

Rope stack


Autoropes

• Ropes stored on rope stack instead of in tree

• No application-specific code to use ropes

• Ropes instantiated dynamically

• No preprocessing required

• Same access patterns as with manual ropes

• Extra pushes and pops on rope stack add some overhead

46


Autoropes

47

void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}

Simple compiler transformation convertsrecursive code into iterative autoropes code


Autoropes

48

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}

See paper for details of how to transform more complex code


Unintended consequence

49

• Recursive calls naturally lead to thread divergence

• If some threads make recursive calls, other threads wait until calls return

• Does not happen for iterative code

• All threads reconverge at beginning of loop

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))

continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}


formerly recursion

Unintended consequence

49

• Recursive calls naturally lead to thread divergence

• If some threads make recursive calls, other threads wait until calls return

• Does not happen for iterative code

• All threads reconverge at beginning of loop

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))

continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}

formerly return


A

B G

H K

I J

C F

D E

Autoropes on GPU

50

B

A

G


A

B G

H K

I J

C F

D E

Autoropes on GPU

51

B

A

G


A

B G

H K

I J

C F

D E

Autoropes on GPU

52

B

A

G


A

B G

H K

I J

C F

D E

Autoropes on GPU

53

B

A

G


A

B G

H K

I J

C F

D E

Autoropes on GPU

54

B

A

G


A

B G

H K

I J

C F

D E

Autoropes on GPU

55

B

A

G


A

B G

H K

I J

C F

D E

Autoropes on GPU

56

B

A

G


A

B G

H K

I J

C F

D E

Autoropes on GPU

56

B

A

G

Threads no longer diverge in execution

But do diverge in tree!


Thread divergence vs. memory coalescing• Memory accesses on GPU only well behaved

if accesses by all threads in warp can be coalesced

• Same memory or strided access

• Bad memory behavior of autoropes outweighs lack of thread divergence

• Goal: benefits of autoropes while maintaining memory coalescing

57


Lockstepping• Essentially, force GPU to let threads diverge

• If any thread in a warp wants to visit a node’s children, all threads in a warp visit the child

• Threads that are “dragged along” are programmatically masked out

• Warp execution takes longer (proportional to union of threads’ traversals, rather than longest traversal), but improved memory performance makes up for it

• Automatically implemented during autoropes compiler pass

58


Dynamic lockstepping

• Some algorithms allow different traversal orders

• Some points visit left child before right, and others visit right before left

• Optimization reduces traversal size

• Inherently bad memory access patterns

59


A

B G

H K

I J

C F

D E


60

B

A

G


A

B G

H K

I J

C F

D E


61

B

A

G


A

B G

H K

I J

C F

D E


62

B

A

G


A

B G

H K

I J

C F

D E


63

B

A

G


A

B G

H K

I J

C F

D E


64

B

A

G



• Dynamic lockstepping allows all points in a warp to “vote” on which traversal order to use

• Maintains memory coalescing

• Some points do more work than in original algorithm

• Tradeoff can still be worth it!

65


Engineering details

• Transform point data from array of structures format to structure of arrays

• Use analysis from [PACT 2013] to prove safety and transform automatically

• Copy tree data to GPU in linearized fashion

• Lay out fields of tree and point according to use (more commonly-accessed fields placed in shared memory)

• Interleave rope stacks for points in warp to allow strided access

66


Results• Two platforms

• GPU platform: NVIDIA Tesla C2070 (6GB global memory, 14 SMs)

• CPU platform: 32-core, 2.3 GHz Opteron

• Five benchmarks

• Barnes-Hut, Point correlation, Nearest neighbor, k-Nearest neighbor, Vantage point trees

• Multiple inputs per benchmark

• Used sorted and unsorted points

67


High-level takeaways

• Autoropes+lockstep always faster than simple recursive GPU implementation (up to 14x faster)

• For most benchmarks/inputs, best GPU implementation faster than CPU implementation up to 16 threads

• Speedups comparable to hand-written implementations

68


Barnes-Hut

69


Point correlation

70


Nearest-neighbor

71


Conclusions

• Mapping irregular applications to GPUs is very difficult

• Developed two general techniques, autoropes and lockstepping, that can achieve significant speedup on GPU

• vs. baseline GPU code and CPU implementations

• Automatic approaches competitive with previous hand-written implementations

72


General Transformations for GPU Execution of

Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni

School of Electrical and Computer Engineering

* Now at Qualcomm; ** Now at Google


General Transformations for GPU Execution of Tree...

Documents

Transcript of General Transformations for GPU Execution of Tree...