General Transformations for GPU Execution of Tree...

92
General Transformations for GPU Execution of Tree Traversals Michael Goldfarb*,Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at Google Thursday, March 27, 14

Transcript of General Transformations for GPU Execution of Tree...

Page 1: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

General Transformations for GPU Execution of

Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni

School of Electrical and Computer Engineering

* Now at Qualcomm; ** Now at Google

Thursday, March 27, 14

Page 2: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

GPU execution of irregular programs

• GPUs offer promise of massive, energy-efficient parallelism

• Much success in mapping regular applications to GPUs

• Regular memory accesses, predictable computation

• Much less success in mapping irregular applications

• Pointer-based data structures

• Unpredictable, input-dependent computation and memory accesses

2

Thursday, March 27, 14

Page 3: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Tree traversal algorithms

• Many irregular algorithms are built around tree-traversal

• Barnes-Hut

• Nearest-neighbor

• 2-point correlation

• Numerous papers describing how to map tree traversal algorithms to GPUs

3

Thursday, March 27, 14

Page 4: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

4

• Data mining algorithm

• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p

• Naïve approach: compare all N points with p

• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, March 27, 14

Page 5: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

5

• Data mining algorithm

• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p

• Naïve approach: compare all N points with p

• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, March 27, 14

Page 6: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

6

• Data mining algorithm

• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p

• Naïve approach: compare all N points with p

• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, March 27, 14

Page 7: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

7

A

Thursday, March 27, 14

Page 8: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

7

A

B G

Thursday, March 27, 14

Page 9: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

7

A

B G

C F

Thursday, March 27, 14

Page 10: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

7

A

B G

C F

D E

Thursday, March 27, 14

Page 11: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

7

A

B G

C F H K

D E

Thursday, March 27, 14

Page 12: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

7

A

B G

C F H K

D E I J

Thursday, March 27, 14

Page 13: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

8

D E I J

C F H K

B G

A

Thursday, March 27, 14

Page 14: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

8

D E I J

C F H K

B G

A

Thursday, March 27, 14

Page 15: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

9

D E I J

C F H K

B G

A

Thursday, March 27, 14

Page 16: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

10

D E I J

C F H K

B G

A

Thursday, March 27, 14

Page 17: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

11

D E I J

C F H K

B G

A

Thursday, March 27, 14

Page 18: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

12

D E I J

C F H K

B G

A

Thursday, March 27, 14

Page 19: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

13

D E I J

C F H K

B G

A

Thursday, March 27, 14

Page 20: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

14

D E I J

C F H K

B G

A

Thursday, March 27, 14

Page 21: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

15

KDCell root = /* build kdtree */;Set<Point> ps;double radius;

foreach Point p in ps { recurse(p, root, radius);}...void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}

Thursday, March 27, 14

Page 22: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Basic pattern

16

TreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Thursday, March 27, 14

Page 23: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Basic pattern

16

recursive traversal

TreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Thursday, March 27, 14

Page 24: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Basic pattern

16

recursive traversal

tree structureTreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Thursday, March 27, 14

Page 25: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Basic pattern

16

recursive traversal

repeated traversal

tree structureTreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Thursday, March 27, 14

Page 26: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Basic pattern

16

recursive traversal

repeated traversal

tree structureTreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Lots of parallelism!

Thursday, March 27, 14

Page 27: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

What’s the problem?

• GPUs add high overhead for recursion

• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses

• Status quo: ad hoc solutions

• New algorithm? New GPU techniques!

17

Thursday, March 27, 14

Page 28: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

What’s the problem?

• GPUs add high overhead for recursion

• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses

• Status quo: ad hoc solutions

• New algorithm? New GPU techniques!

17

Want generally applicable techniques for

mapping irregular applications to GPUs

Thursday, March 27, 14

Page 29: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Contributions

• Two general techniques for mapping tree-traversals to GPUs

• Autoropes: eliminates recursion overhead

• Lockstepping: promotes memory coalescing

• Compiler pass to automatically apply techniques to recursive tree-traversal code

• Significant GPU speedups on 5 tree-traversal algorithms

18

Thursday, March 27, 14

Page 30: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Naïve GPU implementation

• Warp-based SIMT (single-instruction, multiple-thread) execution

• 32 points put in a single warp

• Warp traverses tree

• All points in warp must execute same instruction

• If points diverge, some points sit idle while other threads execute

19

Thursday, March 27, 14

Page 31: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Naïve GPU implementation

20

A

B G

C F H K

D E I J

Thursday, March 27, 14

Page 32: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Naïve GPU implementation

20

A

B G

C F H K

D E I J

Thursday, March 27, 14

Page 33: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Naïve GPU implementation

20

A

B G

C F H K

D E I J

A

B G

C F

D E

Thursday, March 27, 14

Page 34: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Naïve GPU implementation

20

A

B G

C F H K

D E I J

A

B G

H K

I J

Thursday, March 27, 14

Page 35: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

21

B

A

G

Thursday, March 27, 14

Page 36: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

22

B

A

G

Thursday, March 27, 14

Page 37: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

23

B

A

G

Thursday, March 27, 14

Page 38: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

24

B

A

G

Thursday, March 27, 14

Page 39: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

25

B

A

G

Thursday, March 27, 14

Page 40: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

26

B

A

G

Thursday, March 27, 14

Page 41: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

27

B

A

G

Thursday, March 27, 14

Page 42: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

28

B

A

G

Thursday, March 27, 14

Page 43: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

29

B

A

G

Thursday, March 27, 14

Page 44: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

30

B

A

G

Thursday, March 27, 14

Page 45: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

31

B

A

G

Thursday, March 27, 14

Page 46: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

32

B

A

G

Thursday, March 27, 14

Page 47: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

33

B

A

G

Thursday, March 27, 14

Page 48: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Naïve GPU implementation

34

B

A

G

Thursday, March 27, 14

Page 49: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Lots of accesses to tree

• Many accesses just moving up the tree in order to later move down again

• Lots of function stack manipulation

• Trees are very large, cannot be stored in GPU’s fast memory

• Want to minimize accesses to tree

35

Thursday, March 27, 14

Page 50: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J

Thursday, March 27, 14

Page 51: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J

Thursday, March 27, 14

Page 52: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J

Thursday, March 27, 14

Page 53: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J

Thursday, March 27, 14

Page 54: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

How to avoid extra accesses to tree?

• Installing ropes into a tree requires complex, application-specific preprocessing

• Using ropes correctly during execution requires complex, application-specific logic

37

A

B G

C F H K

D E I J

Thursday, March 27, 14

Page 55: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

• General technique for “linearizing” tree traversal for arbitrary traversal algorithms

• Achieve generality, simplicity and space-efficiency at the cost of overhead

• Key insight: recursive tree algorithms are just depth-first traversals of a tree; can transform into iterative algorithm

38

Thursday, March 27, 14

Page 56: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

39

A

B G

C F H K

D E I J

A

B G

C F

D EA

Rope stack

Thursday, March 27, 14

Page 57: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

40

A

B G

C F H K

D E I J

G

A

B G

C F

D EB

Rope stack

Thursday, March 27, 14

Page 58: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

41

A

B G

C F H K

D E I J

F

A

B G

C F

D E

G

C

Rope stack

Thursday, March 27, 14

Page 59: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

42

A

B G

C F H K

D E I J

E

A

B G

C F

D E

FG

D

Rope stack

Thursday, March 27, 14

Page 60: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

43

A

B G

C F H K

D E I J

F

A

B G

C F

D E

G

E

Rope stack

Thursday, March 27, 14

Page 61: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

44

A

B G

C F H K

D E I J

G

A

B G

C F

D EF

Rope stack

Thursday, March 27, 14

Page 62: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

45

A

B G

C F H K

D E I J

A

B G

C F

D EG

Rope stack

Thursday, March 27, 14

Page 63: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

• Ropes stored on rope stack instead of in tree

• No application-specific code to use ropes

• Ropes instantiated dynamically

• No preprocessing required

• Same access patterns as with manual ropes

• Extra pushes and pops on rope stack add some overhead

46

Thursday, March 27, 14

Page 64: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

47

void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}

Simple compiler transformation convertsrecursive code into iterative autoropes code

Thursday, March 27, 14

Page 65: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Autoropes

48

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}

See paper for details of how to transform more complex code

Thursday, March 27, 14

Page 66: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Unintended consequence

49

• Recursive calls naturally lead to thread divergence

• If some threads make recursive calls, other threads wait until calls return

• Does not happen for iterative code

• All threads reconverge at beginning of loop

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))

continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}

Thursday, March 27, 14

Page 67: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

formerly recursion

Unintended consequence

49

• Recursive calls naturally lead to thread divergence

• If some threads make recursive calls, other threads wait until calls return

• Does not happen for iterative code

• All threads reconverge at beginning of loop

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))

continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}

formerly return

Thursday, March 27, 14

Page 68: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Autoropes on GPU

50

B

A

G

Thursday, March 27, 14

Page 69: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Autoropes on GPU

51

B

A

G

Thursday, March 27, 14

Page 70: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Autoropes on GPU

52

B

A

G

Thursday, March 27, 14

Page 71: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Autoropes on GPU

53

B

A

G

Thursday, March 27, 14

Page 72: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Autoropes on GPU

54

B

A

G

Thursday, March 27, 14

Page 73: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Autoropes on GPU

55

B

A

G

Thursday, March 27, 14

Page 74: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Autoropes on GPU

56

B

A

G

Thursday, March 27, 14

Page 75: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Autoropes on GPU

56

B

A

G

Threads no longer diverge in execution

But do diverge in tree!

Thursday, March 27, 14

Page 76: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Thread divergence vs. memory coalescing• Memory accesses on GPU only well behaved

if accesses by all threads in warp can be coalesced

• Same memory or strided access

• Bad memory behavior of autoropes outweighs lack of thread divergence

• Goal: benefits of autoropes while maintaining memory coalescing

57

Thursday, March 27, 14

Page 77: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Lockstepping• Essentially, force GPU to let threads diverge

• If any thread in a warp wants to visit a node’s children, all threads in a warp visit the child

• Threads that are “dragged along” are programmatically masked out

• Warp execution takes longer (proportional to union of threads’ traversals, rather than longest traversal), but improved memory performance makes up for it

• Automatically implemented during autoropes compiler pass

58

Thursday, March 27, 14

Page 78: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Dynamic lockstepping

• Some algorithms allow different traversal orders

• Some points visit left child before right, and others visit right before left

• Optimization reduces traversal size

• Inherently bad memory access patterns

59

Thursday, March 27, 14

Page 79: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Dynamic lockstepping

60

B

A

G

Thursday, March 27, 14

Page 80: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Dynamic lockstepping

61

B

A

G

Thursday, March 27, 14

Page 81: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Dynamic lockstepping

62

B

A

G

Thursday, March 27, 14

Page 82: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Dynamic lockstepping

63

B

A

G

Thursday, March 27, 14

Page 83: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

A

B G

H K

I J

C F

D E

Dynamic lockstepping

64

B

A

G

Thursday, March 27, 14

Page 84: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Dynamic lockstepping

• Dynamic lockstepping allows all points in a warp to “vote” on which traversal order to use

• Maintains memory coalescing

• Some points do more work than in original algorithm

• Tradeoff can still be worth it!

65

Thursday, March 27, 14

Page 85: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Engineering details

• Transform point data from array of structures format to structure of arrays

• Use analysis from [PACT 2013] to prove safety and transform automatically

• Copy tree data to GPU in linearized fashion

• Lay out fields of tree and point according to use (more commonly-accessed fields placed in shared memory)

• Interleave rope stacks for points in warp to allow strided access

66

Thursday, March 27, 14

Page 86: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Results• Two platforms

• GPU platform: NVIDIA Tesla C2070 (6GB global memory, 14 SMs)

• CPU platform: 32-core, 2.3 GHz Opteron

• Five benchmarks

• Barnes-Hut, Point correlation, Nearest neighbor, k-Nearest neighbor, Vantage point trees

• Multiple inputs per benchmark

• Used sorted and unsorted points

67

Thursday, March 27, 14

Page 87: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

High-level takeaways

• Autoropes+lockstep always faster than simple recursive GPU implementation (up to 14x faster)

• For most benchmarks/inputs, best GPU implementation faster than CPU implementation up to 16 threads

• Speedups comparable to hand-written implementations

68

Thursday, March 27, 14

Page 88: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Barnes-Hut

69

Thursday, March 27, 14

Page 89: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Point correlation

70

Thursday, March 27, 14

Page 90: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Nearest-neighbor

71

Thursday, March 27, 14

Page 91: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

Conclusions

• Mapping irregular applications to GPUs is very difficult

• Developed two general techniques, autoropes and lockstepping, that can achieve significant speedup on GPU

• vs. baseline GPU code and CPU implementations

• Automatic approaches competitive with previous hand-written implementations

72

Thursday, March 27, 14

Page 92: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a

General Transformations for GPU Execution of

Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni

School of Electrical and Computer Engineering

* Now at Qualcomm; ** Now at Google

Thursday, March 27, 14