General Transformations for GPU Execution of Tree...
Transcript of General Transformations for GPU Execution of Tree...
General Transformations for GPU Execution of
Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni
School of Electrical and Computer Engineering
* Now at Qualcomm; ** Now at Google
Thursday, March 27, 14
GPU execution of irregular programs
• GPUs offer promise of massive, energy-efficient parallelism
• Much success in mapping regular applications to GPUs
• Regular memory accesses, predictable computation
• Much less success in mapping irregular applications
• Pointer-based data structures
• Unpredictable, input-dependent computation and memory accesses
2
Thursday, March 27, 14
Tree traversal algorithms
• Many irregular algorithms are built around tree-traversal
• Barnes-Hut
• Nearest-neighbor
• 2-point correlation
• Numerous papers describing how to map tree traversal algorithms to GPUs
3
Thursday, March 27, 14
Point correlation
4
• Data mining algorithm
• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p
• Naïve approach: compare all N points with p
• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, March 27, 14
Point correlation
5
• Data mining algorithm
• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p
• Naïve approach: compare all N points with p
• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, March 27, 14
Point correlation
6
• Data mining algorithm
• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p
• Naïve approach: compare all N points with p
• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, March 27, 14
Point correlation
7
A
Thursday, March 27, 14
Point correlation
7
A
B G
Thursday, March 27, 14
Point correlation
7
A
B G
C F
Thursday, March 27, 14
Point correlation
7
A
B G
C F
D E
Thursday, March 27, 14
Point correlation
7
A
B G
C F H K
D E
Thursday, March 27, 14
Point correlation
7
A
B G
C F H K
D E I J
Thursday, March 27, 14
Point correlation
8
D E I J
C F H K
B G
A
Thursday, March 27, 14
Point correlation
8
D E I J
C F H K
B G
A
Thursday, March 27, 14
Point correlation
9
D E I J
C F H K
B G
A
Thursday, March 27, 14
Point correlation
10
D E I J
C F H K
B G
A
Thursday, March 27, 14
Point correlation
11
D E I J
C F H K
B G
A
Thursday, March 27, 14
Point correlation
12
D E I J
C F H K
B G
A
Thursday, March 27, 14
Point correlation
13
D E I J
C F H K
B G
A
Thursday, March 27, 14
Point correlation
14
D E I J
C F H K
B G
A
Thursday, March 27, 14
Point correlation
15
KDCell root = /* build kdtree */;Set<Point> ps;double radius;
foreach Point p in ps { recurse(p, root, radius);}...void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}
Thursday, March 27, 14
Basic pattern
16
TreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Thursday, March 27, 14
Basic pattern
16
recursive traversal
TreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Thursday, March 27, 14
Basic pattern
16
recursive traversal
tree structureTreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Thursday, March 27, 14
Basic pattern
16
recursive traversal
repeated traversal
tree structureTreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Thursday, March 27, 14
Basic pattern
16
recursive traversal
repeated traversal
tree structureTreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Lots of parallelism!
Thursday, March 27, 14
What’s the problem?
• GPUs add high overhead for recursion
• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses
• Status quo: ad hoc solutions
• New algorithm? New GPU techniques!
17
Thursday, March 27, 14
What’s the problem?
• GPUs add high overhead for recursion
• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses
• Status quo: ad hoc solutions
• New algorithm? New GPU techniques!
17
Want generally applicable techniques for
mapping irregular applications to GPUs
Thursday, March 27, 14
Contributions
• Two general techniques for mapping tree-traversals to GPUs
• Autoropes: eliminates recursion overhead
• Lockstepping: promotes memory coalescing
• Compiler pass to automatically apply techniques to recursive tree-traversal code
• Significant GPU speedups on 5 tree-traversal algorithms
18
Thursday, March 27, 14
Naïve GPU implementation
• Warp-based SIMT (single-instruction, multiple-thread) execution
• 32 points put in a single warp
• Warp traverses tree
• All points in warp must execute same instruction
• If points diverge, some points sit idle while other threads execute
19
Thursday, March 27, 14
Naïve GPU implementation
20
A
B G
C F H K
D E I J
Thursday, March 27, 14
Naïve GPU implementation
20
A
B G
C F H K
D E I J
Thursday, March 27, 14
Naïve GPU implementation
20
A
B G
C F H K
D E I J
A
B G
C F
D E
Thursday, March 27, 14
Naïve GPU implementation
20
A
B G
C F H K
D E I J
A
B G
H K
I J
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
21
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
22
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
23
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
24
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
25
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
26
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
27
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
28
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
29
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
30
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
31
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
32
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
33
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Naïve GPU implementation
34
B
A
G
Thursday, March 27, 14
Lots of accesses to tree
• Many accesses just moving up the tree in order to later move down again
• Lots of function stack manipulation
• Trees are very large, cannot be stored in GPU’s fast memory
• Want to minimize accesses to tree
35
Thursday, March 27, 14
How to avoid extra accesses to tree?
• Typical technique: ropes
• Pointers in each tree node that let a traversal jump to the next part of the tree
• Effectively linearizes traversal
36
A
B G
C F H K
D E I J
Thursday, March 27, 14
How to avoid extra accesses to tree?
• Typical technique: ropes
• Pointers in each tree node that let a traversal jump to the next part of the tree
• Effectively linearizes traversal
36
A
B G
C F H K
D E I J
Thursday, March 27, 14
How to avoid extra accesses to tree?
• Typical technique: ropes
• Pointers in each tree node that let a traversal jump to the next part of the tree
• Effectively linearizes traversal
36
A
B G
C F H K
D E I J
Thursday, March 27, 14
How to avoid extra accesses to tree?
• Typical technique: ropes
• Pointers in each tree node that let a traversal jump to the next part of the tree
• Effectively linearizes traversal
36
A
B G
C F H K
D E I J
Thursday, March 27, 14
How to avoid extra accesses to tree?
• Installing ropes into a tree requires complex, application-specific preprocessing
• Using ropes correctly during execution requires complex, application-specific logic
37
A
B G
C F H K
D E I J
Thursday, March 27, 14
Autoropes
• General technique for “linearizing” tree traversal for arbitrary traversal algorithms
• Achieve generality, simplicity and space-efficiency at the cost of overhead
• Key insight: recursive tree algorithms are just depth-first traversals of a tree; can transform into iterative algorithm
38
Thursday, March 27, 14
Autoropes
39
A
B G
C F H K
D E I J
A
B G
C F
D EA
Rope stack
Thursday, March 27, 14
Autoropes
40
A
B G
C F H K
D E I J
G
A
B G
C F
D EB
Rope stack
Thursday, March 27, 14
Autoropes
41
A
B G
C F H K
D E I J
F
A
B G
C F
D E
G
C
Rope stack
Thursday, March 27, 14
Autoropes
42
A
B G
C F H K
D E I J
E
A
B G
C F
D E
FG
D
Rope stack
Thursday, March 27, 14
Autoropes
43
A
B G
C F H K
D E I J
F
A
B G
C F
D E
G
E
Rope stack
Thursday, March 27, 14
Autoropes
44
A
B G
C F H K
D E I J
G
A
B G
C F
D EF
Rope stack
Thursday, March 27, 14
Autoropes
45
A
B G
C F H K
D E I J
A
B G
C F
D EG
Rope stack
Thursday, March 27, 14
Autoropes
• Ropes stored on rope stack instead of in tree
• No application-specific code to use ropes
• Ropes instantiated dynamically
• No preprocessing required
• Same access patterns as with manual ropes
• Extra pushes and pops on rope stack add some overhead
46
Thursday, March 27, 14
Autoropes
47
void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}
Simple compiler transformation convertsrecursive code into iterative autoropes code
Thursday, March 27, 14
Autoropes
48
ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}
See paper for details of how to transform more complex code
Thursday, March 27, 14
Unintended consequence
49
• Recursive calls naturally lead to thread divergence
• If some threads make recursive calls, other threads wait until calls return
• Does not happen for iterative code
• All threads reconverge at beginning of loop
ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))
continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}
Thursday, March 27, 14
formerly recursion
Unintended consequence
49
• Recursive calls naturally lead to thread divergence
• If some threads make recursive calls, other threads wait until calls return
• Does not happen for iterative code
• All threads reconverge at beginning of loop
ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))
continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}
formerly return
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Autoropes on GPU
50
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Autoropes on GPU
51
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Autoropes on GPU
52
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Autoropes on GPU
53
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Autoropes on GPU
54
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Autoropes on GPU
55
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Autoropes on GPU
56
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Autoropes on GPU
56
B
A
G
Threads no longer diverge in execution
But do diverge in tree!
Thursday, March 27, 14
Thread divergence vs. memory coalescing• Memory accesses on GPU only well behaved
if accesses by all threads in warp can be coalesced
• Same memory or strided access
• Bad memory behavior of autoropes outweighs lack of thread divergence
• Goal: benefits of autoropes while maintaining memory coalescing
57
Thursday, March 27, 14
Lockstepping• Essentially, force GPU to let threads diverge
• If any thread in a warp wants to visit a node’s children, all threads in a warp visit the child
• Threads that are “dragged along” are programmatically masked out
• Warp execution takes longer (proportional to union of threads’ traversals, rather than longest traversal), but improved memory performance makes up for it
• Automatically implemented during autoropes compiler pass
58
Thursday, March 27, 14
Dynamic lockstepping
• Some algorithms allow different traversal orders
• Some points visit left child before right, and others visit right before left
• Optimization reduces traversal size
• Inherently bad memory access patterns
59
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Dynamic lockstepping
60
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Dynamic lockstepping
61
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Dynamic lockstepping
62
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Dynamic lockstepping
63
B
A
G
Thursday, March 27, 14
A
B G
H K
I J
C F
D E
Dynamic lockstepping
64
B
A
G
Thursday, March 27, 14
Dynamic lockstepping
• Dynamic lockstepping allows all points in a warp to “vote” on which traversal order to use
• Maintains memory coalescing
• Some points do more work than in original algorithm
• Tradeoff can still be worth it!
65
Thursday, March 27, 14
Engineering details
• Transform point data from array of structures format to structure of arrays
• Use analysis from [PACT 2013] to prove safety and transform automatically
• Copy tree data to GPU in linearized fashion
• Lay out fields of tree and point according to use (more commonly-accessed fields placed in shared memory)
• Interleave rope stacks for points in warp to allow strided access
66
Thursday, March 27, 14
Results• Two platforms
• GPU platform: NVIDIA Tesla C2070 (6GB global memory, 14 SMs)
• CPU platform: 32-core, 2.3 GHz Opteron
• Five benchmarks
• Barnes-Hut, Point correlation, Nearest neighbor, k-Nearest neighbor, Vantage point trees
• Multiple inputs per benchmark
• Used sorted and unsorted points
67
Thursday, March 27, 14
High-level takeaways
• Autoropes+lockstep always faster than simple recursive GPU implementation (up to 14x faster)
• For most benchmarks/inputs, best GPU implementation faster than CPU implementation up to 16 threads
• Speedups comparable to hand-written implementations
68
Thursday, March 27, 14
Barnes-Hut
69
Thursday, March 27, 14
Point correlation
70
Thursday, March 27, 14
Nearest-neighbor
71
Thursday, March 27, 14
Conclusions
• Mapping irregular applications to GPUs is very difficult
• Developed two general techniques, autoropes and lockstepping, that can achieve significant speedup on GPU
• vs. baseline GPU code and CPU implementations
• Automatic approaches competitive with previous hand-written implementations
72
Thursday, March 27, 14
General Transformations for GPU Execution of
Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni
School of Electrical and Computer Engineering
* Now at Qualcomm; ** Now at Google
Thursday, March 27, 14