GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency...
Transcript of GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency...
![Page 1: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/1.jpg)
GPU Accelerated Pathfinding
Avi Bleiweiss
NVIDIA Corporation
![Page 2: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/2.jpg)
Introduction
• Navigation planning
• Global and local
• Crowded game scenes
• Many thousands agents
• Decomposable movement
• Explicit parallelism
• Dynamic environment
1
Roadmap Generation
Global Navigation
Local Navigation
environment
cost, path
p
e
r
f
r
a
m
e
![Page 3: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/3.jpg)
Motivation
• CUDA compute enabling
• Nested data parallelism
• Irregular, divergent algorithms
• Large thread SIMD challenge
• Extend GPU game computing
• Core game AI actions
2
Flat Data Parallel
serial operation on bulk data
Nested Data Parallel
parallel operation on bulk data
![Page 4: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/4.jpg)
Objective
• Optimally navigate agents
• From start to goal state
• Roadmap representation
• Graph data structure
• Parallel, arbitrary search
• Varying topology complexity
• GPU performance scale
3
start
goal
detects
obstacle here
![Page 5: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/5.jpg)
Outline
• Algorithm
• Implementation
• Performance
• Futures
4
![Page 6: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/6.jpg)
Algorithm
5
![Page 7: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/7.jpg)
Graph
• Linked set of nodes, edges
• G = {N, E}
• Dense or sparse
• Edges (E) to nodes (N2) ratio
• Directed or undirected
• Ordered, unordered node pairs
• Consistent data structure
6
dense, undirected
sparse, directed
![Page 8: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/8.jpg)
Data Structure
• Adjacency matrix
• Intuitive edge presence
• Wasteful for sparse graphs O(N2)
• Adjacency lists
• Immediate node indices
• Compact storage O(N+E)
• Roadmap sparse graph
• Adjacency lists default
7
0 1 2 3
0 0 1 1 0
1 0 0 0 1
2 0 0 0 0
3 0 0 0 0
0 1 2
1 3
adjacency matrix
adjacency lists
![Page 9: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/9.jpg)
Search
• Feasibility and optimality
• Planning in state space
• Unvisited, dead, or alive
• Priority queue alive states
• Cost based state
• Running time complexity
• Worse than linear
8
1: Q.Insert(nS) and mark nS as visited
2: while Q not empty do
3: n ← Q.Extract()
4: if(n == nG) return SUCCESS
5: for all u є U(n) do
6: n’ ←f(n, u)
7: if n’ not visited then
8: Mark n’ visited
9: Q.Insert(n’)
10: else
11: Resolve duplicate n’
12: return FAILURE
Forward Search Algorithm Template• alive nodes are placed on a priority queue Q
• ns and ng are start and goal positions, respectively
• u is an action in a list of actions U
• f(n, u), state transition function
• n is current node and n’ the next adjacent node.
![Page 10: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/10.jpg)
Algorithms
• Cost based search
• Priority queue sort function
• Search properties:
• Dijkstra, A* without heuristic
9
Search Start Goal Heuristic Optimal Speed
Best First no yes yes no fair
Dijkstra yes no no yes slow
A* yes yes yes yes˚ fast
˚ assumes admissible heuristic
![Page 11: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/11.jpg)
Heuristic
• Admissible = optimistic
• Never overestimate cost-to-goal
• A* with admissible heuristic
• Guarantees optimal path
• Narrows search scope
• Suboptimal, weighted heuristic
• Quality vs. efficiency tradeoff
10
Function Definition
Manhattan w * abs(ng - n)
Diagonal w * max(abs(ng - n))
Euclidian w * sqrtf(square(ng - n))
n – position vector state
![Page 12: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/12.jpg)
A*
• Irregular, highly nested
• Priority queue element
• {node index, cost } pair
• Memory bound
• Extensive scatter, gather
• Low arithmetic intensity
• Embedded in heuristic
• Unrolling inner loop
11
1: f = priority queue element {node index, cost}
2: F = priority queue containing initial f (0,0)
2: G = g cost set initialized to zero
3: P, S = pending and shortest nullified edge sets
4: n = closest node index
5: E = node adjacency list
6: while F not empty do
7: n ← F.Extract()
8: S[n] ← P[n]
9: if n is goal then return SUCCESS
10: foreach edge e in E[n] do
11: h ← heuristic(e.to, goal)
12: g ← G[n] + e.cost
13: f ← {e.to, g + h}
14: if not in P or g < G[e.to] and not in S then
15: F.Insert(f)
16: G[e.to] ← g
17: P[e.to] ← e
18: return FAILURE
Cost notation:
• g(n): cost from start to node n
• h(n): heuristic cost from n to goal
• f(n, cost): combined cost of g(n) and h(n)
![Page 13: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/13.jpg)
Implementation
12
![Page 14: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/14.jpg)
Software
• Game AI workloads
• GPU, CPU invocation paths
• Orthogonal multi core semantics
• CUDA graceful multi launch
• Scalar C++, SIMD intrinsics (SSE)
13
![Page 15: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/15.jpg)
Tradeoffs
• Shared roadmap caching
• Working set coalesced access
• Efficient priority queue operations
• Divergent kernel parallel execution
• CUDA profiler for optimization
14
![Page 16: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/16.jpg)
Roadmap Textures
• Linear device memory
• Texture reference binding
• Flattened edge list
• With adjacency directory
• Adjacency list cacheable
• Loop control direct map
• 2 or 4, 32 bit components
15
Node
id position.x position.y position.z
Edge
from to cost reserved
Adjacency
offset offset+count
![Page 17: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/17.jpg)
Working Set
• Thread local storage
• Global memory regions
• O(T*N) storage complexity
• Node sized arrays
• 4, 8, 16 bytes data structures
• Exceeding available memory
16
Inputs
List Definition Initialization
Paths start, goal positions user defined
G cost from-start zero
F sum of costs from-start, to-goal zero
P, S visited, dead node (edge) zero
Outputs
List Definition Initialization
Costs accumulated path cost zero
W subtree, plotted waypoints zero
T - threads, N - roadmap nodes
O(T*N)
O(T)
![Page 18: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/18.jpg)
Coalescing
• Strided memory layout
• Node id fast axis
• Interleaved organization
• Thread id running index
• Contiguous memory access
• Across a thread warp
• Array indexing inexpensive
17
G F P S W G F P S W
G F P S W
0 1
0..N-1
0..T-1
0
.
.
N-1
thread
id
node id
![Page 19: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/19.jpg)
Priority Queue
• Element pairs
• Cost, node id
• Fixed size array
• Heap based
• Logarithmic running cost
• Operation efficiency
• Insertion, extraction
• Insertions dominates
• Early success exit
18
1: __device__ void
2: insert(CUPriorityQ* pq, CUCost c)
3: {
4: int i = ++(pq→size);
5: CUCost* costs = pq→costs;
6: while(i > 1 && costs[i>>1].cost > c.cost) {
7: costs[i] = costs[i>>1];
8: i >>= 1;
9: }
10: pq→costs[i] = c;
11: }
1: __device__ CUCost
2: extract(CUPriorityQ* pq)
3: {
4: CUCost cost;
5: if(pq→size >= 1) {
6: cost = pq→costs[1];
7: pq→costs[1] = pq→costs[pq→size--];
8: heapify(pq);
9: }
10: return cost;
11: }
![Page 20: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/20.jpg)
Execution
• CUDA launch scope
• Consult device properties
• An agent constitutes a thread
• One dimensional grid of
• One dimension thread blocks
• Kernel resource usage
• 20 registers
• 40 shared memory bytes
19
CUDA Occupancy Tool Data
Threads per block 128
Registers per block 2560
Warps per block 4
Threads per multiprocessor 384
Thread blocks per multiprocessor 3
Thread blocks per GPU (8800 GT) 42
![Page 21: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/21.jpg)
Performance
20
![Page 22: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/22.jpg)
Experiments
• Roadmap topology complexity (RTC)
• Fixed, varying agent count
• Dijkstra and non weighted, A* search
• SSE, multi core CPU scale
• CUDA interleaved kernel
• GPU timing includes copy
21
![Page 23: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/23.jpg)
Benchmarks
• G0–G5: small to moderate RTC (<500 nodes)
• G6: large graph (>5000 nodes)
22
Graph Nodes Edges Agents Blocks
G0 8 24 64 1
G1 32 178 1024 8
G2 64 302 4096 32
G3 129 672 16641 131
G4 245 1362 60025 469
G5 340 2150 115600 904
G6 5706 39156 64–9216 1–72 random pairs
all pairs
![Page 24: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/24.jpg)
Processors
• Processor properties:
23
Property Intel Core 2 Duo AMD Athlon 64 X2 8400 M 8800 GT GTX 280
Core Clock 2000 2110 400 600 600
Shader Clock NA NA 550 1500 1300
Memory Clock 1180 667 400 900 1000
Global Memory 2048 2048 256 512 1024
Memory Bus 64 64 64 256 512
Multiprocessor 1 per core 1 per core 8 14 30
clocks and memory size in millions
![Page 25: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/25.jpg)
Footprint
• Working set O(T*N) dominates roadmap O(N+E)
• Per thread local 0.33–13.6 KB (G0–G5), 230 KB (G6)
24
0.01
0.1
1
10
100
1000
10000
G0 G1 G2 G3 G4 G5
M
B
y
t
e
s
Graphs0.01
0.1
1
10
100
1000
10000
64 256 1024 4096 9216
M
B
y
t
e
s
Agents
launches
1 1 1 1 2 3
launches
1 1 1 2 4
GPU 8800 GT
GPU total footprint forRTC G0–G5 #
GPU total footprint forRTC G6 #
![Page 26: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/26.jpg)
Search
• A* higher arithmetic intensity improves speedup
25
0
10
20
30
40
50
60
G0 G1 G2 G3 G4 G5
S
p
e
e
d
u
p
Graphs
A*, Euclidian
SSE GPU
0
5
10
15
20
25
30
G0 G1 G2 G3 G4 G5
S
p
e
e
d
u
p
Graphs
Dijkstra
GPU
GPU speedup vs. a single core CPU, optimized scalar code for
RTC G0–G5, fixed agent #
GPU speedup vs. a single core CPU, optimized scalar and SSE
code for RTC G0–G5, fixed agent #
CPU AMD Athlon 64 X2
GPU 8800 GT
![Page 27: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/27.jpg)
Multi Core
• Quad core vs. dual core speedup: 1.05X (G0–G5), 1.2X (G6)
26
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
G0 G1 G2 G3 G4 G5
S
p
e
e
d
u
p
Graphs
A*, Euclidian
2C-SSE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
64 256 1024 4096 9216
S
p
e
e
d
u
p
Agents
A*, Euclidian
2C-SSE
CPU speedup vs. a single core, SSE optimized code for
RTC G0–G5, fixed agent #
CPU speedup vs. a single core, SSE optimized code for
RTC G6, ascending agent #
CPU Intel Core 2 Duo
![Page 28: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/28.jpg)
Cross GPU
27
0
5
10
15
20
25
30
35
40
G0 G1 G2 G3 G4 G5
S
p
e
e
d
u
p
Graphs
A*, Euclidian
8400 M 8800 GT GTX 280
GPU speedup vs. a single core CPU,SSE optimized code for RTC G0–G5,
fixed agent #
0
5
10
15
20
25
30
35
40
45
64 256 1024 4096 9216
S
p
e
e
d
u
p
Agents
A*, Euclidian
8400 M 8800 GT GTX 280
GPU speedup vs. a single core CPU,SSE optimized code for RTC G6,
ascending agent #
CPU AMD Athlon 64 X2
• GTX 280 vs. 8800 GT speedup up to 2X
![Page 29: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/29.jpg)
Running Time
• Unlocked copy overhead
• Host-to-Device (up to 50%)
• Device-to-Host (less than 5%)
28
Parameter G5 G6˚
Total running time (seconds) 2.495 6.136
Average search time (seconds) 0.000021 0.000665
Average points per path 12.6576 15.8503
˚ agent count: 9216
1
10
100
G0 G1 G2 G3 G4 G5
N
o
r
m
a
l
i
z
e
d
R
u
n
n
i
n
g
T
i
m
e
Graphs
A*, Euclidian
GPU 8800 GT
GPU running time logarithmicscale normalized to RTC G0
![Page 30: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/30.jpg)
Limitations
• Small agent count
• Unlocked copy expensive
• Pinned memory 1.6X overall speedup
• Software memory coalescing
• Limited, 1.15X performance scale
• Multi GPU linear scale
• Replicated roadmap expense
• Weighted A* oscillating
29
![Page 31: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/31.jpg)
Futures
30
![Page 32: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/32.jpg)
Futures
• Working set greedy allocation
• Dynamic, CUDA kernel malloc
• Global memory caching
• Kernel spawning threads
• Unrolled A* inner loop
• Realigning agent blocks
• Local navigation
31
![Page 33: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/33.jpg)
Conclusions
• Global navigation scalable
• GPU efficient search for
• Many thousands agents
• Nested data parallelism
• Evolving GPU opportunity
• GPU preferred platform
• Integrating core game AI
32
![Page 34: GPU Accelerated Pathfinding - Penn Engineeringcis565/LECTURES/Path.pdf · • With adjacency directory • Adjacency list cacheable • Loop control direct map • 2 or 4, 32 bit](https://reader033.fdocuments.in/reader033/viewer/2022041916/5e6a01fafcf94407496d6022/html5/thumbnails/34.jpg)
Questions?
Thank You!
33