GPU Cost Estimation for Load Balancing in Parallel Ray Tracing
GPU-Assisted Path Tracing
-
Upload
daniel-gentry -
Category
Documents
-
view
49 -
download
6
description
Transcript of GPU-Assisted Path Tracing
GPU-Assisted Path Tracing
Matthias Boindl
Christian Machacek
Institute of Computer Graphics and Algorithms
Vienna University of Technology
2
Motivation: Why Path Tracing?
Physically basedNature provides the reference image
Parallelizable
Sublinear in #objects
Conceptually simpleCan lead to a clean implementation
But: fast implementation on GPUs not trivial
Outline
Path tracing introMain steps of the algorithm
Mapping the algorithm to the GPUHow to organize code into kernels
When to launch kernels
How to pass data between kernels
Acceleration structuresFocus on bounding volume hierarchies
3Christian Machacek
Like ray tracing, except it……supports arbitrary BRDFs
…is stochastic: at each bounce, the new direction is decided randomly
Convergence video
From Pharr, Humphreys: PBRT, 2nd ed. (2010) 4
Path Tracing Intro
From Pharr, Humphreys: PBRT, 2nd ed. (2010) 5
Path Tracing Pseudocode
while image not converged r = new ray from eye through next pixel do i = closest intersection of r with scene if no i: break if i is on a light source: c = c + throughput * emission randomly pick new direction and create reflected ray r evaluate BRDF at i update throughput while path throughput high enough
From Pharr, Humphreys: PBRT, 2nd ed. (2010) 6
Path Tracing Pseudocode
while image not converged r = new ray from eye through next pixel do i = closest intersection of r with scene if no i: break if i is on a light source: c = c + throughput * emission randomly pick new direction and create reflected ray r evaluate BRDF at i update throughput while path throughput high enough
logic15%
new path4%
mate-rials25%
ray cast56%
Execution Time
From Bikker (2013) 7
Megakernel Execution Divergence
Solution: Wavefront Path Tracing
Separate, specialized kernels
Keep a pool of ~1 million paths alive
Work for next stage goes into kernel-specific, compact queues (=4MB index arrays)
8https://mediatech.aalto.fi/~samuli/
Results
Performance
Execution times(ms / 1M path segments)
9Christian Machacek
Limitations and Possible Improvements
Higher memory requirements (+200 MB)
Kernel launch overheadDynamic parallelism on GK110
Use an outer scheduling kernel
No CPU round trip
Launch independent stages side-by-sideCUDA streams
So kernels with little work don’t hog the GPU
10Christian Machacek
Acceleration Structures
Find nearest intersection in O(log N)
Space partitioning vs. object partitioning
Hybrid methods exist
11Matthias Boindl
Performance
For interactive rendering, compromiseTraversal performance (build quality)
Construction/Update time
Update or rebuild from scratch
Adapt to GPU environmentMemory architecture
Parallel execution
12Matthias Boindl
State of the Art
Tero Karras and Timo Aila. 2013. Fast parallel construction of high-quality bounding volume hierarchies. In Proceedings of the 5th High-Performance Graphics Conference (HPG '13). ACM, New York, NY, USA, 89-99.
13Matthias Boindl
Close the Performance Gap
14Matthias Boindl
Basic Idea
Fast construction of simple BVHGenerate leaf for each triangle
Reduce SAH cost by modifying tree
15Matthias Boindl
Treelets
Allow local tree modification
16Matthias Boindl
ABCF are leaves, DEG are internal nodes
Treelet Construction
Find root: parallel bottom-up traversalStart with leaves
Use atomic counter at conjunctions
Ensures all children have been processed
Build treeletAdd both children
Pick children withhighest surface area
Fixed size: 7 leaf nodes
17Matthias Boindl
Rearrange Treelet
Minimize treelet root node surface areaNaive implementation: test each permutation
Better: dynamic programmingCaching of best intermediate resultsStart with leaves, then pairs, then triplets, …
Suboptimal subtree construction avoided
Parallelizable as well
18Matthias Boindl
Results
Gap closed
19Matthias Boindl
Results
Speed/Quality tradeoff
20Matthias Boindl
Conclusion
Use specialized kernelsLower execution divergence
(Better use of instruction cache)
(Fewer registers used simultaneously)
Construct acceleration structures quicklyBut not too quickly
21Matthias Boindl
Thanks for your attention!
Institute of Computer Graphics and Algorithms
Vienna University of Technology
Results
Speed/Quality tradeoff
23Matthias Boindl
Logic Kernel
Does not need a queue, operates on all paths
If shadow ray was unblocked, add light contribution
Find material or light source the ray hitsPlace path into proper material queue
Russian roulette
If path terminated, accumulate to imagePlace path into new path queue
Sample light sources (aka next event estim.)
24Christian Machacek
New Path Kernel
Generate a new image-space sample
Generate camera rayPlace it into extension ray cast queue
Initialize path stateThroughput
Pixel position
etc.
25Christian Machacek
Material Kernels
Generate incoming direction
Evaluate light contribution based on light sample generated in the logic kernel
We haven’t cast the shadow ray yet!
For MIS: p(light sample) from the BSDF
Discard BSDF stack
Queueextension ray
(shadow ray)
26Christian Machacek
Ray Cast Kernels
Extension raysFind first intersection against scene geometry
Store hit data into path state
Shadow raysBlocked or not?
27Christian Machacek