Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
-
Upload
takahiro-harada -
Category
Technology
-
view
4.044 -
download
6
description
Transcript of Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
INTRODUCTION TO MONTE CARLO RAY TRACING
OPENCL IMPLEMENTATION
TAKAHIRO HARADA 9/2014
2 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
RECAP OF LAST SESSION
y Talked about theory y BRDFs
‒ Reflec8on, Refrac8on, Diffuse, Microfacet
y Fresnel is everywhere y Monte Carlo Ray Tracing
‒ Intui8ve understanding of Monte Carlo Integra8on ‒ Simple sampling (Random sampling) ‒ BeXer sampling (Importance sampling) ‒ Layered material
hXp://www.slideshare.net/takahiroharada/introduc8on-‐to-‐monte-‐carlo-‐ray-‐tracing-‐cedec2013
3 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
REVIEW
for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } }
SIMPLE CPU MC RAY TRACER
Direct illumina<on
4 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
REVIEW
for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
SIMPLE CPU MC RAY TRACER
Indirect illumina<on
5 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
REVIEW
for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } }
SIMPLE CPU MC RAY TRACER
for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
Direct illumina<on Indirect Illumina<on
6 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
COMPARISON
Direct illumina<on Indirect illumina<on
7 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
WHY OPENCL?
y Speed! ‒ GPU can accelerate it ‒ Why? Faster is the beXer
y OpenCL is an API for GPU compute
y OpenCL is not only for graphics programmers
y OpenCL does not always require a GPU ‒ Runs on CPU too ‒ Runs if there is a CPU (everywhere)
y If renderer is wriXen in OpenCL, runs on Windows, Linux, MacOSX J
PORTING TO OPENCL (FIRST ATTEMPT)
9 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
THINGS TO BE DONE
y No pointer in OpenCL* y Change pointer to index y Stored in a flat memory
y Not suited for par8al update
*Shared Virtual Memory (OpenCL 2.0)
DATA STRUCTURE
10 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
THINGS TO BE DONE
y Node data for a binary tree ‒ Spa8al accelera8on structure (BVH) ‒ Shading network
y Buffer<NodeData> nodeData;
DATA STRUCTURE
Node Data
m_m
ax.x
m_m
ax.y
m_m
ax.z
m_m
in.x
m_m
in.y
m_m
in.z
m_child0
m_child1
Node Data
m_m
ax.x
m_m
ax.y
m_m
ax.z
m_m
in.x
m_m
in.y
m_m
in.z
m_child0
m_child1
Node Data
m_m
ax.x
m_m
ax.y
m_m
ax.z
m_m
in.x
m_m
in.y
m_m
in.z
m_child0
m_child1
11 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
THINGS TO BE DONE
y Material ‒ Texture entry
y Buffer<Material> material;
y Buffer<char> texData; y Buffer<uint> texTable;
DATA STRUCTURE
Material0
m_kd
m_ior
m_...
m_kdT
ex
m_iorTex
m_b
umpT
ex
TextureTable
tex0
tex1
tex2
tex3
tex4
. . .
Texture0
m_h
eade
r
m_d
ata
Texture1
m_h
eade
r
m_d
ata
Texture2
m_h
eade
r
m_d
ata
Material1
m_kd
m_ior
m_...
m_kdT
ex
m_iorTex
m_b
umpT
ex
. . .
Texture3
m_h
eade
r
m_d
ata
12 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
THINGS TO BE DONE
for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
WRITING OPENCL KERNEL
__kernel void PtKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) return; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
CPU code OpenCL kernel
13 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
IT WORKS BUT…
y This approach is simple
y But a lot of issues
14 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DRAWBACKS
y Likely not u8lize hardware efficiently ‒ SIMD divergence ‒ GPU occupancy (latency)
y Maintainability
y Extendibility, Portability
PERFORMANCE
GPU ARCHITECTURE
16 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
OPENCL ON CPU
y Processing element executes Work item (thread) ‒ A SIMD lane (4*)
y Compute unit executes Work group (thread group) ‒ A core (8*) ‒ # of processing elements != # of work items
y Compute device executes Kernel (shader) ‒ A CPU ‒ # of compute units != # of work groups
* On AMD FX-‐8350
Work item
Work group Compute Unit
Processing element
. . .
Kernel
17 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
GPU VS CPU
y Processing element executes Work item ‒ A SIMD lane (64*)
y Compute unit executes Work group ‒ A SIMD engine (44x4*) ‒ # of processing elements != # of work items
y Compute device executes Kernel ‒ A GPU ‒ # of compute units != # of work groups
* On AMD Radeon R9 290X
Work item
Work group Compute Unit (4)
Processing element
. . .
Kernel
Compute Unit (64)
Processing element
...
CPU GPU
18 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
HIGH LEVEL DESCRIPTION
y Today’s GPU is similar to a CPU (if you look at very high level) ‒ GPU is an extremely wide CPU ‒ Many cores ‒ Wide SIMD
y AMD Radeon R9 290X GPU ‒ 176 = 44x4 SIMD engines (cores) ‒ 64 wide SIMD
y But different in ‒ SIMD width (very wide) ‒ Limited local resources ‒ Strategy to hide latency
y Knowing those are the key to exploit the performance
19 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
20 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
21 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
22 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
23 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
24 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
25 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
26 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
WIDE SIMD EXECUTION
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
J J
L
J
L
L
L
27 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
J J
J
J
J
J
J
28 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
LATENCY
y Highest latency is from memory access
y CPU prevent it by having larger cache ‒ Latency of cache access is small (fast)
y Most of the memory access do not go to memory
y CPU can run at full speed un8l a cache miss
y # of concurrent execu8on on the GPU is far much larger than CPU ‒ More than 11k (= 44x4x64) work items
y GPU cache is not large enough to absorb memory requests from those if they all requests different part of memory
y Strategy ‒ Keep memory access as local as possible (not realis8c for prac8cal apps) ‒ Uses GPU mechanism for latency hiding
29 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
GPU LATENCY HIDING
y GPU can execute at full speed if there are only ALU instruc8ons (Inst. 0 -‐ 2) *
y Stalls on memory access instruc8on (Inst. 3)
* Can hide latency using logical vector
Inst. 0
Inst. 1
Inst. 2
Inst. 3 (MemAccess)
Lane0 Lane1 Lane2 Lane3 Lane1
Inst. 4
30 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
GPU LATENCY HIDING
y When stalled, switch to another work group
y Could fill the stall with instruc8ons from WG1
y A SIMD of GPU needs to process mul8ple WGs at the same 8me to hide latency (or maximize its throughput)
WG0: Inst. 0
WG0: Inst. 1
WG0: Inst. 2
WG0: Inst. 3
Lane0 Lane1 Lane2 Lane3 Lane1
WG0: Inst. 4
WG1: Inst. 0
WG1: Inst. 1
WG1: Inst. 2
WG1: Inst. 3
Lane0 Lane1 Lane2 Lane3 Lane1
31 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
HOW MANY WGS CAN WE EXECUTE PER SIMD
y 10 wavefronts (64WIs) per SIMD is the max
y It depends on local resource usage of the kernel y VGPR usage is ozen the problem
y Share 256 VGPRs among n work groups ‒ 1 wavefront, 256VGPRs LL ‒ 2 wavefronts, 128VGPRs ‒ 4 wavefronts, 64VGPRs J ‒ 10 wavefronts, 25VGPRs
y Share 16KB LDS among n work groups ‒ 1 work group, 16KB LL ‒ 2 work group, 8KB ‒ 4 work group, 4KB J
y VGPRs ‒ Registers used by vector ALUs ‒ 64KB/SIMD ‒ 256 VGPRs/SIMD lane (= 64KB/64/4)
y LDS (Local data share) ‒ 64KB/CU (CU == 4SIMD) ‒ 32KB/SIMD
32 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
ADVICE TO REDUCE VGPR PRESSURE
y Don’t write a large kernel y If the program can be split into several pieces, split
them into several kernels ‒ Single kernel approach
‒ VGPR usage of the kernel is 200 = max(60, 200, 10) ‒ 1 wavefront per SIMD ‒ Bad for latency hiding
‒ Mul8ple kernel approach ‒ FuncA: 4 wavefronts per SIMD ‒ FuncB: 1 wavefronts per SIMD ‒ FuncC: 10 wavefronts per SIMD ‒ FuncB is bad, but FuncA, FuncC runs fast
y Helps compiler too
GET MORE PERFORMANCE FROM GPU
Single Kernel
FuncA (60VGPRS)
FuncB (200VGPRS)
FuncC(10VGPRS)
Mul8ple Kernels
FuncA (60VGPRS)
FuncB (200VGPRS)
FuncC(10VGPRS)
L
L
L
L
J
J
33 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
W
WHAT IS THE SCALAR ARCHITECTURE?
y Best for vector computa8on
y Low efficiency on scalar computa8on
y Physical concurrent execu8on ‒ 16 work items ‒ 4 ALU opera8ons each ‒ Total: 16x4 ALU opera8ons
y 1 SIMD is running in a CU
y Difficult to fill xyzw
y If not filled, we waste HW cycle
y Good for both vector and scalar computa8on
y Need more work groups to fill GPU
y Physical concurrent execu8on ‒ 16x4 work items ‒ 1 ALU opera8on each ‒ Total: 16x4 ALU opera8ons
y 4 SIMDs are running in a CU
y 4x more work groups are necessary to fill HW
VLIW4 (NI), VLIW5 (EG) Scalar Architecture (SI, CI)
. . . X Y Z W
Lane 0
X Y Z W
Lane 1
X Y Z W
Lane 2
X Y Z W
Lane 3
X Y Z W
Lane 4
X Y Z W
Lane 15
. . .
SIMD0 SIMD0 SIMD1 SIMD3
34 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
ANOTHER SOLUTION
y Spli{ng computa8on into mul8ple kernels ‒ Primary ray gen kernel ‒ Trace kernel ‒ Evaluate DI kernel ‒ Etc
y BeXer HW u8liza8on ‒ Less divergence ‒ Higher HW occupancy
35 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
OTHER BENEFITS?
Ray Genera8on
First Hit
First Hit (Normal)
Direct Illumina8on
Indirect Illumina8on
y Maintainability ‒ Debug is not as easy as we do on C, C++ ‒ Not all compilers are mature
‒ Can hit to a compiler bug, which is hard to debug ‒ Helps compiler
‒ By spli{ng kernels, we can isolate the issue ‒ If the code is developed by many people, this is important
y Extendibility, Portability ‒ Easy to extend features ‒ Primary Ray Gen Kernel
‒ Add another camera projec8on ‒ Ray Cas8ng Kernel
‒ Easy to add another primi8ves (e.g., vector displacement) ‒ Take it out for physics ray cas8ng queries
PORTING TO OPENCL (SECOND ATTEMPT)
37 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPLITTING KERNELS
forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
TRANSFORMING CPU CODE
{ forAll() PrimaryRayGenKernel(); while(1) { forAll() TraceKernel(); if( !any( hits ) ) break; forAll() SampleLightKernel(); forAll() TraceKernel(); forAll() AccumulateDIKernel(); forAll() SampleNextRayKernel(); } }
Naïve CPU implementa<on Preparing for OpenCL implementa<on
Each for loop => A kernel execu8on
38 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPLITTING KERNELS
forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
CPU implementa<on Host code
39 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPLITTING KERNELS
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
__kernel void PrimaryRayGenKernel(); y Generate rays for all pixels in parallel __kernel void TraceKernel(); y Compute intersec8on for all rays in parallel __kernel void SampleLightKernel(); y Sample light for all hit points in parallel __kernel void AccumulateDItKernel(); y Accumulate DI for all hit points in parallel __kernel void SampleNextRayKernel(); y Generate bounced rays for all hit points in parallel
Host Code Device Code
40 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DESIGN
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
LOCALIZE BRANCH
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
Camera Type Brdf Type
41 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DESIGN
y Cannot keep state between kernels y Ray state needs to be saved to/restored from
global memory
RAY STATE
y Example ‒ Ray genera8on + ray direc8on visualiza8on
__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); }
struct RayState { float4 m_throughput; int2 m_randomNumber; int m_pixelIdx; };
42 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DESIGN
y Cannot keep state between kernels y Ray state needs to be saved to/restored from
global memory
RAY STATE
y Example ‒ Ray genera8on + ray direc8on visualiza8on
__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); }
Out: Ray
In: pixelIdx
Out: Ray
In: pixelIdx
Out: Ray
In: pixelIdx
Out: Ray
In: pixelIdx
Prim
aryRayGe
nKerne
l
In: Ray
Out: Pixel color
In: Ray
Out: Pixel color
In: Ray
Out: Pixel color
In: Ray
Out: Pixel color
Visualize
Kernel
Global memory (state)
43 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
RAY COMPACTION
y Sparse data lowers SIMD u8liza8on
y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8) Primary
Secondary
44 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
RAY COMPACTION
y Sparse data lowers SIMD u8liza8on
y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8)
y With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8)
*When rays are created for all pixels, this is not necessary
Primary
Secondary
Primary
Secondary
45 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
RAY COMPACTION
y Sparse data lowers SIMD u8liza8on
y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8)
y With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8)
*When rays are created for all pixels, this is not necessary
y No need to write a compac8on kernel y Can compact using global atomics
‒ Prepare a counter (gRayCount) ‒ Perform atomic increment to reserve memory ‒ BeXer to do atomics in WG first, then do an atomic add per WG
__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; }
Primary
Secondary
Primary
Secondary
46 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DIRECT ILLUMINATION COMPUTATION
y SampleLightKernel ‒ Want to keep the work uniform
‒ Different # of light sample per ray isn’t good ‒ Compute contribu8on from one point on a light
y Simple approach ‒ Select a light ‒ Select a point on a light ‒ Compute DI without occlusion term
y More sophis8cated light sampling ‒ Using poten8al contribu8on for PDF ‒ Forward+ style light culling
__kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; }
47 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DIRECT ILLUMINATION COMPUTATION
y SampleLightKernel
y TraceRayKernel ‒ Check if the point on the light is visible or not ‒ Reuse code
y AccumulateDIKernel ‒ If the ray is not blocked, accumulate the result
__kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; } __kernel void AccumulateDIKernel(__global ...) { Hit shadowHit = gShadowHit[GIDX]; float4 di = gDi[GIDX]; if( !shadowHit ) gFb[GIDX] += di; }
48 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SAMPLE NEXT RAY
y Compute next ray by sampling BRDF
y Store ray and ray state __kernel void SampleNextRayKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; Hit hit = gHit[GIDX]; if( !hit ) return; nextRay, s = Brdf_Sample( ray, s ); int dst = atom_inc( &gRayCount ); gRayNext[dst] = nextRay; gRayStateNext[dst] = s; }
49 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
TRACE KERNEL
y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer)
0 1 2 3 4 5 6 7 8 9
1 2
0
4 3 5 6
Mesh0 xform0
Mesh1 xform1
Mesh2 xform2
Mesh3 xform3
50 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
TRACE KERNEL
y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer)
y 2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf
0 1 2 3 4 5 6 7 8 9
1 2
0
4 3 5 6
Top BVH
Bohom BVH
Mesh0 xform0
Mesh1 xform1
Mesh2 xform2
Mesh3 xform3
51 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
TRACE KERNEL
y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer)
y 2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf
y Store those BVHs in a single memory ‒ Traverse top tree ‒ Hit a leaf, transform the ray into object space ‒ Traverse boXom tree ‒ On exit, transform the ray back to world space
Bohom A Bohom B Bohom C Bohom D
0 1 2 3 4 5 6 7 8 9
1 2
0
root idx
Top
4 3 5 6
Top BVH
Bohom BVH
Mesh0 xform0
Mesh1 xform1
Mesh2 xform2
Mesh3 xform3
52 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SO FAR
y Explained an OpenCL implementa8on of a simple path tracer
y Easy to extend from here
y Extension can be done by swapping one or two kernels ‒ Material system, Shader ‒ Light sampling ‒ Support for different type of primi8ves ‒ Ray caster + spa8al accelera8on structure
ADVANCED TOPICS
54 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
INSTANCING
y Powerful technique to increase geometric complexity
y Small memory overhead ‒ Shares geometric informa8on (vertex, normal etc) ‒ Shares BVH ‒ Stores object transform
1 2
0
4 3 5 6
Top BVH
Mesh0 xform0
Mesh0 xform1
Mesh0 xform2
Mesh1 xform3
55 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
INSTANCING
y Powerful technique to increase geometric complexity
y Small memory overhead ‒ Shares geometric informa8on (vertex, normal etc) ‒ Shares BVH ‒ Stores object transform
Bohom A Top Bohom B
1 2
0
4 3 5 6
Top BVH
Mesh0 xform0
Mesh0 xform1
Mesh0 xform2
Mesh1 xform3
Bohom BVH
56 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
LAYERED MATERIAL
y Binary tree of BRDFs y Leaf node
‒ BRDF y Internal node
‒ Blend func8on ‒ Fresnel blend, Linear blend
y Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node
Reflect
Microfacet Diffuse
pdf=0.25
pdf=0.5
0.5 0.5
0.5 0.5
pdf=0.25
57 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
LAYERED MATERIAL
y Binary tree of BRDFs y Leaf node
‒ BRDF y Internal node
‒ Blend func8on ‒ Fresnel blend, Linear blend
y Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SelectBRDFKernel ); launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
MORE ADVANCED TOPICS
59 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
VR
y Latency is super important
y To improve a frame rendering 8me, ‒ Used mul8ple GPUs ‒ Foveated rendering
y More than 60fps on 4 Hawaii GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample
60 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
VR
y Latency is important
y To improve a frame rendering 8me, ‒ Used mul8ple GPUs ‒ Foveated rendering
y More than 60fps on 4 Hawaii GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample
{ launch( VRPrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } launch( FillPixelKernel ); }
61 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DISPLACEMENT MAPPING
y Powerful technique to increase geometric complexity
y Pre tessella8on ‒ Required memory is too large ‒ GPU memory is too small
y Direct ray tracing ‒ When hit a patch, tessellate and displace
Base mesh Vector displacement map With vector displacement
Fig. from hXp://support.nextlimit.com/display/mxdocsv3/Displacement+component
62 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DISPLACEMENT MAPPING
y Powerful technique to increase geometric complexity
y Pre tessella8on ‒ Required memory is too large ‒ GPU memory is too small
y Direct ray tracing ‒ When hit a patch, tessellate and displace
y To amor8ze tessella8on, displacement cost, batch ray intersec8on
y Need to change TraceKernel
63 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DISPLACEMENT MAPPING
y TraceKernel ‒ If a ray hit a quad with displacement map, save (ray, primi8ve) to a buffer
‒ Sort (ray, primi8ve) pairs by primi8ve index ‒ Process primi8ves in the list in parallel
y For each patch ‒ Build quad BVH in parallel ‒ Cast rays in parallel
y Key is work buffer memory alloca8on Level 0 (1 node)
BVH Comt
Ray Cast Ray Cast Ray Cast Ray Cast
Level 2 (16 nodes)
BVH Comt
BVH Comt
BVH Comt
BVH Comt
BVH Comt
BVH Comt
Ray Cast Ray Cast Ray Cast Ray Cast Ray Cast Ray Cast
Level 1 (4 nodes)
BVH Comt
BVH Comt
BVH Comt
BVH Comt
64 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
VECTOR DISPLACEMENT IN ACTION
Base mesh Vector displacement
52GB memory if pre tessella8on is used
65 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
OPEN SHADING LANGUAGE
y OSL itself has nothing to do with OpenCL y Many use cases
y Using OSL in OCL renderer ‒ Translate OSL to
‒ OCL kernel ‒ SPIR
‒ Feed those to OCL run8me ‒ clBuildProgram ‒ clCreateKernel
y OSL example surface
maXe [[ string descrip8on = "Lamber8an diffuse material" ]]
(float Kd = 1 [[float UImin = 0, float UIsozmax = 1 ]],
color Cs = 1 [[float UImin = 0, float UImax = 1 ]],
string texname = “diffuse.tex” [[int texture_slot = 1]] )
{
Ci = Kd * Cs * noise(5.0 * P) * diffuse (N);
}
66 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPIR
y Standard Portable Intermediate Representa8on
y Based on LLVM IR (32, 64)
y Useful to ship OpenCL Apps y Device independent y OpenCL did not have usable binary code representa8on
‒ Binary for each device x driver ‒ Combina8on explode ‒ Embed kernel as string
‒ Load source, clCreateProgramWithSource ‒ Dump binary, clGetProgramInfo + CL_PROGRAM_BINARIES ‒ Load binary, clCreateProgramWithBinary
y OpenCL implementa8on has to support cl_khr_spir extension ‒ Works on AMD, Intel (OpenCL 1.2) ‒ SPIR 2.0 is coming with OpenCL 2.0
67 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPIR
y Offline compiler ‒ clang-‐spir* -‐cc1 -‐emit-‐llvm-‐bc -‐triple spir-‐unknown-‐unknown -‐cl-‐spir-‐compile-‐op8ons ”-‐x spir" -‐include <opencl_spir.h> -‐o <output> <input>
‒ clBuildProgram with “-‐x spir -‐spir-‐std=CL1.2”
y Use host OpenCL API ‒ clCompileProgram + Op8on ‒ clGetProgramInfo + CL_PROGRAM_BINARIES
*hXps://github.com/KhronosGroup/SPIR
CREATE SPIR BINARY
68 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014