GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities...
Transcript of GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities...
![Page 1: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/1.jpg)
GPGPU Lessons LearnedMark Harris
![Page 2: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/2.jpg)
General-Purpose Computation on GPUs
Highly parallel applicationsPhysically-based simulationimage processingscientific computing computer visioncomputational financemedical imagingbioinformatics
www.gpgpu.org
![Page 3: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/3.jpg)
NVIDIA GPU Pixel Shader GFLOPS
• GPU Observed GFLOPS• CPU Theoretical peak GFLOPS
2005 2006
![Page 4: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/4.jpg)
Physics on GPUs
GPU: very high data parallelismG70 24 pixel pipelines, 48 shading processors1000s of simultaneous threadsVery high memory bandwidthSLI enables 1-4 GPUs per system
Physics: very high data parallelism1000s of colliding objects1000s of collisions to resolve every frame
Physics is an ideal match for GPUs
![Page 5: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/5.jpg)
Physically-based Simulation on GPUs
Jens Krüger, TU-Munich
Doug L. James, CMU
Particle Systems
Fluid Simulation
Cloth Simulation
Soft-body Simulation
![Page 6: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/6.jpg)
What about Game Physics?
Fluids, particles, cloth map naturally to GPUsHighly parallel, independent data
Game Physics = rigid body physicsCollision detection and responseSolving constraints
Rigid body physics is more complicatedArbitrary shapesArbitrary interactions and dependenciesParallelism is harder to extract
![Page 7: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/7.jpg)
Havok FX
A framework for Game Physics on the GPUJoint NVIDIA / Havok R&D project launched in 2005
For details, come to the talks:
Havok FX™: GPU-accelerated Physics for PC Games4:00-5:00PM Thursday
[Need location]
Physics Simulation on NVIDIA GPUs5:30-6:30PM Thursday
[Need Location]
![Page 8: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/8.jpg)
Havok FX Demo
NVIDIA DinoBones demo
![Page 9: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/9.jpg)
Lessons learned from Havok FX
Arithmetic Intensity is Key
CPUs and GPUs can get along
Readback ain’t wrong
Vertex Scatter vs. Pixel Gather
Printf debugging for pixel shaders
![Page 10: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/10.jpg)
Arithmetic Intensity is Key
Arithmetic Intensity = Arithmetic / Bandwidth
GPUs like it highVery little on-chip cacheGoing to mem and back costs a lotLong programs with much more math than texture fetch
Game physics has very high AI> 1500 pixel shader cycles per collision~ 100 texture fetches per collision
![Page 11: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/11.jpg)
Havok GPU Threading Experiment
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210 221 232 243 254 265 276 287 298 309 320 331 342
ms
Number of 32 pixel rows shaded
Performance of a pixel shader
![Page 12: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/12.jpg)
Leverage Processor Strengths
GPUs are good at data parallel computationCPUs are good at sequential computation
Most real problems have a bit of bothLuckily most real computers have both processors!Especially game platforms
Rigid body collision processing is a great example
![Page 13: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/13.jpg)
Rigid Body Dynamics Overview
3 phases to every simulation clock tickIntegrate positions and velocitiesDetect collisionsResolve collisions
Integration is embarrassingly parallelNo dependencies between objects: use the GPU
Detecting collision is basically scene traversalCPU is good at this – use it
Resolving collisions is a tricky oneIs it parallel enough for the GPU?
![Page 14: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/14.jpg)
Is physics a data parallel task?
Solve Collisions
NewVelocities
Contacts&
Velocities
Body 1
Body 2
Body 5
Body 7
Body 6
Body 3 Body 8
Body 4
1
23
4
5
6
7
8 910
11
12
![Page 15: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/15.jpg)
Is physics a data parallel task?
Solve Collisions
NewVelocities
Contacts&
Velocities
Body
Body
Body
Body
Body
Body Body
Body
![Page 16: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/16.jpg)
Is physics a data parallel task?
Solve Collisions
NewVelocities
Contacts
Solve link 1
Solve link 2
Solve link N
Batch 1 Batch 2 Batch 3
Solve link 1
Solve link 2
Solve link N
Solve link 1
Solve link 2
Solve link N
![Page 17: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/17.jpg)
Readback is Not Evil
Hybrid CPU-GPU solution implies communication
Readback and download across PCI-e bus
It’s not that bad if you use it wiselyMinimize transfer size and frequencyUse PBO to optimize transfers
Physics data << computationRead back and download a few bytes per obj each frameAt most a few MB per frame < 200 MB/secPCI-e = 4 GB / sec
![Page 18: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/18.jpg)
Vertex Scatter vs. Pixel Gather
Problem: sparse array updateComputed a set of addresses that need to be updatedCompute updates for only those addresses in an array
![Page 19: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/19.jpg)
Method 1: Pixel Gather
(Pre)compute an array of compute flagsProcess all pixels in the destination arrayBranch out of computation where flag is zero
0 0 0 0 0
0 0 0 0 0
0 0 00 0
0 0 0 0
0 0 0 0 0
0
0 0 0 00
1
1
1
1
1
1
![Page 20: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/20.jpg)
0,4
2,3
4,5
0,1
0,2
4,2
Method 2: Vertex Scatter
(Pre)compute addresses of elements to updateDraw 1-pixel points at those addresses
Run update shader on points
![Page 21: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/21.jpg)
Vertex Scatter vs. Pixel Gather
Not obvious that Vertex Scatter can be a winDrawing single-pixel points is inefficientBecause shader pipes process 2x2 “quads” of pixels
But you can use a simple heuristicUse Vertex Scatter if # of updates is significantly smaller than array sizeOtherwise use pixel gather
But always experiment in your own application!
![Page 22: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/22.jpg)
Printf for Pixels
Debugging pixel shaders is hardEspecially GPGPU shaders – output not an image
Even harder if you’ve used all of your outputsHavok FX easily uses up 4 float4 MRT outputs
A simple hack to dump data from your shadersA macro to dump arbitrary shader variablesA wrapper function to run the program once for all “printfs”
And run it once more with correct outputs
![Page 23: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/23.jpg)
Printf for Pixels
First, define a handy macro to put in your shaders
#ifdef DEBUG_SHADER#define CG_PRINTF(index, variable) \
if (debugSelector == index) \DEBUG_OUT = variable;
#else#define CG_PRINTF(index, variable)
#endif
![Page 24: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/24.jpg)
Printf for Pixelsfloat4 foo( float2 coords, // other params#ifdef DEBUG_SHADER
uniform float debugSelector#endif
) : COLOR0 {#ifdef DEBUG_SHADER
float4 DEBUG_OUT;#endif
float4 temp1 = complexCalc1();float4 temp2 = complexCalc1(temp1);float4 ret = complexCalc3(temp2);
CG_PRINTF(1, temp1);CG_PRINTF(2, temp2);
#ifdef DEBUG_SHADERCG_PRINTF(0, ret);return DEBUG_OUT;
#elsereturn ret;
#endif}
Use the macro to instrument your shader
![Page 25: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/25.jpg)
Printf for Pixels: C++ code
debugProgram( CGProgram prog, x, y, w, h, float** debugData) {CGParam psel = cgGetNamedParameter(prog, “debugSelector”);for (int selector = 1; selector < 100; ++selector) {if (debugData[selector] == 0) break;
// run program with debug selectorcgGLBindFloat1f(psel, selector); runProgram( prog, x, y, w, h);
// read back resultsglReadPixels(x, y, w, h, GL_RGBA, GL_FLOAT,
debugData[selector]); }
cgGLBindFloat1f(psel, 0); runProgram( prog, x, y, w, h); // run program as normal
}
![Page 26: GPGPU Lessons Learned...3 phases to every simulation clock tick Integrate positions and velocities Detect collisions Resolve collisions Integration is embarrassingly parallel No dependencies](https://reader034.fdocuments.in/reader034/viewer/2022042219/5ec5aca1b572b423c64ed236/html5/thumbnails/26.jpg)
Questions?
Havok FX presentations at GDC 2006:
Havok FX™: GPU-accelerated Physics for PC Games
4:00-5:00PM Thursday [Need location]
Physics Simulation on NVIDIA GPUs5:30-6:30PM Thursday
[Need Location]