Aim

Advanced Procedural Rendering in DirectX 11

Matt SwobodaPrincipal Engineer, SCEE R&D PhyreEngine TeamDemo Coder, Fairlight

AimMore dynamic game worlds.

Demoscene?I make demosLike games, crossed with music videosLinear, non-interactive, scriptedAll generated in real-time On consumer-level PC hardwareUsually effect-driven & dynamicRelatively light on static artist-built dataOften heavy on procedural & generative content

DirectX 11?DirectX 9 is very oldWe are all very comfortable with it.... But does not map well to modern graphics hardwareDirectX 11 lets you use same hardware smarterCompute shaders Much improved shading languageGPU-dispatched draw calls.. And much more

Procedural Mesh GenerationA reasonable result from random formulae(Hopefully a good result from sensible formulae)

Signed Distance Fields (SDFs)Distance function: Returns the closest distance to the surface from a given pointSigned distance function: Returns the closest distance from a point to the surface, positive if the point is outside the shape and negative if inside

Signed Distance FieldsUseful tool for procedural geometry creationEasy to define in code .... Reasonable results from random formulaeCan create from meshes, particles, fluids, voxelsCSG, distortion, repeats, transforms all easyNo concerns with geometric topology Just define the field in space, polygonize later

A BoxBox(pos, size){ a = abs(pos-size) - size; return max(a.x,a.y,a.z); }*Danger: may not actually compile

Cutting with Booleans

d = Box(pos)c = fmod(pos * A, B)subD = max(c.y,min(c.y,c.z))d = max(d, -subD)

More Booleans

d = Box(pos)c = fmod(pos * A, B)subD = max(c.y,min(c.y,c.z))subD = min(subD,cylinder(c))subD = max(subD, Windows())d = max(d, -subD)

Repeated Booleans

d = Box(pos)e = fmod(pos + N, M)floorD = Box(e)d = max(d, -floorD)

Cutting Holes

d = Box(pos)e = fmod(pos + N, M)floorD = Box(e)floorD = min(floorD,holes())d = max(d, -floorD)

Combined Result

d = Box(pos)c = fmod(pos * A, B)subD = max(c.y,min(c.y,c.z))subD = min(subD,cylinder(c))subD = max(subD, Windows())e = fmod(pos + N, M)floorD = Box(e)floorD = min(floorD,holes())d = max(d, -subD)d = max(d, -floorD)

Repeating the Spacepos.y = frac(pos.y)d = Box(pos)c = fmod(pos * A, B)subD = max(c.y,min(c.y,c.z))subD = min(subD,cylinder(c))subD = max(subD, Windows())e = fmod(pos + N, M)floorD = Box(e)floorD = min(floorD,holes())d = max(d, -subD)d = max(d, -floorD)

Repeating the Spacepos.xy = frac(pos.xy)d = Box(pos)c = fmod(pos * A, B)subD = max(c.y,min(c.y,c.z))subD = min(subD,cylinder(c))subD = max(subD, Windows())e = fmod(pos + N, M)floorD = Box(e)floorD = min(floorD,holes())d = max(d, -subD)d = max(d, -floorD)

DetailsAddDetails()

DetailsDoLighting()ToneMap()

DetailsAddDeferredTexture()AddGodRays()

DetailsMoveCamera()MakeLookGood()

Ship It.

Procedural SDFs in PracticeGenerated scenes probably wont replace 3D artists

Procedural SDFs in PracticeGenerated scenes probably wont replace 3D artistsGenerated SDFs good proxies for real meshesCode to combine a few primitives cheaper than art dataCombine with artist-built meshes converted to SDFsBoolean, modify, cut, distort procedurally

Video(Video Removed)(Its a cube morphing into a mesh. You know, just for fun etc.)

SDFs From Triangle Meshes

SDFs from Triangle MeshesConvert triangle mesh to SDF in 3D texture32^3 256^3 volume texture typicalSDFs interpolate well.. bicubic interpolation.. Low resolution 3D textures still work wellAgnostic to poly count (except for processing time)Can often be done offline

SDFs from Triangle MeshesA mesh converted to a 64x64x64 SDF and polygonised. Its two people doing yoga, by the way.

SDFs from Triangle MeshesNave approach?Compute distance from every cell to every triangleVery slow but accurateVoxelize mesh to grid, then sweep? UGLYSweep to compute signed distance from voxels to cellsVoxelization too inaccurate near surface....But near-surface distance is important - interpolationCombine accurate triangle distance and sweep

Geometry StagesBind 3D texture targetVS transforms to SDF spaceGeometry shader replicates triangle to affected slicesFlatten triangle to 2DOutput positions as TEXCOORDs.... All 3 positions for each vertex

Pixel Shader StageCalculates distance from 3D pixel to triangle Compute closest position on triangleEvaluate vertex normal using barycentricEvaluate distance sign using weighted normal Write signed distance to output color, distance to depthDepth test keeps closest distance

Post Processing StepCells around mesh surface now contain accurate signed distanceRest of grid is emptyFill out rest of the grid in post process CSFast Sweeping algorithm

Fast Sweepingd = maxPossibleDistancefor i = 0 to row lengthd += cellSizeif(abs(cell[i]) > abs(d))cell[i] = delsed = cell[i]Requires ability to read and write same bufferOne thread per rowThread R/W doesnt overlapNo interlock neededSweep forwards then backwards on same axisSweep each axis in turn

SDFs from Particle Systems

SDFs From Particle SystemsNave: treat each particle as a sphereCompute min distance from point to particlesBetter: use metaball blobby equationDensity(P) = Sum[ (1 (r2/R2))3 ] for all particles R : radius thresholdr : distance from particle to point PProblem: checking all particles per cell

Evaluating Particles Per CellBucket sort particles into grid cells in CSEvaluate a kernel around each cellSum potentials from particles in neighbouring cells9x9x9 kernel typical (729 cells, containing multiple particles per cell, evaluated for ~2 million grid cells)Gives accurate result .. glacially> 200ms on Geforce 570

Evaluating Particles, FastRender single points into gridWrite out particle position with additive blendSum particle count in alpha channelPost process gridDivide by count: get average position of particles in cellEvaluate potentials with kernel - grid cells onlyUse grid cell average position as proxy for particles

Evaluating Particles, FasterEvaluating potentials accurately far too slowSumming e.g. 9x9x9 cell potentials for each cell..Still > 100 ms for our test casesUse separable blur to spread potentials insteadNot quite 100% accurate.. But close enoughCalculate blur weights with potential function to at least feign correctnessHugely faster - < 2 ms

Visualising Distance FieldsRay Tracing & Polygonisation

Ray CastingSee: ray marching; sphere tracing

SDF(P) = Distance to closest point on surface (Closest points actual location not known)Step along ray by SDF(P) until SDF(P)~0Skips empty space!

Ray CastingAccuracy depends on iteration countPrimary rays require high accuracy50-100 iterations -> slowResult is transitory, view dependentUseful for secondary raysCan get away with fewer iterationsDo something else for primary hits

Polygonisation / Meshing Generate triangle mesh from SDFRasterise as for any other meshSuits 3D hardwareIntegrate with existing render pipelineReuse mesh between passes / framesSpeed not dependent on screen resolutionUse Marching Cubes

Marching Cubes In One SlideOperates on a discrete gridEvaluate field F() at 8 corners of each cubic cellGenerate sign flag per corner, OR togetherWhere sign(F) changes across corners, triangles are generated5 per cell maxLookup table defines triangle pattern

Marching Cubes IssuesLarge number of grid cells128x128x128 = 2 million cellsOnly process whole grid when necessaryTriangle count varies hugely by field contentsCan change radically every frameUpper bound very large: -> size of gridMost cells empty: actual output count relatively smallTraditionally implemented on CPU

Geometry Shader Marching CubesCPU submits a large, empty draw callOne point primitive per grid cell (i.e. a lot)VS minimal: convert SV_VertexId to cell positionGS evaluates marching cubes for cellOutputs 0 to 5 triangles per cellFar too slow: 10ms - 150ms (128^3 grid, architecture-dependent)Work per GS instance varies greatly: poor parallelismSome GPU architectures handle GS very badly

Stream Compaction on GPU

Stream CompactionTake a sparsely populated array

Push all the filled elements together Remember count & offset mappingNow only have to process filled part of array

Stream CompactionCounting pass - parallel reductionIteratively halve array size (like mip chain)Write out the sum of the count of parent cellsUntil final step reached: 1 cell, the total count Offset pass - iterative walk back upCell offset = parent position + sibling positionsHistopyramids: stream compaction in 3D

HistopyramidsSum down mip chain in blocks

(Imagine it in 3D)

HistopyramidsCount up from base to calculate offsets

Histopyramids In UseFill grid volume texture with active mask0 for empty, 1 for activeGenerate counts in mip chain downwardsUse 2nd volume texture for cell locationsWalk up the mip chain

Compaction In ActionUse histopyramid to compact active cellsActive cell count now known tooGPU dispatches drawcall only for # active cellsUse DrawInstancesIndirectGS determines grid position from cell indexUse histopyramid for thisGenerate marching cubes for cell in GS

Compaction ReactionHuge improvement over brute force~5 ms down from 11 msGreatly improves parallelismReduced draw call sizeGeometry still generated in GSRuns again for each render passNo indexing / vertex reuse

Geometry Generation

Generating GeometryWish to pre-generate geometry (no GS)Reuse geometry between passes; allow indexed vertices First generate active vertices Intersection of grid edges with 0 potential contourRemember vertex index per grid edge in lookup tableVertex count & locations still vary by potential field contents Then generate indices Make use of the vertex index lookup

Generating Vertex DataProcess potential grid in CSOne cell per threadFind active edges in each cellOutput vertices per cellIncrementCounter() on vertex bufferReturns current num vertices writtenWrite vertex to end of buffer at current counterWrite counter to edge index lookup: scattered writeOr use 2nd histopyramid for vertex data instead

Generating GeometryNow generate index data with another CSHistopyramid as before.. .. But use edge index grid lookup to locate indicesDispatchIndirect to limit dispatch to # active cellsRender geom: DrawIndexedInstancedIndirectGPU draw call: index count copied from histopyramidNo GS required! Generation can take just 2ms

Meshing Improvements

Smoothing

Smoothing More

SmoothingLaplacian smoothAverage vertices along edge connectionsKey for improving quality of fluid dynamics meshingMust know vertex edge connectionsGenerate from index buffer in post process

Bucket Sorting ArraysNeed to bucket elements of an array?E.g. Spatial hash; particles per grid cell; triangles connected to each vertex Each bucket has varying # elementsDont want to over-allocate bucketsAllocate only # elements in array

Counting SortUse Counting SortCounting pass count # elements per bucketUse atomics for parallel op InterlockedAdd()Compute Parallel Prefix Sum Like a 1d histopyramid.. See CUDA SDK Finds offset for each bucket in element arrayThen assign elements to bucketsReuse counter buffer to track idx in bucket

Smoothing ProcessUse Counting Sort: bucket triangles per vertexPost-process: determine edges per vertex Smooth vertices(Original Vertex * 4 + Sum[Connected Vertices]) / (4 + Connected Vertex Count)Iterate smooth process to increase smoothness

0 Smoothing Iterations

Subdivision, Smooth NormalsUse existing vertex connectivity dataSubdivision: split edges, rebuild indices1 new vertex per edge4 new triangles replace 1 old triangleCalc smooth vertex normals from final meshUse vertex / triangle connectivity dataAverage triangle face normals per vertexVery fast minimal overhead on total generation cost

PerformanceSame scene, 128^3 grid, Geforce 570Brute force GS version: 11 ms per passNo reuse shadowmap passes add 11ms each Generating geometry in CS: 2 ms + 0.4 ms per pass2ms to generate geometry in CS; 0.4ms to render itGenerated geometry reused between shadow passes

Video(Video Removed)(A tidal wave thing through a city. It was well cool!!!!!1)

Wait, Was That Fluid Dynamics?Yes, It Was.

Smoothed Particle HydrodynamicsSolver works on particlesParticles represent point samples of fluid in spaceLocate local neighbours for each particleFind all particles inside a particles smoothing radiusNeighbourhood search can be expensiveSolve fluid forces between particles within radiusWe use Compute for most of this

Neighbourhood SearchSpatially bucket particles using spatial hashReturn of Counting Sort - with a histopyramidIn this case: hash is quantised 3D positionBucket particles into hashed cells

SPH Process Step by StepBucket particles into cells Evaluate all particles..Find particle neighbours from cell structureMust check all nearby cells inside search radius tooSum forces on particles from all neighboursSimple equation based on distance and velocitiesReturn new acceleration

SPH PerformancePerformance depends on # neighbours evaluatedDetermined by cell granularity, particle search radius, number of particles in system, area covered by systemFavour small cell granularityEasier to reduce # particles tested at cell levelBalance particle radius by handSmoothness vs performance

SPH PerformanceIn practice this is still far, far too slow (>200ms)Can check > 100 cells, too many particle interactionsSo we cheat..Average particle positions + velocities in each cellUse average value for particles vs distant cellsForce vectors produced close enough to real values..Only use real particle positions for close cells

IlluminationThe Rendering Pipeline of the Future

Rendering Pipeline of the FuturePrimary rays are rasterisedFast: rasterisation still faster for typical game meshesUse for camera / GBuffers, shadow maps Secondary rays are traced Use GBuffers to get starting pointGlobal illumination / ambient occlusion, reflectionsPaths are complex bounce, scatter, divergeNeeds full scene knowledge hard for rasterisationTend to need less accuracy / sharpness..

Ambient Occlusion Ray TracingCast many random rays out from surface Monte-Carlo styleAO result = % of rays that reach skySlow.. Poor ray coherenceLots of rays per pixel needed for good resultSome fakes available SSAO & variants largely horrible..

Ambient Occlusion with SDFsRaytrace SDFs to calculate AOAccuracy less important (than primary rays)Less SDF iterations < 20, not 50-100Limit ray length We dont really ray cast..Just sample multiple points along ray Ray result is a function of SDF distance at points

4 Rays Per Pixel

16 Rays Per Pixel

64 Rays Per Pixel

256 Rays Per Pixel

Ambient Occlusion Ray TracingGood performance: 4 rays; Quality: 64 raysTry to plug quality/performance gapCould bilateral filter / blur Few samples, smooth results spatially (then add noise)Or use temporal reprojectionFew samples, refine results temporallyRandomise rays differently every frame

Temporal ReprojectionKeep previous frames dataPrevious result buffer, normals/depths, view matrixReproject current frame previous frameCurrent view position * view inverse * previous viewSample previous frames result, blend with currentReject sample if normals/depths differ too muchProblem: rejected samples / holes

Video(Video Removed)(Basically it looks noisy, then temporally refines, then when the camera moves you see holes)

Temporal Reprojection: Good

Temporal Reprojection: Holes

Hole FillingReprojection works if you can fill holes nicelyEasy to fill holes for AO: just cast more raysCast 16 rays for pixels in holes, 1 for the restAdversely affects performanceWork between local pixels differs greatlyCS thread groups wait on longest threadSome threads take 16x longer than others to complete

Video(Video Removed)(It looks all good cos the holes are filled)

Rays Per Thread

Hole FillingSolution: balance rays across threads in CS16x16 pixel tiles: 256 threads in groupCompute & sum up required rays in tile1 pixel per thread1 for reprojected pixels; 16 for hole pixelsSpread ray evaluation across cores evenlyN rays per thread

Rays Per Thread - Tiles

Video(Video Removed)(It still looks all good cos the holes are filled, by way of proof Im not lying about the technique)

Performance16 rays per pixel: 30 ms1 ray per pixel, reproject: 2 ms1 + 16 in holes, reproject: 12 ms 1 + 16 rays, load balanced tiles: 4 ms~ 2 rays per thread typical!

Looking Forward

Looking ForwardMultiple representations of same worldGeometry + SDFsRasterise themTrace themCollide with them World can be more dynamic.

http://directtovideo.wordpress.com

ThanksJani Isoranta, Kenny Magnusson for 3DAngeldawn for the Fairlight logoJussi Laakonen, Chris Butcher for actually making this talk happenSCEE R&D for allowing this to happenGuillaume Werle, Steve Tovey, Rich Forster, Angelo Pesce, Dominik Ries for slide reviews

ReferencesHigh-speed Marching Cubes using Histogram Pyramids; Dyken, Ziegler et al.Sphere Tracing: a geometric method for the antialiased ray tracing of implicit surfaces; John C. HartRendering Worlds With Two Triangles; Inigo QuilezlesFast approximations for global illumination on dynamic scenes; Alex Evans

http://directtovideo.wordpress.com*DirectX 9 is 9 years oldComfortable, well-understood API on PC & consoleAPI itself largely an incremental update to DirectX 8 released in 2000Ingrained in our codebases (and minds)

Shader language now has integers, atomics. Can now write to buffers by side effect in pixel shaders and a wealth of other new things besides.*The separation of generation of the space and the topology is key in achieving flexibility topology is evil.*CSG subtract operator just max with negative distance *Repeat is just frac (or fmod) on the space.

Always transform the space, not the shape. The same goes for twists, bends, distorts, transforms. *Textures are generated in realtime too positions and normals are read from Gbuffers.The texture generation isnt dissimilar to the distance field generation, but separating them makes it easier to get performance (the detail is done in the texture so it doesnt weigh down the distance field shader).*Dont expect everyone to fire all artists and generate worlds in code ***Bicubic interpolation theres a good gems articile and some nice shader code around which can do bicubic 3d texture interpolation with 8 taps (all linear filtered though) from a 3d texture. Pretty good actually!**Geometry shaders suck, though. Will move this to compute-side expansion at some point.*Can evaluate color if necessary by sampling texture

OK I lied. We dont use a 3d texture target because depth stencil doesnt work. We actually use a 2d texture array so we can have a matching depthstencil.*The problem comes when rays just miss the surface they have to march a lot of iterations to get past the surface because SDF(P) is very small near to the surface

Because the result is transitory it cant be cached between frames or passes it has to be regenerated continuously.**Out of patent since 2005!Wikipedia**Look, GS is really shit on some architectures. On older cards like my geforce 9800 in a work machine, the performance was tragic (the 150ms figure quoted much better on the same card with histopyramids). It seems better on more modern cards and on one vendor than another, though.*We use stream compaction for many things .. Like particle depth sorting, SPH, triangles in grid cell, you name it. *Histopyramids thanks to Gernot Ziegler and Chris Dyken. *Offset = parent offset in mip chain + cells sibling counts (that come before it in sequence)*Append buffers vs counter buffers: append buffers do not return the count**your mileage may vary a lot

Note that its possible to use an appendbuffer instead of a histopyramid for generating indices here; performance vs histopyramid depends on architecture. Architecture-dependent is something that comes up a lot. Sucks, dont it?*0 iterations, 1 iteration*4 iterations, 16 iterations*We also use this algorithm for depth sorting particles.*Counting sort is the first pass of radix sort.

Atomics:Ive found that performance of atomics depends on contention and differs greatly between hardware vendors. Hard to judge, hard to predict; definitely not free.*Why bucket triangles not connected vertices?Because it causes less writes!

I also used a linked list using a counter buffer and IncrementCounter(). This was much less successful because the linked list meant that the indices per vertex were scattered all around memory it was much slower to walk the list. Changing from a linked list to buckets gave approximately 10x speedup on some hardware.*GS approach scales linearly with grid size : 64^3 grid is 1.4msThe CS approach is without smoothing, subdivision or normal calculation to give a fair comparison (as the GS one doesnt do it). *The particle behaviours I describe here are interesting because they demonstrate uses of spatial databases and interaction between particles and the world, and particles with other particles.*Particles are easier to collide with environment than grid based systems.*First, pre-determine bucket sizesThen compact bucket buffer Finally, put particles into correct bucketsWe use a histopyramid for the compaction. We actually render particles into a grid using vertex,geometry and pixel shader and use a UAV to write out the counts its faster than using CS for the lot because blending beats atomics here, although thats a case by case thing (or so Ive found) so you need to check both ways. **2x cell kernel radius = 5x5x5 cells = 125 cells!Using averaged values for distant cells works because the force direction produced doesnt change much between the particles in the cell against the particle, and because the weight is low (1/d^2 falloff).This is something like what they do to solve N-body problems in astrophysics. 2 kernels the inner is high quality and slow but small, the outer is large and approximated and fast.*Use mip levels to reduce cost of texture sampling at distance*About 7-8ms on a Geforce 570. We cast nice rays with a lot of taps and no mip filtering bias tricks. You could use a lot less taps and sacrifice some quality, but increase speed accordingly. 8 taps is pretty good and itll complete in more like 3ms.*About 30ms on a Geforce 570 we cast nice rays

*About 60ms on a Geforce 570

*About 3 dog years on a Geforce 570

*In reality what this means is, for every pixel in every tile if the tile contains a white pixel, it will take the time of casting 16 rays for a pixel FOR THE WHOLE TILE.i.e. its slow.*To explain via code, the Compute Shader does this:

Process 1 pixel: read pixel, reproject, determine if in hole, add required rays to list in local memorySyncFor(I = threadIndex; I < numRays; I += NumThreads) { Process ray; }SyncOutput lit pixel*The rays are balanced across the tiles, so theres no harsh peaks.*Without balancing the 1+16 is only about twice as fast as doing 16 rays per pixel; with balancing its much faster.*

Aim

Documents

Transcript of Aim