Practical Occlusion Culling on PS3

78
_

Transcript of Practical Occlusion Culling on PS3

Practical occlusion culling for PS3

_

Practical occlusion culling for PS3Will ValeSecond Intention Limited

_

2

Lets get started lots of slides to get through

CELLPHONES!!

Who is this guy?Freelancer with graphics tech biasWorking with Guerrilla since 2005Talking about occlusion culling in KILLZONE 3_

3

TakeawayBackground: Why do occlusion culling?The SPU runtime (rendering and testing)Creating workable occludersUseful debugging toolsProblems and performanceResults, and thoughts on where to go next_

4

Want to spend a fair bit of time talking about the runtime, since thats where a lot of the interesting bits are.Also want to cover the way we produce the asset data for occlusion, and how we show the workings of the system for profiling and debugging.Finish off with a summary of the pros and cons.

Why do occlusion culling?

Drawing these guys is a waste of time_

We want to draw as much stuff as possible.

If we spend time drawing anything invisible, we waste time on the RSX and in our pipeline.

5

Killzone 2: Starting pointScene geometry in a Kd-TreeCulled using zones, portals and blockersProblems:Lots of artist time to place and tweakEntirely staticGeometric complexityLots of tests, fiddly codeToo much time around 10-30% of one SPU (serial)Can't feed RSX until it's done_

Zones connected with portals, blockers can occlude portal frustums.6

Corinth River: KILLZONE 2

7

Killzone 2: Rendering pipelinePPUOther workPrepare to drawKickOther workSPUOther workScene queryBuild main display listOther renderingSPUOther workOther renderingSPUOther workOther renderingSPUOther workOther renderingSPUOther workOther rendering_

8

Heres what we had at the end of Killzone 2.

The rendering pipeline is largely a chain of jobs on SPU, calling out to other jobs during display list generation. After setting up the inputs and kicking, the PPU has minimal involvement.

NB: Bar lengths arent representative of anything, and obviously the jobs run on any SPU in real life.

Killzone 2: Rendering pipelinePPUPrepare to drawKickSPUScene queryBuild main display listScene database

Kd-TreeZonesPortalsQuery result

ObjectsPartsLights

Main memory_

9

Heres what we had at the end of Killzone 2.

The rendering pipeline is largely a chain of jobs on SPU, calling out to other jobs during display list generation. After setting up the inputs and kicking, the PPU has minimal involvement.

NB: Bar lengths arent representative of anything, and obviously the jobs run on any SPU in real life.

Frozen Shores: KILLZONE 3

Frozen Shores: KILLZONE 3

11

Occluded geometry

Killzone 3: Art goalsIncrease scene complexityLarger, more open environmentsWith more stuff in themSimplify content pipelineDon't waste artist time on things which aren't prettyDon't require artist tweaks but allow them80% solutionWant it to just work 80% of the time_

13

Killzone 3: Tech goalsDon't increase RSX loadNever enough GPU time that we can waste itFully conservative solutionNo popping when you go around cornersDrop into pipeline without restructuringReduce riskAllow swapping between implementations at runtime_

14

We also wanted to reduce SPU time spent on the critical path, which includes the scene query.

The ideaSome spare memorySome spare SPU timeBest guess: create and test a depth buffer on SPUsDecouples tests and occludersRendering linear in number of occludersTesting linear in number of objectsPlays to SPU strengthsCulls early_

15

Rasterisation is pretty SPU-friendly easy to parallelise, probably compute bound.

As well see later, an all-SPU approach also makes it relatively easy to slot the new code into the pipeline.

The planCreate occluder geometry offlineEach frame, SPUs render occluders to 720p depth bufferSplit buffer into 16 pixel high slices for rasterisationDown-sample buffer to 80x45 (16x16 max filter)Test bounding boxes against this during scene traversalAccurate: Rasterisation + depth testCoarse: Some kind of constant-time point test_

16

So this is where we got to before we started writing code. We aimed to spend a month or so getting the rasterisation up and running, since we expected that to be the most limiting factor.

The reason we wanted to use a high resolution depth buffer is to prevent conservatism from messing occluder fusion if we draw things in fine detail, we dont need to worry about conservative rasterisation since any errors will only be one pixel wide.

SPU runtime

_

Now to get into what we actually built. Im going to cover rasterisation first, then the tests - since thats the order we worked in.17

Killzone 3: Modified pipelinePPU

Other workSPUScene querySetupRasteriseFilterOccluded queryBuild main display list...SPUOther workRasteriseOther workOther renderingSPUOther workRasteriseOther renderingSPUOther workRasteriseOther renderingSPUOther workRasteriseOther rendering_Occluded queryOther workOccluded queryOther workOccluded queryOther workOccluded query

18

Recall that for Killzone 2 we had a single query job.

For Killzone 3, we replace the query job with a more complex chain. The PPU side and and the render back end (in blue) stay much the same.

Killzone 3: Modified pipelinePPUSPUScene querySetupRasteriseFilterOccluded queryBuild main display list

Scene database

Kd-TreeQuery result

ObjectsPartsLights

Main memoryStaging area

Projected occluder trianglesOcclusionbuffer

Depth dataQuery result

Occluder meshes

_

19

Recall that for Killzone 2 we had a single query job.

For Killzone 3, we replace the query job with a more complex chain. The PPU side and and the render back end (in blue) stay much the same.

Main memory staging layoutBlockSizeCountTotalGlobal triangles48 bytes4096192KBDMA list entries8 bytes23K184KBJob commands3KBGrand total379KB

NB:We originally rasterised at 720pEnded up shipping with 640x360 (see later)Memory and performance figures are for this option_

s ld

Its worth having a quick look at what we write to the scratch area of main memory.

The setup job writes triangle data, but because triangles often span multiple strips we only store the triangle once (48 bytes) and then store an DMA list entry (8 bytes) for each strip the triangle touches.

We also have space for a couple of KB of job commands in the scratch area, for when the setup job launches the rasterise jobs.20

Occluder query jobFinds occluders in the (truncated) view frustumOccluders are normal rendering primitivesLive with the rest of a drawable object, identified by flag bit

QueryWalk Kd-tree

Extract occluder parts

Sort occluders by size

Output listKd-treeMesh dataQuery resultMain memory_

21

The occluder query is much the same as the Killzone 2 query, without the portals and zones.

The main change is that we sort the occluders (there arent too many so this is easy) so that if we have to discard any, they arent likely to be important.

Occluder setup jobDecodes RSX-style vertex and index arraysOutputs clipped + projected triangles to staging areaInternal pipeline to hide DMA latency

Setup

Load arrays(local copy)

Query resultMain memoryEngine structsArray headersIndex arrayVertex arrayStaging areaLoad part(local copy)Load address(local copy)Prime indices(1K cache)Prime vertices(2k cache)Process(global write caches)

WorkloadResolve indirection in engine data_

22

The occluder query is much the same as the Killzone 2 query, without the portals and zones.

The main change is that we sort the occluders (there arent too many so this is easy) so that if we have to discard any, they arent likely to be important.

Setup: LS memory layoutBlockSizeCountTotalTriangle write cache6KB16KBDMA list write cache1.5KB2334.5KBIndex cache1KB66KBVertex data cache2KB612KBPost-transform cache~600b6~3.5KBSmaller data, alignment slop etc.11KBTotal data73KBStack8KBCode40KBGrand total105KB

_

In future well probably increase some cache sizes and optimise this job further, but it didnt become a bottleneck until wed worked on speeding up a lot of other areas. The split sizes show different configurations for full-720p vs. half-720p depth buffers.23

Setup: Load dataFirst three stages load small (bytes rather than KB) engine structsIndex data streamed through 1K cacheFirst read pipelined, later reads blockOccluders diced so they usually fit in one goVertex data streamed through 2K cacheFirst read pipelined again90% hit rate32-entry post-transform cacheDirect mapped, not a FIFO60% hit rate_

24

Pipeline data for this first stages is pretty small and we can store it easily in LS.

Index data is usually not a problem since we slice up the primitives offline.

Vertex data can be pretty large, so we read that through a very simple cache (a window on the vertex data in main memory).

Both caches use blocking reads when they reload, but the first read is pipelined this is usually good enough.

We also have a small post-transform cache which saves a surprising amount of work.

Setup: Decode and transformLast stage does all the heavy liftingDecode vertices from 32-bit float or 16-bit integerRSX formatsNo-clipping pathPrimitive bounds lie inside frustumStore projected vertices in post-transform cacheClipping pathOnly when requiredCull/clip triangles against near and far planesScissor test handles image extents laterStore clip-space vertices in post-transform cacheBranchless clipper_

25

Most primitives (todo %) go through the no-clipping path - theyre spatially diced, so they dont tend to be enormous.

Some get clipped via a simple branchless clipper as it walks around the triangle it generates an interpolated vertex for every edge, but only increments the output vertex counter if the edge spans the clipping plane. Near and far planes are handled separately.

Setup: Cull and dispatchCull projected triangles against image extentsSend visible triangles to staging area in main memoryStore one copy of each trianglevia 6KB double-buffered write cacheStore DMA list entry for each strip under the trianglevia 1.5KB double-buffered write cacheSaves memory (8 byte entries vs. 48-byte triangles)If we run out of staging space, ignore excess trianglesThen setup and kick rasteriser jobs_

26

Finally we cull any outlying triangles, and write them back to main memory. At least until we run out of space

We dont duplicate triangles to each strip instead we store each triangle once and give the strips pointers (DMA list entries) to their subset of triangles.

We transpose the triangles here to save the rasterise jobs some work.

Rasterise jobLaunch one rasterise job per stripLoad triangles from staging area using list DMADraw triangles to a floating point 640x16 depth buffer in LSCompress depth buffer to uint16 and store

RasteriseList DMA input

Setup and draw triangles

Compress depth buffer

Output scanlineStaging areaOcclusion bufferMain memory_

27

The occluder query is much the same as the Killzone 2 query, without the portals and zones.

The main change is that we sort the occluders (there arent too many so this is easy) so that if we have to discard any, they arent likely to be important.

rasterise: LS memory layoutBlockSizeCountTotalInput triangle buffer48KB148KBDepth buffer20KB120KBOutput scanline