Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14

Rendering Battlefield 4 with Mantle

Rendering battlefield 4 with mantle

Johan Andersson Electronic Arts

#

#

DX11MantleAvg: 78 fpsMin: 42 fpsCore i7-3970x, AMD Radeon R9 290x, 1080p ULTRAAvg: 120 fpsMin: 94 fps+58%!

#Bf4 mantle goalsGoals:

Significantly improve CPU performanceMore consistent & stable performanceImprove GPU performance where possible

Add support for a new Mantle rendering backend in a live gameMinimize changes to engine interfacesCompatible with built PC content

Work on wide set of hardwareAPU to quad-GPUBut x64 only (32-bit Windows needs to die)Non-goals:

Design new renderer from scratch for Mantle

Take advantage of asymmetric MGPU (APU+discrete)

Optimize video memory consumption

#Bf4 mantle strategic goals

Prove that low-level graphics APIs work outside of consoles

Push the industry towards low-level graphics APIs everywhere

Build a foundation for the future that we can build great games on

#shaders

#shadersShader resource bind points replaced with a resource table object - descriptor setThis is how the hardware accesses the shader resourcesFlat list of images, buffers and samplers used by any of the shader stagesVertex shader streams converted to vertex shader buffer loads

Engine assign each shader resource to specific slot in the descriptor set(s)Can share slots between shader stages = smaller descriptor setsThe mapping takes a while to wrap ones head around

#Shader conversionDX11 bytecode shaders gets converted to AMDIL & mapping applied using ILC toolDone at load timeDont have to change our shaders!

Have full source & control over the process

Could write AMDIL directly or use other frontends if wanted

#Descriptor setsVery simple usage in BF4: for each draw call write flat list of resourcesEssentially direct replacement of SetTexture/SetConstantBuffer/SetInputStream

Single dynamic descriptor set object per frameSub-allocate for each draw call and write list of resources

~15000 resource slots written per frame in BF4, still very fast

#Descriptor sets

#Descriptor sets future OPTIMIZATIONS

Use static descriptor sets when possible

Reduce resource duplication by reusing & sharing more across shader stages

Nested descriptor sets

#Compute pipelines1:1 mapping between pipeline & shader

No state built into pipeline

Can execute in parallel with rendering

~100 compute pipelines in BF4

#Graphics pipelinesAll graphics shader stages combined to a single pipeline object together with important graphics state

~10000 graphics pipelines in BF4 on a single level, ~25 MB of video memory

Could use smaller working pool of active state objects to keep reasonable amount in memoryHave not been required for us

#Pre-building pipelinesGraphics pipeline creation is expensive operation, do at load time instead of runtime!Creating one of our graphics pipelines take ~10-60 ms eachPre-build using N parallel low-priority jobsAvoid 99.9% of runtime stalls caused by pipeline creation!

Requires knowing the graphics pipeline state that will be used with the shadersPrimitive typeRender target formatsRender target write masksBlend modes

Not fully trivial to know all state, may require engine changes / pre-defining use casesImportant to design for!

#Pipeline cacheCache built pipelines both in memory cache and disk cacheImproved loading timesMax 300 MBSimple LRU policyLZ4 compressed (free)

Database signature:Driver versionVendor IDDevice ID

#memory

#Memory managementMantle devices exposes multiple memory heaps with characteristicsCan be different between devices, drivers and OS:es

User explicitly places resources in wanted heapsDriver suggests preferred heaps when creating objects, not a requirementTypeSizePageCPU accessGPU ReadGPU WriteCPU ReadCPU WriteLocal256 MB65535CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined1301700.00582.8Local4096 MB6553513018000Remote16106 MB65535CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined2.62.60.13.3Remote16106 MB65535CpuVisible|CpuGpuCoherent2.62.63.22.9

#Frostbite memory heapsSystem Shared MappedCPU memory that is GPU visible. Write combined & persistently mapped = easy & fast to write to in parallel at any time

System Shared PinnedCPU cached for readback. Not used much

Video SharedGPU memory accessible by CPU. Used for descriptor sets and dynamic buffersMax 256 MB (legacy constraint)Avoid keeping persistently mapped as WDMM doesnt like this and can decide to move it back to CPU memory

Video PrivateGPU private memory. Used for render targets, textures and other resources CPU does not need to access

#Memory REFERENCESWDDM needs to know which memory allocations are referenced for each command bufferIn order to make sure they are resident and not paged outMax ~1700 memory references are supportedOverhead with having lots of references

Engine needs to keep track of what memory is referenced while building the command buffersEasy & fast to doEach reference is either read-only or read/writeWe use a simple global list of references shared for all command buffers.

#Memory poolingPooling memory allocations were required for usSub allocate within larger 1 32 MB chunksAll resources stored memory handle + offsetNot as elegant as just void* on consolesFragmentation can be a concern, not too much issues for us in practice

GPU virtual memory mapping is fully supported, can simplify & optimize management

#Overcommitting video memoryAvoid overcommitting video memory!Will lead to severe stalls as VidMM moves blocks and moves memory back and forthVidMM is a black box One of the biggest issues we ran into during development

RecommendationsBalance memory poolsMake sure to use read-only memory referencesUse memory priorities

#Memory prioritiesSetting priorities on the memory allocations helps VidMM choose what to page out when it has to

5 priority levelsVery high = Render targets with MSAAHigh = Render targets and UAVsNormal = TexturesLow = Shader & constant buffersVery low = vertex & index buffers

#Memory residency futureFor best results manage which resources are in video memory yourself & keep only ~80% usedAvoid all stallsCan async DMA in and out

We are thinking of redesigning to fully avoid possibility of overcommitting

Hoping WDDMs memory residency management can be simplified & improved in the future

#Resource management

#Resource lifetimesApp manages lifetime of all resourcesHave to make sure GPU is not using an object or memory while we are freeing it on the CPUHow weve always worked with GPUs on the consolesMulti-GPU adds some additional complexity that consoles do not have

We keep track of lifetimes on a per frame granularityQueues for object destruction & free memory operationsAdd to queue at any time on the CPUProcess queues when GPU command buffers for the frame are done executingTracked with command buffer fences

#Linear frame allocatorWe use multiple linear allocators with Mantle for both transient buffers & imagesUsed for huge amount of small constant data and other GPU frame data that CPU writesEasy to use and very low overheadDont have to care about lifetimes or state

Fixed memory buffers for each frame Super cheap sub-allocation from from any threadIf full, use heap allocation (also fast due to pooling)

Alternative: ring buffersRequires being able to stall & drain pipeline at any allocation if full, additional complexity for us

#tilingTextures should be tiled for performanceExplicitly handled in Mantle, user selects linear or tiledSome formats (BC) cant be accessed as linear by the GPU

On consoles we handle tiling offline as part of our data processing pipelineWe know the exact tiling formats and have separate resources per platform

For MantleTiling formats are opaque, can be different between GPU architectures and image typesTile textures with DMA image upload from SystemShared to VideoPrivateLinear source, tiled destinationFree

#command buffers

#Command buffersCommand buffers are the atomic unit of work dispatched to the GPUSeparate creation from executionNo immediate context a la DX11 that can execute work at any callMakes resource synchronization and setup significantly easier & faster

Typical BF4 scenes have around ~50 command buffers per frameReasonable tradeoff for us with submission overhead vs CPU load-balancing

#Command buffer sourcesFrostbite has 2 separate sources of command buffers

World renderingRendering the world with tons of objects, lots of draw calls. Have all frame data up frontAll resources except for render targets are read-onlyGenerated in parallel up front each frame

Immediate rendering (the rest)Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etcManaging resource state, memory and running on different queues (graphics, compute, DMA)Sequentially generated in a single job, simulate an immediate context by splitting the command buffer

Both are very important and have different requirements

#Resource transitionsKey design in Mantle to significantly lower driver overhead & complexityExplicit hazard tracking by the app/engineDrives architecture-specific caches & compressionAMD: FMASK, CMASK, HTILEEnables explicit memory management

Examples:Optimal render target writes Graphics shader read-onlyCompute shader write-only DrawIndirect arguments

Mantle has a strong validation layer that tracks transitions which is a major help

#Managing resource transitionsEngines need a clear design on how to handle state transitionsMultiple approaches possible:

Sequential in-order command buffersGenerate one command buffer at the time in orderTransition resources on-demand when doing operation on them, very simpleRecommendation: start with this

Out-of-order multiple command buffersTrack state per command buffer, fix up transitions when order of command buffers is known

Hybrid approaches & more

#Managing resource transitions in frostbiteCurrent approach in Frostbite is quite basic:We keep track of a single state for each resource (not subresource)The immediate rendering transition resources as needed depending on operationThe out of order world rendering command buffers dont need to transition statesAlready have write access to MRTs and read-access to all resources setup outside themAvoids the problem of them not knowing the state during generation

Works now but as we do more general parallel rendering it will have to changeTrack resource state for each command buffer & fixup between command buffers

#Dynamic State objectsGraphics state is only set with the pipeline object and 5 dynamic state objectsState objects: color blend, raster, viewport, depth-stencil, MSAANo other parameters such as in DX11 with stencil ref or SetViewport functions

Frostbite use case:Pre-create when possibleOtherwise on-demand creation (hash map)Only ~100 state objects!

Still possible to end up with lots of state objectsEsp. with state object float & integer values (depth bounds, depth bias, viewport)But no need to store all permutations in memory, objects are fast to create & app manages lifetimes

#queues

#queuesUniversal queue can do both graphics, compute and presents

We use also use additional queues to parallelize GPU operations:DMA queue Improve perf with faster transfers & avoiding idling graphics will transferingCompute queue - Improve perf by utilizing idle ALU and update resources simultaneously with gfx

More GPUs = more queues!

#Order of execution within a queue is sequential

Synchronize multiple queues with GPU semaphores (signal & wait)

Also works across multiple GPUs

ComputeGraphicsQueues synchronization

S

WaitWS

#Queues synchronization contStarted out with explicit semaphoresError prone to handle when having lots of different semaphores & queuesDifficult to visualize & debug

Switched to more representation more similar to a job graphJust a model on top of the semaphores

#Gpu job graphEach GPU job has list of dependencies (other command buffers) Dependencies has to finish first before job can run on its queueThe dependencies can be from any queue

Was easier to work with, debug and visualizeReally extendable going forward

Graphics 1Graphics 2DMAComputeGraphics 2

#Async dmaAMD GPUs have dedicated hardware DMA engines, lets use them!Uploading through DMA is faster than on universal queue, even if blockingDMA have alignment restrictions, have to support falling back to copies on universal queue

Use case: Frame buffer & texture uploadsUsed by resource initial data uploads and our UpdateSubresourceGuaranteed to be finished before the GPU universal queue starts rendering the frame

Use case: Multi-GPU frame buffer copyPeer-to-peer copy of the frame buffer to the GPU that will present it

#Async computeFrostbite has lots of compute shader passes that could run in parallel with graphics workHBAO, blurring, classification, tile-based lighting, etc

Running as async compute can improve GPU performance by utilizing free ALUFor example while doing shadowmap rendering (ROP bound)

#Async compute tile-based lighting

3 sequential compute shadersInput: zbuffer & gbufferOutput: HDR texture/UAV

Runs in parallel with graphics pipeline that renders to other targets

ComputeGraphicsTileZGbufferShadowmapsReflectionDistortTranspCull lightsLightingSSWaitW

#Async compute tile-based lighting

We manually prepare the resources for the async computeImportant to not access the resources on other queues at the same time (unless read-only state)Have to transition resources on the queue that last used it

Up to 80% faster in our initial tests, but not fully reliableBut is a pretty small part of the frame timeNot in BF4 yet

ComputeGraphicsTileZGbufferShadowmapsReflectionDistortTranspCull lightsLightingSSWaitW

#Multi-gpu

#Multi-gpuMulti-GPU alternatives:AFR Alternate Frame Rendering (1-4 GPUs of the same power)Heterogeneous AFR 1 small + 1 big GPU (APU + Discrete)SFR Split Frame RenderingMulti-GPU Job Graph Primary strong GPU + slave GPUs helping

Frostbite supports AFR nativelyNo synchronization points within the frameFor resources that are not rendered every frame: re-render resources for each GPUExample: sky envmap update on weather change

With Mantle multi-GPU is explicit and we have to build support for it ourselves

#Multi-gpu afr with mantleAll resources explicitly duplicated on each GPU with async DMAHidden internally in our rendering abstraction

Every frame alternate which GPU we build command buffers for and are using resources from

Our UpdateSubresource has to make sure it updates resources on all GPU

Presenting the screen has to in some modes copy the frame buffer to the GPU that owns the display

Bonus:Can simulate multi-GPU mode even with single GPU! Multi-GPU works in windowed mode!

#GPUs are independently rendering & presenting to the screen can cause micro-stutteringFrames are not presented in a regular intervalsFrame rate can be high but presentation & gameplay is not smoothFCAT is a good tool to analyse this

Multi-gpu issuesGPU0GPU1Frame 0PFrame 1PFrame 2PFrame 3PGPU0GPU1Irregular presentation interval

#GPUs are independently rendering & presenting to the screen can cause micro-stutteringFrames are not presented in a regular intervalsFrame rate can be high but presentation & gameplay is not smoothFCAT is a good tool to analyse this

We need to introduce dependency & dampening between the GPUs to alleviate this frame pacing

Multi-gpu issuesGPU0GPU1Frame 0PFrame 1PFrame 2PFrame 3PIdeal presentation interval

#Frame pacingMeasure average frame rate on each GPUShort history (10-30 frames)Filter out spikes

Insert delay on the GPU before each presentForce the frame times to become more regular and GPUs to alignDelay value is based on the calculate avg frame rate

GPU0GPU1Frame 0PFrame 1PFrame 2PFrame 3PGPU0GPU1Delay

D

#conclusion

#Mantle dev recommendationsThe validation layer is a critical friend!

Youll end up with a lot of object & memory management code, try share with console code

Make sure you have control over memory usage and can avoid overcommitting video memory

Build a robust solution for resource state management early

Figure out how to pre-create your graphics pipelines, can require engine design changes

Build for multi-GPU support from the start, easier than to retrofit

#futureSecond wave of Frostbite Mantle titles

Adapt Frostbite core rendering layer based on learnings from MantleRefine binding & buffer updates to further reduce overheadVirtual memory managementMore async compute & async DMAsMulti-GPU job graph R&DLinuxWould like to see how our Mantle renderer behaves with different memory management & driver model

#Questions?Email: [email protected]: http://frostbite.comTwitter: @repi

#

Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14

Technology

Transcript of Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14