Post on 02-Jun-2018
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
1/38
DIRECTX
ADVANCEMENTS IN THE MANY CORE ERA
Getting the most out of the PC Platform
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
2/38
System Architecture Trends
CPU Evolution
From single core to multi-corePower wall limits frequency scaling
GPU EvolutionUnified processor cores
Multiple hardware engines
MMU and paged memories
More autonomous, fully
programmable machines
Highly Integrated SoC Designs
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
3/38
GeForce FX 5800GeForce 6800 Ultra
GeForce 7800 GTX
GeForce 8800 GTX
GeForce GTX 280
GeForce GTX 480
GeForce GTX 580
GeForce GTX 680
GeForce GTX TITAN
GeForce 780 Ti
Pentium 4
Woodcrest
Harpertown
BloomfieldWestmere
Sandy BridgeIvy Bridge
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
Apr-01 Jan-04 Oct-06 Jul-09 Apr-12 Dec-14
NVIDIA GPU Single Precision
Intel CPU Single Precision
Northwood
PrescottWoodcrest
Harpertown
Blo
GeForce FX 5900
GeForce 6800 GT
GeForce 7800 GTX
GeForce 8800 GTX
GeForce GTX 280
GeForce GT
0
30
60
90
120
150
180
210
240
270
300
330
360
2003 2005 2007 2
CPU
GeForce GPU
GPU vs CPU Perf Scaling
GFLOPS
GB/s
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
4/38
Application Trends
From Functional Parallelism to Task-
based ParallelismGraphs of jobs, work stealing schedulers
GPU as a General Purpose ProcessorPhysics & simulation, Programmable Graphics
Requires more control over underlying hardware
Content is KingShifting balance between runtime costs vs
production costsFrom
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
5/38
Current API Abstraction is Getting
Designed After >20 Years Old Machine Abstraction
Large State Machine AbstractionState transitions orchestrated by the CPU
Opaque Memory ModelResource management hidden from the app
Limited Multi-core ScalabilitySerial submission model
Implicit Hazard Tracking and SynchronizationHigh work submission costs
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
6/38
Radical Reductions in Submission
Embrace Latest GPU ArchitecturImprovements
Highly Scalable on Multi-core Sys
Provide Console-like Execution E
Designed in Close Collaboration
Microsoft and NVIDIA
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
7/38
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
8/38
DX12 Nutsand Bolts
State Management Model
Command Lists
Resource Binding And Hazard Tra
Residency Management And Mem
CPU/GPU Synchronization
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
9/38
State Management
Further State Grouping and FactorizationProvide opportunities for aggressive pre-baking
As much cost as possible is paid up-front
Pipeline State Encapsulates Heavy-weight StateBits of state specified and used together
More Frequently Changing State Still Lives
Outside PSOsLighter weight state changes
ID3D12Pipel
ID3D1
ID3D11Ra
ID3D11
ID3D11Dep
ID3D11I
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
10/38
PSO and Non-PSO State
PSO Non-PSO
Shader Program at Every Stage Resource Bindings
Blend State Viewport Mappings
Rasterizer State Scissor Rectangles
Depth-stencil State Blend Factor
Input Layout Depth Test
Render Target Properties Stencil
Input Topology Type Input Topology Bucket (List v
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
11/38
New Binding Model
Current GPUs Can Reference a Large Namespace of ResourcSo called bindless model
Resource Binding State is Pulled Into the SystemRather than CPU pushing it - more scalable and efficient
Can Reference a Very Large Working Set of ResourcesE.g. Kepler GPUs can use a palette of > 1M textures
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
12/38
Shader Code
Bindless BenefitsEssential for raytracing and next-gen rendering
where resource working set is not known in advan
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
13/38
Descriptors
Descriptors encapsulate handles to resources
Semantically equivalent to ID3D11*ResourceView objects
Opaque bag of bits, of implementation-specific format and size
Expected to be on the order of 64B-128B
No driver-side allocations associated with descriptors
Can be freely copied around and thrown away by the app
API functions to create descriptors given app-provided spec
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
14/38
Descriptor Tables and Heaps
Descriptor Tables Encapsulate Palettes
of Resource ReferencesContiguous arrays of descriptor entries
Individual descriptors indexable from the GPU
Tables Defined in Descriptor Heaps
Driver provided memoryImplementation-specific size restrictions
Represent Unit of Resource Binding StateAllow for bulk binding of resources
Descriptor Heap
SR
SR
SR
Table #0
SR
SR
SR
Table #1
SR
SRTable #6
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
15/38
Descriptors, Tables, and Heaps
Tables Very Cheap to Switch
Ideally just a pointer change to the HW
Efficent bulk binding state change
A Collection of Tables Can be Set at a Time
Shaders can select a table to use for indexingRestrictions based on resource type
Applications Responsible for Managing Tables Within Heaps
Various strategies, balancing heap space vs performance
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
16/38
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
17/38
Bundles
Highly Reusable Rendering Components
A kind of command list
Can Only be Executed From Direct Command Lists
Non-pipeline State Inherited from the Calling Command List
Provides for reuse flexibility
Represent Individual Objects in a SceneCollection of draws and state changes
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
18/38
Residency Management And Heap
Heaps Represent Bulk Memory Allocation
Unit of OS/driver memory management
Explicitly Separate From The Binding ModelIndividual resource bindings no longer tracked
Residency Managed At Coarse Grain Per HeapApplications responsible for managing memory within heaps
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
19/38
Hazard Tracking
Implicit Hazard Tracking Becoming Hard and Impractical
Tiled resources allow for memory aliasing
Very large palette of resources can be referenced
Applications Use Explicit Barriers to Signal HazardsE.g. resource transitioning from RTV to SRV
Can exploit app-level knowledge and be less conservative
Robustness vs Efficiency Trade off
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
20/38
CPU/GPU Synchronization
Hazards Between Multiple Concurrent Processors Managed W
CPU sharing data with the GPU
Applications Responsible for Setting Fences and Tracking ThCan use standard OS synchronization APIs
Renaming and versioning optimizations are responsibility ofDynamic Resources are Effectively Permanently Mapped
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
21/38
Diving IntoNitrous
Nitrous = Oxides custom engi
Specifically designed for high
Core neutral. Main thread act
lightweight sequencer
All work divided up into small
are in the microsecond range
Can produce lots of jobs, 10,0
per frame
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
22/38
Multi-core CPU Basics
Be Wary, There Is A Lot Of Very Bad Advice In The Wild
Spawning threads to handle tasksRelying OS preemptive scheduler, heavy weight OS synchronization primitiv
Functional threading in general
Your Survival GuideOK: Multi-thread read of same location
OK: Multi-thread write to different locations
OK: Multi-thread write to same location in stamp mode
CAUTION: Atomic instructions
STOP: Multi-thread read/write to same location
STOP: Multi-thread write to same CACHE line
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
23/38
Even Frame
Setting Up Our Frame
Concept of Asynchronous GPU
Now Exposed Through APIMust buffer frames ourselves
Will create 2+ copies of everything
CPU to GPU DataShader constants
Texture updates
GPU CommandsCommand group = self contained, no
cross thread hazards
Means WAW, RAW hazards must be
marked in command buffers
Data nor command should not be
changed while GPU could be reading
CommandBuffer
GPU visible
memory
Command
Buffer
CommandBuffer
ComBu
Com
Bu
ComBu
d b d
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
24/38
But Commands May be Generated
Inter Frame Data
Job Scheduler
Core 0
Command
Buffer
Command
Buffer
Core 1 Core 2
Command
Buffer
Command
Buffer
M d il b f
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
25/38
More details about our frame
In reality, diagram is over simplified
Nitrous has its own internal command formatSmall, efficient commands
Stateless, each command contains references to all needed state
Inheritance unneeded
Seperates internal graphics system from any particular API
Being Stateless, can be generated completely out of order
Entire Frame is queued up in internal command formatD3D12 is translated. Each internal command is translated, 1:1each command buffer to a D3D12 command list
Gets more optimal use out of instruction cache and data cachAllows us surrender the entire system to the driver
Sh d
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
26/38
Shaders
Shader Blocks
Group of shaders with identicleresources
Key point: No changing of individual
shaders, shader considered onemonolithic block
All resources bound at same places
across all stages
Maps well to a group of pipeline
state objects in d3d12
ShaderGroup SimpleShader
{
ResourceSetPrimitive = V
ConstantSetDynamic[0] = D
ResourceSetBatch[1] = UConstantSetShader[0] = Gl
Methods
{
main:
CodeBlocks = SimpleShade
VertexShader = SimpleVSS
PixelShader = SimplePSS
zprime:
CodeBlocks = SimpleShade
VertexShader = SimpleVSS
PixelShader = BlankSimp
}
}
F F i f U d t
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
27/38
Even FrameBatch Set
Four Frequencies of Updates
Batch 0 Batch 1 Batch 2 Batch 3 Batch 4
Prim 0 Prim 1 Prim 2
Prim 1 Prim 2
PRIMITIVE
BATCH SET
IB
Resources
Tri info
Pri
Sh
ReCo
BatchesPrimitives
Shaders
RTs
Blend State
ReCo
Sh
R S t
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
28/38
Resource Sets
In Real World, Textures
Are Grouped
Nitrous Has 5 Bind Points2 for batch
2 for shader
1 for primitive
VB Is Just A Resource Set
Maps Well To D3d12 Descriptor Tables
SPACE FIGHTER 1
(0) Albiedo
(1) Material Mask(2) Ambient Occlusion
(3) Normal Map
(4) Weathering Map
M P l
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
29/38
Memory Pools
Resources used together, created
together
Multiple resource sets are often pooled
Simplifies memory management, less
then 1000 total allocations
Maps well to D3D12 memory heap
Orange Team Units M
SPACE FIGHTER 1 CA
CARRIER FORWARD CA
(0) Albiedo
(1) Material Mask(2) Ambient Occlusion
(3) Normal Map
(4) Weathering Map
(0) Al
(1) Ma(2) Am
(3) No
(4) We
(0) Albiedo
(1) Material Mask
(2) Ambient Occlusion
(3) Normal Map(4) Weathering Map
(0) Al
(1) Ma
(2) Am
(3) No(4) We
Constants and Reso rce Update
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
30/38
Constants and Resource Update
Common data used per frame is dumped into a Graphics Transfer M
Reset every frame, with a simple incrementing counterAll constants are uploaded into this memory, and referenced by com
Per frame resource updates placed in this memory
No point in resizing and reallocating, need the max memory planned f
Non per frame updates, which are big, other memory is allocated, b
free memory after receiving notification that GPU command is comp
D3d12 this ends up just being a buffer
Starting the frame
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
31/38
Starting the frame
bool BeginFrame()
{
bool bGPUbound = true
int FrameData = g_uFrame % FramesQueued
if g_Fences[FrameData] is blocking
then g_Fences[FrameData]->WaitOnFence
Else bGPUBound = false
//we know the GPU is free with this nowg_FrameData[FrameData]->LockAndMap()
g_uFrame++;
return bGPUBound;
}
This function shplace on sequen
once ready to be
generating comm
Can place timer
both sides of the
get accurate CPof our app
Median frames q
should be Fram
.5, so usually on
frames behind
Generating a command list
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
32/38
Generating a command list
Once frame has begun, can begin issuing command lists
Easy strategy: Create 2 sets of command lists, 1 set for tbeing rendered, and 1 for the frame being generated (orthen 2 frames are queued)
What we do:Create a series of tasks which dump rendering into command b
Dont create a new command buffer for every task, however, ea context which points to a command buffer
So, if we have 6 CPUs and 600 objects to process in a particularend up generating 6 command buffers with 100 batches each
The draw order will fluctuate frame to frame, so must handle adifferently
Resource Sets
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
33/38
Resource Sets
B
B
B
P
S
S
Natural Allegory To Descriptor Tables
Since Resource Sets are immutable, and can bind multiple, canpre create
If hardware can support multiple descriptor tables,
conceivable that we dont have to update them per frame ever
Otherwise, need to dynamically create a descritor table for
each batch
What Resource Bindings Exist For Nitrous? Turns
Out We Have A Small State Vector
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
34/38
Generating a Command List
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
35/38
Generating a Command List
For each Batch (inside a task)
Create State vector from Batch, Shader, and Primitiveif not enough bind points
Lookup state vector in our cache (16 entry, un
per thread)
if in cache
Bind created descriptor table
elseCreate New descriptor table (since same size, we h
pool of them)
Bind descriptor table, evict last used thing in cato cache
Tracking Resource Usage
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
36/38
Tracking Resource Usage
Apps responsibility to track what
resources get used
Simple strategy: Stamp a frame number
on each memory pool anytime it is bound
Traverse the complete resource list,
anything which matches current framemust be resident
Quick as long as we keep # of heaps
reasonableImportant: Frame # should be paddedinto a cache line to avoid serialization
Heap Description Las
UI Textures intro
UI Textures in Game
Orange Faction Units
Purple Faction Units
Weapon effects
Post Process RTs
Terrain Heightmap
Expected results
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
37/38
Expected results
Expecting large increases in performance in terms of CPU t
driver/D3D
Expecting vast reduction in driver complexity, hopefully mo
drivers
Expecting less frames to be queued, generally more respon
Shouldnt have frame hitches caused by driver at all.
8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform
38/38