D3D11 Deferred Contexts - NVIDIA Developer
-
Upload
nguyenkhue -
Category
Documents
-
view
239 -
download
2
Transcript of D3D11 Deferred Contexts - NVIDIA Developer
![Page 1: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/1.jpg)
D3D11 Deferred Contexts Primer & Best Practices
Bryan Dudash Developer Technology, NVIDIA
![Page 2: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/2.jpg)
Agenda
● Discussions on bottlenecks
● What are these ―deferred contexts‖?
● Best Practices
● Anecdotes
● Final Thoughts
![Page 3: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/3.jpg)
Bottlenecks
![Page 4: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/4.jpg)
Game Engines are Complex
● Many possible bottlenecks ● CPU
● Game code bottleneck ● D3D11 Runtime bottleneck ● Driver code bottleneck
● GPU ● Shading, Texture, etc etc ● Blending
● Bandwidth ● Texture and Buffer updates
![Page 5: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/5.jpg)
CPU bottleneck
● This talk is about CPU bottlenecks ● Specifically code around rendering
● Other bottlenecks well covered by previous talks ● ―DirectX11 Performance Reloaded‖ ● Nick Thibieroz, AMD ● Holger Gruen, NVIDIA
![Page 6: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/6.jpg)
Our target case
Application producer thread
Driver
D3D API command - Draw command, state setting etc.
Mapped buffer uploads - Buffer updates
Non-D3D workloads - Anything else
* Cool diagram blatantly borrowed from ―DirectX11 Performance Reloaded Talk‖
• Not feeding draw commands to driver fast enough • Not ideal way to drive performance
![Page 7: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/7.jpg)
What is a ―Deferred Context‖ ● ID3D11DeviceContext that does not immediately issue commands invoked on it
● Called a ―deferred context‖ or ―DC‖ ● All commands are deferred until later
● ―Finished― into a ID3D11CommandList ● ID3D11CommandList is executed later on immediate context (―IC‖)
● Supported on all D3D11 hardware ● Possibly through emulation in D3D11 runtime
● Check direct driver support with:
.
struct D3D11_FEATURE_DATA_THREADING {
BOOL DriverConcurrentCreates;
BOOL DriverCommandLists;
} D3D11_FEATURE_DATA_THREADING
![Page 8: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/8.jpg)
Simple Pseudo code example* IC Render Thread
ID3D11Device* pd3dDevice;
ID3D11DeviceContext* pd3dImmediateContext;
ID3D11DeviceContext* pd3dDeferredContext = NULL;
ID3D11CommandList* pd3dCommandList = NULL;
// Make ourselves a shiny new DC
pd3dDevice->CreateDeferredContext
( 0 , &pd3dDeferredContext );
loop { // our frame loop
// Some IC rendering or other setup
// Indicate to other thread to start rendering to DC
// with Event or other threading construct
// Possibly do some unrelated IC work
// Wait for completion of DC thread(s)
// Execute all deferred commands
pd3dImmediateContext->ExecuteCommandList
( pd3dCommandList , FALSE);
// More IC rendering and back buffer swap
}
Some Worker Thread
// Traverse scene graph and
// render some stuff to deferred context
// Create a command list with
// all commands since previous finish call
pd3dDeferredContext->FinishCommandList
( FALSE, &pd3dCommandList );
* Don‘t write an implementation that looks like this. This is just meant to
show you the D3D11 interfaces used.
![Page 9: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/9.jpg)
Another Simple Example – Jobs
DC Pool
Allocate DC to Group ―B‖
Allocate DC to Group ―A‖
Traverse Scene Rendering to
Group ―B‖
Dependency
Render Object to ―A‖ Render Object to ―A‖ Render Object to ―B‖ Render Object to ―A‖ Render Object to ―A‖ Render Object to ―A‖
Traverse Scene Rendering to
Group ―A‖
Finalize Command List
―A‖
Finalize Command List
―B‖
![Page 10: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/10.jpg)
What does using a DC enable? ● Lower CPU bottleneck*
● By de-serializing app render,d3d runtime and driver work
● Thread out runtime D3D calls onto as many threads as you like.
● Can simplify a jobs solution ● Reduced app/driver sync time
● The Good/Bad ● +Facilitates parallelization of scene traversal ● +Parallelizes runtime API calls ● +Parallelizes buffer updates ● -Redundant state overhead
●Avoidable depending on grouping
* There are tons of caveats we‘ll cover in the Best Practices section
![Page 11: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/11.jpg)
What can‘t I do with DCs? ● You knew this was coming, right?
● DCs are a ―fire & forget‖ model
● Deferred Contexts cannot get any feedback from the GPU ● Query data cannot be retrieved.
● No device state inheritance or transmission
● Always starts with default device state ● Always leaves with default device state ● However global state (textures, buffers, etc) persists
● Across IC/Execute
● Only addresses CPU bottlenecks
![Page 12: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/12.jpg)
Inherited Object State
● Global state of objects is inherited between contexts
● Texture data, constant data, queries
● Display lists ● Fill once, use multiple
times
v Operation VB
data
IC: write(A) A
CL execute (next 4 operations) A
-- CL Map(discard) – write(B) B
-- CL Map(discard) – write(C) C
-- CL Draw C
-- CL Map(discard) – write(D) D
IC Draw D
IC Map(discard) – write(E) E
![Page 13: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/13.jpg)
Manual Command Lists ● Application custom threaded command lists
● Manually capture all data required to issue D3D11 calls ● Replay on IC
● Token+Replay is what D3D11 emulation does ● If driver doesn‘t support command lists directly ● Be careful of I$ thrashing from branchy replays
●Branch mispredicts
● The good/bad
● +Allows you to parallelize scene traversal ● +Allows more efficient render state reuse ● +Can be lock-free and guarantee no allocations/deallocations during replay ● - Does not parallelize runtime API calls ● - Does not parallelize buffer updates on app thread
● Driver still able to parallelize these
● - Watch out for thread sync issues
![Page 14: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/14.jpg)
Some Numbers
All numbers run on: - In house DC test application - Notebook Core i7 2670QM @ 2.2GHz - 16GB RAM, - GeForce GTX560M
![Page 15: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/15.jpg)
0
5
10
15
20
25
Fram
e Ti
me(
ms)
Draw Call Count
Batched DC versus IC * Run with custom DC test application
Batched DC IC
![Page 16: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/16.jpg)
0
2
4
6
8
10
12
14
16
1
50
0
10
00
15
00
20
00
25
00
30
00
35
00
40
00
45
00
50
00
55
00
60
00
65
00
70
00
75
00
80
00
85
00
90
00
95
00
10
00
0
10
50
0
11
00
0
11
50
0
12
00
0
12
50
0
13
00
0
13
50
0
14
00
0
14
50
0
15
00
0
15
50
0
16
00
0
16
50
0
17
00
0
17
50
0
18
00
0
18
50
0
19
00
0
19
50
0
20
00
0
Fram
e Ti
me(
ms)
Draw Call Count
Scene vs Per-Draw vs Batched DC performance * Run with custom DC test application
Scene Per-Draw Batched
Too many thread events causing efficiency issues
![Page 17: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/17.jpg)
Best Practices
![Page 18: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/18.jpg)
Test Test Test ● Always test using the latest drivers
● Remember to test on equivalent systems ●Or the same system for best results
● Two complete render paths ● Initial render path (non-DC) ● DC command lists threaded path ● At least during internal dev to make sure you are gaining perf
● Try to test on different: ● CPUs – clock speed and cores affect CPU perf and bottlenecks ● GPUS
● Multiple generations ● Multiple IHVs – different drivers have different implementations
● Motherboards – PCIE bandwidth may affect CPU waiting
![Page 19: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/19.jpg)
Be a Good Buffer Management Citizen™
● John McDonald‘s ― Efficient Buffer Management‖ ● GDC2012 talk
● NEVER readback from the GPU
● I.e. Never use staging resource on a DC ● Will result in the map being forced onto IC
●when command list is executed
● And thus serialized ● And anything dependent on that will also be serialized
![Page 20: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/20.jpg)
NEVER set Restore Context State
● 2nd parameter to ExecuteCommandList ● If set to TRUE
● Will save and restore ALL d3d state ● Set *tons* of redundant state ● Added CPU overhead
● If set to FALSE ● Application is responsible to set what state it needs ● Likely you are already setting proper state
![Page 21: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/21.jpg)
Load Balance List Size ● Don‘t make a new DC/Commandlist for every draw call
● Really, just don‘t
● Don‘t make your command lists too short ● Should have at least a few hundred API calls
● At least dozen draws or so ● A ―standard‖ mix of buffer updates, state setting and draws.
● Don‘t make your command lists too long ● Execute of long lists may interfere with other IC calls ● Chop into multiple as some tweak-able limit
● Dependent on engine implementation ● State per call, etc ● See ―TEST TEST TEST‖ best practice
![Page 22: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/22.jpg)
Operations to Avoid
● Doing these inside a DC will affect performance adversely ● Queries
● Subsequent getData() on IC will (potentially) stall until DC exec reaches endQuery
● Readbacks/blit to staging resources ● Subsequent map() on IC will potentially stall until DC exec
reaches the blit
● Any really large one time updates ● Do these on IC
![Page 23: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/23.jpg)
Don‘t hog the CPU
● I know you want to get to 100% utilization but… ● If the driver has no headroom to process commands then your worker
threads will just be waiting…
● Driver cannot fully transform to hardware commands on DC
● Some work remains to be done on IC during command list execute ● If all cores are dominated by application, driver is starved. ● Try 2*(N-1) as well as (2*N)-1 application threads
● i.e. 6-7 on a quad core. For *all* game threads. ● Driver may or may not need a full physical core ● Test test test
![Page 24: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/24.jpg)
Don‘t muck with CPU affinity
● Will almost never offer a speedup
● Will interfere with driver‘s efficiency
● Can quickly become bottleneck
![Page 25: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/25.jpg)
Don‘t pre-clear state
● DCs provide a default state context for you! ● Clearing state is just extra busy work
● But may happen as a result of your engine‘s state management code
● Examples ● Setting shaders to NULL ● Setting SRVs to NULL ● Etc…
![Page 26: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/26.jpg)
Manage Redundant State
● A general best practice
● Spend time on threads to determine which state can be reused
● May not be true for single threaded IC
![Page 27: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/27.jpg)
Maintain a DC pool
● Initialize DCs pool with threads ● Reuse these
● DC state resets after finalize
● DCs hold memory while commands lists are ―in flight‖ ● Or longer if you don‘t release the command list! ● ~10-30MB/list/frame assuming balanced lists
● Constant buffers, state, etc
● Plus dynamic buffer updates sizes ● 32bit applications may run into address space issues for large
command lists
![Page 28: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/28.jpg)
UpdateSubResource bug
● On drivers that don‘t support command lists
● There is workaround code listed in the MC D3D11 documentation for UpdateSubResource
If your application calls UpdateSubresource on a deferred context with a destination box—to which pDstBox points—that has a non-(0,0,0) offset, where the driver does not support command lists, UpdateSubresource inappropriately applies that destination-box offset to the pSrcData parameter.
![Page 29: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/29.jpg)
Anecdotes
![Page 30: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/30.jpg)
Civilization V ● Watch Dan Baker‘s GDC2010 presentation.
● ―Firaxis‘ Civilization V : A Case Study in Scalable Performance‖
● Large multi-threaded engine ● Sometimes >10k draws per frame (w/ lots of state)
● ―n wide‖ render buffers ● Threaded out to # of cores ● Cognizant of command list sizes
●Load balance to homogenize # of calls
● DCs versus serialized execution of render commands initially gained ~50% performance
● Later non-DC path optimizations closed that gap a bit
● Saw major benefits from parallel buffer updates
![Page 31: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/31.jpg)
Other Anecdotes ● Assassin's Creed 3
● Conservatively ~24% gain from using DCs in CPU bottleneck situations ● >> in some situations
● i.e. 37 FPS -> 46 FPS ● 2.93GHZ Nehalem, GTX680, 720p
● Other engines* ● DC command lists quicker to implement than manual threading with IC
● Simpler than rolling your own token+replay
● Be careful with too many command lists ● Extra state require to set up draws ● Lint on your state calls to avoid redundant sets
● Important in non-DC case as well
● Watch out for over utilizing CPU in game code ● Driver needs some time too
* Covers common cases on various engines, so just call ‗em general anecdotes
![Page 32: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/32.jpg)
Final Thoughts
● Threading your engine == good ● Jobs/Work system == better
● Driver DC command lists ● Parallelize API calls and buffer updates ● May add overhead from extra state sets
● Amortize by grouping and state change filters
● Always test performance continuously ● To make sure you have the right solution for your game ● Test on both AMD and NVIDIA
![Page 33: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/33.jpg)
Final Thoughts(2)
● Work with your IHV ● Only you can prevent CPU bottlenecks™
● Constantly tuning driver performance for game engine workloads
● Improved directly as a result from working with Civ5 and AC3
● DC use may(should) shift bottleneck ● GPU may become bottleneck
● Driver may become bottleneck
![Page 34: D3D11 Deferred Contexts - NVIDIA Developer](https://reader034.fdocuments.in/reader034/viewer/2022051507/58a2bcc61a28abac368b4965/html5/thumbnails/34.jpg)
Questions?
Bdudash at nvidia com