An Optimized Diffusion Depth Of Field Solver

Post on 20-May-2015

1.200 views 0 download

Tags:

Transcript of An Optimized Diffusion Depth Of Field Solver

An Optimized Diffusion Depth Of Field Solver (DDOF)

28th February 2011 2AMD‘s Favorite Effects

Holger Gruen – AMD

Agenda• Motivation• Recap of a high-level explanation of DDOF• Recap of earlier DDOF solvers• A Vanilla Cyclic Reduction(CR) DDOF solver• A DX11 optimized CR solver for DDOF• Results

28th February 2011 AMD‘s Favorite Effects 3

Motivation• Solver presented at GDC 2010 [RS2010] has

some weaknesses• Great implementation but memory reqs and

runtime too high for many game developers• Looking for faster and memory efficient solver

28th February 2011 AMD‘s Favorite Effects 4

Diffusion DOF recap 1• DDOF is an enhanced way of blurring a picture

taking an arbitrary CoC at a pixel into account• Interprets input image as a heat distribution• Uses the CoC at a pixel to derive a per pixel

heat conductivity CoC=Circle of Confusion

28th February 2011 AMD‘s Favorite Effects 5

Diffusion DOF recap 2• Blurring is done by time stepping a differential

equation that models the diffusion of heat• ADI method used to arrive at a separable

solution for stepping• Need to solve tri-diagonal linear system for

each row and then each colum of the input28th February 2011 AMD‘s Favorite Effects 6

DDOF Tri-diagonal system

28th February 2011 AMD‘s Favorite Effects 7

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0

0 n n n n

b c y x

a b c y x

a b c y x

a b y x

• row/col of inputimage

• derived from CoC at each pixel of aninput row/col

• resulting blurred row/col

Solver recap 1• The GDC2010 solver [RS2010] is a ‚hybrid‘ solver

– Performs three PCR steps upfront– Performs serial ‚Sweep‘ algorithm to solve small

resulting systems– Check [ZCO2010] for details on other hybrid

solvers

28th February 2011 AMD‘s Favorite Effects 8

Solver recap 2• The GDC2010 solver [RS2010] has drawbacks

– It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm

• GPUs without RW cache will suffer

– For high resolutions three PCR steps produce tri-diagonal system of substantial size

• This means a serial (sweep) algorithm is run on a ‚big‘ system

28th February 2011 AMD‘s Favorite Effects 9

Solver recap 3• Cyclic Reduction (CR) solver

– Used by [Kass2006] in the original DDOF paper– Runs in two phases

1. reduction phase2. backward substitution phase

28th February 2011 AMD‘s Favorite Effects 10

Solver recap 4• According to [ZCO2010]:

– CR solver has lowest computational complexity of all solvers

– It suffers from lack of parallelism though • At the end of the reduction phase• At the start of the backwards substitution phase

28th February 2011 AMD‘s Favorite Effects 11

Passes of a Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 12

Input imageX

Pass 1: construct from CoC

abc

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0

0 n n n n

b c y x

a b c y x

a b c y x

a b y x

Passes of a Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 13

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at size 1Solve for the first y

Y substitutesubstitute

Blurred image

Vanilla Solver Results• Higher performance than reported in

[Bavoil2010] (~6 ms vs. ~8ms at 1600x1200)

• Memory footprint prohibitively high – >200 MB at 1600x1200

• Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010]

28th February 2011 AMD‘s Favorite Effects 14

Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 15

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at size 1Solve for the first y

Y substitutesubstitute

Blurred image

This is what kills

parallelism

Keeping the parallelism high

28th February 2011 AMD‘s Favorite Effects 16

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010])

Y substitutesubstitute

Blurred image

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 17

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

Blurred image

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 18

rgab32fX

rgab32fabc

rgab32f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba32f rgab32fsubsti-tute

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 19

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 20

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 21

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

This does again save a significant amount of memory as this is the biggest surface used by the solver

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 22

rgab16fX

abc

rgab16f

rgab32f

reduce reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

Skip abc construction pass and compute abc on-the-fly during 1. reduction pass

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 23

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 24

rgab16fX

abc

rgab16f

rgab32f

reduce reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

Skip abc construction pass compute abc during 1. reduction pass

Yet again this saves a significant amount of memory !

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 25

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 26

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

4–to-1 Reduction 2.87 3.32 ~73

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 27

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 28

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Pack abc and X into one rgba_uint surface

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 29

rgab16fX

rgab32fabc

uint

uint

uint

uint

pack x,y channel

(f32tof16(X.x) + (f32tof16(X.y) << 16))

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 30

rgab16fX

rgab32fabc

uint

uint

uint

uint

lower 5 bits of z channel

higher 27 bits of x channel

pack

(asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F))

Steal 6 lowest mantissa bits of abc.x to store some bits of X.z

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 31

rgab16fX

rgab32fabc

uint

uint

uint

uint

central 5 bits of z channel

higher 27 bits of y channel

pack

(asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F))

Steal 6 lowest mantissa bits of abc.y to store some bits of X.z

SM5 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 32

rgab16fX

rgab32fabc

uint

uint

uint

uint

higher 5 bits of z channel

higher 27 bits of z channel pack

(asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F))

Steal 6 lowest mantissa bits of abc.z to store some bits of X.z

Sample Screenshot

28th February 2011 AMD‘s Favorite Effects 33

Abs(Packed-Unpacked) x 255.0f

28th February 2011 AMD‘s Favorite Effects 34

DX11 Memory Optimizations 2• Solver does a horizonal and vertical pass• Chain of lower res RTs needs to be there twice

– Horizontal reduction/substitution chain– Vertical reduction/substitution chain

• How can DX11 help?

28th February 2011 AMD‘s Favorite Effects 35

DX11 Memory Optimizations 2• UAVs allow us to reuse data of the horizontal

chain for the vertical chain• A proof of concept implementation shows that this

works nicely but impacts the runtime significantly – ~40% lower fps

• Stayed with RTs as memory was already quite low• Use only if you are really concerned about memory

28th February 2011 AMD‘s Favorite Effects 36

Final Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 37

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate,)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

4–to-1 Reduction 2.87 3.32 ~73

4-to-1 Reduction + SM5 Packing 2.75 3.14 ~58

Future Work• Look into CS acceleration of the solver

– 4-to-1 reduction pass– 1-to-4 substitution pass

• Look into using heat diffusion for other effects– e.g. Motion blur

28th February 2011 AMD‘s Favorite Effects 38

Conclusion• Optimized CR solver is fast and mem-efficient

– Used in Dragon Age 2– 4aGames considering its use for new projects– Detailed description in ‚Game Engine Gems 2‘

• Mail me (holger.gruen@amd.com) if you want access to the sources

28th February 2011 AMD‘s Favorite Effects 39

References• [Kass2006] “Interactive depth of field using simulated diffusion on a GPU”

Michael Kass, Pixar Animation studios, Pixar technical memo #06-01• [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D.

Owens, PPoPP 2010• [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O.

Shishkovtsov, GDC 2010• [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil,

FGO2010

28th February 2011 AMD‘s Favorite Effects 40

Backup

28th February 2011 AMD‘s Favorite Effects 41

Results 1920x1200

28th February 2011 AMD‘s Favorite Effects 42

Solver Time in ms Memory in MegabytesHD5870 GTX480

Standard Solver (already skips high res abc construction)

4.31 4.03 ~158

4–to-1 Reduction 3.36 4.02 ~88

4-to-1 Reduction + SM5 Packing 3.23 3.79 ~70