Download - An Optimized Diffusion Depth Of Field Solver

Transcript
Page 1: An Optimized Diffusion Depth Of Field Solver
Page 2: An Optimized Diffusion Depth Of Field Solver

An Optimized Diffusion Depth Of Field Solver (DDOF)

28th February 2011 2AMD‘s Favorite Effects

Holger Gruen – AMD

Page 3: An Optimized Diffusion Depth Of Field Solver

Agenda• Motivation• Recap of a high-level explanation of DDOF• Recap of earlier DDOF solvers• A Vanilla Cyclic Reduction(CR) DDOF solver• A DX11 optimized CR solver for DDOF• Results

28th February 2011 AMD‘s Favorite Effects 3

Page 4: An Optimized Diffusion Depth Of Field Solver

Motivation• Solver presented at GDC 2010 [RS2010] has

some weaknesses• Great implementation but memory reqs and

runtime too high for many game developers• Looking for faster and memory efficient solver

28th February 2011 AMD‘s Favorite Effects 4

Page 5: An Optimized Diffusion Depth Of Field Solver

Diffusion DOF recap 1• DDOF is an enhanced way of blurring a picture

taking an arbitrary CoC at a pixel into account• Interprets input image as a heat distribution• Uses the CoC at a pixel to derive a per pixel

heat conductivity CoC=Circle of Confusion

28th February 2011 AMD‘s Favorite Effects 5

Page 6: An Optimized Diffusion Depth Of Field Solver

Diffusion DOF recap 2• Blurring is done by time stepping a differential

equation that models the diffusion of heat• ADI method used to arrive at a separable

solution for stepping• Need to solve tri-diagonal linear system for

each row and then each colum of the input28th February 2011 AMD‘s Favorite Effects 6

Page 7: An Optimized Diffusion Depth Of Field Solver

DDOF Tri-diagonal system

28th February 2011 AMD‘s Favorite Effects 7

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0

0 n n n n

b c y x

a b c y x

a b c y x

a b y x

• row/col of inputimage

• derived from CoC at each pixel of aninput row/col

• resulting blurred row/col

Page 8: An Optimized Diffusion Depth Of Field Solver

Solver recap 1• The GDC2010 solver [RS2010] is a ‚hybrid‘ solver

– Performs three PCR steps upfront– Performs serial ‚Sweep‘ algorithm to solve small

resulting systems– Check [ZCO2010] for details on other hybrid

solvers

28th February 2011 AMD‘s Favorite Effects 8

Page 9: An Optimized Diffusion Depth Of Field Solver

Solver recap 2• The GDC2010 solver [RS2010] has drawbacks

– It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm

• GPUs without RW cache will suffer

– For high resolutions three PCR steps produce tri-diagonal system of substantial size

• This means a serial (sweep) algorithm is run on a ‚big‘ system

28th February 2011 AMD‘s Favorite Effects 9

Page 10: An Optimized Diffusion Depth Of Field Solver

Solver recap 3• Cyclic Reduction (CR) solver

– Used by [Kass2006] in the original DDOF paper– Runs in two phases

1. reduction phase2. backward substitution phase

28th February 2011 AMD‘s Favorite Effects 10

Page 11: An Optimized Diffusion Depth Of Field Solver

Solver recap 4• According to [ZCO2010]:

– CR solver has lowest computational complexity of all solvers

– It suffers from lack of parallelism though • At the end of the reduction phase• At the start of the backwards substitution phase

28th February 2011 AMD‘s Favorite Effects 11

Page 12: An Optimized Diffusion Depth Of Field Solver

Passes of a Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 12

Input imageX

Pass 1: construct from CoC

abc

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0

0 n n n n

b c y x

a b c y x

a b c y x

a b y x

Page 13: An Optimized Diffusion Depth Of Field Solver

Passes of a Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 13

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at size 1Solve for the first y

Y substitutesubstitute

Blurred image

Page 14: An Optimized Diffusion Depth Of Field Solver

Vanilla Solver Results• Higher performance than reported in

[Bavoil2010] (~6 ms vs. ~8ms at 1600x1200)

• Memory footprint prohibitively high – >200 MB at 1600x1200

• Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010]

28th February 2011 AMD‘s Favorite Effects 14

Page 15: An Optimized Diffusion Depth Of Field Solver

Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 15

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at size 1Solve for the first y

Y substitutesubstitute

Blurred image

This is what kills

parallelism

Page 16: An Optimized Diffusion Depth Of Field Solver

Keeping the parallelism high

28th February 2011 AMD‘s Favorite Effects 16

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010])

Y substitutesubstitute

Blurred image

Page 17: An Optimized Diffusion Depth Of Field Solver

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 17

Input imageX

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

Blurred image

Page 18: An Optimized Diffusion Depth Of Field Solver

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 18

rgab32fX

rgab32fabc

rgab32f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba32f rgab32fsubsti-tute

Page 19: An Optimized Diffusion Depth Of Field Solver

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 19

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

Page 20: An Optimized Diffusion Depth Of Field Solver

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 20

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f

Page 21: An Optimized Diffusion Depth Of Field Solver

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 21

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

This does again save a significant amount of memory as this is the biggest surface used by the solver

Page 22: An Optimized Diffusion Depth Of Field Solver

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 22

rgab16fX

abc

rgab16f

rgab32f

reduce reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

Skip abc construction pass and compute abc on-the-fly during 1. reduction pass

Page 23: An Optimized Diffusion Depth Of Field Solver

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 23

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

Page 24: An Optimized Diffusion Depth Of Field Solver

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 24

rgab16fX

abc

rgab16f

rgab32f

reduce reduce

reduce

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f rgab16fsubsti-tute

Skip abc construction pass compute abc during 1. reduction pass

Yet again this saves a significant amount of memory !

Page 25: An Optimized Diffusion Depth Of Field Solver

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 25

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Page 26: An Optimized Diffusion Depth Of Field Solver

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 26

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

4–to-1 Reduction 2.87 3.32 ~73

Page 27: An Optimized Diffusion Depth Of Field Solver

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 27

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Page 28: An Optimized Diffusion Depth Of Field Solver

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 28

rgab16fX

abc

reduce4

Stop at a reasonable size

Solve for Y at that resolution

Y substitutesubstitute

rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Pack abc and X into one rgba_uint surface

Page 29: An Optimized Diffusion Depth Of Field Solver

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 29

rgab16fX

rgab32fabc

uint

uint

uint

uint

pack x,y channel

(f32tof16(X.x) + (f32tof16(X.y) << 16))

Page 30: An Optimized Diffusion Depth Of Field Solver

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 30

rgab16fX

rgab32fabc

uint

uint

uint

uint

lower 5 bits of z channel

higher 27 bits of x channel

pack

(asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F))

Steal 6 lowest mantissa bits of abc.x to store some bits of X.z

Page 31: An Optimized Diffusion Depth Of Field Solver

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 31

rgab16fX

rgab32fabc

uint

uint

uint

uint

central 5 bits of z channel

higher 27 bits of y channel

pack

(asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F))

Steal 6 lowest mantissa bits of abc.y to store some bits of X.z

Page 32: An Optimized Diffusion Depth Of Field Solver

SM5 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 32

rgab16fX

rgab32fabc

uint

uint

uint

uint

higher 5 bits of z channel

higher 27 bits of z channel pack

(asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F))

Steal 6 lowest mantissa bits of abc.z to store some bits of X.z

Page 33: An Optimized Diffusion Depth Of Field Solver

Sample Screenshot

28th February 2011 AMD‘s Favorite Effects 33

Page 34: An Optimized Diffusion Depth Of Field Solver

Abs(Packed-Unpacked) x 255.0f

28th February 2011 AMD‘s Favorite Effects 34

Page 35: An Optimized Diffusion Depth Of Field Solver

DX11 Memory Optimizations 2• Solver does a horizonal and vertical pass• Chain of lower res RTs needs to be there twice

– Horizontal reduction/substitution chain– Vertical reduction/substitution chain

• How can DX11 help?

28th February 2011 AMD‘s Favorite Effects 35

Page 36: An Optimized Diffusion Depth Of Field Solver

DX11 Memory Optimizations 2• UAVs allow us to reuse data of the horizontal

chain for the vertical chain• A proof of concept implementation shows that this

works nicely but impacts the runtime significantly – ~40% lower fps

• Stayed with RTs as memory was already quite low• Use only if you are really concerned about memory

28th February 2011 AMD‘s Favorite Effects 36

Page 37: An Optimized Diffusion Depth Of Field Solver

Final Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 37

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate,)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

4–to-1 Reduction 2.87 3.32 ~73

4-to-1 Reduction + SM5 Packing 2.75 3.14 ~58

Page 38: An Optimized Diffusion Depth Of Field Solver

Future Work• Look into CS acceleration of the solver

– 4-to-1 reduction pass– 1-to-4 substitution pass

• Look into using heat diffusion for other effects– e.g. Motion blur

28th February 2011 AMD‘s Favorite Effects 38

Page 39: An Optimized Diffusion Depth Of Field Solver

Conclusion• Optimized CR solver is fast and mem-efficient

– Used in Dragon Age 2– 4aGames considering its use for new projects– Detailed description in ‚Game Engine Gems 2‘

• Mail me ([email protected]) if you want access to the sources

28th February 2011 AMD‘s Favorite Effects 39

Page 40: An Optimized Diffusion Depth Of Field Solver

References• [Kass2006] “Interactive depth of field using simulated diffusion on a GPU”

Michael Kass, Pixar Animation studios, Pixar technical memo #06-01• [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D.

Owens, PPoPP 2010• [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O.

Shishkovtsov, GDC 2010• [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil,

FGO2010

28th February 2011 AMD‘s Favorite Effects 40

Page 41: An Optimized Diffusion Depth Of Field Solver

Backup

28th February 2011 AMD‘s Favorite Effects 41

Page 42: An Optimized Diffusion Depth Of Field Solver

Results 1920x1200

28th February 2011 AMD‘s Favorite Effects 42

Solver Time in ms Memory in MegabytesHD5870 GTX480

Standard Solver (already skips high res abc construction)

4.31 4.03 ~158

4–to-1 Reduction 3.36 4.02 ~88

4-to-1 Reduction + SM5 Packing 3.23 3.79 ~70