An Optimized Diffusion Depth of Field Solver Optimized Diffusion... · 2012-07-23 · References...

40

Transcript of An Optimized Diffusion Depth of Field Solver Optimized Diffusion... · 2012-07-23 · References...

An Optimized Diffusion Depth Of Field Solver (DDOF)

28th February 2011 2 AMD‘s Favorite Effects

Holger Gruen – AMD

Agenda • Motivation

• Recap of a high-level explanation of DDOF

• Recap of earlier DDOF solvers

• A Vanilla Cyclic Reduction(CR) DDOF solver

• A DX11 optimized CR solver for DDOF

• Results

28th February 2011 AMD‘s Favorite Effects 3

Motivation

• Solver presented at GDC 2010 [RS2010] has some weaknesses

• Great implementation but memory reqs and runtime too high for many game developers

• Looking for faster and memory efficient solver

28th February 2011 AMD‘s Favorite Effects 4

Diffusion DOF recap 1

• DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account

• Interprets input image as a heat distribution

• Uses the CoC at a pixel to derive a per pixel heat conductivity

CoC=Circle of Confusion

28th February 2011 AMD‘s Favorite Effects 5

Diffusion DOF recap 2

• Blurring is done by time stepping a differential equation that models the diffusion of heat

• ADI method used to arrive at a separable solution for stepping

• Need to solve tri-diagonal linear system for each row and then each colum of the input

28th February 2011 AMD‘s Favorite Effects 6

DDOF Tri-diagonal system

28th February 2011 AMD‘s Favorite Effects 7

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0

0 n n n n

b c y x

a b c y x

a b c y x

a b y x

• row/col of input image

• derived from CoC at each pixel of an input row/col

• resulting blurred row/col

Solver recap 1 • The GDC2010 solver [RS2010] is a ‚hybrid‘ solver

– Performs three PCR steps upfront

– Performs serial ‚Sweep‘ algorithm to solve small resulting systems

– Check [ZCO2010] for details on other hybrid solvers

28th February 2011 AMD‘s Favorite Effects 8

Solver recap 2 • The GDC2010 solver [RS2010] has drawbacks

– It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm • GPUs without RW cache will suffer

– For high resolutions three PCR steps produce tri-diagonal system of substantial size • This means a serial (sweep) algorithm is run on a ‚big‘ system

28th February 2011 AMD‘s Favorite Effects 9

Solver recap 3 • Cyclic Reduction (CR) solver

– Used by [Kass2006] in the original DDOF paper

– Runs in two phases

1. reduction phase

2. backward substitution phase

28th February 2011 AMD‘s Favorite Effects 10

Solver recap 4 • According to [ZCO2010]:

– CR solver has lowest computational complexity of all solvers

– It suffers from lack of parallelism though

• At the end of the reduction phase

• At the start of the backwards substitution phase

28th February 2011 AMD‘s Favorite Effects 11

Passes of a Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 12

Input image X

Pass 1: construct from CoC

abc

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0

0 n n n n

b c y x

a b c y x

a b c y x

a b y x

reduce

reduce

reduce

reduce

Stop at size 1 Solve for the

first y

Y substitute substitute

Blurred image

Vanilla Solver Results

• Higher performance than reported in [Bavoil2010] (~6 ms vs. ~8ms at 1600x1200)

• Memory footprint prohibitively high

– >200 MB at 1600x1200

• Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010]

28th February 2011 AMD‘s Favorite Effects 13

Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 14

Input image X

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at size 1 Solve for the

first y

Y substitute substitute

Blurred image

This is

what kills

parallelism

Keeping the parallelism high

28th February 2011 AMD‘s Favorite Effects 15

Input image X

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at a

reasonable

size

Solve for Y at

that resolution to

have a big

enough parallel

workload (e.g

using PCR see

[ZCO2010])

Y substitute substitute

Blurred image

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 16

Input image X

Pass 1: construct from CoC

abc

reduce

reduce

reduce

reduce

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

Blurred image

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 17

rgab32f X

rgab32f abc

rgab32f

rgab32f

reduce

reduce

reduce

reduce

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

rgba32f rgab32f

substi-

tute

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 18

rgab16f X

rgab32f abc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

rgba16f rgab16f

substi-

tute

This saves some significant

amount of memory - We found

no artifacts for going from

rgba32f to rgba16f

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 19

rgab16f X

rgab32f abc

rgab16f

rgab32f

reduce

reduce

reduce

reduce

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

rgba16f rgab16f

substi-

tute

This does again save a

significant amount of

memory as this is the

biggest surface used

by the solver

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 20

rgab16f X

abc

rgab16f

rgab32f

reduce reduce

reduce

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

rgba16f rgab16f

substi-

tute

Skip abc

construction pass

and compute abc

on-the-fly during 1.

reduction pass

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 21

Solver

Time in ms

Memory in Megabytes

HD5870 GTX480

GDC2010 hybrid solver on GTX480

~8.5

8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66

3.33

~132

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 22

rgab16f X

abc

rgab16f

rgab32f

reduce reduce

reduce

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

rgba16f rgab16f

substi-

tute

Skip abc

construction

pass compute

abc during 1.

reduction pass

Yet again this saves a

significant amount of

memory !

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 23

rgab16f X

abc

reduce4

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

rgba16f

substitute4

Skip abc

construction

pass compute

abc during 1.

reduction pass

Reduce 4-to-1

in a special first

reduction pass

Substitute

1-to-4 in a

special

substitution pass

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 24

Solver

Time in ms

Memory in Megabytes

HD5870 GTX480

GDC2010 hybrid solver on GTX480

~8.5

8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66

3.33

~132

4–to-1 Reduction

2.87

3.32

~73

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 25

rgab16f X

abc

reduce4

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

rgba16f

substitute4

Skip abc

construction

pass compute

abc during 1.

reduction pass

Reduce 4-to-1

in a special first

reduction pass

Substitute

1-to-4 in a

special

substitution pass

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 26

rgab16f X

abc

reduce4

Stop at a

reasonable

size

Solve for Y at

that resolution

Y substitute substitute

rgba16f

substitute4

Skip abc

construction

pass compute

abc during 1.

reduction pass

Reduce 4-to-1

in a special first

reduction pass

Substitute

1-to-4 in a

special

substitution pass

Pack abc and X into

one rgba_uint surface

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 27

rgab16f X

rgab32f abc

uint

uint

uint

uint

pack x,y channel

(f32tof16(X.x) +

(f32tof16(X.y) << 16))

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 28

rgab16f X

rgab32f abc

uint

uint

uint

uint

higher 27 bits of x channel

(asuint(abc.x) &0xFFFFFFC0) |

(f32tof16(X.z) & 0x3F))

Steal 6 lowest mantissa bits of abc.x to store some bits of X.z

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 29

rgab16f X

rgab32f abc

uint

uint

uint

uint

higher 27 bits of y channel

(asuint(abc.y) &0xFFFFFFC0) |

((f32tof16(X.z) >>6 )& 0x3F))

Steal 6 lowest mantissa bits of abc.y to store some bits of X.z

SM5 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 30

rgab16f X

rgab32f abc

uint

uint

uint

uint

higher 27 bits of z channel

(asuint(abc.z) &0xFFFFFFC0) |

((f32tof16(X.z) >>12 )& 0x3F))

Steal 6 lowest mantissa bits of abc.z to store some bits of X.z

Sample Screenshot

28th February 2011 AMD‘s Favorite Effects 31

Abs(Packed-Unpacked) x 255.0f

28th February 2011 AMD‘s Favorite Effects 32

DX11 Memory Optimizations 2

• Solver does a horizonal and vertical pass

• Chain of lower res RTs needs to be there twice

– Horizontal reduction/substitution chain

– Vertical reduction/substitution chain

• How can DX11 help?

28th February 2011 AMD‘s Favorite Effects 33

DX11 Memory Optimizations 2

• UAVs allow us to reuse data of the horizontal chain for the vertical chain

• A proof of concept implementation shows that this works nicely but impacts the runtime significantly – ~40% lower fps

• Stayed with RTs as memory was already quite low

• Use only if you are really concerned about memory

28th February 2011 AMD‘s Favorite Effects 34

Final Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 35

Solver

Time in ms

Memory in Megabytes

HD5870 GTX480

GDC2010 hybrid solver on GTX480

~8.5

8.00 [Bavoil2010]

~117 (guesstimate,)

Standard Solver (already skips high res abc construction)

3.66

3.33

~132

4–to-1 Reduction

2.87

3.32

~73

4-to-1 Reduction + SM5 Packing

2.75

3.14

~58

Future Work

• Look into CS acceleration of the solver

– 4-to-1 reduction pass

– 1-to-4 substitution pass

• Look into using heat diffusion for other effects

– e.g. Motion blur

28th February 2011 AMD‘s Favorite Effects 36

Conclusion

• Optimized CR solver is fast and mem-efficient

– Used in Dragon Age 2

– 4aGames considering its use for new projects

– Detailed description in ‚Game Engine Gems 2‘

• Mail me ([email protected]) if you want access to the sources

28th February 2011 AMD‘s Favorite Effects 37

References • [Kass2006] “Interactive depth of field using simulated diffusion on a GPU”

Michael Kass, Pixar Animation studios, Pixar technical memo #06-01

• [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D. Owens, PPoPP 2010

• [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O. Shishkovtsov, GDC 2010

• [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil, FGO2010

28th February 2011 AMD‘s Favorite Effects 38

Backup

28th February 2011 AMD‘s Favorite Effects 39

Results 1920x1200

28th February 2011 AMD‘s Favorite Effects 40

Solver

Time in ms

Memory in Megabytes

HD5870 GTX480

Standard Solver (already skips high res abc construction)

4.31

4.03

~158

4–to-1 Reduction

3.36

4.02

~88

4-to-1 Reduction + SM5 Packing

3.23

3.79

~70