An Optimized Diffusion Depth Of Field Solver

An Optimized Diffusion Depth Of Field Solver (DDOF)

28th February 2011 2AMD‘s Favorite Effects

Holger Gruen – AMD

Agenda• Motivation• Recap of a high-level explanation of DDOF• Recap of earlier DDOF solvers• A Vanilla Cyclic Reduction(CR) DDOF solver• A DX11 optimized CR solver for DDOF• Results

28th February 2011 AMD‘s Favorite Effects 3

Motivation• Solver presented at GDC 2010 [RS2010] has

some weaknesses• Great implementation but memory reqs and

runtime too high for many game developers• Looking for faster and memory efficient solver

Diffusion DOF recap 1• DDOF is an enhanced way of blurring a picture

taking an arbitrary CoC at a pixel into account• Interprets input image as a heat distribution• Uses the CoC at a pixel to derive a per pixel

heat conductivity CoC=Circle of Confusion

Diffusion DOF recap 2• Blurring is done by time stepping a differential

equation that models the diffusion of heat• ADI method used to arrive at a separable

solution for stepping• Need to solve tri-diagonal linear system for

each row and then each colum of the input28th February 2011 AMD‘s Favorite Effects 6

DDOF Tri-diagonal system

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0 n n n n

b c y x

a b c y x

a b y x

• row/col of inputimage

• derived from CoC at each pixel of aninput row/col

• resulting blurred row/col

Solver recap 1• The GDC2010 solver [RS2010] is a ‚hybrid‘ solver

– Performs three PCR steps upfront– Performs serial ‚Sweep‘ algorithm to solve small

resulting systems– Check [ZCO2010] for details on other hybrid

solvers

Solver recap 2• The GDC2010 solver [RS2010] has drawbacks

– It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm

• GPUs without RW cache will suffer

– For high resolutions three PCR steps produce tri-diagonal system of substantial size

• This means a serial (sweep) algorithm is run on a ‚big‘ system

Solver recap 3• Cyclic Reduction (CR) solver

– Used by [Kass2006] in the original DDOF paper– Runs in two phases

1. reduction phase2. backward substitution phase

Solver recap 4• According to [ZCO2010]:

– CR solver has lowest computational complexity of all solvers

– It suffers from lack of parallelism though • At the end of the reduction phase• At the start of the backwards substitution phase

Passes of a Vanilla CR Solver

Input imageX

Pass 1: construct from CoC

1 1 1 1

2 2 2 2 2

3 3 3 3 3

0 n n n n

b c y x

a b c y x

a b y x

Passes of a Vanilla CR Solver

Input imageX

reduce

Stop at size 1Solve for the first y

Y substitutesubstitute

Blurred image

Vanilla Solver Results• Higher performance than reported in

[Bavoil2010] (~6 ms vs. ~8ms at 1600x1200)

• Memory footprint prohibitively high – >200 MB at 1600x1200

• Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010]

Vanilla CR Solver

Input imageX

reduce

Stop at size 1Solve for the first y

Blurred image

This is what kills

parallelism

Keeping the parallelism high

Input imageX

reduce

Stop at a reasonable size

Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010])

Blurred image

Memory Optimizations 1

Input imageX

reduce

Solve for Y at that resolution

Blurred image

rgab32fX

rgab32fabc

rgab32f

reduce

rgba32f rgab32fsubsti-tute

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f

rgab16fX

rgab32fabc

rgab16f

rgab32f

reduce

This does again save a significant amount of memory as this is the biggest surface used by the solver

rgab16fX

rgab16f

rgab32f

reduce reduce

reduce

Skip abc construction pass and compute abc on-the-fly during 1. reduction pass

Intermediate Results 1600x1200

Solver Time in ms Memory in MegabytesHD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010]

~117 (guesstimate)

Standard Solver (already skips high res abc construction)

3.66 3.33 ~132

rgab16fX

rgab16f

rgab32f

reduce reduce

reduce

Skip abc construction pass compute abc during 1. reduction pass

Yet again this saves a significant amount of memory !

rgab16fX

reduce4

rgba16f

substitute4

Reduce 4-to-1in a special first reduction pass

Substitute 1-to-4 in a special substitution pass

Intermediate Results 1600x1200

~117 (guesstimate)

3.66 3.33 ~132

4–to-1 Reduction 2.87 3.32 ~73

DX11 Memory Optimizations 1

rgab16fX

reduce4

rgba16f

substitute4

DX11 Memory Optimizations 1

rgab16fX

reduce4

rgba16f

substitute4

Pack abc and X into one rgba_uint surface

Using SM5 for data packing

rgab16fX

rgab32fabc

pack x,y channel

(f32tof16(X.x) + (f32tof16(X.y) << 16))

rgab16fX

rgab32fabc

lower 5 bits of z channel

higher 27 bits of x channel

(asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F))

Steal 6 lowest mantissa bits of abc.x to store some bits of X.z

rgab16fX

rgab32fabc

central 5 bits of z channel

higher 27 bits of y channel

(asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F))

Steal 6 lowest mantissa bits of abc.y to store some bits of X.z

SM5 Memory Optimizations 1

rgab16fX

rgab32fabc

higher 5 bits of z channel

higher 27 bits of z channel pack

(asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F))

Steal 6 lowest mantissa bits of abc.z to store some bits of X.z

Sample Screenshot

Abs(Packed-Unpacked) x 255.0f

DX11 Memory Optimizations 2• Solver does a horizonal and vertical pass• Chain of lower res RTs needs to be there twice

– Horizontal reduction/substitution chain– Vertical reduction/substitution chain

• How can DX11 help?

DX11 Memory Optimizations 2• UAVs allow us to reuse data of the horizontal

chain for the vertical chain• A proof of concept implementation shows that this

works nicely but impacts the runtime significantly – ~40% lower fps

• Stayed with RTs as memory was already quite low• Use only if you are really concerned about memory

Final Results 1600x1200

~117 (guesstimate,)

3.66 3.33 ~132

4–to-1 Reduction 2.87 3.32 ~73

4-to-1 Reduction + SM5 Packing 2.75 3.14 ~58

Future Work• Look into CS acceleration of the solver

– 4-to-1 reduction pass– 1-to-4 substitution pass

• Look into using heat diffusion for other effects– e.g. Motion blur

Conclusion• Optimized CR solver is fast and mem-efficient

– Used in Dragon Age 2– 4aGames considering its use for new projects– Detailed description in ‚Game Engine Gems 2‘

• Mail me (holger.gruen@amd.com) if you want access to the sources

References• [Kass2006] “Interactive depth of field using simulated diffusion on a GPU”

Michael Kass, Pixar Animation studios, Pixar technical memo #06-01• [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D.

Owens, PPoPP 2010• [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O.

Shishkovtsov, GDC 2010• [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil,

FGO2010

Backup

Results 1920x1200

4.31 4.03 ~158

4–to-1 Reduction 3.36 4.02 ~88

4-to-1 Reduction + SM5 Packing 3.23 3.79 ~70

An Optimized Diffusion Depth Of Field Solver

Documents

Transcript of An Optimized Diffusion Depth Of Field Solver

Finite Element Solutions for Geotechnical · PDF fileIntegrated Solver Optimized for the next generation 64-bit platform Finite Element Solutions for Geotechnical Engineering Integrated

EfficiEntly Solving thE StochaStic REaction-DiffuSion csb ... · solver against the state-of-the-art C solver URD-ME[1] on the well-known MinD model from E. coli: Model Schematic

GOAL solver: a hybrid local search based solver for high ...

CASI Solver

Automatic Solver Configuration and Solver Portfolios

MATLAB Implementation of a Multigrid Solver for Diffusion ...

A multigrid solver for 3D electromagnetic diffusion

Introduction to Lego Mindstorms NXT. Video clips: Rubix cube solver Rubix cube solver Rubix cube solver Rubix cube solver Roboflush Roboflush Roboflush.

Understanding Stress, Chemistry and Molecular Diffusion ...€¦ · Understanding Stress, Chemistry and Molecular Diffusion for Optimized CMP of ULK Dielectrics ULK Thin-Film Materials

FAST ITERATIVE SOLVER FOR CONVECTION-DIFFUSION …

DEVELOPMENT OF AN OPTIMIZED LAMBERT PROBLEM SOLVER …

Optimized halftoning using dot diffusion and methods for inverse

Solver About

3D Excavation and Piled-Raft Foundationnorthamerica.midasuser.com/web/upload/sample/3D... · 2017-02-21 · Integrated Solver Optimized for the next generation 64-bit platform Finite

Author: Elisa Boatti - unipv · Author: Elisa Boatti 19/01/2016 ... VUMAT for Abaqus Explicit solver • AceGen high-level intuitive and optimized language, very similar to Mathematica

ACOUSTIC NOISE OPTIMIZED DIFFUSION-WEIGHTED IMAGING …

Adams Solver

A PARALLEL SOLVER FOR REACTION DIFFUSION … · May 31, 2004 15:27 WSPC/103-M3AS 00348 A Parallel Solver for Reaction{Di usion Systems 885 we have chosen to use adaptive methods in

CTC isolation & analysis in an integrated microsystem · • HRP label & substrate optimized for reaction-limited kinetics rather than diffusion-limited • Optimal electrode design

Estabilidad de Taludes - latinamerica.midasuser.comlatinamerica.midasuser.com/web/upload/sample/Estabilidad_de_Talud… · Integrated Solver Optimized for the next generation 64-bit