A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight...

Post on 26-Dec-2015

223 views 3 download

Transcript of A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight...

A Multigrid Solver for Boundary Value Problems Using Programmable

Graphics HardwareNolan Goodnight Cliff Woolley Gregory Lewin

David Luebke Greg Humphreys

University of Virginia

Graphics Hardware 2003July 26-27 – San Diego, CA

General-Purpose GPU Programming

Why do we port algorithms to the GPU?

How much faster can we expect it to be, really?

What is the challenge in porting?

Case Study

Problem: Implement a Boundary Value Problem (BVP) solver using the GPU

Could benefit an entire class of scientific and engineering applications, e.g.:

Heat transfer

Fluid flow

Related Work

Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms

Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Very similar to our system Developed concurrently

Complementary approach

Driving problem: Fluid mechanics sim

Problem domain is a warped disc:

regular grid

regular grid

BVPs: Background

Boundary value problems are sometimes governedby PDEs of the form:

= f

is some operator

is the problem domain

f is a forcing function (source term)

Given and f, solve for .

BVPs: Example

Heat Transfer Find a steady-state temperature distribution T

in a solid of thermal conductivity k with thermal source S

This requires solving a Poisson equation of the form:

k2T = -S

This is a BVP where is the Laplacian operator 2

All our applications require a Poisson solver.

BVPs: Solving

Most such problems cannot be solved analytically

Instead, discretize onto a grid to form a set of linear equations, then solve:

Direct elimination

Gauss-Seidel iteration

Conjugate-gradient

Strongly implicit procedures

Multigrid method

Multigrid method

Iteratively corrects an approximation to the solution

Operates at multiple grid resolutions

Low-resolution grids are used to correct higher-resolution grids recursively

Very fast, especially for large grids: O(n)

Multigrid method

Use coarser grid levels to recursively correct an approximation to the solution

Algorithm:

smooth

residual

restrict recurse

interpolate 1

111 -4

1/8

1/8

1/81/8 1/4

1/16

1/16

1/16

1/16 1/2

1/2

1/21/2 11/4

1/4

1/4

1/4

= i - f

Implementation

For each step of the algorithm:

Bind as texture maps the buffers that contain the necessary data

Set the target buffer for rendering

Activate a fragment program that performs the necessary kernel computation

Render a grid-sized quad with multitexturing

fragment program

render target buffer

render target buffer

source buffer texture

source buffer texture

Optimizing the Solver

Detect steady-state natively on GPU

Minimize shader length

Special-case whenever possible

Avoid context-switching

Optimizing the Solver: Steady-state

How to detect convergence?

L1 norm - average error

L2 norm – RMS error (common in visual sim)

L norm – max error (common in sci/eng apps) Can use occlusion query!

secs to steady statevs. grid size

Optimizing the Solver: Shader length

Minimize number of registers used

Vectorize as much as possible

Use the rasterizer to perform computations of linearly-varying values

Pre-compute invariants on CPU

shader original fp

fastpath fp

fastpath vp

smooth 79-6-1 20-4-1 12-2

residual 45-7-0 16-4-0 11-1

restrict 66-6-1 21-3-0 11-1

interpolate 93-6-1 25-3-0 13-2

Optimizing the Solver: Special-case

Fast-path vs. slow-path

write several variants of each fragment program to handle boundary cases

eliminates conditionals in the fragment program

equivalent to avoiding CPU inner-loop branching

slow path with boundaries

fast path, no boundaries

Optimizing the Solver: Special-case

Fast-path vs. slow-path

write several variants of each fragment program to handle boundary cases

eliminates conditionals in the fragment program

equivalent to avoiding CPU inner-loop branching

secs per v-cyclevs. grid size

Optimizing the Solver: Context-switching

Find best packing data of multiple grid levelsinto the pbuffer surfaces

Optimizing the Solver: Context-switching

Find best packing data of multiple grid levelsinto the pbuffer surfaces

Optimizing the Solver: Context-switching

Find best packing data of multiple grid levelsinto the pbuffer surfaces

Optimizing the Solver: Context-switching

Remove context switching

Can introduce operations with undefined results: reading/writing same surface

Why do we need to do this?

Can we get away with it?

What about superbuffers?

Data Layout

Performance:

secs to steady statevs. grid size

Data Layout

Compute 4 values at a time

Requires source, residual, solution values to be in different buffers

Complicates boundary calculations

Adds setup and teardown overhead

Stacked domain

Possible additional vectorization:

Results: CPU vs. GPU

Performance:

secs to steady statevs. grid size

Conclusions

What we need going forward:

Superbuffers or: Universal support for multiple-surface

pbuffers

or: Cheap context switching

Developer tools Debugging tools

Documentation

Global accumulator

Ever increasing amounts of precision, memory Textures bigger than 2048 on a side

Acknowledgements

Hardware

David Kirk

Matt Papakipos

Driver Support

Nick Triantos

Pat Brown

Stephen Ehmann

Fragment Programming

James Percy

Matt Pharr

General-purpose GPU

Mark Harris

Aaron Lefohn

Ian Buck

Funding

NSF Award #0092793