CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries &...

27
1 1 Popular CUDA Packages Krishnan Suresh [email protected] 2 CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. Thrust: Parallel sort, etc 3. CUSP: Sparse Linear Algebra Package 4. CuSparse: Sparse Linear Algebra Package 5. CUFFT: Fast Fourier Transform 6. GPUMat: Matlab Wrapper (free) 7. Jacket: Matlab Wrapper ($$) 3 CUBLAS 4 CUBLAS CUDA implementation of BLAS (Basic Linear Algebra Subprograms) Vector, vector (Level-1) Matrix, vector (Level-2) Matrix, matrix (Level-3) Precisions Single: real & complex Double: real & complex (not all functions) No kernel calls, shared memory, etc

Transcript of CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries &...

Page 1: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

1

1

Popular CUDA Packages

Krishnan Suresh

[email protected]

2

CUDA Libraries & Packages

1. CUBLAS: Dense Linear Algebra

2. Thrust: Parallel sort, etc

3. CUSP: Sparse Linear Algebra Package

4. CuSparse: Sparse Linear Algebra Package

5. CUFFT: Fast Fourier Transform

6. GPUMat: Matlab Wrapper (free)

7. Jacket: Matlab Wrapper ($$)

3

CUBLAS

4

CUBLAS

• CUDA implementation of BLAS (Basic

Linear Algebra Subprograms)

– Vector, vector (Level-1)

– Matrix, vector (Level-2)

– Matrix, matrix (Level-3)

• Precisions

– Single: real & complex

– Double: real & complex (not all functions)

• No kernel calls, shared memory, etc

Page 2: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

2

5

CUBLAS Library Usage

• No additional downloads needed

– cublas.lib (in CUDA SDK)

– Add cublas.lib to linker

– #include cublas.h

6

CUBLAS Code Structure

1. Initialize CUBLAS: cublasInit() 2. Create CPU memory and data

3. Create GPU memory: cublasAlloc(…)

4. Copy from CPU to GPU : cublasSetVector(…)

5. Operate on GPU : cublasSgemm(…)

6. Check for CUBLAS error : cublasGetError()

7. Copy from GPU to CPU : cublasGetVector(…) 8. Verify results

9. Free GPU memory : cublasFree(…)

10. Shut down CUBLAS : cublasShutDown()

7

CUBLAS BLAS-1 Functions:

Vector-vector operations

8

CU(BLAS) Naming Convention

cublasIsamax

Index of

Single

Precision

absolute

cublasIdamax

Find the index of the absolute max

of a vector of single precision reals

cublasIzamax

cublasIcamax

max

Page 3: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

3

9

CU(BLAS) Naming Convention (2)

cublasSaxpy

Single

Precision

alpha*x+y

cublasDaxpy

Compute alpha*x+y where

x &y are single precision reals

& alpha is a scalar

10

CUBLAS Example-1 (CPU) Ta x y

11

CUBLAS Example-1 (GPU) Ta x y

• No kernel calls

• No memory mgmt.

Increment of 1

12

CUBLAS Example-2 (CPU)

z x y

Page 4: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

4

13

CUBLAS Example-2 (GPU)

z x y

Output stored

in d_y

14

CUBLAS BLAS-2 Functions:

Matrix-Vector Operations

:

z Ax y

A symmetric banded

1

( )

x A y

A Upper or Lower

15

CUBLAS: Caveats

• Solves Ax = b only for Upper/Lower A

• Limited class of sparse matrices

• Column format & 1-indexing (Fortran style)

• C: row format & 0-indexing; use macros

16

CU(BLAS) Naming Convention

cublasSsbmv

Single

symmetric

banded

z Ax y

x x x

x x x x

x x x x x

x x x x

x x X

Page 5: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

5

17

Example z Ax y

( , )

2 1

1 2 1

1 2 ...

... ... 1

1 2N N

A

It is sufficient to store

( , )

2 1

2 1

2 ...

... 1

2N N

(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

Stored as

Symmetric-Banded

#Super-Diagonals = 1

18

CUBLAS Example-3 (CPU)

z Ax y (2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

Macro for 0-indexing in C

2

1_ :

2

1

...

X

h A

19

CUBLAS Example-3 (CPU)

(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

1 1 1

2 2 2

3 3 3

2 1

1 2 1

1 2 ...

... ... ...... ... 1

1 2N N N

z x y

z x y

z x y

z x y

20

CUBLAS Example-3 (GPU)

z Ax y (2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

#Rows

Upper

diagonal

#Rows

Page 6: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

6

21

CUBLAS Optimal Usage

1. Copy from CPU to GPU : cublasSet …(…) 2. Operate on GPU

Operation 1

Operation 2

Operation n

3. Copy from GPU to CPU : cublasGet…(…)

22

CUBLAS BLAS-3 Functions:

Matrix-Matrix Operations

C AB C

1

( )

X A B

A Upper or Lower

23

CUBLAS Performance

24

CUBLAS SGEMM Performance

CPU Optimized

for 4-core CPU Naive

Page 7: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

7

25

Thrust

26

Thrust

• C++ Template Library using CUDA

• Vector containers: • host_vector & device_vector

• Generalizes std:vector

• Store any type & dynamically resize

• Numerous algorithms • Sort

• Sum

• Max

27

Thrust: Getting started

• Download to (CUDA include directory)

– http://code.google.com/p/thrust/

– Requires CUDA 2.3

• Tutorial:

– http://code.google.com/p/thrust/wiki/Tutorial

28

Thrust: Concept

Page 8: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

8

29

Thrust Algorithms: Prefix Sum

Given a sequence:

And an operation

Output:

{ }1 2 3, , ,..., Nx x x x

Å

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x xÅ Å Å Å Å Å

30

Prefix Sum

Key to numerous algorithms

Also referred to as “Scan” algorithm

Different operations result in different results

31

Prefix Sum: Example

Given a sequence:

And an operation

Output

{ }1,2,9,6,...,

+

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x+ + + + + +

{ }1,3,11,17,...

32

Prefix Sum: Example

Given a sequence:

And an operation

Output

{ }1,2,9,6,...,

*

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x* * * * * *

{ }1,2,18,108,...

Page 9: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

9

33

Prefix Sum: Example

Given a sequence:

And an operation

Output

{ }1,2,9,6,...,

max

{ }1 1 2 1 2 3,max( , ),max(max( , ), ),...x x x x x x

{ }1,2,9,9,...

34

Thrust: Examples Set-up

35

Thrust: Examples

36

Thrust: Examples cont.

2 2 2

1 2 ... Na x x x x

Page 10: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

10

37

CUSP

Cusp is a library for sparse

linear algebra and graph

computations on CUDA.

38

CUSP #include <cusp/hyb_matrix.h>

#include <cusp/io/matrix_market.h>

#include <cusp/krylov/cg.h>

int main(void)

{

// create an empty sparse matrix structure (HYB format)

cusp::hyb_matrix<int, float, cusp::device_memory> A;

// load a matrix stored in MatrixMarket format

cusp::io::read_matrix_market_file(A, "5pt_10x10.mtx");

// allocate storage for solution (x) and right hand side (b)

cusp::array1d<float, cusp::device_memory> x(A.num_rows, 0);

cusp::array1d<float, cusp::device_memory> b(A.num_rows, 1);

// solve the linear system A * x = b with the Conjugate Gradient method

cusp::krylov::cg(A, x, b);

return 0;

}

39

CuSparse

40

CUFFT

CUDA Implementation of

Fast Fourier Transform

Page 11: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

11

41

Fourier Transform

• Extract frequencies from signal

• Given a function

• 1-D Fourier transform:

• 2-D, 3-D

( );f t t- ¥ < < ¥

2ˆ( ) ( ) i tf f t e dtp xx

¥

-

- ¥

= ò

42

Fourier Transform

Continuous Signal Fourier Transform

(Wikipedia)

2ˆ( ) ( ) i tf t f e dp xx x

¥

- ¥

= ò

43

Discrete Fourier Transform

• Given a sequence

• Discrete Fourier transform (DFT):

… another sequence

0 1 1, ,..., Nx x x -

21

0

ˆiknN

Nk n

n

x x ep-

-

=

= å

44

DFT Examples

Highest frequency

that can be captured

correctly

Page 12: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

12

45

Fast Fourier Transform

• DFT: Naïve O(N2) operation

• FFT: Fast DFT, O(NlogN)

• Key to signal processing, PDE, …

0 1 1, ,..., Nx x x - 0 1 1ˆ ˆ ˆ, ,..., Nx x x -

21

0

ˆiknN

Nk n

n

x x ep-

-

=

= å

46

CUFFT

• Fast CUDA library for FFT

• No additional downloads needed

– cufft.lib (in CUDA SDK)

– Add cufft.lib to linker

– #include cufft.h

47

CUFFT: Features

• 1-D, 2-D, 3-D

• Precisions

– Single: real & complex

– Double: real & complex (not all functions)

• Uses CUDA memory calls & fft data

• Requires a ‘plan’

• Based on FFTW model

48

CUFFT Example

Page 13: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

13

49

CUFFT Example (cont.)

Complex to

complex

1 data

(batch)

50

Questions?

Finite Element Analysis

on the GPU

Krishnan Suresh

[email protected]

52

Summary

• Minimize data transfer between CPU & GPU

• Avoid complex GPU logic

• Maximize independent #GPU-threads

• GPU memory management is critical

Page 14: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

14

53

Finite Element Analysis

• Simulation of engineering phenomena

• Phenomena:

– Structural, thermal, fluid, …

– Static, modal, transient

– Linear, non-linear

54

Structural Static FEA

Model Discretize Post-

process Element Stiffness

e

e

K

f

Assemble/ Solve

Ku f

e

e

K K

f f

55

FEA Variations

Discretize Model Element Stiffness

Assemble/ Solve

Post- process

e

e

K K

f f

Ku f

Nonlinear

Optimization

Tet/Hex/… Direct/Iterative Order/Hybrid

e

e

K

f

1. Accurate

2. Automated

3. Fast

56

Why GPU?

Hours or even days of CPU time.

[Gordon; JPL]

Page 15: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

15

57

Typical Bottleneck

Model Discretize Post-

process Element Stiffness

e

e

K

f

Assemble/ Solve

Ku f

e

e

K K

f f

… GPU emphasis

58

Linear Solvers

Ku f

K is sparse & usually symmetric P.D

1 1

T

T

K LDL

u L D L f

Direct

59

Direct Sparse on GPU (1)

(2006)

60

Direct Sparse on GPU (1)

Page 16: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

16

61

Direct Sparse on GPU (1)

Single precision limitation

62

Direct Sparse on GPU (2) Ku f

(2008)

63

Direct Sparse on GPU (2)

Ku f

Effective speed-up

Linear Solvers

Ku f

1 1

T

T

K LDL

u L D L f

Direct

1 ( )

: Preconditioner of K

i i iu u B f Ku

B

Iterative

• Repeated Matrix-Vector ops

• Ideally suited for GPU!

Page 17: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

17

65

Linear Solvers

Ku f

Direct

1 ( )

: Preconditioner of K

i i iu u B f Ku

B

Iterative

Issues:

1. K: Sparse

2. B: Need pre-conditioner!

66

Iterative Sparse on GPU (2)

• Double precision real world SpMv

– CPU (2.3 GHz Dual Xeon): 1 GFLOPS

– GPU (GTX 280): 16 GFLOPS

67

Iterative Sparse on GPU (1)

(2008)

• Jacobi preconditioned conjugate gradient

• ATI GPU

• Speed-up 3.5.

68

Linear Solvers

Ku f

Direct

1 ( )

: Preconditioner of K

i i iu u P f Ku

P

Iterative

Assembled Assembly-

free

Page 18: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

18

69

Assembly-free Iterative Solvers 1 ( )

: Preconditioner of K

i i iu u P f Ku

P

1. AfSpMv: Assembly-free Sparse Matrix-Vector

Multiplication

• Compute Ku without assembling K

2. Assembly-free Pre-conditioning

• Compute Pv without assembling/inverting K

Hex-Mesh Solver using CUDA

(HMS-CUDA)

Objective • Solve structural problems posed over hexahedral meshes on the GPU

• Targeted Applications

Material Characterization Topology Optimization Cellular modeling

Large models (~10 million dof)

Arbitrary-shaped hex-element

Non-uniform connectivity

Heterogeneous/orthotropic material

Hexahedral elements

Displacement Elements

– Standard displacement interpolation

– 8 node hex; Ke: 24*24

– Good balance between accuracy and speed

Page 19: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

19

Strategy “Assembly-free Matrix-Vector Multiply + Conjugate Gradient +

Preconditioning”

• Assembly-free:

– Decreases memory foot-print on GPU, easy to parallelize

• Assembly-free over arbitrary hex-meshes

– Ke is not same for all elements

– Re-computing Ke as needed is too expensive

– Storing all Ke can be a challenge (> 5 GB)

• Solution: Exploit Geometric and Material Congruency

– Compute and store Ke of template elements

– Map all elements to template elements

– Decreases memory footprint

– Increases coalesced memory access in GPU

Congruency Question

“A Congruence Problem for Polyhedra” Borisov A., et. al., Amer. Math. Monthly 131

(2010) 232--249

Are these two hexahedral elements geometrically

and materially congruent?, i.e.,

Will they result in the same Ke matrix?

Theorem 1.2 (Cauchy, 1839). Two convex polyhedra with corresponding congruent and

similarly situated faces have equal corresponding dihedral angles.

Translation: 12 measurements are sufficient.

Translation: A maximum of 18 measurements are sufficient

Corollary 2.5. Let P be a convex polyhedron with E edges. Then there is a set of E

measurements that is sufficient to determine P up to congruence amongst all nearby

convex polyhedra with the same combinatorial structure as P.

Congruency Question

Currently, edge lengths are used to determine congruency

Assign element signature

Compare element signature

Congruence Examples

Structured grid, isotropic material

– 8192 elements

– One template element

– 1 Ke matrix (24*24); 4.6 KB

Unstructured grid, isotropic material

– 6144 elements

– 34 template elements (~0.5%)

– 34 Ke matrices; 156 KB (vs 28 MB)

Page 20: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

20

Results

Unstructured grid, composite

layers

– 83584 elements

– Material congruency check

– 322 templates (~0.4%)

– 1.48 MB storage (vs 384 MB)

Proposed Algorithm

Detect

Congruency

Hex-mesh

Compute Ke of templates

Push data to GPU

Assembly-free

Ku = f

on GPU

Conjugate Gradient

Example: Matlab

GPU algorithms:

Dot-product: Use CUBLAS

Ax: Implement assembly-

free

Also implement in CPU for

comparison

Assembly-free y = Kx in GPU

Each node assigned to a thread

Launch as many blocks as needed

Algorithm (per thread/node):

uResult = 0, …

For each neighboring elem

For each of 8 nodes of elem

Get (u,v,w) of node

Apply Ke of elem, node

Accumulate into uResult, …

Apply BC

Push uResult to Memory

Page 21: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

21

GPU Data

Separate storage for (u,v,w) for coalesced memory

access.

Can reduce foot-print further

Assembly-free Kx on GPU

Code NOT Optimized!

Experiments

Models of different complexity

Four different solvers:

1. Matlab assembled direct

2. Matlab assembled CG (1e-10)

3. C assembly-free CG(1e-10)

4. CUDA assembly-free CG (1e-10)

All methods exploit congruency and

computing Ke of template elements (so

not included in cost)

All computations in double precision

All methods yield identical results

Platform

64 bit Windows Office Professional

CPU:

– i7, 3.2 GHz, 6 GB

– C Code single core

GPU:

– GTX 480 (400 cores)

– 1.5 GB

Page 22: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

22

Timing: Problem 1 Isotropic structured hex mesh

32 x 16 x 16; 1 template element

28,000 dof Fixed

Overhead

cost (s)

Solution

time (s)

Matlab – Assembled; direct solve 1.8

(K assembly)

7.0

Matlab – Assembled; CG (1e-10) 1.8

(K assembly)

8.7

C– Assembly-free; CG (1e-10) -

8.0

CUDA – Assembly-free; CG (1e-

10)

0.13

(GPU transfer)

0.20

Timing: Problem 2 Isotropic structured hex mesh

32 x 32 x 16; 1 template element

55,539 dof Fixed

Overhead

cost (s)

Solution

time (s)

Matlab – Assembled; direct solve 3.6

(K assembly)

33.2

Matlab – Assembled; CG (1e-10) 3.6

(K assembly)

2.73

C– Assembly-free; CG (1e-10) -

19.8

CUDA – Assembly-free; CG (1e-

10)

0.16

(GPU transfer)

0.29

Timing: Problem 3 Isotropic structured hex mesh

64x 32 x 16; 1 template element

109,385 dof Fixed

Overhead

cost (s)

Solution

time (s)

Matlab – Assembled; direct solve 7.1

(K assembly)

> 8 mins

(stalled)

Matlab – Assembled; CG (1e-10) 7.1

(K assembly)

8.8

C– Assembly-free; CG (1e-10) 0.0

60.8

CUDA – Assembly-free; CG (1e-

10)

0.19 s

(GPU transfer)

0.66

TIming: Problem 4 Isotropic structured hex mesh

64 x 32x 16; 2 template element

109,000 dof Fixed

Overhead

cost (s)

Solution

time (s)

Isotropic Composite Isotropic Composite

Matlab: Direct --

-- -- --

Matlab: CG 7.1 7.1 8.8 22.0

C: CG, AF 0.0

0.0 60.8 154

CUDA: CG, AF 0.16

0.17 0.66 1.6

Same K, but

2.5 x CG

iterations.

No penalty in

GPU due to non

isotropy

Page 23: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

23

Timing: Problem 5 Isotropic structured hex mesh

256 x 128 x 64; 1 template element

6.5 million dof

468 MB GPU memory

Overhead

cost (s)

Solution

time (s)

Matlab – Assembled; direct solve -- Stalls!

Matlab – Assembled; CG (1e-10) -- Stalls!

C– Assembly-free; CG (1e-10) --

Stalls!

CUDA – Assembly-free; CG (1e-

10)

0.41 s

(GPU transfer)

87.2

Fixed

Timing: Problem 6 Unstructured hex mesh

Isotropic and Composite (alternate

layers +/- 60 deg fibers)

23,000 dof

68 template elements

Fixed

Overhead

cost (s)

Solution

time (s)

Isotropic Composite Isotropic Composite

Matlab: Direct 1.28

1.28 2.7 2.7

Matlab: CG 1.28 1.28 20.0 20.0

C: CG, AF 0.0

0.0 127 132

CUDA: CG, AF 0.16

0.17 6.5 7.13

Load

High aspect

ratio elements;

CG performs

poorly without

preconditioning

Timing: Problem 6 Unstructured hex mesh

Isotropic and Composite (alternate

layers +/- 60 deg fibers)

274,000 dof

322 template elements

Fixed

Overhead

cost (s)

Solution

time (s)

Isotropic Composite Isotropic Composite

Matlab: Direct --

-- --

--

Matlab: CG 18.3 18.3 214.2 216.3

C: CG, AF 0.0

0.0 720.3 770.6

CUDA: CG, AF 0.18

0.18 29.3 35.2

Load

CG Pre-conditioning for FEA

Effective

Para

llel-

Fri

end

ly

Jacobi

ILU

Multi-grid

Generic Mesh

Structured Mesh

Coarse-Fine

ELE

SSOR

Page 24: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

24

Exploiting HMS-CUDA

for Topology Optimization

Classic SIMP Optimization

D

Demo

Matlab code www.ersl.wisc.edu

Large-Scale Optimization

Size DOF [Wang 07]*

Medium (84,28,14) 107,184 2.4 hours

Large (180,60,30) 1,010,160 45.7 hours

*[Wang 07]: “Large-scale topology optimization using preconditioned Krylov

subspace methods with recycling”, Wang, de Sturler, Paulino, IJNME, vol. 69, 2007.

Page 25: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

25

Bridge Problem

V = 30% 1 min 10 secs

Exploiting HMS-CUDA

for Material Characterization

Double Notch Specimen

Fixed

Prescribed displacement

Accuracy: Double Notch

ANSYS CUDA

u (0, 0.01) (0, 0.01)

v (-0.098, 0.098) (-0.098, 0.098)

z (-0.0016, 0.00016) (-0.0016, 0.00016)

Prescribed: u = 0.01

Page 26: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

26

Accuracy: Double Notch

ANSYS CUDA

u (-0.00027, 0.00024) (-0.00027, 0.00024)

v (-0.0011, 0.0011) (-0.0011, 0.0011)

z (0,0.01) (0, 0.01)

Prescribed: w = 0.01

Material Characterization-1

E1=100 (MPa), E2=25, n12=0.1, G23=25

E1=147000 (MPa), E2=10000, n12=0.27,

G23=7000

F=(1000,0,0) , M = (0,0,0); CGTol = 1e-10

#FEA = 4700

E1=146957, E2=10362, n12=0.268,

G23=6963

Initial Guess

Exact Answer

Converged Result

Material Characterization-2

E1=100 (MPa), E2=25, n12=0.1, G23=25

E1=147190, E2=10266, n12=0.27,

G23=6996

E1=147000, E2=10000, n12=0.27, G23=7000

F=(1000,0,0) , M = (0,0,0); CGTol = 0.001

#FEA = 2162

Material Characterization-3

E1=100 (MPa), E2=25, n12=0.1, G23=25

E1=147000 (MPa), E2=10000, n12=0.27,

G23=7000

F=(1000,1000,1000) , M = (0,0,0); CGTol = 0.001

#FEA = 1952

E1=146967, E2=10329, n12=0.259,

G23=6949

Page 27: CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries & Packages 1. CUBLAS: Dense Linear Algebra 2. ... CUBLAS 4 CUBLAS • CUDA ... •

27

Material Characterization-4

E1=100 (MPa), E2=25, n12=0.1, G23=25

E1=147000 (MPa), E2=10000, n12=0.27,

G23=7000

F=(1000,1000,1000) , M = (0,0,0); CGTol = 0.001

#FEA = 2876

E1=144340, E2=10378, n12=0.193,

G23=6998

Conclusions

• Assembly-free Hex-mesh solver on GPU hard to beat!

• Very little penalty for material variation

• A penalty-factor of 3.0 for unstructured hex meshes

• Future work

– ANSYS interface

– Optimize GPU code

– Pre-conditioner for thin elements

– Reduce GPU Memory

– OpenCL implementation

– Congruent hex mesh generator

– Multi-physics

– Transient problems

– Nonlinear problems

– …