CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries &...

1

1

Popular CUDA Packages

Krishnan Suresh

[email protected]

2

CUDA Libraries & Packages

1. CUBLAS: Dense Linear Algebra

2. Thrust: Parallel sort, etc

3. CUSP: Sparse Linear Algebra Package

4. CuSparse: Sparse Linear Algebra Package

5. CUFFT: Fast Fourier Transform

6. GPUMat: Matlab Wrapper (free)

7. Jacket: Matlab Wrapper ($$)

3

CUBLAS

4

CUBLAS

• CUDA implementation of BLAS (Basic

Linear Algebra Subprograms)

– Vector, vector (Level-1)

– Matrix, vector (Level-2)

– Matrix, matrix (Level-3)

• Precisions

– Single: real & complex

– Double: real & complex (not all functions)

• No kernel calls, shared memory, etc

mailto:[email protected]

2

5

CUBLAS Library Usage

• No additional downloads needed

– cublas.lib (in CUDA SDK)

– Add cublas.lib to linker

– #include cublas.h

6

CUBLAS Code Structure

1. Initialize CUBLAS: cublasInit() 2. Create CPU memory and data

3. Create GPU memory: cublasAlloc(…)

4. Copy from CPU to GPU : cublasSetVector(…)

5. Operate on GPU : cublasSgemm(…)

6. Check for CUBLAS error : cublasGetError()

7. Copy from GPU to CPU : cublasGetVector(…) 8. Verify results

9. Free GPU memory : cublasFree(…)

10. Shut down CUBLAS : cublasShutDown()

7

CUBLAS BLAS-1 Functions:

Vector-vector operations

8

CU(BLAS) Naming Convention

cublasIsamax

Index of

Single

Precision

absolute

cublasIdamax

Find the index of the absolute max

of a vector of single precision reals

cublasIzamax

cublasIcamax

max

3

9

CU(BLAS) Naming Convention (2)

cublasSaxpy

Single

Precision

alpha*x+y

cublasDaxpy

Compute alpha*x+y where

x &y are single precision reals

& alpha is a scalar

10

CUBLAS Example-1 (CPU) Ta x y

11

CUBLAS Example-1 (GPU) Ta x y

• No kernel calls

• No memory mgmt.

Increment of 1

12

CUBLAS Example-2 (CPU)

z x y

4

13

CUBLAS Example-2 (GPU)

z x y

Output stored

in d_y

14


Matrix-Vector Operations

:

z Ax y

A symmetric banded

1

( )

x A y

A Upper or Lower

15

CUBLAS: Caveats

• Solves Ax = b only for Upper/Lower A

• Limited class of sparse matrices

• Column format & 1-indexing (Fortran style)

• C: row format & 0-indexing; use macros

16

CU(BLAS) Naming Convention

cublasSsbmv

Single

symmetric

banded

z Ax y

x x x

x x x x

x x x x x

x x x x

x x X

5

17

Example z Ax y

( , )

2 1

1 2 1

1 2 ...

... ... 1

1 2N N

A

It is sufficient to store

( , )

2 1

2 1

2 ...

... 1

2N N

(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

Stored as

Symmetric-Banded

#Super-Diagonals = 1

18


z Ax y (2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

Macro for 0-indexing in C

2

1_ :

2

1

...

X

h A

19


(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

1 1 1

2 2 2

3 3 3

2 1

1 2 1

1 2 ...

... ... ...... ... 1

1 2N N N

z x y

z x y

z x y

z x y

20

CUBLAS Example-3 (GPU)

z Ax y (2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

#Rows

Upper

diagonal

#Rows

6

21

CUBLAS Optimal Usage

1. Copy from CPU to GPU : cublasSet …(…) 2. Operate on GPU

Operation 1

Operation 2

…

Operation n

3. Copy from GPU to CPU : cublasGet…(…)

22


Matrix-Matrix Operations

C AB C

1

( )

X A B

A Upper or Lower

23

CUBLAS Performance

24

CUBLAS SGEMM Performance

CPU Optimized

for 4-core CPU Naive

7

25

Thrust

26

Thrust

• C++ Template Library using CUDA

• Vector containers: • host_vector & device_vector

• Generalizes std:vector

• Store any type & dynamically resize

• Numerous algorithms • Sort

• Sum

• Max

27

Thrust: Getting started

• Download to (CUDA include directory)

– http://code.google.com/p/thrust/

– Requires CUDA 2.3

• Tutorial:

– http://code.google.com/p/thrust/wiki/Tutorial

28

Thrust: Concept

http://code.google.com/p/thrust/

http://code.google.com/p/thrust/wiki/Tutorial

8

29

Thrust Algorithms: Prefix Sum

Given a sequence:

And an operation

Output:

{ }1 2 3, , ,..., Nx x x x

Å

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x xÅ Å Å Å Å Å

30

Prefix Sum

Key to numerous algorithms

Also referred to as “Scan” algorithm

Different operations result in different results

31

Prefix Sum: Example

Given a sequence:

And an operation

Output

{ }1,2,9,6,...,

+

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x+ + + + + +

{ }1,3,11,17,...

32

Prefix Sum: Example

Given a sequence:

And an operation

Output

{ }1,2,9,6,...,

*

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x* * * * * *

{ }1,2,18,108,...

9

33

Prefix Sum: Example

Given a sequence:

And an operation

Output

{ }1,2,9,6,...,

max

{ }1 1 2 1 2 3,max( , ),max(max( , ), ),...x x x x x x

{ }1,2,9,9,...

34

Thrust: Examples Set-up

35

Thrust: Examples

36

Thrust: Examples cont.

2 2 2

1 2 ... Na x x x x

10

37

CUSP

Cusp is a library for sparse

linear algebra and graph

computations on CUDA.

38

CUSP #include <cusp/hyb_matrix.h>

#include <cusp/io/matrix_market.h>

#include <cusp/krylov/cg.h>

int main(void)

{

// create an empty sparse matrix structure (HYB format)

cusp::hyb_matrix<int, float, cusp::device_memory> A;

// load a matrix stored in MatrixMarket format

cusp::io::read_matrix_market_file(A, "5pt_10x10.mtx");

// allocate storage for solution (x) and right hand side (b)

cusp::array1d<float, cusp::device_memory> x(A.num_rows, 0);

cusp::array1d<float, cusp::device_memory> b(A.num_rows, 1);

// solve the linear system A * x = b with the Conjugate Gradient method

cusp::krylov::cg(A, x, b);

return 0;

}

39

CuSparse

40

CUFFT

CUDA Implementation of

Fast Fourier Transform

11

41

Fourier Transform

• Extract frequencies from signal

• Given a function

• 1-D Fourier transform:

• 2-D, 3-D

( );f t t- ¥ < < ¥

2ˆ( ) ( ) i tf f t e dtp xx

¥

-

- ¥

= ò

42

Fourier Transform

Continuous Signal Fourier Transform

(Wikipedia)

2ˆ( ) ( ) i tf t f e dp xx x

¥

- ¥

= ò

43

Discrete Fourier Transform

• Given a sequence

• Discrete Fourier transform (DFT):

… another sequence

0 1 1, ,..., Nx x x -

21

0

ˆiknN

Nk n

n

x x ep-

-

=

= å

44

DFT Examples

Highest frequency

that can be captured

correctly

12

45

Fast Fourier Transform

• DFT: Naïve O(N2) operation

• FFT: Fast DFT, O(NlogN)

• Key to signal processing, PDE, …

0 1 1, ,..., Nx x x - 0 1 1ˆ ˆ ˆ, ,..., Nx x x -

21

0

ˆiknN

Nk n

n

x x ep-

-

=

= å

46

CUFFT

• Fast CUDA library for FFT

• No additional downloads needed

– cufft.lib (in CUDA SDK)

– Add cufft.lib to linker

– #include cufft.h

47

CUFFT: Features

• 1-D, 2-D, 3-D

• Precisions

– Single: real & complex

– Double: real & complex (not all functions)

• Uses CUDA memory calls & fft data

• Requires a ‘plan’

• Based on FFTW model

48

CUFFT Example

13

49

CUFFT Example (cont.)

Complex to

complex

1 data

(batch)

50

Questions?

Finite Element Analysis

on the GPU

Krishnan Suresh

[email protected]

52

Summary

• Minimize data transfer between CPU & GPU

• Avoid complex GPU logic

• Maximize independent #GPU-threads

• GPU memory management is critical

mailto:[email protected]

14

53

Finite Element Analysis

• Simulation of engineering phenomena

• Phenomena:

– Structural, thermal, fluid, …

– Static, modal, transient

– Linear, non-linear

54

Structural Static FEA

Model Discretize Post-

process Element Stiffness

e

e

K

f

Assemble/ Solve

Ku f

e

e

K K

f f

55

FEA Variations

Discretize Model Element Stiffness

Assemble/ Solve

Post- process

e

e

K K

f f

Ku f

Nonlinear

Optimization

Tet/Hex/… Direct/Iterative Order/Hybrid

e

e

K

f

1. Accurate

2. Automated

3. Fast

56

Why GPU?

Hours or even days of CPU time.

[Gordon; JPL]

15

57

Typical Bottleneck

Model Discretize Post-

process Element Stiffness

e

e

K

f

Assemble/ Solve

Ku f

e

e

K K

f f

… GPU emphasis

58

Linear Solvers

Ku f

K is sparse & usually symmetric P.D

1 1

T

T

K LDL

u L D L f

Direct

59

Direct Sparse on GPU (1)

(2006)

60


16

61


Single precision limitation

62

Direct Sparse on GPU (2) Ku f

(2008)

63


Ku f

Effective speed-up

Linear Solvers

Ku f

1 1

T

T

K LDL

u L D L f

Direct

1 ( )

: Preconditioner of K

i i iu u B f Ku

B

Iterative

• Repeated Matrix-Vector ops

• Ideally suited for GPU!

17

65

Linear Solvers

Ku f

Direct

1 ( )


i i iu u B f Ku

B

Iterative

Issues:

1. K: Sparse

2. B: Need pre-conditioner!

66

Iterative Sparse on GPU (2)

• Double precision real world SpMv

– CPU (2.3 GHz Dual Xeon): 1 GFLOPS

– GPU (GTX 280): 16 GFLOPS

67

Iterative Sparse on GPU (1)

(2008)

• Jacobi preconditioned conjugate gradient

• ATI GPU

• Speed-up 3.5.

68

Linear Solvers

Ku f

Direct

1 ( )


i i iu u P f Ku

P

Iterative

Assembled Assembly-

free

18

69

Assembly-free Iterative Solvers 1 ( )


i i iu u P f Ku

P

1. AfSpMv: Assembly-free Sparse Matrix-Vector

Multiplication

• Compute Ku without assembling K

2. Assembly-free Pre-conditioning

• Compute Pv without assembling/inverting K

Hex-Mesh Solver using CUDA

(HMS-CUDA)

Objective • Solve structural problems posed over hexahedral meshes on the GPU

• Targeted Applications

Material Characterization Topology Optimization Cellular modeling

Large models (~10 million dof)

Arbitrary-shaped hex-element

Non-uniform connectivity

Heterogeneous/orthotropic material

Hexahedral elements

Displacement Elements

– Standard displacement interpolation

– 8 node hex; Ke: 24*24

– Good balance between accuracy and speed

19

Strategy “Assembly-free Matrix-Vector Multiply + Conjugate Gradient +

Preconditioning”

• Assembly-free:

– Decreases memory foot-print on GPU, easy to parallelize

• Assembly-free over arbitrary hex-meshes

– Ke is not same for all elements

– Re-computing Ke as needed is too expensive

– Storing all Ke can be a challenge (> 5 GB)

• Solution: Exploit Geometric and Material Congruency

– Compute and store Ke of template elements

– Map all elements to template elements

– Decreases memory footprint

– Increases coalesced memory access in GPU

Congruency Question

“A Congruence Problem for Polyhedra” Borisov A., et. al., Amer. Math. Monthly 131

(2010) 232--249

Are these two hexahedral elements geometrically

and materially congruent?, i.e.,

Will they result in the same Ke matrix?

Theorem 1.2 (Cauchy, 1839). Two convex polyhedra with corresponding congruent and

similarly situated faces have equal corresponding dihedral angles.

Translation: 12 measurements are sufficient.

Translation: A maximum of 18 measurements are sufficient

Corollary 2.5. Let P be a convex polyhedron with E edges. Then there is a set of E

measurements that is sufficient to determine P up to congruence amongst all nearby

convex polyhedra with the same combinatorial structure as P.

Congruency Question

Currently, edge lengths are used to determine congruency

Assign element signature

Compare element signature

Congruence Examples

Structured grid, isotropic material

– 8192 elements

– One template element

– 1 Ke matrix (24*24); 4.6 KB

Unstructured grid, isotropic material

– 6144 elements

– 34 template elements (~0.5%)

– 34 Ke matrices; 156 KB (vs 28 MB)

20

Results

Unstructured grid, composite

layers

– 83584 elements

– Material congruency check

– 322 templates (~0.4%)

– 1.48 MB storage (vs 384 MB)

Proposed Algorithm

Detect

Congruency

Hex-mesh

Compute Ke of templates

Push data to GPU

Assembly-free

Ku = f

on GPU

Conjugate Gradient

Example: Matlab

GPU algorithms:

Dot-product: Use CUBLAS

Ax: Implement assembly-

free

Also implement in CPU for

comparison

Assembly-free y = Kx in GPU

Each node assigned to a thread

Launch as many blocks as needed

Algorithm (per thread/node):

uResult = 0, …

For each neighboring elem

For each of 8 nodes of elem

Get (u,v,w) of node

Apply Ke of elem, node

Accumulate into uResult, …

Apply BC

Push uResult to Memory

21

GPU Data

Separate storage for (u,v,w) for coalesced memory

access.

Can reduce foot-print further

Assembly-free Kx on GPU

Code NOT Optimized!

…

…

…

Experiments

Models of different complexity

Four different solvers:

1. Matlab assembled direct

2. Matlab assembled CG (1e-10)

3. C assembly-free CG(1e-10)

4. CUDA assembly-free CG (1e-10)

All methods exploit congruency and

computing Ke of template elements (so

not included in cost)

All computations in double precision

All methods yield identical results

Platform

64 bit Windows Office Professional

CPU:

– i7, 3.2 GHz, 6 GB

– C Code single core

GPU:

– GTX 480 (400 cores)

– 1.5 GB

22

Timing: Problem 1 Isotropic structured hex mesh

32 x 16 x 16; 1 template element

28,000 dof Fixed

Overhead

cost (s)

Solution

time (s)

Matlab – Assembled; direct solve 1.8

(K assembly)

7.0

Matlab – Assembled; CG (1e-10) 1.8

(K assembly)

8.7

C– Assembly-free; CG (1e-10) -

8.0

CUDA – Assembly-free; CG (1e-

10)

0.13

(GPU transfer)

0.20



55,539 dof Fixed

Overhead

cost (s)

Solution

time (s)


(K assembly)

33.2


(K assembly)

2.73

C– Assembly-free; CG (1e-10) -

19.8


10)

0.16

(GPU transfer)

0.29


64x 32 x 16; 1 template element

109,385 dof Fixed

Overhead

cost (s)

Solution

time (s)


(K assembly)

> 8 mins

(stalled)


(K assembly)

8.8

C– Assembly-free; CG (1e-10) 0.0

60.8


10)

0.19 s

(GPU transfer)

0.66

TIming: Problem 4 Isotropic structured hex mesh

64 x 32x 16; 2 template element

109,000 dof Fixed

Overhead

cost (s)

Solution

time (s)

Isotropic Composite Isotropic Composite

Matlab: Direct --

-- -- --

Matlab: CG 7.1 7.1 8.8 22.0

C: CG, AF 0.0

0.0 60.8 154

CUDA: CG, AF 0.16

0.17 0.66 1.6

Same K, but

2.5 x CG

iterations.

No penalty in

GPU due to non

isotropy

23



6.5 million dof

468 MB GPU memory

Overhead

cost (s)

Solution

time (s)

Matlab – Assembled; direct solve -- Stalls!

Matlab – Assembled; CG (1e-10) -- Stalls!

C– Assembly-free; CG (1e-10) --

Stalls!


10)

0.41 s

(GPU transfer)

87.2

Fixed

Timing: Problem 6 Unstructured hex mesh

Isotropic and Composite (alternate

layers +/- 60 deg fibers)

23,000 dof

68 template elements

Fixed

Overhead

cost (s)

Solution

time (s)


Matlab: Direct 1.28

1.28 2.7 2.7

Matlab: CG 1.28 1.28 20.0 20.0

C: CG, AF 0.0

0.0 127 132

CUDA: CG, AF 0.16

0.17 6.5 7.13

Load

High aspect

ratio elements;

CG performs

poorly without

preconditioning

Timing: Problem 6 Unstructured hex mesh

Isotropic and Composite (alternate

layers +/- 60 deg fibers)

274,000 dof

322 template elements

Fixed

Overhead

cost (s)

Solution

time (s)


Matlab: Direct --

-- --

--

Matlab: CG 18.3 18.3 214.2 216.3

C: CG, AF 0.0

0.0 720.3 770.6

CUDA: CG, AF 0.18

0.18 29.3 35.2

Load

CG Pre-conditioning for FEA

Effective

Para

llel-

Fri

end

ly

Jacobi

ILU

Multi-grid

Generic Mesh

Structured Mesh

Coarse-Fine

ELE

SSOR

24

Exploiting HMS-CUDA

for Topology Optimization

Classic SIMP Optimization

D

Demo

Matlab code www.ersl.wisc.edu

Large-Scale Optimization

Size DOF [Wang 07]*

Medium (84,28,14) 107,184 2.4 hours

Large (180,60,30) 1,010,160 45.7 hours

*[Wang 07]: “Large-scale topology optimization using preconditioned Krylov

subspace methods with recycling”, Wang, de Sturler, Paulino, IJNME, vol. 69, 2007.

http://www.ersl.wisc.edu/

25

Bridge Problem

V = 30% 1 min 10 secs

Exploiting HMS-CUDA

for Material Characterization

Double Notch Specimen

Fixed

Prescribed displacement

Accuracy: Double Notch

ANSYS CUDA

u (0, 0.01) (0, 0.01)

v (-0.098, 0.098) (-0.098, 0.098)

z (-0.0016, 0.00016) (-0.0016, 0.00016)

Prescribed: u = 0.01

26

Accuracy: Double Notch

ANSYS CUDA

u (-0.00027, 0.00024) (-0.00027, 0.00024)

v (-0.0011, 0.0011) (-0.0011, 0.0011)

z (0,0.01) (0, 0.01)

Prescribed: w = 0.01

Material Characterization-1

E1=100 (MPa), E2=25, n12=0.1, G23=25

E1=147000 (MPa), E2=10000, n12=0.27,

G23=7000

F=(1000,0,0) , M = (0,0,0); CGTol = 1e-10

#FEA = 4700

E1=146957, E2=10362, n12=0.268,

G23=6963

Initial Guess

Exact Answer

Converged Result


E1=100 (MPa), E2=25, n12=0.1, G23=25

E1=147190, E2=10266, n12=0.27,

G23=6996

E1=147000, E2=10000, n12=0.27, G23=7000

F=(1000,0,0) , M = (0,0,0); CGTol = 0.001

#FEA = 2162


E1=100 (MPa), E2=25, n12=0.1, G23=25

E1=147000 (MPa), E2=10000, n12=0.27,

G23=7000

F=(1000,1000,1000) , M = (0,0,0); CGTol = 0.001

#FEA = 1952

E1=146967, E2=10329, n12=0.259,

G23=6949

27


E1=100 (MPa), E2=25, n12=0.1, G23=25

E1=147000 (MPa), E2=10000, n12=0.27,

G23=7000

F=(1000,1000,1000) , M = (0,0,0); CGTol = 0.001

#FEA = 2876

E1=144340, E2=10378, n12=0.193,

G23=6998

Conclusions

• Assembly-free Hex-mesh solver on GPU hard to beat!

• Very little penalty for material variation

• A penalty-factor of 3.0 for unstructured hex meshes

• Future work

– ANSYS interface

– Optimize GPU code

– Pre-conditioner for thin elements

– Reduce GPU Memory

– OpenCL implementation

– Congruent hex mesh generator

– Multi-physics

– Transient problems

– Nonlinear problems

– …

CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries &...

Documents

Transcript of CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries &...