CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries &...
-
Upload
nguyencong -
Category
Documents
-
view
237 -
download
1
Transcript of CUDA Libraries & Packagesoutreach.sbel.wisc.edu/Workshops/GPUworkshop/2011/... · CUDA Libraries &...
1
1
Popular CUDA Packages
Krishnan Suresh
2
CUDA Libraries & Packages
1. CUBLAS: Dense Linear Algebra
2. Thrust: Parallel sort, etc
3. CUSP: Sparse Linear Algebra Package
4. CuSparse: Sparse Linear Algebra Package
5. CUFFT: Fast Fourier Transform
6. GPUMat: Matlab Wrapper (free)
7. Jacket: Matlab Wrapper ($$)
3
CUBLAS
4
CUBLAS
• CUDA implementation of BLAS (Basic
Linear Algebra Subprograms)
– Vector, vector (Level-1)
– Matrix, vector (Level-2)
– Matrix, matrix (Level-3)
• Precisions
– Single: real & complex
– Double: real & complex (not all functions)
• No kernel calls, shared memory, etc
2
5
CUBLAS Library Usage
• No additional downloads needed
– cublas.lib (in CUDA SDK)
– Add cublas.lib to linker
– #include cublas.h
6
CUBLAS Code Structure
1. Initialize CUBLAS: cublasInit() 2. Create CPU memory and data
3. Create GPU memory: cublasAlloc(…)
4. Copy from CPU to GPU : cublasSetVector(…)
5. Operate on GPU : cublasSgemm(…)
6. Check for CUBLAS error : cublasGetError()
7. Copy from GPU to CPU : cublasGetVector(…) 8. Verify results
9. Free GPU memory : cublasFree(…)
10. Shut down CUBLAS : cublasShutDown()
7
CUBLAS BLAS-1 Functions:
Vector-vector operations
8
CU(BLAS) Naming Convention
cublasIsamax
Index of
Single
Precision
absolute
cublasIdamax
Find the index of the absolute max
of a vector of single precision reals
cublasIzamax
cublasIcamax
max
3
9
CU(BLAS) Naming Convention (2)
cublasSaxpy
Single
Precision
alpha*x+y
cublasDaxpy
Compute alpha*x+y where
x &y are single precision reals
& alpha is a scalar
10
CUBLAS Example-1 (CPU) Ta x y
11
CUBLAS Example-1 (GPU) Ta x y
• No kernel calls
• No memory mgmt.
Increment of 1
12
CUBLAS Example-2 (CPU)
z x y
4
13
CUBLAS Example-2 (GPU)
z x y
Output stored
in d_y
14
CUBLAS BLAS-2 Functions:
Matrix-Vector Operations
:
z Ax y
A symmetric banded
1
( )
x A y
A Upper or Lower
15
CUBLAS: Caveats
• Solves Ax = b only for Upper/Lower A
• Limited class of sparse matrices
• Column format & 1-indexing (Fortran style)
• C: row format & 0-indexing; use macros
16
CU(BLAS) Naming Convention
cublasSsbmv
Single
symmetric
banded
z Ax y
x x x
x x x x
x x x x x
x x x x
x x X
5
17
Example z Ax y
( , )
2 1
1 2 1
1 2 ...
... ... 1
1 2N N
A
It is sufficient to store
( , )
2 1
2 1
2 ...
... 1
2N N
(2, )
1 1 ... 1_
2 2 2 ... 2N
Xh A
Stored as
Symmetric-Banded
#Super-Diagonals = 1
18
CUBLAS Example-3 (CPU)
z Ax y (2, )
1 1 ... 1_
2 2 2 ... 2N
Xh A
Macro for 0-indexing in C
2
1_ :
2
1
...
X
h A
19
CUBLAS Example-3 (CPU)
(2, )
1 1 ... 1_
2 2 2 ... 2N
Xh A
1 1 1
2 2 2
3 3 3
2 1
1 2 1
1 2 ...
... ... ...... ... 1
1 2N N N
z x y
z x y
z x y
z x y
20
CUBLAS Example-3 (GPU)
z Ax y (2, )
1 1 ... 1_
2 2 2 ... 2N
Xh A
#Rows
Upper
diagonal
#Rows
6
21
CUBLAS Optimal Usage
1. Copy from CPU to GPU : cublasSet …(…) 2. Operate on GPU
Operation 1
Operation 2
…
Operation n
3. Copy from GPU to CPU : cublasGet…(…)
22
CUBLAS BLAS-3 Functions:
Matrix-Matrix Operations
C AB C
1
( )
X A B
A Upper or Lower
23
CUBLAS Performance
24
CUBLAS SGEMM Performance
CPU Optimized
for 4-core CPU Naive
7
25
Thrust
26
Thrust
• C++ Template Library using CUDA
• Vector containers: • host_vector & device_vector
• Generalizes std:vector
• Store any type & dynamically resize
• Numerous algorithms • Sort
• Sum
• Max
27
Thrust: Getting started
• Download to (CUDA include directory)
– http://code.google.com/p/thrust/
– Requires CUDA 2.3
• Tutorial:
– http://code.google.com/p/thrust/wiki/Tutorial
28
Thrust: Concept
8
29
Thrust Algorithms: Prefix Sum
Given a sequence:
And an operation
Output:
{ }1 2 3, , ,..., Nx x x x
Å
{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x xÅ Å Å Å Å Å
30
Prefix Sum
Key to numerous algorithms
Also referred to as “Scan” algorithm
Different operations result in different results
31
Prefix Sum: Example
Given a sequence:
And an operation
Output
{ }1,2,9,6,...,
+
{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x+ + + + + +
{ }1,3,11,17,...
32
Prefix Sum: Example
Given a sequence:
And an operation
Output
{ }1,2,9,6,...,
*
{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x* * * * * *
{ }1,2,18,108,...
9
33
Prefix Sum: Example
Given a sequence:
And an operation
Output
{ }1,2,9,6,...,
max
{ }1 1 2 1 2 3,max( , ),max(max( , ), ),...x x x x x x
{ }1,2,9,9,...
34
Thrust: Examples Set-up
35
Thrust: Examples
36
Thrust: Examples cont.
2 2 2
1 2 ... Na x x x x
10
37
CUSP
Cusp is a library for sparse
linear algebra and graph
computations on CUDA.
38
CUSP #include <cusp/hyb_matrix.h>
#include <cusp/io/matrix_market.h>
#include <cusp/krylov/cg.h>
int main(void)
{
// create an empty sparse matrix structure (HYB format)
cusp::hyb_matrix<int, float, cusp::device_memory> A;
// load a matrix stored in MatrixMarket format
cusp::io::read_matrix_market_file(A, "5pt_10x10.mtx");
// allocate storage for solution (x) and right hand side (b)
cusp::array1d<float, cusp::device_memory> x(A.num_rows, 0);
cusp::array1d<float, cusp::device_memory> b(A.num_rows, 1);
// solve the linear system A * x = b with the Conjugate Gradient method
cusp::krylov::cg(A, x, b);
return 0;
}
39
CuSparse
40
CUFFT
CUDA Implementation of
Fast Fourier Transform
11
41
Fourier Transform
• Extract frequencies from signal
• Given a function
• 1-D Fourier transform:
• 2-D, 3-D
( );f t t- ¥ < < ¥
2ˆ( ) ( ) i tf f t e dtp xx
¥
-
- ¥
= ò
42
Fourier Transform
Continuous Signal Fourier Transform
(Wikipedia)
2ˆ( ) ( ) i tf t f e dp xx x
¥
- ¥
= ò
43
Discrete Fourier Transform
• Given a sequence
• Discrete Fourier transform (DFT):
… another sequence
0 1 1, ,..., Nx x x -
21
0
ˆiknN
Nk n
n
x x ep-
-
=
= å
44
DFT Examples
Highest frequency
that can be captured
correctly
12
45
Fast Fourier Transform
• DFT: Naïve O(N2) operation
• FFT: Fast DFT, O(NlogN)
• Key to signal processing, PDE, …
0 1 1, ,..., Nx x x - 0 1 1ˆ ˆ ˆ, ,..., Nx x x -
21
0
ˆiknN
Nk n
n
x x ep-
-
=
= å
46
CUFFT
• Fast CUDA library for FFT
• No additional downloads needed
– cufft.lib (in CUDA SDK)
– Add cufft.lib to linker
– #include cufft.h
47
CUFFT: Features
• 1-D, 2-D, 3-D
• Precisions
– Single: real & complex
– Double: real & complex (not all functions)
• Uses CUDA memory calls & fft data
• Requires a ‘plan’
• Based on FFTW model
48
CUFFT Example
13
49
CUFFT Example (cont.)
Complex to
complex
1 data
(batch)
50
Questions?
Finite Element Analysis
on the GPU
Krishnan Suresh
52
Summary
• Minimize data transfer between CPU & GPU
• Avoid complex GPU logic
• Maximize independent #GPU-threads
• GPU memory management is critical
14
53
Finite Element Analysis
• Simulation of engineering phenomena
• Phenomena:
– Structural, thermal, fluid, …
– Static, modal, transient
– Linear, non-linear
54
Structural Static FEA
Model Discretize Post-
process Element Stiffness
e
e
K
f
Assemble/ Solve
Ku f
e
e
K K
f f
55
FEA Variations
Discretize Model Element Stiffness
Assemble/ Solve
Post- process
e
e
K K
f f
Ku f
Nonlinear
Optimization
Tet/Hex/… Direct/Iterative Order/Hybrid
e
e
K
f
1. Accurate
2. Automated
3. Fast
56
Why GPU?
Hours or even days of CPU time.
[Gordon; JPL]
15
57
Typical Bottleneck
Model Discretize Post-
process Element Stiffness
e
e
K
f
Assemble/ Solve
Ku f
e
e
K K
f f
… GPU emphasis
58
Linear Solvers
Ku f
K is sparse & usually symmetric P.D
1 1
T
T
K LDL
u L D L f
Direct
59
Direct Sparse on GPU (1)
(2006)
60
Direct Sparse on GPU (1)
16
61
Direct Sparse on GPU (1)
Single precision limitation
62
Direct Sparse on GPU (2) Ku f
(2008)
63
Direct Sparse on GPU (2)
Ku f
Effective speed-up
Linear Solvers
Ku f
1 1
T
T
K LDL
u L D L f
Direct
1 ( )
: Preconditioner of K
i i iu u B f Ku
B
Iterative
• Repeated Matrix-Vector ops
• Ideally suited for GPU!
17
65
Linear Solvers
Ku f
Direct
1 ( )
: Preconditioner of K
i i iu u B f Ku
B
Iterative
Issues:
1. K: Sparse
2. B: Need pre-conditioner!
66
Iterative Sparse on GPU (2)
• Double precision real world SpMv
– CPU (2.3 GHz Dual Xeon): 1 GFLOPS
– GPU (GTX 280): 16 GFLOPS
67
Iterative Sparse on GPU (1)
(2008)
• Jacobi preconditioned conjugate gradient
• ATI GPU
• Speed-up 3.5.
68
Linear Solvers
Ku f
Direct
1 ( )
: Preconditioner of K
i i iu u P f Ku
P
Iterative
Assembled Assembly-
free
18
69
Assembly-free Iterative Solvers 1 ( )
: Preconditioner of K
i i iu u P f Ku
P
1. AfSpMv: Assembly-free Sparse Matrix-Vector
Multiplication
• Compute Ku without assembling K
2. Assembly-free Pre-conditioning
• Compute Pv without assembling/inverting K
Hex-Mesh Solver using CUDA
(HMS-CUDA)
Objective • Solve structural problems posed over hexahedral meshes on the GPU
• Targeted Applications
Material Characterization Topology Optimization Cellular modeling
Large models (~10 million dof)
Arbitrary-shaped hex-element
Non-uniform connectivity
Heterogeneous/orthotropic material
Hexahedral elements
Displacement Elements
– Standard displacement interpolation
– 8 node hex; Ke: 24*24
– Good balance between accuracy and speed
19
Strategy “Assembly-free Matrix-Vector Multiply + Conjugate Gradient +
Preconditioning”
• Assembly-free:
– Decreases memory foot-print on GPU, easy to parallelize
• Assembly-free over arbitrary hex-meshes
– Ke is not same for all elements
– Re-computing Ke as needed is too expensive
– Storing all Ke can be a challenge (> 5 GB)
• Solution: Exploit Geometric and Material Congruency
– Compute and store Ke of template elements
– Map all elements to template elements
– Decreases memory footprint
– Increases coalesced memory access in GPU
Congruency Question
“A Congruence Problem for Polyhedra” Borisov A., et. al., Amer. Math. Monthly 131
(2010) 232--249
Are these two hexahedral elements geometrically
and materially congruent?, i.e.,
Will they result in the same Ke matrix?
Theorem 1.2 (Cauchy, 1839). Two convex polyhedra with corresponding congruent and
similarly situated faces have equal corresponding dihedral angles.
Translation: 12 measurements are sufficient.
Translation: A maximum of 18 measurements are sufficient
Corollary 2.5. Let P be a convex polyhedron with E edges. Then there is a set of E
measurements that is sufficient to determine P up to congruence amongst all nearby
convex polyhedra with the same combinatorial structure as P.
Congruency Question
Currently, edge lengths are used to determine congruency
Assign element signature
Compare element signature
Congruence Examples
Structured grid, isotropic material
– 8192 elements
– One template element
– 1 Ke matrix (24*24); 4.6 KB
Unstructured grid, isotropic material
– 6144 elements
– 34 template elements (~0.5%)
– 34 Ke matrices; 156 KB (vs 28 MB)
20
Results
Unstructured grid, composite
layers
– 83584 elements
– Material congruency check
– 322 templates (~0.4%)
– 1.48 MB storage (vs 384 MB)
Proposed Algorithm
Detect
Congruency
Hex-mesh
Compute Ke of templates
Push data to GPU
Assembly-free
Ku = f
on GPU
Conjugate Gradient
Example: Matlab
GPU algorithms:
Dot-product: Use CUBLAS
Ax: Implement assembly-
free
Also implement in CPU for
comparison
Assembly-free y = Kx in GPU
Each node assigned to a thread
Launch as many blocks as needed
Algorithm (per thread/node):
uResult = 0, …
For each neighboring elem
For each of 8 nodes of elem
Get (u,v,w) of node
Apply Ke of elem, node
Accumulate into uResult, …
Apply BC
Push uResult to Memory
21
GPU Data
Separate storage for (u,v,w) for coalesced memory
access.
Can reduce foot-print further
Assembly-free Kx on GPU
Code NOT Optimized!
…
…
…
Experiments
Models of different complexity
Four different solvers:
1. Matlab assembled direct
2. Matlab assembled CG (1e-10)
3. C assembly-free CG(1e-10)
4. CUDA assembly-free CG (1e-10)
All methods exploit congruency and
computing Ke of template elements (so
not included in cost)
All computations in double precision
All methods yield identical results
Platform
64 bit Windows Office Professional
CPU:
– i7, 3.2 GHz, 6 GB
– C Code single core
GPU:
– GTX 480 (400 cores)
– 1.5 GB
22
Timing: Problem 1 Isotropic structured hex mesh
32 x 16 x 16; 1 template element
28,000 dof Fixed
Overhead
cost (s)
Solution
time (s)
Matlab – Assembled; direct solve 1.8
(K assembly)
7.0
Matlab – Assembled; CG (1e-10) 1.8
(K assembly)
8.7
C– Assembly-free; CG (1e-10) -
8.0
CUDA – Assembly-free; CG (1e-
10)
0.13
(GPU transfer)
0.20
Timing: Problem 2 Isotropic structured hex mesh
32 x 32 x 16; 1 template element
55,539 dof Fixed
Overhead
cost (s)
Solution
time (s)
Matlab – Assembled; direct solve 3.6
(K assembly)
33.2
Matlab – Assembled; CG (1e-10) 3.6
(K assembly)
2.73
C– Assembly-free; CG (1e-10) -
19.8
CUDA – Assembly-free; CG (1e-
10)
0.16
(GPU transfer)
0.29
Timing: Problem 3 Isotropic structured hex mesh
64x 32 x 16; 1 template element
109,385 dof Fixed
Overhead
cost (s)
Solution
time (s)
Matlab – Assembled; direct solve 7.1
(K assembly)
> 8 mins
(stalled)
Matlab – Assembled; CG (1e-10) 7.1
(K assembly)
8.8
C– Assembly-free; CG (1e-10) 0.0
60.8
CUDA – Assembly-free; CG (1e-
10)
0.19 s
(GPU transfer)
0.66
TIming: Problem 4 Isotropic structured hex mesh
64 x 32x 16; 2 template element
109,000 dof Fixed
Overhead
cost (s)
Solution
time (s)
Isotropic Composite Isotropic Composite
Matlab: Direct --
-- -- --
Matlab: CG 7.1 7.1 8.8 22.0
C: CG, AF 0.0
0.0 60.8 154
CUDA: CG, AF 0.16
0.17 0.66 1.6
Same K, but
2.5 x CG
iterations.
No penalty in
GPU due to non
isotropy
23
Timing: Problem 5 Isotropic structured hex mesh
256 x 128 x 64; 1 template element
6.5 million dof
468 MB GPU memory
Overhead
cost (s)
Solution
time (s)
Matlab – Assembled; direct solve -- Stalls!
Matlab – Assembled; CG (1e-10) -- Stalls!
C– Assembly-free; CG (1e-10) --
Stalls!
CUDA – Assembly-free; CG (1e-
10)
0.41 s
(GPU transfer)
87.2
Fixed
Timing: Problem 6 Unstructured hex mesh
Isotropic and Composite (alternate
layers +/- 60 deg fibers)
23,000 dof
68 template elements
Fixed
Overhead
cost (s)
Solution
time (s)
Isotropic Composite Isotropic Composite
Matlab: Direct 1.28
1.28 2.7 2.7
Matlab: CG 1.28 1.28 20.0 20.0
C: CG, AF 0.0
0.0 127 132
CUDA: CG, AF 0.16
0.17 6.5 7.13
Load
High aspect
ratio elements;
CG performs
poorly without
preconditioning
Timing: Problem 6 Unstructured hex mesh
Isotropic and Composite (alternate
layers +/- 60 deg fibers)
274,000 dof
322 template elements
Fixed
Overhead
cost (s)
Solution
time (s)
Isotropic Composite Isotropic Composite
Matlab: Direct --
-- --
--
Matlab: CG 18.3 18.3 214.2 216.3
C: CG, AF 0.0
0.0 720.3 770.6
CUDA: CG, AF 0.18
0.18 29.3 35.2
Load
CG Pre-conditioning for FEA
Effective
Para
llel-
Fri
end
ly
Jacobi
ILU
Multi-grid
Generic Mesh
Structured Mesh
Coarse-Fine
ELE
SSOR
24
Exploiting HMS-CUDA
for Topology Optimization
Classic SIMP Optimization
D
Demo
Matlab code www.ersl.wisc.edu
Large-Scale Optimization
Size DOF [Wang 07]*
Medium (84,28,14) 107,184 2.4 hours
Large (180,60,30) 1,010,160 45.7 hours
*[Wang 07]: “Large-scale topology optimization using preconditioned Krylov
subspace methods with recycling”, Wang, de Sturler, Paulino, IJNME, vol. 69, 2007.
25
Bridge Problem
V = 30% 1 min 10 secs
Exploiting HMS-CUDA
for Material Characterization
Double Notch Specimen
Fixed
Prescribed displacement
Accuracy: Double Notch
ANSYS CUDA
u (0, 0.01) (0, 0.01)
v (-0.098, 0.098) (-0.098, 0.098)
z (-0.0016, 0.00016) (-0.0016, 0.00016)
Prescribed: u = 0.01
26
Accuracy: Double Notch
ANSYS CUDA
u (-0.00027, 0.00024) (-0.00027, 0.00024)
v (-0.0011, 0.0011) (-0.0011, 0.0011)
z (0,0.01) (0, 0.01)
Prescribed: w = 0.01
Material Characterization-1
E1=100 (MPa), E2=25, n12=0.1, G23=25
E1=147000 (MPa), E2=10000, n12=0.27,
G23=7000
F=(1000,0,0) , M = (0,0,0); CGTol = 1e-10
#FEA = 4700
E1=146957, E2=10362, n12=0.268,
G23=6963
Initial Guess
Exact Answer
Converged Result
Material Characterization-2
E1=100 (MPa), E2=25, n12=0.1, G23=25
E1=147190, E2=10266, n12=0.27,
G23=6996
E1=147000, E2=10000, n12=0.27, G23=7000
F=(1000,0,0) , M = (0,0,0); CGTol = 0.001
#FEA = 2162
Material Characterization-3
E1=100 (MPa), E2=25, n12=0.1, G23=25
E1=147000 (MPa), E2=10000, n12=0.27,
G23=7000
F=(1000,1000,1000) , M = (0,0,0); CGTol = 0.001
#FEA = 1952
E1=146967, E2=10329, n12=0.259,
G23=6949
27
Material Characterization-4
E1=100 (MPa), E2=25, n12=0.1, G23=25
E1=147000 (MPa), E2=10000, n12=0.27,
G23=7000
F=(1000,1000,1000) , M = (0,0,0); CGTol = 0.001
#FEA = 2876
E1=144340, E2=10378, n12=0.193,
G23=6998
Conclusions
• Assembly-free Hex-mesh solver on GPU hard to beat!
• Very little penalty for material variation
• A penalty-factor of 3.0 for unstructured hex meshes
• Future work
– ANSYS interface
– Optimize GPU code
– Pre-conditioner for thin elements
– Reduce GPU Memory
– OpenCL implementation
– Congruent hex mesh generator
– Multi-physics
– Transient problems
– Nonlinear problems
– …