ME964 High Performance Computing for Engineering Applications “The real problem is not whether...

60
ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan Negrut, 2011 ME964 UW-Madison Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011

Transcript of ME964 High Performance Computing for Engineering Applications “The real problem is not whether...

Page 1: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

ME964High Performance Computing for Engineering Applications

“The real problem is not whether machines think but whether men do.”B. F. Skinner

© Dan Negrut, 2011ME964 UW-Madison

Outlining Midterm ProjectsTopic 3: GPU-based FEA

Topic 4: GPU Direct Solver for Sparse Linear Algebra

March 01, 2011

Page 2: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Before We Get Started… Last time

Midterm Project topics 1 and 2 Discrete Element Method on the GPU. Area coordinator: Toby Heyn Collision Detection on the GPU. Area coordinator: Arman Pazouki

Today Midterm Project topics 3 and 4

Finite Element Method on the GPU. Area coordinators: Prof. Suresh and Naresh Khude Sparse direct solver on the GPU (Cholesky). Area coordinator: Dan Negrut

Midterm Project Related Issues Midterm Project is due on 04/13 at 11:59 PM (use Learn@UW drop-box) Intermediate report due on 03/22 at 11:59 PM (use the same Learn@UW drop-box) Each area coordinator

Will provide a test problem for you to test your GPU implementation Will also assist you with questions related to the non-programming aspects (the “theory”) behind the topic you chose

You can continue your Midterm Project (MP) and have it become your Final Project (FP) In this case you will be expected to show how the FP implementation is superior to your MP implementation

Other issues HW5 due tonight at 11:59 PM

Use Learn@UW drop-box to submit homework2

Page 3: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Finite Element Analysison the GPU?

Krishnan [email protected] Professor

Page 4: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Finite Element Analysis

Computer simulation of engineering models Physics:

– Structural, thermal, fluid, … Mode:

– Static, modal, transient– Linear, non-linear, multi-physics

Page 5: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Why GPU?

Hours or even days of CPU time.

[Gordon; JPL]

Page 6: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Question

Can one exploit graphics programmable units (GPU) to speed-up Finite Element analysis?

+

Page 7: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Structural Static FEA

Model DiscretizePost-

processElementStiffness

e

e

K

f

Assemble/Solve

Ku f

e

e

K K

f f

Page 8: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

FEA: Variations

DiscretizeModelElementStiffness

Assemble/Solve

Post-process

e

e

K K

f f

Ku f

Nonlinear

Optimization

Tet/Hex/… Direct/IterativeOrder/Hybrid

e

e

K

f

Page 9: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

FEA: Challenges

DiscretizeModelElementStiffness

Assemble/Solve

Post-process

e

e

K K

f f

Ku f

Nonlinear

Optimization

Tet/Hex/… Direct/IterativeOrder/Hybrid

e

e

K

f

1. Accuracy2. Automation3. Speed

Page 10: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Typical Bottleneck

Model DiscretizePost-

processElementStiffness

e

e

K

f

Assemble/Solve

Ku f

e

e

K K

f f

Page 11: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

GPU & Engineering Analysis

Model Discretize

CPU GPU?

Discretization Data: Small b-rep (+) Logic: Complex (-) Threads: Few (-)

Not a good candidate for GPU!?

Page 12: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Element Stiffness

Element Stiffness Data: O(N) (+/-) Logic: Simple (+) Threads: N (+)

DiscretizeModelElementStiffness

e

e

K

f

CPU CPU GPU?

Hex 2nd Order

Hex Hybrid

Page 13: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Stiffness: Hex 2nd Order

( , )e M MK

8 Corners~100 Bytes Data (x y z) 27 Nodes~ M = 81 DOF (u v w) kij ~ Gaussian integration

– 30 flops

(8 Corners) (27 Nodes)

2(15 )Flops N M

200000, 81

4secCPU

N M

T

Page 14: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Typical Bottleneck

Model DiscretizeElementStiffness

e

e

K

f

Assemble/Solve

Ku f

e

e

K K

f f

Page 15: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Direct vs. Iterative

Ku f

K is sparse & usually symmetric P.D

1 1

T

T

K LDL

u L D L f

Direct

1 ( )

: Preconditioner of K

i i iu u B f Ku

B

Iterative

(GPU Variation: Assembly-free)

Note: Nvidia offers CuBLAS-3 dense matrix library

Page 16: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Direct Sparse on GPU (1)

(2006)

Page 17: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Direct Sparse on GPU (1)

Ku f

Page 18: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Direct Sparse on GPU (1)

Ku f

Page 19: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Direct Sparse on GPU (2)

Ku f

(2008)

Page 20: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Direct Sparse on GPU (2)

Ku f

Page 21: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Iterative Sparse on GPU (1)

(2008)

Jacobi preconditioned conjugate gradient ATI GPU Speed-up 3.5.

Page 22: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Iterative Sparse on GPU (2)

Double precision real world SpMv – CPU (2.3 GHz Dual Xeon): 1 GFLOPS– GPU (GTX 280): 16 GFLOPS– Speedup ~ 16

Page 23: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

FEA/GPU Class Projects?

1. Complete < 6 weeks

2. Important (publishable)

3. Pilot code

Page 24: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

FEA/GPU Class Projects?

1. GPU Friendly Preconditioners for Thin Structures – Research papers– OpenCL and ViennaCL Pilot Code

2. Topology Optimization – Research papers– CUDA code

3. Others – Can discuss …

Page 25: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Thin Structure?

Page 26: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Thin Structure?

Large K

Page 27: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Preconditioners?

Ku f

Iterative Methods: – GPU methods available for K*u– Typical preconditioners: simple Jacobi, …

Poor preconditioner … slow convergence Objective:

– GPU friendly preconditioner for thin structures

1 ( )

: Preconditioner of K

i i iu u B f Ku

B

Page 28: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Research Publication

Page 29: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Basic Idea

Restriction(via dual-representation)

Prolongation(via dual-representation)

1-D Coarse Mesh(via dual-representation)

Page 30: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Algorithm

Page 31: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Why Preconditioner?

Page 32: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Why Double Precision?

Page 33: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

How Expensive is Preconditioner?

Page 34: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

GPU Friendly

Speed-up without Preconditioner Speed-up with Preconditioner

Page 35: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

FEA/GPU Class Projects?

1. GPU Friendly Preconditioners for Thin Structures – Research papers– OpenCL and ViennaCL Pilot Code

2. Topology Optimization – Research papers– CUDA code

3. Others – Can discuss …

Page 36: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Topology Optimization

0

JD

Min

VWÌ

W=

0 {J , }D

Min VWÌ

D

[Sigmund 2001]

V = 50%Stiffest topology for a given volume?Where to remove material?

Multi Objective + Topology Optimization = MOTO

Page 37: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Demo

Matlab code www.ersl.wisc.edu

Page 38: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Pareto Optimal Designs

Purely pareto optimal

Page 39: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Comparison

D

Page 40: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

3-D

Pareto-Method SIMP

Page 41: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

3-D GPU Implementation

Multi-grid Topology Optimization on the GPU (IDETC conf. 2011)

Page 42: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

42

Motivation for Topic 4:Sparse Direct Solver

Page 43: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

43

Nomenclature&

Simplifying Assumptions

Page 44: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

44

The Schur Complement Problem inMulti-Body Dynamics Applications

Page 45: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

45

Formulation Framework

Position:

Orientation: Euler parameters,

Translational Velocity:

Angular velocities , , ]x y y Ti i i i

0 1 2 3[ , , , ]Ti i i i ie e e ep

[ , , ]Ti i i ix y zr

[ , , ]Ti i i ix y zr

Page 46: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

46

Constrained Equations of Motion

( , , )t r p 0

( , , ) ( , , ) ( , , )tt t t r p r r p r p

( , , ) ( , , ) ( , , , , )t t t r p r r p r r p

( , , ) ( , , , , )

ˆ( , , ) ( , , , , )

T

T

t t

t t

r pM 0 r F r r p

r p0 J n r r p

Page 47: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

47

Numerical Solution of the Newton-Euler Constrained Equations of Motion

One has to solve a set of Differential Algebraic Equations (DAEs) to find the time evolution of a mechanical system

Most often the numerical solution of the DAEs requires the solution of a linear system of the form:

ˆ

T

T

M 0 r F

0 J n

0

Page 48: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

48

Approach Followed

First solve the “Reduced System” for :

Then recover accelerations

1

1

T

T

M 0b

0 J

1

1

( )

ˆ( )

T

T

r M F

J n

Page 49: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

49

Iterative Solution of the Reduced System

Define positive definite Reduced Matrix

Preconditioned Conjugate Gradient requires computation at time of

requires preconditioning:

1

1

T

T

M 0E

0 J

E

nt( )k

n E

old E b

Page 50: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

50

Computing

A thread is associated with each body

We’ll look at how thread 9 does its share of work to compute

( )kn E

1

2( ) ( )k k mn n n

J

e

eE R

e

e

3e

Time step n, iteration (k):

Page 51: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

51

How Thread-9 Does its Work

S1. Compute reaction forces acting on me:

S2. Compute my constraint acceleration

S3. Project my constraint acceleration

3 5 69 9 3 9 5 9 6( ) ( ) ( )C T T T F

19 9 9C C a M F

3 3 5 5 6 69 9 9 9 9 9 9 9 9

C C C a a a

3 3 39 12 eFinally,

Page 52: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

Iteration Operation Countfor Body 9 (Thread-9)

Step Multiplications Additions

S1

S2

S3

96 ( 1)C 96 C

96 C 95 C

56

52

Page 53: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

53

Computing [Concluding Remarks]

The algorithm scales very well: one thread for each body

Each thread only interacts with adjacent joints

Load balance is obtained when the bodies have similar topology index

( )kn E

Page 54: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

54

Direct Solution of the Reduced System

Page 55: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

55

The Sparse Direct Solver

Page 56: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

The Direct Solver: How Things Get Done

In the reduced linear system each constraint induces an equation

Example: constraint 3 induced equation:

Since is positive definite, is also positive definite

E b

32 2 33 3 35 5 36 6 3 E E E E b

E 33E

56

Fundamental Idea: Solve for ¸3 and substitute it in all the equations where it shows up

Page 57: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

57

First Example: Seven-Body Mechanism

Page 58: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

58

Page 59: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

The Elimination Sequence

The fundamental question is this: what should be the sequence in which the unknowns (the edges of the graph) are eliminated? Different elimination sequences result in different levels of effort

The question becomes more complicated since you are interested in a parallel elimination sequence You would like to limit the amount of synchronization barriers that you

impose in the implementation

59

In the end, although it’s formulated like solving a system, the problem becomes that starting with a graph and eliminating its edges in parallel Similar to a Mikado, or “pick-up sticks”, game that you

want to play in parallel

Page 60: ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

60

Second Example: HMMWV Model

Elim. Sequence A M I F NNZBad 1240 1336 195 96 99Good 459 469 109 10 99Index Reduction 220 233 90 13 77