Untitled DMarkel Corporation (MKL): Generating High Intrinsic Value Since 1986ocument
Performing acoustic, vibro-acoustic and aero-acoustic...
Transcript of Performing acoustic, vibro-acoustic and aero-acoustic...
Performing acoustic, vibro-acoustic and
aero-acoustic computations using
MUMPS
Presented By: Eveline Rosseel
29 May 2013
1FFT Confidential 5/29/2013
29 May 2013
• Introduction on Free Field Technologies
• MUMPS in Actran
• Benchmark of sparse direct solvers
• Conclusions
Outline
2FFT Confidential 5/29/2013
Free Field Technologies: leader in acoustic, vibro-acoustic and aero-acoustic CAE
• Free Field Technologies (FFT) -Software Development since 1998
• Main activities
– Development of the Actran software
– Services: training, consulting,
technology transfer, …
3FFT Confidential 5/29/2013
technology transfer, …
– Research in acoustic CAE
and related fields
• Our customers’ fields
– Automotive
– Aerospace
– Electronic
– Heavy equipment
Free Field Technologies around the world
Headquarted in Mont-Saint-Guibert, Belgium, FFT has offices in Toulouse, France, Tokyo, Japan, and
Troy, MI, USA.
FFT is part of MSC Software Corporation, international leading provider in Virtual Product
Development technology.
4FFT Confidential 5/29/2013
Our software is distributed in each global region and used by more than 250 customers around the world.
MUMPS in Actran
5FFT Confidential 5/29/2013
MUMPS: default solver in Actran
Target applications
• Mostly complex, unsymmetric sparse systems with a symmetric structure
• Up to a few million of DOFs, up to a few 1000 RHS
• Out-of-core computations
• Shared and distributed memory computing
• Application dependent sparsity patterns
• Mostly complex, unsymmetric sparse systems with a symmetric structure
• Up to a few million of DOFs, up to a few 1000 RHS
• Out-of-core computations
• Shared and distributed memory computing
• Application dependent sparsity patterns
6FFT Confidential 5/29/2013
• Application dependent sparsity patterns• Application dependent sparsity patterns
MUMPS: highlighted experiencesBacktransformation phase
Out-of-core computations: congestion due to shared scratch disk
• Example: frequency parallelism, every proc runs its own MUMPS instance• Example: frequency parallelism, every proc runs its own MUMPS instance
Proc 1
solve freq 1
Memory
proc 1
Proc 2
solve freq 2
Memory
proc 2
Proc n
solve freq n
Memory
proc n
Time Factorize Solve
1 proc
sequential
39 min 7 min
7FFT Confidential 5/29/2013
Configuration: 600 KDOF, 253 RHS, Westmere-ex Intel 2.26 GHz, 4x8 cores, raid-0 sata scratch disk
• Solution: ICNTL(27) and introduction of additional synchronization amongst procs
Configuration: 600 KDOF, 253 RHS, Westmere-ex Intel 2.26 GHz, 4x8 cores, raid-0 sata scratch disk
• Solution: ICNTL(27) and introduction of additional synchronization amongst procs
Scratch disk
sequential
8 procs =
8 sequential
MUMPS
instances
44 min Up to 4.5h
MUMPS: highlighted experiencesBacktransformation phase
Out-of-core computations
• Reduced I/O
congestion using
ICNTL(27) and
additional
synchronization points
• Optimal value
• Reduced I/O
congestion using
ICNTL(27) and
additional
synchronization points
• Optimal value
8FFT Confidential 5/29/2013
• Optimal value
ICNTL(27)=NRHS
• Additional
synchronization points:
backtransformation
step of processors
sharing the same
scratch is done in
sequential mode
• Optimal value
ICNTL(27)=NRHS
• Additional
synchronization points:
backtransformation
step of processors
sharing the same
scratch is done in
sequential mode
• Quality of reordered matrix (METIS, SCOTCH, …) influences memory consumption
factorization phase
• Distributed computations:
• Memory consumption peak on proc 0 during sequential analysis phase
surpasses memory consumption parallel factorization phase
• Quality of reordered matrix (METIS, SCOTCH, …) influences memory consumption
factorization phase
• Distributed computations:
• Memory consumption peak on proc 0 during sequential analysis phase
surpasses memory consumption parallel factorization phase
MUMPS: highlighted experiencesAnalysis phase
9FFT Confidential 5/29/2013
Analysis phase
Factorization
Configuration:
1.9 MDOF, 1 RHS
SCOTCH ordering
Out-of-core run on
Westmere-ex
2.4GHz processor
with 4x10 cores
and 256 GB RAM
MUMPS: highlighted experiencesAnalysis phase
Scalability of MPI computations: need for parallel analysis
• Avoid memory consumption peak at sequential analysis phase
by using a parallel analysis phase: PT-Scotch or Parmetis
• Avoid memory consumption peak at sequential analysis phase
by using a parallel analysis phase: PT-Scotch or Parmetis
Configuration:
1.9 MDOF, 1 RHS
PT-SCOTCH
ordering
10FFT Confidential 5/29/2013
• Problem: time and memory consumption has increased compared to run with Scotch
-> scalability issue parallel analysis
• Problem: time and memory consumption has increased compared to run with Scotch
-> scalability issue parallel analysis
ordering
Out-of-core run on
Westmere-ex
2.4GHz processor
with 4x10 cores and
256 GB RAMFactorization
Analysis phase
MUMPS in Actran:future plans
• Increasing model sizes:
• Need for 64bit integer version
• Increasing number of nodes in distributed memory computing:
• Need for robust parallel analysis
• More investigations on hybrid iterative/direct solver use
• Increasing model sizes:
• Need for 64bit integer version
• Increasing number of nodes in distributed memory computing:
• Need for robust parallel analysis
• More investigations on hybrid iterative/direct solver use
11FFT Confidential 5/29/2013
• More investigations on hybrid iterative/direct solver use• More investigations on hybrid iterative/direct solver use
Benchmark of sparse direct solvers
12FFT Confidential 5/29/2013
Overview solvers
MUMPS (4.10.0) Pardiso 10.3 Intel MKL UMFPACK (5.6.2)
out-of-core ✓ ✓ X
multithreading BLAS ✓ BLAS
MPI support ✓ X X
Motivation: assess performance of MUMPS with respect to other sparse direct solvers
13FFT Confidential 5/29/2013
MPI support ✓ X X
ordering
(PAR)Metis, (PT)Scotch,
PORD, (Q)AMD, AMF MD or Metis
(COL)AMD, Metis or
NESDIS
multiple rhs
block approach
✓
(block size)
✓
(no block size) X
iterative
refinement ✓ ✓ ✓single/double
precision ✓ ✓ ✓
Solver benchmark settings
• Sequential and multithreaded tests (no MPI)
• Ordering METIS
• Pivot threshold 0.01
• Double precision
• Internal RHS block size 16
• No iterative refinement
• Sequential and multithreaded tests (no MPI)
• Ordering METIS
• Pivot threshold 0.01
• Double precision
• Internal RHS block size 16
• No iterative refinement
14FFT Confidential 5/29/2013
• No iterative refinement
• Memory relaxation 20%
• No iterative refinement
• Memory relaxation 20%
Acoustic test cases
� Represent 20-30% of our customers (mainly automotive)
� All tests successful
ACOUSTIC RADIATION Symmetric? NDOF NRHS
IFEM-VS No 15.6K 1
PML-DC YES 280K 1
15FFT Confidential 5/29/2013
PML-DC YES 280K 1
IFEM-DC No 405K 1
RC_Indus_hpc_MUMPS case 3 (MB) No 730K 1
RC_Indus_hpc_MUMPS case 2 (IFE) No 872K 1
PML-DF YES 1.05M 1
IFEM-DF No 1.38M 1
RC_Indus_hpc_MUMPS case 1 (IFE) No 1.90M 3
Memory consumption: in-core
Pardiso MKL and MUMPS: comparable memory
requirements
Pardiso MKL: lowest
memory consumption
16FFT Confidential 5/29/2013
memory consumption
UMFPACK: highest memory
consumption, large difference
with other solvers
Memory consumption: out-of-core
Large difference between memory
consumption of OOC Pardiso and
OOC MUMPS on acoustic tests
17FFT Confidential 5/29/2013
Computational cost: sequential runs
MUMPS: lowest overall factorization
time for the sequential runs
UMFPACK: very high factorization
time on largest test case
18FFT Confidential 5/29/2013
Computational cost: multithreaded runs
Absolute timings Parallel efficiency
19FFT Confidential 5/29/2013
Pardiso MKL: nearly optimal parallel efficiency
UMFPACK: very high computing times
Vibro-acoustic test cases
� Represent 50-60% of our customers
MUMPS PARDISO MKL UMFPACK
RC_Indus_hpc_MUMPS 4 OK OK OK
276 KDOF, 2 RHS
20FFT Confidential 5/29/2013
276 KDOF, 2 RHS
Ship OK OK OK
410 KDOF, 3 RHS
RC_Indus_hpc_MUMPS 5
1.86 MDOF, 1 RHS OK OK OK
Pl case
1.75 MDOF, 20 RHS OK
IC: zero pivot
error – OOC: OK
Out of memory
(> 250 GB)
Cockpit
3.09 MDOF, 50 RHS
IC: memory
allocation error OK
N.A. (symmetric
matrix)
Vibro-acoustic test case RC_Indus_hpc_MUMPS_C5
Pardiso MKL MUMPS UMFPACK
IN-CORE 47.3 52.2 76.9
Pardiso MKL MUMPS
Peak memory consumption (Gbyte)
21FFT Confidential 5/29/2013
OUT-OF-CORE 10.9 11.7
� IC: Pardiso lowest memory requirements – UMFPACK highest
Computation time
� Same trend as for pure acoustic problems: MUMPS has lowest factorization time
� UMFPACK: largest sequential computation time
TM test cases
� Represent 10-15% of our customers
MUMPS PARDISO MKL UMFPACK
Inlet-Nacelle
542 KDOF, 44 RHS OK OK OK
Inlet-APU
22FFT Confidential 5/29/2013
Inlet-APU
818 KDOF, 7 RHS OK OK OK
By-pass DUCT
596 KDOF, 253 RHS OK OK OK
By-pass
1.54 MDOF, 521 RHS OK OK OK
Inlet-Nacelle
3.02 MDOF, 151 RHS
IC: memory
allocation error,
OOC: OK OK
Out of memory (>
250 GB)
TM test cases: memory
Inlet Nacelle
600KDOF
ByPass Duct
600KDOF
23FFT Confidential 5/29/2013
Real, symmetric test cases
� Low importance to our customers (5%)
� 3 pure vibro test cases (316 to 733 KDOF) and 1 pure acoustic test (419KDOF)
� MUMPS: large memory requirements on
pure acoustic test case w.r.t. Pardiso MKL
� Pardiso: large memory requirements on
pure vibro test cases
24FFT Confidential 5/29/2013
pure vibro test cases
Conclusions
25FFT Confidential 5/29/2013
• UMFPACK:
• Many restrictions: in-core, only 1 RHS at a time, non-symmetric matrices
• UMFPACK:
• Many restrictions: in-core, only 1 RHS at a time, non-symmetric matrices
Performing acoustic, vibro-acoustic and aero-acoustic simulations with MUMPS
MUMPS in Actran
Solver benchmarks
• Default solver in Actran: very good results obtained -- thanks to MUMPS developers!
• Interested in more extensive use of parallel analysis tool
• Default solver in Actran: very good results obtained -- thanks to MUMPS developers!
• Interested in more extensive use of parallel analysis tool
26FFT Confidential 5/29/2013
• No improvement of MUMPS: excessive memory consumption and computation time
• Pardiso:
• Low memory requirements: especially on OOC acoustic test cases
• Good multithreaded behaviour: almost optimal scalability
• MUMPS:
• Fast solver: overall the lowest factorization time
• Low memory requirements for out-of-core version, especially on vibro(-acoustic) tests
• No improvement of MUMPS: excessive memory consumption and computation time
• Pardiso:
• Low memory requirements: especially on OOC acoustic test cases
• Good multithreaded behaviour: almost optimal scalability
• MUMPS:
• Fast solver: overall the lowest factorization time
• Low memory requirements for out-of-core version, especially on vibro(-acoustic) tests