MUMPS v 5.1mumps.enseeiht.fr/doc/ud_2017/Sikelis_Talk.pdf · Kostas Sikelis, Phd Software...
Transcript of MUMPS v 5.1mumps.enseeiht.fr/doc/ud_2017/Sikelis_Talk.pdf · Kostas Sikelis, Phd Software...
Kostas Sikelis, Phd
Software Development specialist, Optistruct
MUMPS User Days, June 2017
MUMPS v 5.1.1
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Outline
• Optistruct - Overview
• Comparison of MUMPS 5.1.1 vs 5.0.1
• 64 bit MUMPS
• BLR/ABLR – Preliminary results
• Wish list
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
OptiStruct - Overview
Large Scale Computing and Parallelization
CompositesVibrations and
Acoustics
Linear and
Nonlinear
Analysis
Fatigue Multiphysics
Optimization
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Optistruct – MUMPS usage
• Linear static analysis
• Symmetric positive definite or indefinite systems
• Nonlinear static/transient analysis
• Unsymmetric systems
• Lanczos Eigensolver
• Symmetric indefinite systems
• Direct Frequency response Analysis
• Symmetric complex systems
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Optistruct – MUMPS Benchmark
Model # of Equations # of Nonzeros # of Elements Element Type
Knuckle 2.8M 117M 650K Solid (CTETRA10)
Car body 12M 320M 1.8M Shell (CQUAD4/CTRIA3)
366K Solid (CTETRA/CPENTA/CHEXA
• Incore linear static analysis.
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Optistruct – MUMPS Benchmark
Data extracted from MUMPS report
• Time
• ELAPSED TIME IN ANALYSIS DRIVER
• ELAPSED TIME IN FACTORIZATION DRIVER
• ELAPSED TIME IN SOLVE DRIVER
• Total Memory
• TOTAL space in MBYTES for IC factorization
• ** EFF Min: Avg. Space in MBYTES per working proc (x # of MPI-Processes)
• ** Avg. Space in MBYTES per working proc during solve (x # of MPI-Processes)
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – 5.1.1 vs 5.0.1
• Slight performance drop in Analysis phase due to 64bit METIS
• Improved MPI scaling both in Factorization and Solve phases
• Signifficantly improved SMP scaling both in Factorization and Solve phases
1 2 4 8 16
5.1.1 80 77 80 83 95
5.0.1 66 67 67 71 79
0
20
40
60
80
100
# of mpi
Analysis Wall time - Shell
1 2 4 8 12
5.1.1 52 53 53 54 55
5.0.1 45 44 44 46 47
0
10
20
30
40
50
60
# of mpi
Analysis Wall time - Solid
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
1 2 4 8 12
5.1.1 7172 4567 2515 2227 1248
5.0.1 7072 3817 3230 2302 1316
010002000300040005000600070008000
# of mpi
Factorization Wall time (s) - Solid
-19,6%
-1,4%
22,1%3,2%
5,1%
MUMPS in Optistruct – 5.1.1 vs 5.0.1
• Slight performance drop in Analysis phase due to 64bit METIS
• Improved MPI scaling both in Factorization and Solve phases
• Signifficantly improved SMP scaling both in Factorization and Solve phases
1 2 4 8 16
5.1.1 335 168 108 77 55
5.0.1 328 180 116 72 60
050
100150200250300350400
# of mpi
Factorization Wall time (s) - Shell
-2,1%
6,7%
6,9%-6,9%
8,3%
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – 5.1.1 vs 5.0.1
• Slight performance drop in Analysis phase due to 64bit METIS
• Improved MPI scaling both in Factorization and Solve phases
• Signifficantly improved SMP scaling both in Factorization and Solve phases (even more potential withaggressive setting)
1 2 4 8 12
5.1.1 7172 3605 1940 1073 779
5.0.1 7072 3729 2131 1286 1010
010002000300040005000600070008000
# of threads
Factorization Wall time (s) - Solid
22,9%
-1,4%
3,3%
9,0%16,6%
1 2 4 8 16
5.1.1 335 225 169 163 163
5.0.1 328 270 201 207 201
050
100150200250300350400
# of threads
Factorization Wall time (s) - Shell
-2,1%
16,7%15,9% 21,2% 18,9%
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Optistruct – MUMPS 64bit (full)
• Motivation
• Industrial models challenge 32bit integer capacity
• Quick solution: Promote all integers to 64
• MPI-Free Implementation
• Compile Fortran source forcing all integers to be 64bit (-i8 or -fdefault-integer-8 in OPTF)
• Compile C source forcing all integers to be 64bit (-DINTSIZE64 in OPTC)
• Link with 64bit BLAS libraries
• MPI : Intel supports 64 bit integer mpi with option –ilp64 in mpirun command, but with the following limitation
• In REDUCE operations MPI_2INTEGER data type is not promoted, send and receive buffers must be INTEGER(4)
• For custom reduction functions used in REDUCE operations the integer arguments must be INTEGER(4)
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Optistruct – MUMPS 64bit (full)
i. bcast_errors.F@16 : INTEGER IN(2), OUT(2) --> INTEGER(4) IN(2), OUT(2)
ii. tools_common.F@269 : INTEGER TEMP1(2), TEMP2(2) --> INTEGER(4) TEMP1(2), TEMP2(2)
iii. [dzsc]fac_scalings_simScale_util.F@23 : INTEGER IWRK(IWSZ) --> INTEGER(4) IWRK(IWSZ)
iv. [dzsc]fac_scalings_simScale_util.F@433-436: INTEGER LEN, INV(2*LEN), INOUTV(2*LEN), DTYPE --> INTEGER(4) LEN, INV(2*LEN), INOUTV(2*LEN), DTYPE
v. [dzsc]fac_scalings_simScale_util.F@466 : INTEGER IW(IWSZ) --> INTEGER(4) IW(IWSZ)
vi. [dzsc]fac_scalings_simScale_util.F@923 : INTEGER IWRK(IWSZ) --> INTEGER(4) IWRK(IWSZ)
vii. [dzsc]mumps_driver.F@1642 : INTEGER TMP1(2),TMP(2) --> INTEGER(4) TMP1(2),TMP(2)
• MUMPS 5.0.1 : Apply above changes manually
• MUMPS 5.1.1 : -DWORKAROUNDINTELILP64MPI2INTEGER
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.Copyright © 2015 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
• MUMPS 5.0.1
• NZ -> 32 bit
• Stiffness definitely less than 2 billion
• 32bit integer for internal symbolic
analysis
• No more than 1 billion terms
• Even smaller model for unsymmetric
matrix from nonlinear with friction
• Interface to METIS 32bit API
• Need scotch sometimes
Optistruct – MUMPS 64 bit (selective)
• MUMPS 5.1.1
• NZ -> 32 bit, NNZ -> 64bit
• Backward compatible
• Solve stiffness with more than 2
billion terms if NNZ is used
• 64bit integer for internal symbolic
analysis
• Solve more than 1 billion terms
• Interface to METIS 64bit API
• Avoid METIS overflow
• Better performance
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – selective 64 bit vs full
• Analysis phase is a bit faster
• Factorization phase occasionally much faster.
• Solve phase a bit slower.
• Considerable reduction in Total Memory.
1x12 2x6 4x3 6x2 12x1
OS32 55 53 53 57 55
OS64 55 55 55 56 57
0
10
20
30
40
50
60
#of mpi x # of threads
Analysis Wall time (s) - Solid
0% 3,8% 3,8% -1,8% 3,6%
1x16 2x8 4x4 8x2 16x1
OS32 80 76 76 80 95
OS64 82 79 81 85 99
0
20
40
60
80
100
120
#of mpi x # of threads
Analysis Wall time (s) - Shell
2,5% 3,9% 6,6% 6,2%4,2%
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – selective 64 bit vs full
1x12 2x6 4x3 6x2 12x1
OS32 780 942 1020 1011 1248
OS64 776 921 1262 1118 1257
0
200
400
600
800
1000
1200
1400
#of mpi x # of threads
Factorization Wall time (s) - Solid
-0,5%-2,2% 23,7% 10,6%
0,7%
1x16 2x8 4x4 8x2 16x1
OS32 163 82 59 49 55
OS64 157 81 61 58 89
0
50
100
150
200
#of mpi x # of threads
Factorization Wall time (s) - Shell
-3,7%
-1,2%3,4%
18,4% 61,9%
• Analysis phase is a bit faster
• Factorization phase occasionally much faster.
• Solve phase a bit slower.
• Considerable reduction in Total Memory.
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – selective 64 bit vs full
• Analysis phase is a bit faster
• Factorization phase occasionally much faster.
• Solve phase a bit slower.
• Considerable reduction in Total Memory.
1x12 2x6 4x3 6x2 12x1
OS32 9,2 6,5 5,2 5,3 5,8
OS64 9,6 6,5 4,8 4,6 5,4
0
2
4
6
8
10
12
#of mpi x # of threads
Solve Wall time (s) - Solid
4,3%
0%-7,7% -6,9%-13,2%
1 2 4 8 16
OS32 10,7 7,5 6,5 2,6 3,3
OS64 10,8 7,5 5,2 3,5 3,2
0
2
4
6
8
10
12
#of mpi x # of threads
Solve Wall time (s) - Shell
0,9%
0%-20%
34,6%-3%
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – selective 64 bit vs full
• Analysis phase is a bit faster
• Factorization phase occasionally much faster.
• Solve phase a bit slower.
• Considerable reduction in Total Memory.
1x12 2x6 4x3 6x2 12x1
OS32 85122 102054 115483 152051 201719
OS64 91174 111843 137507 185438 225805
0
50000
100000
150000
200000
250000
#of mpi x # of threads
Total Memory Estimate (MB) - Solid
7,1%9,6%
19%
22%
12%
1x16 2x8 4x4 8x2 16x1
OS32 41182 43499 45981 52768 61797
OS64 42909 46238 52252 60256 71286
01000020000300004000050000600007000080000
#of mpi x # of threads
Total Memory Estimate (MB) - Shell
4,2%6,3% 13,6%
14,2%15,4%
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – BLR/ABLR
• Analysis phase is a bit slower (for solid).
• Factorization and Solve phases up to 3 times faster (for solid)
• Reduced accuracy of the solution. Solution changes with the number of MPI-processes
1x16 2x8 4x4 8x2 16x1
BLR 84 83 91 99 108
ABLR 86 83 86 90 108
5.1.1 80 76 76 80 95
020406080
100120
#of mpi x # of threads
Analysis Wall time (s) - Shell
1x12 2x6 4x3 6x2 12x1
BLR 73 72 72 76 73
ABLR 72 72 72 76 73
5.1.1 55 53 53 57 55
01020304050607080
#of mpi x # of threads
Analysis Wall time (s) - Solid
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – BLR/ABLR
• Analysis phase is a bit slower (for solid).
• Factorization and Solve phases up to 3 times faster (for solid)
• Reduced accuracy of the solution. Solution changes with the number of MPI-processes
1x16 2x8 4x4 8x2 16x1
BLR 153 77 53 46 55
ABLR 158 77 52 43 52
5.1.1 163 82 59 49 55
0
50
100
150
200
#of mpi x # of threads
Factorization Wall time (s) - Shell
1x12 2x6 4x3 6x2 12x1
BLR 362 282 294 328 349
ABLR 296 256 262 307 314
5.1.1 780 942 1020 1011 1248
0200400600800
100012001400
#of mpi x # of threads
Factorization Wall time (s) - Solid
x2,1x3,3 x3,5 x3 x3,5
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
MUMPS in Optistruct – BLR/ABLR
• Analysis phase is a bit slower (for solid).
• Factorization and Solve phases up to 3 times faster (for solid)
• Reduced accuracy of the solution. Solution changes with the number of MPI-processes
1x12 2x6 4x3 6x2 12x1
BLR 2,4640E+03 2,5385E+03 2,4489E+03 2,6225E+03 2,5359E+03
ABLR 2,4640E+03 2,4649E+03 4,8555E+04 2,3939E+06 1,5674E+03
5.1.1 2,43367E+032,43367E+03 2,43367E+032,43367E+032,43367E+03
1,0E+001,0E+011,0E+021,0E+031,0E+041,0E+051,0E+061,0E+07
#of mpi x # of threads
Log(Compliance) - Solid
1x12 2x6 4x3 6x2 12x1
BLR 1,2578E-02 4,2218E-02 6,4007E-03 7,3649E-02 4,1007E-02
ABLR 1,2578E-02 1,8823E-02 2,3833E+00 1,9451E+00 3,4821E+00
5.1.1 2,3268E-10 2,3735E-10 2,4897E-10 2,8041E-10 2,9909E-10
1,0E-10
1,0E-08
1,0E-06
1,0E-04
1,0E-02
1,0E+00
#of mpi x # of threads
Epsilon - Solid
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Optistruct – Wish list
• Robust detection of ill-conditioned (near-singular) matrix
• Often MUMPS provides physically meaningless solutions to ill-conditioned systems.
• Preferable just error out.
• Possibly a parameter similar to MAXRATIO used for LDLT i.e 𝒓 = 𝒎𝒂𝒙(𝑫𝒊𝒂𝒈 𝑨 𝒊𝒊
𝑫𝒊)
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Optistruct – Wish list
• Detection of NaN in input matrix
• A failsafe in case input is erroneous.
• Preferable to error out instead of delivering an erroneous solution
© 2016 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved.
Optistruct – Wish list
• Suggestion of optimal reordering
0:00:00
0:02:53
0:05:46
0:08:38
0:11:31
0:14:24
0:17:17
0:20:10
0:23:02
0:25:55
METIS SCOTCH PT-SCOTCH
Total timing
0
5
10
15
20
25
30
35
40
METIS SCOTCH PT-SCOTCH
Operations (1.0E+12)
Questions?
THANK YOU
MUMPS Team, thank you for a great linear equation solver!