Higher-Order Finite Element Code for Electromagnetic...
Transcript of Higher-Order Finite Element Code for Electromagnetic...
Higher-Order Finite Element Code forElectromagnetic Simulation on HPC Environments
Luis E Garcia-Castillo(1)Daniel Garcia-Donoro(2) Adrian Amor Martin(1)
(1)Departamento de Teorıa de la Senal y ComunicacionesUniversidad Carlos III de Madrid Spain[luiseaamor]tscuc3mes(2)Xidian University Xirsquoan Chinadanielxidianeducn
GREMA-UC3M Higher Order Finite Element Code v12
UC3M a young University established in 1989
3 Campuses in Madrid Regionbull Getafe 11km far from capitalbull Leganeacutes 12km far from capitalbull Colmenarejo 45km far from capital 3 Schools ( Bachelor programs ) bull Social and Legal Sciences (GC)bull Humanities Communication and Library
Sciences (GC)bull Polytechnic School (LC)
1 Center for Advanced Studiesbull For Master programs
Colmenarejo
Leganeacutes Getafe
Outline
1 Antecedents
2 Parallel Higher-Order FEM CodeElectromagnetic Modeling FeaturesComputational Features and ImplementationGUI with HPCaaS Interface
3 MUMPS InterfaceGetting Experience with MUMPSPresent Stage of DevelopmentSome Issues with MemorySpecialized Interfaces
4 Applications amp Performance
5 Work in Progress and Future Work
GREMA-UC3M Higher Order Finite Element Code v12
Antecedents
More than 20 years of experience onnumerical methods for EM (mainlyFEM but also others) Contributionson
I Curl-conforming basis functionsI Non-standard mesh truncation
technique (FE-IIEE) for scatteringand radiation problems
I Adaptivity h and hp strategiesI Hybridization with MoM and high
frequency techniques such asPOPTD and GTDUTD
I
Code writing from scratch mainlyduring PhD thesis of DGarcia-DonoroParallel processing (MPI) and HPC inmind
Inclusion of well-provenresearch techniquesdeveloped within the researchgroupldquoReasonablerdquo friendly to beused by non-developers
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationDouble-Curl Vector Wave PDE
Formulation based on double curl vector wave equation (use of E or H)
nablatimes (fminus1r nablatimes V
)minus k2
0 gr V = minusjk0H0P +nablatimes fminus1r Q
Table Formulation magnitudes and parameters
V macrfr macrgr h P L ΓD ΓN
Form E E macrmicror macrεr η J M ΓPEC ΓPMC
Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationBoundary Conditions and Excitations
The boundary conditions considered are of Dirichlet Neumann andCauchy types
ntimes V = ΨD over ΓD (1)
ntimes(
macrfrminus1nablatimes V
)= ΨN over ΓN (2)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨC over ΓC (3)
Periodic Boundary Conditions on unit cell (infinite array approach)
Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization
Lumped RLC (resistance coils and capacitors) elements an ports
Impressed electric and magnetic currents plane waves
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationVariational Formulation
Use of H(curl) spaces
H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)
H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method
Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0
c(FV) =
int
Ω
(nablatimes F) middot(
macrfrminus1nablatimes V
)dΩminus k2
0
int
Ω
(F middot macrgr V
)dΩ +
γ
int
ΓC
(ntimes F) middot (ntimes V) dΓC
l(F) = minusjk0h0
int
Ω
F middot P dΩ minusint
ΓN
F middotΨN dΓN minusint
ΓC
F middotΨC dΓC
minusint
Ω
F middotnablatimes(
macrfrminus1
L)
dΩ
GREMA-UC3M Higher Order Finite Element Code v12
FEM DiscretizationHigher-Order Curl-Conforming Element
Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements
4
2
31
13 14
15
16
18
17
19 20
ζ
ξ
η1
2 3
4
7
8
9
10
11
12
6 5
1
2
3
4
5
6
1
2
3
4
56
7
89
10
1112
16
15
17
18
20
19
22
21
2324
25
26
27
28
29
30
31
32
3334
35
36
ζ
ξ
η
14
13
Example 2nd order versions of tetra and prism
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
UC3M a young University established in 1989
3 Campuses in Madrid Regionbull Getafe 11km far from capitalbull Leganeacutes 12km far from capitalbull Colmenarejo 45km far from capital 3 Schools ( Bachelor programs ) bull Social and Legal Sciences (GC)bull Humanities Communication and Library
Sciences (GC)bull Polytechnic School (LC)
1 Center for Advanced Studiesbull For Master programs
Colmenarejo
Leganeacutes Getafe
Outline
1 Antecedents
2 Parallel Higher-Order FEM CodeElectromagnetic Modeling FeaturesComputational Features and ImplementationGUI with HPCaaS Interface
3 MUMPS InterfaceGetting Experience with MUMPSPresent Stage of DevelopmentSome Issues with MemorySpecialized Interfaces
4 Applications amp Performance
5 Work in Progress and Future Work
GREMA-UC3M Higher Order Finite Element Code v12
Antecedents
More than 20 years of experience onnumerical methods for EM (mainlyFEM but also others) Contributionson
I Curl-conforming basis functionsI Non-standard mesh truncation
technique (FE-IIEE) for scatteringand radiation problems
I Adaptivity h and hp strategiesI Hybridization with MoM and high
frequency techniques such asPOPTD and GTDUTD
I
Code writing from scratch mainlyduring PhD thesis of DGarcia-DonoroParallel processing (MPI) and HPC inmind
Inclusion of well-provenresearch techniquesdeveloped within the researchgroupldquoReasonablerdquo friendly to beused by non-developers
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationDouble-Curl Vector Wave PDE
Formulation based on double curl vector wave equation (use of E or H)
nablatimes (fminus1r nablatimes V
)minus k2
0 gr V = minusjk0H0P +nablatimes fminus1r Q
Table Formulation magnitudes and parameters
V macrfr macrgr h P L ΓD ΓN
Form E E macrmicror macrεr η J M ΓPEC ΓPMC
Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationBoundary Conditions and Excitations
The boundary conditions considered are of Dirichlet Neumann andCauchy types
ntimes V = ΨD over ΓD (1)
ntimes(
macrfrminus1nablatimes V
)= ΨN over ΓN (2)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨC over ΓC (3)
Periodic Boundary Conditions on unit cell (infinite array approach)
Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization
Lumped RLC (resistance coils and capacitors) elements an ports
Impressed electric and magnetic currents plane waves
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationVariational Formulation
Use of H(curl) spaces
H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)
H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method
Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0
c(FV) =
int
Ω
(nablatimes F) middot(
macrfrminus1nablatimes V
)dΩminus k2
0
int
Ω
(F middot macrgr V
)dΩ +
γ
int
ΓC
(ntimes F) middot (ntimes V) dΓC
l(F) = minusjk0h0
int
Ω
F middot P dΩ minusint
ΓN
F middotΨN dΓN minusint
ΓC
F middotΨC dΓC
minusint
Ω
F middotnablatimes(
macrfrminus1
L)
dΩ
GREMA-UC3M Higher Order Finite Element Code v12
FEM DiscretizationHigher-Order Curl-Conforming Element
Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements
4
2
31
13 14
15
16
18
17
19 20
ζ
ξ
η1
2 3
4
7
8
9
10
11
12
6 5
1
2
3
4
5
6
1
2
3
4
56
7
89
10
1112
16
15
17
18
20
19
22
21
2324
25
26
27
28
29
30
31
32
3334
35
36
ζ
ξ
η
14
13
Example 2nd order versions of tetra and prism
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Outline
1 Antecedents
2 Parallel Higher-Order FEM CodeElectromagnetic Modeling FeaturesComputational Features and ImplementationGUI with HPCaaS Interface
3 MUMPS InterfaceGetting Experience with MUMPSPresent Stage of DevelopmentSome Issues with MemorySpecialized Interfaces
4 Applications amp Performance
5 Work in Progress and Future Work
GREMA-UC3M Higher Order Finite Element Code v12
Antecedents
More than 20 years of experience onnumerical methods for EM (mainlyFEM but also others) Contributionson
I Curl-conforming basis functionsI Non-standard mesh truncation
technique (FE-IIEE) for scatteringand radiation problems
I Adaptivity h and hp strategiesI Hybridization with MoM and high
frequency techniques such asPOPTD and GTDUTD
I
Code writing from scratch mainlyduring PhD thesis of DGarcia-DonoroParallel processing (MPI) and HPC inmind
Inclusion of well-provenresearch techniquesdeveloped within the researchgroupldquoReasonablerdquo friendly to beused by non-developers
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationDouble-Curl Vector Wave PDE
Formulation based on double curl vector wave equation (use of E or H)
nablatimes (fminus1r nablatimes V
)minus k2
0 gr V = minusjk0H0P +nablatimes fminus1r Q
Table Formulation magnitudes and parameters
V macrfr macrgr h P L ΓD ΓN
Form E E macrmicror macrεr η J M ΓPEC ΓPMC
Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationBoundary Conditions and Excitations
The boundary conditions considered are of Dirichlet Neumann andCauchy types
ntimes V = ΨD over ΓD (1)
ntimes(
macrfrminus1nablatimes V
)= ΨN over ΓN (2)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨC over ΓC (3)
Periodic Boundary Conditions on unit cell (infinite array approach)
Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization
Lumped RLC (resistance coils and capacitors) elements an ports
Impressed electric and magnetic currents plane waves
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationVariational Formulation
Use of H(curl) spaces
H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)
H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method
Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0
c(FV) =
int
Ω
(nablatimes F) middot(
macrfrminus1nablatimes V
)dΩminus k2
0
int
Ω
(F middot macrgr V
)dΩ +
γ
int
ΓC
(ntimes F) middot (ntimes V) dΓC
l(F) = minusjk0h0
int
Ω
F middot P dΩ minusint
ΓN
F middotΨN dΓN minusint
ΓC
F middotΨC dΓC
minusint
Ω
F middotnablatimes(
macrfrminus1
L)
dΩ
GREMA-UC3M Higher Order Finite Element Code v12
FEM DiscretizationHigher-Order Curl-Conforming Element
Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements
4
2
31
13 14
15
16
18
17
19 20
ζ
ξ
η1
2 3
4
7
8
9
10
11
12
6 5
1
2
3
4
5
6
1
2
3
4
56
7
89
10
1112
16
15
17
18
20
19
22
21
2324
25
26
27
28
29
30
31
32
3334
35
36
ζ
ξ
η
14
13
Example 2nd order versions of tetra and prism
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Antecedents
More than 20 years of experience onnumerical methods for EM (mainlyFEM but also others) Contributionson
I Curl-conforming basis functionsI Non-standard mesh truncation
technique (FE-IIEE) for scatteringand radiation problems
I Adaptivity h and hp strategiesI Hybridization with MoM and high
frequency techniques such asPOPTD and GTDUTD
I
Code writing from scratch mainlyduring PhD thesis of DGarcia-DonoroParallel processing (MPI) and HPC inmind
Inclusion of well-provenresearch techniquesdeveloped within the researchgroupldquoReasonablerdquo friendly to beused by non-developers
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationDouble-Curl Vector Wave PDE
Formulation based on double curl vector wave equation (use of E or H)
nablatimes (fminus1r nablatimes V
)minus k2
0 gr V = minusjk0H0P +nablatimes fminus1r Q
Table Formulation magnitudes and parameters
V macrfr macrgr h P L ΓD ΓN
Form E E macrmicror macrεr η J M ΓPEC ΓPMC
Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationBoundary Conditions and Excitations
The boundary conditions considered are of Dirichlet Neumann andCauchy types
ntimes V = ΨD over ΓD (1)
ntimes(
macrfrminus1nablatimes V
)= ΨN over ΓN (2)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨC over ΓC (3)
Periodic Boundary Conditions on unit cell (infinite array approach)
Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization
Lumped RLC (resistance coils and capacitors) elements an ports
Impressed electric and magnetic currents plane waves
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationVariational Formulation
Use of H(curl) spaces
H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)
H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method
Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0
c(FV) =
int
Ω
(nablatimes F) middot(
macrfrminus1nablatimes V
)dΩminus k2
0
int
Ω
(F middot macrgr V
)dΩ +
γ
int
ΓC
(ntimes F) middot (ntimes V) dΓC
l(F) = minusjk0h0
int
Ω
F middot P dΩ minusint
ΓN
F middotΨN dΓN minusint
ΓC
F middotΨC dΓC
minusint
Ω
F middotnablatimes(
macrfrminus1
L)
dΩ
GREMA-UC3M Higher Order Finite Element Code v12
FEM DiscretizationHigher-Order Curl-Conforming Element
Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements
4
2
31
13 14
15
16
18
17
19 20
ζ
ξ
η1
2 3
4
7
8
9
10
11
12
6 5
1
2
3
4
5
6
1
2
3
4
56
7
89
10
1112
16
15
17
18
20
19
22
21
2324
25
26
27
28
29
30
31
32
3334
35
36
ζ
ξ
η
14
13
Example 2nd order versions of tetra and prism
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
FEM FormulationDouble-Curl Vector Wave PDE
Formulation based on double curl vector wave equation (use of E or H)
nablatimes (fminus1r nablatimes V
)minus k2
0 gr V = minusjk0H0P +nablatimes fminus1r Q
Table Formulation magnitudes and parameters
V macrfr macrgr h P L ΓD ΓN
Form E E macrmicror macrεr η J M ΓPEC ΓPMC
Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationBoundary Conditions and Excitations
The boundary conditions considered are of Dirichlet Neumann andCauchy types
ntimes V = ΨD over ΓD (1)
ntimes(
macrfrminus1nablatimes V
)= ΨN over ΓN (2)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨC over ΓC (3)
Periodic Boundary Conditions on unit cell (infinite array approach)
Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization
Lumped RLC (resistance coils and capacitors) elements an ports
Impressed electric and magnetic currents plane waves
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationVariational Formulation
Use of H(curl) spaces
H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)
H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method
Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0
c(FV) =
int
Ω
(nablatimes F) middot(
macrfrminus1nablatimes V
)dΩminus k2
0
int
Ω
(F middot macrgr V
)dΩ +
γ
int
ΓC
(ntimes F) middot (ntimes V) dΓC
l(F) = minusjk0h0
int
Ω
F middot P dΩ minusint
ΓN
F middotΨN dΓN minusint
ΓC
F middotΨC dΓC
minusint
Ω
F middotnablatimes(
macrfrminus1
L)
dΩ
GREMA-UC3M Higher Order Finite Element Code v12
FEM DiscretizationHigher-Order Curl-Conforming Element
Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements
4
2
31
13 14
15
16
18
17
19 20
ζ
ξ
η1
2 3
4
7
8
9
10
11
12
6 5
1
2
3
4
5
6
1
2
3
4
56
7
89
10
1112
16
15
17
18
20
19
22
21
2324
25
26
27
28
29
30
31
32
3334
35
36
ζ
ξ
η
14
13
Example 2nd order versions of tetra and prism
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
FEM FormulationBoundary Conditions and Excitations
The boundary conditions considered are of Dirichlet Neumann andCauchy types
ntimes V = ΨD over ΓD (1)
ntimes(
macrfrminus1nablatimes V
)= ΨN over ΓN (2)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨC over ΓC (3)
Periodic Boundary Conditions on unit cell (infinite array approach)
Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization
Lumped RLC (resistance coils and capacitors) elements an ports
Impressed electric and magnetic currents plane waves
GREMA-UC3M Higher Order Finite Element Code v12
FEM FormulationVariational Formulation
Use of H(curl) spaces
H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)
H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method
Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0
c(FV) =
int
Ω
(nablatimes F) middot(
macrfrminus1nablatimes V
)dΩminus k2
0
int
Ω
(F middot macrgr V
)dΩ +
γ
int
ΓC
(ntimes F) middot (ntimes V) dΓC
l(F) = minusjk0h0
int
Ω
F middot P dΩ minusint
ΓN
F middotΨN dΓN minusint
ΓC
F middotΨC dΓC
minusint
Ω
F middotnablatimes(
macrfrminus1
L)
dΩ
GREMA-UC3M Higher Order Finite Element Code v12
FEM DiscretizationHigher-Order Curl-Conforming Element
Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements
4
2
31
13 14
15
16
18
17
19 20
ζ
ξ
η1
2 3
4
7
8
9
10
11
12
6 5
1
2
3
4
5
6
1
2
3
4
56
7
89
10
1112
16
15
17
18
20
19
22
21
2324
25
26
27
28
29
30
31
32
3334
35
36
ζ
ξ
η
14
13
Example 2nd order versions of tetra and prism
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
FEM FormulationVariational Formulation
Use of H(curl) spaces
H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)
H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method
Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0
c(FV) =
int
Ω
(nablatimes F) middot(
macrfrminus1nablatimes V
)dΩminus k2
0
int
Ω
(F middot macrgr V
)dΩ +
γ
int
ΓC
(ntimes F) middot (ntimes V) dΓC
l(F) = minusjk0h0
int
Ω
F middot P dΩ minusint
ΓN
F middotΨN dΓN minusint
ΓC
F middotΨC dΓC
minusint
Ω
F middotnablatimes(
macrfrminus1
L)
dΩ
GREMA-UC3M Higher Order Finite Element Code v12
FEM DiscretizationHigher-Order Curl-Conforming Element
Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements
4
2
31
13 14
15
16
18
17
19 20
ζ
ξ
η1
2 3
4
7
8
9
10
11
12
6 5
1
2
3
4
5
6
1
2
3
4
56
7
89
10
1112
16
15
17
18
20
19
22
21
2324
25
26
27
28
29
30
31
32
3334
35
36
ζ
ξ
η
14
13
Example 2nd order versions of tetra and prism
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
FEM DiscretizationHigher-Order Curl-Conforming Element
Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements
4
2
31
13 14
15
16
18
17
19 20
ζ
ξ
η1
2 3
4
7
8
9
10
11
12
6 5
1
2
3
4
5
6
1
2
3
4
56
7
89
10
1112
16
15
17
18
20
19
22
21
2324
25
26
27
28
29
30
31
32
3334
35
36
ζ
ξ
η
14
13
Example 2nd order versions of tetra and prism
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Mesh Truncation with FE-IIEEProblem Set-Up
Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition
ε1 micro1
Medium 1
ε2 micro2
Medium 2
PEC
PMC
PEC
J
nprime
n
M
S prime
Meq
Jeq
S
Exterior Domain(ΩEXT)EincHinc
FEM Domain(ΩFEM)
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Mesh Truncation with FE-IIEEAlgorithm
Local BC for FEM (sparse matrices)
ntimes(
macrfrminus1nablatimes V
)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S
Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime
VFE-IIEE =
intintcopy
Sprime(Leq timesnablaG) middot dSprime minus jk0h0
intintcopy
Sprime
(Oeq
(G +
1k2
0nablanablaG
))middot dSprime
nabla times VFE-IIEE = jk0h0
intintcopy
Sprime(Oeq timesnablaG) dSprime minus
intintcopy
Sprime
(Leq
(k2
0 G + nablanablaG))
middot dSprime
ΨSCAT = ntimes(
macrfrminus1nablatimes VFE-IIEE
)+ γ ntimes ntimes VFE-IIEE
GREMA-UC3M Higher Order Finite Element Code v12
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Mesh Truncation with FE-IIEEFlow Chart
Assembly ofElement Matrices
Imposition of BCNon Related to S
PostprocessSparseSolver
InitialBC on S
Computation ofElement Matrices
Upgrade of BC on
Mesh
FEM code for nonminusopen problems
Ψ(0)(r)
S Ψ(i+1)(r)
J(i)eq (r prime)M(i)
eq (r prime)rArr
V (r isin ΓS)nablatimesV (r isin ΓS)
GREMA-UC3M Higher Order Finite Element Code v12
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Computational Features
Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions
GREMA-UC3M Higher Order Finite Element Code v12
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Computational FeaturesTowards HPC
Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores
GREMA-UC3M Higher Order Finite Element Code v12
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Parallel Flow Chart of the Code
MPI Division
FE-IIEE enabled
Yes
No
p0 p1 p2 pn ReadInput Data
p0 p1 p2 pn Create global mesh object
Numberingof DOFs
p1 p2 pn Deallocateglobal mesh
object
p0
Initializesolver
p0 p1 p2 pn Create local mesh object
p0 p1 p2 pn Extractboundary
mesh
p0 p1 p2 pn
p0 p1 p2 pn Calculate ampfill LHS matrix
p0 p1 p2 pn Matrixfactorization
p0 Calculate ampfill RHS matrix
Solve systemof equations
p0 p1 p2 pn
Yes
Yes
Calculatescatteringfield
p0 Updateglobal RHS
FE-IIEE enabled
p0 p1 p2 pn
More frequencies
Postprocess
No
No
Solve systemof equations
p0 p1 p2 pn
ConvIter achieved
Yes
No
GREMA-UC3M Higher Order Finite Element Code v12
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Graphical User InterfaceFeatures
FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces
Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)
GREMA-UC3M Higher Order Finite Element Code v12
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
PosidoniaIn-house HPCaaS Solution
Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)
A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015
GREMA-UC3M Higher Order Finite Element Code v12
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
GUI with HPCaaS InterfaceScreenshot
GREMA-UC3M Higher Order Finite Element Code v12
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to
Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)
RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )
GREMA-UC3M Higher Order Finite Element Code v12
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
MUMPS InterfacePresent Stage of Development (version 502)
MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors
I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)
Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization
lowast Frequent use of out-of-core (OOC) capabilities of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Some Issues with MemoryMemory Allocation inside MUMPS
Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors
Listing 1 file zana aux parF
1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Some Issues with Memory (cont)Memory Allocation inside MUMPS
Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process
WorkaroundMatrix partition based on rows instead of elements of the mesh
I Slightly worse LU fill-in (size of cofactors) than with partition based onelements
Change ordering of dof as input to MUMPS (to be done)
It may be the case we are doing something completely wrong
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Some Issues with MemoryLarge Number of RHS amp Solution Vectors
Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples
Monostatic radar cross section (RCS) predictionLarge arrays of antennas
Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)
Use of sparse format for RHSUse of centralized solution vectors
I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors
GREMA-UC3M Higher Order Finite Element Code v12
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors
Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS InterfacesRepetitive Solver
Vivaldi antenna
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS Interfaces (cont)Repetitive Solver
Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS Interfaces (cont)Repetitive Solver
(a) Unit Cell (b) Unit Cell Mesh
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS Interfaces (cont)Repetitive Solver
Figure Virtual Mesh of Antenna Array
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS Interfaces (cont)Repetitive Solver
Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface
problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems
I Identical matrices with different right hand sides
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS Interfaces (cont)Repetitive Solver
Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical
I Further saving in timeI Large saving in memory
Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS Interfaces (cont)Repetitive Solver
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver
Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors
Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems
GREMA-UC3M Higher Order Finite Element Code v12
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors
MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski
Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Regular MUMPS amp MULTISOLVERTime and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0Se
cond
s
u n k n o w n s
M U L T I S O L V E R M U M P S
GREMA-UC3M Higher Order Finite Element Code v12
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison
1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 4 0Me
mory
in GB
u n k n o w n s
M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )
GREMA-UC3M Higher Order Finite Element Code v12
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
HPC EnvironmentCluster of Xidian University (XDHPC)
Cluster of Xidian University140 compute nodes
I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk
56 Gbps InfiniBand network
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression
[10 minus 16] GHz
Length 218 mm
3245 K tetrahedrons
22 M unknowns
Wall time 73 min per freq point
I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression
GREMA-UC3M Higher Order Finite Element Code v12
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Scattering ProblemBistatic RCS of Car
Bistatic RCS at 15 GHz
Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Scattering Problem (cont)Bistatic RCS of Car
27 M tetrahedrons
173 M unknownsWall time 59 min per freq point(46 compute nodes)
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from behind
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Scattering Problem (cont)Bistatic RCS of Car
q Incident plane wave arriving from the front
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Parallel ScalabilityFactorization and FE-IIEE Stages
1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0
4 9 5 2 5 7
7 0
factor
izatio
n + so
lving s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to thefactorization phase
1 2 0 2 4 0 3 6 0 4 8 0
1 0
1 5
2 0
2 5
3 0
3 5
4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
7 8 8 5
1 0 0
FE-IIE
E spe
edup
n u m b e r o f c o r e sSpeedup graph corresponding to the
mesh truncation phase
Benchmark Bistatic RCS of Impala
GREMA-UC3M Higher Order Finite Element Code v12
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Parallel ScalabilityWhole Code
1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4
6 2 6 4 7 0
8 0 fac
toriza
tion +
solvin
g + FE
-IIEE s
peed
up
n u m b e r o f c o r e s
1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s
Speedup graph corresponding to the whole code
GREMA-UC3M Higher Order Finite Element Code v12
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Work in Progress and Future Work
Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes
Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver
GREMA-UC3M Higher Order Finite Element Code v12
Thanks for your attention
Thanks to the MUMPS team
Thanks for your attention
Thanks to the MUMPS team