Scalable Parallel Solvers for Highly Heterogeneous...

1
Scalable Parallel Solvers for Highly Heterogeneous Nonlinear Stokes Flow Discretized with Adaptive High-Order Finite Elements Johann Rudi 1 , Tobin Isaac 1 , Georg Stadler 2 , Michael Gurnis 3 , and Omar Ghattas 1,4 1 Institute for Computational Engineering and Sciences (ICES), The University of Texas at Austin, USA 2 Courant Institute of Mathematical Sciences, New York University, USA 3 Seismological Laboratory, California Institute of Technology, USA 4 Jackson School of Geosciences and Department of Mechanical Engineering, The University of Texas at Austin, USA COMPUTATIONAL ENGINEERING SCIENCES INSTITUTE FOR & Summary of main results I Geometric multigrid (GMG) for preconditioning Stokes systems I Novel GMG based BFBT/LSC pressure Schur complement pre- conditioner I Repartitioning on coarse GMG levels for load-balancing and MPI communicator reduction I Algebraic multigrid (AMG) as coarse solver for GMG avoids full AMG setup cost and large matrix assembly I High-order finite elements I Adaptive meshes resolving hetero- geneous viscosity with variations of up to 6 orders of magnitude I Octree algorithms for handling adaptive meshes in parallel I Parallel scalability results on up to 16,384 CPU cores (MPI) I Inexact Newton-Krylov method for highly nonlinear rheology I Global-scale simulation of Earth’s mantle flow 1. Earth mantle flow Model equations for mantle convection with plate tectonics Rock in the mantle moves like a viscous, incompressible fluid on time scales of millions of years. From conservation of mass and momentum, we obtain that the instantaneous flow velocity can be modeled as a nonlinear Stokes system. -∇ · μ(T, u)(u + u > ) + p = f (T ) ∇· u =0 u . . . velocity p . . . pressure T . . . temperature μ . . . viscosity The right-hand side forcing f is derived from the Boussinesq approximation and depends on the temperature. The viscosity μ depends exponentially on the temperature (via an Arrhenius relationship), on a power of the second invariant of the strain rate tensor, incorporates plastic yielding, and lower/upper bounds. μ(T, u) = max μ min , min τ yield ε(u) ,w min μ max ,a(T ε(u) 1-n n with exponentially on temperature dependent factor a(T ), plate decoupling w(x), viscosity bounds 0 min max , yielding stress 0 yield , exponent n 3, and square root of the 2 nd invariant of the strain rate tensor ˙ ε(u). Central open questions I Main drivers of plate mo- tion; negative buoyancy forces or convective shear traction? I Strength of plate coupling & amount of en- ergy dissipation in hinge zones I Role of subducting slab geometries I Accuracy of rheology extrapolations de- rived from laboratory experiments Research target Global simulation of the Earth’s instantaneous mantle convection and associated plate tectonics with realistic parameters and high reso- lutions down to faulted plate boundaries. 2. Solver challenges of global-scale mantle flow Inherent challenges of realistic Earth mantle flow simulations: I Severe nonlinearity, heterogeneity, and anisotropy of the Earth’s rheology with a wide range of spatial scales I Highly localized features with respect to Earth’s radius (6371 km), like plate thickness 50 km and shearing zones at plate boundaries 5 km I 6 orders of magnitude viscosity contrast within 5 km thin plate boundaries Highly accurate numerical simulations require: I Resolution down to 1 km at plate boundaries (uniform mesh of Earth’s man- tle would result in computationally prohibitive O(10 12 ) degrees of freedom). Enabled by: adaptive mesh refinement I Velocity approximation with high accuracy and local mass conservation. Enabled by: high-order discretizations Effective viscosity field and adaptive mesh resolving narrow plate boundaries (in red ). Visualization by L. Alisic. 3. Scalable Stokes solver High-order finite element discretization of the Stokes system -∇ · μ (u + u > ) + p = f ∇· u =0 discretize with --------→ high-order FE AB > B 0 u p = f 0 I High-order finite element shape functions I Inf-sup stable velocity-pressure pairings: Q k × P disc k-1 with 2 k I Locally mass conservative due to discontinuous pressure space I Fast, matrix-free application of stiffness and mass matrices I Hexahedral elements allow exploiting the tensor product structure of basis functions to greatly reduce the number of floating point operations Linear solver: Preconditioned Krylov subspace method Coupled iterative solver: GMRES with upper triangular block preconditioning AB > B 0 | {z } Stokes operator ˜ AB > 0 - ˜ S -1 | {z } preconditioner u 0 p 0 = f 0 Approximating the inverse of the viscous stress block, ˜ A -1 A -1 , is well suited for multigrid methods. BFBT/LSC Schur complement approximation ˜ S -1 Improved BFBT / Least Squares Commutator (LSC) method: ˜ S -1 =(BD -1 B > ) -1 (BD -1 AD -1 B > )(BD -1 B > ) -1 with diagonal scaling, D := diag(A). Here, approximating the inverse of the dis- crete pressure Laplacian, (BD -1 B > ), is well suited for multigrid methods. Derivation: Consider the least squares problem of a commutation relationship Find minimizing matrix X for: min X AD -1 B > e j - B > Xe j 2 C -1 for all j, where matrix C is s.p.d., matrix D is invertible but arbitrary for now, and e j is the j -th unit vector. The solution X =(BC -1 B > ) -1 (BC -1 AD -1 B > ) gives a C -1 -orthogonal projection, i.e., B > e i , (AD -1 B > - B > X) e j C -1 = 0 for all i, j. From the choice C -1 = ˜ A -1 , e.g., a multigrid V-cycle, we obtain B > e i , (AD -1 B > - B > X) e j ˜ A -1 = 0 for all i, j ˜ S = B ˜ A -1 B > , which represents an optimal preconditioner for the right-preconditioned discrete Stokes system. A computationally feasible choice is C -1 = D -1 = diag(A) -1 . 4. Stokes solver robustness with scaled BFBT Schur complement approximation The subducting plate model problem on a cross section of the spherical Earth domain serves as a benchmark for solver robustness. Subduction model viscosity field. Multigrid parameters: GMG for ˜ A: 1 V-cycle, 3+3 smoothing; GMG for (BD -1 B > ): 1 V-cycle, 3+3 smoothing, and additional 6+6 smoothing in discontinuous, modal pressure space. Robustness with respect to plate boundary thickness 10 km GMRES iteration 0 50 100 150 200 250 300 l 2 norm of ||residual|| / ||init residual|| 10 -8 10 -6 10 -4 10 -2 10 0 10km_viscous_stress 10km_Stokes_with_mass 10km_Stokes_with_BFBT 5 km GMRES iteration 0 50 100 150 200 250 300 l 2 norm of ||residual|| / ||init residual|| 10 -8 10 -6 10 -4 10 -2 10 0 5km_viscous_stress 5km_Stokes_with_mass 5km_Stokes_with_BFBT 2 km GMRES iteration 0 50 100 150 200 250 300 l 2 norm of ||residual|| / ||init residual|| 10 -8 10 -6 10 -4 10 -2 10 0 2km_viscous_stress 2km_Stokes_with_mass 2km_Stokes_with_BFBT Convergence for solving Au = f (gray ), Stokes system with BFBT (blue), Stokes system with viscosity weighted mass matrix as Schur complement approximation (red ) for comparison to conventional preconditioning. 5. Parallel octree-based adaptive mesh refinement Idea: Identify octree leaves with hexahedral elements. I Octree structure enables fast parallel adaptive oc- tree/mesh refinement and coarsening I Octrees and space filling curves enable fast neighbor search, repartitioning, and 2 : 1 balancing in parallel I Algebraic constraints on non-conforming element faces with hanging nodes enforce global continuity of the velocity basis functions I Demonstrated scalability to O(500K) cores (MPI) p4est library 6. Parallel adaptive high-order geometric multigrid The hybrid multigrid hierarchy: Coarsen adaptive octree-based mesh p-GMG h-GMG AMG direct p-coarsening geometric h-coarsening algebraic coars. high-order F.E. trilinear F.E. small #cores and reduced MPI comm. Geometric multigrid method: p-GMG and h-GMG I Parallel repartitioning of coarser h-GMG meshes is important to maintain load-balancing of the adaptive meshes I Sufficiently coarse meshes are repartitioned on subsets of cores, the MPI communicator is reduced to the nonempty cores I High-order L 2 -projection of coefficients onto coarser levels I Re-discretization of differential equations at each coarser p- and h-GMG level I Smoother: Chebyshev accelerated Jacobi (PETSc) with matrix-free differen- tial operator-apply functions; avoiding full matrix assembly I Restriction & interpolation: High-order L 2 -projection; restriction and interpo- lation operators are adjoints of each other in L 2 -sense I No collective communication in GMG cycles needed Coarse solver for geometric multigrid: AMG, PETSc’s GAMG I Coarse problems use only small core counts, usually O(100) I The MPI communicator is reduced to the nonempty cores GMG for (BD -1 B > ) on discontinuous, modal pressure space Novel approach: Re-discretize the underlying variable coefficient Laplace oper- ator with continuous, nodal high-order finite elements in Q k . I Coefficient of Laplace operator is derived from diagonal scaling D -1 I Apply GMG as described above to the continuous, nodal Q k re-discretization of the pressure Laplace operator I On finest level, additionally apply smoother in the space P disc k-1 7. Convergence dependence on mesh size and discretization order h-dependence using geometric multigrid for ˜ A and (BD -1 B > ) The mesh is increasingly refined while the discretization stays fixed to Q 2 × P disc 1 . Performed with subducting plate model problem (see above). Solve Au = f GMRES iteration 0 50 100 150 200 250 l 2 norm of ||residual|| / ||init residual|| 10 -6 10 -4 10 -2 10 0 velocity_DOF_4.6M velocity_DOF_13.4M velocity_DOF_32.5M Solve ( BD -1 B > ) p = g GMRES iteration 0 50 100 150 200 250 l 2 norm of ||residual|| / ||init residual|| 10 -6 10 -4 10 -2 10 0 pressure_DOF_0.9M pressure_DOF_2.6M pressure_DOF_6.3M Solve Stokes system GMRES iteration 0 50 100 150 200 250 l 2 norm of ||residual|| / ||init residual|| 10 -6 10 -4 10 -2 10 0 velocity_pressure_DOF_5.5M velocity_pressure_DOF_16.0M velocity_pressure_DOF_38.8M Multigrid parameters: GMG for ˜ A: 1 V-cycle, 3+3 smoothing; GMG for (BD -1 B > ): 1 V-cycle, 3+3 smoothing, and additional 6+6 smoothing in discontinuous, modal pressure space. p -dependence using geometric multigrid for ˜ A and (BD -1 B > ) The discretization order of the finite element space increases while the mesh stays fixed. Performed with subducting plate model problem (see above). Solve Au = f GMRES iteration 0 50 100 150 200 250 l 2 norm of ||residual|| / ||init residual|| 10 -6 10 -4 10 -2 10 0 Q1 Q2 Q3 Q4 Q5 Solve ( BD -1 B > ) p = g GMRES iteration 0 50 100 150 200 250 l 2 norm of ||residual|| / ||init residual|| 10 -6 10 -4 10 -2 10 0 P1 P2 P3 P4 Solve Stokes system GMRES iteration 0 50 100 150 200 250 l 2 norm of ||residual|| / ||init residual|| 10 -6 10 -4 10 -2 10 0 Q2-P1 Q3-P2 Q4-P3 Q5-P4 Multigrid parameters: GMG for ˜ A: 1 V-cycle, 3+3 smoothing; GMG for (BD -1 B > ): 1 V-cycle, 3+3 smoothing, and additional 6+6 smoothing in discontinuous, modal pressure space. Remark: The deteriorating Stokes convergence with increasing order is due to a deteriorating approximation of the Schur complement by the BFBT method and not the multigrid components. 8. Parallel scalability of geometric multigrid Global problem on adaptively refined mesh of the Earth’s mantle I Locally refined mesh with up to 6 refinement levels difference I Q 2 × P disc 1 discretization I Constant AMG setup time throughout all core counts, accounting for < 10 percent of total setup Stampede at the Texas Advanced Computing Center 16 CPU cores per node (2 × 8 core Intel Xeon E5-2680) 32GB main memory per node (8 × 4GB DDR3-1600MHz) 1,024 nodes or 16,384 cores used for scalability (MPI) Weak scalability with increasingly locally refined Earth mesh 128 256 512 1024 2048 4096 8192 16384 0 0.5 1 1.5 1 1.04 0.96 0.89 0.9 0.91 0.83 0.84 number of cores Weak efficiency* of Au = f solve time 128 256 512 1024 2048 4096 8192 16384 0 0.5 1 1.5 1 0.99 0.95 0.92 0.94 0.92 0.89 0.88 number of cores Weak efficiency* of linear Stokes solve time Detailed timings for solving Au = f #cores velocity DOF setup time (s) AMG, total solve time (s) 128 21M 0.3, 3.0 64.7 256 42M 0.5, 3.3 62.5 512 82M 0.5, 3.8 65.1 1024 162M 0.6, 4.6 69.4 2048 329M 0.3, 5.3 69.4 4096 664M 0.5, 8.0 69.8 8192 1333M 0.7, 12.9 76.6 16384 2668M 0.3, 21.6 76.1 Detailed timings for solving linear Stokes system #cores total DOF velocity+pressure setup time (s) solve time (s) 128 25M 6.4 256.1 256 50M 7.5 258.7 512 97M 7.3 262.1 1024 191M 8.1 269.1 2048 386M 9.6 266.0 4096 782M 11.2 274.1 8192 1567M 17.6 284.2 16384 3131M 26.1 287.2 *Weak efficiency baseline is 128 cores Strong scalability with a fixed locally refined Earth mesh 128 256 512 1024 2048 4096 8192 16384 0 0.5 1 1.5 1 0.97 0.94 0.86 0.8 0.71 0.57 0.48 number of cores Strong efficiency* of Au = f solve time 128 256 512 1024 2048 4096 8192 16384 0 0.5 1 1.5 1 1 0.95 0.91 0.87 0.83 0.7 0.52 number of cores Strong efficiency* of linear Stokes solve time *Strong efficiency baseline is 128 cores 9. Scalable nonlinear Stokes solver: Inexact Newton-Krylov method Newton update u, ˜ p) is computed as the inexact solution of -∇ · " μ I ε ∂μ ˙ ε (u + u > ) (u + u > ) k(u + u > )k 2 F ! (˜ u + ˜ u > ) # + ˜ p = -r mom , ∇· ˜ u = -r mass . I Krylov tolerance for the inexact update computation decreases with subse- quent Newton steps to achieve superlinear convergence I Number of Newton steps is independent of the mesh size I Velocity residual is measured in H -1 -norm for backtracking line search; this avoids overly conservative update steps 1 (evaluation of residual norm requires 3 scalar constant coefficient Laplace solves, which are performed by PCG with GMG preconditioning) I Grid continuation at initial Newton steps: Adaptive mesh refinement to re- solve increasing viscosity variations arising from the nonlinear dependence on the velocity Convergence of inexact Newton-Krylov (16,384 cores) GMRES iteration 0 500 1000 1500 2000 2500 3000 3500 4000 4500 l 2 norm of ||residual|| / ||initial residual|| 10 -12 10 -9 10 -6 10 -3 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 nonlinear residual GMRES residual Plate velocities at nonlinear solution. Adaptive mesh refinement after the first Newton step is indicated by black ver- tical line. 2.3B velocity & pressure DOF at solution, 459 min total runtime on 16,384 cores. SIAM Conference on Computational Science and Engineering (CSE15) Salt Lake City, Utah, USA March 14–18, 2015

Transcript of Scalable Parallel Solvers for Highly Heterogeneous...

Page 1: Scalable Parallel Solvers for Highly Heterogeneous ...users.ices.utexas.edu/~johann/site_data/presentations/rudi_poster... · Scalable Parallel Solvers for Highly Heterogeneous Nonlinear

Scalable Parallel Solvers for Highly Heterogeneous Nonlinear Stokes FlowDiscretized with Adaptive High-Order Finite Elements

Johann Rudi1, Tobin Isaac1, Georg Stadler2, Michael Gurnis3, and Omar Ghattas1,4

1Institute for Computational Engineering and Sciences (ICES), The University of Texas at Austin, USA2Courant Institute of Mathematical Sciences, New York University, USA

3Seismological Laboratory, California Institute of Technology, USA4Jackson School of Geosciences and Department of Mechanical Engineering, The University of Texas at Austin, USA

COMPUTATIONAL ENGINEERING SCIENCES

INSTITUTE

FOR

&

Summary of main resultsI Geometric multigrid (GMG) for

preconditioning Stokes systems

I Novel GMG based BFBT/LSCpressure Schur complement pre-conditioner

I Repartitioning on coarse GMGlevels for load-balancing and MPIcommunicator reduction

I Algebraic multigrid (AMG) ascoarse solver for GMG avoids fullAMG setup cost and large matrixassembly

I High-order finite elements

I Adaptive meshes resolving hetero-geneous viscosity with variations ofup to 6 orders of magnitude

I Octree algorithms for handlingadaptive meshes in parallel

I Parallel scalability results on up to16,384 CPU cores (MPI)

I Inexact Newton-Krylov method forhighly nonlinear rheology

I Global-scale simulation of Earth’smantle flow

1. Earth mantle flow

Model equations for mantle convection with plate tectonicsRock in the mantle moves like a viscous, incompressible fluid on time scales ofmillions of years. From conservation of mass and momentum, we obtain thatthe instantaneous flow velocity can be modeled as a nonlinear Stokes system.

−∇ ·[µ(T,u) (∇u +∇u>)

]+∇p = f (T )

∇ · u = 0

u . . . velocityp . . . pressureT . . . temperatureµ . . . viscosity

The right-hand side forcing f is derived from the Boussinesq approximationand depends on the temperature. The viscosity µ depends exponentially on thetemperature (via an Arrhenius relationship), on a power of the second invariantof the strain rate tensor, incorporates plastic yielding, and lower/upper bounds.

µ(T,u) = max

(µmin,min

(τyield2ε(u)

, wmin(µmax, a(T ) ε(u)

1−nn

)))with exponentially on temperature dependent factor a(T ), plate decouplingw(x), viscosity bounds 0 < µmin < µmax, yielding stress 0 < τyield, exponentn ≈ 3, and square root of the 2nd invariant of the strain rate tensor ε(u).

Central open questions

I Main drivers of plate mo-tion; negative buoyancyforces or convective sheartraction?

I Strength of plate coupling & amount of en-ergy dissipation in hinge zones

I Role of subducting slab geometries

I Accuracy of rheology extrapolations de-rived from laboratory experiments

Research targetGlobal simulation of theEarth’s instantaneous mantleconvection and associatedplate tectonics with realisticparameters and high reso-lutions down to faulted plateboundaries.

2. Solver challenges of global-scale mantle flowInherent challenges of realistic Earth mantle flow simulations:I Severe nonlinearity, heterogeneity, and anisotropy of the Earth’s rheology

with a wide range of spatial scalesI Highly localized features with respect to Earth’s radius (∼6371 km), like plate

thickness ∼50 km and shearing zones at plate boundaries ∼5 kmI 6 orders of magnitude viscosity contrast within ∼5 km thin plate boundariesHighly accurate numerical simulations require:I Resolution down to∼1 km at plate boundaries (uniform mesh of Earth’s man-

tle would result in computationally prohibitive O(1012) degrees of freedom).Enabled by: adaptive mesh refinement

I Velocity approximation with high accuracy and local mass conservation.Enabled by: high-order discretizations

Effective viscosity field and adaptive mesh resolving narrow plate boundaries (in red). Visualization by L. Alisic.

3. Scalable Stokes solver

High-order finite element discretization of the Stokes system{−∇ ·

[µ (∇u +∇u>)

]+∇p = f

∇ · u = 0

discretize with−−−−−−−−→high-order FE

[A B>

B 0

] [up

]=

[f0

]I High-order finite element shape functionsI Inf-sup stable velocity-pressure pairings: Qk × Pdisc

k−1 with 2 ≤ k

I Locally mass conservative due to discontinuous pressure spaceI Fast, matrix-free application of stiffness and mass matricesI Hexahedral elements allow exploiting the tensor product structure of basis

functions to greatly reduce the number of floating point operations

Linear solver: Preconditioned Krylov subspace methodCoupled iterative solver: GMRES with upper triangular block preconditioning[

A B>

B 0

]︸ ︷︷ ︸

Stokes operator

[A B>

0 −S

]−1︸ ︷︷ ︸preconditioner

[u′

p′

]=

[f0

]

Approximating the inverse of the viscous stress block, A−1 ≈ A−1, is well suitedfor multigrid methods.

BFBT/LSC Schur complement approximation S−1

Improved BFBT / Least Squares Commutator (LSC) method:

S−1 = (BD−1B>)−1(BD−1AD−1B>)(BD−1B>)−1

with diagonal scaling, D := diag(A). Here, approximating the inverse of the dis-crete pressure Laplacian, (BD−1B>), is well suited for multigrid methods.

Derivation: Consider the least squares problem of a commutation relationship

Find minimizing matrix X for: minX

∥∥AD−1B>ej −B>Xej∥∥2C−1 for all j,

where matrix C is s.p.d., matrix D is invertible but arbitrary for now, and ejis the j-th unit vector. The solution X = (BC−1B>)−1(BC−1AD−1B>) gives aC−1-orthogonal projection, i.e.,⟨

B>ei, (AD−1B> −B>X) ej⟩C−1 = 0 for all i, j.

From the choice C−1 = A−1, e.g., a multigrid V-cycle, we obtain⟨B>ei, (AD−1B> −B>X) ej

⟩A−1 = 0 for all i, j ⇔ S = BA−1B>,

which represents an optimal preconditioner for the right-preconditioned discreteStokes system. A computationally feasible choice is C−1 = D−1 = diag(A)−1.

4. Stokes solver robustness with scaled BFBTSchur complement approximation

The subducting platemodel problem on across section of thespherical Earth domainserves as a benchmarkfor solver robustness. Subduction model viscosity field.

Multigrid parameters: GMG for A:

1 V-cycle, 3+3 smoothing; GMG

for (BD−1B>): 1 V-cycle,

3+3 smoothing, and additional

6+6 smoothing in discontinuous,

modal pressure space.

Robustness with respect to plate boundary thickness

10 km

GMRES iteration0 50 100 150 200 250 300

l2 n

orm

of

||re

sid

ual|| / ||in

it r

esid

ual||

10 -8

10 -6

10 -4

10 -2

10 0 10km_viscous_stress10km_Stokes_with_mass10km_Stokes_with_BFBT

5 km

GMRES iteration0 50 100 150 200 250 300

l2 n

orm

of

||re

sid

ual|| / ||in

it r

esid

ual||

10 -8

10 -6

10 -4

10 -2

10 0 5km_viscous_stress5km_Stokes_with_mass5km_Stokes_with_BFBT

2 km

GMRES iteration0 50 100 150 200 250 300

l2 n

orm

of

||re

sid

ual|| / ||in

it r

esid

ual||

10 -8

10 -6

10 -4

10 -2

10 0 2km_viscous_stress2km_Stokes_with_mass2km_Stokes_with_BFBT

Convergence for solving Au = f (gray ), Stokes system with BFBT (blue), Stokes system with viscosity weighted

mass matrix as Schur complement approximation (red) for comparison to conventional preconditioning.

5. Parallel octree-based adaptive mesh refinementIdea: Identify octree leaves with hexahedral elements.

I Octree structure enables fast parallel adaptive oc-tree/mesh refinement and coarsening

I Octrees and space filling curves enable fast neighborsearch, repartitioning, and 2 : 1 balancing in parallel

I Algebraic constraints on non-conforming elementfaces with hanging nodes enforce global continuityof the velocity basis functions

I Demonstrated scalability to O(500K) cores (MPI)

p4est library

6. Parallel adaptive high-order geometric multigridThe hybrid multigrid hierarchy: Coarsen adaptive octree-based mesh

p-GMG

h-GMG

AMG

direct

p-coarsening

geometrich-coarsening

algebraiccoars.

high-orderF.E.

trilinearF.E.

small #cores andreduced MPI comm.

Geometric multigrid method: p-GMG and h-GMGI Parallel repartitioning of coarser h-GMG meshes is important to maintain

load-balancing of the adaptive meshes

I Sufficiently coarse meshes are repartitioned on subsets of cores, the MPIcommunicator is reduced to the nonempty cores

I High-order L2-projection of coefficients onto coarser levels

I Re-discretization of differential equations at each coarser p- and h-GMG level

I Smoother: Chebyshev accelerated Jacobi (PETSc) with matrix-free differen-tial operator-apply functions; avoiding full matrix assembly

I Restriction & interpolation: High-order L2-projection; restriction and interpo-lation operators are adjoints of each other in L2-sense

I No collective communication in GMG cycles needed

Coarse solver for geometric multigrid: AMG, PETSc’s GAMGI Coarse problems use only small core counts, usually O(100)

I The MPI communicator is reduced to the nonempty cores

GMG for (BD−1B>) on discontinuous, modal pressure spaceNovel approach: Re-discretize the underlying variable coefficient Laplace oper-ator with continuous, nodal high-order finite elements in Qk.

I Coefficient of Laplace operator is derived from diagonal scaling D−1

I Apply GMG as described above to the continuous, nodal Qk re-discretizationof the pressure Laplace operator

I On finest level, additionally apply smoother in the space Pdisck−1

7. Convergence dependence on mesh size anddiscretization order

h-dependence using geometric multigrid for A and (BD−1B>)

The mesh is increasingly refined while the discretization stays fixed to Q2×Pdisc1 .

Performed with subducting plate model problem (see above).

Solve Au = f

GMRES iteration0 50 100 150 200 250

l2 n

orm

of

||re

sid

ua

l|| /

||in

it r

esid

ua

l||

10 -6

10 -4

10 -2

10 0 velocity_DOF_4.6Mvelocity_DOF_13.4Mvelocity_DOF_32.5M

Solve(BD−1B>

)p = g

GMRES iteration0 50 100 150 200 250

l2 n

orm

of

||re

sid

ua

l|| /

||in

it r

esid

ua

l||

10 -6

10 -4

10 -2

10 0 pressure_DOF_0.9Mpressure_DOF_2.6Mpressure_DOF_6.3M

Solve Stokes system

GMRES iteration0 50 100 150 200 250

l2 n

orm

of

||re

sid

ua

l|| /

||in

it r

esid

ua

l||

10 -6

10 -4

10 -2

10 0 velocity_pressure_DOF_5.5Mvelocity_pressure_DOF_16.0Mvelocity_pressure_DOF_38.8M

Multigrid parameters: GMG for A: 1 V-cycle, 3+3 smoothing; GMG for (BD−1B>): 1 V-cycle, 3+3 smoothing, and

additional 6+6 smoothing in discontinuous, modal pressure space.

p-dependence using geometric multigrid for A and (BD−1B>)

The discretization order of the finite element space increases while the meshstays fixed. Performed with subducting plate model problem (see above).

Solve Au = f

GMRES iteration0 50 100 150 200 250

l2 n

orm

of

||re

sid

ua

l|| /

||in

it r

esid

ua

l||

10 -6

10 -4

10 -2

10 0 Q1

Q2

Q3

Q4

Q5

Solve(BD−1B>

)p = g

GMRES iteration0 50 100 150 200 250

l2 n

orm

of

||re

sid

ua

l|| /

||in

it r

esid

ua

l||

10 -6

10 -4

10 -2

10 0 P1

P2

P3

P4

Solve Stokes system

GMRES iteration0 50 100 150 200 250

l2 n

orm

of

||re

sid

ua

l|| /

||in

it r

esid

ua

l||

10 -6

10 -4

10 -2

10 0 Q2-P1

Q3-P2

Q4-P3

Q5-P4

Multigrid parameters: GMG for A: 1 V-cycle, 3+3 smoothing; GMG for (BD−1B>): 1 V-cycle, 3+3 smoothing, and

additional 6+6 smoothing in discontinuous, modal pressure space.

Remark: The deteriorating Stokes convergence with increasing order is due to a deteriorating approximation of the

Schur complement by the BFBT method and not the multigrid components.

8. Parallel scalability of geometric multigridGlobal problem on adaptively refined mesh of the Earth’s mantle

I Locally refined mesh with up to 6 refinement levelsdifference

I Q2 × Pdisc1 discretization

I Constant AMG setup time throughout all corecounts, accounting for <10 percent of total setup

Stampede at the Texas Advanced Computing Center

16 CPU cores per node (2 × 8 core Intel Xeon E5-2680)32GB main memory per node (8 × 4GB DDR3-1600MHz)1,024 nodes or 16,384 cores used for scalability (MPI)

Weak scalability with increasingly locally refined Earth mesh

128 256 512 1024 2048 4096 8192 163840

0.5

1

1.5

1 1.04 0.96 0.89 0.9 0.91 0.83 0.84

number of cores

Weak efficiency* of Au = f solve time

128 256 512 1024 2048 4096 8192 163840

0.5

1

1.5

1 0.99 0.95 0.92 0.94 0.92 0.89 0.88

number of cores

Weak efficiency* of linear Stokes solve time

Detailed timings for solving Au = f

#cores velocity DOF setup time (s)AMG, total

solve time (s)

128 21M 0.3, 3.0 64.7256 42M 0.5, 3.3 62.5512 82M 0.5, 3.8 65.1

1024 162M 0.6, 4.6 69.42048 329M 0.3, 5.3 69.44096 664M 0.5, 8.0 69.88192 1333M 0.7, 12.9 76.6

16384 2668M 0.3, 21.6 76.1

Detailed timings for solving linear Stokes system

#cores total DOFvelocity+pressure

setup time (s) solve time (s)

128 25M 6.4 256.1256 50M 7.5 258.7512 97M 7.3 262.1

1024 191M 8.1 269.12048 386M 9.6 266.04096 782M 11.2 274.18192 1567M 17.6 284.2

16384 3131M 26.1 287.2

*Weak efficiency baseline is 128 cores

Strong scalability with a fixed locally refined Earth mesh

128 256 512 1024 2048 4096 8192 163840

0.5

1

1.5

1 0.97 0.94 0.86 0.80.71

0.570.48

number of cores

Strong efficiency* of Au = f solve time

128 256 512 1024 2048 4096 8192 163840

0.5

1

1.5

1 1 0.95 0.91 0.87 0.830.7

0.52

number of cores

Strong efficiency* of linear Stokes solve time

*Strong efficiency baseline is 128 cores

9. Scalable nonlinear Stokes solver:Inexact Newton-Krylov method

Newton update (u, p) is computed as the inexact solution of−∇ ·

[(µ I + ε

∂µ

∂ε

(∇u +∇u>)⊗ (∇u +∇u>)‖(∇u +∇u>)‖2F

)(∇u +∇u>)

]+∇p = −rmom,

∇ · u = −rmass.

I Krylov tolerance for the inexact update computation decreases with subse-quent Newton steps to achieve superlinear convergence

I Number of Newton steps is independent of the mesh sizeI Velocity residual is measured in H−1-norm for backtracking line search; this

avoids overly conservative update steps � 1 (evaluation of residual normrequires 3 scalar constant coefficient Laplace solves, which are performedby PCG with GMG preconditioning)

I Grid continuation at initial Newton steps: Adaptive mesh refinement to re-solve increasing viscosity variations arising from the nonlinear dependenceon the velocityConvergence of inexact Newton-Krylov (16,384 cores)

GMRES iteration0 500 1000 1500 2000 2500 3000 3500 4000 4500

l2 n

orm

of

||re

sid

ua

l|| /

||in

itia

l re

sid

ua

l||

10 -12

10 -9

10 -6

10 -3

10 0

1

23

45

67

8 910

1112 13

1415

161718

1920

212223

2425

26

nonlinear residualGMRES residual

Plate velocities at nonlinear solution.

Adaptive mesh refinement after the first Newton step is indicated by black ver-tical line. 2.3B velocity & pressure DOF at solution, 459 min total runtime on16,384 cores.

SIAM Conference on Computational Science and Engineering (CSE15) Salt Lake City, Utah, USA March 14–18, 2015