Parallel Domain Decomposition Preconditioning Timothy J. Barth · OMP.F_Mxrla.V_ IdektpJ_ GldM8...
Transcript of Parallel Domain Decomposition Preconditioning Timothy J. Barth · OMP.F_Mxrla.V_ IdektpJ_ GldM8...
Parallel Domain Decomposition Preconditioning
For Computational Fluid Dynamics
VecPar '98, Porto, Portugal, June 21-23, 1998
Timothy J. Barth
Information Sciences Directorate
NASA Ames Research Center
Moffett Field, California USA, 9LI035
Tony F. Chan
Department of Mathematics
University of California, Los Angeles
Los Angeles, California USA, 9002LI
Wei-Pai Tang
Department of Computer Science
University of Waterloo
Waterloo, Ontario Canada, N2L 3G1
Barth, Chart, Tang (1998)
https://ntrs.nasa.gov/search.jsp?R=20020023597 2020-05-25T14:23:11+00:00Z
I Outline
• Objectives
• Some Difficult Fluid Flow Prob_ms
• Stabilized Spatial Discretizations
• Newton's Method for Solving the Discretized
Flow Equations
* Schur Complement Domain Decomposition
- Basic Formulation
- Simplifying Strategies
• Iteratiye Su_.bd.__n and Schur
Complement Solves-
, Matrix Element Dropping
• Localized Schur Complement
Computation
• Supersparse Computations
- Performance Evaluation
• Concluding Remarks
Barth, Chsn. T_ng (1998)
!
Objectives
* Simulate Compressible Navier-Stokes Flow
About General Geometries
- Unstructured (Simplicial) Meshes
- Stabilized Numerical Space Discretizations
- Exact and Approximate Newton Iterations
• Obtain ParallelScalabilityand Efficiency
SchurComplement Domain Decomposition
- ILU + GMRES on Subdomain Problems
- Emphasis on CoarseGrain Parallism
• SGI Origin2000(64processors)
• SGI Array (40processors)
• CRAY Jg0 Cluster(20processors)
B&rth, Chan, T,tng (1098)
Some Difficult Flow Prob_ms
Entrance/Exit Flow
Recirculation Flow
Boundary-layer Flow
Barth, Chan, Tang (1998)
_r
Advection Dominated Model Problems
m I
o,L ____I I L .IE".........=t[-...oO-,_+_._.o- I...... i i'-_i'- yu--..._-o I
u. +u u =0(.._o
Convergence of ILU+GMRES for Cuthill-McKee
ordered matrices produced from scalar SUPG spatial
discretization.
Barth, Chan, T:_ng (1998)
Stabilized Numerical Methods
Advection Dominated Flows
• Model Equation:
_,(_,t) = g(_,t), • c r
• Stabilized F.E.M.: (Johnson, Hughes, et. al.)
Find u E Sh such that V w E Vh
fawutdn+f_w('x'Vu)dn+fn e(Vw'Vu)dna
+f(A. Vu - _Au)r (A.w - caw) d n
+ Discontinuity Capturing Operator = 0
Barth, Chan, Tang (1908)
ELF
(an Element Library for Fluids)
Euler equations in divergence form:
u¢ + f(u)= + g(u), = 0, /(u), g(u): R m _ R m
Euler equations in symmetric quasilinear form
Ao vt+ AAo v=+ BAo vv = 0, Ao, A,B E R m×m
_"SPD _ _llmm ...... - symm -_- ...... ....
ELF Methodology: Galerkin Least-Squares in
symmetric variables
ELF Input: Simplex geometry, Simplex dofs.
ELF Output: Jacobian matrix and rhs terms for
global assembly.
Barth, Chan, T_ng ,'1_,_8)
Newton's Method
Fluid flow equations in semi-discrete form:
Dut = R(u), R(u):R" -_R -
Backward Euler time integration:
Let At vary with spatial position and IIR(_)llso
that Newton's method is approached as
liR(_)ll-_ 0.
Barth, Chan, T_ng (1098)
. = ,
Solving A x -- b
Let
Az-b=O
represent a matrix problem taken from one step
of Newton's method.
Solve using flexible GMRES in right
preconditioned form:
(AM -l )Mz - b 0
• Allows nonstationary preconditioning
Ideally
(AM -1) = 0(1).
Barth, Chan, T_ng ¢'1C_._8)
' Preconditioners
Some candidate preconditioners:
1. ILU factorization
- Nonoptimal
- Not easily parallized
@ Additive Schwarz variants of ILU on
overlapping meshes
- Nonoptimal without coarse space correction
- 3-D tetrahedral coarsening is problematic
3. Multigrid
- Similar difficulties as additive Schwarz
4. Nonoverlappmg Schurdomain decomposition
- Data localitywell-suitedto coarsegrain
parallelarchitectures
- Algebraiccoarsespaceprovidesglobal
coupling
Barth, Chan. "r_ng (1998)
--'t'O "Nonoptimality of ILU rrecona_ 1,,n_ng
i_iii..........., --[- ................. =_1 ....?m ......."................?m ".........................
:,o i: ..................no
,o 5551
mOo _ 4o _o IO 0 2o 4o _o
OMP.F_Mxrla.V_ IdektpJ_ GldM8 ldw_-Vec_r bhkJplt_
Figure 1-2. Performance of ILU for diffusion and
advection dominated problems using scalar SUPO
discretization and Cuthill-McKee ordering.
• IILU retains 0 (_) condition number for
elliptic problems.
Use of modified ILU variants is
problematic in the nonsymmetric
(advection dominated) limit.
Barth, Chan, "F_ng (1998)
"2 '*_ ,._ _,
Serial versus Parallel ILU
'&. ...........[..l-_, u,u oroll ! !"_'_ 1.J..,j.o_ _.11_ ....... i...|--=_ ==yu.r.-zu..yo.GOl(u.._÷u..ty#
I " •......... 4............................. .L...............
J_ ........i!I iiii illi iiiiiiiiiiiiii I
io I ft i " ! |_l I I I I
I0 0 20 40 (iO 80 IGO
OURES _V_mr
8 SldomJ Ibmdm
JO ° : : i
I0 :" •'.l___.J ..............II°_l_'_ l _ i i i I_,*.*÷ ............I
I°l_'t ! _ ! t i I
|_--• oo,._oo_o*ilo*eo_o ...... o°.4ooo._,,,,o .......
4 ,,," ............. - ............. "- .............
I0 { : :i0 o _o 4o w so t'_
GMRI_ Mm_.V_mm. M_ipll_
Comparison of single domain ILU preconditioning
with multiple domain ILU via Schur complement.
B&rth, Chan, T,tng (1998)
T
Additive Schwarz Preconditioned GMRES
(From Barth, AGARD R-807, 1996)
_o_-.......r r r 7cp..teMm
iio o leo 2o0 _o 4co
Effect of increased overlap and ILU fill.
,o'r......i I l ]
_.._,j i'.... -'7 "i
.........i1o _ _) 10o 15o
GI_IU35 Mmtx.Veca_ Pmdu_
Effect of increasing the number of subdomains.
- Inviscid flow about multi-element airfoil (5k vertices)
- 1 sweep overlapped ILU with _ _ s_.._e correction
Barth, Chan, Tang (1998)
Schur Complement Domain Decomposition
(Some Contributors)
Basic Approach: Axelsson, Glowinski, Lions, Mandel,
Nepomyaschikh, P_riaux, Przemieniecki, Widlund, Xu
Interface Preconditioned: Bjorstad, Bramble, Chan,
Keyes, Golub, Gropp, Mayers, Pasciak, Schatz, Smith
_'P llelIm I " .Chart,Keyes,Gropp,
Smith
See Also: Domain Decomposition: Parallel
Multilevel Methods for Elliptic PDEs, B. Smith
and P. Bjorstad and W. Gropp, Cambridge Press,
1996.
B&rth. Chin, T._ng (1998)
1-,.,
Nonoverlapping Domain Decomposition
via Schur Complement
8 4 S 678 g x
Domain decomposition and induced 2 x 2 matrix
partitioning.
Barth, Chin, T;sng (1998)
Solving a 2 x 2 Block Matrix
• Subdomain partition as illustrated.
• Order the subdomain variables before the
interface variables.
@
@
2 x 2 form of the system
zl, zB denote the interior and the boundaryvariables.
• Block LU factorization of A:
@ Eliminate ZB from the reduced equations:
Sxv =/8 - ABIAT_fl
Solve for z z by
z! = A'_(fl - AIBzB),
• S ffi ABs - ABzAT_AnB is the Schwr
Complement of Ass in A.
B&rth, Chin, T.sng (199R)
Exact Schur Complement Preconditio.ni._g
Preprocessing Step: Calculate the Schur
complement
Preconditioning Step: Solve Mz - r
. #.
Step (1)
Step (2)
Step (3)
Step (4)
Step (5)
Step (6)
tI]b _- rb- _ vi
X b = _--l_o b
_li = (AIB)i Xb
z, = u_- (AII)_. _y_
(parallel)(parallel)
(comm)
(s-parallel,
(parallel)
(parallel)
Barth, Chan, T,tng (1998)
Scalability Experiments
Fixed Subdomain Sise, 4 x Partition:
- Interior and interface dofs/partition is constant.
- Total interface size increases 4x.
Fixed Problem Size, 4 x Partitions:
Interior and interface dofs/partition decrease 4x
and 2x respectively.
Total interface size increases 2x.
Barth, C.h_n. T_ng (1_8)
Interface Decomposition
Overview (top) and closeup (bottom) of interface
decomposition obtained via partitioning of the Schur
complement supernode graph (64 subdomains, 4
interfaces).
B&rth, Chan, T_ng (1998)
Global Preconditioning Strategy for Ax = b
• Partition subdomains and interfaces
• Assign a processor to each subdomain and
interface component
• Precondition subdomain problems using
processor-local ILU
• Precondition the Schur complement using_Jt - -
_ drol_.filled, processor-local -ILU factorizations
• Implement flexible GMRES in a domain
decomposed parallel environment
- Parallel dot products
- Parallel matrix-vector products
Barth, Chan, T_ng (19.98)
Preconditioner I: Inexact Subproblem Solves
Preprocessing Step: Calculate the approxi'mate
Schur complement
= AaB- _(ABI),(AII); 1 (AIB )i.
1. (AH);-I(AIa), -+ (Ali)_-l(A,v), using fig1 steps
of ILU+GMRES so that S -_ S.
2. (AIx) _"I z, --¢ A_-I1z, using m2 steps of
ILU+GMRES.
3. $-- limb"-_8-11t/Jb using m3 steps of
ILU+GMRES.
_Preconditioning Step: Solv_ Mz = r
Step (1)
Step (2)
Step (3)
Step (4)
Step (5)
Step (6)
,,, = (A_,)_.__,v, = (A_I)_ u,
tOb -----rb -- _'_ t)i
X b _- _-1 1UlJb
v, = (AI_), _b
• , ffi",- (_-)_" v,
(parallel)
(_omm)(s-parallel,comm)
(parallel)
(parallel)
Barth, Chan, Tang (1998)
.qi -,
Preconditioner I: Perforumnce
Diffusioa Dominaw.d
150-
100-
k==i
m_l=m_2=m_3
Scalar SUPG discretization with lineax elements
Stopping criterion: IIA=. - bi[ _< lO-SllA=o - bll
Barth, Chan, Tang (1998)
*',,L
Preconditioner I: Performance
Adveelion _nated
120 .................. v ................. ! ................. - .....................................
t
? ? , ,, ,,
100........... --- zsk_,(_ ,,_,_) ...................... nekck_ (I mkkm_)
.'_ -.- __(I mklmJl)80 ........... -_.. nOkd_(n6_) ...................
I
- ................._.................i................._.................3..................
__i " "1 [ i40 ,_................................................]
_n-i ............ -..&,. ............... i ................. _.................. i .................
"_/ "<'_.'",,.. i i __" " .... / ....._'_L_-,-.__. _ - . ..,._.. . . -..... , ",. _ ...... • . I •
I II ! I i t ! t I ! I/ _ -- - ....... 4 ......O . • •
0 2 4 6 8 10
m_l=m_2fm_3
- Scalar SUPG discretization with linear elements
- Stopping criterion: IlAz,. - bll < 10-SllAzo - bll
Barth, Chan, Tang (1998)
Observations for Preconditioner I
There exists an optimal number of subdomains
for minimizing solution time.
Replacing a single long recurrence ILU
factorization with several shorter recurrences
coupled via Schur complement yields an
improved quality preconditioner.
• Choosing small values of the parameters m l, m2,
.....and m3 < 3!cads ___um CPU times. (our
expe_ence).
Linear growth in the time needed to form the
preconditioner is avoided by further partitioning
of the interface. How to precondition the
parallelized interface? Processor-Local ILU.
B&rth, Chan, Tang (1998)
Preconditioner II: Drop Tolerance Approximation
Drop elements of the Schtur complement matrix, Sij:
(1) Is,_l< toa(scalarmat,i_en_,i_)(2) Sparsity pattern (block matrix emries)
- For discretized elliptic problems S exhibits faster
element decay than (AH)_ -1 (Golub and Mayers).
Preprocesslng Step: Calculate the approximate
Schur complement
8-_ ABB -- _"_(Am)i(ftxz)_l(Azv)ii
A A
- Precondition S with ILU(DROP(S))
'Preconditioning Stepf _(;|V_ Mx - r
Step (1)
Step (2)
Step (3)
Step (4)Step (5)
Step (6)
u, = (A_t)_ "1ri
v, = (As1)i u,
rob ---- rb -- E Vi
Xb -_ _-1 Wb
Yi (AIB)iXb
(parallel)
(parallel)
(comm)
(s-parallel,comm)
(parallel)
B&rth, Chin, T_ng (1998)
Preconditioner III: Wireframe Approximation
, Main cost in Preconditioner II is the formation of
S itself: one subdomain solve is needed for every
variable on the interface.
• Idea: Approximate S by the Schur complement of
a relatively thin udreframe region surrounding the
interface variables.
Take a principal submatrix of
A: variables within the wire-
frame.
_".- ....• Compute the • - aplm_i-
mate Schur complement of
interface variables in this prin-
cipal submatrix instead of A
itself.
• From DD theory: the exact Schur complement of
wireframe region is spectraUy equivalent to the
Schur complement of the whole domain.
Barth, Chart, T,in_ (1998)
Preconditioner III: Wireframe Approximation
Pure Diffusion Problem:
Mesh
Size
1600
1600
1600
Num Time Time
Procs Suppoct Iter Form* Applym
16 2 15 .012 .II
16 3 II .017 .08
16 4 6 .020 .04
Pure Advection Problem:
,qt -
Mesh
Size
Sum
Procs
Time
Form*Support Iter
1600 16 2 8 .012 .09
1600 16 3 .......... _L .017 .06
1600 16 4 4 .019 .03
Time
Apply
- Scalar SUPG discretization with linear elements
-- Stopping criterion: [[Axn - bl[ _<10-9[[Axo - b[]
No element dropping, ml = 2, m2 -- 2, m3 = 2
* Timing based on parallelized wireframe
Barth, Chan, Tt_ (!_08)
Preconditioner III: Wirefranm Approximation
$ Introduces a new adjustable parameter into the
preconditioner.
• Specify the width of the wireframe strip in terms
of graph distance on the mesh triangulation.
• Quality of the preconditioner improves rapidly
with increasing wireframe width.
Time taken form the Schur complement is
reduced by approximately 50%.
• Further improvements in cost/performance:
choosing the shape of the wireframe to better
represent the PDE being solved, viz. the flow
direction.
B&rth, Chan, T_ng (1998)
Preconditioner III: Wireframe Approximation
• Preprocessing step: approximate Schur
complement on the localized subdomains:
Axy: restrictions of AH, A18, ABI to the
localized Schur subdomain(s).
• The preconditioning step (solving Mx = r) is
now:
Step (1)
Step (2)
Step (3)
Step (4)
Step (5)
Step (6)
A 1u, = (A.); r,
v, = (As_), u,
Wb---_rb-- EUi
u, = (A18)i Xb
x, = _ - (A%_);'V,
(parallel)
(parallel)
(comm)
(s-parallel,comm)
Barth, Chan, T_ng (1998)
Preconditioner IV: Supersparse Approximation
• Observations:
1. Iterative calculation of A_-_ A1s is expensive.
2. Columns of AzB are usually very sparse.
3. Unpreconditioned Krylov subspace methods
require computation of the vector sequence
-_[p,Ap, A2p,... ,Amp], p-- col(B) for small m.
4. ILU preconditioning destroys sparsity of the
preconditioned Krylov subspace vectors
_,M-_Ap, (M-_A)2p,..., (M-_A)_p].
Barth, Chan_ T_ng (1998)
Preconditioner IV: Supersparse Approxi'm.a.tion
• Strategy:
1. Eliminate wasted flop's in matrix-vector
products by storing vectors and matrices in
sparse form.
X • O av
/ ****llil/:!W ,, XXJ |el !°v I
*_:-2._Appr.oximate the ILUbscksolves based on a
vector fill level strategy to maintain sparsity.
Ax-b
bi xi
....J..................I........I|IIIII|IIIIIIIIIIIIIIIIVlllfflIOV
i
i i
Barth, Chan, T;_ng (1998)
"'2
Preconditioner IV: Supersparse Approximation
• Results indicate I_reconditkming performance
comparable to exact baci_olves.
• 60-70% reduction in cost.
• This technique can be combined with the
previous wireframe strategy with combined 5-7
fold performance gains.
Fill Level k
0
|
2
3
4
OO
Global GMRES Iter
,, a, , m
|
26
22
Time(k)/Time(oo)
0.299
0.313
21
20
20
2O
0.337
0.362
0.392
1.000
_-D test problem consisting of Euler flow past a
multi-element airfoil geometry partitioned into 4 subdomains
with 1600 mesh vertices in each subdomain.
B&rth, Chan, Tang (1998)
Compressible Flow Calculation
,o1 lml: .............................................................. lo ....
j, .................................; , ..... ,', _--CC........ I#_:d i i i_Q .C- _---- ._-CZ_..;..-C _' ....N t_.-_................'f....................!
l i_.l I 1L
7 1 t J iio o.o zJ s_ _ io_ i=-._ io o lo 2o 30 40 5o 6o'
_im_mGlobal GMRI_ l_edx-V_ Pmdk,_
- Mach = .63, Angle of Attack = 2.0 °
- ELF Galerkin Least-Squares
- Time Relaxed Newton Iteration
- m l -- 1, m2 = ma = 4, Fill Level--3
- 4 Subdomain Mesh, (Sk Cells/Subdomain)
Bsrth, Chan, Tang (1998)
._t -
Compressible Flow Calculation
/ \
1
,oL._._ :°Y_'_-F .....F--7.......]......-qio_'11-.... "-'_-_-_"_, T 1__!• .. • ..... _L-- o4,.
,__:.-_:-__.._%,_ .-_----.
1 1"....::_-:-
i _ i I i,, _10 0 $ 10 I$ 10 0 l0 20 30 40 50 60
llcI50n olohl oMRES lt41m_-Vectm'
- Mach = .63, Angle of Attack = 2.0 °
- ELF Galerkin Least-Squares
- Time Relaxed Newton Iteration
- ml = 1, m2 = m3 = 4, Fill Level=3
- 16 Subdomain Mesh, (Sk Cells/Subdomain)
Barth, Chan, T,_ng (1098)
Subdomain- Interface Load Balancing
r--r ....T'"r"-l-"r"" T.....
_o. -_2_-__.m ,.'...."......
0 . !
q_.m)w_oi ,_bdoml ImerlaceSize
....... 4°°1"........i.........1.........r.......r........T........T........i.........I......
/ i ! t I i lJ_'t
_----' _ "-[ t __, !, !i_'_ .......i.........i,......1-_.., _........_......
=---| IA J=:= ,i !
/ I I I i. I I ! t-- O! 1 I I I ! 1 ! t
0 8 16 24 32 40 48 5664 0 8 16 24 32 40 48 5664
Wocemmn
Mach -- .2, Angle of Attack ----2.0 °
16 Subdomain Mesh, (Sk Cells/Subdomain)
IBM SP2, MPI Message Passing Protocol
ELF Oalerkin Least-Squares
Barth, Chan, Tang (1998)
Preliminary Timings
Subdomaln Interf=lO0 Subdomaln Interf=lSO Subdomaln Interf=lSO-20(
(4 Interface Partitions)
; meoo a
• i
Raw1"Imps Adi.ma forI¢m¢¢ S_
........!.....T-T-T .....!.....f.... '°T.........................[........T........T........T........].........i......./ i I I I I i
,oF............... I i................
,,.....
ii.l i i i..:$-:.!-,::::i=.-.?:--,-!--F--,>t.....t.......t................f........f........I........l........i......./ _.].... ,I,...;.... .I....-i.'..-..I..'.... _---4
ii i Li i __i) f 4 _ ! ', ,_0 s 16 14 3i l} 4s 5664 0 s 16 i+ 3i +0 4 s66+
Pmcemco PmO:WlS
6 GMRES Iterations
- 5k Cells/$ubdomain
-IBM SP2, MPI Message Passing Protocol
- mi = 1, mi : ma = 4, Fill Level=3
Barth, Chan, Tang (19<;8)
w
Performance of Schur Complement D.D.
For Viscous Flow
,o.__°_0 ,0 30 _0 4o lo 0 _ 20 _) 40
(Add C]dR_ Mmtx-Vec_ Mol_pli_ GiolM GIJES blmtx-¥ec_r Mu/_lles
6
...... _,----_:_=..........I.................1.......N_
,o. m ..................................I0
! "
- SUPG Discretization
- Local CFL Number 100, 000
B&rth, Chan, T_ng (1908)
Concluding Return-ks
The baseline nonoverlapping scheme is very
robust but relatively expensive
Localized wireframe and supersparse
computations can significantly reduce cost of the
forming the Schur complement preconditioner
• Scalability: Growth of the Schur complement
matrix necessitatespart!tl0ning of the interface
• Scalability: Load balancing still problematic due
to size imbalance of the interface associated with
each subdomain
Barth, Chan, Tang (1098)