Parallel Domain Decomposition Methods with Mixed Order...
Transcript of Parallel Domain Decomposition Methods with Mixed Order...
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
893
Parallel Domain Decomposition Methods with Mixed Order Discretization for Fully
Implicit Solution of Tracer Transport Problems on the Cubed-Sphere
Haijian Yang1 , Chao Yang
2 and Xiao-Chuan Cai
3
(1)College of Mathematics and Econometrics, Hunan University, Hunan, 410082, Changsha, People’s Republic
of China
(2)Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of
Sciences, Beijing, 100190, People’s Republic of China
(3)Department of Computer Science, University of Colorado Boulder, Boulder, CO 80309, USA
Haijian Yang
Email: [email protected]
Chao Yang
Email: [email protected]
Xiao-Chuan Cai (Corresponding author)
Email: [email protected]
Abstract
In this paper, a fully implicit finite volume Eulerian scheme and a corresponding scalable parallel solver are
developed for some tracer transport problems on the cubed-sphere. To efficiently solve the large sparse linear
system at each time step on parallel computers, we introduce a Schwarz preconditioned Krylov subspace
method using two discretizations. More precisely speaking, the higher order method is used for the residual
calculation and the lower order method is used for the construction of the preconditioner. The matrices from the
two discretizations have similar sparsity pattern and eigenvalue distributions, but the matrix from the lower
order method is a lot sparser, as a result, excellent scalability results (in total computing time and the number of
iterations) are obtained. Even though Schwarz preconditioner is originally designed for elliptic problems, our
experiments indicate clearly that the method scales well for this class of purely hyperbolic problems. In
addition, we show numerically that the proposed method is highly scalable in terms of both strong and weak
scalabilities on a supercomputer with thousands of processors.
Keywords
Transport equation Cubed-sphere Fully implicit method Domain decomposition Parallel scalability
1 Introduction
The tracer transport equation plays a critical role in global atmospheric models [9, 13]. The problem at high
resolution is very demanding in terms of computational resources. In order to develop a new generation of
climate modeling software and make effective use of supercomputers with large number of processors, robust
and scalable algorithms are necessary to allow the simultaneous use of fine spacial meshes and large time steps,
also to maintain the fast convergence for a wide range of physical parameters.
There are several schemes designed for the tracer transport problem on the cubed-sphere by using explicit
methods, such as finite volume methods [5, 19], discontinuous Galerkin (DG) methods [17], spectral-element
methods [37]. These schemes are shown to be stable and reliable for solving the tracer transport problem.
However, because of the explicit nature of the algorithms, there are strict restrictions on the time step size
imposed by the Courant-Friedrichs-Lewy (CFL) condition. When very fine meshes are used in the spacial
discretization, the time step size has to be very small in order to satisfy this stability condition. To reduce the
stability restriction of the time step size, semi-Lagrangian method (SL) is becoming increasingly popular for
solving the transport equation [7, 8, 11, 14, 29]. In this paper, we introduce and study fully implicit domain
decomposition algorithms that are not only robust with respect to the physical parameters but also scalable with
a large number of processors. A potential drawback of the fully implicit method is that a large linear or
nonlinear system needs to be solved at each time step. To improve the efficiency of a fully implicit solver,
domain decomposition based preconditioning algorithms have been successfully applied in several applications
[2, 6, 10, 25, 36]. In particular, we have employed domain decomposition based implicit algorithms in
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
894
atmospheric modeling for solving both the global shallow water equations [32, 33] and the regional
compressible Euler equations [34]. In this work, we extend the methods to the solution of the tracer transport
problem for atmospheric flows. The transport equation for tracer transport differs from the shallow-water or
Euler equations in the way that the tracer distribution can be quite non-smooth, which requires more deliberate
design of numerical discretization that may in turn challenge the solver for the discretized system. We also
mention that this class of Schwarz methods has not been well understood for purely hyperbolic problems such as
the tracer transport problem because the lack of ellipticity that is required by the existing theory [26, 27, 31].
For the discretization, we use a finite volume Eulerian scheme based on the Lax-Friedrichs flux solver together
with a second-order spatial reconstruction. One layer of ghost cells is used in the scheme to couple the six
patches together and pass the information between the patches. To solve the large linear algebraic system at
each time step, a Krylov subspace method preconditioned by restricted additive Schwarz is applied. In order to
have a highly scalable (in total computing time) iterative method, two issues have to be addressed: (1) the
number of iterations has to be relatively stable when the mesh is refined and/or when the number of processors
is increased; (2) the subdomain solver has to be cheap enough. The first issue is addressed by using the Schwarz
preconditioner with sufficiently large overlap. To deal with the second issue, traditional approaches replace
subdomain solve by some kind of incomplete factorization, but in our case, we replace the second-order
discretization by a first-order discretization which corresponds to a sparser matrix with a similar distribution of
eigenvalues. The accuracy of the overall method is not impacted since the change happens only at the
preconditioning level. Even though the transport problem is linear, but when a limiter is used in the
discritization and resulting algebraic system may become nonlinear. We consider such a case in the paper and
the Schwarz preconditioned Krylov solver is replaced by a Newton-Krylov-Schwarz method.
The rest of the paper is organized as follows. In Sect. 2, we present the transport equation and a fully implicit
discretization scheme. Section 3 focuses on the details of the domain decomposition algorithm, with special
emphasis on tuning the Schwarz preconditioners. Some numerical experiments to understand the accuracy and
the parallel performance of the proposed methods are provided in Sect. 4. We end the paper with some
concluding remarks in Sect. 5.
2 Transport Equation
Consider the following tracer transport equation defined on the sphere [15, 18]:
⎧⎩⎨∂ϕ∂t+∇⋅(Vϕ)=ϕ∇⋅V, on S×(0,T],ϕ|t=0=ϕ0,
(2.1)
where ϕ is the diagnostic variable representing the tracer mixing ratio per unit mass, V=(u,v) is the velocity of
the flow in the local latitude-longitude coordinates (λ,θ), S is the surface of the sphere, and ϕ0 is a given initial
condition.
To discretize (2.1), we employ the cubed-sphere mesh [20–23, 32] which is based on a gnomic mapping from
the six faces of a cube to the surface of the sphere. Figure 1 is a schematic illustration on the relative positions of
the six patches and their local connectivity. In Fig. 1, patches one to four are put along the equator, and patches
five and six are centered at the north and south poles, respectively. The mesh on each patch is nonorthogonal
and curvilinear due to the gnomic mapping. The coordinates system for each patch is free of singularities and
has the same metric terms. Let (λ,θ)∈[−π,π]×[−π/2,π/2] be the longitude-latitude coordinates,
and (x,y)∈[−π/4,π/4]×[−π/4,π/4] be the curvilinear coordinates on the cubed-sphere. Equation (2.1) has the form,
when written in the local curvilinear coordinates:
∂Λϕ∂t+(∂∂x(Λv1ϕ)+∂∂y(Λv2ϕ))=ϕ(∂∂x(Λv1)+∂∂y(Λv2)),
(2.2)
where Λ=(secxsecy)2/(1+tan2x+tan2y)3−−−−−−−−−−−−−−−√ and (v1,v2) is the contravariant coordinates of V.
More details about the transformation between the surface of the cube and the sphere can be found in Nair et al.
[17].
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
895
Fig. 1
Relative positions of the six patches and their local connectivity
S is divided into six identical patches Ωk (k=1,2,…,6), each is covered with a logically square N×Nmesh with
centers at (xi,yj) as
Ωki,j=[xi−1/2,xi+1/2]×[yj−1/2,yj+1/2],i,j=1,…,N,
where the mesh size is h=π/(2N). When using a finite volume method, the cell average of ϕ is denoted as
Φki,j=1h2Λi,j∫yi+1/2yi−1/2∫xi+1/2xi−1/2ΛϕdΩki,j. (2.3)
In the following, for simplicity the superscript k is ignored. Denote f(ϕ)=v1ϕ and g(ϕ)=v2ϕ. After integrating
(2.2) over cell Ωi,j and using a cell-centered finite volume method, we obtain
∂Φi,j∂t+1Λi,jh((Λf)i+1/2,j−(Λf)i−1/2,j)+1Λi,jh((Λg)i,j+1/2−(Λg)i,j−1/2)≈1h2Λi,j∫yi+1/2yi−1/2∫xi+1/2xi−1/2ϕ(∂∂
x(Λv1)+∂∂y(Λv2))dΩi,j,
(2.4)
where (Λf)i+1/2,j is approximated by Λi+1/2,jf~i+1/2,j with
f~i+1/2,j≈1h∫yj+1/2yj−1/2f(ϕ(xi+1/2,y))dy
(2.5)
and (Λg)i,j+1/2, (Λf)i−1/2,j and (Λg)i,j−1/2 are approximated similarly. The right-hand side of (2.4) is evaluated
as
(1Λ∂∂x(Λv1)+1Λ∂∂y(Λv2))i,jΦi,j≈1Λi,j((Λv1)i+1/2,j−(Λv1)i−1/2,jh+(Λv2)i,j+1/2−(Λv2)i,j−1/2h)Φi,j.
A Riemann solver is required to obtain the approximate fluxes in (2.5). In this study, we employ the local Lax-
Friedrichs flux formula:
F~(Φ−,Φ+)=[(F(Φ−)+F(Φ+))−α(Φ+−Φ−)]/2,
(2.6)
where α is the maximum absolute value of the normal velocity along each cell boundary, and Φ− and Φ+ are the
reconstructed states of Φ on the cell boundary. By using (2.6), we obtain
f~i+1/2,j=[(f(Φ−i+1/2,j)+f(Φ+i+1/2,j))−αi+1/2,j(Φ+i+1/2,j−Φ−i+1/2,j)]/2=[v1(xi+1/2,yj)(Φ−i+1/2,j+Φ+i+1/2,j)
−αi+1/2,j(Φ+i+1/2,j−Φ−i+1/2,j)]/2;
the other fluxes are defined in a similar way.
The purpose of the reconstruction is to estimate Φ on the cell boundary, based on the cell-averaged values
ofΦ on the neighboring cells, as shown in Fig. 2. For now, we use a piecewise linear reconstruction that doesn’t
destroy the linearity of the problem; later, in one of the numerical experiments, we will consider a case when the
linearity is not preserved after using a limiter in the scheme. In the x-direction, we calculate the reconstructed
states by
Φ−i−1/2,jΦ−i+1/2,j=Φi−1,j+14(Φi,j−Φi−2,j),Φ+i−1/2,j=Φi,j−14(Φi+1,j−Φi−1,j),=Φi,j+14(Φi+1,j−Φi−1,j),Φ+i
+1/2,j=Φi+1,j−14(Φi+2,j−Φi,j);
and we can similarly obtain Φ−i,j−1/2, Φ+i,j−1/2, Φ−i,j+1/2 and Φ+i,j+1/2 in the y−direction. The
reconstruction scheme is second-order accurate in space.
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
896
Fig. 2
Reconstructions of Φ on the cell Ωi,j=[xi−1/2,xi+1/2]×[yj−1/2,yj+1/2]
The cubed-sphere mesh provides us with a nearly uniform mesh and eliminates the mesh singularities that often
appear in the latitude-longitude mesh. However, it also gives rise to a new difficulty: artificial boundaries are
created between the six patches. Values near patch boundaries should be correctly passed to couple the patches
together. In order to solve the transport equation on the six patches as one system, we use one layer of ghost
cells along each patch boundary. Then information between patches is passed by setting appropriate boundary
conditions on ghost cells. For example, suppose Γ12 is the boundary between Patch 1and 2, the reconstruction
on the cell boundary {x1/2}×[yj−1/2,yj+1/2] for Patch 2 is given by
Φ−1/2,j=(ΦN,j)I+14(Φ∗1,j−(ΦN−1,j)I),
where (ΦN,j)I and (ΦN−1,j)I are the cell-averaged values of Φ in Patch 1, and Φ∗1,j is an interpolated value on
Patch 2. Analogously,
Φ+1/2,j=Φ1,j−14(Φ2,j−(ΦN,j)∗,I),
where Φ1,j and Φ2,j are the cell-averaged values of Φ in Patch 2, and (ΦN,j)∗,I is an interpolated value on
Patch 1. The interpolations we use to calculate Φ∗1,j on Patch 2 and (ΦN,j)∗,I on Patch 1 only depend on the
geometry position of the mesh cell. For example, let Φ¯¯¯1,j be the ghost point belonging to Patch 1outward to
Patch 2, then the values Φ∗1,j on Patch 2 are calculated by the following interpolation:
Φ∗1,j={ηjΦ1,j+(1−ηj)Φ1,j+1ηjΦ1,j+(1−ηj)Φ1,j−1j<[N/2]+1,otherwise,
where ηj is the linear interpolation coefficient defined by
ηj=⎧⎩⎨r(Φ1,j+1,Φ¯¯¯1,j)/r(Φ1,j−1,Φ1,j+1)r(Φ1,j−1,Φ¯¯¯1,j)/r(Φ1,j−1,Φ1,j+1)j<[N/2]+1,otherwise,
with r(⋅,⋅) being the great-circle distance between the two points.
After spatially discretizing (2.2), we have a semi-discrete system
∂Φi,j∂t+L(Φi,j)=0,
(2.7)
where L(Φi,j) is a linear operator defined by
(Λi+1/2,jΛi,jf~i+1/2,j−Λi−1/2,jΛi,jf~i−1/2,j)+(Λi,j+1/2Λi,jg~i,j+1/2−Λi,j−1/2Λi,jg~i,j−1/2)−1Λi,j((Λv1)i+1/2,j−(
Λv1)i−1/2,jh+(Λv2)i,j+1/2−(Λv2)i,j−1/2h)Φi,j.
For comparison purpose we implement both implicit and explicit methods for the temporal integration of (2.7).
For the implicit method, we use the second-order backward differentiation formula (BDF-2) that reads
12△t(3Φ(m)i,j−4Φ(m−1)i,j+Φ(m−2)i,j)+L(Φ(m)i,j)=0,
(2.8)
where Φ(m)i,j is the evaluation of Φi,j at the mth time step with a uniform time step size △t. Only at the first
time step, a first-order backward Euler (BDF-1) method is used. We use the second-order Strong Stability
Preserving Runge-Kutta (SSP RK-2) method for the explicit method
⎧⎩⎨⎪⎪⎪⎪Φ¯¯¯(m)i,jΦ(m)i,j=Φ(m−1)i,j−△tL(Φ(m−1)i,j),=12(Φ(m−1)i,j+Φ¯¯¯(m)i,j)−△t2L(Φ¯¯¯(m)i,j).
(2.9)
3 Fully Implicit Domain Decomposition Methods
After the discretization of (2.2) in space and time, we obtain a linear system for each time step. In this study we
propose an additive Schwarz right-preconditioned Generalized Minimal RESidual (GMRES) method for solving
the system,
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
897
AM−1X=b,X=MΦ,
(3.1)
where M−1 is the preconditioner [24]. To solve (3.1) at time step m, we first set the initial guess Φ0 equal to the
solution of the previous time step Φ(m−1). Only at the first time step, we choose the initial condition as the
initial guess. Then the next approximate solution is obtained by using the right-preconditioned GMRES method
with a restart 30 until the residual satisfies
∥AM−1Xn−b∥≤ηr,n=0,1,…,
where ηr=10−5 is the tolerance, and then Φ(m)=M−1Xn.
To define the one-level restricted additive Schwarz preconditioner, each patch Ωk is decomposed into p non-
overlapping subdomains Ωki. Here p is the number of processors per patch. Also, each subdomain corresponds
to one processor. Hence, the number of processors for the whole domain Ω is 6p. In order to get the overlapping
subdomain, we extend each subdomain Ωki with δ layers of mesh cells to a larger subdomain Ωki,δ that overlaps
with its neighbors. Any subdomain boundary that coincides with a patch boundary is extended to the
neighboring patch(es).
Note that the total number of unknowns is 6N2. Let Ni be the number of unknowns in Ωki,δ and the restriction
operator Rki,δ be an Ni×(6N2) matrix that maps a vector defined on the entire domain to a shorter vector defined
on subdomain Ωki,δ by discarding all components corresponding to mesh points outside Ωki,δ.
Specifically, Rki,0 is also an Ni×(6N2) matrix that is similarly defined, with the difference that its application to
a (6N2)×1 vector zeroes all those components corresponding to mesh cells outside Ωki. The
subdomain Ni×Ni matrix is defined as
Aki=Rki,δA(Rki,δ)T.
(3.2)
The one-level restricted additive Schwarz (RAS) preconditioner for A is defined as [4, 26, 27]
M−1RAS=∑k=16∑i=1p(Rki,0)T(Aki)−1Rki,δ.
(3.3)
The matrix-vector multiplication with (Aki)−1 is either exactly calculated by a LU factorization or obtained
approximately by an ILU factorization. More details will be discussed in the numerical experiments section.
The effectiveness of the Schwarz preconditioner relies on its ability to mimic the spectrum of the linear operator
and at same time is relatively cheap to apply. In the RAS preconditioner (3.3), we construct the subdomain
matrix directly from the matrix A. In this case it is effective in terms of the number of GMRES iterations, as
shown in Sect. 4.3. But the computing time is not as good as what we want because the subdomain matrices are
denser when we use a higher order discretization. In order to lower the cost of subdomain solves without losing
the effectiveness of the preconditioner, we build the subdomain matrix with a first-order spatial discretization
while the second-order scheme is still used to build the matrix A in (3.1). This idea is based on the fact that
matrices arising from the first-order scheme and the second-order scheme both originate from the same transport
equation, thus have similar eigenvalue distributions. Let A˜ denote the matrix with the first-order discretization
and Aki˜ be its restriction to the overlapping subdomain Ωki,δ. Then the new RAS preconditioner is defined by:
M˜−1RAS=∑k=16∑i=1p(Rki,0)T(Aki˜)−1Rki,δ,
(3.4)
where the matrix-vector multiplication of (Aki˜)−1 is obtained by a LU or an ILU factorization. In the first-order
discretization, the reconstructed states are given by:
Φ−i−1/2,j=Φi−1,j,Φ+i−1/2,j=Φi,j,Φ−i+1/2,j=Φi,j,Φ+i+1/2,j=Φi+1,j,
in the x-direction, and similarly in the y-direction. As an example, in Figs. 3 and 4 we show the sparsity patterns
and eigenvalue distributions of the matrices based on the first-order and second-order discretizations of a typical
transport problem on a relatively coarse mesh. We see that the sparsity patterns are similar and the eigenvalue
distributions are also similar, but the number of nonzeros of the first-order matrix is much less than that of the
second-order matrix.
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
898
Fig. 3
The sparsity patterns of the matrices with the second (left panel) and the first (right panel) order discretizations.
Here the mesh is 6×6×6. “nnz” is the number of nonzero elements
Fig. 4
The eigenvalue distributions of the matrices with the second (left panel) and the first (right panel) order
discretizations. Here the mesh is 6×6×6. “Asterisk” represents the eigenvalues of the matrix
4 Numerical Experiments
In this section, we test the proposed implicit method with a variety of test cases. The purposes of the tests
include: (1) the verification of the numerical order of convergence and the effective resolution; (2) the
preservation of the shape of “rough” distributions; (3) mixing diagnostics for two nonlinearly related tracers;
and (4) the parallel performance of the method.
4.1 The test cases
We consider four different initial scalar fields for ϕ0 including: a smooth scalar field, two quasi-smooth scalar
fields, and a non-smooth scalar field [12, 16, 18]. The velocity fields are given as either non-divergent or
divergent flows. Let (λi,θi), (i=1,2) be the centers of the initial distributions, (X,Y,Z) be equal
to (cosθcosλ,cosθsinλ,sinθ), and (Xi,Yi,Zi) be equal to (cosθicosλi,cosθisinλi,sinθi). Then the smooth initial
scalar field is the Gaussian hills defined by
ϕ0(λ,θ)=h1(λ,θ)+h2(λ,θ),
(4.1)
where hi=exp{−5((X−Xi)2+(Y−Yi)2+(Z−Zi)2)}. Two scalar fields are used for the quasi-smooth case: the cosine
bells and the “correlated” cosine bells. The cosine bells (C1 function) are defined as:
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
899
$$\begin{aligned} {\phi }_0(\lambda ,\theta )={\left\{ \begin{array}{ll} 0.1 + 0.9 h_i (\lambda ,\theta ) &{}
r_i<="" div="" style="outline: 0px;">
(4.2)
where r=1/2 is the base radius of the bells and \(h_i (\lambda ,\theta )= \frac{1}{2} \big (1+\cos (2\pi r_i) \big ) ,
\text { if } r_i with ri being the great-circle distance between (λ,θ) and (λi,θi). The correlated cosine bells are
given by:
ϕ∗=ψ(ϕ0),
(4.3)
where ϕ0 is the cosine bells condition defined in (4.2) and the nonlinear functional relation ψ is given by
ψ(χ)=aψχ2+bψ
(4.4)
with aψ=−0.8 and bψ=0.9.
For the non-smooth case, the initial condition is the slotted-cylinders defined by
ϕ0(λ,θ)=⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪1110.1 if ri⩽r and |λ−λi|⩽r6 for i=1,2, if r1⩽r and |λ−λ1|<r6 and θ
−θ1<−5r12, if r2⩽r and |λ−λ2|<r6 and θ−θ2>5r12,otherwise,
(4.5)
where r=1/2. In the numerical tests, we employ two different types of deformational wind fields [18]. The first
velocity field is a non-divergent flow
⎧⎩⎨⎪⎪⎪⎪⎪⎪u(λ,θ,t)=ksin2(λ)sin(2θ)cos(πtT),v(λ,θ,t)=ksin(2λ)cos(θ)cos(πtT).
(4.6)
The second velocity field is a divergent flow defined by
⎧⎩⎨⎪⎪⎪⎪⎪⎪u(λ,θ,t)=−ksin2(λ2)sin(2θ)cos2(θ)cos(πtT),v(λ,θ,t)=k2sin(λ)cos3(θ)cos(πtT).
(4.7)
The components of the velocity vector for the zonal background flow are given by
⎧⎩⎨⎪⎪⎪⎪⎪⎪u(λ,θ,t)=ksin2(λ¯)sin(2θ)cos(πtT)+2πcos(θ)T,v(λ,θ,t)=ksin(2λ¯)cos(θ)cos(πtT),
(4.8)
where λ¯=λ−2πt/T. This wind field is non-divergent but highly deformational [18]. In the experiments, the
following combinations of the initial conditions and velocity fields are used:
Case-1: Gaussian hills (4.1) and non-divergent flow (4.6);
Case-2: Cosine bells (4.2) and divergent flow (4.7);
Case-3: Slotted-cylinders (4.5) and zonal background flow (4.8);
Case-4: Cosine bells (4.2) and zonal background flow (4.8);
Case-5: Correlated cosine bells (4.3) and zonal background flow (4.8).
We set the duration of integration to T=5 time units. The parameter k and the centers of the initial
distributions (λi,θi), (i=1,2) are chosen to make the test cases challenging,
for the test cases with the non-divergent flow: k=2, (λ1,θ1)=(5π/6,0) and (λ2,θ2)=(−5π/6,0);
for the test cases with the divergent flow: k=1, (λ1,θ1)=(3π/4,0) and (λ2,θ2)=(−3π/4,0).
Case-1, Case-2, and Case-3 are first proposed in Nair et al. [18] to validate global transport schemes. Case-
4 and Case-5 are utilized to evaluate schemes using interrelated tracers, scatter plots and numerical mixing
diagnostics in Lauritzen et al. [15]. These test cases are designed in a way that the flow reverses its course at
half-time t=T/2 and the scalar field returns to its initial position and shape in the end of the simulation; that is,
the final solution ϕT is identical to the initial condition ϕ0.
4.2 Correctness and Accuracy
In the tests, we compute certain errors to assess the order of convergence of the proposed scheme. We define the
following measurements [18, 29, 30]:
l1=I(|ϕ0−ϕT|)I(|ϕ0|),l2=(I((ϕ0−ϕT)2)I((ϕ0)2))1/2,l∞=max|ϕ0−ϕT|max|ϕ0|;
ϕmax=maxϕT−maxϕ0maxϕ0−minϕ0,ϕmin=minϕT−minϕ0maxϕ0−minϕ0.
All functions are evaluated at the mesh points, and the integral I(ϕ) is calculated as the discrete summation over
all cell centers ∑6k=1∑Ni=1∑Nj=1Λki,jΦki,j.
Following [12, 15, 29], we estimate the numerical convergence
rates K1, K2 and K∞ for l1, l2 and l∞respectively, by using a least-squares linear regression
log(li)=Ai−Kilog(Δλ),i=1,2,∞,
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
900
where Δλ is the average mesh-spacing in degrees and Ai (i=1,2,∞) are constants defined in Lauritzen et al. [15],
Harris et al. [8]. For the transport equation, the maximum CFL number can be defined by
CFL=△tUmaxΔλ(π180∘), (4.9)
where Umax is the maximum wind speed [12, 15].
Since the Gaussian-hills initial condition in Case-1 is infinitely smooth, we use it to assess the order of accuracy
of the implicit scheme for the non-divergent flow. In the test, we use the following
meshes: N=16,32,64,128,256,512 (correspondingly Δλ=90∘/16,90∘/32 and so on). Table 1 and Figure 5show the
errors in space. We see that the order of accuracy approaches to the third-order line when the mesh becomes
finer. This is because the solution of the test case is smooth and compact. Figure 6 shows the initial condition
and numerical solutions at t=T/2 and t=T by using the implicit method. We find that the final solution is in good
agreement with the initial condition.
Table 1
Results for Case-1 by using the implicit method with different meshes
Mesh l1 l2 l∞ ϕmax ϕmin
16×16×6 3.90E−1 3.28E−1 3.73E−1 −3.66E−1 −5.20E−2
32×32×6 1.40E−1 1.37E−1 1.81E−1 −1.67E−1 −2.35E−2
64×64×6 2.94E−2 3.31E−2 5.27E−2 −4.41E−2 −1.88E−3
128×128×6 4.29E−3 5.18E−3 9.07E−3 −7.39E−3 0
256×256×6 5.60E−4 6.89E−4 1.23E−3 −1.01E−3 0
512×512×6 8.10E−5 9.92E−5 1.72E−4 −1.49E−4 0
The time step size is fixed to △t=T/5,000
Fig. 5
Convergence plots for the l1, l2 and l∞ errors as the mesh is refined for Case-1. The problem is solved
with △t=T/5,000 by using the implicit method. Legends “2nd-order” and “3rd-order” represent the second and
third order convergence rates in space
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
901
Fig. 6
Contour plots of Case-1. The problem is solved on a 512×512×6 mesh by the implicit method with △t=T/5,000.
The upper left panel is the initial scalar field ϕ0, theupper right panel is the numerical solution at t=T/2, and
the lower panel is the numerical solution at t=T
We next compare the fully implicit method with the explicit method (2.9). Table 2 and Fig. 7 show the
computed errors in time. The CFL condition causes the time step size of the explicit method to be small when
solving the problem on fine meshes, as shown in Table 2. On the other hand, the implicit method allows the
time step size to be independent of the mesh resolution. From Table 2 we see that explicit and implicit methods
have similar numerical accuracy as △t=T/104 for the explicit method and △t=T/500for the implicit method. It is
observed from Fig. 8 that the performance of the implicit method is better than that of the explicit method in
terms of the total computing time with up to 3072 processors. We remark that the comparison of the explicit and
the implicit methods is based on the same spatial discretization. The performance of the explicit scheme may be
improved by using a semi-Lagrangian finite volume method. But such a comparison is beyond the scope of this
study.
Fig. 7
Convergence plots for the l1, l2 and l∞ errors of Case-1. The problem is solved on a 1024×1,024×6 mesh by
using the implicit method. Legends “1st-order” and “2nd-order” represent the first and second order
convergence rates in time
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
902
Fig. 8
Comparison between the implicit method and the explicit method for Case-1 in terms of the computing time
with different number of processors. The tests are performed on a 1,024×1,024×6 mesh with Δt=T/500 for the
implicit method and with △t=T/104 for the explicit method, respectively
Table 2
Results for Case-1 by using explicit and implicit methods with different time step sizes
△t CFL l1 l2 l∞ ϕmax ϕmin
Implicit method
T/10 7.56E+2 8.29E−1 6.13E−1 6.82E−1 −4.79E−1 −4.79E−4
T/50 1.51E+2 2.54E−1 2.35E−1 2.96E−1 −2.36E−1 −1.13E−2
T/100 7.56E+1 9.37E−2 9.65E−2 1.31E−2 −9.67E−2 −5.70E−3
T/500 1.51E+1 2.82E−3 3.23E−3 5.00E−3 −3.38E−3 0
T/1,000 7.56E+0 5.94E−4 6.61E−4 1.00E−3 −7.24E−4 0
T/2,000 3.78E+0 1.38E−4 1.49E−4 2.22E−4 −1.73E−4 0
T/5,000 1.51E+0 2.58E−5 2.86E−5 4.65E−5 −3.91E−5 0
Explicit method
T/5,000 1.51E+0 – – – – –
T/104 7.56E−1 1.86E−3 1.79E−3 1.76E−3 −1.36E−5 0
T/(2×104) 3.78E−1 9.35E−4 8.98E−4 8.84E−4 −1.38E−5 0
T/(5×104) 1.51E−1 3.74E−4 3.60E−4 3.59E−4 −1.47E−5 0
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
903
△t CFL l1 l2 l∞ ϕmax ϕmin
T/105 7.56E−2 1.88E−4 1.81E−4 1.84E−4 −1.52E−5 0
The mesh is 1024×1024×6. “–” denotes no convergence. In (4.9), Umax=2.32 for this particular test case
We then study Case-2, where the cosine bells initial condition and the divergent wind are used. The flow is more
complex compared to the non-divergent case. In Lauritzen and Skamarock [12], Lauritzen [15], White and
Dongarra [29, “effective resolution” is defined to assess the absolute error and the rate of convergence. In our
implicit simulation we define the effective resolution to be the one when the l2 error is approximately 0.033.
Table 3 and Fig. 9 show the computed errors. The convergence plot in the middle curve of Fig. 9shows the
effective resolution by using the intersection between the convergence curve of l2 and the line l2=0.033. As
shown in Fig. 9, the effective resolution for the implicit method is about 90∘/64=1.4062∘when
using △t=T/5,000. Figure 10 shows the initial condition and solutions at t=T/2 and t=T by using the implicit
method.
Fig. 9
Convergence plots for the l1, l2 and l∞ errors as the mesh is refined for Case-2. The problem is solved
with △t=T/5,000 by using the implicit method. Legends “2nd-order” and “3rd-order” represent the second and
third order convergence rates in space. The blue line is l2=0.033, which is used to define “effective resolution”
Fig. 10
Contour plots of Case-2. The problem is solved on a 512×512×6 mesh by the implicit method with △t=T/5,000.
The upper left panel is the initial scalar field ϕ0, the upper right panel is the numerical solution at t=T/2, and
the lower panel is the numerical solution at t=T
Table 3
Results for Case-2 by using the implicit method with different meshes
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
904
Mesh l1 l2 l∞ ϕmax ϕmin
16×16×6 1.66E−1 3.25E−1 3.97E−1 −3.68E−1 0
32×32×6 7.40E−2 1.52E−1 1.68E−1 −1.63E−1 0
64×64×6 1.90E−2 4.02E−2 6.00E−2 −3.56E−2 0
128×128×6 3.56E−3 8.32E−3 1.98E−2 −5.19E−3 0
256×256×6 6.33E−4 1.84E−3 5.89E−3 −6.73E−4 0
512×512×6 1.15E−4 4.59E−4 1.82E−3 −8.60E−5 0
The time step size is fixed to △t=T/5,000
In Case-3, the slotted-cylinders initial condition and the zonal background flow are used. The non-smooth initial
condition is used to challenge the proposed scheme. We use the implicit method to obtain the numerical solution
with △t=T/5,000 and 512×512×6. Figure 11 shows contour plots of the initial condition, the numerical solutions
at t=T/2, and t=T by using a contour interval of 0.05. The errors
are l1=8.67E−2, l2=1.93E−1, l∞=8.48E−1, ϕmax=1.43E−1, and ϕmin=−4.25E−2.
Fig. 11
Contour plots of Case-3. The problem is solved on a 512×512×6 mesh by the implicit method with △t=T/5,000.
The upper left panel is the initial scalar field ϕ0, the upper right panel is the numerical solution at t=T/2, and
the lower panel is the numerical solution at t=T
In Lauritzen and Thuburn [16], in order to explore the mixing characteristics of a transport scheme, a mixing
diagnostics is defined to quantify the numerical mixing in terms of the normalized distance between the pre-
existing functional curve and scatter points. This mixing diagnostics is based on the highly deformational
analytical flow field (4.8) and two nonlinearly related tracers. The two nonlinearly related tracers are the cosine
bells and the correlated cosine bells (i.e., Case-4 and Case-5). The mixing ratio for two nonlinearly related
tracers is referred to as the cosine bells condition ϕ and the correlated cosine bells condition ϕ∗.
In general, a plot of scatter points (ϕ,ϕ∗) follows a constant curve from the pre-existing functional relation
curve ψ defined in (4.4). Based on the distance between the pre-existing functional curve and scatter points,
three categories of deviation are defined from this curve [16]: “real” mixing, “range-preserving” unmixing and
overshooting. The three diagnostics that quantitatively account for numerical mixing that resembles the three
deviations are referred to as lr, lu and lo, respectively. More details about the definitions of the
errors lr, luand lo can be found in Lauritzen et al. [15], Lauritzen and Thuburn [16] and references therein. In
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
905
the tests, we compute the mixing diagnostics (lr,lu,lo) at t=T/2 at resolutions N=64,128,256,512, as shown in
Table 4 and Fig. 12. Figure 13 shows contour plots of the initial condition, the numerical solution at t=T/2. In
the figure, “real mixing” denotes the area where points are below the curve but within the triangle, “range-
preserving unmixing” denotes the area where points are above the curve but within the triangle, and
“overshooting” denotes the area where points are outside the triangle [29].
Table 4
Diagnostics for the real mixing (lr), the range-preserving unmixing (lu), and the overshooting (lo) at t=T/2by
using the implicit method with different meshes
Mesh lr lu lo
64×64×6 4.30E−3 9.94E−4 2.80E−3
128×128×6 1.80E−3 2.27E−4 1.40E−3
256×256×6 5.67E−4 2.49E−4 4.84E−4
512×512×6 9.61E−5 6.28E−5 6.60E−5
The time step size is fixed to △t=T/5,000
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
906
Fig. 12
Scatter plots at t=T/2 with the implicit method for two nonlinearly correlated tracers based on the cosine-bells
initial conditions. The horizontal axis denotes the value of the numerical solution ϕ for the cosine-bells initial
condition and the vertical axis denotes the value of the numerical solution ϕ∗ for correlated cosine bells
condition at t=T/2
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
907
Fig. 13
Contour plots of Case-4 and Case-5. The problem is solved on a 512×512×6 mesh by the implicit method
with △t=T/5,000. The upper left panel is the initial scalar field ϕ0 for Case-5, the upper right panelis the
numerical solution at t=T/2 for Case-4, and the lower panel is the numerical solution at t=T/2 forCase-5. The
initial field for Case-4 is the same as Case-2; see Fig. 9
It is observed from Figs. 10 and 11 that there are spurious oscillations occurring at the quasi-smooth or non-
smooth area. To reduce the spurious oscillations, we modify the scheme by adding a slope limiter in the state
reconstruction. For example, we calculate the reconstructed states Φ−i−1/2,j and Φ+i−1/2,j by
Φ−i−1/2,j=Φi−1,j+12(limiter(Φi−1,j−Φi−2,j,Φi,j−Φi−1,j))
and
Φ+i−1/2,j=Φi,j−12(limiter(Φi,j−Φi−1,j,Φi+1,j−Φi,j)),
respectively, and others are defined in a similar way. Here we use the corrected van Albada limiter [28]:
limiter(d1,d2)=⎧⎩⎨⎪⎪d1d2(d1+d2)d21+d220 if d1d2⩽0,otherwise.
In this case, (2.8) becomes a system of nonlinear algebraic equations. We use a Newton-Krylov-Schwarz type
algorithm to solve it. The algorithm includes the following steps: an inexact Newton method for the nonlinear
system and the linear algorithms described in Sect. 3 for the Jacobian system [3, 35]. We solveCase-2 on
a 512×512×6 mesh by the implicit method with △t=T/5,000. As shown in Fig. 14, undershoots appear when the
limiter is not applied, while the undershoots disappear and the monotonicity is obtained with the use of the
limiter.
Fig. 14
Contour plots of Case-2 at t=T/2 without (left panel) or with (right panel) the limiter. The problem is solved on
a 512×512×6 mesh by the implicit method with △t=T/5,000
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
908
4.3 Parallel Performance
In this subsection, we present some numerical results using Case-1 and mainly focus on the parallel performance
of the Schwarz preconditioner which is the most critical component of the implicit method. Our algorithms are
implemented based on the portable extensible toolkit for scientific computation (PETSc) [1]. All computations
are performed on a Dell PowerEdge C6100 supercomputer located at the University of Colorado Boulder. Each
node contains 24 GB local memory and two hex-core 2.8 Ghz Intel Westmere processors. The nodes are
interconnected via a non-blocking QDR Infiniband high performance network.
Scalability is an important issue in parallel computing, especially for solving large-scale problems with many
processors. In our tests, the strong scalability is defined by Speedup=T1/T2 where T1 and T2 are the execution
times obtained by running the parallel code with Np,1 and Np,2 processors (Np,1≤Np,2), respectively. The weak
scalability is used to examine how the execution time varies with the number of processors when the problem
size per processor is fixed.
For the implicit solver, we check the robustness of the algorithms with respect to the time step size △t.
Table 5 shows the influence of Δt for a fixed mesh 2,048×2,048×6. As Δt increases, the number of iterations
increases, while the computing time deceases. Also, it is clear that the performance of the proposed method is
better when Δt becomes small, in terms of the strong scalability. In this test case, a sparse LU factorization is
used to solve the subdomain problems in the RAS preconditioner. It is important to note that the timing results
obtained by using a first-order discretization based preconditioner is always better than the results obtained with
a second-order discretization based preconditioner.
Table 5
Effect of time step sizes for Case-1
Np Iter Time Iter Time Iter Time Iter Time
T/10 T/50 T/200 T/500
1st-order-pre
192 132.6 176.5 67.3 630.7 40.3 2,205.7 27.3 4,869.9
384 139.5 83.6 68.7 293.0 41.2 977.0 27.9 2,226.5
768 168.5 63.2 73.3 126.3 42.4 399.3 28.6 893.3
1,536 182.0 28.9 76.2 67.4 43.3 201.7 29.1 503.2
3,072 257.4 14.6 84.6 37.7 45.5 110.2 29.8 285.5
2nd-order-pre
192 62.8 449.5 15.3 1,830.7 9.6 7313.7 8.6 20,062.9
384 67.0 216.3 16.3 816.6 10.0 3,501.6 8.9 9,274.0
768 119.2 85.3 25.2 220.9 10.6 767.0 9.0 2,039.3
1,536 134.4 49.3 27.8 107.7 10.9 352.5 9.1 934.8
3,072 250.5 40.2 46.5 48.8 14.4 124.6 9.5 388.1
2,048×2,048×6 mesh, LU subdomain solver, and δ=1. “Iter” denotes the average number of iterations per time
step, and “Time” denotes the total computing time in seconds. “1st-order-pre” denotes the preconditioner with
the first-order discretization; “2nd-order-pre” denotes the preconditioner with the second-order discretization
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
909
In the Schwarz procondtioner, the overlapping parameter δ plays an important role in controlling the number of
iterations and the totally computing time. For this experiment, we consider two meshes 1,024×1,024×6 and
2,048×2,048×6, and use two fixed time step sizes Δt=T/100,T/1,000. We run Case-1 using different overlapping
sizes and different number of processors. The subdomain solve is set to be the LU factorization. In
Tables 6 and 7, we show the performance of the additive Schwarz preconditioner. The results suggest that an
optimal overlapping size exists if the goal is to minimize the total computing time for a given number of
processors on a particular machine. Also, the performance of the implicit method by using the preconditioner
with the first-order preconditioner is more attractive measured by the computing time, as shown in Fig. 15.
Fig. 15
Strong scalability results for Case-1 with different number of processors Np. The mesh is 2,048×2,048× 6
and △t=T/100. We use LU and δ=1
Table 6
Effect of overlapping size δ for Case-1
Np Iter Time Iter Time Iter Time Iter Time
δ=0 δ=1 δ=2 δ=3
1st-order-pre
192 54.6 1,112.4 53.0 1,135.9 51.7 1,202.1 51.6 1,233.7
384 54.0 516.5 54.1 531.8 53.0 562.3 52.7 604.9
768 58.2 221.2 56.5 223.5 55.0 247.9 54.6 275.8
1,536 58.7 115.7 58.1 117.7 56.7 119.5 56.1 141.5
3,072 67.1 64.3 62.7 63.8 60.5 80.5 59.7 86.9
2nd-order-pre
192 22.8 3,672.8 10.7 3,609.0 8.7 3,883.5 8.4 4,421.4
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
910
Np Iter Time Iter Time Iter Time Iter Time
δ=0 δ=1 δ=2 δ=3
384 24.1 1,551.6 11.1 1,630.7 9.1 1,736.6 8.7 1,716.3
768 26.8 368.8 14.6 405.5 13.6 434.8 13.3 456.9
1,536 28.3 232.2 15.7 196.3 14.6 199.8 14.3 219.0
3,072 37.4 81.0 24.7 88.1 23.8 93.8 23.4 103.0
LU subdomain solver, 2,048×2,048×6 mesh, and Δt=T/100. “Iter” denotes the average number of iterations per
time step, and “Time” denotes the total computing time in seconds. “1st-order-pre” denotes the preconditioner
with the first-order discretization; “2nd-order-pre” denotes the preconditioner with the second-order
discretization
Table 7
Effect of the overlapping size δ for Case-1
Np Iter Time Iter Time Iter Time Iter Time
δ=0 δ=1 δ=2 δ=3
1st-order-pre
192 18.4 1,277.7 15.5 1,262.7 14.7 1,371.0 14.5 1,452.7
384 18.5 626.8 15.7 620.7 14.9 639.2 14.5 760.9
768 18.9 255.6 16.1 264.0 15.4 270.5 15.0 304.7
1,536 19.2 173.3 16.3 152.5 15.4 159.6 15.1 181.8
3,072 19.6 101.3 16.7 103.0 15.9 110.6 15.5 121.6
2nd-order-pre
192 14.5 3,083.0 6.6 3,443.8 4.1 3,507.8 3.3 4,084.6
384 14.8 1,372.0 6.9 1,496.7 4.1 1,624.4 3.3 1,764.1
768 15.1 351.7 6.9 343.3 4.2 385.0 3.3 427.3
1,536 15.4 227.3 7.2 215.0 4.2 200.8 3.3 248.4
3,072 15.6 110.1 7.4 110.5 4.4 128.6 3.8 150.0
LU subdomain solver, 1,024×1,024×6 mesh, and Δt=T/1,000. “Iter” denotes the average number of iterations
per time step, and “Time” denotes the total computing time in seconds. “1st-order-pre” denotes the
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
911
preconditioner with the first-order discretization; “2nd-order-pre” denotes the preconditioner with the second-
order discretization
The performance of the implicit method depends heavily on how the subdomain problems are solved. In the
following set of the tests, we compare several different subdomain solvers including sparse ILU factorizations
with different levels of fill-ins. We run the test on a fixed mesh 2,048×2,048× 6 and Δt=T/100. We summarize
the results with different number of processors and levels of fill-ins in Table 8. Compared with Table 6, we see
that the implicit method with ILU is more attractive in the terms of the computing time. The number of
iterations is relatively small and slightly increases as Np tends to 3,072. Moreover, in terms of the total
computing time, the performance of the “2nd-order-pre” approach is better than that of the “1st-order-pre”
approach when the number of processors is small such as 192 and 384, but the “1st-order-pre” approach
becomes more competitive approach become more competitive as Np becomes larger. As a result, as shown in
Fig. 16, the “1st-order-pre” approach shows better strong scalability than that of the“2nd-order-pre” approach.
In this sense, the “1st-order-pre” approach is more suitable for solving large-scale problems with many
processors.
Fig. 16
Strong scalability results for Case-1 with different number of processors Np. The mesh is 2,048×2,048×6
and △t=T/100. We use ILU(3) and δ=1
Table 8
Test results using different fill-in levels k and different number of processors for Case-1
Np Iter Time Iter Time Iter Time Iter Time
k=2 k=3 k=4 k=5
1st-order-pre
192 106.1 588.5 77.9 518.8 64.0 492.0 57.4 494.3
384 107.5 305.2 79.6 269.3 65.3 255.0 58.2 251.5
768 109.6 160.9 82.4 145.0 67.2 140.0 60.6 137.5
1,536 111.8 90.9 84.5 84.7 68.9 80.7 62.1 80.9
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
912
Np Iter Time Iter Time Iter Time Iter Time
k=2 k=3 k=4 k=5
3,072 114.8 58.5 87.6 56.5 72.9 54.2 66.6 53.5
2nd-order-pre
192 64.5 599.4 32.3 434.2 22.8 402.8 18.6 401.6
384 67.6 352.4 33.3 273.5 23.8 244.0 19.7 230.1
768 68.6 202.7 34.1 162.7 24.2 146.7 20.1 138.0
1,536 71.8 108.3 38.1 89.0 26.9 84.1 22.6 96.8
3,072 73.6 87.6 38.9 59.1 29.6 58.9 26.3 58.7
δ=1 and Δt=T/100. 2,048×2,048× 6 mesh. “Iter” denotes the average number of iterations per time step, and
“Time” denotes the total computing time in seconds. “1st-order-pre” denotes the preconditioner with the first-
order discretization; “2nd-order-pre” denotes the preconditioner with the second-order discretization
Finally, to further examine the parallel performance of the proposed methods, we show the weak scalability in
Table 9 and Fig. 17. The first-order approach is clearly better than the second-order approach in terms of the
computing time, although the first-order approach needs more iterations. We also observe that for the implicit
solver the number of iterations suffers when the number of processors increases and the mesh is refined, as a
result the computing time can not stay unchanged. This suggests the need of a two-level or multilevel Schwarz
algorithm.
Fig. 17
Weak scalability results for Case-1 with different number of processors Np. A fixed 96×96 mesh is used per
processor, “1st-order-pre-LU” denotes the 1st-order scheme with a LU subdomain solve and the others are
defined similarly
Table 9
Weak scalability results for Case-1 with a fixed 96×96 mesh per processor
Np Mesh Iter Time Iter Time
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
913
1st-order-pre 2nd-order-pre
LU
24 192×192×6 17.0 19.5 6.3 38.3
96 384×384×6 24.0 24.8 8.1 41.7
384 768×768×6 34.8 34.7 10.1 46.7
864 1,152×1,152×6 43.3 43.4 12.4 60.1
1,536 1,536×1,536×6 52.0 55.5 15.2 68.8
1,944 1,728×1,728×6 54.7 61.1 16.7 75.1
2,904 2,112×2112×6 61.7 68.3 19.8 89.8
ILU(4)
24 192×192×6 17.1 13.9 6.5 14.1
96 384×384×6 24.1 16.9 9.0 15.9
384 768×768×6 35.6 22.9 13.6 22.2
864 1,152×1,152×6 45.9 30.0 17.8 30.5
1,536 1,536×1,536×6 56.5 40.0 22.5 42.9
1,944 1,728×1,728×6 62.0 49.1 24.9 55.9
2,904 2,112×2,112×6 73.3 56.3 30.5 70.1
δ=1 and △t=T/100. “Iter” denotes the average number of iterations per time step, and “Time” denotes the total
computing time in seconds. “1st-order-pre” denotes the preconditioner with the first-order discretization; “2nd-
order-pre” denotes the preconditioner with the second-order discretization. The total degree of freedom for the
biggest case is 2,112×2,112×6=26,763,264
5 Concluding Remarks
A parallel, fully implicit method was developed for solving the tracer transport problem on the cubed-sphere.
Domain decomposition methods with both first-order and second-order discretizations are proposed to solve the
linear system at each time step. The implicit method with the second-order temporal discretization allows much
larger time steps than the explicit method, while preserving the accuracy of the solution, and also demonstrates
superior performance in terms of the total computing time compared to the explicit method. The effectiveness
and scalability of the implicit method depends heavily on the design of the preconditioner. After many
experiments, we found the class of restricted additive Schwarz method based on a first-order discretization
works well, and is more attractive than the second-order discretization. Excellent results were obtained for
solving several test problems with tens of millions of unknowns and on a parallel machine with up to 3,072
processors. We believe that the family of Schwarz methods with low order discretization is suitable for larger
problems and on machines with lager number of processors. Future research may include solving other flow
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
914
problems on the cubed-sphere on much finer meshes with a larger number of processors and with multilevel
Schwarz preconditioners.
Acknowledgments
The authors would like to express their appreciations to the anonymous reviewers for the invaluable comments
that greatly improved the quality of the manuscript. This work was supported in part by NSF grant CCF-
1216314 and DOE grant DE-SC0001774. H. Yang was also supported in part by NSFC grants 91330111,
11201137 and 11272352. C. Yang was also supported in part by NSFC grants 61170075 and 91130023
References
1.Balay, S., Buschelman, K., Gropp, W.D., Kaushik, D., Knepley, M., McInnes, L.C., Smith, B.F., Zhang, H.:
PETSc Users Manual. Argonne National Laboratory (2012)
2.Brown, P.N., Shumaker, D.E., Woodward, C.S.: Fully implicit solution of large-scale non-equilibrium
radiation diffusion with high order time integration. J. Comput. Phys. 204, 760–783
(2005)MathSciNetCrossRefMATH
3.Cai, X.-C., Gropp, W.D., Keyes, D.E., Melvin, R.G., Young, D.P.: Parallel Newton-Krylov-Schwarz
algorithms for the transonic full potential equation. SIAM J. Sci. Comput. 19, 246–265
(1998)MathSciNetCrossRefMATH
4.Cai, X.-C., Sarkis, M.: A restricted additive Schwarz preconditioner for general sparse linear systems. SIAM
J. Sci. Comput. 21, 792–797 (1999)MathSciNetCrossRefMATH
5.Chen, C., Xiao, F.: Shallow water model on cubed-sphere by multi-moment finite volume method. J. Comput.
Phys. 227, 5019–5044 (2008)MathSciNetCrossRefMATH
6.Evans, K.J., Knoll, D.A.: Temporal accuracy of phase change convection simulations using the JFNK-
SIMPLE algorithm. Int. J. Num. Meth. Fluids. 55, 637–655 (2007)CrossRefMATH
7.Erath, C., Lauritzen, P.H., Garcia, J.H., Tufo, H.M.: Integrating a scalable and efficient semi-Lagrangian
multi-tracer transport scheme in HOMME. Proc. Comput. Sci. 9, 994–1003 (2012)CrossRef
8.Harris, L.M., Lauritzen, P.H., Mittal, R.: A flux-form version of the conservative semi-Lagrangian multi-
tracer transport scheme (CSLAM) on the cubed-sphere grid. J. Comput. Phys. 230, 1215–1237
(2011)MathSciNetCrossRefMATH
9.Jacobson, M.Z.: Fundamentals of Atmospheric Modeling. Cambridge University Press, New York (1999)
10.Knoll, D.A., Chacon, L., Margolin, L.G., Mousseau, V.A.: On balanced approximations for time integration
of multiple time scale systems. J. Comput. Phys. 185, 583–611 (2003)CrossRefMATH
11.Lauritzen, P.H., Nair, R.D., Ullrich, P.A.: A conservative semi-Lagrangian multi-tracer transport scheme
(CSLAM) on the cubed-sphere grid. J. Comput. Phys. 229, 1401–1424 (2010)MathSciNetCrossRefMATH
12.Lauritzen, P.H., Skamarock, W.C.: Test-case suite for 2D passive tracer transport: a proposal for the NCAR
transport workshop. March (2011)
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
915
13.Lauritzen, P.H., Jablonowski, C., Taylor, M., Nair, R.: Numerical Techniques for Global Atmospheric
Models. Lecture Notes in Computational Science and Engineering. Springer, Berlin (2011)CrossRef
14.Lauritzen, P.H., Ullrich, P.A., Nair, R.D.: Atmospheric transport schemes: desirable properties and a semi-
Lagrangian view on finite-volume discretizations. In: Lecture Notes in Computational Science and Engineering
(Tutorials), vol. 80, Springer, (2011)
15.Lauritzen, P.H., Skamarock, W.C., Prather, M.J., Taylor, M.A.: A standard test case suite for two-
dimensional linear transport on the sphere. Geosci. Model Dev. Discuss. 5, 189–228 (2012)CrossRef
16.Lauritzen, P.H., Thuburn, J.: Evaluating advection/transport schemes using interrelated tracers, scatter plots
and numerical mixing diagnostics. Q. J. Roy. Meteor. Soc. 138, 906–918 (2012)CrossRef
17.Nair, R.D., Thomas, S.J., Loft, R.D.: A discontinuous Galerkin global shallow water model. Mon. Weather
Rev. 133, 876–888 (2005)CrossRef
18.Nair, R.D., Lauritzen, P.H.: A class of deformational-flow test cases for linear transport problems on the
sphere. J. Comput. Phys. 229, 8868–8887 (2010)MathSciNetCrossRefMATH
19.Putman, W.M., Lin, S.-J.: Finite-volume transport on various cubed-sphere grids. J. Comput. Phys. 227, 55–
78 (2007)MathSciNetCrossRefMATH
20.
Rancic, M.R., Purser, J., Mesinger, F.: A global-shallow water model using an expanded spherical cube:
Gnomonic versus conformal coordinates. Q. J. Roy. Meteor. Soc. 122, 959–982 (1996)CrossRef
21.Ronchi, C., Iacono, R., Paolucci, P.: The cubed sphere: a new method for the solution of partial differential
equations in spherical geometry. J. Comput. Phys. 124, 93–114 (1996)MathSciNetCrossRefMATH
22.Sadourny, R., Arakawa, A., Mintz, Y.: Integration of the nondivergent barotropic vorticity equation with an
icosahedralhexagonal grid for the sphere. Mon. Weather Rev. 96, 351–356 (1968)CrossRef
23.Sadourny, R.: Conservative finite-difference approximations of the primitive equations on quasi-uniform
spherical grids. Mon. Weather Rev. 100, 211–224 (1972)CrossRef
24.Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)CrossRefMATH
25.Shadid, J.N., Tuminaro, R.S., Devine, K.D., Hennigan, G.L., Lin, P.T.: Performance of fully coupled domain
decomposition preconditioners for finite element transport/reaction simulations. J. Comput. Phys. 205, 24–47
(2005)MathSciNetCrossRefMATH
26.Smith, B., Bjørstad, P., Gropp, W.: Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial
Differential Equations. Cambridge University Press, Cambridge (1996)MATH
27.Toselli, A., Widlund, O.: Domain Decomposition Methods-Algorithms and Theory. Springer, Berlin
(2005)MATH
28.Van Albada, G.D., van Leer, B., Roberts, W.W.: A comparative study of computational methods in cosmic
gas dynamics. Astron. Astrophys.108, 95–103 (1982)
International Journal of Mathematics and Computer Sciences (IJMCS) ISSN: 2305-7661 Vol.32 Aug 2014 www.scholarism.net
916
29.White III, J.B., Dongarra, J.J.: High-performance high-resolution semi-Lagrangian tracer transport on a
sphere. J. Comput. Phys. 230, 6778–6799 (2011)CrossRefMATH
30.Williamson, D.L., Drake, J.B., Hack, J.J., Jakob, R., Swarztrauber, P.N.: A standard test set for numerical
approximations to the shallow water equations in spherical geometry. J. Comput. Phys. 102, 211–224
(1992)MathSciNetCrossRefMATH
31.Wu, Y., Cai, X.-C., Keyes, D.E.: Additive Schwarz methods for hyperbolic equations. In: Mandel, J., Farhat,
C., Cai, X.-C. (eds.) Proceedings of the 10th International Conference on Domain Decomposition Methods,
AMS, pp. 513–521 (1998)
32.Yang, C., Cao, J., Cai, X.-C.: A fully implicit domain decomposition algorithm for shallow water equations
on the cubed-sphere. SIAM J. Sci. Comput. 32, 418–438 (2010)MathSciNetCrossRefMATH
33.Yang, C., Cai, X.-C.: Parallel multilevel methods for implicit solution of shallow water equations with
nonsmooth topography on cubed-sphere. J. Comput. Phys. 230, 2523–2539 (2011)MathSciNetCrossRefMATH
34.Yang, C., Cai, X.-C.: A scalable fully implicit compressible Euler solver for mesoscale nonhydrostatic
simulation of atmospheric flows. SIAM J. Sci. Comput. To appear
35.Yang, H., Cai, X.-C.: Parallel two-grid semismooth Newton-Krylov-Schwarz method for nonlinear
complementarity problems. J. Sci. Comput. 47, 258–280 (2011)MathSciNetCrossRefMATH
36.Yang, H., Prudencio, E., Cai, X.-C.: Fully implicit Lagrange-Newton-Krylov-Schwarz algorithms for
boundary control of unsteady incompressible flows. Int. J. Numer. Meth. Eng. 91, 644–665
(2012)MathSciNetCrossRefMATH
37.Zhang, J., Wang, L.L., Rong, Z.: A prolate-element method for nonlinear PDEs on the sphere. J. Sci.
Comput. 47, 73–92 (2011)MathSciNetCrossRefMATH