A parallel workload balanced and memory efficient lattice-Boltzmann algorithm with single unit BGK...

13
A parallel workload balanced and memory efficient lattice-Boltzmann algorithm with single unit BGK relaxation time for laminar Newtonian flows David Vidal a,b, * , Robert Roy a , François Bertrand a, ** a Ecole Polytechnique de Montréal, Montréal, QC, Canada H3C 3A7 b FPInnovations – Paprican, Pointe-Claire, QC, Canada H9R 3J9 article info Article history: Received 30 November 2008 Received in revised form 11 January 2010 Accepted 15 April 2010 Available online 20 April 2010 Keywords: Lattice Boltzmann method Fluid flow Porous media SPMD parallelization Workload balance Memory usage abstract A parallel workload balanced and memory efficient lattice-Boltzmann algorithm for laminar Newtonian fluid flow through large porous media is investigated. It relies on a simplified LBM scheme using a single unit BGK relaxation time, which is implemented by means of a shift algorithm and comprises an even fluid node partitioning domain decomposition strategy based on a vector data structure. It provides per- fect parallel workload balance, and its two-nearest-neighbour communication pattern combined with a simple data transfer layout results in 20–55% lower communication cost, 25–60% higher computational parallel performance and 40–90% lower memory usage than previously reported LBM algorithms. Perfor- mance tests carried out using scale-up and speed-up case studies of laminar Newtonian fluid flow through hexagonal packings of cylinders and a random packing of polydisperse spheres on two different computer architectures reveal parallel efficiencies with 128 processors as high as 75% for domain sizes comprising more than 5 billion fluid nodes. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction The lattice Boltzmann method (LBM), developed in the early 1990s, is now considered by many researchers as the method of choice for the simulation of single phase or multiphase flows in complex sparse geometries such as porous media [1–3]. It origi- nates from a discretized version of the Boltzmann equation in which collisions are dealt with a relaxation procedure towards lo- cal equilibrium. LBM is advantageous with respect to more classi- cal CFD methods because of three main factors: (1) its flexibility in discretizing complex geometries by means of a simple structured lattice into which nodes are marked as ‘‘fluid” or ‘‘solid” depending on the phase they belong to, (2) the explicit nature of its underly- ing scheme, which facilitates its parallelization, and (3) its relative ease of implementation. Despite these advantages, three areas of improvement have recently attracted the attention of researchers in relation to the simulation of fluid flow in porous media: (1) the reduction of core memory usage, (2) the improvement of com- putational efficiency and accuracy of LBM algorithms, and (3) the reduction of workload imbalance often observed with highly heter- ogeneous domains when computing in parallel. Several researchers [3–6] showed that transforming the sparse matrix data structure inherent to the LBM lattice into a vector data structure in which only the fluid nodes are retained (no computa- tions take place on the solid nodes) through semi-direct or indirect addressing can significantly reduce memory usage. Furthermore, Martys and Hagedorn [3] proposed a simplification of the LBM scheme using a relaxation time equal to unity, which allows a sig- nificant reduction of the memory consumption by only requiring the storage of the fluid density and the three components of the velocity for each fluid node in the domain. Despite its memory advantage over standard population-storing methods, this promis- ing density/velocity-storing strategy has not really been picked up by the scientific community, probably because of its apparent restrictions on the relaxation time. Argentini et al. [7] introduced another method to reduce memory usage by up to 78%, but it uses a non-BGK approach that is limited to Stokes flows. Because of LBM young age and heuristic development, a wide range of schemes and implementations are available. Some of them have recently been carefully evaluated by several research- ers. For example, Pan et al. [8] compared Bhatnagar–Gross–Krook single-relaxation-time (BGK) and multiple-relaxation-time (MRT) LBM schemes. They found a better accuracy with the MRT schemes at the expense of a slightly higher computational cost (10–20%), although the BGK scheme can still provide accurate results when the single-relaxation-time parameter is equal to unity. Very re- cently, Mattila et al. [9] performed a comprehensive comparison in terms of computational efficiency and memory consumption 0045-7930/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.compfluid.2010.04.011 * Corresponding author at: Ecole Polytechnique de Montréal, Montréal, QC, Canada H3C 3A7. ** Corresponding author. E-mail addresses: [email protected] (D. Vidal), [email protected] (R. Roy), [email protected] (F. Bertrand). Computers & Fluids 39 (2010) 1411–1423 Contents lists available at ScienceDirect Computers & Fluids journal homepage: www.elsevier.com/locate/compfluid

Transcript of A parallel workload balanced and memory efficient lattice-Boltzmann algorithm with single unit BGK...

Computers & Fluids 39 (2010) 1411–1423

Contents lists available at ScienceDirect

Computers & Fluids

journal homepage: www.elsevier .com/ locate /compfluid

A parallel workload balanced and memory efficient lattice-Boltzmann algorithmwith single unit BGK relaxation time for laminar Newtonian flows

David Vidal a,b,*, Robert Roy a, François Bertrand a,**

a Ecole Polytechnique de Montréal, Montréal, QC, Canada H3C 3A7b FPInnovations – Paprican, Pointe-Claire, QC, Canada H9R 3J9

a r t i c l e i n f o

Article history:Received 30 November 2008Received in revised form 11 January 2010Accepted 15 April 2010Available online 20 April 2010

Keywords:Lattice Boltzmann methodFluid flowPorous mediaSPMD parallelizationWorkload balanceMemory usage

0045-7930/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.compfluid.2010.04.011

* Corresponding author at: Ecole PolytechniqueCanada H3C 3A7.** Corresponding author.

E-mail addresses: [email protected] (D.(R. Roy), [email protected] (F. Bertrand).

a b s t r a c t

A parallel workload balanced and memory efficient lattice-Boltzmann algorithm for laminar Newtonianfluid flow through large porous media is investigated. It relies on a simplified LBM scheme using a singleunit BGK relaxation time, which is implemented by means of a shift algorithm and comprises an evenfluid node partitioning domain decomposition strategy based on a vector data structure. It provides per-fect parallel workload balance, and its two-nearest-neighbour communication pattern combined with asimple data transfer layout results in 20–55% lower communication cost, 25–60% higher computationalparallel performance and 40–90% lower memory usage than previously reported LBM algorithms. Perfor-mance tests carried out using scale-up and speed-up case studies of laminar Newtonian fluid flowthrough hexagonal packings of cylinders and a random packing of polydisperse spheres on two differentcomputer architectures reveal parallel efficiencies with 128 processors as high as 75% for domain sizescomprising more than 5 billion fluid nodes.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

The lattice Boltzmann method (LBM), developed in the early1990s, is now considered by many researchers as the method ofchoice for the simulation of single phase or multiphase flows incomplex sparse geometries such as porous media [1–3]. It origi-nates from a discretized version of the Boltzmann equation inwhich collisions are dealt with a relaxation procedure towards lo-cal equilibrium. LBM is advantageous with respect to more classi-cal CFD methods because of three main factors: (1) its flexibility indiscretizing complex geometries by means of a simple structuredlattice into which nodes are marked as ‘‘fluid” or ‘‘solid” dependingon the phase they belong to, (2) the explicit nature of its underly-ing scheme, which facilitates its parallelization, and (3) its relativeease of implementation. Despite these advantages, three areas ofimprovement have recently attracted the attention of researchersin relation to the simulation of fluid flow in porous media: (1)the reduction of core memory usage, (2) the improvement of com-putational efficiency and accuracy of LBM algorithms, and (3) thereduction of workload imbalance often observed with highly heter-ogeneous domains when computing in parallel.

ll rights reserved.

de Montréal, Montréal, QC,

Vidal), [email protected]

Several researchers [3–6] showed that transforming the sparsematrix data structure inherent to the LBM lattice into a vector datastructure in which only the fluid nodes are retained (no computa-tions take place on the solid nodes) through semi-direct or indirectaddressing can significantly reduce memory usage. Furthermore,Martys and Hagedorn [3] proposed a simplification of the LBMscheme using a relaxation time equal to unity, which allows a sig-nificant reduction of the memory consumption by only requiringthe storage of the fluid density and the three components of thevelocity for each fluid node in the domain. Despite its memoryadvantage over standard population-storing methods, this promis-ing density/velocity-storing strategy has not really been picked upby the scientific community, probably because of its apparentrestrictions on the relaxation time. Argentini et al. [7] introducedanother method to reduce memory usage by up to 78%, but it usesa non-BGK approach that is limited to Stokes flows.

Because of LBM young age and heuristic development, a widerange of schemes and implementations are available. Some ofthem have recently been carefully evaluated by several research-ers. For example, Pan et al. [8] compared Bhatnagar–Gross–Krooksingle-relaxation-time (BGK) and multiple-relaxation-time (MRT)LBM schemes. They found a better accuracy with the MRT schemesat the expense of a slightly higher computational cost (10–20%),although the BGK scheme can still provide accurate results whenthe single-relaxation-time parameter is equal to unity. Very re-cently, Mattila et al. [9] performed a comprehensive comparisonin terms of computational efficiency and memory consumption

X

ZY

1

23

4

5

6

7

9

11

13

8

10

14

12

0

00 =e

t

xi δ

δ=e

t

xi δ

δ3=e

Fig. 1. Numbering of the 15 populations of the D3Q15 lattice used in the presentwork. Odd and even numbers correspond respectively to the forward-pointing andbackward-pointing populations. Population 0 is a rest population.

1412 D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423

of five different LBM implementations found in the literature: twowell-established algorithms, the two-lattice and one-lattice (two-step) algorithms, and three recent ones, the Lagrangian, shift (alsocalled compressed-grid) and swap algorithms. They also investi-gated the effect of various data structures and memory addressingschemes on computational efficiency. They found that the shift[10] and the newly developed swap algorithms [11], combinedwith a novel bundle data structure and indirect addressing, bothyield high computational performance and low memory consump-tion. The reader is referred to [9–12] and references therein for amore thorough description of these different implementations.

Concerning parallel efficiency, the classical Cartesian domaindecompositions studied by Satofuka and Nishioka [13], which con-sists in dividing the whole domain in equally-sized subdomains intoslices or boxes, may lead to more and more severe workload imbal-ance as the number of subdomains increases. As underlined by Panet al. [6], this is due to the variation of the porosity among the sub-domains as their number is increased, all the more so in the caseof heterogeneous porous media. To alleviate this problem, severalmethods have been proposed based on orthogonal recursive bisec-tion [6,14], multilevel recursive bisection or k-way graph partition-ing using the METIS package [4,5,15,16]. Although they providedefinite improvements with regard to the classical domain decom-position methods, these methods are uneasy to implement, comeat high memory expense, create complex communication patternsand can become computationally expensive when dealing with verylarge systems (say billions of lattice nodes) or when dynamic loadbalancing is required, such as for fluid flow through packings ofhighly polydisperse spheres [17] or settling particles [18].

Wang et al. [19] proposed a quick, simple and elegant wayto balance workload and reduce memory requirement for thetwo-lattice LBM implementation. The method consists of first vec-torizing the data structure through the use of indirect memoryaddressing as previously proposed by others [4–6]. But to achieveaccurate workload balance, the resulting data vector is then simplysplit into equally-sized subvectors, each of which is assigned to aspecific processor. As a result, exact fluid node load balance andhigh parallel efficiency can be achieved. Furthermore, a simplecommunication pattern among processors, similar to slice domaindecomposition, is obtained since data communication for a givenprocessor only involves its two nearest processors. However, theminimization of the amount of communication is not accountedfor by the method, and this may impair its parallel performance.To further improve this method and lower memory usage, werecently proposed a one-lattice algorithm with a vector data struc-ture combined with an even fluid node partitioning domaindecomposition and a fully-optimized data transfer layout thatcarefully selects and minimizes the LBM populations to be commu-nicated [20]. As communication overhead will always impair par-allel performance when decreasing domain size, there is anobvious interest in LBM algorithms that can reduce further theamount of data to be exchanged between processors.

Martys and Hagedorn [3] proposed a few years ago a simplifiedLBM algorithm that reduces core memory usage. It appears thatthis method has so far been overlooked by the scientific commu-nity, probably because the use of a constant relaxation time is lim-ited in scope and may not be suitable for certain classes ofproblems such as those involving turbulent and non-Newtonianfluid flows. However, for the specific case of laminar Newtonianflows, not only does this method decrease significantly core mem-ory usage by only requiring the storage of the density and the threecomponents of the velocity for each fluid node of the domain, butalso, something that has not been reported by these authors, it re-duces by 20–55% (depending on the lattice type) the communica-tion cost as only the fluid density and velocity arrays need to beexchanged instead of the usual 5–9 inward LBM population arrays.

In a previous companion paper [20], we showed how to im-prove significantly the parallel performance of large LBM simula-tions in the general case of flows through heterogeneous porousmedia. Since many flow problems in porous media involve laminarNewtonian flows (e.g. for the prediction of the permeability of theporous medium), we propose here to use a combination of meth-ods to further improve memory usage and parallel performancefor this specific class of problems. The specific objective of thiswork is in fact threefold: (1) to highlight the advantages and limi-tations of the simplified LBM algorithm proposed by Martys andHagedorn [3] to reduce memory usage, (2) to combine this simpli-fied algorithm with a shift algorithm based on a vector data struc-ture and an even fluid node partitioning domain decomposition, inorder to improve parallel performance when dealing with hetero-geneous domains, and (3) to compare the memory usage and par-allel performance of this novel algorithm with theoreticalperformance model predictions and one-lattice LBM implementa-tions previously studied [20]. First, the lattice Boltzmann methodwill be recalled and simplifications to this method will be exam-ined. Second, the shift LBM implementation will be described alongwith the data structure used, and the resulting gain in memory willbe evaluated. Parallel communication and workload balance strat-egies and their resulting performance will be next examined. Final-ly, the computational advantages of the new proposed method willbe assessed by means of two case studies involving 3-dimensionallaminar Newtonian fluid flows through hexagonal packings of cyl-inders and a packing of polydisperse spheres.

2. The lattice Boltzmann method

Contrary to the conventional CFD methods that solve directlythe Navier–Stokes equations, LBM actually ‘‘simulates” by meansof particles the macroscopic behaviour of fluid molecules in mo-tion. More precisely, LBM comes from the discretization in space(x), velocity (e) and time (t) of the Boltzmann equation from the ki-netic gas theory that describes the evolution of the probability dis-tribution function (or population) of a particle, f(x, e, t), and itsmicrodynamic interactions.

2.1. Fused collision-propagation scheme

In practice, the populations of particles propagate and collide atevery time step dt on a lattice with spacing dx and along ei velocitydirections, where the number of directions i (nd) depends on thetype of lattice. A D3Q15 lattice is used in the present work, i.e. a3-dimensional lattice with nd = 15 velocity directions (Fig. 1). The

D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423 1413

general collision-propagation procedure can be mathematicallysummarized by a fused scheme with a single-relaxation-time:

fiðx; tÞ ¼ fiðx� eidt; t � dtÞ

� fiðx� eidt; t � dtÞ � f eqi ðx� eidt ; t � dtÞ

s�; ð1Þ

where fi(x, t) is the particle probability distribution function (orpopulation) in the direction of the velocity ei at position x and timet, s� is a dimensionless relaxation time. The second term of theright-hand side of Eq. (1) approximates the collision process bymeans of a single relaxation procedure (the so-called Bhatnager,Gross and Krook approximation [21]) towards a local equilibriumpopulation that can be given for a D3Q15 lattice by

f eqi ðx;tÞ¼wiq 1þ3ðei �uÞ

dt

dx

� �2

þ92

ei �uð Þ2 dt

dx

� �4

�32ðu �uÞ dt

dx

� �2" #

;

ð2Þ

where w0 ¼ 29 ; wi ¼ 1

9 for i ¼ 1 : 6; wi ¼ 172 for i ¼ 7 : 14 and where

the local fluid density is defined as

q ¼ qðx; tÞ ¼X

i

fi; ð3Þ

and the local macroscopic fluid velocity is given by

u ¼ uðx; tÞ ¼ 1qX

i

fiei: ð4Þ

The dimensionless relaxation time s� is related to the kinematicviscosity m of the fluid:

s� ¼ mdtc2

sþ 1

2; ð5Þ

where cs is the speed of sound defined as

cs ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi3ð1�w0Þ

7

rdx

dt: ð6Þ

It comes from Eqs. (5) and (6) that two degrees of freedomamong the three parameters s�, dt and dx must be set to obtain agiven viscosity. One possibility is to fix dx and dt, but this can leadto significant inaccuracies, depending on the boundary conditionsused, as will be further discussed in Section 5.2.

Starting from initial conditions and using appropriate boundaryconditions, the collision-propagation scheme described in Eq. (1) ismarched in time until an appropriate convergence is reached (e.g.(du/dt)/(du/dt)max = 10�5). The LBM scheme is explicit and the up-date of the populations at a lattice node is a local operation since itonly requires the populations of the immediate neighbouringnodes. This makes the LBM scheme well adapted to distributedparallelization. Finally, as previously mentioned, there are severalways to implement this scheme, which were recently rigorouslyclassified and studied by Mattila et al. [9]. In this work, a shiftLBM implementation with a vector data structure is used, as de-scribed in more detail in Section 3.

2.2. Simplified fused collision-propagation scheme

It is known (see e.g. [3]) that the choice s� = 1 leads to a drasticsimplification of Eq. (1) of the LBM scheme. In such a case, the pop-ulation pointing in the direction of the velocity ei at position x andtime t is only dependent on the population at the local equilibriumat the previous time step at the closest neighbouring node in the�ei direction:

fiðx; tÞ ¼ f eqi ðx� eidt; t � dtÞ: ð7Þ

The obvious outcome of this simplified scheme is that the nd

populations at each node no longer need to be stored, and that only

the local density and the three components of the fluid velocitymust be, as the populations at local equilibrium can directly becomputed through Eq. (2). The standard and simplified schemeswill hereafter be called population-storing and density/velocity-storing schemes, respectively. The use of the simplified scheme re-sults in a substantial reduction of memory requirements becauseof the replacement of nd arrays by four arrays of equal size. Onthe other hand, this scheme is limited to laminar Newtonian fluidflow problems since non-Newtonian and turbulent LBM schemesgenerally entail the determination of local relaxation times thatyield the desired local viscosity (e.g. [22,23]). Moreover, anotherdrawback is that the time step dt is now fixed for a given dx, so thatone cannot play with dt to converge faster towards steady state,which may limit the computational performance of the method.However, as will be seen in Section 5.2, despite this apparenttrade-off between memory and convergence speed, the use of asingle unit relaxation time may be fully justified in the case of lam-inar Newtonian fluids when the accuracy of LBM is considered.

2.3. Boundary conditions and imposition of pressure drop

In this work, three types of boundary conditions are used, whichare typical for a porous medium. First, the boundary conditions atthe periphery of the domain are periodic, which means that anyout-going population re-enters the domain on its opposite side.Second, to impose a pressure drop DP in a given ej direction, a bodyforce is added on each node at each iteration in the ei directions notnormal to ej. Combined with periodic boundary conditions in thisdirection, this ‘‘trick” enforces the prescribed pressure gradient.Third, the no-slip wall boundary conditions on the solid phase ofthe porous domain is modeled using the classical half-waybounce-back method, which reflects any in-coming populationsto the wall in the opposite direction at the next iteration. The solidboundary is often reported to be located half-way between the lastfluid node and the first solid node, but this might not be true in allcircumstances (see Section 5.2). The combination of half-waybounce-back boundary condition on solid walls, an added bodyforce through the entire domain, and periodic condition on domainboundaries is a common choice for the LBM simulation of pres-sure-driven fluid flow through porous media. This combination isfrequent because it is flexible, easy to implement and computa-tionally efficient. The accuracy of these boundary conditions willbe further discussed in Section 5.2. Note that, contrary to a one-lat-tice implementation (see e.g. [20]), there is no need here to imple-ment so-called ‘‘periodic” and ‘‘bounce-back” lattice nodes becausevelocity and density data at the previous time step can be accesseddirectly thanks to the shift LBM implementation described next.

3. LBM implementation, data structure and memoryrequirements

Following along the lines of recent other researchers [3–6], avector data structure is used in this work instead of the sparse ma-trix data structure inherent to any porous structure, in order to re-duce the memory usage by only storing information of the ‘‘fluidnodes” since no computations are performed on the ‘‘solid nodes”(Fig. 2). The reader is referred to [20] for a careful comparison ofmemory usage by these two data structures.

The LBM scheme being explicit, the information at the previoustime step is required to compute that at the current time step. Toimplement this technique and avoid the overwriting of usefulinformation, data at any two consecutive time steps could be keptin memory, like in the so-called two-lattice algorithm [19],although such an approach is memory inefficient. To alleviate thisproblem, researchers have devised algorithms that overlap two

xy

Staircase-like interfaces

Sharp interfaces

CPU 1 CPU 2 CPU 3 CPU 4

(a)

(c)

fluid nodes(220)

buffernodes (30)Equally split

among CPUs

(12)

CPU 1

CPU 2

(13)

CPU 3

(6) (30)

(30)

(30)

CPU 4

(15) (30)

(b)

(d)

(55)(15)

(55)(10)

(55)(15)

(55)(10)

Fig. 2. 2D schematic discretization and decomposition of a porous medium (a) using a 24 � 15 node domain on a 4-processor cluster. The sparse matrix data structure (b) isconverted into a (ordered) vector data structure (c), which is equally split among the processors resulting in a decomposition of the original discretization (d). Note that thevectorization order (y–x order) determines the location of the CPU interfaces (dashed red lines). Green and blue nodes are ghost layer nodes that need to be added to the leftand the right of each subdomain, respectively. No computations are performed on ghost layer nodes, which are used to facilitate the transfer of data during the propagationstep. Buffer nodes are also added to allow the shift algorithm to proceed without overwriting necessary data (see Fig. 3). (For interpretation of the references to colour in thisfigure legend, the reader is referred to the web version of this article.)

1414 D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423

consecutive time steps into one memory array through the use of aclever order of population updates, and thus avoids the overwrit-ing of yet to be propagated populations by ones that have alreadybeen. To do so, the one-lattice (two-step) algorithm first executeslocally the collisions for all the nodes. It then propagates the result-ing populations in ascending node order for the backward-pointingpopulations and in descending node order for forward-pointingpopulations. For this to work, it requires the use of so-called(tagged) periodic and bounce-back nodes to store out-going andsolid-pointing populations, respectively [20]. Recently, Mattilaet al. [11] introduced the ‘‘swap” algorithm based on a fused(one-step) propagation-collision scheme that, at each lattice node,permutates populations with neighbouring nodes that have notbeen yet updated before performing the collision with previ-ously-arrived and newly-permutated populations. Note that, un-like any other algorithms, the swap algorithm does not requirethe allocation of additional memory, but is restricted to a popula-tion-storing scheme. Another possibility is to consider the shift(also called compressed-grid) algorithm, which uses additionalmemory through buffer nodes to store newly-updated values, thuspreventing the overwriting of still useful data [10]. More precisely,the size Sbuf of the memory offset between specific data at two con-secutive time steps is determined by the data spatial dependency,which is related to the propagation step of LBM. As illustrated inFig. 2b for a 2D case with periodic boundary conditions, two latticenode columns are necessary to avoid breaking this dependency. In3D and similar conditions, a two-layer thick lattice node slice

would be required. In practice, the procedure consists of shiftingthe newly updated data, in reverse (resp. forward) node order, to+Sbuf (resp. �Sbuf) memory positions at odd (resp. even) time steps,as depicted in Fig. 3. More details can be found in [10].

As the proposed scheme for a single unit relaxation time (Eq.(7)) consists in a fused scheme, the one-lattice (two-step) algo-rithm does not represent viable solution and must be ruled out.Moreover, since only density and velocity are to be stored in mem-ory in our scheme, the swap algorithm cannot be used except ifadditional memory is allocated to store the set of 1

2 nd � 1� �

popu-lations that are yet to be propagated, the size of which dependson the data spatial dependency, and a trickier update procedureis developed. As a matter of fact, it appears that the shift algorithm,originally developed for a population-storing LBM scheme, can bestraightforwardly extended to a density/velocity-storing LBMscheme. For the (nx � ny � nz) lattice discretization of a porous do-main of porosity e, it can be shown that the memory requirement(qmem) for the shift LBM implementation with a vector data struc-ture [20] and density/velocity storage is

qmem�f8 bytesg� ððnf þSbufÞ�4Þ|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}density=velocity storage

þ f4 bytesg� ðnf �ðnd�1Þ|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}Þconnectivity list

þf2 bytesg � ðnf �3Þ|fflfflfflffl{zfflfflfflffl}coordinate

list

þ f1 byteg � ðnf �ðnd�1ÞÞ|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}bounce-back treatment

; ð8Þ

Beginning of iteration (n+1)

-first node update

-last node update

descending update order

ghost layer updateghost layer updateBeginning of iteration (n+2)

ascending update order

ghost layer updateghost layer update

-first node update

-last node update

Fig. 3. Schematic representation of the shift algorithm procedure at odd (n + 1) and even (n + 2) time steps for a given subdomain. White and yellow cell nodes represent data(populations or velocity and density) in memory at the beginning of odd and even time steps, respectively. Green and blue nodes are ghost layer nodes that need to be added,to the left and the right of each subdomain, respectively. No computations are performed on ghost layer nodes, which are used to facilitate the transfer of data during thepropagation step. Gray cell nodes represent buffer lattice nodes used during shifting to prevent the overwriting of useful data. For the sake of illustration, the spatial datadependency is arbitrarily assumed to be equal to 5, which means that Sbuf = 5. (For interpretation of the references to colour in this figure legend, the reader is referred to theweb version of this article.)

D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423 1415

where nd is the number of populations used in the lattice, which isequal to 15 for a D3Q15 lattice, and nf ¼ ðnx � ny � nzÞ � e is thenumber of fluid nodes, {8 bytes} means that the computations aredone in double precision and {x byte(s)} refers to an x-byte integerarray with x < 8. The vector data structure requires indirect address-ing through the use of a connectivity list for all fluid nodes. Also, anadditional 1-byte array is employed to determine the direction ofthe in-coming populations (the direction is reversed whenbounce-back occurs) for the treatment of bounce-back conditions.Finally, it is convenient to store an integer coordinate list of all fluidnodes since such information will be required when post-process-ing the simulation results.

1 �20% less for D3Q15 and D3Q19 (4 � [(ny � nz) + 2 � (nz + 1)] � e vs.�5 � (ny � nz) � e) and �55% less for a D3Q27 (4 � [(ny � nz) + 2 � (nz + 1)] � e vs.�9 � (ny � nz) � e).

2 Persistent communications with MPI_SSEND_INIT and MPI_RECV_INIT did notimprove performance significantly.

4. Parallel workload balance and communication strategies

Three key aspects of a parallel LBM implementation are consid-ered in this section: computational workload, communicationoverhead and resulting parallel performance.

4.1. Workload balance and communication strategies

With a vector data structure, it is rather straightforward to bal-ance workload on a parallel computer, as explained by Wang et al.[19]. Each processor receives an equal portion of the vector(s), asillustrated in Fig 2c. Ghost layers for in-coming data from neigh-bouring processors as well as buffer nodes to handle the shift pro-cedure are added to each subdomain. Overall, this workloadbalance procedure leads to a slice domain decomposition methodthat resembles classical slice decomposition techniques (for in-stance, y–x vectorization in 2D and z–y–x vectorization in 3D createslices in the x direction, as can be seen in Figs. 2d and 4, respec-tively): the data communication pattern is simple due to the factthat each processor needs to communicate with only its two

nearest processors. The amount of data to be transferred betweenprocessors depends on the position of the interfaces, which can besharp or staircase-like. In the best-case scenario, the sharp inter-face, only four arrays of size (ny � nz) � e (i.e. the density and thethree components of the velocity of layer I in Fig. 4) are required.In the worst-case scenario, the staircase-like interface, the size ofthese arrays is larger, [(ny � nz) + 2 � (nz + 1)] � e, because of thedata spatial dependency and the presence of periodic boundaryconditions. The likelihood to be in the worst-case scenario at oneinterface increases with the number of processors used, and maylead to a communication bottleneck. This data transfer layout ishowever much simpler than that for a population-storing LBMimplementation (see [20]). It requires the transfer of roughly 20–55%1 fewer data between processors for D3Q15, D3Q19 andD3Q27 lattices. As will be seen, this results in a noticeable improve-ment of parallel performance when communication overhead is alimiting factor.

4.2. Parallel performance

A single-program multiple-data (SPMD) model using MPI and aFortran compiler was used to implement on parallel computers theproposed shift single unit relaxation time LBM algorithm with avector data structure combined with even fluid node partitioning.Communications between processors are carried out by non-block-ing MPI_ISEND and MPI_IRECV subroutines.2 When evaluating par-allel performance, one can either keep the lattice dimensionsconstant while increasing the number of processors (a speed-up

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

111111

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1111111111

1

X

YZ

Subdomain n

Subdomain n+1

Subdomain n

Subdomain n+1

Sharp interface

Staircase-like interface

Node layer data required for next subdomain’s ghost layer

Layer I

Layers I & II

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

111111

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

Fully-optimized data transfer layout

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

111111

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

111111

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

111111

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

11

Layer I Layer II

com y z z⎣ ⎦com y z z ⎦com y z z[q 4 (n ×n )+2 (n 1) ]= × × + × ε

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1111111111

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1111111111

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1111111111

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1111111111

Layer I

Required data transfer layout

com y z[q 4 (n ×n ) ]= × × ε11 u & ρ

Data to be transferred in x-forward direction

Fig. 4. Possible interface scenarios and their corresponding data transfer layouts between subdomain n and subdomain n + 1 (transparent) for the density/velocity-storingscheme. Note that the vectorization order is z–y–x. Also, these data transfer layouts are valid for periodic boundary conditions and that only data corresponding to fluid nodesneed to be exchanged.

1416 D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423

test) or increase proportionally the lattice dimensions to the numberof processors used (a scale-up test). If Ntot is the total number of lat-tice nodes and np the number of processors used for a given simula-tion, these two scenarios lead respectively to Ntot = nx � ny � nz andNtot = nx � ny � nz � np. One can then derive for both scenarios thefollowing theoretical efficiency E(np) model for a porous mediumof average porosity e (see [20] for the development of the model):

EðnpÞ ¼e

emax þqnpðtlatþstdataÞ

Ntotmtoper

¼ eemax þ rcc

; ð9Þ

where emax is the largest subdomain porosity, m is the number ofarithmetic (floating-point) operations per lattice node per iteration,q is the number of MPI_ISEND and MPI_IRECV required per proces-sor (q = 4 when a processor has two neighbours), toper is the averagetime spent per arithmetic operation, tlat is the communicationlatency, tdata is the average time to transfer 1 byte of data, s is theamount of data that needs to be transferred to one neighbouringprocessor, equal to qcom � 8 bytes, and rcc represents the ratio be-tween the communication and the computational workload. Notethat emax = e when the workload is balanced. The product (mtoper)for a specific code on a given machine, heretoafter called the nodalcomputational time, can be approximated by

mtoper � tCPU;1=ðnit � nf Þ; ð10Þ

where tCPU,1 is the CPU time measured on a single processor and nit

the number of iterations. The latency tlat and data rate transfer tdata

can be respectively evaluated using utilities such as mpptest [24]and mpiP [25], although only rough approximations can be ob-tained because the actual values depend on the number and thesize of the messages transferred. From Eq. (9), it can be seen thatthe efficiency is bounded by: (1) e/emax when communication over-head is negligible (i.e. rcc� emax) and workload is not balanced, (2)1 when communication overhead is negligible and workload isbalanced (e = emax), and (3) e/rcc ? 0 when communications be-come overwhelming (i.e. rcc emax). Furthermore, as shown in[20], the computational performance Pcomp expressed in Millionsof Lattice fluid node Updates Per Second (MLUPS) can be expressedby

PcompðnpÞ ¼10�6eNtot

emaxNtotnp

mtoper þ qðtlat þ stdataÞ¼ 10�6np

mtoperEðnpÞ: ð11Þ

Interestingly, this equation links the parallel efficiency to thecomputational performance, which can be evaluated experimen-tally as

PcompðnpÞ ¼10�6eNtotnit

tCPU;np

; ð12Þ

Table 1Specifications of the Mammouth(mp) HPC cluster.

Mammouth(mp) parallel cluster

Make Dell PowerEdge SC1425Processors used 2 � 128 Intel Xeona

–clock speed 3.6 GHz–bus FSB 800 MHz–RAM 8 GB–cache 1 MB L2

Network InfiniBand 4� (700–800 MB/s)–tlat (ls) �5.0–tdata (ns/byte) �8.0

Operating system RedHat Enterprise Linux 4 (2.6.9–42.0.3.ELsmp)Compiler Portland Group pgf90 Fortran (6.0–4)Message passing MPI (mvapich2 0.9.82)

a Only one processor per server was used.

Table 2Nodal computational time (mtoper) for the various codes on the Mammouth(mp) HPCcluster.

Code Nodal computational time (ns)

With 4-byteinteger compileroption

With 8-byteinteger compileroption

Proposed shift algorithm witheven fluid node vector partitioning

422 N/A

One-lattice algorithm with evenfluid node vector partitioning

879 N/A

One-lattice algorithm withclassical x-axis slicedecomposition

1142 N/A

D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423 1417

where tCPU;np represents the CPU time with np processors. Combin-ing Eqs. (11) and (12) gives a way to measure experimentally theparallel efficiency without timings on one single processor:

EðnpÞ ¼eNtotnitmtoper

nptCPU;np

: ð13Þ

Note that simulations on one single processor may not be feasi-ble because of the size of the computational domain. In fact, thisequation allows to evaluate experimental parallel efficiencies forvery large domains. It is valid if the computational time is propor-

0

50

100

150

200

250

300

350

0 0.2 0.4Por

Mem

ory

usag

e (B

yte/

node

)

one-lattice algorstructureone-lattice algorstructureshift algorithm wand unit relaxati

δx/2

2*δxδx

2*δx

δx

δx/2

Fig. 5. Memory usage per lattice node as a function of the porosity of an hexagonal pproposed shift (vector data structure) LBM implementations. The porosity is changed by vand models from [20]), and symbols are numerical data points. The thicker the lines orimplementations, the memory usage of which depends on the lattice size.

tional to the number of lattice nodes, which is rigorously the casefor the LBM algorithms considered in this work.

5. Numerical experiments and discussion

The proposed parallel algorithm relies on a density/velocity-storing single unit relaxation time LBM scheme implemented bymeans of a shift procedure and an even fluid node partitioningbased on a vector data structure. The efficiency of this algorithmwill first be assessed by means of 3-dimensional fluid flow simula-tions through a hexagonal packing of cylinders and a heteroge-neous random packing of polydisperse spheres. Next, the choiceof a single unit relaxation time will be further discussed in con-junction with the boundary conditions used.

5.1. Memory and parallel efficiency experiments

The efficiency of the proposed method with regard to parallelperformance and memory usage is investigated by means of twocase studies that were introduced in a previous article [20]. Moreprecisely, it is compared with two other algorithms described andinvestigated in that article: (1) a one-lattice LBM implementationwith a matrix data structure and a classical slice domain decompo-sition and (2) a one-lattice LBM implementation with a vector datastructure and an even fluid node partitioning technique combinedwith a fully-optimized population transfer layout. The latter of thetwo was found to be the most parallel and memory efficient algo-rithm in that paper. The reader is referred to [20] for more detaileddescriptions of the case studies and these LBM implementations.

5.1.1. Hexagonal packings of cylindersThe following numerical experiments were performed on the

High-Performance Computing (HPC) cluster (Mammouth(mp))from the Réseau Québécois de Calcul de Haute Performance(RQCHP). Table 1 summarizes the main features of this HPC cluster.The dimension of the lattice was increased proportionally to thenumber of processors used (scale-up test). More precisely, the totalnumber of lattice nodes (Ntot) was 600 � 10np � 347 in the x, y andz directions, respectively. Four different cylinder radii (R = 60.6,67.1, 73.6 and 80.1 lattice nodes) were investigated in order to varythe domain porosity. Table 2 presents the nodal computational

0.6 0.8 1osity (-)

ithm with matrix data

ithm with vector data

ith vector data structureon time

acking of cylinders for the one-lattice (matrix and vector data structures) and thearying the diameter of the cylinders. Lines correspond to model predictions (Eq. (8)the bigger the symbols, the coarser the lattice (dx = 0.0462 lm) for the one-lattice

1418 D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423

times (mtoper) of the various codes used on Mammouth(mp), as cal-culated from Eq. (10). These results clearly show a better computa-tional performance for the shift algorithm. As a matter of fact,Mattila et al. [9] also reported a better sequential computationalperformance for the shift algorithm than for the one-lattice (two-step) algorithm.

As displayed in Fig. 5, the memory requirement model (Eq. (8))predicts for the shift algorithm a linear decrease of the memoryusage per node as the porosity of the hexagonal packings of cylin-ders is decreased. The predictions are in good agreement with thenumerical data. It can also be observed that the shift algorithmwith a single unit relaxation time reduces memory usage by40–75% as compared to the one-lattice LBM implementation withvector data structure. This represents a tour de force consideringthat this one-lattice implementation is already significantly morememory efficient than the traditional algorithm with a matrix datastructure. Note that memory savings could be even greater in thecase of D3Q19 or D3Q27 lattice types.

For this scale-up case study, there is an obvious slice domaindecomposition along the cylinder axis (y axis), which can lead tohigh parallel efficiency due to straightforward workload balance

0

20

40

60

80

100

0 20 40Number o

Eff

icie

ncy

(%)

Proposed shift algorithm with even fluid n

One-lattice algorithm with even fluid node

One-lattice algorithm with classical x-axis

1

10

100

1000

1

Number o

Com

puta

tion

al p

erfo

rman

ce (

ML

UP

S)

Proposed shift algorithm with even fluid n

One-lattice algorithm with even fluid nod

One-lattice algorithm with classical x-axis

Fig. 6. Parallel efficiency (a) and computational performance (b) comparisons on the thand even fluid node vector partitioning domain decompositions for an hexagonal packinThe colored dashed lines in (a) represent the model predictions from Eq. (9) for the corrperformance for the shift algorithm with an even fluid node vector partitioning domain

and constant communication load (involving 600 � 347 latticenodes). However, to emphasize the improvement obtained byresorting to an even fluid node vector partitioning domain decom-position when load imbalance is present, the computational do-main was discretized and decomposed in the x direction, whichis perpendicular to the cylinder axial direction. Fig. 6 presentsthe parallel efficiency and computational performance for the pro-posed method and the two above-mentioned one-lattice LBMimplementations. As expected, the algorithms using even fluidnode vector partitioning significantly improve the parallel effi-ciency with respect to the classical domain decompositions. In par-ticular, the drop and fluctuations in efficiency resulting fromvariations in subdomain porosity for classical domain decomposi-tion are suppressed. This proves the workload balancing capabilityof the even fluid node vector partitioning method. Moreover, theefficiency model (Eq. (9)) predicts well the different trends ob-served experimentally. Surprisingly, the efficiency of the proposedshift algorithm (blue curve with diamonds) is slightly lower thanthat of the one-lattice counterpart (red curve with circles). In fact,the �20% lower communication load (see Section 4.1) is outshinedby the better intrinsic computational performance of the shift

60 80 100f processors

ode vector partitioning

vector partitioning

matrix slice decomposition

10010

f processors

ode vector partitioning

e vector partitioning

matrix slice decomposition

(a)

(b)

e Mammouth(mp) cluster between the one-lattice and shift algorithms with x-sliceg of cylinders (R = 73.6 lattice nodes) with proportional domain size (scale-up test).esponding algorithms. In (b), the black dashed line represents the theoretical lineardecomposition.

Table 3Specifications of the Artemis HPC cluster.

Artemis parallel cluster

Make Dell PowerEdge 1950Processors used 2 � 64 quad core Intel Xeon 5440

–clock speed 2.83 GHz–bus FSB 1333 MHz–RAM 16 GB–cache 12 MB L2

Network Gigabit Ethernet–tlat (ls) �1.0–tdata (ns/byte) �12.0

Operating system CentOS 4.6 (2.6.9–67.0.15.ELsmp)Compiler Intel Fortran (10.1.015)Message passing MPI (OpenMPI 1.2.6)

Table 5Memory usage of the various codes on a single processor for the spherical particlepacking case study.

Code Memory usage (GB)

Proposed shift algorithm with evenfluid node vector partitioning

1.9

One-lattice algorithm with even fluidnode vector partitioning

5.9

One-lattice algorithm withclassical x-axis slice decomposition

10.1

D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423 1419

algorithm. This result is due to a higher rcc ratio for the shift algo-rithm (see Eq. (9)). Nevertheless and more importantly, this pro-posed algorithm provides in this scale-up case study a sustained25% increase in computational parallel performance, as evidencedin Fig. 6, at a much lower memory usage.

5.1.2. Random packing of polydisperse spheresTo further investigate the parallel performance and the memory

gain of the proposed method, a speed-up test was carried out on a3-dimensional random packing of polydisperse spheres (e = 27.5%)generated using a Monte-Carlo packing procedure described else-where [17]. The domain size (4003 lattices nodes) can fit in thememory of a single server and was kept constant as the numberof processors was increased (speed-up test). These tests were per-formed on the HPC cluster (Artemis) from FPInnovations. Table 3summarizes the main features of this HPC cluster and Table 4 dis-plays the nodal computational times of the various codes, whichevidences here again the intrinsic computational superiority ofthe shift algorithm.

As can be seen from Table 5, the combination of a vector datastructure and a single unit relaxation scheme reduces by a factorof 5.3 with respect to a one-lattice algorithm with classical slicedecomposition the memory required to simulate the flow throughthe packing. The single unit relaxation scheme alone reduces thememory by a factor of 3.1, which is fairly close to the factor of3.75 theoretically achievable by replacing the storage of 15 popu-lations by that of the density and three velocity components.

Fig. 7 shows a significant decrease in parallel performance asthe number of processors increases for the three algorithms inves-tigated. This is explained by the reduction of the subdomain gran-ularity as the number of processors increases, which eventuallyleads to an overwhelming communication overhead. In particular,it can be seen that the proposed shift implementation (blue curvewith diamonds) underperforms as compared to its one-latticecounterpart (red curve with circles). As already explained inSection 5.1.1, this is due to the higher communication/computa-tional workload ratio (rcc) resulting from the better intrinsic

Table 4Nodal computational time (mtoper) for the various codes on the Artemis HPC cluster.

Code Nodal computational time (ns)

With 4-byteinteger compileroption

With 8-byteinteger compileroption

Proposed shift algorithm witheven fluid node vector partitioning

197 250

One-lattice algorithm with evenfluid node vector partitioning

443 478

One-lattice algorithm withclassical x-axis slicedecomposition

704 802

computational performance of the shift algorithm. These trendsare confirmed by the efficiency model predictions, although thequantitative agreement with the numerical data is not as good asin the previous case study. This can be attributed to the small sizeof the domain investigated.

To assess the performance improvement and the adequacy ofthe efficiency models when the domain size is increased, a scale-up test from 3843 to 26303 node lattices corresponding to a maxi-mum of 5.0 � 109 fluid nodes was carried out. The computationalperformance and parallel efficiency for 128 processors was calcu-lated through Eqs. (12) and (13) for the three algorithms (Fig. 8).Note that for a domain size larger than 12803 lattice nodes, the8-byte integer compilation option was needed for the indexing ofnodes at the pre-processing stage. As expected, the results showan increasing performance and efficiency as the domain size andthe corresponding granularity are increased. Moreover, the modelpredictions from Eqs. (9) and (11) (color dotted lines) become extre-mely good above �1.2 � 108 fluid nodes (7683 node lattices). Hereagain, the parallel efficiency of the one-lattice algorithm with evenfluid node vector partitioning surpasses the proposed shift algo-rithm at the same domain size, but the difference decreases asthe domain size increases. As both algorithms follow closely theperformance models, it can be concluded that they exhibit the ex-pected workload balance that tends towards 100% (E ? 1) as thedomain size is increased. In fact, efficiencies as high as 79% and75% for 128 processors are obtained for the largest domain withthe one-lattice and shift algorithms, respectively. Despite itsslightly lower efficiency, it can be observed in Fig. 8 that the com-putational parallel performance of the shift algorithm is about 60%superior, and that the slope of the curve is significantly steeperthan that of the one-lattice algorithm. This difference in perfor-mance between the two algorithms is significantly larger than inthe case of the (older) Mammouth(mp) architecture (see Fig. 7) be-cause of a larger cache and faster bus (see Tables 1 and 3). Unsur-prisingly, the use of 8-byte integers affects the computationalperformance, all the more so for the shift algorithm due to a higherproportion of integer-based operations. Moreover, the efficiency ofthe one-lattice algorithm with a classical x-axis slice decomposi-tion, which is significantly lower than that for the other two meth-ods, levels off as domain size is increased to a lower asymptoticalvalue equal to e/emax = 27.5%/42.2% = 65.2%. Quite clearly, thiscompromises the applicability of the method for large domainsizes. Finally, on HPC cluster Artemis and its 1024 GB memory,the largest domains tested and that could fit in memory comprised5.0 � 109, 1.6 � 109 and 5.8 � 108 fluid nodes, for the shift, the one-lattice with even fluid node vector partitioning and the one-latticewith classical slice decomposition methods, respectively. In otherwords, these two one-lattice implementations are surpassed bythe shift algorithm by factors of 3.1 and 10, respectively.

5.2. Justifications for using the simplified single unit relaxation timeLBM scheme

We recall that using a constant relaxation time is only practicalin the case of laminar Newtonian flows since LBM schemes for

0

20

40

60

80

100

0 20 40 60 80 100Number of processors

Eff

icie

ncy

(%)

Proposed shift algorithm with even fluid node vector partitioning

One-lattice algorithm with even fluid node vector partitioning

One-lattice algorithm with classical x-axis matrix slice decomposition

1

10

100

1000

100101

Number of processors

Com

puta

tion

al p

erfo

rman

ce (

ML

UP

S)

Proposed shift algorithm with even fluid node vector partitioning

One-lattice algorithm with even fluid node vector partitioning

One-lattice algorithm with classical x-axis matrix slice decomposition

(a)

(b)

Fig. 7. Parallel efficiency (a) and computational performance (b) comparisons on the the Artemis cluster between the one-lattice and shift algorithms with x-slice Cartesianand even fluid node vector partitioning domain decompositions for a random packing of polydisperse spheres with constant domain size (speed-up test). The colored dashedlines in (a) represent the model predictions from Eq. (9) for the corresponding algorithms. In (b), the black dashed line represents the theoretical linear performance for theshift algorithm with an even fluid node vector partitioning domain decomposition.

1420 D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423

non-Newtonian and turbulent flows generally entail local varia-tions of the relaxation time. As seen in Section 5.1.1, fixing s� = 1can save up a significant amount of memory, which can be of greatinterest for simulating laminar Newtonian fluid flow through verycomplex geometries. This comes at the expense of an increasedCPU time. Indeed, it can be inferred from Eqs. (5) and (6) that dt

varies as O d2x s� � 1

2

� �� �for a prescribed kinematic viscosity m. It is

thus attractive to use, given dx, a large value of s� to increase dt

and converge faster to a desired steady-state solution. So, fixingthe relaxation time to unity removes this flexibility and, as a result,may increase computational time. Part of this CPU time increasecan however be recovered in the case of parallel execution thanksto the lower communication overhead resulting from the simpli-fied single unit relaxation time LBM scheme. In addition, thischoice may be judicious in practice because of accuracy reasonsrelated to the type of boundary conditions used.

Several researchers [26–32] have pointed out the limitations ofthe half-way bounce-back rule as far as accuracy is concerned.Despite nearly second-order accuracy in space, its overall accuracyvaries with s�. This is illustrated in Fig. 9, which presents the rela-tive error of the mean velocity with respect to the analytical solu-

tion for a 3D flow in a square duct using a one-lattice LBMimplementation. As also discussed by others [26,27,30,31], thehalf-way bounce-back boundary condition generally shows bestresults for s� = 1. The reason why accuracy varies with s� is stilldebated (e.g. the position of the wall may depend on s� and be pre-cisely half-way when s� = 1, or slip velocity at the wall may bepresent). When s� is too small or too large and the lattice resolu-tion is coarse with respect to the geometry, significant errors canbe noticed. As a result, if one decides to use the half-waybounce-back boundary condition, s� = 1 appears as a good choiceto obtain accurate results in all circumstances. Note that othermethods [26–29,31,33–35] have been developed to strictly enforceno-slip wall boundary conditions independently of s�. However,they come at extra computational cost that, although it has neverbeen carefully evaluated, is likely to be proportional to the numberof interfaces between the solid and the fluid phases (usually quitehigh for porous media). Furthermore, some of these methods re-quire extra information to be communicated between processors,which may noticeably impair the parallel performance. This andthe fact that half-way bounce-back boundary conditions are rathersimple to implement may explain why, despite their limitations,

0

100

200

300

400

Number of fluid nodes / 106

Com

puta

tion

al p

erfo

rman

ce (

ML

UP

S)

Proposed shift algorithm with even fluid node vector partitioningOne-lattice algorithm with even fluid node vector partitioningOne-lattice algorithm with classical x-axis matrix slice decomposition

0

20

40

60

80

100

10 100 1000 10000

10 100 1000 10000

Number of fluid nodes / 106

Eff

icie

ncy

(%)

Proposed shift algorithm with even fluid node vector partitioning

One-lattice algorithm with even fluid node vector partitioning

One-lattice algorithm with classical x-axis matrix slice decomposition

(b)

(a)

Fig. 8. Parallel efficiency (a) and computational performance (b) comparisons at 128 processors on the Artemis cluster between the one-lattice and shift algorithms with x-slice and even fluid node vector partitioning domain decompositions for a random packing of polydisperse spheres with increasing domain size (scale-up test). The coloreddotted lines represent the model predictions from Eqs. (13) and (12) for the corresponding domain decompositions. Open and filled symbols correspond to simulationsperformed respectively with 4- and 8-byte integers.

D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423 1421

they are still very popular. It also justifies the use of an LBM imple-mentation based on a single unit relaxation time for simulatinglaminar Newtonian fluid flows. Note that one way to alleviatethe constraint on CPU time resulting from the choice s� = 1 andspeed up convergence to steady state, would be to use the iterativemomentum relaxation technique proposed by Kandhai et al. [36].This technique was reported to cut down the number of iterationsrequired to reach steady state by 45–97%.

6. Concluding remarks

An efficient parallel LBM algorithm was introduced for simulat-ing laminar Newtonian fluid flows through porous media. Itprovides perfect parallel workload balance owing to a two-nearest-neighbour communication pattern and a simple lattice-type-independent data transfer layout with lower (20–55%)communication cost and higher (25–60%, depending on the archi-tecture used) computational parallel performance than previouslyreported LBM algorithms. With this algorithm, the usual trade-offbetween memory and computational performance is oversteppedowing to a 40–90% reduction in memory usage with respect to

classical population-storing LBM algorithms. The proposed algo-rithm is built around four combined strategies to achieve remark-able performance. First, a vector data structure is used instead ofthe sparse matrix structure inherent to porous media in order toreduce memory requirements. Second, taking advantage of thisvector data structure, an even fluid node partitioning domaindecomposition technique is used to perfectly balance the parallelworkload. Third, the use of a single unit relaxation time simplifiesthe collision-propagation LBM scheme by replacing the usual pop-ulation array storage by a density/velocity array storage, whichleads to significant memory savings and reduces parallel commu-nication cost. Finally, to further reduce the memory usage andimprove the sequential computational performance, a shift algo-rithm that overlaps the data related to two consecutive time stepsinto smaller memory space while accounting for their spatialdependency was implemented. If resorting to a single unitrelaxation time may appear restrictive at first sight, its use is fullyjustified as far as accuracy is concerned when classical, computa-tionally efficient, half-way bounce-back boundary conditions areconsidered. Indeed, it is shown in this work that the best resultsare obtained for such boundary conditions when a relaxation time

Fig. 9. (a) Relative error of the mean velocity with respect to the analytical solution and (b) normalized computation time, as a function of the single dimensionless relaxationtime and the ratio of the lateral domain dimension (D) to the lattice size for a LBM fluid flow simulation through a 3-dimensional square duct. The no-slip wall boundarycondition is enforced through the half-way bounce-back rule, which gives here a nearly second-order accuracy in space (J (d1:8

x )). From Eqs. (5) and (6), one can show that theCPU time scales as O d5

x s� � 12

� �� �1 �

.

1422 D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423

equal to unity is employed. The major drawback of the proposedalgorithm is that it is restricted to laminar Newtonian fluid flowsbecause LBM schemes for non-Newtonian and turbulent flowsare generally based on variable local relaxation times to achievevariable local viscosities. For such flows, one-lattice, shift or swapalgorithms with variable relaxation times and even fluid node vec-tor partitioning domain decomposition remain the methods ofchoice, although much less memory efficient than the method pro-posed in this work.

The memory and parallel computational performances were as-sessed on two different computer architectures by means of scale-up and speed-up case studies for laminar Newtonian fluid flowsthrough hexagonal packings of cylinders and a random packingof polydisperse spheres. Numerical data were observed to be invery good agreement with performance model predictions, show-ing that the algorithm parallel efficiency tends asymptotically to100% as domain size is increased, and that workload is thus well-balanced despite the heterogeneity of the domains tested. Efficien-cies with 128 processors as high as 75% were found for domainsizes comprising as many as 5 billion fluid nodes. To our knowl-edge, this is the first time that such large flow simulations, whichrequired overall less than 1 TB of memory for the largest domainsizes, are reported. This highlights the memory efficiency of theproposed algorithm. The domain sizes investigated also justifythe use of the even fluid node partitioning domain decompositioninstead of more advanced techniques such as spectral recursivebisection or k-way graph partitioning, as these would be unpracti-cal for such large systems.

Acknowledgments

The computer resources and support from the Réseau Québéc-ois de Calcul de Haute Performance (RQCHP) and from FPInnova-tions as well as the financial contribution of the NSERC SentinelNetwork are gratefully acknowledged. Special thanks to Louis-Alexandre Leclaire and Christian Poirier.

References

[1] Succi S. The lattice Boltzmann equation for fluid dynamics and beyond. Oxford,UK: Oxford Science Publications; 2001.

[2] Nourgaliev RR, Dinh TN, Theofanous TG, Joseph D. The lattice Boltzmannequation method: theoretical interpretation, numerics and implications. Int JMultiphase Flow 2003;29(1):117–69.

[3] Martys NS, Hagedorn JG. Multiscale modeling of fluid transport inheterogeneous materials using discrete Boltzmann methods. Mater Struct2002;35:650–9.

[4] Dupuis A, Chopard B. An object oriented approach to lattice gas modeling.Future Gener Comput Syst 2000;16(5):523–32.

[5] Schulz M, Krafczyk M, Tolke J, Rank E. Parallelization strategies and efficiencyof CFD computations in complex geometries using lattice Boltzmann methodson high performance computers. In: Breuer M, Durst F, Zenger C, editors. Highperformance scientific and engineering computing. Berlin: Springer Verlag;2002. p. 115–22.

[6] Pan C, Prins JF, Miller CT. A high-performance lattice Boltzmann implementationto model flow in porous media. Comput Phys Commun 2004;158:89–105.

[7] Argentini R, Bakker AF, Lowe CP. Efficiently using memory in lattice Boltzmannsimulations. Future Gener Comput Syst 2004;20(6):973–80.

[8] Pan C, Luo LS, Miller CT. An evaluation of lattice Boltzmann schemes for porousmedium flow simulation. Comput Fluids 2006;35:898–909.

[9] Mattila K, Hyväluoma J, Timonen J, Rossi T. Comparison of implementations ofthe lattice-Boltzmann method. Comput Math Appl 2008;55(7):1514–24.

[10] Pohl T, Kowarschik M, Wilke J, Iglberger K, Rüde U. Optimization and profilingof the cache performance of parallel lattice Boltzmann codes. Parallel ProcessLett 2003;13(4):549–60.

[11] Mattila K, Hyväluoma J, Rossi T, Aspnäs M, Westerholm J. An efficient swapalgorithm for the lattice Boltzmann method. Comput Phys Comm2007;176:200–10.

[12] Wellein G, Zeiser T, Hager G, Donath S. On the single processor performance ofsimple lattice Boltzmann kernels. Comput Fluids 2006;35:910–9.

[13] Satofuka N, Nishioka T. Parallelization of lattice Boltzmann method forincompressible flow computations. Comput Mech 1999;23:164–71.

[14] Kandhai D, Koponen A, Hoekstra AG, Kataja M, Timonen J, Sloot PMA. Lattice-Boltzmann hydrodynamics on parallel systems. Comput Phys Commun1998;111:14–26.

[15] Axner L, Bernsdorf J, Zeiser T, Lammers P, Linxweiler J, Hoekstra AG.Performance evaluation of a parallel sparse lattice Boltzmann solver. J CompPhys 2008;227:4895–911.

[16] Freudiger S, Hegewald J, Krafczyk M. A parallelization concept for a multi-physics lattice Boltzmann prototype based on hierarchical grids. Progr ComputFluid Dynam Int J 2008;8(1–4):168–78.

[17] Vidal D, Ridgway C, Pianet G, Schoelkopf J, Roy R, Bertrand F. Effect of particlesize distribution and packing compression on fluid permeability as predictedby lattice-Boltzmann simulations. Comput Chem Eng 2009;33:256–66.

D. Vidal et al. / Computers & Fluids 39 (2010) 1411–1423 1423

[18] Pianet G, Bertrand F, Vidal D, Mallet B. Modeling the compression of particlepackings using the discrete element method. In: Proceedings of the 2008 TAPPIadvanced coating fundamentals symposium, Atlanta, GA, USA. TAPPI Press;2008.

[19] Wang J, Zhang X, Bengough AG, Crawford JW. Domain-decomposition methodfor parallel lattice Boltzmann simulation of incompressible flow in porousmedia. Phys Rev E 2005;72:016706–11.

[20] Vidal D, Roy R, Bertrand F. On improving the performance of large parallellattice Boltzmann flow simulations in heterogeneous porous media. ComputFluids 2010;39(2):324–37.

[21] Bhatnagar PL, Gross EP, Krook M. A model for collision processes in gases. I.Small amplitude processes in charged and neutral one-component systems.Phys Rev 1954;94:511–25.

[22] Gabbanelli SG, Drazer G, Koplik J. Lattice Boltzmann method for non-Newtonian fluid flows. Phys Rev E 2006;72:046312.

[23] Weickert M, Teike G, Schmidt O, Sommerfeld M. Investigation of the LES WALEturbulence model within the lattice Boltzmann framework. Comput Math Appl2009. doi:10.1016/j.camwa.2009.08.06.

[24] mpptest vesion 1.4b. <http://www-unix.mcs.anl.gov/mpi/mpptest/>; October2008.

[25] mpiP version 3.1.2. <http://mpip.sourceforge.net/>; November 2008.[26] Noble DR, Chen S, Georgiadis JG, Buckius RO. A consistent hydrodynamic boundary

condition for the lattice Boltzmann method. Phys Fluids 1995;7(1):203–9.

[27] Inamuro T, Yoshino M, Ogino F. A non-slip boundary condition for the latticeBoltzmann simulations. Phys Fluids 1995;7(12):2928–30. Erratum: PhysFluids 1996;8(4):1124.

[28] Maier R, Bernard RS, Grunau DW. Boundary conditions for the latticeBoltzmann method. Phys Fluids 1996;8(7):1788–801.

[29] Chen S, Martinez D, Mei R. On boundary conditions in lattice Boltzmannmethods. Phys Fluids 1996;8(9):2527–36.

[30] Gallivan MA, Noble DR, Georgiadis JG, Buckius RO. An evaluation of thebounce-back boundary condition for lattice Boltzmann simulations. Int JNumer Meth Fluids 1997;25:249–63.

[31] Zou Q, He X. On pressure and velocity boundary conditions for the latticeBoltzmann BGK model. Phys Fluids 1997;9(6):1591–8.

[32] Holdych DJ, Noble D, Georgiadis J, Buckius R. Truncation error analysis oflattice Boltzmann methods. J Comput Phys 2004;193:595–619.

[33] Chopard B, Dupuis A. A mass conserving boundary condition for latticeBoltzmann models. Int J Mod Phys B 2003;17:103–7.

[34] Guo A, Zheng C, Shi B. An extrapolated method for boundary conditions inlattice Boltzmann method. Phys Fluids 2002;14(6):2007–10.

[35] Fang HP, Chen SY. Lattice Boltzmann method for three-dimensional movingparticles in a Newtonian fluid. Chin Phys 2004;13(1):47–53.

[36] Kandhai D, Koponen A, Hoekstra A, Sloot PMA. Iterative momentum relaxationfor fast lattice-Boltzmann simulations. Future Gener Comp Syst 2001;18:89–96.