On improving the performance of large parallel lattice Boltzmann flow simulations in heterogeneous...

Computers & Fluids 39 (2010) 324–337

Contents lists available at ScienceDirect

Computers & Fluids

journal homepage: www.elsevier .com/locate /compfluid

On improving the performance of large parallel lattice Boltzmann flowsimulations in heterogeneous porous media

David Vidal a,b,*, Robert Roy a, François Bertrand a,*

a Ecole Polytechnique de Montréal, Montréal, Que., Canada H3C 3A7b FPInnovations – Paprican, Pointe-Claire, Que., Canada H9R 3J9

a r t i c l e i n f o

Article history:Received 20 November 2008Received in revised form 16 July 2009Accepted 18 September 2009Available online 24 September 2009

0045-7930/$ - see front matter � 2009 Elsevier Ltd. Adoi:10.1016/j.compfluid.2009.09.011

* Corresponding authors.E-mail addresses: [email protected] (D

ca (R. Roy), [email protected] (F. Bertrand

a b s t r a c t

Classical Cartesian domain decompositions for parallel lattice Boltzmann simulations of fluid flowthrough heterogeneous porous media are doomed to workload imbalance as the number of processorsincreases, thus leading to decreasing parallel performance. A one-lattice lattice Boltzmann method(LBM) implementation with vector data structure combined with even fluid node partitioning domaindecomposition and fully-optimized data transfer layout is presented. It is found to provide nearly-opti-mal workload balance, lower memory usage and better computational performance than classical slicedecomposition techniques using sparse matrix data structures. Predictive memory usage and parallelperformance models are also established and observed to be in very good agreement with data corre-sponding to numerical fluid flow simulations performed through 3-dimensional packings of cylindersand polydisperse spheres.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

The knowledge of transport phenomena in porous media is bothof great scientific and technological importance. Nevertheless, por-ous media are among the most complex geometries that nature hasto offer and are difficult to characterize. In this regard, fluid perme-ability is often used as it is a very sensitive material property thattakes into account pore connectivity. Although empirical approxi-mate correlations such as the Carman–Kozeny equation exist, theonly theoretical way to evaluate fluid permeability relies on theintegration of the Navier–Stokes equations.

Conventional computational fluid dynamics (CFD) methods (i.efinite element/volume or finite difference methods) have proven lim-ited in solving the Navier–Stokes equations in porous media. On thecontrary, the lattice Boltzmann method (LBM), developed in the early90s, is considered by many researchers as the method of choice for thesimulation of single phase or multiphase flows in complex geometries[1,2]. In addition to its relative ease of implementation, LBM superior-ity is due to two main factors: (1) its flexibility in discretizing complexgeometries by means of a simple structured lattice on which the fluidand solid phases are encoded in a Boolean manner, and (2) the inher-ent locality of its scheme, which makes it straightforwardly suitablefor distributed parallelization.

Despite its advantages, three avenues of improvement haverecently attracted the attention of researchers, in relation to the

ll rights reserved.

. Vidal), robert.roy@polymtl.).

simulation of fluid flow in porous media: (1) the reduction of corememory usage, (2) the improvement of computational efficiencyand accuracy of LBM algorithms, and (3) the reduction of workloadimbalance associated with the heterogeneity of the domain whencomputing in parallel. In the case of memory requirement for por-ous media systems, several researchers [3–6] showed that trans-forming the sparse matrix data structure inherent to the LBMlattice into a vector data structure in which only the ‘‘fluid nodes”are retained (since no computations are performed on the ‘‘solidnodes”) can significantly reduce memory consumption, especiallywhen the domain porosity is lower than 70–75%. Martys andHagedorn [6] and Argentini et al. [7] proposed simplifications ofthe LBM scheme in specific circumstances (for a LBM relaxationtime equal to unity and for Stokes flows, respectively), which sig-nificantly reduce memory usage because, for each fluid node ofthe domain, they only require the storage of the fluid density andvelocity or of a limited number of distribution moments. Alongthe same lines, Martys and Hagedorn [6] proposed a semi-directaddressing strategy based on the use of a pointer array, wherebymemory allocation is required for the fluid nodes only.

The heuristic development of LBM has created a large variety ofLBM schemes and implementations, and several researchers havetried to evaluate them. For example, Pan et al. [8] comparedBhatnagar–Gross–Krook single-relaxation-time (BGK) and multi-ple-relaxation-time (MRT) LBM schemes. They observed betteraccuracy with MRT schemes at the expense of a slightly highercomputational cost (10–20%), although the BGK scheme can stillprovide accurate results when the single-relaxation-time parame-ter is equal to unity. Very recently, Mattila et al. [9] performed a

http://dx.doi.org/10.1016/j.compfluid.2009.09.011

mailto:[email protected]

mailto:robert.roy@polymtl. ca

mailto:robert.roy@polymtl. ca

mailto:[email protected]

http://www.sciencedirect.com/science/journal/00457930

http://www.elsevier.com/locate/compfluid

X

ZY

1

23

4

5

6

7

9

11

13

8

1014

12

0

00 =e

xi =e

xi 3=e

X

ZY

1

23

4

5

6

7

9

11

13

8

1014

12

0

00 =e 00 =e

xi =

t

xi =

xi =

t

xi =

Fig. 1. Numbering of the 15 populations of the D3Q15 lattice used in the presentwork. Odd and even numbers correspond respectively to the inward-pointing andoutward-pointing populations. Population 0 is a rest population.

D. Vidal et al. / Computers & Fluids 39 (2010) 324–337 325

comprehensive comparison in terms of computational efficiencyand memory consumption of five different LBM implementationsfrom the literature: well-established one-lattice two-step andtwo-lattice algorithms, and three recent implementations, namelythe Lagrangian, shift (also called compressed-grid) and swap algo-rithms. They also investigated the effect of various data structuresand memory addressing schemes on computational efficiency.They found out that the newly developed swap algorithm [10]combined with a novel bundle data structure and indirect address-ing yields both high computational performance and low memoryconsumption. The reader is referred to [9–12] and referencestherein for more thorough descriptions of these differentimplementations.

Concerning parallel efficiency, the classical Cartesian domaindecompositions studied by Satofuka and Nishioka [13], which con-sist of dividing the whole domain in equally-sized subdomains(using slice- or box-decompositions), create a workload imbalanceas the number of subdomains increases. As underlined by Pan et al.[5], this is due to porosity variations among the subdomains astheir number is increased, not only for heterogeneous porous med-ia but also for (macroscopically) homogeneous ones. To overcomethis problem, several methods have been proposed such as theorthogonal recursive-bisection [5,14] and multilevel recursive-bisection or k-way graph partitioning using the METIS package[3,4,15,16]. Although they may provide significant improvementswith regard to classical domain decomposition techniques, thesemethods are difficult to implement, come at high memory expense,create complex communication patterns and may become compu-tationally expensive when dealing with very large systems (of saybillions of lattice nodes) or when dynamic load balancing isrequired.

Recently, Wang et al. [17] proposed a quick, simple and elegantway to balance workload and reduce memory requirement for thetwo-lattice LBM implementation. The method consists of first vec-torizing the data structure through the use of indirect memoryaddressing as previously proposed by others [3–5]. But to achieveaccurate workload balance, the resulting data vector is then splitinto equally-sized sub-vectors, each of which is assigned to a spe-cific processor. As a result, exact fluid node load balance and highparallel efficiency can be achieved. Furthermore, a simple commu-nication pattern among processors, similar to that with slice do-main decomposition, is obtained since data communication for agiven processor only involves the two nearest processors. Also,these authors claim that the data to be exchanged are contiguousin memory due to the vector data structure. However, it appearsthat the whole population set on lattice nodes involved in ghostlayers are exchanged between processors, which is more informa-tion than actually required. This can affect the data communicationload and thus impair parallel performance. Moreover, despite thereduction of the memory requirements through the use of a vectordata structure, we believe that further improvements could beachieved by resorting to LBM schemes that are more efficient thanthe two-lattice implementation.

The objective of this work is three-fold: (1) to extend the parallelworkload balancing procedure proposed by Wang et al. [17] to amore memory-efficient LBM algorithm, namely the one-latticetwo-step LBM implementation, (2) to further improve the parallelperformance of this scheme by reducing the communication over-head through the determination of a precise communication layoutthat minimizes the amount of data to be exchanged between theprocessors, and (3) to propose memory usage and parallel efficiencymodels that predict accurately and explain the performance of twoone-lattice two-step LBM implementations, one with a sparsematrix data structure and a classical slice domain decomposition,and one with a vector data structure and an even fluid node parti-tioning domain decomposition. First, the lattice Boltzmann method

is recalled. There follows a description of the sparse matrix andvector data structures used in the LBM implementations and theirassociated memory requirements. The parallel communication andworkload balance strategies as well as their resulting performanceare next examined. Finally, the computational advantages of thenew methods proposed are assessed by simulating fluid flowthrough two different porous media made up of 3-dimensionalpackings of cylinders and polydisperse spheres, respectively.

2. The lattice Boltzmann method

Contrary to the conventional CFD methods that solve directlythe Navier–Stokes equations, LBM actually ‘‘simulates” by meansof a particle approach the macroscopic behaviour of fluid mole-cules in motion. More precisely, LBM comes from the discretizationin space (x), velocity (e), and time (t) of the Boltzmann equationfrom the kinetic gas theory, which describes the evolution of theprobability distribution function (or population) of a particle,f(x,e,t), and its microdynamic interactions.

2.1. Collision–propagation scheme

In practice, the populations of particles propagate and collide atevery time step dt on a lattice with spacing dx and along ei velocitydirections, where the number of directions i (nd) depends on thetype of lattice chosen. A D3Q15 lattice is used in the present work,i.e a 3-dimensional lattice with nd = 15 velocity directions (Fig. 1).The collision–propagation procedure can be mathematically sum-marized by a two-step scheme encompassing a collision step,

f �i ðx; tÞ ¼ fiðx; tÞ �fiðx; tÞ � f eq

i ðx; tÞs�

; ð1Þ

followed by a propagation step,

fiðxþ eidt; t þ dtÞ ¼ f �i ðx; tÞ; ð2Þ

where fi(x,t) is the particle probability distribution function (or pop-ulation) in the direction of the velocity ei at position x and time t,and s* is a dimensionless relaxation time. The second term of theright-hand side of Eq. (1) approximates the collision process bymeans of a single relaxation procedure, the so-called Bhatnager,Gross and Krook’s approximation [18], towards a local equilibriumpopulation that, for a D3Q15 lattice, is given by

f eqi ðx;tÞ¼wiq 1þ3ðei �uÞ

dt

dx

� �2

þ92ðei �uÞ2

dt

dx

� �4

�32ðu �uÞ dt

dx

� �2" #

;

ð3Þ

326 D. Vidal et al. / Computers & Fluids 39 (2010) 324–337

with w0 ¼ 29 ; wi ¼ 1

9 for wi ¼ 172 ; for i ¼ 7 : 14, where

q ¼ qðx; tÞ ¼X

i

fi ð4Þ

and

u ¼ uðx; tÞ ¼ 1qX

i

fiei ð5Þ

are the local fluid density and the local macroscopic fluid velocity,respectively. The dimensionless relaxation time s* is related tothe kinematic viscosity m of the fluid by

s� ¼ mdtc2

sþ 1

2; ð6Þ

where cs is a speed of sound defined as

cs ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi3ð1�w0Þ

7

rdx

dt: ð7Þ

In practice, for better accuracy of the half-way bounce-back bound-ary condition [19], s* is chosen equal to 1, and dt and dx are chosenaccording to these two equations.

From initial conditions and appropriate boundary conditions, thecollision–propagation scheme is marched in time until an appropri-ate convergence is reached (e.g. (du/dt)/(du/dt)max = 10�5). The LBMscheme (Eqs. (1) and (2)) is explicit and the population update at alattice node is a local operation since it only requires the populationsof the immediate neighbouring nodes. This makes the LBM schemewell adapted to distributed parallelization. Note that, as previouslymentioned, there are several ways to implement this scheme, whichhave recently been rigorously classified and studied by Mattila et al.[9]. In this work, we compare two one-lattice (two-step1) LBM imple-mentations using two different data structures, as detailed in theSection 3.

2.2. Boundary conditions and imposition of pressure drop

In this work, three types of boundary conditions are used, whichare typical for a porous medium. First, the boundary conditions atthe periphery of the domain are periodic, which means that anyout-going population re-enters the domain on its opposite side.For their implementation, a one-layer halo of nodes, called ‘‘peri-odic nodes”, is added to the external boundaries of the domain toavoid breaking the pipeline of operations during the propagationstep (see Fig. 2b). Second, to impose a pressure drop DP in a givenej direction, a body force is added on each node at each iteration inthe ei directions not normal to ej, and combined with a periodicboundary condition in the ej direction. Third, the wall boundaryconditions on the solid objects of the porous domain can be mod-eled using the classical half-way bounce-back method, which re-flects any in-coming population to the wall in the oppositedirection at the next iteration. The solid boundary is in fact locatedhalf-way between the last fluid node and the first solid node whens* = 1. In practice, it is very convenient to tag as ‘‘bounce-backnodes” any solid nodes in direct connection with a fluid node(see Fig. 2b), and copy and flip there any in-coming populationsfrom the fluid nodes in provision of the following propagation step.For a more complete description of the boundary conditions andtheir implementation, as well as LBM in general, the reader is re-ferred to [1,2].

1 Named two-step implementation by Mattila et al. [9] despite the fact that a two-step procedure can also be used within a two-lattice implementation.

3. Data structure and memory usage optimization

The original implementation of the LBM collision–propagationscheme, the so-called two-lattice algorithm (e.g. Wang et al.[17]), requires storing information of the nd populations at the cur-rent and the next time steps in two 4-dimensional double precisionarrays or matrices (three dimensions to locate the data in spaceand one dimension to identify the population). Using two matriceshas the advantage of simplifying the algorithm and allows per-forming the collision and the propagation steps simultaneously(i.e. by fusing Eqs. (1) and (2)), although a two-step procedure isalso feasible, but probably not as efficient. In addition, the geome-try of the domain is stored in a 3-dimensional 1-byte array, orphase matrix, and allows retrieving the phase information, i.e.whether a node is a fluid or a solid node. In this case, the explicittagging of periodic and bounce-back nodes is not required sinceit can be replaced by (expensive) if-conditions. Finally, additionalspace for the fluid density and the three components of the velocitycan also be used, like in [17], but is not mandatory if the underlyingcalculations are performed locally as the collision step proceeds.For a (nx � ny � nz) lattice, the memory requirement (qmem) forthe two-lattice implementation with matrix data structure isthen

qmem � f8 bytesg � 2� nd � ðnx � ny � nzÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}population storage

þf1 byteg

� ðnx � ny � nzÞ|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}phase matrix

; ð8Þ

where nd is the number of populations used in the lattice, equal to15 for a D3Q15 lattice, {8 bytes} indicates the computations aredone in double precision and {x byte(s)} refers to a x-byte integerarray for x < 8.

It appears that the use of (tagged) periodic and bounce-backnodes and a clever order for the population updates during thepropagation step may lead to a reduction of memory requirement,since only one 4-dimensional array or matrix is then necessary tostore the population information required. For this to work, the or-der must avoid the overwriting of a yet to be propagated popula-tion by one that has been. This leads to the so-called one-latticeimplementation that will be used hereafter. It can be shown thatthe memory requirement for the one-lattice implementation withmatrix data structure is given by

qmem � f8 bytesg� ðnd � ðnx þ 2Þ � ðny þ 2Þ � ðnz þ 2ÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

population storage

þf4 bytesg

� ðnb � ð4� 1þ 4� ðnd � 1ÞÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}bounce-back treatment

þf1 byteg

� ððnx þ 2Þ � ðny þ 2Þ � ðnz þ 2ÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}phase matrix

; ð9Þ

where nb � ða Sd2

xÞ is the number of bounce-back nodes, which

depends on the total surface area S of the solid phase, and a P 1a variable that depends on the shape of the interface and its orien-tation with respect to lattice reference axes (for cylinders alignedalong one reference axis, we found a � 1.26). The number of integerarrays required to store and treat the bounce-back nodes resultsfrom an algorithm proposed in [1]. On the downside, the presenceof periodic and bounce-back node halos makes the adaptation ofthe parallel workload balance procedure proposed by Wanget al. [17] trickier. We will come back to this point in the nextsection.

Further memory gain can be made by considering that, for aporous domain of porosity e, no operation takes place on solid

(a)

(b)

16 • • • • • • • • • • • • • • • • • • • • • • • • • •15 • • • • • • • • • • • • • • • • • • • • • • • • • •14 • • • • • • • • • • • • • • • • • • • • • • • • • •13 • • • • • • • • • • • • • • • • • • • • • • • • • •12 • • • • • • • • • • • • • • • • • • • • • • • • • •11 • • • • • • • • • • • • • • • • • • • • • • • • • •10 • • • • • • • • • • • • • • • • • • • • • • • • • •9 • • • • • • • • • • • • • • • • • • • • • • • • • •8 • • • • • • • • • • • • • • • • • • • • • • • • • •7 • • • • • • • • • • • • • • • • • • • • • • • • • •6 • • • • • • • • • • • • • • • • • • • • • • • • • •5 • • • • • • • • • • • • • • • • • • • • • • • • • •4 • • • • • • • • • • • • • • • • • • • • • • • • • •3 • • • • • • • • • • • • • • • • • • • • • • • • • •2 • • • • • • • • • • • • • • • • • • • • • • • • • •1 • • • • • • • • • • • • • • • • • • • • • • • • • •0 • • • • • • • • • • • • • • • • • • • • • • • • • •

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

• •

• •

periodic node

solid node

fluid node

bounce-back node

xy

(c)

Fluid nodes (220)

Bounce-backnodes (104)

Periodic nodes (82)

• • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • •

1 • • • • • • • • • • • • • • • • • • • • • • • • • •11 • • • • • • • • • • • • • • • • • • • • • • • • • •10 • • • • • • • • • • • • • • • • • • • • • • • • • •9 • • • • • • • • • • • • • • • • • • • • • • • • • •8 • • • • • • • • • • • • • • • • • • • • • • • • • •7 • • • • • • • • • • • • • • • • • • • • • • • • • •6 • • • • • • • • • • • • • • • • • • • • • • • • • •5 • • • • • • • • • • • • • • • • • • • • • • • • • •4 • • • • • • • • • • • • • • • • • • • • • • • • • •3 • • • • • • • • • • • • • • • • • • • • • • • • • •2 • • • • • • • • • • • • • • • • • • • • • • • • • •1 • • • • • • • • • • • • • • • • • • • • • • • • • •0 • • • • • • • • • • • • • • • • • • • • • • • • • •

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

• •

• •

xy

(220)

Fig. 2. 2D schematic discretization of a porous medium (a) using a 24 � 15 node domain. Here, the sparse matrix data structure (b) is compared with a (ordered) vector datastructure (c). Note that the vector is made of three ‘‘sub-vectors” for storing population information related to fluid, bounce-back and periodic nodes, respectively.


nodes. In the above mentioned data structures, a great amount ofmemory is wasted in storing zeros for the solid nodes. As alreadymentioned, several researchers [3–5] have proposed to save up asignificant amount of memory by linearizing into vectors thematrices and only storing information related to fluid nodes. As aresult, the phase matrix is no longer necessary. The downside isthat the node topology provided by a matrix data structure is lostand direct addressing between nodes is no longer feasible. The vec-tor data structure thus requires indirect addressing through theuse of a connectivity list for all fluid nodes. Also, it is convenientto store the integer coordinate list of these nodes since such a listis required when post-processing the simulation results. The one-lattice implementation with vector data structure then entailsthe following memory requirement:

qmem � f8 bytesg � ððnf þ nc þ nbÞ � ndÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}population storage

þf4 bytesg

� ðnb � ð2� ðnd � 1Þ þ 1Þ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}bounce-back treatment

þnc � ðnd;in þ 1þ 1Þ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}periodicity treatment

þ nf � ðnd � 1Þ|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}connectivity list

Þ þ f2 bytesg � ðnf � 3Þ|fflfflfflfflffl{zfflfflfflfflffl}coordinate list

; ð10Þ

where nf ¼ ðnx � ny � nzÞ � e is the number of fluid nodes,nc � ððnx þ 2Þ � ðny þ 2Þ � ðnz þ 2Þ � nx � ny � nzÞ � e the numberof periodic nodes, and nd,in the number of inward-pointing popula-tions, which is equal to 5 for a D3Q15 lattice. Transforming the ma-trix data structure into a vector implies the replacement of thematrix of populations by three ‘‘sub-vectors”, one for the fluidnodes, one for the bounce-back nodes and one for the periodicnodes, all three of which are combined into one single vector (asillustrated in Fig. 2c). Similarly, for comparison purposes, the mem-ory requirement for the two-lattice implementation with vectordata structure is provided:

qmem � f8 bytesg � ð2� nf � ndÞ|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}population storage

þf4 bytesg

� ðnf � ðnd � 1Þ|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}connectivity list

Þ þ f2 bytesg � ðnf � 3Þ|fflfflfflfflffl{zfflfflfflfflffl}coordinate list

: ð11Þ

The resulting memory usage for these different data structures willbe assessed in Section 5, but it can be readily seen by comparingEqs. (8)–(11) that a significant memory gain can be made if a vectoris used instead of a matrix, since the population storage, whichtakes up most of the memory, is proportional to the porosity. Inmany applications, the porosity of the porous media is low.

Another issue related to the data structure is the ordering of thend populations in memory, which can affect the computational effi-ciency. When saving the populations in memory, one can either de-cide to store contiguously all the nd populations of a given node andproceed nodewise, or store contiguously a given population for allthe nodes and proceed populationwise. The first approach is calleda collision-optimized data structure since it tends to give better per-formance during the collision step, whereas the second one is re-ferred as the propagation-optimized data structure for oppositereasons. Mattila et al. [9] showed that the advantage of one overthe other is processor dependent, with the propagation-optimizeddata structure giving better performance on Opteron processors,whereas the reverse is observed on Xeon processors. They also ob-served that a so-called bundle data structure, a somewhat hybridversion of these two structures, offer better overall performanceon both processor types due to lower data cache misses. A propaga-tion-optimized data structure was used in this work.

4. Parallel workload balance and communication strategies

Three key aspects of a parallel LBM implementation are exam-ined in this section: computational workload, communicationoverhead and resulting parallel performance.


4.1. Workload balance strategy

With a vector data structure, it becomes straightforward to bal-ance the workload on a parallel computer, as shown by Wang et al.in the case of a two-lattice LBM implementation without bounce-back and periodic nodes [17]. All that must be done is to ensurethat each processor receives an equal portion of the vector data.In our one-lattice implementation, the situation is more compli-cated owing to the presence of periodic and bounce-back nodes.In theory, to achieve a perfect load balance, these periodic andbounce-back nodes would also need to be equally split betweenprocessors. In practice, splitting evenly periodic and bounce-backnodes would increase significantly the complexity of the commu-nication layout between processors and require the exchange ofdata about periodic and bounce-back nodes as well as additionalfluid nodes. As a result, only the fluid node sub-vector is equallypartitioned, as depicted in Fig. 3(a). This could lead to some work-load imbalance depending on the number of periodic and bounce-back nodes that each processor has to deal with. However, forsome domain geometries, in the presence of large heterogeneitiesfor instance, one could choose a preferential order of vectorizationto limit the potential imbalance of periodic and bounce-backnodes. Nevertheless, the computational load required for the up-date of periodic and bounce-back nodes would still be very lowwhen compared to the one required for the fluid nodes, so thatno noticeable impact on the overall workload should be expected.

4.2. Communication strategy

Exchange of data between processors only involves fluid nodes,because no communication concerning periodic or bounce-backnodes is necessary due to the fact that boundary conditions aretreated after the fluid node data exchange. Overall, the workloadbalance procedure based on fluid nodes only leads to a slice do-main decomposition method that resembles classical ones. For in-stance, y–x vectorizations in 2D and z–y–x vectorizations in 3Dcreate slices in the x direction, as can been seen in Figs. 3(b) and4, respectively. In such cases, the data communication pattern be-tween the processors is simple because each processor needs tocommunicate with only its two nearest processors. However, ifnot treated with care, the amount of data to be transferred be-tween these processors can be substantially greater than in thecase of classical slice decompositions. In fact, it depends on the po-sition of the interface that can be sharp or staircase-like. In thebest-case scenario, the sharp interface, only the five populationspointing towards the neighbouring processor and located in thelast layer of nodes (for instance the layer I in Fig. 4 and layer Iand populations 1, 7, 9, 11 and 13 in Fig. 5, which both correspondto a x-forward data transfer) are required. In the worst-case sce-nario, the staircase-like interface, the five populations pointing to-wards the neighbouring processor are required but several of thefour populations in the y–z plane of the interface (populations 3,4, 5 and 6 for the x-forward direction) may also be requireddepending on the y–z position of the interface.

The easy solution is to send the nine populations required (1, 3,4, 5, 6, 7, 9, 11 and 13) to the neighbouring processor for all thefluid nodes of layers I and II (see the simple data transfer layoutin Fig. 5). This has the advantage of being straightforward to code,but entails increased communication overhead since more datathan required are exchanged. Although a clear enough descriptionof the data transfer layout they used is lacking, Wang et al. [17]seems to have adopted this strategy because they mention thatthe data to be exchanged are contiguous in memory. The actualamount (qcom) of data to be sent depends on the position of theinterface in layer II. If the interface is located at the beginning oflayer II (lower left corner) then qcom � 9�(ny � nz) � e. In the worst

case, if the interface is located at the end of layer II (upper rightcorner), qcom � 2 � 9 � (ny � nz) � e. The likelihood to fall in theworst-case scenario at one interface increases with the numberof processors used, and may yield a communication bottleneck.This scenario can be slightly improved at low implementation costby using the so-called improved data transfer layout presentedin Fig. 5, wherein the amount of data to transfer isqcom � 2 � 7 � (ny � nz) � e. Finally, a fully-optimized data transferlayout sending only the required population for each individualfluid node in layers I and II can be constructed (see Fig. 5). Ascan be seen, the amount of communication is now equal toqcom � 5 � (ny � nz) � e, regardless of the location of the interface(assuming that the interface ghost layers have the same porosityas that of the whole domain, and that ny and nz >> 1), which isequal to the amount of communication required by a sharp inter-face. In other words, such a layout guarantees a balanced commu-nication workload between the processors. This comes at the costof (1) a trickier to implement data transfer layout, which howeverneeds to be established once for all at the pre-processing stage and(2) a data preparation step at each iteration, since the data are nolonger contiguous in memory. However, the data preparation (i.e.packing and unpacking) is found to be negligible as compared tothe gain obtained from the reduction of the amount of data to betransferred (i.e. [2 � 9 � ny � nz � e] vs. [2 � 7 � ny � nz � e] or[�5 � ny � nz � e]). In this work, both the improved and thefully-optimized data transfer layouts are used and compared.

4.3. Parallel performance

The parallelization of our LBM code was accomplished bymeans of a single-program multiple-data (SPMD) model, andimplemented using MPI and a Fortran compiler. Communicationsbetween processors are carried out by non-blocking MPI_ISENDand MPI_IRECV subroutines. When evaluating parallel perfor-mance, one can either keep the lattice dimensions constant whileincreasing the number of processors (a speed-up scenario), or in-crease proportionally the lattice dimensions to the number of pro-cessors used (a scale-up scenario). If Ntot is the total number oflattice nodes and np the number of processors used for a given sim-ulation, these two scenarios lead respectively to Ntot = nx � ny � nz

and Ntot = nx � ny � nz � np. One can then derive for both scenariosthe following theoretical efficiency E(np) model for a porous mediaof average porosity e (see Appendix):

EðnpÞ ¼e

emax þqnpðtlatþstdataÞ

Ntotmtoper

¼ eemax þ rcc

; ð12Þ

where emax is the highest subdomain porosity, m is the number ofarithmetic (floating-point) operations per lattice node per iteration,q is the number of MPI_ISEND and MPI_IRECV required per proces-sor (q = 4 when each processor has two neighbours), toper is theaverage time spent per arithmetic operation, tlat is the message la-tency, tdata is the average time to transfer 1 byte of data, s is theamount of data that needs to be transferred to one neighbouringprocessor, which is equal to qcom � 8 bytes, and rcc represents theratio between the communication and the computational workload.The product (mtoper) for a specific code on a given machine, hereto-after called the nodal computational time, can be approximated by

mtoper � tCPU;1=ðnit � nfÞ; ð13Þ

where tCPU,1 is the CPU time measured on a single processor and nit

the number of iterations. The latency tlat and data rate transfer tdata

can be, respectively, evaluated using utilities such as mpptest [20]and mpiP [21], although only rough approximations can be obtainedbecause the actual values depend on the number and the size of themessages transferred. Furthermore, the computational performance

(a)

(b)

Fig. 3. Schematic splitting of the (ordered) vector of Fig. 2 between four processors (CPUs) (a) and the resulting splitting on the original discretization (b). Note that thevectorization order (here y–x order) determines the location of the CPU interfaces (dashed red lines). Green and blue nodes are ghost layer nodes on the left and the right ofeach subdomain, respectively. No computations are performed on these ghost layer nodes. Only memory is allocated for the corresponding population data to be transferredduring the propagation step. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)


Pcomp expressed in MLUPS (Millions of Lattice fluid node UpdatesPer Second) can be obtained by

PcompðnpÞ ¼10�6eNtot

emaxNtotnp

mtoper þ qðtlat þ stdataÞ¼ 10�6np

mtoperEðnpÞ: ð14Þ

Interestingly, this equation links the parallel efficiency to the com-putational performance, which can be evaluated experimentally as

PcompðnpÞ ¼10�6eNtotnit

tCPU;np

; ð15Þ

where tCPU;np represents the CPU time with np processors.Combining Eqs.(14) and (15) gives a way to measure experimen-tally the parallel efficiency without timings on one singleprocessor:

Fig. 4. Possible interface scenarios between subdomain n and subdomain n+1 (transparent) and the location of the layer data that need to be sent from processor n toprocessor n+1. Note that the vectorization order is here z–y–x.


EðnpÞ ¼eNtotnitmtoper

nptCPU;np

: ð16Þ

Note that simulations on one single processor may not be feasiblebecause of the size of the computational domain. In fact, this equa-tion allows to evaluate experimental parallel efficiencies for verylarge domains. It is valid if the computational time is proportionalto the number of lattice nodes, which is the case for the LBM algo-rithms considered in this work.

From Eq. (12), one can easily see that the efficiency is boundedby e/emax when communication overhead is negligible (i.e. rcc <<emax), which means that, in such a case, any porosity variationamong subdomains will significantly affect the parallel perfor-mance. Also, when the number of subdomains increases, the like-lihood of an increase of the subdomain porosity variability growsrapidly, all the more so when the pore features are in the same or-der of magnitude as the subdomain dimensions, even for appar-ently homogeneous structures, as pointed out in [5] andillustrated by Fig. 6 in the case of Fontainebleau sandstone. Datapresented in this figure show that if 64 processors were used tocompute fluid flow through a �5 mm3 Fontainebleau sandstonediscretization, a parallel efficiency as low as 41.4% (= 15.0/36.5 � 100) would at best be observed. If communication overheadis not an issue, the one-lattice implementation with vector datastructure should provide instead a theoretical parallel efficiencyof 100% because emax = e and workload is balanced.

5. Numerical experiments

In order to demonstrate the efficiency of the proposed algo-rithms, both 3-dimensional fluid flow simulations through a hex-agonal packing of cylinders and a heterogeneous random packingof polydisperse spheres were conducted and compared with thoseobtained using classical slice domain decomposition.

5.1. Hexagonal packing of cylinders

The following numerical simulations were performed on theHigh-performance computing (HPC) cluster (Mammouth(mp))from the Réseau Québécois de Calcul de Haute Performance(RQCHP). Table 1 summarizes the main features of this HPC cluster.

The dimension of the lattice was increased proportionally to thenumber of processors used (scale-up test). More precisely, the totalnumber of lattice nodes (Ntot) was 600 � 10 np � 347 lattice nodes.Four different cylinder radii, R = 60.6, 67.1, 73.6 and 80.1 latticenodes, were considered to investigate the impact of porosity vari-ations on performance. An example of a computed flow field is pre-sented in Fig. 7. The comparison between the numerical andanalytical values of fluid permeability given in Fig. 8 shows theaccuracy of our LBM code. Table 2 presents the nodal computa-tional times (m toper) of the various codes used on Mammouth(mp),as calculated from Eq. (13).

The memory requirements predicted by Eqs. (8)–(11) show thatthe vector data structure offers a significant improvement over thesparse matrix data structure for both two-lattice and one-latticeimplementations, when e 6 80% and e 6 70%, respectively (seeFig. 9). This is in good agreement with previously reported data[3–5]. At high porosity values, the small memory gain resultingfrom the removal of the solid nodes is surpassed by the cost ofthe coordinate and connectivity lists. As can also be seen for theone-lattice implementations, the finer the lattice size the lowerthe memory requirements per lattice node since a smaller propor-tion of bounce-back nodes are needed to discretize the interfacebetween the fluid and solid phases. In this regard, the actual mem-ory requirements for both one-lattice implementations are in goodagreement with the predictions for the finest lattices. For thecoarse lattices, a deviation from the predictions is observed. It ap-pears that, due to the small size of the domains involved in thesetest cases, additional secondary arrays not taken into account inthe memory models are no longer negligible. Finally, accordingto the theoretical models, the one-lattice implementation withvector data structure provides a noticeable reduction of memoryusage compared to its two-lattice counterpart, especially for finelattices. For coarse lattices and low porosity (< �20%), the corre-sponding two-lattice implementation is less greedy although thelattice resolution is such that poor accuracy is expected with arelative error on permeability larger than 10%. Furthermore, notethat the two-lattice implementations of Wang et al. [17] requiresignificantly more memory (�40%) than the values predicted heresince they also store the density and the three components of thevelocity, and use a D3Q19 lattice, i.e. 19 populations instead of 15in the case of the D3Q15.

Fig. 5. Various population transfer layouts and resulting communication load (qcom) for the two types of interface (see Fig. 4) and a D3Q15 lattice (see Fig. 1) with periodicboundary conditions. Without loss of generality, assuming a z–y–x vectorization order, only the x-forward data transfer is presented. To construct the x-backward layout,opposite populations to the ones reported are used. Note that the data corresponding to solid, bounce-back and periodic nodes do not need to be exchanged. If the solid phaseis static, the data transfer layout is determined once for all at the pre-processing stage.


For this simple case study, there is an obvious slice domaindecomposition along the y direction that can lead to high parallelefficiency due to straightforward workload balance and constantcommunication load (related to the exchange of data for600 � 347 lattice nodes). In this regard, with R = 73.6 lattice nodesand 2 6 np 6 100, nearly constant parallel efficiencies of 96.7%,96.8% and 98.8% were obtained, respectively, for the one-latticeimplementation with x-slice decomposition, and for the one-latticeimplementations with even vector partitioning using improvedand fully-optimized data transfer layouts.

To highlight performance improvements with even vector par-titioning domain decomposition when load imbalance is present,the computational domain in Fig. 7 was discretized and purposelydecomposed in the x direction (i.e. the domain was swept duringvectorization following the z–y–x order). In the case of an x-sliceclassical decomposition, the porosity ratio e/emax can be evaluatedanalytically as a function of the number of subdomains and

showed to decrease to a value as low as e for np P 75, becauseemax = 100% in such cases. As expected and as can be seen inFig. 10(a), the resulting load imbalance creates a significant dropin parallel efficiency, which reaches values as low as neare = 34.5%. This experimental drop in parallel efficiency is in factvery well predicted by Eq. (12) (green dashed line in Fig. 10a). Asubstantial gain in performance as well as the elimination of thefluctuations due to porosity variations among processors can beobserved for the proposed one-lattice implementations with evenvector decomposition. The decrease in performance is now solelydue to the increase in message passing as the number of processorsis increased (we recall that s / np for an x-axis decomposition). Theimbalance in the number of periodic and bounce-back nodes fromone processor to another does not seem to affect performance asthis would create noticeable fluctuations, which justifies our ap-proach consisting in only balancing the number of fluid nodes. At100 processors, 75% efficiency is achieved with the fully-optimized

0

10

20

30

40

0.01 0.1 1 10

Sample volume (mm3 )

Por

osit

y (%

)

Fig. 6. Experimental porosity variations of a Fontainebleau sandstone as a functionof sample volume (drawn from X-ray microtomography data published in [22]). Thevolume ratios between the three samples and the largest one are 1, 1/8 and 1/64.

Table 1Specifications of the Mammouth(mp) HPC cluster.

Mammouth(mp) parallel cluster

Make Dell PowerEdge SC1425Processors used 2 � 128 Intel Xeona

– Clock speed 3.6 GHz– Bus FSB 800 MHz– RAM 8 GB– Cache 1 MB L2Network InfiniBand 4 � (700–800 MB/s)– tlat (ls) �5.0– tdata (ns/byte) �8.0Operating system RedHat Enterprise Linux 4 (2.6.9-42.0.3.ELsmp)Compiler Portland Group pgf90 Fortran (6.0-4)Message passing MPI (mvapich2 0.9.82)

a Only one processor per server was used.

0

10

20

30

0 10 20 30

Kanalytical/R2 x 103 (-)

KL

BM

/R2 x

103

(-)

Fig. 8. Comparison between the x-axis normalized fluid permeability predicted byLBM and the analytical solution [23] for hexagonal packings of cylinders with fourdifferent radii R. The relative errors are smaller than 1.4%.

Table 2Nodal computational time (m toper) for the various codes on the Mammouth(mp) HPCcluster.

Code Nodal computational time (ns)

With 4-byte integercompiler option


One-lattice algorithm withvector data structure

879 N/A

One-lattice algorithm withsparse matrix data structure

1142 N/A


data transfer layout, which is much better than 35% with the x-axisdecomposition, and significantly better than �60% with the im-proved data transfer layout. As expected, the performance ob-tained with the fully-optimized data transfer layout surpassesthe one achieved with the improved data transfer layout. The dif-ference is well predicted by the performance model (red and orange

Fig. 7. Y-cross section of a hexagonal packing of cylinders and the resulting flow velocit(dx = 0.0462 lm), which results in a domain porosity of 34.5%. Note that the pressure dr

dashed lines, respectively, obtained from Eq. (12) with emax = e).Interestingly, as can be seen in Fig. 10(b) as well as in Table 2,the sequential execution of the one-lattice codes with a vectordata structure is �1.3 times faster than their counterpart with asparse matrix data structure (i.e. x-slice decomposition with one

y as computed by LBM (blue color chart). The cylinder radius R is 73.6 lattice nodesop is imposed in the x-direction.

0

50

100

150

200

250

300

350

0 0.2 0.4 0.6 0.8 1

Porosity (-)

Mem

ory

usag

e (B

yte/

node

)

one-lattice with matrix data structureone-lattice with vector data structuretwo-lattice with matrix data structuretwo-lattice with vector data structure

2*δx

δx/2

2*δxδx

2*δx

δx

δx/2

δxδx/2

Relative error on permeability > 10%

Fig. 9. Memory usage per node as a function of the porosity of a hexagonal packingof cylinders for one- and two-lattice LBM implementations with sparse matrix andvector data structures. The porosity is changed by varying the diameter of thecylinders. Lines correspond to model predictions (Eqs. (8)–(11)) and symbols areactual numerical experiment data points for the one-lattice implementations only.The thicker the lines or the bigger the symbols, the coarser the lattice(dx = 0.0462 lm). The porosity values for the three lattice resolutions, below whichthe computed permeabilities are off by more than 10% from the analytical solution,are displayed on the left of the x-axis for the three lattice resolutions.

0

20

40

60

80

100

0 20 40 60 80 100

Number of processors

Eff

icie

ncy

(%)

Proposed vector decomposition with fully-optimized data transfer layout

Proposed vector decomposition with improved data transfer layout

X-slice decomposition

1

10

100

1000

100101


Com

puta

tion

al p

erfo

rman

ce (

ML

UP

S) Proposed vector decomposition with fully-optimized data transfer layout



(a)

(b)

Fig. 10. Parallel efficiency (a) and computational performance (b) comparisons onthe Mammouth(mp) cluster between x-slice domain decomposition and evenvector partitioning domain decomposition with both improved and fully-optimizeddata transfer layouts for a hexagonal packing of cylinders (R = 73.6 lattice nodes)and proportional domain size (scale-up test). The colored dashed lines in (a)represent the model predictions from Eq. (12) for the corresponding domaindecompositions. In (b), the black dashed line represents the theoretical linearperformance for the even vector partitioning domain decomposition.


processor in Fig. 10b). This can be attributed to the removal ofexpensive if-conditions to test the state of the nodes when scan-ning through the sparse matrix data structure. Furthermore, thiswas obtained despite the use of indirect memory addressing ofthe entries of the vector data structure. As a matter of fact, Mattilaet al. [9] showed recently that such indirect addressing rather im-proves computational performance. Finally, higher speed ratios areachieved as the number of processors is increased, to reach a max-imum of �2.7 at 100 processors.

5.2. Random packing of polydisperse spheres

To further investigate the parallel performance of the proposedLBM implementations, a second series of tests was carried out on a3-dimensional random packing of polydisperse spheres generatedusing a Monte-Carlo packing procedure described elsewhere [24](Fig. 11). To induce large scale heterogeneities within the packing,six ellipsoidal pore inclusions were introduced randomly withinthe packing. The domain size (4003 lattices nodes), which couldfit in the memory of a single server, was kept constant as the num-ber of processors was increased (speed-up test). Note that, for thiscase, the memory usage on a single processor using the vector datastructure was 42% lower than that for the sparse matrix data struc-ture (5.9 GB vs. 10.1 GB), which represents a substantial reduction.The tests were performed on the HPC cluster (Artemis) from FPIn-novations. Table 3 summarizes the main features of this HPC clus-ter and Table 4 presents the nodal computational times (mtoper) ofthe various codes used on this cluster.

A significant decrease in parallel performance can be observedin Fig. 12 as the number of processors increases for the three algo-rithms investigated. This is due to the reduction of the subdomaingranularity as the number of processors increases, which leads toan increase of the communication over computation ratio, rcc inEq. (12). As the weight of rcc becomes more and more importantthan that of emax in this equation, the upper bound for the parallelefficiency gradually switches from e/emax to e/rcc. In other words,this means that the communication overhead becomes moreimportant than the workload imbalance with regard to the parallelperformance. Moreover, when the number of processors is small,the use of the one-lattice vector implementation with improved

data transfer layout (with s / 14 � ny � nz � e) yields a slightimprovement of the parallel efficiency with respect to the one-lat-tice sparse matrix implementation (with s / 5 � ny � nz), which re-sults from a smaller amount of transferred data (14e � 3.9 < 5).When the number of processors increases, the efficiencies becomesimilar. However, it can be noted in Fig. 12b that the former re-mains �1.5 times faster than the latter over the whole range ofnumber of processors investigated. Furthermore, thanks to its care-ful data transfer layout, the one-lattice vector implementationwith fully-optimized layout (with s / 5 � ny � nz � e) provides anoticeable improvement over the other two algorithms. Thesetrends are also confirmed by the performance model predictions.

To assess the parallel performance as the domain size is in-creased (scale-up test from 3843 to 17923 node lattices), simula-tions were carried out with 128 processors on HPC clusterArtemis for all three algorithms. The computational performanceand parallel efficiency was calculated through Eqs. (15) and (16)for the three algorithms (Fig. 13). As expected, the computationalperformance and efficiency increase with the number of lattice(fluid) nodes in all cases. Moreover, the vector decompositionwith the improved data transfer layout provided enhanced perfor-mance as compared to the sparse matrix implementation withclassical slice domain decomposition. This is mainly due to thebetter sequential performance of the resulting code as the effi-ciency is only slightly better. On the other hand, resorting to thefully-optimized data transfer layout outperformed the other two

Fig. 11. Random packing of spheres with a lognormal particle size distribution (geometric standard deviation equal to 2.5 and median particle size of 0.6 lm), as createdusing a Monte-Carlo packing procedure described in [24] (left). Forty-five different particle sizes were used ranging from 0.1 to 4 lm. The warmer the particle color, thesmaller its size. Ellipsoid-shaped pore inclusions (total volume equal to 6 � �1.85 lm3 and ellipsoid aspect ratio equal to 1.5) were introduced within the packing to inducelarge scale heterogeneities (right). The overall packing porosity is equal to 27.5%. (For interpretation of color mentioned in this figure, the reader is referred to the web versionof this article.)

Table 3Specifications of the Artemis HPC cluster.

Artemis parallel cluster

Make Dell PowerEdge 1950Processors used 2 � 64 quad core Intel Xeon 5440– Clock speed 2.83 GHz– Bus FSB 1333 MHz– RAM 16 GB– Cache 12 MB L2Network Gigabit Ethernet– tlat (ls) �1.0– tdata (ns/byte) �12.0Operating system CentOS 4.6 (2.6.9-67.0.15.ELsmp)Compiler Intel Fortran (10.1.015)Message passing MPI (OpenMPI 1.2.6)

Table 4Nodal computational time (m toper) for the various codes on the Artemis HPC cluster.

Code Nodal computational time (ns)



One-lattice algorithm withvector data structure

443 478

One-lattice algorithm withsparse matrix data structure

704 802

1

10

100

1000

100101


Com

puta

tion

al p

erfo

rman

ce (

ML

UP

S)




0

20

40

60

80

100

0 20 40 60 80 100


Eff

icie

ncy

(%)




(a)

(b)

Fig. 12. Parallel efficiency (a) and computational performance comparisons (b) onthe Artemis cluster between x-slice domain decomposition and even vectorpartitioning domain decompositions with both improved and fully-optimized datatransfer layouts, for a random packing of spheres with constant domain size (speed-up test). The colored dashed lines in (a) represent the model predictions from Eq.(12) for the corresponding domain decompositions. In (b), the black dashed linerepresents the theoretical linear performance for the even vector partitioningdomain decomposition. (For interpretation of color mentioned in this figure, thereader is referred to the web version of this article.)


implementations both in terms of efficiency and performance. Themodel predictions from Eqs. (12) and (14) (colored dotted lines) be-come extremely good above �1.2 � 108 fluid nodes (7683 nodelattices). As all algorithms follow closely the performance and effi-ciency models, it can be concluded that they converge to the ex-pected bounds, that is E ? 1 (balanced workload) for the vectordecompositions and E ? e/emax = 27.5%/42.2% = 65.2% (unbalancedworkload) for the slice domain decomposition. Also note the smalldecrease in performance when simulations were performed with8-byte integers (closed symbols), which were required wheneverthe number of lattice nodes was too large to be indexed by 4-byteintegers. Finally, the fact that domains three times larger (com-prising up to 1.6 billions fluid nodes) could be simulated in thecase of even vector partitioning domain decompositions high-lights the memory advantage of this strategy over the use ofsparse matrices.

6. Conclusion and perspectives

A one-lattice vector implementation of LBM with even fluidnode partitioning domain decomposition was introduced andshown to (1) substantially reduce the memory usage when domain

0

20

40

60

80

100

10 100 1000 10000

Number of fluid nodes / 106

Eff

icie

ncy

(%)




0

100

200

300

10 100 1000 10000

Number of fluid nodes / 106

Com

puta

tion

al p

erfo

rman

ce (

ML

UP

S) Proposed vector decomposition with fully-optimized data transfer layoutProposed vector decomposition with improved data transfer layoutX-slice decomposition

(a)

(b)

Fig. 13. Parallel efficiency (a) and computational performance (b) comparisons at128 processors on the Artemis cluster between x-slice domain decomposition andeven vector partitioning domain decompositions with both improved and fully-optimized data transfer layouts, for a random packing of spheres with constantdomain size (scale-up test). The colored dotted lines represent the model predictionsfrom Eqs. (12) and (14) for the corresponding domain decompositions. Open andfilled symbols correspond to simulations performed respectively with 4- and 8-byteintegers. (For interpretation of color mentioned in this figure, the reader is referredto the web version of this article.)


porosity is low (below �70%), (2) decrease sequential executiontime thanks to the removal of costly if-conditions to test the stateof the nodes when scanning through the sparse matrix data struc-ture and (3) eliminate the workload imbalance resulting from localheterogeneities of a porous structure and thus improve parallelperformance. It was found that although workload balancing isonly carried out on the fluid nodes, the imbalance in bounce-backand periodic nodes does not affect measurably the performance.Consequently, this indicates that the algorithm is nearly-optimalfor load balancing. Although the communication patterns are sim-ple (i.e. communications involving only two neighbouring proces-sors like in slice domain decomposition), it was observed that theamount of populations to be transferred to the neighbouring pro-cessors should be minimized to reduce communication overhead.Consequently, a fully-optimized data transfer layout for theD3Q15 lattice and periodic boundary conditions, easily adaptableto other types of lattice, was developed and showed to outperformother vector and matrix data transfer layouts. Predictive memoryusage and parallel performance models were also established andfound to be in very good agreement with the measured data.

The proposed method still retains one of the drawbacks fromclassical slice domain decompositions when dealing with largeinterfaces between subdomains combined with small problemgranularity, which can create an important communication over-head. We believe that parallel performance in such cases can bebetter with methods that minimize subdomain interfaces such as

the spectral recursive-bisection and multilevel graph partitioningmethods. However, this drawback could be alleviated by simplify-ing the LBM algorithm and further reducing the amount of data tobe exchanged. This will be the subject of a forthcoming paper. Theproposed method appears to be well-suited for the (dynamic) loadbalancing of simulations of fluid flow through very large and com-plex heterogeneous porous domains such as those involvingsettling particles [25] or packings of polydisperse spheres. Re-ported in a companion paper [24], we have recently performed,using the parallel computing and memory-efficient algorithmdeveloped here, a comprehensive study of flows through com-pressed packings of highly polydisperse spheres, and derived amodified Carman–Kozeny correlation from the detailed numericalexperiments. Finally, note that the algorithm proposed here islikely to run efficiently on heterogeneous architectures becausethe fluid node vector can be judiciously split according to the rel-ative computational speed of the processors involved in a parallelsimulation.

Acknowledgments

The computer resources and support from the Réseau Québéc-ois de Calcul de Haute Performance (RQCHP) and from FPInnova-tions, as well as the financial contribution of the NSERC SentinelNetwork are gratefully acknowledged. Special thanks to Louis-Alexandre Leclaire and Christian Poirier.

Appendix A. Development of the theoretical parallelperformance models for heterogeneous porous media

Two cases need to be examined: the speed-up test (constant do-main size) and the scale-up test (proportional domain size).

A.1. Speed-up test case

The analysis for the speed-up test case is based on the followingassumptions:

(1) the computer architecture is homogeneous;(2) the processors are linked directly to each other;(3) the sequential fraction of the code is negligible;(4) the computation time comprises all the arithmetic opera-

tions performed on fluid nodes (no computations on solidnodes), which are considered equal;

(5) the lattice domain size is constant as the number of proces-sors varies and equal to Ntot = nx � ny � nz (speed-up test).

If the domain of average porosity e and lattice size Ntot is parti-tioned into np subdomains of porosity ei and equal lattice size N, wehave:

e ¼ 1np

Xnp

i¼1

ei ðA1Þ

and

ei N ¼ eiNtot

np; ðA2Þ

where ei N represents the amount of fluid nodes in subdomain i.If tcal;i denotes the time spent per iteration by processor i to per-

form m arithmetic (floating-point) computations per fluid node inits associated subdomain, and tcom;i the overall time spent per iter-ation by this processor to transfer si,j bytes of data to each of its ni

neighbouring processors j, we then obtain:


tcal;i ¼ eiNtot

npmtoper

� �� nit ðA3Þ

and

tcom;i ¼ 2Xni

j¼1

ðtlat þ si;jtdataÞ" #

� nit; ðA4Þ

where the ‘‘2” in Eq. (A4) stands for the number of messages (send/receive) required per processor for each neighbouring processor,toper is the average time spent per arithmetic operation, tlat the com-munication latency, tdata the average time to transfer 1 byte of data,and nit the number of iterations performed to reach the desiredsolution.

As the CPU time of the parallel LBM code running on np proces-sors is limited by the slowest processor, we have:

tCPU;np ¼maxi½tcal;i þ tcom;i�

¼maxi

eiNtot

npmtoper þ 2

Xni

j¼1

ðtlat þ si;jtdataÞ" #

� nit; ðA5Þ

If now i = max for the subdomain with the highest porosity, emax, wethen have2:

tCPU;np � emaxNtot

npmtoper þ 2

Xnmax

j¼1

ðtlat þ smax;jtdataÞ" #

� nit: ðA6Þ

Also the sequential execution time of the LBM code is given by:

tCPU;1 ¼ ½eNtotmtoper� � nit: ðA7Þ

The parallel efficiency for a speed-up test is defined as:

EðnpÞ ¼tCPU;1

np � tCPU;np

: ðA8Þ

Combining Eqs. (A6)–(A8) gives the efficiency model for heteroge-neous porous media in the case of a speed-up test:

EðnpÞ ¼e

emax þ2npPnmax

j¼1ðtlatþsmax;j tdataÞ

Ntotmtoper

: ðA9Þ

Finally, the parallel computational performance expressed inMLUPS (Millions of Lattice fluid nodes Update Per Second) can beobtained through:

PcompðnpÞ ¼10�6eNtotnit

tCPU;np

¼ 10�6eNtot

emaxNtotnp

mtoper þ 2Pnmax

j¼1 ðtlat þ smax;jtdataÞ

¼ 10�6np

mtoperEðnpÞ: ðA10Þ

A.2. Scale-up test case

The analysis for the scale-up test case is based on the followingssumptions:

(1–4) same as for the speed-up test case;(5) the lattice domain size is proportional to the number of pro-cessors: Ntot ¼ nx � ny � nz � np ðscale-up testÞ.

Note that Eqs. (A1)–(A6) still hold, but with the new definitionof Ntot. However, the sequential execution time of the LBM codewith this new definition of Ntot now becomes:

2 Note that this latter simplification is only true if e Ntotnp

mtoper >> 2Pni

j¼1ðtlat þ si; jtdataÞ.

tCPU;1 ¼ eNtot

npmtoper

� �� nit: ðA11Þ

The parallel efficiency for the scale-up test is now defined as:

EðnpÞ ¼tCPU;1

tCPU;np

: ðA12Þ

Combining Eqs. (A6), (A11), and (A12) gives the efficiency model forheterogeneous porous media in the case of a scale-up test:

EðnpÞ ¼e

emax þ2npPnmax

j¼1tlatþsmax;j tdatað Þ

Ntotmtoper

: ðA13Þ

As can be seen, Eqs. (A8) and (A13) are identical except for theunderlying definition of Ntot. Finally, Eq. (A10) holds for the scale-up test with the appropriate definition of Ntot.

References

[1] Succi S. The lattice Boltzmann equation for fluid dynamics and beyond. Oxford,UK: Oxford Science Publications; 2001.

[2] Nourgaliev RR, Dinh TN, Theofanous TG, Joseph D. The lattice Boltzmannequation method: theoretical interpretation, numerics and implications. Int JMultiphase Flow 2003;29(1):117–69.

[3] Dupuis A, Chopard B. An object oriented approach to lattice gas modeling.Future Gener Comput Syst 2000;16(5):523–32.

[4] Schulz M, Krafczyk M, Tolke J, Rank E. Parallelization strategies and efficiencyof CFD computations in complex geometries using lattice Boltzmann methodson high performance computers. In: Breuer M, Durst F, Zenger C, editors. Highperformance scientific and engineering computing. Berlin: Springer Verlag;2002. p. 115–22.

[5] Pan C, Prins JF, Miller CT. A high-performance lattice Boltzmannimplementation to model flow in porous media. Comput Phys Commun2004;158:89–105.

[6] Martys NS, Hagedorn JG. Multiscale modeling of fluid transport inheterogeneous materials using discrete Boltzmann methods. Mater Struct2002;35:650–9.

[7] Argentini R, Bakker AF, Lowe CP. Efficiently using memory in lattice Boltzmannsimulations. Future Gener Comput Syst 2004;20(6):973–80.

[8] Pan C, Luo LS, Miller CT. An evaluation of lattice Boltzmann schemes for porousmedium flow simulation. Comput Fluids 2006;35:898–909.

[9] Mattila K, Hyväluoma J, Timonen J, Rossi T. Comparison of implementations ofthe lattice-Boltzmann method. Comput Math Appl 2008;55(7):1514–24.

[10] Mattila K, Hyväluoma J, Rossi T, Aspnäs M, Westerholm J. An efficient swapalgorithm for the lattice Boltzmann method. Comput Phys Commun2007;176:200–10.

[11] Pohl T, Kowarschik M, Wilke J, Iglberger K, Rüde U. Optimization and profilingof the cache performance of parallel lattice Boltzmann codes. Parallel ProcessLett 2003;13(4):549–60.

[12] Wellein G, Zeiser T, Hager G, Donath S. On the single processor performance ofsimple lattice Boltzmann kernels. Comput Fluids 2006;35:910–9.

[13] Satofuka N, Nishioka T. Parallelization of lattice Boltzmann method forincompressible flow computations. Comput Mech 1999;23:164–71.

[14] Kandhai D, Koponen A, Hoekstra AG, Kataja M, Timonen J, Sloot PMA. Lattice-Boltzmann hydrodynamics on parallel systems. Comput Phys Commun1998;111:14–26.

[15] Axner L, Bernsdorf J, Zeiser T, Lammers P, Linxweiler J, Hoekstra AG.Performance evaluation of a parallel sparse lattice Boltzmann solver. JComput Phys 2008;227:4895–911.

[16] Freudiger S, Hegewald J, Krafczyk M. A parallelization concept for a multi-physics lattice Boltzmann prototype based on hierarchical grids. Prog ComputFluid Dyn Int J 2008;8(1–4):168–78.

[17] Wang J, Zhang X, Bengough AG, Crawford JW. Domain-decomposition methodfor parallel lattice Boltzmann simulation of incompressible flow in porousmedia. Phys Rev E 2005;72:016706–11.

[18] Bhatnagar PL, Gross EP, Krook M. A model for collision processes in gases. I.Small amplitude processes in charged and neutral one-component systems.Phys Rev 1954;94:511–25.

[19] Vidal D. Développement d’algorithmes parallèles pour la simulationd’écoulements de fluides dans les milieux poreux. PhD Thesis, EcolePolytechnique de Montréal; 2009.

[20] mpptest version 1.4b. Available from: http://www-unix.mcs.anl.gov/mpi/mpptest/; Oct 2008.

[21] mpiP version 3.1.2. http://mpip.sourceforge.net/; Nov 2008.[22] Auzerais FM, Dunsmuir J, Ferréol BB, Martys N, Olson J, Ramakrishnan TS, et al.

Transport in sandstone: a study based on three dimensional microtomo-graphy. Geophys Res Lett 1996;23(7):705–8.

[23] Hayes RE, Bertrand F, Tanguy PA. Modelling of fluid/paper interaction inthe application nip of a film coater. Transport Porous Media 2000;40:55–72.

http://www-unix.mcs.anl.gov/mpi/mpptest/

http://www-unix.mcs.anl.gov/mpi/mpptest/

http://mpip.sourceforge.net/


[24] Vidal D, Ridgway C, Pianet G, Schoelkopf J, Roy R, Bertrand F. Effect ofparticle size distribution and packing compression on fluid permeability aspredicted by lattice-Boltzmann simulations. Comput Chem Eng 2009;33:256–66.

[25] Pianet G, Bertrand F, Vidal D, Mallet B. Modeling the compression of particlepackings using the discrete element method. In: Proceedings of the 2008 TAPPIadvanced coating fundamentals symposium, Atlanta, GA, USA: TAPPI Press;2008.

On improving the performance of large parallel lattice Boltzmann flow simulations in heterogeneous...

Documents

Transcript of On improving the performance of large parallel lattice Boltzmann flow simulations in heterogeneous...