Message-passing for Lattice Boltzmann1.1 Theory of Lattice Boltzmann Method 1.1.1 Lattice Gas...

Message-passing for Lattice BoltzmannDissertation

Erlend Davidson

August 21, 2008

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2008

Abstract

The lattice Boltzmann algorithm is a successful and popular method for the simulationof fluid systems. In this study we improve the performance of Ludwig, an existingparallel implementation of lattice Boltzmann, by using MPI derived datatypes to re-duce the amount of data sent between processors during each iteration. We detail howthe optimisation was introduced without changing the structure of the existing code,and provide detailed benchmark results which show a marked improvement to parallelefficiency and run time on three modern HPC architectures.

Contents

1 Introduction 11.1 Theory of Lattice Boltzmann Method . . . . . . . . . . . . . . . . . . 2

1.1.1 Lattice Gas Cellular Automata Methods . . . . . . . . . . . . . 21.1.2 Lattice Boltzmann Method . . . . . . . . . . . . . . . . . . . . 3

1.2 Implementation Details - The Ludwig Code . . . . . . . . . . . . . . . 41.2.1 Halo-swapping Communication Overheads . . . . . . . . . . . 61.2.2 Review of Existing Optimisations . . . . . . . . . . . . . . . . 7

1.3 Purpose of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Design of Code 92.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Halo-Swapping Implementation - Derived Datatypes . . . . . . . . . . 102.3 Implementation Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Implement Full Halo-swapping using MPI_Type_struct . . . . . 122.3.2 Hard-code the Reduced Datatypes . . . . . . . . . . . . . . . . 152.3.3 Implement the Reduced Halo-swapping Cleanly . . . . . . . . 15

3 Testing of Reduced Halo-swapping 173.1 Modifications to Existing Tests . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 test_halo.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 test_model.c . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.3 test_prop.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.4 test_halo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.5 test_model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Comparison of Results with Original Code . . . . . . . . . . . . . . . . 213.3 Physical Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Testing of Non-reduced mode . . . . . . . . . . . . . . . . . . . . . . 22

4 Performance Results and Discussion 234.1 Methodology of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 234.2 Overview of Architectures . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 HECToR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Bluegene/L . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.3 HPCx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Performance on HECToR . . . . . . . . . . . . . . . . . . . . . . . . . 26

ii

4.4 Performance on Bluegene/L . . . . . . . . . . . . . . . . . . . . . . . 304.5 Performance on HPCx . . . . . . . . . . . . . . . . . . . . . . . . . . 334.6 Large Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.6.1 HECToR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6.2 HPCx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.7.1 Analysis of Original Ludwig Code Performance . . . . . . . . . 414.7.2 Analysis of Reduced Halo-swapping Code Performance . . . . 424.7.3 Analysis of Full Halo-swapping Mode Performance . . . . . . . 434.7.4 Effect of Optimisation on Code Structure and Maintainability . 44

5 Conclusions 45

A Review of Process 47A.1 Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47A.2 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A.2.1 Risk 1: Machine time unavailable . . . . . . . . . . . . . . . . 48A.2.2 Risk 2: Optimisation breaks code . . . . . . . . . . . . . . . . 48A.2.3 Risk 3: Illness . . . . . . . . . . . . . . . . . . . . . . . . . . 49A.2.4 Risk 4: Data loss . . . . . . . . . . . . . . . . . . . . . . . . . 49

iii

List of Tables

4.1 The absolute run times on HECToR, using D3Q15 . . . . . . . . . . . 274.2 Run times of the original and reduced codes for a large system on HEC-

ToR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A.1 The risk assessment from the project preparation. . . . . . . . . . . . . 49

iv

List of Figures

1.1 A hexagonal lattice showing six velocity vectors . . . . . . . . . . . . . 21.2 Discrete velocities in the D3Q19 and D2Q9 models . . . . . . . . . . 41.3 Halo-swapping in 2-dimensions . . . . . . . . . . . . . . . . . . . . . 51.4 A lattice subdomain on one processor (3-dimensions) . . . . . . . . . . 61.5 Graph of bandwidth vs. message size . . . . . . . . . . . . . . . . . . . 7

2.1 The parts of “site” to transfer during halo-swapping . . . . . . . . . . . 122.2 Performance of MPI_Type_struct for full halo-swapping. . . . . . 14

3.1 Effect of halo-swapping on corners . . . . . . . . . . . . . . . . . . . . 20

4.1 The time taken by the original and reduced codes on HECToR . . . . . 274.2 Scaling and efficiency graphs on HECToR (D3Q15) . . . . . . . . . . 284.3 Scaling and efficiency graphs on HECToR (D3Q19) . . . . . . . . . . 294.4 Scaling and efficiency graphs on Bluegene/L (D3Q15) . . . . . . . . . 314.5 Scaling and efficiency graphs on Bluegene/L (D3Q19) . . . . . . . . . 324.6 Scaling and efficiency graphs on HPCx (D3Q15) . . . . . . . . . . . . 344.7 Scaling and efficiency graphs on HPCx (D3Q19) . . . . . . . . . . . . 354.8 Run times of the original and reduced codes for a large system on HEC-

ToR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.9 Scaling and efficiency graphs on HECToR (large system, D3Q19) . . . 384.10 Run times of the original and reduced codes for a large system on HPCx 394.11 Scaling and efficiency graphs on HPCx (large system, D3Q19) . . . . . 404.12 Graph of halo sizes again total lattice size . . . . . . . . . . . . . . . . 42

A.1 Gantt Graph of work plan. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

v

Acknowledgements

I would like to thank my supervisor Dr Kevin Stratford for his guidance during thisproject.

Chapter 1

Introduction

The lattice Boltzmann algorithm [3] is an increasingly popular method for the simula-tion of fluids. Rather than trying to solve the Navier-Stokes equations for fluid flow, thelattice Boltzmann method models the mesoscale local interactions between “particles”of fluid density, on a discrete representation of space. The algorithm is composed ofa propagation stage, where particles move according to their velocity, and a collisionstage where particles interact on lattice sites.

The success of the lattice Boltzmann method (LBM) comes from its ability to simulatecomplex boundary conditions [4], fluid mixtures, and liquid-gas mixtures efficiently. Itis also used for the accurate simulation of turbulent flow [5]. As well as being usedfor the study of theoretical fluid mechanics in academia, the LBM can be applied torealistic systems such as the flow of blood in arteries [6], suspension fluids and fuelexhaust systems [7]. The basic algorithm is easy to implement, and when parallelisedscales well up to thousands of processors.

There has been a lot of interest in optimising the LBM, though most of the effort hasbeen spent on memory-cache performance [8, 9] or on serial performance [10]. Thiswork focuses on optimising the communications of one parallel implementation of theLBM.

The remainder of this introduction will describe the problem and a solution. We lookat the original implementation of the communications in an existing LBM code, anddescribe how these were modified to improve parallel performance. Section 3 describesthe work done to ensure the correctness of the reduced halo-swapping code. Finally wepresent detailed benchmark results of the modified code to quantify the performanceimprovement, and review the effect of the optimisation on the maintainability and read-ability of the code.

1

1.1 Theory of Lattice Boltzmann Method

1.1.1 Lattice Gas Cellular Automata Methods

The lattice Boltzmann algorithm is based on the lattice gas cellular automata (LGCA)method [11], in which space and time are discretised and the fluid is represented byindividual particles on a hexagonal lattice. These particles move between lattice siteseach time-step according to their velocity vector ciα, where i = 1 . . . 6 are for the sixlattice velocities and α = x, y are the physical dimensions. The discrete directions areshown in figure 1.1,

Figure 1.1: A hexagonal lattice showing the six unique velocity vectors that a particle can have.

The movement of a particle is described by,

x(t + ∆t) = x(t) + ci∆t, (1.1.1)

where the time step (∆t) is usually set to unity for convenience.

Discretising time means propagation and collisions can be separated into two separatephases, since collisions can only occur on a lattice site. The collisions must obey con-servation laws from physics, namely:

• total momentum,

• mass - the number of particles in the system is constant.

2

These are necessary, but not sufficient conditions for the simulation of the Navier-Stokesequations. A further requirement of a lattice gas model is to preserve rotational invari-ance, which leads to the need for a hexagonal lattice.

Macroscopic physical quantities (density and momentum) are extracted from the sim-ulation by considering the number of particles at each lattice site, and averaging overarea (lattice sites) and time to minimise the statistical noise.

The LGCA method suffers from statistical noise, which requires averaging over multi-ple time-steps to obtain accurate measurements. Also the requirement for a hexagonalgrid makes it difficult to extend to 3-dimensions, and requires more lattice sites to bestored and processed [12].

1.1.2 Lattice Boltzmann Method

The Boltzmann equation,

fi(~r + ~ci, t + 1)− fi(~r, t) = ω (f eqi (~r, t)− fi(~r, t)) (1.1.2)

is a statistical description of one particle in a fluid. The left-hand side of equation 1.1.2represents the propagation of particles through space by their velocity vectors (ci), theright-hand side describes the system’s movement towards equilibrium due to collisions.If there were only one particle in the fluid the left-hand side would simply keep theparticle moving in a constant direction.

The lattice Boltzmann method (LBM) replaces the boolean concept of particles inLGCA by a density distribution fi(x, t) at each point on the lattice which propagatesbetween adjacent lattice points in the analogous fashion to particles. This is desir-able because it eliminates the statistical noise problems of LGCA. The LBM requires asquare/cubic lattice, which eases implementation especially in 3-dimensions.

Physical quantities are extracted from the LB model using the density function,

(density) ρ(x, t) =∑

i

fi(xi, t), (1.1.3)

(momentum) ρα(x, t) =∑

i

fi(x, t)ciα, (1.1.4)

(stress) παβ(x, t) =∑

i

fi(x, t)ciαciβ, (1.1.5)

where α, β are physical dimensions (x, y, z).

3

Different lattice Boltzmann models are classified using the DnQm notation, where n isthe number of dimensions and m the number of discrete velocities allowed. The discretevelocities allowed at each lattice site in the D2Q9 and D3Q19 models are shown infigure 1.2.

Figure 1.2: Discrete velocities in the D3Q19 and D2Q9 models on cubic/square lattice. Notethere is always a zero velocity, corresponding to the particle remaining at the current site. Imagefrom: [1]

1.2 Implementation Details - The Ludwig Code

This study will concentrate on an existing implementation of the lattice Boltzmannalgorithm, developed by the University of Edinburgh, called Ludwig [13]. Ludwig waswritten in C and uses the Message-Passing Interface for parallel communications. Itwas developed for the study of complex binary fluids and fluid suspensions. CurrentlyLudwig supports two models, D3Q15 and D3Q19. However it was intentionally writtenin a very modular fashion so it could be extended to larger numbers of velocities inthe future. The densities are stored as double-precision floating point numbers. Thisrequires extreme amounts of memory and CPU resources for a typical simulation.

A production run of Ludwig would typically consider a total system size of approxi-mately 1024× 512× 512. Simply storing a lattice this size requires at least,

19× 8× 1024× 5122 = 38 GBytes per fluid, (1.2.1)

using the D3Q19 model, and assuming a double-precision float is 8 bytes.

4

Furthermore, each iteration will require all 1024 × 5122 grid points to be processed(propagation and collisions). Subsequently the algorithm requires much more CPUtime and memory than a modern sequential computer can provide.

The lattice Boltzmann method is easily parallelised, as it relies on only local interac-tions. The spatial domain can be split into equally sized subdomains which are dis-tributed across several processors. At the boundaries between these subdomains thevalues at the lattice points are swapped, in what is known as halo-swapping. A 2-dimensional demonstration of halo-swapping is shown in figure 1.3.

Figure 1.3: Halo-swapping between 4 subdomains in 2-dimensions. To generalise to 3-dimensions add halo-swapping in the z-direction, with each subdomain as in figure 1.4. Theblue (darker) regions are internal lattice sites, the green (lighter) regions are halo sites.

With the example above on 64 processors the subdomain size is 256×128×128, givingreduced calculation time and an approximate memory requirement of only about 608MBytes per processor for the lattice.

However at the end of every iteration, each processor will have to communicate a 2-dimensional plane of 19 (or 15 in the D3Q15 case) velocities to their adjacent proces-sors. In the above example this leads to a data transfer of approximately,

(5122 × 2 + 1024× 512× 4)× 19× 8 = 380 MBytes per fluid, (1.2.2)

being both sent and received from each processor every iteration.

5

Figure 1.4: A lattice subdomain on one processor. The internal blue (darker) regionrepresents the real lattice, the surrounding green (lighter) region the halo sites. Note thehalos closest to the reader have been omitted for clarity.

1.2.1 Halo-swapping Communication Overheads

The time taken to transfer a message is,

T = (ts + twm), (1.2.3)

where ts is the startup cost (latency), tw the bandwidth (MBytes/s) and m is the size ofthe message (MBytes).

The latency (ts) is independent of the size of the message being transferred, and is spe-cific to the particular MPI library and network in use. Most parallel codes are affectedmore by the latency of the network than by its bandwidth, as the messages are normallysmall. Mathematically,

ts ≥ twm, ∀ m < M, (1.2.4)

where M is the threshold value after which the message can be considered “large”.

The latency vs. bandwidth communication costs can be demonstrated by timing howlong it takes to send and receive messages of increasing size. Figure 1.5 shows a graph

6

of bandwidth1 against the size of the message. The measured bandwidth is low at smallmessage sizes because the time to send each message is dominated by the latency. Asthe message size is increased the relative effect of the latency is reduced: for messageslarger than 0.5 MBytes, bandwidth of the interconnect is the limiting factor.

0

20

40

60

80

100

120

140

160

0 0.5 1 1.5 2 2.5 3 3.5 4

Ban

dwid

th(M

Byt

es/s

)

Message size (MBytes)

Figure 1.5: An example bandwidth vs. message size on the Bluegene/L system. Data generatedusing a “ping pong" test written by the EPCC [2]

1.2.2 Review of Existing Optimisations

There are a number of academic, commercial and free implementations of the latticeBoltzmann method, which run on both serial [1, 14, 15, 16, 17] and parallel machines[13, 18, 19, 20, 21].

The majority of codes are written for the study of physical phenomena, so relativelylittle is published about the actual implementations or performance optimisations. Anexception is the National Institute of Standards and Technology (NIST), who have de-veloped some optimisations to vastly reduce the required memory and CPU resourcesunder certain restrictive circumstances.

They represent the lattice as an array of pointers, so empty/inactive lattice sites requireonly memory equal to the size of a pointer. By setting ω = 1 in the Boltzmann equation

1Bandwidth calculated by dividing the message size in megabytes by the time it took to send themessage

7

(equation 1.1.2) the m densities (where is m is from D3Qm) at each lattice site aresummed to give just one number. This optimisation would reduce the memory andcommunication requirements of Ludwig by a factor m (a huge benefit), however settingω = 1 sets the viscosities of each fluid to one fixed value, and so it is unsuitable forcodes like Ludwig which are designed to handle binary (or more) fluid mixtures. Also,using pointers for the lattice may create cache performance problems, and certainlymakes halo-swapping very difficult as data is no longer guaranteed to be contigious inmemory. This optimisation is probably targetted at serial, desktop machines.

1.3 Purpose of Study

The aim of this project is to reduce the amount of data that is transferred betweenprocessors each iteration. This is achieved by sending only the velocity densities whichwill actually propagate into the neighbouring subdomains. From now on we will referto this as “reduced halo-swapping”. The second objective is to analyse the effectivenessof this optimisation on improving performance, and consider the effect it has on themaintainability and readability of the program code.

8

Chapter 2

Design of Code

2.1 Requirements

The new code will peform reduced halo swapping, which should improve performancesince less data is transferred each iteration. Obviously this must give the same answersas the original code (see section 3). We considered the following design considerationsimportant.

• Usability

– it should be possible to revert back to full halo-swapping at runtime (somephysical systems involving collisions with solid suspension parts require allthe velocity density vectors, and it makes correctness checking easier);

– the optimisations must work for all models (d3q∗∗), including models whichmay be added in the future.

• Performance

– the reduced halo-swapping mode should improve performance;

– the full halo-swapping mode of the new code should not significantly; re-duce the performance, compared to the original code.

• Code readability and maintainability

9

– the structure of the original code should not be changed;

– the optimisation should not add a large amount of code to the original.

We will now describe the structure of the existing code, and discuss how the aboverequirements can be satisfied by considering the design of the reduced halo-swappingcode. We will also outline the development strategy used. The next section (2.2) ex-plains how derived datatypes are used to send both full and reduced halos. Section 2.3outlines the development stategy for the reduced version.

2.2 Halo-Swapping Implementation - Derived Datatypes

The Ludwig implementation of the lattice follows a hierachical design. It is necessaryto understand this before modifying any halo exchange code.

Each lattice site contains NVEL velocity densities per fluid, and by default there are twofluids f and g. The site is represented as a C-struct,

typedef struct {double f[NVEL], g[NVEL];

} Site;

where each element of the f and g arrays propagates in the direction given by cv(d3qNVEL.c),

const int cv[NVEL][3] = {{ 0, 0, 0},{ 1, 1, 1}, { 1, 1, -1}, { 1, 0, 0},{ 1, -1, 1}, { 1, -1, -1}, { 0, 1, 0},{ 0, 0, 1}, { 0, 0, -1}, { 0, -1, 0},{-1, 1, 1}, {-1, 1, -1}, {-1, 0, 0},{-1, -1, 1}, {-1, -1, -1}};

The Site construct is packed into one MPI_Datatype to make it easier to send,

MPI_Type_contiguous(sizeof(Site), MPI_BYTE, &DT_Site);

Ideally Site would be stored as single array of size NVEL×Num.Fluids, as this wouldenable more fluids to be added without recompiling the code. However this modification

10

is beyond the scope of this project, and does not affect the optimisation. If “Site” ischanged to an array at some future date, this optimisation will still work.

The 3-dimensional lattice is stored as a 1-dimensional contiguous array of sites, oflength nx×ny×nz, where nα is the size of the subdomain in the α direction (includinghalos). The halo sites in each dimension will therefore be evenly spaced throughout theentire lattice. All loops over the lattice follow the structure,

for (i = 0; i <= n[X] + 1; i++) {for (j = 0; j <= n[Y] + 1; j++) {

for(k = 0; k <= n[Z]; k++) {...

}}

}

and so z is the fastest moving dimension (x the slowest). The halos transferred in the x-direction are contiguous in memory. The y and z direction halos require MPI_Type_vector,

MPI_Type_contiguous(ny*nz, DT_Site, &DT_plane_YZ);MPI_Type_vector(nx, nz, ny*nz, DT_Site, &DT_plane_XZ);MPI_Type_vector(nx*ny, 1, nz, DT_Site, &DT_plane_XY);

From the above it is clear that the optimisation requires six versions of DT_Site - onefor each direction perpendicular to the face of a cube. Each of these will send only theelements of f and g which should propagate in that direction.

For cv above, the parts of site[i].f which will be sent in each direction are shownin figure 2.1.

The two main options for describing figure 2.1 in MPI are MPI_Type_indexed andMPI_Type_struct. The former would have been simpler as it works with indicesand not displacements in bytes. However it is impossible in MPI-1.2 to specify paddingat the beginning and end of the indexed datatype (which is required because the datatypeis going to be used to make more datatypes). MPI_Type_struct provides MPI_LBand MPI_UB to mark the lower and upper bounds of the memory block.

The code for two of the six datatypes describing figure 2.1 are,

int xcount = 3;int xblocklens[] = {1, 5, 1};MPI_Aint xdisp_fwd[] = {0, 1*8, 15*8};

11

(a) The x-directions

(b) The y-directions

(c) The z-directions

Figure 2.1: The parts of the array site[i].f (and of site[i].g) which need tobe transferred to neighbouring processors in the x, y and z directions.

MPI_Aint xdisp_bwd[] = {0, 10*8, 15*8};MPI_Datatype[] = {MPI_LB, MPI_Double, MPI_UB};MPI_Type_struct(xcount, xblocklens, \

xdisp_fwd, xtypes, &DT_Site_xfwd);MPI_Type_struct(xcount, xblocklens, \

xdisp_bwd, xtypes, &DT_Site_xbwd);

These cannot simply be hardcoded into model.c, as they will change depending onthe choice of model. It therefore makes sense to describe the datatypes in d3qNVEL.c,however since there are currently no executable statements in these files, and the numberof fluids is not known in d3q**.c the types are actually created in model.c.

2.3 Implementation Stages

To reduce the risk of breaking the program, or of harming performance, the optimisedcode was written in a set of phases. Each phase takes the code closer to the final product.

2.3.1 Implement Full Halo-swapping using MPI_Type_struct

As MPI_Type_struct is to be used to send the reduced version of “Site”, it isnecessary to check this doesn’t impede performance. The full halo swapping was re-implemented using MPI_Type_struct. This required a version of DT_Site foreach direction (init_halo() in model.c),

12

MPI_Type_struct(xcount, xblocklens, xdisp_fwd, \xtypes, &DT_Site_xfwd);

MPI_Type_struct(xcount, xblocklens, xdisp_bwd, \xtypes, &DT_Site_xbwd);

MPI_Type_struct(ycount, yblocklens, ydisp_fwd, \ytypes, &DT_Site_yfwd);

MPI_Type_struct(ycount, yblocklens, ydisp_bwd, \ytypes, &DT_Site_ybwd);

MPI_Type_struct(zcount, zblocklens, zdisp_fwd, \ztypes, &DT_Site_zfwd);

MPI_Type_struct(zcount, zblocklens, zdisp_bwd, \ztypes, &DT_Site_zbwd);

where the count, displacement and block lengths arguments were defined in the d3qNVELfile,

MPI_Datatype types[xcount] = {MPI_DOUBLE};/* all of f and g */int xblocklens_fwd[xcount] = {2*NVEL};MPI_Aint xdisp_fwd[xcount] = {0};MPI_Aint xdisp_bwd[xcount] = {0};

These are used to make the planes corresponding to halos,

MPI_Type_vector(nx*ny, 1, nz, DT_Site_zfwd, &DT_plane_XY_fwd);MPI_Type_commit(&DT_plane_XY_fwd);MPI_Type_vector(nx*ny, 1, nz, DT_Site_zbwd, &DT_plane_XY_bwd);MPI_Type_commit(&DT_plane_XY_bwd);

Notice the vector parameters are unchanged. This gives twice as many DT_planedatatypes as before (as there is a forward and backwards one in each dimension). Up-dating halo_site() to use the new types was trivial, as only “_fwd” or “_bwd”needed to be appended to the existing datatype parameters,

MPI_Issend(&site[xfac].f[0], 1, DT_plane_YZ_bwd,cart_neighb(BACKWARD,X), TAG_BWD, cart_comm(), &request[0]);

MPI_Irecv(&site[(N[X]+1)*xfac].f[0], 1, DT_plane_YZ_bwd,cart_neighb(FORWARD,X), TAG_BWD, cart_comm(), &request[1]);

MPI_Issend(&site[N[X]*xfac].f[0], 1, DT_plane_YZ_fwd,cart_neighb(FORWARD,X), TAG_FWD, cart_comm(), &request[2]);

MPI_Irecv(&site[0].f[0], 1, DT_plane_YZ_fwd,cart_neighb(BACKWARD,X), TAG_FWD, cart_comm(), &request[3]);

13

0

200

400

600

800

1000

1200

1400

0 500 1000 1500 2000

Spee

dup

rela

tive

to2

proc

essi

ngco

res

(1pr

oces

sor)

Number of Processors

HECToR OriginalHECToR Full with MPITypestruct

Figure 2.2: Performance of MPI_Type_struct for full halo-swapping. There is no notice-able difference in performance.

14

2.3.2 Hard-code the Reduced Datatypes

The reduced datatypes can be implemented quite quickly by setting the count, displace-ment and block lengths as appropriate for the model. This allows us to select and testthe correct values, which makes debugging the final code easier. It also confirms that itis possible to use MPI_Type_struct for this problem.

Once the code passed all tests we proceeded to the next phase.

2.3.3 Implement the Reduced Halo-swapping Cleanly

Referring back to the requirements (section 2.1) the optimisation must work with dif-ferent velocity models, therefore the above hard-coded solution is not suitable. As thevelocity vectors (cv) are defined in the d3q ∗ ∗.c file, it was decided that the descriptionof cv should reside in the same file. For the cv shown in figure 2.1,

int xblocklens_cv[xcountcv] = {5};int xdisp_fwd_cv[xcountcv] = {1};int xdisp_bwd_cv[xcountcv] = {10};

int yblocklens_cv[ycountcv] = {2, 1, 2};int ydisp_fwd_cv[ycountcv] = {1, 6, 10};int ydisp_bwd_cv[ycountcv] = {4, 9, 13};

...

xcountcv is defined in the d3q ∗ ∗.h header file.

Although the current code specifically defines two fluids, it may be preferred in thefuture to use more. A new constant, ndist was introduced in model.h which definesthe number of fluids, and model.c uses this to keep the reduced halo-swapping codeflexible.

In model.c three new functions were created to expand the description of cv,

• getAintDisp(int indexDisp[ ], MPI_Aint dispArray[ ], int count):

– converts the d3q ∗ ∗.c index displacement description of cv into a displace-ment in bytes (required for MPI_Type_struct)

15

– expands to length countcv×ndist+2, where the first and last displacementsare the upper and lower bounds of the whole structure (0 and sizeof(Site)).

• getblocklens(int blocklens_cv[ ], int blocklens[ ], int count)

– converts the d3q ∗ ∗.c blocklens_cv to ndist× length(blocklens_cv) +2The +2 is to accommodate the lower and upper bounds.

• gettypes(MPI_Datatype types[ ], int count)

– outputs an array of countcv × ndist + 2 types. The first and lasttypes are always MPI_LB and MPI_UB respectively. The other types areMPI_DOUBLE.

A fourth function getDerivedDTParms(...) was introduced to wrap the abovetogether into one call.

To facilitate switching between reduced and full halo-swapping at runtime, a new func-tion int use_reduced_halos() was added to control.c. This checks for thereduced_halos yes/no in the input file. The default behaviour is for full haloswapping. When full halo-swapping is used init_halo() uses the derived datatypescode from the original Ludwig, and these are copied to use the same names as the re-duced code,

DT_plane_YZ_fwd = DT_plane_YZ;DT_plane_YZ_bwd = DT_plane_YZ;DT_plane_XZ_fwd = DT_plane_XZ;DT_plane_XZ_bwd = DT_plane_XZ;DT_plane_XY_fwd = DT_plane_XY;DT_plane_XY_bwd = DT_plane_XY;

This approach means there is no requirement for a conditional in halo_site, whichwould have added a lot more code and possibly affected performance (as halo_site()runs every iteration, init_site() runs only once).

16

Chapter 3

Testing of Reduced Halo-swapping

It is critically important to test that the reduced code gives exactly the same resultsas the original Ludwig implementation. Due to the size and complexity of the code,a multistage testing process was adopted. In particular unit testing was used to moreeasily identify the source of error.

3.1 Modifications to Existing Tests

There are nine test codes that come with the Ludwig source. The tests which are of mostinterest to this work are,

3.1.1 test_halo.c

test_halo_null()

The test_halo_null() test ensures that all parts of the halo regions are beingoverwritten during the halo transfer, and that values from the halos are not “leaking” intothe interior lattice. The code sets all the densities in the interior lattice sites to a zero,and all halo sites to one. Then a halo-swap is done via a single call to halo_site().Finally the test checks that all velocities at every site (halos and interior) are zero.

17

test_halo()

This is a more rigorous test than test_halo_null(). The entire lattice is set tozero, except boundary lattice sites which are set to a particular number. After halo-swapping it is checked that each halo site’s density matches the value from the corre-sponding boundary.

3.1.2 test_model.c

test_halo_swap()

This code ensures that the densities in each site finish in the correct position after halo-swapping. In each internal lattice site the first three densities are set to the displacementof the site in the x, y and z dimensions respectively. The remaining densities at eachsite are set to their index (i.e. the 5th density at each site is set to 5). After a singlehalo-swap, the code verifies that each velocity matches the above.

3.1.3 test_prop.c

test_velocity()

Each velocity in the internal lattice is set to its index. A halo-swap and a propagationstep are executed (single calls to halo_site() and propagate()). The test codethen checks that each velocity at each internal site matches its index. Note: if halo-swapping were not functioning correctly then this test would fail, as after propagationthere would be differences on the boundary sites of the internal distribution.

test_source_destination()

This test checks that halos are sent in the correct direction, and are received by theintended neighbouring process. Each internal lattice site is set to a unique value (allvelocities at one site are set to this same value). A single halo-swap and propagationstep is executed. The code then verifies that each internal site contains the expecteddata.

Of these three tests, only test_halo and test_model need to be modified to work withreduced halo-swapping. The propagation tests only check internal lattice sites after the

18

propagation phase - since the reduced halo-swapping code sends the same velocitiesthat propagate into the internal sites the original test_prop will still work.

3.1.4 test_halo

The test_halo_null() routine must only check the velocities at each halo sitewhich will have been updated by the halo-swap. The new version should interpret thereduced halo data from one of d3q15.c or d3q19.c to produce a set of “mask” array foreach direction.

The following was used to make six “mask” arrays (xfwd, xbwd in each dimension)

if (use_reduced_halos()) {xfwd = calloc(NVEL, sizeof(int));

for (i = 0; i < xcountcv; i++) {for(j = 0; j < xdisp_fwd_cv[i]; j++) {

for(k = 0; k < xblocklens_cv[i]; k++) {xfwd[xdisp_fwd_cv[i]+k] = 1;

}}

}...

}

The elements of the “mask” arrays are 1 when the corresponding velocity is trans-ferred, otherwise 0. For example, for the D3Q15 x−forward communication given infigure 2.1a, xfwd would be,

int xfwd[] = {0, 1, 1, 1, 1, 1, 0, \0, 0, 0, 0, 0, 0, 0, 0};

When checking each lattice site after the halo exchange the mask is used,

for (p = 0; p < NVEL; p++) {if (xfwd[p]) {

/* This velocity shouldhave been transferred... */

test_assert(...);}

}

19

The other issue is that this test would fail on the corners of the lattice, because they arereceiving from halo sites. This is explained by figure 3.1.

(a) y-direction (b) x-direction (c) After halo swapping

Figure 3.1: First halos are exchanged in the y-direction (3.1a). Then halos are exchanged in thex-direction - notice that the corner halo sites receive from a halo site, and not from a lattice site(3.1b). 3.1c shows the result after all the halos have been exchanged. The corner sites contain ahybrid of x and y halo-swapped data.

These corners are not a part of the calculation - they only exist because a non-cubicsubdomain would lead to far more complex code - as such it is safe to ignore them,

if (xfwd[p] && !on_corner(...)) {...

}

with on_corner() defined as,

int on_corner(int x, int y, int z, \int mx, int my, int mz) {

/* on the axes */if ( abs(x) + abs(y) == 0 || \

abs(x) + abs(z) == 0 || \abs(y) + abs(z) == 0 )

{return 1;

20

}/* opposite the axes */.../* the rest of the corners */...

}

3.1.5 test_model

A modified version of the test_halo_swap() routine was written for reducedswapping (test_reduced_halo_swap()) which only checks the sites involvedin the transfer, and avoids corner sites, using the same techniques as for test_halo.c.

3.2 Comparison of Results with Original Code

As stated in section 2 (Design of Code) the new code (both reduced and full modes)should give exactly the same answers as the old code. The halo sites which are beingomitted in the new code would never have propagated from the recipient process’s halobuffer into the physical subdomain - therefore they are not a part of the calculation. Forthis reason we can expect the measured values in the output files to match exactly.

On each available machine (Ness, HECToR, HPCx, and Bluegene/L) the output filesfrom the new code were compared to the original using the diff command. This test wasconducted on 1, 2, 4, . . . MPI tasks up to the maximum possible on the architecture.

3.3 Physical Quantities

The physical system being simulated must obey conservation of total momentum. Asthe system starts in the rest state the final momentum should be zero, within smallnumerical errors. The numerical errors in double-precision are about 10−14.

Similarly the mass should be conserved (no creation/destruction of fluid particles). Ifthe halo swapping is incorrect then the total mass will usually be changed.

21

3.4 Testing of Non-reduced mode

All the above tests were also applied to the full halo swapping mode of the new code.The old tests were sufficient to test this mode.

22

Chapter 4

Performance Results and Discussion

4.1 Methodology of Benchmarks

To determine the effectiveness of the optimisation on reducing overall execution timesome tests were conducted on different HPC architectures. Both the full and reducedhalo swapping modes of the new code were compared against the original code (whichdoes only full halo swapping).

For the comparison of the reduced and original swapping modes, two different velocityvector models were used: D3Q15 and D3Q19. As far as we are concerned, the primarydifference between these is the amount of data which has to be transferred each iteration.The full halo swapping mode was also benchmarked to examine any overheads whichmay have been introduced.

Each benchmark was run three times on 1, 2, 4, 8, . . . processors up until the maximumpossible on the system. An average and standard deviation was used to graph speedupand efficiency for each code. The standard deviation of the average gives a measure ofthe precision, and can indicate bad data or that more runs are required. In practice thestandard deviations were generally very small. The formula for the standard deviationis,

σ =1√N

N∑i

(ti − 〈t〉)2, (4.1.1)

where σ =standard deviation, N =number of runs, 〈t〉 =average time.

23

Parallel speedup and efficiency are defined,

speedup on p processes = S(p) =T1

Tp

, (4.1.2)

efficiency on p processes = E(p) =S(p)

p, (4.1.3)

where Ti is the time taken on i processes. However since the memory bandwidth ona dual-core processor is shared between both cores, running on just one core wouldgive better performance at higher cost (both cores are still reserved and paid for). Wechose to use both cores since that is what the vast majority of users will do. Also aseach processor has a finite amount of memory, running on a small number of processorslimits the size of the system that can be addressed. For these reasons the speedup andefficiency graphs are relative to four cores (HECToR), or two cores (Bluegene, HPCx).

These benchmarks were run on three different HPC architectures: HPCx, Bluegene/Land the new HECToR service. As well as quantifying the effects of the optimisation onreal world machines, this provides insight into the effect of different memory layouts,and interconnects on this type of optimisation. The relevant characteristics of thesemachines are described below.

4.2 Overview of Architectures

4.2.1 HECToR

The HECToR (High End Computing on Terascale Resources) service became opera-tional towards the end of 2007, replacing the HPCx machine as the UK national super-computing service. The system consists of 5664 AMD Opteron processors. These aredual-core and clocked at 2.8 GHz, giving over 11000 processors and 63.4 Tflops. Eachcore has two independent FPUs (addition and multiplication). There is 6 GBytes ofmemory available to each processor, and this is shared between the cores. The cachesare not shared between cores.

The processors are connected by a 3-d mesh interconnect. The interconnect uses Pow-erPC 440 processors (the same type of processors are used for computation in Blu-gene/L).

24

4.2.2 Bluegene/L

The Bluesky service at Edinburgh is a single Bluegene/L cabinet, containing 2048 pro-cessors (two per chip). Each processing chip is a PowerPC 440 running at 700 MHz.This relatively slow processor is chosen because it lowers power consumption and cool-ing costs: the heat produced by a processor is proportional to its clock speed cubed. Thetwo cores on each chip can be used either in Co-processor mode, where one processoris responsible for calculation and another deals with message-passing communications,or the more commonly used Virtual-node mode where both cores do calculation andcommunication. Each processor has two FPUs. There is 512 MBytes of memory oneach chip, shared between both cores.

The communications interconnect allows for a theoretical maximum of 175 MBytes/s(figure 1.5 shows the actual measured bandwidth which will include overheads) whenusing the torus network. There is a tree network for global operations which has atheoretical bandwidth of 350 MBytes/s.

4.2.3 HPCx

HPCx was the primary UK national computing service, until it was replaced by HEC-ToR near the end of 2007. It consists of 160 IBM eServer 575 logical partitions (LPARS).Each LPAR contains 8 Dual-core modules (DCMs) giving 16 Power5 processors, and32 GBytes of shared memory. The level 2 and 3 caches are shared between processors,however the level 1 cache is not.

The Power5 processor has support for simultaneous multi-threading (SMT) which al-lows two threads to run on one processor, essentially making each processor appearlike two logical processors with a shared level 1 cache. SMT was not used in any of thebenchmarks, as it was found to not benefit the performance of Ludwig.

25

4.3 Performance on HECToR

On HECToR Ludwig scales almost linearly up to about 2048 processes, given the sizeof the problem being parallelised. As the number of MPI tasks is increased the scalingsteadily decreases.

The code demonstrates poorer scaling as the number of processing cores are increased(figures 4.2a, 4.3a). At 8192 processes there is a very noticeable dip in scalability.Figures 4.2b and 4.3b show a steady drop in efficiency as the number of processesincreases. The smaller model (D3Q15) is always about 5% less efficient, but with bothmodels it is uneconomical to run on more than about 2048 processes.

The reduced halo swapping code shows a substantial improvement to parallel scaling(figures 4.2a and 4.3a). The D3Q19 model benefits more from the optimisation - thereduced code speedup on 8192 processes using model D3Q19 is 1443 (compared withan original speedup of 1176), using the D3Q15 model 1309 (compared with an origi-nal speedup of 1117). The difference in efficiency between the original code, and thereduced code, is approximately 5% at 256 processes using both models. On 8192 pro-cesses this difference increases to approximately 10% (D3Q15) or 15% (D3Q19). Thereduced halo-swapping code on 8192 MPI tasks is slightly more efficient than the fullcode on half that many processes.

Figure 4.1 shows a comparison of the run times for the original and reduced codes onHECToR.

The full halo-swapping mode of the modified code scales worse than the original code,in the D3Q15 case (figure 4.2a). The effect on efficiency is approximately 5% (fig-ure 4.4b). Interestingly the scaling and efficiency when using the D3Q19 model fullhalo mode are unaffected.

26

1

10

100

100 1000 10000

Tota

lrun

time

/s

Number of MPI tasks

HECToR Reduced Halos (D3Q15)HECToR Original Code (D3Q15)

HECToR Reduced Halos (D3Q19)HECToR Original Code (D3Q19)

Figure 4.1: The time taken by the original and reduced Ludwig codes on HECToR. This is anabsolute measure of performance, which demonstrates how the improvement in run time fromthe reduced code is much more noticeable on a higher number of processes. The log scaleemphasises scaling (straight implies perfect scaling). The full mode of the modified code hasbeen omitted for clarity.

P T (Full) Error T (Reduced) Error T (Original) Error4 1682.022 0.5 1671.379 0.7 1673.416 0.98 855.615 1 848.997 0.9 851.129 0.6

16 433.229 0.6 429.512 0.3 431.084 0.0932 219.042 0.08 214.627 0.03 217.444 0.164 111.520 0.3 108.281 0.006 109.959 0.03128 57.384 0.4 55.0000 0.04 56.241 0.08256 28.778 0.08 27.350 0.009 27.9 0.1512 14.879 0.3 13.883 0.01 14.645 0.011024 7.977 0.06 7.380 0.007 8.048 0.042048 4.500 0.07 3.960 0.007 4.369 0.0034096 2.553 0.002 2.180 0.02 2.452 0.0018192 1.618 0.05 1.276 0.01 1.498 0.0005

Table 4.1: The absolute run times on HECToR, using D3Q15. T: time, P: processes.

27

0

500

1000

1500

2000

0 1000 2000 3000 4000 5000 6000 7000 8000

Spee

dup

rela

tive

to4

MPI

task

s

Number of MPI tasks

0

100

200

0 250 500 750

HECToR Full HalosHECToR Reduced HalosHECToR Original Code

(a) Strong scaling graph on HECToR (D3Q15)

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000

Effi

cien

cyre

lativ

eto

4M

PIta

sks

Number of MPI tasks

0.9

1

0 200 400 600


(b) Efficiency graph on HECToR (D3Q15)

Figure 4.2: Ludwig: 200 time-steps on a total system size 2563 on HECToR, using the D3Q15model. Each MPI task is run on one processing core (both cores in a processor are used). Theinset graphs show a zoom in of the data for a smaller number of processes.

28

0

500

1000

1500

2000

0 1000 2000 3000 4000 5000 6000 7000 8000

Spee

dup

rela

tive

to4

MPI

task

s

Number of MPI tasks

0

100

200

0 250 500 750


(a) Strong scaling graph on HECToR (D3Q19)

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000

Effi

cien

cyre

lativ

eto

4M

PIta

sks

Number of MPI tasks

0.8

0.9

1

0 200 400 600


(b) Efficiency graph on HECToR (D3Q19)

Figure 4.3: Ludwig: 200 time-steps on a total system size 2563 on HECToR, using the D3Q19model. Each MPI task is run on one processing core (both cores in a processor are used). Theinset graphs show a zoom in of the data for a smaller number of processes.

29

4.4 Performance on Bluegene/L

The scalability on Bluegene/L is excellent, given the small lattice size (figures 4.4a,4.5a). Both models have a small dip in parallel scaling at 256 processing cores (128Power440 processors). There is a large decrease in scaling performance at 1024 MPItasks, though this is understandable given the size of each subdomain is just 16× 8× 8lattice sites. The efficiency (figures 4.4b, 4.5b) is extremely good, with superscalingapparent on 8 to 32 processes in the D3Q15 model. Due to the small problem size, on8 processors each process requires around 34 MBytes 1. Recalling that Bluegene/L hasa generous 4 MBytes of level3 cache shared between two cores on a processor it is verylikely a cache optimisation which gives the superlinear speedup. Also it is importantto note that 8 MPI tasks is the smallest number which will give a 3-dimensional cubicdecomposition.

In contrast to the HECToR and HPCx benchmarks, the larger D3Q19 model performsworse on Bluegene/L. The efficiency of D3Q15 is about 10-20% better than D3Q19 formore than 8 processes. There is no superscaling when using the D3Q19 model. Fromall the above runs, it is uneconomical to use more than 512 MPI tasks on Bluegene/L.

Reduced halo-swapping gives a modest performance improvement using the D3Q15model. However it does not make it any more appealing to run on 1024 processes(speedup of 344 vs. an original speedup of 337 - figure 4.4a). The performance im-provement to the D3Q19 model is more significant, particularly on 512 to 1024 MPItasks (figure 4.5a). The speedup of the reduced code on 1024 processes is 336, com-pared with an original of 316.

In both models, the reduced code’s greatest benefit to efficiency is realised at 256 pro-cesses (figures 4.4b, 4.5b). Again contrary to HECToR and HPCx the difference inefficiencies of the original and reduced codes converges at as the number of MPI tasksis increased above 512.

There is no measurable difference in efficiency or performance between the originalcode, and the new code’s full halo-swapping mode.

1This value comes from the program output

30

0

100

200

300

400

500

0 200 400 600 800 1000

Spee

dup

rela

tive

to2

MPI

task

s

Number of MPI tasks

BG/L Full HalosBG/L Reduced Halos

BG/L Original

(a) Strong scaling graph on Bluegene/L (D3Q15)

0

0.2

0.4

0.6

0.8

1

1.2

0 200 400 600 800 1000

Effi

cien

cyre

lativ

eto

2M

PIta

sks

Number of MPI tasks


BG/L Original

(b) Efficiency graph on Bluegene/L (D3Q15)

Figure 4.4: Ludwig: 200 time-steps on a total system size 963 on Bluegene/L, using the D3Q15model. Each MPI task is run on one processing core (both cores in a processor are used).

31

0

100

200

300

400

500

0 200 400 600 800 1000

Spee

dup

rela

tive

to2

MPI

task

s

Number of MPI tasks


BG/L Original

(a) Strong scaling graph on Bluegene/L (D3Q19)

0

0.2

0.4

0.6

0.8

1

1.2

0 200 400 600 800 1000

Effi

cien

cyre

lativ

eto

two

proc

essi

ngco

res

Number of Processors


BG/L Original

(b) Efficiency graph on Bluegene/L (D3Q19)

Figure 4.5: Ludwig: 200 time-steps on a total system size 963 on Bluegene/L, using the D3Q19model. Each MPI task is run on one processing core (both cores in a processor are used).

32

4.5 Performance on HPCx

Ludwig scales very well up to 512 processes on HPCx. However there is a large dip inscaling at 1024 processes (figures 4.6a and 4.7a). The efficiency graphs (figures 4.6b,4.7b) show some interesting features. There is a superlinear peak in efficiency at 8 tasks- the number of DCMs in a logical partition (LPAR) - using both models, followed bya rapid drop in efficiency outside the LPAR. It is clear from this that the MPI libraryis using the shared memory inside each LPAR to communicate halos. Running on 16tasks does not give a superlinear speedup, although this is still inside the LPAR. This isdue to the sharing of level 2 and 3 caches between both Power5s in the DCM.

The D3Q19 model scales better than the D3Q15 model, particularly at 512 or moreprocesses.

The insets in figures 4.6b and 4.7b show that reduced halo-swapping is actually detri-mental to performance inside the LPAR. The overhead of using MPI_Type_struct(as opposed to the original, MPI_Type_contiguous) offsets the reduced halo sizewhen the transfer bandwidth is very high. The reduced code benefits performance onmore MPI tasks: in the D3Q19 model just outside the LPAR (greater than 16 tasks -figure 4.7b), and in the D3Q15 case after about 128 tasks (figure 4.6b). Both modelsscale very well up to 1024 MPI tasks with reduced halo-swapping (figures 4.6a and4.7a).

Using reduced halo-swapping increases the efficiency by about 10% on 512 processesand almost 20% on 1024 processes, compared to full halo-swapping, using the D3Q19model.

There is no significant drop in scaling or efficiency by using the modified code’s fullhalo-swapping mode, as all differences lie within the calculated statistical uncertainties.

33

0

100

200

300

400

500

600

0 200 400 600 800 1000

Spee

dup

rela

tive

to2

MPI

task

s

Number of MPI tasks

0

20

40

60

0 70 140

HPCx Full HalosHPCx Reduced Halos

HPCx Original

(a) Strong scaling graph on HPCx (D3Q15)

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000

Effi

cien

cyre

lativ

eto

2M

PIta

sks

Number of MPI tasks

0.85

0.9

0.95

1

1.05

0 20 40 60 80 100


HPCx Original

(b) Efficiency graph on HPCx (D3Q15)

Figure 4.6: Ludwig: 200 time-steps on a total system size 2563 on HPCx, using the D3Q15model. Each MPI task is run on one processing core (both cores in a processor are used). Theinset graphs show a zoom in of the data for a smaller number of processes.

34

0

100

200

300

400

500

600

0 200 400 600 800 1000

Spee

dup

rela

tive

to2

MPI

task

s

Number of MPI tasks

0

20

40

60

0 20 40 60 80 100


HPCx Original

(a) Strong scaling graph on HPCx (D3Q19)

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000

Effi

cien

cyre

lativ

eto

2M

PIta

sks

Number of MPI tasks

0.85

0.9

0.95

1

1.05

0 20 40 60 80 100


HPCx Original

(b) Efficiency graph on HPCx (D3Q19)

Figure 4.7: Ludwig: 200 time-steps on a total system size 2563 on HPCx, using the D3Q19model. Each MPI task is run on one processing core (both cores in a processor are used). Theinset graphs show a zoom in of the data for a smaller number of processes.

35

4.6 Large Problem Size

All the benchmarks so far have considered relatively small systems. As these systemsare subdivided on more processes the overheads start to dominate over the computation.These benchmarks emphasise the effects of the optimisation, but the subdomain size ona large number of processes can become unrealistically small.

The following are the same benchmarks on a larger system, sized 1024 × 5122 latticesites. This is the same size of system used in production runs of Ludwig. This leads toa subdomain size of 32× 162 on 8192 processes.

4.6.1 HECToR

The original Ludwig code scales very well up to 4096 processes with the problem sizeshown (figure 4.9a). As before the scaling suffers at 8192 processes.

The reduced halo-swapping code scales considerably better than the original, given thespeedup is calculated relative to 128 MPI tasks. There is an increase in efficiency ofabout 5% (figure 4.9b) on 8192 processes. There is an anomalous dip in efficiencywhen using the reduced halo-swapping code on 1024 MPI tasks. Figure 4.6.1 and ta-ble 4.6.1 show the run times of the original and reduced version of Ludwig. Note thaton 128 processes the reduced code performs slightly better. The improvement increasessignificantly to 6.7% on 8192 processes.

There is no determinable difference in efficiency or scaling between the original codeand the new code’s full mode.

Processes Time (reduced code) Error Time (original code) Error Percentage128 1152.347 0.2 1142.610 0.1 0.8%256 572.156 0.2 580.246 0.4 1.4%512 287.954 0.5 293.256 1 1.8%1024 151.045 0.1 151.067 0.2 0.01%2048 74.914 0.005 77.788 0.2 3.7%4096 40.609 0.004 42.431 0.005 4.3%8192 23.740 0.003 25.449 0.005 6.7%

Table 4.2: Run times of the original Ludwig and reduced halo-swapping versions. The fi-nal column shows the percentage improvement in the run time of the reduced halo-swappingcode. This highlights the general trend that reduced halo-swapping is more beneficial on moreprocesses.

36

10

100

1000

100 1000 10000

Tota

lrun

time

/s

Number of MPI tasks


Figure 4.8: The times taken by the original, reduced and full halo-swapping codes on HECToR.This is using a problem size 1024 × 5122. The time on 8192 processes using the original codeis 25.449s, and using the reduced code is 23.74s

37

0

10

20

30

40

50

60

70

0 1000 2000 3000 4000 5000 6000 7000 8000

Spee

dup

rela

tive

to12

8M

PIta

sks

Number of MPI tasks


(a) Strong scaling graph on HECToR for a large system (D3Q19)

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000

Effi

cien

cyre

lativ

eto

128

MPI

task

s

Number of MPI tasks

0.95

0.975

1

200 400 600 800 1000


(b) Efficiency graph on HECToR for a large system(D3Q19)

Figure 4.9: Ludwig: 200 time-steps on a total system size 1024× 5122 on HECToR, using theD3Q19 model. Each MPI task is run on one processing core (both cores in a processor areused). The inset graphs show a zoom in of the data for a smaller number of processes.

38

4.6.2 HPCx

The original Ludwig code scales very well up to the full 1024 processes on HPCx (fig-ure 4.11a). The efficiency remains above 95% on 256 and 512 MPI tasks, and above87% on 1024 MPI tasks.

The reduced halo-swapping code scales significantly more than the original Ludwig,with an improvement to efficiency of almost 3% on 1024 processes. This is a notableincrease as the efficiency of the original code was already very high, and it is calculatedrelative to 128 processes. It is important to note that the reduced code also performsbetter than the original (and full) codes on 128 processes. Figure 4.6.2 shows the ab-solute runtimes of the three modes. On 1024 MPI tasks - a subdomain size of 643 -reduced halo-swapping reduces the runtime by 6.4%.

There is no measureable difference in performance between the original code and thenew code’s full halo-swapping mode.

100

1000

100 1000

Tota

lrun

time

/s

Number of MPI tasks

HPCx Full HalosHPCx Reduced HalosHPCx Original Code

Figure 4.10: The times taken by the original, reduced and full halo-swapping codes onHPCx. This is using a problem size 1024× 5122. The time on 1024 processes using theoriginal code is 224.402s, and using the reduced code is 210.699s

39

0

2

4

6

8

10

0 200 400 600 800 1000

Spee

dup

rela

tive

to12

8M

PIta

sks

Number of MPI tasks


HPCx Original

(a) Strong scaling graph on HPCx for a large system (D3Q19)

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000

Effi

cien

cyre

lativ

eto

128

MPI

task

s

Number of MPI tasks


HPCx Original

(b) Efficiency graph on HPCx for a large system(D3Q19)

Figure 4.11: Ludwig: 200 time-steps on a total system size 1024 × 5122 on HPCx, using theD3Q19 model. Each MPI task is run on one processing core (both cores in a processor areused).

40

4.7 Analysis

Increasing the number of processors, while keeping the total lattice size constant, ac-tually increases the total amount of data being communicated every iteration due tohalos around every subdomain. These halos are one form of overhead for the program.Consider a total lattice of size 2563 distributed onto 8 processors. Each subdomain willcontain (256/2 + 2)3 lattice sites, and so the total number of halo sites increases. As-suming all the subdomains are a cube, the general formula for the number of halo sitesper subdomain is,

number of halo sites = h = 63

√(L

p

)2

+ 12 3

√L

p+ 8, (4.7.1)

where L is the total number of lattice sites, and p the number of processors.

The main overhead is the communications cost of transferring these halos. The totalamount of data transferred from each processor is c × h, where c is a constant thatdepends on the model being used, and on whether reduced halo swapping is used. Thebandwidth component of the communications overhead depends on c, h and L,

overhead ∝ c× h

L. (4.7.2)

As the number of processes increases, the size of the subdomains decreases and thenumber of halo sites relative to the number of sites used in the calculation increases(figure 4.12). This unfortunately means it is impossible to achieve linear speedup on anarbitrary number of processors, as can be seen across all the results for both models.

4.7.1 Analysis of Original Ludwig Code Performance

The D3Q19 model scales better, and is more efficient, than D3Q15 on HPCx and HEC-ToR. This is expected from Gustasfon’s Law [22], as more velocities provide a largerproblem size. It is interesting to observe that this is not the case with Bluegene/L (fig-ures 4.4a-4.5b), possibly due to the relatively slower speed of each CPU compared withthe interconnect bandwidth.

41

0

5

10

15

20

25

0 1000 2000 3000 4000 5000 6000 7000 8000

Num

bero

fhal

osi

tes

perr

eall

attic

esi

te

Number of MPI tasks

Number of lattice sites per halo site

Figure 4.12: L/h for L = 2563. Increase indicates bigger overhead on more processors.

4.7.2 Analysis of Reduced Halo-swapping Code Performance

All of the results (figures 4.2a-4.7b) show that reduced halo swapping delivers a definiteimprovement to efficiency and performance on distributed memory systems2. This con-firms that the original Ludwig code’s communications are limited by bandwidth ratherthan latency, as was hypothesised (figure 1.5).

The reduced halos code generally scales better when using the D3Q19 model. Theincrease in efficiency is also greater using the D3Q19 model.

From the HPCx results within the LPAR we can predict that the optimisation does notbenefit shared memory architectures. The communications bandwidth of using memoryfor halo-swapping is so high that the cost of more derived datatypes offsets any benefitfrom reduced transfer times.

The benefit of reduced halo-swapping on the Bluegene/L system is quite small. Thisis probably because the code spends more time on the computation, due to the slowerindividual processors, so the communications are less of an overhead. It would beinteresting to conduct these benchmarks on a Bluegene/L system up to about 10, 000processing nodes.

2The distributed systems include HECToR and Bluegene/L. HPCx is a distributed system when usingmore than 16 MPI tasks (one LPAR).

42

When the problem size is increased (section 4.6) the reduced halo-swapping benefit toscaling and efficiency is less noticeable. The smaller the subdomain size, the largerthe increase in performance from the optimisation. Comparing the smaller (2563) andlarger (1024 × 5122) systems on HECToR, the subdomain size of the smaller systemat 2048 processes is the same as the subdomain size at 8192 processes using the biggersystem. In the first case the reduced code gives an improvement to the speedup of about12% (443.7 c.f. 394.7). Using 8192 MPI tasks on the larger system the improvement tothe speedup is also about 6% (48.1 c.f. 45.4). Therefore it is not possible to predict theprecise benefit from reduced halo-swapping by just considering the subdomain size.

Reduced halo-swapping is of most benefit when the problem is decomposed onto a verylarge number of processors. Typically this would be uneconomic to do, as the efficiencydrops and more parallel CPU hours would be used in total. However with this type ofusage the reduced code can increase efficiency by up to about 20% (figure 4.7b). Alsoin use-cases where the the network bandwidth is low (consider the time spent in halo-swapping vs. the time spent in propagation/collision), reduced halo-swapping gives anoticeable increase in performance. The reduced code is almost never detrimental tothe performance (the exception being shared memory architectures). Even with a verylarge lattice size there is still a reasonable improvement in efficiency and scaling.

The optimisation took a total of about five weeks to implement, which also includes timeto modify the tests and debug the code. Somebody starting out with more knowledge ofthe code could probably program reduced halo-swapping in less time. The optimisationonly has value if the code is used many times.

4.7.3 Analysis of Full Halo-swapping Mode Performance

The modified code uses the full halo-swapping code from the original code. The onlydifferences are the conditional in init_halo() and the copying of each deriveddatatype (see section 2.3.3 for details). The benchmark results show no significantdifference in performance between the this and the original code.

There is a slight decrease in scaling, and efficiency, using the D3Q15 model on HEC-ToR. This may be an issue with the MPI library, the compiler being unable to optimisedue to the new conditional in init_site(), or software upgrades. The HECToRsystem is new, the software is quite reguarly updated, and it is the only system whichuses the Pathscale compiler. The D3Q19 system was benchmarked after D3Q15, so thefirst instinct is to blame a system upgrade. However benchmarking the D3Q15 modelagain confirmed this was not the case. Using a different compiler on HECToR couldhelp analyse the issue, however it a very minor problem.

43

4.7.4 Effect of Optimisation on Code Structure and Maintainabil-ity

Due to the modular structure of the original Ludwig code, the optimisation has notchanged the structure in any way. No new files were introduced to the code, as allmodifications were made in existing source files. Although the optimisation was tricky,and required some care to implement, it has added only approximately 200 lines to thesource code (excluding test codes). It is possible that the Site struct described in sec-tion 2.2 will be replaced with a 1-dimensional array in the future. The optimisationwe implemented is fully compatible with this change, and will require no further mod-ification. The original Ludwig allowed new velocity models to be defined by adding(and compiling against) a new d3q ∗ ∗.c file to describe the velocity vectors. Sincethe model’s description of reduced halo-swapping is in the same file, enabling reducedhalo-swapping for a new model is simple.

The original test codes are slightly more complicated now. Some of these tests (test_halo,test_model and test_prop) need to be run twice to cover full and reduced halo-swapping.As it is possible to switch between the full and reduced modes at runtime, this can beachieved by specifying two different input files. The reduced halo-swapping tests canautomatically determine which densities should be communicated, and so the additionof new velocity models will not require any modification to the tests codes.

Writing the optimisation and modifying the tests took about five weeks.

44

Chapter 5

Conclusions

We have designed, described and implemented an optimisation for the parallel message-passing communications in an existing lattice Boltzmann code called Ludwig [13].

The code was successfully modified to transfer less data during halo-swapping, and aconditional in the input file allows it to switch between full and reduced halo-swappingat runtime. Although the optimisation was specific to this algorithm, there are generaltrends of when the additional computational cost of derived datatypes is worthwhile.

The optimisation was found to almost always improve parallel performance, increasingscalability to higher numbers of processes and reducing run time. The only exceptionwas inside one LPAR on HPCx, where high bandwidth shared memory is used for thecommunications. We found the optimisation gave the most benefit when the paralleloverheads are significant, for example scaling small systems to many processes. Onlarge systems the scaling of the original code is very good, so the benefit of reducedhalo-swapping is less pronounced. We also found that the D3Q19 model received moreof an improvement to parallel performance than the D3Q15 model, and it seems logicalfrom Gustafson’s Law [22] that larger velocity models (for example D3Q27) wouldbenefit even more from the reduced halo-swapping code.

The full halo-swapping mode of the modified code was also benchmarked against theoriginal. We found this gave no decrease in performance, except when using the D3Q15model on HECToR.

The final code is only about 200 lines longer, and the structure is identical to the original.The optimisation can be enabled for future velocity models by adding a description ofthe velocities to the model’s source file.

45

In conclusion we have shown that using MPI derived datatypes to reduce the amount ofdata being passed between processes can be very worthwhile if the code’s communica-tions are bandwidth limited. This is an optimisation which will improve performanceon a high percentage of modern and future supercomputers, and will help code scaleefficiently as the number of processors in parallel machines increases.

46

Appendix A

Review of Process

A.1 Work Plan

The work plan originally proposed is shown in figure A.1. The main features are twoweeks of reading time, four weeks to write the optimisation, and three weeks of bench-marks.

The reading stage was used to plan how the reduced halo-swapping mode would beintroduced into the existing code, see section 2.3. The first stage of the implementation(using MPI_Type_struct instead of MPI_Type_contiguous to send the fullhalos) was completed on week 6. Using MPI_Type_struct set out the structure ofthe reduced halo-swapping optimisation, which was completed in week 10. This is afew weeks after the planned date to finish the implementation, however the slack time inweeks 12 and 13 prevented this from being an issue. The benchmarks took the predictedtime.

The work plan shows “Writing up” from week 3 onwards. In practice this was used towrite draft versions of the Introduction and Benchmarks Methodology sections. Workon the presentation slides didn’t begin until after the report was finished, which is ac-ceptable as the talk is only 10 minutes (about 10 slides) and most of the content will besimilar to the dissertation.

47

Figure A.1: Gantt Graph of work plan.

O1: design the code B1: benchmark on Blugene/LO2: write the code B2: benchmark on HECToRO3: test/debug the code B3: benchmark on HPCx

A.2 Risks

A.2.1 Risk 1: Machine time unavailable

Mitigation: We have access to three machines (Bluegene, HPCx and HECToR) so thelikihood of all three being down at once is very low. While it would be detrimentalto the performance analysis of the optimised code, any one of these computers beingoffline would not be a critical problem.

One of the back-planes on Bluegene/L failed around week 10. This was not repairedduring the course of the project, and so it was not possible to get results on 2048 pro-cesses from this system.

A.2.2 Risk 2: Optimisation breaks code

Mitigation: Use the existing tests after each significant code modification to check thecode’s correctness (primarily propagation and halo, as these test the part of the code Iwill be changing). Also use CVS so that broken code can be rolled back for debugging.

48

Risk Liklihood Impact1. Machine time unavailable Low Very high2. Optimisation breaks code High High3. Illness Low Moderate4. Data loss Low Extreme

Table A.1: The risk assessment from the project preparation.

It is also possible that in the first two weeks of the project a new test could be written ifthis is seen to help.

Ran over by about a week on the implementation stage. Was not an issue as there wassome slack time later in the work plan (week 12).

A.2.3 Risk 3: Illness

Mitigation: The estimates for debugging and benchmarking are conservative. If timeis lost during one of these two phases, that time can be made up in the next phase. Ifthe optimisation is completed in less time than allocated we may look at other optimi-sations.

Only hayfever, which was treated by presciption antihistamines and didn’t cause anydisruption.

A.2.4 Risk 4: Data loss

Mitigation: Check code in to CVS regularly (the CVS repository is stored in the Na-tional e-Science Centre). The report and data obtained from benchmarks will be storedin a remote CVS repository, and my work machine does a nightly incremental backup.

Backups were useful for rolling back to old versions. Also help confidence. CVS wasuseful for working in different locations (and on different machines).

49

Bibliography

[1] Thurey, N. & Rude, U. Free surface lattice-Boltzmann fluid simulations withand without level sets. Workshop on Vision, Modelling, and Visualization (VMVStanford) 199–208 (2004).

[2] Edinburgh Parallel Computing Centre (2008). URL http://www.epcc.ed.ac.uk.

[3] Higuera, F., Succi, S. & Benzi, R. Lattice Gas Dynamics with Enhanced Colli-sions. Europhysics Letters (EPL) 9, 345–349 (1989).

[4] Clague, D., Kandhai, B., Zhang, R. & Sloot, P. Hydraulic permeability of (un)bounded fibrous media using the lattice Boltzmann method. Physical Review E61, 616–625 (2000).

[5] Bernsdorf, J., Zeiser, T., Brenner, G. & Durst, F. Simulation of a 2D ChannelFlow Around a Square Obstacle with Lattice-Boltzmann (BGK) Automata. Inter-national Journal of Modern Physics C 9, 1129–1141 (1998).

[6] Fang, H., Wang, Z., Lin, Z. & Liu, M. Lattice Boltzmann method for simulatingthe viscous flow in large distensible blood vessels. Physical Review E 65, 51925(2002).

[7] Derksen, J. Simulations of confined turbulent vortex flow. Computers and Fluids34, 301–318 (2005).

[8] Pohl, T., Kowarschik, M., Wilke, J., Iglberger, K. & Rude, U. Optimization andProfiling of the Cache Performance of Parallel Lattice Boltzmann Codes. ParallelProcessing Letters 13, 549–560 (2003).

[9] Donath, S. On Optimized Implementations of the Lattice Boltzmann Method onContemporary High Performance Architectures. Lehrstuhl fur Informatik 10.

[10] Wellein, G., Zeiser, T., Hager, G. & Donath, S. On the single processor perfor-mance of simple lattice Boltzmann kernels. Computers and Fluids 35, 910–919(2006).

[11] Frisch, U., Hasslacher, B. & Pomeau, Y. Lattice-Gas Automata for the Navier-Stokes Equation. Physical Review Letters 56, 1505–1508 (1986).

50

http://www.epcc.ed.ac.uk

http://www.epcc.ed.ac.uk

[12] Succi, S. The Lattice Boltzmann Equation for Fluid Dynamics and Beyond (Ox-ford University Press, 2001).

[13] Desplat, J., Pagonabarraga, I. & Bladon, P. LUDWIG: A parallel Lattice-Boltzmann code for complex fluids. Computer Physics Communications 134,273–290 (2001).

[14] Denniston, C., Marenduzzo, D., Orlandini, E. & Yeomans, J. Lattice Boltz-mann algorithm for three-dimensional liquid-crystal hydrodynamics. Philosoph-ical Transactions: Mathematical, Physical and Engineering Sciences 362, 1745–1754 (2004).

[15] Care, C., Halliday, I. & Good, K. Lattice Boltzmann nemato-dynamics. J. Phys.Condens. Matter 12, L665–L671 (2000).

[16] El’beem. URL http://elbeem.sourceforge.net/.

[17] Blender. URL http://www.blender.org/.

[18] Lattice boltzmann at nist (2008). URL http://math.nist.gov/mcsd/savg/parallel/lb/.

[19] Pohl, T. et al. Performance Evaluation of Parallel Large-Scale Lattice BoltzmannApplications on Three Supercomputing Architectures .

[20] Openlb. URL http://www.openlb.org.

[21] Powerflow. URL http://www.exa.com/pages/pflow/pflow_physics.html.

[22] Gustafson, J. Reevaluating Amdahl’s law. Communications of the ACM 31, 532–533 (1988).

51

http://elbeem.sourceforge.net/

http://www.blender.org/

http://math.nist.gov/mcsd/savg/parallel/lb/

http://math.nist.gov/mcsd/savg/parallel/lb/

http://www.openlb.org

http://www.exa.com/pages/pflow/pflow_physics.html

http://www.exa.com/pages/pflow/pflow_physics.html

Message-passing for Lattice Boltzmann1.1 Theory of Lattice Boltzmann Method 1.1.1 Lattice Gas...

Documents

Transcript of Message-passing for Lattice Boltzmann1.1 Theory of Lattice Boltzmann Method 1.1.1 Lattice Gas...