Distributed Data Assimilation - A case study Aad J. van der Steen High Performance Computing Group...

Distributed Data Assimilation - A case studyAad J. van der Steen

High Performance Computing GroupUtrecht University

1. The application

2. Parallel implementation

3. Model and experiments

4. Perspectives for distributed implementation

The application, Ensflow, assimilates ocean flow data into a stochasic ocean flow model.

Many realisations of the model with randomly distributed parameters forming an ensemble are run.

Perodically these runs are integrated with satellite data and an optimal ensemble average is computed.

The sequence of ensemble averages over time describes the development of the ocean's currents best fitting the observations.

The application - 1

The application - 2

The region of interest in the southern tip of Africa:

Data from the TOPEX/Nimbus satellite are used for the assimilation.

Purpose is to understand the evolution of streams and eddies in this region.

The application -3

Because of the stochastic nature of the model many realisationsof the model with slightly different parameter values are to beevolved.The observations of the top layer values are interpolated toa 251x151 grid.

The ensemble members are allowed to develop independentlyfor some time and combined to find the ensemble mean

FRT B

With F the best estimate for the model evolution withoutobservations.

R = matrix of field measurement covariances.

B = matrix of representer coefficients.

The application - 4

The model performs two computational intensive tasks:

1. Generation of the ensemble members.

2. The computational flow part that describes the evolution of the stream function .

Every 240 hourly timesteps an analysis of the ensemble isdone to obtain the optimal estimate for the past period.

Parallel implementation -1

1. Ensemble members are distributed evenly over the processors.

2. Data of ensemble members are independent and are local to the processors.

3. Only in the analysis phase to determine the globally optimal data have to be exchanged (using MPI).

4. The optimal global field is distributed and a new cycle is started.


The program contains 2 irreducible scalar parts:

1. Initialisation, linearly dependent on the number of ensemble members, and depends on , the number of gridpoints by . Init time = .

2. The analysis part for which holds that the analysis time .

On the DAS-2 systems and (for ).

ne

O nd

2 log nd

nd

tan

e

3

ti

ti59.5 s t

a104 s

ne60


The time per ensemble member per 24h time step This amounts to 20x60x30 = 36,000s = 10h singleprocessor time for the complete 20 day cycle considered.

After the init phase a distribute operation and per analysis step a collect and a distribute operation are required.The total amount of data moved is .

The bandwidth at with this occurs is 120-140 MB/s (usingMyrinet on one cluster). So, the total communication time isabout 0.12s per transfer.

Total communication time within one run .

ts30s.

nxn

yn

l1.5MB

tc727 s

Model and experiments -1

The timing model has the following form:

T p tit

a1200t

s

pt

c59.510436,000

p15173.536,000

ps

Model and experiments -2

Remarks:

1. There is a mistery with respect to the computation phases: for p = 1 , for p > 1 consistently.

2. For p < 6, using 1 CPU/node is somewhat faster, from p = 6 on, 2 CPUs/node is marginally faster due to decreasing competition for memory and faster intra-node communication.

tc17 s t

c30 s

Model and experiments: Simulation results

Shown is a simulation of 180 dayly periods, note the blueeddies that form counterclocwise in the Atlantic.

Perspectives for distributed implementation -1

The timing model has the following form:

In the single-cluster implementation is quite small (ca. 15 s)and virtually independent of .

For the distributed version this might not be the case:

1. Presently Globus cannot be used yet in conjunction with Myrinet's MPI, communication must be done via IP.

2. The geographical distance between the DAS clusters introduces non-negligable latencies.

T p tit

a1200t

s

pt

c

tc

p


As can be seen from the figure the communication time isstill insignificant when distributing the model over twolocations (UU and VU):


is quite erratic, more determined by synchronisation than communication time:

tc


The results show that this application is excellently suitedfor distributed processes. Still, both communication andthe analysis phase may be made more efficient:

1. When is known which process id.s are located where, first intra-cluster communication can be done, then the assembled messages can be exchanged.

2. The analysis could be done on the local ensemble members (remember ) and synchronised less frequently.

tan

e

3


Using more sites has a notable effect on the communication.Again, synchronisation effects are more important thanthe communication time proper :

Sites Exec.time (s) Comm.time(s) 1 3310 45.1 2 3274 62.9 3 3339 208.5 4 3299 151.9

12-proc. run: UU, UU+VU, UU+VU+Leiden, UU+VU+Leiden+Delft


This case study was a particular well suited candidate fordistributed processing. Apart from improving this implementationwe will proceed with three other projects that are promising:

1. Running two coupled oceanographic model within the Cactus framework.

2. Inexact sequence matching of genetic material.

3. Pattern recognition on proteomic micro arrays.

Acknowledgements:Fons van Hees for the single-system parallelisation

Distributed Data Assimilation - A case study Aad J. van der Steen High Performance Computing Group...

Documents

Transcript of Distributed Data Assimilation - A case study Aad J. van der Steen High Performance Computing Group...