SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for...

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Advanced User Supportfor

MPCUGLES code at University of Minnesota

October 09, 2008

Mahidhar Tatineni (SDSC)Lonnie Crosby (NICS)John Cazes (TACC)



Overview of MPCUGLES Code• MPCUGLES is an unstructured grid large eddy simulation code

(written in f90/MPI), developed by Prof. Mahesh Krishnan’s group at the University of Minnesota, which can be used for very complex geometries.

• The incompressible flow algorithm employs a staggered approach with face-normal velocities stored at the centroids of faces, velocity and pressure stored at cell-centroids. The non-linear terms are discretized such that discrete energy conservation is imposed.

• The code also uses the HYPRE library (developed at LLNL) which is a set of high performance preconditioners to help solve sparse linear systems of equations which are part of the main algorithm.

• MPCUGLES has been run at scale using upto 2048 cores and 50 million control volumes, on the Blue Gene (SDSC), DataStar (SDSC), Ranger (TACC) and Kraken (NICS).



General Requirements• Grid, initial condition generation and partitioning for the

runs is done using the METIS software. For the larger grids the experimental metis-5.0pre1 version is required (Previous ASTA project uncovered a problem with metis-4.0 version for large scale cases).

• The I/O in the code is done using NETCDF. Each processor writes its own files in the NETCDF format. There is no MPI-IO or parallel netcdf requirement.

• HYPRE library (from LLNL) of high performance preconditioners that features parallel multigrid methods for both structured and unstructured grid problems. Compiled with version 1.8.2b. The algebraic multigrid (HYPRE_BoomerAMG) solver is used from the library. The MPCUGLES code also has the option of using a conjugate-gradient method as an alternative.



Porting to Ranger and Kraken• The code was recently ported to both the available track 2

systems (Ranger and Kraken).

• Compiling the code on both machines was relatively straightforward. Both Ranger and Kraken had the netcdf libraries already installed. The needed versions of the Hypre library (v 1.8.2b) and METIS (v 5.0pre1) were easy to install on both machines.

• The grid and initial condition generation codes are currently serial. For the current scaling studies they were run on Ranger (1 proc/node, 32GB) or DataStar (1 proc/p690 node; 128GB). This is a potential bottleneck for larger runs (>50 million CVs) and part of the current AUS project will be focused on parallelizing this part so that much larger grid sizes can be considered.



Performance on Ranger • Strong Scaling (257^3

grid)

• Weak Scaling (64k CVs/task)

Cores 4-way 8-way

16 2298s(2-way)

-

32 1004s -

64 577s 633s

128 353s 494s

256 304s 503s

512 - 678s

Cores Total CVs 4-way 8-way

16 2097152 287s 308s

32 4194304 417s 453s

64 8388608 396s 433s

128 16777216 353s 494s

256 33554432 560s -



Performance on Kraken • Strong Scaling (257^3

grid)

• Weak Scaling (64k CVs/task)

Cores 1-way 2-way

16 - -

32 - -

64 514s -

128 285s 365s

256 187s 280s

512 157s 268s

Cores Total CVs 1-way 2-way

16 2097152 275s 301s

32 4194304 365s 405s

64 8388608 337s 379s

128 16777216 285s 365s

256 33554432 428s -



Comments on Performance • Strong scaling for 16 million control volumes case is o.k. upto

256 cores on Ranger and 512 cores on Kraken. The primary factor is the network bandwidth available per core (higher on Kraken). Overall the code scales o.k. if there are ~32-64K CVs per task. This is consistent with previous results on DataStar.

• The code should exhibit good weak scaling based on the communication pattern seen in older runs (mostly nearest neighbor). The results are o.k. up to 256 cores but show a jump in run times after that. One of the problems is that the underlying solver might be taking longer to converge as the number of CVs increases (this is not a isotropic problem … wall bound channel flow).

• Weak scaling runs for 65K CVs/task and above 512 cores are restricted due to grid size limitations at this point. Needs to be addressed.



Future Work• Near term:

• Redo the weak scaling runs with an isotropic run to see if that helps avoid the extra computations needed by the underlying solver.

• Run at larger processor counts on both Ranger and Kraken with profiling / performance tools to analyze the performance.

• Long term:• Parallelize the initial condition and grid generation parts to

enable scaling to much larger processor counts.• Investigate the performance implications of changing the

underlying linear solver and see if any improvements can be made. For example the CG algorithm scales much better (tests on Kraken already show this) but takes longer to converge (tradeoff).

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for...

Documents

Transcript of SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for...