SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for...
-
Upload
ethelbert-oconnor -
Category
Documents
-
view
217 -
download
0
description
Transcript of SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for...
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Advanced User Supportfor
MPCUGLES code at University of Minnesota
October 09, 2008
Mahidhar Tatineni (SDSC)Lonnie Crosby (NICS)John Cazes (TACC)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Overview of MPCUGLES Code• MPCUGLES is an unstructured grid large eddy simulation code
(written in f90/MPI), developed by Prof. Mahesh Krishnan’s group at the University of Minnesota, which can be used for very complex geometries.
• The incompressible flow algorithm employs a staggered approach with face-normal velocities stored at the centroids of faces, velocity and pressure stored at cell-centroids. The non-linear terms are discretized such that discrete energy conservation is imposed.
• The code also uses the HYPRE library (developed at LLNL) which is a set of high performance preconditioners to help solve sparse linear systems of equations which are part of the main algorithm.
• MPCUGLES has been run at scale using upto 2048 cores and 50 million control volumes, on the Blue Gene (SDSC), DataStar (SDSC), Ranger (TACC) and Kraken (NICS).
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
General Requirements• Grid, initial condition generation and partitioning for the
runs is done using the METIS software. For the larger grids the experimental metis-5.0pre1 version is required (Previous ASTA project uncovered a problem with metis-4.0 version for large scale cases).
• The I/O in the code is done using NETCDF. Each processor writes its own files in the NETCDF format. There is no MPI-IO or parallel netcdf requirement.
• HYPRE library (from LLNL) of high performance preconditioners that features parallel multigrid methods for both structured and unstructured grid problems. Compiled with version 1.8.2b. The algebraic multigrid (HYPRE_BoomerAMG) solver is used from the library. The MPCUGLES code also has the option of using a conjugate-gradient method as an alternative.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Porting to Ranger and Kraken• The code was recently ported to both the available track 2
systems (Ranger and Kraken).
• Compiling the code on both machines was relatively straightforward. Both Ranger and Kraken had the netcdf libraries already installed. The needed versions of the Hypre library (v 1.8.2b) and METIS (v 5.0pre1) were easy to install on both machines.
• The grid and initial condition generation codes are currently serial. For the current scaling studies they were run on Ranger (1 proc/node, 32GB) or DataStar (1 proc/p690 node; 128GB). This is a potential bottleneck for larger runs (>50 million CVs) and part of the current AUS project will be focused on parallelizing this part so that much larger grid sizes can be considered.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Performance on Ranger • Strong Scaling (257^3
grid)
• Weak Scaling (64k CVs/task)
Cores 4-way 8-way
16 2298s(2-way)
-
32 1004s -
64 577s 633s
128 353s 494s
256 304s 503s
512 - 678s
Cores Total CVs 4-way 8-way
16 2097152 287s 308s
32 4194304 417s 453s
64 8388608 396s 433s
128 16777216 353s 494s
256 33554432 560s -
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Performance on Kraken • Strong Scaling (257^3
grid)
• Weak Scaling (64k CVs/task)
Cores 1-way 2-way
16 - -
32 - -
64 514s -
128 285s 365s
256 187s 280s
512 157s 268s
Cores Total CVs 1-way 2-way
16 2097152 275s 301s
32 4194304 365s 405s
64 8388608 337s 379s
128 16777216 285s 365s
256 33554432 428s -
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Comments on Performance • Strong scaling for 16 million control volumes case is o.k. upto
256 cores on Ranger and 512 cores on Kraken. The primary factor is the network bandwidth available per core (higher on Kraken). Overall the code scales o.k. if there are ~32-64K CVs per task. This is consistent with previous results on DataStar.
• The code should exhibit good weak scaling based on the communication pattern seen in older runs (mostly nearest neighbor). The results are o.k. up to 256 cores but show a jump in run times after that. One of the problems is that the underlying solver might be taking longer to converge as the number of CVs increases (this is not a isotropic problem … wall bound channel flow).
• Weak scaling runs for 65K CVs/task and above 512 cores are restricted due to grid size limitations at this point. Needs to be addressed.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Future Work• Near term:
• Redo the weak scaling runs with an isotropic run to see if that helps avoid the extra computations needed by the underlying solver.
• Run at larger processor counts on both Ranger and Kraken with profiling / performance tools to analyze the performance.
• Long term:• Parallelize the initial condition and grid generation parts to
enable scaling to much larger processor counts.• Investigate the performance implications of changing the
underlying linear solver and see if any improvements can be made. For example the CG algorithm scales much better (tests on Kraken already show this) but takes longer to converge (tradeoff).